Arduino STM32 NeoPixels (WS2812B) using SPI DMA

posted in: Arduino, STM32 | 6

RGB LED strips (aka Neopixels) have been around for a number of years now, so I’ve been somewhat behind the curve in not having tried these interesting devices until now.However I recently bought a 1M strip of 30 LEDS which feature the WS2812B device, from a local eBay vendor.

As regular readers will know, my microcontroller of choice for most general purpose work is the cheap and trusty STM32F103C in conjunction with the Arduino API, so I checked if anyone had ported Adafruit’s NeoPixel library to the Arduino STM32, and found

Unfortunately when I tried it, I occasionally got completely random LED colours and random flashing etc. Hooking up a logic analyser to the data pin, I found that the pulse timings produced by that library seemed considerably wrong, and were being caused by a combination of various factors including the compiler not caching the GPIO register address, and other things associated with call overhead and some inline assembler. There was also an issue with the USB Serial failing because for the entire time to send the data, all interrupts were disabled to ensure critical timing was not disturbed.

I spent some time fixing this library (but have not published an updated version yet), and it worked OK, but I still had the issue with USB failing, So I thought that bit-banging the data to the LEDs seemed quite inefficient, as the STM32 possesses several independent subsystems which could be used for this purpose.

The simplest and most obvious subsystem for sending a bit-stream is of course SPI, and the STM32 features DMA driven SPI, which would allow data to be sent to the LED’s while new data is prepared.

Looking at the datasheet for the WS2812B,

To send a “0” requires a pulse that is high for 400nS and low for 850nS (all values plus or minus 150nS)
To send a logic 1 requires a pulse that is high for 800nS and low for 450nS (all values plus or minus 150nS)

This is approximately a 1 to 2 ratio between the “mark” and “space” length (or vice versa)

As the STM32F103 normal operating frequency is 72Mhz and the SPI PLL divider is available in powers of 2, I checked and found that 72Mhz / 32 = 2.25 Mhz, which equates to a pulse length of 444.44nS, and sending the binary pattern 100 results in a high pulse of 444ns followed by a low for 889nS (rounding to the nearest nS), and sending 110 results in a high pulse of 889nS and a low period of 444nS

Both combinations are within the spec for the WS2812B.

It should be noted at this point, that the older WS2812 has slightly different timings, e.g. a pulse of 350nS is required for a logic zero, so this device is not compatible with this method of sending data.

Now that I knew that I could use bit triplets “100” and “110” to send a data “0” and “1” to the WS2812B, I needed a way to generate these triplets and concatenate them to produce the GRB data for each LED (The data order is Green Red Blue , not the traditional Red Green Blue), so I wrote this function which converts a single byte (colour channel) into the 24 bit (8 x 3 bit triplets)


uint32_t convert(uint8_t data)
  uint32_t out=0;
  for(uint8_t mask = 0x80; mask; mask >>= 1)  
    if (data & mask)
      out = out | 0B110;//Bit high
      out = out | 0B100;// bit low
  return out;

I used this function to build a buffer of “encoded” data buffer and then used the Arduino STM32 extended SPI function dmaSend() to send the buffer to the LED strip.


Initial results of this were very promising, with the 30 LED strip being updated quickly, but I noticed that the first LED in the strip sometimes displayed the wrong colour.

I hooked up my 100Mhz USB logic analyser, and looked at the data being and compared it with a bit-banged version I’d previously been working on.


SPI version

Bit-banged version


And the length of the very first pulse (logic high), was 0.49uS long for SPI and 0.45us long for bit-banged.


Reading the spec of the WS2812B, a pulse duration of < 550nS (0.55us) is supposed to be a logic zero, however my LED strip was treating this as a logic 1, and hence setting the Green channel to 10000000, or in a general case the MS bit was always set to 1, hence green values were never lower than 0x80 (50%)

I don’t know why the WS2812B is treating a 490nS pulse as if it is longer than 550nS, but the only thing I can conjecture is that the pulse width being shown on the logic analyser is based on a different threshold voltage e.g. Vdd (3.3V) / 2. However according to the WS2812B spec, a High is signalled when the input voltage is 0.7 of its supply voltage which is nominally 5V, = 3.5V.
I know that driving these devices using 3.3V logic is known to be problematic if the LED supply 5V because of this input voltage threshold, and that there are various workaround for this, usually involving diodes in series with the GND or the 5V power line to the WS2812B. But in my case I was getting the opposite effect, as if the threshold was a lower voltage than Vdd (3.3V) /2 = 1.65V


I checked the width of the other pulses being generated by the SPI and found that only the first pulse was this length and all other pulses were 450nS (actually 444.4ns but my analyser can only resolve to the nearest 10nS), so I concluded that this effect was being caused by the MCU hardware setting up the MOSI signal in advance of the transfer starting.

Normally this would not be a problem when using conventional SPI, as the data is clocked using a separate signal, and is only happening because I’m using the SPI hardware for a reason other than which is was designed to be used; so I don’t blame ST for building defective hardware 😉


The workaround for this is actually very simple. An additional byte of 0x00 ( 00000000″ ) was added to the start of SPI data buffer, so that the first LED pulse is actually in the second bye of the transfer.

I also noticed that occasionally the STM32 hardware seemed to leave the MOSI signal at logic 1 after the end of the transfer, even though the last binary bit of the transfer in this protocol is always a zero. This causes a problem, because the WS2812 protocol requires that its Data In, be logic low for 50uS prior to each transfer to act as a Reset signal. The fix for this was also to append another byte of 0x00 to the end of the transfer.


Having got this working within a test sketch, I rewrote it as a library which replicated the Adafruit NeoPixel library API as closely as possible.

I also did some speed optimisation by replacing the function that calculates the SPI bit pattern, with a lookup table that converts a single colour channel (RGB 8 bit value) into the 24 bit encoded pulse-train needed for the SPI.


But one thing thing was still troubling me, as although I was using DMA to send the SPI data, the existing Arduino STM32 (LibMaple) SPI function dmaSend() is “blocking” (aka synchronous). So that the code execution effectively has to wait until the DMA is complete, before the next set of LED data can be constructed.
To overcome this, I modified dmaSend() to make a new function called dmaSendAsync(), which returns immediately after the DMA transmission of data has started, and in case the code to construct the next set of LED data completes before the current asynchronous transfer has finished, I took the blocking code from the end of dmaSend and put it into the start of dmaSendAsync and added a static flag to the function so that the blocking code (which waits for DMA completion), is only run if a DMA transfer has been has previously been started.

Just replacing dmaSend with dmaSendAsync however, would not work correctly, because the Arduino sketch code could update the buffer of data that was currently being sent to the LED’s via DMA and cause unexpected results. To address this problem I added a double buffer system, so that the data buffer that functions like setPixelColor() interact with, is different from the buffer being sent to the LEDs; and the buffers are swapped as part of the library’s show() function – which sends the data via SPI.

This is all very standard and easy to implement in the code, but when I ran the test / example sketch, I found that some visual effects, specifically colorWipe() function were not working as expected, and caused to flash the LED’s in a very strange way.

Initially I presumed I must have made a mistake with how I handled the double buffering, because the problem went away if I switched back to single buffering, (with some added delays), but after exhaustive examination of the data and using the logic analyser to see what was actually being sent, I finally realised that the visual effects created by functions like colorWipe() are additive.


void colorWipe(uint32_t c, uint8_t wait)
   for(uint16_t i=0; i<strip.numPixels(); i++)
      strip.setPixelColor(i, c);;

Where this function effectively does the following:-

Set pixel 1 to Colour X
Send data to all LEDs

Set pixel 2 to Colour X
Send data to all LEDs

Set pixel 3 to Colour X
Send data to all LEDs

Set pixel 3 to Colour X
Send data to all LEDs

Which produces the following effect




But if there are 2 buffers, (both initially empty) what the code would do is…

Set Buffer 1, pixel 1 to Colour X
Send data to all LEDs

Set Buffer 2 pixel 2 to Colour X
Send data to all LEDs

Set Buffer 1 pixel 3 to Colour X
Send data to all LEDs

Set Buffer 2 pixel 4 to Colour X
Send data to all LEDs


Which produces this effect




Unfortunately the only way around this problem is to copy the contents of the last updated frame buffer to the other frame buffer. But at least this can be done during the DMA transfer if buffer pointers are exchanged each time


// Sends the current buffer to the leds
void WS2812B::show(void) 
  SPI.dmaSendAsync(pixels,numBytes);// Start the DMA transfer of the current pixel buffer to the LEDs and return immediately.

  // Need to copy the last / current buffer to the other half of the double buffer as most API code does not rebuild the entire contents
  // from scratch. Often just a few pixels are changed e.g in a chaser effect
  if (pixels==doubleBuffer)
	// pixels was using the first buffer
	pixels	= doubleBuffer+numBytes;  // set pixels to second buffer
	memcpy(pixels,doubleBuffer,numBytes);// copy first buffer to second buffer
	// pixels was using the second buffer	  
	pixels	= doubleBuffer;  // set pixels to first buffer
	memcpy(pixels,doubleBuffer+numBytes,numBytes);	 // copy second buffer to first buffer 


At the time of writing I’m using memcpy to copy the buffers, but this may not be the most efficient way to do this, as memcpy may be doing single byte copies, where as 32 bit copies would be faster.

So I’m considering padding the frame buffers to 4 bytes, by adding bytes as necessary, depending on the number of LEDs. Currently each buffer is NUM_LEDs times 3 + 1 start byte + 1 end byte, e.g. 30 LEDs takes 272 bytes

The other thing that makes the code run slower in some places than the bit-banged version is the need to use the lookup table to copy the 3 bytes per colour into the frame buffer.


void WS2812B::setPixelColor(uint16_t n, uint8_t r, uint8_t g, uint8_t b)
   uint8_t *bptr = pixels + (n<<3) + n +1;
   uint8_t *tPtr = (uint8_t *)encoderLookup + g*2 + g;// need to index 3 x g into the lookup
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;

   tPtr = (uint8_t *)encoderLookup + r*2 + r;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;   
   tPtr = (uint8_t *)encoderLookup + b*2 + b;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;
   *bptr++ = *tPtr++;

I’ve tried to optimise the code by using sequential pointer reads and writes with increment, however it was still 60% slower than the bit-banged code, which simply needs to write the RGB values straight into the frame buffer

Moving the LUT from flash to RAM has improved the speed by about 30%, so that setPixelColor(uint16_t n, uint8_t r, uint8_t g, uint8_t b) now takes 1224nS to execute (including the call overhead, rather than 1642nS. The bit-banged version, which just sets 3 bytes in the frame buffer takes 948nS, so is still 28% faster even with no optimisation.

There may be further optimisations that can be performed on this code to increase the speed, possibly using the strategy described by @stevestrong on the forum

But that will have to wait for a day when I have more free time to allocate to this.


Overall, I think the jury is still out, about whether using SPI and DMA rather than simply bit-banging the data is the best approach.

If the bit-banged method could be made to work without disabling the interrupts during the entire duration of the show() function, it would probably be fine for most simple effects, and may even work faster in a lot of cases.

However, just disabling interrupts when each High / Low waveform is created, and re-enabling the interrupts between data bits, would technically break the WS2812B’s spec, as the Low period would be variable and almost always longer than is technically correct.

Also for effects which required a lot of processing for each pixel, using DMA would potentially increase the frame rate

If I get time I’ll also publish a bit-banged version, and in the mean time anyone interested in trying the code can download the latest version of the Arduino STM32 repo

Or look at the library code here

6 Responses

  1. Jon Dresser

    Hey Roger,. I wanted to chime in here because I’ve been building quadcopter/drones lately, and the flight controls in those use stm32 processors, and can control rgb leds quite well. I HAVE observed the strange misbehavior of them at times, but I think it might be related to the power supply. Ie: if I’m powering the board via usb, instead of from the battery. Also, if I have not assigned every led in the strip. I run Betaflight firmware on my boards, and it’s open-source, if you were interested in looking at some source…

  2. Roger Clark

    Thanks Jon

    I have noticed a lot of STM32 based FCU’s appearing, and also some ultra cheap quads that use a GD32F103 processor.

    But at the moment, I’m completely swamped with other things, both work and projects, so I’m probably not going to have time to do much on that front (which is a shame as quads are fun !)

  3. Matthias Dübon

    Hi Roger,
    thanks for this blog, but one question. As far as I know SPI does have a data width (configurable up to 16 bit on ST devices). The SPI is tranferring the data in chunks of this data width, e.g. if you have a data width of 8 you should get a stream like:
    … even with DMA. Is the pause on your system neglectable?

  4. Roger Clark

    SPI access to the flash chip in the GD77 should be ultra fast for reading, so I have no idea why downloading the codeplug takes so long.

    One of my contacts has taken their GD77 apart and read the SPI chip using an external reader, and it does contain the codeplug along with the DMR ID and some other (unknown) data.

  5. Matthias Dübon

    Thanks for your response but I’ve just seen there is text missing in my last comment. Let me try to rephrase:
    A SPI frame has a fixed data widh (e.g. 8 bit) after sending 8 bit the SPI “hardware” is making a pause.
    It’s like:
    These pauses should be much longer than the tolerance of WS2812B and it should totally corrupt your timing. Or do I get something wrong?

  6. Roger Clark


    OK. I thought you were talking about something else.

    SPI is sent by DMA so there should be no pauses between the data for each LED