The Positive Rail
  • Blog
  • About
  • Contact

Bare metal visibility - writing a device driver for the Kinetis KL26 SPI peripheral

1/10/2015

2 Comments

 
One of the things I've been doing lately is been writing device drivers from bare metal, that is, without the help of Processor Expert or other libraries.  This is flipping bits in registers, using bits of assembly language, etc.  When I mentioned this to a friend of mine, he looked at me like I was insane.  Why would I waste that time?  The whole point of libraries and these tools is to abstract it away so you don't get lost in the weeds.

To be honest, I'm not totally sure if this is a good use of my time.  Part of me agrees with my friend.  It's a lot of horrible debugging - digging through documentation, trying a setting at a time and staring at uncooperative traces.  It takes forever.  It's work someone else has done.  I have plenty of things to do besides staring at a register map and doing binary-hex conversions in my head.

There are a couple of things I think I understand better now, though.  First, I am getting a deeper sense of what peripherals can do and what they can't do.  This is not always clear in the documentation (at least not for the Freescale devices I'm using).  

Second, I'm starting to see regularly the things that slow down debugging at this level.  I'm getting a little better at it.  And I'm starting to believe there are a few tools that would speed things up a lot. 

The reason I care is that I've noticed that probably 90% of my development time is wasted on 10% of bugs - which tend to be bare-metal bugs, or at least clearly visible in bare metal.  In these cases, you end up needing bare metal visibility - views of the registers and memory - to see where your code is not behaving as it should.  Having this visibility - just like being able to see what the variables in your code are doing - would speed this up incredibly.

So let's take a short walk through the SPI driver and its quirks...
The Serial Peripheral Interface (SPI)

The Kinetis KL26 is Freescale's M0+ processor chip.  Among the on-chip peripherals it has (i.e. UART, I2C, I2S, GPIO) is a Serial Peripheral Interface (SPI) - which allows it to communicate with the huge number of chips that speak SPI.  

SPI has two main functions: to transmit data, and receive data in a serial stream.  It is simple in theory - you first set the clocks and pins in the peripheral and the various modes, and then enable it.  Then, at least in the Kinetis designs, you push the transmit data stream into the data register either by copying from memory with the CPU or DMA (direct memory access, a way of transferring data without using the CPU), and read the receive data stream from the same register.

What prevents this from being easy?  
  • If you learn how to transmit using the CPU it is a little different than transmitting using DMA.  
  • There are subtle but important differences when it is receiving and transmitting.  So once you learn how to transmit, you don't necessarily learn how to receive, and in fact you can be led astray!  

In the next sections I'll first talk about transmitting, then receiving
SPI Transmit:

Once you've set up the clocks and pins for the SPI interface, it is ready to transmit data to another chip.  The basic idea is that you signal to the chip you are about to transmit by pulling down the the CS (chip-select, or SS slave select) line which connects you to the peripheral and then send and receive data on two additional lines, the MISO (Master In Slave Out) and MOSI (Master Out Slave In) lines.  Each bit of the data is marked by a pulse on the SCLK (serial clock) lines.  When you are done, you pull the CS line high.

If you are using the CPU to manage the SPI peripheral, it must take the additional step of copying data to and from the buffers on the SPI.  So the procedure is follows: 
  1. logic low on chip select (enable)
  2. read data from data low byte (DL) register - this is a dummy read to trigger the register
  3. poll the SPI Transmit Empty Flag (status register S, SPTEF) to check when the register is ready to receive a new byte
  4. push your data onto data low byte (DL) register
  5. poll the SPI Receive Full (status register S, SRF) flag to check when the receive buffer is full
  6. read the data from data low byte (DL) register
  7. logic high on chip select (disable)
  8. reset DMA and interrupts
This approach is simple and straightforward.  The problem is that there is a long pause between each byte as it is sent out (2-3x the length of the byte itself) which turns out to be due to the CPU timing.  Plus, it uses a lot of CPU.  In one of the applications I am developing, a 2-3x increase in speed is actually quite important.  Direct memory access should get rid of both of these issues.

But when I tried using DMA, to transmit a stream of data, I ran into a couple of problems:

1) Chip select disable timing.  In the SPI protocol you must set the 'chip select' line to logic low before transmitting data, and return it to high to indicate the end of the transmission.  Although there is an option to have the peripheral control the chip select, it goes low on every byte - which breaks most protocols.  So here we are manually enabling and disabling the chip-select.  

When using DMA, you don't have direct control over when the transmission ends.  Instead, you get an interrupt when the DMA transfer is complete.  

The problem is that if you disable the chip select immediately after receiving the DMA transfer complete interrupt, the last word may not have finished transmitting.  The DMA engine only knows that it's delivered the last bytes, but not when they are actually output on the pins.  

It's tricky to pick up this problem too because
  • It only affect the very last byte in the transmission - so it may be difficult to notice.
  • It is -not- an obvious problem at the higher speeds (i.e. divider 1,2,3) because the CPU takes some time to execute the chip select disable command.  But at slower speeds, it will cut off the last word, or even garble the next transmission - which would be especially fiendish.  

2) First byte.   It turns out that the DMA system is not quite fast enough to catch the first DMA transfer request - the symptom of this is the first byte in the transmission is duplicated.  Again, a little tricky to catch, if you're not paying attention.  The solution is to push the first word out with the CPU.

With these two twists in mind, the procedure for transmitting using DMA then is:
  1. logic low on chip select (enable)
  2. push first byte into DL register (this is actually a read of the status register S, followed by write to the DL - otherwise the DL write is ignored)
  3. enable transmit DMA request (TXDMAE)
  4. wait for the transmit DMA to complete (I use an interrupt handler to set an indicator to 1 and poll it; can also be done using an RTOS semaphore)
  5. poll the SPI Receive Full (status register S, SRF) flag until it is set, indicating that the last word was read from the pin
  6. logic high on chip select (disable)
  7. reset DMA and interrupts - if you don't, you will also get some garbled first bytes (the direct push out of the register will be ignored, and you'll see a double byte)
SPI Receive

Using DMA to receive SPI data has another couple of twists:
1) Transmit buffer blocking.  When transmitting using DMA, we did not care whether the receive buffer was full, until the very end when we wanted to find out when the last bit had been pushed out the pin.  But it turns out that if the transmit buffer is empty, it will raise a flag and block - until you push more data into the transmit buffer. 

There are two solutions and one non-solution:
  • Continuously push dummy data onto the transmit buffer using the CPU (this must be done by first reading the S register then pushing a dummy value into the DL register).  Problem is this takes a bunch of CPU.  
  • Use a dummy DMA transmit to push data onto the DL register.  Problem is that this occupies one of the DMA channels.  Also, it appears that the speed is limited - at higher speeds the receive buffer begins to miss bytes - the symptom of this is that the transmit buffer will be finished (and so the system will block with a transmit buffer empty flag) but the receive buffer will have a small number of bytes left (e.g. at the end of 512 bytes, it would have 12 bytes left).
  • It -doesn't- work to use an interrupt to push dummy data only when the transmit buffer is empty.  There are two problems.  First, it is very slow.  Second, it seems that the receive DMA request is blocked.  The symptom of this is that the system goes on transmitting an empty byte endlessly and the DMA receive byte count never drops (either because it never receives a request, or perhaps because there is a 'destination' error).

So the procedure for receiving data (using the first method) becomes:
  1. logic low on chip select (enable)
  2. enable receive DMA request (TXDMAE)
  3. continuously read the S register and push a dummy value onto the DL register until the DMA receive is finished
  4. logic high on chip select (disable)
  5. reset DMA and interrupts 
SPI transmit and Receive

Once I understood the individual transmit and receive, it becomes clear that this system was designed for simultaneous DMA transmit and receive - this is the procedure, and it involves very little polling or other oddities:
  1. logic low on chip select (enable)
  2. push the first byte onto the register (following a read of S register)
  3. enable transmit DMA request (TXDMAE)
  4. enable receive DMA request (RXDMAE)
  5. wait until DMA receive is completed (which also means that the last byte has been output/read from the pin)
  6. logic high on chip select (disable)
  7. reset DMA and interrupts 
The only issue seems to be the speed limit - it appears you are limited in speed to a divider slower than 1/4 the bus speed (e.g. 0x03 will work but 0x02 will not, at 12.288MHz).  This is a pretty strong limitation that I still need to puzzle through.  I've tried reversing the priority of the DMA request channels (e.g. giving the transmit DMA higher or lower priority than the receive) but it does not seem to matter. 

Another limitation seems to be that the 16 bit mode does -not- work.  This would also ease the load on the DMA.  And in the KL26 there is definitely no FIFO - which is funny because they devote a good chunk of the manual to it.
Conclusions

Figuring this all out took a few days of pretty intense work as I was also getting used to using GDB, rather than the IDE I usually use.  I knew about some of these bugs before but had never truly solved them or trusted my understanding of them.  But here are my observations:

  • Once I got quick views of the peripheral registers in GDB (which I had taken for granted in the IDE) things sped up a lot, as I could more easily see what was going wrong. 
  • It also helped to be able to put very short output onto the UART (e.g. just a character or two).  In some cases you have to keep the system running in order to see the bug - and using a full console printf is too slow and interferes with the process.
  • The Saleae logic analyzer was invaluable in picking up the bugs in the first place - including some difficult hardware bugs due to a loose battery connection
  • The P&E GDB server is very unstable - much more than I experienced in their integrated system.  The worst instabilities would force me to reset the entire debugging system every time - totally unacceptable.
  • Taking the time to write down the registers and even writing this post helped clarify my thinking a lot.

A bare metal visibility tool that I want to have is a visual/text tool that gives me an instantaneous view of not only the registers in hex - but the individual flags and what they mean and when they changed - and notes about the peripheral and the register.  And memory - in a format that I need.  

A peripheral register and memory viewer -is- in the Freescale CodeWarrior based IDE and I found it extremely useful, but it will -not- be present in the next version of the IDE.  The other problem with it was that it was difficult to get to and clunky to use.  It reset when the debugger reset, and took several clicks and drags to open up, and the memory maps were still somewhat arcane.  I've seen some third party programs that appear like they might support this functionality - but I haven't tried them yet.  They look like horrible early 2000's eras Windows programs.  Why shouldn't such a program be beautiful, fast, powerful and easy to use?

Also, I saw that entering information about registers took a fraction of the time I would have saved from always having to look things up in the PDF reference manual - and also made me pay attention to the registers and think about them early on.  

And most importantly I found that notes - both inline in the code and in form like this - really are invaluable in both solving the problem and helping understand what is going on later.  

But I'm happy now.  I now have a reasonably stable SPI device driver - with only a couple of known issues - and more importantly, a bunch of documentation about how it works and its limitations. 

Happy bare metal debugging!
2 Comments
Jeff Gough
5/3/2015 11:51:35 pm

Have a look at the embsysregview plugin for KDS (and eclipse in general). When supplied with the CMSIS XML register description file from Freescale, it provides a pretty decent interface of the sort you describe.

Check out the post about it on mcuoneclipse.

Cheers
Jeff

Reply
Frederick Soo
5/4/2015 01:22:14 am

Wow, that is really cool! I will definitely try to install it and use it when I can - it's very close to what I was imagining. Thank you for the link.

Reply



Leave a Reply.

    What is this about?

    Embedded systems, nascent markets and minimalist product development

    Archives

    September 2014
    August 2014
    July 2014
    June 2014

    Categories

    All

    RSS Feed