At Promenade Software, we write a lot of firmware, but we also help clients when they need help with their own firmware. We have done this for several clients, and we often see the same situations repeated in which the firmware has intermittent problems that the client simply cannot figure out. The reason is always clear upon code review - some fundamental firmware tenets have been broken. This blog is by no means a complete tutorial for writing good firmware, but it will share the 5 main tenets to resolve intermittent bugs.
For simplicity, we will focus on small single core processes running bare-metal code – one thread running in a main loop, with interrupt handlers. The concepts can be expanded with a threaded RTOS or even a multi-core system.
The first thing we do when we receive the code is to clarify what is being called from an interrupt. We will follow the flow from every interrupt and append the function names with “ISR” (Interrupt Service Routine). We will rename the non-local variables read or written likewise. The compiler will then tell us what is being shared between an interrupt and the main thread. Using this naming convention really helps in the maintenance of the code. Often the problem was introduced later by someone not aware a function was run from an interrupt.
Reentrant functions do not use the same resources, such as hardware or memory areas. Stack based local variables are fine. If they don’t need to be shared, it is best to make a version for the ISR and one for the main thread for future proofing.
For example, if the interrupt is filling in an array of data, the consumer of the data in the thread needs to disable interrupts, copy the data into local variables, and then re-enable interrupts. For example:
The unsafe way to read what the interrupt collected in the main thread would be:
temp1 = temperature_data_from_ISR[0]
temp2 = temperature_data_from_ISR[1]
The safe way to read it would be:
ENTER_CRITICAL_SECTION() // disable interrupts
int temp1 = temperature_data_ISR[0]
int temp2 = temperature_data_ISR[1]
EXIT_CRITICAL_SECTION() // re-enable interrupts, if they were enabled when you came in.
Make sure that simple variable access is atomic. For example, sharing a 16-bit variable on a 32-bit processor means that the access is not one machine instruction. Interrupts can happen in between instructions, if not protected with critical sections.
Even something innocuous like the following can be a problem:
Thread writes a value, ISR reads the value:
int mysharedvar = 1;
This looks like it would be an atomic action, but we have seen the compiler optimizer turn this into a clear and increment, which is faster than moving the value from flash memory. In one case we saw, the interrupt would occasionally fire between the clear and the increment. The interrupt saw the value as 0, even though the thread logic was continually writing only 1.
An interrupt should not clear flash, read a slow ADC, wait for a bus to send or receive, or do CPU intensive work. Interrupts need to be quick – in and out. If not, there is the potential for some other interrupts to be dropped and for data to be lost because they were not serviced in time (ex: for Bluetooth, serial buses, etc.). Use state flags so that the main loop can pick up the work.
We want to avoid the situation in which the main loop takes longer than the expected time, and everything gets pushed out. We generally use an available GPIO pin and a logic analyzer for measuring this, exercising the worst case. Avoid doing delays in the main loop – even if you think you have time. That will help future-proof your code. For example, do not set an output, spin for 100ms, and clear it. Instead, set up a state table of actions and times (based off a timer tick), and manage the states each time the main loop comes around.