Hardware and software fault finding - an interesting bug
For makers who tinker with hardware and software, there will inevitably come a time when you have a project all put together, but some aspect of it is not working as you expect. In cases like these the first problem to solve is where to start looking for it? Is it the code, is it the hardware, or is it the basic design or even the concept? That then leads onto the question of how do you approach finding the root cause of the problem?
This is the tale of how we tracked down and elusive bug while on a software integration and testing job on some interesting hardware. Various bugs and features were found and fixed in the process, but there is one in particular that stood out as “interesting”, since it took many steps to finally find and fix.
Background
The project (in the early 2000s) was to develop software for a replacement microprocessor based control system on Marconi’s H1141 HF Transmitter / Amplifier. The H1141 is a 10kW HF transmitter that can operate anywhere in the 1.6MHz to 30Mhz HF band. It has a tuneable penultimate stage that allows it to be tuned and its output level set in just a few seconds. Designed in the early 80’s It is regarded as a “compact” transmitter – part solid state and part thermionic valve, standing a tad over 6’ tall on its plinth, by about 5’ wide, and bit over 2’ deep. It weighs in at around 750kg. There is a copy of one of the spec sheets for the original version of this transmitter here. There is also an article from 1986 showing 8 of them installed in a broadcast station in Rugby UK (in the far back left of the second photo)
Hardware
Now for anyone with an interest in electronics or radio, but little exposure to broadcast radio, this was a fascinating thing to work on. Many the expected components are there, it is just the scale and the power that were very different from your average electronics project!
In modern electronics you often see a mains input fed to a power supply section that reduces the voltage down to a low “safe” level to power the electronics. While there were indeed some low voltage circuits and power supplies here, there was also a very large step up transformer to boost the 415V three phase mains input up to 10kV. That huge voltage ultimately being used to power a high gain high voltage thermionic valve that does the bulk of the signal amplification. Running at full output, the transmitter will suck 28kVA from the mains (much of that will ultimately be blasted up the cooling chimney as hot air!)
Those who have studied the tuning mechanism used in many classic radio receiver circuits will be familiar how the tuning usually works with a wound inductor, and a variable capacitor to create a resonant “tank” circuit. This creates a moveable band pass filter that makes the tuning stage “selective”. The resonance giving a sharp peak in the tuners output at the tuned frequency. Well, those elements are all here, except the inductor is a made from a few turns of “wire” that looked more like 15mm copper pipe. There are also couple of low value variable capacitors – but since they are rated for operation at up to 16kV these are large vacuum chamber types about the size of a large can of paint. The capacitance can be varied by pulling on the plunger that moves one of the plates of the capacitor. However the vacuum means that to actually pull the plunger against the substantial effects of atmospheric pressure, you need to pull really really hard, and so a substantial gearbox and servo motor assembly does it under software control. The inductor coil is also tuneable – this being achieved by shorting bars driven on large solenoids, that fire in sequence with a loud and satisfying “clank” to shorten or lengthen the effective length of the coil in use.
Even turning this thing on took a little dance of coordination and timing. Just flicking the power switch would have resulted in the massive inrush current tripping all the breakers. So, a “Step Start” process was used. Software would close one mains contactor, that would apply power to the transmitter via a substantial high wattage, low ohms wire wound resistor. The resistor would limit the inrush current. The only problem with that approach, was that the resistor would be glowing red hot in no time. So 500ms after the step start contactor closed, the software would then close the main start contactor. That would the bypass the current limiting resistor.
New control board
The original design had a rack of vertically mounted cards fitted at the top of the case that would handle the control and tuning of the transmitter. For this upgrade around half of them had been removed and replaced by a modern (mostly surface mount) PCB around 10” square. This housed a 16MHz embedded “system on a chip” controller based on a Z80 CPU core, and extra circuitry to interface to the older electronics and also provide a serial data link for communications with an external “digital drive”. The digital drive was a very high-tech radio transceiver bristling with lots of digital signal processors (DSPs), that can modulate and demodulate pretty much any kind of radio carrier you might want. (In modern terms it would probably be classed as a “software defined radio”). The digital drive sat atop the main transmitter in new extension grafted onto the case. The drive sported a basic user interface, and a large electroluminescent display that could present a range of operating details to the user/operator.
The software
There is a surprising amount of software embedded in transmitters. Some of it provides extensive remote control capabilities to set and inspect all the various parameters, and to activate functions like tuning and output level settings. Lots of built in test and monitoring, and then code for handling operational anomalies like detected arc faults, or severe output impedance mismatches, where the software would need to take “evasive action” (typically muting the output for a short time, and then automatically retrying a number of times after a delay to see if the fault clears). This would recover from “brief” problems caused by lightening strikes near the antenna, or bird strikes, but allow the transmitter to “trip out” for sustained problems.
The software was what is known as a “real time” embedded system. It has a bespoke operating system consisting of a library of code to drive and interact with all the various hardware sub systems, and a very compact cooperative multitasking scheduler to run all the required functions when required, many concurrently. It allows the structure of the code to be clear and simple, but also allows it to create the illusion that it can juggle many balls at the same time!
The Job
Basically our job was to design and implement the software. This was to be based on what the old system did, plus the automation of a number of what were previously manual tasks. Also to import more of the advanced systems integration and communications capabilities copied from other more recently designed transmitters in the Marconi range. Once that was done, carry out hardware software integration on the newly designed control system and get everything working in the transmitter.
The top level specification for the software had been written by one of the few remaining engineers who had worked on the original design of the transmitter, much earlier in his career. Some decades later, and now nearer retirement, lots of the detail had to be conjured from old notes, memory, and a bit of reverse engineering of the circuits!
The new software had been designed, written and code inspected. Now all we needed to do was integrate it with the hardware, get it all working, and test it. This was made a bit more fun by the hardware also being rather new and untested, and the main control board available for development and testing was a hand assembled prototype rather than a production board.
Work proceeded quite quickly, the “low level” code could interact with the hardware, and the serial comms were up and running, so we could hook up a terminal and have an easy way to feed the system commands, and see the responses as well as debug messages.
We were finding and fixing a mixture of bugs, and also re-discovering some long forgotten hardware interface requirements that needed to be added to the spec. We also had to fix some manufacturing errors in the prototype control card (like when we worked out that one bit of functionality which did not seem to work as expected, was because someone had forgotten to solder one compete side of an IC's pins onto to the board!)
The bug
One day something odd was observed. We noticed that the power output bar graph that was displayed on the digital drive’s screen, was jumping all over the place randomly, regardless of the actual output power. The power output level is conjured up by the software by reading a voltage produced by a RF coupler connected to the transmitters output and scaling it with the correct factors to turn it into a power in Watts. The digital drive would periodically request this information from the transmitter, and would turn the value returned into a live bar graph display on screen showing the forward power output. Today however, something was not right.
The control system included a pair of 8 channel Analogue to Digital Converters (ADCs) to allow capture of 16 separate analogue voltages from various sensors like the power couplers, and level detectors. Reading a value from them is pretty easy. You write to a control register on the ADC to tell it what channel you want to digitise, write to another register to start the conversion, and then wait for it to digitise the value, then you can read it. Rather the reading and re-reading them every time, in our code we had a simple task scheduled to run every 20ms which would read and store the results of the previous conversions from the pair of ADCs, and then trigger them to start the next ones. This would maintain a continuously refreshed pool of digitised analogue readings available for use by the software.
Now nothing had changed in this area of code, however it looked like all the readings from just one of the ADCs were no longer sensible. The code that actually read the ADC chip was a very simple function coded in Z80 assembler included inside the C function as an inline assembler insert (many development systems specifically targeted for embedded system design, actually compile to assembler source code as an intermediate step – this makes dropping in extra assembler code very easy). There was one function for each chip – identical, apart from being hard coded to the address range of the ADC in question. Given this was development hardware, the first suspicion was simply a failing ADC chip. It may have been damaged by exposure to an “out of range” voltage input, or even static damaged during assembly. The obvious fix was to swap the devices over and check that the fault “moved”. Alas they were Quad Flat Pack (QFP) devices soldered directly to the board rather than being mounted in sockets, so that was not a quick or trivial exercise.
However I had a hunch… in the past I had worked on systems that used peripheral chips designed in the 4MHz Z80 era, that were now running on faster CPU buses, and in some cases the timings on the faster busses were just on the margins of what some of these chips could handle. So I thought it worth seeing if that was the case here. I added a single “no op” instruction to the ADC read function, introducing a very short delay between the instruction that selected the channel to convert, and the one to start the conversion. Re-compiled and re-tested. Success – normal functionality was restored. So we made a note to go back an investigate that at the hardware level later to see what actually going on. Meanwhile, Integration and testing carried on fine for the next few days. Then the problem came back. Again, no changes in that area of code. We tried the same fix – add another sub micro second “no op” delay. Problem solved. So it could be that the device was getting ever close to complete failure, but now I was suspicious that there was more at play. A few builds later, in the same day, the problem returned. So rather than add yet another delay, I tried something different – I removed all the no op instructions we had put in, and returned it to its original form. That also fixed it! That meant it was not a conventional bus timing problem.
Now at this stage my colleague was all for fiddling with delays to keep it working so we could carry on, but to me it seemed clear that we had a much deeper problem, and we would be better off finding out what was going on sooner rather than later (otherwise you can imagine that a kludge like that would end up being shipped in the final code!) We needed to start a deeper probe into the hardware and software.
The first problem was a bit mundane – access to the CPU board – it was buried in the box and not easy to get at. So far, we had coped ok by programming EPROMs containing each new build of software off the board, and then just plugging them in and relying on what we could do via the serial console, but that was not really any good for deeper access with test equipment. So we had to wait a while so the production department could make up a set of cable loom extensions, and a folded up metal bracket that would allow us to mount the PCB just outside the front of the case, at a jaunty angle where we could get at it and prod at it with test gear.
My plan of attack was to hookup the logic analyser we had sitting there, and see what was going on. That started the bonus quest for who in the company had the one and only set of rare and expensive QFP test clips that would actually allow access to the pins on the surface mount chips. Once those were tracked down, I probed some of the pins on the CPU and also those on the ADC chip – capturing the main bus signals so we could observe when the chip was being accessed, and what registers were being used.
Looking at the working chip, the accesses looked normal. A write to it during the boot time initialisation to select the first ADC channel, then a write to the start conversion register to kick of the first conversion, and then later reads and writes once the scheduled task took over continuously grabbing conversions and storing them. However, clipping the ADC probe onto the failing device, showed something was obviously different. You could see the write to the device channel selection register and then to the start conversion register, but then about 6 microseconds later an additional write to the start conversion register was happening while there was a conversion still in process. (the ADC was a “successive approximation” style device that took around 25 uSec to perform a conversion). This unexpected write was glitching the conversion in progress, and causing the nonsensical results.
Now armed with this knowledge it looked like we had a software error causing an unexpected write to the device. This In itself was somewhat unusual on a Z80 platform since that CPU has separate address space for its IO devices, meaning that only the special IN and OUT instructions could actually access IO – you can’t simply trample on IO space with a spurious read or write caused by accessing a corrupted or uninitialized pointer (not untypical in languages like C). Still, it should be simple enough to find this with the right kit… All we needed was to find the kit!
What we now needed was an “In Circuit Emulator” or ICE. Eventually we tracked down someone with venerable old Pentica mime ICE for the Z80 CPU, relieved them of it, and got it connected up to our test system. I set up some IO access breakpoints to trigger whenever any of the block of IO addresses associated with the ADCs were accessed. Then configured the breakpoints to briefly enable the trace buffer to log the code running at the time of the accesses. That meant that we could run the code at full speed, in its normal mode of operation, but we could also capture the “smoking gun” that was shooting the ADC, in the act.
We ran the code for a bit, and then stopped execution after enough time had passed to be sure we would have captured many runs through the ADC access code. Now we had a look at the trace buffer. That revealed…. Nothing, nada, zilch! We could see all the reads and writes that should be there, but we could not see the one that was causing the problem.
This was now a very interesting problem – the ADC was definitely getting written to unexpectedly, but the software was not actually doing it at all – even in error. The plot thickened.
Now we needed a bigger picture. Previously I had hooked up the logic analyser to “just enough” of the address and data pins on the CPU to get a feel for what was going on. With probes on the IO select pin on the CPU, and the chip select and register select pins on the ADC, was enough to see the problem. However there must be some more going on that we were not seeing. So I spent a good while hooking up test leads to many more CPU lines including all address, data, and control bus lines, interrupt line and anything else that might show us more. (this is a fairly slow process – find the right wire on the logic analyser probe, find the right pin on the chip from the circuit diagram or the devices spec sheet, then count the position of the pin on the chip, attach the probe wire to it, and then configure the logic analyser to label that probe line so the final trace shown on it’s screen would make some sense)
Running the code now, I could see something unusual. When the ADC was accessed, you could see the IO pin on the CPU go active as you would expect. On the spurious access however, you could see the IO pin on the CPU go active, but you could also see the memory access pin go active at the same time. Now this raised an interesting question: why? Not only why the memory access pin was active at the same time as the IO access one, but the more fundamental, why does the CPU even have two separate pins for this? (Other CPUs which also have separate IO address space like the Intel 8086 range, only have one control bus pin used; in one state it means memory is should be active, and in the other IO). What does it actually mean when the CPU apparently indicated that it would like to simultaneously access memory and IO at the same time, something that makes no sense at all! I realised my knowledge of Z80 hardware was lacking.
I retreated from the integration “rig room” to the lab to dig out my copy of the seminal Z80 programmers bible “Programming the Z80” by Rodnay Zaks – a book I had bought about 20 years ago in a fit of enthusiasm when first learning machine code programming. However not having access to a Z80 based system at the time, meant I had not really found much use for it. However, today it would pay for itself! It seems that the Z80 hardware uses the simultaneous activation of the memory and IO access signals as a special “interrupt acknowledge” signal. It is intended to allow the CPU to send a signal to an external peripheral device or a dedicated interrupt controller IC to indicate that the CPU has seen and responded to an interrupt. (depending on the interrupt mode selected on the CPU, the external device may even be expected to inject an instruction opcode onto the data bus to effect a call to the interrupt handler). Being able to send an "int ack" is fundamental to the proper handling of interrupts - especially in more complex systems. Apparently the chip designers decided to not provide a dedicated pin for it.
Normally a computer program will decide its own flow of execution – looping over repetitive tasks, taking decisions based on information available, doing calculations, and reading or writing information as required. Sometimes however a computer program needs to be able to rapidly respond to an event that is not under its direct control. To enable this CPUs support an “interrupt” capability. This is a signal wired up to the CPU that allows some external device to get the attention of the CPU when it needs it. When the CPU is “interrupted” it will stop whatever it was doing, and jump into an interrupt service routine (ISR). The ISR will do whatever is required to deal with the device that caused the interrupt. When it has finished, it will return to whatever it was doing before it was interrupted. A good example might be data arriving on a communications link. The communications device needs the CPU to read and save the data quickly before it gets overwritten by more data arriving.
ISRs are very useful, but also add a layer of complexity to a program since they can occur at any time, and disrupt the sequence of operation. An ISR has to take great care to not upset the main program by what it does. They should also complete quickly so as not to significantly alter the execution time of the main program, since in some systems, the accurate timing of execution could be very important. |
Looking closely at the logic analyser captures I could indeed see that an interrupt from one of the serial port UARTs was occurring just before the problem IO write. The question now was, why was the ADC being selected at all since this particular use of the IO access pin did not actually indicate a real access to IO? The address decoding logic on the processor board should explicitly ignore IO accesses to anything, when the memory access line was also active. It looks like we had an address decoding bug!
Looking at the circuit diagram and working back from the chip select pin of the ADCs, it was easy to find the logic responsible for the address decoding. This was looking at the address lines and the IO access line when deciding when to activate the ADC, critically however, the decode circuitry did not include the memory access control line signal, and so could not ignore an active IO line when it was actually being used to acknowledge an interrupt rather than access IO.
That meant that we had the strange situation that if an interrupt were to occur during that very brief moment when the ADC access routine had been called, and the next conversion had been started, but the conversion had not yet finished, the CPU would start processing the interrupt request, and it would generate the interrupt acknowledge signal (which our hardware could mistake for an IO write). Normally this alone would still not matter since the IO line being active was not enough by itself to actually select a real IO device. The address bus would also have to be pointing somewhere in the address range of an actual IO device at the exact moment that the interrupt was triggered. Only then you could get the perfect combination of events that would cause the decoding logic to activate an IO device by mistake.
Explaining the bug
Looking at the logic analyser again we could see what the address on the bus at the time the interrupt occurred was. Not surprisingly, It was showing the memory location of an instruction inside one of the ADC read functions. What was surprising however was what that address actually was; numerically it was an address very close to the address in IO space where the second ADC chip had been mapped by the decoding logic – it was actually within the 16 byte window that would allow the address decoding logic to activate the ADC in error.
So that meant that for this bug to occur, there was a very tight timing race hazard – an interrupt would have to occur during a 25 uSec window of time, between when the ADC access function had been called (which only happened once every 20 ms) but before the requested conversion had completed (which only took 25 uSec), and the memory address space occupied by that ADC access routine would have to be numerically the same (or very close to) the IO address into where the ADC itself was mapped in IO space. By coincidence it was!
This is how the bug could appear and disappear. When other code linked into the ROM address space below that of the ADC access routines changed in size, it could shift the start location of the ADC routines up and down in the memory address space. Now the IO library code was one of the first modules linked during the build, so it was close to the start of the ROM and there was not much code located below it, so changes to code that were low enough in address space to actually matter were actually fairly rare. Apparently however we had by now tweaked enough code to include code in in this bit of ROM space. This is also how the early “no op” fixes we tried temporarily solved the problem – it had nothing to do with timing at all – what they did was changed the size of the code just enough to nudge the address of ADC routine out of that critical 16 byte window that the address decode logic could misuse.
The fix
Fixing the bug was actually quite easy. The designer of the address decode logic had used an 8 input logic gate to collect together any signals that should mask out the remainder of the decode logic for any conditions that should prevent it from being activated. Currently it only had about three of its inputs used. One mod wire added to the PCB, bringing the memory access bus signal into this gate was all that was needed. Now when the memory access line was “active”, the chip select to the ADC could never be activated – even if the address on the bus did happen to point at it and the IO signal was active.