01 March 2008

Debugging War Story

One day, one of my colleagues updated me about one of the problems that he was trying to fix in one of the older products that he maintained.

My colleague informed me:

We can't fix the problem because we can't even produce a build with the current codebase that doesn't crash instantly on the board. Something in the code changed. I've tracked it down; it is a compiler bug. We'll have to call the compiler vendor.

You have to realize, this problem report pushed so many wrong buttons in my engineer's brain that my head immediately started to hurt. I got a cup of coffee and prepared for battle.

When I got back to my colleague's desk I asked "when was the last time you produced a working build for this product?". Let's just say that the answer was "a lot longer than 6 months and many code re-orgs ago".

Great. The pain in my head became a dull throb.

I thought for a moment and then asked "You mentioned that you had tracked this down to being a compiler bug -- how do you know this?"

A few minutes later my colleague was showing me the assembly language output generated by the compiler. I was a bit out of my element here; I was not familiar with the target processor or its assembly language.

"So, what exactly is the bug here?" I asked. My co-worker explained to me that the compiler was dealing with some code that was working with uint32_t values, but in one particular case it just decided to deal with a uint32_t value using the processor's 16-bit instructions. So, a value in a register was getting "shaved", and this was the root cause of the fatal error the product was experiencing.

Again, I was not familiar with the target processor, but I did manage to look through a reference book on my colleague's desk and I did verify that, sure enough, the assembly language output was using 16-bit instructions in a sea of other code that treated the value properly as a 32-bit value.

At this point I learned a little bit more about the compiler. It wasn't GCC -- this compiler was provided by the chip vendor. The whole compiler seemed to be tightly integrated to the vendor's IDE, some win32 app that seemed a little flaky at best. I'd never used this compiler before in my life.

At this point I had two conflicting thoughts going on in my brain: (1) my co-worker was telling me that there was a compiler bug and (2) I haven't seen an actual compiler bug in a C compiler in over a decade, especially for code as simple as this.

So, I decided to look at the C code in question a little more carefully. It turned out that the problem description of "the compiler is generating code that uses 16-bit instructions to work with 32-bit values" was a bit of an oversimplification; rather, the problem could more accurately be described as "the compiler was emitting 16-bit instructions to move a 32-bit return value (returned from a function call) off of the stack". Let's call the function in question foo().

Oh. I was starting to get a hunch about the problem.

"Is there a prototype for this function that returns a uint32_t?" I asked my colleague. "Yes" was his response. Sure enough, he showed me the prototype in a header file. Damn, this was a minor setback to my hunch. It looked like this in the code, of course:
extern uint32_t foo(uint32_t some_param);
So, at this point I directed my colleague to utilize one of my favorite debugging techniques -- I asked him to run the compiler on the source file in question, but to only run the C preprocessor on the file. This is usually as simple as invoking the compiler like "cc -E" or "gcc -E". After a few minutes of futzing around with the win32 IDE that controlled the compiler, we were eventually able to generate the preprocessed output, all dumped to a file.

As soon as we generated the file, I had my smoking gun.

We imported the file into a text editor and I immediately asked my colleague to look for "foo" in the file. Sure enough, the first occurrence of this string in the file was at the place where this function was invoked. Let me be really clear here: yes, there was a prototype for this function, and this existed in some header file, but in the .c file that we were looking at this file was never #included!

I asked my colleague one more question, but I knew what the response would be before I even asked:

"What size are ints on this processor?"

"16 bits." was his response.

I started doing a little jig in his office....problem solved!

There was no compiler bug. The problem was that the compiler was being asked to generate some code to invoke a function called foo() but it had never heard of that function before. But this is C, and this is legal. So, the compiler generated the code to pop the return value off of the stack using the default that C uses -- int -- and on this particular target, ints were 16-bits wide.

What are the lessons from all of this? I would humbly suggest that there are three:

1: Quality code is built in an environment in which compiler warnings are copiously enabled and paid attention to.

2: If you have a product and you're not building and testing the build output frequently, you're doing something wrong.

3: Occasionally, it is handy to have an engineer who can debug issues like these on staff...