03 March 2008

Debugging War Story 2

At one of the projects I worked on in the past I got to work on some protocol design and implementation. This was actually one of my favorite projects ever; it was a project where I had a lot of responsibility, I got to work with a lot of interesting people on some hard problems, and I got to work on a project that allowed me to be creative and technical at the same time.

Anyways, I was in charge of the protocol implementation. I am a very careful and a very conservative programmer, and a lot of the protocol implementation was coded in the style that I prefer.

The protocol itself was binary (out of necessity). One day as we continued the design of the protocol we concluded that we needed to add a 64-bit integer to one of the message fields. After agreeing on the design, I updated one of the structs that I had created to store the message fields:

#if defined(SOME_COMPILER) || defined(SOME_OTHER_COMPILER)
#pragma pack (1)
#endif
struct SomeMsgStruct {
uint32_t field1;
uint32_t field2;
....
uint64_t fieldN; /* NEW FIELD */
}
#if defined(GNUC)
__attribute__ ((__packed__))
#endif
;
#if defined(SOME_COMPILER) || defined(SOME_OTHER_COMPILER)
#pragma pack (0)
#endif


The key point you need to understand here is that I wanted to make sure this struct was packed and there was no padding in the struct (as the C standard allows). You need to understand that I was supporting several different compilers and target architectures, some of which I did not have easy access to.

So, in my conservative programming style, I also added the following code to one of the protocol's initialization functions:


SomeMsgStruct x;
assert(sizeof(x) == (sizeof(x.field1) + sizeof(x.field2) + sizeof(x.fieldN)));


After testing my code, I checked it in, announced my changes, and then moved on to my next tasks.

....

Several days later, one of my colleagues, who, shall we say, was not a detail-oriented engineer, complained to me that something was wrong with the protocol stack and that there was garbage being transmitted on the wire. I knew this problem had "wild goose chase" written all over it, but I didn't have a magic wand to fix this problem. So, at around 2pm, we sat down to debug the problem.

We analyzed logfiles. We looked at protocol traces. I tried to reproduce the problem on my setup, but the problem only reared its ugly head on my colleague's setup. My colleague's setup included a different target processor than what I had in my setup, so I was quickly forced to try to understand this problem on my colleague's foreign setup (which seemed to include a very tedious compile/link/load-the-binary-onto-the-target phase).

Eventually, in one of the logfiles, I noticed that something seemed to be wrong with "struct SomeMsgStruct". Ah! So I had my colleague add this to the code:

printf("sizeof SomeMsgStruct: %d\n", sizeof(SomeMsgStruct));

And, sure enough, the output was not what I was expecting.

Now I was really confused. At around 7:30pm, I wondered aloud to my colleague "How could this possibly be happening?! How could the compiler be doing this? I even put an assert() in the code to make sure that everything was right!".

At this point my co-worker blurted out:

assert()? Oh, what's that? When I first started working with your new code this morning, I kept on getting this `assertion failed' message. But I just wanted to get the code going, so I commented out that pesky line of code.

After this revelation, I excused myself from the noisy lab where I had just spent the entire afternoon, went outside to the parking lot, and had a good scream. After I composed myself, I went back inside and re-wrote the code so that the structure would be packed correctly on my colleague's target architecture. I also told my colleague, in no uncertain terms, to never modify my code again and to absolutely positively never ever remove an assertion from the code again unless he knew what he was doing and was prepared to deal with the consequences.

I am saddened to inform you, kind reader, that this wasn't the last time in my life that a colleague removed an assertion from my code. I guess that I am better able to deal with this now, but I am never less surprised when I see this.