28 February 2011

Synchronization Bugs

My previous posting reminds me of an amusing story I was reading in the book _Coders at Work_ by Peter Seibel. The story is told by Joshua Bloch, who I consider to be an exceptionally gifted programmer:

To test the code, I wrote a monstrous “basher.” It ran lots of transactions, each of which contained nested transactions, recursively up to some maximum nesting depth. Each of the nested transactions would lock and read several elements of a shared array in ascending order and add something to each element, preserving the invariant that the sum of all the elements in the array was zero. Each subtransaction was either committed or aborted—90 percent commits, 10 percent aborts, or whatever. Multiple threads ran these transactions concurrently and beat on the array for a prolonged period. Since it was a shared-memory facility that I was testing, I ran multiple multithreaded bashers concurrently, each in its own process.

At reasonable concurrency levels, the basher passed with flying colors. But when I really cranked up the concurrency, I found that occasionally, just occasionally, the basher would fail its consistency check. I had no idea what was going on. Of course I assumed it was my fault because I had written all of this new code.

......

I spent a week or so writing painfully thorough unit tests of each component, and all the tests passed. Then I wrote detailed consistency checks for each internal data structure, so I could call the consistency checks after every mutation until a test failed. Finally I caught a low-level consistency check failing—not repeatably, but in a way that allowed me to analyze what was going on. And I came to the inescapable conclusion that my locks weren’t working. I had concurrent read-modify-write sequences taking place in which two transactions locked, read, and wrote the same value and the last write was clobbering the first.

I had written my own lock manager, so of course I suspected it. But the lock manager was passing its unit tests with flying colors. In the end, I determined that what was broken wasn’t the lock manager, but the underlying mutex implementation! This was before the days when operating systems supported threads, so we had to write our own threading package. It turned out that the engineer responsible for the mutex code had accidentally exchanged the labels on the lock and try-lock routines in the assembly code for our Solaris threading implementation. So every time you thought you were calling lock, you were actually calling try-lock, and vice versa. Which means that when there was actual contention—rare in those days—the second thread just sailed into the critical section as if the first thread didn’t have the lock. The funny thing was that that this meant the whole company had been running without mutexes for a couple weeks, and nobody noticed.


I find stories like these from the trenches to be weirdly entertaining.

The problem that Bloch reports here reminds me of a bug that I encountered once during one of my graduate classes. I was taking a parallel programming class in which we were doing a fair amount of coding with various parallel programming environments. One of these environments provided a barrier synchronization primitive. The implementation of this primitive was heavily optimized to be efficient with a large number of processors. Since a large part of the grade in the class was based on our ability to write code that was demonstrably efficient, my code made use of this primitive. The problem was this: I started noticing some weirdness in my benchmarks, so I started inserting more and more self-checks into my code. Eventually I came to the inescapable conclusion that the barrier synchronization primitive had some bug in its implementation. At this point I had no choice but to implement my own barrier synchronization primitive and migrate my code to use my implementation. When I reported my problem I (understandably) got a skeptical reply. I had very little time to look into the details of the problem, so I continued to use my own barrier synchronization primitive instead of the system's. My implementation was definitely slower than the system's, but I simply could not use the system's anymore.

After a couple of weeks of existing in this state, I felt a good deal of vindication when I heard my fellow classmates complain in class that they thought that the barrier synchronization primitive had a bug. At this point the professor who ran the class gave us the code for the system's underlying barrier synchronization primitive. I really wish I could report that I (or one of my classmates) had an "a-ha!" moment with the code at this point...but unfortunately this never happened. I put around three hours into trying to track down the bug in the system's highly optimized barrier sync mechanism, but given the large amount of hours I was putting into $DAYJOB at the time, three hours was all I had to give to finding this problem. As far as I know, nobody ever found the actual problem here. I always felt a little badly about this, but there was nothing I could do about this problem.

Despite this specific incident, I have had a good deal of success in debugging thread/synchronization bugs over the years, and the one thing I can say about this class of problem is this: it is really satisfying to find and fix a problem in this area!

No comments: