kdc-blog: parallel programming

My previous posting reminds me of an amusing story I was reading in the book _Coders at Work_ by Peter Seibel. The story is told by Joshua Bloch, who I consider to be an exceptionally gifted programmer:

To test the code, I wrote a monstrous “basher.” It ran lots of transactions, each of which contained nested transactions, recursively up to some maximum nesting depth. Each of the nested transactions would lock and read several elements of a shared array in ascending order and add something to each element, preserving the invariant that the sum of all the elements in the array was zero. Each subtransaction was either committed or aborted—90 percent commits, 10 percent aborts, or whatever. Multiple threads ran these transactions concurrently and beat on the array for a prolonged period. Since it was a shared-memory facility that I was testing, I ran multiple multithreaded bashers concurrently, each in its own process.

At reasonable concurrency levels, the basher passed with flying colors. But when I really cranked up the concurrency, I found that occasionally, just occasionally, the basher would fail its consistency check. I had no idea what was going on. Of course I assumed it was my fault because I had written all of this new code.

......

I spent a week or so writing painfully thorough unit tests of each component, and all the tests passed. Then I wrote detailed consistency checks for each internal data structure, so I could call the consistency checks after every mutation until a test failed. Finally I caught a low-level consistency check failing—not repeatably, but in a way that allowed me to analyze what was going on. And I came to the inescapable conclusion that my locks weren’t working. I had concurrent read-modify-write sequences taking place in which two transactions locked, read, and wrote the same value and the last write was clobbering the first.

I had written my own lock manager, so of course I suspected it. But the lock manager was passing its unit tests with flying colors. In the end, I determined that what was broken wasn’t the lock manager, but the underlying mutex implementation! This was before the days when operating systems supported threads, so we had to write our own threading package. It turned out that the engineer responsible for the mutex code had accidentally exchanged the labels on the lock and try-lock routines in the assembly code for our Solaris threading implementation. So every time you thought you were calling lock, you were actually calling try-lock, and vice versa. Which means that when there was actual contention—rare in those days—the second thread just sailed into the critical section as if the first thread didn’t have the lock. The funny thing was that that this meant the whole company had been running without mutexes for a couple weeks, and nobody noticed.

I find stories like these from the trenches to be weirdly entertaining.

The problem that Bloch reports here reminds me of a bug that I encountered once during one of my graduate classes. I was taking a parallel programming class in which we were doing a fair amount of coding with various parallel programming environments. One of these environments provided a barrier synchronization primitive. The implementation of this primitive was heavily optimized to be efficient with a large number of processors. Since a large part of the grade in the class was based on our ability to write code that was demonstrably efficient, my code made use of this primitive. The problem was this: I started noticing some weirdness in my benchmarks, so I started inserting more and more self-checks into my code. Eventually I came to the inescapable conclusion that the barrier synchronization primitive had some bug in its implementation. At this point I had no choice but to implement my own barrier synchronization primitive and migrate my code to use my implementation. When I reported my problem I (understandably) got a skeptical reply. I had very little time to look into the details of the problem, so I continued to use my own barrier synchronization primitive instead of the system's. My implementation was definitely slower than the system's, but I simply could not use the system's anymore.

After a couple of weeks of existing in this state, I felt a good deal of vindication when I heard my fellow classmates complain in class that they thought that the barrier synchronization primitive had a bug. At this point the professor who ran the class gave us the code for the system's underlying barrier synchronization primitive. I really wish I could report that I (or one of my classmates) had an "a-ha!" moment with the code at this point...but unfortunately this never happened. I put around three hours into trying to track down the bug in the system's highly optimized barrier sync mechanism, but given the large amount of hours I was putting into $DAYJOB at the time, three hours was all I had to give to finding this problem. As far as I know, nobody ever found the actual problem here. I always felt a little badly about this, but there was nothing I could do about this problem.

Despite this specific incident, I have had a good deal of success in debugging thread/synchronization bugs over the years, and the one thing I can say about this class of problem is this: it is really satisfying to find and fix a problem in this area!

Over at JavaLobby, Clay Breshears asks "What is so Hard About Parallel Programming?".

Hmmm. Interesting. I've spent quite a bit of time thinking about this question too. I happened to write my masters thesis on the subject of parallel computation.

I think that Mr. Breshears is obviously a very smart individual, but I disagree with his assertion that:

[...] learning parallel programming is no harder than learning a second programming language.

Let me be clear: I am not violently disagreeing with Mr. Breshears, but let me put it this way: I would suggest that for a reasonably intelligent computer science student, learning a second programming language, on a scale of 1-10, is around a "3" (in terms of difficulty). Now, for this same student to learn parallel programming, I would estimate that, on a scale of 1-10, the amount of difficulty would be a "7".

So, I disagree with Mr. Breshears.

What is my rationale for my disagreement? In my experience, it is often hard to write a serial program to solve a particular problem. I mean, this should be uncontroversial. However, I think that moving from a serial program that solves a given problem to a correct parallel program that solves the same problem requires a paradigm shift. The central part of this transformation does not involve more rote learning of some new programming language's syntax and APIs -- this process requires a completely different way of thinking. It is one thing to hazily understand that an execution unit on a given machine is going to be executing a given routine; it is another thing entirely to be able to grok that multiple execution units on a given machine might be executing the same routine all at the same time.

There is one other thing that this transformation requires as well: a fantastic attention to detail. Coming up with a correct/efficient parallel program that solves a problem means that the programmer has to completely understand the problem at hand (especially the data dependencies). The programmer also has to know the semantics of the parallel programming system that they are using like the back of their hand. This last bit cannot be stressed enough: the number of details involved here can be mind-boggling.

How difficult can it be to come up with a parallel implementation of a serial program? Well, let me answer this question somewhat indirectly. One of the most seductively popular ways to write a parallel program is to use threads and synchronization mechanisms. In fact, threads and synchronization mechanisms aren't just used for parallel programming -- they're also used quite a bit in concurrent systems programming. This is where I spend quite a bit of my time as a software engineer, actually. Over the years, I have seen quite a few problems in real products that have as their root-cause something to do with threads and synchronization primitives. These problems have all been inadvertently created by engineers with at least a few years of industry experience (sometimes much more...). If engineers with this level of experience can get confused and make mistakes in such an environment, what chance does the average college sophomore (as Mr. Breshears suggests) have to use similar technology and create parallel programs? I would humbly suggest that the answer to this question is "not much".

kdc-blog

28 February 2011

Synchronization Bugs

06 April 2009

What is so Hard About Parallel Programming? Lots!

About Me

Blog Archive