12 September 2012
Semaphores
[text: "Perhaps you should have somebody on your development team that understands how to use semaphores and other synchronization primitives", photograph reminds people that in some systems, a system failure is a lot more inconvenient than simply having to restart]
Image credit: http://www.flickr.com/photos/smithsonian/4011428942/
If you buy me beer I might even tell you how I got the inspiration for this creation. Let's just say that my inspiration came from code I looked at a long time ago.
28 February 2011
Synchronization Bugs
To test the code, I wrote a monstrous “basher.” It ran lots of transactions, each of which contained nested transactions, recursively up to some maximum nesting depth. Each of the nested transactions would lock and read several elements of a shared array in ascending order and add something to each element, preserving the invariant that the sum of all the elements in the array was zero. Each subtransaction was either committed or aborted—90 percent commits, 10 percent aborts, or whatever. Multiple threads ran these transactions concurrently and beat on the array for a prolonged period. Since it was a shared-memory facility that I was testing, I ran multiple multithreaded bashers concurrently, each in its own process.
At reasonable concurrency levels, the basher passed with flying colors. But when I really cranked up the concurrency, I found that occasionally, just occasionally, the basher would fail its consistency check. I had no idea what was going on. Of course I assumed it was my fault because I had written all of this new code.
......
I spent a week or so writing painfully thorough unit tests of each component, and all the tests passed. Then I wrote detailed consistency checks for each internal data structure, so I could call the consistency checks after every mutation until a test failed. Finally I caught a low-level consistency check failing—not repeatably, but in a way that allowed me to analyze what was going on. And I came to the inescapable conclusion that my locks weren’t working. I had concurrent read-modify-write sequences taking place in which two transactions locked, read, and wrote the same value and the last write was clobbering the first.
I had written my own lock manager, so of course I suspected it. But the lock manager was passing its unit tests with flying colors. In the end, I determined that what was broken wasn’t the lock manager, but the underlying mutex implementation! This was before the days when operating systems supported threads, so we had to write our own threading package. It turned out that the engineer responsible for the mutex code had accidentally exchanged the labels on the lock and try-lock routines in the assembly code for our Solaris threading implementation. So every time you thought you were calling lock, you were actually calling try-lock, and vice versa. Which means that when there was actual contention—rare in those days—the second thread just sailed into the critical section as if the first thread didn’t have the lock. The funny thing was that that this meant the whole company had been running without mutexes for a couple weeks, and nobody noticed.
I find stories like these from the trenches to be weirdly entertaining.
The problem that Bloch reports here reminds me of a bug that I encountered once during one of my graduate classes. I was taking a parallel programming class in which we were doing a fair amount of coding with various parallel programming environments. One of these environments provided a barrier synchronization primitive. The implementation of this primitive was heavily optimized to be efficient with a large number of processors. Since a large part of the grade in the class was based on our ability to write code that was demonstrably efficient, my code made use of this primitive. The problem was this: I started noticing some weirdness in my benchmarks, so I started inserting more and more self-checks into my code. Eventually I came to the inescapable conclusion that the barrier synchronization primitive had some bug in its implementation. At this point I had no choice but to implement my own barrier synchronization primitive and migrate my code to use my implementation. When I reported my problem I (understandably) got a skeptical reply. I had very little time to look into the details of the problem, so I continued to use my own barrier synchronization primitive instead of the system's. My implementation was definitely slower than the system's, but I simply could not use the system's anymore.
After a couple of weeks of existing in this state, I felt a good deal of vindication when I heard my fellow classmates complain in class that they thought that the barrier synchronization primitive had a bug. At this point the professor who ran the class gave us the code for the system's underlying barrier synchronization primitive. I really wish I could report that I (or one of my classmates) had an "a-ha!" moment with the code at this point...but unfortunately this never happened. I put around three hours into trying to track down the bug in the system's highly optimized barrier sync mechanism, but given the large amount of hours I was putting into $DAYJOB at the time, three hours was all I had to give to finding this problem. As far as I know, nobody ever found the actual problem here. I always felt a little badly about this, but there was nothing I could do about this problem.
Despite this specific incident, I have had a good deal of success in debugging thread/synchronization bugs over the years, and the one thing I can say about this class of problem is this: it is really satisfying to find and fix a problem in this area!
04 August 2010
Java IO Faster Than NIO?
I read Paul Tyma's slides intently because the subject of "how do you design a massively scalable server?" is one of my favorite topics. There are so many interesting areas to explore in this area.
I have designed several scalable servers over the years. In one case this was for some telco-class equipment that I was helping to design. On that project I ended up using a combination of threads, /dev/poll (on platforms that had this) and poll() (on other platforms). I was under a large amount of pressure to make things fast/scalable on this project, so I was always tinkering around. In the end, I came up with a fairly high-performance communications subsystem that satisfied the goals that we had. The test harnesses that I wrote to go along with this server seemed to validate this as well.
For work on another scalable server that I designed, I went with Java NIO. This is why the article referenced on /. interested me so much. When I started on this particular project, I had no idea of whether to go with a blocking/threads-based architecture or a pure NIO-based system. In the end, I guessed that NIO might be faster so I went with that. The design that I came up with for this server is based on a lot of research that I've done over the years, and this server appears to be very scalable and fast.
I do have some other random thoughts on this topic...
- One of the scalable servers that I have designed happens to interact with SSL/TLS. I have to tell you -- getting a non-blocking I/O layer to interact correctly and efficiently with a SSL/TLS layer is pretty complicated. I spent a huge amount of brainpower and time getting things right here. Given the information that is presented in Tyma's article, if I were to start designing a scalable Java-based server again, I would definitely consider just using Java IO rather than NIO. This would make the code much more simple and straightforward.
- Speaking of making a high-performance communications subsystem, what's the deal with all of the postings on the web from "programmers" who can't figure out how to use their socket API so they insert calls to sleep() in order to "guarantee" that they can read/write data? I must have seen a dozen postings with advice like this. This boggles my mind! A call to sleep() has no place in a high-performance communications subsystem, and anybody who puts one there is not producing good quality code.
- In Tyma's slides, I especially liked the story of Rob Van Behren. When I got to the sentence that read "[w]hen he was done, he sat back and realized he had written the foundation for a threading package" I felt like I had known this gentleman named Rob for a hundred years...
06 April 2009
What is so Hard About Parallel Programming? Lots!
Hmmm. Interesting. I've spent quite a bit of time thinking about this question too. I happened to write my masters thesis on the subject of parallel computation.
I think that Mr. Breshears is obviously a very smart individual, but I disagree with his assertion that:
[...] learning parallel programming is no harder than learning a second programming language.Let me be clear: I am not violently disagreeing with Mr. Breshears, but let me put it this way: I would suggest that for a reasonably intelligent computer science student, learning a second programming language, on a scale of 1-10, is around a "3" (in terms of difficulty). Now, for this same student to learn parallel programming, I would estimate that, on a scale of 1-10, the amount of difficulty would be a "7".
So, I disagree with Mr. Breshears.
What is my rationale for my disagreement? In my experience, it is often hard to write a serial program to solve a particular problem. I mean, this should be uncontroversial. However, I think that moving from a serial program that solves a given problem to a correct parallel program that solves the same problem requires a paradigm shift. The central part of this transformation does not involve more rote learning of some new programming language's syntax and APIs -- this process requires a completely different way of thinking. It is one thing to hazily understand that an execution unit on a given machine is going to be executing a given routine; it is another thing entirely to be able to grok that multiple execution units on a given machine might be executing the same routine all at the same time.
There is one other thing that this transformation requires as well: a fantastic attention to detail. Coming up with a correct/efficient parallel program that solves a problem means that the programmer has to completely understand the problem at hand (especially the data dependencies). The programmer also has to know the semantics of the parallel programming system that they are using like the back of their hand. This last bit cannot be stressed enough: the number of details involved here can be mind-boggling.
How difficult can it be to come up with a parallel implementation of a serial program? Well, let me answer this question somewhat indirectly. One of the most seductively popular ways to write a parallel program is to use threads and synchronization mechanisms. In fact, threads and synchronization mechanisms aren't just used for parallel programming -- they're also used quite a bit in concurrent systems programming. This is where I spend quite a bit of my time as a software engineer, actually. Over the years, I have seen quite a few problems in real products that have as their root-cause something to do with threads and synchronization primitives. These problems have all been inadvertently created by engineers with at least a few years of industry experience (sometimes much more...). If engineers with this level of experience can get confused and make mistakes in such an environment, what chance does the average college sophomore (as Mr. Breshears suggests) have to use similar technology and create parallel programs? I would humbly suggest that the answer to this question is "not much".
