22 September 2015

Blue Man Group -- Baba O Riley




I watch this every once in a while, and I think that it is awesome....

19 September 2015

The Casa Sui Performance Problem

One day at {dayjob}, I was asked to take a look at a system that was suffering from a serious performance problem.  The end-users who used this system were having a difficult time.  The phone was ringing, and people were complaining...

So, I took a look at the system.  The first thing that I noted was that the 15-minute load-average on the system was running at around 10, with frequent 1-minute jumps to 14.  The hardware in question was....awful (that's another story...).  The system had either 2 or 4 vCPUs -- I can't remember which.  Either way, this system was running horribly behind.

Despite the fact that better hardware was available, the owners of this system had the following perspective: this system worked OK during the previous busy periods many months (and software releases...) ago.  They were not going to pay one more red cent to get better hardware.  They also had a support contract and supported (but nearly EOL'd) hardware.

So, it was my job to try to make things better, under difficult circumstances.

The production system in question was handling a large number of transactions.  Unfortunately, with my meager test equipment back in the lab, I had a difficult time reproducing the same load that the customer's system was handling (this situation was deplorable, I admit).  So, the only way to look at the product with a mind towards fixing the performance problem at-hand was to collect data directly from the customer's system....while it was running in production.  I had already cobbled together some utilities for this purpose, so I started gathering data.  The system in question was running in a heavily multi-threaded JVM, and so any of a dozen things could have been contributing to the high CPU load.

A couple of hours later, I started to sift through the data that I was collecting.  This was....challenging.  Eventually, started to get a better and better idea of the general neighborhood in our large codebase where the CPU was spending all of its time.  Many of the running threads seemed to be running in a certain area of the code, so I started looking there for obvious problems.

Unfortunately, no obvious problems presented themselves.  As I traced through the code, I couldn't find any algorithmically complex routines, or busy-waiting, or other obvious boffo stuff.  I'd already fixed some easy problems in this area many months before, so I had some familiarity with this area of the codebase.

So....I went for a long walk, and traced through things in my head.  At some point during my walk, I remembered that in this codebase the system used some custom queue classes to transport messages from one thread to another.  When I got back from my walk, I decided to look at the code for one of these queues.

This is what I saw:
import java.util.Vector;
class FooQueue extends Vector { [..........]
    public synchronized void put(Foo foo) {
        addElement(foo);
        notifyAll();
    }

    public synchronized Foo get() {
         while(size() == 0) {
             try { wait(); }
             catch InterruptedException e) {}
         }
         r = elementAt(0);
         removeElementAt(0);
         return r;
    }
}

I started to curse.....a lot.   Wow.....I was pretty shocked that code like this existed in a production codebase.  This is the sort-of thing that you might kid with colleagues about doing, but of course you would NEVER EVER write code like this for real.  It was especially disturbing to look at this code with the understanding that this was a work-queue, and a ton of transactions were getting funneled through this queue.

This code was awesome in its sheer inefficiency.   Actually, no, that's not quite right....it is even worse than that.  This code has the quality that, the more work it is asked to handle, the more useless work it has to perform.  This queue "implementation" actually caused its own problems, because all of the threads that were executing this code probably spent more time maintaining the queue than doing actual work.  And....of course....doing this meant that even more transactions ended up in the queue, waiting to get CPU time.

With this queue "implementation", adding items to the queue was fairly quick.  However, removing elements from the queue involved a lot more work.....because using Java's Vector class is totally inappropriate for this purpose.  Every single time the processor tries to de-queue and item from the queue, java.util.Vector.removeElementAt(0) is called and.....guess what? -- the Vector class does exactly what it is supposed to do -- keep the internal array contiguous and with all of the elements starting at array index 0.  So, this means that for every removal from the queue, if there were N elements in the queue before the removal, then (N-1) elements would need to be copied as part of the removal process.

Hilariously, I even managed to determine that at many times on this production system that there were on the order of at least 900 items waiting in the queue.

The truly awful part about this code was that it actually worked....OK...when the system was under light load.  But then, as the system was put under heavier and heavier load, the code performs worse and worse.

....

So, anyway, I fixed this code by re-writing this code to use a much more reasonable queue.  While I was doing this, I found several other things that caused me to curse even more, but, eventually I got this system to work a bit more efficiently.  The best thing I can say about this codebase was that there was always ways to improve it....

15 July 2015

An Appreciation for _Programmers at Work_

A few months ago I learned that an important book in my life had been "republished" in the form of a blog.  The book is/was Programmers at Work.  I'm happy to see the content in this book available on the web, because I dimly remember giving my own copy of this book away to somebody else who I thought could use it.

I found this book to be useful when I was younger because back then I really didn't have very many people to talk with when I was exploring the idea of going into the computer (specifically software) field.  There were literally no programmers to talk with in the area that I grew up in, and of course, this was before the era of the Internet.  The thing that I was wondering about was "how do I get from point A to point B?".  Based on the information I got from this book, I got a sense that I needed to change the trajectory of my math studies in high-school (I decided that I really needed to take Calculus in high-school) so I enrolled in a local community college (night school) and got myself onto a better path.  Later on, as a college freshman, I was glad that I had changed my trajectory.

....

Back when I read this book, I was a twisted, confused kid, who really didn't have too many people to talk to about college in the first place.  Looking back at this section of Programmers at Work:

LAMPSON: I used to think that undergraduate computer-science education was bad, and that it should be outlawed. Recently I realized that position isn’t reasonable. An undergraduate degree in computer science is a perfectly respectable professional degree, just like electrical engineering or business administration. But I do think it’s a serious mistake to take an undergraduate degree in computer science if you intend to study it in graduate school.
INTERVIEWER: Why?
LAMPSON: Because most of what you learn won’t have any long-term significance. You won’t learn new ways of using your mind, which does you more good than learning the details of how to write a compiler, which is what you’re likely to get from undergraduate computer science. I think the world would be much better off if all the graduate computer-science departments would get together and agree not to accept anybody with a bachelor’s degree in computer science. Those people should be required to take a remedial year to learn something respectable like mathematics or history, before going on to graduate-level computer science. However, I don’t see that happening.
INTERVIEWER: What kind of training or type of thought leads to the greatest productivity in the computer field?
LAMPSON: From mathematics, you learn logical reasoning. You also learn what it means to prove something, as well as how to handle abstract essentials. From an experimental science such as physics, or from the humanities, you learn how to make connections in the real world by applying these abstractions.

....maybe this was one of the reasons why I also decided to study liberal-arts at college.  Sometimes books are the source of dangerous ideas....

17 June 2015

Adventures in Virtualization and Storage Management

One fine Sunday while I was doing some work around the house, I received a series of dire work emails from people in the management of ${employer}.  Early that morning, one of our marquee customers had some problems with an appliance that we supported at their site.  From what I read, it seemed like the appliance had suffered a significant failure...but the actual cause of the failure was unclear.

Thankfully, a member of our Support staff had managed to get the appliance working again by simply rebooting.  This was a small amount of good news in this difficult situation.

The "marquee customer" in this story is a famous company in the financial services industry.  Their infrastructure is large and complex.

There was little that I could do about the problem at this point, so I sent a request to the folks in Support to please try to get some logfiles from the appliance.  It seemed like a good idea to get to work early the next day, so I decided to turn in early that night.  I was fairly sure that this situation was going to land on my desk in the morning, and I was right....

The next morning, the logfiles that I asked for soon appeared, and I learned that this customer had implemented our appliance as a VM in their big-iron ESX server farm.  The content in the logfiles was...strange.  Something Crazy seemed to have happened to the appliance's database, but beyond this, the root-cause of the problem was unclear.

So, at this point, we began to do several things:

  • trying to get more logfiles to help determine the root cause
  • looking through the code related to the database to see if there were any obvious problems
  • attending a set of conference calls with the customer to manage the problem.

This was a challenging problem to deal with.   Nothing in the logfiles really shed any light on the problem.  Unfortunately there were some issues in the database code that I really had been wanting to do something about for long time prior to this incident, so I started to work on these (but it seemed unlikely that the problems I was concerned about were related in any way to the problems that this customer had encountered).  And, of course, the conference calls were....tense.  Everybody wished that the problem had never happened.

A detailed reading of all of the logfiles still led me to conclude that Something Crazy had happened.

On Wednesday/Thursday we deployed some new code at this site that we hoped would improve things in general.  The stress level in the office began to go back down to normal...

...

The next weekend, the machine crashed again at exactly the same time.  Now I got even more forwarded email....and this time it was obvious that this problem was being noticed by various higher-ups in both organizations.   Not good, not good....

Again, somebody from Support managed to get the machine going again by simply rebooting.

....

Monday morning came along, and we arranged a conference call with the customer's IT staff.   We started going through the logfiles, etc.  We were pulling our hair out trying to figure out what had gone wrong while the higher-ups were fuming.

Around 5 minutes into the conference call, as we were talking to this customer's IT staff (waiting for some "higher up" staff to arrive), one of the customer's sysadmins told us that the last two weekends had been pretty bad at their site.  "Why?" I asked.  The answer:  "well, the corporate SAN server was scheduled for some maintenance over the past two weekends, and when they were performing the maintenance a lot of things went haywire".  I asked "does this SAN server provide the storage for all of the VMs on the ESX server, the same ESX server that runs our appliance/VM?".  "Yes", the sysadmin said, and then he continued, "actually, when the SAN server went down, we had around 40 VMs on the ESX server crash or go into a flaky mode where they needed to be rebooted".

I pressed "mute" on the speakerphone, looked around the table and asked our Support person "at any point in this incident has this customer's staff ever mentioned the fact that their SAN server went down?"  The reply:  "nope".

...

We then learned that their SAN server had been down for 45+ minutes during both weekends.  Then I had to explain to this customer's IT staff that our appliance had a database server running on it, and that a database server doesn't react very kindly when the underlying disk drive goes away for 45+ minutes.   "Ohhhhhh" was their reply....

The end-result of this incident is that we actually had to add verbiage to the customer-facing docs reminding site-admins that if they chose to deploy the appliance as a VM, that they needed to ensure that the underlying storage mechanism needed to be highly-available.

I swear I didn't make this up.