19 September 2015

The Casa Sui Performance Problem

One day at {dayjob}, I was asked to take a look at a system that was suffering from a serious performance problem.  The end-users who used this system were having a difficult time.  The phone was ringing, and people were complaining...

So, I took a look at the system.  The first thing that I noted was that the 15-minute load-average on the system was running at around 10, with frequent 1-minute jumps to 14.  The hardware in question was....awful (that's another story...).  The system had either 2 or 4 vCPUs -- I can't remember which.  Either way, this system was running horribly behind.

Despite the fact that better hardware was available, the owners of this system had the following perspective: this system worked OK during the previous busy periods many months (and software releases...) ago.  They were not going to pay one more red cent to get better hardware.  They also had a support contract and supported (but nearly EOL'd) hardware.

So, it was my job to try to make things better, under difficult circumstances.

The production system in question was handling a large number of transactions.  Unfortunately, with my meager test equipment back in the lab, I had a difficult time reproducing the same load that the customer's system was handling (this situation was deplorable, I admit).  So, the only way to look at the product with a mind towards fixing the performance problem at-hand was to collect data directly from the customer's system....while it was running in production.  I had already cobbled together some utilities for this purpose, so I started gathering data.  The system in question was running in a heavily multi-threaded JVM, and so any of a dozen things could have been contributing to the high CPU load.

A couple of hours later, I started to sift through the data that I was collecting.  This was....challenging.  Eventually, started to get a better and better idea of the general neighborhood in our large codebase where the CPU was spending all of its time.  Many of the running threads seemed to be running in a certain area of the code, so I started looking there for obvious problems.

Unfortunately, no obvious problems presented themselves.  As I traced through the code, I couldn't find any algorithmically complex routines, or busy-waiting, or other obvious boffo stuff.  I'd already fixed some easy problems in this area many months before, so I had some familiarity with this area of the codebase.

So....I went for a long walk, and traced through things in my head.  At some point during my walk, I remembered that in this codebase the system used some custom queue classes to transport messages from one thread to another.  When I got back from my walk, I decided to look at the code for one of these queues.

This is what I saw:
import java.util.Vector;
class FooQueue extends Vector { [..........]
    public synchronized void put(Foo foo) {

    public synchronized Foo get() {
         while(size() == 0) {
             try { wait(); }
             catch InterruptedException e) {}
         r = elementAt(0);
         return r;

I started to curse.....a lot.   Wow.....I was pretty shocked that code like this existed in a production codebase.  This is the sort-of thing that you might kid with colleagues about doing, but of course you would NEVER EVER write code like this for real.  It was especially disturbing to look at this code with the understanding that this was a work-queue, and a ton of transactions were getting funneled through this queue.

This code was awesome in its sheer inefficiency.   Actually, no, that's not quite right....it is even worse than that.  This code has the quality that, the more work it is asked to handle, the more useless work it has to perform.  This queue "implementation" actually caused its own problems, because all of the threads that were executing this code probably spent more time maintaining the queue than doing actual work.  And....of course....doing this meant that even more transactions ended up in the queue, waiting to get CPU time.

With this queue "implementation", adding items to the queue was fairly quick.  However, removing elements from the queue involved a lot more work.....because using Java's Vector class is totally inappropriate for this purpose.  Every single time the processor tries to de-queue and item from the queue, java.util.Vector.removeElementAt(0) is called and.....guess what? -- the Vector class does exactly what it is supposed to do -- keep the internal array contiguous and with all of the elements starting at array index 0.  So, this means that for every removal from the queue, if there were N elements in the queue before the removal, then (N-1) elements would need to be copied as part of the removal process.

Hilariously, I even managed to determine that at many times on this production system that there were on the order of at least 900 items waiting in the queue.

The truly awful part about this code was that it actually worked....OK...when the system was under light load.  But then, as the system was put under heavier and heavier load, the code performs worse and worse.


So, anyway, I fixed this code by re-writing this code to use a much more reasonable queue.  While I was doing this, I found several other things that caused me to curse even more, but, eventually I got this system to work a bit more efficiently.  The best thing I can say about this codebase was that there was always ways to improve it....

No comments: