17 June 2015

Adventures in Virtualization and Storage Management

One fine Sunday while I was doing some work around the house, I received a series of dire work emails from people in the management of ${employer}.  Early that morning, one of our marquee customers had some problems with an appliance that we supported at their site.  From what I read, it seemed like the appliance had suffered a significant failure...but the actual cause of the failure was unclear.

Thankfully, a member of our Support staff had managed to get the appliance working again by simply rebooting.  This was a small amount of good news in this difficult situation.

The "marquee customer" in this story is a famous company in the financial services industry.  Their infrastructure is large and complex.

There was little that I could do about the problem at this point, so I sent a request to the folks in Support to please try to get some logfiles from the appliance.  It seemed like a good idea to get to work early the next day, so I decided to turn in early that night.  I was fairly sure that this situation was going to land on my desk in the morning, and I was right....

The next morning, the logfiles that I asked for soon appeared, and I learned that this customer had implemented our appliance as a VM in their big-iron ESX server farm.  The content in the logfiles was...strange.  Something Crazy seemed to have happened to the appliance's database, but beyond this, the root-cause of the problem was unclear.

So, at this point, we began to do several things:

  • trying to get more logfiles to help determine the root cause
  • looking through the code related to the database to see if there were any obvious problems
  • attending a set of conference calls with the customer to manage the problem.

This was a challenging problem to deal with.   Nothing in the logfiles really shed any light on the problem.  Unfortunately there were some issues in the database code that I really had been wanting to do something about for long time prior to this incident, so I started to work on these (but it seemed unlikely that the problems I was concerned about were related in any way to the problems that this customer had encountered).  And, of course, the conference calls were....tense.  Everybody wished that the problem had never happened.

A detailed reading of all of the logfiles still led me to conclude that Something Crazy had happened.

On Wednesday/Thursday we deployed some new code at this site that we hoped would improve things in general.  The stress level in the office began to go back down to normal...

...

The next weekend, the machine crashed again at exactly the same time.  Now I got even more forwarded email....and this time it was obvious that this problem was being noticed by various higher-ups in both organizations.   Not good, not good....

Again, somebody from Support managed to get the machine going again by simply rebooting.

....

Monday morning came along, and we arranged a conference call with the customer's IT staff.   We started going through the logfiles, etc.  We were pulling our hair out trying to figure out what had gone wrong while the higher-ups were fuming.

Around 5 minutes into the conference call, as we were talking to this customer's IT staff (waiting for some "higher up" staff to arrive), one of the customer's sysadmins told us that the last two weekends had been pretty bad at their site.  "Why?" I asked.  The answer:  "well, the corporate SAN server was scheduled for some maintenance over the past two weekends, and when they were performing the maintenance a lot of things went haywire".  I asked "does this SAN server provide the storage for all of the VMs on the ESX server, the same ESX server that runs our appliance/VM?".  "Yes", the sysadmin said, and then he continued, "actually, when the SAN server went down, we had around 40 VMs on the ESX server crash or go into a flaky mode where they needed to be rebooted".

I pressed "mute" on the speakerphone, looked around the table and asked our Support person "at any point in this incident has this customer's staff ever mentioned the fact that their SAN server went down?"  The reply:  "nope".

...

We then learned that their SAN server had been down for 45+ minutes during both weekends.  Then I had to explain to this customer's IT staff that our appliance had a database server running on it, and that a database server doesn't react very kindly when the underlying disk drive goes away for 45+ minutes.   "Ohhhhhh" was their reply....

The end-result of this incident is that we actually had to add verbiage to the customer-facing docs reminding site-admins that if they chose to deploy the appliance as a VM, that they needed to ensure that the underlying storage mechanism needed to be highly-available.

I swear I didn't make this up.