09 June 2013

The Death of a Hard-Drive

Have you ever wanted to see a hard-drive die on a production system?  Yeah, me neither...

The other day I was called upon to try to figure out what was going wrong on a heavily-utilized production system.  I dimly recalled looking at this system several months ago.  From my notes, I recalled that the system was acting a bit sluggishly back then.  I even made a few performance improvements at the time.

Fast forward to this past week...  The performance improvements that I made months ago helped get the system through a critical period, but now the system was acting sluggishly again.  So, I analyzed the logs.   There, in the logs, was a periodic warning that the hard-drives were acting a little bit flaky.  I do not have any sort of physical access to this system ; the logs are all I have.  The problem is that this periodic warning seems to have been occurring for a long time now, and there seems to be some resistance to replacing the hardware in this system.  So, I really had to make some strong case to prove that this system was taking a nosedive and needed immediate action.

I decided to create a graph that shows the frequency of this system's warning message over time.  Here is my graph:

Every point on this graph means "something bad happened".  And...I think it is clear that this graph shows that "a whole lot of new badness is happening"....and that this system needs some serious fixing.

I'm crossing my fingers right now...

