05 June 2016

Mis-Adventures In Security Engineering

One day, I found myself working at a new {dayjob}.  I was enthusiastic about this job ; I thought I could make a real difference at this shop and work with some interesting tech.

But the environment in this shop was chaotic, and I spent a lot of time fighting fires.  Every day brought its own challenges...

Early on, I became aware of a security problem in the product I was supporting.  The problem itself was pretty serious, and in my assessment if somebody had managed to exploit this problem the effects would have been significant and widespread.

On numerous occasions, I asked my manager to allocate time into my schedule to fix this problem.  Unfortunately, I was spending a huge amount of time fire-fighting other problems, and I simply had no time to put together a proper solution for this security problem.  At the time, my manager was not interested in me devoting any time to solving this problem.

This situation went on for a while, until one of our more security-conscious customers started asking some pretty reasonable questions to the effect of "how is this area of your product secured?".  As I recall, the official answer (which I wanted no part of) involved some hand-waving.

But I used this incident to stress my point:  the security in this area of the product was dangerously flawed, and we needed to fix this problem.  And....I stressed a second point too:  we could either fix this problem according to our own schedule, or else we could try to fix this problem in a panic because of a public disclosure of this flaw.

I was still in fire-fighting mode -- with very little time to work on this security problem.  But, I did manage to write a design specification for how this problem could be solved, and I wrote some proof-of-concept code that showed how the fix could be implemented.

I continued to ask my manager for time to implement the real solution -- time to productize my proof-of-concept, and time to work through a few problems that I hadn't quite solved yet.  But this never yielded much.

Instead, what happened was that one day I was told that {very-senior-and-important-engineer} would be implementing a fix for this problem.  Apparently it had been decided that this problem was actually important.

{Very-senior-and-important-engineer} had much more experience with the company's products than I did.  He had also been involved with the meetings I had held in which I had described the problem, the ramifications of the problem, and my proposed solution.  He had easy access to my design document, and my proof-of-concept code too.

Meanwhile, I was still up to neck with other problems.

I was envious that {very-senior-and-important-engineer} was going to get to solve this problem....but, on the whole, I was glad that we'd actually get to solve this problem in a non-emergency mode.  This was a pretty complicated problem!..

Also, it seemed somewhat reasonable for {very-senior-and-important-engineer} to fix this problem, since, in my mind, he was the engineer who was responsible for the original flawed scheme.


To my great surprise, two days after I was told that {very-senior-and-important-engineer} was going to start working on this problem, he stopped me in the company break-room, rubbed his hands together and in his booming voice proclaimed "IT IS DONE!".  I replied "what?".  And then he told me that the security problem was fixed.

I was surprised to hear this news, because I estimated that fixing this problem would take....many weeks of uninterrupted effort.  I knew that {very-senior-and-important-engineer} could implement code very quickly...and he was far more familiar with the system than I was....but still...

{very-senior-and-important-engineer} told me that he had checked his whole implementation of the security fix into source control a few minutes earlier.

So...I went back to me desk and looked through our source control system.  Here I present to you the entirety of {very-senior-and-important-engineer}'s effort.  Just to be 100% clear, the {entity} that sits on the other side of this socket.....there is no good reason to have any trust in this entity, and, in fact, you actually have to assume that this entity has been completely compromised by an attacker.

Here's what I saw:
$ svn blame .....
Annotations for /long/path/to/file.java
[...some editing done here....]
public void run() 
        InetAddress remoteHost = socket.getInetAddress();
        String ip = remoteHost.getHostAddress();
        System.out.println("\nClient connecting IP = " +
                     remoteHost.getHostAddress() + " name = " +
                     remoteHost.getHostName() + "\n");
        socket.setSoTimeout(30 * 1000);
        InputStream inputStream = socket.getInputStream();
        OutputStream outputStream = socket.getOutputStream();
        DataInputStream input = new DataInputStream(inputStream);
        DataOutputStream output = new DataOutputStream(outputStream);
        int time = (int)(System.currentTimeMillis() / 1000);
        PublicKey serverPublicKey = serverKeyPair.getPublic();
        BASE64Encoder myB64 = new BASE64Encoder();
        String b64 = myB64.encode(serverPublicKey.getEncoded());
        byte[] bytes = b64.getBytes();
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream outKey = new DataOutputStream(baos);
        outKey.write(serverPassPhrase, 0, serverPassPhrase.length);
        byte[] phrase = baos.toByteArray();
        bytes = encryptAES(phrase, bytes);
        output.write(bytes, 0, bytes.length);
        PrivateKey agentPrivateKey = agentKeyPair.getPrivate();

        b64 = myB64.encode(agentPrivateKey.getEncoded());
        bytes = b64.getBytes();

        bytes = encryptAES(phrase, bytes);

        output.write(bytes, 0, bytes.length);

   catch (Exception e)
This was the entirety of {very-senior-and-important-engineer}'s solution to the security problem.

I was pretty gobsmacked when I saw this code.  In fact, the solution that I had proposed did in-fact call for some public-private key crypto, but, this is where the similarities between what I had proposed and this "solution" ended.  This code didn't look much like my proof-of-concept code either.

Wow....this code...  I still shake my head at the thought of it.

I decided to wait a day, to give {very-senior-and-important-engineer} some time.  "Perhaps he forgot to check in all of his changes?" I mused.  For example, readers who are paying attention will note that there are no changes that correspond to the client side of the connection.

The next day, I concluded that this diff was going to be the entirety of {very-senior-and-important-engineer}'s contribution to the solution to this security problem.  That made perfect sense to me, in a twisted sort-of way.

Again....wow...this code....  Wow.

This code amazed me so much that I just HAD TO ask a question about it.  So, I stopped by {very-senior-and-important-engineer}'s office and asked him:

Say, I have a question about your code.....the code here goes through the trouble of creating a public/private keypair, and then it passes the public key off to the client....and then right afterwards, it sends the private key to the (untrusted) client as well.....so, my question is this....what was the reasoning behind this?

He answered:

I just wanted to make sure that the client had all of the data that it would need.


Well, at this point, I knew pretty much exactly what I was dealing with.  So, I told {very-senior-and-important-engineer} "say, I think that a few of the points that I made in my design aren't quite captured in this code -- would you mind if I updated it a little bit?".  He replied "sure, that would be fine".


Over the next {long period of time}, I managed to find small pockets of time, here and there, to implement a fix for this security problem.  And, the very first thing that I did was to throw out all of this code....ALL OF IT.  When {very-senior-and-important-engineer} proclaimed that this problem was "DONE", he had in-fact implemented around 0.001% of the required fix, and the 0.001% that he did implement was completely and utterly wrong.

Eventually I fixed this security problem.  But....it was a really tough problem to solve, one of the more difficult problems I have ever solved in my life.

22 September 2015

Blue Man Group -- Baba O Riley

I watch this every once in a while, and I think that it is awesome....

19 September 2015

The Casa Sui Performance Problem

One day at {dayjob}, I was asked to take a look at a system that was suffering from a serious performance problem.  The end-users who used this system were having a difficult time.  The phone was ringing, and people were complaining...

So, I took a look at the system.  The first thing that I noted was that the 15-minute load-average on the system was running at around 10, with frequent 1-minute jumps to 14.  The hardware in question was....awful (that's another story...).  The system had either 2 or 4 vCPUs -- I can't remember which.  Either way, this system was running horribly behind.

Despite the fact that better hardware was available, the owners of this system had the following perspective: this system worked OK during the previous busy periods many months (and software releases...) ago.  They were not going to pay one more red cent to get better hardware.  They also had a support contract and supported (but nearly EOL'd) hardware.

So, it was my job to try to make things better, under difficult circumstances.

The production system in question was handling a large number of transactions.  Unfortunately, with my meager test equipment back in the lab, I had a difficult time reproducing the same load that the customer's system was handling (this situation was deplorable, I admit).  So, the only way to look at the product with a mind towards fixing the performance problem at-hand was to collect data directly from the customer's system....while it was running in production.  I had already cobbled together some utilities for this purpose, so I started gathering data.  The system in question was running in a heavily multi-threaded JVM, and so any of a dozen things could have been contributing to the high CPU load.

A couple of hours later, I started to sift through the data that I was collecting.  This was....challenging.  Eventually, started to get a better and better idea of the general neighborhood in our large codebase where the CPU was spending all of its time.  Many of the running threads seemed to be running in a certain area of the code, so I started looking there for obvious problems.

Unfortunately, no obvious problems presented themselves.  As I traced through the code, I couldn't find any algorithmically complex routines, or busy-waiting, or other obvious boffo stuff.  I'd already fixed some easy problems in this area many months before, so I had some familiarity with this area of the codebase.

So....I went for a long walk, and traced through things in my head.  At some point during my walk, I remembered that in this codebase the system used some custom queue classes to transport messages from one thread to another.  When I got back from my walk, I decided to look at the code for one of these queues.

This is what I saw:
import java.util.Vector;
class FooQueue extends Vector { [..........]
    public synchronized void put(Foo foo) {

    public synchronized Foo get() {
         while(size() == 0) {
             try { wait(); }
             catch InterruptedException e) {}
         r = elementAt(0);
         return r;

I started to curse.....a lot.   Wow.....I was pretty shocked that code like this existed in a production codebase.  This is the sort-of thing that you might kid with colleagues about doing, but of course you would NEVER EVER write code like this for real.  It was especially disturbing to look at this code with the understanding that this was a work-queue, and a ton of transactions were getting funneled through this queue.

This code was awesome in its sheer inefficiency.   Actually, no, that's not quite right....it is even worse than that.  This code has the quality that, the more work it is asked to handle, the more useless work it has to perform.  This queue "implementation" actually caused its own problems, because all of the threads that were executing this code probably spent more time maintaining the queue than doing actual work.  And....of course....doing this meant that even more transactions ended up in the queue, waiting to get CPU time.

With this queue "implementation", adding items to the queue was fairly quick.  However, removing elements from the queue involved a lot more work.....because using Java's Vector class is totally inappropriate for this purpose.  Every single time the processor tries to de-queue and item from the queue, java.util.Vector.removeElementAt(0) is called and.....guess what? -- the Vector class does exactly what it is supposed to do -- keep the internal array contiguous and with all of the elements starting at array index 0.  So, this means that for every removal from the queue, if there were N elements in the queue before the removal, then (N-1) elements would need to be copied as part of the removal process.

Hilariously, I even managed to determine that at many times on this production system that there were on the order of at least 900 items waiting in the queue.

The truly awful part about this code was that it actually worked....OK...when the system was under light load.  But then, as the system was put under heavier and heavier load, the code performs worse and worse.


So, anyway, I fixed this code by re-writing this code to use a much more reasonable queue.  While I was doing this, I found several other things that caused me to curse even more, but, eventually I got this system to work a bit more efficiently.  The best thing I can say about this codebase was that there was always ways to improve it....

15 July 2015

An Appreciation for _Programmers at Work_

A few months ago I learned that an important book in my life had been "republished" in the form of a blog.  The book is/was Programmers at Work.  I'm happy to see the content in this book available on the web, because I dimly remember giving my own copy of this book away to somebody else who I thought could use it.

I found this book to be useful when I was younger because back then I really didn't have very many people to talk with when I was exploring the idea of going into the computer (specifically software) field.  There were literally no programmers to talk with in the area that I grew up in, and of course, this was before the era of the Internet.  The thing that I was wondering about was "how do I get from point A to point B?".  Based on the information I got from this book, I got a sense that I needed to change the trajectory of my math studies in high-school (I decided that I really needed to take Calculus in high-school) so I enrolled in a local community college (night school) and got myself onto a better path.  Later on, as a college freshman, I was glad that I had changed my trajectory.


Back when I read this book, I was a twisted, confused kid, who really didn't have too many people to talk to about college in the first place.  Looking back at this section of Programmers at Work:

LAMPSON: I used to think that undergraduate computer-science education was bad, and that it should be outlawed. Recently I realized that position isn’t reasonable. An undergraduate degree in computer science is a perfectly respectable professional degree, just like electrical engineering or business administration. But I do think it’s a serious mistake to take an undergraduate degree in computer science if you intend to study it in graduate school.
LAMPSON: Because most of what you learn won’t have any long-term significance. You won’t learn new ways of using your mind, which does you more good than learning the details of how to write a compiler, which is what you’re likely to get from undergraduate computer science. I think the world would be much better off if all the graduate computer-science departments would get together and agree not to accept anybody with a bachelor’s degree in computer science. Those people should be required to take a remedial year to learn something respectable like mathematics or history, before going on to graduate-level computer science. However, I don’t see that happening.
INTERVIEWER: What kind of training or type of thought leads to the greatest productivity in the computer field?
LAMPSON: From mathematics, you learn logical reasoning. You also learn what it means to prove something, as well as how to handle abstract essentials. From an experimental science such as physics, or from the humanities, you learn how to make connections in the real world by applying these abstractions.

....maybe this was one of the reasons why I also decided to study liberal-arts at college.  Sometimes books are the source of dangerous ideas....

17 June 2015

Adventures in Virtualization and Storage Management

One fine Sunday while I was doing some work around the house, I received a series of dire work emails from people in the management of ${employer}.  Early that morning, one of our marquee customers had some problems with an appliance that we supported at their site.  From what I read, it seemed like the appliance had suffered a significant failure...but the actual cause of the failure was unclear.

Thankfully, a member of our Support staff had managed to get the appliance working again by simply rebooting.  This was a small amount of good news in this difficult situation.

The "marquee customer" in this story is a famous company in the financial services industry.  Their infrastructure is large and complex.

There was little that I could do about the problem at this point, so I sent a request to the folks in Support to please try to get some logfiles from the appliance.  It seemed like a good idea to get to work early the next day, so I decided to turn in early that night.  I was fairly sure that this situation was going to land on my desk in the morning, and I was right....

The next morning, the logfiles that I asked for soon appeared, and I learned that this customer had implemented our appliance as a VM in their big-iron ESX server farm.  The content in the logfiles was...strange.  Something Crazy seemed to have happened to the appliance's database, but beyond this, the root-cause of the problem was unclear.

So, at this point, we began to do several things:

  • trying to get more logfiles to help determine the root cause
  • looking through the code related to the database to see if there were any obvious problems
  • attending a set of conference calls with the customer to manage the problem.

This was a challenging problem to deal with.   Nothing in the logfiles really shed any light on the problem.  Unfortunately there were some issues in the database code that I really had been wanting to do something about for long time prior to this incident, so I started to work on these (but it seemed unlikely that the problems I was concerned about were related in any way to the problems that this customer had encountered).  And, of course, the conference calls were....tense.  Everybody wished that the problem had never happened.

A detailed reading of all of the logfiles still led me to conclude that Something Crazy had happened.

On Wednesday/Thursday we deployed some new code at this site that we hoped would improve things in general.  The stress level in the office began to go back down to normal...


The next weekend, the machine crashed again at exactly the same time.  Now I got even more forwarded email....and this time it was obvious that this problem was being noticed by various higher-ups in both organizations.   Not good, not good....

Again, somebody from Support managed to get the machine going again by simply rebooting.


Monday morning came along, and we arranged a conference call with the customer's IT staff.   We started going through the logfiles, etc.  We were pulling our hair out trying to figure out what had gone wrong while the higher-ups were fuming.

Around 5 minutes into the conference call, as we were talking to this customer's IT staff (waiting for some "higher up" staff to arrive), one of the customer's sysadmins told us that the last two weekends had been pretty bad at their site.  "Why?" I asked.  The answer:  "well, the corporate SAN server was scheduled for some maintenance over the past two weekends, and when they were performing the maintenance a lot of things went haywire".  I asked "does this SAN server provide the storage for all of the VMs on the ESX server, the same ESX server that runs our appliance/VM?".  "Yes", the sysadmin said, and then he continued, "actually, when the SAN server went down, we had around 40 VMs on the ESX server crash or go into a flaky mode where they needed to be rebooted".

I pressed "mute" on the speakerphone, looked around the table and asked our Support person "at any point in this incident has this customer's staff ever mentioned the fact that their SAN server went down?"  The reply:  "nope".


We then learned that their SAN server had been down for 45+ minutes during both weekends.  Then I had to explain to this customer's IT staff that our appliance had a database server running on it, and that a database server doesn't react very kindly when the underlying disk drive goes away for 45+ minutes.   "Ohhhhhh" was their reply....

The end-result of this incident is that we actually had to add verbiage to the customer-facing docs reminding site-admins that if they chose to deploy the appliance as a VM, that they needed to ensure that the underlying storage mechanism needed to be highly-available.

I swear I didn't make this up.

02 November 2014

Philosophy Tech Support

I'm pretty sure that Philosophy Tech Support from Existential Comics is the funniest thing I'm going to read all day.

05 September 2014

Hanover / Chelsea Loop -- August 2014

This was a nice ride!  Gloomy skies for the entire day, but it never actually rained on us.  Even though it was August, I wore arm-warmers for the entire day, and a vest during descents.  The roads were super-nice, with a fair amount of dirt, but I rode all of it with no problems on 700x25s.

Just before we rolled back into Hanover, we stopped by a store that is frequented by AT hikers.  When I hike in the Whites I always give AT hikers whatever food that I have (because I know how many calories they burn through), and that's exactly what I did here in Norwich too.  It was neat to chat with one particular AT hiker, asking him about his experiences so far on the trail.

As we crossed the border back into NH, the sun came out and it got hot.  That was the first real sun we got for the entire day.  Still, this was a great ride, with a great bunch of people, and I'm glad that we shared this little adventure!