09 February 2009

Fun With Logging

What is it with logging subsystems? Writing one always seems to bring up the most interesting problems....

One day at $DAYJOB I was called upon to write a logging subsystem. So, I dived in.

In my judgment, the logging subsystem I had to design+implement had to satisfy a number of basic constraints:
  • output redirectable to a file or terminal or syslog
  • configurable so that engineers working on one subsystem could see their own output and ignore the output generated by other subsystems
  • configurable so that several levels of output could be generated...informational....warnings....debug...etc.
  • ability to generate timestamps with output
  • bombproof and reliable
  • portable to a number of systems and compilers
In addition, due to the nature of the project we were working on, I also added a few more features:
  • very high performance
  • ability to log to a circular memory buffer
  • ability to change logfile parameters at runtime
We were implementing a very big system, and we needed all of these features. So, this is what I designed+implemented.

With all of these constraints, the logging subsystem was not everything to everybody. For example, I was under CONSIDERABLE pressure to make this subsystem as efficient as possible, so when it came time to choosing between making the system easy-to-use or ultra-efficient, I had to choose the latter design goal. Still, I was pleased with my efforts. I unit-tested my code, finished my docs, checked it in, alerted my co-workers, and started to work on other things.

My co-workers thought my work was OK and they started using my logging subsystem.

This is when the interesting problems began.

The problem that hit me right away was that now that our SQA staff had logfiles that they could read, when something went wrong in the product (still in its infancy), most of the time the new bug in the bug-tracking system got assigned to me. What was the rationale for this? Answer (paraphrasing) "The product bombed and I saw an error message in the logfile. Kevin owns the logfile subsystem, so I assigned the problem to Kevin". With this logic, I was getting assigned and bothered by dozens and dozens of bugs. I mean, a logfile message could have said "Fatal error in Ralph's code; please contact Ralph if you see this error" and the bug still likely would have gotten assigned to me. It took me several iterations with SQA to address this problem.

At this point SQA understood that if they wanted to test out my code, they would have to test the logging system ITSELF and not simply look at the messages that happened to be traversing through it. Well, the next test that came up went like this: pump as much {stuff} through the logging system to see it it could handle this. I was a bit dubious of this test, but I had little time to complain. After a week of SQA staying out of my hair, I again started getting many bugs assigned to me. This time the logic went like this: after turning on full logging in every subsystem, heavily loading the system, and then letting it run for a day the system bombed. Oh, and the system was completely out of disk space too, due to the immense size of the resulting logfile.

I had to roll up my sleeves and investigate this situation. It turns out that my code was not the culprit here either, because my code deliberately did not check the return value from the call to write() to send the data to disk. However, now I was getting assigned lots of bugs that had the following pattern: (1) turn on full logging (2) hammer on system for long time (3) observe that it bombs. I did not enjoy getting assigned these sorts of bugs, so I protested to SQA that they were assuming that there was a causal relationship between two things: the fact that the system's disks were filling up and the fact that the system died. I pointed out that these two things might be completely unrelated. Eventually, I had to make this point crystal clear to SQA by writing a tool for them that filled the local disk to 99.999% full before they tested the system. At this point, with a bit of effort, I was able to convince them that if the system didn't die right away when they started their testing, that the problem must lie somewhere else. Eventually, I sat down in the lab for a few hours to figure out where the real root cause was, and sure enough, there were several real problems elsewhere.

The last problem I had with the logging subsystem was the weirdest one. My boss's boss had heard of all of the "problems" that were cropping up "in the logging subsystem" and he decided to see what was going on. I explained the situation with the previous two "problems" to him and he understood right away -- he was, after all, a very good engineer. But then over the next few weeks he asked me several times if I was sure that the logging subsystem was efficient enough. In response to his first few queries as to whether the logging subsystem was "efficient enough" I gave some short but succinct answers: "yeah, it is written very efficiently" and "I used some preprocessor hacks to precompute things", etc. But he kept on asking me about the logging subsystem, and I began to detect a very worried vibe from this gentleman.

The final straw came one day when, for the Nth time, I had to explain that if somebody were to turn on full logging on the system (and you might as well include a few calls to fflush() and fsync() in here, because I made these configurable too), then the system would run slower AND THIS WAS OK. I kept on fielding questions and concerns from various people in the company about this issue, and it was starting to drive me a little bit batty. I'd explain it this way "look, if you ask a computer to do more work, would you also reasonably expect that it would take more time to do this work?". After I managed to extract an answer of "yes" I'd ask "so why are you concerned about our system slowing down if you turn on full debug logging?". I'd also point out that you would NEVER want to do this on a production system.

Anyways, my boss's boss stopped by one more time to ask me about all of this and so I had it out with him. We went back to his office and I explained exactly how the logging subsystem worked and answered all of his questions. Then I got to ask my question: "why are you so concerned about the system running slower when somebody turns on debug logging?". At this point he told me one of HIS war stories: one day when he was working in the telecom world he was involved in a sales demo of a serious, big-iron telephone switch. Not having one of these just laying around for a demo, they brought the potential customer to the central office, where a big-iron switch was operating in a production environment. Well, the demo started, and at one point the potential customer asked "What if something goes wrong on the switch? Are there any debugging capabilities?". Well, the sales engineer, wanting to make a sale said "No problem! I know how to turn on debug output!" and before anybody could stop him he did exactly that. Well, at this point, operations on the switch screeched to a crawl and THIRTY-FIVE THOUSAND PHONE CALLS WERE DROPPED INSTANTLY. He didn't tell me if they got the sale or not.... (I can only assume that this event was the inspiration behind the "can you hear me now!?" slogan)

Anyways, after telling me this, I understood where my boss's boss was coming from. I never got to make any modifications the logging subsystem based on his input (like, maybe some level of debug logging should not be available in a production build) because the issue never really got to be really important and I transitioned to another project a short time later. But, I understood where he was coming from.

Ahh...logging subsystems....never a dull moment.

No comments: