22 February 2013

Retransmission Dampening

A certain network protocol that I work with can spew a lot of traffic.  Staggering amounts, actually.

I only have limited control over this traffic.  It just shows up (sometimes as if fired by a Gatling gun), and code that I help maintain on a certain server has to deal with it.   When the traffic isn't handled in a timely manner, people get upset and the phone starts ringing.

Recently, after witnessing our server starting to have difficulty handling a certain burst in traffic, I started to think to myself "how are we going to improve?".  I came up with a variety of ideas.

One of my ideas went as follows:


It is computationally non-trivial to unpack everything in every PDU that arrives at the server.  Under normal operations, everything is fine, and there are plenty of CPU resources for this sort of thing.  However, during heavy traffic loads, my data suggests that unpacking all of these PDUs is something that I need to pay attention to.

And....one of the problems is that the {things} that are sending our server all of these PDUs almost seem to be impatient.  Some of these {things} seem to operate in the following way: 

  1. Send a PDU with a request in it
  2. Wait an infinitesimal amount of time
  3. Have you received a response yet?  If "yes", we are done!  However, if "no" then goto step 1.
The problem I am seeing when I analyze the traffic is that I see a fair number of re-sent PDU requests.  Our server is handling the requests, but perhaps not at the rate that these {things} on the network would like.  All that these re-sent request PDUs do in these busy periods is add more load onto our server, which is unpacking the received PDUs as quickly as it can and then doing work.

But...when I look at the guts of this protocol, I believe that I see a non-trivial (but not difficult either) way of limiting these re-sent packets.  The code that I have in mind won't actually have to entirely unpack all of the PDUs in order to figure out which packets are re-sent packets and which packets are valid new requests.

Here is the thing that I have in mind:
  1. Under normal operations, everything works as it usually does.
  2. A "retransmission dampening filter" (RDF) will always be running.
  3. The RDF will let new PDUs into the system with no restrictions.
  4. In a CPU-efficient way, the RDF will be able to detect PDUs that are retransmissions.  If a PDU has been re-transmitted by the network {thing} without having waited {N} milliseconds, then the packet will be dropped.  This will spare the rest of the system from having to unpack the entire PDU, which, again, takes some CPU time.
I think that the nice part about this scheme is that it pretty much maintains itself.

However, before I go off an write some code to implement this, I decided that I wanted to run a simulation using some Real World data to see how much of an effect on overall traffic this scheme would have if the system was running with a {1, 2, 3} second minimum.

So, I threw together a simulator, and here are my results:






So, um, yeah....obviously I'm going to write some code soon to implement this.  Of course, I'll make my code tunable, but I'm pretty sure that the default will be to enforce a minimum 1-second retransmission minimum.  This should be both a fun project, as well as something that will really help the server that I help maintain!

No comments: