14 October 2013

Fun With Real-World Network Problems

A long time ago, on a ${Dayjob} far, far away, somebody from Support found me in the lunch-room and said "Kevin, can you give me a hand?  I'm having trouble with a Customer-site and I can't figure out what is going wrong".  "Okay", I said, as we walked back to his desk.

As we walked back to his desk, the Support guy warned me "look, this Customer is not going to be easy to deal with.  His network is down ; people are yelling at him, and so he's yelling at us".   OK, I can deal with this...

So, I sat down in front of the Support guy's computer, which already had a desktop/VPN sharing session setup to the remote customer site.  The Customer (Site Administrator) was soon on the speakerphone, and we were soon in the thick of the problem.

The first thing I said was "Look, let me try to get up to speed here.  What is going wrong?".   OK, so I learned that their DHCP server was only working for some of this site's end-users.  It seemed that every user who wasn't able to get a DHCP address was calling the site's help-desk.

The Site Administrator was...upset...and he even reminded us that everything should be working just fine -- after all -- our company had just replaced his old DHCP server with a brand-new and rather expensive new server one month ago.  From his perspective, it simply was not acceptable that this new piece of hardware was not performing flawlessly....

So, I started to get the Site Administrator to run some basic diagnostics.   Was the DHCP server configured correctly?  Was it running?  Were the disk drives acting reasonably?  Etc.  Everything checked out just fine.

Many users at this site continued to not be able to get onto the network....

At this time, I was out of simple diagnostics to run.  I decided that I really wanted to see a network capture of the DHCP traffic.  I asked the Site Administrator to run the network trace.  He did this, not quite in the manner that I was hoping for, but now I was starting to see more information.

With the trace enabled, I was now starting to see the DHCP server perform its work, on the wire.  The challenging thing for me at this point was that the network trace that the Site Administrator was running showed me a high-level view of ALL of the traffic that was hitting the system's DHCP server.  At least 75% of this traffic related to DHCP interactions in which the client was successfully able to obtain a DCHP address....and picking out failing sessions was like picking out a needle in a haystack that was getting tossed around by a tornado.  After a few tense minutes of watching traffic like this (sprinkled with shouts from the Site Administrator of "there's another failure right there!!!"), I finally was able to get the Site Administrator to send one of his techs out to the location of the problem (wireless) network so I would have a known client to look at.  Once the tech was at the network location with his test MacBook, I obtained the MAC address of the test MacBook's wireless interface.  Then I had the Site Admin adjust the network capture so that we could narrow the capture to only capture the DHCP traffic from the tech's MacBook.

Despite the fact that we were now looking at a much smaller subset of the network's total traffic, it still wasn't obvious what was going wrong.  The DHCP traffic from the failing MacBook seemed....strangely slow....and the DHCP negotiation wasn't working correctly....but the problem wasn't immediately obvious.  The Site Admin was becoming more and more unhappy.

Many users at this site continued to not be able to get onto the network....

I was nearly out of ideas.  The trace that I was running wasn't showing any obvious problems.  Finally, I told the Site Admin "look, can we please run a trace in the manner that I originally wanted, which would be a complete network capture in which I can look at ALL of the traffic, including all of the headers and complete contents of all of the packets?".  After a few minutes of back and forth, I eventually got this capture started (capturing traffic as seen by the DHCP server), and then we got the tech with the MacBook to run one more DHCP negotiation.  Then I got the Site Admin to send me the complete network capture.  I ran back to my desk and opened the capture in Wireshark.

Now that I was back at my desk and I was able to think more clearly, I was hoping that I would somehow be able to employ the Feynman Problem Solving Algorithm to identify where the problem was.  Unfortunately, my brain was unable to successfully implement this algorithm.

However.....I did notice one strange thing in the trace.  I noticed that all of the DHCP traffic from the failing DHCP client wasn't destined for a broadcast address.  Instead it was destined for the specific IP address of the shiny new DHCP server running on the brand new appliance at the customer site.   Also, since the DHCP requests were destined for a specific IP address, the destination MAC address in the Ethernet headers was also set to a unicast MAC address.  Furthermore, I noticed that at some point early on the DHCP negotiation, that the DHCP client was actually issuing an ARP-request with the IP address of the DHCP server inside....somehow indicating that the DHCP client somehow simultaneously DID and DID NOT know the MAC address of the DHCP server.  I didn't know what to make of this....so I wandered back to the guy in Support's desk, where we still had a speakerphone connection with the Site Admin of the failing site.  I verified that the destination IP address that I had seen was correct....and then I tried to verify the destination MAC address that I had seen in the packet trace.

Here is where we were able to start dragging The Real Problem out into the light of day: the MAC addresses didn't match!  In fact, we soon established that:

  • there were no DHCP-helpers on this network -- it was a big but simple network.
  • all of the failing DHCP clients were wireless clients, but not all wireless clients were failing -- many were working just fine.
  • all of the failing DHCP clients happened to be sending their DHCP protocol traffic to a specific (correct) IP destination address but an incorrect unicast MAC address.
  • all of the successful DHCP clients (both wired and wireless) were interacting with the DHCP server via an IP-broadcast/MAC-broadcast addresses  (the way that things usually go). 
  • ....and there was this weird ARP issue too:   I wondered aloud "how is the tech's MacBook knowledgeable enough to send out an Ethernet frame with the destination MAC address correctly filled in with some (wrong) unicast MAC address, but then a few milliseconds later it seems to be in a state where it doesn't have this unicast entry in its ARP cache, because it is sending out an ARP-request for the DHCP server's IP address?".

I explained this situation to to the Site Administrator and made the following fact very clear:  the appliance that we were all trying to fix at the customer site was simply NOT GOING TO RESPOND to traffic that arrived at its NIC with a destination MAC address that wasn't either a broadcast address or an address that matched its own MAC address.  Then I told the Site Admin "unfortunately, I really haven't thought of a reason why this would be occurring on your network yet....I'm not sure where this phantom MAC address is coming from....".  Upon hearing these new facts, the Site Admin told us "hold on a sec..." and then he said "I'll call you back soon".  Click....

I went back to my desk to puzzle over the network traces some more.  I really had no good way to explain this traffic that I was seeing at this site.

Soon after, I got my explanation.  The Site Admin sent us an email that read:

I believe we've found part of the problem.  There are two settings on ${famous_networking_vendor's} wireless controllers: "Broadcast Filter ARP" and "Convert Broadcast ARP Requests to Unicast".  Together they do what it implies, convert Broadcast ARP requests to unicast traffic over wireless, reducing the amount of broadcast traffic.  This setting has been enabled for years with no issues.  The problem seems to have started after we upgraded our hardware.  If I disable this feature on ${famous_network_vendor's} wireless controller then the ARP process works normally and the client doesn't lose connectivity to it. 

At this point I asked the Site Admin via email "say, that destination unicast MAC address we observed in all of the failing DHCP negotiations and in those ARP-requests -- did that MAC address belong to one if the NICs on the appliance that we replaced at your site around a month ago?  The MAC address seems like it would match the NIC vendor that I believe would go along with that vintage of old hardware"...

I never got a response to my query, but I am 100% certain that the answer to my question was "yes".

Problem solved.  The root cause of this site's entire problem here was not the DHCP server itself, but rather how the site's specific wireless configuration was setup to deal with one (and only one unique) DHCP server.

Obviously, somebody at this site had previously attempted to setup their wireless network in some sort of an optimal way (I can see the rationale here), but, of course, this "optimal configuration" fell down the second that a new appliance was setup at this site....a new appliance with new/different MAC addresses.   I learn something new every day!


[certain details of this story are a tiny bit cloudy in my mind, but I'm fairly sure that I have most of the details right.  Unfortunately, the network captures that I ran on this day are long gone....]

No comments: