In my response to The Network Collective’s group therapy session where the participants ‘fessed up to engineering sins, I promised to share my own personal nightmare story, as seen on Reuters. Grab a bag of popcorn, a mug of hot chocolate and your best
ghost story flashlight, and I will share a tale which will haunt you for years to come. If you have small children in the room, this may be a good time to send them outside to play.
At one point in my career I was a network engineer for a national mobile provider in the USA. The mobility market is a high-stakes environment, perhaps more so than most industry outsiders might expect. Users have surprisingly high expectations and are increasingly reliant on the availability of the network at all times of day or night.
Mobile networks are typically not just for consumers but are also used by a large number of private entities including fleet management companies, fire/burglar alarm systems, shipping companies and emergency services, so even a minor outage can potentially be a problem. These commercial organizations all had customized private connectivity to the mobile provider and thankfully we had a contractually-identified maintenance window available six days a week, during which all changes would have to happen. Nonetheless, even during a change window the attitude, rightly enough, was that if an interruption in service could be avoided, it should be. I refer to this as
make before break engineering — a reference to electrical switches in which the new connection is made before the old connection is disconnected — and writing changes this way requires a different mindset from that found in a typical enterprise environment.
When the stakes are high, the stress is high, and with true
gallows humor we would joke, somewhat tongue in cheek, that you weren’t a fully-fledged member of the team until you had caused an outage which you could read about on Reuters. It was a somewhat ironic badge of honor, in some ways. In many networking roles, losing connectivity for a few hours is just an annoyance. Think about it, though; have you ever heard or read a story in the news about a mobile provider having some kind of outage? The risk of damage to a provider’s reputation should not be underestimated, as reports of outages have a direct impact on customers’ perception about the reliability and capabilities of each provider when they’re making a choice about their next mobile contract, and that means a direct impact on the bottom line.
While I’m not proud of it, I do have the aforementioned badge of honor (and possibly the t-shirt as well). As background, I should explain that one of my roles at this particular mobile provider was to manage internet peering for the data centers. Internally, we had backhaul between the public-facing addresses for each site (so we would not have to transit the public internet when a service was not local), so internally we knew all our public routes, but externally we carefully filtered what we advertised to the Internet to ensure that traffic from outside the provider came to the right place.
The error I made was when updating a route-map on our edge internet routers at Data Center A. My intent had been to add a new sequence something like this:
route-map RM_OUTBOUND_TO_INTERNET seq 700 permit
match ip prefix-list PL_LocalRoutes
Simple, right? Unfortunately, at some point during the creation of my MOP (Method Of Procedure, or a change script), I had managed to mistype the name of the prefix-list, and my change instead read like this:
route-map RM_OUTBOUND_TO_INTERNET seq 700 permit
match ip prefix-list PL LocalRoutes
The MOP had been through reviews both within Engineering and with Operations, and nobody had spotted the error, and so the change was scheduled for execution. At this point it is worth explaining that this company had strict separation of duties between Operations and Engineering; Engineering wrote the MOPs, but weren’t allowed to execute them. Operations executed the MOPs, but weren’t allowed to write them; my access to the routers as an Engineer was read-only. I’ve posted previously about writing a MOP in order that it can be successfully executed by another person and I recommend reading that post too. While it’s a pain to have to write changes out in such detail, the upside is that I didn’t actually have to be there at 4AM when the change was being executed. After all, how could I help?
Fast Forward to 11:30AM
Somewhere around 11:30AM the morning of my 4AM internet change at Data Center A, I received an email asking if I had heard about an outage in Data Center B, and wondering if I could help take a look because they couldn’t figure out what had happened. This was the first report I’d heard about it, so I asked for further details of what was happening. Data Center B, it seems, was mostly offline. Throughput was way down on the internet-facing firewalls, and users going through that site were reporting that they couldn’t access many services. I thought about this for a minute, issued one command on the edge router at Data Center A, and I was able to confirm that the root cause was the change made on my behalf. I told them to roll the change back per my change script, and the problem would disappear, and within 10 minutes — by 11:45AM or so — service had been restored.
I learned something important that morning about Cisco IOS route-map configuration; did you know that you can match more than one prefix-list within the same match command? i.e. it’s valid to have:
match ip prefix-list PL1 PL2 PL3 PL4 PL5
This is handy to know, because it means that my typo:
match ip prefix-list PL LocalRoutes
…was not rejected as a syntax error by IOS. Instead, it was interpreted as being a request to match a route in either of two prefix-lists, one called
PL and one called
LocalRoutes. In true IOS fashion, there was also no warning or error about the fact that the command was referencing two prefix-lists, neither of which exist.
Another helpful thing to understand is that when a prefix-list is non-existent, Cisco IOS treats it as a
match all clause. Thus, instead of only matching the list of networks in PL_LocalRoutes, my route-map statement now matched all routes, and that included the our internal routes to the public ranges in other data centers.
The end result was that Data Center A was advertising routes which belonged to Data Center B, and consequently traffic was going to the wrong place and while some of it was permitted to transit our internal network to Data Center B, the return path from B to the Internet didn’t include Data Center A, so there was an asymmetrical path through the firewalls which meant the sessions never established.
The outage had been running from 4:15AM until around 11:45AM, but it had only been noticed at around 7:15AM. Needless to say this extended way beyond our maintenance window. Customers were complaining, and when I jumped on google to see if there was any word about an outage affecting (roughly) a quarter of the American population, I was rewarded with the a page and a half of news reports about it, and top of the list was Reuters. Level up! The Reuters badge, I found out, comes with a complimentary wave of nausea.
What command did I issue to figure out what was going on?
show ip bgp neighbor a.b.c.d advertised-routes | inc Total
While I’m not quite at the level where I can fix radios by thinking, I was able to listen to the symptoms, think about what might cause them, realize that my change involved one of those potential causes (i.e. that I was advertising too many routes from Data Center A), and was able to validate my theory fairly easily. I knew how many routes were advertised before my change, and I knew how many routes I had intended to allow in addition, so when I checked how many routes were being advertised to one of our internet providers and saw a significantly larger number than expected, it was obvious what was wrong. I didn’t immediately know why it had happened, but I knew what had happened. Once I was knew that it was the route-map change which had evidently not gone to plan, the space in the middle of the prefix-list name was an easy thing to discover.
Why Wasn’t The Problem Noticed Earlier?
Why was it 7AM before a problem was identified? The answer to this is both good and bad. During maintenance windows, the NOC were used to seeing anomalies in device performance and traffic flows as we made changes, so a culture had built up whereby anomalies would be ignored during the maintenance window, even if we had not advised that such anomalies should be expected. After the BGP change was made, traffic for Data Center B was coming in to Data Center A, and the internet-facing firewalls were blocking a huge number of sessions, and the idle session count was through the roof. CPU had doubled because two data centers’ traffic was hitting the firewall. In all cases, while these symptoms had been noted, they were ignored as the normal fluctuation during a big change.
With the benefit of hindsight, obviously the NOC would not have done this, but at that time, it’s what happened. Even at the end of the maintenance window at 6:15AM, the firewall statistics were clearly abnormal – but the NOC was changing shifts around then, and the message had been passed on by the outgoing shift that they were ignoring the firewall statistics due to maintenance activities, and consequently the next shift continued to ignore it for the next hour before somebody again questioned why the utilization and failed session statistics were still so high. This was an outage
extender (i.e. something which wasn’t causal, but extended the outage beyond the point at which it could or should have been identified and fixed), because the issue had been in place for three hours already before anybody started looking at it, and we had already exited the agreed maintenance window.
Why Wasn’t I Called Earlier?
Perhaps understandably, when an outage occurs in Data Center B, Operations did not immediately consider changes made in Data Center A. Even when I was eventually contacted, it was to get help troubleshooting, not because my change was suspected of being the cause. This was a lesson learned; the data centers were inherently coupled when it came to public IP space and internet access, so it was important to always consider that coupling when an issue arose. Again, this doesn’t change the root cause of the problem, but it’s another extended. Once I was called, I identified the problem within five minutes and service was restored 10 minutes after that.
Surely You Tested After The Change?
We did test after the change. Data Center A — where we made the change — was working perfectly. We did not, however, test Data Center B. Why would we? The change was in Data Center A. Another lesson learned, and a good case study in considering the potential downstream impact of a change.
Hey Mr Hypocrite, Where’s Your Implementation Test Plan?
Where was my test plan? In the script, actually. Every change in the MOP was followed up a set of test steps to validate the correct implementation of the change. Before changing the route-map, the MOP gave the commands to test and note down the number of routes being sent to each internet peer. The MOP specified how many new routes should be advertised, and then post-change I had included specific checks on how many routes we were advertising to each internet BGP peer, and noted that the number should be [routes_before] + [added_routes].
When the Operations engineer checked their session logs, they admitted honestly that they had evidently not issued the commands specified in the MOP to validate the post-change routing. Once more this is an outage extender because had the commands been issued and the route counts had not matched what was specified in the MOP, the MOP directed the Operations engineer to stop and roll back from that point. Had the tests been carried out, the problem would have been identified at 4:20AM and rolled back by 4:25AM, limiting to 10 minutes an outage which eventually lasted nearly seven and a half hours.
Whose Fault Is It Anyway?
It was my fault. I produced a MOP with a typo in it, so the root cause of the outage is all mine. However, were it not for an unfortunate storm of bad assumptions and incomplete process execution, the incident could have been identified and resolved well within the maintenance window, and somebody at Reuters could have had a quieter morning. Similarly, I would not have spent the next two days putting together a detailed Root Cause Analysis document for management and generally feeling like the worst engineer in the world.
Was I Fired?
No, I was not. I owned up to my typo, but with so many other elements contributing to the outage, it would have been very unfair if the company had singled me out. Instead, I worked with Operations to find ways to avoid this kind of issue in the future and create the necessary policy to support that goal.
Lessons to Learn
I noted a number of lessons learned on the way through, but as a brief summary:
- Fix the outage first; point fingers later
- Own up to your mistakes
- Always question anomalies and see if the answers make sense
- Always have a thorough test plan including the expected results
- Always execute the test plan…
- Consider downstream impacts and environment which may have a shared fate
- Don’t do it again! Once is unfortunate; twice is just careless. Figure out what you need to do to ensure that you don’t repeat the same mistake.
I think that’s more than enough from me. If you have your own horror stories I’d love to hear them, and if you haven’t listened to The Network Collective, Episode 1, you should, because you’ll hear about some more bad days happening to other people and you can empathize or cackle with the schadenfreude, as is your preference.
Some times, places, people and technical details about this incident have been changed to protect the guilty. And also to stop you finding it in Reuters’ archives…