The end of March brought with it the first episode of a neat new project called The Network Collective, a video roundtable for networking engineers. The hosts and co-founders of this escapade are Jordan Martin (@BCJordo), Eyvonne Sharp (@SharpNetwork) and Phil Gervasi (@Network_Phil).
Top 10 Ways To Break Your Network
Episode 1 brought three guests to the virtual table: Carl Fugate, Mike Zsiga and Jody Lemoine, the latter of whom (top right on the YouTube video) is actually blurry in real life, and this is not a video artifact. The topic for discussion was the Top 10 Ways To Break Your Network.
Thankfully, the show didn’t actually provide tips on how to break your network — as if we need any help doing that — but instead looked at the shameful ways in which each participant had managed to cause network destruction in the past, and what lessons could be learned.
The fact that five of six experienced professionals are willing to own up to their blunders (one brought a colleague’s mistake to put up on the chopping block) actually signals one of the most important lessons that the episode highlighted, which is to be honest and own up to your mistakes. It is better for your career to do that than to pretend that you have no idea how an outage happened. Trust me; I have a very particular set of skills, skills I have acquired over a very long career. Skills that make me a nightmare for people who cause outages. If you admit your error up front, that’ll be the end of it. I will not throw you under the bus; I will not pursue you. But if you try to cover up your error, I will look for you, I will find you, and I will hang you out to dry. But I digress… The long and short of it is that if I waste my time tracking down the source of a problem which somebody knew all along but didn’t want to admit to, I’m going to be pretty steamed. As a consultant for 16 years, one of many mantras I learned to live by is this:
It’s All About The Environment
With that said, I feel that there’s another important lesson here, and it’s for management rather than the engineers. As a manager, it behooves you to create an environment which encourages honesty instead of punishing it. I have worked in environments where the most important part of finding the root cause to an outage was appointing blame to an individual. Guess what? Nobody ever wanted to own up to doing anything because they were fearful for their jobs. If you’re currently thinking Well, of course, that’s obvious!
, you’d be right, yet I’ve seen and heard about companies like this far too often. How does your company treat honest mistakes?
Confession Time
We all make mistakes. The reasons for those mistakes vary from carelessness and over-confidence through to ignorance, software bugs, unfamiliarity with an environment and sheer bad luck. However, they are mistakes and — other than in circumstances of exceptional disgruntlement — are not an intentional attempt to take down a network. I don’t want to sound like a greetings card, but every mistake is also an opportunity to learn, and this is where the metaphorical rubber meets the road. The aim of performing a root cause analysis (RCA) after an outage is not simply to determine what happened; it should also be to look at how that same mistake can be avoided in the future. Without the latter, there’s no point in performing the RCA in the first place, in my opinion.
Finding Root Cause
When looking for a root cause, I go beyond simply the action that caused the outage. I as questions like:
- Was there a process failure which allowed this to happen (or did somebody break a process which would have prevented the issue)?
- How quickly was the issue discovered? Why did it take that long?
- Were there
extenders
, i.e. did something happen (or not happen) which meant that the outage continued for longer than it needed to? - What testing was being done during/after the change, and did it catch the error? If not, why not? i.e. Were there what we now know are holes in the test plan.
One of the comments during this episode was along the lines that and outage can occur, and then steps are taken to make sure that particular outage path can’t happen again, but it’s almost pointless because the next outage will inevitably be something else unexpected. The implication seemed to be that making changes to avoid a repeat incident was somehow pointless. I respectfully disagree. The first time a mistake happens, it’s a mistake. If the same mistake happens again because I didn’t take steps to prevent it, then it’s not a mistake any more, it’s a known, unresolved problem.
As a corollary to that, if an engineer makes the same mistake repeatedly, perhaps this career is not for them.
All Aboard The Blunder Bus
In response to this first episode, in a future post I will share one of my own epic blunders and analyze the lessons to be learned from what happened.
The Network Collective
The Network Collective looks like it should be an interesting project to follow, and I would recommend subscribing. I love hearing tales from the real world, and next weeks’ recording of Episode 2 (Choosing a Routing Protocol) features the ubiquitous Russ White. What’s not to love?
For me there are three key lessons from this excellent episode that apply to all IT staff.
1) Admit to mistakes (as you already covered)
2) The importance of really understanding what you are doing
3) As listed at 36:45, the three most important documents when performing a change in ascending order of importance (most important last): the change details, the test plan, the recovery plan.