As a consultant, I have had the pleasure of working for a large number of companies in a range of industries, and one of the things that interests me in all of them is seeing the varied approaches to network changes, from the initial determination of what change needs to be made, following the change through some kind of Change Control process (in most cases), through to execution on the production network. The one thing that’s consistent with every company is that they aren’t consistent – they all have their own way of doing things, and while there’s some fundamental commonality between them, each is usually missing steps that I would consider important.
I’m not going to try to document a full change control process here (though of course, if you want to engage me to consult for your company, we can discuss that), but instead I present to you:
The Network Engineer’s Guide to Making Changes
A common sense approach to making supportable and repeatable
changes to your network.
Before we get stuck in, let me make the following brief points:
- This is a long post, but please stick with it. I thought it better to put it all in one place than to split over multiple posts.
- Apply as necessary. What I mean is, use your common sense to determine which areas of this guide are really necessary for a given change. And while you’re doing that, remember that common sense isn’t as common as we’d all like it to be. 🙂 Personally, I err on the side of caution, but do your own risk assessment.
I’ve made a lot of network changes over time. I wish I could tell you that every single one of them was successful – of course they were not – but I do have a pretty solid track record, all things considered. I’ve seen many approaches to planned change (i.e. changes that are scheduled, rather than those required to resolve an outage), and they typically fall somewhere between these two scenarios:
An engineer decides a change is needed and figures out in their head what commands are needed. They log into the core routers during peak hours and make the change without any requirement for notification or approval. Five hours later (after the business day has ended), a critical problem is discovered by a user at a remote site. After three hours of aimless troubleshooting by the support team, somebody finally happens to call the engineer that made the change earlier in the day and ask for help with the problem. The engineer realizes what has happened, plays dumb, pretends to troubleshoot, then magically manages to solve the problem and is the team’s hero for finding and fixing the problem so quickly! In their hurry to back the change out, the engineer has actually introduced a different problem which is discovered three hours later. This time, the engineer is able to restore the configuration to its former state, and the problems (as far as they know) as fixed. With the network problems finally resolved, everybody moves on. Two days later, the same thing happens again, and our hero once again is showered in adulation.
An engineer decides a change is needed. They work up a candidate configuration in the lab, then create a change script documenting the purpose of the change, the expected outcomes, a validated sequence of backout commands, a pre- and post-change test plan, a notification list, and a risk assessment. The script is submitted to a team of the company’s top network engineers for review and approval; once gained, the engineer submits a change request into the company’s Change system, and waits for the 17 required approvals to be completed. After 16 of the 17 approvals are in place, the change must go before the Change Advisory Board (CAB) who review the change in the context of the risk assessment, other scheduled changes, freezes and other significant events, and once satisfied the CAB provides the final approval for the change to go ahead at the specified time and date. Testers are lined up, and the change is executed as planned, outside business hours. A problem is detected by the testers, and after brief troubleshooting to determine the cause, the change is successfully rolled back. The following morning, the CAB meets again to review the outcome of the previous night’s changes, and request a root cause analysis of the change failure from the engineer who had authored the change. The report is reviewed, and once the CAB is satisfied that the error has been correctly identified and steps have been taken to ensure that it isn’t repeated, the engineer gets approval to update their script and repeat the review and change control process again so that a second attempt can be scheduled.
I’m guessing that elements of both these scenarios are familiar to most readers. I’m also going to take a stab in the dark here and say that many people reading the second scenario raised their eyebrows about the constraints of the process described, and the inevitable delays and paperwork that it must introduce; I believe the standard war cry is that “the process gets in the way of actually doing my job.” And there’s a truth to that – process certainly can kill productivity. On the other hand, it can also add a layer of protection to the network that can help keep you in business. Whether the network is the service you provide to your customers (e.g. in a Service Provider) or your customers are internal (i.e. you maintain the internal corporate network), no customer likes to see unplanned service interruption, and depending on the nature of the business the actual cost of such outages can be surprisingly high.
Ok, so now you’re probably assuming that I’m going to tell you that the Process Driven approach is the right one, and you don’t need to bother reading the rest of this. Not quite. I’ll tell you that there’s value to it, but I’d also note that obsessively process-driven organizations can also be inflexible and slow moving – things that customers also do not like, when they need that change, or are waiting for a new product to be launched. There’s a compromise required for sure, but regardless of the process that surrounds you, as Engineers I believe there are things we can do that will improve our success rate and assist troubleshooting efforts. You can even help with – gasp – documentation along the way!
2. The Change Script
For me, the biggest thing an Engineer can do is to write a good change script. Yes, a script. It’s not cool to do things in your head – things get missed, errors get made, and then it’s very hard to evaluate what was changed.
TIP #1: Write your changes down, no matter how tedious it seems.
How you write them down will also make a huge difference. I would not consider the following, for example, to be a good change script:
I mean, yes, it might be syntactically correct, and congratulations for following TIP #1 and all that, but what does this change actually do? Many larger organizations separate the Engineering and Operations organizations, so this change script may be handed off to an Operations group to execute. How do they know that interface GigabitEthernet0/2 is the right interface to shut down? What’s being shut down? Should the BGP neighbor come up at this time? Should we be testing anything? It looks like some pretty major routing changes are going on. If something breaks, how do I back out?
Think I’m kidding? I’ve seen – and indeed, I’ve had to issue – scripts just like this to an Operations team – and amazingly they were executed without question. Looking back, the Ops folks must have had nerves of steel to take these contextless directives and put them on a production network. So with that in mind, even though there are only nine lines of configuration in this example, they scream out to me as needing some improvement, so let’s look at a couple of ways to make things better.
3. The Syntax
When it comes to change scripts, it’s important to note that there are two levels of ‘correctness’ in my eyes, and they are absolutely not the same thing:
Syntactically correct means that when you paste the commands in, the router doesn’t spew errors. This is the very basic requirement for a change script – the syntax should be correct!
Semantically correct means that the change does what you intended it to do and solves the problem you had. For this to work, the change will have to be syntactically correct of course, but if your syntactically correct change does the wrong thing, we didn’t fix anything (and possibly caused other problems).
TIP #2: Validate your syntax.
For syntax, it’s really simple – use lab equipment or an emulator (e.g. GNS3 / dynagen / dynamips, or the fabled IOU) to validate that the commands you have written down are correct. Try to ensure that the IOS version you are using to validate is similar to the version on the router you are configuring. There’s nothing so fun as executing a change only to discover that that command you were relying on doesn’t exist in the version of code running on the production router.
Basically, in my eyes there’s little excuse for syntax errors in scripts that are hitting a router.
4. Beware The Copy and Paste Monster
Especially in environments where changes are made to redundant pairs of devices, the temptation to copy and paste chunks of configuration in your script is strong – after all, the two routers are configured so similarly that you’ll likely be making the same change on both devices in a pair.
TIP #3: Don’t Copy and Paste.
That’s a bit of a black and white statement isn’t it? You can take this one with a grain of salt, BUT – just note that it’s very dangerous to copy and paste anything that involves interface IPs, OSPF network statements (if you use host wildcards), and so on. Copy and paste can be deadly unless you’re really diligent at identifying every reference that needs to change. Go carefully, my friends!
5. The Backout
How do we backout IOS scripts? We put “no” in front of every line! I’m kidding, obviously, yet I’ll bet we’ve all seen scripts that do this. To be absolutely fair, very often that’s sufficient so long as we avoid some of the existing structure. For example, in the change script above, the new BGP neighbor is added with these commands:
router bgp 65000
neighbor 220.127.116.11 remote-as 65000
To back that out, since (we assume) BGP 65000 already exists, we likely want to do:
router bgp 65000
no neighbor 18.104.22.168 remote-as 65000
That is, we don’t just put ‘no’ in front of both lines. This is simple stuff and I think most people can cope with doing this on the fly, although as I inferred earlier, I’m making the assumption here that bgp process 65000 already exists. Perhaps it didn’t – how would I know unless I ran some checks before making any changes?
So all’s good until you find a command that doesn’t work so nicely. We added OSPF to BGP redistribution with these commands:
router bgp 65000
redistribute ospf 1 route-map REDIST
Putting aside that the route-map is badly named, let’s back this out (picture that you’re under stress, it’s 3AM, and executive management is screaming for blood):
router bgp 65000
no redistribute ospf 1 route-map REDIST
All good? Not really. This particular ‘no’ command actually just removes the route-map from the redistribution, but leaves redistribution in place – now without any filters; things just got worse. What’s the lesson here?
TIP #4: Document – and validate – your backout script.
Use the same lab or emulator you used to validate the change syntax, and make sure that the configuration has been restored to its original state after you run the backout commands. You don’t want to be trying to figure out how to backout a 200 line script when you’re under pressure, and as we saw above, even a two line backout can cause chaos. There’s little worse than backing out a change that caused an outage only to find that you introduced a new problem in its place.
Backout scripts must be CLEARLY identified as backout, and differentiated from the actual change configuration. You don’t want a tired Ops engineer to run your commands, then plough on blindly and back them out again without thinking 🙂
So we’ve covered the basics – we got the script syntactically correct, and we added a validated backout. If this is as far as you read and you do this for every change script, you’re probably in the upper quartiles now. But wait, there’s more! Why stop at syntax?
6. Change Simulation
For the record, I do not subscribe to the school of thought that says “if we simulate every change in the lab, we’ll never have an outage.” This only works well if you have a tiny network and your lab 100% matches production down to the cable, with production-mirrored syntax and appropriate route injections to simulate external peers. Even then, sod’s law says that the change you make relating to the external peers that you didn’t fully replicate will be the one that takes down your network.
Back in the real world however, most networks are too big to effectively simulate. That said, I’m a big fan of trying to simulate what you can. I use GNS3 regularly when writing complex routing changes to ensure as best I can that the change behaves the way I expect it to, and that the downstream impact is as expected. I might only be modeling 4-6 routers rather than the 20 that are impacted, but I do try my best to at least minimize the likelihood that I’m just dead wrong about something.
TIP #5: Simulate whatever seems sensible in order to validate your change.
You can give yourself permission to skip this step for access port turn-ups, ok? If you’re really good, that is.
7. Semantic Validation
Or: Are You Actually Doing The Right Thing?
As I mentioned earlier, while correct syntax is essential, it’s no good if the changes you make aren’t actually going to have the end result that you had anticipated. Of course, if you knew it wouldn’t have the right result. you would have already done it a different way. From this you can conclude that the worst person to sanity check a script is you yourself.
You should of course check your own work for syntax, but when it comes to getting a grip on whether your change will work properly, you need somebody else to review your change and assess whether or not it’s the right approach, and to evaluate whether or not you got it right.
TIP #6: Always get a peer (or better) review. In my opinion, this should be mandatory for ALL changes, no matter how small.
Any review is better than no review, but obviously the better qualified the reviewer, and the better they know the environment in which the change is being made, the more valuable and reliable the review is likely to be. Most people are pretty amazed (and embarrassed) at how many errors a good peer review can reveal. The reviewer should be logging into the devices and validating the existing configuration and checking that your assumptions are correct. They should use their experience on the network to assess how your change will affect other devices in the network, which leads neatly into:
8. Downstream Impact
Often forgotten when writing a change on a device is what the impact is of the change on other devices – i.e. those downstream of your change. For example, let’s assume that you need to change the OSPF reference bandwidth on a router because it’s making a suboptimal decision about which interface to use for a particular destination. The change in itself may work locally on the router, but how will it impact other routers in the same OSPF area? Will their view of the network now change for the worse? Have you thought it through?
Another example may be that somebody looks at BGP and sees that for some reason, three BGP NLRI are being generated on a particular device:
The engineer decides that this is wasteful, and removes the two /17 networks from BGP. Locally this may not make much difference. Downstream however, traffic on the two WAN (links which were using the /17 routes to distribute traffic) is now all going over a single link and the link is almost at saturation point. Nothing bad will show up on the device where the change is made – it’s all downstream.
TIP #7: Think about the effect on downstream devices!
Between the engineer authoring the script and the reviewer, attention to the knock on effect of a change is critical. If you have a configuration archive and you’re playing with routing, why not search the archive for instances of the routes you’re manipulating? What you find may save your bacon.
9. Current Configuration
When you’re writing a change, you typically look at the current configuration and then create your script based on that. What happens if somebody changes the configuration between now and then? When the outage occurs, you’re left saying “Uh, but when I wrote the script, I’m sure that command wasn’t there. I checked!” Can you prove it?
TIP #8: Include dated configuration snippets in your change script.
Grab the existing configuration and paste the relevant parts into your script (and note the date on which you captured the configuration) to back up your change. This has a few benefits:
- If somebody changes the config, you have proof that you weren’t imagining things when you wrote your script;
- You can validate the router configuration against the snippets to ensure that there has been no change, before the script is executed.
- You provide the reviewer of your script with contextual information that will help them do their job better, and also save them time because they don’t have to log into the router and look for the configuration themselves;
- It provides a handy historical mini-snapshot of how things were on that date. When you look back, it will help you understand why you made the decisions you made in terms of the configuration. Remember I mentioned how this would help you document the network? Stick with it, there’s more in this vein.
I should note that when you paste in configuration from the router, you really need to make sure that it is clearly identified as current configuration, and not as configuration to be entered into the router again.
If you’re using notepad the easiest way to prevent errors is to precede every non-executable line with a ! so that even if config is copied and pasted by mistake, only the lines without a ! are executed; the rest are ignored as comments. I use this trick, for example, to allow me to confirm interface descriptions and explicitly call out existing config when I’m making a change, e.g.:
!Change interface IP facing NYC-RTR02A:
! description L3 link to NYC-RTR02A
! ip address 192.168.250.3 255.255.255.0
ip address 192.168.250.1 255.255.255.0
I actually also use colors and other formatting to make the executable change more obvious compared to the configuration I grabbed from the router itself, and I haven’t replicated that entirely here. However, what you can see is that the existing IP is .3, we see the interface description copied from the router, confirming that this is the correct interface (we hope!), then we see the actual change. This is quite effective, I promise! You’ll note that I used the ! to make a comment about the change as well; here’s another example:
! Change static route to point to 22.214.171.124 instead of 126.96.36.199
! Existing config (2011-12-07):
! ip route 188.8.131.52 255.255.255.255 184.108.40.206
! Add new route
ip route 220.127.116.11 255.255.255.255 18.104.22.168 name Force_DBHost_via_R01
! Remove old route
no ip route 22.214.171.124 255.255.255.255 126.96.36.199
In theory you could cut and paste that whole lot into a router and only the desired changes would be made. It’s more typing than if you just put in the two commands – but then, isn’t this a lot clearer? From this excerpt we still don’t know why we’re changing the next hop, nor what those next hops are – that’s something that should also be in the change script, so let’s talk about:
10. Problem and Purpose
I believe that every change script to be executed in a data center should begin with a problem statement, even if the script is just for a port turn up. For a basic change like the port turn up, it doesn’t need to be complex – the purpose could be to “enable the ports for a server <y> to support the <x> service web interface”, or something like that. For a more complex change, the problem statement should explain the current status/architecture, what needs to be done, how you intend to achieve the change, and what the end status should be.
Tip #9: Always include a section that defines the script’s purpose.
The explanation you write may end up being one line or ten pages. Again, I can sense eyebrows being raised because that’s a pain to do. Yes, it can be – but doing it, as I see it, has six major benefits:
- Writing it out forces you to think through what you’re planning to do. Often, the very act of spelling out the problem and your solution along with some diagrams to demonstrate the change is enough to make you realize why something isn’t going to work the way you thought, or perhaps to alert you to a downstream issue that you hadn’t previously thought of.
- By documenting the purpose, you make the reviewer’s job easier and make the review more accurate as well, because they now understand the semantics of what you are attempting. Without a true understanding of purpose, the reviewer is just checking syntax, which is really only half the job.
- By writing out what you’re doing, you are creating an archive of network change documentation that you can store and search at a later date. When you wonder why something was implemented in a particular way, you can go back to this change script, read the explanation and understand what problem was being solved, and see why the particular solution was chosen.
- You give the group that actually executes the script (if it’s not you) a clear understanding of what the script does, which means that they are (a) more likely to approve it in CAB, and (b) more likely to detect any problems that occur, based on the fact that they have comprehension of the task at hand. This is often undervalued, but you get far fewer calls in the middle of the night if you empower the Operations engineer by giving them information.
- You provide a model for future changes of the same type, and a way to on-board new team members. This whole ‘learning by osmosis’ thing is way overrated I think. What better way for the new or junior team member to learn than to read recent change scripts and see how things are done, and understand how the network functions?
- Your documentation will make troubleshooting easier if any problems arise in the hours and days following your change. Operations can read what was done, troubleshoot it for you, and determine if the change needs to be backed out – all without calling you.
This is a big step up from just typing up a list of commands, and I believe – along with backout scripts – is one of the major things missing from many organization’s change requirements. When you look at the six benefits I listed above, doesn’t it make sense to do it? The level of understanding you can impart even in a couple of paragraphs ultimately reduces outages because it empowers everybody to do their jobs better and understand the network better.
11.Project or Service
What project or service is your change for? Just a quick mention of it will make the script searchable later!
TIP #10: State what project or service your change supports.
12. Don’t Include ‘conf t’
I see this a lot – scripts that include ‘conf t’ at the top. My advice is: don’t do that. If you must, put a comment saying “enter configuration mode”. After all, if either you or your Operations engineer doesn’t know to enter configuration mode before entering configuration commands, you have bigger problems on the horizon.
Critically though, I’ve seen it happen multiple times that an engineer is working on a configuration script, logging in to routers to validate information, and copying and pasting chunks of their script around – and one misplaced right click in a PuTTY window (or any other terminal client that supports right-click paste) dumps the paste buffer into a production router. Avoiding ‘conf t’ means it’s never in your paste buffer, and your accidental commands will be rejected and you should still have a job at the end of the day.
TIP #11: Don’t put ‘conf t’ in your scripts
13. Test, Test, Test!
It’s amazing how many changes I have seen being made without even basic testing steps being undertaken. If you add a BGP neighbor, why not check if it came up? Is it receiving routes? If you enable an interface, why not check it came up?
TIP #12: Test before and after, compare results, specify what the result should be.
Testing takes many forms of course – some can be done on the routers, and some will be service-based and require the platform owners to participate. Testing should be specified in your change script, not in your head. If you’re making routing changes, it’s probably also worth having a basic set of standard tests to ensure that everything else still works; it’s easy to test only the service you’re making the change for, and forget the routing affects downstream devices and you might have just killed another service. Use your NOC or network management team if you have one, and get them to check alarms and service metrics for any anomalies.
The tip instructs you to specify what the result should be – this is really important. There’s no point saying:
sh ip route 188.8.131.52
Is 184.108.40.206 supposed to respond to that ping? What am I expecting to see in the traceroute? What are you expecting to see in the output show ip route command? What are the SUCCESS criteria for these tests? If you don’t specify what counts as success, the testing is worthless. If possible, simulate actual command output – late at night (when many changes occur) it really helps to compare like for like.
14. Baseline Before/After
Protect yourself by baselining the devices you are changing – grab a copy of the running config. Check device cpu is normal. Check the logging buffer (or syslog servers) to see if the device has a problem before you start the change. This helps eliminate false errors after the change. I’ll wager that most readers have made a change then had somebody say “Oh my goodness, the even log is going crazy, back out, back out!” And then you realize that the event log was always crazy… Repeat the same commands before and after your change, and log the output. Always log the output… 🙂
TIP #13: Baseline your devices before and after your change.
15. Document Information
When you create a change script, always include a title, your name and contact details, the date, the date config snippets were gathered, and so on. You may also want to include details like whether the commands were lab-tested, and who reviewed/approved the change script.
TIP #14: Include basic info in your scripts, including author details.
This info takes very little time to add, but again will pay you back in the future when you need to find a script, or find out who made a particular change.
16. Document Format
It is probably evident from the tips above that Notepad ain’t gonna cut it. I don’t care what your favorite tool is for knocking up configurations – we all have our preferences I’m sure – but ultimately, I’ve found that delivering documentation in a format that’s visually attractive and allows you to format the text as well as to include diagrams is a big step forward. To that end, I generally suggest using something like Microsoft Word for change scripts (but for goodness sake, please format configuration as Courier New or similar – it’s what the eye expects). I have previously created templates with pre-defined styles for configuration, headers, comments, config snippets, and so forth so that formatting text is simple.
TIP #15: Deliver change scripts in Word format
If you use document management systems like SharePoint, you also have the ability to edit and version-control Word documents in an integrated fashion, and SharePoint indexes Word document contents as well, so you can search your archive. In fact, I’ve previously used SharePoint for the change review / approval process as well, providing both a record of the approval process and a searchable archive of all changes made since the site was set up.
17. Use Change Script Templates
Now you’re started using Word, create a template with standard styles for each type of information, and a framework for all the information that’s expected in the script. Now mandate that the whole team use it. It’s actually quite refreshing to have every script look and feel the same way, and makes reviewing, reading and executing scripts much easier because everybody knows where to look for information, they know what the changes look like that should be pasted into a router.
TIP #16: Use a change script template
18. Date and Version Control
Every script – and its filename – should have a version number and date. I’ve seen way too often confusion on a change window when it’s realized (usually half way through) that people are working from a different cut of the change script.
TIP #17: Add Date and Version to all scripts, and keep it updated.
Putting It All Together
It’s not as bad as the previous 4,500 words might make it seem. Here are the tips all together in a handy list:
- Write your changes down, no matter how tedious it seems.
- Validate your syntax.
- Don’t Copy and Paste.
- Document – and validate – your backout script.
- Simulate whatever seems sensible in order to validate your change.
- Always get a peer (or better) review.
- Think about the effect on downstream devices!
- Include dated configuration snippets in your change script.
- Always include a section that defines the script’s purpose.
- State what project or service your change supports.
- Don’t put ‘conf t’ in your scripts
- Test before and after, compare results, specify what the result should be.
- Baseline your devices before and after your change.
- Include basic info in your scripts, including author details.
- Deliver change scripts in Word format
- Use a change script template
- Add Date and Version to all scripts, and keep it updated.
If you create a good change script template, much of the work is done for you. The template I usually use contains all the styles necessary to format the document consistently, then has the following sections, each starting on their own page. Sections can be removed if not needed, or just marked as not required:
- Document Info page, containing:
- Script version and date
- Author (and contact info)
- Work request number (the request for me to do the work)
- Purpose (1 line summary)
- Priority / Criticality
- Project Manager
- Impact of delaying execution
- Date of Configuration Snapshots
- Table of Contents
- Is the change service impacting?
- Has it been lab tested?
- Is other support required?
- Is testing required outside what’s in the script?
- Is the change on production devices?
- Dependencies & Pre-Requisites
- Document any pre-change requirements, e.g. does another change have to be executed before this one will work?
- Detailed Background / Purpose / Requirements, if required
- Pre-Change Testing
- Includes standard set of test commands for IOS, NXOS, ScreenOS, Junos OS
- KPI and Alarm/Event Log Checks for NOC or NMS
- A boilerplate guide to the document styles that are used
- Cable matrix (if applicable)
- IP matrix (if applicable)
- Device configuration
- Each device to be configured starts on a new page with the device name as the title for the section (it also auto-populates into the header so that when there are multiple pages of changes for a device, it’s easy to reconfirm what device you are in the middle of configuring;
- Testing commands and output for each specific device is included inline as required;
- Config snippets for each device are included either at the start of a section, or individual lines are included on the way through
- Post-Change Testing – tests to be carried out after all changes are completed
- Backout scripts (usually in reverse order to the implementation scripts above)
- Post-Backout Testing
If I didn’t feel the eyebrows raising before, I know they’re up right now. It looks huge doesn’t it? But here’s the thing – most of this information is already in the template. I just fill in some gaps. Sections that are likely to be optional are, by default, marked as “NOT REQUIRED” or similar in the template – so if I don’t need them, I have no change to make. All I really have to do is to explain what the change is, write the change, and go from there. Every script looks the same; it’s clear, consistently formatted, and there’s enough information in there to sink a bus. And for archival purposes, it’s awesome – the information in a change script of this type is like gold when you need to look back and understand a change.
There’s more I could say, but I have to stop somewhere and this is a good grounding at least. I hope this has given you something to think about; or maybe you have something you can teach me too.
My name is John Herbert, and I approve this message. Happy scripting!
I’d love to hear how you do your change scripts. What did I miss? Is there anything here you’re going to start doing from now on? Do you agree, or is this just too much for you? Throw in your opinions below!