Get A Lab And Be Error Free!

Sev1

Today I’m going to discuss a misunderstanding I’ve encountered in many companies who are striving to reduce outages caused by network changes that go wrong. In a nutshell, here is is the fallacy in question:

If we get a lab, our changes will never cause an outage again!

I put it to you that anybody who believes that having a lab means no more errors is out of their everlovin’ mind. Still, they sure don’t hurt, and we’ll see why.

A Common Fallacy?

I’ve heard the idea that labs eliminate errors expressed repeatedly in the past, in a variety of industries and sizes of company. The myth seems to persist that a lab will magically fix what’s wrong with the current change process. That’s probably for two reasons:

  1. If you don’t ever test your changes before you execute them, there’s a good chance they won’t do what you expect.
  2. When there’s an outage caused by a change, Engineering teams without labs may lean on the excuse that “we don’t have any way to test our changes, so we do the best we can. You can’t blame us when there’s a mistake.

I agree with #1 and I’d argue that some level of change simulation is essential. Writing network changes that have not been validated in some way is, in the long run, a recipe for disaster. Why? Because, bless our hearts, most of us cannot predict all the knock on effects of our changes. If we could, we’d never have outages, we’d never have to fix things on the fly, and we’d never have to roll back. And that’s assuming that you got the basics right, like having the correct syntax.

Excuse/reason #2 on the other hand has a grain of truth to it but becomes dangerous when extrapolated, because the inverse of #2 is “If we had a way to test our changes, this wouldn’t have happened.” And thus the die is cast in the eye of management. After repeated outages, they hear the “no way to test” excuse, and eventually one of them tumbles to the idea that if they spend some money on a lab, it will fix things.

Why Do Labs Not Work?

A lab can, in theory, allow you to accurately simulate your production environment. However unless your network is really small, or your company has a really large amount of spare cash floating about, the likelihood that you will have the entire network mirrored in the lab is going to be close to zero. And therein lies the rub. If you are not able to 100% mirror the production environment, then anything you do in the lab is just an approximation to reality. Approximations aren’t 100% accurate, which means your lab results aren’t 100% accurate either.

The reality about lab testing then lies somewhere in the middle. A lab can help you reduce errors, but generally speaking cannot eliminate them.

Once you agree that some level of non-production testing is critical to minimizing the chance of mistakes, you have to decide how to go about simulating your network. You want the simulation to be as close as possible to reality, and the way I see it, there are two main paths to take – a hardware lab made out of real equipment, or a software “lab” like GNS3 or IOU.

Cisco Lab

Real Labs (Physical Hardware)

Hardware labs are great if you have the money, space and power. You’ll be able to match the real hardware used in production. You should be able to match IOS levels, and the configurations should mirror what’s in production at the time you run the simulation. That’s a pain because to make it easy to sync production configs over to the lab, you really need to match hardware down to the line card.

It’s also important with a shared lab resource to make sure that the lab time is scheduled so that you never have more than one person trying to make changes in the lab environment at a time. When a lab session is finished, configs should be reset to match production again, with none of the tested changes left in place to disrupt the next lab user. In fact the cost of running a large lab far exceeds the cost of the hardware, and even beyond the cost of the facilities to host it. Who will be responsible for keeping the lab in sync with production, both topologically and in terms of code revisions? Very quickly you can end up needing a “Lab Ops” group who keep the lab in a usable state.

Personally I don’t use hardware labs much any more other than to test out command syntax on matching hardware/IOS combinations, because for that, you can’t beat the real thing.

GNS3 Logo

Soft Labs (Simulations)

I’m a huge fan of lab simulation. I mostly work on Cisco equipment, so I’m going to be rude and talk about GNS3 and IOU again, but let’s take it as read that, for example, if you want to test Junos OS you can run an Olive or two. Some things just don’t simulate well – service modules for example, aren’t really catered for – but for general routing, both GNS3 and IOU are fabulous. Unless you’re really lucky, you’ll be unlikely to match production hardware with a software simulation though – there will be a lot of compromise required in that respect.

To demonstrate the value of a software approach, I will share that my last network change simulation involved 22 routers and a switch. That sounds like a lot, but in reality I actually made a lot of compromises and assumptions to reduce it even that far. Few physical labs I’ve had access to have that much equipment available on tap for you to hook together. Even a Cisco CPOC (Customer Proof of Concept) lab would likely raise an eyebrow if you asked for that much equipment. Using IOU I would be able to simulate that entire topology in a single virtual machine on my laptop. If I had IOU, that is. Which I don’t. It wouldn’t be quick to set up this topology in IOU though; the configuration files for IOU can be a little tedious to put together – much worse than dynagen configuration files (and why is GNS3 so popular as a front end to dynamips? Because you don’t have to fiddle with the topology config files!).

The change I simulated prior to that only required 10 routers, and GNS3 coped with that reasonably well all things considered. I tend to simulate routing in GNS3 with 3640s whenever I can, simply because the CPU is old and slow which means dynamips has to do less CPU emulation work. My laptop has 4GB of RAM and an Intel Core2 Duo P8700 @ 2.53Ghz – not a powerhouse by today’s standards, but it worked well enough to let me find the problems I might face!

Caution!

Caution Required

I’ve mentioned that running software simulations means making compromises, and that applies in particular to the hardware and software being simulated. Even after simulating a change in a “software lab”, it’s still worth using real hardware (and the correct IOS version and feature set) to validate that your syntax is supported. Sometimes you’ll want to test a feature – perhaps something very new – that’s not supported in your simulated environment, and that’s just hard luck. For most routing changes though, I’ve been able to get close enough to reality using simulations that I’ve managed to discover problems (often fairly complex routing interactions), and as a result I’ve been able to mitigate them while still in the lab environment. In other words, yes, the lab successfully reduced the chances of an outage being caused by my change.

Should I Campaign For A Lab?

Yes – I would always recommend you have something mirroring your production environment on which you can test syntax. If for example you deploy 6500 switches everywhere, having at least one in the lab – or preferably two – could turn out to be really handy. While I’m not a huge fan of physical labs any more, they do have a use, as there’s nothing so great as showing up for a change and discovering that the production hardware doesn’t support the command option you were relying on. It seems expensive to buy the equipment, but remember that Cisco may offer you special lab pricing on hardware that won’t be used in production, and you could consider refurbished equipment for a lab, even if you’d never use it in production.

Beyond that, simulate, simulate, simulate. Doubly so for routing changes with multiple protocols and redistribution involved. I’ll look in a future post at some of the compromises required to do that, but the value of knowing how a change will affect the routing tables is hard to overstate.

Whichever approach you take, DO NOT promise to eliminate outages. If anybody offers to fund a lab in return for eliminating outages, just say no. It sounds like a really good way to get for lab funding, but it’ll come back to bite you really quickly.

Labs and simulations help you reduce outages. Reduction is a good goal, right?

4 Comments on Get A Lab And Be Error Free!

  1. I have a small client who tests everything in a VRF in production equipment (the VRF mirrors production VRFs). It works for them for the most part and honestly rarely bites them unless they hit some weird bug. Another option for the cash strapped out there.

    • I’ve heard of people doing that, and as read your comment I was desperately hoping that the abbreviation I was about to see was going to be VDC not VRF. I’ve definitely seen servers tested in VRFs (e.g. a DEV environment overlaid on the main production network) but in those instances the router configurations were pretty static.

      However, to have people issuing configuration commands on a production router for test purposes scares me a little. Do they make changes to the test VRF configurations during the day? It worries me that one command missing the “vrf XXX” option could theoretically bring down the network. One mis-typed interface name and your production servers are down. In other words yes, while I can see how that would work, I’m not sure I could ever consider doing that as I would personally evaluate the risk of accidentally affecting production as way too high for my comfort. That’s not to say it can’t work – but count me out! 🙂

  2. More often than not, lab equipment actually does end up in production. Classic scenario of something broke, and till we wait for spare part let’s use this machine so there’s not downtime for days or even weeks. Then it never gets pulled out of rack, because,, you know, once it’s there its destination is sealed …:-)

    • Well, there is that… 🙂 I guess it’s down to support contracts and company policies to prevent that, but I take your point!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.