Today’s post is a quick tip on configuring a route-map in IOS, or more specifically, how not to configure it.
Like almost everybody else, I am capable of making stupid mistakes. I don’t mean the ones where an extremely complex change has something unexpected go wrong, but rather I’m thinking about the kinds of things where you slap your forehead because the mistake was just so avoidable. I do my best to catch potential bonehead errors in review and testing, but once in a while something slips by and disaster ensues (usually without hilarious consequences). Laugh quietly, please.
This one is really simple. When you’re matching on an ACL or prefix-list in a route-map, an additional space in the middle of the ACL/PL name will screw things up royally. So will a typo.
To explain what I mean, here’s a route-map intended to only allow the Atlanta data center’s routes to be advertised via BGP:
ip prefix-list PL-ATLANTADATACENTER permit 10.1.0.0/16 le 24
route-map RM-BGP-FILTER-OUT permit 10
description Advertise ATL data center ranges only
match ip address prefix-list PL-ATLANTA DATACENTER
route-map RM-BGP-FILTER-OUT deny 999
description Explicit deny, just for clarity
router bgp XXXXX
neighbor a.b.c.d route-map RM-BGP-FILTER-OUT out
In my case, after the script had been written and lab tested, I had been asked to change the name of the prefix-list from “PL-ATLDATACENTER” to “PL-ATLANTADATACENTER”. Since it only appeared a couple of times, I edited the text directly. I don’t know whether it was Word being smart and adding a space after I pasted “ANTA” in the match statement (I’ve had that problem before) or if I stupidly hit the space bar myself, but either way, the end result was a match line in the route-map with a space where I didn’t want it. The script was peer reviewed before execution, and the space wasn’t noticed (not really a surprise).
So what’s the problem?
match ip address prefix-list PL-ATLANTA DATACENTER
- There’s no prefix-list called PL-ATLANTA
- There’s no prefix-list called DATACENTER
- This is valid syntax – IOS interprets this line as you wanting to match against more than one prefix-list
- IOS doesn’t warn you that these prefix-lists do not exist
One unexpected space in the configuration and without any errors being raised, my route-map is matching against vapors.
What happens when you match against a non-existent prefix-list? IOS matches all routes. So now instead of allowing just the ATL data center routes out, this route-map now permitted everything to be sent.
Needless to say, this was not a good thing, but as post-change testing of systems in the ATL data center was successful, the issue was not immediately identified. Later in the morning, I got a call to help troubleshoot issues being seen in some other data centers. Based on the symptoms they described, I applied a bit of Psychic Troubleshooting (similar to Raymond Chen’s Psychic Debugging) and that was enough for me to guess where to check, and I was able to fix the problem within 10 minutes of joining the outage call.
- Double check the match statements in your script. Then triple check them.
- Make sure you don’t have a typo in the prefix-list or ACL name. When writing the scripts, cut and paste the prefix-list name into the route-map so you can’t get it wrong.
- If you’re using BGP soft reconfiguration (it’s handy if you can spare the memory), make sure and do a pre-change and post-change check on the number of routes being sent/received through your route-maps, and compare with the number you expected to be there. That’s a really simple test that I have skipped in the past, but once I started doing it in every BGP change, it caught multiple problems with prefix-lists, ACLs and route-maps.
- Testing in local data centers is great, but you need to consider whether your change might affect another data center too. In the example above, the routes that slipped out through the route-map included routes for other data centers, and some traffic intended for those DCs started coming to Atlanta instead. We didn’t test services in those other data centers after my change, and had we done so we would have spotted the problem faster.
- Check KPIs across the board before and after your change. For this change, the moment the route-map was in place, firewalls in the Atlanta data center started showing a spike in the number of drops (they were seeing traffic not intended for this data center). Systems in other data centers starting showing failures. Getting the NOC to perform a KPI check on critical systems across all data centers could have identified that there was a problem.
- Secondary KPI advice: train your NOC not to assume that spikes in KPIs during regular maintenance window hours are just “something that happens” during maintenance windows, and thus something they can ignore. Once a change is complete and you check in with them, they need to check that things have normalized again.
- If you change your script post-testing, you might consider re-testing the updated configuration. Script changes may invalidate your lab testing!
So now you know.