DCNM ISSU: Disaster or Triumph? Let’s Find Out.

If you’ve recovered from the shock of hearing me say something positive about Cisco’s Data Center Network Manager (DCNM) product, then you’ll want to hold on tightly to your underbritches as I tell you that I just used DCNM ISSU to perform disruptive software upgrades on eight of the switches in my ethernet fabric, and–spoiler alert–it was actually a fairly pleasant experience!

DCNM

DCNM ISSU

DCNM offers management of ISSU (In-Service Software Upgrade). ISSU usually implies some kind of hitless (to the revenue ports) upgrade, historically made possible on the Nexus 7000, for example, by having dual supervisor modules and using Non-Stop Forwarding (NSF) to keep the forwarding plane intact while the supervisors failover. With the single-CPU layer-2 Nexus 5600 switches, however, the data plane can be told to continue forwarding frames while the control plane reboots with new code, allowing for an upgrade to take place without interruption.

Disruptive Upgrades

Unfortunately it’s not always possible to perform a non-disruptive upgrade. The code version I was installing included a fix to the linecard BIOS, so the linecards had to be reloaded as well as the main CPU. In other words, the switch has to be rebooted after the upgrade, and there’s no way around it.

So what happens when I ask DCNM to upgrade more than one switch, and how would it handle a disruptive upgrade? I requested an ISSU for all four spine switches on my Ethernet fabric.

Upgrading Fabric Spines

DCNM - Select ISSU

The steps to perform an upgrade are pretty straightforward. After selecting the devices to upgrade, the code versions need to be selected:

DCNM ISSUE - Selecting an Image

Thankfully, despite appearances, it’s only necessary to select each software image once per platform type. However, picking the image from the standard repository is an interesting exercise in finding your file despite random file ordering:

DCNM - Select Image

Can you identify the sort order chosen for these files? I can’t. It might also be helpful if the file list was limited just to files ending in .bin and other valid extensions; after all, what else would I be selecting? It also occurs to me that having to select the kickstart and system images separately may lead to avoidable mistakes, and it would be possible (preferable indeed) to select one image and have the other automatically selected. I acknowledge that it’s possible to rename images from the official Cisco naming convention, but if the files match the naming convention, this would be a really nice touch. The blurred folders are serial numbers for devices in my fabric; they are used as part of the boot process to hold files relevant to a particular chassis.

Checking Compatibility

Having selected software images, the next step in the process is to validate the images on the switches themselves. This is a time-consuming process, and DCNM offers the option to finish the installation later:

DCNM - Finish ISSU Later

Sounds good, right? Unfortunately what this really means is that the validation will complete, but the software upgrade will not happen. The ISSU job can be opened to confirm the completion status, but I couldn’t find an option to indicate to DCNM that now is a good time to complete the request and install the software. Instead, I had to go back through the same process as before to create a new installation job, but this time select the option to skip validation. This seems a little bit odd to me, so perhaps I’m just missing something obvious!

Compatibility Check Failed!

The error message received when a check fails is simply the output from the switch CLI when requested to validate the software images, so its helpfulness may vary depending on the issue. The first time I tried the ISSU, the compatibility check correctly identified that my kickstart image was fine, but my system image was corrupted somehow. I’m still a little bit confused as to how it got corrupted, as I ran checksum verification after transfer, but when I looked again, the system image was much smaller than it should have been. After uploading the system image again, the file size was where it should be, and the subsequent checks worked.

Checking the image compatibility is slow and boring (even if I don’t have to do anything), but there’s value to doing it!

Validating Files Using Checksums

If you aren’t sure how to validate a file checksum (usually MD5 or SHA1) I posted a brief guide to file checksum calculation recently. The DCNM server runs a linux variant, so checksums can be generated in the shell using the commands listed in that post (md5sum, sha1sum and openssl).

Installing The Software

I watched the installation process via the console ports on my fabric spine switches, and everything happens as expected. Finally, remembering that this was a disruptive upgrade, the switch rebooted. This is where things get interesting for me; what I wanted to see was what would happen after the first upgrade was completed. Would the DCNM ISSU process wait before moving on to the next switch, or would it continue regardless? This is critical because the Nexus 5600 switches tend to take 5 minutes or more to become functional after powering on.

Upgrade Sequencing

How does DCNM ISSU sequence and schedule the updates? I could not find any explicit documentation, but as far as I can tell from sitting on the console, once the reboot command has been issued to an upgrades device, DCNM moves on to the next one and begins the software load process. Whether by luck or design, it takes longer to perform a disruptive software upgrade than it does for a rebooted switch to become active again. The net effect is that although I am not aware of DCNM actively checking to ensure that only one switch is rebooting at any one time, the net effect of the various delays is the same, and my leaf/spine network never had less than three active spines at any time.

Although it would take longer to perform larger upgrades, I would very much prefer to see DCNM waiting for each switch to confirm that it is up and passing traffic (and perhaps even that the FabricPath ISIS neighbor relationships had been restored) before the upgrade process began on the next switch. Similar for my leaf switches running VPC+, I’d like to know that the A side of a leaf VPC switch pair was passing traffic before the B side started upgrading (assuming that the switch version difference was insignificant enough that VPC would establish successfully between old and new NXOS versions).

DCNM ISSU – Conclusions

After upgrading four leaf switches as well as the four spine switches without any major issues, I was pretty happy with the procedure. The interface has a few odd or clunky behaviors as mentioned above, but functionally it’s quite nice to say go upgrade this and come back when it’s done. Obviously, I’m a nervous ninny, so I’m the kind of guy who will watch over the upgrade process to ensure that things are going well. Still, it’s preferable to be monitoring upgrades than to be performing them manually.

Another win for DCNM!

2 Comments on DCNM ISSU: Disaster or Triumph? Let’s Find Out.

  1. Thank you for your tests and candid feedback.
    I’ve raised your concerns to DCNM Engineering and created enhancement requests CSCvd06103 (ISSU option should only list NX-OS image files from flash and sort them by version) and CSCvd06111 (ISSU feature “validate and finish the installation later” can’t be scheduled) in order to have these addressed in a future release.

Leave a Reply

Your email address will not be published.


*