Last–but not least–in the technology triumvirate presenting a joint session at Networking Field Day 17 was Cumulus Networks. This post looks at the benefits of Cumulus Linux as a NOS on the Mellanox Spectrum Ethernet switch platform.
I’ve not yet managed to deploy Cumulus Linux in anger, but it’s on a fairly short list of Network Operating Systems (NOS) which I would like to evaluate in earnest, because every time I hear about it, I conclude that it’s a great solution. In fact, I’m having difficulty typing this post because I have to stop frequently to wipe the drool from my face.
Cumulus Linux supports around 70 switches from 8 manufacturers at this time, and perhaps obviously, that includes the Mellanox Spectrum switches that were presented during this session. This is the beauty of disaggregation of course; it’s possible to make a hardware selection, then select the software to run on it. Mellanox made a fairly strong case for why the Spectrum-based hardware is better than others, so now Cumulus has to argue for why they would be the best NOS to run on the Mellanox hardware.
Cumulus Linux, as the name suggests, is based on Debian linux. Cumulus Networks then adds its deep knowledge of the ASICs and platforms they support in order to write the linux drivers necessary to interface with the switch. One of the arguments made against much commercial use of open source software is that the companies do not give back to the open source community, while profiting from the same. Cumulus Networks is proud to tell us that it feeds patches back upstream where possible, and in the NFD17 presentation, Cumulus TME Pete Lumbis described how Cumulus wanted to add support for VRFs to linux, as they felt that the existing network name spaces were not an adequate solution. Cumulus went back to the community with ideas and code to implement VRFs, the proposal was accepted and now VRFs are part of the linux kernel. Once linux supports VRFs natively, Cumulus then implemented VRFs on the switching platform. The end result – as with pretty much everything in Cumulus Linux – is that configuring a VRF or an interface, for example, on Cumulus Linux is done exactly the same as it would be on any other linux device.
Cumulus Networks was also a key player in the forking of Quagga to create Free Range Routing (FRR). Some parts of the community, including Cumulus Networks and other vendors, were unhappy that Quagga was not being developed and patches were not being reviewed and updated at the kind of speed necessary to keep Cumulus – a user of Quagga – competitive. In the end, unable to find other solutions to a claimed backlog of 3,000 patches for Quagga, the project was forked and Cumulus Linux now uses FRR instead.
It is of note that not everybody is, or wants to be, a linux admin, and for many network engineers the linux CLI is not comfortable. With that in mind, Cumulus Linux created a network command line utility (NCLU). Linux admins can just edit the config files if desired, while network engineers are more likely to appreciate NCLU’s online help and tab completion of commands. Under the hood, NCLU is just a more friendly abstraction to the underlying configuration files. so both server and network admins can view and manage the configurations in whichever way makes most sense to them. Pete Lumbis notes that in some cases the NCLU command takes output which in linux is positively messy, and turns it into something far clearer and more familiar; but the source for the information is the same regardless of which commands are used to view it.
It’s also of note that since everything NCLU does is reflected within linux itself, management of Cumulus switches can be achieved using the tools that already exist to manage linux servers. That’s right; finally there’s a network product which Ansible understands right out of the box, because it’s really a linux server!
Cumulus VX is a free VM of Cumulus Linux supporting VMware, VirtualBox and KVM. It can be downloaded and used for network testing and learning purposes, and with the assistance of a tool like Vagrant the connectivity between multiple VMs can be automated so that full network architectures can be simulated using Cumulus VX. The demo on the video is entirely virtualized based on a this topology:
The configurations in CumulusVX are identical to those on dedicated switching hardware, so once modeled and validated successfully the configuration can simply be copied over to the target devices with a high degree of confidence that it will work. It’s worth noting that CumulusVX is not optimized for throughput in any way (no DPDK or similar); and is intended purely as a validation platform, not a production platform.
The feature which most excited me (it pushed my nerd buttons) was NetQ. Every switch is configured to run a NetQ agent which monitors Netlink messages as well as a few other things. The NetQ agent can thus see the operational status of its switch and store the status in a Redis in-memory distributed database, meaning that the status of every device is known on, well, every device, so it’s possible to see the current state of an entire fabric from a single device.
For example, the screenshot below is from the demo session and includes an example of tracing a mac (
netq show mac), running a layer 2 trace (
netq trace ... from leaf02) and checking current BGP session status (
netq check bgp):
Netq can ‘perform’ a layer 2 or layer 3 traceroute from anywhere to anywhere else in the network, including identifying ECMP links along the way. What’s interesting about this is that no packets are sent; the traceroutes are calculated based on the known status of every node so it’s all theoretical but based on real device status.
Because the state data are held in a Redis database, if there’s not a netq command providing the desired output, users can choose to write a query as a select statement instead:
And one more thing: running a netq command means querying the current system state data, so if a system were to keep track of that state data with timestamps, it would be possible to ask for a netq command to be run on the system state at a specific time in the past. Enter the telemetry server (another VM) which does exactly that. To me this kind of thing is like gold when troubleshooting a report of a problem that occurred at a specific time in the past/
Cumulus has extended Netq support to Docker as well, so it’s possible to see, for example, which containers are running on a given host, or where a container connects to the fabric, and similar. With the telemetry server running, the same queries can be used to show how containers are being spun up and closed down.
I know I can’t really do justice to Cumulus Linux and NetQ, but Cumulus Networks’ Pete Lumbis does a great job in this NFD17 session video:
I really like the way Cumulus Networks thinks, and its commitment to preserving the ‘linux’ way of doing things. Obviously, in this context Cumulus Linux is being presented as a good partner for Mellanox switches, but even if another hardware vendor is chosen, Cumulus always seems to me to be a pretty sweet product that’s accessible to administrators either with either server experience or network experience. Two thumbs way up!
I was an invited delegate to Network Field Day 17 at which all the companies listed above presented. Sponsors pay for presentation time in front of the NFD delegates, and in turn that money funds the delegates’ transport, accommodation and nutritional needs while there. With that said, I always want to make it clear that I don’t get paid anything to be there (I took vacation), and I’m under no obligation to write so much as a single word about any of the presenting companies, let alone write anything nice about them. If I have written about a sponsoring company, you can rest assured that I’m saying what I want to say and I’m writing it because I want to, not because I have to.
You can read more here if you would like.