I’ve been developing yet more automation recently, and I’ve been hitting two major stumbling blocks that have had a negative impact on my ability to complete the tooling.
When APIs were first made available, the documentation from many vendors was simply incomplete; it seemed that the documentation team was always a release or two behind the people implementing the API. To fix that, a number of vendors have moved to a self-documenting API system along the lines of Swagger. The theory is that if you build an API endpoint, you’re automatically building the documentation for it at the same time, which is a super idea. This has improved the API’s endpoint coverage but in some cases has resulted in thorough documentation explaining what the endpoints are, but little to no documentation explaining why one would choose to use a particular endpoint.
As a result, with one API in particular I have been losing my mind trying to understand which endpoint I should use to accomplish a particular task, when no less than three of them appear to handle the same thing. I’m then left using trial and error to determine the correct path, and at the end of it I determine which one to use, but don’t really know why.
There are few better ways to waste an afternoon than to have an API endpoint which you call correctly per the documentation, but the call fails for some other reason. The worst one I’ve encountered recently is an HTTP REST API call which returns a HTTP 400 error (implying that the problem is in the request I sent) but with a JSON error message in the returned content saying there was an internal error on the back end. Surely an internal server error should be in the 5xx series? That particular error is caused by a bug which prevents that API call working correctly when the device is deployed in a cluster. This is infuriating, and took a long time to track down and confirm as a bug rather than an error on my part.
Unfortunately this discovery also suggests that as part of the software validation process before code release, either the API is not being fully tested (it has incomplete test coverage) and/or the API is not being tested against devices which are clustered, which, for this device, I’d suggest represents the majority of implementations.
Trust But Verify
Worse, in some ways, than that are the endpoints which return a valid-looking result and a success code, but are not necessarily providing what was requested. I’ve learned the hard way that just because an API tells you that a request was successful, it’s still necessary during development to manually inspect the returned data to make sure that the API is behaving itself and providing what it claims.
For example, I am working with one API where a request for the FIB (Forwarding Information Base) returns lots of entries. However, closer inspection of those entries reveals that only the first of any ECMP next-hops is being returned; it’s not possible to see all calculated equal cost paths. As a second irritation, the entire FIB cannot be retrieved at once; after much trial and error I determined that it is necessary to page the result in blocks of around 30-35 entries, or the request eventually triggers an internal error and fails. Naturally, the documentation does not indicate that there is any kind of limit on how many FIB entries can be returned safely at one time, nor that there would be an issue with a larger FIB being returned.
Worse, retrieving the RIB (Routing Information Base) from the same API – which thankfully does include the ECMP routes I’m looking for – only returns the first ~60 entries, and ignores any pagination requests entirely, so it’s not possible to see anything but those 60 entries. Again, looking manually allowed me to confirm that although I had asked for entries 90-129, for example, I was still getting the first 60 RIB entries. If I had not looked carefully, I could have made some very bad decisions on the basis of those incomplete data.
If the “show ip route” (or similar) command didn’t work properly in the CLI of a network device, customers would lose their minds, and I am pretty certain that a patched version of code would become available almost immediately. When the API doesn’t work, I get shrugs and promises of a fix in a future release at some unspecified time.
My 2 Bits
APIs have got big and unwieldy, and that’s partly our fault as users, because – reasonably enough – we want them to allow us to do everything the device can do. The APIs take a lot more effort on the part of the vendors to document and the end result seems to be that in some cases at least, the quality and value of that documentation has decreased even while coverage of endpoint availability and capabilities have increased. Making those APIs usable and understandable is key for developers but also in order to retain customers, because if I can’t figure out how to do something or I lose faith in the API’s reliability on one product, there’s a danger I’ll move to a different product.
As an industry we expect to be able to control everything via an API, and we are automating our business processes based on those APIs. Broken API endpoints mean broken business processes and that’s just not acceptable. I’m also getting a little tired of waiting for one bug to be fixed in a code release, then discovering another API bug and having to repeat the cycle, never quite finding a version of code that I can safely deploy and automate.
APIs have to be functional and reliable or they’re useless. APIs need to be thoroughly tested before shipping code to customers. Perhaps vendors can consider how they might be able to patch bugs in the API in a more agile fashion so that issues can perhaps be fixed without requiring a full code upgrade on a device, which has a high cost to the business. Unfortunately the API seems frequently to be tightly bound to the operating system rather than abstracted safely away from it, which means this will largely remain a dream rather than an actuality.
Most importantly, APIs need to be a first class citizen in the operation of every device, not a “table stakes” feature wedged uncomfortably and unreliably into legacy code.