Not content with digging into the A10 health monitors recently, I thought I should do the same for f5 LTM which has some slightly different setting and, it turns out, works really quite differently.
I hate to say it again, but there’s a good chance if somebody explained health monitors to you, they got it wrong or, at least, only told you half the story. The results presented here are relevant to f5 LTM version 11.3.
Health Check Interval x 3 = Down
This is probably the fallacy I have heard most often about f5 health monitors: “The f5 sends a health check every 5 seconds, then if three health checks fail, it marks the pool member (or node) down.” This makes us comfortable because it sounds awfully like OSPF and a variety of other protocols that rely on counting missed packets to determine neighbor liveliness.
When it comes to f5 health monitors, however, this is not true.
f5 Health Monitors
The f5 manages its monitors through four main configuration parameters:
Interval
Interval is, as you might have guessed, the time period between each health check attempt, measured in seconds; the default for an HTTP monitor is five seconds. Technically in fact, this is how often the f5 issues a monitor check when the monitored system is down. Yes, you read correctly. It should actually be called “Down Interval” so it pairs with:
Up Interval
Disabled by default, the Up Interval dictates how often a device should be checked when it is known to be up. If not set, the default behavior is to check using the value defined for Interval. If set, when while the host is considered to be “up”, it uses this value as the health check interval. When the host goes down, the f5 will use the regular Interval value for health checks.
You could, for example, run a health check every 2s while a server is up, then if it’s detected as being down, you might so Interval to 10s so that the health check interval drops down to 10s.
Time Until Up
This rather clumsily named parameter defines how long the f5 waits to mark a pool member (or node) as up after a successful health check. It’s a little like HSRP’s “preempt delay” setting – yeah, the neighbor’s back up, but let’s wait and make sure it’s stable and fully initialized before letting it take over as primary again. By default this setting is 0s, i.e. the server is immediately available. Just as with A10, I’d argue that this is a bad choice – but with f5 and A10 both making the same decisions, I’m apparently in a minority here.
Timeout
This one might sound obvious but, uh, timeout for what exactly? The default setting is 16 seconds, or (Interval x 3) +1, which is aways f5’s recommendation for monitor settings. So, lie number one:
Sixteen seconds is, in fact, enough for four monitors to be sent, as the timer is started when the first health check is sent.
Still, that aside, the number of failed monitors that occur during the Timeout period is actually irrelevant. Yes it would be nice to have four monitors fail just to prove that a device is down, but here’s lie number two:
The requirement is simply that there is not a successful health check within that Timeout period. The actual number of check failures is irrelevant; there is no “3 failures” or “4 failures” mechanism; it’s purely about time.
So what happens when a server simply responds slowly? Imagine that we have a server that’s running really badly. It’s responding to requests, but it’s taking 14 seconds to get the response back (yes, I hacked a little HTTP daemon to do exactly this for testing purposes). What happens?
Time | Health Check # | Response / Progress |
---|---|---|
0s | 1 | TCP connection made |
5 s | 2 | TCP connection made |
10s | 3 | TCP connection made |
14s | (1) | HTTP 200 response to #1 |
15s | 4 | TCP connection made |
Success! After 14 seconds we received an HTTP response. Because this response comes before the 16 second Timeout we defined, it counts as a successful health check. Yes that’s right, a server that takes 14 seconds to respond to a query is, by default, considered A-OK by an f5. Does that sound smart? It sure doesn’t to me. I was able to run my crippled server indefinitely without the pool member ever going down. The A10, which has a separate timeout for application responses like this, treated the health check as a failure after 5 seconds and took the pool member down after three failures.
HTTP Response Codes
Side note: While the A10’s http monitor looks for non–2xx response codes, the f5 by default looks for any response at all. It’s possible to change the Receive String to look for HTTP 200, for example, but it will not accelerate the failure like the A10 does; it simply counts as a failed health check if the HTTP 200 isn’t in the received data.
My Thoughts
I like the way A10 has approached this more than f5. The definitions of the parameters are clearer, and I think it handles slow-responding servers better. More importantly though, talking to other people about the f5 monitors and thinking back on the understanding that I had been given, it was clear that there is some confusion about what these parameters mean, and that – of all things – is the bigger risk to the network in terms of potentially misguided configuration.
30 Blogs in 30 Days
This post is part of my participation in Etherealmind’s 30 Blogs in 30 Days challenge.
Hey John,
This is a subject close to my heart. I visited this particular rabbit hole a couple of months back (https://devcentral.f5.com/questions/monitor-timeouts – there’s a nice diagram I drew too).
I’d agree that the setting names (the whole thing really) are unclear and lead to confusion. I also agree regarding the Time Until Up parameter although the Slow Ramp Time setting (at the Pool level) in F5 LTM mitigates this somewhat.
I’m sure with both vendors you can have multiple monitors and the Pool and Node level and be somewhat more sophisticated. I’d certainly recommend using custom, well defined and application specific send and receive strings where you can to ensure responses and a positive UP mark actually reflect the user experience as much as possible.
Speaking generally (and note I’ve not read your A10 posts yet) the whole subject of monitoring is rather more involved that most ‘people’ realise. Just from an ‘orchestration’ and coordination perspective I’d say it’s very complex, tracking multiple, simultaneous underlying TCP/UDP connections, application level requests, marrying that all up with responses (or not) and so on.
Monitoring is the unsung hero of the ADC domain and I’d imagine the vendor’s systems/approaches/software could be re-purposed in quite valuable ways if they put their minds to it.
Great feedback, Steven – thank you! I just took a look, and I agree – you drew nice diagrams on that post. I was fighting to find the right way to present the information without it getting too cluttered (not sure I succeeded).
Definitely agree about the application specific send/receive strings. The default ‘blank receive string’ isn’t much use in reality and I wouldn’t want to use it. And you’re right about the complexity of monitoring; it makes my mind boggle at times!