How can the reported downtime be less than our monitoring frequency?

Here's an example of why, starting with a test cycle at 11:05 am:

11:05:00 - The test cycle begins.
11:05:30 - The primary monitoring node times out after 30 seconds.
11:05:38 - The confirmation nodes report agreement on the failure.
11:05:38 - Downtime is recorded and alerts are sent

(total duration of test cycle is 38 seconds)

11:06:00 - The next test cycle begins
11:06:02 - The correct response is received, indicating that the host is back up
11:06:02 - A change of state from down -> up is noted, and recovery alerts are sent.

The calculated downtime is from 11:05:38 until 11:06:02, and is therefore 24 seconds.


Not what you were looking for? Try a search:

Ninja Tip: trace* will match traceroute.

Also in this topic: