Pingdom: great service, but bad at maths

We use Pingdom at Whisher to keep tabs on our servers. This service, albeit not as comprehensive as Nagios, can send TCP and UDP probes, and even check the expected output of a HTTP GET. In case one of the sensors fails, you receive an email and/or SMS telling you about the problem. What is great about Pingdom is that through a very simple interface, you can setup basic monitors for your servers, with graphs and reports on downtime – very good value for money.

Today we had a problem in the switch, which caused it to somehow erase the mapping table entry between one of the server ports and the upstream provider’s gateway, thus effectively blackholing it off the internet. The rest of the servers were working, and pings were fine between all of them, so I rebooted the affected server. The reset on the switch’s port seemed to cure the problem.

Initially, I was alerted to this by an email sent by Pingdom. After the reboot and switch soft reset, Pingdom sent me another email reporting the server as being up again. This is where things went funny:

One hour and sixty-three minutes of downtime? Wouldn’t that add up to two hours and three minutes? Weird maths! It’s also curious that the down alert was received at 20:50:37, meaning the server was really down for only 1h 3m.

Wow. It's Quiet Here...

Be the first to start the conversation!

Leave a Reply:

Gravatar Image