By running a web service, I should be able to access the council’s website out of hours. It’s a great idea, because it means that even when the staff aren’t in the office, I can still pay my council tax, or order some green garden bags.
Except that I can’t. For some reason I always end up attempting this task at the end of a weekend. By which time their web services usually look like the screen grab shown above.
Here’s what’s supposed to happen. You’re running some web apps. You monitor the servers that those apps are hosted on. You monitor their vital signs, (are the hard drives ok, has one of the power supplies gone a bit iffy, is it too hot in the server room), then you monitor the services running on that server. Finally you run some sanity checks on the actual web front end every so often. Can you still access the server, is the SQL database still accepting connections, etc etc. If any one of these things fails, someone in your team should be on call, and receive a text message. Then they proxy in, and fix it (or at least reboot something!).
To do this, you use a wonderful system called Nagios. It’s free and open-source, but a bit of a pain to set up, but it’s also highly configurable. You can get it to only fire messages to your on-call person when two minor faults have occurred. You can set it to wait a specific number of minutes to see if a fault resolves itself. You can even start firing text messages and emails to the on-call person’s boss if it’s still broken after a couple of hours!
If you work for Lewisham council, you should ask your IT people about this. It’s not that hard, and if it’s done well, it means that you’ll know about a fault before any of your users do.
Please, for the sake of my untidy garden.