I woke up yesterday to discover a hundred emails telling me that TPN had been down for several hours. This was about 8am Melbourne time. I woke Mano and he confirmed that we couldn't even get shell access to our servers, so he logged a support call with our hosting provider in Texas. They re-booted our server but we still couldn't access it. We called their call centre, and got nowhere fast. I started emailing and calling our account executive, leaving voicemails (turns out he was on leave).
Something like FIVE HOURS later, after our server had been offline for 8 - 9 hours, I got this email from the hosting provider:
Your server had been rebooted twice as per your support request. After the second failure it was consoled. During boot the system was dumping I/O error messages from the SCSI RAID and entering a kernel panic. The system was brought down for diagnostics and hardware failure on the RAID device was detected. The failed hardware was replaced and your server has been brought online. If you note any further issues with your system please inform us promptly and ask tech support for immediate escalaction to the NOC. I appologize for any inconvenience caused by this issue. The new hardware has been tested and should give you no further problems. Thank you.
My question is this - should it take 8 hours for them to identify and rectify a situation like this? Am I being unrealistic in thinking that is totally unacceptable? My business was off the air for a third of the day. In their service level agreement, under the heading "network", the provider states that they guarantee a 99.99%
uptime. What's that work out to? Something like 61 hours in a year? This is the second time our server has been offline for several hours in a single day in the last couple of months, both times due to something on their end.
So anyway... what kind of expectations should I have in terms of an SLA?
I don't know if I miss the point or not, but shouldn't your expectations be that they meet their SLA or some penalty is worn by them? I would image questions should be asked about their use of tools to detect problems. Surely there are tools that could be used to ping (not necessarily network ping but some sort of testing that its okay) the server and raise the alarm if something goes wrong. I know one of our products (not suggesting you should use our product (although you should!), just mentioning the availablity of such tools) has a small feature that allows you to monitor a webpage and see if a page is servered and even if that page is within parameters that you expect (i.e. contents of the page hasn't changed). I guess it depends on how quickly you wish to be informed of the problems (i.e. is it useful if we email/Skype/ring your mobile/other) us users see? I noticed that TPN was down and was going to contact you, but saw your Skype message ment you knew.
Back to the provider, is there any penaltys for not meeting SLA mentioned in the contract? Also, could they argue if they allow for lets say 60 hours of downtime a year and you are down for 60 hours straight but never again for the rest of the year, I guess they meet the SLA (technically). Might be something to consider for your next hosting contract. Do they have any escalation steps? I.E. Down for 1 hour, they do this, down for 2 they do this and that?
Very interesting. My sites are no where near as important as yours but have been considering this as thinking of doing the hosting service for my blog (want to try out Wordpress to see what all the fuss is about and try something a bit different) and have been interesting in the difference services SLA.
Best of luck with this dude.
Molly
Posted by: Phillip Molly Malone | Wednesday, April 05, 2006 at 01:59 PM
99.99% uptime means 53 minutes of downtime p.a., if my maths is correct. However "network" wasn't the problem, it was hardware, so they might be able to weasel out of it. If I were you I'd scour the SLA clauses of your contract and try to invoke them, Cameron. Best of luck witt it.
Posted by: Paul Montgomery | Wednesday, April 05, 2006 at 03:02 PM
Cam
They're fucking bone heads! Any fucking NOC has tools to monitor systems. If they don't see a system go down within minutes, then they're not doing their job.
Simply put. You host your server in a datacenter to ensure that if there are issues (which there should be few), they fix it asap. So, a datacenter should implement UPS, monitoring, etc. to ensure downtime is at a minimum. 53 minutes a year is great (though probably only if you cluster a system), but 8-9 hours at a time is a JOKE!! Completely unacceptable. I could host a service in my house and have better availability that that.
Getting techie for a sec....any NOC should be using software that captures SNMP traps, which will raise a flag when a system has any issue (low hard drive space, failing fans, a system going down). If a NOC doesn't have one of these, they're either lying, or won't survive long.
In my opinion they've failed their SLA, and hence should pay for losses (or whatever is agreed in the contract).
Rich
Posted by: Richard Giles | Wednesday, April 05, 2006 at 03:31 PM
Hey Rich,
They are obviously must be using Sun Hardware! ;-)
But seriously, on up time. I remember being in our Bedford, MA HQ once for an IT conference and they were showing us there EMC Disk Array and said the first they know theres a problem is when the EMC guy turns up with a new disk. (I assumed they where exagerating a little).
Molly
Posted by: Phillip Molly Malone | Wednesday, April 05, 2006 at 04:14 PM
Cam, we are about to launch a new service for Aus and have been forced into using US hosting for the time being and over the last few months have had quite a few outages due to harware at their end. It is very frustrating. If you have the $, you need to get beyond the standard hosting farm SLA's because they normally have an out regarding harware failures. The other salient point here is the design of your hardware. You need to get things sorted so you don't have a single point single point of failure. Redundancy is very important for internet based companies.
Angus
Posted by: Angus Scown | Wednesday, April 05, 2006 at 04:47 PM
Three words SNMP, SNMP & SNMP... 8-9 hours.. Unacceptable...
Posted by: Stephen Edgar | Wednesday, April 05, 2006 at 07:39 PM
You'll find that most stock service level agreements with penalties spend more time limiting the provider's liability than compensating you for the downtime. All you'll get is, typically, a month's free service.
To summarise: "We promise you 99.99% availability across any calendar year, or we'll spot you a month's free service. If something breaks for three days and you lose AUD$1m because of the downtime, though: tough."
What's important, here, is the relationship. If you have no relationship with them aside from the contract, change providers immediately. If you have a decent relationship AND they're acknowledging that they screwed up AND telling you what they're doing to stop it from happening again, it might be worth giving them another shot.
When you get big enough, there are affordable techniques to spread the TPN delivery network across multiple provider networks such that you have both server and provider redundancy. It's more expensive to guarantee that the control systems are always up, but at the very least you should be able to guarantee access to the feeds and episodes.
Regards,
Garth.
Posted by: Garth Kidd | Thursday, April 06, 2006 at 07:08 AM