Google cries foul over coverage of Apps outages
- 13 November, 2008 08:09
Recent outages affecting Google Apps have received a disproportionately large amount of coverage from the technology press, resulting in a misperception about the stability of this hosted collaboration and communication suite.
That's the opinion of Matthew Glotzbach, product management director of Google's Enterprise unit, who recently chatted about this issue with IDG News Service, one of the news outlets that Google feels has blown the problem out of proportion.
Glotzbach's view, which he outlined in a recent blog post, is that the availability and performance of Web-hosted software, like Google Apps, gets more scrutiny because its outages occur publicly in the Internet cloud. The press coverage creates a wrong perception about the overall reliability of cloud applications versus that of on-premise software.
For example, Gmail's availability, measured as average uptime per user based on server-side error rates, has been north of 99.9 percent over the last year, which works out to an aggregate of 10 to 15 minutes of downtime per month, according to Glotzbach. That's lower, he points out, than the 30 to 60 minutes average unplanned downtime that, according to a recent Radicati Group, hit on-premise e-mail systems, which also cost more to acquire, install and maintain than Google Apps.
In the interview, Glotzbach put into what Google's considers a proper perspective the several outages in August and October that left some Apps users unable to access their Gmail service for 24 hours or more. An edited version of the conversation follows.
Would you like to recap the main points in your recent blog post about Gmail's and Apps' reliability and performance?
The reliability of the cloud overall is under more scrutiny than the average enterprise IT system reliability, and that's fine. I think it's good to hold the cloud to a higher standard. However, the perception potentially of people is maybe overstated relative to the reality. Right now, when we have the most minor of an issue that may affect an infinitesimal small number of people, it's being picked up and talked about as if it's affecting a large portion [of users.] I'm not saying that it's acceptable to have [outages]. I realize the expectation is 100 percent reliability and that's the goal: to be 100 percent reliable so that there is no discussion because it's always available. That's the gold standard we've gotten with Google.com and that's where we want to get Google Apps as well.
Why are you experiencing outages of 24-plus hours in Apps' Gmail?
It's very rare that any one user is out for that length of time. Even when there's a report of an outage, if the total duration of the outage was 24 hours or 12 hours, whatever the case may be, it's very common that during that period a user may be affected only for 10 minutes or something like that.
Page BreakRegarding how is it a user might be down for some number of hours, it really depends on the failure scenario we're dealing with. All users are dual-homed, meaning their data is served from two separate locations, so there is a redundant live copy of their data. Any time there's an outage, the vast majority -- 99-plus percentage of people -- don't experience any issue because we automatically fail them over to the other location.
Where there are problems is in the cases where can't fail that user over [to the backup] for whatever reason -- there's an error with the account, or the master and the slave [copies] are out of sync. So in a few circumstances, we have been unable to fail a user over and we can't restore that user's access to the service until we restore that physical location. This is an area where we're constantly getting better and some of the things we've done as a result of our learnings over the last few months address that.
When I talk to affected Apps administrators, many say they'd like Google to be more transparent in how it acknowledges problems on the Apps Help discussion forum and to offer more details. For example, Salesforce and Amazon have public Web sites where they report in real time the performance and availability status of their hosted services and applications.
We're always striving for more transparency to give our users of all shapes and sizes, whether consumers or the largest [Apps Premier] paying corporations and everyone in between. We systematically and very publicly post on our user forums whenever there is an issue, and offer work-arounds whenever possible.
One of the perceived challenges or issues with transparency is that we run a lot of services and historically we've tried to target the messaging to the people it would affect, so we have discussion forums for Gmail, Google Docs, and so on. We're definitely hearing what people are saying and responding to feedback in that very transparent way and also looking at whether we need a centralized place like Amazon and Salesforce do.
However, both Amazon and Salesforce offer much lower-scale service. That doesn't mean we don't respect what they're doing, but we operate on a much larger scale. The information we give is as or more transparent than what they give. If you go to the Salesforce Trust dashboard, and you click into an issue, it says something like "on this day there was an issue for two hours." Again, I'm not suggesting that's not sufficient, but to suggest that we don't also offer transparency to our users [isn't correct].
After the Gmail outages in August, you offered a service-level agreement credit to your Apps Premier customers and outlined plans for improved communication during problems. How is that going?
We already produce incident reports within 48 hours and share them with our Premier edition customers. You're also seeing more systematic and timely postings of issues through the existing channels. The dashboard, the actual application people can go to [to check outage status], is in the works.
Page BreakStill, some Apps administrators, especially Standard edition ones, who don't have phone support, want Google to be quicker about posting problem acknowledgements and details in the official Apps forum.
We constantly work to improve the service and, when there are issues, to be more responsive and provide high quality data. Any time someone's unable to access the service, that's a cause for concern and we're highly sensitive to that.
[However,] I'd draw attention to the pragmatic comparison of how often people's corporate mail systems go down, and the five-person company that uses the free Standard edition. Their alternatives are interesting to look at: they can pay hundreds of dollars per seat for a hosted business mail platform from a different provider, or pay tens of thousands of dollars to run their own server. And even if they did that, their uptime guarantees would be less than the actual uptime they're seeing from Apps.
But don't cloud computing providers get into a slippery slope when they start justifying whatever performance problems they encounter by pointing at the different on-premise software model? A big reason why people go to cloud options is to hand over software installation and maintenance tasks to someone like Google, who offers to do them better and at lower cost. But I don't think they expect to be down for 36 straight hours, at which point they may wish to have the mail server in house.
Absolutely. We would never want somebody to be down for any number of hours, or any number of minutes for that matter. Unfortunately, we're talking about cases on the fringe. Our goal is to be at 100 percent reliability and we're getting ever closer week after week. When you're dealing with literally hundreds of millions of active users or accounts, unfortunately until you get pretty darned close to 100 percent, even when you're at 99.999 percent, there is the reality that a user could encounter an issue with service at some point. The goal is absolutely for the expectation that I can move to Google Apps and this cloud-based service and I'll experience perfect uptime and perfectly reliable service and I'll be really happy about the overall experience.