Originally posted on 4/11/12
Anyone that has spent even a small amount of time in the software industry, whether it is on the development side or the service side, knows there is but one truth; Bug fixes beget bugs, which beget more bug fixes. In IT Service Management language this translates to changes begets more changes, which beget even more changes. It is a vicious cycle of sorts, and no matter how much cloud vendors might tout the pursuit of infinite up time, something is bound to happen.
Now, even while consumers of cloud technologies may be up in arms (and rightfully so to most extents) about losing access to their systems, perhaps whole infrastructures in some cases, what they really seem to take issue with is the fact that large providers don’t appear to do that much better of a job at keeping systems running twenty four seven, or really one hundred percent up time, than they could do themselves. Is that really true though?
Of course, this question is really aimed at how well a third party can keep an organizations entire IT infrastructure running without even a single problem. In reality the likelihood of zero downtime, is right near impossible. Considering an endless amount of resources as an option, mistakes still happen. Humans will continue to be what they are, imperfect.
Case in point, the Microsoft Azure platform which experienced a major outage in February due to a “bug” that miscalculated February 29, or rather the 2012 Leap Year, left customers around the world with no service – essentially no way to conduct or run their business, at least the online and server related components. To Microsoft’s credit they handled the issue relatively quickly – within about eight hours – and though root cause analysis is still to be conducted, they have a preliminary cause identified and have relayed to their customers.
Microsoft is also not the only lame duck in the pond though. Amazon, as well as Google, both experienced major outages in 2011. Each company identifying bugs, or changes rather, that caused systems to go down. Even the once, wunderkind smart phone producer RIM, had their own share of disastrous outages in 2011. Aside from the latter, each of the companies eventually fared well due to the direct manner and general openness used to detect, resolve, and communicate the issue. So, what did they do differently?
While, we can’t answer that completely for each of these companies, we have worked with companies that have very similar responsibilities, and handle issues such as this with extreme prejudice. Their success comes down to five key strategic elements that together, enable them to manage change better than their competitors, even if that competitor is The Cloud. We’ve included those five below. However, if Change Management is a new concept to you, you will want to check out our Practical Overview to Change Management. You can download that for free right here:
Detailed, Specific, and Thorough SLA
A service level agreement, or SLA, established the parameters you must work within to provide the high quality support to your customers. Even if you are providing service or support to a front-facing product or external set of customers, and SLA puts everyone’s expectations on the table. It also serves as map for dealing with change, and the perceived impact to the business.
Collaborative and Continuously Updated Disaster Recovery Plan
Like the incident with Microsoft Window’s Azure, significant downtime, or essentially a scenario that leaves an organization, or part of an organization, unable to conduct business, must be addressed. How resources are pooled together, and which systems are critical and must be addressed first should be part of the plan. However, this is a living document that requires members of the entire organization to weigh-in on.
Dedication to clear, concise, and continuous communication
Again, Microsoft is the poster child here, but Amazon and Google did a great job as well – RIM, not so much. Even though customers are screaming (and they have that right), there is no reason you shouldn’t be immediately up front with current issues, known causes, and possible solutions. Not to mention there should be a reasonable setting of expectations with regard to the duration and scope of any incidents/problems. Finally, and this yields more to element four. You need to have a portal for communicating all information. As part of a complete IT Service Management solution, a modern Service Catalog will provide this functionality.
Modern, Intuitive IT Service Management Solution
Change is really just one component of a true IT Service Management solution. Your Service Desk, Asset and Configuration Management, and the Service Catalog will all play a role in making sure and issues caused by a change, or “bug,” are identified quickly, traced back to their origin, and then communicated clearly to all users. For more information on an ITIL best practices based IT Service Management solution, take a look at ChangeGear. Our five minute Quick Tour is a great place to start.
Comprehensive IT Service Management Automation
This last piece is perhaps the most essential, mainly due to the inevitability of the human component – mistakes and errors. Automation, specifically those related to mechanisms in place for notification will make the difference between an issue that is identified and addressed quickly (Microsoft, Amazon, and Google), and one that might go on for weeks completely unnoticed (RIM). The key is your IT Service Management solution (see element number four above), must have the modern features that provides key functionality like automatic and robust reporting that provides managers with the information they need to make strategic and informed decisions, SLA automation and linkage that allows assists in identifying incidents that may expand outside agreed thresholds, and historical, audit-tracking capabilities that will allow your engineers to pinpoint when and where an issue occurred.