On June 29, Rackspace experienced a localized outage in its Dallas-Fort Worth, Texas (DFW) data center, its first outage in more than two years. The following week, the same data center experienced another outage, this time shorter in duration and affecting a smaller amount of customers. What resulted was an important learning opportunity for the team at Rackspace, from technical preparations to communication best practices. During the outages the number one priority was putting our customers first, and as a result, maintaining Rackspace’s integrity and customer relationships while also developing an ambitious plan to reinvest in our infrastructure to prevent future outages.
Rackspace operates nine data center facilities across the world, including the one in DFW. Over a period of ten days, Rackspace experienced two power disruptions incidents in one of three phases of that data center which could not be prevented by the redundant design of Rackspace’s systems. While limited to a single phase of this data center, the disruptions did impact some of our customer base served by that phase of our data center.
How Rackspace Responded
As is always the case, Rackspace’s first priority was getting its customers back up and running. Customer uptime is a principle at the heart of Rackspace Fanatical Support. Rackspace and its team of Rackers (Rackspace’s term for its employees), took an “all hands on deck” approach to remedying the situation and minimizing the impact of the outages. For Rackspace, this meant calling in teams who were not scheduled to work, and dedicating extra hours to make sure that customer issues were addressed in a timely manner. When the main phone lines were busy, Rackers turned to Twitter as a vehicle for customer communication, supplying mobile phone numbers to ensure customers had as many points of contact into Rackspace as possible.
While the team in the DFW data center worked tirelessly to identify the source of the disruption as well as to resolve it, Rackers on the support teams turned to social media channels, like Twitter and Rackspace’s corporate blog (www.rackspace.com/blog), to keep customers up-to-date on the progress made in returning to normal operations. Transparency and regular communication was important, and Rackers on Twitter responded directly to customers and made regular updates via the @Rackspace handle. After Rackspace was able to obtain details and root cause analysis regarding these disruptions, CEO Lanham Napier, who strongly believes that forthcoming and honest communication is the best way to maintain customer trust and satisfaction, even posted a video blog to the Rackspace blog explaining the cause of the outages. Another central tenant of Fanatical Support is taking responsibility. Regardless of the source of these disruptions, Rackspace did not make excuses or point fingers. Instead, Rackspace took responsibility and fully honored its service level agreements (SLA) for its customers. Rackspace has the best SLAs in the industry and will not hesitate to make it right with its customers when there is a disruption.
What Rackspace is Doing Now
Rackspace has learned from this experience and has developed a plan of action:
1. Put the best people on it, and bring in the experts. A team of the best Rackspace talent from the US and the UK have been brought together to focus on the issues. The team will be joined by top talent from our vendors, as well as knowledgeable outside consultants to ensure any and all known and unknown issues are considered and resolved.
2. Assess the status of the infrastructure. The Rackspace team is combing through the data center and assessing every link in the chain.
3. Improve standard operating procedures. Increase the frequency of testing, monitoring and measurement programs within the data center. Maintenance schedules will change and the level of detail reviewed internally and shared externally will increase.
4. Invest. Continue to invest in the data center infrastructure. Investment in additional information systems as appropriate will also be made to support new measuring and management procedures.
While no data center is risk-free, managing and mitigating the risk to acceptable levels is paramount at Rackspace. In the case of the Rackspace DFW data center, the power infrastructure has been stabilized, although Rackspace will continue to be hyper-vigilant in monitoring and responding to any irregularity.
Rackspace’s goal is to use this unfortunate experience to grow to be better and stronger, both technical innovation and customer relationships. Based on initial feedback from customers and the industry, everyone at Rackspace is proud of the way it was handled and how the company communicated about the outages. Even with a temporary setback, it is more important to Rackspace to maintain long-term credibility and trust as a world class hosting organization that will continue to evolve and learn from these experiences.
Jacques Greyling, operations director at Rackspace, is responsible for a variety of groups including network security, managed backup, SAN infrastructure, DC engineering, business services and operations. Greyling is an eight-year veteran of Rackspace. He previously held engineering roles at organizations including Nissan and Datacentrix.