Disaster recovery is defined as the policies and processes that are put together in order to keep the IT or the technology infrastructure functioning after or during a natural or human-induced disaster. Data center managers a long time ago realized the dependence of their organizations on their computer systems and thus developed procedures that would ensure continuity of business functions. And the plans and strategy that was being put into place for this purpose was associated to key metrics such as recovery point objective (RPO) and recovery time objective (RTO) that are specified for certain business processes (such as payroll, service ticketing, order processing, online banking, smart phone applications, etc.) and are mapped to the underlying IT systems and infrastructure that support those processes.
As IT systems became increasingly critical to the smooth operation of a company and their customers,the importance of ensuring the continued operation of those systems and applications, and their rapid recovery, has increased. In fact, most companies and organizations went so far as to build multiple data centers to mitigate against disasters in any single one, and they assume that if something does go wrong in one, IT functionality would failover to the other instantaneously. As a result, we see today companies with multiple data centers in a variety of geographies, each of them being able to handle the peak demand of each individual application at any given time. Companies also built redundancies into each data center to make sure that a single component won’t result in an application outage.
But even with all this hardware and infrastructure in place, failover (or disaster recovery) procedures still require error-prone manual intervention 80 percent of the time as “automated” failover procedures don’t work as planned and the applications go down. A study by Symantec found that even before you get to manual intervention, 25 percent of disaster recovery failover tests fail completely not even getting to the manual part.
Disaster recovery planning, not only defines the steps to be taken in the event of a disaster, but also outlines the potential challenges to the plan such as storage and backup management, migration among physical and virtual servers, software and hardware upgrades, data protection and security, and of course cost.
It may appear that DR planning is a very difficult task that offers no guarantees that things will go as planned. This however does not necessarily mean that rigorous DR planning is futile and failover procedures should work occasionally. In fact, they should be updated with every change, tested after every update but due to the “risk” of application downtime and the uncertainty of an event actually ever happening these procedures as well as the backup data centers are sitting dormant, untested, running idle and consuming lots of power.
Even though there are many possible causes of application downtime, those caused by the IT equipment have been virtually eliminated by the virtualization of the physical IT infrastructure. By minimizing or eliminating single points of failure in the servers, storage and networking systems, hardware failures rarely cause any application downtime today. System software has experienced similar improvements in both stability and self-recovery capabilities, making the “soft crash” also now a rare occurrence. But even with all the IT improvements, downtime is still higher than organizations desire or can accept.
Now the problem is power. More than half of all application outages are now caused by problems related to power, and the impact of power quality and reliability on application availability will only get worse as disturbances on the electric grid and its distribution system become more frequent and longer in duration.
The reason power is now the primary cause of application downtime is the very success the IT organization has had in minimizing or eliminating single points of failure throughout the application infrastructure. The software-defined layers of abstraction created by virtualizing servers, storage and networking systems have dramatically reduced application outages caused by hardware failures. Unfortunately, transfer switch, uninterruptible power supply (UPS) and backup generator in the data center have not followed the same improvements and, as a result, the percentage of power-related outages has gone way up and any power outage leads to a lengthy recovery procedure for the application.
So the question is why didn’t the power infrastructure follow a similar path? Why can’t we abstract applications from power dependencies the same way they are abstracted from IT hardware dependencies?
This is where software defined power (SDP) comes into play. SPD abstracts an application from the physical power infrastructure, leveraging IT resources in multiple geographies and data centers, and moves the application to wherever power is most reliable (and most cost efficient) at any point in time. By making it possible to move applications from one location to another, data center operators can pro-actively shift an application away from potential issues instead of waiting for an event to happen, hoping disaster recovery procedures automatically handle the issue. As noted earlier, if 80 percent of the time the automated “after-the-fact” procedures fail, maybe it’s time to use an ongoing pro-active management approach as part of standard operating procedures.
SDP automates the original disaster recovery procedures in a way that it turns them into a standard operating procedures, moving applications from one location to another, extending them with power and utility intelligence, integrating with all load-balanced, virtualized environments and infrastructure management components, adjusting the equipment dynamically as needed to support any given application load across all data centers. Manual overwrites allow operators to push applications proactively to other locations to avoid disasters or for maintenance and test purposes.
Once configured with the service level and other application requirements, SDP continuously and automatically tests disaster recovery procedures while optimizing resource levels, both within and between data centers thereby increasing application availability and reducing operational costs.
SDP utilizes run-books to automate application load shifting and shedding that involve multiple and parallel tasks from standard operating procedures of both the IT and facility systems. Any runbook, whether event-driven, ongoing or scheduled, can be tested and refined until perfected to assure its error-free and secure operation.
Integral to its design are advanced verification features which not only ensure fail-safe operation but continuously test DR procedures, making them much more effective. Because load shifting does not occur until the destination availability has been verified, the process is risk free, and when disaster does strike, the chances of smooth transition are dramatically improved.
So while virtualization, load-balancing and data center infrastructure management solutions are necessary for maximizing application reliability in an energy-efficient data center, they are not sufficient because neither is abstracting your applications from power or can shift a complete environment from one location to another pro-actively. It should come as no surprise, then, that power is the next (and last) resource waiting to become software defined. Unlike with other “software-defined” abstractions, where the software can dynamically change and allocate available capacity, this is not possible with power. While you cannot dynamically adjust how much power goes to a rack or outlet directly, you can change the power consumed in any rack or outlet by shifting the workload.
In addition to increasing availability by affording greater immunity from unplanned downtime caused by volatile power sources, shifting application workloads across data centers also makes it easier to schedule the planned downtime needed for routine maintenance and upgrades within each data center. Together these improvements have the effect of maximizing application uptime with absolutely no adverse impact on service levels, performance or quality of service.
In fact, just recently when we received a “tornado warning” from our utility company in Sacramento, forwarded to us by our co-location services provider late afternoon, we pressed a manual overwrite button on our dashboard that shifted our application load to Ashburn, VA. In less than 3 minutes and without interruption to users, we eliminated any potential risk from the tornado and we did not have to watch our service monitor all evening for potential outages or failing disaster recovery procedures. Once the tornado warning was lifted we moved the application load back to Sacramento. That’s how SDP eliminates power risk: pro-actively.
There is a lot more functionality around SDP and a lot more operational benefits to being able to shift applications from one location to another seamlessly in less than five minutes, but they are subject of a future article.
About the Author
Clemens Pfeiffer is the CTO of Power Assure and is a 25-year veteran of the software industry, where he has held leadership roles in process modeling and automation, software architecture and database design, and data center management and optimization technologies.