The Internet has transformed the landscape of business. Traditional high availability is no longer good enough; key applications must be accessible at all times for businesses to survive and thrive in today's highly competitive and dynamic environment. Meeting these higher availability demands requires a well thought out strategy that accounts for the increasing complexity of enterprise application infrastructures. Data centers and systems now span the globe, integrating disparate business processes. Delivering application availability across these interdependencies is increasingly difficult. This complexity of applications makes it difficult to assess a priori vulnerabilities of application failures. Designing your application infrastructure for continuous availability, therefore, begins with the architecture. Unfortunately, too many enterprise applications today are deployed with high availability as an afterthought. The result is a proliferation of point products and solutions throughout an enterprise. This situation not only complicates the management issues but also makes it hard to respond rapidly during crises.
The goal of the traditional High Availability (HA) architecture is to mitigate or prevent application downtime or outages due to failures caused by application errors or any infrastructure failure. Disaster recovery primarily deals with falling back on the secondary site in case of a failure at the primary site. With globalization and the Internet driving application access from all corners of the world, making applications available all the time is far more important than ever before. In contrast, continuous application and data availability recommends a unified enterprise architecture and operational model that embraces the application-awareness, which is the foundation for keeping business-critical applications and data available all the time. The goal of the Continuous Application and Data Availability is to provide resilient business computing infrastructure, application, and data-enabling customer access to business-critical applications all the time with no interruptions or downtime.
This article presents a new way of achieving Continuous Application and Data Availability through a strategic architecture and design thinking.
Application Resiliency is Critical to Business Success
Although technologies have improved greatly during the past decade, achieving application high availability is still fraught with complexity and high costs. Why? All applications must be shutdown for patching, version upgrades, and deployment of new features. Data corruption, server, storage, and network failures, application errors, faults caused by dependent software stack, and human errors cause further outages. While planned outages last from a few minutes to several hours, depending on the outage, unplanned outages caused by application infrastructure or the application itself last from a few minutes to an hour or so depending on the failure and complexity of the application infrastructure.
There are two dimensions to application resiliency: Continuous Availability of the business computing infrastructure and the mechanism to fail over to secondary sites in case of failure. Enterprises often use High Availability (HA) and Disaster Recovery (DR) interchangeably to define very similar requirements, but they are quite different in scope and size. HA focuses on overcoming technology failures such as network and storage failures. DR focuses on overcoming the physical data center or infrastructure disasters. Both HA and DR focus on ensuring that applications are available 24x7 with zero or minimal downtime caused by planned or unplanned outages.
Failing to deploy appropriate HA and DR strategies for enterprise application results in significant loss to enterprises, either in cost of downtime or impact on their reputations and customer experience, and imposes hurdles on enterprise long-term objectives and progress.
Let us discuss the challenges of keeping enterprise applications and data centers resilient. Current HA and DR solutions that protect applications in the enterprise are very expensive and add to the complexity of already complex application infrastructure. The biggest challenge faced by many enterprises is to find the balance between the cost of achieving the resiliency of application infrastructure and the cost of losing customers or business due to application failures. Architects or operations staff who have little or no insights into the business impact of such failures tend to put together point solutions to meet HA and DR needs of an enterprises as a tactical plan.
As companies look to deploy high-availability systems to protect their most critical data and applications, they are soon overwhelmed by the complexity of creating and deploying the complicated clustered solutions to achieve the high-availability goals. These are primarily driven by how quickly an application must be back in operation following a failure, called the Recovery Time Objective (RTO), the project budget, and the cost of using data replication or shared storage. The scalable availability curve below shows the relationship between the level of protection that can be achieved from each configuration relative to the cost and complexity of implementation. As you move up the curve, more protection is afforded, but at higher costs and higher degree of complexity. Exploring what each of these configurations deliver will help us better understand why old models doesn't scale well for 21st century dynamic and agile enterprises.
At the low end of the availability curve, disk-to-disk backup and restore has emerged as a viable data-protection mechanism supported by falling drive prices and growing disk capacity. Having a back-up copy of critical production data on a local data volume greatly reduces time to recover over traditional tape archive systems. Using a simple mount of the backup volume, files needed for recovery can be dragged and dropped back onto the production server. Systems can also be configured to perform point-in-time data snapshots if desired. For customers looking for better data protection with quicker recovery times, disk-to-disk backup is an ideal solution. When combined with an existing tape archive solution, you can build disk-to-disk-to-tape solutions that offer the best of on-site, real-time data protection and recovery with off-site archival.
Growing complexities of applications and data sizes limit the usefulness of the backup and restore as a DR strategy as it is very time consuming, resulting in longer application down time. Though it is a time-tested and proven approach, it provides little value for applications that drive revenue for enterprises. A backup-and-restore strategy is acceptable if your application can tolerate hours or days of down time.
If protection of applications against planned or unplanned downtime is desired, then adding in high-availability clustering with its ability to monitor servers and applications and take automated recovery actions between servers in a cluster should be explored. HA clusters can be deployed in one of several configurations, again depending on recovery requirements and budget. Data replication and HA clustering can be combined to build what is called a "shared nothing cluster". The clustering technology ensures that applications and servers are operational and can perform a failover from one server to another in the event that any problems are detected. The data replication software handles mirroring data needed by the application between the servers so that no matter which server the application is running on, the data is available to it. The data replication can occur across either a LAN or across a WAN, depending on where the servers in the cluster are located.
Because failures are caused by a number of factors, adding 9s to your application availability requires adding more nodes to reduce the chances of failure of the backup nodes. Complexity of your cluster applications increases by the number of nodes you add to you application cluster. With the addition of more moving parts to your application, availability of applications is now a function of all nodes within your application and their interaction multiplied by the number of times you replicate this infrastructure to increase the availability of such clusters. In addition to this, the main challenge with the clustering technologies available today is that the time it takes to recognize the failure, switch over the failover node, and restart the application instances is significantly larger. Though a clustering approach is easily implementable with commodity hardware, it increases the complexity of the application infrastructure as well as the cost.
At the other extreme, there exists a solution purely based on expensive fault-tolerant hardware. In the world of commoditization you can take a more expensive path to investing in fault-tolerant hardware. But this increases the complexity of managing the infrastructure as well as hiring and retaining the specialized operational staff to keep it operable and meet business goals.
Regardless of how large of an investment is directed to build silos of HA and DR solutions, none of these are designed with application awareness in mind. As a result, these silos introduce infrastructure complexity and operational challenges. For example, VMware HA, Redhat Cluster Suite, or any traditional HA and DR tools are primarily designed to monitor and protect individual components within the system. These tools have no understanding or knowledge of what is happening within the application, creating a discrepancy in achieving application-level SLAs. To achieve 99.9 percent availability at the application level, each dependent component and management software needs to be available at much higher rates, making the cost prohibitively high. In the following section, I recommend using Enterprise Architecture as a strategy to define and direct a uniform foundation and platform for delivering the Continuous Application and Data Availability for business applications depending on the level of risk and revenue impact/loss associated with each application.
Continuous Availability as Enterprise Architecture Strategy
Enterprise Architecture (EA) is the organizing logic for business processes and IT infrastructure reflecting the integration and standardization requirements of the company's operating model. The operating model outlines the expectations for integration and standardization across business units. The Enterprise Architecture delineates the key processes, systems, and data composing the core of a company's operations. Enterprise Architecture directs a cohesive platform for execution.
Enterprise Architecture is broad in scope and strategic in focus. The key to effective Enterprise Architecture is to identify the processes, data, technologies, and customer interfaces. An effective EA program that captures the continuous availability of applications and data protection not only orchestrates coherence between business and Information Technology functions. It provides clear business value. It helps enterprises attain capabilities that move the organization towards its goals. Architects drive the creation of product roadmaps and unified architecture to reduce the complexity of application infrastructure, minimize risk, and maximize capabilities and efficiency.
Defining the Enterprise Architecture and how the HA and DR would be delivered across all applications and application delivery infrastructure is critical to reducing cost and complexities associated with deploying many point products. Typically, solutions delivered by operating system vendors are primarily focused on generalized solutions, making it hard to implement Application Awareness for enterprise applications like SAP or Oracle. In this case, applications are very complex with a majority of them deployed as n-tier architecture - applications running across multiple servers and multiple networks. Making all these components meet your SLA goals collectively is a daunting task. In addition, it quickly impacts the availability of your business-critical applications. Application-aware HA/DR solutions are designed and architected with applications in mind. These products sense and respond to application-specific events, thereby providing architects with a unified, cohesive architecture to plug into their EA, delivering one common management framework across the enterprise. This not only reduces complexity, but also allows administrators to respond more quickly to improve the performance of their application.
Lastly, focusing on the operating model rather than on individual business strategies provides organizations with better guidance for development and a more stable platform for delivering more reliable application delivery infrastructure. This stable platform, or foundation, enables IT to be a more proactive than reactive force in identifying future strategic initiatives. Attention is shifted from fire fighting to innovation. This transformation drives simplicity – just enough complexity to become agile.
Application Infrastructure & Design Thinking
There are two challenges for Application Availability– Ensuring that the application is up and running all the time and keeping the data available to the application when needed. Application and data availability issues can be addressed more elegantly without introducing unnecessary complexity by following the design thinking model described below. Make sure you wear six-design hats while building and deploying your business-critical application.
- Design for Mobility: Application availability depends on the availability of the infrastructure on which these applications are deployed. If you are hosting multiple applications on the same hardware or server, or sharing the same storage or network, then your application availability depends on the impact caused by other applications. In addition, each application depends on operating system-specific patches or configurations, which may in turn impact other applications. So, build your enterprise architecture to include mechanisms or models to capture and deploy your application descriptions along with a mechanism to isolate one application from the other. You can use a combination of various techniques like chroot on Linux with some sort of configuration management tools to capture and audit your application configurations. Defining the application isolation and dependencies clearly enables you to quickly migrate your application from one server to the other. This will also enable you to move your applications easily from physical to virtual environments.
- Design for Data Availability: Moving data along with the application is a time-consuming task. Sometimes, sending the disk via FedEx is much cheaper than copying it. So, your application architecture strategy should include mechanisms to replicate critical data for applications to multiple locations. Locating your application closer to your data is much more effective than moving the data closer to the application. Continuous Data Replication, combined with independent data verification and audit process, ensures that you can move application to any other backup or failover locations with a click of the mouse button. This option would give you a higher degree of flexibility by proactively protecting your application data. By using virtualization or standby servers, applications can be started at alternate locations to satisfy a wide range of recovery point and recovery time objectives.
- Design for Elasticity: Performance degradation can be more problematic than application outages. An application that is taking too long to respond and an application that is not available are treated the same way by the user. Both of them offer a poor user experience. Users are more frustrated with slow response. Design for application elasticity; i.e., grow or shrink the application footprint based on demand. When you design for elasticity you are also designing your application for continuous availability. Virtualization and cloud-based delivery models are strategic drivers for your application architecture. Though virtualization can provide a quick and easy answer for your application high availability, it doesn't really solve your application-awareness issue. Hypervisor has no knowledge of application dependencies. Virtual Machine Manager (VMM) or hypervisors can supervise hosts and migrate VMs. But, moving the whole application is more challenging than moving a single VM. All this complexity adds to your application Recovery Time Objective. HA solutions offered by hypervisor vendors provide you with a poor man's clustering strategy. Defining your application failover strategy within your enterprise architecture is key to achieving your application resiliency.
- Design for Security: Because moving a physical server is more complex and time consuming than moving VMs, you can emulate a virtual, fault -olerant application infrastructure by running one VM for each physical box and use the HA infrastructure of your choice from reliable application-aware HA tools like SIOS SteelEye Protection Suite (SPS) and live VM migration tools like VMWare vMotion or XenMotion tools. With this approach, you can build reliable, more dependable application protection architectures to increase your application resiliency.
- Design for Simplicity: In case of a disaster or application failover, a majority of the solutions we put in place didn't work during the crisis. Now, we have better technologies and better tools to manage this process. However, our biggest enemy today is the complexity and cost of managing the identical server configuration and dependencies. By defining the enterprise-wide strategy for application description and packaging as part of the enterprise architecture, one should be able to simply record and play this back at the remote site, assuring application availability.
- Design for Testability: Periodically switch your application between primary and secondary sites. For example, you can run your application during the daytime on primary and during the nighttime at secondary sites. This not only preempts any problems, but also provides you an opportunity to explore new ways to improve your application performance. When you design your application for automatic movement, you are making them ready for virtualization and, hence, ready for cloud-based delivery models. More importantly, testing for DR and HA is not complex but it is a very disruptive process. By designing for AAR, you eliminate the need for DR testing.
From my experience of managing large-scale deployments and operations for the past 20+ years, I can tell where operations teams spend most of their time. Change management of application, broken applications due to scripting errors or infrastructure problems take a good part of the IT budget. According to EMA, close to 80 percent of costs are attributed to application change management and control processes to prevent application failures. By defining the right HA and DR strategy with the proper architecture, enterprises can redeploy these non-productive expenses to more revenue-generating opportunities by funding innovation and R&D.
I encourage IT executives to debate their company's operating model to articulate a vision for how the company will operate and how those operations will distinguish the company from their competition. By engaging in this conversation, IT executives provide critical direction for building a foundation for execution. Continuous availability of applications and data are key to this foundation.
Reddy is most recently known for his position as vice president of Yahoo!'s cloud computing and virtualization business unit. Working in close coordination with senior executive teams and architects, Reddy developed Yahoo!'s organization-wide cloud computing strategy. Reddy is a technology leader in data center architecture, implementation and operational strategies who began his career in Silicon Valley working on data center server consolidation and grid computing solutions for such Fortune 500 enterprises as Oracle and later as a co-founder of Optena Corp. He also architected and delivered a cloud-based supply chain platform at Mitrix.