Spring World 2018

Conference & Exhibit

Attend The #1 BC/DR Event!

Fall Journal

Volume 30, Issue 3

Full Contents Now Available!

Tuesday, 30 June 2015 00:00

Fault Tolerance in Virtualized Data Centers

Written by  James Dyckowski
Leveraging the Resilience of FT and HA Solutions

Most enterprises face challenges in effective data management. To that end, one of today’s dominant tech solutions continues to be virtualization—partly because it fits well with consolidation, an ongoing trend in data centers. As an increasing number of servers become virtualized, users need fewer physical machines, allowing IT administrators to consolidate the number of disparate servers.

When you look at the impressive and well-documented benefits of virtualization, it becomes obvious why this solution is still so popular. Whether we’re talking about storage servers or application servers, the merits of virtualization include:

  • Centralized manageability
  • Optimal use of resources
  • Lower total cost of ownership
  • Greener” data center
    • Decreased power consumption
    • Reduced data center footprint
    • Less noise

While it’s clear that virtualization and server consolidation will continue to increase, enterprises still must exercise proper care in deployment to steer clear of possible pitfalls when using these powerful tools. For example, consider the threat that consolidated servers face by putting all of their eggs in one basket. If even a single server goes down, other components of the data center are subject to crash potential as well.

This is why the concept of “fault tolerance” (FT) becomes so important—it offers a solution to this threat. To that end, some companies, such as VMware, have introduced FT technology into their latest product lines. Let’s examine two new complementary FT technologies that have entered the marketplace, and compare some of their similarities and distinguishing features.

Fault Tolerance in Virtual Machines

For IT administrators, there are two top concerns in any enterprise data center: data integrity and service continuation. If either of these areas becomes compromised or interrupted, it can lead to serious consequences, including loss of service availability and/or data loss that is beyond recovery.

In data centers that deploy virtual infrastructures with more than one virtual server housed on a single physical server, avoiding these challenges can be even more difficult. Think about what happens when the sole physical server goes down—it puts all of the connected virtual application servers promptly out of commission. This led to the development of an initial FT solution designed to remove this potential problem with virtual machines (VMs).

The FT solution works like this: in data centers that host multiple virtual servers, whenever FT becomes enabled for one of the servers, a copy of that specific virtual server is automatically created; it then automatically runs on another physical server via an automatic scheduler, which enables the application to choose the most appropriate system available to host this secondary VM. Alternatively, the administrator can select another server from a list of available systems. The figure below shows the general framework for this FT solution.

Fault Tolerance-01

Regardless of the operating system (OS) that the application server is running, this FT solution can enable it. Since the solution is both OS-independent and application-independent, the administrator need not hassle with changing the OS or application to support this FT. That’s the upside. On the downside, for successful deployment of this solution, administrators must follow several specific requirements regarding hardware components (including the CPU), as well as follow rigid requirements for the primary and secondary VMs to run OS and application codes that are identical and synchronized.

However, one manufacturer of this solution uses an innovative technology to ensure that the two VMs run identical code. Using a high-speed network link, the primary VM delivers packets like keyboard, mouse, and network inputs to the secondary VM. The following figure shows the data stream that the primary VM ships to the secondary VM.

Fault Tolerance-02

In this solution, if the primary server fails and the primary VM becomes unavailable, then the secondary VM takes over the primary’s role. To guarantee transparent failover, this solution relies on a network virtualization mechanism. This mechanism ensures that the secondary VM takes over the complete network identity of the primary VM. What’s more, this solution’s infrastructure can automatically choose another suitable physical server to host a new secondary VM, ensuring future FT.

Fault Tolerance in Virtualized Storage Servers

If your data center deploys a virtualized server infrastructure, then it requires at least one shared virtual storage server to host the data stores for the virtual memory system (VMS). Needless to say, it is crucial for this shared server to be both sufficiently robust and fault-tolerant.

This brings us to the second FT solution, which features a high availability (HA) component capable of delivering on these demands. The architecture of this HA solution, such as StorTrends, performs in virtualized environments in the same way that it can be deployed in application servers, with a separate server handling secondary storage needs.

The solution is very simple for administrators, who need only select a secondary server and enable either Active/Passive or Active/Active during configuration; the network storage management software handles the rest. Unlike the first FT solution—the success of which rests on underlying shared storage—the HA solution relies on mirrored storage. The beauty of this approach is that it requires no shared resources between the two nodes, allowing each server to offer complete FT.

The HA architecture also provides efficient load-balancing. The volumes residing in the solution’s servers can comprise up to 6 aliases per controller, with one controller assuming the role as the primary server for one alias, and the second controller for the other alias.

In terms of requirements, the alias’ owner is responsible for mirroring all data writes to its peer. Yet despite the time involved, this requirement offers enterprises two important benefits: first, the two storage servers share the compute and I/ O loads, meaning each plays an active role for an alias. Second, it promotes more efficient use of link bandwidth over the full-duplex links, with critical writes mirrored to the peers in two different directions over the network link.

Fault Tolerance-03

In the event of a controller server failure, the surviving server takes on the “primary” role for both aliases. This solution also employs network virtualization, which allows the surviving server to take over the entire network identity of the failed server. In short, these capabilities ensure that the failover unfolds transparently in an “OS-agnostic” manner.

Comparing Capabilities

Now that you have a basic understanding of these two fault tolerance technologies, let’s examine some potential hazards that could adversely affect an enterprise’s operation and data availability in an instant, comparing how the FT and HA solutions handle these threats:

  • Software crashes. Both physical and virtual servers must deal with the constant threat of a software crash. Servers can be temporarily lost when the OS of a virtual machine crashes due to bugs in the kernel. Service interruptions can also occur due to viruses, or when apps running in the VM service terminate abnormally. When these events happen, the first FT solution described above is at a distinct disadvantage, since both the primary and secondary VMs execute identical software code in lockstep. They therefore will suffer from identical consequences during a crash in either VM, leading to service disruption. Contrast this to the HA solution, where the two nodes do not execute identical code. This architecture enables the secondary node to take over the role of the primary—even when the primary server goes down due to a software crash—and the storage cluster continues operations without interruption.
    Fault Tolerance-04
    Fault Tolerance-05
  • Software upgrades. Every server must stay updated with the latest software upgrades to ensure optimum security and performance. With this process as well, the first FT solution faces a disadvantage because it requires identical software code to be running in both VMs. This means that to upgrade software, the administrator must bring down both the VMs to complete the updates, rendering the entire storage cluster unavailable during this time. With the HA solution, on the other hand, each node can generally be updated independently, keeping the other node available throughout the process.
  • Hardware crashes. With the FT solution, if the primary VM goes down due to a hardware crash or power failure, the secondary VM is available to assume the primary role. This means even in the event of a hardware failure, service availability remains unaffected. This is also true with the HA solution—if the primary node becomes unavailable when hardware crashes, the secondary node takes over the role of the primary, leaving the storage cluster fully operational.
    Fault Tolerance-06Fault tolerance-chart

Conclusion

As virtualization continues to make inroads into the data center, the issue of fault tolerance continues to magnify in importance. A wide range of companies are competing to emerge as leaders in the server virtualization arena, but two types of solutions stand out by virtue of their feature-rich offerings: the FT solution and the HA solution.

The latest version of the FT solution has emerged as the leader in addressing the issues of robustness and reliability of virtual server farms. This solution lets enterprises implement server virtualization in an OS-neutral and platform-agnostic manner. While this FT solution is powerful, however, it is still in its infancy and does have certain limitations, such as supporting only single-processor VMs. One manufacturer of this solution also recommends that as a best practice, administrators should implement other subcomponent-level redundancies, such as network teaming for network-related irregularities—even when FT is deployed.

The shared storage server is a key component of any virtualized data center. The availability and fault tolerance of this underlying storage server is also critical, which makes the HA solution very compelling as well. As with the FT solution, one manufacturer recommends implementing sub-component-level redundancies, such as RAID for disk failures and network teaming. For heightened robustness and fault tolerance, enterprises should consider deploying both the FT solution and the HA solution in a data center, thus benefitting jointly from the many powerful features that this pair of products provides.

 

Senior SalesDyckowski-James Solutions Engineer, James Dyckowski, has more than eight years of technology experience in storage, RAID, virtualization (VMware, MS Hyper-V, Citrix and RHVS), SAN, NAS, Linux and DB applications (MS SQL, Exchange and Oracle), backup and disaster recovery (DR). Dyckowski has spent the past eight years with American Megatrends, Inc.'s StorTrends Data Storage Division. Dyckowski is responsible for pre-sales support, solutions designs, installation and post-sales support for StorTrends SAN and NAS Products. Dyckowski received his bachelor of science degree in computer science from the University of Georgia and his master of business administration degree from Shorter University.  Dyckowski enjoys spending time in datacenters with cool new technologies that help clients enhance productivity, protect data, save money, and increase their bottom line.