Spring World 2018

Conference & Exhibit

Attend The #1 BC/DR Event!

Fall Journal

Volume 30, Issue 3

Full Contents Now Available!

Wednesday, 20 July 2016 00:00

Disaster Recovery Strategies for Big Data

Written by  JIM SCOTT

Organizations are increasingly analyzing operational data in order to reveal a significant competitive advantage, increase revenue, or mitigate risk. These big data workloads are becoming essential to operations and are required to run in real-time, continuous data environments. As this "big data" becomes more mission-critical in nature, grows in volume, and operates in a continuous stream (think click stream, IoT sensors or web/video content), traditional disaster recovery strategies are no longer applicable for enterprises. In light of this discovery, it's vital that organizations implement a big data disaster recovery plan that can shield them from the effects of significant negative events such as crippling cyberattacks or equipment failures.

What's alarming is that most organizations aren't backing up their big data systems, which is an extreme cause for concern. Without a solid disaster recovery plan in place, the cost to your business could prove devastating – lost data and downtime could result in lost revenue, lost productivity, and failed opportunities.

Let's take a look at some of the key considerations and best practices for developing a reliable disaster recovery strategy, which will help minimize the negative effects of a disaster, and allow you to quickly resume mission-critical functions.

Legacy Considerations
Data these days is coming in fast and furious, in a variety of formats such as Facebook likes, web logs, audio, and images. The velocity and variety of these data types simply cannot be easily classified using relational database management systems (RDBMS). In addition, backing up legacy data protection systems is time consuming, since the entire file system has to be scanned each time a backup job is run. Have you ever had to do an actual recovery of your RDBMS backup, but have it fail because you had bad data in your backup? It's not pretty and it is more common than most people realize.

Traditional disaster recovery strategies often rely on a "pot hot" model. Remember the story about the frog in hot water? The frog lounges in a pot of water which is heating up so slowly that the frog doesn't even notice. The water gets hotter and hotter, and by the time the frog notices, he's already boiled.

Shockingly, many IT managers think of a disaster recovery plan as something that's nice to have, but not a necessary part of their data management activities. Many organizations don't have any kind of disaster recovery plan beyond nightly backups.

Because of these kinds of legacy considerations, organizations are evolving from using complex RDBMS models to using NoSQL + SQL (structured query language) and storing data in a JavaScript object notification (JSON) data structure, while also using a converged data platform and Apache Drill to query data in a multitude of different ways. This approach provides organizations with a significantly more reliable way of recovering critical data in the event of a disaster. NoSQL data stores offer greater flexibility and management for data and applications; most NoSQL databases support data replication and can store multiple copies of data across clusters or even across data centers in order to ensure high availability and disaster recovery.

Cloud vs. On-Premise Approaches
Disaster recovery in the cloud is a relatively new concept. Although many organizations now use the cloud in some form, they have been slow to migrate big data and data warehousing to the cloud, despite cost and scalability benefits. In fact, a 2014 Gartner survey shows that only 42 percent of organizations with big data are using the cloud in any form.
Whether your clusters are in the cloud or on-premises, mirroring and replication features are a critical part of a successful disaster recovery strategy, especially if you're talking about multiple clusters and multiple data centers.

Consider Netflix as an example. Netflix has been moving huge portions of its streaming operation to Amazon Web Services (AWS) for years now, and they recently completed their cloud migration in January 2016. What if all of Netflix's systems in Amazon went down? Netflix keeps backups of everything in Google Cloud Storage in case of a natural disaster, a self-inflicted failure, or security breach. However, all of their data on Google is technically a dead copy. So in the face of disaster, they'd have to restore from all of that data on Google, because they don't have services running on Google.

If Netflix had built their site on a converged data platform that was independent of the cloud provider, they could run it on any or all of the different providers, and could do their load balancing in front of that. If one of their service providers goes completely down (or doubled their pricing), they would still have everything up and running.

Whether you're running your cluster on Amazon, Google, or on- premises, you need to be able to manage them in a cohesive way. If you want to successfully run and manage multiple clusters and multiple data centers, look for a converged big data platform that includes backup and mirroring capabilities to protect against data loss after a site-wide disaster.


Backups – Look for a Hadoop distribution that allows you to take a snapshot of your cluster at the volume level. The snapshot will include all data in the volume, including both files and database tables. The snapshot completes nearly instantaneously and represents a consistent view of the data. This means that the state of the snapshot will always be the same. The snapshot then can be written to another medium as a backup.
With some Hadoop distributions, you can set up a very small instance and then mirror the data. In this case, all you have to do is pay for capacity, which is predominantly storage, since you don't really need compute capacity. So even if you have an on- premises system and spin up a small Amazon cluster with ridiculous amounts of storage behind it, you're still looking at a pretty low-cost implementation from the compute side, since you're only looking at how much it would cost to store "x" amount of data in the cloud. Compare that to the backup costs of any traditional backup system such as an RDBMS that is doing daily, monthly, nightly, incremental backups; compare it to the recovery of legacy backup systems.


Replication – Organizations are continually pushing the boundaries of real-time analytics in Hadoop, but are often challenged by geographically-dispersed environments. Being able to replicate events between data centers helps enable the general, broad-purpose analytics that occur. If you take a close look at how different cloud providers offer their services, you'll notice that there are pretty substantial differences between their application programming interfaces (API) and how you manage them.
Your Hadoop distribution should be able to handle both local replication for high availability and remote replication for disaster recovery (DR), availability, and data locality in a single architecture. You need to be able to replicate event streams between data centers so that when you perform analytics, you don't care where the analytics are performed; you have a full view of everything. If you have a multinational organization and have regulatory concerns with where data resides, replication also is very important, but for different purposes. Look for a Hadoop distribution that offers cross data center table replication, so that you don't have to tie your data to one site and can instead have global relevance, with live data updates across multiple clusters that can be shared and analyzed immediately.

Mirroring – Mirroring supports the following characteristics that are critical for DR deployments:

  • Scheduled: Administrators should be able to schedule how often mirrors are updated. A higher frequency of updates leads to a lower recovery point objective (RPO).
  • Incremental: Only deltas should be transferred from the master cluster to the replicas. If only an 8K block is updated at the master cluster, then only that block would be transferred in the next mirroring job.
  • Efficient: Transferred data should be compressed and sent asynchronously and in parallel, so that it does not significantly impact system performance.
  • Consistent: Prior to creating remote mirrors, a snapshot should automatically be taken to ensure a remote mirror of a consistent, known state of the master. Checksums are run to ensure integrity.
  • Atomic: Changes on the mirror should be made only after all data has been received for a given mirroring operations.
  • Flexible: Multiple mirroring topologies should be supported, including cascaded and one-to-many mirroring.
  • Resilient: Should there be a network partition during a mirroring operation, the system would periodically retry the connection and resume once the network is restored.
  • Secure: Configurable over-the-wire encryption prevents network eavesdropping on the mirrored data.
  • Global: Your Hadoop distribution should be able to globally replicate data at IoT-scale, with global metadata replication, where stream metadata is replicated alongside data, allowing producers and consumers to failover between sites for high availability. Data is spread across geographically-distributed locations via cross-cluster replication to ensure business continuity, should an entire site wide disaster occur.

Summary
The rapid pace of change in the enterprise, coupled with the threat to data loss in big data platforms, raises concerns with those responsible for ensuring business continuity in the face of disaster or mal-intent. Companies need to ensure they are protected to the level they are comfortable with. Operating in any combination of on-premises and cloud is a good strategy for protection. It becomes an even better strategy when gaining the ability to run multi-master environments to prevent downtime. Disasters come in a variety of forms, and speaking from experience, you should be prepared for the eventuality that your data center provider may "accidently" turn the power off to your cage of servers. This is completely unplanned and unpredictable, yet it happened to me. Following the practices outlined here saved us the pain of recovery.

It's vital that you have a robust and well-tested disaster recovery plan. Protecting a big data environment requires a new way of thinking about how to take advantage of new technologies that will keep pace with your data growth and help protect your business from unforeseen disasters.

Scott JimJim Scott is director of enterprise strategy and architecture at MapR Technologies.