Fall World 2013

Conference & Exhibit

Attend The #1 BC/DR Event!

Spring Journal

Volume 26, Issue 2

Full Contents Now Available!

Drives Will Fail: What’s Your Back-up Plan?

Written by  MIKE COBB Wednesday, 07 November 2007 16:07
Drive failure is inevitable, and its causes are many. A few are extreme. I have recovered data from computers that have been dropped, run over, burned, drowned, and shot. But those are the exception. The everyday causes of drive failure are more mundane, breakdowns in the inner workings of the drives themselves, brought on by the very complexity that makes them so powerful.

Have you ever been stuck in a traffic jam on the freeway and thought, "Gee, they really ought to widen this road and add a couple of lanes to handle all this traffic?" Except you know what would happen. Those lanes would fill up with cars in no time, and you’d be stuck again. Or, as someone said, "Nature abhors a vacuum."

That’s sort of how it is with today’s hard drives. As manufacturers have been able to create more and more digital capacity in less and less physical space, users have eagerly poured in more and more data to fill that extra space, often right to the limit.

The problem is that it takes longer to back up these expanding data sets. Many individuals don’t bother, and many companies don’t budget adequately for the right tools to keep up with the volume. Meanwhile, the drives themselves have become more intricate, with smaller components and tighter tolerances. So the risk of failure is increased. Drive failure has been a fact of life since the first computers, but with the capacity of today’s machines and the propensity of users to use it all, the sheer volume of data at risk today is staggering. Whether you edit home movies on your laptop for fun or manage a room full of servers for business, it’s time for a back-up plan. 

A Data Salvager’s Perspective

As a director of engineering, I deal with the challenge of data overflow every day from both the end-user and the enterprise level. When you’ve been doing this for a few years – 13 in my case – you get an interesting perspective on how much data volume has grown. A typical data recovery job in 1994 involved a hard drive with storage capacity in the 20-40 megabyte range. For the recovery process we used 240 mb hard drives to hold the data we recovered and the average file count, including all the files of the operating system as well as the user’s data, was around 25,000 files per recovery. And in those days floppy discs were the primary back-up medium and what we used to send customers their recovered data.

Today, the average recovery has grown exponentially. A typical recovery on a Macintosh, for instance, is 60 gigabytes and the average number of files is 160,000. In the PC world, the average recovery is a bit smaller – 10-15 gb. Often the recovered data can be returned to the customer on DVDs or CDs, but increasingly, we’re sending it out in new internal or external hard drives because of the large data sets and file sizes. External drives have become today’s floppies. 

Drive Failure Happens

Drive failure is inevitable, and its causes are many. A few are extreme. I have recovered data from computers that have been dropped, run over, burned, drowned, and shot. But those are the exception. The everyday causes of drive failure are more mundane, breakdowns in the inner workings of the drives themselves, brought on by the very complexity that makes them so powerful.

Many drives come out of the factory with some kind of defect that will eventually surface. The average service life of a drive these days is three to five years. Drive manufacturers claim the failure rate is about 1 percent of all drives in use per year, but some independent estimates put it as high as 4 percent and even up to 13 percent.

Just as it can’t be avoided, neither can it be predicted. There have been various efforts at "smart" failure prediction, but a majority of drive failures happen immediately, like bad accidents, without warning. When the drive heads suddenly decide they’re going to crash into a spinning platter, no one can see it coming.

Electronic failure at the printed circuit board or component level and minor media damage (less than 5,000 bad sectors) used to be the primary cause of drive failure. Nowadays, we see less electrical failure and more physical media damage resulting from the tight packing of ever-shrinking, fast-moving mechanical parts, especially the head-platter surface interface.

Power surges are a common cause, too. They’re especially bad for the users who are conscientious about backing up, because their back-up drives are usually plugged into the same power source as their main ones.

Here’s one that surprises people: hard drives are sensitive to altitude. They have a higher rate of failure over 10,000 feet, even in pressurized airplane cabins where every other person seems to have a laptop or an MP3 player. In a depressurized environment, like a mountaintop in the Andes, they simply won’t function over 10,000 feet.

User error is a less common cause of failure, but it certainly happens. It might be as simple as unplugging a firewire or USB drive without first "ejecting" it. You might even get away with it nine times and get lulled into thinking nothing’s going to happen – until the tenth.

At the enterprise level, I repeatedly hear from IT managers who thought that, in a RAID server, back-up was "built in" – if one drive failed, another would take over – never anticipating that a second drive failure would crash the entire system. In fact, when one drive fails the remaining drives start working overtime, running faster and hotter than normal, increasing the risk of complete failure.

IT support people often call after they’ve done a reinstall on a desktop system to try to eradicate some kind of corruption, only to find there were some crucial documents that the user didn’t back up to the server. People want to know how they can prevent their disks from failing. The short answer is you can’t. It’s not a question of if, but when. The more pertinent question, though, is how can you prevent your data from being lost or avoid going through a recovery with the downtime and costs that it entails? Based on my experiences, there’s only one answer. 

Back Up, Back Up, Back Up Some More

Being told you need to back up regularly is kind of like having the dentist tell you that you need to floss. You know it’s true. You vow to be better about it. You have the means and every intention. But you forget, or you put it off. And next thing you know, you’re getting a root canal.

Cost used to be an impediment. After paying for a computer, who really wanted to shell out the money for an extra hard drive? But external drives are comparatively cheap now. We see more and more of them coming in for data recovery. The reason? Users are buying external drives for back-up, and then they wind up using them for data overflow. So the data on their external drives is just as much at risk as that on their computers. The only data that is not at risk is data that has been backed up.

So what’s the best back-up system for the heavy hobbyist or small creative business? The answer is the one you’re most likely to use – if it encourages you to back up instead of discouraging you, it’s right for you. If you have large files of photos, movies, music, and the like, CDs or DVDs are simply not a practical option. The handling and storage of them is also a bit cumbersome. Tape back-up was once the standard in business, but nowadays it’s costly and slow compared to other options and does not give you the flexibility to restore on other systems. firewire or USB external drives are the way to go – provided you’re not tempted to use that extra capacity for your data overflow. If that’s inevitable, you need to buy another drive. If you can’t be bothered to remember to back up on a regular basis, there are programs you can buy to schedule automatic backups.

Of course, in a creative business, you probably have very large data sets and a large number of files, possibly more than you can back up in one night. And if your back-up time cuts into your work time, that translates to downtime. One solution is to upgrade your network to gigabit. It’s 10 times faster than 100 Base-T Ethernet – currently the standard in many businesses – and you can transfer as much as 2 gigs per minute (vs. 200 mb.) 

Strategies for the RAID Environment

The issue is trickier for enterprise IT departments that measure their volume in terabytes rather than gigabytes, because it comes down to the fundamental business tradeoffs of time and money. If you have a RAID 1 mirror or RAID 5 striped with distributed parity, you’re off to a good start. But if there’s corruption on any of the drives, it will be mirrored as well. You still need a back-up. Tape can take days to back up a high volume of data. Now, you can get external drives capable of holding up to 2 terabytes, allowing you to back up to multiple drives and restore the data to any computer – an advantage that tape doesn’t give you.

Do you need to back up your entire system every time you back up? One strategy to consider is incremental back-ups. Start with a full system back-up, then back up only the data that has been changed in succeeding intervals. You’ll still need to do full back-ups regularly, and you’ll need to determine the schedule based on the amount of your data and the nature of your business – for instance, incremental back-ups nightly and full back-ups on the weekends.

The real killer in the movement of data is not always the amount of data in gigabytes or terabytes, but the file count. Is there a way to consolidate multiple files (directories, for example) into a single file? Fewer, larger files will be easier and faster to back up and restore than more, smaller files, even if the total volume of data is the same.

Of course, all the back-up planning in the world is for naught if you don’t also have a plan and the means for restoring the data. That means having yet another server dedicated to restoration. And it’s advisable if not imperative to do an occasional practice restore on another computer – obviously you can’t restore to one that’s failed. It doesn’t have to be the identical computer, just one that’s sufficiently robust to handle the data.

Managing the movement and storage of data is a secondary if not even a more distant priority at most companies relative to their primary business, which is why IT departments have to keep cajoling management on the importance of having adequate backup resources–and the cost of downtime. 

When All Else Fails: Data Recovery

So you vowed to be better about backing up, but you got too busy and just plain forgot. Or everything got backed up except one crucial file. Or your well-planned back-up system didn’t do its job properly. Well, good news. Your chances are good, actually better than good, the data is not really lost and it can be restored.

Data can be imaged from the original drive in the cleanroom, then sent to engineers to ascertain whether the damage is logical or physical. The length of turnaround and cost is based on the service selected (priority, standard, or economy), the drive capacity, the operating system, and the complexity of the recovery. 

More Data Than Ever

As we’ve witnessed the increase in data volume over the years, we’ve also noted that it is proportional to the rise in the value of data – because so much of our work and lives now exists primarily in digital form. Data loss can cripple businesses and send everyday users to the edge of despair.

With today’s data recovery capabilities, despair is unwarranted in the vast majority of cases. But the best recovery plan is the user’s own. Think about what that growing volume of data is really worth to you and figure out how to safeguard it before it overflows. Hard drive failure may be unavoidable, but with the right tools and strategies mindset, you can avoid losing what really matters.

Mike Cobb is the director of Macintosh and Unix engineering at DriveSavers, a data recovery services company. He joined the company in 1994 and has performed recoveries on all types of hard drives. Before joining DriveSavers, he worked as a tech support supervisor and beta-test coordinator for a manufacturer of Mac-based RAID mirroring hardware and software, among other products.



"Appeared in DRJ's Fall 2007 Issue"
Login to post comments