Tuesday, December 27, 2005

RAIDing the data

Having worked in the IT filed for pretty much all of my adult life, and some of my non-adult life, there's one lesson I've learned the hardway: hard drives die. If you store your important data on a hard drive, chances are you're going to lose that data somehow someday. Backups are a great way to preserve your data from these inevitible failures (they'll happen, it's just a matter of when!) The problem is, backing up and protecting data needs to be easy enough that you'll remember to do it, otherwise it just don't happen. I've been doing this for a very long time, and even I don't backup my data as often as I should. I've lost enough data in my life you'd think I'd learn, but...

So, one of my primary goals with this new server was to have it protect my data without needing my intervention. I'm going to do this is a multi-layer approach, and using RAID arrays is the first layer. For those not familiar, RAID stands for Redundant Array of Inexpensive Disks. Hard drives are cheap these days. It's almost impossible to find a drive that doesn't come with triple-digit gigabytes anymore. In fact, the most recent addition to the server, a 200G 7200 RPM monster, cost me only $30 after rebate. The basic idea behind RAID is to spread your data across multiple cheap disks in order in such a way that if one fails, you don't lose everything. You can Google for more info. The two levels of RAID I'll be using are RAID-1 and RAID-5.

RAID-1 is commonly known as "drive mirroring". I setup two drives of equal size, and everytime I write data to one, it's written to the other. If one drive fails, I have a duplicate of the data on the other one. The big drawback to this setup is that writing to two drives is typically slower than writing to one. The other is that you "lose" a whole drive. If you take two 200G drives and mirror them, you only get to store 200G of data. I offset the first drawback by putting the two drives on different controllers in the system (also known as drive duplexing since I'm protected by redundant controllers as well). Performance is then not affected as much. The second is offset by the fact that drives are cheap. For $30, I can't afford to not protect my data.

RAID-5 is also known as "striping with parity". RAID-5 requires at least three drives. In a nutshell, let's say you wanted to store the following sequence of numbers:

1 2 3 4 5 6

With a single drive, all numbers are written to the single drive, obviously. In a mirror, all 6 are written to each drive. In a RAID-5, the numbers are spread out across multiple drives, with one of the drives storing "parity data". Parity data is essentially a mathematical formula that describes the data such that if you lost one of the three pieces of data, you can recover the third. So, here's how the data would look on a R5 array:

Drive 1 Drive 2 Drive 3
1 2 P3
P7 3 4
5 P11 6

For simplicity's sake, I used a simple algorithm to calculate the parity: I added the data written to each drive together. Let's say we lose Drive 1. Well, we can figure out that the data missing from row 1 is the number 1 since we know X + 2 = 3. The parity for row 2 is 7 because X = 3 + 4 and so on. In most implementation, if a drive fails, the system will stay up and running until you replace it since it can figure out what's missing. Drawbacks: performance is similar to RAID-1 in that you have to write now to three drives. Also, you lose one drive's worth of space, but not as much as in a mirror.

So, how am I using this? Well, in this machine I have the following drives: 1x120G, 2x200G. The 120 is where the OS is stored, and the 200s typically store my data. I'm going to carve them up as shown in this picture:




My "I can't possibly live without this, so it needs maximum protection" data is stored on the RAID-5 array listed as "personal". My Exchange virtual machine is stored on the one listed as "exchange". I have them separated to minimize corruption issues. My "I'd like to make sure I don't lose this since it's a PITA to replace, but I CAN replace it if necessary" data is stored on the RAID-1 (MP3s, videos, etc). Here's how I did it:

The first problem is the fact that HDE was already setup with a single 200G partition, filled with about 140G of data. I didn't have enough drive space to store it elsewhere, but fortunately, the mdadm tool in Linux gives us a simple workaround

Firstly, I needed to very carefully document what it was I wanted to do. I got ADD, so I have to make absolutely sure I've got a detailed plan of attack or I'll forget stuff. :)

Creat an /etc/mdadm.conf file

echo 'DEVICE /dev/hd* /dev/sd*' > /etc/mdadm.conf

This tells mdadm that any hard drive in the machine could be considered a candidate for creating arrays, and at boot time, find them.

Partition the disks

HDA already has some partitions on it, some of which I didn't need anymore. I created two 10G partitions, and one big one for the last. Since this is going to be a MythTV box, I'll use the free space on the drive for scheduled, temporary recordings. If I want to save something, I can reencode it and put it into the store. On HGE, I also created 2 10G partitions and one remaining bigity-big one. Remember also when in fdisk to set the partition type to "fd" (they're created as "83" by default). "fd" is the type for Linux RAID Autodetect.

A quick reboot later into single user mode (safest way to do this stuff), I created my arrays. Three simple commands:

mdadm --create /dev/md0 --level 5 --raid-devices=3 /dev/hda6 /dev/hdg1 \ missing
mdadm --create /dev/md1 --level 5 --raid-devices=3 /dev/hda7 /dev/hdg2 \ missing
mdadm --create /dev/md2 --level 1 --raid-devices=2 /dev/hdg3 \ missing

The "missing" directive is what allows me to keep my data intact until the last drive is ready to add to the array. It allows me to create the array without all of the partitions. Now, all we do is format the filesystems:

mkfs.jfs /dev/md0
mkfs.jfs /dev/md1
mkfs.jfs /dev/md2

I chose JFS based on recommendations on the MythTV board. I knew I wanted a journaling filesystem for the arrays, primarily due to their size, and JFS seems to be the most "stable" in this configuration.

As a final step in this section, I mounted the arrays (treat them as a regular drive, i.e. "mount /dev/md0 /mnt/personal") and copied all of the data from the HDE partition over to the new arrays.

Once all the data was copied over, I simply fdisked HDE so that it's partition table was similar to HDG's. A note: Linux' RAID support is pretty flexible. You don't have to worry about getting EXACTLY the same number of blocks in each partition. When I created the personal and exchange partitions, I used "+10000M" in each of the "end block" sections. That got them close enough on size.

To finalize the arrays and get them to sync, simply issue the following:

mdadm --add /dev/md0 /dev/hde1
mdadm --add /dev/md1 /dev/hde2
mdadm --add /dev/md2 /dev/hde3

To check the sync status, do "cat /proc/mdstat". You'll see them syncronizing. Go do something for a half hour or so. The system will sync each in turn. When everything's up and running, you'll see "[UUU]" or "[UU]" at the end of each status line (indicating all three or two drives are Up). It's probably not entirely dangerous, but you should wait until the arrays are up fully before trying to use them.

Some final config file changes

Obviously, you'll need to add the arrays to your /etc/fstab. Again, treat them just like any other type of drive. Mine now looks like this:

/dev/hda3 / ext2 defaults 1 1
/dev/md0 /mnt/personal jfs defaults 0 0

And, so on. You should also do the following:

mdadm --detail --scan >> /etc/mdadm.conf

This puts the information on the arrays in the conf file. Mdadm doesn't really need this, but it's good for you to have the info in the future.

As a final step, let's make sure we know when our arrays have degraded. Edit your rc.local (or its equivelant in your distro) and include the following line for each of your arrays:

nohup mdadm --monitor --mail=root@localhost --delay=300 /dev/md0 &

Don't forget the ampersand. This line will send an e-mail to you anytime mdadm detects a degraded array. The mail directive simply acts as a "frontend" to sendmail, mdadm doesn't actually send the mail. So, if sendmail isn't setup properly, you won't get the mail. Which is why I have it going to the root mailbox.. ;-)

I think that's about it. Sorry it was a bit long, but there was a lot of ground to cover. Drop me a comment if you found this info useful!

No comments:

Post a Comment