Wednesday, January 4, 2006

Your drive is dead, you are so screwed

Remember me saying how you really need to protect your data? Boy, do I know how to predict failure or what?! :) Seriously, I had a panic attack over the last couple of days as I started getting mails from mdadm:

A DegradedArray event had been detected on md device /dev/md0.

Yipes! So, I cat /proc/mdstat and found that all three arrays were showing a degraded state, with one drive missing from each. hdg was not showing in any of the arrays. A mdadm -Q /dev/hdg confirmed that it was not part of any array.

A quick inspection of /var/log/messages shows:

Jan 4 16:09:47 alfred kernel: ide: failed opcode was: unknown
Jan 4 16:09:47 alfred kernel: hdg: task_out_intr: status=0x50 { DriveReady SeekComplete }
Jan 4 16:15:47 alfred kernel: hdg: dma_intr: error=0x84 { DriveStatusError BadCRC }


BadCRC?! BADCRC!?!? This hard drive is just over a month old!! It better not be failing. So, I did me some searching, just to be sure I didn't need to go through the trouble of trying to get this drive out and shipped back to the manufacturer (especially since I JUST threw out the damn box!) I figured I'd give it a little test to make sure the drive itself wasn't bad. Since I knew it wasn't part of any array, I could play with it how I wanted. So, I repartitioned it and created a new ext2 filesystem on it. I then copied a large amount of data to and from the drive, with no errors in /var/log/messages. Hmmm...

Unfortunately, you know all that rhetoric about how Linux support is just BUSTLING on the Internet? How it's so much better than commercial support? Yeah, I've heard it, too...After about two hours of searching, I finally came to the conclusion that this error did not indicate a dying drive, but a problem with either the driver, the card, the cable or the drive (as in one of these things was not entirely compatible). In other words, there were lots of opinions out there on what these messages meant, but no real information. Most blamed it on the kernel ("it works fine with a 2.4 kernel"), some blamed it on the drive's manufacturer ("if it can't keep up with DMA requests, you'll get that error. Get a new drive"), others said it was ACPI ("add pci=noapic to your boot option"). Even on the kernel-dev list, a number of people had posted the exact same problem, few found a solution. None of their solutions worked for me.

I'll spare you the exact details on everything I had to do to fix this, but suffice it to say I believe the problem was that I had two different speed drives on the same controller card. It didn't make any sense to me, either since both drives were on their own controller on the card. The real reason for it was more likely a combo of that, and the fact that I'm using these drives in an array (quite a few of the folks with this issue were using arrays). So, I issued the following:

hdparm -X udma3 /dev/hde
hdparm -X udma3 /dev/hdg

This sets the speeds down to 66Mhz. So far, after a half an hour, I haven't seen any errors reoccur in the event log. I'll try moving them up to UDMA4 (100Mhz) at some point, but for now the RAID1 array is rebuilding and I'm not seeing the error, so I'm pretty sure this is a valid fix.

Update: it's the next day, and I had no errors last night or this morning. The arrays are still holding, so I think we're good. I modified /etc/sysconfig/harddisks to use the following command line on each of my drives at boot time:

hdparm -X udma3 -d1 -c3 /dev/hdx

The drives will be a little slower and not running at max efficiency, but I can deal. For the most part, these drives are for storage, and I don't need quick like a bunny access. I think this confirms my theory of it being an md driver issue. If one drive is faster than the other, md should "wait" for the other to catch up. Of course, that would probably introduce many other timing issues in a driver that's designed to minimize data loss.

This situation could have been a LOT worse had I not setup those arrays (ignoring the fact that it probably wouldn't have occured had it not been for the arrays...). The machine just chugged along nicely without a burp, even with one drive missing. It's also a good thing I told mdadm to send me alert e-mails. The machine would have plodded along without me ever knowing the drive had failed. I would have found out the seriously hard way: when one of the others died, too...

1 comment: