Server drive failure

In other news, I received a barrage of emails from my home server the other day, each complaining about a degraded raid array. A swift check through the emails indicates that every array on the machine had been degraded, and a little more investigation leads to the simple conclusion that one of the hard drives has completely failed.

So far I’ve not had time to check if this is a failure of the drive, the controller, or perhaps even just a cable coming loose, but it’s nice to see the server continuing to function completely normally despite this failure. Lots of kudos for the software raid functionality in Linux.

My job for tomorrow morning is to find out what has actually failed, replace it, and then reinstate the degraded raid arrays. One thing I’ll look into is getting SMART monitoring of the hard drives enabled. Currently it isn’t, and it would have been nice to have had some advance warning of this so I could have had the new drive ordered and waiting.

Still, hopefully the whole thing is not more than a couple of hours work.

Update: Initially it seemed like a cabling problem; simply replugging all the drives seemed to resolve the problem. However, putting it all back together again caused it to stop working again. After quite a lot of swapping of cables, and then finally wiggling of cables, it became clear that the problem was the drive after all. Ultimately it looks like the circuit-board attached to the drive has failed. Flexing the cables causes a little bit of movement of the circuit-board, which I suspect over time has caused it to fail.

A new drive seems to have completely resolved the problem. Having got that installed, it took about 5 minutes to partition it up, another 5 minutes to add the partitions to the raid sets, and about 4 hours for the linux software raid system to rebuild the raid sets.

And all is now working perfectly again.


