I’m sure things are programmed to know when you’re skint so their failure causes you even more grief than it should!
And on that encouraging note my server which has disks in a RAID 5 configuration decided to send me an email at 10pm last night to let me know one of the disks had failed!
RAID 5 gives you a degree of protection from failure by providing redundancy but with this being the second disk in that machine to die in the last 5 months there was no time for hanging around in case another pegged out! RAID 5 allows for one disk death, two is game over!
The below resources are useful guides on how to get things back working again.
HOWTO Replace a failing disk on Linux Software RAID-5 – Consultancy.EdVoncken – pdf copy of web page as above link is currently down
However, a week on and all of a sudden I was getting a FailSpare event error email from the server and the disk was being auto removed from the array. This was a James May moment (“Oh Cock!”) until it turned out to be my brand new Western Digital Red Drive which was throwing up the error! The James May moment escalated to a new level! That’s a real cause for concern.
After a considerable amount of internet reading I installed smartmontools and ran some tests on the disks in the RAID array (over a SSH session through a VPN over 3G!) Not a single error!
The solution to removing the error is basically to mark the disk as failed, remove the failed disk from the array, add it back and set it off rebuilding again.
# mdadm –manage /dev/md0 -f /dev/sdc
( make sure it has failed )
# mdadm –manage /dev/md0 -r /dev/sdc
( remove from the array )
# mdadm –manage /dev/md0 -a /dev/sdc
( add the device back to the array )
# mdadm –detail /dev/md0
( verify there are no faults and the array knows about the spare )
OK we have a solution but why has a perfectly good, brand spanking new disk done this? One suggestion is SATA cables. Those in the know have had perfectly good disks report errors when there’s nothing wrong with them, only to discover that a substandard cable is the root cause. Now you could argue why that cause is exhibiting symptoms now rather than x years ago when the machine was first built, but in the absence of any other solutions sourcing a few brand new cables or reputable origin and swapping them out seems a good idea.
There is also the possibility that my failed 3TB disk of last week has nothing wrong with it. Plugging it into a SATA port on a motherboard and having a look what smartmontools has to say about it under an Ubuntu Live disk may be an interesting few hours work.
In the meantime I’m trying to get smartmontools to run as a daemon and carry out some meaningful tests on the disks on a regular basis to give me some advanced warning of problems.
Right, almost two weeks on and it’s not a happy story. In short I’ve lost all the data on the RAID array and had to replace assorted pieces of hardware and start from scratch. The data loss is an annoyance rather than a “throwing yourself from a bridge” situation, but the amount of time and effort (thank you Mat for your help) that has gone into this is a major pisser.
I will gloss over the majority of this but it’s worth making some key notes as there are useful tools and tactics out there should anyone else suffer a similar problem. I’m sure at some point in my life I’ll find myself here again and having something to refer to is going to be a help!
The RAID failure was caused by a failing disk throwing up bad blocks PLUS a duff SATA cable and/or controller card. RAID5 will tolerate a disk failure, but it looks like the failing cable on a second drive caused it to believe two disks were failing. As such I was never getting out of that situation in one piece.
The method of attempted repair went like this –
stop the array ASAP –
sudo mdadm --stop /dev/md0
use smartmontools to carry out long tests on all disks in the machine
sudo smartctl -t long /dev/sda
this takes a long time on 3TB disks but it tells you everything you need to know!
View the necessary reports
sudo smartctl -l selftest /dev/sda
Now the rub came when only one disk was reporting errors in that diagnostic report, because in theory the other three are intact and the data stripe is across all of them allowing the replacement of one and the subsequent rebuilding of the dataset.
As such the anticipated next step is to restart the array, forcing mdadm to use the pre-existing disks with their data, but ignoring the crazy spare fail flags and the like as there is only one faulty disk and we can swap that out once the array rebuilds.
So the way to do this is
mdadm --create /dev/md0 --level=5 --raid-devices=4 --chunk=64 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 --assume-clean
having a read of http://www.linuxquestions.org/questions/linux-server-73/mdadm-re-added-disk-treated-as-spare-750739/ gives an explanation as to the root of this command.
The –assume-clean throws caution to the wind and creates the array regardless of the disk states.
Right the next trick is to run fsck on the filesystem to correct any bad blocks which we know we’ve got because smartmontools has told us so as a result of the long scan tests.
Now I’m not convinced that this didn’t add to the problems. If the RAID array is rebuilding, running something like fsck which is going to start moving blocks around, to me, seems like a good way to confuse the hell out of things but research prior to hitting Enter said otherwise.
sudo fsck.ext4 -y -s /dev/md0
the -y switch saves having to answer every question fsck throws up and -s is similar to the –assume-clean used in mdadm
Problems started when fsck started running out of memory running it’s check which seemed very very odd!
A lot of reading later and this isn’t an uncommon problem which people have resolved by creating swapfiles to expand the machine’s memory to cope.
dd if= /dev/zero of= /swapfile1 bs=1024 count=12582912 mkswap /swapfile1 swapon /swapfile1
The above creates a 12GB swapfile called swapfile1 which gives you an extra 12GB of memory.
Internet posts suggest 1GB of memory and/or swapfile per Terabyte of disk you’re trying to fix, however in the middle of the night fsck crashed out again complaining about insufficient memory. Fsck is not a quick process, especially on a 12TB array!
You can add as many swapfiles as you want to expand the physical memory of your machine in line with the capacity of the disk containing that file. The snag is fsck runs slower, but if you want your data back it’s got to be worth a shot. That however, was a shot in the dark! With almost 64GB of virtual memory from 4 swapfiles, fsck still didn’t want to play and during its runs more and more bad blocks were being reported.
There comes a point where the return on investment question is asked. If I’ve got so many bad blocks in this data set, how usable will it be? How do I know which files are corrupt until I try and access them? Will I be able to replace that data in x years time when I discover a corrupted file?
Tough questions with only one solution. Start again from scratch. So, with a certain degree of reluctance, after an awful lot of effort, it was back to square one.
I decided to wipe the entire server and install Ubuntu 16.04 LTS beta (full release due imminently) and set my disks in a RAID5 array using ZFS. ZFS is now implemented as standard in 16.04 which is a bonus! The duff disk was replaced with a brand new 3TB drive and we’re ready to go.
ZFS is set to perform a weekly scrub and smartmontools does it’s thing on a weekly basis checking every single disk and reporting back.
The learning points from this are that nothing is infallible, regardless of the level of redundancy a RAID array offers. If it’s valuable back it up regularly on a removable media format and lock it in a fire safe ready for Judgement Day!
I’m more pissed off with the fact I haven’t progressed my 1Watter build or managed to do anything else for the last 2 weeks!
Don’t ask me to pick your winning Lotto numbers!