I love linux software ("md") raid. I use md raid1 on a zillion systems. I never has issues. Until today...
I get a call that a customer has lost all their emails for about a month and their apps' data appears to be old and/or missing. Strange.
I login to the linux server and see:
cat /proc/mdstat Personalities : [raid1] md122 : active raid1 sda1[0] 409536 blocks [2/1] [U_]
md123 : active raid1 sda2[0] 5242816 blocks [2/1] [U_]
md124 : active raid1 sda3[0] 1939865536 blocks [2/1] [U_]
md125 : active raid1 sdb1[1] 409536 blocks [2/1] [_U]
md126 : active raid1 sdb2[1] 5242816 blocks [2/1] [_U]
md127 : active raid1 sdb3[1] 1939865536 blocks [2/1] [_U]
That's not correct. These systems should have 3 partitions, not 6. Ah, md has done some really goofball things with this pathological case. It's separated the raid into duplicates and assembled each separately! Woah!
They said they had a accidental reboot today (kid hitting reset button). And it booted/rooted off the wrong schizo set (sda).
There appears to have been a drive failure/kick a month ago: Apr 4 10:10:32 firewall kernel: [1443781.218260] md/raid1:md127: Disk failure on sda3, disabling device. Apr 4 10:10:32 firewall kernel: [1443781.218262] <1>md/raid1:md127: Operation continuing on 1 devices.
And it hadn't rebooted since then, before today.
It gets stranger... I rebooted the system trying to test a few recovery ideas (offsite) out. On the next reboot it came up using the good/current sdb drive for boot/root! Huh? It's like it's picking which one to use at random! It still shows 6 md arrays, but it's using the properly 3 this time.
So is all this a bug?
1. Shouldn't the system have marked the sda as failed/bad PERMANENTLY so on next reboot it would ignore it. OK, I can understand that if it thought the whole drive was bad, it wouldn't be able to write to the sda superblock to survive the reboot. But couldn't it have written the info to sdb's superblock? If a system can't remember what has failed, then I don't see how this behaviour can be avoided.
2. Why did linux md bring up both sets of arrays? It can see they are the same array. Why on earth would it ever split them? That seems majorly screwy to me.
Still, thank God it didn't try to start syncing the stale set to the good set! We had backups, but it's a pain to recover. In the end, just rebooting until luck gives us the current set was all it took. I'll head on-site to replace the bad disk and do a proper resync.
I have had hardware RAID systems (ARAID99) in this exact situation go into a schizo state where the disks were unsynched yet both were being used for writes! The problems always seem to revolve around a disk going "soft" bad and then coming alive after reboot.
I don't know much about the inner workings of Linux software RAID, but I'm wondering if some ID got borked that makes them look like halves of two separate arrays.
[root@bob ~]# mdadm --detail /dev/md0 | grep UUID UUID : f84b1e9d:5cac2742:d382826c:eabfdbf8 [root@bob ~]# mdadm --query --examine /dev/sdb1 | egrep '(Magic|UUID)' Magic : a92b4efc UUID : f84b1e9d:5cac2742:d382826c:eabfdbf8 [root@bob ~]# mdadm --query --examine /dev/sda2 | egrep '(Magic|UUID)' Magic : a92b4efc UUID : f84b1e9d:5cac2742:d382826c:eabfdbf8
Just poking around the mdadm command doesn't show anything specific to a single device. My guess would be that there's some algorithm that reconstructs the array based on what's found on the controllers. So a device belongs to the same RAID set as long as they have the same UUID (which is also repeated in /etc/mdadm.conf on my system). Then it would look at device specific metadata to figure out the sync status. Browsing the source to dm-raid1.c and some other files shows there's a notion of a primary device in the RAID set and some sync tables.
Sean
On Mon, May 14, 2012 at 2:55 PM, Trevor Cordes trevor@tecnopolis.ca wrote:
I love linux software ("md") raid. I use md raid1 on a zillion systems. I never has issues. Until today...
I get a call that a customer has lost all their emails for about a month and their apps' data appears to be old and/or missing. Strange.
I login to the linux server and see:
cat /proc/mdstat Personalities : [raid1] md122 : active raid1 sda1[0] 409536 blocks [2/1] [U_]
md123 : active raid1 sda2[0] 5242816 blocks [2/1] [U_]
md124 : active raid1 sda3[0] 1939865536 blocks [2/1] [U_]
md125 : active raid1 sdb1[1] 409536 blocks [2/1] [_U]
md126 : active raid1 sdb2[1] 5242816 blocks [2/1] [_U]
md127 : active raid1 sdb3[1] 1939865536 blocks [2/1] [_U]
That's not correct. These systems should have 3 partitions, not 6. Ah, md has done some really goofball things with this pathological case. It's separated the raid into duplicates and assembled each separately! Woah!
They said they had a accidental reboot today (kid hitting reset button). And it booted/rooted off the wrong schizo set (sda).
There appears to have been a drive failure/kick a month ago: Apr 4 10:10:32 firewall kernel: [1443781.218260] md/raid1:md127: Disk failure on sda3, disabling device. Apr 4 10:10:32 firewall kernel: [1443781.218262] <1>md/raid1:md127: Operation continuing on 1 devices.
And it hadn't rebooted since then, before today.
It gets stranger... I rebooted the system trying to test a few recovery ideas (offsite) out. On the next reboot it came up using the good/current sdb drive for boot/root! Huh? It's like it's picking which one to use at random! It still shows 6 md arrays, but it's using the properly 3 this time.
So is all this a bug?
- Shouldn't the system have marked the sda as failed/bad PERMANENTLY so
on next reboot it would ignore it. OK, I can understand that if it thought the whole drive was bad, it wouldn't be able to write to the sda superblock to survive the reboot. But couldn't it have written the info to sdb's superblock? If a system can't remember what has failed, then I don't see how this behaviour can be avoided.
- Why did linux md bring up both sets of arrays? It can see they are the
same array. Why on earth would it ever split them? That seems majorly screwy to me.
Still, thank God it didn't try to start syncing the stale set to the good set! We had backups, but it's a pain to recover. In the end, just rebooting until luck gives us the current set was all it took. I'll head on-site to replace the bad disk and do a proper resync.
I have had hardware RAID systems (ARAID99) in this exact situation go into a schizo state where the disks were unsynched yet both were being used for writes! The problems always seem to revolve around a disk going "soft" bad and then coming alive after reboot. _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable