I seriously need a UPS…mdadm resync explanation

There were intermittent storms and wind around me last night causing not full on power outtages but instead, dreadful power dips!

So my server went down. Waited a bit until things seemed to be normal. Brought it back up. About 2 minutes later, power dip (im thinking now that it was really only a problem for devices taking a ton of power/voltage, aka, my server).

So I said screw it. Shut everything down.

The next day, I brought the server back up….hoping everything would be just fine like normal. But…

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : active (auto-read-only) raid6 sdi1[3] sdh1[2] sdd1[4] sdj1[0] sde1[1] sdb1[5] sdf1[7] sdg1[8] sdc1[6] 20510947072 blocks super 1.2 level 6, 64k chunk, algorithm 2 [⁹⁄₉] [UUUUUUUUU] resync=PENDING unused devices:

The array was mounted read-only…which means something is going on…some mismatch in parity or something. Also, whats up with that PENDING?

Catting /var/log/syslog, I saw this:

Nov 18 11:00:36 whyte kernel: [ 1.434360] md/raid:md127: not clean – starting background reconst ruction Nov 18 11:00:36 whyte kernel: [ 1.434473] md/raid:md127: device sdi1 operational as raid disk 3 Nov 18 11:00:36 whyte kernel: [ 1.434532] md/raid:md127: device sdh1 operational as raid disk 2 Nov 18 11:00:36 whyte kernel: [ 1.434611] md/raid:md127: device sdd1 operational as raid disk 4 Nov 18 11:00:36 whyte kernel: [ 1.434670] md/raid:md127: device sdj1 operational as raid disk 0 Nov 18 11:00:36 whyte kernel: [ 1.434728] md/raid:md127: device sde1 operational as raid disk 1 Nov 18 11:00:36 whyte kernel: [ 1.434787] md/raid:md127: device sdb1 operational as raid disk 8 Nov 18 11:00:36 whyte kernel: [ 1.434846] md/raid:md127: device sdf1 operational as raid disk 6 Nov 18 11:00:36 whyte kernel: [ 1.434905] md/raid:md127: device sdg1 operational as raid disk 5 Nov 18 11:00:36 whyte kernel: [ 1.434964] md/raid:md127: device sdc1 operational as raid disk 7 Nov 18 11:00:36 whyte kernel: [ 1.436045] md/raid:md127: allocated 9618kB Nov 18 11:00:36 whyte kernel: [ 1.436204] md/raid:md127: raid level 6 active with 9 out of 9 devi ces, algorithm 2 Nov 18 11:00:36 whyte kernel: [ 1.436293] RAID conf printout: Nov 18 11:00:36 whyte kernel: [ 1.436294] — level:6 rd:9 wd:9 Nov 18 11:00:36 whyte kernel: [ 1.436296] disk 0, o:1, dev:sdj1 Nov 18 11:00:36 whyte kernel: [ 1.436298] disk 1, o:1, dev:sde1 Nov 18 11:00:36 whyte kernel: [ 1.436299] disk 2, o:1, dev:sdh1 Nov 18 11:00:36 whyte kernel: [ 1.436301] disk 3, o:1, dev:sdi1 Nov 18 11:00:36 whyte kernel: [ 1.436302] disk 4, o:1, dev:sdd1 Nov 18 11:00:36 whyte kernel: [ 1.436304] disk 5, o:1, dev:sdg1 Nov 18 11:00:36 whyte kernel: [ 1.436305] disk 6, o:1, dev:sdf1 Nov 18 11:00:36 whyte kernel: [ 1.436307] disk 7, o:1, dev:sdc1 Nov 18 11:00:36 whyte kernel: [ 1.436308] disk 8, o:1, dev:sdb1 Nov 18 11:00:36 whyte kernel: [ 1.436348] md127: detected capacity change from 0 to 2100320980172 8 Nov 18 11:00:36 whyte kernel: [ 1.443230] md127: unknown partition table

So I, just like a few blog posts ago about my power outtage during a grow operation, I stopped the array, and started it again:

Nov 18 12:00:52 whyte kernel: [ 3620.478830] md/raid:md0: not clean – starting background reconstru ction Nov 18 12:00:52 whyte kernel: [ 3620.478854] md/raid:md0: device sdj1 operational as raid disk 0 Nov 18 12:00:52 whyte kernel: [ 3620.478856] md/raid:md0: device sdb1 operational as raid disk 8 Nov 18 12:00:52 whyte kernel: [ 3620.478858] md/raid:md0: device sdc1 operational as raid disk 7 Nov 18 12:00:52 whyte kernel: [ 3620.478859] md/raid:md0: device sdf1 operational as raid disk 6 Nov 18 12:00:52 whyte kernel: [ 3620.478861] md/raid:md0: device sdg1 operational as raid disk 5 Nov 18 12:00:52 whyte kernel: [ 3620.478863] md/raid:md0: device sdd1 operational as raid disk 4 Nov 18 12:00:52 whyte kernel: [ 3620.478864] md/raid:md0: device sdi1 operational as raid disk 3 Nov 18 12:00:52 whyte kernel: [ 3620.478866] md/raid:md0: device sdh1 operational as raid disk 2 Nov 18 12:00:52 whyte kernel: [ 3620.478876] md/raid:md0: device sde1 operational as raid disk 1 Nov 18 12:00:52 whyte kernel: [ 3620.480237] md/raid:md0: allocated 9618kB Nov 18 12:00:52 whyte kernel: [ 3620.480363] md/raid:md0: raid level 6 active with 9 out of 9 device s, algorithm 2 Nov 18 12:00:52 whyte kernel: [ 3620.480365] RAID conf printout: Nov 18 12:00:52 whyte kernel: [ 3620.480367] — level:6 rd:9 wd:9 Nov 18 12:00:52 whyte kernel: [ 3620.480369] disk 0, o:1, dev:sdj1 Nov 18 12:00:52 whyte kernel: [ 3620.480371] disk 1, o:1, dev:sde1 Nov 18 12:00:52 whyte kernel: [ 3620.480372] disk 2, o:1, dev:sdh1 Nov 18 12:00:52 whyte kernel: [ 3620.480374] disk 3, o:1, dev:sdi1 Nov 18 12:00:52 whyte kernel: [ 3620.480376] disk 4, o:1, dev:sdd1 Nov 18 12:00:52 whyte kernel: [ 3620.480378] disk 5, o:1, dev:sdg1 Nov 18 12:00:52 whyte kernel: [ 3620.480380] disk 6, o:1, dev:sdf1 Nov 18 12:00:52 whyte kernel: [ 3620.480382] disk 7, o:1, dev:sdc1 Nov 18 12:00:52 whyte kernel: [ 3620.480384] disk 8, o:1, dev:sdb1 Nov 18 12:00:52 whyte kernel: [ 3620.480450] md0: detected capacity change from 0 to 21003209801728 Nov 18 12:00:52 whyte kernel: [ 3620.480800] md0: unknown partition table Nov 18 12:00:52 whyte kernel: [ 3620.481444] md: resync of RAID array md0 Nov 18 12:00:52 whyte kernel: [ 3620.481460] md: minimum guaranteed speed: 1000 KB/sec/disk. Nov 18 12:00:52 whyte kernel: [ 3620.481462] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync. Nov 18 12:00:52 whyte kernel: [ 3620.481478] md: using 128k window, over a total of 2930135296k.

So a resync has definitely started:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid6 sdj1[0] sdb1[5] sdc1[6] sdf1[7] sdg1[8] sdd1[4] sdi1[3] sdh1[2] sde1[1] 20510947072 blocks super 1.2 level 6, 64k chunk, algorithm 2 [⁹⁄₉] [UUUUUUUUU] [>………………..] resync = 0.0% (^1064960⁄_2930135296) finish=779.2min speed=62644K/sec unused devices:

GRRRRR. Well whats actually wrong here? Data needing to be shuffled around? Is there a drive acting up? Then I found this:

This is output from the autodetection of a RAID-5 array that was not cleanly shut down (e.g. the machine crashed). Reconstruction is automatically initiated. Mounting this device is perfectly safe, since reconstruction is transparent and all data are consistent (it’s only the parity information that is inconsistent – but that isn’t needed until a device fails).

Phewwww… Ok I can breathe again. So yeah, I think everything is fine. Going to postpone any data migration I was doing until this finishes…

Ohh and that unknown partition table error…don’t worry about that…it’s merely a warning…I’m not putting a “valid” partition table on my array…view here and here.