From SpectLog
Jump to: navigation, search

Problem

I booted my desktop one day and realized that the RAID built on top of two external drives went into inactive state:

cat /proc/mdstat
Personalities:
md0 : inactive sdc1[0] sdd1[2]
      1953521072 blocs super 1.2

unused devices: <none>

I checked for the possible I/O problems and there was a list of errors in /var/log/messages:

Jun 30 01:46:39 lhost kernel: [  648.634235] xhci_hcd 0000:04:00.0: ERROR Transfer event TRB DMA ptr not part of current TD
Jun 30 01:46:40 lhost kernel: [  649.181187] usb 9-1: USB disconnect, device number 2
Jun 30 01:46:40 lhost kernel: [  649.184335] scsi 10:0:0:0: [sdc] killing request
Jun 30 01:46:40 lhost kernel: [  649.184383] scsi 10:0:0:0: [sdc] Unhandled error code
Jun 30 0q:46:40 lhost kernel: [  649.184385] scsi 10:0:0:0: [sdc]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 30 01:46:40 lhost kernel: [  649.184388] scsi 10:0:0:0: [sdc] CDB: Read(10): 28 00 03 7a e4 00 00 00 80 00
Jun 30 01:46:40 lhost kernel: [  649.184393] end_request: I/O error, dev sdc, sector 58385408
Jun 30 01:46:40 lhost kernel: [  649.192231] md/raid1:md0: Disk failure on sdc1, disabling device.
Jun 30 01:46:40 lhost kernel: [  649.192232] md/raid1:md0: Operation continuing on 1 devices.

It's not clear what happened to the hardware, but rebooting and checking S.M.A.R.T. data[1] confirmed the disks han no problems:

smartctl -A /dev/sdc
...
smartctl -A /dev/sdd
...

Maybe the cable got loosen or other things happened beyond my imagination and understanding.

First tries

One of the possible reasons I investigated was the mismatch between output of mdadm --examine --scan and /etc/mdadm.conf file [2]. Nothing seemed wrong in my case, the UUIDs were the same, but I made sure ARRAY line in configuration file for the array in question was exactly the output of the mdadm --examine --scan command.

Before re-activation, stopping array makes no harm:

mdadm --stop /dev/md0

In order to activate it, I tried the following command [3]:

mdadm --assemble /dev/md0

According to man-page, this is to "Assemble a pre-existing array". It didn't work for me - array still appeared in inactive state.

I even tried to re-create array according to some successful posts in the Internet [4] even though there were no permanently failed disks in my case:

mdadm --create /dev/md0 --verbose -mlevel=1 --raid-devices=2 /dev/sdc1 /dev/sdd1

What suggested me a workaround was the following error.

mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument

This error led me to a Wikipedia article[5] which suggested lack of required kernel modules. After many tries to re-activate array, dmesg output was filled with messages like:

[   11.011904] mdadm: sending ioctl 1261 to a partition!
...
[   11.085704] md: personality for level 1 is not loaded!

Workaround

The output of lsmod for the list of loaded kernel modules was suspiciously short with only 7 modules.

I had two kernels installed on the desktop, so I booted my system into the older kernel and everything worked. The array got activated itself during normal boot and went into recovery automatically:

cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdc1[0] sdd1[1]
      976760400 blocks super 1.2 [2/2] [UU]
      [===>.................]  resync = 15.0% (147095744/976760400) finish=1587.4min speed=8710K/sec
      
unused devices: <none>

I compared the list of loaded kernel modules by booting into both newer and older kernels and the sizes of the lists were 7 versus 50 modules.

What was strange is that I had had used the newer kernel for a week before the incident. Why the newer kernel couldn't load necessary modules is probably a separate question.