Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

drive dropped from RAID5 set after reboot



On an Ubuntu Feisty system, I received notice of a degraded RAID array 
after rebooting today. Investigating showed:

# mdadm --detail /dev/md1
/dev/md1:
         Version : 00.90.03
   Creation Time : Fri Jan 26 16:20:26 2007
      Raid Level : raid5
...
    Raid Devices : 4
   Total Devices : 3
Preferred Minor : 1
     Persistence : Superblock is persistent
...
           State : clean, degraded
  Active Devices : 3
Working Devices : 3
  Failed Devices : 0
   Spare Devices : 0
...
     Number   Major   Minor   RaidDevice State
        0     254        4        0      active sync   /dev/mapper/sda1
        1       0        0        1      removed
        2     254        5        2      active sync   /dev/mapper/sdc1
        3     254        6        3      active sync   /dev/mapper/sdd1


If it was a hardware problem or otherwise a problem with the physical 
drive, I'd expect it to show up as "failed" rather than "removed."

No complaints when the device was re-added:

# mdadm -v /dev/md1 --add /dev/mapper/sdb1
mdadm: added /dev/mapper/sdb1

but it troubles me that it just disappeared on its own. dmesg doesn't 
seem to show anything interesting, other than the lack of sdb1 being 
picked up by md:

# dmesg | fgrep sd
...
[   35.520480] sdb: Write Protect is off
[   35.520483] sdb: Mode Sense: 00 3a 00 00
[   35.520496] SCSI device sdb: write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA
[   35.520542] SCSI device sdb: 625142448 512-byte hdwr sectors (320073 MB)
[   35.520550] sdb: Write Protect is off
[   35.520552] sdb: Mode Sense: 00 3a 00 00
[   35.520564] SCSI device sdb: write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA
[   35.520567]  sdb: sdb1
[   35.538213] sd 1:0:0:0: Attached scsi disk sdb
...
[   35.939614] md: bind<sdc1>
[   35.939797] md: bind<sdd1>
[   35.939942] md: bind<sda1>
[   49.731674] md: unbind<sda1>
[   49.731684] md: export_rdev(sda1)
[   49.731707] md: unbind<sdd1>
[   49.731711] md: export_rdev(sdd1)
[   49.731722] md: unbind<sdc1>
[   49.731726] md: export_rdev(sdc1)


Other than the DegradedArray event, /var/log/daemon.log doesn't show 
anything interesting. smartd didn't report any problems with /dev/sdb. 
Then again, while looking into this I found:

smartd[6370]: Device: /dev/hda, opened
smartd[6370]: Device: /dev/hda, found in smartd database.
smartd[6370]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
...
smartd[6370]: Device: /dev/sda, opened
smartd[6370]: Device: /dev/sda, IE (SMART) not enabled, skip device Try 
'smartctl -s on /dev/sda' to turn on SMART features
...
smartd[6370]: Device: /dev/sdb, IE (SMART) not enabled...
smartd[6370]: Device: /dev/sdc, IE (SMART) not enabled...
smartd[6370]: Device: /dev/sdd, IE (SMART) not enabled...
smartd[6370]: Monitoring 1 ATA and 0 SCSI devices

So it looks like the drives in the RAID array weren't being monitored by 
smartd. Running the suggested command:

# smartctl -s on /dev/sda
  smartctl version 5.36 ...
  unable to fetch IEC (SMART) mode page [unsupported field in scsi
   command]
  A mandatory SMART command failed: exiting. To continue, add one or more
   '-T permissive' options.

Seems it doesn't like these SATA drives. I'll have to investigate further...


I've noticed the device names have changed as of a reboot last weekend. 
Probably due to upgrades to the udev system. The array was originally 
setup with /dev/sda1 ... /dev/sdd1 and the output from /proc/mdstat 
prior to a reboot last week showed:
md1 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]

and now shows:
md1 : active raid5 dm-7[4] dm-6[3] dm-5[2] dm-4[0]

but if that was the source of the problem, I'd expect it to throw off 
all the devices, not just one of the drives.


It may be relevant to note that the drive was initially created in a 
degraded state (4 device array with only 3 devices active), with the 4th 
device being added just prior to the previous reboot. But the added 
device was /dev/sda1, not /dev/sdb1.


I've also noticed on reboot a message on the console that says something 
like no RAID arrays found in mdadm.conf during the last couple of 
reboots, but as that file has been updated to reflect the current output 
of "mdadm --detail --scan" and the array has been functioning, I've 
ignored it. However, while investigating the above I noticed:

mythtv:/etc# dmesg | fgrep md:
[   31.069854] md: raid1 personality registered for level 1
[   31.651721] md: raid6 personality registered for level 6
[   31.651723] md: raid5 personality registered for level 5
[   31.651724] md: raid4 personality registered for level 4
[   35.710310] md: md0 stopped.
[   35.793291] md: md1 stopped.
[   35.939614] md: bind<sdc1>
[   35.939797] md: bind<sdd1>
[   35.939942] md: bind<sda1>
[   36.251952] md: array md1 already has disks!
[...80 more identical messages deleted...]
[   49.476995] md: array md1 already has disks!
[   49.731660] md: md1 stopped.
[   49.731674] md: unbind<sda1>
[   49.731684] md: export_rdev(sda1)
[   49.731707] md: unbind<sdd1>
[   49.731711] md: export_rdev(sdd1)
[   49.731722] md: unbind<sdc1>
[   49.731726] md: export_rdev(sdc1)
[   51.613310] md: bind<dm-4>
[   51.618923] md: bind<dm-5>
[   51.632529] md: bind<dm-6>
[   51.714527] md: couldn't update array info. -22
[   51.714580] md: couldn't update array info. -22

The "array md1 already has disks" messages as well as the repeated 
starting/stopping and binding and unbinding seems to suggest that 
something isn't quite right.

Although maybe some of this is by design. I see in /etc/default/mdadm:

# list of arrays (or 'all') to start automatically when the initial ramdisk
# loads. This list *must* include the array holding your root 
filesystem. Use
# 'none' to prevent any array from being started from the initial ramdisk.
INITRDSTART='all'

so maybe the array is being initially setup by initrd, and then being 
setup again at a later stage. This system doesn't have its root file 
system on the array, so I'm going to switch 'all' to 'none'.


That still leaves me without a likely cause for why the drive 
disappeared from the array.

  -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.





BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org