What to do if a drive failure occurs?

First signs

A drive failure might be reported via fmdump or zpool status, e.g.

# fmdump -v
Jul 16 22:18:52.7769 f52c874e-914d-c1c0-9b3f-a80b973ea08a DISK-8000-0X
  100%  fault.io.disk.predictive-failure

        Problem in: hc://:product-id=Sun-Fire-X4500:chassis-id=0746AMT004:server-id=s08:serial=KRVN67ZBHV0J9H:part=HITACHI-HDS7250SASUN500G-0737KV0J9H:revision=K2AOAJ0A/bay=33/disk=0
           Affects: dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN67ZBHV0J9H//pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@6,0
               FRU: hc://:product-id=Sun-Fire-X4500:chassis-id=0746AMT004:server-id=s08:serial=KRVN67ZBHV0J9H:part=HITACHI-HDS7250SASUN500G-0737KV0J9H:revision=K2AOAJ0A/bay=33/disk=0
          Location: HD_ID_33

or

zpool status
  pool: atlashome
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed with 0 errors on Thu Aug 14 01:52:20 2008
 
       NAME        STATE     READ WRITE CKSUM
        atlashome   ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
[...]
            c1t6d0  ONLINE   10.23     0     0

In this case the disk c1t6 sitting in bay HD_ID_33 is faulty and needs attention.

Identify drive and offline/unconfigure drive

Use cfgadm to get current status of disk

# cfgadm | grep c1t6 sata1/6::dsk/c1t6d0 disk connected configured ok

If this disk is in a zpool, offline it by running
# zpool offline atlashome c1t6d0
Bringing device c1t6d0 offline

Offline the drive by running
# cfgadm -c unconfigure sata1/6
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:6
This operation will suspend activity on the SATA device
Continue (yes/no)? yes

do not disconnect the drive, if you do so the blue light indicating that you may safely remove the drive will not be lit

At this point you may want to remember the slot number and/or position by running hd -c

Physically exchange disk

  1. Go to the box, pull it out carefully (watch the cables in the back!)
  2. Open cover, locate slot (blue LED should be on)
  3. Take disk out, put new disk in
  4. Close cover, push box back in (cables!)

Configure new disk

After exchanging the disk, run the follwoing commands to get the system back into action:

# cfgadm -c connect sata1/6
Activate the port: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:6
This operation will enable activity on the SATA port
Continue (yes/no)? yes
# cfgadm -c configure sata1/6
# cfgadm -a | grep sata1/6
sata1/6::dsk/c1t6d0            disk         connected    configured   ok

If this looks good, put this disk back into the zpool, replacing it with itself:

zpool replace atlashome c1t6p0

The zpool will "resilver" the disk and after an hour or so, the system should be back in a non-degraded state.

-- CarstenAulbert - 25 Aug 2008
Topic revision: r6 - 29 Aug 2008, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback