What to do if a drive failure occurs?
First signs
A drive failure might be reported via
fmdump
or
zpool status
, e.g.
# fmdump -v
Jul 16 22:18:52.7769 f52c874e-914d-c1c0-9b3f-a80b973ea08a DISK-8000-0X
100% fault.io.disk.predictive-failure
Problem in: hc://:product-id=Sun-Fire-X4500:chassis-id=0746AMT004:server-id=s08:serial=KRVN67ZBHV0J9H:part=HITACHI-HDS7250SASUN500G-0737KV0J9H:revision=K2AOAJ0A/bay=33/disk=0
Affects: dev:///:devid=id1,sd@SATA_____HITACHI_HDS7250S______KRVN67ZBHV0J9H//pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@6,0
FRU: hc://:product-id=Sun-Fire-X4500:chassis-id=0746AMT004:server-id=s08:serial=KRVN67ZBHV0J9H:part=HITACHI-HDS7250SASUN500G-0737KV0J9H:revision=K2AOAJ0A/bay=33/disk=0
Location: HD_ID_33
or
zpool status
pool: atlashome
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub completed with 0 errors on Thu Aug 14 01:52:20 2008
NAME STATE READ WRITE CKSUM
atlashome ONLINE 0 0 0
raidz2 ONLINE 0 0 0
[...]
c1t6d0 ONLINE 10.23 0 0
In this case the disk
c1t6
sitting in bay
HD_ID_33
is faulty and needs attention.
Use
cfgadm
to get current status of disk
# cfgadm | grep c1t6
sata1/6::dsk/c1t6d0 disk connected configured ok
If this disk is in a zpool, offline it by running
# zpool offline atlashome c1t6d0
Bringing device c1t6d0 offline
Offline the drive by running
# cfgadm -c unconfigure sata1/6
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:6
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
do not disconnect the drive, if you do so the blue light indicating that you may safely remove the drive will not be lit
At this point you may want to remember the slot number and/or position by running
hd -c
Physically exchange disk
- Go to the box, pull it out carefully (watch the cables in the back!)
- Open cover, locate slot (blue LED should be on)
- Take disk out, put new disk in
- Close cover, push box back in (cables!)
After exchanging the disk, run the follwoing commands to get the system back into action:
# cfgadm -c connect sata1/6
Activate the port: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:6
This operation will enable activity on the SATA port
Continue (yes/no)? yes
# cfgadm -c configure sata1/6
# cfgadm -a | grep sata1/6
sata1/6::dsk/c1t6d0 disk connected configured ok
If this looks good, put this disk back into the zpool, replacing it with itself:
zpool replace atlashome c1t6p0
The zpool will "resilver" the disk and after an hour or so, the system should be back in a non-degraded state.
--
CarstenAulbert - 25 Aug 2008