linux 故障磁盘核实和定位

背景

业务发现 dmesg 中有大量/dev/sdu 的日志信息,怀疑 sdu 盘坏。本文记录硬盘状态查询和定位过程。

思路

  1. 通过 dmesg 查看告警信息
  2. 通过/sys/block/sd* 查看设备名和硬盘槽位对应信息
  3. 通过 raid 卡驱动查看硬盘健康状态
  4. 通过 raid 卡驱动开启硬盘定位灯,为维修更换做准备
1
2
3
4
5
dmesg | grep sdu
ll /sys/block/sd*
/opt/MegaRAID/storcli/storcli64 /c0 show all
/opt/MegaRAID/storcli/storcli64 /c0/eall/s15 show all
/opt/MegaRAID/storcli/storcli64 /c0/eall/s15 start locate

过程

1
2
3
4
5
6
7
8
9
10
11
[root@localhost monitor]# dmesg | grep sdu | tail
[90695.836127] sd 0:0:23:0: [sdu] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[90695.836139] sd 0:0:23:0: [sdu] tag#20 Sense Key : Medium Error [current] [descriptor]
[90695.836144] sd 0:0:23:0: [sdu] tag#20 Add. Sense: Unrecovered read error
[90695.836150] sd 0:0:23:0: [sdu] tag#20 CDB: Read(16) 88 00 00 00 00 02 07 07 1f 60 00 00 01 00 00 00
[90695.836154] blk_update_request: critical medium error, dev sdu, sector 8707842088
[90715.450506] sd 0:0:23:0: [sdu] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[90715.450518] sd 0:0:23:0: [sdu] tag#3 Sense Key : Medium Error [current] [descriptor]
[90715.450523] sd 0:0:23:0: [sdu] tag#3 Add. Sense: Unrecovered read error
[90715.450528] sd 0:0:23:0: [sdu] tag#3 CDB: Read(16) 88 00 00 00 00 02 03 ff c6 90 00 00 01 00 00 00
[90715.450532] blk_update_request: critical medium error, dev sdu, sector 8657028944

验证 dmesg 信息,确认有 medium error 告警。设备在总线上的识别号:0:0:23:0

1
2
[root@localhost monitor]# ll /sys/block/sdu
lrwxrwxrwx 1 root root 0 Jan 11 18:45 /sys/block/sdu -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/target0:0:23/0:0:23:0/block/sdu

通过查看 /sys/block/中信息也可以确认。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/opt/MegaRAID/storcli/storcli64 /c0 show

PD LIST :
=======

-------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
-------------------------------------------------------------------------------
...
0:11 17 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:12 19 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:13 11 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:14 24 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:15 23 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:16 28 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:17 5 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -
0:18 21 JBOD - 5.456 TB SATA HDD N N 512B HUS726060ALE610 U -

image.png
DID 为设备 ID,对应 raid 卡槽位为 0:15

1
2
3
4
5
6
7
8
9
10
11
[root@localhost monitor]# /opt/MegaRAID/storcli/storcli64 /c0/eall/s15 show all
...
Drive /c0/e0/s15 State :
======================
Shield Counter = 0
Media Error Count = 329
Other Error Count = 0
Drive Temperature = 30C (86.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No
...

Media Error Count = 329 即为报错统计,一般过 10 就可以更换。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@localhost monitor]# /opt/MegaRAID/storcli/storcli64 /c0/eall/s15 start locate
Controller = 0
Status = Failure
Description = Start Drive Locate Failed.

Detailed Status :
===============

-------------------------------------------
Drive Status ErrCd ErrMsg
-------------------------------------------
/c0/e252/s15 Failure 255 Drive not found
/c0/e0/s15 Success 0 -
-------------------------------------------

开启 15 槽位定位灯,便于现场更换。
BMC 界面确定灯已开启。
image.png