Вылетел сегодня ночью винт из RAID массива 5 уровня.
Массив состоял из 3-х дисков Western Digital по 2ТБ каждый.Вначале начали сыпаться ошибки типа:
ata2.00 input/outpur error
ata2.00: exception emask
ata2.00: failed command: MULTIREAD
После чего сервер зависал, интернет и диски отваливались.
Сам сервер состоит из 4-х дисков. Один под систему, другие 3 - это вышеуказанный массив. ОС Ubuntu.SMART показывает, что все диски живы.
При попытке пересобрать массив, пишет:
raid5: cannot start dirty degraded array for md0
raid5: failed to run raid set md0
md: pers->run() failed ...
mdadm: failed to RUN_ARRAY /dev/md0: Input/output errorhttp://i68.fastpic.ru/big/2014/0831/95/1d54bab199a150bf73a87...
http://i68.fastpic.ru/big/2014/0831/61/8507d5212bc7f7652cc1a...Говорит, что массив dirty и не дает его собрать. В интернетах пишут, что статус dirty можно убрать на свой страх и риск:
echo "clean" > /sys/block/md0/md/array_state
http://www.devinzuczek.com/2010/09/raid5-cannot-start-dirty-.../
Еще проблема упоминается здесь:
http://www.tampabaycomputing.com/blog/raid5-cannot-start-dir...Не могу понять, почему нельзя пересобрать массив и почему он развалился. Системный блок сильно запылился, может контроллер материнки сглюкнул. Сейчас буду чистить и менять шлейфы винтов на запасные.
Подскажите, пожалуйста, как пересобрать массив, как с этими статусами dirty degraded быть?
dd на аналогичный винт. Возможно косяк в БП. И уже копии дисков собирать в массив.
> dd на аналогичный винт. Возможно косяк в БП. И уже копии дисков
> собирать в массив.А может с остальныии дисками тоже какая-то проблема?
Имеет смысл делать образ выпавшего диска или лучше купить новый диск и пересобрать массив вместе с ним?Вот информация о состоянии дисков и рейда:
smartctl -a /dev/sda1:
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WCAZA9629068
Firmware Version: 51.0AB51
User Capacity: 2В 000В 398В 934В 016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Aug 31 15:16:27 2014 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (36360) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 191 169 021 Pre-fail Always - 5433
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 89
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22563
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 87
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 63
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 802712
194 Temperature_Celsius 0x0022 120 105 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 49
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 8
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a /dev/sdb1:
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WCAZA9637224
Firmware Version: 51.0AB51
User Capacity: 2В 000В 398В 934В 016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Aug 31 15:17:13 2014 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (37080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 187 166 021 Pre-fail Always - 5608
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 90
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23756
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 88
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 64
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 806873
194 Temperature_Celsius 0x0022 122 106 000 Old_age Always - 28
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl -a /dev/sdd1:
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: WDC WD20EARX-00PASB0
Serial Number: WD-WMAZA5049112
Firmware Version: 51.0AB51
User Capacity: 2В 000В 398В 934В 016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Aug 31 15:17:27 2014 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (39180) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0027 189 168 021 Pre-fail Always - 5533
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 91
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23753
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 89
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 64
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 781694
194 Temperature_Celsius 0x0022 120 104 000 Old_age Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 16
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 2
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
mdadm -D /dev/md0:
/dev/md0:
Version : 00.90
Creation Time : Thu Sep 29 20:57:02 2011
Raid Level : raid5
Used Dev Size : 1953514432 (1863.02 GiB 2000.40 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Sun Aug 31 03:17:07 2014
State : active, degraded, Not Started
Active Devices : 2
Working Devices : 3
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 64K
UUID : 942adab0:1d983c20:b94deef7:686c3c5d
Events : 0.4915
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 0 0 2 removed
3 8 49 - spare /dev/sdd1
sudo mdadm -E /dev/sda1:
/dev/sda1:
Magic : a92b4efc
Version : 00.90.00
UUID : 942adab0:1d983c20:b94deef7:686c3c5d
Creation Time : Thu Sep 29 20:57:02 2011
Raid Level : raid5
Used Dev Size : 1953514432 (1863.02 GiB 2000.40 GB)
Array Size : 3907028864 (3726.03 GiB 4000.80 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Update Time : Sun Aug 31 03:17:07 2014
State : active
Active Devices : 2
Working Devices : 3
Failed Devices : 1
Spare Devices : 1
Checksum : 93a0e7e0 - correct
Events : 4915
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 0 8 1 0 active sync /dev/sda1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
2 2 0 0 2 faulty removed
3 3 8 49 3 spare /dev/sdd1
sudo mdadm -E /dev/sdb1:
/dev/sdb1:
Magic : a92b4efc
Version : 00.90.00
UUID : 942adab0:1d983c20:b94deef7:686c3c5d
Creation Time : Thu Sep 29 20:57:02 2011
Raid Level : raid5
Used Dev Size : 1953514432 (1863.02 GiB 2000.40 GB)
Array Size : 3907028864 (3726.03 GiB 4000.80 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Update Time : Sun Aug 31 03:17:07 2014
State : active
Active Devices : 2
Working Devices : 3
Failed Devices : 1
Spare Devices : 1
Checksum : 93a0e7f2 - correct
Events : 4915
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 17 1 active sync /dev/sdb1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
2 2 0 0 2 faulty removed
3 3 8 49 3 spare /dev/sdd1
sudo mdadm -E /dev/sdd1:
/dev/sdd1:
Magic : a92b4efc
Version : 00.90.00
UUID : 942adab0:1d983c20:b94deef7:686c3c5d
Creation Time : Thu Sep 29 20:57:02 2011
Raid Level : raid5
Used Dev Size : 1953514432 (1863.02 GiB 2000.40 GB)
Array Size : 3907028864 (3726.03 GiB 4000.80 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 0
Update Time : Sun Aug 31 03:16:39 2014
State : clean
Active Devices : 2
Working Devices : 3
Failed Devices : 1
Spare Devices : 1
Checksum : 93a0fb26 - correct
Events : 4914
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 49 3 spare /dev/sdd1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
2 2 0 0 2 faulty removed
3 3 8 49 3 spare /dev/sdd1
sudo blkid
/dev/sda1: UUID="942adab0-1d98-3c20-b94d-eef7686c3c5d" TYPE="linux_raid_member"
/dev/sdb1: UUID="942adab0-1d98-3c20-b94d-eef7686c3c5d" TYPE="linux_raid_member"
/dev/sdd1: UUID="942adab0-1d98-3c20-b94d-eef7686c3c5d" TYPE="linux_raid_member"
/dev/sdc1: UUID="120e7a8f-d11e-4369-9eef-dc86a25ce595" TYPE="ext4"
/dev/sdc5: UUID="4e64d2d3-8cfd-418e-8ad1-347950f56973" TYPE="swap"
/dev/sdc6: UUID="2b5bcaca-fcf2-4542-b376-d0f8e35dbc94" TYPE="ext4"
df /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdc6 146627104 37908160 101270676 28% /
> Вот информация о состоянии дисков и рейда:С дисками по смарту проблем не видно.
Попробуйте команду smartctl -t long /dev/sdd - запускает длинное тестирование диска /dev/sdd
>> Вот информация о состоянии дисков и рейда:
> С дисками по смарту проблем не видно.
> Попробуйте команду smartctl -t long /dev/sdd - запускает длинное тестирование диска /dev/sddКак же нет проблем?
А вот это?199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 49
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 8Причем на 2-х дисках
А что там такого плохого?
Value - 200
Worst - 199
Treshhold - 000
Поле WHEN_FAILED пустое
Оба параметра относятся к износу диска (Old_age)>[оверквотинг удален]
> Как же нет проблем?
> А вот это?
> 199 UDMA_CRC_Error_Count 0x0032 200 199
> 000 Old_age Always
> -
> 49
> 200 Multi_Zone_Error_Rate 0x0008 200 200
> 000 Old_age Offline
> - 8
> Причем на 2-х дисках
RAW_VALUE больше нуля, это плохо, пора менять.
> RAW_VALUE больше нуля, это плохо, пора менять.RAW значения могут иметь совершенно разное значение у разных производителей и даже прошивок. Ни них не надо ориентироваться.
>> RAW_VALUE больше нуля, это плохо, пора менять.
> RAW значения могут иметь совершенно разное значение у разных производителей и даже
> прошивок. Ни них не надо ориентироваться.199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 49
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 8Когда они отличны от 0 это уже плохо
UDMA_CRC_Error_Count у меня обычно связан с плохим кабелем sata
второй- механика дохнет. Если не растут значения, то может и можно жить, но не надо удивляться, что рейд рассыпается