HP MSA1000 RAID ADG Data Recovery Darth Technology

Introduction to the HP MSA1000 Disk Array Cabinet

HP's exclusive RAID ADG for MSA1000 is based on RAID5. It is a disk configuration scheme that extends RAID 5. The disadvantage of RAID 5 is that after a hard disk fails, the RAID group changes from online to degraded. If the second hard disk fails, the data of the entire RAID group will be lost, which is catastrophic for the enterprise. of. RAID ADG technology overcomes this drawback in principle. Its biggest feature is that it deploys two parity sets and provides the capacity of two hard disks to store these parity information, which can simultaneously prevent two hard disks from malfunctioning. Breaking through the previous RAID allows only one hard disk to fail at the same time, which effectively improves the reliability of the data on the server hard disk. RAID ADG has lower implementation cost than dual RAID 0+1, but provides higher fault tolerance than RAID 5.

Fault description:

1. The RAID array raid ADG made by MSA1000 with seven disks is connected to the win2003 server. It is found in advance that the system prompts $MFT error and requires the CHKDSK command to be repaired. But the administrator did not care, suddenly one day found that the No. 2 disk in the array lit up with a red light alarm, so unplug the No. 2 disk and replace the No. 2 disk with a new one.

2. After replacing the No. 2 disk with the new disk, after the array is running for a period of time, the array cannot be accessed. The system still prompts the $MFT error and requires CHKDSK to be repaired. At this time, the administrator ran this command to repair the array. The CHKDSK command was run for a period of time. When it was not completed, it was automatically stopped. The CHKDSK was run repeatedly several times. The results were automatically stopped near completion. During this period, the array was found to be automatically Rebuild. After all operations are completed, the disk partition still cannot be opened.

Detection and recovery results:

The reasons for the loss of data corruption can be explained from the following points:

1. The system prompts a $MFT error indicating that the data directory structure is corrupted. There are many reasons for the damage of the directory structure: a. The logic error caused by the system reading and writing errors on the disk; b. The hardware failure caused by the hard disk hardware problem, including multiple hard disk dropouts and hard disk bad sectors. After detecting the hard disk, there is no physical problem, it can be identified, and it can also be mirrored.

2. Found the No. 2 disk alarm, removed the No. 2 disk, and replaced it with a new hard disk. In theory, this is correct. The premise of this is that you must first confirm that the other six disks are normal in the array, that is, the six disks are in the online state, and if there are other two or more disks in the offline state ( Offline) state (because sometimes the hard disk is offline, it does not necessarily light up the red light alarm), the disk change can not make the array normal (according to the principle of raid ADG, the array can be broken if the two disks are broken, but bad 3 The array above the block disk crashes).

3, running the CHKDSK command is the main reason for the disorder of the directory structure, many directories such as found.000 and dir.001 are generated in this link. The CHKDSK command is a reorganization of the file directory, renaming some directories and sorting the location of the files. For disk arrays, the most taboo command is CHKDSK. If the array data combination is incorrect, or the Raid is reconfigured (Raid parameters change—disk order, block size, etc.), the system will prompt for disk parameter errors or directory structure errors, and require CHKDSK to be repaired. This kind of prompt is only the responsibility of the operating system, but the operating system does not prompt that the consequences of running CHKDSK will bring about looping and damage of data, which is the most deadly.

4. After analyzing the underlying data, it is found that the position of the first 15 GB, the No. 2 disk and the No. 7 disk redundancy information of each disk is the same, and does not meet the raid ADG data combination rule. After each disk is about 15 GB, The data combination method is in line with raid ADG. The reason for this result may be: after replacing the new No. 2 disk, the system uses the No. 2 disk as the No. 7 disk, and uses the remaining disks to Rebuild; or the No. 7 disk has been dropped, and the system uses the No. 2 disk as the 7th. The disk comes to Rebuild; or the 2nd and 7th disks are physically replaced, and so on, and then a series of other operations after the problem occurs. We reorganized with the "escort ship" according to the data combination of 15GB each disk. The result of the restoration is that the data directory structure is disordered, and most of the files can be recovered and normal.


The factors causing data corruption boil down to: the system does not process in time after the $MFT error prompt; after the number 2 disk is lit red, it should check the other disks of the array, and then confirm the replacement of the 2nd disk, and timely track the replacement of the 2nd disk. System operation; running CHKDSK is the main reason for the disorder of the directory structure; 15GB of redundant information in front of the 2nd disk is not normal, resulting in partial file loss, which is caused by Rebuild operation; any hardware security solution on the hardware is only relative, The professionalism and responsibility of human management can minimize the loss of failure.

Author: Darth data recovery CTO Tan total

