WO2020048442A1

WO2020048442A1 - Hard disk fault processing method, array controller and hard disk

Info

Publication number: WO2020048442A1
Application number: PCT/CN2019/104163
Authority: WO
Inventors: 刘国霞; 吴黎明
Original assignee: 华为技术有限公司
Priority date: 2018-09-05
Filing date: 2019-09-03
Publication date: 2020-03-12

Abstract

Disclosed is a hard disk fault processing method. The method is executed by an array controller (101) of a storage array (100). The storage array (100) comprises multiple hard disks (105, 106, 107, 108); each of the hard disks (105, 106, 107, 108) is divided into multiple storage blocks; and multiple storage blocks of different hard disks (105, 106, 107, 108) constitute a storage block set by means of a redundancy algorithm. The method comprises: acquiring fault information of a faulty area, where a fault occurs, in a first hard disk; when the fault information indicates that data loss occurs in the faulty area, determining a faulty storage block where the lost data is located; using other storage blocks in a storage block set to which the faulty storage block belongs to restore the data in the faulty storage block; storing the restored data in a restored storage block, wherein the restored storage block is located at a second hard disk and the second hard disk is a hard disk except the hard disk where the storage block set is located; and recording a correlation between the address, in the first hard disk, of the data in the faulty storage block and the address of the restored storage block in the second hard disk.

Description

Hard disk failure processing method, array controller and hard disk

Technical field

The present invention relates to the field of storage technology, and in particular, to a processing method after a failure occurs in a storage area in a hard disk, and an array controller and a hard disk that execute the processing method.

Background technique

Because solid-state hard disks cannot be written in place, they can only be written off-site, so a part of the redundant space must be reserved in the solid-state hard disk as a free space for writing data to the solid-state hard disk to improve the performance of the solid-state hard disk. The nominal capacity provided by the solid state hard disk to the outside does not include the capacity of the redundant space.

When a partial area failure (such as a die failure, hereinafter referred to as a failure area) occurs in the solid state hard disk, in order to keep the nominal capacity from being reduced, it is necessary to compensate the failure area through the capacity of a redundant space, which will cause redundancy The capacity of the free space is reduced. The reduction of redundant space will cause the wear and tear of solid state drives to increase, which will affect the performance of solid state drives.

Summary of the Invention

An embodiment of the present invention provides a method for processing a fault area that appears in a hard disk. Using the processing method, after a fault area of a hard disk occurs, the redundant space of the hard disk is not reduced, and thus the wear degree of the hard disk is not increased.

A first aspect of the embodiments of the present invention provides a hard disk failure processing method, where the method is executed by an array controller of a storage array. The storage array includes a plurality of hard disks, each hard disk is divided into a plurality of storage blocks, and a plurality of storage blocks located on different hard disks form a storage block group through a redundant algorithm. The method includes: obtaining fault information of a fault area where a fault occurs in the first hard disk; when the fault information indicates that data is lost in the fault area, determining a faulty storage block where the missing data is located; and using the faulty storage block The other storage blocks in the belonging storage block group recover the data of the faulty storage block; store the recovered data to the recovery storage block, the recovery storage block is located on a second hard disk, and the second hard disk is A hard disk other than the hard disk where the storage block group is located; recording the correspondence between the address of the data in the failed storage block in the first hard disk and the address of the recovery data block in the second hard disk.

By recovering the lost data in the failed storage block in the first hard disk to the recovery storage block in the second hard disk, and recording the correspondence between the address of the failed storage block in the first hard disk and the address of the recovery storage block in the second hard disk In this way, the redundant space in the first hard disk is not reduced, thereby ensuring the performance of the first hard disk.

In the embodiment of the first aspect, two methods are provided for obtaining fault information of a faulted area in the first hard disk, and the first is for the array controller to receive the fault information reported by the first hard disk. The second is that the array controller sends a fault query command to the first hard disk; and then receives the fault information reported by the first hard disk according to the fault query command.

Optionally, the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.

In an embodiment of the first aspect, two methods are provided for determining a failed storage block where the missing data is located. The first is for the array controller to obtain a first storage block in the first hard disk from the first storage block. An address in the hard disk; sending a data loss query command to the first hard disk, the query command carrying the address of the first storage block in the first hard disk; receiving the query on the first hard disk After the instruction, it is determined whether the address carried in the query instruction includes part or all of the address of the fault area. If it is included, the first hard disk carries an indication of the first storage block in the return message of the query instruction. Including the indication information of the missing data; otherwise, the return message of the query instruction carries indication information indicating that the first storage block does not include the missing data. After receiving the indication information indicating whether the first storage block includes the lost data returned by the first hard disk, if the array controller determines that the indication information indicates that the first storage block includes the lost data, Data, it is determined that the first storage block is the faulty storage block; then a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in the first An address in a hard disk.

The second way to determine the faulty storage block where the missing data is is to send a fault zone query command to the first hard disk; receive information returned by the first hard disk that includes the address of the faulty area; The address determines the faulty memory block.

Optionally, the fault information includes a capacity of the fault area, and the method further includes: acquiring the capacity of the fault area in the fault information, and accumulating the capacity of the fault area to a fault of the first hard disk. Of the total capacity, when it is determined that the total capacity of the failure is greater than a preset value, the user is prompted to replace the first hard disk.

The determination of whether to replace the hard disk based on the total lost capacity of the hard disk is more convenient and accurate than the method for detecting the wear of the hard disk in the prior art.

A second aspect of the present invention provides a hard disk failure processing method. The method is executed by an array controller of a storage array. The storage array includes a first hard disk, and the first hard disk includes a fault area. The method includes obtaining fault information of the fault area; determining a capacity of the fault area according to the fault information; migrating part of the data in the first hard disk to a second hard disk according to the capacity; recording the migrated The mapping relationship between the address of data in the first hard disk and the address in the second hard disk.

Optionally, the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm. The method further includes judging whether data is lost on the first hard disk according to the failure information; and if data is lost on the first hard disk, recovering data in the first hard disk through the redundancy algorithm.

By acquiring the information of data loss in the hard disk and the redundant algorithm between the hard disks, the lost data in the hard disk can be recovered in time.

A third aspect of the present invention provides a hard disk failure processing method, which is executed by a hard disk. The method includes detecting a faulty area in the hard disk; determining whether there is data loss in the faulty area; setting a flag for data loss in the faulty area according to the determination result; A flag indicating whether there is data loss in the fault area is reported to the array controller as fault information.

Optionally, the hard disk also records the capacity of the fault area, and reports the capacity of the fault area to the array controller as fault information.

By reporting the capacity of the faulty area of the hard disk, the array controller can sense the lost capacity in the faulty area, so that when the array receives a write request, it can allocate the write request to a hard disk with a larger remaining capacity, so as to better control the hard disk. .

Optionally, the method further includes determining whether the capacity of the fault area is greater than a preset value, and when the capacity of the fault area is greater than a preset value, reporting the fault information to the array controller.

When the capacity of the fault area is greater than the preset value, the fault information is reported again, which can avoid the frequent impact of the fault information on the performance of the storage array.

In an implementation manner of the third aspect, when a communication protocol between a hard disk and an array controller is different, a manner in which the hard disk records and reports the fault information is also different. When the communication protocol between the hard disk and the array controller is a SCSI protocol, the failure information is recorded in an information exception log page in the SCSI protocol;

In the SCSI protocol, there are two other ways to report the fault information. The first is: receiving input and output IO requests sent by the array controller; and carrying the information exception log page in response to the IO requests. In the information, the fault information is reported to the array controller through the response information; the second is: receiving a fault information query request sent by the array controller; and carrying the information exception log page in the fault information query In the requested response information, the fault information is reported through the response information.

When the communication protocol between the hard disk and the array controller is the ATA protocol, the fault information is recorded in the disk information statistics page in the ATA protocol; then the way to report the fault information is to receive the The fault information query request sent by the array controller; carrying the on-disk information statistics page in the response information of the fault information query request, and reporting the fault information through the response information.

When the communication protocol between the hard disk and the array controller is the NVMe protocol, the fault information is recorded in the health information log in the NVMe protocol; then the way to report the fault information is to: The information statistics page is carried in the response information of the asynchronous event request, and the failure information is reported through the response information.

The fourth aspect of the present invention provides a hard disk failure processing method. The difference between the hard disk processing method provided in the fourth aspect and the hard disk failure processing method provided in the first aspect is only that after the recovered data is stored in the recovery storage block, The recovery storage block replaces the faulty storage block in the storage block group without recording the correspondence between the recovery storage block and the faulty storage block.

A fifth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the first aspect, and functions performed by each functional module of the array controller and the hard disk failure processing provided in the first aspect The steps included in the method are the same and will not be repeated here.

A sixth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the second aspect, and functions performed by each functional module of the array controller and the hard disk failure processing provided in the second aspect The steps included in the method are the same and will not be repeated here.

A seventh aspect of the present invention provides a hard disk corresponding to the hard disk failure processing method provided in the third aspect, and the functions performed by the functional modules of the hard disk and the hard disk failure processing method provided by the third aspect include The steps are the same and will not be repeated here.

An eighth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the fourth aspect, and the functions performed by each functional module of the array controller and the hard disk failure processing provided in the fourth aspect The steps included in the method are the same and will not be repeated here.

A ninth aspect of the present invention provides an array controller. The array controller includes a processor and a computer-readable storage medium. The storage medium stores program instructions, and the processor runs the program instructions to execute a first Aspect, the second aspect or the fourth aspect of the hard disk failure processing method.

A tenth aspect of the present invention provides a hard disk. The hard disk includes a processor and a computer-readable storage medium. The storage medium stores program instructions. The processor runs the program instructions to execute the program instructions provided in the third aspect. Hard disk troubleshooting method.

According to an eleventh aspect of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the first aspect, the second aspect, and the third aspect. Aspect, or the fourth aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present invention or the prior art more clearly, the drawings used in the embodiments or the description of the prior art will be briefly introduced below.

FIG. 1 is a structural diagram of a storage array.

FIG. 2 is a schematic diagram of an array controller generating a logical disk in the first embodiment of the present invention.

FIG. 3 is a schematic diagram of a hot spare space and a redundant space provided by a storage array in the first embodiment of the present invention.

FIG. 4 is a flowchart of processing a fault area in a hard disk by the storage array in the first embodiment of the present invention.

FIG. 5 is an ASC code and an ASCQ code indicating a hard disk failure area defined in the SCSI protocol in the embodiment of the present invention.

FIG. 6 is an exemplary diagram of an information exception log page in the SCSI protocol according to an embodiment of the present invention.

FIG. 7 is a schematic diagram of the descriptor format sensing data in the return information of the IO request in the embodiment of the present invention.

FIG. 8 is a schematic diagram of an intra-disk information statistics page defined in the ATA protocol in an embodiment of the present invention.

FIG. 9 is a schematic diagram of a health information log defined in the NVMe protocol according to an embodiment of the present invention.

FIG. 10 is a flowchart of a processing method when an array controller receives a rewrite request according to an embodiment of the present invention.

11 is a schematic diagram of an array controller forming a logical disk by a RAID algorithm through a plurality of independent hard disks in a second embodiment of the present invention.

FIG. 12 is a flowchart of a method for processing a fault area occurring in a hard disk in a second embodiment of the present invention.

FIG. 13 is a block diagram of a hard disk in an embodiment of the present invention.

FIG. 14 is a block diagram of an array controller in the first embodiment of the present invention.

FIG. 15 is a block diagram of an array controller in a second embodiment of the present invention.

detailed description

In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments.

As shown in FIG. 1, it is a structural diagram of a memory array 100. The storage array 100 includes an array controller 101, a plurality of hard disks 105-108, a memory 103, a communication interface 104, and a bus 102. The array controller 101 is configured to run a program (not shown) in the memory 103 to manage the hard disks 105-108 and access data therein. The communication interface 104 is used to connect to a host (not shown), and the host can transmit read-write instructions or management instructions to the storage array 100 through the communication interface 104. The communication interface may be a non-volatile memory standard (Non-Volatile Memory Express, NVMe) interface or a small computer system interface (Small Computer System Interface, SCSI) SCSI interface.

As shown in FIG. 2, in the first embodiment of the present invention, the array controller 101 generates a logical disk for use by a host. When the storage array 100 is a flash memory array using an NVMe interface, the host communicates with the storage array 100 through the NVMe protocol, and the logical disks generated by the storage array 100 can pass a namespace defined in the NVMe protocol. ). When the storage array 100 is a storage array using a SCSI interface, the host communicates with the storage array 100 through the SCSI protocol, and the hard disk can be represented by a logical unit number (LUN) defined in the SCSI protocol. . In the embodiment of FIG. 2, the process of generating the LUN by the array controller 101 under the SCSI protocol is taken as an example for description.

As shown in FIG. 2, each hard disk 105-108 in the storage array 100 is divided into chunks of the same size, and the chunks belonging to different hard disks utilize redundant arrays of independent hard disks (redundant arrays of independent drives). A RAID) algorithm generates a chunk group. As shown in FIG. 2, the chunk 201 belonging to the hard disk 102, the chunk 202 belonging to the hard disk 106, and the chunk 203 belonging to the hard disk 107 generate a chunk group 204 by using a RAID 5 algorithm. The chunk 205 belonging to the hard disk 102, the chunk 206 belonging to the hard disk 107, and the chunk 207 belonging to the hard disk 108 generate a chunk group 208 by a RAID 5 algorithm. After the chunk group is generated, the storage controller 101 records the hard disk where the chunks constituting each chunk group are located. A storage pool 209 is constructed based on the chunk group, and logical disks such as LUN0, LUN1, and LUN3 are constructed based on the storage pool 209.

As shown in FIG. 3, in the embodiment of the present invention, in addition to the storage resource pool 209, the storage array 100 also provides a hot spare space 210 and a redundant space 211. The hot spare space 210 is used to recover data in the failed chunk and replace the failed chunk after the chunk in the hard disk fails. The redundant space is a space reserved by the storage array 100. The storage array 100 does not provide the size of the redundant space to the outside, and the redundant space is used to improve the performance of the storage array 100. The hot spare space 210 and the redundant space 211 may be a single hard disk or a chunk of multiple hard disks to form a pooled space. In the embodiment of the present invention, the hot spare space and the redundant space may be divided into a granularity of the same size as the chunk, which is used to replace a failed chunk in a hard disk. How to replace the failed chunk in the hard disk by using the hot spare space 210 and the redundant space 211 will be described in detail when a method for hard disk failure processing is described below.

As described in the background art, in the prior art, after a fault area (that is, a section of storage space where a fault occurs in a hard disk) is generated in a hard disk, for example, after a die in the hard disk fails, the array controller 100 is not aware, so No failures will be processed, and after the array controller 100 detects that the wear degree of the hard disk reaches a threshold, it directly performs disk replacement processing. However, when the wear degree of the hard disk does not reach the threshold, the fault area will reduce the redundant space of the hard disk, and the reduction of the redundant space will affect the performance of the storage system.

The fault processing method provided in the embodiment of the present invention may enable the array controller 100 to determine a faulty storage block in which a faulty area in a hard disk is located, and replace the faulty storage block with a storage block in a redundant space 211 or a hot spare space 210, so Will affect the performance of the entire storage system.

A method for processing a failure in a hard disk in the embodiment of the present invention will be described below through a flowchart of FIG. 4.

FIG. 4 is a flowchart of a method for a storage system to process a failed area in a hard disk under the architecture shown in FIG. 2. In the following description, only the failure of the hard disk 105 in the storage array 100 is described as an example.

In step S401, the hard disk 105 identifies a fault area in the hard disk 105 and accumulates the capacity of the fault area.

The fault area may be a Die particle in a flash of a hard disk, or may be a space on the hard disk. During the operation of the hard disk, the hard disk counts the number of times an abnormality occurs in the storage area of the hard disk. When the number of times a certain type of abnormality occurs in a storage area exceeds a preset value, the storage area can be identified as a fault area. The abnormality may be error checking and correction (ECC error), uncorrectable error-correcting code error (UNC), slow response to I / O, or I / O response timeout. The identification of the fault area may be in any manner in the prior art, and is not limited herein. The function performed by the hard disk is realized by a processor (not shown) in the hard disk executing a piece of program code stored in a memory (not shown) in the hard disk.

In step S402, the array controller 101 obtains fault information of a fault area in the hard disk 105. The fault information includes an identifier indicating whether there is data loss in the fault area. In some embodiments, the fault information further includes a capacity of the fault area.

In practical applications, in order to ensure the reliability of data, some hard disks will have RAID with data recovery capabilities, such as RAID1, RAID5, etc. In this way, even if there is a fault area in the hard disk, the RAID in the hard disk can also be used. The algorithm recovers the data in the fault area so that no data is lost. However, when the hard disk does not have in-disk RAID or is a RAID that cannot recover data, such as RAID0, the data in the fault area cannot be recovered, so data will be lost. Therefore, in the embodiment of the present invention, the array controller 101 obtains an identifier indicating whether there is data loss in the fault area, so as to recover the lost data in the fault area subsequently.

In the embodiment of the present invention, the methods for acquiring the fault information are different for different communication protocols between the hard disk 105 and the storage array 100. The methods for acquiring the fault information of the fault area under different communication protocols are described below.

SCSI protocol

In the existing SCSI protocol, an additional detection code (ASC) and an attachment detection code qualifier (ASCQ) are defined, and abnormalities generated in a hard disk are identified through different ASC and ASCQ. After the abnormality is detected on the hard disk, the ASC and ASCQ corresponding to the abnormality are recorded in the informational exception log page. The information abnormality log page is a log page defined in the SCSI protocol for recording an abnormality of a hard disk. In the existing SCSI protocol, when ASC is 5D, it means that the failure prediction threshold (FAILURE PREDICTION THRESHOLD EXCEEDED) is exceeded, that is, when a parameter in the monitored hard disk exceeds the set threshold, the parameter needs to be reported To the array controller. Each ASCQ corresponding to 5D defines various parameters that need to be detected in the hard disk. Since the existing SCSI protocol does not define ASCQ for reporting a fault area in a hard disk, and the capacity of the fault area cannot be reported, in the embodiment of the present invention,

ASCQ codes

6D and 6E indicating the fault conditions of the fault area are defined. As shown in FIG. 5, it is a definition of the newly defined

ASCQ codes

6D and 6E in the embodiment of the present invention, where 6D indicates a fault area in the hard disk but no data is lost, and 6E indicates a fault area in the hard disk and there is no data loss. data lost. When the hard disk detects a fault area and the total fault capacity reaches a preset value, 6D or 6E is recorded in the information abnormal log page according to whether the data in the fault area is lost. As shown in FIG. 6, it is an example diagram of an information abnormality log page. When the capacity of the fault area reaches a preset value and no data is lost, the information abnormality ASC indicated by the 8th byte of the information abnormality log page (INFORMATION EXCEPTION ADDITIONAL SENSE CODE) is filled in 5D, and the 9th byte information abnormal ASCQ (INFORMATIONAL EXCEPTION ADDITIONAL SENSE CODE QUALIFIER) of the information abnormal log page is filled in. When the capacity of the fault area reaches the preset value and data is lost, the information abnormality ASC indicated by the 8th byte of the information abnormality log page is filled with 5D, and the 9th byte of the information abnormality log page is filled. 6E is entered in the ASCQ. In addition, optionally, the capacity of the fault area is also recorded in the information byte of the information exception log page, as shown in FIG. 6, which carries "00 00 00 00 00 00 00" (Hexadecimal), which is the failure capacity of 8GB. The above-mentioned

ASCQ codes

6D and 6E are just examples. In actual use, any ASCQ that is not used by the agreement under ASC 5D can be used.

In the SCSI protocol, the fault information of the fault area can be obtained in two ways.

The first method is that the hard disk 105 actively reports.

After receiving the IO request sent by the array controller 101, when an abnormal code is recorded in the information abnormal ASC byte and the information abnormal ASCQ byte of the information abnormal log page, for example, 5D and 6E, the 5D and The 6E and the fault capacity 8G recorded in the information byte are respectively written into the descriptor format sensing data in the return information of the IO request. As shown in FIG. 7, if the descriptor format sensing data includes ASC bytes and ASCQ bytes, the descriptor is filled with the ASC code 5D and ASCQ code 6D / 6E obtained from the information exception log page. ASC and ASCQ in format sensing data. In addition, the descriptor format sensing data also includes information bytes, and the capacity of the fault area can be written into the information bytes of the descriptor format sensing data.

In this way, after the array controller 101 receives the return information of the IO request, it can obtain the failure information of the fault area in the hard disk from the descriptor format sensing data of the return information.

The second way is for the array controller 101 to periodically query the fault information of the fault area in the hard disk 105.

In this manner, the array controller 101 periodically sends a fault query instruction to the hard disk 105, and the fault query instruction carries the identifier of the information abnormal log page. After receiving the fault query instruction, the hard disk 105 returns the information abnormal log page to the array controller 101. After receiving the log page, the array controller 101 obtains the content corresponding to the ASC, ASCQ, and information from the log page. If the ASC and ASCQ are 5D and 6D, respectively, the array controller 101 can obtain that the abnormality in the hard disk 105 is a fault area, and there is no data loss in the fault area. If the ASC and ASCQ are 5D and 6E, respectively, the array controller 101 may obtain that an abnormality in the hard disk is a fault area, and data is lost in the fault area. Optionally, the capacity of the fault area can also be obtained from the bytes corresponding to the information.

Second, Advanced Technology Attachment (ATA) agreement

In the ATA protocol, an in-disk information statistics page (Solid State State Device Statistics) is defined, and this statistics page is used to record abnormal information of various abnormalities of the hard disk detected by the hard disk. In the embodiment of the present invention, new abnormal information, that is, uncorrectable flash unit error information (Uncorrect Flash Unit Error Information) is defined, and is used to record fault information of a fault area in a hard disk. As shown in FIG. 8, the information is represented by a 64-bit binary number, in which 15: 0 digits are set to 00000010 (hexadecimal representation in the figure: 0002), which is an identifier of the information statistics page in the disc. The 23:16 digit is used to indicate whether there is a fault zone in the hard disk. If there is a fault zone, set the 23:16 bit to 00000001 (the hexadecimal representation is 01). If there is no fault zone, set 23: The 16 bits are set to 00000000 (the hexadecimal representation in the figure is 00). The 31:24 digits are used to indicate whether there is data loss in the fault area. If there is data loss, set 31:24 to 00000001 (the hexadecimal representation in the figure is 01). If there is no data loss, set 31:24 Set to 00000000 (or 00 in hexadecimal). In addition, the hard disk also records the capacity of the faulty area in the information statistics page of the disk. For example, in Figure 8 24-31 bytes define parameters: uncorrectable capacity parameter, which is also 64 bits. The fault capacity of the fault zone is recorded at the position corresponding to the parameter. In the embodiment of the present invention, the reported capacity is: 00 00 00 00 00 01 00 (hexadecimal), that is, 8G.

The array controller 101 periodically sends a query command to the hard disk 105, and the query command carries the identifier of the information statistics page in the disk. After receiving the fault query instruction, the hard disk returns the statistics page of the information on the disk to the array controller 101. After the array controller 101 receives the in-disk information statistics page, it obtains the uncorrectable flash unit error information, that is, the 64-bit secondary system number (or 16-bit hexadecimal number). By analyzing the error information of the uncorrectable flash memory unit, the fault area information in the hard disk 105 can be obtained.

Third, the NVMe protocol

In the NVMe protocol, a health information log (Health Information Log) is defined, and the log is used to record abnormal information about various abnormalities of the hard disk detected by the hard disk. As shown in FIG. 9, different bits in byte 0 of the health information log define different abnormal information in the hard disk. Among them,

bits

1, 2, 3, and 4 are abnormal information that has been defined in the existing NVMe protocol. Since it is not related to the present invention, it will not be described here. In the embodiment of the present invention, a fifth bit is newly defined to indicate whether a fault area appears in the hard disk 105. When the value of the fifth bit is 1, it indicates that a fault area exists in the hard disk 105. In addition, a 32-bit character string is defined in the 4 bytes of 6-9. The most significant bit, bit 7 of the 9th byte, is used to indicate whether there is data loss in the fault area. For example, when the bit is set to 1, It indicates that there is data loss in the fault area. When it is set to 0, it indicates that there is no data loss in the fault area. The bits after the highest bit are used to indicate the failure capacity of the faulty storage block. For example, "00 00 00 00 00 01 00" in hexadecimal indicates that there is no data loss of the faulty storage block, and the faulty storage block has a capacity of 8 GB. And "80, 00, 00, 00, 01, 00" (the highest digit in hexadecimal is 8 and the highest digit in binary is 1), which means that there is no data loss in the faulty storage block and the fault capacity of the faulty storage block is 8GB.

When the failure capacity of the hard disk statistics failure storage block reaches a preset value, the health information log is reported to the array controller 101 through a response of an asynchronous event request (Asynchronous Event Request). The array controller 101 can analyze the health information log to obtain the fault information of the fault area.

In step S403, the array controller 101 acquires the capacity of the fault area of the hard disk 105 from the fault information, and accumulates the acquired capacity information to the total fault capacity of the hard disk recorded by the array controller 101. in. When the total failure capacity of the hard disk 105 reaches a preset value, the user is notified to replace the hard disk 105.

In step S404, the array controller 101 determines that there is data loss in the fault area according to the fault information, and the array controller 101 determines a chunk where the lost data is located.

In the embodiment of the present invention, two methods for determining the chunk where the missing data is located are provided. The first method is that the array controller 101 obtains an address in the hard disk 105 of each chunk belonging to the hard disk. When the hard disk is an SSD, the address in the hard disk 105 refers to A logical address in the hard disk, and then sends a data loss query command to the hard disk 105, where the query command carries the logical address of one of the chunks. It is described in FIG. 2 that when the storage pool is constructed, the array controller 101 records the chunks belonging to each hard disk. Therefore, when determining the chunk where the lost data is located, the array controller 101 uses the chunk as the granularity. Query the logical address of the lost data in the hard disk. When the hard disk receives the lost data query command, it is determined whether the logical address carried by the lost data query command includes part or all of the address of the fault area, and if it contains, the data loss identifier is reported to If not, the storage controller 101 reports an identification that no data is lost to the storage controller 101. After the storage controller 101 receives the report information, if the report information includes a data loss identifier, the storage controller 101 uses the chunk indicated by the logical logical address of the chunk carried in the lost data query command as the chunk where the lost data is located. In one embodiment, the fault area reported by the hard disk is generally smaller than the size of the chunk. In this embodiment, if the reported information includes an identifier without data loss, a new data loss query command is sent to the server. For the hard disk, the new data loss query command carries the logical addresses of other chunks of the hard disk, and so on, until the chunk where the lost data is found is found. In another embodiment, if the fault area reported by the hard disk is larger than the size of a chunk, the array controller sends an address of the chunk to the hard disk to determine the faulty storage block. After receiving the return information of a chunk, , The addresses of the next chunk will be sent to the hard disk, until the addresses of all chunks of the hard disk are sent to the hard disk, so as to determine multiple chunks where the lost data is located.

The second method is that the array controller 101 sends a fault list query command to the hard disk 105, and the hard disk 105 reports the recorded logical address list of the fault area to the hard disk 105 after receiving the query command. An array controller 101. The array controller 101 can determine the chunk where the lost data is located according to the reported logical address list.

In step S405, when the array controller 101 determines the chunk where the lost data is located, that is, the failed chunk, it uses other chunks that form a chunk group with the failed chunk to recover the data in the failed chunk by using a RAID algorithm.

In step S406, the array controller 101 stores the recovered data to an idle chunk in the hot spare space or the OP space, and the idle chunk is a backup chunk. The hard disk where the backup chunk is located is different from the hard disk where other chunks in the chunk group are located.

In step S407, the array controller 101 records a mapping relationship between an address of the failed chunk in the hard disk and an address of the backup chunk in the backup space or the OP space.

In this way, when the array controller 101 subsequently receives a request to update and write data in the faulty chunk, it writes the data to be written in the request into the backup chunk. The data in the faulty chunk is invalidated. In the subsequent process of garbage collection, the space in the faulty chunk other than the faulty area can be released.

In the second embodiment of the present invention, after the restored data is restored to the backup chunk in step S406, the array controller 101 replaces the failed chunk in the chunk group with the restored chunk. When replacing, the address of the failed chunk recorded in the metadata of the chunk group on the first hard disk may be replaced with the address of the recovery storage block on the hard disk where the recovery storage block is located.

After the hard disk has reported the lost capacity of the hard disk, the array controller 101 records the lost capacity of each hard disk, calculates the current available capacity of each hard disk, and limits the data written to the hard disk with a relatively large lost capacity. FIG. 10 is a flowchart of a processing method when the array controller 101 receives a rewrite request for data in a hard disk.

Step S501: Receive a write request, where the write request carries data to be written, a logical address of the data to be written, and a data amount of the data to be written;

In step S502, it is determined that the target hard disk of the data to be written is the hard disk 105 according to the logical address of the data to be written.

Step S503: Query the available capacity of the hard disk 105.

In step S504, it is determined whether the available capacity of the hard disk is smaller than the data amount of the data to be written.

Step S505: If the available capacity of the hard disk is greater than the data amount of the data to be written, write the data to be written to the hard disk.

Step S506: if the available capacity of the hard disk is less than or equal to the data amount of the data to be written, write the data to be written into the hot spare space or redundant space, and write the logic in the hard disk 105 The data pointed to by the address is marked as garbage data and is waiting for subsequent garbage collection.

After the array controller 101 marks the available capacity of each hard disk, when a new chunk group needs to be created subsequently, a hard disk with a large available capacity can be selected to create a chunk group. The available capacity of the hard disk is: the nominal capacity of the hard disk minus the lost capacity, and then minus the used space.

As shown in FIG. 11, in the third embodiment of the present invention, a plurality of independent hard disks 1104-1106 form a logical disk LUN 1101 through a RAID algorithm. The hot spare space 1102 and the redundant space 1103 are also provided by independent

hard disks

1107 and 1108.

The following describes how to deal with a fault area occurring in a hard disk in the third embodiment through a flowchart shown in FIG. 12.

In the third embodiment, steps S701 to S703 are the same as steps S401 to S403 in FIG. 4 in the first embodiment, and details are not described herein again.

In step S704, the array controller 101 obtains an identifier in the fault information indicating whether there is data loss in the fault area.

Step S705, if the identification information indicates that no data is lost in the faulty area, the array controller 101 migrates data in the hard disk 105 with the same capacity as the lost capacity to the hot spare space 1102 or redundant space 1103.

In step S706, if the identification information indicates that data is lost in the faulty area of the hard disk, the data in the hard disk is restored by using a RAID algorithm. After the recovery, step S705 is performed, that is, the data in the hard disk 105 is restored. Data having the same capacity as the lost capacity is migrated to the hot spare space 1102 or the redundant space 1103.

Step S707: Record a mapping relationship between an address of the migrated data in the hard disk 105 in the hard disk 105 and an address migrated to the hot spare space or a redundant space.

When a subsequent access request to the migrated data is received, the migrated data may be accessed in the hot spare space or redundant space according to the mapping relationship.

In this way, the redundant space or the hot spare space is used to compensate for the space lost in the fault area of the hard disk, so that the redundant space in the hard disk does not need to be used to compensate the fault area in the hard disk. When a fault area occurs, the redundant space in the hard disk will not be reduced, so that the wear of the hard disk will not be increased, and the performance of the storage array is guaranteed.

As shown in FIG. 13, a block diagram of a hard disk 1200 according to an embodiment of the present invention is shown.

The hard disk 1200 includes an identification module 1201, a marking module 1202, and a reporting module 1203. The identification module 1201 is configured to identify a fault area in the hard disk 105 and accumulate the capacity of the fault area. The function performed by the identification module 1201 is the same as step S401 in FIG. 4. For how to identify the fault area and the capacity of the cumulative fault area, refer to the description of step S401. The marking module 1202 is used to mark the fault information of the identified fault area. For the marking method, please refer to step S402 on how to mark the fault area in the hard disk in different protocols, such as the SCSI protocol, the ATA protocol, and the NVMe protocol. For related descriptions of the fault information, refer to the related descriptions of FIG. 4, FIG. 5, FIG. 8, and FIG. 9.

The reporting module 1203 is configured to report the fault information marked by the marking module 1202 to the array controller. For the specific way of reporting fault information by the reporting module 1203, please refer to step S402 for descriptions of how the hard disk reports fault information of the fault zone in different protocols, such as SCSI protocol, ATA protocol, and NVMe protocol. More details.

As shown in FIG. 14, it is a block diagram of the array controller 1300 in the first embodiment of the present invention. The array controller 1300 includes an acquisition module 1301, an accumulation module 1302, a recovery module 1303, and a recording module 1304. The acquiring module 1301 is configured to acquire fault information of a fault area in a hard disk. For the method of obtaining the fault information of the fault area in the hard disk, refer to the relevant description in step S402. Different protocols, such as the SCSI protocol, the ATA protocol, and the NVMe protocol, obtain the fault information in different ways. For details, refer to step S402. The description is not repeated here.

The accumulation module 1302 is configured to obtain the capacity of the hard disk fault area from the fault information, and add the acquired capacity information to the total fault capacity of the hard disk that is recorded. When the total failure capacity of the hard disk reaches a preset value, the user is notified to replace the hard disk. For details, refer to the related description of step S403.

The recovery module 1303 is configured to determine, after the acquisition module obtains fault information of a fault area of a hard disk, if the fault information indicates data loss in the fault area, determine a faulty chunk in which the lost data is located. Then use the other chunks that make up the chunk group with the faulty chunk to recover the data in the faulty chunk through the RAID algorithm, and then store the recovered data in the backup chunk, and use the backup chunk and the division in the chunk group to divide. Other chunks other than the failed chunk form a new chunk group. For details, refer to the related description of steps S404-407.

The recording module 1304 is configured to record a mapping relationship between an address of the faulty chunk in the hard disk and an address of the backup chunk in the backup space or the OP space. For details, refer to the related description of step S407.

The array controller in the second embodiment of the present invention has the same functions as the acquisition module 1301, the accumulation module 1302, and the recovery module 1303 in the array controller of the first embodiment, except that in the second embodiment, the record The module replaces the failed chunk in the chunk group with the restored chunk. When replacing, the address of the failed chunk recorded in the metadata of the chunk group on the first hard disk may be replaced with the address of the recovery storage block on the hard disk where the recovery storage block is located.

FIG. 15 is a block diagram of an array controller 1400 in a third embodiment of the present invention. The array controller 1400 includes an acquisition module 1401, an accumulation module 1402, a migration module 1403, and a recording module 1404.

The functions of the acquisition module 1401 and the accumulation module 1402 are the same as the functions of the acquisition module 1301 and the accumulation module 1302 in the array controller 1300. For details, refer to the related descriptions of the acquisition module 1301 and the accumulation module 1302, and details are not described herein. . The migration module 1403 is configured to migrate data of the same capacity as the lost capacity in the hard disk to the hot spare space or redundant space in the fault information indicating that there is no data loss in the fault area; if If the fault information indicates that data is lost in the faulty area of the hard disk, the data in the hard disk is recovered by a RAID algorithm, and after the recovery, data of the hard disk having the same capacity as the lost capacity is migrated. To the hot spare space or redundant space. For details, refer to the related description of steps S704 to S706.

The recording module 1404 is configured to record a mapping relationship between an address of the migrated data in the hard disk 105 in the hard disk 105 and an address migrated to the hot spare space or a redundant space. For details, refer to the related description of step S707. When a subsequent access request to the migrated data is received, the migrated data may be accessed in the hot spare space or redundant space according to the mapping relationship.

One or more of the above modules may be implemented in software, hardware or a combination of both. When any of the above modules or units is implemented in software, the software exists in the form of computer program instructions and is stored in a memory, and the processor may be used to execute the program instructions and implement the above method flow. The processor may include, but is not limited to, at least one of the following: a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a microcontroller (microcontroller), or an artificial intelligence Various types of computing devices, such as processors, that run software. Each computing device may include one or more cores for executing software instructions for operations or processing. The processor can be built into a SoC (System on a Chip) or an application specific integrated circuit (ASIC), or it can be a separate semiconductor chip. The processor processes cores used to execute software instructions for operations or processing, and may further include necessary hardware accelerators, such as field programmable gate arrays (FPGAs), PLDs (programmable logic devices) Or logic circuits that implement dedicated logic operations.

When the above modules or units are implemented in hardware, the hardware can be a CPU, microprocessor, DSP, MCU, artificial intelligence processor, ASIC, SoC, FPGA, PLD, dedicated digital circuit, hardware accelerator, or non-integrated discrete device Any one or any combination of them, which can run the necessary software or does not depend on the software to perform the above method flow.

The hard disk failure processing method, the array controller, and the hard disk provided in the embodiments of the present invention have been described above. Specific examples are used in this document to explain the principle and implementation of the present invention. The descriptions of the above embodiments are only used to help understand the present invention. The method of the invention and its core ideas; meanwhile, for a person of ordinary skill in the art, according to the ideas of the present invention, there will be changes in the specific implementation and the scope of application. In summary, the content of this specification should not be understood. To limit the present invention.

Claims

A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes multiple hard disks, and each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block through a redundant algorithm. Group; the method includes:

Acquiring fault information of a fault area where a fault occurs in the first hard disk;

When the fault information indicates data loss in the fault area, determining the faulty storage block where the missing data is located;

Recovering data of the faulty memory block by using other memory blocks in a memory block group to which the faulty memory block belongs;

Storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;

The correspondence between the address of the data in the faulty storage block in the first hard disk and the address of the recovery data block in the second hard disk is recorded.
The method according to claim 1, wherein the acquiring the fault information of the fault area where the fault occurs in the first hard disk comprises:

Receiving the failure information reported by the first hard disk.
The method according to claim 1, wherein the acquiring the fault information of the fault area where the fault occurs in the first hard disk comprises:

Sending a fault query command to the first hard disk;

Receiving the fault information reported by the first hard disk according to the fault query command.
The method according to claim 2 or 3, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
The method according to any one of claims 1-4, wherein the determining the faulty storage block where the missing data is located comprises:

Obtaining an address of a first storage block in the first hard disk in the first hard disk;

Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;

Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;

When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;

When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
The method according to any one of claims 1-4, wherein the determining the faulty storage block where the missing data is located comprises:

Sending a fault zone query command to the first hard disk;

Receiving information returned by the first hard disk and including an address of the faulty area;

The faulty storage block is determined according to the address of the faulty area.
The method according to any one of claims 1-6, wherein the fault information includes a capacity of the fault area, and the method further comprises:

Acquiring the capacity of the fault area in the fault information, and adding the capacity of the fault area to the total fault capacity of the first hard disk;

When it is determined that the total failure capacity is greater than a preset value, the user is prompted to replace the first hard disk.
A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes a first hard disk, and the first hard disk includes a fault area. The method includes:

Acquiring fault information of the fault area;

Determining the capacity of the fault area according to the fault information;

Migrate part of the data in the first hard disk to the second hard disk according to the capacity;

The mapping relationship between the address of the migrated data in the first hard disk and the address in the second hard disk is recorded.
The method according to claim 8, wherein the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm, and the method further comprises:

Determining whether there is data loss on the first hard disk according to the failure information;

If there is data loss on the first hard disk, the data in the first hard disk is restored through the redundant algorithm.
A hard disk failure processing method executed by a hard disk, the method includes:

Detecting a fault area in the hard disk;

Determining whether there is data loss in the fault area;

Setting a flag for whether there is data loss in the fault area according to the determination result;

And reporting to the array controller as the fault information a flag including a fault area in the hard disk and a flag indicating whether there is data loss in the fault area.
The method of claim 10, further comprising:

The capacity of the fault area is recorded, and the fault information further includes the capacity of the fault area.
The method according to claim 11, further comprising:

Judging whether the capacity of the fault area is greater than a preset value;

When the capacity of the fault area is greater than a preset value, the fault information is reported to the array controller.
The method according to any one of claims 10 to 12, wherein the fault information is recorded in an information exception log page in a small computer system interface SCSI protocol;

The method further includes:

Receiving an input-output IO request sent by the array controller;

Carrying the information abnormality log page in the response information of the IO request, and reporting the failure information to the array controller through the response information.
The method according to any one of claims 10 to 12, wherein the fault information is recorded in an information exception log page in a small computer system interface SCSI protocol;

The method further includes:

Receiving the fault information query request sent by the array controller;

The information abnormality log page is carried in response information of the failure information query request, and the failure information is reported through the response information.
The method according to any one of claims 10 to 12, wherein the fault information is recorded in an on-disk information statistics page in the Advanced Technology Attachment ATA protocol;

The method further includes:

Receiving a fault information query request sent by the array controller;

Carrying the on-disk information statistics page in response information of the failure information query request, and reporting the failure information through the response information.
The method according to any one of claims 10 to 12, wherein the fault information is recorded in a health information log in a non-volatile memory standard NVMe protocol;

The method further includes:

Carrying the on-disk information statistics page in response information of an asynchronous event request, and reporting the failure information through the response information.
An array controller in a storage array. The storage array includes multiple hard disks. Each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block group through a redundant algorithm. The device includes:

An acquisition module, configured to acquire fault information of a fault area where a fault occurs in the first hard disk;

A recovery module, configured to: when the fault information indicates data loss in the fault area, determine a faulty storage block in which the lost data is located; and use the other storage blocks in the storage block group to which the faulty storage block belongs to recover the Data of the faulty storage block; storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;

The recording module is configured to record a correspondence between an address of the data in the faulty storage block in the first hard disk and an address of the recovery data block in the second hard disk.
The array controller according to claim 17, wherein the way for the acquisition module to obtain the fault information of the faulted area in the first hard disk is to receive the fault information reported by the first hard disk.
The array controller according to claim 17, wherein the way for the acquiring module to obtain the fault information of the faulted area in the first hard disk is to send a fault query command to the first hard disk; receive the first The failure information reported by a hard disk according to the failure query command.
The array controller according to any one of claims 17 to 19, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
The array controller according to any one of claims 17 to 20, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:

Obtaining an address of a first storage block in the first hard disk in the first hard disk;

Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;

Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;

When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;

When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
The array controller according to any one of claims 17 to 20, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:

Sending a fault zone query command to the first hard disk;

Receiving information returned by the first hard disk and including an address of the faulty area;

The faulty storage block is determined according to the address of the faulty area.
The array controller according to any one of claims 17 to 22, wherein the fault information includes a capacity of the fault area, and the array controller further comprises:

The accumulation module is configured to obtain the capacity of the fault area in the fault information, and add the capacity of the fault area to the total fault capacity of the first hard disk; when it is determined that the total fault capacity is greater than a preset value, then The user is prompted to replace the first hard disk.
An array controller for a storage array. The storage array includes a first hard disk, the first hard disk includes a fault area, and the method includes:

An acquisition module, configured to acquire fault information of the fault area;

A migration module, configured to determine the capacity of the fault area according to the fault information, and migrate part of the data in the first hard disk to the second hard disk according to the capacity;

The recording module is configured to record a mapping relationship between an address of the migrated data in the first hard disk and an address in the second hard disk.
The array controller according to claim 24, wherein the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm, and the migration module is further configured to:

Determining whether there is data loss on the first hard disk according to the failure information;

If there is data loss on the first hard disk, the data in the first hard disk is restored through the redundant algorithm.
A hard disk including:

An identification module for detecting a fault area in the hard disk;

A marking module, configured to determine whether there is data loss in the fault area, and set a mark whether there is data loss in the fault area according to the determination result;

The reporting module is configured to report a mark including a fault area in the hard disk and a flag indicating whether there is data loss in the fault area to the array controller as fault information.
The hard disk according to claim 26, wherein the hard disk further comprises:

The marking module is further configured to record the capacity of the fault area, and the fault information further includes the capacity of the fault area.
The hard disk according to claim 26, wherein the reporting module is further configured to:

Judging whether the capacity of the fault area is greater than a preset value;

When the capacity of the fault area is greater than a preset value, the fault information is reported to the array controller.
The hard disk according to any one of claims 26 to 28, wherein the recording module records the fault information in an information exception log page in a small computer system interface SCSI protocol;

The reporting module is specifically configured to:

Receiving an input-output IO request sent by the array controller;

Carrying the information abnormality log page in the response information of the IO request, and reporting the failure information to the array controller through the response information.
The hard disk according to any one of claims 26 to 28, wherein the recording module records the fault information in an information exception log page in a small computer system interface SCSI protocol;

The reporting module is specifically configured to:

Receiving the fault information query request sent by the array controller;

The information abnormality log page is carried in the response information of the failure information query request, and the failure information is reported through the response information.
The hard disk according to any one of claims 26 to 28, wherein the failure information is recorded in a disk information statistics page in the Advanced Technology Attachment ATA protocol;

The reporting module is specifically configured to:

Receiving a fault information query request sent by the array controller;

Carrying the on-disk information statistics page in response information of the failure information query request, and reporting the failure information through the response information.
The hard disk according to any one of claims 26 to 28, wherein the fault information is recorded in a health information log in a non-volatile memory standard NVMe protocol;

The reporting module is specifically configured to:

Carrying the on-disk information statistics page in response information of an asynchronous event request, and reporting the failure information through the response information.
A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes multiple hard disks, and each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block through a redundant algorithm. Group; the method includes:

Acquiring fault information of a fault area where a fault occurs in the first hard disk;

When the fault information indicates data loss in the fault area, determining the faulty storage block where the missing data is located;

Recovering data of the faulty memory block by using other memory blocks in a memory block group to which the faulty memory block belongs;

Storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;

Replacing the failed storage block in the storage block group with the recovery storage block.
The method according to claim 33, wherein the replacing the faulty storage block in the storage block group with the recovery storage block comprises:

The address of the faulty storage block in the metadata record in the storage block group in the first hard disk is replaced with the address of the recovery storage block in the second hard disk.
The method according to claim 33 or 34, wherein the acquiring fault information of a fault area in which a fault occurs in the first hard disk comprises:

Receiving the failure information reported by the first hard disk.
The method according to claim 33 or 34, wherein the acquiring fault information of a fault area in which a fault occurs in the first hard disk comprises:

Sending a fault query command to the first hard disk;

Receiving the fault information reported by the first hard disk according to the fault query command.
The method according to any one of claims 34 to 36, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
The method according to any one of claims 33 to 37, wherein the determining the faulty storage block where the missing data is located comprises:

Obtaining an address of a first storage block in the first hard disk in the first hard disk;

Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;

Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;

When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;

When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
The method according to any one of claims 33 to 37, wherein the determining the faulty storage block where the missing data is located comprises:

Sending a fault zone query command to the first hard disk;

Receiving information returned by the first hard disk and including an address of the faulty area;

The faulty storage block is determined according to the address of the faulty area.
An array controller in a storage array. The storage array includes multiple hard disks. Each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block group through a redundant algorithm. The array controls The device includes:

An acquisition module, configured to acquire fault information of a fault area where a fault occurs in the first hard disk;

A recovery module, configured to: when the fault information indicates data loss in the fault area, determine a faulty storage block in which the lost data is located; and use the other storage blocks in the storage block group to which the faulty storage block belongs to recover the Data of the faulty storage block; storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;

A recording module, configured to replace the faulty storage block in the storage block group with the recovery storage block.
An array controller in a storage array, wherein the replacement module is specifically configured to replace the storage block group when the faulty storage block in the storage block group is replaced with the recovery storage block. The address of the faulty storage block in the first hard disk of the metadata record in is replaced with the address of the recovery storage block in the second hard disk.
The array controller according to claim 40 or 41, wherein the way for the acquisition module to obtain the fault information of the faulted area in the first hard disk is to receive the fault information reported by the first hard disk.
The array controller according to claim 40 or 41, wherein the way for the acquisition module to obtain the fault information of the fault zone in the first hard disk is to send a fault query command to the first hard disk; The failure information reported by the first hard disk according to the failure query command.
The array controller according to any one of claims 40 to 43, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
The array controller according to any one of claims 40 to 43, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:

Obtaining an address of a first storage block in the first hard disk in the first hard disk;

Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;

Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;

When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;

When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
The array controller according to any one of claims 40 to 43, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:

Sending a fault zone query command to the first hard disk;

Receiving information returned by the first hard disk and including an address of the faulty area;

The faulty storage block is determined according to the address of the faulty area.