WO2020048442A1 - Hard disk fault processing method, array controller and hard disk - Google Patents
Hard disk fault processing method, array controller and hard disk Download PDFInfo
- Publication number
- WO2020048442A1 WO2020048442A1 PCT/CN2019/104163 CN2019104163W WO2020048442A1 WO 2020048442 A1 WO2020048442 A1 WO 2020048442A1 CN 2019104163 W CN2019104163 W CN 2019104163W WO 2020048442 A1 WO2020048442 A1 WO 2020048442A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- hard disk
- fault
- information
- storage block
- data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
Definitions
- the present invention relates to the field of storage technology, and in particular, to a processing method after a failure occurs in a storage area in a hard disk, and an array controller and a hard disk that execute the processing method.
- solid-state hard disks cannot be written in place, they can only be written off-site, so a part of the redundant space must be reserved in the solid-state hard disk as a free space for writing data to the solid-state hard disk to improve the performance of the solid-state hard disk.
- the nominal capacity provided by the solid state hard disk to the outside does not include the capacity of the redundant space.
- failure area When a partial area failure (such as a die failure, hereinafter referred to as a failure area) occurs in the solid state hard disk, in order to keep the nominal capacity from being reduced, it is necessary to compensate the failure area through the capacity of a redundant space, which will cause redundancy The capacity of the free space is reduced. The reduction of redundant space will cause the wear and tear of solid state drives to increase, which will affect the performance of solid state drives.
- An embodiment of the present invention provides a method for processing a fault area that appears in a hard disk. Using the processing method, after a fault area of a hard disk occurs, the redundant space of the hard disk is not reduced, and thus the wear degree of the hard disk is not increased.
- a first aspect of the embodiments of the present invention provides a hard disk failure processing method, where the method is executed by an array controller of a storage array.
- the storage array includes a plurality of hard disks, each hard disk is divided into a plurality of storage blocks, and a plurality of storage blocks located on different hard disks form a storage block group through a redundant algorithm.
- the method includes: obtaining fault information of a fault area where a fault occurs in the first hard disk; when the fault information indicates that data is lost in the fault area, determining a faulty storage block where the missing data is located; and using the faulty storage block
- the other storage blocks in the belonging storage block group recover the data of the faulty storage block; store the recovered data to the recovery storage block, the recovery storage block is located on a second hard disk, and the second hard disk is A hard disk other than the hard disk where the storage block group is located; recording the correspondence between the address of the data in the failed storage block in the first hard disk and the address of the recovery data block in the second hard disk.
- two methods are provided for obtaining fault information of a faulted area in the first hard disk, and the first is for the array controller to receive the fault information reported by the first hard disk.
- the second is that the array controller sends a fault query command to the first hard disk; and then receives the fault information reported by the first hard disk according to the fault query command.
- the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
- two methods are provided for determining a failed storage block where the missing data is located.
- the first is for the array controller to obtain a first storage block in the first hard disk from the first storage block. An address in the hard disk; sending a data loss query command to the first hard disk, the query command carrying the address of the first storage block in the first hard disk; receiving the query on the first hard disk After the instruction, it is determined whether the address carried in the query instruction includes part or all of the address of the fault area. If it is included, the first hard disk carries an indication of the first storage block in the return message of the query instruction. Including the indication information of the missing data; otherwise, the return message of the query instruction carries indication information indicating that the first storage block does not include the missing data.
- the array controller After receiving the indication information indicating whether the first storage block includes the lost data returned by the first hard disk, if the array controller determines that the indication information indicates that the first storage block includes the lost data, Data, it is determined that the first storage block is the faulty storage block; then a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in the first An address in a hard disk.
- the second way to determine the faulty storage block where the missing data is is to send a fault zone query command to the first hard disk; receive information returned by the first hard disk that includes the address of the faulty area; The address determines the faulty memory block.
- the fault information includes a capacity of the fault area
- the method further includes: acquiring the capacity of the fault area in the fault information, and accumulating the capacity of the fault area to a fault of the first hard disk.
- the total capacity when it is determined that the total capacity of the failure is greater than a preset value, the user is prompted to replace the first hard disk.
- the determination of whether to replace the hard disk based on the total lost capacity of the hard disk is more convenient and accurate than the method for detecting the wear of the hard disk in the prior art.
- a second aspect of the present invention provides a hard disk failure processing method.
- the method is executed by an array controller of a storage array.
- the storage array includes a first hard disk, and the first hard disk includes a fault area.
- the method includes obtaining fault information of the fault area; determining a capacity of the fault area according to the fault information; migrating part of the data in the first hard disk to a second hard disk according to the capacity; recording the migrated The mapping relationship between the address of data in the first hard disk and the address in the second hard disk.
- the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm.
- the method further includes judging whether data is lost on the first hard disk according to the failure information; and if data is lost on the first hard disk, recovering data in the first hard disk through the redundancy algorithm.
- the lost data in the hard disk can be recovered in time.
- a third aspect of the present invention provides a hard disk failure processing method, which is executed by a hard disk.
- the method includes detecting a faulty area in the hard disk; determining whether there is data loss in the faulty area; setting a flag for data loss in the faulty area according to the determination result; A flag indicating whether there is data loss in the fault area is reported to the array controller as fault information.
- the hard disk also records the capacity of the fault area, and reports the capacity of the fault area to the array controller as fault information.
- the array controller can sense the lost capacity in the faulty area, so that when the array receives a write request, it can allocate the write request to a hard disk with a larger remaining capacity, so as to better control the hard disk. .
- the method further includes determining whether the capacity of the fault area is greater than a preset value, and when the capacity of the fault area is greater than a preset value, reporting the fault information to the array controller.
- a manner in which the hard disk records and reports the fault information is also different.
- the communication protocol between the hard disk and the array controller is a SCSI protocol
- the failure information is recorded in an information exception log page in the SCSI protocol
- the fault information is reported to the array controller through the response information; the second is: receiving a fault information query request sent by the array controller; and carrying the information exception log page in the fault information query In the requested response information, the fault information is reported through the response information.
- the fault information is recorded in the disk information statistics page in the ATA protocol; then the way to report the fault information is to receive the The fault information query request sent by the array controller; carrying the on-disk information statistics page in the response information of the fault information query request, and reporting the fault information through the response information.
- the fault information is recorded in the health information log in the NVMe protocol; then the way to report the fault information is to:
- the information statistics page is carried in the response information of the asynchronous event request, and the failure information is reported through the response information.
- the fourth aspect of the present invention provides a hard disk failure processing method.
- the difference between the hard disk processing method provided in the fourth aspect and the hard disk failure processing method provided in the first aspect is only that after the recovered data is stored in the recovery storage block, The recovery storage block replaces the faulty storage block in the storage block group without recording the correspondence between the recovery storage block and the faulty storage block.
- a fifth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the first aspect, and functions performed by each functional module of the array controller and the hard disk failure processing provided in the first aspect The steps included in the method are the same and will not be repeated here.
- a sixth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the second aspect, and functions performed by each functional module of the array controller and the hard disk failure processing provided in the second aspect The steps included in the method are the same and will not be repeated here.
- a seventh aspect of the present invention provides a hard disk corresponding to the hard disk failure processing method provided in the third aspect, and the functions performed by the functional modules of the hard disk and the hard disk failure processing method provided by the third aspect include The steps are the same and will not be repeated here.
- An eighth aspect of the present invention provides an array controller corresponding to the hard disk failure processing method provided in the fourth aspect, and the functions performed by each functional module of the array controller and the hard disk failure processing provided in the fourth aspect The steps included in the method are the same and will not be repeated here.
- a ninth aspect of the present invention provides an array controller.
- the array controller includes a processor and a computer-readable storage medium.
- the storage medium stores program instructions, and the processor runs the program instructions to execute a first Aspect, the second aspect or the fourth aspect of the hard disk failure processing method.
- a tenth aspect of the present invention provides a hard disk.
- the hard disk includes a processor and a computer-readable storage medium.
- the storage medium stores program instructions.
- the processor runs the program instructions to execute the program instructions provided in the third aspect. Hard disk troubleshooting method.
- a computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the first aspect, the second aspect, and the third aspect. Aspect, or the fourth aspect.
- FIG. 1 is a structural diagram of a storage array.
- FIG. 2 is a schematic diagram of an array controller generating a logical disk in the first embodiment of the present invention.
- FIG. 3 is a schematic diagram of a hot spare space and a redundant space provided by a storage array in the first embodiment of the present invention.
- FIG. 4 is a flowchart of processing a fault area in a hard disk by the storage array in the first embodiment of the present invention.
- FIG. 5 is an ASC code and an ASCQ code indicating a hard disk failure area defined in the SCSI protocol in the embodiment of the present invention.
- FIG. 6 is an exemplary diagram of an information exception log page in the SCSI protocol according to an embodiment of the present invention.
- FIG. 7 is a schematic diagram of the descriptor format sensing data in the return information of the IO request in the embodiment of the present invention.
- FIG. 8 is a schematic diagram of an intra-disk information statistics page defined in the ATA protocol in an embodiment of the present invention.
- FIG. 9 is a schematic diagram of a health information log defined in the NVMe protocol according to an embodiment of the present invention.
- FIG. 10 is a flowchart of a processing method when an array controller receives a rewrite request according to an embodiment of the present invention.
- FIG. 11 is a schematic diagram of an array controller forming a logical disk by a RAID algorithm through a plurality of independent hard disks in a second embodiment of the present invention.
- FIG. 12 is a flowchart of a method for processing a fault area occurring in a hard disk in a second embodiment of the present invention.
- FIG. 13 is a block diagram of a hard disk in an embodiment of the present invention.
- FIG. 14 is a block diagram of an array controller in the first embodiment of the present invention.
- FIG. 15 is a block diagram of an array controller in a second embodiment of the present invention.
- FIG. 1 it is a structural diagram of a memory array 100.
- the storage array 100 includes an array controller 101, a plurality of hard disks 105-108, a memory 103, a communication interface 104, and a bus 102.
- the array controller 101 is configured to run a program (not shown) in the memory 103 to manage the hard disks 105-108 and access data therein.
- the communication interface 104 is used to connect to a host (not shown), and the host can transmit read-write instructions or management instructions to the storage array 100 through the communication interface 104.
- the communication interface may be a non-volatile memory standard (Non-Volatile Memory Express, NVMe) interface or a small computer system interface (Small Computer System Interface, SCSI) SCSI interface.
- NVMe non-Volatile Memory Express
- SCSI Small Computer System Interface
- the array controller 101 generates a logical disk for use by a host.
- the storage array 100 is a flash memory array using an NVMe interface
- the host communicates with the storage array 100 through the NVMe protocol, and the logical disks generated by the storage array 100 can pass a namespace defined in the NVMe protocol.
- the storage array 100 is a storage array using a SCSI interface
- the host communicates with the storage array 100 through the SCSI protocol, and the hard disk can be represented by a logical unit number (LUN) defined in the SCSI protocol.
- LUN logical unit number
- the process of generating the LUN by the array controller 101 under the SCSI protocol is taken as an example for description.
- each hard disk 105-108 in the storage array 100 is divided into chunks of the same size, and the chunks belonging to different hard disks utilize redundant arrays of independent hard disks (redundant arrays of independent drives).
- a RAID) algorithm generates a chunk group.
- the chunk 201 belonging to the hard disk 102, the chunk 202 belonging to the hard disk 106, and the chunk 203 belonging to the hard disk 107 generate a chunk group 204 by using a RAID 5 algorithm.
- the chunk 205 belonging to the hard disk 102, the chunk 206 belonging to the hard disk 107, and the chunk 207 belonging to the hard disk 108 generate a chunk group 208 by a RAID 5 algorithm.
- the storage controller 101 After the chunk group is generated, the storage controller 101 records the hard disk where the chunks constituting each chunk group are located.
- a storage pool 209 is constructed based on the chunk group, and logical disks such as LUN0, LUN1, and LUN3 are constructed based on the storage pool 209.
- the storage array 100 in addition to the storage resource pool 209, also provides a hot spare space 210 and a redundant space 211.
- the hot spare space 210 is used to recover data in the failed chunk and replace the failed chunk after the chunk in the hard disk fails.
- the redundant space is a space reserved by the storage array 100.
- the storage array 100 does not provide the size of the redundant space to the outside, and the redundant space is used to improve the performance of the storage array 100.
- the hot spare space 210 and the redundant space 211 may be a single hard disk or a chunk of multiple hard disks to form a pooled space.
- the hot spare space and the redundant space may be divided into a granularity of the same size as the chunk, which is used to replace a failed chunk in a hard disk. How to replace the failed chunk in the hard disk by using the hot spare space 210 and the redundant space 211 will be described in detail when a method for hard disk failure processing is described below.
- a fault area that is, a section of storage space where a fault occurs in a hard disk
- the array controller 100 is not aware, so No failures will be processed, and after the array controller 100 detects that the wear degree of the hard disk reaches a threshold, it directly performs disk replacement processing.
- the wear degree of the hard disk does not reach the threshold, the fault area will reduce the redundant space of the hard disk, and the reduction of the redundant space will affect the performance of the storage system.
- the fault processing method provided in the embodiment of the present invention may enable the array controller 100 to determine a faulty storage block in which a faulty area in a hard disk is located, and replace the faulty storage block with a storage block in a redundant space 211 or a hot spare space 210, so Will affect the performance of the entire storage system.
- FIG. 4 is a flowchart of a method for a storage system to process a failed area in a hard disk under the architecture shown in FIG. 2. In the following description, only the failure of the hard disk 105 in the storage array 100 is described as an example.
- step S401 the hard disk 105 identifies a fault area in the hard disk 105 and accumulates the capacity of the fault area.
- the fault area may be a Die particle in a flash of a hard disk, or may be a space on the hard disk.
- the hard disk counts the number of times an abnormality occurs in the storage area of the hard disk. When the number of times a certain type of abnormality occurs in a storage area exceeds a preset value, the storage area can be identified as a fault area.
- the abnormality may be error checking and correction (ECC error), uncorrectable error-correcting code error (UNC), slow response to I / O, or I / O response timeout.
- ECC error error checking and correction
- UNC uncorrectable error-correcting code error
- the identification of the fault area may be in any manner in the prior art, and is not limited herein.
- the function performed by the hard disk is realized by a processor (not shown) in the hard disk executing a piece of program code stored in a memory (not shown) in the hard disk.
- step S402 the array controller 101 obtains fault information of a fault area in the hard disk 105.
- the fault information includes an identifier indicating whether there is data loss in the fault area.
- the fault information further includes a capacity of the fault area.
- the array controller 101 obtains an identifier indicating whether there is data loss in the fault area, so as to recover the lost data in the fault area subsequently.
- the methods for acquiring the fault information are different for different communication protocols between the hard disk 105 and the storage array 100.
- the methods for acquiring the fault information of the fault area under different communication protocols are described below.
- an additional detection code (ASC) and an attachment detection code qualifier (ASCQ) are defined, and abnormalities generated in a hard disk are identified through different ASC and ASCQ. After the abnormality is detected on the hard disk, the ASC and ASCQ corresponding to the abnormality are recorded in the informational exception log page.
- the information abnormality log page is a log page defined in the SCSI protocol for recording an abnormality of a hard disk.
- ASC failure prediction threshold
- FAILURE PREDICTION THRESHOLD EXCEEDED the failure prediction threshold
- Each ASCQ corresponding to 5D defines various parameters that need to be detected in the hard disk. Since the existing SCSI protocol does not define ASCQ for reporting a fault area in a hard disk, and the capacity of the fault area cannot be reported, in the embodiment of the present invention, ASCQ codes 6D and 6E indicating the fault conditions of the fault area are defined. As shown in FIG. 5, it is a definition of the newly defined ASCQ codes 6D and 6E in the embodiment of the present invention, where 6D indicates a fault area in the hard disk but no data is lost, and 6E indicates a fault area in the hard disk and there is no data loss. data lost.
- FIG. 6 it is an example diagram of an information abnormality log page.
- the information abnormality ASC indicated by the 8th byte of the information abnormality log page (INFORMATION EXCEPTION ADDITIONAL SENSE CODE) is filled in 5D, and the 9th byte information abnormal ASCQ (INFORMATIONAL EXCEPTION ADDITIONAL SENSE CODE QUALIFIER) of the information abnormal log page is filled in.
- the capacity of the fault area When the capacity of the fault area reaches the preset value and data is lost, the information abnormality ASC indicated by the 8th byte of the information abnormality log page is filled with 5D, and the 9th byte of the information abnormality log page is filled. 6E is entered in the ASCQ.
- the capacity of the fault area is also recorded in the information byte of the information exception log page, as shown in FIG. 6, which carries "00 00 00 00 00 00 00 00 00 00 00" (Hexadecimal), which is the failure capacity of 8GB.
- the above-mentioned ASCQ codes 6D and 6E are just examples. In actual use, any ASCQ that is not used by the agreement under ASC 5D can be used.
- the fault information of the fault area can be obtained in two ways.
- the first method is that the hard disk 105 actively reports.
- the descriptor format sensing data includes ASC bytes and ASCQ bytes, the descriptor is filled with the ASC code 5D and ASCQ code 6D / 6E obtained from the information exception log page. ASC and ASCQ in format sensing data.
- the descriptor format sensing data also includes information bytes, and the capacity of the fault area can be written into the information bytes of the descriptor format sensing data.
- the array controller 101 After the array controller 101 receives the return information of the IO request, it can obtain the failure information of the fault area in the hard disk from the descriptor format sensing data of the return information.
- the second way is for the array controller 101 to periodically query the fault information of the fault area in the hard disk 105.
- the array controller 101 periodically sends a fault query instruction to the hard disk 105, and the fault query instruction carries the identifier of the information abnormal log page.
- the hard disk 105 After receiving the fault query instruction, the hard disk 105 returns the information abnormal log page to the array controller 101.
- the array controller 101 obtains the content corresponding to the ASC, ASCQ, and information from the log page. If the ASC and ASCQ are 5D and 6D, respectively, the array controller 101 can obtain that the abnormality in the hard disk 105 is a fault area, and there is no data loss in the fault area.
- the array controller 101 may obtain that an abnormality in the hard disk is a fault area, and data is lost in the fault area.
- the capacity of the fault area can also be obtained from the bytes corresponding to the information.
- an in-disk information statistics page (Solid State State Device Statistics) is defined, and this statistics page is used to record abnormal information of various abnormalities of the hard disk detected by the hard disk.
- new abnormal information that is, uncorrectable flash unit error information (Uncorrect Flash Unit Error Information) is defined, and is used to record fault information of a fault area in a hard disk.
- uncorrectable flash unit error information (Uncorrect Flash Unit Error Information) is defined, and is used to record fault information of a fault area in a hard disk.
- the information is represented by a 64-bit binary number, in which 15: 0 digits are set to 00000010 (hexadecimal representation in the figure: 0002), which is an identifier of the information statistics page in the disc.
- the 23:16 digit is used to indicate whether there is a fault zone in the hard disk.
- the hard disk also records the capacity of the faulty area in the information statistics page of the disk. For example, in Figure 8 24-31 bytes define parameters: uncorrectable capacity parameter, which is also 64 bits. The fault capacity of the fault zone is recorded at the position corresponding to the parameter. In the embodiment of the present invention, the reported capacity is: 00 00 00 00 00 01 00 (hexadecimal), that is, 8G.
- the array controller 101 periodically sends a query command to the hard disk 105, and the query command carries the identifier of the information statistics page in the disk.
- the hard disk After receiving the fault query instruction, the hard disk returns the statistics page of the information on the disk to the array controller 101.
- the array controller 101 After the array controller 101 receives the in-disk information statistics page, it obtains the uncorrectable flash unit error information, that is, the 64-bit secondary system number (or 16-bit hexadecimal number). By analyzing the error information of the uncorrectable flash memory unit, the fault area information in the hard disk 105 can be obtained.
- a health information log (Health Information Log) is defined, and the log is used to record abnormal information about various abnormalities of the hard disk detected by the hard disk.
- different bits in byte 0 of the health information log define different abnormal information in the hard disk.
- bits 1, 2, 3, and 4 are abnormal information that has been defined in the existing NVMe protocol. Since it is not related to the present invention, it will not be described here.
- a fifth bit is newly defined to indicate whether a fault area appears in the hard disk 105. When the value of the fifth bit is 1, it indicates that a fault area exists in the hard disk 105.
- a 32-bit character string is defined in the 4 bytes of 6-9.
- bit 7 of the 9th byte is used to indicate whether there is data loss in the fault area. For example, when the bit is set to 1, It indicates that there is data loss in the fault area. When it is set to 0, it indicates that there is no data loss in the fault area.
- the bits after the highest bit are used to indicate the failure capacity of the faulty storage block. For example, "00 00 00 00 00 01 00" in hexadecimal indicates that there is no data loss of the faulty storage block, and the faulty storage block has a capacity of 8 GB.
- the health information log is reported to the array controller 101 through a response of an asynchronous event request (Asynchronous Event Request).
- the array controller 101 can analyze the health information log to obtain the fault information of the fault area.
- step S403 the array controller 101 acquires the capacity of the fault area of the hard disk 105 from the fault information, and accumulates the acquired capacity information to the total fault capacity of the hard disk recorded by the array controller 101. in.
- the total failure capacity of the hard disk 105 reaches a preset value, the user is notified to replace the hard disk 105.
- step S404 the array controller 101 determines that there is data loss in the fault area according to the fault information, and the array controller 101 determines a chunk where the lost data is located.
- the first method is that the array controller 101 obtains an address in the hard disk 105 of each chunk belonging to the hard disk.
- the address in the hard disk 105 refers to A logical address in the hard disk, and then sends a data loss query command to the hard disk 105, where the query command carries the logical address of one of the chunks.
- the array controller 101 records the chunks belonging to each hard disk. Therefore, when determining the chunk where the lost data is located, the array controller 101 uses the chunk as the granularity. Query the logical address of the lost data in the hard disk.
- the storage controller 101 When the hard disk receives the lost data query command, it is determined whether the logical address carried by the lost data query command includes part or all of the address of the fault area, and if it contains, the data loss identifier is reported to If not, the storage controller 101 reports an identification that no data is lost to the storage controller 101. After the storage controller 101 receives the report information, if the report information includes a data loss identifier, the storage controller 101 uses the chunk indicated by the logical logical address of the chunk carried in the lost data query command as the chunk where the lost data is located. In one embodiment, the fault area reported by the hard disk is generally smaller than the size of the chunk. In this embodiment, if the reported information includes an identifier without data loss, a new data loss query command is sent to the server.
- the new data loss query command carries the logical addresses of other chunks of the hard disk, and so on, until the chunk where the lost data is found is found.
- the array controller sends an address of the chunk to the hard disk to determine the faulty storage block. After receiving the return information of a chunk, , The addresses of the next chunk will be sent to the hard disk, until the addresses of all chunks of the hard disk are sent to the hard disk, so as to determine multiple chunks where the lost data is located.
- the second method is that the array controller 101 sends a fault list query command to the hard disk 105, and the hard disk 105 reports the recorded logical address list of the fault area to the hard disk 105 after receiving the query command.
- An array controller 101 The array controller 101 can determine the chunk where the lost data is located according to the reported logical address list.
- step S405 when the array controller 101 determines the chunk where the lost data is located, that is, the failed chunk, it uses other chunks that form a chunk group with the failed chunk to recover the data in the failed chunk by using a RAID algorithm.
- step S406 the array controller 101 stores the recovered data to an idle chunk in the hot spare space or the OP space, and the idle chunk is a backup chunk.
- the hard disk where the backup chunk is located is different from the hard disk where other chunks in the chunk group are located.
- step S407 the array controller 101 records a mapping relationship between an address of the failed chunk in the hard disk and an address of the backup chunk in the backup space or the OP space.
- the array controller 101 when the array controller 101 subsequently receives a request to update and write data in the faulty chunk, it writes the data to be written in the request into the backup chunk.
- the data in the faulty chunk is invalidated.
- the space in the faulty chunk other than the faulty area can be released.
- the array controller 101 replaces the failed chunk in the chunk group with the restored chunk.
- the address of the failed chunk recorded in the metadata of the chunk group on the first hard disk may be replaced with the address of the recovery storage block on the hard disk where the recovery storage block is located.
- FIG. 10 is a flowchart of a processing method when the array controller 101 receives a rewrite request for data in a hard disk.
- Step S501 Receive a write request, where the write request carries data to be written, a logical address of the data to be written, and a data amount of the data to be written;
- step S502 it is determined that the target hard disk of the data to be written is the hard disk 105 according to the logical address of the data to be written.
- Step S503 Query the available capacity of the hard disk 105.
- step S504 it is determined whether the available capacity of the hard disk is smaller than the data amount of the data to be written.
- Step S505 If the available capacity of the hard disk is greater than the data amount of the data to be written, write the data to be written to the hard disk.
- Step S506 if the available capacity of the hard disk is less than or equal to the data amount of the data to be written, write the data to be written into the hot spare space or redundant space, and write the logic in the hard disk 105
- the data pointed to by the address is marked as garbage data and is waiting for subsequent garbage collection.
- the available capacity of the hard disk is: the nominal capacity of the hard disk minus the lost capacity, and then minus the used space.
- a plurality of independent hard disks 1104-1106 form a logical disk LUN 1101 through a RAID algorithm.
- the hot spare space 1102 and the redundant space 1103 are also provided by independent hard disks 1107 and 1108.
- steps S701 to S703 are the same as steps S401 to S403 in FIG. 4 in the first embodiment, and details are not described herein again.
- step S704 the array controller 101 obtains an identifier in the fault information indicating whether there is data loss in the fault area.
- Step S705 if the identification information indicates that no data is lost in the faulty area, the array controller 101 migrates data in the hard disk 105 with the same capacity as the lost capacity to the hot spare space 1102 or redundant space 1103.
- step S706 if the identification information indicates that data is lost in the faulty area of the hard disk, the data in the hard disk is restored by using a RAID algorithm. After the recovery, step S705 is performed, that is, the data in the hard disk 105 is restored. Data having the same capacity as the lost capacity is migrated to the hot spare space 1102 or the redundant space 1103.
- Step S707 Record a mapping relationship between an address of the migrated data in the hard disk 105 in the hard disk 105 and an address migrated to the hot spare space or a redundant space.
- the migrated data When a subsequent access request to the migrated data is received, the migrated data may be accessed in the hot spare space or redundant space according to the mapping relationship.
- the redundant space or the hot spare space is used to compensate for the space lost in the fault area of the hard disk, so that the redundant space in the hard disk does not need to be used to compensate the fault area in the hard disk.
- the redundant space in the hard disk will not be reduced, so that the wear of the hard disk will not be increased, and the performance of the storage array is guaranteed.
- FIG. 13 a block diagram of a hard disk 1200 according to an embodiment of the present invention is shown.
- the hard disk 1200 includes an identification module 1201, a marking module 1202, and a reporting module 1203.
- the identification module 1201 is configured to identify a fault area in the hard disk 105 and accumulate the capacity of the fault area.
- the function performed by the identification module 1201 is the same as step S401 in FIG. 4.
- the marking module 1202 is used to mark the fault information of the identified fault area.
- the marking method please refer to step S402 on how to mark the fault area in the hard disk in different protocols, such as the SCSI protocol, the ATA protocol, and the NVMe protocol.
- the fault information refer to the related descriptions of FIG. 4, FIG. 5, FIG. 8, and FIG. 9.
- the reporting module 1203 is configured to report the fault information marked by the marking module 1202 to the array controller.
- the specific way of reporting fault information by the reporting module 1203 please refer to step S402 for descriptions of how the hard disk reports fault information of the fault zone in different protocols, such as SCSI protocol, ATA protocol, and NVMe protocol. More details.
- the array controller 1300 includes an acquisition module 1301, an accumulation module 1302, a recovery module 1303, and a recording module 1304.
- the acquiring module 1301 is configured to acquire fault information of a fault area in a hard disk.
- Different protocols such as the SCSI protocol, the ATA protocol, and the NVMe protocol, obtain the fault information in different ways. For details, refer to step S402. The description is not repeated here.
- the accumulation module 1302 is configured to obtain the capacity of the hard disk fault area from the fault information, and add the acquired capacity information to the total fault capacity of the hard disk that is recorded. When the total failure capacity of the hard disk reaches a preset value, the user is notified to replace the hard disk. For details, refer to the related description of step S403.
- the recovery module 1303 is configured to determine, after the acquisition module obtains fault information of a fault area of a hard disk, if the fault information indicates data loss in the fault area, determine a faulty chunk in which the lost data is located. Then use the other chunks that make up the chunk group with the faulty chunk to recover the data in the faulty chunk through the RAID algorithm, and then store the recovered data in the backup chunk, and use the backup chunk and the division in the chunk group to divide. Other chunks other than the failed chunk form a new chunk group. For details, refer to the related description of steps S404-407.
- the recording module 1304 is configured to record a mapping relationship between an address of the faulty chunk in the hard disk and an address of the backup chunk in the backup space or the OP space. For details, refer to the related description of step S407.
- the array controller in the second embodiment of the present invention has the same functions as the acquisition module 1301, the accumulation module 1302, and the recovery module 1303 in the array controller of the first embodiment, except that in the second embodiment, the record The module replaces the failed chunk in the chunk group with the restored chunk.
- the address of the failed chunk recorded in the metadata of the chunk group on the first hard disk may be replaced with the address of the recovery storage block on the hard disk where the recovery storage block is located.
- FIG. 15 is a block diagram of an array controller 1400 in a third embodiment of the present invention.
- the array controller 1400 includes an acquisition module 1401, an accumulation module 1402, a migration module 1403, and a recording module 1404.
- the functions of the acquisition module 1401 and the accumulation module 1402 are the same as the functions of the acquisition module 1301 and the accumulation module 1302 in the array controller 1300. For details, refer to the related descriptions of the acquisition module 1301 and the accumulation module 1302, and details are not described herein. .
- the migration module 1403 is configured to migrate data of the same capacity as the lost capacity in the hard disk to the hot spare space or redundant space in the fault information indicating that there is no data loss in the fault area; if If the fault information indicates that data is lost in the faulty area of the hard disk, the data in the hard disk is recovered by a RAID algorithm, and after the recovery, data of the hard disk having the same capacity as the lost capacity is migrated. To the hot spare space or redundant space. For details, refer to the related description of steps S704 to S706.
- the recording module 1404 is configured to record a mapping relationship between an address of the migrated data in the hard disk 105 in the hard disk 105 and an address migrated to the hot spare space or a redundant space. For details, refer to the related description of step S707. When a subsequent access request to the migrated data is received, the migrated data may be accessed in the hot spare space or redundant space according to the mapping relationship.
- One or more of the above modules may be implemented in software, hardware or a combination of both.
- the software exists in the form of computer program instructions and is stored in a memory, and the processor may be used to execute the program instructions and implement the above method flow.
- the processor may include, but is not limited to, at least one of the following: a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a microcontroller (microcontroller), or an artificial intelligence
- CPU central processing unit
- DSP digital signal processor
- microcontroller microcontroller
- Each computing device may include one or more cores for executing software instructions for operations or processing.
- the processor can be built into a SoC (System on a Chip) or an application specific integrated circuit (ASIC), or it can be a separate semiconductor chip.
- SoC System on a Chip
- ASIC application specific integrated circuit
- the processor processes cores used to execute software instructions for operations or processing, and may further include necessary hardware accelerators, such as field programmable gate arrays (FPGAs), PLDs (programmable logic devices) Or logic circuits that implement dedicated logic operations.
- FPGAs field programmable gate arrays
- PLDs programmable logic devices
- logic circuits that implement dedicated logic operations.
- the hardware can be a CPU, microprocessor, DSP, MCU, artificial intelligence processor, ASIC, SoC, FPGA, PLD, dedicated digital circuit, hardware accelerator, or non-integrated discrete device Any one or any combination of them, which can run the necessary software or does not depend on the software to perform the above method flow.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
Claims (46)
- 一种硬盘故障处理方法,由存储阵列的阵列控制器执行,所述存储阵列包括多个硬盘,每个硬盘被划分多个存储块,位于不同硬盘的多个存储块通过冗余算法构成存储块组;所述方法包括:A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes multiple hard disks, and each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block through a redundant algorithm. Group; the method includes:获取第一硬盘中发生故障的故障区的故障信息;Acquiring fault information of a fault area where a fault occurs in the first hard disk;当所述故障信息指示所述故障区有数据丢失时,则确定丢失数据所在的故障存储块;When the fault information indicates data loss in the fault area, determining the faulty storage block where the missing data is located;利用所述故障存储块所归属的存储块组中的其他存储块恢复所述故障存储块的数据;Recovering data of the faulty memory block by using other memory blocks in a memory block group to which the faulty memory block belongs;将所恢复的数据存储至恢复存储块,所述恢复存储块位于第二硬盘,所述第二硬盘为除所述存储块组所在的硬盘以外的硬盘;Storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;记录所述故障存储块中的数据在所述第一硬盘中的地址与所述恢复数据块在所述第二硬盘中的地址的对应关系。The correspondence between the address of the data in the faulty storage block in the first hard disk and the address of the recovery data block in the second hard disk is recorded.
- 如权利要求1所述的方法,其特征在于,所述获取第一硬盘中发生故障的故障区的故障信息包括:The method according to claim 1, wherein the acquiring the fault information of the fault area where the fault occurs in the first hard disk comprises:接收所述第一硬盘上报的所述故障信息。Receiving the failure information reported by the first hard disk.
- 如权利要求1所述的方法,其特征在于,所述获取第一硬盘中发生故障的故障区的故障信息包括:The method according to claim 1, wherein the acquiring the fault information of the fault area where the fault occurs in the first hard disk comprises:发送故障查询命令至所述第一硬盘;Sending a fault query command to the first hard disk;接收所述第一硬盘根据所述故障查询命令上报的所述故障信息。Receiving the fault information reported by the first hard disk according to the fault query command.
- 如权利要求2或3所述的方法,其特征在于,所述故障信息包括指示数据是否丢失的标识,根据所述标识确定所述故障区有数据丢失。The method according to claim 2 or 3, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
- 如权利要求1-4任意一项所述的方法,其特征在于,所述确定丢失数据所在的故障存储块包括:The method according to any one of claims 1-4, wherein the determining the faulty storage block where the missing data is located comprises:获取所述第一硬盘中的第一存储块在所述第一硬盘中的地址;Obtaining an address of a first storage block in the first hard disk in the first hard disk;发送数据丢失查询命令至所述第一硬盘,所述查询命令中携带所述第一存储块在所述第一硬盘中的地址;Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;接收所述第一硬盘返回的指示所述第一存储块是否包括所述丢失数据的指示信息;Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;当所述指示信息指示所述第一存储块包括所述丢失数据,则确定所述第一存储块为所述故障存储块;When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;当所述指示信息指示所述第一存储块不包括所述丢失数据,则生成新的数据丢失查询命令,所述新的数据丢失查询命令中携带所述第一硬盘的第二存储块在所述第一硬盘中的地址。When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
- 如权利要求1-4任意一项所述的方法,其特征在于,所述确定丢失数据所在的故障存储块包括:The method according to any one of claims 1-4, wherein the determining the faulty storage block where the missing data is located comprises:发送故障区查询命令至所述第一硬盘;Sending a fault zone query command to the first hard disk;接收所述第一硬盘返回的包括所述故障区的地址的信息;Receiving information returned by the first hard disk and including an address of the faulty area;根据所述故障区的地址确定所述故障存储块。The faulty storage block is determined according to the address of the faulty area.
- 如权利要求1-6任意一项所述的方法,其特征在于,所述故障信息包括所述故障 区的容量,所述方法还包括:The method according to any one of claims 1-6, wherein the fault information includes a capacity of the fault area, and the method further comprises:获取所述故障信息中的故障区的容量,将所述故障区的容量累加至所述第一硬盘的故障总容量中;Acquiring the capacity of the fault area in the fault information, and adding the capacity of the fault area to the total fault capacity of the first hard disk;当判断所述故障总容量大于预设值,则提示用户替换所述第一硬盘。When it is determined that the total failure capacity is greater than a preset value, the user is prompted to replace the first hard disk.
- 一种硬盘故障处理方法,由存储阵列的阵列控制器执行,所述存储阵列中包括第一硬盘,所述第一硬盘中包括故障区,所述方法包括:A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes a first hard disk, and the first hard disk includes a fault area. The method includes:获取所述故障区的故障信息;Acquiring fault information of the fault area;根据所述故障信息确定所述故障区的容量;Determining the capacity of the fault area according to the fault information;根据所述容量大小将所述第一硬盘中的部分数据迁移至第二硬盘;Migrate part of the data in the first hard disk to the second hard disk according to the capacity;记录所迁移数据在所述第一硬盘中的地址与在所述第二硬盘中的地址的映射关系。The mapping relationship between the address of the migrated data in the first hard disk and the address in the second hard disk is recorded.
- 如权利要求8所述的方法,其特征在于,所述第一硬盘与所述存储阵列中的其他硬盘根据冗余算法构成逻辑磁盘,所述方法还包括:The method according to claim 8, wherein the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm, and the method further comprises:根据所述故障信息判断所述第一硬盘是否有数据丢失;Determining whether there is data loss on the first hard disk according to the failure information;若所述第一硬盘有数据丢失,则通过所述冗余算法恢复所述第一硬盘中的数据。If there is data loss on the first hard disk, the data in the first hard disk is restored through the redundant algorithm.
- 一种硬盘故障处理方法,由硬盘执行,所述方法包括:A hard disk failure processing method executed by a hard disk, the method includes:侦测所述硬盘中的故障区;Detecting a fault area in the hard disk;确定所述故障区是否有数据丢失;Determining whether there is data loss in the fault area;根据确定结果设置所述故障区是否有数据丢失的标记;Setting a flag for whether there is data loss in the fault area according to the determination result;将所述硬盘中包括故障区的标记及所述故障区是否有数据丢失的标记作为故障信息上报至阵列控制器。And reporting to the array controller as the fault information a flag including a fault area in the hard disk and a flag indicating whether there is data loss in the fault area.
- 如权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:记录所述故障区的容量,所述故障信息还包括所述故障区的容量。The capacity of the fault area is recorded, and the fault information further includes the capacity of the fault area.
- 如权利要求11所述的方法,其特征在于,所述方法还包括:The method according to claim 11, further comprising:判断所述故障区的容量是否大于预设值;Judging whether the capacity of the fault area is greater than a preset value;当所述故障区的容量大于预设值时,上报所述故障信息至所述阵列控制器。When the capacity of the fault area is greater than a preset value, the fault information is reported to the array controller.
- 如权利要求10至12任意一项所述的方法,其特征在于,所述故障信息记录在小型计算机系统接口SCSI协议中的信息异常日志页中;The method according to any one of claims 10 to 12, wherein the fault information is recorded in an information exception log page in a small computer system interface SCSI protocol;所述方法还包括:The method further includes:接收所述阵列控制器发送的输入输出IO请求;Receiving an input-output IO request sent by the array controller;将所述信息异常日志页携带在所述IO请求的响应信息中,通过所述响应信息上报所述故障信息至所述阵列控制器。Carrying the information abnormality log page in the response information of the IO request, and reporting the failure information to the array controller through the response information.
- 如权利要求10至12任意一项所述的方法,其特征在于,所述故障信息记录在小型计算机系统接口SCSI协议中的信息异常日志页中;The method according to any one of claims 10 to 12, wherein the fault information is recorded in an information exception log page in a small computer system interface SCSI protocol;所述方法还包括:The method further includes:接收阵列控制器发送的故障信息查询请求;Receiving the fault information query request sent by the array controller;将所述信息异常日志页携带在所述故障信息查询请求的响应信息中,通过所述响应信息上报所述故障信息。The information abnormality log page is carried in response information of the failure information query request, and the failure information is reported through the response information.
- 如权利要求10至12任意一项所述的方法,其特征在于,所述故障信息记录在高级技术附件ATA协议中的盘内信息统计页中;The method according to any one of claims 10 to 12, wherein the fault information is recorded in an on-disk information statistics page in the Advanced Technology Attachment ATA protocol;所述方法还包括:The method further includes:接收所述阵列控制器发送的故障信息查询请求;Receiving a fault information query request sent by the array controller;将所述盘内信息统计页携带在所述故障信息查询请求的响应信息中,通过所述响应信息上报所述故障信息。Carrying the on-disk information statistics page in response information of the failure information query request, and reporting the failure information through the response information.
- 如权利要求10至12任意一项所述的方法,其特征在于,所述故障信息记录在非易失性存储器标准NVMe协议中的健康信息日志中;The method according to any one of claims 10 to 12, wherein the fault information is recorded in a health information log in a non-volatile memory standard NVMe protocol;所述方法还包括:The method further includes:将所述盘内信息统计页携带在异步事件请求的响应信息中,通过所述响应信息上报所述故障信息。Carrying the on-disk information statistics page in response information of an asynchronous event request, and reporting the failure information through the response information.
- 一种存储阵列中的阵列控制器,所述存储阵列包括多个硬盘,每个硬盘被划分多个存储块,位于不同硬盘的多个存储块通过冗余算法构成存储块组;所述阵列控制器包括:An array controller in a storage array. The storage array includes multiple hard disks. Each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block group through a redundant algorithm. The device includes:获取模块,用于获取第一硬盘中发生故障的故障区的故障信息;An acquisition module, configured to acquire fault information of a fault area where a fault occurs in the first hard disk;恢复模块,用于当所述故障信息指示所述故障区有数据丢失时,则确定丢失数据所在的故障存储块;利用所述故障存储块所归属的存储块组中的其他存储块恢复所述故障存储块的数据;将所恢复的数据存储至恢复存储块,所述恢复存储块位于第二硬盘,所述第二硬盘为除所述存储块组所在的硬盘以外的硬盘;A recovery module, configured to: when the fault information indicates data loss in the fault area, determine a faulty storage block in which the lost data is located; and use the other storage blocks in the storage block group to which the faulty storage block belongs to recover the Data of the faulty storage block; storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;记录模块,用于记录所述故障存储块中的数据在所述第一硬盘中的地址与所述恢复数据块在所述第二硬盘中的地址的对应关系。The recording module is configured to record a correspondence between an address of the data in the faulty storage block in the first hard disk and an address of the recovery data block in the second hard disk.
- 如权利要求17所述的阵列控制器,其特征在于,所述获取模块获取第一硬盘中发生故障的故障区的故障信息的方式为接收所述第一硬盘上报的所述故障信息。The array controller according to claim 17, wherein the way for the acquisition module to obtain the fault information of the faulted area in the first hard disk is to receive the fault information reported by the first hard disk.
- 如权利要求17所述的阵列控制器,其特征在于,所述获取模块获取第一硬盘中发生故障的故障区的故障信息的方式为发送故障查询命令至所述第一硬盘;接收所述第一硬盘根据所述故障查询命令上报的所述故障信息。The array controller according to claim 17, wherein the way for the acquiring module to obtain the fault information of the faulted area in the first hard disk is to send a fault query command to the first hard disk; receive the first The failure information reported by a hard disk according to the failure query command.
- 如权利要求17至19任意一项所述的阵列控制器,其特征在于,所述故障信息包括指示数据是否丢失的标识,根据所述标识确定所述故障区有数据丢失。The array controller according to any one of claims 17 to 19, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
- 如权利要求17至20任意一项所述的阵列控制器,其特征在于,所述恢复模块在确定丢失数据所在的故障存储块时,具体用于:The array controller according to any one of claims 17 to 20, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:获取所述第一硬盘中的第一存储块在所述第一硬盘中的地址;Obtaining an address of a first storage block in the first hard disk in the first hard disk;发送数据丢失查询命令至所述第一硬盘,所述查询命令中携带所述第一存储块在所述第一硬盘中的地址;Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;接收所述第一硬盘返回的指示所述第一存储块是否包括所述丢失数据的指示信息;Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;当所述指示信息指示所述第一存储块包括所述丢失数据,则确定所述第一存储块为所述故障存储块;When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;当所述指示信息指示所述第一存储块不包括所述丢失数据,则生成新的数据丢失查询命令,所述新的数据丢失查询命令中携带所述第一硬盘的第二存储块在所述第一硬盘中的地址。When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
- 如权利要求17至20任意一项所述的阵列控制器,其特征在于,所述恢复模块在确定丢失数据所在的故障存储块时,具体用于:The array controller according to any one of claims 17 to 20, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:发送故障区查询命令至所述第一硬盘;Sending a fault zone query command to the first hard disk;接收所述第一硬盘返回的包括所述故障区的地址的信息;Receiving information returned by the first hard disk and including an address of the faulty area;根据所述故障区的地址确定所述故障存储块。The faulty storage block is determined according to the address of the faulty area.
- 如权利要求17-22任意一项所述的阵列控制器,其特征在于,所述故障信息包括所述故障区的容量,所述阵列控制器还包括:The array controller according to any one of claims 17 to 22, wherein the fault information includes a capacity of the fault area, and the array controller further comprises:累计模块,用于获取所述故障信息中的故障区的容量,将所述故障区的容量累加至所述第一硬盘的故障总容量中;当判断所述故障总容量大于预设值,则提示用户替换所述第一硬盘。The accumulation module is configured to obtain the capacity of the fault area in the fault information, and add the capacity of the fault area to the total fault capacity of the first hard disk; when it is determined that the total fault capacity is greater than a preset value, then The user is prompted to replace the first hard disk.
- 一种存储阵列的阵列控制器,所述存储阵列中包括第一硬盘,所述第一硬盘中包括故障区,所述方法包括:An array controller for a storage array. The storage array includes a first hard disk, the first hard disk includes a fault area, and the method includes:获取模块,用于获取所述故障区的故障信息;An acquisition module, configured to acquire fault information of the fault area;迁移模块,用于根据所述故障信息确定所述故障区的容量,根据所述容量大小将所述第一硬盘中的部分数据迁移至第二硬盘;A migration module, configured to determine the capacity of the fault area according to the fault information, and migrate part of the data in the first hard disk to the second hard disk according to the capacity;记录模块,用于记录所迁移数据在所述第一硬盘中的地址与在所述第二硬盘中的地址的映射关系。The recording module is configured to record a mapping relationship between an address of the migrated data in the first hard disk and an address in the second hard disk.
- 如权利要求24所述的阵列控制器,其特征在于,所述第一硬盘与所述存储阵列中的其他硬盘根据冗余算法构成逻辑磁盘,所述迁移模块还用于:The array controller according to claim 24, wherein the first hard disk and other hard disks in the storage array form a logical disk according to a redundant algorithm, and the migration module is further configured to:根据所述故障信息判断所述第一硬盘是否有数据丢失;Determining whether there is data loss on the first hard disk according to the failure information;若所述第一硬盘有数据丢失,则通过所述冗余算法恢复所述第一硬盘中的数据。If there is data loss on the first hard disk, the data in the first hard disk is restored through the redundant algorithm.
- 一种硬盘,包括:A hard disk including:识别模块,用于侦测所述硬盘中的故障区;An identification module for detecting a fault area in the hard disk;标记模块,用于确定所述故障区是否有数据丢失,根据确定结果设置所述故障区是否有数据丢失的标记;A marking module, configured to determine whether there is data loss in the fault area, and set a mark whether there is data loss in the fault area according to the determination result;上报模块,用于将所述硬盘中包括故障区的标记及所述故障区是否有数据丢失的标记作为故障信息上报至阵列控制器。The reporting module is configured to report a mark including a fault area in the hard disk and a flag indicating whether there is data loss in the fault area to the array controller as fault information.
- 如权利要求26所述的硬盘,其特征在于,所述硬盘还包括:The hard disk according to claim 26, wherein the hard disk further comprises:所述标记模块还用于记录所述故障区的容量,所述故障信息还包括所述故障区的容量。The marking module is further configured to record the capacity of the fault area, and the fault information further includes the capacity of the fault area.
- 如权利要求26所述的硬盘,其特征在于,所述上报模块还用于:The hard disk according to claim 26, wherein the reporting module is further configured to:判断所述故障区的容量是否大于预设值;Judging whether the capacity of the fault area is greater than a preset value;当所述故障区的容量大于预设值时,上报所述故障信息至所述阵列控制器。When the capacity of the fault area is greater than a preset value, the fault information is reported to the array controller.
- 如权利要求26至28任意一项所述的硬盘,其特征在于,所述记录模块将所述故障信息记录在小型计算机系统接口SCSI协议中的信息异常日志页中;The hard disk according to any one of claims 26 to 28, wherein the recording module records the fault information in an information exception log page in a small computer system interface SCSI protocol;所述上报模块具体用于:The reporting module is specifically configured to:接收所述阵列控制器发送的输入输出IO请求;Receiving an input-output IO request sent by the array controller;将所述信息异常日志页携带在所述IO请求的响应信息中,通过所述响应信息上报所述故障信息至所述阵列控制器。Carrying the information abnormality log page in the response information of the IO request, and reporting the failure information to the array controller through the response information.
- 如权利要求26至28任意一项所述的硬盘,其特征在于,所述记录模块将所述故障信息记录在小型计算机系统接口SCSI协议中的信息异常日志页中;The hard disk according to any one of claims 26 to 28, wherein the recording module records the fault information in an information exception log page in a small computer system interface SCSI protocol;所述上报模块具体用于:The reporting module is specifically configured to:接收阵列控制器发送的故障信息查询请求;Receiving the fault information query request sent by the array controller;将所述信息异常日志页携带在所述故障信息查询请求的响应信息中,通过所述响应 信息上报所述故障信息。The information abnormality log page is carried in the response information of the failure information query request, and the failure information is reported through the response information.
- 如权利要求26至28任意一项所述的硬盘,其特征在于,所述故障信息记录在高级技术附件ATA协议中的盘内信息统计页中;The hard disk according to any one of claims 26 to 28, wherein the failure information is recorded in a disk information statistics page in the Advanced Technology Attachment ATA protocol;所述上报模块具体用于:The reporting module is specifically configured to:接收所述阵列控制器发送的故障信息查询请求;Receiving a fault information query request sent by the array controller;将所述盘内信息统计页携带在所述故障信息查询请求的响应信息中,通过所述响应信息上报所述故障信息。Carrying the on-disk information statistics page in response information of the failure information query request, and reporting the failure information through the response information.
- 如权利要求26至28任意一项所述的硬盘,其特征在于,所述故障信息记录在非易失性存储器标准NVMe协议中的健康信息日志中;The hard disk according to any one of claims 26 to 28, wherein the fault information is recorded in a health information log in a non-volatile memory standard NVMe protocol;所述上报模块具体用于:The reporting module is specifically configured to:将所述盘内信息统计页携带在异步事件请求的响应信息中,通过所述响应信息上报所述故障信息。Carrying the on-disk information statistics page in response information of an asynchronous event request, and reporting the failure information through the response information.
- 一种硬盘故障处理方法,由存储阵列的阵列控制器执行,所述存储阵列包括多个硬盘,每个硬盘被划分多个存储块,位于不同硬盘的多个存储块通过冗余算法构成存储块组;所述方法包括:A hard disk failure processing method is executed by an array controller of a storage array. The storage array includes multiple hard disks, and each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block through a redundant algorithm. Group; the method includes:获取第一硬盘中发生故障的故障区的故障信息;Acquiring fault information of a fault area where a fault occurs in the first hard disk;当所述故障信息指示所述故障区有数据丢失时,则确定丢失数据所在的故障存储块;When the fault information indicates data loss in the fault area, determining the faulty storage block where the missing data is located;利用所述故障存储块所归属的存储块组中的其他存储块恢复所述故障存储块的数据;Recovering data of the faulty memory block by using other memory blocks in a memory block group to which the faulty memory block belongs;将所恢复的数据存储至恢复存储块,所述恢复存储块位于第二硬盘,所述第二硬盘为除所述存储块组所在的硬盘以外的硬盘;Storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;将所述存储块组中的所述故障存储块替换为所述恢复存储块。Replacing the failed storage block in the storage block group with the recovery storage block.
- 如权利要求33所述的方法,其特征在于,所述将所述存储块组中的所述故障存储块替换为所述恢复存储块包括:The method according to claim 33, wherein the replacing the faulty storage block in the storage block group with the recovery storage block comprises:将所述存储块组中的元数据记录的所述故障存储块在所述第一硬盘中的地址替换为所述恢复存储块在所述第二硬盘中的地址。The address of the faulty storage block in the metadata record in the storage block group in the first hard disk is replaced with the address of the recovery storage block in the second hard disk.
- 如权利要求33或34所述的方法,其特征在于,所述获取第一硬盘中发生故障的故障区的故障信息包括:The method according to claim 33 or 34, wherein the acquiring fault information of a fault area in which a fault occurs in the first hard disk comprises:接收所述第一硬盘上报的所述故障信息。Receiving the failure information reported by the first hard disk.
- 如权利要求33或34所述的方法,其特征在于,所述获取第一硬盘中发生故障的故障区的故障信息包括:The method according to claim 33 or 34, wherein the acquiring fault information of a fault area in which a fault occurs in the first hard disk comprises:发送故障查询命令至所述第一硬盘;Sending a fault query command to the first hard disk;接收所述第一硬盘根据所述故障查询命令上报的所述故障信息。Receiving the fault information reported by the first hard disk according to the fault query command.
- 如权利要求34至36任意一项所述的方法,其特征在于,所述故障信息包括指示数据是否丢失的标识,根据所述标识确定所述故障区有数据丢失。The method according to any one of claims 34 to 36, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
- 如权利要求33至37任意一项所述的方法,其特征在于,所述确定丢失数据所在的故障存储块包括:The method according to any one of claims 33 to 37, wherein the determining the faulty storage block where the missing data is located comprises:获取所述第一硬盘中的第一存储块在所述第一硬盘中的地址;Obtaining an address of a first storage block in the first hard disk in the first hard disk;发送数据丢失查询命令至所述第一硬盘,所述查询命令中携带所述第一存储块在所 述第一硬盘中的地址;Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;接收所述第一硬盘返回的指示所述第一存储块是否包括所述丢失数据的指示信息;Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;当所述指示信息指示所述第一存储块包括所述丢失数据,则确定所述第一存储块为所述故障存储块;When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;当所述指示信息指示所述第一存储块不包括所述丢失数据,则生成新的数据丢失查询命令,所述新的数据丢失查询命令中携带所述第一硬盘的第二存储块在所述第一硬盘中的地址。When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
- 如权利要求33至37任意一项所述的方法,其特征在于,所述确定丢失数据所在的故障存储块包括:The method according to any one of claims 33 to 37, wherein the determining the faulty storage block where the missing data is located comprises:发送故障区查询命令至所述第一硬盘;Sending a fault zone query command to the first hard disk;接收所述第一硬盘返回的包括所述故障区的地址的信息;Receiving information returned by the first hard disk and including an address of the faulty area;根据所述故障区的地址确定所述故障存储块。The faulty storage block is determined according to the address of the faulty area.
- 一种存储阵列中的阵列控制器,所述存储阵列包括多个硬盘,每个硬盘被划分多个存储块,位于不同硬盘的多个存储块通过冗余算法构成存储块组;所述阵列控制器包括:An array controller in a storage array. The storage array includes multiple hard disks. Each hard disk is divided into multiple storage blocks. Multiple storage blocks located on different hard disks form a storage block group through a redundant algorithm. The array controls The device includes:获取模块,用于获取第一硬盘中发生故障的故障区的故障信息;An acquisition module, configured to acquire fault information of a fault area where a fault occurs in the first hard disk;恢复模块,用于当所述故障信息指示所述故障区有数据丢失时,则确定丢失数据所在的故障存储块;利用所述故障存储块所归属的存储块组中的其他存储块恢复所述故障存储块的数据;将所恢复的数据存储至恢复存储块,所述恢复存储块位于第二硬盘,所述第二硬盘为除所述存储块组所在的硬盘以外的硬盘;A recovery module, configured to: when the fault information indicates data loss in the fault area, determine a faulty storage block in which the lost data is located; and use the other storage blocks in the storage block group to which the faulty storage block belongs to recover the Data of the faulty storage block; storing the recovered data to a recovery storage block, the recovery storage block being located on a second hard disk, the second hard disk being a hard disk other than the hard disk where the storage block group is located;记录模块,用于将所述存储块组中的所述故障存储块替换为所述恢复存储块。A recording module, configured to replace the faulty storage block in the storage block group with the recovery storage block.
- 一种存储阵列中的阵列控制器,其特征在于,所述替换模块在将所述存储块组中的所述故障存储块替换为所述恢复存储块时,具体用于将所述存储块组中的元数据记录的所述故障存储块在所述第一硬盘中的地址替换为所述恢复存储块在所述第二硬盘中的地址。An array controller in a storage array, wherein the replacement module is specifically configured to replace the storage block group when the faulty storage block in the storage block group is replaced with the recovery storage block. The address of the faulty storage block in the first hard disk of the metadata record in is replaced with the address of the recovery storage block in the second hard disk.
- 如权利要求40或41所述的阵列控制器,其特征在于,所述获取模块获取第一硬盘中发生故障的故障区的故障信息的方式为接收所述第一硬盘上报的所述故障信息。The array controller according to claim 40 or 41, wherein the way for the acquisition module to obtain the fault information of the faulted area in the first hard disk is to receive the fault information reported by the first hard disk.
- 如权利要求40或41所述的阵列控制器,其特征在于,所述获取模块获取第一硬盘中发生故障的故障区的故障信息的方式为发送故障查询命令至所述第一硬盘;接收所述第一硬盘根据所述故障查询命令上报的所述故障信息。The array controller according to claim 40 or 41, wherein the way for the acquisition module to obtain the fault information of the fault zone in the first hard disk is to send a fault query command to the first hard disk; The failure information reported by the first hard disk according to the failure query command.
- 如权利要求40至43任意一项所述的阵列控制器,其特征在于,所述故障信息包括指示数据是否丢失的标识,根据所述标识确定所述故障区有数据丢失。The array controller according to any one of claims 40 to 43, wherein the fault information includes an identifier indicating whether data is lost, and it is determined that data is lost in the fault area according to the identifier.
- 如权利要求40至43任意一项所述的阵列控制器,其特征在于,所述恢复模块在确定丢失数据所在的故障存储块时,具体用于:The array controller according to any one of claims 40 to 43, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:获取所述第一硬盘中的第一存储块在所述第一硬盘中的地址;Obtaining an address of a first storage block in the first hard disk in the first hard disk;发送数据丢失查询命令至所述第一硬盘,所述查询命令中携带所述第一存储块在所述第一硬盘中的地址;Sending a data loss query command to the first hard disk, where the query command carries an address of the first storage block in the first hard disk;接收所述第一硬盘返回的指示所述第一存储块是否包括所述丢失数据的指示信息;Receiving indication information returned by the first hard disk indicating whether the first storage block includes the lost data;当所述指示信息指示所述第一存储块包括所述丢失数据,则确定所述第一存储块为所述故障存储块;When the indication information indicates that the first storage block includes the missing data, determining that the first storage block is the faulty storage block;当所述指示信息指示所述第一存储块不包括所述丢失数据,则生成新的数据丢失查询命令,所述新的数据丢失查询命令中携带所述第一硬盘的第二存储块在所述第一硬盘中的地址。When the indication information indicates that the first storage block does not include the lost data, a new data loss query command is generated, and the new data loss query command carries the second storage block of the first hard disk in all locations. Addresses in the first hard disk.
- 如权利要求40至43任意一项所述的阵列控制器,其特征在于,所述恢复模块在确定丢失数据所在的故障存储块时,具体用于:The array controller according to any one of claims 40 to 43, wherein the recovery module is specifically configured to: when determining the faulty storage block where the lost data is located:发送故障区查询命令至所述第一硬盘;Sending a fault zone query command to the first hard disk;接收所述第一硬盘返回的包括所述故障区的地址的信息;Receiving information returned by the first hard disk and including an address of the faulty area;根据所述故障区的地址确定所述故障存储块。The faulty storage block is determined according to the address of the faulty area.
Priority Applications (9)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP19856725.7A EP3822792A4 (en) | 2018-09-05 | 2019-09-03 | Hard disk fault processing method, array controller and hard disk |
MX2021002274A MX2021002274A (en) | 2018-09-05 | 2019-09-03 | Hard disk fault processing method, array controller and hard disk. |
BR112021002987-7A BR112021002987B1 (en) | 2018-09-05 | 2019-09-03 | HARD DISK FAILURE HANDLING METHOD, ARRAY CONTROLLER, AND HARD DISK |
EP21164813.4A EP3920031B1 (en) | 2018-09-05 | 2019-09-03 | Hard disk fault handling method, array controller, and hard disk |
JP2021512508A JP7147050B2 (en) | 2018-09-05 | 2019-09-03 | Hard disk failure countermeasures, array controllers, and hard disks |
KR1020217006066A KR102632961B1 (en) | 2018-09-05 | 2019-09-03 | Hard disk failure handling methods, array controllers, and hard disks |
US17/167,231 US11322179B2 (en) | 2018-09-05 | 2021-02-04 | Hard disk fault handling method, array controller, and hard disk |
US17/226,588 US11264055B2 (en) | 2018-09-05 | 2021-04-09 | Hard disk fault handling method, array controller, and hard disk |
US17/549,094 US11501800B2 (en) | 2018-09-05 | 2021-12-13 | Hard disk fault handling method, array controller, and hard disk |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811028931 | 2018-09-05 | ||
CN201811028931.7 | 2018-09-05 | ||
CN201811451958.7 | 2018-11-30 | ||
CN201811451958.7A CN110879761A (en) | 2018-09-05 | 2018-11-30 | Hard disk fault processing method, array controller and hard disk |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/167,231 Continuation US11322179B2 (en) | 2018-09-05 | 2021-02-04 | Hard disk fault handling method, array controller, and hard disk |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020048442A1 true WO2020048442A1 (en) | 2020-03-12 |
Family
ID=69721562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/104163 WO2020048442A1 (en) | 2018-09-05 | 2019-09-03 | Hard disk fault processing method, array controller and hard disk |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020048442A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488124A (en) * | 2020-04-08 | 2020-08-04 | 深信服科技股份有限公司 | Data updating method and device, electronic equipment and storage medium |
CN111813588A (en) * | 2020-06-01 | 2020-10-23 | 北京百卓网络技术有限公司 | Computer hard disk fault positioning method, device, equipment and storage medium |
CN113900594A (en) * | 2021-10-12 | 2022-01-07 | 天津津航计算技术研究所 | RAID control card S.M.A.R.T.information early warning method |
CN115080340A (en) * | 2022-05-13 | 2022-09-20 | 苏州浪潮智能科技有限公司 | Method, system, computer device and storage medium for monitoring floppy disk array |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6327671B1 (en) * | 1998-11-18 | 2001-12-04 | International Business Machines Corporation | Delta compressed asynchronous remote copy |
CN101276302A (en) * | 2007-03-29 | 2008-10-01 | 中国科学院计算技术研究所 | Magnetic disc fault processing and data restructuring method in magnetic disc array system |
CN106371947A (en) * | 2016-09-14 | 2017-02-01 | 郑州云海信息技术有限公司 | Multi-fault disk data recovery method for RAID (Redundant Arrays of Independent Disks) and system thereof |
CN108345519A (en) * | 2018-01-31 | 2018-07-31 | 河南职业技术学院 | The processing method and processing device of hard disc of computer failure |
-
2019
- 2019-09-03 WO PCT/CN2019/104163 patent/WO2020048442A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6327671B1 (en) * | 1998-11-18 | 2001-12-04 | International Business Machines Corporation | Delta compressed asynchronous remote copy |
CN101276302A (en) * | 2007-03-29 | 2008-10-01 | 中国科学院计算技术研究所 | Magnetic disc fault processing and data restructuring method in magnetic disc array system |
CN106371947A (en) * | 2016-09-14 | 2017-02-01 | 郑州云海信息技术有限公司 | Multi-fault disk data recovery method for RAID (Redundant Arrays of Independent Disks) and system thereof |
CN108345519A (en) * | 2018-01-31 | 2018-07-31 | 河南职业技术学院 | The processing method and processing device of hard disc of computer failure |
Non-Patent Citations (1)
Title |
---|
See also references of EP3822792A4 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488124A (en) * | 2020-04-08 | 2020-08-04 | 深信服科技股份有限公司 | Data updating method and device, electronic equipment and storage medium |
CN111813588A (en) * | 2020-06-01 | 2020-10-23 | 北京百卓网络技术有限公司 | Computer hard disk fault positioning method, device, equipment and storage medium |
CN111813588B (en) * | 2020-06-01 | 2024-03-19 | 北京百卓网络技术有限公司 | Computer hard disk fault positioning method, device, equipment and storage medium |
CN113900594A (en) * | 2021-10-12 | 2022-01-07 | 天津津航计算技术研究所 | RAID control card S.M.A.R.T.information early warning method |
CN115080340A (en) * | 2022-05-13 | 2022-09-20 | 苏州浪潮智能科技有限公司 | Method, system, computer device and storage medium for monitoring floppy disk array |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7326379B2 (en) | Hard disk failure countermeasures, array controllers, and hard disks | |
JP6294518B2 (en) | Synchronous mirroring in non-volatile memory systems | |
WO2020048442A1 (en) | Hard disk fault processing method, array controller and hard disk | |
CN111177040B (en) | Storage device for sharing host memory, operation method thereof and storage system | |
US10459814B2 (en) | Drive extent based end of life detection and proactive copying in a mapped RAID (redundant array of independent disks) data storage system | |
US9952795B2 (en) | Page retirement in a NAND flash memory system | |
TWI428737B (en) | Semiconductor memory device | |
TWI465904B (en) | Semiconductor memory device | |
US20170308303A1 (en) | Systems, Methods, and Computer Readable Media Providing Arbitrary Sizing of Data Extents | |
US8418029B2 (en) | Storage control device and storage control method | |
US9348704B2 (en) | Electronic storage system utilizing a predetermined flag for subsequent processing of each predetermined portion of data requested to be stored in the storage system | |
CN114600073A (en) | Data reconstruction method and device applied to disk array system and computing equipment | |
BR112021002987B1 (en) | HARD DISK FAILURE HANDLING METHOD, ARRAY CONTROLLER, AND HARD DISK | |
CN110659152B (en) | Data processing method and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19856725 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019856725 Country of ref document: EP Effective date: 20210215 Ref document number: 20217006066 Country of ref document: KR Kind code of ref document: A |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112021002987 Country of ref document: BR |
|
ENP | Entry into the national phase |
Ref document number: 2021512508 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 112021002987 Country of ref document: BR Kind code of ref document: A2 Effective date: 20210218 |