CN112732163B

CN112732163B - Data verification method and device

Info

Publication number: CN112732163B
Application number: CN201910974182.5A
Authority: CN
Inventors: 杨雁军
Original assignee: Chengdu Huawei Technology Co Ltd
Current assignee: Chengdu Huawei Technology Co Ltd
Priority date: 2019-10-14
Filing date: 2019-10-14
Publication date: 2023-02-03
Anticipated expiration: 2039-10-14
Also published as: CN112732163A

Abstract

The embodiment of the application provides a data verification method and device, relates to the field of storage, and can identify data reliability risks in the data transmission process of a storage array in time. The method comprises the following steps: writing the data to be written in the write request into the hard disk; receiving a response message of a write request sent by the hard disk, wherein the response message of the write request carries an address of the hard disk into which data to be written is written; generating a read request according to the address in the response message of the write request, and sending the read request to the hard disk; receiving a response message of a read request sent by the hard disk, wherein the response message of the read request carries data read from the hard disk; the data read from the hard disk is checked to determine if the data written to the hard disk is in error. The method and the device are used for detecting whether errors occur in the process of writing the data into the hard disk, and determining the error range and whether the data is partially written.

Description

Data verification method and device

Technical Field

The present application relates to the field of storage, and in particular, to a data verification method and apparatus.

Background

During reading, writing, transmission and storage of data, the data passes through a plurality of components, a plurality of transmission channels and complex software processing procedures, and if the data is damaged, data errors can be caused. If an error is not detected immediately, but is only discovered when a subsequent application accesses the saved data, this situation is referred to as silent data failure. Because the error is not found in time when it occurs, the best repair opportunity may be missed, which will finally result in serious consequences such as a critical data error or a system downtime.

In the storage array, each critical component, such as the memory, the link, and the hard disk, has its own data integrity protection measure, such as an Error Correction Code (ECC) of the memory, a Cyclic Redundancy Check (CRC) of the link, an ECC technique of the hard disk, and the like. However, the protection measures of each component are limited to integrity check in the data access and internal transmission processes, a uniform and continuous check means is lacked among the components, and data damage caused by errors (software bugs) in a software logic layer cannot be solved, so that silent data failure is caused. To solve the silent failure problem of data, the technical committee of the interNational information technology standards committee (INCITS) of the American National Standard (ANSI) has defined the information Protection (PI) standard as a method for verifying data integrity. This verification method is implemented by adding 8 bytes of protection information after the data block. The PI check mainly supports a CRC check and a Logical Block Address (LBA) check. By uniformly and continuously using PI check on the I/O path of the storage array, silent data failures, such as hardware failures and data corruption caused by software bugs on data channels, data errors detectable and correctable by hard disks, and the like, can be prevented and detected. Although the hard disks are updated, the hard disks which do not support PI verification still exist, and for the hard disks which do not support PI verification, the risk of data damage cannot be timely discovered, that is, the problem of silent data failure still exists.

Disclosure of Invention

The embodiment of the application provides a data verification method and device, which can identify data reliability risks in a data transmission process of a storage array in time.

In a first aspect, a data verification method performed by a storage array is provided, the method including: writing the data to be written in the write request into the hard disk; receiving a response message of a write request sent by a hard disk, wherein the response message of the write request carries an address of the hard disk to which data to be written is written; generating a read request according to the address in the response message of the write request, and sending the read request to the hard disk; receiving a response message of a read request sent by the hard disk, wherein the response message of the read request carries data read from the hard disk; the data read from the hard disk is checked to determine if the data written to the hard disk is in error.

Therefore, the method and the device can generate the read request according to the response message of the write request so as to read the data written into the hard disk by the write request and check whether the read data is wrong, so that the reliability risk of the data written into the hard disk can be identified in time, for the hard disk which does not support PI check, the data can be read in time for data check after the data is written into the hard disk, the PI check function of the hard disk is not relied on, and the hard disk without the PI function can be supported.

In one possible design, the method is performed by a Block Device Management (BDM) module in the storage array, the BDM module including a Data Integrity Field (DIF) submodule, and checking data read from the hard disk to determine whether data written to the hard disk is erroneous includes: performing CRC on the data read from the hard disk, and if the CRC fails, performing CRC on the data to be written carried in the response message of the write request; and if the CRC of the data to be written in the response message of the write request fails, determining that the upper layer module of the DIF sub-module is abnormal when sending the write request, and sending an abnormal error code of the upper layer module. That is to say, according to the case that the CRC check performed on the response message of the write request and the response message of the read request fails, it is determined that the data error is an abnormality occurring when the upper layer module of the DIF submodule sends the write request, that is, the abnormality occurs when the upper layer module whose source of the abnormality is the DIF submodule issues the write request, in this case, the DIF submodule may send an abnormal error code to the upper layer module, where the error code indicates that an error occurs when the data is written into the hard disk, and the fault is derived from the upper layer module of the DIF submodule. Therefore, not only can the data error be determined, but also the range of the data error can be determined, so that the data can be repaired in time.

In a possible design, if the CRC of the data to be written in the response message of the write request is successfully checked, it is determined that an exception occurs in the lower module of the DIF sub-module when the write request is sent, and an error code indicating that the lower module is abnormal is sent. In other words, the response message to the write request is successfully checked, and the CRC check performed on the response message to the read request is failed, at this time, a data error may be regarded as an abnormality occurring when the lower module of the DIF submodule sends the write request, that is, an abnormality occurring when the lower module whose source of the abnormality is the DIF submodule issues the write request, in this case, the DIF submodule may send an abnormal error code to the upper module, where the error code indicates that an error occurs when data is written into the hard disk, and the fault is derived from the lower module of the DIF submodule.

In one possible design, the write request carries a DIF parity bit; verifying the data read from the hard disk to determine whether the data written to the hard disk is corrupted comprises: performing CRC on the data read from the hard disk; if the CRC is successful, comparing the DIF of the data to be written carried in the response message of the write request with the DIF of the data read from the hard disk; and if the DIF of the data to be written carried in the response message of the writing request is not consistent with the DIF of the data read from the hard disk, determining that the writing offset occurs when the data to be written is written into the hard disk. This is to consider that the storage array performs CRC check on data read from the hard disk, that is, performs CRC check on data read from the hard disk carried in a response message of a read request, and can only determine whether the read data is erroneous, that is, whether a read data portion and a check portion match, but cannot determine whether the read data is data in which a data write bias failure occurs. Therefore, when it is determined that the data carried in the response message of the read request is not in error, the DIF of the data to be written carried in the response message of the write request and the DIF of the data read from the hard disk can be compared, and whether the data carried in the response message of the read request is the data with the write bias or not can be continuously checked.

In one possible design, the write request is a write request sampled in one sampling cycle. Therefore, all write requests do not need to be checked, and the influence on the service performance of the user caused by checking is reduced.

In a second aspect, there is provided a memory array comprising: the write operation unit is used for writing the data to be written in the write request into the hard disk; the response unit is used for receiving a response message of the write request sent by the hard disk, wherein the response message of the write request carries an address of the hard disk to which the data to be written is written; the read operation unit is used for generating a read request according to the address in the response message of the write request and sending the read request to the hard disk; the response unit is further used for receiving a response message of the read request sent by the hard disk, wherein the response message of the read request carries data read from the hard disk; and the checking unit is used for checking the data read from the hard disk so as to determine whether the data written into the hard disk is error.

In one possible design, the storage array comprises a Block Device Management (BDM) module, the BDM module comprises the check unit, and the check unit comprises a Data Integrity Field (DIF) submodule; a verification unit to: performing CRC on the data read from the hard disk, and if the CRC fails, performing CRC on the data to be written carried in the response message of the write request; and if the CRC of the data to be written in the response message of the write request fails, determining that the upper layer module of the DIF sub-module is abnormal when the write request is sent, and sending an abnormal error code of the upper layer module.

In one possible design, the verification unit is configured to: and if the CRC of the data to be written in the response message of the write request is successfully checked, determining that the lower layer module of the DIF sub-module is abnormal when the write request is sent, and sending an abnormal error code of the lower layer module.

In one possible design, the write request carries a DIF parity bit; a verification unit for: performing CRC on data read from the hard disk; if the CRC is successful, comparing the DIF of the data to be written carried in the response message of the write request with the DIF of the data read from the hard disk; and if the DIF of the data to be written carried in the response message of the write request is not consistent with the DIF of the data read from the hard disk, determining that write offset occurs when the data to be written is written into the hard disk.

In one possible design, the write request is a write request sampled in one sampling period.

In a third aspect, a computer-readable storage medium is provided, comprising computer instructions, which, when run on an electronic device, cause the electronic device to perform the method according to the first aspect.

In a fourth aspect, a computer program product is provided, which, when run on an electronic device, causes the electronic device to perform the method according to the first aspect.

Therefore, the method and the device can identify the reliability risk of the data written into the hard disk in time by reading the written data again and verifying whether the read data is wrong, can perform data reading again for the hard disk which does not support PI verification after data writing, do not depend on the PI verification function of the hard disk, and can support the hard disk without the PI function.

Drawings

Fig. 1 is a data format for PI verification according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a network architecture according to an embodiment of the present application;

fig. 3 is a schematic diagram of a network architecture according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a memory array according to an embodiment of the present application;

fig. 5 is a schematic software partitioning diagram of a BDM module according to an embodiment of the present disclosure;

fig. 5A is a schematic diagram of a data format according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a data verification method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a data verification method according to an embodiment of the present application;

fig. 8 is a schematic software partitioning diagram of a BDM module according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a data write bias detection process according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a memory array according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a memory array according to an embodiment of the present application.

Detailed Description

For ease of understanding, some of the concepts related to the present application are illustratively presented for reference. As follows:

PI checking: also known as Data Integrity Field (DIF) check, data is checked by adding a PI to each Data block. For example, in the T10 PI standard, for a data format of 512+8, as shown in fig. 1, each 512-bit Logical sector (data) is extended with 8 bytes of Protection Information (PI), including 2 bytes of CRC check value (carried in GRD (guard) field of the data Block), 2 bytes of user-defined and application-related information (carried in Application (APP) field of the data Block) and 4 bytes of Logical Block Address (LBA) (carried in REF (reference) field of the data Block), and it can be ensured that data in the data Block has no change by comparing the protection information after reading/writing the data.

CRC: a hash function for generating short fixed digit check code based on network data packet or computer file is mainly used to detect or check the error after data transmission or storage. It uses the principle of division and remainder to make error detection. CRC, like other error detection schemes, forms an n-bit transmission frame by adding (n-k) bit redundancy bits (also called frame check sequence, FCS) to k bits of data to be transmitted, and transmits the transmission frame, where n and k are integers greater than or equal to 1.

Logical Unit Number (LUN): a LUN is an independent storage unit on a storage device that can be recognized by a server. Space of one LUN is derived from a storage pool, and space of the storage pool is derived from a plurality of hard disks constituting a disk array (RAID). From the server perspective, a LUN can be viewed as a usable hard disk.

Block Device Management (BDM), which may also be referred to as the Linux Small Computer System Interface (SCSI) middle layer. The BDM is mainly responsible for hard disk management, hard disk path management, input/output (I/O) merging (a plurality of I/os form an I/O), sorting (a plurality of I/os are sequenced according to I/O priorities), I/O bearer (also called I/O channel), hard disk reliability processing, and the like.

The terms "disk" and "hard disk" are used in the embodiments of the present application to have substantially the same meaning. The magnetic disk is a device magnetic disk which performs a read-write function through magnetism, and can be a nonvolatile storage medium, and files stored after power failure cannot be lost. The hard disk can be better protected by arranging the storage sheets of the magnetic disk in a hard metal box.

The embodiment of the application can be applied to the process of verifying data in the process of performing write operation access on the storage array by the host.

As shown in fig. 2, the network architecture of the present application may include a server 21, which may be understood as a core of computation, and a storage array 22, which is a core of data storage. One server can access one or more storage arrays, one storage array can also be accessed by a plurality of servers, and the access by a plurality of servers is supported.

For example, the server may be a Personal Computer (PC) server of an x86 architecture, or a mini server such as solaris or aix. Taking a PC server as an example, as shown in fig. 3, a terminal 31 of the server peripheral may access the PC server through a Local Area Network (LAN), and a PC server 32 may access the storage array 22 through a Small Computer System Interface (SCSI).

Typically, servers are configured with disk storage, which may be several hundred GB in capacity. When the server needs to store massive data, such as hundreds of TB-level data, the disk capacity of the server is not enough, and the server needs to be expanded. The storage array can combine several, dozens or even hundreds of disks to form a storage device with huge capacity, and all data is stored in the storage array and connected to the database host through optical fibers, so that the storage capacity of the server is improved.

As shown in FIG. 4, the storage array 22 may include a plurality of controllers 41, a cache 42 for each controller 41, and a plurality of hard disks 43. The hard disk 43 may be connected to the controller 41 through a Serial Attached SCSI (SAS) link. Each controller 41 may access the hard disk 43 through multiple access paths.

The storage array 22 maintains LUN modules according to service needs, where the LUN modules are used to create a plurality of designated LUNs, or called a plurality of LUN objects, and manage the plurality of LUN objects, including creating, querying, deleting, and the like of LUN objects, and each LUN object corresponds to a LUN Identifier (ID). After the storage array 22 establishes networking connection with the server 21, the server 21 may find the storage space allocated by the storage array by scanning the LUN module, and perform read/write operation on the storage space. Taking fig. 3 as an example, when a user wants to perform a write operation, the PC server 32 may start a service of a Target (TGT) module by operating on the terminal 31, find a LUN ID that needs to be written, and generate a write request. The write request comprises a data part and a check part to be written into the hard disk (if some servers support PI, the check part in the write request can be directly generated by the servers, if the servers do not support PI, the check part can be generated by the TGT module), the write request reaches a controller of the storage array 33 through the PC server, the controller stores the data and check codes in the write request in a cache (cache) under the controller, and at the moment, if the write request is a write-back request, the storage array can firstly return a response to the PC server to indicate that the write request is successful in writing. The data of the write request stored in the cache is written to the hard disk 43 only when being eliminated from the cache according to the cache elimination policy. A write request written from the cache 42 to the hard disk 43 may pass through a Chunk (CKG) module of the storage array, where the CKG module is mainly used to manage the storage pool RAID, and the CKG module may determine, according to the LUN ID in the write request, which logical disks corresponding to the LUN ID the data needs to be written to. And then, the BDM module can determine the physical hard disk of the write request write data according to the logical disk determined by the CKG module and the corresponding relation between the logical address of the logical disk and the physical address of the physical hard disk. The TGT module, the LUN module, the CKG module and the BDM module are all software modules of an operating system in the storage array.

As shown in fig. 5, the BDM module may include a Logical Disk (LD) sub-module, a SCSI hard disk (SCSI disk, SD) sub-module, and a SCSI input/output (SCSI I/O, SIO) sub-module. The LD submodule is used for managing the corresponding relation between the logical address of the logical disk and the logical address of the physical hard disk, and the SD submodule is used for managing the logical address of the physical hard disk. The SIO submodule may include a check layer (which may also be referred to as a DIF submodule or DIF layer). When the BDM module receives a write request sent by the CKG module, the BDM module may write the data portion and the check portion carried in the write request to the hard disk. When the data write succeeds, the hard disk returns a write response, the write response still carries a data part and a check part written by the write request, and when the write response reaches the LUN module through the BDM module, the CKG module and the cache to perform LBA check (LBA check performed on the LUN module is used for checking whether the data is written with bias) successfully, the data part and the check part stored in the cache are released. Specifically, when the write response reaches the check layer of the BDM module, CRC check is performed on the write response, that is, CRC calculation is performed on the data portion carried in the write response to obtain a CRC check value, the CRC check value is compared with the check portion (CRC check value) carried in the write response, and if the CRC check value is the same as the check portion carried in the write response, the check is successful, which indicates that the data itself carried in the write response has no problem; if the CRC check value is different from the check part carried in the write response, the check fails, and the data carried in the write response is error data, so that the fault of data error writing occurs. When the CRC check is successfully performed, it can only be stated that the data itself is correct data, but the data carried in the write response may also be data in case of data bias write, and therefore, the LBA check is also required to be performed to check whether the data is biased to write when the write response is continuously returned to the LUN module.

The principle of the LBA verification data write offset may be: when a write request is generated in the LUN module, as shown in fig. 5A, for example, for a data format of 512+8, the write request carries a data portion, a check portion, and other attributes of the write data a, and the other attributes carry a data size, a data type, and a first LBA corresponding to the write request, and a REF field of the check portion carries a second LBA corresponding to the LUN ID (the second LBA is not changed in the transmission process, and the second LBA and the first LBA are the same in the LUN module), and in the transmission process of the write request, the first LBA is converted into an LBA of each layer through a software program of each layer, and finally, in the LD submodule, a physical address for accessing the hard disk is determined and issued according to the LBA (logical address) received from the CGK module and a corresponding relationship between the logical address of the logical disk and the physical address of the physical hard disk, and the physical address carried in the write request is written into the hard disk by the physical hard disk according to the received physical address. When the data A is successfully written, the write response returned by the hard disk carries the physical address, the data part and the verification part of the data A which are finally written into the hard disk, and at the moment, due to software or hardware faults, the problem of data write bias possibly occurs when the data are finally written into the hard disk. For example, if no failure occurs during the transmission of the write request, the physical address of the first LBA after being converted should be the physical address 100, and the data a should be written into the physical address 100, and if a failure occurs during the transmission of the write request, and the first LBA is finally converted into a physical address for accessing the hard disk through software programs of each layer, the physical address after being converted is the address 200, and the data a carried in the write request is written into the physical address 200, which may cause a problem of data write bias. When the server reads the data a again, the read request carries the first LBA, and the read response also carries the first LBA, but the data in the physical address 100 corresponding to the read first LBA carried in the read response is not the data a but the data B. When the read response returns to the LUN module, LBA verification is performed, and then it is found that the third LBA in the verification portion carried in the data B does not match the first LBA carried in the read response, and the LUN module determines that the data write offset occurs at the back end. That is, the LBA verification for verifying whether the data is write biased occurs when the read response to read the data again is returned to the LUN module.

It should be noted that the software layers or software modules in the embodiment of the present application are only divided schematically, for example, the division of the BDM module in fig. 5 does not constitute a limitation on the division of the software layers of the BDM module.

It should be further noted that the storage system is mainly divided into two storage mechanisms, namely, a Storage Area Network (SAN) and a Network Attached Storage (NAS), where an example of the LUN referred to in this application refers to SAN, but the application is also applicable to NAS.

The above CRC check and LBA check are both PI checks, some hard disks support the PI check, but not all hard disks support PI verification, and for a hard disk that does not support PI check, a problem of data write error or write bias that may occur cannot be identified.

To solve the problem, the present application provides a data verification method, which is executed by a storage array and whose principle may be: when the storage array executes write operation to write data into the hard disk, the returned response message is not sent to an upper layer in time, but the response message is suspended on the verification layer to generate a read operation, the read operation is used for reading the data written into the hard disk, so that the read data is verified on the verification layer to determine whether the data written into the hard disk is wrong or not, the reliability risk of the data at the rear end can be identified in time on the verification layer, for the hard disk which does not support PI verification, data reading can be carried out again after data writing is carried out on the application for data verification, the PI verification function of the hard disk is not relied on, and the hard disk without the PI function can be supported.

Based on the above principle, the following describes an implementation flow of the present application.

An embodiment of the present application provides a data verification method, as shown in fig. 6, the method includes:

601. and the storage array writes the data to be written in the write request into the hard disk.

When the storage array receives a write request sent by the database host, the storage array may write the data to be written carried in the write request into the hard disk according to the above description. For example, when the BDM module receives data to be written, the data to be written may be written to the hard disk according to the above description of fig. 3 and 5. The write request carries a data portion and a verification portion (protection information portion), and when the storage array performs write operation on the hard disk according to the write request, the data portion and the verification portion are written.

602. And the storage array receives a response message of the write request sent by the hard disk, wherein the response message of the write request carries an address of the hard disk into which the data to be written is written.

When the data is successfully written into the hard disk, the hard disk can generate a response message of the write request, the response message is used for indicating that the data to be written has been written into the hard disk to an upper layer module of the hard disk, and meanwhile, the response message carries the address of the hard disk into which the data to be written is written.

603. And the storage array generates a read request according to the address in the response message of the write request and sends the read request to the hard disk.

The storage array can suspend a response message of the write request in the check layer, and generate a read request according to an address in the response message of the write request, wherein the read request carries an address of the hard disk to which the data to be written is written. For example, if the address for writing the data to be written into the hard disk in step 602 is 200, the address carried in the read request is 200 under normal conditions. The read request is used to read the data written by the storage array into the hard disk at address 200 in step 601.

604. And the storage array receives a response message of the read request sent by the hard disk, wherein the response message of the read request carries the data read from the hard disk.

When the read request reaches the hard disk, the hard disk may read the data to be written into the hard disk in step 601 according to the address carried in the read request, and return a response message of the read request to the upper module of the hard disk, where the response message of the read request carries the data to be written into the hard disk in step 601, and includes a data portion and a verification portion.

605. The storage array checks the data read from the hard disk to determine if the data written to the hard disk is in error.

In this step, the data read from the hard disk is checked, that is, the data to be written carried in the response message of the read request is checked, which may be CRC check performed in the check layer of fig. 5. That is, when a response message of the read request is returned to the check layer (DIF sub-module or DIF layer) in fig. 5, the CRC check is performed on the data read from the hard disk. If the CRC check fails, it is determined that the data written to the hard disk is in error, which may be due to a hardware or software failure. If the CRC is successful, the data carried in the response message of the read request is not in error. And if the CRC fails, indicating that the data carried in the response message of the read request is in error. The CRC check may be to calculate a CRC check value for a data portion carried in a response message of the read request, compare the CRC check value with a CRC check value of a check portion carried in the response message of the read request, indicate that the CRC check is successful if the calculated CRC check value is the same as the CRC check value carried by itself, and indicate that the CRC check is failed if the calculated CRC check value is different from the CRC check value carried by itself.

Therefore, the method and the device can identify the reliability risk of the data written into the hard disk in time by reading the written data again and verifying whether the read data is wrong, can perform data reading again for data verification after data writing is performed on the hard disk which does not support PI verification, do not depend on the PI verification function of the hard disk, and can support the hard disk without the PI function.

In the embodiment of the application, not only can the CRC check be performed on the response message of the read request at the check layer to determine whether the data carried in the generated response message of the read request is erroneous, but also the response message of the pending write request can be further checked at the check layer to determine the error range of the data error. In addition, it is mentioned in the above description that the LBA verification is performed in the LUN module in the upper layer, that is, whether the data is subjected to the write bias is verified until the read response when the data is read again reaches the LUN module, and whether the data is subjected to the write bias is not verified in time.

In addition, considering that data verification is performed on all write requests accessing the hard disk by adopting the method corresponding to fig. 6, and there is an influence on the service performance of the client, the method and the device can sample the write requests accessing the hard disk so as to perform verification on whether the data is written eccentrically on the sampled write requests. Therefore, an embodiment of the present application provides a data verification method, as shown in fig. 7, including:

701. the storage array samples write requests of the hard disks in one sampling period by taking each hard disk as a unit.

That is, the storage array may sample write requests in units of each hard disk. In the case of a hard disk, in each sampling period of the hard disk, for example, the sampling period is 30s, that is, the number of write requests for accessing the hard disk is sampled in each 30s, and the number of the write requests sampled in one period may be at least one. The period may be configured or updated, for example, it may be configured to sample write requests accessed in each period of a hard disk, or to sample a write request. The sampling function can also be provided with a switch for starting the writing sampling function or closing the writing sampling function. The sampling of the write request may be performed by adding a tag field or a bit to the write request, and the write request with the tag field or the bit added thereto is the sampled write request.

Referring to fig. 5 for software layer partitioning of the BDM module, the storage array may be a process that performs sampling in a program of SIO submodule entries of the BDM module.

702. And the storage array writes the data to be written in the sampled write request into the hard disk.

The data to be written in the write request includes a data portion and a check portion. The process of issuing the write request can be seen in the description of fig. 5.

703. The storage array receives a response message of a sampled write request sent by the hard disk, wherein the response message of the write request carries an address of the hard disk to which the data to be written is written.

The response message of the write request carries the address of the hard disk to which the data to be written of the write request is written, and the data to be written of the hard disk comprises a data part and a verification part. The response message to the write request is generated by the hard disk to which the data is written. The response message of the write request will reach the BDM module, and referring to fig. 5, the response message of the write request is returned to the upper module through the driving interface layer of the BDM module.

704. And the storage array generates a read request according to the address in the response message of the sampled write request and sends the read request to the hard disk.

Generating the read request based on the address in the response message to the write request may be performed at a check layer of the BDM module.

In some embodiments, as shown in fig. 8, the check layer for generating the read request in step 704 may be another check layer different from the check layer for performing the CRC check in fig. 5 according to the address in the response message of the sampled write request, and in fig. 8, the check layer in fig. 5 is labeled as check layer 1 (or referred to as DIF submodule 1), and the check layer available for generating the read request is labeled as check layer 2 (or referred to as DIF submodule 2). The verification layer 2 is a new verification layer added in the present application, and the verification layer 2 may be configured to generate a read request according to an address carried in a response message of the write request, and send the read request to a lower module of the verification layer 2 to the hard disk.

For example, when the response message of the sampled write request reaches the check layer 2 of the BDM module, the response message of the write request may not be returned to the upper layer module for a while, the response message of the sampled write request is suspended at the check layer 2, and a read request is generated according to the address in the response message of the sampled write request, where the read request is used to read the data that is written into the hard disk by the write request sampled in step 702, and the read request carries the address in the response message of the write request.

705. And the storage array receives a response message of the read request sent by the hard disk, wherein the response message of the read request carries the data read from the hard disk.

When the hard disk receives a read request generated by the verification layer 2, the hard disk may read the data to be written in the hard disk according to the address in the read request in the step 702, where the data to be written includes a data portion and a verification portion, and generate a response message of the read request, where the response message of the read request carries the data portion and the verification portion.

In some embodiments, when the hard disk reads data according to an address in the read request, all data of a data portion corresponding to the address in the hard disk may be read, or partial data of the data portion may be read according to a preset interval mode, which is not limited in this application.

It should be noted that the write request in step 705 and the following steps 705 refers to a sampled write request, and the response message of the write request refers to a response message of the sampled write request.

706. The storage array performs CRC on the data read from the hard disk, and if the CRC fails, the step 707 is executed; if the CRC check is successful, then step 710 is entered.

Referring to fig. 5, a response message to a read request may pass through the SAS drive to the BDM module. When the response message of the read request reaches the check layer 1 of the BDM module, CRC check may be performed on the response message of the read request at the check layer 1, that is, whether the data part and the check part in the response message of the read request match is calculated, and the matching manner may be as shown in step 605. If the data carried by the response message of the read request is not matched, namely the CRC fails, the data carried by the response message of the read request is wrong, and if the data carried by the response message of the read request is matched, namely the CRC succeeds, the data carried by the response message of the read request is correct.

It should be noted that the SAS driver is only an example in the present application, and the driver is not limited in the present application, and may be a protocol driver such as NVME (non-volatile memory express).

707. And the storage array performs CRC on the data to be written carried in the response message of the write request. Then, step 708 or step 709 is entered.

If the data error occurs, which may be caused by a software failure or a hardware failure, the CRC check may be further performed according to the data to be written carried in the response message of the write request suspended in step 704 to determine the range of the data error.

In step 707, since the response message of the write request is suspended in check layer 2, performing CRC check on the response message of the write request may be performed at check layer 2.

708. And if the CRC of the data to be written in the response message of the write request fails, the storage array determines that the upper layer module of the DIF sub-module is abnormal when the write request is sent, and sends an abnormal error code of the upper layer module.

If the CRC check on the to-be-written data carried in the response message of the suspended write request fails, both the CRC check on the response message of the write request and the CRC check on the response message of the read request fail, at this time, a data error may be considered as an abnormality occurring when the upper module of the check layer 2 sends the write request, that is, an abnormality occurring when the upper module of the check layer 2 issues the write request is an abnormality source, in this case, the check layer 2 may send an abnormal error code to the upper module, where the error code indicates that an error occurs when the data is written into the hard disk, and the failure is originated from the upper module of the check layer 2. For example, a failure may be due to one or more of the debug/merge layer, SIO sub-module entries, SD sub-modules, LD sub-modules, and CKG modules, among others.

709. If the CRC of the data to be written in the response message of the write request is successfully checked, the storage array determines that the lower layer module of the DIF sub-module is abnormal when the write request is sent, and sends an abnormal error code of the lower layer module.

If the CRC check of the data to be written carried in the response message of the suspended write request is successful, the response message of the write request is successfully checked, and the CRC check of the response message of the read request is failed, at this time, a data error may be considered as an abnormality occurring when the lower module of the check layer 2 sends the write request, that is, an abnormality occurring when the lower module of the check layer 2 issues the write request, and in this case, the check layer 2 may send an abnormal error code to the upper module, where the error code indicates that an error occurs when the data is written into the hard disk, and the fault is derived from the lower module of the check layer 2. For example, the failure may be due to one or more modules in the distribution layer, the drive interface layer, or the SRS lines, hard disk frames, or hard disk failures, etc.

710. The storage array compares the DIF of the data to be written carried in the response message of the write request with the DIF of the data read from the hard disk, and then the step 711 or the step 712 is entered.

In the check layer 1, step 706 is executed, namely CRC check is performed on the data read from the hard disk carried in the response message of the read request, and it can only be determined whether the read data is in error, that is, whether the read data portion and the check portion match, but it cannot be determined whether the read data is data in which data write bias failure occurs. Therefore, according to the present application, when it is determined that the data carried in the response message of the read request has not failed, it may be further performed in step 710 to continuously check whether the data carried in the response message of the read request is the data with the write bias.

In some embodiments, the checking whether the data is biased may be performed at the checking layer 2 of the present application. This is because, in the above description, it is mentioned that the LBA verification is performed in the LUN module in the upper layer, that is, it is required to verify whether the data is misaligned or not until the response message of the write request reaches the LUN program, so that it is not timely to verify whether the data is misaligned or not. Therefore, in the embodiment of the present application, the verification layer 2 has at least the following functions: 1) Generating a read request according to the response of the write request; 2) Checking a response message of the write request; 3) And checking whether the data is written to be biased.

711. And if the DIF of the data to be written carried in the response message of the write request is not consistent with the DIF of the data read from the hard disk, determining that write offset occurs when the data to be written is written into the hard disk.

As shown in fig. 9, exemplarily, when data carried by a write request is data a, a physical address of the data a is address 100, if a write offset occurs, the data a is written into address 200, but a physical address carried by a response message of the write request is still address 100, the carried data is data a, a physical address carried by a read request generated according to the response message of the write request is address 100, and data carried by a response message of the read request is data B corresponding to the physical address 100, that is, data read from the hard disk is data B, then DIF1 of the data a carried by the response message of the write request and DIF 2 of the data B read from the hard disk are inconsistent, it is determined that the write offset occurs when the data a to be written is written into the hard disk, and a failure of the write offset is caused by a lower module of check layer 2. The verification layer 2 may send an error code to the upper layer module, where the error code indicates that a data write offset has occurred, and the abnormal range is the lower layer module of the verification layer 2.

712. And if the DIF of the data to be written carried in the response message of the write request is consistent with the DIF of the data read from the hard disk, determining that no write offset occurs when the data to be written is written into the hard disk.

Still referring to fig. 9, exemplarily, when the data to be written carried in the response message of the write request is data a, the address where the data a is originally to be written in the hard disk is 100, the data a is actually written in the address 100, the data read from the hard disk, that is, the data carried in the response message of the read request is the data in the address 100, and the DIF of the data carried in the response message of the write request and the DIF of the data read from the hard disk are both DIF1 of the data a, it is determined that no write offset occurs when the data to be written is written in the hard disk.

In addition, in step 709 and step 711, after checking whether the data has a write offset, the response message of the suspended write request may be canceled, and when the response message of the write request returns to the LD sub-module, if the LD sub-module also receives an error code causing data exception by the lower module, the LD sub-module may switch the access path of the controller, for example, the access path of the controller 1 is switched to the access path of the controller 2, and then send the response message of the write request to the controller 2.

If the retried write request is also determined to have caused a data exception by the underlying module when it is executed at steps 702-712, then the data exception may be synthetically determined to have been caused by a hard disk failure since both controller paths have access to the same hard disk. The repair process of the hard disk can be triggered and the hard disk is isolated. If no data exception occurs or the range of the occurrence of the data exception is caused by an upper module when the retried write request is executed in steps 702-712, it may be determined that the data exception is caused by a fault on the controller 1 path, a repair procedure of the controller 1 path may be triggered, and the controller 1 path may be isolated.

According to the embodiment of the application, the method and the device for detecting the data write offset are not only independent of the PI checking function of the hard disk, but also can support the hard disk without the PI function, and can detect the error of the data write offset in time, and the data write offset does not need to be detected by the LUN module, so that the data write offset is found in time. Moreover, the data verification process of the embodiment of the application does not distinguish the data format and is effective to the service scene of the mixed data format.

It will be appreciated that to implement the above functionality, the memory array contains corresponding hardware and/or software modules that perform the respective functions. The present application is capable of being implemented in hardware or a combination of hardware and computer software in conjunction with the exemplary algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, with the embodiment described in connection with the particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In this embodiment, the electronic device may be divided into functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be implemented in the form of hardware. It should be noted that, the division of the modules in this embodiment is schematic, and is only one logic function division, and another division manner may be available in actual implementation.

In the case of dividing each functional module by corresponding functions, fig. 10 shows a schematic diagram of a possible composition of the memory array 100 related to the above embodiment, and as shown in fig. 10, the memory array 100 may include: a write operation unit 1001, a response unit 1002, a read operation unit 1003, and a check unit 1004.

Among other things, the write unit 1001 may be used to support the memory array 100 to perform the above-described

steps

601, 702, and/or other processes for the techniques described herein; response unit 1002 may be configured to support storage array 100 to perform

steps

602, 604, 703, 705, and/or other processes for the techniques described herein; the read operation unit 1003 may be used to support the storage array 100 to perform the above-described steps 603, steps 704, and/or other processes for the techniques described herein; the verification unit 1004 may be used to support the storage array 100 to perform the

above steps

605, 706, 707, 710, 711, and 712, and/or other processes for the techniques described herein.

In some embodiments, a sampling unit 1005 may also be included to enable the memory array 100 to perform the above step 701, and/or other processes for the techniques described herein; an exception feedback unit 1006 may also be included to support the storage array 100 in performing the above-described

steps

708, 709, and/or other processes for the techniques described herein.

It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

The memory array 100 provided in this embodiment is used for executing the data verification method, and therefore the same effect as the implementation method can be achieved.

Where integrated units are employed, the memory array may include processing modules, memory modules, and communication modules. The processing module may be configured to control and manage operations of the storage array, for example, may be configured to support the storage array to execute steps executed by the write operation unit 1001, the response unit 1002, the read operation unit 1003, the check unit 1004, the sampling unit 1005, and the anomaly feedback unit 1006. The memory modules may be used to support memory arrays for storing program codes and data, etc. The communication module may be used to support communication of the storage array with other devices, such as a server.

The processing module may be a processor or a controller. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., a combination of one or more microprocessors, a Digital Signal Processing (DSP) and a microprocessor, or the like. The storage module may be one or more memories, and a plurality of hard disks. The communication module may specifically be a transceiver, a radio frequency circuit, a bluetooth chip, a Wi-Fi chip, or other devices that interact with other electronic devices.

In an embodiment, when the processing module is a controller and the storage module is a memory and a hard disk, the storage array according to this embodiment may be the storage array 110 having the structure shown in fig. 11.

Embodiments of the present application further provide a storage array, which includes one or more controllers, one or more memories, and a plurality of hard disks. The one or more memories, the plurality of hard disks and the one or more controllers are coupled, the one or more memories are for storing computer program code for the storage array, the computer program code comprising computer instructions that, when executed by the one or more controllers, cause the storage array to perform the associated method steps described above to implement the data verification method in the above-described embodiments.

Embodiments of the present application further provide a computer storage medium, where computer instructions are stored in the computer storage medium, and when the computer instructions are run on an electronic device, the storage array executes the above related method steps to implement the data verification method in the above embodiments.

Embodiments of the present application further provide a computer program product, which when running on a computer, causes the computer to execute the above related steps to implement the data verification method performed by the storage array in the above embodiments.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component or a module, and may include a controller and a memory connected to each other; when the device runs, the controller can execute the computer execution instructions stored in the memory, so that the chip can execute the data verification method executed by the storage array in the above method embodiments.

The storage array, the computer storage medium, the computer program product, or the chip provided in this embodiment are all used for executing the corresponding method provided above, and therefore, the beneficial effects achieved by the storage array, the computer storage medium, the computer program product, or the chip may refer to the beneficial effects in the corresponding method provided above, and are not described herein again.

Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or multiple physical units, that is, may be located in one place, or may be distributed in multiple different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data verification method performed by a storage array, wherein the method is performed by a block device management, BDM, module in the storage array, the BDM module including a data integrity field, DIF, sub-module, the method comprising:

writing the data to be written in the write request into the hard disk;

receiving a response message of the write request sent by the hard disk, wherein the response message of the write request carries an address of the hard disk to which the data to be written is written;

generating a read request according to the address in the response message of the write request, and sending the read request to the hard disk;

receiving a response message of a read request sent by the hard disk, wherein the response message of the read request carries data read from the hard disk;

checking the data read from the hard disk to determine whether the data written into the hard disk is in error;

the verifying the data read from the hard disk to determine whether the data written to the hard disk is in error comprises:

performing CRC on the data read from the hard disk, and if the CRC fails, performing CRC on the data to be written carried in the response message of the write request;

and if the CRC of the data to be written in the response message of the write request fails, determining that the upper layer module of the DIF sub-module is abnormal when the write request is sent, and sending an abnormal error code of the upper layer module.

2. The method as claimed in claim 1, wherein if the CRC check of the data to be written in the response message of the write request is successful, it is determined that an exception occurs in a lower module of the DIF sub-module when the write request is sent, and an error code of the exception of the lower module is sent.

3. The method of claim 1, wherein the write request carries a DIF parity bit; the verifying the data read from the hard disk to determine whether the data written into the hard disk has an error comprises:

performing CRC on the data read from the hard disk;

if the CRC is successful, comparing the DIF of the data to be written carried in the response message of the write request with the DIF of the data read from the hard disk;

and if the DIF of the data to be written carried in the response message of the write request is not consistent with the DIF of the data read from the hard disk, determining that write offset occurs when the data to be written is written into the hard disk.

4. The method of claim 1, wherein the write request is a write request sampled in one sampling period.

5. A storage array for execution by a block device management, BDM, module in the storage array, the BDM module including a data integrity field, DIF, sub-module, the storage array comprising:

the write operation unit is used for writing the data to be written in the write request into the hard disk;

a response unit, configured to receive a response message of the write request sent by the hard disk, where the response message of the write request carries an address at which the data to be written is written in the hard disk;

the read operation unit is used for generating a read request according to the address in the response message of the write request and sending the read request to the hard disk;

the response unit is further configured to receive a response message of a read request sent by the hard disk, where the response message of the read request carries data read from the hard disk;

the verification unit is used for verifying the data read from the hard disk so as to determine whether the data written into the hard disk is error;

the BDM module comprises the check unit which comprises the DIF sub-module;

the verification unit is used for:

6. The storage array of claim 5, wherein the verification unit is configured to:

and if the CRC of the data to be written in the response message of the write request is successful, determining that the lower layer module of the DIF sub-module is abnormal when the write request is sent, and sending an abnormal error code of the lower layer module.

7. The storage array of claim 5, wherein the write request carries a DIF parity bit; the verification unit is configured to:

performing CRC on the data read from the hard disk;

if the CRC is successful, comparing the DIF of the data to be written carried in the response message of the writing request with the DIF of the data read from the hard disk;

8. The memory array of claim 5, wherein the write requests are sampled write requests in one sampling cycle.

9. A computer readable storage medium comprising computer instructions which, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-4.