WO2020034695A1

WO2020034695A1 - Data storage method, data recovery method, apparatus, device and storage medium

Info

Publication number: WO2020034695A1
Application number: PCT/CN2019/087904
Authority: WO
Inventors: 魏明昌
Original assignee: 华为技术有限公司
Priority date: 2018-08-14
Filing date: 2019-05-22
Publication date: 2020-02-20
Also published as: CN110825552A; CN110825552B

Abstract

Provided is a data storage method. The method introduces a mechanism for cross-backup of metadata of data blocks in EC stripes; and by means of storing the metadata of the data blocks and that subjected to cross-backup together in data stripe units, it is ensured that the metadata of the data blocks is mutually stored in different storage nodes, so that even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe units of the other storage nodes, the lost metadata of the storage node can also be acquired from the data stripe units of the other storage nodes. As a result, the probability of metadata loss is reduced, and the reliability and security of data storage is greatly improved, thereby improving the storage performance of a distributed storage system.

Description

Data storage method, data recovery method, device, equipment and storage medium

Technical field

The present application relates to the field of storage technology, and in particular, to a data storage method, a data recovery method, a device, a device, and a storage medium.

Background technique

With the development of storage technology, current distributed storage systems often use erasure code (EC) technology to store data in the form of EC stripes. Each EC slice is composed of m data blocks and k check blocks. The data blocks are used to store data and the check blocks are used to recover data. When data blocks are lost in the EC stripe, as long as the total number of lost data blocks is not less than k, the EC data can be recovered by performing EC inverse coding on the remaining data blocks and parity blocks, thereby greatly improving the data. Storage stability and reliability.

For the process of storing data using EC technology, a distributed system usually includes a client node, a main storage node, and at least one backup storage node. The client node is used to send data to the main storage node, and the main storage node is used to perform EC Encoding, sending data blocks and check blocks to at least one backup storage node, and each backup storage node is used to store data blocks or check blocks. Specifically, when the client node receives the data to be stored, it will send the data to the main storage node corresponding to the target storage location according to the target storage location of the data. The main storage node will divide the data into m data blocks. Redundant algorithm is used to EC code m data blocks to obtain k check blocks. The main storage node itself stores a data block or check block. After the storage is successful, the stored data block or check block is recorded. Metadata, and the main storage node will send the remaining m + k-1 data blocks and parity blocks to m + k-1 backup storage nodes, each of which will store a data block or a calibration block. Check the block, and record the metadata of the stored data block or check block after the storage is successful.

In the above solution, only the security of the data block can be guaranteed, and the security of the metadata of the data block is poor.

Summary of the Invention

The embodiments of the present application provide a data storage method, a data recovery method, a device, a device, and a storage medium, which can solve the problem of poor security of metadata of data blocks stored in related technologies. The technical solution is as follows:

In a first aspect, a data storage method is provided. The method includes:

Generate at least one data stripe unit according to at least one data block to be stored, each data stripe unit includes a data block and cross-backup metadata, and the cross-backup metadata includes a data block of the data stripe unit Metadata and metadata of data blocks included in other data stripe units other than the data stripe unit;

Performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit;

And distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node.

The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data blocks are stored between each other. Even if the metadata of a storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, it can also be based on the data strip of other storage nodes. In the belt unit, the missing metadata of the storage node is obtained, thereby reducing the probability of metadata loss, greatly improving the reliability and security of data storage, and thereby improving the storage performance of the distributed storage system.

Optionally, generating at least one data stripe unit according to at least one data block to be stored includes:

Backup the metadata of the at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one correspondence;

For a data block in the at least one data block, selecting at least one target metadata backup corresponding to the data block from the at least one metadata backup;

Generating a data stripe unit according to the data block, the metadata of the data block, and the at least one target metadata backup.

Optionally, the selecting, from the at least one metadata backup, at least one target metadata backup corresponding to the data block includes:

Querying the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node;

A metadata backup of a data block corresponding to the at least one second storage node is determined as the at least one target metadata backup.

Optionally, performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit includes at least one of the following steps:

Performing EC coding on the data blocks in the at least one data stripe unit to obtain a check block in the at least one check stripe unit;

EC coding is performed on the metadata in the at least one data stripe unit to obtain a metadata check block in the at least one check stripe unit.

Based on this method, by performing EC coding on the metadata in each data stripe unit, the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.

Optionally, the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node includes:

Send a corresponding data stripe unit and / or check stripe unit to each storage node.

Based on this method, by sending a corresponding data stripe unit and / or a verification stripe unit from each client node to each storage node, the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.

Optionally, after the at least one data stripe unit and the at least one check stripe unit are distributed to at least one storage node, the method further includes:

Receiving a first write completion message sent by the at least one storage node;

Send a second write completion message to the target application or external host, where the first write completion message is used to indicate that the corresponding storage node has stored the data stripe unit and / or the verification stripe unit, and the second write completion message is used for The indication indicates that the data has been written to the target storage location.

Determining that a third storage node of the at least one storage node is in a sub-health state;

Sending a corresponding data stripe unit and / or a check stripe unit to a fourth storage node of the at least one storage node.

Based on this optional method, if a storage node is in a sub-healthy state, the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, thereby achieving rapid storage node-to-node storage. Switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, ensuring that the storage system can quickly store data even if the storage nodes are in a sub-health status, thereby ensuring the reliability and stability of the storage system.

Optionally, the sending a corresponding data stripe unit and / or a verification stripe unit to a fourth storage node of the at least one storage node includes at least one of the following steps:

Sending to the fourth storage node at least one second data strip unit other than the first data strip unit corresponding to the third storage node and the at least one check strip unit;

Sending to the fourth storage node at least one second verification strip unit other than the first verification strip unit corresponding to the third storage node and the at least one data strip unit.

Based on this implementation, by degrading the data to be stored, when a storage node is in a sub-health state, it can immediately avoid the storage node and distribute the data stripe unit and / or check strip corresponding to the storage node. Data strip units and / or check strip units other than the strip unit, so as to achieve rapid switching in a sub-health state.

Optionally, the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one After verifying the stripe unit, the method further includes:

When a read request is received, the read request is sent to a fifth storage node, where the read request is used to instruct to read a data block of the first data stripe unit, and the The metadata includes metadata of a data block of the first data stripe unit.

Optionally, before the sending the read request to the fifth storage node, the method further includes:

Query the cross-backup relationship between the storage nodes, and determine a fifth storage node corresponding to the third storage node.

Optionally, the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one Verifying the strip unit includes at least one of the following steps:

Writing a sub-health mark of the third storage node to the at least one second data stripe unit, where the sub-health mark is used to indicate that the third storage node is in a sub-health state;

Send a write request to the fourth storage node, where the write request carries a sub-health mark of the third storage node and a second data stripe unit corresponding to the fourth storage node.

Optionally, after sending the corresponding data stripe unit and / or check stripe unit to the fourth storage node, the method further includes:

When receiving a first write completion message sent by the at least one fourth storage node, send a second write completion message to a target application or an external host.

In a second aspect, a data recovery method is provided. The method includes:

Stores at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block of the data stripe unit and the data strip Metadata of data blocks included in data stripe units other than the unit;

Obtaining the missing metadata of the third storage node according to the at least one data striping unit;

Sending the missing metadata to the third storage node.

The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves. The stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems.

Optionally, obtaining the missing metadata of the third storage node according to the at least one data striping unit includes:

Selecting at least one second data stripe unit from the at least one data stripe unit, and a storage time of the second data stripe unit belongs to a period when the third storage node is in a sub-health state;

Determining the missing metadata according to the cross-backup metadata in the at least one second data stripe unit.

Optionally, the selecting at least one second data stripe unit from the at least one data stripe unit includes at least one of the following steps:

Querying the sub-health log according to the identifier of the third storage node to obtain the at least one second data stripe unit, where the sub-health log is used to indicate data stored while the third storage node is in a sub-health state Stripe unit

Selecting a data strip unit with a sub-health mark of the third storage node as the at least one second data strip unit, and the sub-health mark is used to indicate that the third storage node is in a sub-health state.

Optionally, before querying the sub-health log according to the identifier of the third storage node, the method further includes:

While the third storage node is in a sub-health state, record the sub-health log for the third storage node.

Based on this optional method, if a storage node is missing metadata due to a sub-health flash, when the storage node is in a sub-health recovery state, other storage nodes can use the metadata recorded in the sub-health log recorded locally Difference, synchronize the missing metadata of the storage node to the normal storage node restored from sub-health, so that after the storage node recovers from the sub-health state, it can automatically recover the missing metadata, which reduces the storage node's sub-health status to the storage. The impact of system performance ensures high stability and reliability of the distributed storage system.

Optionally, when the third storage node is in a sub-health state, recording the sub-health log for the third storage node includes at least one of the following steps:

When the received write request carries a sub-health mark and a data stripe unit of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log;

When the received data strip unit carries the sub-health mark of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log.

Optionally, before determining the missing metadata of the third storage node according to the at least one data striping unit, the method further includes:

Receiving a sub-health recovery message, where the sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state.

According to a third aspect, a client node is provided to execute the foregoing data storage method. Specifically, the client node includes a function module for executing the data storage method described in the first aspect or any one of the first aspects.

According to a fourth aspect, a storage node is provided for performing the foregoing data recovery method. Specifically, the storage node includes a function module for executing the data recovery method according to the second aspect or any one of the optional aspects of the second aspect.

According to a fifth aspect, a client node is provided. The client node includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the first aspect or The operations performed by the data storage method according to any one of the first aspects.

According to a sixth aspect, a storage node is provided. The storage node includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the second aspect or the second aspect. The operations performed by the data recovery method described in any one of the optional aspects.

According to a seventh aspect, a non-transitory computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the first aspect or any one of the first aspect. Operations performed by the data storage method described in an optional manner.

According to an eighth aspect, a non-transitory computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the second aspect or any one of the second aspect. The operation performed by the data recovery method described in an optional manner.

In a ninth aspect, a computer program product containing instructions is provided, which when run on a client node, enables the client node to implement the first aspect or any of the first aspect of the first aspect. The operations performed by the data storage method.

According to a tenth aspect, a computer program product containing instructions is provided, which, when running on a storage node, enables the storage node to implement the data recovery described in the second aspect or any one of the optional aspects of the second aspect The action performed by the method.

According to an eleventh aspect, a data storage system is provided. In a possible implementation manner, the system includes:

The client node according to the third aspect and the storage node according to the fourth aspect.

In another possible implementation manner, the system includes:

The client node according to the fifth aspect and the storage node according to the sixth aspect.

According to a twelfth aspect, a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the first aspect or any one of the first optional aspects is implemented The operation performed by the storage method.

According to a thirteenth aspect, a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the second aspect or any one of the second optional aspects is implemented What the recovery method does.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application; FIG.

2 is a schematic structural diagram of a client node according to an embodiment of the present application;

3 is a schematic structural diagram of a storage node according to an embodiment of the present application;

4 is a flowchart of a data storage method according to an embodiment of the present application;

5 is a schematic diagram of a data storage provided by an embodiment of the present application;

6 is a flowchart of a data storage method according to an embodiment of the present application;

7 is a schematic diagram of a data storage provided by an embodiment of the present application;

8 is a flowchart of a data recovery method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a data storage provided by an embodiment of the present application; FIG.

FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application; FIG.

FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application.

detailed description

To make the objectives, technical solutions, and advantages of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following explains the terms involved in this application:

Cross-backup: It is a technology for backing up data between different storage nodes. For example, when cross backup is performed between node A, node B, and node C, node A can store data of node A and data backup of node B, node B can store data of node B and data backup of node C, and node C can Stores the data of node C and the data backup of node A. With cross-backup, even if a node loses data, it can still read the data backup from other nodes to restore its own data.

Three-copy technology: It is a technology for redundant storage of data. It copies data into three copies and stores the three copies on three storage nodes respectively for data storage. Using the three-copy technology, when one copy is lost, the other copy can be recovered to restore the lost copy. With the three-copy technology, each piece of data occupies 3 times the storage space for storage, so the disk utilization is 1/3.

EC technology: It is a technology for redundant storage of data. The original data is encoded by an erasure coding algorithm to obtain redundant check blocks. Each data block and check block are stored in different storage nodes. To store data. Specifically, the data to be stored is divided into m data blocks, and a redundant algorithm is used to perform EC coding on the m data blocks to generate k check blocks, the m data blocks and the k check blocks. Form an EC stripe, each data block or each check block can be called an EC block in the EC stripe, and each EC block can be distributed to different storage nodes for storage. Each EC strip can tolerate the loss of up to k EC blocks. Once any storage node fails, as long as the number of failed storage nodes does not exceed k, you can recover the failed nodes based on the EC blocks on the non-faulty storage nodes. Stored EC blocks, so distributed storage systems that use EC technology to store data will have high reliability. In addition, using EC technology to store data can greatly save storage space compared to the three-copy technology. Specifically, using three-copy technology requires three times the storage space to store one copy of data, while using EC technology requires only 1.4 times the storage space to store one copy of data.

FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application. The distributed storage system includes a client node, at least one storage node, and a metadata controller (MDC) node. A cloud server (Elastic Compute Service (hereinafter referred to as: ECS) service node and a cloud disk backup (Volume Backup Service (hereinafter referred to as: VBS) node). The system provided in Figure 1 can provide object storage services for customers.

Client nodes are also called storage clients or client nodes. Client nodes can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute data to storage nodes for data storage. The client node can be a server, a personal computer (hereinafter referred to as a PC), a notebook computer, etc. The client node can be an independent device, for example, it can be one or more program modules on a device. It can be a virtual machine or container running on one device, and the client node can also be a cluster of multiple devices. For example, it can be a collective name for multiple program modules distributed on multiple devices.

The storage node may store data stripe units and / or check stripe units, receive read requests and / or write requests, and access locally stored data and metadata. The storage node may be an Object-based Storage (OSD) node, a Network Attached Storage (NAS) node, a Storage Area Network (SAN) node, etc. Storage nodes can be servers, PCs, laptops, etc. The storage methods of the storage nodes include, but are not limited to, object storage, block storage, and file storage. Storage nodes can be physical storage nodes, or logically divided by physical storage nodes. Storage node.

The MDC service node can be used to maintain a partitioned view, which can include the mapping relationship between partitions and storage nodes and the current state of each storage node. When the partitioned view changes, the changed partitioned view can be synchronized to the client node As well as each storage node. The MDC service node can be a server, PC, laptop, etc. The MDC service node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or container running on a device. It can also be a cluster composed of multiple devices, for example, it can be a collective name for multiple program modules distributed on multiple devices.

The ECS service node is used to allocate a blank EC stripe, and sends the blank EC stripe to the client node, so that the client node writes data to the blank EC stripe. An ECS service node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. A collective name for multiple program modules distributed across multiple devices.

The VBS node can provide a virtual hard disk function to an upper-layer application or an external host, and the VBS node can receive a read request or a write request from an upper-layer application or an external host. A VBS node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. For example, it can be A collective term for multiple program modules distributed across multiple devices.

FIG. 2 is a schematic structural diagram of a client node according to an embodiment of the present application. The client node 200 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, referred to as abbreviations). : CPU) 201 and one or more memories 202, where the memory 202 stores at least one instruction, and at least one instruction is loaded and executed by the processor 201 to implement the data storage method in each method embodiment described below. Of course, the client node 200 may further have components such as a wired or wireless network interface and an input-output interface for input and output. The client node 200 may further include other components for implementing device functions, and details are not described herein.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions can be executed by a processor in a client node to complete the data storage method in the following embodiments. . For example, the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.

FIG. 3 is a schematic structural diagram of a storage node according to an embodiment of the present application. The storage node 300 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, abbreviated as: CPU). 301 and one or more memories 302, wherein the one or more memories 302 may be hard disks mounted on a storage node, and the hard disks may be logical virtual hard disks or physical physical hard disks. The one or more memories 302 store at least one instruction, and the at least one instruction is loaded and executed by the processor 301 to implement the data recovery method in each method embodiment described below. Of course, the storage node may also have components such as a wired or wireless network interface and an input-output interface for input and output. The storage node 300 may further include other components for implementing device functions, and details are not described herein.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions may be executed by a processor in a storage node to complete the data recovery method in the following embodiments. For example, the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.

FIG. 4 is a flowchart of a data storage method provided by an embodiment of the present application. As shown in FIG. 4, the interaction body of the method includes a client node and at least one storage node, including the following steps:

401. A client node obtains at least one blank EC stripe, and each blank EC stripe is used to carry at least one data stripe unit and at least one check stripe unit.

A client node is a client node. In a distributed storage system, a client node can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute the data to various storage nodes for storage. The client node can be a server, personal computer, laptop, etc. The client node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or The container and the client node may also be a cluster composed of multiple devices. For example, the client node may be a collective name for multiple program modules distributed on multiple devices. This embodiment does not limit the physical form of the client node.

The client node can obtain at least one blank EC stripe, so that when data is stored, data blocks, check blocks, and metadata are written to the blank EC stripe. Among them, each blank EC strip can be regarded as a row of blank data blocks. Each blank EC strip can carry a strip identifier and at least one strip unit identifier. The strip identifier is used to identify the corresponding EC strip, which can be The identification number (Identification, hereinafter referred to as ID) of the EC strip may be, for example, the number and name of the EC strip. The stripe unit identifier is used to identify the corresponding stripe unit, and may be the ID of the stripe unit, for example, the number and name of the stripe unit.

Regarding the manner of obtaining the blank EC stripe, optionally, the ECS service may allocate at least one blank EC stripe to the client node, and send the allocated at least one blank EC stripe to the client node, and the client node may receive the ECS service. At least one blank EC band, thereby obtaining at least one blank EC band. Optionally, the distributed storage system may include multiple client nodes, and the ECS service may allocate a corresponding blank EC stripe to each client node, and send a corresponding blank EC stripe to each client node. The stripe identifiers of the blank EC stripe assigned to different client nodes may be different. For example, blank EC stripe 1 to blank EC stripe 100 are allocated to client node 1, and blank EC stripe 101 is allocated to client node 2. To blank EC band 200.

It should be noted that this step 401 is only an optional step, not a mandatory step. This embodiment does not limit whether to perform step 401 or not. Optionally, the client node may not need to perform the step of obtaining a blank EC strip. For example, at least one blank EC strip may be stored in advance, and data may be written into the pre-stored blank EC strip when storing data. For another example, it is not necessary to use a blank EC strip to store data, but when storing data, the old EC strip that has been written with data is overwritten to write the data to be stored, thereby bringing the storage by multiplexing the old EC strip. data.

402. The client node obtains data to be stored and a target storage location of the data.

The data to be stored may be IO data, that is, data input and / or output to the distributed storage system. The target storage location refers to the location where the data needs to be stored. By obtaining the target storage location of the data, you can determine where to store the data so that when the data is read later, the stored data can be read from the target storage location. Optionally, the storage space of the distributed storage system can be used as one or more logical volumes. Each logical volume can be divided into multiple logical blocks. One or more logical blocks can be used to store data. The target storage location can be It is a logical block address (Logical Block Address, hereinafter abbreviated as: LBA), which may include a starting LBA address and a data length. In addition, the target storage location may also be an identifier of a logical block, for example, it may be a key of the logical block.

Regarding the process of obtaining the data to be stored and the target storage location, optionally, the client node may receive the write request, parse the write request, and obtain the data carried by the write request and the target storage location. A write request may be generated by an upper-layer application, an external host, or a VBS node, and a write request may be sent to a client node, so that the client node receives a write request from an upper-layer application, an external host, or a VBS node. Optionally, the write request may be triggered by a user's input operation.

Optionally, the client node may divide the data to be stored to obtain at least one data block to be stored. The data block refers to data obtained by dividing the data with the granularity of the block. For example, the data block may be GRAIN data 1, GRAIN data 2 or GRAIN data 3 in FIG. 5.

It should be noted that this step 402 is only an optional step, not a mandatory step. In a possible embodiment, the client node may store at least one data block to be stored in advance, and the client node may At least one data block is directly executed in the subsequent step 403.

403. The client node generates at least one data stripe unit according to the at least one data block to be stored, and each data stripe unit includes a data block and cross-backup metadata.

Metadata of a data block refers to data used to describe the data block, and the metadata of the data block can be indexed to the corresponding data block. The metadata of the data block may be the mapping relationship between the target storage location and the stripe unit identifier, for example, it may be the mapping relationship between the LBA and the stripe unit identifier. For example, referring to FIG. 5, the metadata may be GRAIN metadata 1, GRAIN metadata 2 or GRAIN metadata 3 in FIG. 5, where GRAIN metadata 1 is metadata of GRAIN data 1, and GRAIN metadata 1 The content can be the mapping relationship between stripe unit 1 and LBA1, GRAIN metadata 2 is the metadata of GRAIN data 2, the content of GRAIN metadata 2 can be the mapping relationship between stripe unit 2 and LBA2, and so on .

Strip unit: refers to the EC strip unit, which is a component of the EC strip. The strip unit is also called EC stripe. The EC stripe and the stripe unit have a relationship of whole and part. Each EC stripe may include at least one stripe unit, for example, it may include 5 stripe units, 7 stripe units, and the like. An EC slice can be regarded as a row of data blocks, and a slice unit can be regarded as a column of data blocks in the EC slice. For example, please refer to FIG. 5. FIG. 5 provides an example of an EC stripe. The EC stripe includes 5 stripe units, that is, stripe unit 1, stripe unit 2, stripe unit 3, and stripe unit. 4 和条条组 5。 4 and strip unit 5.

The stripe unit is used to store data in the EC stripe. According to the difference in the stored data, in this embodiment, the stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata is called a data stripe. A stripe unit is a stripe unit for storing at least one of a check block and a metadata check block in an EC stripe, which is referred to as a check stripe unit. The EC stripe may include at least one data stripe unit. And at least one parity stripe unit, the number of data stripe units and parity stripe units in the EC stripe can be determined according to the EC encoding algorithm. For example, an EC stripe can include 3 data stripe units and 2 Verify the stripe unit.

The cross-backup metadata includes metadata of the data blocks of the data stripe unit and metadata of data blocks included in the data stripe unit other than the data stripe unit. For any data stripe unit, the cross-backup metadata in the data stripe unit, including the metadata of the data blocks stored in the data stripe unit itself, will also include one or more other data stripe The metadata of the data block stored by the unit.

For example, referring to FIG. 5, the cross-backup metadata may be part of the bold box in FIG. 5. The data block stored by the data stripe unit 1 is GRAIN data 1, and the cross-backup metadata includes GRAIN metadata. 1 (metadata of GRAIN data 1), GRAIN metadata 2 (metadata of GRAIN data 2), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 1 not only stores data stripe unit 1 The metadata of the data block (GRAIN metadata 1) stored in itself also stores the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 2 and the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 3 ( GRAIN metadata 3). Similarly, the data block stored by the data stripe unit 2 is GRAIN data 2. The cross-backup metadata includes GRAIN metadata 2 (metadata of GRAIN data 2), GRAIN metadata 1 (metadata of GRAIN data 1), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 2 stores not only the metadata of the data block (GRAIN metadata 2) stored by data stripe unit 2 itself, but also data stripe unit 1. The metadata (GRAIN metadata 1) of the stored data block and the metadata (GRAIN metadata 3) of the data block stored by the data stripe unit 3.

Optionally, the metadata of the cross backup can be divided into the following two types (1) to (2):

(1) The metadata of the cross-backup may include the metadata of all data blocks to be stored, and accordingly, the metadata of the cross-backup of at least one data stripe unit may be the same, and each data stripe unit stores all data stripe The metadata of the data block in the unit. For example, if the data is divided into m data blocks, the metadata of the cross-backup may be the metadata of m data blocks, that is, each data stripe unit stores the metadata of m data blocks.

Exemplarily, referring to FIG. 5, assuming that the data is divided into three data blocks, the metadata backed up in the data stripe unit 1, the data stripe unit 2, and the data stripe unit 3 may all be the same, for these three The metadata of the data block.

Based on (1), each data stripe unit will store the metadata of the data blocks of all the data stripe units. If the metadata of the data block of a certain data stripe unit is lost, you can start from any of the remaining data stripe units. In the process of reading and recovering lost metadata, the reliability and security of metadata storage is improved, and the availability of distributed storage systems is improved.

(2) The metadata of the cross-backup may include the metadata of a part of the data block to be stored, and accordingly, each data stripe unit may store the metadata of the data block in the part of the data stripe unit. For example, assuming that the data is divided into m data blocks, the metadata of the cross-backup may include metadata of at least two data blocks, that is, each data stripe unit stores metadata of at least two data blocks.

Optionally, the cross-backup metadata may include metadata of data blocks in one or more adjacent data stripe units, for example, may include metadata of data blocks of the previous data stripe unit and the next data stripe The metadata of the data block of the cell. Exemplarily, the metadata of the cross backup of the i-th data stripe unit may include the metadata of the data block of the i-th data stripe unit and the metadata of the data block of the i-1th data stripe unit And the metadata of the data block of the i + 1th data stripe unit.

Based on (2), each data stripe unit can store metadata of data blocks of some data stripe units, which can save storage space of a distributed storage system.

It should be noted that the cross-backup metadata in the data stripe unit can be designed according to requirements as described above (1) or as described above (2). In addition, the cross-backup metadata specifically includes how many data blocks of metadata, and which specifically The metadata of several data blocks can be designed according to actual requirements, which is not limited in this embodiment.

A data stripe unit refers to a stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata. It can be an EC block in an EC stripe, and each data stripe unit can include a data block. Exemplarily, referring to FIG. 5, the data stripe unit may be a stripe unit 1, a stripe unit 2, or a stripe unit 3. Each EC slice may include at least one data slice unit, and the number of data slice units in an EC slice may be equal to the number of data blocks. For example, assuming that the data to be stored is divided into m data blocks, the EC stripe of the data may include m data stripe units.

Optionally, for any data stripe unit, the cross-backup metadata in the data stripe unit may consist of metadata of the data block in the data stripe unit and at least one target metadata backup, the at least one A target metadata backup refers to a metadata backup of a data block of a data stripe unit other than the data stripe unit. Specifically, the process of generating any data stripe unit may include the following steps one to three:

Step 1: Back up metadata of at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one manner.

For each data block in the at least one data block, the metadata of the data block can be determined, the metadata of the data block can be backed up, and a metadata backup of the metadata of the data block can be obtained, and so on, for each data All metadata of the blocks are backed up, and at least one metadata backup corresponding to at least one data block can be obtained. For example, referring to FIG. 5, the metadata of GRAIN data 1 is backed up as GRAIN metadata 1, and the metadata of GRAIN data 2 is backed up as GRAIN metadata 2.

Regarding the process of determining the metadata of the data block, according to the target storage location corresponding to each data stripe unit and the stripe unit ID of each data stripe unit, the relationship between the target storage location and the stripe unit ID is recorded. Mapping relationship, as metadata for the data block.

Step 2: For at least one data block in the data block, from at least one metadata backup, select at least one target metadata backup corresponding to the data block.

Optionally, this step may specifically include the following steps (2.1) to (2.2):

Step (2.1): Query the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node.

To distinguish descriptions, any storage node of a distributed storage system is referred to as a first storage node, and a storage node having a cross-backup relationship with the first storage node is referred to as a second storage node. It should be noted that the terms “first storage node” and “second storage node” are only used to distinguish and describe different storage nodes, and should not be understood to express or imply the order, relative importance, and storage nodes between storage nodes. The total number.

A cross-backup relationship refers to a relationship in which each storage node stores metadata backups with each other. A cross-backup relationship can have at least two functions. First, when storing data, the cross-backup relationship can determine which storage nodes' metadata backups are stored by any storage node, in order to determine the cross-backup metadata that each storage node needs to store according to the cross-backup relationship. For example, if the cross-backup relationship instructs OSD node 1 to store metadata backups of OSD node 2 and OSD node 3, when storing data, the data stripe unit of OSD node 1 can store the OSD node 2 and OSD node 3 Metadata backup. Second, when restoring data, the cross-backup relationship can determine which storage nodes any metadata backup is stored in, so that according to the cross-backup relationship, the metadata backup is read from the cross-backup metadata stored by the corresponding storage node. . For example, if the cross-backup relationship indicates that the metadata backup of OSD node 1 is stored in OSD node 2 and OSD node 3, then when the metadata of OSD node 1 is lost, OSD node 1 and OSD node 3 can read the OSD node 1 Metadata backup.

The data form of the cross-backup relationship can be designed according to actual needs. For example, the cross-backup relationship can include one or more of the following (1) and (2):

(1) The cross-backup relationship may be a mapping relationship between the identities of the storage nodes, and the cross-backup relationship may include the identities of multiple storage nodes. Correspondingly, the process of determining at least one second storage node may include: using the identifier of the first storage node as an index, querying the cross-backup relationship to obtain the identifier of the at least one second storage node, and according to the identifier of the at least one second storage node, Determine at least one second storage node. The storage node identifier is used to identify the corresponding storage node, and may include the ID, name, and number of the storage node.

(2) The cross-backup relationship may be a mapping relationship between the stripe unit IDs, and the cross-backup relationship may include the stripe unit IDs of multiple data stripe units. Taking the data stripe unit corresponding to the first storage node as the first data stripe unit and the data stripe unit corresponding to the second storage node as the second data stripe unit as an example, the process of determining at least one second storage node It may include: determining a stripe unit identifier of the first data stripe unit according to the first storage node, using the stripe unit identifier of the first data stripe unit as an index, querying the cross-backup relationship, and obtaining at least one second data stripe The stripe unit identifier of the unit determines the identifier of the storage node corresponding to the stripe unit identifier of the at least one second data stripe unit, obtains the identifier of the at least one second storage node, and determines the identifier based on the identifier of the at least one second storage node. At least one second storage node.

Optionally, the MDC node may record the cross-backup relationship and send the cross-backup relationship to the client node. For example, the MDC node can record the cross-backup relationship in the partitioned view and send the partitioned view to the client node. The client node can query the partitioned view to obtain the cross-backup relationship in the partitioned view.

Step (2.2) determines the metadata backup of the data block corresponding to the at least one second storage node as the at least one target metadata backup.

After at least one second storage node is determined, a metadata backup of a data block corresponding to the at least one second storage node may be selected from the generated at least one metadata backup as at least one target metadata backup. For example, assuming that the first storage node is OSD node 1, and according to the cross-backup relationship between the storage nodes, determine that the OSD node corresponding to OSD node 1 in the cross-backup relationship is OSD node 2 and OSD node 3, then determine OSD node 2 and OSD The metadata of the data block of node 3 is backed up to obtain GRAIN metadata 2 and GRAIN metadata 3, and GRAIN metadata 2 and GRAIN metadata 3 are used as the target metadata backup of OSD node 1.

It should be noted that selecting the target metadata backup according to the cross-backup relationship is only an optional method of selecting the target metadata backup, not a mandatory method of selecting the target metadata backup. Alternatively, the target may also be selected according to other methods Metadata backup, for example, for any data stripe unit, the metadata backup of all data blocks can be used as the target metadata backup in order to store the metadata of all the data stripe unit's data blocks in the data stripe unit data.

Step 3: Generate a data stripe unit according to the data block, the metadata of the data block, and at least one target metadata backup.

Specifically, a data block, metadata of the data block, and at least one target metadata backup may be written to any stripe unit in the EC stripe, and the stripe unit after writing is used as a data stripe unit, where The data strip unit carries a data block, metadata of the data block, and at least one target metadata backup. The metadata of the data block and at least one target metadata backup are metadata of the cross-backup.

Optionally, a data block, metadata of the data block, and at least one target metadata backup may be written to any blank strip unit in any blank EC stripe, thereby generating a data strip according to the blank strip unit unit. It is also possible to perform overwrite on any stripe unit in the EC stripe to which data has been written, so as to generate a data stripe unit based on the stripe unit to which data has been written, which is not limited in this embodiment.

In summary, the above provides an example of generating a data stripe unit in an EC stripe, and so on. By repeating the above steps, each data stripe unit in an EC stripe can be generated. Among them, an EC slice can be regarded as a row of data blocks, and each data slice unit can be regarded as a column of data blocks in this row of data blocks. For example, referring to FIG. 5, suppose the data is divided into three data blocks, which are GRAIN data 1, GRAIN data 2, and GRAIN data 3, and three data stripe units can be generated, which are data stripe unit 1, respectively. Data stripe unit 2 and data stripe unit 3, data stripe unit 1 carries GRAIN data 1 and cross-backup metadata 1, data stripe unit 2 carries GRAIN data 2 and cross-backup metadata 2, data stripe unit 3 carries GRAIN data 3 and cross-backup metadata 3.

In this embodiment, the metadata of the data block is cross-backed up by using the cross backup technology. After writing to each data stripe unit, the metadata of the data block will be stored between the different data stripe units. After each data stripe unit is distributed to each storage node, the metadata of the data block is stored between different storage nodes, so even if a storage node loses the metadata of the data block, it can still be read from other storage nodes And restore the metadata of the lost data blocks, improve the reliability and security of metadata storage, thereby ensuring the high availability and high reliability of the distributed storage system.

For example, referring to FIG. 5, OSD node 1, OSD node 2 and OSD node 3 mutually store metadata of a data block, even if OSD node 1 loses GRAIN metadata 1, since OSD node 2 and OSD node 3 previously store GRAIN metadata Data 1, so GRAIN metadata 1 can still be read and restored from OSD node 2 and OSD node 3. Similarly, even if OSD node 2 loses GRAIN metadata 2, since OSD node 1 and OSD node 3 have stored GRAIN metadata 2 in advance, it is still possible to read and restore GRAIN metadata 2 from OSD node 1 and OSD node 3. .

404. The client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.

The check stripe unit is used to recover the data stripe unit, which can ensure the reliability and security of the data stripe unit. Specifically, if one or more data stripe units are lost, as long as the number of lost data stripe units is less than the total number of parity stripe units, the remaining data stripe units and parity stripe units can be resolved. Perform EC inverse coding to recover lost data stripe units. For example, referring to FIG. 5, the verification stripe unit may be the verification stripe unit 1 and the verification stripe unit 2 in FIG. 5.

The check stripe unit can be referred to as EC stripe check protection. The check stripe unit can include a check block and a metadata check block. For example, it can include one or more check blocks and one or more meta blocks. Data check block. The check block may be obtained by performing EC coding on the data block in at least one data stripe unit, and the check block may be used to recover the data block in the data stripe unit. For example, referring to FIG. 5, the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5. The metadata check block may be obtained by performing EC encoding according to metadata in at least one data stripe unit, and the metadata check block may be used to recover metadata in the data stripe unit. For example, referring to FIG. 5, the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5.

Regarding the process of generating at least one check stripe unit, the client node may use an encoding algorithm to perform EC coding on at least one data stripe unit to obtain at least one check stripe unit. The encoding algorithm includes, but is not limited to, Reed-Solomon encoding, Cauchy encoding, and the like. This embodiment does not limit which encoding algorithm is used for EC encoding.

The at least one parity strip unit and the at least one data stripe unit may form an EC stripe. Each data stripe unit and each parity stripe unit is a column of data blocks in the EC stripe. For example, referring to FIG. 5, the data striping unit 1, the data striping unit 2, and the data striping unit 3 can be EC-coded to obtain the verification stripe unit 1 and the verification stripe unit 2. The data stripe The strip unit 1, the data strip unit 2, the data strip unit 3, the check strip unit 1, and the check strip unit 2 form an EC strip.

Optionally, a manner of performing EC coding on at least one data stripe unit may include a combination of one or more of the following manners 1 to 3.

Method 1: EC coding is performed on data blocks in at least one data stripe unit to obtain at least one checkout block in a checkout stripe unit.

Specifically, at least one data block may be EC-coded to obtain at least one check block, one or more check blocks may be written to any stripe unit, and the stripe unit after writing is used as a check stripe. unit. Wherein, a check block may be written to any blank strip unit in any blank EC strip, or overwrite writing may be performed on any strip unit in the EC strip to which data has been written. This is not limited.

For example, referring to FIG. 5, GRAIN data 1, GRAIN data 2 and GRAIN data 3 can be EC coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 and write to the stripe unit 4. GRAIN metadata check block 1, write GRAIN metadata check block 2 to stripe unit 5, after writing is completed, stripe unit 4 can be used to verify stripe unit 1, and stripe unit 5 can be used to verify Strip unit 2.

Manner 2: Perform EC coding on the metadata in at least one data stripe unit to obtain a metadata check block in at least one check stripe unit.

In a possible implementation, the cross-backup metadata of at least one data stripe unit may be EC-coded according to the cross-backup metadata in each data stripe unit to obtain a metadata check block. For example, referring to FIG. 5, EC encoding can be performed on cross-backup metadata 1, cross-backup metadata 2 and cross-backup metadata 3 to obtain GRAIN metadata check block 1 and GRAIN metadata check block. 2.

In another possible implementation, one or more metadata may also be selected from the cross-backup metadata of each data stripe unit, and the EC selected from the at least one data stripe unit may be EC coded. To get the metadata check block. Optionally, the metadata that is backed up in each data stripe unit may include multiple rows, and each row carries one metadata. Metadata in the same row in at least one data stripe unit may be selected. One metadata is EC coded. For example, referring to FIG. 5, the cross-backup metadata in each data stripe unit occupies 3 rows, and these 3 rows are used to carry GRAIN metadata 1, GRAIN metadata 2 and GRAIN metadata 3, respectively. You can first perform EC encoding on the first row of metadata in the cross-backed metadata, then select the first row of metadata from the data stripe unit 1 to obtain GRAIN metadata 1, and select the first row of data from the data stripe unit 2. One row of metadata to get GRAIN metadata 2. Select the first row of metadata from data stripe unit 3 to get GRAIN metadata 3. Perform EC on GRAIN metadata 1, GRAIN metadata 2, and GRAIN metadata 3. Encode to get GRAIN metadata check block 1 and GRAIN metadata check block 2.

When the metadata check block is obtained, the metadata check block can be written to any blank strip cell in any blank EC strip, or any strip cell in the EC strip to which data has been written In the above, the write metadata check block is overwritten, and the stripe unit after the writing is completed can be used as a check stripe unit. For example, referring to FIG. 5, a GRAIN metadata check block 1 can be written to the stripe unit 4 and a GRAIN metadata check block 2 can be written to the stripe unit 5. After the writing is completed, the stripe unit 4 It can be used as the verification strip unit 1, and the strip unit 5 can be used as the verification strip unit 2.

By encoding the metadata in each data stripe unit, the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.

Manner 3: Perform EC coding on the data blocks and metadata in at least one data stripe unit to obtain a check block in at least one check stripe unit.

The third method is a combination of the first method and the second method. At least one data block and metadata of at least one data block can be EC coded together to obtain at least one metadata check block. The metadata can be written to any band unit. After the block is checked, the stripe unit after writing is used as the verification stripe unit.

For example, referring to FIG. 5, all data carried in stripe unit 1, stripe unit 2, and stripe unit 3 can be EC-coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 , Write the GRAIN metadata check block 1 to the stripe unit 4, use the completed stripe unit 4 as the check stripe unit 1, write the GRAIN metadata check block 2 to the stripe unit 5, and write The stripe unit 5 after writing is used as the verification stripe unit 2.

405. The client node distributes at least one data stripe unit and at least one check stripe unit to at least one storage node.

Optionally, at least one data stripe unit and at least one check stripe unit may be distributed to at least one storage node corresponding to a target storage location of the data. Specifically, a storage node corresponding to the target storage location may be determined according to a target storage location of the data, and at least one storage node may be obtained. A data stripe unit and / or a verification stripe unit may be allocated to each storage node, Storage nodes distribute data stripe units and / or check stripe units assigned to them.

Regarding the manner of determining the storage node corresponding to the target storage location, a mapping relationship between the storage location and the identification of the storage node may be established in advance, and the mapping relationship between the storage location and the identification of the storage node may be queried according to the target storage location to obtain the target storage. The identifier of the at least one storage node corresponding to the location is used as the storage node corresponding to the target storage location.

Optionally, according to the number and status of storage nodes in the distributed storage system, a partition allocation algorithm is used to generate a partitioned view. The partitioned view is used to indicate the storage node corresponding to each partition. The partitioned view may include at least one partition identifier and the corresponding The client node may determine the partition identifier corresponding to the target storage location according to the target storage location of the data, query the partition view according to the partition identifier, and obtain the identifier of the at least one storage node. Identify the corresponding storage node as the storage node corresponding to the target storage location.

Regarding the process of determining the partition identifier according to the target storage location of the data, when the target storage location is LBA, the LBA can be divided by the total number of partitions, and the obtained remainder is used as the partition identifier to determine which partition the LBA belongs to. The corresponding relationship between the partitions can be referred to as the LBA break-to-partition relationship, and the corresponding relationship between the partitions and the storage nodes can be referred to as the deployment relationship where the partitions are deployed to the storage nodes.

Regarding the manner of allocating data stripe units or check stripe units to storage nodes, the data stripe unit or check stripe unit can be assigned to each storage node in turn in the order of the storage node's identification. For example, the storage node can be Node 1 allocates data stripe unit 1 and storage node 2 allocates data stripe unit 2. As another example, a data stripe unit or a verification stripe unit may be randomly allocated to each storage node. Of course, other methods may also be used to allocate a data stripe unit or a verification stripe unit. This embodiment explains how to allocate data stripe units. The belt unit or the verification strip unit is not limited.

Regarding the manner of distributing the data stripe unit and / or the verification stripe unit, at least one write request can be generated according to the data stripe unit and / or the verification stripe unit corresponding to each storage node, and at least one write request is distributed. To the at least one storage node, for example, a write request may be sent to each storage node, so as to distribute the corresponding data stripe unit and / or check stripe unit to each storage node. Each write request carries a data stripe unit and / or a verification stripe unit corresponding to the storage node, and optionally, can also carry a target storage location of the data. Each write request may be an input / output (Input / Output, hereinafter referred to as: IO) request in the form of a key-value pair.

Optionally, the client node may send a corresponding data stripe unit and / or check stripe unit to each storage node. Specifically, a client node may allocate a corresponding data stripe unit and / or check stripe unit to each storage node, and the client node generates at least one write request, and sends a write request to each storage node. The client node interacts with each storage node to distribute at least one data stripe unit and at least one check stripe unit to the at least one storage node.

Exemplarily, referring to FIG. 5, the client node may send a write request carrying the data stripe unit 1 to the OSD node 1, send a write request carrying the data stripe unit 2 to the OSD node 2, and send a carry to the OSD node 3 There is a write request for the data stripe unit 3, and a write request with the verification stripe unit 1 is sent to the OSD node 4, and a write request with the verification stripe unit 2 is sent to the OSD node 5, thereby realizing the data stripe unit And check the distribution of stripe units.

In this embodiment, a client node sends a corresponding data stripe unit and / or a verification stripe unit to each storage node, and the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.

It should be noted that sending a data stripe unit and / or a verification stripe unit to each storage node by the client node is only an optional way to distribute the data stripe unit and / or the verification stripe unit, rather than Mandatory way to distribute data stripe units and / or check stripe units. In another possible embodiment, one of the storage nodes may also be selected as the primary storage node, and the other storage nodes are The standby storage node, the client node can send at least one data stripe unit and at least one check stripe unit to the main storage node, and the main storage node sends data stripe units and / or check strips to each of the backup storage nodes. The tape unit, this embodiment does not limit how to distribute the data stripe unit and / or the verification stripe unit.

406. When at least one storage node receives the data stripe unit and / or the verification stripe unit, the at least one storage node stores the data stripe unit and / or the verification stripe unit.

Specifically, each storage node can receive a write request, parse the write request, obtain the data stripe unit and / or check stripe unit carried by the write request, and write the data stripe unit and / or check stripe to the storage space. The stripe unit stores data stripe units and / or check stripe units. The write request may carry a target storage location of data, and a data stripe unit and / or a verification stripe unit may be written to a storage space corresponding to the target storage location.

407. At least one storage node sends a first write completion message to the client node, where the first write completion message is used to indicate that the corresponding storage node has stored the data stripe unit and / or the verification stripe unit.

In order to distinguish the description, in this embodiment, the write completion message sent by the storage node is referred to as a first write completion message, and the write completion message sent by the client node is referred to as a second write completion message. It should be noted that the term "the The "one write completion message" and the "second write completion message" are only for distinguishing different write completion messages, and should not be understood as expressing or implying the relative importance between different write completion messages and the total number of write completion messages.

After each storage node writes the data stripe unit and / or check stripe unit, it can generate a first write completion message and send the first write completion message to the client node to notify the client node that it has stored it. The data was successful.

408. The client node receives a first write completion message sent by at least one storage node, and sends a second write completion message to the target application or an external host. The second write completion message is used to indicate that data to be stored has been written to the target storage location. .

Specifically, the client node can determine whether the first write completion message of all the storage nodes has been received, and when it is confirmed that the first write completion message of all the storage nodes is received, a second write completion message is generated to the target application or external The host sends a second write complete message. The target application or the external host can receive the second write completion message, and prompt the second write completion message according to a preset prompt mode, thereby achieving the function of prompting the user that the data to be stored has been written to the target storage location. In addition, the target application may be located on an upper layer of the client node in a logical architecture, and the target application may be referred to as an upper layer application of the client node.

It should be noted that steps 407 and 408 are only optional steps for storing data, and are not mandatory steps for storing data. This embodiment does not limit whether to perform steps 407 and 408.

In summary, the above describes the write IO process of storing data, and the read IO process of reading data may include the following steps 1 to 2.

Step 1: The client node receives the read request, and determines at least one storage node of the partition corresponding to the target storage location according to the target storage location in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.

Step 2: The client node forwards the read request to at least one storage node, so that data is read in the at least one storage node.

The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in an EC stripe. By storing the data blocks and the cross-backup metadata together in a data stripe unit, the data stripe unit is After the EC coding is performed to obtain the verification stripe unit, each data stripe unit and each verification stripe unit are distributed to each storage node, which can ensure that after the storage of each storage node is completed, different storage nodes will save data to each other. Block metadata, even if the metadata of a certain storage node is lost, because the other storage nodes have previously stored the metadata of the storage node, they can read and recover the lost metadata from the data stripe units of other storage nodes. This greatly improves the reliability and security of data storage, and improves the storage performance of distributed storage systems. In addition, using EC technology to store data can avoid the performance overhead of the cache when storing data through the three-copy mechanism, thereby saving storage space and reducing operating costs. Further, the client node can send the corresponding data stripe unit and check stripe unit to each storage node to ensure that the data can reach each storage node to be stored only through one-hop network forwarding. It can greatly reduce the delay, improve the speed and efficiency of storing data, and thus improve the storage performance of distributed storage systems.

The following uses the embodiment in FIG. 6 to describe the data storage process in a scenario where the storage node is in a sub-health state. In order to distinguish descriptions, in the embodiment of FIG. 6, a storage node in a sub-health state is referred to as a third storage node, and a storage node that is not in a sub-health state is referred to as a fourth storage node. It should be noted that the terms "third storage node" and "fourth storage node" are only used to distinguish between storage nodes that describe whether they are in a sub-health state, and should not be understood as expressly or implicitly the order and relative importance between storage nodes. And the total number of storage nodes.

FIG. 6 is a flowchart of a data storage method according to an embodiment of the present application. As shown in FIG. 6, the interaction body of the method includes a client node and at least one fourth storage node, including the following steps:

601. The client node obtains at least one blank EC strip.

This step is the same as the above step 401, and details are not described herein.

602. The client node obtains data to be stored.

This step is the same as the above step 402, and details are not described herein.

603. The client node determines that the third storage node is in a sub-health state.

The sub-health state is also called a sub-health flash, and the sub-health state may include a state in which read and write requests are abnormally slow, a state where a write cache is invalid, and a state where a disk is damaged.

Regarding the manner of determining that the third storage node is in a sub-health state, optionally, the client node may receive a sub-health message, and determine that the third storage node is in a sub-health state according to the sub-health message. The sub-health message is used to indicate that the third storage node is in a sub-health state, and can carry the identity of the third storage node. The client node can parse the sub-health message and obtain the identity of the third storage node to determine the third storage node. In a sub-health state.

Optionally, the sub-health message received by the client node may come from an MDC node. Specifically, a distributed storage system may include an MDC node. The MDC node is used to maintain the state of each storage node. The MDC node may maintain communication with each storage node. When the MDC node determines that the third storage node is in a sub-health state, The MDC node may generate a sub-health message according to the identity of the third storage node, and send the sub-health message to the client node. The client node may receive the sub-health message of the MDC node, thereby determining that the third storage node is in a sub-health state.

It should be noted that receiving the sub-health message is only an example of a method for determining that the third storage node is in a sub-health state, and is not a mandatory method for determining that the third storage node is in a sub-health state. Optionally, the client The node may also determine that the third storage node is in a sub-health state by other methods. For example, the client node can maintain communication with the third storage node, and the client node can actively detect that the third storage node is in a sub-health state. This embodiment does not limit how to determine that the third storage node is in a sub-health state.

604. The client node generates at least one data stripe unit according to at least one data block to be stored.

605. The client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.

606. The client node distributes at least one data stripe unit and at least one check stripe unit to a fourth storage node.

Steps 604 to 606 are the same as the above steps 403 to 405, and the differences mainly include the following two points:

Difference one: The content of distributed data has increased. In this embodiment, in addition to the data stripe unit and the check stripe unit, a sub-health mark of the third storage node is also distributed, so that the third storage node's sub-health mark is used to indicate that the third storage node is in Sub-health state. The sub-health tag is used to indicate that the third storage node is in a sub-health state, and may include an identifier of the third storage node. The sub-health tag may be generated by the client node.

The manner of distributing the sub-health mark of the third storage node may include a combination of one or more of the following manners 1 to 2.

Method 1: A sub-health mark may be written to at least one data stripe unit, so that each data stripe unit carries a data block and cross-backup metadata, and also carries a sub-health mark of a third storage node. Therefore, by distributing at least one data stripe unit, the sub-health mark of the third storage node is distributed.

Wherein, a sub-health mark may be written into at least one data stripe unit, and then EC coding is performed on at least one data stripe unit to obtain at least one check stripe unit. In addition, it is also possible to first generate at least one data stripe unit, perform EC coding on the at least one data stripe unit to obtain at least one check stripe unit, and write the sub-health to the free position in the at least one data stripe unit. mark.

It should be noted that if the client node sends data stripe units to each storage node, the client node can write a sub-health mark to at least one data stripe unit. If the main storage node sends each storage node When sending the data stripe unit, the main storage node can write the sub-health mark to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the data stripe unit.

Method 2: A sub-health mark may be written to at least one write request, so that each write request carries a sub-health mark of a third storage node while carrying a data stripe unit and / or a verification stripe unit, Thus, the sub-health mark of the third storage node is distributed by distributing at least one write request.

For example, when generating a write request, the sub-health mark and the data stripe unit may be encapsulated to obtain a write request carrying the sub-health mark and the data stripe unit, so as to write the sub-health mark to the write request. For another example, the sub-health field can be reserved in the write request, and the sub-health field can be set to write the sub-health flag into the write request. This embodiment does not limit how to write the sub-health flag into the write request.

It should be noted that if a client node sends a write request to each storage node, the client node can write a sub-health mark to at least one write request. If the main storage node sends a write request to each storage node, Then, the main storage node may write a write request to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the write request.

For example, referring to FIG. 7, when a sub-health flash occurs on OSD node 1, the client node may write a sub-health flag of OSD node 1 to the write request of OSD node 2, and write to the write request of OSD node 3. After entering the sub-health tag of OSD node 1, the client node can send a write request carrying the sub-health tag of OSD node 1 to OSD node 2 and OSD node 3 to distribute OSD node 1 to OSD node 2 and OSD node 3. Sub-health mark.

Difference two: The distribution object of data can change. Specifically, if the third storage node is in a sub-health state, a data stripe unit and / or a check stripe unit may be sent to a fourth storage node of the at least one storage node. The fourth storage node and the third storage node may be different storage nodes. With regard to the manner of determining the fourth storage node, a storage node other than the third storage node in at least one storage node may be determined, and at least one fourth storage node may be obtained, and data stripe units and / or calibrations may be sent to the at least one fourth storage node. Check strip unit. Wherein, the client node may send data stripe units and / or check stripe units to each fourth storage node, or the main storage node may send data stripe units and / or calibrations to each fourth storage node. The strip inspection unit is not limited in this embodiment.

Optionally, the data to be stored can be downgraded, that is, the data stripe unit and / or the check stripe unit corresponding to the third storage node can be sent without sending the data stripe unit and / or the checkout stripe unit corresponding to the third storage node. A data stripe unit and / or a verification stripe unit other than the verification stripe unit. Wherein, if the data is stored in a degraded manner, the write request carrying the data stripe unit and / or the check stripe unit may be referred to as a degraded write request.

Specifically, the degraded write may include any one or more of the following implementation manners 1 to 2:

Implementation manner 1: Send at least one second data stripe unit and at least one check stripe unit other than the first data stripe unit corresponding to the third storage node to at least one fourth storage node.

When the third storage node corresponds to the data stripe unit, it is assumed that the data stripe unit corresponding to the third storage node is referred to as the first data stripe unit, and the data stripe unit other than the first data stripe unit is referred to as the second The data stripe unit, the first implementation method may specifically include: determining at least one data stripe unit generated, data stripe units other than the first data stripe unit, obtaining at least one second data stripe unit, and providing the at least one The fourth storage node sends the corresponding second data stripe unit and / or check stripe unit.

For example, referring to FIG. 7, when OSD node 1 has a sub-health flash, OSD nodes other than OSD node 1 (that is, OSD node 2 to OSD node 5) can send data other than stripe unit 1. The data stripe unit and / or the verification stripe unit (that is, the data stripe unit 2, the data stripe unit 3, the verification stripe unit 1, and the verification stripe unit 2).

Implementation method 2: Send at least one second verification strip unit and at least one data strip unit other than the first verification strip unit corresponding to the third storage node to at least one fourth storage node.

When the third storage node corresponds to the check stripe unit, it is assumed that the check stripe unit corresponding to the third storage node is called the first check stripe unit, and the other check stripe units are called second check stripe units. The implementation method 2 may specifically include: determining, among the at least one verification strip unit generated, a verification strip unit other than the first verification strip unit, obtaining at least one second verification strip unit, and providing the at least one The fourth storage node sends a corresponding second verification stripe unit and / or a verification stripe unit.

607. When the fourth storage node receives the corresponding data stripe unit and / or check stripe unit, each fourth storage node stores the corresponding data stripe unit and / or check stripe unit.

Step 607 is the same as the above step 406, and the difference is that the fourth storage node can store the data stripe unit and / or the verification stripe unit while the third storage node is in a sub-health state. Record a sub-health log for the third storage node, so that when the third storage node is in a sub-health recovery state, the third storage node sends metadata missing from the third storage node.

The sub-health log is used to indicate the data stripe unit stored during the third storage node's sub-health state. The sub-health log can also be called the metadata difference log, which can indicate that the third storage node is in the sub-health state and the third storage node Difference in metadata stored between the node and the fourth storage node. The sub-health log may include a correspondence between the third storage node and the data stripe unit, for example, may include a correspondence between the sub-health mark of the third storage node and the stripe unit identifier of the data stripe unit.

Optionally, the implementation manner of recording the sub-health log may include any one or more of the following implementations one to two:

Implementation one: When the received write request carries the sub-health mark of the third storage node and the data stripe unit, write the sub-health record corresponding to the third storage node to the sub-health log.

Specifically, when the fourth storage node parses the write request, if the sub-health mark of the third storage node is obtained from the write request, a sub-health record corresponding to the third storage node is generated, and the third is written into the sub-health log. The sub-health record corresponding to the storage node. The sub-health record refers to a record in the sub-health log, which may include the sub-health mark of the third storage node, and may also include other information, such as the timestamp of the sub-health mark received, and the data bar in the write request. Band unit identification, etc.

Implementation two: When the received data strip unit carries the sub-health mark of the third storage node, write the sub-health record corresponding to the third storage node to the sub-health log.

Specifically, when the fourth storage node parses the data stripe unit, if the sub-health mark of the third storage node is obtained from the data stripe unit, a sub-health record corresponding to the third storage node is generated, and the sub-health record is stored in the sub-health log. Write the sub-health record corresponding to the third storage node.

It should be noted that the step of writing the sub-health record to the sub-health log may be performed multiple times while the third storage node is in the sub-health. For example, the received write request may carry the sub-health mark of the third storage node. At that time, a sub-health record is written to the sub-health log to maintain a write request log record with a sub-health mark.

608. Each fourth storage node sends a first write completion message to the client node.

609. The client node receives a first write completion message sent by at least one fourth storage node, and sends a second write completion message to a target application or an external host.

Steps 608 to 609 are the same as the above steps 407 to 408. The difference is that, because the third storage node is in a sub-health state, the client node may not need to determine whether the first write completion message is received from the third storage node. When receiving the first write completion message from at least one fourth storage node other than the third storage node, that is, confirming that the data has been successfully stored, send a second write completion message to the target application or external host, that is, once the write is degraded After entering into other storage nodes, the client node can return to the IO to write successfully.

In summary, the above describes the write IO process of the third storage node in a sub-health state. This embodiment also provides the read IO process of the third storage node in a sub-health state, with the data strip corresponding to the third storage node. The unit is called a first data stripe unit. The metadata backup of the data blocks of the first data stripe unit is stored in the fifth storage node as an example. The read IO process may include the following steps 1 to 2.

Step 1: When the client node receives the read request, query the cross-backup relationship between the storage nodes, and determine the fifth storage node corresponding to the third storage node. The cross-backup metadata stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit. The read request is used to instruct to read the data block of the first data stripe unit.

If the client receives a read request, and the read request instructs to read data in the data block of the first data stripe unit, since the third storage node storing the first data stripe unit is currently in a sub-health state, The third storage node queries the cross-backup relationship between the storage nodes to obtain a fifth storage node corresponding to the third storage node. Because the cross-backup metadata in the data stripe unit stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit, in other words, the metadata backup of the data block of the first data stripe unit has been stored on the fifth storage node in advance, so forward the read request to the fifth storage node. Read the data of the first data stripe unit.

For example, referring to FIG. 7, when the OSD node 1 is in a sub-health state, if the client node receives a read request, and the read request indicates that the GRAIN data 1 stored in the OSD node 1 should be read, according to the OSD node 1, Query the cross-backup relationship between the storage nodes, obtain OSD node 2 and OSD node 3, and forward the read request to OSD node 2 and OSD node 3.

Step 2: The client node sends a read request to the fifth storage node.

Step 3: The fifth storage node receives the read request sent by the client node and performs data reading.

The fifth storage node may read the cross-backup metadata in the data stripe unit stored by itself, and obtain the metadata of the data block of the first data stripe unit from the cross-backup metadata, according to the metadata of the data block. The data can be indexed to the corresponding data block, thereby reading the data in the data block and returning the data to the client node. Specifically, the fifth storage node can perform data interaction with other storage nodes other than the third storage node, and can receive data stripe units and check stripe units sent by other storage nodes, so as to obtain other than the first data stripe unit. The at least one second data stripe unit and the at least one check stripe unit can perform EC inverse coding on the at least one second data stripe unit and at least one check stripe unit to obtain the first data stripe unit, Thus, the first data strip unit is restored, the data block of the first data strip unit is read, and the data in the data block is returned to the client node.

In the method provided by this embodiment, if a certain storage node is in a sub-healthy state, the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, so as to achieve rapidity between storage nodes. The switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, and ensures that the storage system can quickly store data even if the storage nodes are in the sub-health status, thereby ensuring the reliability and stability of the storage system.

FIG. 8 is a flowchart of a data recovery method according to an embodiment of the present application. As shown in FIG. 8, the method is executed by a fifth storage node and includes the following steps:

801. The fifth storage node stores at least one data stripe unit.

For specific implementation of this step, refer to the embodiment in FIG. 4 and the embodiment in FIG. 6, and details are not described herein.

802. The fifth storage node determines that the third storage node is in a sub-healthy recovery state.

The sub-healthy recovery state refers to a state of transition from a sub-healthy state to a healthy state, that is, a state in which the sub-healthy state is recovering.

Regarding how to determine that the third storage node is in a sub-health recovery state, optionally, the fifth storage node may receive the sub-health recovery message, parse the sub-health recovery message, and obtain an identifier of the third storage node carried in the sub-health recovery message. The identification of the three storage nodes can determine that the third storage node is in a sub-healthy recovery state. The sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state, and may carry the identifier of the third storage node.

Optionally, the fifth storage node may receive the sub-health recovery message sent by the MDC node. Specifically, the MDC node can maintain communication with each storage node, the MDC node can sense the current status of each storage node, and when the MDC node determines that the third storage node is in a sub-health recovery state, the MDC node can generate a sub-health recovery message , Sending a sub-health recovery message to the fifth storage node, and the fifth storage node may receive the sub-health recovery message from the MDC node, thereby determining that the third storage node is in a sub-health recovery state.

The first point to be noted is that receiving a sub-health recovery message sent by an MDC node is only an optional way to receive a sub-health recovery message, rather than a mandatory way to receive a sub-health recovery message. Optionally, the fifth storage node may Receive sub-health recovery messages from other nodes. For example, when the third storage node is in a sub-health recovery state, the third storage node may actively send sub-health recovery messages to the fifth storage node, and the fifth storage node may receive the fifth storage. Sub-health recovery message sent by the node.

The second point that needs to be explained is that step 802 is only an optional step for data recovery, not a mandatory step for data recovery. In another possible implementation, the fifth storage node may not need to perform step 802. For example, the fifth storage node may perform the following step 803 when receiving a sending instruction of missing metadata.

803. The fifth storage node acquires missing metadata of the third storage node according to at least one data stripe unit.

Taking the data stripe unit stored in the third storage node as the first data stripe unit as an example, since the cross-backup metadata of the data stripe unit stored in the fifth storage node includes the data block of the first data stripe unit In other words, the metadata backup of the data block of the first data stripe unit has been stored on the fifth storage node in advance, so the fifth storage node can obtain it according to the data stripe unit stored by itself. Missing metadata for the third storage node.

The missing metadata refers to the metadata of the data block of the data stripe unit that the third storage node should store but does not store. Metadata. The missing metadata can be understood as difference metadata, that is, the difference in metadata stored between the third storage node and the fifth storage node.

Optionally, the process of obtaining missing metadata may specifically include the following steps 1 to 2:

Step 1: Select at least one second data stripe unit from the at least one data stripe unit.

The second data stripe unit is a data stripe unit stored by the fifth storage node itself, and the storage time of the second data stripe unit belongs to the period when the third storage node is in a sub-health state. The storage time of the second data stripe unit refers to the point in time when the fifth storage node stores the second data stripe unit. During the period when the third storage node is in a sub-health state, it can mean that the third storage node starts to be in a sub-health state to the first. The time range in which the three storage nodes end in a sub-health state, for example, may refer to the time range in which the third storage node starts to be in a sub-health state to the third storage node starts to be in a sub-health recovery state. The fifth storage node may select each data stripe unit stored by the fifth storage node during the sub-healthy state of the third storage node, and use the selected at least one data stripe unit as the at least one second data stripe unit. .

Optionally, step one may specifically include a combination of one or more of the following implementation manners one to three.

Implementation method 1: According to the identifier of the third storage node, query the sub-health log to obtain at least one second data stripe unit.

The identifier of the third storage node can be used as an index to query the sub-health log to determine the stripe unit ID of the data stripe unit corresponding to the identity of the third storage node in the sub-health log, and to obtain at least one stripe unit ID, it can be determined The at least one stripe unit identifies at least one data stripe unit corresponding to the at least one data stripe unit as at least one second data stripe unit. For example, the OSD node 2 may query the sub-health log according to the identifier of the OSD node 1, and obtain the stripe unit identifier corresponding to the identifier of the OSD node 1.

Implementation method 2: Select a sub-health mark data strip unit with a third storage node as at least one second data strip unit.

It can be determined whether each data stripe unit stored carries the sub-healthy mark of the third storage node. When any data stripe unit carries the sub-healthy mark of the third storage node, the data stripe unit is selected as the second Data striping unit.

Implementation manner three: A target time period in which the third storage node is in a sub-health state may be determined, and at least one data stripe unit stored in the target time period is selected as at least one second data stripe unit.

Regarding how to determine the target time period, for example, when a sub-health message from a third storage node is received, a first time stamp may be recorded, and when a sub-health recovery message from a third storage node is received, a second time stamp may be recorded. The time period between the first time stamp and the second time stamp is used as a target time period when the third storage node is in a sub-health state. For another example, the first time stamp can be recorded when the received write request carries the sub-health mark of the third storage node, and the second time can be recorded when the received write request no longer carries the sub-health mark of the third storage node The time stamp between the first timestamp and the second timestamp is used as a target time period when the third storage node is in a sub-health state.

Regarding how to select at least one second data stripe unit, the storage time point of each data stripe unit can be recorded when each data stripe unit is stored, and the storage can be selected according to the storage time point of each data stripe unit. A data stripe unit whose time point is within a target time period is used as at least one second data stripe unit.

Step 2: Determine the missing metadata of the third storage node according to the cross-backup metadata in the at least one second data stripe unit.

Step two may include a combination of any one or more of the following implementation manners one to two:

Implementation manner one: Cross-backup metadata in at least one second data stripe unit may be used as missing metadata of the third storage node. For example, referring to FIG. 9, the OSD node 2 may use GRAIN metadata 2, GRAIN metadata 3, and GRAIN metadata 1 as the missing metadata of the OSD node 1.

Implementation manner 2: One or more metadata may be selected from the metadata of the cross-backup of at least one second data stripe unit, and the selected one or more metadata may be used as the missing metadata of the third storage node.

For example, the metadata of the data block corresponding to the third storage node may be selected as the missing metadata of the third storage node. For example, referring to FIG. 9, the OSD node 2 may select the GRAIN metadata 1 corresponding to the OSD node 1, Let GRAIN metadata 1 be the missing metadata for OSD node 1.

804. The fifth storage node sends the missing metadata to the third storage node.

By sending the missing metadata to the third storage node, it is possible to synchronize the missing metadata of the third storage node to the third storage node while the third storage node is in a sub-health state, in other words, the third storage node and the first storage node can be synchronized. The difference metadata between the five storage nodes is synchronized to the third storage node, so that after the third storage node receives the missing metadata, it can re-store the missing metadata, thereby restoring the metadata of the data blocks that were not stored during the sub-health period.

Optionally, when the third storage node is in a sub-healthy state, the read IO process may include the following steps one to three:

Step 1: The client node receives the read request, and determines at least one storage node corresponding to the target storage location according to the target storage location carried in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.

Step 2: The client node forwards the read request to the fifth storage node.

When the client node determines that the third storage node is in a sub-healthy recovery state, it can query the cross-backup relationship between the storage nodes according to the third storage node to obtain the fifth storage node, and forward the read request corresponding to the third storage node. To the fifth storage node, that is, forward the read request to the storage node where the metadata backup is located.

For example, referring to FIG. 9, assuming that the third storage node is OSD node 1, the metadata of the cross-backup stored by OSD node 2 and OSD node 3 includes the metadata backup of OSD node 1, then the corresponding read of OSD node 1 can be read. The request is forwarded to OSD node 2 and OSD node 3.

Optionally, the specific process of recovering the missing metadata for the third storage node may include the following steps one to three:

Step 1: The third storage node receives the write request from the client node and the fifth storage node, and maintains a first write request record corresponding to the client node and a second write request record corresponding to the fifth storage node.

The write request from the client node carries the data stripe unit and the target storage location. The first write request record is used to record the data and metadata that the third storage node needs to store while the third storage node is in a sub-healthy recovery state. Whenever the third storage node receives a write request from a client node, it can record the write request from the client node in the first write request record. The write request of the fifth storage node carries missing metadata, and the second write request record is used to record the missing metadata received by the third storage node that needs to be stored. When the third storage node receives the write request of the fifth storage node, That is, the write request of the fifth storage node is recorded in the second write request record.

Step 2: After receiving the missing metadata, merge the multiple write requests according to the first write request record and the second write request record to obtain at least one target write request.

Specifically, all write requests in the first write request record and the second write request record can be determined. According to the timestamps carried by all write requests, they are sorted in the order of the timestamps. For all sorted write requests, you can According to the target storage locations carried in all write requests, determine whether there are write requests with the same target storage location. When multiple target storage locations carried in multiple write requests are the same, merge the multiple write requests with the same target storage locations. That is, the most recent write request in the write request is filtered out, and other write requests are filtered out, and at least one target write request is finally obtained.

Step 3: Store metadata carried in the at least one target write request.

When the storage metadata is completed, the third storage node is restored to a healthy state. Optionally, after the third storage node is restored to a healthy state, a message that the third storage node is in a healthy state may be sent to the MDC node, and the MDC node may broadcast the health of the third storage node to the client node and each storage node. Message.

It should be noted that the third storage node in the sub-health state is only an exemplary scenario of missing metadata, and is not a mandatory scenario. Accordingly, the third storage node in the sub-health recovery state is only used for data recovery. An exemplary scenario instead of a mandatory scenario. In another exemplary scenario, the third storage node may also lack metadata due to other reasons, such as missing metadata due to equipment failure, temporary power outage, insufficient memory capacity, etc. Correspondingly, the third storage node may also perform data recovery in other scenarios, such as performing data recovery when the third storage node returns to normal, for example, performing data recovery when the third storage node receives a data recovery instruction. This embodiment The scenario where the third storage node recovers data is not limited.

The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves. The stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems. In particular, if a storage node lacks metadata due to a sub-health flash, when the storage node is in a sub-health recovery state, other storage nodes can change the metadata based on the metadata difference recorded in the sub-health log recorded locally. The missing metadata of the storage node is synchronized to the normal storage node restored from sub-health, so that the storage node can automatically recover the missing metadata after recovering from the sub-health state, which reduces the impact of the storage node's sub-health state on the performance of the storage system To ensure high stability and high reliability of the distributed storage system.

FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application. The client node includes a generation module 1001, an encoding module 1002, and a distribution module 1003.

A generating module 1001, configured to execute the above step 403;

An encoding module 1002, configured to perform the above step 404;

The distribution module 1003 is configured to perform the foregoing step 405.

Optionally, the generating module 1001 is configured to perform steps 1 to 3 in step 403 above;

Optionally, the generating module 1001 is configured to execute steps (2.1) to (2.2) in step two in step 403 above;

Optionally, the encoding module 1002 is configured to perform at least one of the first to third methods in step 404.

Optionally, the distribution module 1003 is configured to send a corresponding data stripe unit and / or a verification stripe unit to each storage node.

Optionally, the device further includes:

The sending module is configured to perform step 408 described above.

Optionally, the distribution module 1003 is configured to perform step 606 described above.

Optionally, the distribution module 1003 is configured to perform at least one of implementation manners 1 to 2 in step 606.

Optionally, the device further includes:

The sending module is configured to execute step two of the IO reading process in the embodiment in FIG. 6.

Optionally, the device further includes:

The query module is configured to execute step 1 of the IO reading process in the embodiment in FIG. 6 described above.

Optionally, the device further includes:

A sending module is configured to perform the foregoing step 609.

The first point that needs to be explained is that each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program. The computer program can be implemented in any programming language. The client node can implement the function of storing data based on the processor and memory hardware through the above modules, that is, the client node processor can run the software code stored in the memory of the client node. Execute the corresponding software to realize the function of storing data.

The second point that needs to be explained is that the client node provided in the above embodiment only uses the above-mentioned division of functional modules as an example for storing data. In practical applications, the above-mentioned functions can be allocated by different functional modules as required. Finished, that is, the internal structure of the client node is divided into different functional modules to complete all or part of the functions described above. In addition, the client node provided in the foregoing embodiment belongs to the same concept as the data storage method embodiment, and its specific implementation process is described in the method embodiment in detail, and is not repeated here.

FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application. The apparatus includes a storage module 1101, an obtaining module 1102, and a sending module 1103.

A storage module 1101, configured to perform the above step 801;

An obtaining module 1102, configured to execute the foregoing step 803;

The sending module 1103 is configured to perform the foregoing step 804.

Optionally, the obtaining module 1102 is configured to perform steps 1 to 2 in step 803 described above;

Optionally, the selection sub-module is configured to perform at least one of the implementation manners 1 to 3 in step 1 in the foregoing step 803.

Optionally, the device further includes:

Recording module for recording sub-health logs.

Optionally, the recording module is configured to execute at least one of implementations one to two in step 607.

Optionally, the device further includes:

A receiving module for receiving a sub-health recovery message.

The first point that needs to be explained is that each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program. The computer program can be implemented in any programming language. Through the above modules, the storage node can implement the function of restoring data based on the hardware of the processor and the memory, that is, the processor of the storage node can run the software code stored in the memory of the storage node to execute the corresponding Software to implement the function of recovering data.

The second point that needs to be explained is: When recovering data, the storage nodes provided in the above embodiments only use the division of the above functional modules as an example. In practical applications, the above functions can be allocated by different functional modules as required. That is, the internal structure of the storage node is divided into different functional modules to complete all or part of the functions described above. In addition, the storage nodes and data recovery method embodiments provided in the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.

In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a client node, enables the client node to implement operations performed by the data storage method in the foregoing embodiment. .

In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a storage node, enables the storage node to implement the operations performed by the data recovery method in the foregoing embodiment.

In an exemplary embodiment, the present application further provides a data storage system. In a possible implementation manner, the system includes the client node described in the foregoing FIG. 2 embodiment and the foregoing FIG. 3 embodiment. The storage node.

In another possible implementation manner, the system includes: the client node described in the foregoing FIG. 10 embodiment and the storage node described in the foregoing FIG. 11 embodiment.

In an exemplary embodiment, the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data storage method in the foregoing embodiment.

In an exemplary embodiment, the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data recovery method in the foregoing embodiment.

All the above-mentioned optional technical solutions may be used in any combination to form optional embodiments of the present application, which are not repeated here one by one.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present application are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program instructions may be from a website site, a computer, a server, or data The center transmits to another website site, computer, server or data center by wire or wirelessly. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration. The usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD), or a semiconductor medium (such as a solid state hard disk), and the like.

The term "and / or" in this application is only an association relationship describing related objects, and it means that there can be three kinds of relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and There are three cases of B. In addition, the character "/" in this application generally indicates that the related objects before and after are in an "or" relationship.

The term “multiple” in the present application means two or more than two, for example, multiple data packets refer to two or more data packets.

The terms "first" and "second" in this application are used to distinguish the same or similar items with basically the same function and function. Those skilled in the art can understand that the words "first" and "second" are not correct in quantity. And execution order.

The above description is only an optional embodiment of the present application, and is not intended to limit the present application. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application, which should be covered in this application. Within the scope of protection.

Claims

A data storage method, characterized in that the method includes:

Generating at least one data stripe unit according to at least one data block to be stored, each data stripe unit including a data block and cross-backup metadata, the cross-backup metadata including data included in the data stripe unit Block metadata and metadata of data blocks included in other data stripe units other than the data stripe unit;

Performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit;

And distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node.
The method according to claim 1, wherein the generating at least one data stripe unit according to at least one data block to be stored comprises:

Backup the metadata of the at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one correspondence;

For a data block in the at least one data block, selecting at least one target metadata backup corresponding to the data block from the at least one metadata backup;

Generating a data stripe unit according to the data block, the metadata of the data block, and the at least one target metadata backup.
The method according to claim 2, wherein the selecting, from the at least one metadata backup, at least one target metadata backup corresponding to the data block comprises:

Querying the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node;

A metadata backup of a data block corresponding to the at least one second storage node is determined as the at least one target metadata backup.
The method according to claim 1, wherein the performing erasure code EC coding on the at least one data stripe unit to obtain at least one check stripe unit comprises at least one of the following steps:

Performing EC coding on the data blocks in the at least one data stripe unit to obtain a check block in the at least one check stripe unit;

EC coding is performed on the metadata in the at least one data stripe unit to obtain a metadata check block in the at least one check stripe unit.
The method according to claim 1, wherein the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node comprises:

Determining that a third storage node of the at least one storage node is in a sub-health state;

Sending to the fourth storage node of the at least one storage node at least one second data strip unit other than the first data strip unit corresponding to the third storage node and the at least one check strip unit; And / or, sending to the fourth storage node of the at least one storage node at least one second verification strip unit other than the first verification strip unit corresponding to the third storage node and the at least one Data striping unit.
The method according to claim 5, wherein the sending to the fourth storage node of the at least one storage node at least one second other than the first data stripe unit corresponding to the third storage node After the data stripe unit and the at least one check stripe unit, the method further includes:

When a read request is received, the read request is sent to a fifth storage node, where the read request is used to instruct to read a data block of the first data stripe unit, and the The metadata includes metadata of a data block of the first data stripe unit.
The method according to claim 5, wherein the sending to the fourth storage node of the at least one storage node at least one second other than the first data stripe unit corresponding to the third storage node The data stripe unit and the at least one check stripe unit include at least one of the following steps:

Writing a sub-health mark of the third storage node to the at least one second data stripe unit, where the sub-health mark is used to indicate that the third storage node is in a sub-health state;

Send a write request to the fourth storage node, where the write request carries a sub-health mark of the third storage node and a second data stripe unit corresponding to the fourth storage node.
A data recovery method, characterized in that the method includes:

Stores at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block of the data stripe unit and the data strip Metadata of data blocks included in data stripe units other than the unit;

Obtaining the missing metadata of the third storage node according to the at least one data striping unit;

Sending the missing metadata to the third storage node.
The method according to claim 8, wherein the acquiring the missing metadata of the third storage node according to the at least one data striping unit comprises:

Selecting at least one second data stripe unit from the at least one data stripe unit, and a storage time of the second data stripe unit belongs to a period when the third storage node is in a sub-health state;

Determining the missing metadata according to the cross-backup metadata in the at least one second data stripe unit.
The method according to claim 9, wherein the selecting at least one second data stripe unit from the at least one data stripe unit comprises at least one of the following steps:

Querying the sub-health log according to the identifier of the third storage node to obtain the at least one second data stripe unit, where the sub-health log is used to indicate data stored while the third storage node is in a sub-health state Stripe unit

Selecting a data strip unit with a sub-health mark of the third storage node as the at least one second data strip unit, and the sub-health mark is used to indicate that the third storage node is in a sub-health state.
The method according to claim 10, wherein before the querying the sub-health log according to the identifier of the third storage node, the method further comprises:

While the third storage node is in a sub-health state, record the sub-health log for the third storage node.
The method according to claim 11, wherein, when the third storage node is in a sub-health state, recording the sub-health log for the third storage node comprises at least one of the following steps:

When the received write request carries a sub-health mark and a data stripe unit of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log;

When the received data strip unit carries the sub-health mark of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log.
The method according to claim 8, wherein before the acquiring the missing metadata of the third storage node according to the at least one data striping unit, the method further comprises:

Receiving a sub-health recovery message, where the sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state.
A client node, wherein the client node includes:

A generating module, configured to generate at least one data stripe unit according to at least one data block to be stored, each data stripe unit includes a data block and cross-backup metadata, and the cross-backup metadata includes the data strip Metadata of a data block with a unit and metadata of a data block included in a data strip unit other than the data strip unit;

An encoding module, configured to perform erasure coding EC encoding on the at least one data stripe unit to obtain at least one check stripe unit;

A distribution module, configured to distribute the at least one data stripe unit and the at least one check stripe unit to at least one storage node.
A storage node, characterized in that the storage node includes:

A storage module, configured to store at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block and the data strip Metadata of data blocks included in data stripe units other than the unit;

An obtaining module, configured to obtain the missing metadata of the third storage node according to the at least one data striping unit;

A sending module, configured to send the missing metadata to the third storage node.
A client node, characterized in that the client node includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 1 to claim 1. An operation performed by the data storage method according to any one of claims 7 is required.
A storage node, characterized in that the storage node includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 8 to 13 The operations performed by the data recovery method of any one.
A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the data storage according to any one of claims 1 to 7. The action performed by the method.
A computer-readable storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by a processor to implement data recovery according to any one of claims 8 to 13 The action performed by the method.