WO2020034695A1 - Data storage method, data recovery method, apparatus, device and storage medium - Google Patents

Data storage method, data recovery method, apparatus, device and storage medium Download PDF

Info

Publication number
WO2020034695A1
WO2020034695A1 PCT/CN2019/087904 CN2019087904W WO2020034695A1 WO 2020034695 A1 WO2020034695 A1 WO 2020034695A1 CN 2019087904 W CN2019087904 W CN 2019087904W WO 2020034695 A1 WO2020034695 A1 WO 2020034695A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
storage node
metadata
stripe unit
storage
Prior art date
Application number
PCT/CN2019/087904
Other languages
French (fr)
Chinese (zh)
Inventor
魏明昌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020034695A1 publication Critical patent/WO2020034695A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/373Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 with erasure correction and erasure determination, e.g. for packet loss recovery or setting of erasures for the decoding of Reed-Solomon codes

Definitions

  • the present application relates to the field of storage technology, and in particular, to a data storage method, a data recovery method, a device, a device, and a storage medium.
  • EC erasure code
  • Each EC slice is composed of m data blocks and k check blocks.
  • the data blocks are used to store data and the check blocks are used to recover data.
  • the EC data can be recovered by performing EC inverse coding on the remaining data blocks and parity blocks, thereby greatly improving the data. Storage stability and reliability.
  • a distributed system usually includes a client node, a main storage node, and at least one backup storage node.
  • the client node is used to send data to the main storage node
  • the main storage node is used to perform EC Encoding, sending data blocks and check blocks to at least one backup storage node
  • each backup storage node is used to store data blocks or check blocks.
  • the client node receives the data to be stored, it will send the data to the main storage node corresponding to the target storage location according to the target storage location of the data.
  • the main storage node will divide the data into m data blocks. Redundant algorithm is used to EC code m data blocks to obtain k check blocks.
  • the main storage node itself stores a data block or check block. After the storage is successful, the stored data block or check block is recorded. Metadata, and the main storage node will send the remaining m + k-1 data blocks and parity blocks to m + k-1 backup storage nodes, each of which will store a data block or a calibration block. Check the block, and record the metadata of the stored data block or check block after the storage is successful.
  • the embodiments of the present application provide a data storage method, a data recovery method, a device, a device, and a storage medium, which can solve the problem of poor security of metadata of data blocks stored in related technologies.
  • the technical solution is as follows:
  • a data storage method includes:
  • each data stripe unit includes a data block and cross-backup metadata
  • the cross-backup metadata includes a data block of the data stripe unit Metadata and metadata of data blocks included in other data stripe units other than the data stripe unit;
  • the method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe.
  • the different storage nodes are ensured.
  • the metadata of the data blocks are stored between each other. Even if the metadata of a storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, it can also be based on the data strip of other storage nodes.
  • the missing metadata of the storage node is obtained, thereby reducing the probability of metadata loss, greatly improving the reliability and security of data storage, and thereby improving the storage performance of the distributed storage system.
  • generating at least one data stripe unit according to at least one data block to be stored includes:
  • the selecting, from the at least one metadata backup, at least one target metadata backup corresponding to the data block includes:
  • a metadata backup of a data block corresponding to the at least one second storage node is determined as the at least one target metadata backup.
  • performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit includes at least one of the following steps:
  • EC coding is performed on the metadata in the at least one data stripe unit to obtain a metadata check block in the at least one check stripe unit.
  • the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.
  • the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node includes:
  • the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.
  • the method further includes:
  • the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node includes:
  • a storage node if a storage node is in a sub-healthy state, the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, thereby achieving rapid storage node-to-node storage.
  • Switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, ensuring that the storage system can quickly store data even if the storage nodes are in a sub-health status, thereby ensuring the reliability and stability of the storage system.
  • the sending a corresponding data stripe unit and / or a verification stripe unit to a fourth storage node of the at least one storage node includes at least one of the following steps:
  • the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one After verifying the stripe unit further includes:
  • the read request is sent to a fifth storage node, where the read request is used to instruct to read a data block of the first data stripe unit, and the
  • the metadata includes metadata of a data block of the first data stripe unit.
  • the method before the sending the read request to the fifth storage node, the method further includes:
  • the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one Verifying the strip unit includes at least one of the following steps:
  • the method further includes:
  • a data recovery method includes:
  • each data stripe unit includes a data block and cross-backup metadata
  • the cross-backup metadata includes metadata of the data block of the data stripe unit and the data strip Metadata of data blocks included in data stripe units other than the unit;
  • the method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe.
  • the different storage nodes are ensured.
  • the metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves.
  • the stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems.
  • obtaining the missing metadata of the third storage node according to the at least one data striping unit includes:
  • the selecting at least one second data stripe unit from the at least one data stripe unit includes at least one of the following steps:
  • the method before querying the sub-health log according to the identifier of the third storage node, the method further includes:
  • a storage node is missing metadata due to a sub-health flash
  • other storage nodes can use the metadata recorded in the sub-health log recorded locally Difference, synchronize the missing metadata of the storage node to the normal storage node restored from sub-health, so that after the storage node recovers from the sub-health state, it can automatically recover the missing metadata, which reduces the storage node's sub-health status to the storage.
  • the impact of system performance ensures high stability and reliability of the distributed storage system.
  • recording the sub-health log for the third storage node includes at least one of the following steps:
  • the method before determining the missing metadata of the third storage node according to the at least one data striping unit, the method further includes:
  • a client node is provided to execute the foregoing data storage method.
  • the client node includes a function module for executing the data storage method described in the first aspect or any one of the first aspects.
  • a storage node for performing the foregoing data recovery method.
  • the storage node includes a function module for executing the data recovery method according to the second aspect or any one of the optional aspects of the second aspect.
  • a client node includes a processor and a memory.
  • the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the first aspect or The operations performed by the data storage method according to any one of the first aspects.
  • a storage node includes a processor and a memory.
  • the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the second aspect or the second aspect.
  • a non-transitory computer-readable storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the first aspect or any one of the first aspect. Operations performed by the data storage method described in an optional manner.
  • a non-transitory computer-readable storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the second aspect or any one of the second aspect. The operation performed by the data recovery method described in an optional manner.
  • a computer program product containing instructions which when run on a client node, enables the client node to implement the first aspect or any of the first aspect of the first aspect. The operations performed by the data storage method.
  • a computer program product containing instructions which, when running on a storage node, enables the storage node to implement the data recovery described in the second aspect or any one of the optional aspects of the second aspect The action performed by the method.
  • a data storage system includes:
  • the client node according to the third aspect and the storage node according to the fourth aspect are the client node according to the third aspect and the storage node according to the fourth aspect.
  • the system includes:
  • the client node according to the fifth aspect and the storage node according to the sixth aspect are the client node according to the fifth aspect and the storage node according to the sixth aspect.
  • a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the first aspect or any one of the first optional aspects is implemented The operation performed by the storage method.
  • a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the second aspect or any one of the second optional aspects is implemented What the recovery method does.
  • FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a client node according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a storage node according to an embodiment of the present application.
  • FIG. 4 is a flowchart of a data storage method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a data storage provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a data storage method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a data storage provided by an embodiment of the present application.
  • FIG. 8 is a flowchart of a data recovery method according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a data storage provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application.
  • Cross-backup It is a technology for backing up data between different storage nodes. For example, when cross backup is performed between node A, node B, and node C, node A can store data of node A and data backup of node B, node B can store data of node B and data backup of node C, and node C can Stores the data of node C and the data backup of node A. With cross-backup, even if a node loses data, it can still read the data backup from other nodes to restore its own data.
  • Three-copy technology It is a technology for redundant storage of data. It copies data into three copies and stores the three copies on three storage nodes respectively for data storage. Using the three-copy technology, when one copy is lost, the other copy can be recovered to restore the lost copy. With the three-copy technology, each piece of data occupies 3 times the storage space for storage, so the disk utilization is 1/3.
  • EC technology It is a technology for redundant storage of data.
  • the original data is encoded by an erasure coding algorithm to obtain redundant check blocks.
  • Each data block and check block are stored in different storage nodes.
  • To store data Specifically, the data to be stored is divided into m data blocks, and a redundant algorithm is used to perform EC coding on the m data blocks to generate k check blocks, the m data blocks and the k check blocks.
  • a redundant algorithm is used to perform EC coding on the m data blocks to generate k check blocks, the m data blocks and the k check blocks.
  • Form an EC stripe each data block or each check block can be called an EC block in the EC stripe, and each EC block can be distributed to different storage nodes for storage.
  • Each EC strip can tolerate the loss of up to k EC blocks.
  • any storage node fails, as long as the number of failed storage nodes does not exceed k, you can recover the failed nodes based on the EC blocks on the non-faulty storage nodes.
  • Stored EC blocks so distributed storage systems that use EC technology to store data will have high reliability.
  • using EC technology to store data can greatly save storage space compared to the three-copy technology. Specifically, using three-copy technology requires three times the storage space to store one copy of data, while using EC technology requires only 1.4 times the storage space to store one copy of data.
  • FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application.
  • the distributed storage system includes a client node, at least one storage node, and a metadata controller (MDC) node.
  • ECS Elastic Compute Service
  • VBS Volume Backup Service
  • the system provided in Figure 1 can provide object storage services for customers.
  • Client nodes are also called storage clients or client nodes. Client nodes can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute data to storage nodes for data storage.
  • the client node can be a server, a personal computer (hereinafter referred to as a PC), a notebook computer, etc.
  • the client node can be an independent device, for example, it can be one or more program modules on a device. It can be a virtual machine or container running on one device, and the client node can also be a cluster of multiple devices. For example, it can be a collective name for multiple program modules distributed on multiple devices.
  • the storage node may store data stripe units and / or check stripe units, receive read requests and / or write requests, and access locally stored data and metadata.
  • the storage node may be an Object-based Storage (OSD) node, a Network Attached Storage (NAS) node, a Storage Area Network (SAN) node, etc.
  • Storage nodes can be servers, PCs, laptops, etc.
  • the storage methods of the storage nodes include, but are not limited to, object storage, block storage, and file storage.
  • Storage nodes can be physical storage nodes, or logically divided by physical storage nodes. Storage node.
  • the MDC service node can be used to maintain a partitioned view, which can include the mapping relationship between partitions and storage nodes and the current state of each storage node. When the partitioned view changes, the changed partitioned view can be synchronized to the client node As well as each storage node.
  • the MDC service node can be a server, PC, laptop, etc.
  • the MDC service node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or container running on a device. It can also be a cluster composed of multiple devices, for example, it can be a collective name for multiple program modules distributed on multiple devices.
  • the ECS service node is used to allocate a blank EC stripe, and sends the blank EC stripe to the client node, so that the client node writes data to the blank EC stripe.
  • An ECS service node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. A collective name for multiple program modules distributed across multiple devices.
  • the VBS node can provide a virtual hard disk function to an upper-layer application or an external host, and the VBS node can receive a read request or a write request from an upper-layer application or an external host.
  • a VBS node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. For example, it can be A collective term for multiple program modules distributed across multiple devices.
  • FIG. 2 is a schematic structural diagram of a client node according to an embodiment of the present application.
  • the client node 200 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, referred to as abbreviations). : CPU) 201 and one or more memories 202, where the memory 202 stores at least one instruction, and at least one instruction is loaded and executed by the processor 201 to implement the data storage method in each method embodiment described below.
  • the client node 200 may further have components such as a wired or wireless network interface and an input-output interface for input and output.
  • the client node 200 may further include other components for implementing device functions, and details are not described herein.
  • a non-transitory computer-readable storage medium such as a memory including instructions, and the foregoing instructions can be executed by a processor in a client node to complete the data storage method in the following embodiments.
  • the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.
  • FIG. 3 is a schematic structural diagram of a storage node according to an embodiment of the present application.
  • the storage node 300 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, abbreviated as: CPU). 301 and one or more memories 302, wherein the one or more memories 302 may be hard disks mounted on a storage node, and the hard disks may be logical virtual hard disks or physical physical hard disks.
  • the one or more memories 302 store at least one instruction, and the at least one instruction is loaded and executed by the processor 301 to implement the data recovery method in each method embodiment described below.
  • the storage node may also have components such as a wired or wireless network interface and an input-output interface for input and output.
  • the storage node 300 may further include other components for implementing device functions, and details are not described herein.
  • a non-transitory computer-readable storage medium such as a memory including instructions, and the foregoing instructions may be executed by a processor in a storage node to complete the data recovery method in the following embodiments.
  • the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.
  • FIG. 4 is a flowchart of a data storage method provided by an embodiment of the present application. As shown in FIG. 4, the interaction body of the method includes a client node and at least one storage node, including the following steps:
  • a client node obtains at least one blank EC stripe, and each blank EC stripe is used to carry at least one data stripe unit and at least one check stripe unit.
  • a client node is a client node.
  • a client node can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute the data to various storage nodes for storage.
  • the client node can be a server, personal computer, laptop, etc.
  • the client node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or
  • the container and the client node may also be a cluster composed of multiple devices.
  • the client node may be a collective name for multiple program modules distributed on multiple devices. This embodiment does not limit the physical form of the client node.
  • the client node can obtain at least one blank EC stripe, so that when data is stored, data blocks, check blocks, and metadata are written to the blank EC stripe.
  • each blank EC strip can be regarded as a row of blank data blocks.
  • Each blank EC strip can carry a strip identifier and at least one strip unit identifier.
  • the strip identifier is used to identify the corresponding EC strip, which can be
  • the identification number (Identification, hereinafter referred to as ID) of the EC strip may be, for example, the number and name of the EC strip.
  • the stripe unit identifier is used to identify the corresponding stripe unit, and may be the ID of the stripe unit, for example, the number and name of the stripe unit.
  • the ECS service may allocate at least one blank EC stripe to the client node, and send the allocated at least one blank EC stripe to the client node, and the client node may receive the ECS service. At least one blank EC band, thereby obtaining at least one blank EC band.
  • the distributed storage system may include multiple client nodes, and the ECS service may allocate a corresponding blank EC stripe to each client node, and send a corresponding blank EC stripe to each client node.
  • the stripe identifiers of the blank EC stripe assigned to different client nodes may be different. For example, blank EC stripe 1 to blank EC stripe 100 are allocated to client node 1, and blank EC stripe 101 is allocated to client node 2. To blank EC band 200.
  • this step 401 is only an optional step, not a mandatory step. This embodiment does not limit whether to perform step 401 or not.
  • the client node may not need to perform the step of obtaining a blank EC strip.
  • at least one blank EC strip may be stored in advance, and data may be written into the pre-stored blank EC strip when storing data.
  • the client node obtains data to be stored and a target storage location of the data.
  • the data to be stored may be IO data, that is, data input and / or output to the distributed storage system.
  • the target storage location refers to the location where the data needs to be stored. By obtaining the target storage location of the data, you can determine where to store the data so that when the data is read later, the stored data can be read from the target storage location.
  • the storage space of the distributed storage system can be used as one or more logical volumes. Each logical volume can be divided into multiple logical blocks. One or more logical blocks can be used to store data.
  • the target storage location can be It is a logical block address (Logical Block Address, hereinafter abbreviated as: LBA), which may include a starting LBA address and a data length.
  • LBA Logical Block Address
  • the target storage location may also be an identifier of a logical block, for example, it may be a key of the logical block.
  • the client node may receive the write request, parse the write request, and obtain the data carried by the write request and the target storage location.
  • a write request may be generated by an upper-layer application, an external host, or a VBS node, and a write request may be sent to a client node, so that the client node receives a write request from an upper-layer application, an external host, or a VBS node.
  • the write request may be triggered by a user's input operation.
  • the client node may divide the data to be stored to obtain at least one data block to be stored.
  • the data block refers to data obtained by dividing the data with the granularity of the block.
  • the data block may be GRAIN data 1, GRAIN data 2 or GRAIN data 3 in FIG. 5.
  • this step 402 is only an optional step, not a mandatory step.
  • the client node may store at least one data block to be stored in advance, and the client node may At least one data block is directly executed in the subsequent step 403.
  • the client node generates at least one data stripe unit according to the at least one data block to be stored, and each data stripe unit includes a data block and cross-backup metadata.
  • Metadata of a data block refers to data used to describe the data block, and the metadata of the data block can be indexed to the corresponding data block.
  • the metadata of the data block may be the mapping relationship between the target storage location and the stripe unit identifier, for example, it may be the mapping relationship between the LBA and the stripe unit identifier.
  • the metadata may be GRAIN metadata 1, GRAIN metadata 2 or GRAIN metadata 3 in FIG. 5, where GRAIN metadata 1 is metadata of GRAIN data 1, and GRAIN metadata 1
  • the content can be the mapping relationship between stripe unit 1 and LBA1
  • GRAIN metadata 2 is the metadata of GRAIN data 2
  • the content of GRAIN metadata 2 can be the mapping relationship between stripe unit 2 and LBA2, and so on .
  • Strip unit refers to the EC strip unit, which is a component of the EC strip.
  • the strip unit is also called EC stripe.
  • the EC stripe and the stripe unit have a relationship of whole and part.
  • Each EC stripe may include at least one stripe unit, for example, it may include 5 stripe units, 7 stripe units, and the like.
  • An EC slice can be regarded as a row of data blocks, and a slice unit can be regarded as a column of data blocks in the EC slice.
  • FIG. 5 provides an example of an EC stripe.
  • the EC stripe includes 5 stripe units, that is, stripe unit 1, stripe unit 2, stripe unit 3, and stripe unit. 4 ⁇ ⁇ ⁇ 5 ⁇ 4 and strip unit 5.
  • the stripe unit is used to store data in the EC stripe. According to the difference in the stored data, in this embodiment, the stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata is called a data stripe.
  • a stripe unit is a stripe unit for storing at least one of a check block and a metadata check block in an EC stripe, which is referred to as a check stripe unit.
  • the EC stripe may include at least one data stripe unit. And at least one parity stripe unit, the number of data stripe units and parity stripe units in the EC stripe can be determined according to the EC encoding algorithm. For example, an EC stripe can include 3 data stripe units and 2 Verify the stripe unit.
  • the cross-backup metadata includes metadata of the data blocks of the data stripe unit and metadata of data blocks included in the data stripe unit other than the data stripe unit.
  • the cross-backup metadata in the data stripe unit including the metadata of the data blocks stored in the data stripe unit itself, will also include one or more other data stripe
  • the metadata of the data block stored by the unit will also include one or more other data stripe The metadata of the data block stored by the unit.
  • the cross-backup metadata may be part of the bold box in FIG. 5.
  • the data block stored by the data stripe unit 1 is GRAIN data 1
  • the cross-backup metadata includes GRAIN metadata. 1 (metadata of GRAIN data 1), GRAIN metadata 2 (metadata of GRAIN data 2), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 1 not only stores data stripe unit 1
  • the metadata of the data block (GRAIN metadata 1) stored in itself also stores the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 2 and the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 3 ( GRAIN metadata 3).
  • the data block stored by the data stripe unit 2 is GRAIN data 2.
  • the cross-backup metadata includes GRAIN metadata 2 (metadata of GRAIN data 2), GRAIN metadata 1 (metadata of GRAIN data 1), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 2 stores not only the metadata of the data block (GRAIN metadata 2) stored by data stripe unit 2 itself, but also data stripe unit 1.
  • the metadata of the cross backup can be divided into the following two types (1) to (2):
  • the metadata of the cross-backup may include the metadata of all data blocks to be stored, and accordingly, the metadata of the cross-backup of at least one data stripe unit may be the same, and each data stripe unit stores all data stripe
  • the metadata of the data block in the unit For example, if the data is divided into m data blocks, the metadata of the cross-backup may be the metadata of m data blocks, that is, each data stripe unit stores the metadata of m data blocks.
  • the metadata backed up in the data stripe unit 1, the data stripe unit 2, and the data stripe unit 3 may all be the same, for these three The metadata of the data block.
  • each data stripe unit will store the metadata of the data blocks of all the data stripe units. If the metadata of the data block of a certain data stripe unit is lost, you can start from any of the remaining data stripe units. In the process of reading and recovering lost metadata, the reliability and security of metadata storage is improved, and the availability of distributed storage systems is improved.
  • the metadata of the cross-backup may include the metadata of a part of the data block to be stored, and accordingly, each data stripe unit may store the metadata of the data block in the part of the data stripe unit.
  • the metadata of the cross-backup may include metadata of at least two data blocks, that is, each data stripe unit stores metadata of at least two data blocks.
  • the cross-backup metadata may include metadata of data blocks in one or more adjacent data stripe units, for example, may include metadata of data blocks of the previous data stripe unit and the next data stripe
  • the metadata of the data block of the cell may include the metadata of the data block of the i-th data stripe unit and the metadata of the data block of the i-1th data stripe unit And the metadata of the data block of the i + 1th data stripe unit.
  • each data stripe unit can store metadata of data blocks of some data stripe units, which can save storage space of a distributed storage system.
  • cross-backup metadata in the data stripe unit can be designed according to requirements as described above (1) or as described above (2).
  • the cross-backup metadata specifically includes how many data blocks of metadata, and which specifically The metadata of several data blocks can be designed according to actual requirements, which is not limited in this embodiment.
  • a data stripe unit refers to a stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata. It can be an EC block in an EC stripe, and each data stripe unit can include a data block. Exemplarily, referring to FIG. 5, the data stripe unit may be a stripe unit 1, a stripe unit 2, or a stripe unit 3. Each EC slice may include at least one data slice unit, and the number of data slice units in an EC slice may be equal to the number of data blocks. For example, assuming that the data to be stored is divided into m data blocks, the EC stripe of the data may include m data stripe units.
  • the cross-backup metadata in the data stripe unit may consist of metadata of the data block in the data stripe unit and at least one target metadata backup, the at least one A target metadata backup refers to a metadata backup of a data block of a data stripe unit other than the data stripe unit.
  • the process of generating any data stripe unit may include the following steps one to three:
  • Step 1 Back up metadata of at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one manner.
  • the metadata of the data block can be determined, the metadata of the data block can be backed up, and a metadata backup of the metadata of the data block can be obtained, and so on, for each data All metadata of the blocks are backed up, and at least one metadata backup corresponding to at least one data block can be obtained.
  • the metadata of GRAIN data 1 is backed up as GRAIN metadata 1
  • the metadata of GRAIN data 2 is backed up as GRAIN metadata 2.
  • the process of determining the metadata of the data block according to the target storage location corresponding to each data stripe unit and the stripe unit ID of each data stripe unit, the relationship between the target storage location and the stripe unit ID is recorded. Mapping relationship, as metadata for the data block.
  • Step 2 For at least one data block in the data block, from at least one metadata backup, select at least one target metadata backup corresponding to the data block.
  • this step may specifically include the following steps (2.1) to (2.2):
  • any storage node of a distributed storage system is referred to as a first storage node, and a storage node having a cross-backup relationship with the first storage node is referred to as a second storage node.
  • first storage node and second storage node are only used to distinguish and describe different storage nodes, and should not be understood to express or imply the order, relative importance, and storage nodes between storage nodes. The total number.
  • a cross-backup relationship refers to a relationship in which each storage node stores metadata backups with each other.
  • a cross-backup relationship can have at least two functions. First, when storing data, the cross-backup relationship can determine which storage nodes' metadata backups are stored by any storage node, in order to determine the cross-backup metadata that each storage node needs to store according to the cross-backup relationship. For example, if the cross-backup relationship instructs OSD node 1 to store metadata backups of OSD node 2 and OSD node 3, when storing data, the data stripe unit of OSD node 1 can store the OSD node 2 and OSD node 3 Metadata backup.
  • the cross-backup relationship can determine which storage nodes any metadata backup is stored in, so that according to the cross-backup relationship, the metadata backup is read from the cross-backup metadata stored by the corresponding storage node. . For example, if the cross-backup relationship indicates that the metadata backup of OSD node 1 is stored in OSD node 2 and OSD node 3, then when the metadata of OSD node 1 is lost, OSD node 1 and OSD node 3 can read the OSD node 1 Metadata backup.
  • the data form of the cross-backup relationship can be designed according to actual needs.
  • the cross-backup relationship can include one or more of the following (1) and (2):
  • the cross-backup relationship may be a mapping relationship between the identities of the storage nodes, and the cross-backup relationship may include the identities of multiple storage nodes.
  • the process of determining at least one second storage node may include: using the identifier of the first storage node as an index, querying the cross-backup relationship to obtain the identifier of the at least one second storage node, and according to the identifier of the at least one second storage node, Determine at least one second storage node.
  • the storage node identifier is used to identify the corresponding storage node, and may include the ID, name, and number of the storage node.
  • the cross-backup relationship may be a mapping relationship between the stripe unit IDs, and the cross-backup relationship may include the stripe unit IDs of multiple data stripe units.
  • the process of determining at least one second storage node It may include: determining a stripe unit identifier of the first data stripe unit according to the first storage node, using the stripe unit identifier of the first data stripe unit as an index, querying the cross-backup relationship, and obtaining at least one second data stripe
  • the stripe unit identifier of the unit determines the identifier of the storage node corresponding to the stripe unit identifier of the at least one second data stripe unit, obtains the identifier of the at least one second storage node, and determines the identifier based on the identifier of the at least one second storage node.
  • the MDC node may record the cross-backup relationship and send the cross-backup relationship to the client node.
  • the MDC node can record the cross-backup relationship in the partitioned view and send the partitioned view to the client node.
  • the client node can query the partitioned view to obtain the cross-backup relationship in the partitioned view.
  • Step (2.2) determines the metadata backup of the data block corresponding to the at least one second storage node as the at least one target metadata backup.
  • a metadata backup of a data block corresponding to the at least one second storage node may be selected from the generated at least one metadata backup as at least one target metadata backup. For example, assuming that the first storage node is OSD node 1, and according to the cross-backup relationship between the storage nodes, determine that the OSD node corresponding to OSD node 1 in the cross-backup relationship is OSD node 2 and OSD node 3, then determine OSD node 2 and OSD The metadata of the data block of node 3 is backed up to obtain GRAIN metadata 2 and GRAIN metadata 3, and GRAIN metadata 2 and GRAIN metadata 3 are used as the target metadata backup of OSD node 1.
  • selecting the target metadata backup according to the cross-backup relationship is only an optional method of selecting the target metadata backup, not a mandatory method of selecting the target metadata backup.
  • the target may also be selected according to other methods Metadata backup, for example, for any data stripe unit, the metadata backup of all data blocks can be used as the target metadata backup in order to store the metadata of all the data stripe unit's data blocks in the data stripe unit data.
  • Step 3 Generate a data stripe unit according to the data block, the metadata of the data block, and at least one target metadata backup.
  • a data block, metadata of the data block, and at least one target metadata backup may be written to any stripe unit in the EC stripe, and the stripe unit after writing is used as a data stripe unit, where
  • the data strip unit carries a data block, metadata of the data block, and at least one target metadata backup.
  • the metadata of the data block and at least one target metadata backup are metadata of the cross-backup.
  • a data block, metadata of the data block, and at least one target metadata backup may be written to any blank strip unit in any blank EC stripe, thereby generating a data strip according to the blank strip unit unit. It is also possible to perform overwrite on any stripe unit in the EC stripe to which data has been written, so as to generate a data stripe unit based on the stripe unit to which data has been written, which is not limited in this embodiment.
  • each data stripe unit in an EC stripe can be generated.
  • an EC slice can be regarded as a row of data blocks, and each data slice unit can be regarded as a column of data blocks in this row of data blocks.
  • FIG. 5 suppose the data is divided into three data blocks, which are GRAIN data 1, GRAIN data 2, and GRAIN data 3, and three data stripe units can be generated, which are data stripe unit 1, respectively.
  • data stripe unit 1 carries GRAIN data 1 and cross-backup metadata
  • data stripe unit 2 carries GRAIN data 2 and cross-backup metadata
  • data stripe unit 3 carries GRAIN data 3 and cross-backup metadata 3.
  • the metadata of the data block is cross-backed up by using the cross backup technology. After writing to each data stripe unit, the metadata of the data block will be stored between the different data stripe units. After each data stripe unit is distributed to each storage node, the metadata of the data block is stored between different storage nodes, so even if a storage node loses the metadata of the data block, it can still be read from other storage nodes And restore the metadata of the lost data blocks, improve the reliability and security of metadata storage, thereby ensuring the high availability and high reliability of the distributed storage system.
  • OSD node 1, OSD node 2 and OSD node 3 mutually store metadata of a data block, even if OSD node 1 loses GRAIN metadata 1, since OSD node 2 and OSD node 3 previously store GRAIN metadata Data 1, so GRAIN metadata 1 can still be read and restored from OSD node 2 and OSD node 3.
  • OSD node 2 loses GRAIN metadata 2
  • OSD node 1 and OSD node 3 have stored GRAIN metadata 2 in advance, it is still possible to read and restore GRAIN metadata 2 from OSD node 1 and OSD node 3. .
  • the client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.
  • the check stripe unit is used to recover the data stripe unit, which can ensure the reliability and security of the data stripe unit. Specifically, if one or more data stripe units are lost, as long as the number of lost data stripe units is less than the total number of parity stripe units, the remaining data stripe units and parity stripe units can be resolved. Perform EC inverse coding to recover lost data stripe units.
  • the verification stripe unit may be the verification stripe unit 1 and the verification stripe unit 2 in FIG. 5.
  • the check stripe unit can be referred to as EC stripe check protection.
  • the check stripe unit can include a check block and a metadata check block. For example, it can include one or more check blocks and one or more meta blocks.
  • the check block may be obtained by performing EC coding on the data block in at least one data stripe unit, and the check block may be used to recover the data block in the data stripe unit.
  • the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5.
  • the metadata check block may be obtained by performing EC encoding according to metadata in at least one data stripe unit, and the metadata check block may be used to recover metadata in the data stripe unit.
  • the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5.
  • the client node may use an encoding algorithm to perform EC coding on at least one data stripe unit to obtain at least one check stripe unit.
  • the encoding algorithm includes, but is not limited to, Reed-Solomon encoding, Cauchy encoding, and the like. This embodiment does not limit which encoding algorithm is used for EC encoding.
  • the at least one parity strip unit and the at least one data stripe unit may form an EC stripe.
  • Each data stripe unit and each parity stripe unit is a column of data blocks in the EC stripe.
  • the data striping unit 1, the data striping unit 2, and the data striping unit 3 can be EC-coded to obtain the verification stripe unit 1 and the verification stripe unit 2.
  • the data stripe The strip unit 1, the data strip unit 2, the data strip unit 3, the check strip unit 1, and the check strip unit 2 form an EC strip.
  • a manner of performing EC coding on at least one data stripe unit may include a combination of one or more of the following manners 1 to 3.
  • Method 1 EC coding is performed on data blocks in at least one data stripe unit to obtain at least one checkout block in a checkout stripe unit.
  • At least one data block may be EC-coded to obtain at least one check block, one or more check blocks may be written to any stripe unit, and the stripe unit after writing is used as a check stripe. unit.
  • a check block may be written to any blank strip unit in any blank EC strip, or overwrite writing may be performed on any strip unit in the EC strip to which data has been written. This is not limited.
  • GRAIN data 1, GRAIN data 2 and GRAIN data 3 can be EC coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 and write to the stripe unit 4.
  • GRAIN metadata check block 1 write GRAIN metadata check block 2 to stripe unit 5, after writing is completed, stripe unit 4 can be used to verify stripe unit 1, and stripe unit 5 can be used to verify Strip unit 2.
  • Manner 2 Perform EC coding on the metadata in at least one data stripe unit to obtain a metadata check block in at least one check stripe unit.
  • the cross-backup metadata of at least one data stripe unit may be EC-coded according to the cross-backup metadata in each data stripe unit to obtain a metadata check block.
  • EC encoding can be performed on cross-backup metadata 1, cross-backup metadata 2 and cross-backup metadata 3 to obtain GRAIN metadata check block 1 and GRAIN metadata check block. 2.
  • one or more metadata may also be selected from the cross-backup metadata of each data stripe unit, and the EC selected from the at least one data stripe unit may be EC coded.
  • the metadata that is backed up in each data stripe unit may include multiple rows, and each row carries one metadata. Metadata in the same row in at least one data stripe unit may be selected.
  • One metadata is EC coded. For example, referring to FIG. 5, the cross-backup metadata in each data stripe unit occupies 3 rows, and these 3 rows are used to carry GRAIN metadata 1, GRAIN metadata 2 and GRAIN metadata 3, respectively.
  • the metadata check block can be written to any blank strip cell in any blank EC strip, or any strip cell in the EC strip to which data has been written
  • the write metadata check block is overwritten, and the stripe unit after the writing is completed can be used as a check stripe unit.
  • a GRAIN metadata check block 1 can be written to the stripe unit 4 and a GRAIN metadata check block 2 can be written to the stripe unit 5.
  • the stripe unit 4 It can be used as the verification strip unit 1, and the strip unit 5 can be used as the verification strip unit 2.
  • the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.
  • Manner 3 Perform EC coding on the data blocks and metadata in at least one data stripe unit to obtain a check block in at least one check stripe unit.
  • the third method is a combination of the first method and the second method.
  • At least one data block and metadata of at least one data block can be EC coded together to obtain at least one metadata check block.
  • the metadata can be written to any band unit. After the block is checked, the stripe unit after writing is used as the verification stripe unit.
  • all data carried in stripe unit 1, stripe unit 2, and stripe unit 3 can be EC-coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 , Write the GRAIN metadata check block 1 to the stripe unit 4, use the completed stripe unit 4 as the check stripe unit 1, write the GRAIN metadata check block 2 to the stripe unit 5, and write The stripe unit 5 after writing is used as the verification stripe unit 2.
  • the client node distributes at least one data stripe unit and at least one check stripe unit to at least one storage node.
  • At least one data stripe unit and at least one check stripe unit may be distributed to at least one storage node corresponding to a target storage location of the data.
  • a storage node corresponding to the target storage location may be determined according to a target storage location of the data, and at least one storage node may be obtained.
  • a data stripe unit and / or a verification stripe unit may be allocated to each storage node, Storage nodes distribute data stripe units and / or check stripe units assigned to them.
  • a mapping relationship between the storage location and the identification of the storage node may be established in advance, and the mapping relationship between the storage location and the identification of the storage node may be queried according to the target storage location to obtain the target storage.
  • the identifier of the at least one storage node corresponding to the location is used as the storage node corresponding to the target storage location.
  • a partition allocation algorithm is used to generate a partitioned view.
  • the partitioned view is used to indicate the storage node corresponding to each partition.
  • the partitioned view may include at least one partition identifier and the corresponding
  • the client node may determine the partition identifier corresponding to the target storage location according to the target storage location of the data, query the partition view according to the partition identifier, and obtain the identifier of the at least one storage node. Identify the corresponding storage node as the storage node corresponding to the target storage location.
  • the LBA when the target storage location is LBA, the LBA can be divided by the total number of partitions, and the obtained remainder is used as the partition identifier to determine which partition the LBA belongs to.
  • the corresponding relationship between the partitions can be referred to as the LBA break-to-partition relationship
  • the corresponding relationship between the partitions and the storage nodes can be referred to as the deployment relationship where the partitions are deployed to the storage nodes.
  • the data stripe unit or check stripe unit can be assigned to each storage node in turn in the order of the storage node's identification.
  • the storage node can be Node 1 allocates data stripe unit 1 and storage node 2 allocates data stripe unit 2.
  • a data stripe unit or a verification stripe unit may be randomly allocated to each storage node.
  • other methods may also be used to allocate a data stripe unit or a verification stripe unit. This embodiment explains how to allocate data stripe units.
  • the belt unit or the verification strip unit is not limited.
  • At least one write request can be generated according to the data stripe unit and / or the verification stripe unit corresponding to each storage node, and at least one write request is distributed.
  • a write request may be sent to each storage node, so as to distribute the corresponding data stripe unit and / or check stripe unit to each storage node.
  • Each write request carries a data stripe unit and / or a verification stripe unit corresponding to the storage node, and optionally, can also carry a target storage location of the data.
  • Each write request may be an input / output (Input / Output, hereinafter referred to as: IO) request in the form of a key-value pair.
  • the client node may send a corresponding data stripe unit and / or check stripe unit to each storage node.
  • a client node may allocate a corresponding data stripe unit and / or check stripe unit to each storage node, and the client node generates at least one write request, and sends a write request to each storage node.
  • the client node interacts with each storage node to distribute at least one data stripe unit and at least one check stripe unit to the at least one storage node.
  • the client node may send a write request carrying the data stripe unit 1 to the OSD node 1, send a write request carrying the data stripe unit 2 to the OSD node 2, and send a carry to the OSD node 3
  • There is a write request for the data stripe unit 3 and a write request with the verification stripe unit 1 is sent to the OSD node 4, and a write request with the verification stripe unit 2 is sent to the OSD node 5, thereby realizing the data stripe unit And check the distribution of stripe units.
  • a client node sends a corresponding data stripe unit and / or a verification stripe unit to each storage node, and the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.
  • sending a data stripe unit and / or a verification stripe unit to each storage node by the client node is only an optional way to distribute the data stripe unit and / or the verification stripe unit, rather than Mandatory way to distribute data stripe units and / or check stripe units.
  • one of the storage nodes may also be selected as the primary storage node, and the other storage nodes are The standby storage node, the client node can send at least one data stripe unit and at least one check stripe unit to the main storage node, and the main storage node sends data stripe units and / or check strips to each of the backup storage nodes.
  • the tape unit this embodiment does not limit how to distribute the data stripe unit and / or the verification stripe unit.
  • the at least one storage node When at least one storage node receives the data stripe unit and / or the verification stripe unit, the at least one storage node stores the data stripe unit and / or the verification stripe unit.
  • each storage node can receive a write request, parse the write request, obtain the data stripe unit and / or check stripe unit carried by the write request, and write the data stripe unit and / or check stripe to the storage space.
  • the stripe unit stores data stripe units and / or check stripe units.
  • the write request may carry a target storage location of data, and a data stripe unit and / or a verification stripe unit may be written to a storage space corresponding to the target storage location.
  • At least one storage node sends a first write completion message to the client node, where the first write completion message is used to indicate that the corresponding storage node has stored the data stripe unit and / or the verification stripe unit.
  • the write completion message sent by the storage node is referred to as a first write completion message
  • the write completion message sent by the client node is referred to as a second write completion message.
  • the term "the The "one write completion message” and the "second write completion message” are only for distinguishing different write completion messages, and should not be understood as expressing or implying the relative importance between different write completion messages and the total number of write completion messages.
  • each storage node After each storage node writes the data stripe unit and / or check stripe unit, it can generate a first write completion message and send the first write completion message to the client node to notify the client node that it has stored it. The data was successful.
  • the client node receives a first write completion message sent by at least one storage node, and sends a second write completion message to the target application or an external host.
  • the second write completion message is used to indicate that data to be stored has been written to the target storage location. .
  • the client node can determine whether the first write completion message of all the storage nodes has been received, and when it is confirmed that the first write completion message of all the storage nodes is received, a second write completion message is generated to the target application or external
  • the host sends a second write complete message.
  • the target application or the external host can receive the second write completion message, and prompt the second write completion message according to a preset prompt mode, thereby achieving the function of prompting the user that the data to be stored has been written to the target storage location.
  • the target application may be located on an upper layer of the client node in a logical architecture, and the target application may be referred to as an upper layer application of the client node.
  • steps 407 and 408 are only optional steps for storing data, and are not mandatory steps for storing data. This embodiment does not limit whether to perform steps 407 and 408.
  • the above describes the write IO process of storing data
  • the read IO process of reading data may include the following steps 1 to 2.
  • Step 1 The client node receives the read request, and determines at least one storage node of the partition corresponding to the target storage location according to the target storage location in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.
  • Step 2 The client node forwards the read request to at least one storage node, so that data is read in the at least one storage node.
  • the method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in an EC stripe.
  • the data stripe unit is After the EC coding is performed to obtain the verification stripe unit, each data stripe unit and each verification stripe unit are distributed to each storage node, which can ensure that after the storage of each storage node is completed, different storage nodes will save data to each other.
  • Block metadata even if the metadata of a certain storage node is lost, because the other storage nodes have previously stored the metadata of the storage node, they can read and recover the lost metadata from the data stripe units of other storage nodes.
  • the client node can send the corresponding data stripe unit and check stripe unit to each storage node to ensure that the data can reach each storage node to be stored only through one-hop network forwarding. It can greatly reduce the delay, improve the speed and efficiency of storing data, and thus improve the storage performance of distributed storage systems.
  • a storage node in a sub-health state is referred to as a third storage node
  • a storage node that is not in a sub-health state is referred to as a fourth storage node.
  • third storage node and fourth storage node are only used to distinguish between storage nodes that describe whether they are in a sub-health state, and should not be understood as expressly or implicitly the order and relative importance between storage nodes. And the total number of storage nodes.
  • FIG. 6 is a flowchart of a data storage method according to an embodiment of the present application. As shown in FIG. 6, the interaction body of the method includes a client node and at least one fourth storage node, including the following steps:
  • the client node obtains at least one blank EC strip.
  • This step is the same as the above step 401, and details are not described herein.
  • the client node obtains data to be stored.
  • This step is the same as the above step 402, and details are not described herein.
  • the client node determines that the third storage node is in a sub-health state.
  • the sub-health state is also called a sub-health flash, and the sub-health state may include a state in which read and write requests are abnormally slow, a state where a write cache is invalid, and a state where a disk is damaged.
  • the client node may receive a sub-health message, and determine that the third storage node is in a sub-health state according to the sub-health message.
  • the sub-health message is used to indicate that the third storage node is in a sub-health state, and can carry the identity of the third storage node.
  • the client node can parse the sub-health message and obtain the identity of the third storage node to determine the third storage node. In a sub-health state.
  • the sub-health message received by the client node may come from an MDC node.
  • a distributed storage system may include an MDC node.
  • the MDC node is used to maintain the state of each storage node.
  • the MDC node may maintain communication with each storage node.
  • the MDC node may generate a sub-health message according to the identity of the third storage node, and send the sub-health message to the client node.
  • the client node may receive the sub-health message of the MDC node, thereby determining that the third storage node is in a sub-health state.
  • receiving the sub-health message is only an example of a method for determining that the third storage node is in a sub-health state, and is not a mandatory method for determining that the third storage node is in a sub-health state.
  • the client The node may also determine that the third storage node is in a sub-health state by other methods. For example, the client node can maintain communication with the third storage node, and the client node can actively detect that the third storage node is in a sub-health state. This embodiment does not limit how to determine that the third storage node is in a sub-health state.
  • the client node generates at least one data stripe unit according to at least one data block to be stored.
  • the client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.
  • the client node distributes at least one data stripe unit and at least one check stripe unit to a fourth storage node.
  • Steps 604 to 606 are the same as the above steps 403 to 405, and the differences mainly include the following two points:
  • Difference one The content of distributed data has increased.
  • a sub-health mark of the third storage node is also distributed, so that the third storage node's sub-health mark is used to indicate that the third storage node is in Sub-health state.
  • the sub-health tag is used to indicate that the third storage node is in a sub-health state, and may include an identifier of the third storage node.
  • the sub-health tag may be generated by the client node.
  • the manner of distributing the sub-health mark of the third storage node may include a combination of one or more of the following manners 1 to 2.
  • a sub-health mark may be written to at least one data stripe unit, so that each data stripe unit carries a data block and cross-backup metadata, and also carries a sub-health mark of a third storage node. Therefore, by distributing at least one data stripe unit, the sub-health mark of the third storage node is distributed.
  • a sub-health mark may be written into at least one data stripe unit, and then EC coding is performed on at least one data stripe unit to obtain at least one check stripe unit.
  • the client node can write a sub-health mark to at least one data stripe unit. If the main storage node sends each storage node When sending the data stripe unit, the main storage node can write the sub-health mark to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the data stripe unit.
  • Method 2 A sub-health mark may be written to at least one write request, so that each write request carries a sub-health mark of a third storage node while carrying a data stripe unit and / or a verification stripe unit, Thus, the sub-health mark of the third storage node is distributed by distributing at least one write request.
  • the sub-health mark and the data stripe unit may be encapsulated to obtain a write request carrying the sub-health mark and the data stripe unit, so as to write the sub-health mark to the write request.
  • the sub-health field can be reserved in the write request, and the sub-health field can be set to write the sub-health flag into the write request. This embodiment does not limit how to write the sub-health flag into the write request.
  • a client node sends a write request to each storage node, the client node can write a sub-health mark to at least one write request. If the main storage node sends a write request to each storage node, Then, the main storage node may write a write request to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the write request.
  • the client node may write a sub-health flag of OSD node 1 to the write request of OSD node 2, and write to the write request of OSD node 3.
  • the client node can send a write request carrying the sub-health tag of OSD node 1 to OSD node 2 and OSD node 3 to distribute OSD node 1 to OSD node 2 and OSD node 3. Sub-health mark.
  • Difference two The distribution object of data can change. Specifically, if the third storage node is in a sub-health state, a data stripe unit and / or a check stripe unit may be sent to a fourth storage node of the at least one storage node.
  • the fourth storage node and the third storage node may be different storage nodes.
  • a storage node other than the third storage node in at least one storage node may be determined, and at least one fourth storage node may be obtained, and data stripe units and / or calibrations may be sent to the at least one fourth storage node.
  • Check strip unit a storage node other than the third storage node in at least one storage node may be determined, and at least one fourth storage node may be obtained, and data stripe units and / or calibrations may be sent to the at least one fourth storage node.
  • the client node may send data stripe units and / or check stripe units to each fourth storage node, or the main storage node may send data stripe units and / or calibrations to each fourth storage node.
  • the strip inspection unit is not limited in this embodiment.
  • the data to be stored can be downgraded, that is, the data stripe unit and / or the check stripe unit corresponding to the third storage node can be sent without sending the data stripe unit and / or the checkout stripe unit corresponding to the third storage node.
  • the write request carrying the data stripe unit and / or the check stripe unit may be referred to as a degraded write request.
  • the degraded write may include any one or more of the following implementation manners 1 to 2:
  • Implementation manner 1 Send at least one second data stripe unit and at least one check stripe unit other than the first data stripe unit corresponding to the third storage node to at least one fourth storage node.
  • the first implementation method may specifically include: determining at least one data stripe unit generated, data stripe units other than the first data stripe unit, obtaining at least one second data stripe unit, and providing the at least one The fourth storage node sends the corresponding second data stripe unit and / or check stripe unit.
  • OSD nodes other than OSD node 1 can send data other than stripe unit 1.
  • the data stripe unit and / or the verification stripe unit that is, the data stripe unit 2, the data stripe unit 3, the verification stripe unit 1, and the verification stripe unit 2).
  • Implementation method 2 Send at least one second verification strip unit and at least one data strip unit other than the first verification strip unit corresponding to the third storage node to at least one fourth storage node.
  • the implementation method 2 may specifically include: determining, among the at least one verification strip unit generated, a verification strip unit other than the first verification strip unit, obtaining at least one second verification strip unit, and providing the at least one The fourth storage node sends a corresponding second verification stripe unit and / or a verification stripe unit.
  • each fourth storage node When the fourth storage node receives the corresponding data stripe unit and / or check stripe unit, each fourth storage node stores the corresponding data stripe unit and / or check stripe unit.
  • Step 607 is the same as the above step 406, and the difference is that the fourth storage node can store the data stripe unit and / or the verification stripe unit while the third storage node is in a sub-health state. Record a sub-health log for the third storage node, so that when the third storage node is in a sub-health recovery state, the third storage node sends metadata missing from the third storage node.
  • the sub-health log is used to indicate the data stripe unit stored during the third storage node's sub-health state.
  • the sub-health log can also be called the metadata difference log, which can indicate that the third storage node is in the sub-health state and the third storage node Difference in metadata stored between the node and the fourth storage node.
  • the sub-health log may include a correspondence between the third storage node and the data stripe unit, for example, may include a correspondence between the sub-health mark of the third storage node and the stripe unit identifier of the data stripe unit.
  • the implementation manner of recording the sub-health log may include any one or more of the following implementations one to two:
  • the fourth storage node parses the write request, if the sub-health mark of the third storage node is obtained from the write request, a sub-health record corresponding to the third storage node is generated, and the third is written into the sub-health log.
  • the sub-health record refers to a record in the sub-health log, which may include the sub-health mark of the third storage node, and may also include other information, such as the timestamp of the sub-health mark received, and the data bar in the write request. Band unit identification, etc.
  • the fourth storage node parses the data stripe unit, if the sub-health mark of the third storage node is obtained from the data stripe unit, a sub-health record corresponding to the third storage node is generated, and the sub-health record is stored in the sub-health log. Write the sub-health record corresponding to the third storage node.
  • the step of writing the sub-health record to the sub-health log may be performed multiple times while the third storage node is in the sub-health.
  • the received write request may carry the sub-health mark of the third storage node.
  • a sub-health record is written to the sub-health log to maintain a write request log record with a sub-health mark.
  • Each fourth storage node sends a first write completion message to the client node.
  • the client node receives a first write completion message sent by at least one fourth storage node, and sends a second write completion message to a target application or an external host.
  • Steps 608 to 609 are the same as the above steps 407 to 408. The difference is that, because the third storage node is in a sub-health state, the client node may not need to determine whether the first write completion message is received from the third storage node.
  • the client node may not need to determine whether the first write completion message is received from the third storage node.
  • receives the first write completion message from at least one fourth storage node other than the third storage node that is, confirming that the data has been successfully stored
  • send a second write completion message to the target application or external host that is, once the write is degraded
  • the client node can return to the IO to write successfully.
  • the above describes the write IO process of the third storage node in a sub-health state.
  • This embodiment also provides the read IO process of the third storage node in a sub-health state, with the data strip corresponding to the third storage node.
  • the unit is called a first data stripe unit.
  • the metadata backup of the data blocks of the first data stripe unit is stored in the fifth storage node as an example.
  • the read IO process may include the following steps 1 to 2.
  • Step 1 When the client node receives the read request, query the cross-backup relationship between the storage nodes, and determine the fifth storage node corresponding to the third storage node.
  • the cross-backup metadata stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit.
  • the read request is used to instruct to read the data block of the first data stripe unit.
  • the third storage node queries the cross-backup relationship between the storage nodes to obtain a fifth storage node corresponding to the third storage node. Because the cross-backup metadata in the data stripe unit stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit, in other words, the metadata backup of the data block of the first data stripe unit has been stored on the fifth storage node in advance, so forward the read request to the fifth storage node. Read the data of the first data stripe unit.
  • the client node when the OSD node 1 is in a sub-health state, if the client node receives a read request, and the read request indicates that the GRAIN data 1 stored in the OSD node 1 should be read, according to the OSD node 1, Query the cross-backup relationship between the storage nodes, obtain OSD node 2 and OSD node 3, and forward the read request to OSD node 2 and OSD node 3.
  • Step 2 The client node sends a read request to the fifth storage node.
  • Step 3 The fifth storage node receives the read request sent by the client node and performs data reading.
  • the fifth storage node may read the cross-backup metadata in the data stripe unit stored by itself, and obtain the metadata of the data block of the first data stripe unit from the cross-backup metadata, according to the metadata of the data block.
  • the data can be indexed to the corresponding data block, thereby reading the data in the data block and returning the data to the client node.
  • the fifth storage node can perform data interaction with other storage nodes other than the third storage node, and can receive data stripe units and check stripe units sent by other storage nodes, so as to obtain other than the first data stripe unit.
  • the at least one second data stripe unit and the at least one check stripe unit can perform EC inverse coding on the at least one second data stripe unit and at least one check stripe unit to obtain the first data stripe unit,
  • the first data strip unit is restored, the data block of the first data strip unit is read, and the data in the data block is returned to the client node.
  • the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, so as to achieve rapidity between storage nodes.
  • the switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, and ensures that the storage system can quickly store data even if the storage nodes are in the sub-health status, thereby ensuring the reliability and stability of the storage system.
  • FIG. 8 is a flowchart of a data recovery method according to an embodiment of the present application. As shown in FIG. 8, the method is executed by a fifth storage node and includes the following steps:
  • the fifth storage node stores at least one data stripe unit.
  • the fifth storage node determines that the third storage node is in a sub-healthy recovery state.
  • the sub-healthy recovery state refers to a state of transition from a sub-healthy state to a healthy state, that is, a state in which the sub-healthy state is recovering.
  • the fifth storage node may receive the sub-health recovery message, parse the sub-health recovery message, and obtain an identifier of the third storage node carried in the sub-health recovery message.
  • the identification of the three storage nodes can determine that the third storage node is in a sub-healthy recovery state.
  • the sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state, and may carry the identifier of the third storage node.
  • the fifth storage node may receive the sub-health recovery message sent by the MDC node.
  • the MDC node can maintain communication with each storage node, the MDC node can sense the current status of each storage node, and when the MDC node determines that the third storage node is in a sub-health recovery state, the MDC node can generate a sub-health recovery message , Sending a sub-health recovery message to the fifth storage node, and the fifth storage node may receive the sub-health recovery message from the MDC node, thereby determining that the third storage node is in a sub-health recovery state.
  • receiving a sub-health recovery message sent by an MDC node is only an optional way to receive a sub-health recovery message, rather than a mandatory way to receive a sub-health recovery message.
  • the fifth storage node may Receive sub-health recovery messages from other nodes. For example, when the third storage node is in a sub-health recovery state, the third storage node may actively send sub-health recovery messages to the fifth storage node, and the fifth storage node may receive the fifth storage. Sub-health recovery message sent by the node.
  • step 802 is only an optional step for data recovery, not a mandatory step for data recovery.
  • the fifth storage node may not need to perform step 802.
  • the fifth storage node may perform the following step 803 when receiving a sending instruction of missing metadata.
  • the fifth storage node acquires missing metadata of the third storage node according to at least one data stripe unit.
  • the fifth storage node can obtain it according to the data stripe unit stored by itself. Missing metadata for the third storage node.
  • the missing metadata refers to the metadata of the data block of the data stripe unit that the third storage node should store but does not store. Metadata.
  • the missing metadata can be understood as difference metadata, that is, the difference in metadata stored between the third storage node and the fifth storage node.
  • the process of obtaining missing metadata may specifically include the following steps 1 to 2:
  • Step 1 Select at least one second data stripe unit from the at least one data stripe unit.
  • the second data stripe unit is a data stripe unit stored by the fifth storage node itself, and the storage time of the second data stripe unit belongs to the period when the third storage node is in a sub-health state.
  • the storage time of the second data stripe unit refers to the point in time when the fifth storage node stores the second data stripe unit.
  • the time range in which the three storage nodes end in a sub-health state may refer to the time range in which the third storage node starts to be in a sub-health state to the third storage node starts to be in a sub-health recovery state.
  • the fifth storage node may select each data stripe unit stored by the fifth storage node during the sub-healthy state of the third storage node, and use the selected at least one data stripe unit as the at least one second data stripe unit. .
  • step one may specifically include a combination of one or more of the following implementation manners one to three.
  • Implementation method 1 According to the identifier of the third storage node, query the sub-health log to obtain at least one second data stripe unit.
  • the identifier of the third storage node can be used as an index to query the sub-health log to determine the stripe unit ID of the data stripe unit corresponding to the identity of the third storage node in the sub-health log, and to obtain at least one stripe unit ID, it can be determined
  • the at least one stripe unit identifies at least one data stripe unit corresponding to the at least one data stripe unit as at least one second data stripe unit.
  • the OSD node 2 may query the sub-health log according to the identifier of the OSD node 1, and obtain the stripe unit identifier corresponding to the identifier of the OSD node 1.
  • Implementation method 2 Select a sub-health mark data strip unit with a third storage node as at least one second data strip unit.
  • each data stripe unit stored carries the sub-healthy mark of the third storage node.
  • the data stripe unit is selected as the second Data striping unit.
  • Implementation manner three A target time period in which the third storage node is in a sub-health state may be determined, and at least one data stripe unit stored in the target time period is selected as at least one second data stripe unit.
  • a first time stamp may be recorded
  • a second time stamp may be recorded. The time period between the first time stamp and the second time stamp is used as a target time period when the third storage node is in a sub-health state.
  • the first time stamp can be recorded when the received write request carries the sub-health mark of the third storage node
  • the second time can be recorded when the received write request no longer carries the sub-health mark of the third storage node
  • the time stamp between the first timestamp and the second timestamp is used as a target time period when the third storage node is in a sub-health state.
  • the storage time point of each data stripe unit can be recorded when each data stripe unit is stored, and the storage can be selected according to the storage time point of each data stripe unit.
  • a data stripe unit whose time point is within a target time period is used as at least one second data stripe unit.
  • Step 2 Determine the missing metadata of the third storage node according to the cross-backup metadata in the at least one second data stripe unit.
  • Step two may include a combination of any one or more of the following implementation manners one to two:
  • Cross-backup metadata in at least one second data stripe unit may be used as missing metadata of the third storage node.
  • the OSD node 2 may use GRAIN metadata 2, GRAIN metadata 3, and GRAIN metadata 1 as the missing metadata of the OSD node 1.
  • Implementation manner 2 One or more metadata may be selected from the metadata of the cross-backup of at least one second data stripe unit, and the selected one or more metadata may be used as the missing metadata of the third storage node.
  • the metadata of the data block corresponding to the third storage node may be selected as the missing metadata of the third storage node.
  • the OSD node 2 may select the GRAIN metadata 1 corresponding to the OSD node 1, Let GRAIN metadata 1 be the missing metadata for OSD node 1.
  • the fifth storage node sends the missing metadata to the third storage node.
  • the third storage node By sending the missing metadata to the third storage node, it is possible to synchronize the missing metadata of the third storage node to the third storage node while the third storage node is in a sub-health state, in other words, the third storage node and the first storage node can be synchronized.
  • the difference metadata between the five storage nodes is synchronized to the third storage node, so that after the third storage node receives the missing metadata, it can re-store the missing metadata, thereby restoring the metadata of the data blocks that were not stored during the sub-health period.
  • the read IO process may include the following steps one to three:
  • Step 1 The client node receives the read request, and determines at least one storage node corresponding to the target storage location according to the target storage location carried in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.
  • Step 2 The client node forwards the read request to the fifth storage node.
  • the client node When the client node determines that the third storage node is in a sub-healthy recovery state, it can query the cross-backup relationship between the storage nodes according to the third storage node to obtain the fifth storage node, and forward the read request corresponding to the third storage node. To the fifth storage node, that is, forward the read request to the storage node where the metadata backup is located.
  • the third storage node is OSD node 1
  • the metadata of the cross-backup stored by OSD node 2 and OSD node 3 includes the metadata backup of OSD node 1, then the corresponding read of OSD node 1 can be read.
  • the request is forwarded to OSD node 2 and OSD node 3.
  • the specific process of recovering the missing metadata for the third storage node may include the following steps one to three:
  • Step 1 The third storage node receives the write request from the client node and the fifth storage node, and maintains a first write request record corresponding to the client node and a second write request record corresponding to the fifth storage node.
  • the write request from the client node carries the data stripe unit and the target storage location.
  • the first write request record is used to record the data and metadata that the third storage node needs to store while the third storage node is in a sub-healthy recovery state.
  • the third storage node receives a write request from a client node, it can record the write request from the client node in the first write request record.
  • the write request of the fifth storage node carries missing metadata
  • the second write request record is used to record the missing metadata received by the third storage node that needs to be stored.
  • Step 2 After receiving the missing metadata, merge the multiple write requests according to the first write request record and the second write request record to obtain at least one target write request.
  • all write requests in the first write request record and the second write request record can be determined. According to the timestamps carried by all write requests, they are sorted in the order of the timestamps. For all sorted write requests, you can According to the target storage locations carried in all write requests, determine whether there are write requests with the same target storage location. When multiple target storage locations carried in multiple write requests are the same, merge the multiple write requests with the same target storage locations. That is, the most recent write request in the write request is filtered out, and other write requests are filtered out, and at least one target write request is finally obtained.
  • Step 3 Store metadata carried in the at least one target write request.
  • the third storage node When the storage metadata is completed, the third storage node is restored to a healthy state.
  • a message that the third storage node is in a healthy state may be sent to the MDC node, and the MDC node may broadcast the health of the third storage node to the client node and each storage node. Message.
  • the third storage node in the sub-health state is only an exemplary scenario of missing metadata, and is not a mandatory scenario. Accordingly, the third storage node in the sub-health recovery state is only used for data recovery. An exemplary scenario instead of a mandatory scenario.
  • the third storage node may also lack metadata due to other reasons, such as missing metadata due to equipment failure, temporary power outage, insufficient memory capacity, etc.
  • the third storage node may also perform data recovery in other scenarios, such as performing data recovery when the third storage node returns to normal, for example, performing data recovery when the third storage node receives a data recovery instruction. This embodiment The scenario where the third storage node recovers data is not limited.
  • the method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe.
  • the different storage nodes are ensured.
  • the metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves.
  • the stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems.
  • a storage node lacks metadata due to a sub-health flash
  • other storage nodes can change the metadata based on the metadata difference recorded in the sub-health log recorded locally.
  • the missing metadata of the storage node is synchronized to the normal storage node restored from sub-health, so that the storage node can automatically recover the missing metadata after recovering from the sub-health state, which reduces the impact of the storage node's sub-health state on the performance of the storage system To ensure high stability and high reliability of the distributed storage system.
  • FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application.
  • the client node includes a generation module 1001, an encoding module 1002, and a distribution module 1003.
  • An encoding module 1002 configured to perform the above step 404;
  • the distribution module 1003 is configured to perform the foregoing step 405.
  • the generating module 1001 is configured to perform steps 1 to 3 in step 403 above;
  • the generating module 1001 is configured to execute steps (2.1) to (2.2) in step two in step 403 above;
  • the encoding module 1002 is configured to perform at least one of the first to third methods in step 404.
  • the distribution module 1003 is configured to send a corresponding data stripe unit and / or a verification stripe unit to each storage node.
  • the device further includes:
  • the sending module is configured to perform step 408 described above.
  • the distribution module 1003 is configured to perform step 606 described above.
  • the distribution module 1003 is configured to perform at least one of implementation manners 1 to 2 in step 606.
  • the device further includes:
  • the sending module is configured to execute step two of the IO reading process in the embodiment in FIG. 6.
  • the device further includes:
  • the query module is configured to execute step 1 of the IO reading process in the embodiment in FIG. 6 described above.
  • the distribution module 1003 is configured to perform step 606 described above.
  • the device further includes:
  • a sending module is configured to perform the foregoing step 609.
  • each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program.
  • the computer program can be implemented in any programming language.
  • the client node can implement the function of storing data based on the processor and memory hardware through the above modules, that is, the client node processor can run the software code stored in the memory of the client node. Execute the corresponding software to realize the function of storing data.
  • the client node provided in the above embodiment only uses the above-mentioned division of functional modules as an example for storing data.
  • the above-mentioned functions can be allocated by different functional modules as required.
  • the client node provided in the foregoing embodiment belongs to the same concept as the data storage method embodiment, and its specific implementation process is described in the method embodiment in detail, and is not repeated here.
  • FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application.
  • the apparatus includes a storage module 1101, an obtaining module 1102, and a sending module 1103.
  • An obtaining module 1102 configured to execute the foregoing step 803;
  • the sending module 1103 is configured to perform the foregoing step 804.
  • the obtaining module 1102 is configured to perform steps 1 to 2 in step 803 described above;
  • the selection sub-module is configured to perform at least one of the implementation manners 1 to 3 in step 1 in the foregoing step 803.
  • the device further includes:
  • Recording module for recording sub-health logs.
  • the recording module is configured to execute at least one of implementations one to two in step 607.
  • the device further includes:
  • a receiving module for receiving a sub-health recovery message.
  • each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program.
  • the computer program can be implemented in any programming language.
  • the storage node can implement the function of restoring data based on the hardware of the processor and the memory, that is, the processor of the storage node can run the software code stored in the memory of the storage node to execute the corresponding Software to implement the function of recovering data.
  • the storage nodes provided in the above embodiments only use the division of the above functional modules as an example.
  • the above functions can be allocated by different functional modules as required. That is, the internal structure of the storage node is divided into different functional modules to complete all or part of the functions described above.
  • the storage nodes and data recovery method embodiments provided in the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
  • the present application further provides a computer program product containing instructions, which when executed on a client node, enables the client node to implement operations performed by the data storage method in the foregoing embodiment. .
  • the present application further provides a computer program product containing instructions, which when executed on a storage node, enables the storage node to implement the operations performed by the data recovery method in the foregoing embodiment.
  • the present application further provides a data storage system.
  • the system includes the client node described in the foregoing FIG. 2 embodiment and the foregoing FIG. 3 embodiment.
  • the storage node is the client node described in the foregoing FIG. 2 embodiment and the foregoing FIG. 3 embodiment.
  • the system includes: the client node described in the foregoing FIG. 10 embodiment and the storage node described in the foregoing FIG. 11 embodiment.
  • the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data storage method in the foregoing embodiment.
  • the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data recovery method in the foregoing embodiment.
  • the computer program product includes one or more computer program instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program instructions may be from a website site, a computer, a server, or data
  • the center transmits to another website site, computer, server or data center by wire or wirelessly.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD), or a semiconductor medium (such as a solid state hard disk), and the like.
  • multiple in the present application means two or more than two, for example, multiple data packets refer to two or more data packets.
  • first and second in this application are used to distinguish the same or similar items with basically the same function and function. Those skilled in the art can understand that the words “first” and “second” are not correct in quantity. And execution order.

Abstract

Provided is a data storage method. The method introduces a mechanism for cross-backup of metadata of data blocks in EC stripes; and by means of storing the metadata of the data blocks and that subjected to cross-backup together in data stripe units, it is ensured that the metadata of the data blocks is mutually stored in different storage nodes, so that even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe units of the other storage nodes, the lost metadata of the storage node can also be acquired from the data stripe units of the other storage nodes. As a result, the probability of metadata loss is reduced, and the reliability and security of data storage is greatly improved, thereby improving the storage performance of a distributed storage system.

Description

数据存储方法、数据恢复方法、装置、设备及存储介质Data storage method, data recovery method, device, equipment and storage medium 技术领域Technical field
本申请涉及存储技术领域,特别涉及一种数据存储方法、数据恢复方法、装置、设备及存储介质。The present application relates to the field of storage technology, and in particular, to a data storage method, a data recovery method, a device, a device, and a storage medium.
背景技术Background technique
随着存储技术的发展,目前的分布式存储系统经常使用纠删码(Erasure Code,以下简称:EC)技术,以EC条带的形式来存储数据。每条EC条带由m个数据块和k个校验块组成,数据块用于存储数据,校验块用于恢复数据。当EC条带中丢失数据块时,只要丢失的数据块的总数量不小于k,通过对剩余的数据块和校验块进行EC反编码,就能恢复丢失的数据块,从而极大地提高数据存储的稳定性和可靠性。With the development of storage technology, current distributed storage systems often use erasure code (EC) technology to store data in the form of EC stripes. Each EC slice is composed of m data blocks and k check blocks. The data blocks are used to store data and the check blocks are used to recover data. When data blocks are lost in the EC stripe, as long as the total number of lost data blocks is not less than k, the EC data can be recovered by performing EC inverse coding on the remaining data blocks and parity blocks, thereby greatly improving the data. Storage stability and reliability.
针对使用EC技术存储数据的过程,分布式系统通常包括客户端节点、主存储节点以及至少一个备存储节点,客户端节点用于向主存储节点下发数据,主存储节点用于对数据进行EC编码,向至少一个备存储节点发送数据块以及校验块,每个备存储节点用于存储数据块或校验块。具体来说,当客户端节点接收到待存储的数据时,会根据数据的目标存储位置,将数据发送给目标存储位置对应的主存储节点,主存储节点会将数据划分为m个数据块,采用冗余算法,对m个数据块进行EC编码,得到k个校验块,主存储节点自身会存储一个数据块或校验块,在存储成功后,记录已存储的数据块或校验块的元数据,并且,主存储节点会将剩余的m+k-1个数据块和校验块,发送给m+k-1个备存储节点,每个备存储节点会存储一个数据块或校验块,并在存储成功后,记录已存储的数据块或校验块的元数据。For the process of storing data using EC technology, a distributed system usually includes a client node, a main storage node, and at least one backup storage node. The client node is used to send data to the main storage node, and the main storage node is used to perform EC Encoding, sending data blocks and check blocks to at least one backup storage node, and each backup storage node is used to store data blocks or check blocks. Specifically, when the client node receives the data to be stored, it will send the data to the main storage node corresponding to the target storage location according to the target storage location of the data. The main storage node will divide the data into m data blocks. Redundant algorithm is used to EC code m data blocks to obtain k check blocks. The main storage node itself stores a data block or check block. After the storage is successful, the stored data block or check block is recorded. Metadata, and the main storage node will send the remaining m + k-1 data blocks and parity blocks to m + k-1 backup storage nodes, each of which will store a data block or a calibration block. Check the block, and record the metadata of the stored data block or check block after the storage is successful.
上述方案中仅能保证数据块的安全性,而数据块的元数据的安全性较差。In the above solution, only the security of the data block can be guaranteed, and the security of the metadata of the data block is poor.
发明内容Summary of the Invention
本申请实施例提供了一种数据存储方法、数据恢复方法、装置、设备及存储介质,能够解决相关技术中存储数据块的元数据的安全性较差的问题。所述技术方案如下:The embodiments of the present application provide a data storage method, a data recovery method, a device, a device, and a storage medium, which can solve the problem of poor security of metadata of data blocks stored in related technologies. The technical solution is as follows:
第一方面,提供了一种数据存储方法,所述方法包括:In a first aspect, a data storage method is provided. The method includes:
根据待存储的至少一个数据块,生成至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据条带单元的数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;Generate at least one data stripe unit according to at least one data block to be stored, each data stripe unit includes a data block and cross-backup metadata, and the cross-backup metadata includes a data block of the data stripe unit Metadata and metadata of data blocks included in other data stripe units other than the data stripe unit;
对所述至少一个数据条带单元进行纠删码EC编码,得到至少一个校验条带单元;Performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit;
将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至至少一个存储节点。And distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node.
本实施例提供的方法,引入了对EC条带中数据块的元数据进行交叉备份的机制,通过将数据块以及交叉备份的元数据,共同存储在数据条带单元中,确保不同存储节点之间会互相存储数据块的元数据,即使某个存储节点的元数据丢失,由于其他存储节点的数据条带单元中预先存储了该存储节点的元数据备份,也可以根据其他存储节 点的数据条带单元中,获取该存储节点的缺失元数据,从而降低了元数据丢失的概率,极大地提高了数据存储的可靠性和安全性,从而提升分布式存储系统的存储性能。The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data blocks are stored between each other. Even if the metadata of a storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, it can also be based on the data strip of other storage nodes. In the belt unit, the missing metadata of the storage node is obtained, thereby reducing the probability of metadata loss, greatly improving the reliability and security of data storage, and thereby improving the storage performance of the distributed storage system.
可选地,所述根据待存储的至少一个数据块,生成至少一个数据条带单元,包括:Optionally, generating at least one data stripe unit according to at least one data block to be stored includes:
对所述至少一个数据块的元数据进行备份,得到至少一个元数据备份,所述至少一个数据块和所述至少一个元数据备份一一对应;Backup the metadata of the at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one correspondence;
对于所述至少一个数据块中的数据块,从所述至少一个元数据备份中,选取所述数据块对应的至少一个目标元数据备份;For a data block in the at least one data block, selecting at least one target metadata backup corresponding to the data block from the at least one metadata backup;
根据所述数据块、所述数据块的元数据以及所述至少一个目标元数据备份,生成数据条带单元。Generating a data stripe unit according to the data block, the metadata of the data block, and the at least one target metadata backup.
可选地,所述从所述至少一个元数据备份中,选取所述数据块对应的至少一个目标元数据备份,包括:Optionally, the selecting, from the at least one metadata backup, at least one target metadata backup corresponding to the data block includes:
根据所述数据块对应的第一存储节点,查询存储节点之间的交叉备份关系,得到所述第一存储节点对应的至少一个第二存储节点;Querying the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node;
确定所述至少一个第二存储节点对应的数据块的元数据备份,作为所述至少一个目标元数据备份。A metadata backup of a data block corresponding to the at least one second storage node is determined as the at least one target metadata backup.
可选地,所述对所述至少一个数据条带单元进行纠删码EC编码,得到至少一个校验条带单元,包括下述至少一个步骤:Optionally, performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit includes at least one of the following steps:
对所述至少一个数据条带单元中的数据块进行EC编码,得到所述至少一个校验条带单元中的校验块;Performing EC coding on the data blocks in the at least one data stripe unit to obtain a check block in the at least one check stripe unit;
对所述至少一个数据条带单元中的元数据进行EC编码,得到所述至少一个校验条带单元中的元数据校验块。EC coding is performed on the metadata in the at least one data stripe unit to obtain a metadata check block in the at least one check stripe unit.
基于这种方式,通过对每个数据条带单元中的元数据进行EC编码,可以进一步提高元数据存储的可靠性和安全性。具体来说,当任一数据条带单元中的元数据丢失时,不仅可以通过其他数据条带单元存储的交叉备份的元数据,读取和恢复丢失的元数据,也可以通过对其他数据条带单元中的元数据以及元数据校验块进行EC反编码,读取和恢复丢失的元数据,从而进一步降低元数据损失的概率,极大地提高存储元数据的可靠性和安全性。Based on this method, by performing EC coding on the metadata in each data stripe unit, the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.
可选地,所述将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至少一个存储节点,包括:Optionally, the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node includes:
向每个存储节点发送对应的数据条带单元和/或校验条带单元。Send a corresponding data stripe unit and / or check stripe unit to each storage node.
基于这种方式,通过由客户端节点向每个存储节点发送对应的数据条带单元和/或校验条带单元,待存储的数据只需经过客户端节点与存储节点之间的一跳转发,即可到达每个存储节点,从而在每个存储节点上进行存储,而无需先经过客户端节点到达主存储节点,再从主存储节点到达每个备存储节点,也就避免了数据要经过两跳转发,才能到达每个存储节点的情况,节省了主存储节点的转发所导致的网络时延,从而降低了存储数据的时延,提高存储数据的效率,加快存储数据的速度,从而极大地提升了分布式存储系统的存储性能。Based on this method, by sending a corresponding data stripe unit and / or a verification stripe unit from each client node to each storage node, the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.
可选地,所述将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至少一个存储节点之后,所述方法还包括:Optionally, after the at least one data stripe unit and the at least one check stripe unit are distributed to at least one storage node, the method further includes:
接收所述至少一个存储节点发送的第一写完成消息;Receiving a first write completion message sent by the at least one storage node;
向目标应用或外部主机发送第二写完成消息,所述第一写完成消息用于指示对应的存储节点已存储数据条带单元和/或校验条带单元,所述第二写完成消息用于指示已向目标存储位置写入所述数据。Send a second write completion message to the target application or external host, where the first write completion message is used to indicate that the corresponding storage node has stored the data stripe unit and / or the verification stripe unit, and the second write completion message is used for The indication indicates that the data has been written to the target storage location.
可选地,所述将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至少一个存储节点,包括:Optionally, the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node includes:
确定所述至少一个存储节点中的第三存储节点处于亚健康状态;Determining that a third storage node of the at least one storage node is in a sub-health state;
向所述至少一个存储节点中的第四存储节点,发送对应的数据条带单元和/或校验条带单元。Sending a corresponding data stripe unit and / or a check stripe unit to a fourth storage node of the at least one storage node.
基于这种可选方式,如果某个存储节点处于亚健康状态,在存储数据时可以立即避开处于亚健康状态的存储节点,而通过其它存储节点来进行数据存储,从而实现存储节点之间快速切换,降低了存储节点的亚健康状态对存储系统性能的影响,保证即使存储节点处于亚健康状态,存储系统也可以快速存储数据,从而保证存储系统的可靠性和稳定性。Based on this optional method, if a storage node is in a sub-healthy state, the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, thereby achieving rapid storage node-to-node storage. Switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, ensuring that the storage system can quickly store data even if the storage nodes are in a sub-health status, thereby ensuring the reliability and stability of the storage system.
可选地,所述向所述至少一个存储节点中的第四存储节点,发送对应的数据条带单元和/或校验条带单元,包括下述至少一个步骤:Optionally, the sending a corresponding data stripe unit and / or a verification stripe unit to a fourth storage node of the at least one storage node includes at least one of the following steps:
向所述第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元;Sending to the fourth storage node at least one second data strip unit other than the first data strip unit corresponding to the third storage node and the at least one check strip unit;
向所述第四存储节点,发送所述第三存储节点对应的第一校验条带单元以外的至少一个第二校验条带单元以及所述至少一个数据条带单元。Sending to the fourth storage node at least one second verification strip unit other than the first verification strip unit corresponding to the third storage node and the at least one data strip unit.
基于这种实现方式,通过对待存储的数据进行降级写,当某个存储节点处于亚健康状态时,可以立即避开该存储节点,分发该存储节点对应的数据条带单元和/或校验条带单元以外的数据条带单元和/或校验条带单元,从而实现亚健康状态下的快速切换。Based on this implementation, by degrading the data to be stored, when a storage node is in a sub-health state, it can immediately avoid the storage node and distribute the data stripe unit and / or check strip corresponding to the storage node. Data strip units and / or check strip units other than the strip unit, so as to achieve rapid switching in a sub-health state.
可选地,所述向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元之后,所述方法还包括:Optionally, the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one After verifying the stripe unit, the method further includes:
当接收到读请求时,向第五存储节点发送所述读请求,所述读请求用于指示读取所述第一数据条带单元的数据块,所述第五存储节点存储的交叉备份的元数据包括所述第一数据条带单元的数据块的元数据。When a read request is received, the read request is sent to a fifth storage node, where the read request is used to instruct to read a data block of the first data stripe unit, and the The metadata includes metadata of a data block of the first data stripe unit.
可选地,所述向第五存储节点发送所述读请求发送所述读请求之前,所述方法还包括:Optionally, before the sending the read request to the fifth storage node, the method further includes:
查询存储节点之间的交叉备份关系,确定所述第三存储节点对应的第五存储节点。Query the cross-backup relationship between the storage nodes, and determine a fifth storage node corresponding to the third storage node.
可选地,所述向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元,包括下述至少一个步骤:Optionally, the sending to the fourth storage node of the at least one storage node at least one second data stripe unit other than the first data stripe unit corresponding to the third storage node and the at least one Verifying the strip unit includes at least one of the following steps:
向所述至少一个第二数据条带单元写入所述第三存储节点的亚健康标记,所述亚健康标记用于指示所述第三存储节点处于亚健康状态;Writing a sub-health mark of the third storage node to the at least one second data stripe unit, where the sub-health mark is used to indicate that the third storage node is in a sub-health state;
向所述第四存储节点发送写请求,所述写请求携带所述第三存储节点的亚健康标记以及所述第四存储节点对应的第二数据条带单元。Send a write request to the fourth storage node, where the write request carries a sub-health mark of the third storage node and a second data stripe unit corresponding to the fourth storage node.
可选地,所述向第四存储节点,发送对应的数据条带单元和/或校验条带单元之后,所述方法还包括:Optionally, after sending the corresponding data stripe unit and / or check stripe unit to the fourth storage node, the method further includes:
当接收到所述至少一个第四存储节点发送的第一写完成消息时,向目标应用或外部主机发送第二写完成消息。When receiving a first write completion message sent by the at least one fourth storage node, send a second write completion message to a target application or an external host.
第二方面,提供了一种数据恢复方法,所述方法包括:In a second aspect, a data recovery method is provided. The method includes:
存储至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据条带单元的数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;Stores at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block of the data stripe unit and the data strip Metadata of data blocks included in data stripe units other than the unit;
根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据;Obtaining the missing metadata of the third storage node according to the at least one data striping unit;
向所述第三存储节点发送所述缺失元数据。Sending the missing metadata to the third storage node.
本实施例提供的方法,引入了对EC条带中数据块的元数据进行交叉备份的机制,通过将数据块以及交叉备份的元数据,共同存储在数据条带单元中,确保不同存储节点之间会互相存储数据块的元数据,即使某个存储节点的元数据丢失,由于其他存储节点的数据条带单元中预先存储了该存储节点的元数据备份,其他存储节点可以根据自身存储的数据条带单元,获取该存储节点的缺失元数据,将该存储节点的缺失元数据同步给该存储节点,从而降低了元数据丢失的概率,极大地提高了数据存储的可靠性和安全性,从而提升分布式存储系统的存储性能。The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves. The stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems.
可选地,所述根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据,包括:Optionally, obtaining the missing metadata of the third storage node according to the at least one data striping unit includes:
从所述至少一个数据条带单元中,选取至少一个第二数据条带单元,所述第二数据条带单元的存储时间属于所述第三存储节点处于亚健康状态期间;Selecting at least one second data stripe unit from the at least one data stripe unit, and a storage time of the second data stripe unit belongs to a period when the third storage node is in a sub-health state;
根据所述至少一个第二数据条带单元中的交叉备份的元数据,确定所述缺失元数据。Determining the missing metadata according to the cross-backup metadata in the at least one second data stripe unit.
可选地,所述从所述至少一个数据条带单元中,选取至少一个第二数据条带单元,包括下述至少一个步骤:Optionally, the selecting at least one second data stripe unit from the at least one data stripe unit includes at least one of the following steps:
根据所述第三存储节点的标识,查询亚健康日志,得到所述至少一个第二数据条带单元,所述亚健康日志用于指示在所述第三存储节点处于亚健康状态期间存储的数据条带单元;Querying the sub-health log according to the identifier of the third storage node to obtain the at least one second data stripe unit, where the sub-health log is used to indicate data stored while the third storage node is in a sub-health state Stripe unit
选取具有所述第三存储节点的亚健康标记的数据条带单元,作为所述至少一个第二数据条带单元,所述亚健康标记用于指示所述第三存储节点处于亚健康状态。Selecting a data strip unit with a sub-health mark of the third storage node as the at least one second data strip unit, and the sub-health mark is used to indicate that the third storage node is in a sub-health state.
可选地,所述根据所述第三存储节点的标识,查询亚健康日志之前,所述方法还包括:Optionally, before querying the sub-health log according to the identifier of the third storage node, the method further includes:
在所述第三存储节点处于亚健康状态期间,为所述第三存储节点记录所述亚健康日志。While the third storage node is in a sub-health state, record the sub-health log for the third storage node.
基于这种可选方式,如果某个存储节点由于亚健康闪断而缺失元数据,当该存储节点处于亚健康恢复状态时,其他存储节点即可根据本地记录的亚健康日志中记录的元数据差异,将该存储节点的缺失元数据同步到从亚健康恢复正常的存储节点上,从而实现存储节点从亚健康状态恢复后可以自动地恢复缺失元数据,降低了存储节点处于亚健康状态对存储系统性能的影响,保证分布式存储系统的高稳定性和高可靠性。Based on this optional method, if a storage node is missing metadata due to a sub-health flash, when the storage node is in a sub-health recovery state, other storage nodes can use the metadata recorded in the sub-health log recorded locally Difference, synchronize the missing metadata of the storage node to the normal storage node restored from sub-health, so that after the storage node recovers from the sub-health state, it can automatically recover the missing metadata, which reduces the storage node's sub-health status to the storage. The impact of system performance ensures high stability and reliability of the distributed storage system.
可选地,所述在所述第三存储节点处于亚健康状态期间,为所述第三存储节点记录所述亚健康日志,包括下述至少一个步骤:Optionally, when the third storage node is in a sub-health state, recording the sub-health log for the third storage node includes at least one of the following steps:
当接收到的写请求携带所述第三存储节点的亚健康标记以及数据条带单元时,向所述亚健康日志,写入所述第三存储节点对应的亚健康记录;When the received write request carries a sub-health mark and a data stripe unit of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log;
当接收到的数据条带单元携带所述第三存储节点的亚健康标记时,向所述亚健康日志,写入所述第三存储节点对应的亚健康记录。When the received data strip unit carries the sub-health mark of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log.
可选地,所述根据所述至少一个数据条带单元,确定第三存储节点的缺失元数据之前,所述方法还包括:Optionally, before determining the missing metadata of the third storage node according to the at least one data striping unit, the method further includes:
接收亚健康恢复消息,所述亚健康恢复消息用于指示所述第三存储节点处于亚健康恢复状态。Receiving a sub-health recovery message, where the sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state.
第三方面,提供一种客户端节点,用于执行上述数据存储方法。具体地,该客户端节点包括用于执行上述第一方面或第一方面的任一种可选方式所述的数据存储方法的功能模块。According to a third aspect, a client node is provided to execute the foregoing data storage method. Specifically, the client node includes a function module for executing the data storage method described in the first aspect or any one of the first aspects.
第四方面,提供一种存储节点,用于执行上述数据恢复方法。具体地,该存储节点包括用于执行上述第二方面或第二方面的任一种可选方式所述的数据恢复方法的功能模块。According to a fourth aspect, a storage node is provided for performing the foregoing data recovery method. Specifically, the storage node includes a function module for executing the data recovery method according to the second aspect or any one of the optional aspects of the second aspect.
第五方面,提供一种客户端节点,所述客户端节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现上述第一方面或第一方面的任一种可选方式所述的数据存储方法所执行的操作。According to a fifth aspect, a client node is provided. The client node includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the first aspect or The operations performed by the data storage method according to any one of the first aspects.
第六方面,提供一种存储节点,所述存储节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现上述第二方面或第二方面的任一种可选方式所述的数据恢复方法所执行的操作。According to a sixth aspect, a storage node is provided. The storage node includes a processor and a memory. The memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the second aspect or the second aspect. The operations performed by the data recovery method described in any one of the optional aspects.
第七方面,提供一种非瞬态的计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现上述第一方面或第一方面的任一种可选方式所述的数据存储方法所执行的操作。According to a seventh aspect, a non-transitory computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the first aspect or any one of the first aspect. Operations performed by the data storage method described in an optional manner.
第八方面,提供一种非瞬态的计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现上述第二方面或第二方面的任一种可选方式所述的数据恢复方法所执行的操作。According to an eighth aspect, a non-transitory computer-readable storage medium is provided. The storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the second aspect or any one of the second aspect. The operation performed by the data recovery method described in an optional manner.
第九方面,提供了一种包含指令的计算机程序产品,当其在客户端节点上运行时,使得该客户端节点能够实现上述第一方面或第一方面的任一种可选方式所述的数据存储方法所执行的操作。In a ninth aspect, a computer program product containing instructions is provided, which when run on a client node, enables the client node to implement the first aspect or any of the first aspect of the first aspect. The operations performed by the data storage method.
第十方面,提供了一种包含指令的计算机程序产品,当其在存储节点上运行时,使得该存储节点能够实现上述第二方面或第二方面的任一种可选方式所述的数据恢复方法所执行的操作。According to a tenth aspect, a computer program product containing instructions is provided, which, when running on a storage node, enables the storage node to implement the data recovery described in the second aspect or any one of the optional aspects of the second aspect The action performed by the method.
第十一方面,提供一种数据存储系统,在一种可能的实现方式中,所述系统包括:According to an eleventh aspect, a data storage system is provided. In a possible implementation manner, the system includes:
第三方面所述的客户端节点和第四方面所述的存储节点。The client node according to the third aspect and the storage node according to the fourth aspect.
在另一种可能的实现方式中,所述系统包括:In another possible implementation manner, the system includes:
第五方面所述的客户端节点和第六方面所述的存储节点。The client node according to the fifth aspect and the storage node according to the sixth aspect.
第十二方面,提供了一种芯片,所述芯片包括处理器和/或程序指令,当所述芯片 运行时,实现上述第一方面或第一方面的任一种可选方式所述的数据存储方法所执行的操作。According to a twelfth aspect, a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the first aspect or any one of the first optional aspects is implemented The operation performed by the storage method.
第十三方面,提供了一种芯片,所述芯片包括处理器和/或程序指令,当所述芯片运行时,实现上述第二方面或第二方面的任一种可选方式所述的数据恢复方法所执行的操作。According to a thirteenth aspect, a chip is provided, where the chip includes a processor and / or program instructions, and when the chip is running, the data described in the second aspect or any one of the second optional aspects is implemented What the recovery method does.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例提供的一种分布式存储系统的系统架构图;FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application; FIG.
图2是本申请实施例提供的一种客户端节点的结构示意图;2 is a schematic structural diagram of a client node according to an embodiment of the present application;
图3是本申请实施例提供的一种存储节点的结构示意图;3 is a schematic structural diagram of a storage node according to an embodiment of the present application;
图4是本申请实施例提供的一种数据存储方法的流程图;4 is a flowchart of a data storage method according to an embodiment of the present application;
图5是本申请实施例提供的一种数据存储的示意图;5 is a schematic diagram of a data storage provided by an embodiment of the present application;
图6是本申请实施例提供的一种数据存储方法的流程图;6 is a flowchart of a data storage method according to an embodiment of the present application;
图7是本申请实施例提供的一种数据存储的示意图;7 is a schematic diagram of a data storage provided by an embodiment of the present application;
图8是本申请实施例提供的一种数据恢复方法的流程图;8 is a flowchart of a data recovery method according to an embodiment of the present application;
图9是本申请实施例提供的一种数据存储的示意图;FIG. 9 is a schematic diagram of a data storage provided by an embodiment of the present application; FIG.
图10是本申请实施例提供的一种客户端节点的结构示意图;FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application; FIG.
图11是本申请实施例提供的一种存储节点的结构示意图。FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application.
具体实施方式detailed description
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。To make the objectives, technical solutions, and advantages of the present application clearer, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
以下对本申请涉及的术语进行解释:The following explains the terms involved in this application:
交叉备份:是一种不同存储节点之间互相存储数据备份的技术。例如,节点A、节点B和节点C之间进行交叉备份时,节点A可以存储节点A的数据和节点B的数据备份,节点B可以存储节点B的数据和节点C的数据备份,节点C可以存储节点C的数据和节点A的数据备份。通过交叉备份,即使某个节点丢失数据,仍可以从其他节点读取数据备份,从而恢复自身的数据。Cross-backup: It is a technology for backing up data between different storage nodes. For example, when cross backup is performed between node A, node B, and node C, node A can store data of node A and data backup of node B, node B can store data of node B and data backup of node C, and node C can Stores the data of node C and the data backup of node A. With cross-backup, even if a node loses data, it can still read the data backup from other nodes to restore its own data.
三副本技术:是一种对数据进行冗余存储的技术,通过将一份数据复制成为三个副本,将三个副本分别存储于三个存储节点,来进行数据存储。采用三副本技术,当某个副本丢失时,可以通过对其他副本进行复制,恢复丢失的副本。采用三副本技术,每份数据会占用3倍的存储空间以进行存储,因此磁盘利用率为1/3。Three-copy technology: It is a technology for redundant storage of data. It copies data into three copies and stores the three copies on three storage nodes respectively for data storage. Using the three-copy technology, when one copy is lost, the other copy can be recovered to restore the lost copy. With the three-copy technology, each piece of data occupies 3 times the storage space for storage, so the disk utilization is 1/3.
EC技术:是一种对数据进行冗余存储的技术,通过纠删码算法对原始的数据进行编码,得到冗余的校验块,将各个数据块和校验块分别存储在不同的存储节点上,来存储数据。具体来说,会将待存储的数据切分为m个数据块,采用冗余算法,对m个数据块进行EC编码,生成k个校验块,该m个数据块以及k个校验块组成一个EC条带,每个数据块或每个校验块可以称为EC条带中的一个EC block,可以将每个EC block分发至不同的存储节点以进行存储。每个EC条带中最多可以容忍k个EC block的丢失,一旦任一存储节点故障,只要故障的存储节点不超过k个,就可以根据非故障的 存储节点上的EC block,恢复故障节点上存储的EC block,因此采用EC技术存储数据的分布式存储系统会具有很高的可靠性。并且,使用EC技术存储数据,相比三副本技术来说,可以极大地节约存储空间。具体来说,采用三副本技术,需要三倍的存储空间才能存储一份数据,而采用EC技术,只需要1.4倍的存储空间即可存储一份数据。EC technology: It is a technology for redundant storage of data. The original data is encoded by an erasure coding algorithm to obtain redundant check blocks. Each data block and check block are stored in different storage nodes. To store data. Specifically, the data to be stored is divided into m data blocks, and a redundant algorithm is used to perform EC coding on the m data blocks to generate k check blocks, the m data blocks and the k check blocks. Form an EC stripe, each data block or each check block can be called an EC block in the EC stripe, and each EC block can be distributed to different storage nodes for storage. Each EC strip can tolerate the loss of up to k EC blocks. Once any storage node fails, as long as the number of failed storage nodes does not exceed k, you can recover the failed nodes based on the EC blocks on the non-faulty storage nodes. Stored EC blocks, so distributed storage systems that use EC technology to store data will have high reliability. In addition, using EC technology to store data can greatly save storage space compared to the three-copy technology. Specifically, using three-copy technology requires three times the storage space to store one copy of data, while using EC technology requires only 1.4 times the storage space to store one copy of data.
图1是本申请实施例提供的一种分布式存储系统的系统架构图,该分布式存储系统包括客户端节点、至少一个存储节点、元数据控制器(Meta Data Controller,以下简称:MDC)节点、云服务器(Elastic Compute Service,以下简称:ECS)服务节点以及云硬盘备份(Volume Backup Service,以下简称:VBS)节点。图1提供的系统可以为客户提供对象存储服务。FIG. 1 is a system architecture diagram of a distributed storage system according to an embodiment of the present application. The distributed storage system includes a client node, at least one storage node, and a metadata controller (MDC) node. A cloud server (Elastic Compute Service (hereinafter referred to as: ECS) service node and a cloud disk backup (Volume Backup Service (hereinafter referred to as: VBS) node). The system provided in Figure 1 can provide object storage services for customers.
客户端节点也称存储客户端或client节点,客户端节点可以与上层应用或外部主机交互,可以接收上层应用或外部主机的数据,将数据分发至存储节点,以进行数据存储。客户端节点可以是服务器、个人计算机(personal computer,以下简称:PC)、笔记本电脑等等,客户端节点可以是一个独立的设备,例如可以是一个设备上的一个或多个程序模块,又如可以是一个设备上运行的虚拟机或容器,客户端节点也可以是多个设备组成的集群,例如可以是分布在多台设备上的多个程序模块的统称。Client nodes are also called storage clients or client nodes. Client nodes can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute data to storage nodes for data storage. The client node can be a server, a personal computer (hereinafter referred to as a PC), a notebook computer, etc. The client node can be an independent device, for example, it can be one or more program modules on a device. It can be a virtual machine or container running on one device, and the client node can also be a cluster of multiple devices. For example, it can be a collective name for multiple program modules distributed on multiple devices.
存储节点可以存储数据条带单元和/或校验条带单元,接收读请求和/或写请求,访问存储在本地的数据以及元数据。存储节点可以是对象存储(Object-based Storage,以下简称:OSD)节点、网络附属存储(Network Attached Storage,以下简称:NAS)节点、存储区域网络(Storage Area Network,以下简称:SAN)节点等,存储节点可以是服务器、PC、笔记本电脑等,存储节点存储数据的方式包括而不限于对象存储、块存储、文件存储等,存储节点可以是物理存储节点,也可以是物理存储节点划分得到的逻辑存储节点。The storage node may store data stripe units and / or check stripe units, receive read requests and / or write requests, and access locally stored data and metadata. The storage node may be an Object-based Storage (OSD) node, a Network Attached Storage (NAS) node, a Storage Area Network (SAN) node, etc. Storage nodes can be servers, PCs, laptops, etc. The storage methods of the storage nodes include, but are not limited to, object storage, block storage, and file storage. Storage nodes can be physical storage nodes, or logically divided by physical storage nodes. Storage node.
MDC服务节点可以用于维护分区视图,该分区视图可以包括分区与存储节点之间的映射关系以及每个存储节点的当前状态,当分区视图变化时,可以将变化的分区视图同步到客户端节点以及每个存储节点。MDC服务节点可以是服务器、PC、笔记本电脑等,MDC服务节点可以是一个独立的设备,例如可以是一个设备上的一个或多个程序模块,又如可以是一个设备上运行的虚拟机或容器,也可以是多个设备组成的集群,例如可以是分布在多台设备上的多个程序模块的统称。The MDC service node can be used to maintain a partitioned view, which can include the mapping relationship between partitions and storage nodes and the current state of each storage node. When the partitioned view changes, the changed partitioned view can be synchronized to the client node As well as each storage node. The MDC service node can be a server, PC, laptop, etc. The MDC service node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or container running on a device. It can also be a cluster composed of multiple devices, for example, it can be a collective name for multiple program modules distributed on multiple devices.
ECS服务节点用于分配空白EC条带,将空白EC条带发送给客户端节点,以使客户端节点向空白EC条带写入数据。ECS服务节点可以是一个独立的设备,例如可以是一个设备上的一个或多个程序模块,又如可以是一个设备上运行的虚拟机或容器,也可以是多个设备组成的集群,例如可以是分布在多台设备上的多个程序模块的统称。The ECS service node is used to allocate a blank EC stripe, and sends the blank EC stripe to the client node, so that the client node writes data to the blank EC stripe. An ECS service node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. A collective name for multiple program modules distributed across multiple devices.
VBS节点可以向上层应用或外部主机提供虚拟硬盘的功能,VBS节点可以接收上层应用或外部主机的读请求或写请求。VBS节点可以是一个独立的设备,例如可以是一个设备上的一个或多个程序模块,又如可以是一个设备上运行的虚拟机或容器,也可以是多个设备组成的集群,例如可以是分布在多台设备上的多个程序模块的统称。The VBS node can provide a virtual hard disk function to an upper-layer application or an external host, and the VBS node can receive a read request or a write request from an upper-layer application or an external host. A VBS node can be an independent device, for example, it can be one or more program modules on one device, or it can be a virtual machine or container running on one device, or it can be a cluster of multiple devices. For example, it can be A collective term for multiple program modules distributed across multiple devices.
图2是本申请实施例提供的一种客户端节点的结构示意图,该客户端节点200可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,简称:CPU)201和一个或一个以上的存储器202,其中,存储器 202中存储有至少一条指令,至少一条指令由处理器201加载并执行以实现下述各个方法实施例中的数据存储方法。当然,该客户端节点200还可以具有有线或无线网络接口以及输入输出接口等部件,以便进行输入输出,该客户端节点200还可以包括其他用于实现设备功能的部件,在此不做赘述。FIG. 2 is a schematic structural diagram of a client node according to an embodiment of the present application. The client node 200 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, referred to as abbreviations). : CPU) 201 and one or more memories 202, where the memory 202 stores at least one instruction, and at least one instruction is loaded and executed by the processor 201 to implement the data storage method in each method embodiment described below. Of course, the client node 200 may further have components such as a wired or wireless network interface and an input-output interface for input and output. The client node 200 may further include other components for implementing device functions, and details are not described herein.
在示例性实施例中,还提供了一种非瞬态的计算机可读存储介质,例如包括指令的存储器,上述指令可由客户端节点中的处理器执行以完成下述实施例中的数据存储方法。例如,计算机可读存储介质可以是只读存储器(Read-Only Memory,以下简称:ROM)、随机存取存储器(Random Access Memory,以下简称:RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions can be executed by a processor in a client node to complete the data storage method in the following embodiments. . For example, the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.
图3是本申请实施例提供的一种存储节点的结构示意图,该存储节点300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,简称:CPU)301和一个或一个以上的存储器302,其中,该一个或一个以上的存储器302可以为存储节点上挂载的硬盘,该硬盘可以是逻辑上的虚拟硬盘,也可以是物理上的实体硬盘。该一个或一个以上存储器302中存储有至少一条指令,至少一条指令由处理器301加载并执行以实现下述各个方法实施例中的数据恢复方法。当然,该存储节点还可以具有有线或无线网络接口以及输入输出接口等部件,以便进行输入输出,该存储节点300还可以包括其他用于实现设备功能的部件,在此不做赘述。FIG. 3 is a schematic structural diagram of a storage node according to an embodiment of the present application. The storage node 300 may have a large difference due to different configurations or performance, and may include one or more processors (Central Processing Units, abbreviated as: CPU). 301 and one or more memories 302, wherein the one or more memories 302 may be hard disks mounted on a storage node, and the hard disks may be logical virtual hard disks or physical physical hard disks. The one or more memories 302 store at least one instruction, and the at least one instruction is loaded and executed by the processor 301 to implement the data recovery method in each method embodiment described below. Of course, the storage node may also have components such as a wired or wireless network interface and an input-output interface for input and output. The storage node 300 may further include other components for implementing device functions, and details are not described herein.
在示例性实施例中,还提供了一种非瞬态的计算机可读存储介质,例如包括指令的存储器,上述指令可由存储节点中的处理器执行以完成下述实施例中的数据恢复方法。例如,计算机可读存储介质可以是只读存储器(Read-Only Memory,以下简称:ROM)、随机存取存储器(Random Access Memory,以下简称:RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as a memory including instructions, and the foregoing instructions may be executed by a processor in a storage node to complete the data recovery method in the following embodiments. For example, the computer-readable storage medium may be a read-only memory (hereinafter referred to as ROM), a random access memory (Random access memory (hereinafter referred to as RAM)), a CD-ROM, a magnetic tape, a floppy disk, and an optical data storage device Wait.
图4是本申请实施例提供的一种数据存储方法的流程图,如图4所示,该方法的交互主体包括客户端节点以及至少一个存储节点,包括以下步骤:FIG. 4 is a flowchart of a data storage method provided by an embodiment of the present application. As shown in FIG. 4, the interaction body of the method includes a client node and at least one storage node, including the following steps:
401、客户端节点获取至少一个空白EC条带,每个空白EC条带用于承载至少一个数据条带单元以及至少一个校验条带单元。401. A client node obtains at least one blank EC stripe, and each blank EC stripe is used to carry at least one data stripe unit and at least one check stripe unit.
客户端节点即client节点,在分布式存储系统中,客户端节点可以与上层应用或外部主机交互,可以接收上层应用或外部主机的数据,将数据分发给各个存储节点以进行存储。客户端节点可以是服务器、个人电脑、笔记本电脑等,客户端节点可以是一个独立的设备,例如可以是一个设备上的一个或多个程序模块,又如可以是一个设备上运行的虚拟机或容器,客户端节点也可以是多个设备组成的集群,例如可以是分布在多台设备上的多个程序模块的统称,本实施例对客户端节点的物理形态不做限定。A client node is a client node. In a distributed storage system, a client node can interact with upper-layer applications or external hosts, receive data from upper-layer applications or external hosts, and distribute the data to various storage nodes for storage. The client node can be a server, personal computer, laptop, etc. The client node can be an independent device, for example, it can be one or more program modules on a device, and it can also be a virtual machine or The container and the client node may also be a cluster composed of multiple devices. For example, the client node may be a collective name for multiple program modules distributed on multiple devices. This embodiment does not limit the physical form of the client node.
客户端节点可以获取至少一个空白EC条带,以便在存储数据时,向空白EC条带写入数据块、校验块以及元数据。其中,每个空白EC条带可以看做一行空白的数据块,每个空白EC条带可以携带条带标识以及至少一个条带单元标识,条带标识用于标识对应的EC条带,可以是EC条带的身份标识号码(Identification,以下简称:ID),例如可以是EC条带的编号、名称等。条带单元标识用于标识对应的条带单元,可以是条带单元的ID,例如可以是条带单元的编号、名称等。The client node can obtain at least one blank EC stripe, so that when data is stored, data blocks, check blocks, and metadata are written to the blank EC stripe. Among them, each blank EC strip can be regarded as a row of blank data blocks. Each blank EC strip can carry a strip identifier and at least one strip unit identifier. The strip identifier is used to identify the corresponding EC strip, which can be The identification number (Identification, hereinafter referred to as ID) of the EC strip may be, for example, the number and name of the EC strip. The stripe unit identifier is used to identify the corresponding stripe unit, and may be the ID of the stripe unit, for example, the number and name of the stripe unit.
关于获取空白EC条带的方式,可选地,可以由ECS服务为客户端节点分配至少一个空白EC条带,向客户端节点发送分配的至少一个空白EC条带,客户端节点可以接收ECS服务的至少一个空白EC条带,从而得到至少一个空白EC条带。可选地,分布式存储系统可以包括多个客户端节点,ECS服务可以为每个客户端节点分配对应的空白EC条带,向每个客户端节点发送对应的空白EC条带。其中,为不同客户端节点分配的空白EC条带的条带标识可以不同,例如为客户端节点1分配空白EC条带1至空白EC条带100,为客户端节点2分配空白EC条带101至空白EC条带200。Regarding the manner of obtaining the blank EC stripe, optionally, the ECS service may allocate at least one blank EC stripe to the client node, and send the allocated at least one blank EC stripe to the client node, and the client node may receive the ECS service. At least one blank EC band, thereby obtaining at least one blank EC band. Optionally, the distributed storage system may include multiple client nodes, and the ECS service may allocate a corresponding blank EC stripe to each client node, and send a corresponding blank EC stripe to each client node. The stripe identifiers of the blank EC stripe assigned to different client nodes may be different. For example, blank EC stripe 1 to blank EC stripe 100 are allocated to client node 1, and blank EC stripe 101 is allocated to client node 2. To blank EC band 200.
需要说明的是,本步骤401仅是可选步骤,而非必选步骤,本实施例对是否执行步骤401不做限定。可选地,客户端节点也可以无需执行获取空白EC条带的步骤,例如,可以预先存储至少一个空白EC条带,在存储数据时,向预先存储的空白EC条带写入数据即可。又如,可以无需使用空白EC条带来存储数据,而是在存储数据时,向已经写入数据的老EC条带,覆盖写入待存储的数据,从而通过复用老EC条带来存储数据。It should be noted that this step 401 is only an optional step, not a mandatory step. This embodiment does not limit whether to perform step 401 or not. Optionally, the client node may not need to perform the step of obtaining a blank EC strip. For example, at least one blank EC strip may be stored in advance, and data may be written into the pre-stored blank EC strip when storing data. For another example, it is not necessary to use a blank EC strip to store data, but when storing data, the old EC strip that has been written with data is overwritten to write the data to be stored, thereby bringing the storage by multiplexing the old EC strip. data.
402、客户端节点获取待存储的数据以及数据的目标存储位置。402. The client node obtains data to be stored and a target storage location of the data.
待存储的数据可以是IO数据,即向分布式存储系统输入和/或输出的数据。目标存储位置是指数据需要存储到的位置,通过获取数据的目标存储位置,可以确定要在哪个位置存储数据,以便在后续读数据时,可以从目标存储位置读取存储的数据。可选地,分布式存储系统的存储空间可以作为一个或多个逻辑卷,每个逻辑卷可以划分为多个逻辑块,可以使用一个或多个逻辑区块来存储数据,则目标存储位置可以是逻辑区块地址(Logical Block Address,以下简称:LBA),可以包括起始LBA地址和数据长度。另外,目标存储位置也可以是逻辑区块的标识,例如可以是逻辑区块的key。The data to be stored may be IO data, that is, data input and / or output to the distributed storage system. The target storage location refers to the location where the data needs to be stored. By obtaining the target storage location of the data, you can determine where to store the data so that when the data is read later, the stored data can be read from the target storage location. Optionally, the storage space of the distributed storage system can be used as one or more logical volumes. Each logical volume can be divided into multiple logical blocks. One or more logical blocks can be used to store data. The target storage location can be It is a logical block address (Logical Block Address, hereinafter abbreviated as: LBA), which may include a starting LBA address and a data length. In addition, the target storage location may also be an identifier of a logical block, for example, it may be a key of the logical block.
关于获取待存储的数据以及目标存储位置的过程,可选地,客户端节点可以接收写请求,解析写请求,得到写请求携带的数据以及目标存储位置。其中,可以由上层应用、外部主机或VBS节点生成写请求,向客户端节点发送写请求,以使客户端节点接收到上层应用、外部主机或VBS节点的写请求。可选地,该写请求可以由用户的输入操作触发。Regarding the process of obtaining the data to be stored and the target storage location, optionally, the client node may receive the write request, parse the write request, and obtain the data carried by the write request and the target storage location. A write request may be generated by an upper-layer application, an external host, or a VBS node, and a write request may be sent to a client node, so that the client node receives a write request from an upper-layer application, an external host, or a VBS node. Optionally, the write request may be triggered by a user's input operation.
可选地,客户端节点可以对待存储的数据进行划分,得到至少一个待存储的数据块。数据块是指以块为粒度,对数据进行划分后得到的数据,举例来说,数据块可以是图5中的GRAIN数据1、GRAIN数据2或GRAIN数据3。Optionally, the client node may divide the data to be stored to obtain at least one data block to be stored. The data block refers to data obtained by dividing the data with the granularity of the block. For example, the data block may be GRAIN data 1, GRAIN data 2 or GRAIN data 3 in FIG. 5.
需要说明的是,本步骤402仅是可选步骤,而非必选步骤,在一个可能的实施例中,客户端节点可以预先存储待存储的至少一个数据块,则客户端节点可以根据预先存储的至少一个数据块,直接执行后续步骤403。It should be noted that this step 402 is only an optional step, not a mandatory step. In a possible embodiment, the client node may store at least one data block to be stored in advance, and the client node may At least one data block is directly executed in the subsequent step 403.
403、客户端节点根据待存储的至少一个数据块,生成至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据。403. The client node generates at least one data stripe unit according to the at least one data block to be stored, and each data stripe unit includes a data block and cross-backup metadata.
数据块的元数据(Metadata)是指用于描述数据块的数据,数据块的元数据能够索引至对应的数据块。数据块的元数据可以是目标存储位置与条带单元标识之间的映射关系,例如,可以是LBA与条带单元标识之间的映射关系。举例来说,请参见图5,元数据可以是图5中的GRAIN元数据1、GRAIN元数据2或GRAIN元数据3,其中GRAIN元数据1为GRAIN数据1的元数据,GRAIN元数据1的内容可以是条带单元1与LBA1 之间的映射关系,GRAIN元数据2为GRAIN数据2的元数据,GRAIN元数据2的内容可以是条带单元2与LBA2之间的映射关系,以此类推。Metadata of a data block refers to data used to describe the data block, and the metadata of the data block can be indexed to the corresponding data block. The metadata of the data block may be the mapping relationship between the target storage location and the stripe unit identifier, for example, it may be the mapping relationship between the LBA and the stripe unit identifier. For example, referring to FIG. 5, the metadata may be GRAIN metadata 1, GRAIN metadata 2 or GRAIN metadata 3 in FIG. 5, where GRAIN metadata 1 is metadata of GRAIN data 1, and GRAIN metadata 1 The content can be the mapping relationship between stripe unit 1 and LBA1, GRAIN metadata 2 is the metadata of GRAIN data 2, the content of GRAIN metadata 2 can be the mapping relationship between stripe unit 2 and LBA2, and so on .
条带单元:是指EC条带的单元,为EC条带的组成部分,条带单元也称EC分条。EC条带与条带单元为整体与部分的关系,每个EC条带中可以包括至少一个条带单元,例如可以包括5个条带单元、7个条带单元等。EC条带可以看作一行数据块,则条带单元可以看做EC条带中的一列数据块。举例来说,请参见图5,图5提供了一个EC条带的示例,该EC条带包括5个条带单元,即条带单元1、条带单元2、条带单元3、条带单元4和条带单元5。Strip unit: refers to the EC strip unit, which is a component of the EC strip. The strip unit is also called EC stripe. The EC stripe and the stripe unit have a relationship of whole and part. Each EC stripe may include at least one stripe unit, for example, it may include 5 stripe units, 7 stripe units, and the like. An EC slice can be regarded as a row of data blocks, and a slice unit can be regarded as a column of data blocks in the EC slice. For example, please refer to FIG. 5. FIG. 5 provides an example of an EC stripe. The EC stripe includes 5 stripe units, that is, stripe unit 1, stripe unit 2, stripe unit 3, and stripe unit. 4 和 条条 组 5。 4 and strip unit 5.
条带单元用于存储EC条带中的数据,按照其存储数据的不同,本实施例中,将EC条带中用于存储数据块以及交叉备份的元数据的条带单元,称为数据条带单元,将EC条带中用于存储校验块、元数据校验块中的至少一项的条带单元,称为校验条带单元,EC条带中可以包括至少一个数据条带单元以及至少一个校验条带单元,EC条带中数据条带单元以及校验条带单元的数量可以根据EC编码算法确定,举例来说,EC条带可以包括3个数据条带单元和2个校验条带单元。The stripe unit is used to store data in the EC stripe. According to the difference in the stored data, in this embodiment, the stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata is called a data stripe. A stripe unit is a stripe unit for storing at least one of a check block and a metadata check block in an EC stripe, which is referred to as a check stripe unit. The EC stripe may include at least one data stripe unit. And at least one parity stripe unit, the number of data stripe units and parity stripe units in the EC stripe can be determined according to the EC encoding algorithm. For example, an EC stripe can include 3 data stripe units and 2 Verify the stripe unit.
交叉备份的元数据包括数据条带单元的数据块的元数据和数据条带单元以外的其他数据条带单元包括的数据块的元数据。对于任一数据条带单元来说,该数据条带单元中的交叉备份的元数据,包括该数据条带单元自身存储的数据块的元数据,也会包含其他的一条或多条数据条带单元存储的数据块的元数据。The cross-backup metadata includes metadata of the data blocks of the data stripe unit and metadata of data blocks included in the data stripe unit other than the data stripe unit. For any data stripe unit, the cross-backup metadata in the data stripe unit, including the metadata of the data blocks stored in the data stripe unit itself, will also include one or more other data stripe The metadata of the data block stored by the unit.
举例来说,请参见图5,交叉备份的元数据可以为图5的加粗方框中的部分,数据条带单元1存储的数据块为GRAIN数据1,交叉备份的元数据包括GRAIN元数据1(GRAIN数据1的元数据)、GRAIN元数据2(GRAIN数据2的元数据)和GRAIN元数据3(GRAIN数据3的元数据),因此数据条带单元1不仅存储了数据条带单元1本身存储的数据块的元数据(GRAIN元数据1),也存储了数据条带单元2存储的数据块的元数据(GRAIN元数据2)以及数据条带单元3存储的数据块的元数据(GRAIN元数据3)。同理地,数据条带单元2存储的数据块为GRAIN数据2,交叉备份的元数据包括GRAIN元数据2(GRAIN数据2的元数据)、GRAIN元数据1(GRAIN数据1的元数据)和GRAIN元数据3(GRAIN数据3的元数据),因此数据条带单元2不仅存储了数据条带单元2本身存储的数据块的元数据(GRAIN元数据2),也存储了数据条带单元1存储的数据块的元数据(GRAIN元数据1)以及数据条带单元3存储的数据块的元数据(GRAIN元数据3)。For example, referring to FIG. 5, the cross-backup metadata may be part of the bold box in FIG. 5. The data block stored by the data stripe unit 1 is GRAIN data 1, and the cross-backup metadata includes GRAIN metadata. 1 (metadata of GRAIN data 1), GRAIN metadata 2 (metadata of GRAIN data 2), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 1 not only stores data stripe unit 1 The metadata of the data block (GRAIN metadata 1) stored in itself also stores the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 2 and the metadata of the data block (GRAIN metadata 2) stored in the data stripe unit 3 ( GRAIN metadata 3). Similarly, the data block stored by the data stripe unit 2 is GRAIN data 2. The cross-backup metadata includes GRAIN metadata 2 (metadata of GRAIN data 2), GRAIN metadata 1 (metadata of GRAIN data 1), and GRAIN metadata 3 (metadata of GRAIN data 3), so data stripe unit 2 stores not only the metadata of the data block (GRAIN metadata 2) stored by data stripe unit 2 itself, but also data stripe unit 1. The metadata (GRAIN metadata 1) of the stored data block and the metadata (GRAIN metadata 3) of the data block stored by the data stripe unit 3.
可选地,交叉备份的元数据可以分为以下(1)至(2)两种类型:Optionally, the metadata of the cross backup can be divided into the following two types (1) to (2):
(1)交叉备份的元数据可以包括待存储的所有数据块的元数据,相应地,至少一个数据条带单元的交叉备份的元数据可以相同,每个数据条带单元均存储所有数据条带单元中数据块的元数据。例如,假设数据划分为m个数据块,则交叉备份的元数据可以为m个数据块的元数据,也即是,每个数据条带单元均存储m个数据块的元数据。(1) The metadata of the cross-backup may include the metadata of all data blocks to be stored, and accordingly, the metadata of the cross-backup of at least one data stripe unit may be the same, and each data stripe unit stores all data stripe The metadata of the data block in the unit. For example, if the data is divided into m data blocks, the metadata of the cross-backup may be the metadata of m data blocks, that is, each data stripe unit stores the metadata of m data blocks.
示例性地,请参见图5,假设数据划分为3个数据块,则数据条带单元1、数据条带单元2以及数据条带单元3中交叉备份的元数据可以均相同,为这3个数据块的元数据。Exemplarily, referring to FIG. 5, assuming that the data is divided into three data blocks, the metadata backed up in the data stripe unit 1, the data stripe unit 2, and the data stripe unit 3 may all be the same, for these three The metadata of the data block.
基于(1),每个数据条带单元会存储所有数据条带单元的数据块的元数据,如果 某个数据条带单元的数据块的元数据丢失,可以从剩余的任一数据条带单元中,读取和恢复丢失的元数据,从而提高元数据存储的可靠性和安全性,从而提高分布式存储系统的可用性。Based on (1), each data stripe unit will store the metadata of the data blocks of all the data stripe units. If the metadata of the data block of a certain data stripe unit is lost, you can start from any of the remaining data stripe units. In the process of reading and recovering lost metadata, the reliability and security of metadata storage is improved, and the availability of distributed storage systems is improved.
(2)交叉备份的元数据可以包括待存储的部分数据块的元数据,相应地,每个数据条带单元可以存储部分数据条带单元中数据块的元数据。例如,假设数据划分为m个数据块,则交叉备份的元数据可以包括至少两个数据块的元数据,也即是,每个数据条带单元均存储至少两个数据块的元数据。(2) The metadata of the cross-backup may include the metadata of a part of the data block to be stored, and accordingly, each data stripe unit may store the metadata of the data block in the part of the data stripe unit. For example, assuming that the data is divided into m data blocks, the metadata of the cross-backup may include metadata of at least two data blocks, that is, each data stripe unit stores metadata of at least two data blocks.
可选地,交叉备份的元数据可以包括相邻的一条或多条数据条带单元中数据块的元数据,例如可以包括上一个数据条带单元的数据块的元数据和下一个数据条带单元的数据块的元数据。示例性地,第i个数据条带单元的交叉备份的元数据中,可以包括第i个数据条带单元的数据块的元数据、第i-1个数据条带单元的数据块的元数据以及第i+1个数据条带单元的数据块的元数据。Optionally, the cross-backup metadata may include metadata of data blocks in one or more adjacent data stripe units, for example, may include metadata of data blocks of the previous data stripe unit and the next data stripe The metadata of the data block of the cell. Exemplarily, the metadata of the cross backup of the i-th data stripe unit may include the metadata of the data block of the i-th data stripe unit and the metadata of the data block of the i-1th data stripe unit And the metadata of the data block of the i + 1th data stripe unit.
基于(2),每个数据条带单元可以存储部分数据条带单元的数据块的元数据,可以节约分布式存储系统的存储空间。Based on (2), each data stripe unit can store metadata of data blocks of some data stripe units, which can save storage space of a distributed storage system.
需要说明的是,数据条带单元中交叉备份的元数据如上述(1)还是如上述(2)可以根据需求设计,另外交叉备份的元数据具体包括多少数据块的元数据,以及具体包括哪几个数据块的元数据均可以根据实际需求设计,本实施例对此不做限定。It should be noted that the cross-backup metadata in the data stripe unit can be designed according to requirements as described above (1) or as described above (2). In addition, the cross-backup metadata specifically includes how many data blocks of metadata, and which specifically The metadata of several data blocks can be designed according to actual requirements, which is not limited in this embodiment.
数据条带单元是指EC条带中用于存储数据块以及交叉备份的元数据的条带单元,可以是EC条带中的一个EC block,每个数据条带单元可以包括一个数据块。示例性地,请参见图5,数据条带单元可以是条带单元1、条带单元2或条带单元3。每个EC条带中可以包括至少一个数据条带单元,一个EC条带中数据条带单元的数量可以等于数据块的数量。举例来说,假设待存储的数据划分为m个数据块,则该数据的EC条带中可以包括m个数据条带单元。A data stripe unit refers to a stripe unit in the EC stripe that is used to store data blocks and cross-backup metadata. It can be an EC block in an EC stripe, and each data stripe unit can include a data block. Exemplarily, referring to FIG. 5, the data stripe unit may be a stripe unit 1, a stripe unit 2, or a stripe unit 3. Each EC slice may include at least one data slice unit, and the number of data slice units in an EC slice may be equal to the number of data blocks. For example, assuming that the data to be stored is divided into m data blocks, the EC stripe of the data may include m data stripe units.
可选的,对于任一数据条带单元来说,该数据条带单元中的交叉备份的元数据可以由该数据条带单元中数据块的元数据以及至少一个目标元数据备份组成,该至少一个目标元数据备份是指该数据条带单元以外的其他数据条带单元的数据块的元数据备份。具体来说,生成任一条数据条带单元的过程可以包括以下步骤一至步骤三:Optionally, for any data stripe unit, the cross-backup metadata in the data stripe unit may consist of metadata of the data block in the data stripe unit and at least one target metadata backup, the at least one A target metadata backup refers to a metadata backup of a data block of a data stripe unit other than the data stripe unit. Specifically, the process of generating any data stripe unit may include the following steps one to three:
步骤一、对至少一个数据块的元数据进行备份,得到至少一个元数据备份,该至少一个数据块和该至少一个元数据备份一一对应。Step 1: Back up metadata of at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one manner.
对于该至少一个数据块中的每个数据块,可以确定该数据块的元数据,对数据块的元数据进行备份,得到该数据块的元数据的元数据备份,依次类推,对于每个数据块的元数据均进行备份,可以得到与至少一个数据块一一对应的至少一个元数据备份。举例来说,请参见图5,GRAIN数据1的元数据备份为GRAIN元数据1,GRAIN数据2的元数据备份为GRAIN元数据2。For each data block in the at least one data block, the metadata of the data block can be determined, the metadata of the data block can be backed up, and a metadata backup of the metadata of the data block can be obtained, and so on, for each data All metadata of the blocks are backed up, and at least one metadata backup corresponding to at least one data block can be obtained. For example, referring to FIG. 5, the metadata of GRAIN data 1 is backed up as GRAIN metadata 1, and the metadata of GRAIN data 2 is backed up as GRAIN metadata 2.
其中,关于确定数据块的元数据的过程,可以根据每个数据条带单元对应的目标存储位置以及每个数据条带单元的条带单元标识,记录目标存储位置与条带单元标识之间的映射关系,作为数据块的元数据。Regarding the process of determining the metadata of the data block, according to the target storage location corresponding to each data stripe unit and the stripe unit ID of each data stripe unit, the relationship between the target storage location and the stripe unit ID is recorded. Mapping relationship, as metadata for the data block.
步骤二、对于至少一个数据块中的数据块,从至少一个元数据备份中,选取数据块对应的至少一个目标元数据备份。Step 2: For at least one data block in the data block, from at least one metadata backup, select at least one target metadata backup corresponding to the data block.
可选地,本步骤具体可以包括以下步骤(2.1)至步骤(2.2):Optionally, this step may specifically include the following steps (2.1) to (2.2):
步骤(2.1)根据数据块对应的第一存储节点,查询存储节点之间的交叉备份关系,得到第一存储节点对应的至少一个第二存储节点。Step (2.1): Query the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node.
为了区分描述,将分布式存储系统的任一个存储节点称为第一存储节点,与第一存储节点存在交叉备份关系的存储节点称为第二存储节点。需要说明的是,术语“第一存储节点”以及“第二存储节点”仅是为了区分描述不同的存储节点,而不应理解为明示或暗示存储节点之间的顺序、相对重要性以及存储节点的总数量。To distinguish descriptions, any storage node of a distributed storage system is referred to as a first storage node, and a storage node having a cross-backup relationship with the first storage node is referred to as a second storage node. It should be noted that the terms “first storage node” and “second storage node” are only used to distinguish and describe different storage nodes, and should not be understood to express or imply the order, relative importance, and storage nodes between storage nodes. The total number.
交叉备份关系是指各个存储节点互相存储元数据备份的关系,交叉备份关系至少可以具有两个功能。第一,在存储数据时,交叉备份关系能够确定任一存储节点存储哪几个存储节点的元数据备份,以便根据交叉备份关系,确定每个存储节点需要存储的交叉备份的元数据。例如,假设交叉备份关系指示OSD节点1要存储OSD节点2和OSD节点3的元数据备份,则存储数据时,可以在OSD节点1的数据条带单元中,存储OSD节点2和OSD节点3的元数据备份。第二,在恢复数据时,交叉备份关系能够确定任一元数据备份存储在哪几个存储节点,以便根据交叉备份关系,从相应的存储节点存储的交叉备份的元数据中,读取元数据备份。例如,假设交叉备份关系指示OSD节点1的元数据备份存储于OSD节点2和OSD节点3,则当OSD节点1的元数据丢失时,可以从OSD节点2和OSD节点3读取OSD节点1的元数据备份。A cross-backup relationship refers to a relationship in which each storage node stores metadata backups with each other. A cross-backup relationship can have at least two functions. First, when storing data, the cross-backup relationship can determine which storage nodes' metadata backups are stored by any storage node, in order to determine the cross-backup metadata that each storage node needs to store according to the cross-backup relationship. For example, if the cross-backup relationship instructs OSD node 1 to store metadata backups of OSD node 2 and OSD node 3, when storing data, the data stripe unit of OSD node 1 can store the OSD node 2 and OSD node 3 Metadata backup. Second, when restoring data, the cross-backup relationship can determine which storage nodes any metadata backup is stored in, so that according to the cross-backup relationship, the metadata backup is read from the cross-backup metadata stored by the corresponding storage node. . For example, if the cross-backup relationship indicates that the metadata backup of OSD node 1 is stored in OSD node 2 and OSD node 3, then when the metadata of OSD node 1 is lost, OSD node 1 and OSD node 3 can read the OSD node 1 Metadata backup.
交叉备份关系的数据形式可以根据实际需求设计,举例来说,交叉备份关系可以包括以下(1)和(2)中的一种或多种:The data form of the cross-backup relationship can be designed according to actual needs. For example, the cross-backup relationship can include one or more of the following (1) and (2):
(1)交叉备份关系可以是存储节点的标识之间的映射关系,交叉备份关系可以包括多个存储节点的标识。相应地,确定至少一个第二存储节点的过程可以包括:将第一存储节点的标识作为索引,查询交叉备份关系,得到至少一个第二存储节点的标识,根据至少一个第二存储节点的标识,确定至少一个第二存储节点。其中,存储节点的标识用于标识对应的存储节点,可以包括存储节点的ID、名称、编号等。(1) The cross-backup relationship may be a mapping relationship between the identities of the storage nodes, and the cross-backup relationship may include the identities of multiple storage nodes. Correspondingly, the process of determining at least one second storage node may include: using the identifier of the first storage node as an index, querying the cross-backup relationship to obtain the identifier of the at least one second storage node, and according to the identifier of the at least one second storage node, Determine at least one second storage node. The storage node identifier is used to identify the corresponding storage node, and may include the ID, name, and number of the storage node.
(2)交叉备份关系可以是条带单元标识之间的映射关系,交叉备份关系可以包括多个数据条带单元的条带单元标识。以第一存储节点对应的数据条带单元称为第一数据条带单元,第二存储节点对应的数据条带单元称为第二数据条带单元为例,确定至少一个第二存储节点的过程可以包括:根据第一存储节点,确定第一数据条带单元的条带单元标识,将第一数据条带单元的条带单元标识作为索引,查询交叉备份关系,得到至少一个第二数据条带单元的条带单元标识,确定该至少一个第二数据条带单元的条带单元标识对应的存储节点的标识,得到至少一个第二存储节点的标识,根据至少一个第二存储节点的标识,确定至少一个第二存储节点。(2) The cross-backup relationship may be a mapping relationship between the stripe unit IDs, and the cross-backup relationship may include the stripe unit IDs of multiple data stripe units. Taking the data stripe unit corresponding to the first storage node as the first data stripe unit and the data stripe unit corresponding to the second storage node as the second data stripe unit as an example, the process of determining at least one second storage node It may include: determining a stripe unit identifier of the first data stripe unit according to the first storage node, using the stripe unit identifier of the first data stripe unit as an index, querying the cross-backup relationship, and obtaining at least one second data stripe The stripe unit identifier of the unit determines the identifier of the storage node corresponding to the stripe unit identifier of the at least one second data stripe unit, obtains the identifier of the at least one second storage node, and determines the identifier based on the identifier of the at least one second storage node. At least one second storage node.
可选地,可以由MDC节点记录交叉备份关系,将交叉备份关系发送给客户端节点。例如,MDC节点可以在分区视图中记录交叉备份关系,将分区视图发送给客户端节点,客户端节点可以查询分区视图,得到分区视图中的交叉备份关系。Optionally, the MDC node may record the cross-backup relationship and send the cross-backup relationship to the client node. For example, the MDC node can record the cross-backup relationship in the partitioned view and send the partitioned view to the client node. The client node can query the partitioned view to obtain the cross-backup relationship in the partitioned view.
步骤(2.2)确定至少一个第二存储节点对应的数据块的元数据备份,作为至少一个目标元数据备份。Step (2.2) determines the metadata backup of the data block corresponding to the at least one second storage node as the at least one target metadata backup.
当确定至少一个第二存储节点以后,可以从生成的至少一个元数据备份中,选取至少一个第二存储节点对应的数据块的元数据备份,作为至少一个目标元数据备份。 例如,假设第一存储节点为OSD节点1,根据存储节点之间的交叉备份关系,确定交叉备份关系中OSD节点1对应的OSD节点为OSD节点2和OSD节点3,则确定OSD节点2和OSD节点3的数据块的元数据备份,得到GRAIN元数据2和GRAIN元数据3,将GRAIN元数据2和GRAIN元数据3,作为OSD节点1的目标元数据备份。After at least one second storage node is determined, a metadata backup of a data block corresponding to the at least one second storage node may be selected from the generated at least one metadata backup as at least one target metadata backup. For example, assuming that the first storage node is OSD node 1, and according to the cross-backup relationship between the storage nodes, determine that the OSD node corresponding to OSD node 1 in the cross-backup relationship is OSD node 2 and OSD node 3, then determine OSD node 2 and OSD The metadata of the data block of node 3 is backed up to obtain GRAIN metadata 2 and GRAIN metadata 3, and GRAIN metadata 2 and GRAIN metadata 3 are used as the target metadata backup of OSD node 1.
需要说明的是,根据交叉备份关系来选取目标元数据备份仅是选取目标元数据备份的可选方式,而非选取目标元数据备份的必选方式,可选地,也可以根据其他方式选取目标元数据备份,例如,对于任一数据条带单元来说,可以将所有数据块的元数据备份均作为目标元数据备份,以便在数据条带单元中存储所有数据条带单元的数据块的元数据。It should be noted that selecting the target metadata backup according to the cross-backup relationship is only an optional method of selecting the target metadata backup, not a mandatory method of selecting the target metadata backup. Alternatively, the target may also be selected according to other methods Metadata backup, for example, for any data stripe unit, the metadata backup of all data blocks can be used as the target metadata backup in order to store the metadata of all the data stripe unit's data blocks in the data stripe unit data.
步骤三、根据数据块、数据块的元数据以及至少一个目标元数据备份,生成数据条带单元。Step 3: Generate a data stripe unit according to the data block, the metadata of the data block, and at least one target metadata backup.
具体来说,可以向EC条带中的任一条带单元,写入数据块、数据块的元数据以及至少一个目标元数据备份,将写入完成后的条带单元作为数据条带单元,其中,该数据条带单元承载了数据块、数据块的元数据以及至少一个目标元数据备份,该数据块的元数据以及至少一个目标元数据备份即为交叉备份的元数据。Specifically, a data block, metadata of the data block, and at least one target metadata backup may be written to any stripe unit in the EC stripe, and the stripe unit after writing is used as a data stripe unit, where The data strip unit carries a data block, metadata of the data block, and at least one target metadata backup. The metadata of the data block and at least one target metadata backup are metadata of the cross-backup.
可选地,可以向任一空白EC条带中的任一空白条带单元,写入数据块、数据块的元数据以及至少一个目标元数据备份,从而根据空白条带单元来生成数据条带单元。也可以在已写入数据的EC条带中的任一条带单元上进行覆盖写,从而根据已写入数据的条带单元来生成数据条带单元,本实施例对此不做限定。Optionally, a data block, metadata of the data block, and at least one target metadata backup may be written to any blank strip unit in any blank EC stripe, thereby generating a data strip according to the blank strip unit unit. It is also possible to perform overwrite on any stripe unit in the EC stripe to which data has been written, so as to generate a data stripe unit based on the stripe unit to which data has been written, which is not limited in this embodiment.
综上所述,以上提供了生成EC条带中的一个数据条带单元的示例,依次类推,可以通过重复执行上述步骤,生成EC条带中的每个数据条带单元。其中,一个EC条带可以看做一行数据块,每个数据条带单元可以看做这行数据块中的一列数据块。举例来说,请参见图5,假设将数据划分为3个数据块,分别是GRAIN数据1、GRAIN数据2和GRAIN数据3,可以生成3个数据条带单元,分别是数据条带单元1、数据条带单元2以及数据条带单元3,数据条带单元1承载GRAIN数据1以及交叉备份的元数据1,数据条带单元2承载GRAIN数据2以及交叉备份的元数据2,数据条带单元3承载GRAIN数据3以及交叉备份的元数据3。In summary, the above provides an example of generating a data stripe unit in an EC stripe, and so on. By repeating the above steps, each data stripe unit in an EC stripe can be generated. Among them, an EC slice can be regarded as a row of data blocks, and each data slice unit can be regarded as a column of data blocks in this row of data blocks. For example, referring to FIG. 5, suppose the data is divided into three data blocks, which are GRAIN data 1, GRAIN data 2, and GRAIN data 3, and three data stripe units can be generated, which are data stripe unit 1, respectively. Data stripe unit 2 and data stripe unit 3, data stripe unit 1 carries GRAIN data 1 and cross-backup metadata 1, data stripe unit 2 carries GRAIN data 2 and cross-backup metadata 2, data stripe unit 3 carries GRAIN data 3 and cross-backup metadata 3.
本实施例中,通过使用交叉备份技术,对数据块的元数据进行交叉备份,向每个数据条带单元写完成后,不同数据条带单元之间会互相存储数据块的元数据,因此将每个数据条带单元分发至每个存储节点后,不同存储节点之间会互相存储数据块的元数据,那么即使某个存储节点丢失了数据块的元数据,仍可以从其他存储节点读取并恢复丢失的数据块的元数据,提高了元数据存储的可靠性和安全性,从而保证分布式存储系统的高可用性、高可靠性。In this embodiment, the metadata of the data block is cross-backed up by using the cross backup technology. After writing to each data stripe unit, the metadata of the data block will be stored between the different data stripe units. After each data stripe unit is distributed to each storage node, the metadata of the data block is stored between different storage nodes, so even if a storage node loses the metadata of the data block, it can still be read from other storage nodes And restore the metadata of the lost data blocks, improve the reliability and security of metadata storage, thereby ensuring the high availability and high reliability of the distributed storage system.
例如,请参见图5,OSD节点1、OSD节点2和OSD节点3互相存储数据块的元数据,即使OSD节点1丢失了GRAIN元数据1,由于OSD节点2和OSD节点3预先存储了GRAIN元数据1,因此仍然可以从OSD节点2和OSD节点3读取和恢复GRAIN元数据1。同理地,即使OSD节点2丢失了GRAIN元数据2,由于OSD节点1和OSD节点3预先存储了GRAIN元数据2,因此仍然可以从OSD节点1和OSD节点3读取和恢复GRAIN元数据2。For example, referring to FIG. 5, OSD node 1, OSD node 2 and OSD node 3 mutually store metadata of a data block, even if OSD node 1 loses GRAIN metadata 1, since OSD node 2 and OSD node 3 previously store GRAIN metadata Data 1, so GRAIN metadata 1 can still be read and restored from OSD node 2 and OSD node 3. Similarly, even if OSD node 2 loses GRAIN metadata 2, since OSD node 1 and OSD node 3 have stored GRAIN metadata 2 in advance, it is still possible to read and restore GRAIN metadata 2 from OSD node 1 and OSD node 3. .
404、客户端节点对至少一个数据条带单元进行EC编码,得到至少一个校验条带单元。404. The client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.
校验条带单元用于恢复数据条带单元,能够保证存储数据条带单元的可靠性和安全性。具体来讲,若一个或多个数据条带单元丢失,只要丢失的数据条带单元的数量小于校验条带单元的总数量,即可通过对剩余的数据条带单元以及校验条带单元进行EC反编码,恢复丢失的数据条带单元。举例来说,请参见图5,校验条带单元可以是图5中的校验条带单元1和校验条带单元2。The check stripe unit is used to recover the data stripe unit, which can ensure the reliability and security of the data stripe unit. Specifically, if one or more data stripe units are lost, as long as the number of lost data stripe units is less than the total number of parity stripe units, the remaining data stripe units and parity stripe units can be resolved. Perform EC inverse coding to recover lost data stripe units. For example, referring to FIG. 5, the verification stripe unit may be the verification stripe unit 1 and the verification stripe unit 2 in FIG. 5.
校验条带单元可以称为EC条带的校验保护,校验条带单元可以包括校验块以及元数据校验块,例如,可以包括一个或多个校验块以及一个或多个元数据校验块。校验块可以根据至少一个数据条带单元中的数据块进行EC编码得到,校验块可以用于恢复数据条带单元中的数据块。举例来说,请参见图5,校验块可以是图5中的GRAIN元数据校验块1和GRAIN元数据校验块2。元数据校验块可以根据至少一个数据条带单元中的元数据进行EC编码得到,元数据校验块可以用于恢复数据条带单元中的元数据。举例来说,请参见图5,校验块可以是图5中的GRAIN元数据校验块1和GRAIN元数据校验块2。The check stripe unit can be referred to as EC stripe check protection. The check stripe unit can include a check block and a metadata check block. For example, it can include one or more check blocks and one or more meta blocks. Data check block. The check block may be obtained by performing EC coding on the data block in at least one data stripe unit, and the check block may be used to recover the data block in the data stripe unit. For example, referring to FIG. 5, the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5. The metadata check block may be obtained by performing EC encoding according to metadata in at least one data stripe unit, and the metadata check block may be used to recover metadata in the data stripe unit. For example, referring to FIG. 5, the check block may be the GRAIN metadata check block 1 and the GRAIN metadata check block 2 in FIG. 5.
关于生成至少一个校验条带单元的过程,客户端节点可以采用编码算法,对至少一个数据条带单元进行EC编码,得到至少一个校验条带单元。该编码算法包括而不限于里德-所罗门(Reed-Solomon)编码、柯西编码等,本实施例对采用哪种编码算法进行EC编码不做限定。Regarding the process of generating at least one check stripe unit, the client node may use an encoding algorithm to perform EC coding on at least one data stripe unit to obtain at least one check stripe unit. The encoding algorithm includes, but is not limited to, Reed-Solomon encoding, Cauchy encoding, and the like. This embodiment does not limit which encoding algorithm is used for EC encoding.
其中,该至少一个校验条带单元以及至少一个数据条带单元可以组成一条EC条带,每个数据条带单元以及每个校验条带单元是EC条带中的一列数据块。举例来说,请参见图5,可以对数据条带单元1、数据条带单元2和数据条带单元3进行EC编码,得到校验条带单元1和校验条带单元2,该数据条带单元1、数据条带单元2、数据条带单元3、校验条带单元1和校验条带单元2组成了一条EC条带。The at least one parity strip unit and the at least one data stripe unit may form an EC stripe. Each data stripe unit and each parity stripe unit is a column of data blocks in the EC stripe. For example, referring to FIG. 5, the data striping unit 1, the data striping unit 2, and the data striping unit 3 can be EC-coded to obtain the verification stripe unit 1 and the verification stripe unit 2. The data stripe The strip unit 1, the data strip unit 2, the data strip unit 3, the check strip unit 1, and the check strip unit 2 form an EC strip.
可选地,对至少一个数据条带单元进行EC编码的方式可以包括以下方式一至方式三中一项或多项的组合。Optionally, a manner of performing EC coding on at least one data stripe unit may include a combination of one or more of the following manners 1 to 3.
方式一、对至少一个数据条带单元中的数据块进行EC编码,得到至少一个校验条带单元中的校验块。Method 1: EC coding is performed on data blocks in at least one data stripe unit to obtain at least one checkout block in a checkout stripe unit.
具体来说,可以对至少一个数据块进行EC编码,得到至少一个校验块,向任一条带单元写入一个或多个校验块,将写入完成后的条带单元作为校验条带单元。其中,可以向任一空白EC条带中的任一空白条带单元写入校验块,也可以在已写入数据的EC条带中的任一条带单元上进行覆盖写,本实施例对此不做限定。Specifically, at least one data block may be EC-coded to obtain at least one check block, one or more check blocks may be written to any stripe unit, and the stripe unit after writing is used as a check stripe. unit. Wherein, a check block may be written to any blank strip unit in any blank EC strip, or overwrite writing may be performed on any strip unit in the EC strip to which data has been written. This is not limited.
举例来说,请参见图5,可以对GRAIN数据1、GRAIN数据2和GRAIN数据3进行EC编码,得到GRAIN元数据校验块1和GRAIN元数据校验块2,向条带单元4写入GRAIN元数据校验块1,向条带单元5写入GRAIN元数据校验块2,则写入完成后,条带单元4可以作为校验条带单元1,条带单元5可以作为校验条带单元2。For example, referring to FIG. 5, GRAIN data 1, GRAIN data 2 and GRAIN data 3 can be EC coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 and write to the stripe unit 4. GRAIN metadata check block 1, write GRAIN metadata check block 2 to stripe unit 5, after writing is completed, stripe unit 4 can be used to verify stripe unit 1, and stripe unit 5 can be used to verify Strip unit 2.
方式二、对至少一个数据条带单元中的元数据进行EC编码,得到至少一个校验条带单元中的元数据校验块。Manner 2: Perform EC coding on the metadata in at least one data stripe unit to obtain a metadata check block in at least one check stripe unit.
在一种可能的实现中,可以根据每个数据条带单元中的交叉备份的元数据,对至 少一个数据条带单元的交叉备份的元数据进行EC编码,以得到元数据校验块。举例来说,请参见图5,可以对交叉备份的元数据1、交叉备份的元数据2以及交叉备份的元数据3进行EC编码,得到GRAIN元数据校验块1以及GRAIN元数据校验块2。In a possible implementation, the cross-backup metadata of at least one data stripe unit may be EC-coded according to the cross-backup metadata in each data stripe unit to obtain a metadata check block. For example, referring to FIG. 5, EC encoding can be performed on cross-backup metadata 1, cross-backup metadata 2 and cross-backup metadata 3 to obtain GRAIN metadata check block 1 and GRAIN metadata check block. 2.
在另一种可能的实现中,也可以从每个数据条带单元的交叉备份的元数据中选取一个或多个元数据,对从至少一个数据条带单元中选取的元数据进行EC编码,以得到元数据校验块。可选地,每个数据条带单元中交叉备份的元数据可以包括多个行,每个行承载一个元数据,可以选取至少一个数据条带单元中处于同一行的元数据,对选取的至少一个元数据进行EC编码。举例来说,请参见图5,每个数据条带单元中交叉备份的元数据占据3行,这3行分别用于承载GRAIN元数据1、GRAIN元数据2和GRAIN元数据3。可以先针对交叉备份的元数据中的第一行元数据进行EC编码,则从数据条带单元1中选择第一行的元数据,得到GRAIN元数据1,从数据条带单元2中选择第一行的元数据,得到GRAIN元数据2,从数据条带单元3中选择第一行的元数据,得到GRAIN元数据3,对GRAIN元数据1、GRAIN元数据2和GRAIN元数据3进行EC编码,得到GRAIN元数据校验块1以及GRAIN元数据校验块2。In another possible implementation, one or more metadata may also be selected from the cross-backup metadata of each data stripe unit, and the EC selected from the at least one data stripe unit may be EC coded. To get the metadata check block. Optionally, the metadata that is backed up in each data stripe unit may include multiple rows, and each row carries one metadata. Metadata in the same row in at least one data stripe unit may be selected. One metadata is EC coded. For example, referring to FIG. 5, the cross-backup metadata in each data stripe unit occupies 3 rows, and these 3 rows are used to carry GRAIN metadata 1, GRAIN metadata 2 and GRAIN metadata 3, respectively. You can first perform EC encoding on the first row of metadata in the cross-backed metadata, then select the first row of metadata from the data stripe unit 1 to obtain GRAIN metadata 1, and select the first row of data from the data stripe unit 2. One row of metadata to get GRAIN metadata 2. Select the first row of metadata from data stripe unit 3 to get GRAIN metadata 3. Perform EC on GRAIN metadata 1, GRAIN metadata 2, and GRAIN metadata 3. Encode to get GRAIN metadata check block 1 and GRAIN metadata check block 2.
当得到元数据校验块后,可以向任一空白EC条带中的任一空白条带单元写入元数据校验块,也可以在已写入数据的EC条带中的任一条带单元上,覆盖写入元数据校验块,可以将写入完成后的条带单元作为校验条带单元。举例来说,请参见图5,可以向条带单元4写入GRAIN元数据校验块1,向条带单元5写入GRAIN元数据校验块2,则写入完成后,条带单元4可以作为校验条带单元1,条带单元5可以作为校验条带单元2。When the metadata check block is obtained, the metadata check block can be written to any blank strip cell in any blank EC strip, or any strip cell in the EC strip to which data has been written In the above, the write metadata check block is overwritten, and the stripe unit after the writing is completed can be used as a check stripe unit. For example, referring to FIG. 5, a GRAIN metadata check block 1 can be written to the stripe unit 4 and a GRAIN metadata check block 2 can be written to the stripe unit 5. After the writing is completed, the stripe unit 4 It can be used as the verification strip unit 1, and the strip unit 5 can be used as the verification strip unit 2.
通过对每个数据条带单元中的元数据进行EC编码,可以进一步提高元数据存储的可靠性和安全性。具体来说,当任一数据条带单元中的元数据丢失时,不仅可以通过其他数据条带单元存储的交叉备份的元数据,读取和恢复丢失的元数据,也可以通过对其他数据条带单元中的元数据以及元数据校验块进行EC反编码,读取和恢复丢失的元数据,从而进一步降低元数据损失的概率,极大地提高存储元数据的可靠性和安全性。By encoding the metadata in each data stripe unit, the reliability and security of metadata storage can be further improved. Specifically, when metadata in any data stripe unit is lost, not only can cross-backup metadata stored in other data stripe units be used to read and restore the lost metadata, but also The metadata in the band unit and the metadata check block are EC back-encoded to read and recover the lost metadata, thereby further reducing the probability of metadata loss and greatly improving the reliability and security of the stored metadata.
方式三、对至少一个数据条带单元中的数据块和元数据进行EC编码,得到至少一个校验条带单元中的校验块。Manner 3: Perform EC coding on the data blocks and metadata in at least one data stripe unit to obtain a check block in at least one check stripe unit.
方式三为方式一和方式二的结合,可以对至少一个数据块以及至少一个数据块的元数据一起进行EC编码,得到至少一个元数据校验块,向任一条带单元,写入元数据校验块,将写入完成后的条带单元,作为校验条带单元。The third method is a combination of the first method and the second method. At least one data block and metadata of at least one data block can be EC coded together to obtain at least one metadata check block. The metadata can be written to any band unit. After the block is checked, the stripe unit after writing is used as the verification stripe unit.
举例来说,请参见图5,可以对条带单元1、条带单元2和条带单元3中承载的所有数据进行EC编码,得到GRAIN元数据校验块1以及GRAIN元数据校验块2,向条带单元4写入GRAIN元数据校验块1,将写入完成后的条带单元4作为校验条带单元1,向条带单元5写入GRAIN元数据校验块2,将写入完成后的条带单元5作为校验条带单元2。For example, referring to FIG. 5, all data carried in stripe unit 1, stripe unit 2, and stripe unit 3 can be EC-coded to obtain GRAIN metadata check block 1 and GRAIN metadata check block 2 , Write the GRAIN metadata check block 1 to the stripe unit 4, use the completed stripe unit 4 as the check stripe unit 1, write the GRAIN metadata check block 2 to the stripe unit 5, and write The stripe unit 5 after writing is used as the verification stripe unit 2.
405、客户端节点将至少一个数据条带单元以及至少一个校验条带单元,分发至至少一个存储节点。405. The client node distributes at least one data stripe unit and at least one check stripe unit to at least one storage node.
可选地,可以将至少一个数据条带单元以及至少一个校验条带单元,分发至数据 的目标存储位置对应的至少一个存储节点。具体来说,可以根据数据的目标存储位置,确定该目标存储位置对应的存储节点,得到至少一个存储节点,可以为每个存储节点分配数据条带单元和/或校验条带单元,向每个存储节点分发为其分配的数据条带单元和/或校验条带单元。Optionally, at least one data stripe unit and at least one check stripe unit may be distributed to at least one storage node corresponding to a target storage location of the data. Specifically, a storage node corresponding to the target storage location may be determined according to a target storage location of the data, and at least one storage node may be obtained. A data stripe unit and / or a verification stripe unit may be allocated to each storage node, Storage nodes distribute data stripe units and / or check stripe units assigned to them.
关于确定目标存储位置对应的存储节点的方式,可以预先建立存储位置与存储节点的标识之间的映射关系,根据目标存储位置,查询存储位置与存储节点的标识之间的映射关系,得到目标存储位置对应的至少一个存储节点的标识,将该至少一个存储节点的标识对应的存储节点,作为目标存储位置对应的存储节点。Regarding the manner of determining the storage node corresponding to the target storage location, a mapping relationship between the storage location and the identification of the storage node may be established in advance, and the mapping relationship between the storage location and the identification of the storage node may be queried according to the target storage location to obtain the target storage. The identifier of the at least one storage node corresponding to the location is used as the storage node corresponding to the target storage location.
可选地,可以根据分布式存储系统中存储节点的数量以及状态,采用分区分配算法,生成分区视图,分区视图用于指示每个分区对应的存储节点,分区视图可以包括至少一个分区标识以及对应的至少一个存储节点的标识,客户端节点可以根据数据的目标存储位置,确定目标存储位置对应的分区标识,根据分区标识查询分区视图,得到至少一个存储节点的标识,将该至少一个存储节点的标识对应的存储节点,作为目标存储位置对应的存储节点。Optionally, according to the number and status of storage nodes in the distributed storage system, a partition allocation algorithm is used to generate a partitioned view. The partitioned view is used to indicate the storage node corresponding to each partition. The partitioned view may include at least one partition identifier and the corresponding The client node may determine the partition identifier corresponding to the target storage location according to the target storage location of the data, query the partition view according to the partition identifier, and obtain the identifier of the at least one storage node. Identify the corresponding storage node as the storage node corresponding to the target storage location.
关于根据数据的目标存储位置确定分区标识的过程,当目标存储位置为LBA时,可以将LBA除以分区的总数量,将得到的余数作为分区标识,从而确定LBA归属到哪个分区,其中,LBA与分区之间的对应关系可以称为LBA打散到分区的关系,分区与存储节点之间的对应关系可以称为分区部署到存储节点的部署关系。Regarding the process of determining the partition identifier according to the target storage location of the data, when the target storage location is LBA, the LBA can be divided by the total number of partitions, and the obtained remainder is used as the partition identifier to determine which partition the LBA belongs to. The corresponding relationship between the partitions can be referred to as the LBA break-to-partition relationship, and the corresponding relationship between the partitions and the storage nodes can be referred to as the deployment relationship where the partitions are deployed to the storage nodes.
关于为存储节点分配数据条带单元或校验条带单元的方式,可以按照存储节点的标识的顺序,依次为每个存储节点分配数据条带单元或校验条带单元,例如,可以为存储节点1分配数据条带单元1,为存储节点2分配数据条带单元2。又如,可以为每个存储节点随机分配一个数据条带单元或校验条带单元,当然,还可以采用其他方式分配数据条带单元或校验条带单元,本实施例对如何分配数据条带单元或校验条带单元不做限定。Regarding the manner of allocating data stripe units or check stripe units to storage nodes, the data stripe unit or check stripe unit can be assigned to each storage node in turn in the order of the storage node's identification. For example, the storage node can be Node 1 allocates data stripe unit 1 and storage node 2 allocates data stripe unit 2. As another example, a data stripe unit or a verification stripe unit may be randomly allocated to each storage node. Of course, other methods may also be used to allocate a data stripe unit or a verification stripe unit. This embodiment explains how to allocate data stripe units. The belt unit or the verification strip unit is not limited.
关于分发数据条带单元和/或校验条带单元的方式,可以根据每个存储节点对应的数据条带单元和/或校验条带单元,生成至少一个写请求,将至少一个写请求分发至该至少一个存储节点,例如可以向每个存储节点发送一个写请求,从而向各个存储节点分发对应的数据条带单元和/或校验条带单元。其中,每个写请求携带存储节点对应的数据条带单元和/或校验条带单元,可选地还可以携带数据的目标存储位置。每个写请求可以是键值对(key-value)形式的输入/输出(Input/Output,以下简称:IO)请求。Regarding the manner of distributing the data stripe unit and / or the verification stripe unit, at least one write request can be generated according to the data stripe unit and / or the verification stripe unit corresponding to each storage node, and at least one write request is distributed. To the at least one storage node, for example, a write request may be sent to each storage node, so as to distribute the corresponding data stripe unit and / or check stripe unit to each storage node. Each write request carries a data stripe unit and / or a verification stripe unit corresponding to the storage node, and optionally, can also carry a target storage location of the data. Each write request may be an input / output (Input / Output, hereinafter referred to as: IO) request in the form of a key-value pair.
可选地,可以由客户端节点向每个存储节点发送对应的数据条带单元和/或校验条带单元。具体来说,可以由客户端节点为每个存储节点分配对应的数据条带单元和/或校验条带单元,由客户端节点生成至少一个写请求,向每个存储节点发送写请求,通过客户端节点与每个存储节点进行交互,来将至少一个数据条带单元以及至少一个校验条带单元,分发至该至少一个存储节点。Optionally, the client node may send a corresponding data stripe unit and / or check stripe unit to each storage node. Specifically, a client node may allocate a corresponding data stripe unit and / or check stripe unit to each storage node, and the client node generates at least one write request, and sends a write request to each storage node. The client node interacts with each storage node to distribute at least one data stripe unit and at least one check stripe unit to the at least one storage node.
示例性地,请参见图5,client节点可以向OSD节点1发送携带有数据条带单元1的写请求,向OSD节点2发送携带有数据条带单元2的写请求,向OSD节点3发送携带有数据条带单元3的写请求,向OSD节点4发送携带有校验条带单元1的写请求,向OSD节点5发送携带有校验条带单元2的写请求,从而实现数据条带单元以及校验 条带单元的分发。Exemplarily, referring to FIG. 5, the client node may send a write request carrying the data stripe unit 1 to the OSD node 1, send a write request carrying the data stripe unit 2 to the OSD node 2, and send a carry to the OSD node 3 There is a write request for the data stripe unit 3, and a write request with the verification stripe unit 1 is sent to the OSD node 4, and a write request with the verification stripe unit 2 is sent to the OSD node 5, thereby realizing the data stripe unit And check the distribution of stripe units.
本实施例中,通过由客户端节点向每个存储节点发送对应的数据条带单元和/或校验条带单元,待存储的数据只需经过客户端节点与存储节点之间的一跳转发,即可到达每个存储节点,从而在每个存储节点上进行存储,而无需先经过客户端节点到达主存储节点,再从主存储节点到达每个备存储节点,也就避免了数据要经过两跳转发,才能到达每个存储节点的情况,节省了主存储节点的转发所导致的网络时延,从而降低了存储数据的时延,提高存储数据的效率,加快存储数据的速度,从而极大地提升了分布式存储系统的存储性能。In this embodiment, a client node sends a corresponding data stripe unit and / or a verification stripe unit to each storage node, and the data to be stored only needs to go through a jump between the client node and the storage node. It can reach each storage node, so that it can be stored on each storage node without having to go through the client node to reach the primary storage node, and then from the primary storage node to each standby storage node. After two jumps, it can reach each storage node, which saves the network delay caused by the forwarding of the main storage node, thereby reducing the delay of storing data, improving the efficiency of storing data, and accelerating the speed of storing data. This greatly improves the storage performance of the distributed storage system.
需要说明的是,由客户端节点向每个存储节点发送数据条带单元和/或校验条带单元,仅是分发数据条带单元和/或校验条带单元的可选方式,而非分发数据条带单元和/或校验条带单元的必选方式,在另一个可能的实施例中,也可以在至少一个存储节点中,选取某一存储节点为主存储节点,其他存储节点为备存储节点,客户端节点可以向主存储节点发送至少一个数据条带单元以及至少一个校验条带单元,由主存储节点向各个备存储节点,分别发送数据条带单元和/或校验条带单元,本实施例对如何分发数据条带单元和/或校验条带单元不做限定。It should be noted that sending a data stripe unit and / or a verification stripe unit to each storage node by the client node is only an optional way to distribute the data stripe unit and / or the verification stripe unit, rather than Mandatory way to distribute data stripe units and / or check stripe units. In another possible embodiment, one of the storage nodes may also be selected as the primary storage node, and the other storage nodes are The standby storage node, the client node can send at least one data stripe unit and at least one check stripe unit to the main storage node, and the main storage node sends data stripe units and / or check strips to each of the backup storage nodes. The tape unit, this embodiment does not limit how to distribute the data stripe unit and / or the verification stripe unit.
406、当至少一个存储节点接收到数据条带单元和/或校验条带单元时,至少一个存储节点存储数据条带单元和/或校验条带单元。406. When at least one storage node receives the data stripe unit and / or the verification stripe unit, the at least one storage node stores the data stripe unit and / or the verification stripe unit.
具体来说,每个存储节点可以接收写请求,解析写请求,得到写请求携带的数据条带单元和/或校验条带单元,向存储空间写入数据条带单元和/或校验条带单元,从而存储数据条带单元和/或校验条带单元。其中,写请求可以携带数据的目标存储位置,可以向目标存储位置对应的存储空间写入数据条带单元和/或校验条带单元。Specifically, each storage node can receive a write request, parse the write request, obtain the data stripe unit and / or check stripe unit carried by the write request, and write the data stripe unit and / or check stripe to the storage space. The stripe unit stores data stripe units and / or check stripe units. The write request may carry a target storage location of data, and a data stripe unit and / or a verification stripe unit may be written to a storage space corresponding to the target storage location.
407、至少一个存储节点向客户端节点发送第一写完成消息,第一写完成消息用于指示对应的存储节点已存储数据条带单元和/或校验条带单元。407. At least one storage node sends a first write completion message to the client node, where the first write completion message is used to indicate that the corresponding storage node has stored the data stripe unit and / or the verification stripe unit.
为了区分描述,本实施例将存储节点发送的写完成消息,称为第一写完成消息,将客户端节点发送的写完成消息,称为第二写完成消息,需要说明的是,术语“第一写完成消息”和“第二写完成消息”仅是为了区分不同的写完成消息,而不应理解为明示或暗示不同写完成消息之间的相对重要性以及写完成消息的总数量。In order to distinguish the description, in this embodiment, the write completion message sent by the storage node is referred to as a first write completion message, and the write completion message sent by the client node is referred to as a second write completion message. It should be noted that the term "the The "one write completion message" and the "second write completion message" are only for distinguishing different write completion messages, and should not be understood as expressing or implying the relative importance between different write completion messages and the total number of write completion messages.
每个存储节点在写入数据条带单元和/或校验条带单元完成后,可以生成第一写完成消息,向客户端节点发送第一写完成消息,以便通知客户端节点,自身已经存储数据成功。After each storage node writes the data stripe unit and / or check stripe unit, it can generate a first write completion message and send the first write completion message to the client node to notify the client node that it has stored it. The data was successful.
408、客户端节点接收至少一个存储节点发送的第一写完成消息,向目标应用或外部主机发送第二写完成消息,第二写完成消息用于指示已向目标存储位置写入待存储的数据。408. The client node receives a first write completion message sent by at least one storage node, and sends a second write completion message to the target application or an external host. The second write completion message is used to indicate that data to be stored has been written to the target storage location. .
具体来说,客户端节点可以判断是否已经接收到所有存储节点的第一写完成消息,当确认接收到所有存储节点的第一写完成消息,则生成第二写完成消息,向目标应用或外部主机发送第二写完成消息。其中,目标应用或外部主机可以接收第二写完成消息,对该第二写完成消息按照预设提示方式进行提示,从而达到向用户提示,已向目标存储位置写入待存储的数据的功能。另外,该目标应用在逻辑架构上可以位于客户端节点的上层,目标应用可以称为客户端节点的上层应用。Specifically, the client node can determine whether the first write completion message of all the storage nodes has been received, and when it is confirmed that the first write completion message of all the storage nodes is received, a second write completion message is generated to the target application or external The host sends a second write complete message. The target application or the external host can receive the second write completion message, and prompt the second write completion message according to a preset prompt mode, thereby achieving the function of prompting the user that the data to be stored has been written to the target storage location. In addition, the target application may be located on an upper layer of the client node in a logical architecture, and the target application may be referred to as an upper layer application of the client node.
需要说明的是,步骤407以及步骤408仅是存储数据的可选步骤,而非存储数据的必选步骤,本实施例对是否执行步骤407以及步骤408不做限定。It should be noted that steps 407 and 408 are only optional steps for storing data, and are not mandatory steps for storing data. This embodiment does not limit whether to perform steps 407 and 408.
综上所述,上述阐述了存储数据的写IO流程,而读取数据的读IO流程可以包括以下步骤一至步骤二。In summary, the above describes the write IO process of storing data, and the read IO process of reading data may include the following steps 1 to 2.
步骤一、客户端节点接收读请求,根据读请求中的目标存储位置,确定目标存储位置对应的分区的至少一个存储节点。例如,可以根据LBA与分区之间的映射关系,确定LBA对应的分区,查询分区视图,确定分区对应的至少一个存储节点。Step 1: The client node receives the read request, and determines at least one storage node of the partition corresponding to the target storage location according to the target storage location in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.
步骤二、客户端节点将读请求转发到至少一个存储节点,从而在该至少一个存储节点进行数据读取。Step 2: The client node forwards the read request to at least one storage node, so that data is read in the at least one storage node.
本实施例提供的方法,引入了对EC条带中数据块的元数据进行交叉备份的机制,通过将数据块以及交叉备份的元数据,共同存储在数据条带单元中,对数据条带单元进行EC编码得到校验条带单元后,将各个数据条带单元以及各个校验条带单元,分发至各个存储节点,可以确保每个存储节点存储完成后,不同存储节点之间会互相保存数据块的元数据,即使某个存储节点的元数据丢失,由于其他存储节点预先存储了该存储节点的元数据,能够从其他存储节点的数据条带单元中,读取和恢复丢失的元数据,从而极大地提高了数据存储的可靠性和安全性,提升分布式存储系统的存储性能。并且,通过EC技术来存储数据,可以避免通过三副本机制来存储数据时,对缓存的性能开销,从而节省存储空间,降低运行成本。进一步地,可以由客户端节点向每个存储节点,发送对应的数据条带单元和校验条带单元,保证数据只需经过一跳网络转发,即可到达每个存储节点,以得到存储,可以极大地降低时延,提高存储数据的速度和效率,从而提升分布式存储系统的存储性能。The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in an EC stripe. By storing the data blocks and the cross-backup metadata together in a data stripe unit, the data stripe unit is After the EC coding is performed to obtain the verification stripe unit, each data stripe unit and each verification stripe unit are distributed to each storage node, which can ensure that after the storage of each storage node is completed, different storage nodes will save data to each other. Block metadata, even if the metadata of a certain storage node is lost, because the other storage nodes have previously stored the metadata of the storage node, they can read and recover the lost metadata from the data stripe units of other storage nodes. This greatly improves the reliability and security of data storage, and improves the storage performance of distributed storage systems. In addition, using EC technology to store data can avoid the performance overhead of the cache when storing data through the three-copy mechanism, thereby saving storage space and reducing operating costs. Further, the client node can send the corresponding data stripe unit and check stripe unit to each storage node to ensure that the data can reach each storage node to be stored only through one-hop network forwarding. It can greatly reduce the delay, improve the speed and efficiency of storing data, and thus improve the storage performance of distributed storage systems.
以下通过图6实施例,描述存储节点处于亚健康状态的场景下,数据的存储流程。为了区分描述,图6实施例中,将处于亚健康状态的存储节点称为第三存储节点,将未处于亚健康状态的存储节点称为第四存储节点。需要说明的是,术语“第三存储节点”以及“第四存储节点”仅是为了区分描述是否处于亚健康状态的存储节点,而不应理解为明示或暗示存储节点之间的顺序、相对重要性以及存储节点的总数量。The following uses the embodiment in FIG. 6 to describe the data storage process in a scenario where the storage node is in a sub-health state. In order to distinguish descriptions, in the embodiment of FIG. 6, a storage node in a sub-health state is referred to as a third storage node, and a storage node that is not in a sub-health state is referred to as a fourth storage node. It should be noted that the terms "third storage node" and "fourth storage node" are only used to distinguish between storage nodes that describe whether they are in a sub-health state, and should not be understood as expressly or implicitly the order and relative importance between storage nodes. And the total number of storage nodes.
图6是本申请实施例提供的一种数据存储方法的流程图,如图6所示,该方法的交互主体包括客户端节点以及至少一个第四存储节点,包括以下步骤:FIG. 6 is a flowchart of a data storage method according to an embodiment of the present application. As shown in FIG. 6, the interaction body of the method includes a client node and at least one fourth storage node, including the following steps:
601、客户端节点获取至少一个空白EC条带。601. The client node obtains at least one blank EC strip.
本步骤与上述步骤401同理,在此不做赘述。This step is the same as the above step 401, and details are not described herein.
602、客户端节点获取待存储的数据。602. The client node obtains data to be stored.
本步骤与上述步骤402同理,在此不做赘述。This step is the same as the above step 402, and details are not described herein.
603、客户端节点确定第三存储节点处于亚健康状态。603. The client node determines that the third storage node is in a sub-health state.
亚健康状态也称亚健康闪断,亚健康状态可以包括处理读写请求异常缓慢的状态、写缓存失效的状态、磁盘损坏的状态等。The sub-health state is also called a sub-health flash, and the sub-health state may include a state in which read and write requests are abnormally slow, a state where a write cache is invalid, and a state where a disk is damaged.
关于确定第三存储节点处于亚健康状态的方式,可选地,客户端节点可以接收亚健康消息,根据亚健康消息,确定第三存储节点处于亚健康状态。其中,亚健康消息用于指示第三存储节点处于亚健康状态,可以携带第三存储节点的标识,客户端节点可以解析亚健康消息,得到该第三存储节点的标识,从而确定第三存储节点处于亚健 康状态。Regarding the manner of determining that the third storage node is in a sub-health state, optionally, the client node may receive a sub-health message, and determine that the third storage node is in a sub-health state according to the sub-health message. The sub-health message is used to indicate that the third storage node is in a sub-health state, and can carry the identity of the third storage node. The client node can parse the sub-health message and obtain the identity of the third storage node to determine the third storage node. In a sub-health state.
可选地,客户端节点接收的亚健康消息可以来自于MDC节点。具体来讲,分布式存储系统可以包括MDC节点,MDC节点用于维护每个存储节点的状态,MDC节点可以与每个存储节点保持通信,当MDC节点确定第三存储节点处于亚健康状态时,MDC节点可以根据第三存储节点的标识,生成亚健康消息,向客户端节点发送亚健康消息,客户端节点可以接收MDC节点的亚健康消息,从而确定第三存储节点处于亚健康状态。Optionally, the sub-health message received by the client node may come from an MDC node. Specifically, a distributed storage system may include an MDC node. The MDC node is used to maintain the state of each storage node. The MDC node may maintain communication with each storage node. When the MDC node determines that the third storage node is in a sub-health state, The MDC node may generate a sub-health message according to the identity of the third storage node, and send the sub-health message to the client node. The client node may receive the sub-health message of the MDC node, thereby determining that the third storage node is in a sub-health state.
需要说明的是,接收亚健康消息,仅是确定第三存储节点处于亚健康状态的方式的一种示例,而非确定第三存储节点处于亚健康状态的必选方式,可选地,客户端节点也可以通过其他方式确定第三存储节点处于亚健康状态。例如,客户端节点可以与第三存储节点保持通信,客户端节点可以主动探测到第三存储节点处于亚健康状态。本实施例对如何确定第三存储节点处于亚健康状态不做限定。It should be noted that receiving the sub-health message is only an example of a method for determining that the third storage node is in a sub-health state, and is not a mandatory method for determining that the third storage node is in a sub-health state. Optionally, the client The node may also determine that the third storage node is in a sub-health state by other methods. For example, the client node can maintain communication with the third storage node, and the client node can actively detect that the third storage node is in a sub-health state. This embodiment does not limit how to determine that the third storage node is in a sub-health state.
604、客户端节点根据待存储的至少一个数据块,生成至少一个数据条带单元。604. The client node generates at least one data stripe unit according to at least one data block to be stored.
605、客户端节点对至少一个数据条带单元进行EC编码,得到至少一个校验条带单元。605. The client node performs EC coding on at least one data stripe unit to obtain at least one check stripe unit.
606、客户端节点将至少一个数据条带单元以及至少一个校验条带单元,分发至第四存储节点。606. The client node distributes at least one data stripe unit and at least one check stripe unit to a fourth storage node.
步骤604至步骤606与上述步骤403至步骤405同理,而区别之处主要包括以下两点:Steps 604 to 606 are the same as the above steps 403 to 405, and the differences mainly include the following two points:
区别一、分发的数据内容有所增加。本实施例中,在分发数据条带单元以及校验条带单元的基础上,还会分发第三存储节点的亚健康标记,以便通过第三存储节点的亚健康标记,指示第三存储节点处于亚健康状态。其中,亚健康标记用于指示第三存储节点处于亚健康状态,可以包括第三存储节点的标识,亚健康标记可以由客户端节点生成。Difference one: The content of distributed data has increased. In this embodiment, in addition to the data stripe unit and the check stripe unit, a sub-health mark of the third storage node is also distributed, so that the third storage node's sub-health mark is used to indicate that the third storage node is in Sub-health state. The sub-health tag is used to indicate that the third storage node is in a sub-health state, and may include an identifier of the third storage node. The sub-health tag may be generated by the client node.
分发第三存储节点的亚健康标记的方式,可以包括以下方式一至方式二中的一项或多项的组合。The manner of distributing the sub-health mark of the third storage node may include a combination of one or more of the following manners 1 to 2.
方式一、可以向至少一个数据条带单元写入亚健康标记,以使每个数据条带单元在携带数据块以及交叉备份的元数据的同时,还会携带第三存储节点的亚健康标记,从而通过分发至少一个数据条带单元,来分发第三存储节点的亚健康标记。Method 1: A sub-health mark may be written to at least one data stripe unit, so that each data stripe unit carries a data block and cross-backup metadata, and also carries a sub-health mark of a third storage node. Therefore, by distributing at least one data stripe unit, the sub-health mark of the third storage node is distributed.
其中,可以先向至少一个数据条带单元写入亚健康标记,再对至少一个数据条带单元进行EC编码,得到至少一个校验条带单元。另外,也可以先生成至少一个数据条带单元,对至少一个数据条带单元进行EC编码,得到至少一个校验条带单元,再向至少一个数据条带单元中的空闲位置,写入亚健康标记。Wherein, a sub-health mark may be written into at least one data stripe unit, and then EC coding is performed on at least one data stripe unit to obtain at least one check stripe unit. In addition, it is also possible to first generate at least one data stripe unit, perform EC coding on the at least one data stripe unit to obtain at least one check stripe unit, and write the sub-health to the free position in the at least one data stripe unit. mark.
需要说明的是,如果由客户端节点向每个存储节点发送数据条带单元,则可以由客户端节点向至少一个数据条带单元写入亚健康标记,如果由主存储节点向每个存储节点发送数据条带单元,则可以由主存储节点向至少一个数据条带单元写入亚健康标记,本实施例对向数据条带单元写入亚健康标记的执行主体不做限定。It should be noted that if the client node sends data stripe units to each storage node, the client node can write a sub-health mark to at least one data stripe unit. If the main storage node sends each storage node When sending the data stripe unit, the main storage node can write the sub-health mark to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the data stripe unit.
方式二、可以向至少一个写请求写入亚健康标记,以使每个写请求在携带数据条带单元和/或校验条带单元的同时,还会携带第三存储节点的亚健康标记,从而通过分发至少一个写请求,来分发第三存储节点的亚健康标记。Method 2: A sub-health mark may be written to at least one write request, so that each write request carries a sub-health mark of a third storage node while carrying a data stripe unit and / or a verification stripe unit, Thus, the sub-health mark of the third storage node is distributed by distributing at least one write request.
例如,可以在生成写请求时,对亚健康标记以及数据条带单元进行封装,得到携带亚健康标记以及数据条带单元的写请求,从而向写请求写入亚健康标记。又如,写请求中可以预留亚健康字段,可以对亚健康字段置位,从而向写请求中写入亚健康标记,本实施例对如何向写请求中写入亚健康标记不做限定。For example, when generating a write request, the sub-health mark and the data stripe unit may be encapsulated to obtain a write request carrying the sub-health mark and the data stripe unit, so as to write the sub-health mark to the write request. For another example, the sub-health field can be reserved in the write request, and the sub-health field can be set to write the sub-health flag into the write request. This embodiment does not limit how to write the sub-health flag into the write request.
需要说明的是,如果由客户端节点向每个存储节点发送写请求,则可以由客户端节点向至少一个写请求写入亚健康标记,如果由主存储节点向每个存储节点发送写请求,则可以由主存储节点向至少一个数据条带单元写入写请求,本实施例对向写请求写入亚健康标记的执行主体不做限定。It should be noted that if a client node sends a write request to each storage node, the client node can write a sub-health mark to at least one write request. If the main storage node sends a write request to each storage node, Then, the main storage node may write a write request to at least one data stripe unit. This embodiment does not limit the execution subject that writes the sub-health mark to the write request.
举例来说,请参见图7,当OSD节点1发生亚健康闪断时,客户端节点可以向OSD节点2的写请求写入OSD节点1的亚健康标记,并向OSD节点3的写请求写入OSD节点1的亚健康标记,客户端节点可以向OSD节点2和OSD节点3,发送携带OSD节点1的亚健康标记的写请求,来向OSD节点2和OSD节点3,分发OSD节点1的亚健康标记。For example, referring to FIG. 7, when a sub-health flash occurs on OSD node 1, the client node may write a sub-health flag of OSD node 1 to the write request of OSD node 2, and write to the write request of OSD node 3. After entering the sub-health tag of OSD node 1, the client node can send a write request carrying the sub-health tag of OSD node 1 to OSD node 2 and OSD node 3 to distribute OSD node 1 to OSD node 2 and OSD node 3. Sub-health mark.
区别二、数据的分发对象可以发生变化。具体来说,若第三存储节点处于亚健康状态,可以向至少一个存储节点中的第四存储节点,发送数据条带单元和/或校验条带单元。其中,第四存储节点与第三存储节点可以是不同的存储节点。关于确定第四存储节点的方式,可以确定至少一个存储节点中第三存储节点以外的存储节点,得到至少一个第四存储节点,向该至少一个第四存储节点发送数据条带单元和/或校验条带单元。其中,可以由客户端节点向每个第四存储节点发送数据条带单元和/或校验条带单元,也可以由主存储节点向每个第四存储节点发送数据条带单元和/或校验条带单元,本实施例对此不做限定。Difference two: The distribution object of data can change. Specifically, if the third storage node is in a sub-health state, a data stripe unit and / or a check stripe unit may be sent to a fourth storage node of the at least one storage node. The fourth storage node and the third storage node may be different storage nodes. With regard to the manner of determining the fourth storage node, a storage node other than the third storage node in at least one storage node may be determined, and at least one fourth storage node may be obtained, and data stripe units and / or calibrations may be sent to the at least one fourth storage node. Check strip unit. Wherein, the client node may send data stripe units and / or check stripe units to each fourth storage node, or the main storage node may send data stripe units and / or calibrations to each fourth storage node. The strip inspection unit is not limited in this embodiment.
可选地,可以对待存储的数据进行降级写,即,可以无需发送第三存储节点对应的数据条带单元和/或校验条带单元,而发送第三存储节点对应的数据条带单元和/或校验条带单元以外的数据条带单元和/或校验条带单元。其中,若采用降级写的方式来存储数据,上述携带数据条带单元和/或校验条带单元的写请求可以称为降级写请求。Optionally, the data to be stored can be downgraded, that is, the data stripe unit and / or the check stripe unit corresponding to the third storage node can be sent without sending the data stripe unit and / or the checkout stripe unit corresponding to the third storage node. A data stripe unit and / or a verification stripe unit other than the verification stripe unit. Wherein, if the data is stored in a degraded manner, the write request carrying the data stripe unit and / or the check stripe unit may be referred to as a degraded write request.
具体来说,降级写可以包括以下实现方式一至实现方式二中的任一项或多项:Specifically, the degraded write may include any one or more of the following implementation manners 1 to 2:
实现方式一、向至少一个第四存储节点,发送第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及至少一个校验条带单元。Implementation manner 1: Send at least one second data stripe unit and at least one check stripe unit other than the first data stripe unit corresponding to the third storage node to at least one fourth storage node.
当第三存储节点对应数据条带单元时,假设将第三存储节点对应的数据条带单元称为第一数据条带单元,将第一数据条带单元以外的数据条带单元称为第二数据条带单元,实现方式一具体可以包括:确定生成的至少一个数据条带单元中,第一数据条带单元以外的数据条带单元,得到至少一个第二数据条带单元,向该至少一个第四存储节点,发送对应的第二数据条带单元和/或校验条带单元。When the third storage node corresponds to the data stripe unit, it is assumed that the data stripe unit corresponding to the third storage node is referred to as the first data stripe unit, and the data stripe unit other than the first data stripe unit is referred to as the second The data stripe unit, the first implementation method may specifically include: determining at least one data stripe unit generated, data stripe units other than the first data stripe unit, obtaining at least one second data stripe unit, and providing the at least one The fourth storage node sends the corresponding second data stripe unit and / or check stripe unit.
举例来说,请参见图7,当OSD节点1发生亚健康闪断时,可以向OSD节点1以外的其余的OSD节点(即OSD节点2至OSD节点5),发送数据条带单元1以外的数据条带单元和/或校验条带单元(即数据条带单元2、数据条带单元3、校验条带单元1和校验条带单元2)。For example, referring to FIG. 7, when OSD node 1 has a sub-health flash, OSD nodes other than OSD node 1 (that is, OSD node 2 to OSD node 5) can send data other than stripe unit 1. The data stripe unit and / or the verification stripe unit (that is, the data stripe unit 2, the data stripe unit 3, the verification stripe unit 1, and the verification stripe unit 2).
实现方式二、向至少一个第四存储节点,发送第三存储节点对应的第一校验条带单元以外的至少一个第二校验条带单元以及至少一个数据条带单元。Implementation method 2: Send at least one second verification strip unit and at least one data strip unit other than the first verification strip unit corresponding to the third storage node to at least one fourth storage node.
当第三存储节点对应校验条带单元时,假设第三存储节点对应的校验条带单元称为第一校验条带单元,其他校验条带单元称为第二校验条带单元,实现方式二具体可以包括:确定生成的至少一个校验条带单元中,第一校验条带单元以外的校验条带单元,得到至少一个第二校验条带单元,向该至少一个第四存储节点,发送对应的第二校验条带单元和/或校验条带单元。When the third storage node corresponds to the check stripe unit, it is assumed that the check stripe unit corresponding to the third storage node is called the first check stripe unit, and the other check stripe units are called second check stripe units. The implementation method 2 may specifically include: determining, among the at least one verification strip unit generated, a verification strip unit other than the first verification strip unit, obtaining at least one second verification strip unit, and providing the at least one The fourth storage node sends a corresponding second verification stripe unit and / or a verification stripe unit.
607、当第四存储节点接收到对应的数据条带单元和/或校验条带单元时,每个第四存储节点存储对应的数据条带单元和/或校验条带单元。607. When the fourth storage node receives the corresponding data stripe unit and / or check stripe unit, each fourth storage node stores the corresponding data stripe unit and / or check stripe unit.
步骤607与上述步骤406同理,而区别之处主要在于,第四存储节点在存储数据条带单元和/或校验条带单元的基础上,可以在第三存储节点处于亚健康状态期间,为第三存储节点记录亚健康日志,以便在第三存储节点处于亚健康恢复状态时,向第三存储节点发送该第三存储节点缺失的元数据。Step 607 is the same as the above step 406, and the difference is that the fourth storage node can store the data stripe unit and / or the verification stripe unit while the third storage node is in a sub-health state. Record a sub-health log for the third storage node, so that when the third storage node is in a sub-health recovery state, the third storage node sends metadata missing from the third storage node.
亚健康日志用于指示在第三存储节点处于亚健康状态期间存储的数据条带单元,亚健康日志也可以称为元数据差异日志,能够指示由于第三存储节点处于亚健康状态,第三存储节点与第四存储节点之间存储的元数据的差异。亚健康日志可以包括第三存储节点与数据条带单元之间的对应关系,例如可以包括第三存储节点的亚健康标记以及数据条带单元的条带单元标识之间的对应关系。The sub-health log is used to indicate the data stripe unit stored during the third storage node's sub-health state. The sub-health log can also be called the metadata difference log, which can indicate that the third storage node is in the sub-health state and the third storage node Difference in metadata stored between the node and the fourth storage node. The sub-health log may include a correspondence between the third storage node and the data stripe unit, for example, may include a correspondence between the sub-health mark of the third storage node and the stripe unit identifier of the data stripe unit.
可选的,记录亚健康日志的实现方式可以包括以下实现一至实现二中的任一种或多种:Optionally, the implementation manner of recording the sub-health log may include any one or more of the following implementations one to two:
实现一、当接收到的写请求携带第三存储节点的亚健康标记以及数据条带单元时,向亚健康日志写入第三存储节点对应的亚健康记录。Implementation one: When the received write request carries the sub-health mark of the third storage node and the data stripe unit, write the sub-health record corresponding to the third storage node to the sub-health log.
具体来讲,第四存储节点解析写请求时,如果从写请求中得到了第三存储节点的亚健康标记,则生成第三存储节点对应的亚健康记录,向亚健康日志中写入第三存储节点对应的亚健康记录。其中,该亚健康记录是指亚健康日志中的一条记录,可以包括第三存储节点的亚健康标记,另外还可以包括其他信息,例如接收到亚健康标记的时间戳、写请求中的数据条带单元的条带单元标识等。Specifically, when the fourth storage node parses the write request, if the sub-health mark of the third storage node is obtained from the write request, a sub-health record corresponding to the third storage node is generated, and the third is written into the sub-health log. The sub-health record corresponding to the storage node. The sub-health record refers to a record in the sub-health log, which may include the sub-health mark of the third storage node, and may also include other information, such as the timestamp of the sub-health mark received, and the data bar in the write request. Band unit identification, etc.
实现二、当接收到的数据条带单元携带第三存储节点的亚健康标记时,向亚健康日志,写入第三存储节点对应的亚健康记录。Implementation two: When the received data strip unit carries the sub-health mark of the third storage node, write the sub-health record corresponding to the third storage node to the sub-health log.
具体来讲,第四存储节点解析数据条带单元时,如果从数据条带单元中得到了第三存储节点的亚健康标记,则生成第三存储节点对应的亚健康记录,向亚健康日志中写入第三存储节点对应的亚健康记录。Specifically, when the fourth storage node parses the data stripe unit, if the sub-health mark of the third storage node is obtained from the data stripe unit, a sub-health record corresponding to the third storage node is generated, and the sub-health record is stored in the sub-health log. Write the sub-health record corresponding to the third storage node.
需要说明的是,向亚健康日志写入亚健康记录的步骤,可以在第三存储节点处于亚健康期间多次执行,例如,可以每当接收到的写请求携带第三存储节点的亚健康标记时,就向亚健康日志写入一条亚健康记录,从而维护带有亚健康标记的写请求日志记录。It should be noted that the step of writing the sub-health record to the sub-health log may be performed multiple times while the third storage node is in the sub-health. For example, the received write request may carry the sub-health mark of the third storage node. At that time, a sub-health record is written to the sub-health log to maintain a write request log record with a sub-health mark.
608、每个第四存储节点向客户端节点发送第一写完成消息。608. Each fourth storage node sends a first write completion message to the client node.
609、客户端节点接收至少一个第四存储节点发送的第一写完成消息,向目标应用或外部主机发送第二写完成消息。609. The client node receives a first write completion message sent by at least one fourth storage node, and sends a second write completion message to a target application or an external host.
步骤608至步骤609与上述步骤407至步骤408同理,相区别的是,由于第三存储节点处于亚健康状态,客户端节点可以无需判断是否接收到第三存储节点的第一写 完成消息,当接收到第三存储节点以外的至少一个第四存储节点的第一写完成消息,即确认数据已经存储成功,则向目标应用或外部主机发送第二写完成消息,也即是,一旦降级写入到其他存储节点成功,客户端节点即可向上返回IO写成功。Steps 608 to 609 are the same as the above steps 407 to 408. The difference is that, because the third storage node is in a sub-health state, the client node may not need to determine whether the first write completion message is received from the third storage node. When receiving the first write completion message from at least one fourth storage node other than the third storage node, that is, confirming that the data has been successfully stored, send a second write completion message to the target application or external host, that is, once the write is degraded After entering into other storage nodes, the client node can return to the IO to write successfully.
综上所述,上述阐述了第三存储节点处于亚健康状态的写IO流程,本实施例还提供了第三存储节点处于亚健康状态的读IO流程,以第三存储节点对应的数据条带单元称为第一数据条带单元,第一数据条带单元的数据块的元数据备份存储于第五存储节点为例,读IO流程可以包括以下步骤一至步骤二。In summary, the above describes the write IO process of the third storage node in a sub-health state. This embodiment also provides the read IO process of the third storage node in a sub-health state, with the data strip corresponding to the third storage node. The unit is called a first data stripe unit. The metadata backup of the data blocks of the first data stripe unit is stored in the fifth storage node as an example. The read IO process may include the following steps 1 to 2.
步骤一、当客户端节点接收到读请求时,查询存储节点之间的交叉备份关系,确定第三存储节点对应的第五存储节点,该第五存储节点存储的交叉备份的元数据包括第一数据条带单元的数据块的元数据,该读请求用于指示读取第一数据条带单元的数据块。Step 1: When the client node receives the read request, query the cross-backup relationship between the storage nodes, and determine the fifth storage node corresponding to the third storage node. The cross-backup metadata stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit. The read request is used to instruct to read the data block of the first data stripe unit.
如果客户端接收到读请求,而该读请求指示读取第一数据条带单元的数据块中的数据,由于存储该第一数据条带单元的第三存储节点当前处于亚健康状态,可以根据第三存储节点,查询存储节点之间的交叉备份关系,得到第三存储节点对应的第五存储节点,由于该第五存储节点存储的数据条带单元中的交叉备份的元数据,包括第一数据条带单元的数据块的元数据,换句话说,第一数据条带单元的数据块的元数据备份已经预先存储在第五存储节点上,因此向第五存储节点转发读请求,即可读取第一数据条带单元的数据。If the client receives a read request, and the read request instructs to read data in the data block of the first data stripe unit, since the third storage node storing the first data stripe unit is currently in a sub-health state, The third storage node queries the cross-backup relationship between the storage nodes to obtain a fifth storage node corresponding to the third storage node. Because the cross-backup metadata in the data stripe unit stored by the fifth storage node includes the first The metadata of the data block of the data stripe unit, in other words, the metadata backup of the data block of the first data stripe unit has been stored on the fifth storage node in advance, so forward the read request to the fifth storage node. Read the data of the first data stripe unit.
举例来说,请参见图7,当OSD节点1处于亚健康状态时,如果客户端节点接收到读请求,而读请求指示要读取OSD节点1存储的GRAIN数据1,可以根据OSD节点1,查询存储节点之间的交叉备份关系,得到OSD节点2和OSD节点3,向OSD节点2和OSD节点3转发读请求。For example, referring to FIG. 7, when the OSD node 1 is in a sub-health state, if the client node receives a read request, and the read request indicates that the GRAIN data 1 stored in the OSD node 1 should be read, according to the OSD node 1, Query the cross-backup relationship between the storage nodes, obtain OSD node 2 and OSD node 3, and forward the read request to OSD node 2 and OSD node 3.
步骤二、客户端节点向第五存储节点发送读请求。Step 2: The client node sends a read request to the fifth storage node.
步骤三、第五存储节点接收客户端节点发送的读请求,进行数据读取。Step 3: The fifth storage node receives the read request sent by the client node and performs data reading.
第五存储节点可以读取自身存储的数据条带单元中的交叉备份的元数据,从该交叉备份的元数据中,得到第一数据条带单元的数据块的元数据,根据数据块的元数据,可以索引至对应的数据块,从而读取数据块中的数据,将数据返回给客户端节点。具体来讲,第五存储节点可以与第三存储节点以外的其他存储节点进行数据交互,可以接收其他存储节点发送的数据条带单元以及校验条带单元,从而得到第一数据条带单元以外的至少一个第二数据条带单元以及至少一个校验条带单元,可以对至少一个第二数据条带单元以及至少一个校验条带单元,进行EC反编码,得到第一数据条带单元,从而还原第一数据条带单元,读取第一数据条带单元的数据块,将数据块中的数据返回给客户端节点。The fifth storage node may read the cross-backup metadata in the data stripe unit stored by itself, and obtain the metadata of the data block of the first data stripe unit from the cross-backup metadata, according to the metadata of the data block. The data can be indexed to the corresponding data block, thereby reading the data in the data block and returning the data to the client node. Specifically, the fifth storage node can perform data interaction with other storage nodes other than the third storage node, and can receive data stripe units and check stripe units sent by other storage nodes, so as to obtain other than the first data stripe unit. The at least one second data stripe unit and the at least one check stripe unit can perform EC inverse coding on the at least one second data stripe unit and at least one check stripe unit to obtain the first data stripe unit, Thus, the first data strip unit is restored, the data block of the first data strip unit is read, and the data in the data block is returned to the client node.
本实施例提供的方法,如果某个存储节点处于亚健康状态,在存储数据时可以立即避开处于亚健康状态的存储节点,而通过其它存储节点来进行数据存储,从而实现存储节点之间快速切换,降低了存储节点的亚健康状态对存储系统性能的影响,保证即使存储节点处于亚健康状态,存储系统也可以快速存储数据,从而保证存储系统的可靠性和稳定性。In the method provided by this embodiment, if a certain storage node is in a sub-healthy state, the storage node in the sub-healthy state can be immediately avoided when storing data, and data storage is performed through other storage nodes, so as to achieve rapidity between storage nodes. The switching reduces the impact of the sub-health status of the storage nodes on the performance of the storage system, and ensures that the storage system can quickly store data even if the storage nodes are in the sub-health status, thereby ensuring the reliability and stability of the storage system.
图8是本申请实施例提供的一种数据恢复方法的流程图,如图8所示,该方法的 执行主体为第五存储节点,包括以下步骤:FIG. 8 is a flowchart of a data recovery method according to an embodiment of the present application. As shown in FIG. 8, the method is executed by a fifth storage node and includes the following steps:
801、第五存储节点存储至少一个数据条带单元。801. The fifth storage node stores at least one data stripe unit.
本步骤的具体实现请参见上述图4实施例和图6实施例,在此不做赘述。For specific implementation of this step, refer to the embodiment in FIG. 4 and the embodiment in FIG. 6, and details are not described herein.
802、第五存储节点确定第三存储节点处于亚健康恢复状态。802. The fifth storage node determines that the third storage node is in a sub-healthy recovery state.
该亚健康恢复状态是指从亚健康状态过渡到健康状态的状态,也即是正在从亚健康恢复的状态。The sub-healthy recovery state refers to a state of transition from a sub-healthy state to a healthy state, that is, a state in which the sub-healthy state is recovering.
关于如何确定第三存储节点处于亚健康恢复状态,可选地,第五存储节点可以接收亚健康恢复消息,解析亚健康恢复消息,得到亚健康恢复消息携带的第三存储节点的标识,根据第三存储节点的标识,可以确定第三存储节点处于亚健康恢复状态。其中,亚健康恢复消息用于指示第三存储节点处于亚健康恢复状态,可以携带第三存储节点的标识。Regarding how to determine that the third storage node is in a sub-health recovery state, optionally, the fifth storage node may receive the sub-health recovery message, parse the sub-health recovery message, and obtain an identifier of the third storage node carried in the sub-health recovery message. The identification of the three storage nodes can determine that the third storage node is in a sub-healthy recovery state. The sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state, and may carry the identifier of the third storage node.
可选地,第五存储节点可以接收MDC节点发送的亚健康恢复消息。具体来讲,MDC节点可以与每个存储节点保持通信,MDC节点可以感知每个存储节点的当前状态,当MDC节点确定第三存储节点处于亚健康恢复状态时,MDC节点可以生成亚健康恢复消息,向第五存储节点发送亚健康恢复消息,第五存储节点可以接收MDC节点的亚健康恢复消息,从而确定第三存储节点处于亚健康恢复状态。Optionally, the fifth storage node may receive the sub-health recovery message sent by the MDC node. Specifically, the MDC node can maintain communication with each storage node, the MDC node can sense the current status of each storage node, and when the MDC node determines that the third storage node is in a sub-health recovery state, the MDC node can generate a sub-health recovery message , Sending a sub-health recovery message to the fifth storage node, and the fifth storage node may receive the sub-health recovery message from the MDC node, thereby determining that the third storage node is in a sub-health recovery state.
需要说明的第一点是,接收MDC节点发送的亚健康恢复消息仅是接收亚健康恢复消息的可选方式,而非接收亚健康恢复消息的必选方式,可选地,第五存储节点可以接收其他节点发送的亚健康恢复消息,例如,当第三存储节点处于亚健康恢复状态时,第三存储节点可以主动向第五存储节点发送亚健康恢复消息,第五存储节点可以接收第五存储节点发送的亚健康恢复消息。The first point to be noted is that receiving a sub-health recovery message sent by an MDC node is only an optional way to receive a sub-health recovery message, rather than a mandatory way to receive a sub-health recovery message. Optionally, the fifth storage node may Receive sub-health recovery messages from other nodes. For example, when the third storage node is in a sub-health recovery state, the third storage node may actively send sub-health recovery messages to the fifth storage node, and the fifth storage node may receive the fifth storage. Sub-health recovery message sent by the node.
需要说明的第二点是,步骤802仅是数据恢复的可选步骤,而非数据恢复的必选步骤。在另一种可能的实现中,第五存储节点也可以无需执行步骤802,例如,第五存储节点可以当接收到缺失元数据的发送指令时,执行下述步骤803。The second point that needs to be explained is that step 802 is only an optional step for data recovery, not a mandatory step for data recovery. In another possible implementation, the fifth storage node may not need to perform step 802. For example, the fifth storage node may perform the following step 803 when receiving a sending instruction of missing metadata.
803、第五存储节点根据至少一个数据条带单元,获取第三存储节点的缺失元数据。803. The fifth storage node acquires missing metadata of the third storage node according to at least one data stripe unit.
以第三存储节点存储的数据条带单元称为第一数据条带单元为例,由于第五存储节点存储的数据条带单元的交叉备份的元数据中包括第一数据条带单元的数据块的元数据,换句话说,第一数据条带单元的数据块的元数据备份,已经预先存储在第五存储节点上,因此第五存储节点根据自身存储的数据条带单元,即可获取到第三存储节点的缺失元数据。Taking the data stripe unit stored in the third storage node as the first data stripe unit as an example, since the cross-backup metadata of the data stripe unit stored in the fifth storage node includes the data block of the first data stripe unit In other words, the metadata backup of the data block of the first data stripe unit has been stored on the fifth storage node in advance, so the fifth storage node can obtain it according to the data stripe unit stored by itself. Missing metadata for the third storage node.
其中,该缺失元数据是指第三存储节点本应存储而未存储的数据条带单元的数据块的元数据,例如可以是第三存储节点由于处于亚健康状态,而未存储的数据块的元数据。缺失元数据可以理解为差异元数据,即第三存储节点与第五存储节点之间存储的元数据差异。The missing metadata refers to the metadata of the data block of the data stripe unit that the third storage node should store but does not store. Metadata. The missing metadata can be understood as difference metadata, that is, the difference in metadata stored between the third storage node and the fifth storage node.
可选地,获取缺失元数据的过程具体可以包括以下步骤一至步骤二:Optionally, the process of obtaining missing metadata may specifically include the following steps 1 to 2:
步骤一、从至少一个数据条带单元中,选取至少一个第二数据条带单元。Step 1: Select at least one second data stripe unit from the at least one data stripe unit.
第二数据条带单元是第五存储节点自身存储的数据条带单元,第二数据条带单元的存储时间属于第三存储节点处于亚健康状态期间。其中,第二数据条带单元的存储时间是指第五存储节点存储第二数据条带单元的时间点,第三存储节点处于亚健康状 态期间可以指第三存储节点开始处于亚健康状态至第三存储节点结束处于亚健康状态的时间范围,例如,可以指第三存储节点开始处于亚健康状态至第三存储节点开始处于亚健康恢复状态的时间范围。第五存储节点可以选取第三存储节点处于亚健康状态期间,该第五存储节点自身存储的每个数据条带单元,将选取的至少一个数据条带单元作为该至少一个第二数据条带单元。The second data stripe unit is a data stripe unit stored by the fifth storage node itself, and the storage time of the second data stripe unit belongs to the period when the third storage node is in a sub-health state. The storage time of the second data stripe unit refers to the point in time when the fifth storage node stores the second data stripe unit. During the period when the third storage node is in a sub-health state, it can mean that the third storage node starts to be in a sub-health state to the first. The time range in which the three storage nodes end in a sub-health state, for example, may refer to the time range in which the third storage node starts to be in a sub-health state to the third storage node starts to be in a sub-health recovery state. The fifth storage node may select each data stripe unit stored by the fifth storage node during the sub-healthy state of the third storage node, and use the selected at least one data stripe unit as the at least one second data stripe unit. .
可选地,步骤一具体可以包括以下实现方式一至实现方式三中的一项或多项的组合。Optionally, step one may specifically include a combination of one or more of the following implementation manners one to three.
实现方式一、根据第三存储节点的标识,查询亚健康日志,得到至少一个第二数据条带单元。Implementation method 1: According to the identifier of the third storage node, query the sub-health log to obtain at least one second data stripe unit.
可以将第三存储节点的标识作为索引,查询亚健康日志,确定亚健康日志中该第三存储节点的标识对应的数据条带单元的条带单元标识,得到至少一个条带单元标识,可以确定至少一个条带单元标识对应的至少一个数据条带单元,将该至少一个数据条带单元作为至少一个第二数据条带单元。举例来说,OSD节点2可以根据OSD节点1的标识,查询亚健康日志,得到OSD节点1的标识对应的条带单元标识。The identifier of the third storage node can be used as an index to query the sub-health log to determine the stripe unit ID of the data stripe unit corresponding to the identity of the third storage node in the sub-health log, and to obtain at least one stripe unit ID, it can be determined The at least one stripe unit identifies at least one data stripe unit corresponding to the at least one data stripe unit as at least one second data stripe unit. For example, the OSD node 2 may query the sub-health log according to the identifier of the OSD node 1, and obtain the stripe unit identifier corresponding to the identifier of the OSD node 1.
实现方式二、选取具有第三存储节点的亚健康标记的数据条带单元,作为至少一个第二数据条带单元。Implementation method 2: Select a sub-health mark data strip unit with a third storage node as at least one second data strip unit.
可以判断存储的每个数据条带单元中是否携带第三存储节点的亚健康标记,当任一数据条带单元携带第三存储节点的亚健康标记,则选取该数据条带单元,作为第二数据条带单元。It can be determined whether each data stripe unit stored carries the sub-healthy mark of the third storage node. When any data stripe unit carries the sub-healthy mark of the third storage node, the data stripe unit is selected as the second Data striping unit.
实现方式三、可以确定第三存储节点处于亚健康状态的目标时间段,选取该目标时间段内存储的至少一个数据条带单元,作为至少一个第二数据条带单元。Implementation manner three: A target time period in which the third storage node is in a sub-health state may be determined, and at least one data stripe unit stored in the target time period is selected as at least one second data stripe unit.
关于如何确定目标时间段,例如,可以当接收到第三存储节点的亚健康消息时,记录第一时间戳,当接收到第三存储节点的亚健康恢复消息时,记录第二时间戳,将第一时间戳至第二时间戳之间的时间段,作为第三存储节点处于亚健康状态的目标时间段。又如,可以当接收到的写请求携带第三存储节点的亚健康标记时,记录第一时间戳,当接收到的写请求不再携带第三存储节点的亚健康标记时,记录第二时间戳,将第一时间戳至第二时间戳之间的时间段,作为第三存储节点处于亚健康状态的目标时间段。Regarding how to determine the target time period, for example, when a sub-health message from a third storage node is received, a first time stamp may be recorded, and when a sub-health recovery message from a third storage node is received, a second time stamp may be recorded. The time period between the first time stamp and the second time stamp is used as a target time period when the third storage node is in a sub-health state. For another example, the first time stamp can be recorded when the received write request carries the sub-health mark of the third storage node, and the second time can be recorded when the received write request no longer carries the sub-health mark of the third storage node The time stamp between the first timestamp and the second timestamp is used as a target time period when the third storage node is in a sub-health state.
关于如何选取至少一个第二数据条带单元,可以在存储每个数据条带单元时,记录每个数据条带单元的存储时间点,可以根据每个数据条带单元的存储时间点,选取存储时间点处于目标时间段内的数据条带单元,作为至少一个第二数据条带单元。Regarding how to select at least one second data stripe unit, the storage time point of each data stripe unit can be recorded when each data stripe unit is stored, and the storage can be selected according to the storage time point of each data stripe unit. A data stripe unit whose time point is within a target time period is used as at least one second data stripe unit.
步骤二、根据至少一个第二数据条带单元中的交叉备份的元数据,确定第三存储节点的缺失元数据。Step 2: Determine the missing metadata of the third storage node according to the cross-backup metadata in the at least one second data stripe unit.
步骤二可以包括以下实现方式一至实现方式二中的任一项或多项的组合:Step two may include a combination of any one or more of the following implementation manners one to two:
实现方式一、可以将至少一个第二数据条带单元中的交叉备份的元数据,作为第三存储节点的缺失元数据。举例来说,请参见图9,OSD节点2可以将GRAIN元数据2、GRAIN元数据3和GRAIN元数据1,作为OSD节点1的缺失元数据。Implementation manner one: Cross-backup metadata in at least one second data stripe unit may be used as missing metadata of the third storage node. For example, referring to FIG. 9, the OSD node 2 may use GRAIN metadata 2, GRAIN metadata 3, and GRAIN metadata 1 as the missing metadata of the OSD node 1.
实现方式二、可以从至少一个第二数据条带单元的交叉备份的元数据中,选取一个或多个元数据,将选取的一个或多个元数据,作为第三存储节点的缺失元数据。Implementation manner 2: One or more metadata may be selected from the metadata of the cross-backup of at least one second data stripe unit, and the selected one or more metadata may be used as the missing metadata of the third storage node.
例如,可以选取第三存储节点对应的数据块的元数据,作为第三存储节点的缺失元数据,举例来说,请参见图9,OSD节点2可以选取OSD节点1对应的GRAIN元数据1,将GRAIN元数据1,作为OSD节点1的缺失元数据。For example, the metadata of the data block corresponding to the third storage node may be selected as the missing metadata of the third storage node. For example, referring to FIG. 9, the OSD node 2 may select the GRAIN metadata 1 corresponding to the OSD node 1, Let GRAIN metadata 1 be the missing metadata for OSD node 1.
804、第五存储节点向第三存储节点发送缺失元数据。804. The fifth storage node sends the missing metadata to the third storage node.
通过向第三存储节点发送缺失元数据,可以将第三存储节点处于亚健康状态期间,第三存储节点的缺失元数据同步给第三存储节点,换句话说,可以将第三存储节点与第五存储节点之间的差异元数据同步给第三存储节点,以使第三存储节点接收到缺失元数据后,能够重新存储缺失元数据,从而恢复亚健康期间未存储的数据块的元数据。By sending the missing metadata to the third storage node, it is possible to synchronize the missing metadata of the third storage node to the third storage node while the third storage node is in a sub-health state, in other words, the third storage node and the first storage node can be synchronized. The difference metadata between the five storage nodes is synchronized to the third storage node, so that after the third storage node receives the missing metadata, it can re-store the missing metadata, thereby restoring the metadata of the data blocks that were not stored during the sub-health period.
可选地,当第三存储节点处于亚健康状态时,读IO流程可以包括以下步骤一至步骤三:Optionally, when the third storage node is in a sub-healthy state, the read IO process may include the following steps one to three:
步骤一、客户端节点接收读请求,根据读请求中携带的目标存储位置,确定目标存储位置对应的至少一个存储节点。例如,可以根据LBA与分区之间的映射关系,确定LBA对应的分区,查询分区视图,确定分区对应的至少一个存储节点。Step 1: The client node receives the read request, and determines at least one storage node corresponding to the target storage location according to the target storage location carried in the read request. For example, according to the mapping relationship between the LBA and the partition, the partition corresponding to the LBA can be determined, the partition view can be queried, and at least one storage node corresponding to the partition can be determined.
步骤二、客户端节点向第五存储节点转发读请求。Step 2: The client node forwards the read request to the fifth storage node.
当客户端节点确定第三存储节点处于亚健康恢复状态时,可以根据第三存储节点,查询存储节点之间的交叉备份关系,得到第五存储节点,将第三存储节点对应的读请求,转发给第五存储节点,也即是,将读请求转发给元数据备份所处的存储节点。When the client node determines that the third storage node is in a sub-healthy recovery state, it can query the cross-backup relationship between the storage nodes according to the third storage node to obtain the fifth storage node, and forward the read request corresponding to the third storage node. To the fifth storage node, that is, forward the read request to the storage node where the metadata backup is located.
例如,请参见图9,假设第三存储节点为OSD节点1,OSD节点2和OSD节点3存储的交叉备份的元数据中包括OSD节点1的元数据备份,则可以将OSD节点1对应的读请求,转发给OSD节点2和OSD节点3。For example, referring to FIG. 9, assuming that the third storage node is OSD node 1, the metadata of the cross-backup stored by OSD node 2 and OSD node 3 includes the metadata backup of OSD node 1, then the corresponding read of OSD node 1 can be read. The request is forwarded to OSD node 2 and OSD node 3.
可选的,针对第三存储节点恢复该缺失元数据的具体过程,可以包括以下步骤一至步骤三:Optionally, the specific process of recovering the missing metadata for the third storage node may include the following steps one to three:
步骤一、第三存储节点接收客户端节点的写请求和第五存储节点的写请求,并维护客户端节点对应的第一写请求记录以及第五存储节点对应的第二写请求记录。Step 1: The third storage node receives the write request from the client node and the fifth storage node, and maintains a first write request record corresponding to the client node and a second write request record corresponding to the fifth storage node.
客户端节点的写请求携带数据条带单元以及目标存储位置,第一写请求记录用于记录在第三存储节点处于亚健康恢复状态的过程中,第三存储节点需要存储的数据和元数据,每当第三存储节点接收到客户端节点的写请求时,即可在第一写请求记录中,记录客户端节点的写请求。第五存储节点的写请求携带缺失元数据,第二写请求记录用于记录第三存储节点接收到的需要存储的缺失元数据,当第三存储节点接收到第五存储节点的写请求时,即可在第二写请求记录中,记录第五存储节点的写请求。The write request from the client node carries the data stripe unit and the target storage location. The first write request record is used to record the data and metadata that the third storage node needs to store while the third storage node is in a sub-healthy recovery state. Whenever the third storage node receives a write request from a client node, it can record the write request from the client node in the first write request record. The write request of the fifth storage node carries missing metadata, and the second write request record is used to record the missing metadata received by the third storage node that needs to be stored. When the third storage node receives the write request of the fifth storage node, That is, the write request of the fifth storage node is recorded in the second write request record.
步骤二、当缺失元数据接收完成后,根据第一写请求记录以及第二写请求记录,对多个写请求进行合并处理,得到至少一个目标写请求。Step 2: After receiving the missing metadata, merge the multiple write requests according to the first write request record and the second write request record to obtain at least one target write request.
具体来讲,可以确定第一写请求记录以及第二写请求记录中的所有写请求,根据所有写请求携带的时间戳,按照时间戳的早晚顺序进行排序,对于排序后的所有写请求,可以根据所有写请求中携带的目标存储位置,判断是否存在目标存储位置相同的写请求,当多个写请求中携带的目标存储位置相同时,对这多个目标存储位置相同的写请求进行合并处理,即,筛选出这写请求中时间最近的写请求,而过滤掉其他写请求,最终得到至少一个目标写请求。Specifically, all write requests in the first write request record and the second write request record can be determined. According to the timestamps carried by all write requests, they are sorted in the order of the timestamps. For all sorted write requests, you can According to the target storage locations carried in all write requests, determine whether there are write requests with the same target storage location. When multiple target storage locations carried in multiple write requests are the same, merge the multiple write requests with the same target storage locations. That is, the most recent write request in the write request is filtered out, and other write requests are filtered out, and at least one target write request is finally obtained.
步骤三、存储该至少一个目标写请求中携带的元数据。Step 3: Store metadata carried in the at least one target write request.
当存储元数据完成后,第三存储节点恢复为健康状态。可选地,第三存储节点恢复为健康状态后,可以向MDC节点发送第三存储节点处于健康状态的消息,MDC节点可以向客户端节点和每个存储节点,广播到第三存储节点的健康消息。When the storage metadata is completed, the third storage node is restored to a healthy state. Optionally, after the third storage node is restored to a healthy state, a message that the third storage node is in a healthy state may be sent to the MDC node, and the MDC node may broadcast the health of the third storage node to the client node and each storage node. Message.
需要说明的是,第三存储节点处于亚健康状态仅是缺失元数据的一种示例性场景,而非必选场景,相应地,第三存储节点处于亚健康恢复状态也仅是进行数据恢复的一种示例性场景,而非必选场景,在另一个示例性场景中,第三存储节点也可以由于其他原因缺失元数据,例如由于设备故障、临时断电、内存容量不足等原因缺失元数据,相应地,第三存储节点也可以在其他场景下进行数据恢复,例如当第三存储节点恢复正常时进行数据恢复,例如当第三存储节点接收到数据恢复指令时进行数据恢复,本实施例对第三存储节点恢复数据的场景不做限定。It should be noted that the third storage node in the sub-health state is only an exemplary scenario of missing metadata, and is not a mandatory scenario. Accordingly, the third storage node in the sub-health recovery state is only used for data recovery. An exemplary scenario instead of a mandatory scenario. In another exemplary scenario, the third storage node may also lack metadata due to other reasons, such as missing metadata due to equipment failure, temporary power outage, insufficient memory capacity, etc. Correspondingly, the third storage node may also perform data recovery in other scenarios, such as performing data recovery when the third storage node returns to normal, for example, performing data recovery when the third storage node receives a data recovery instruction. This embodiment The scenario where the third storage node recovers data is not limited.
本实施例提供的方法,引入了对EC条带中数据块的元数据进行交叉备份的机制,通过将数据块以及交叉备份的元数据,共同存储在数据条带单元中,确保不同存储节点之间会互相存储数据块的元数据,即使某个存储节点的元数据丢失,由于其他存储节点的数据条带单元中预先存储了该存储节点的元数据备份,其他存储节点可以根据自身存储的数据条带单元,获取该存储节点的缺失元数据,将该存储节点的缺失元数据同步给该存储节点,从而降低了元数据丢失的概率,极大地提高了数据存储的可靠性和安全性,从而提升分布式存储系统的存储性能。尤其是,如果某个存储节点由于亚健康闪断而缺失元数据,当该存储节点处于亚健康恢复状态时,其他存储节点即可根据本地记录的亚健康日志中记录的元数据差异,将该存储节点的缺失元数据同步到从亚健康恢复正常的存储节点上,从而实现存储节点从亚健康状态恢复后可以自动地恢复缺失元数据,降低了存储节点处于亚健康状态对存储系统性能的影响,保证分布式存储系统的高稳定性和高可靠性。The method provided in this embodiment introduces a mechanism for cross-backup of metadata of data blocks in the EC stripe. By storing the data blocks and the cross-backup metadata together in the data stripe unit, the different storage nodes are ensured. The metadata of the data block is stored between each other. Even if the metadata of a certain storage node is lost, since the metadata backup of the storage node is pre-stored in the data stripe unit of the other storage node, other storage nodes can use the data stored by themselves. The stripe unit obtains the missing metadata of the storage node and synchronizes the missing metadata of the storage node to the storage node, thereby reducing the probability of metadata loss and greatly improving the reliability and security of data storage, thereby Improve the storage performance of distributed storage systems. In particular, if a storage node lacks metadata due to a sub-health flash, when the storage node is in a sub-health recovery state, other storage nodes can change the metadata based on the metadata difference recorded in the sub-health log recorded locally. The missing metadata of the storage node is synchronized to the normal storage node restored from sub-health, so that the storage node can automatically recover the missing metadata after recovering from the sub-health state, which reduces the impact of the storage node's sub-health state on the performance of the storage system To ensure high stability and high reliability of the distributed storage system.
图10是本申请实施例提供的一种客户端节点的结构示意图,该客户端节点包括:生成模块1001、编码模块1002和分发模块1003。FIG. 10 is a schematic structural diagram of a client node according to an embodiment of the present application. The client node includes a generation module 1001, an encoding module 1002, and a distribution module 1003.
生成模块1001,用于执行上述步骤403;A generating module 1001, configured to execute the above step 403;
编码模块1002,用于执行上述步骤404;An encoding module 1002, configured to perform the above step 404;
分发模块1003,用于执行上述步骤405。The distribution module 1003 is configured to perform the foregoing step 405.
可选地,该生成模块1001,用于执行上述步骤403中的步骤一至步骤三;Optionally, the generating module 1001 is configured to perform steps 1 to 3 in step 403 above;
可选地,该生成模块1001,用于执行上述步骤403中的步骤二中的(2.1)至(2.2);Optionally, the generating module 1001 is configured to execute steps (2.1) to (2.2) in step two in step 403 above;
可选地,该编码模块1002,用于执行上述步骤404中的方式一至方式三中的至少一种方式。Optionally, the encoding module 1002 is configured to perform at least one of the first to third methods in step 404.
可选地,该分发模块1003,用于向每个存储节点发送对应的数据条带单元和/或校验条带单元。Optionally, the distribution module 1003 is configured to send a corresponding data stripe unit and / or a verification stripe unit to each storage node.
可选地,该装置还包括:Optionally, the device further includes:
发送模块,用于执行上述步骤408。The sending module is configured to perform step 408 described above.
可选地,该分发模块1003,用于执行上述步骤606。Optionally, the distribution module 1003 is configured to perform step 606 described above.
可选地,该分发模块1003,用于执行上述步骤606中实现方式一至实现方式二中的至少一项。Optionally, the distribution module 1003 is configured to perform at least one of implementation manners 1 to 2 in step 606.
可选地,该装置还包括:Optionally, the device further includes:
发送模块,用于执行上述图6实施例中读IO流程的步骤二。The sending module is configured to execute step two of the IO reading process in the embodiment in FIG. 6.
可选地,该装置还包括:Optionally, the device further includes:
查询模块,用于执行上述图6实施例中读IO流程的步骤一。The query module is configured to execute step 1 of the IO reading process in the embodiment in FIG. 6 described above.
可选地,该分发模块1003,用于执行上述步骤606。Optionally, the distribution module 1003 is configured to perform step 606 described above.
可选地,该装置还包括:Optionally, the device further includes:
发送模块,用于执行上述步骤609。A sending module is configured to perform the foregoing step 609.
需要说明的第一点是:上述实施例所述的各个模块具体可以是软件中执行相应功能的软件模块,即,“模块”可以是一组计算机程序构成的功能模块,该计算机程序可以是源程序或目标程序,该计算机程序可以通过任意编程语言实现。客户端节点通过上述各个模块,可以基于处理器加存储器的硬件来实现存储数据的功能,也即是,可以通过客户端节点的处理器,运行存储在客户端节点的存储器中的软件代码,来执行相应的软件来实现存储数据的功能。The first point that needs to be explained is that each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program. The computer program can be implemented in any programming language. The client node can implement the function of storing data based on the processor and memory hardware through the above modules, that is, the client node processor can run the software code stored in the memory of the client node. Execute the corresponding software to realize the function of storing data.
需要说明的第二点是:上述实施例提供的客户端节点在存储数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将客户端节点的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的客户端节点与存储数据方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。The second point that needs to be explained is that the client node provided in the above embodiment only uses the above-mentioned division of functional modules as an example for storing data. In practical applications, the above-mentioned functions can be allocated by different functional modules as required. Finished, that is, the internal structure of the client node is divided into different functional modules to complete all or part of the functions described above. In addition, the client node provided in the foregoing embodiment belongs to the same concept as the data storage method embodiment, and its specific implementation process is described in the method embodiment in detail, and is not repeated here.
图11是本申请实施例提供的一种存储节点的结构示意图,该装置包括:存储模块1101、获取模块1102和发送模块1103。FIG. 11 is a schematic structural diagram of a storage node according to an embodiment of the present application. The apparatus includes a storage module 1101, an obtaining module 1102, and a sending module 1103.
存储模块1101,用于执行上述步骤801;A storage module 1101, configured to perform the above step 801;
获取模块1102,用于执行上述步骤803;An obtaining module 1102, configured to execute the foregoing step 803;
发送模块1103,用于执行上述步骤804。The sending module 1103 is configured to perform the foregoing step 804.
可选地,该获取模块1102,用于执行上述步骤803中的步骤一至步骤二;Optionally, the obtaining module 1102 is configured to perform steps 1 to 2 in step 803 described above;
可选地,该选取子模块,用于执行上述步骤803中的步骤一中的实现方式一至实现方式三中的至少一项。Optionally, the selection sub-module is configured to perform at least one of the implementation manners 1 to 3 in step 1 in the foregoing step 803.
可选地,该装置还包括:Optionally, the device further includes:
记录模块,用于记录亚健康日志。Recording module for recording sub-health logs.
可选地,该记录模块,用于执行步骤607中的实现一至实现二中的至少一项。Optionally, the recording module is configured to execute at least one of implementations one to two in step 607.
可选地,该装置还包括:Optionally, the device further includes:
接收模块,用于接收亚健康恢复消息。A receiving module for receiving a sub-health recovery message.
需要说明的第一点是:上述实施例所述的各个模块具体可以是软件中执行相应功能的软件模块,即,“模块”可以是一组计算机程序构成的功能模块,该计算机程序可以是源程序或目标程序,该计算机程序可以通过任意编程语言实现。通过上述各个模块,存储节点可以基于处理器加存储器的硬件来实现恢复数据的功能,也即是,可以通过存储节点的处理器,运行存储在存储节点的存储器中的软件代码,来执行相应的软件来实现恢复数据的功能。The first point that needs to be explained is that each module described in the above embodiments may specifically be a software module that executes a corresponding function in software, that is, a "module" may be a functional module composed of a group of computer programs, and the computer program may be a source A program or an object program. The computer program can be implemented in any programming language. Through the above modules, the storage node can implement the function of restoring data based on the hardware of the processor and the memory, that is, the processor of the storage node can run the software code stored in the memory of the storage node to execute the corresponding Software to implement the function of recovering data.
需要说明的第二点是:上述实施例提供的存储节点在恢复数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将存储节点的内部结构划分成不同的功能模块,以完成以上描述的 全部或者部分功能。另外,上述实施例提供的存储节点与数据恢复方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。The second point that needs to be explained is: When recovering data, the storage nodes provided in the above embodiments only use the division of the above functional modules as an example. In practical applications, the above functions can be allocated by different functional modules as required. That is, the internal structure of the storage node is divided into different functional modules to complete all or part of the functions described above. In addition, the storage nodes and data recovery method embodiments provided in the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
在一个示例性实施例中,本申请还提供了一种包含指令的计算机程序产品,当其在客户端节点上运行时,使得该客户端节点能够实现上述实施例中数据存储方法所执行的操作。In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a client node, enables the client node to implement operations performed by the data storage method in the foregoing embodiment. .
在一个示例性实施例中,本申请还提供了一种包含指令的计算机程序产品,当其在存储节点上运行时,使得该存储节点能够实现上述实施例中数据恢复方法所执行的操作。In an exemplary embodiment, the present application further provides a computer program product containing instructions, which when executed on a storage node, enables the storage node to implement the operations performed by the data recovery method in the foregoing embodiment.
在一个示例性实施例中,本申请还提供了一种数据存储系统,在一种可能的实现方式中,所述系统包括:上述图2实施例所述的客户端节点和上述图3实施例所述的存储节点。In an exemplary embodiment, the present application further provides a data storage system. In a possible implementation manner, the system includes the client node described in the foregoing FIG. 2 embodiment and the foregoing FIG. 3 embodiment. The storage node.
在另一种可能的实现方式中,所述系统包括:上述图10实施例所述的客户端节点和上述图11实施例所述的存储节点。In another possible implementation manner, the system includes: the client node described in the foregoing FIG. 10 embodiment and the storage node described in the foregoing FIG. 11 embodiment.
在一个示例性实施例中,本申请还提供了一种芯片,所述芯片包括处理器和/或程序指令,当所述芯片运行时,实现上述实施例中数据存储方法所执行的操作。In an exemplary embodiment, the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data storage method in the foregoing embodiment.
在一个示例性实施例中,本申请还提供了一种芯片,所述芯片包括处理器和/或程序指令,当所述芯片运行时,实现上述实施例中数据恢复方法所执行的操作。In an exemplary embodiment, the present application further provides a chip, where the chip includes a processor and / or program instructions, and when the chip runs, implements operations performed by the data recovery method in the foregoing embodiment.
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。All the above-mentioned optional technical solutions may be used in any combination to form optional embodiments of the present application, which are not repeated here one by one.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present application are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer program instructions may be from a website site, a computer, a server, or data The center transmits to another website site, computer, server or data center by wire or wirelessly. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, and the like that includes one or more available medium integration. The usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD), or a semiconductor medium (such as a solid state hard disk), and the like.
本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。The term "and / or" in this application is only an association relationship describing related objects, and it means that there can be three kinds of relationships. For example, A and / or B can mean: A exists alone, A and B exist simultaneously, and There are three cases of B. In addition, the character "/" in this application generally indicates that the related objects before and after are in an "or" relationship.
本申请中术语“多个”的含义是指两个或两个以上,例如,多个数据包是指两个或两个以上的数据包。The term “multiple” in the present application means two or more than two, for example, multiple data packets refer to two or more data packets.
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,本领域技术人员可以理解,“第一”“第二”等字样不对数量和执行顺序进行限定。The terms "first" and "second" in this application are used to distinguish the same or similar items with basically the same function and function. Those skilled in the art can understand that the words "first" and "second" are not correct in quantity. And execution order.
以上所述仅为本申请的可选实施例,并不用以限制本申请,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above description is only an optional embodiment of the present application, and is not intended to limit the present application. Any person skilled in the art can easily think of changes or replacements within the technical scope disclosed in this application, which should be covered in this application. Within the scope of protection.

Claims (19)

  1. 一种数据存储方法,其特征在于,所述方法包括:A data storage method, characterized in that the method includes:
    根据待存储的至少一个数据块,生成至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据条带单元包括的数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;Generating at least one data stripe unit according to at least one data block to be stored, each data stripe unit including a data block and cross-backup metadata, the cross-backup metadata including data included in the data stripe unit Block metadata and metadata of data blocks included in other data stripe units other than the data stripe unit;
    对所述至少一个数据条带单元进行纠删码EC编码,得到至少一个校验条带单元;Performing erasure coding EC coding on the at least one data stripe unit to obtain at least one check stripe unit;
    将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至至少一个存储节点。And distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node.
  2. 根据权利要求1所述的方法,其特征在于,所述根据待存储的至少一个数据块,生成至少一个数据条带单元,包括:The method according to claim 1, wherein the generating at least one data stripe unit according to at least one data block to be stored comprises:
    对所述至少一个数据块的元数据进行备份,得到至少一个元数据备份,所述至少一个数据块和所述至少一个元数据备份一一对应;Backup the metadata of the at least one data block to obtain at least one metadata backup, and the at least one data block corresponds to the at least one metadata backup in a one-to-one correspondence;
    对于所述至少一个数据块中的数据块,从所述至少一个元数据备份中,选取所述数据块对应的至少一个目标元数据备份;For a data block in the at least one data block, selecting at least one target metadata backup corresponding to the data block from the at least one metadata backup;
    根据所述数据块、所述数据块的元数据以及所述至少一个目标元数据备份,生成数据条带单元。Generating a data stripe unit according to the data block, the metadata of the data block, and the at least one target metadata backup.
  3. 根据权利要求2所述的方法,其特征在于,所述从所述至少一个元数据备份中,选取所述数据块对应的至少一个目标元数据备份,包括:The method according to claim 2, wherein the selecting, from the at least one metadata backup, at least one target metadata backup corresponding to the data block comprises:
    根据所述数据块对应的第一存储节点,查询存储节点之间的交叉备份关系,得到所述第一存储节点对应的至少一个第二存储节点;Querying the cross-backup relationship between the storage nodes according to the first storage node corresponding to the data block to obtain at least one second storage node corresponding to the first storage node;
    确定所述至少一个第二存储节点对应的数据块的元数据备份,作为所述至少一个目标元数据备份。A metadata backup of a data block corresponding to the at least one second storage node is determined as the at least one target metadata backup.
  4. 根据权利要求1所述的方法,其特征在于,所述对所述至少一个数据条带单元进行纠删码EC编码,得到至少一个校验条带单元,包括下述至少一个步骤:The method according to claim 1, wherein the performing erasure code EC coding on the at least one data stripe unit to obtain at least one check stripe unit comprises at least one of the following steps:
    对所述至少一个数据条带单元中的数据块进行EC编码,得到所述至少一个校验条带单元中的校验块;Performing EC coding on the data blocks in the at least one data stripe unit to obtain a check block in the at least one check stripe unit;
    对所述至少一个数据条带单元中的元数据进行EC编码,得到所述至少一个校验条带单元中的元数据校验块。EC coding is performed on the metadata in the at least one data stripe unit to obtain a metadata check block in the at least one check stripe unit.
  5. 根据权利要求1所述的方法,其特征在于,所述将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至至少一个存储节点,包括:The method according to claim 1, wherein the distributing the at least one data stripe unit and the at least one check stripe unit to at least one storage node comprises:
    确定所述至少一个存储节点中的第三存储节点处于亚健康状态;Determining that a third storage node of the at least one storage node is in a sub-health state;
    向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元;和/或,向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一校验条带单元以外的至少一个第二校验条带单元以及所述至少一个数据条带单元。Sending to the fourth storage node of the at least one storage node at least one second data strip unit other than the first data strip unit corresponding to the third storage node and the at least one check strip unit; And / or, sending to the fourth storage node of the at least one storage node at least one second verification strip unit other than the first verification strip unit corresponding to the third storage node and the at least one Data striping unit.
  6. 根据权利要求5所述的方法,其特征在于,所述向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元之后,所述方法还包括:The method according to claim 5, wherein the sending to the fourth storage node of the at least one storage node at least one second other than the first data stripe unit corresponding to the third storage node After the data stripe unit and the at least one check stripe unit, the method further includes:
    当接收到读请求时,向第五存储节点发送所述读请求,所述读请求用于指示读取所述第一数据条带单元的数据块,所述第五存储节点存储的交叉备份的元数据包括所述第一数据条带单元的数据块的元数据。When a read request is received, the read request is sent to a fifth storage node, where the read request is used to instruct to read a data block of the first data stripe unit, and the The metadata includes metadata of a data block of the first data stripe unit.
  7. 根据权利要求5所述的方法,其特征在于,所述向所述至少一个存储节点中的第四存储节点,发送所述第三存储节点对应的第一数据条带单元以外的至少一个第二数据条带单元以及所述至少一个校验条带单元,包括下述至少一个步骤:The method according to claim 5, wherein the sending to the fourth storage node of the at least one storage node at least one second other than the first data stripe unit corresponding to the third storage node The data stripe unit and the at least one check stripe unit include at least one of the following steps:
    向所述至少一个第二数据条带单元写入所述第三存储节点的亚健康标记,所述亚健康标记用于指示所述第三存储节点处于亚健康状态;Writing a sub-health mark of the third storage node to the at least one second data stripe unit, where the sub-health mark is used to indicate that the third storage node is in a sub-health state;
    向所述第四存储节点发送写请求,所述写请求携带所述第三存储节点的亚健康标记以及所述第四存储节点对应的第二数据条带单元。Send a write request to the fourth storage node, where the write request carries a sub-health mark of the third storage node and a second data stripe unit corresponding to the fourth storage node.
  8. 一种数据恢复方法,其特征在于,所述方法包括:A data recovery method, characterized in that the method includes:
    存储至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据条带单元的数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;Stores at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block of the data stripe unit and the data strip Metadata of data blocks included in data stripe units other than the unit;
    根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据;Obtaining the missing metadata of the third storage node according to the at least one data striping unit;
    向所述第三存储节点发送所述缺失元数据。Sending the missing metadata to the third storage node.
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据,包括:The method according to claim 8, wherein the acquiring the missing metadata of the third storage node according to the at least one data striping unit comprises:
    从所述至少一个数据条带单元中,选取至少一个第二数据条带单元,所述第二数据条带单元的存储时间属于所述第三存储节点处于亚健康状态期间;Selecting at least one second data stripe unit from the at least one data stripe unit, and a storage time of the second data stripe unit belongs to a period when the third storage node is in a sub-health state;
    根据所述至少一个第二数据条带单元中的交叉备份的元数据,确定所述缺失元数据。Determining the missing metadata according to the cross-backup metadata in the at least one second data stripe unit.
  10. 根据权利要求9所述的方法,其特征在于,所述从所述至少一个数据条带单元中,选取至少一个第二数据条带单元,包括下述至少一个步骤:The method according to claim 9, wherein the selecting at least one second data stripe unit from the at least one data stripe unit comprises at least one of the following steps:
    根据所述第三存储节点的标识,查询亚健康日志,得到所述至少一个第二数据条带单元,所述亚健康日志用于指示在所述第三存储节点处于亚健康状态期间存储的数据条带单元;Querying the sub-health log according to the identifier of the third storage node to obtain the at least one second data stripe unit, where the sub-health log is used to indicate data stored while the third storage node is in a sub-health state Stripe unit
    选取具有所述第三存储节点的亚健康标记的数据条带单元,作为所述至少一个第二数据条带单元,所述亚健康标记用于指示所述第三存储节点处于亚健康状态。Selecting a data strip unit with a sub-health mark of the third storage node as the at least one second data strip unit, and the sub-health mark is used to indicate that the third storage node is in a sub-health state.
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述第三存储节点的标识,查询亚健康日志之前,所述方法还包括:The method according to claim 10, wherein before the querying the sub-health log according to the identifier of the third storage node, the method further comprises:
    在所述第三存储节点处于亚健康状态期间,为所述第三存储节点记录所述亚健康日志。While the third storage node is in a sub-health state, record the sub-health log for the third storage node.
  12. 根据权利要求11所述的方法,其特征在于,所述在所述第三存储节点处于亚健康状态期间,为所述第三存储节点记录所述亚健康日志,包括下述至少一个步骤:The method according to claim 11, wherein, when the third storage node is in a sub-health state, recording the sub-health log for the third storage node comprises at least one of the following steps:
    当接收到的写请求携带所述第三存储节点的亚健康标记以及数据条带单元时,向所述亚健康日志,写入所述第三存储节点对应的亚健康记录;When the received write request carries a sub-health mark and a data stripe unit of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log;
    当接收到的数据条带单元携带所述第三存储节点的亚健康标记时,向所述亚健康日志,写入所述第三存储节点对应的亚健康记录。When the received data strip unit carries the sub-health mark of the third storage node, write a sub-health record corresponding to the third storage node to the sub-health log.
  13. 根据权利要求8所述的方法,其特征在于,所述根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据之前,所述方法还包括:The method according to claim 8, wherein before the acquiring the missing metadata of the third storage node according to the at least one data striping unit, the method further comprises:
    接收亚健康恢复消息,所述亚健康恢复消息用于指示所述第三存储节点处于亚健康恢复状态。Receiving a sub-health recovery message, where the sub-health recovery message is used to indicate that the third storage node is in a sub-health recovery state.
  14. 一种客户端节点,其特征在于,所述客户端节点,包括:A client node, wherein the client node includes:
    生成模块,用于根据待存储的至少一个数据块,生成至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据条带单元的数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;A generating module, configured to generate at least one data stripe unit according to at least one data block to be stored, each data stripe unit includes a data block and cross-backup metadata, and the cross-backup metadata includes the data strip Metadata of a data block with a unit and metadata of a data block included in a data strip unit other than the data strip unit;
    编码模块,用于对所述至少一个数据条带单元进行纠删码EC编码,得到至少一个校验条带单元;An encoding module, configured to perform erasure coding EC encoding on the at least one data stripe unit to obtain at least one check stripe unit;
    分发模块,用于将所述至少一个数据条带单元以及所述至少一个校验条带单元,分发至至少一个存储节点。A distribution module, configured to distribute the at least one data stripe unit and the at least one check stripe unit to at least one storage node.
  15. 一种存储节点,其特征在于,所述存储节点,包括:A storage node, characterized in that the storage node includes:
    存储模块,用于存储至少一个数据条带单元,每个数据条带单元包括数据块以及交叉备份的元数据,所述交叉备份的元数据包括所述数据块的元数据和所述数据条带单元以外的其他数据条带单元包括的数据块的元数据;A storage module, configured to store at least one data stripe unit, each data stripe unit includes a data block and cross-backup metadata, the cross-backup metadata includes metadata of the data block and the data strip Metadata of data blocks included in data stripe units other than the unit;
    获取模块,用于根据所述至少一个数据条带单元,获取第三存储节点的缺失元数据;An obtaining module, configured to obtain the missing metadata of the third storage node according to the at least one data striping unit;
    发送模块,用于向所述第三存储节点发送所述缺失元数据。A sending module, configured to send the missing metadata to the third storage node.
  16. 一种客户端节点,其特征在于,所述客户端节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求1至权利要求7任一项所述的数据存储方法所执行的操作。A client node, characterized in that the client node includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 1 to claim 1. An operation performed by the data storage method according to any one of claims 7 is required.
  17. 一种存储节点,其特征在于,所述存储节点包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如权利要求8至权利要求13任一项所述的数据恢复方法所执行的操作。A storage node, characterized in that the storage node includes a processor and a memory, and the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement claims 8 to 13 The operations performed by the data recovery method of any one.
  18. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求1至权利要求7任一项所述的数据存储方法所执行的操作。A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, and the instruction is loaded and executed by a processor to implement the data storage according to any one of claims 1 to 7. The action performed by the method.
  19. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求8至权利要求13任一项所述的数据恢复方法所执行的操作。A computer-readable storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by a processor to implement data recovery according to any one of claims 8 to 13 The action performed by the method.
PCT/CN2019/087904 2018-08-14 2019-05-22 Data storage method, data recovery method, apparatus, device and storage medium WO2020034695A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810922810.0 2018-08-14
CN201810922810.0A CN110825552B (en) 2018-08-14 2018-08-14 Data storage method, data recovery method, node and storage medium

Publications (1)

Publication Number Publication Date
WO2020034695A1 true WO2020034695A1 (en) 2020-02-20

Family

ID=69525070

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/087904 WO2020034695A1 (en) 2018-08-14 2019-05-22 Data storage method, data recovery method, apparatus, device and storage medium

Country Status (2)

Country Link
CN (1) CN110825552B (en)
WO (1) WO2020034695A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510338B (en) * 2020-03-09 2022-04-26 苏州浪潮智能科技有限公司 Distributed block storage network sub-health test method, device and storage medium
CN112527561B (en) * 2020-12-09 2021-10-01 广州技象科技有限公司 Data backup method and device based on Internet of things cloud storage

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104331478A (en) * 2014-11-05 2015-02-04 浪潮电子信息产业股份有限公司 Data consistency management method of self-simplified storage system
US20160205190A1 (en) * 2014-08-22 2016-07-14 Nexenta Systems, Inc. Parallel transparent restructuring of immutable content in a distributed object storage system
CN106662983A (en) * 2015-12-31 2017-05-10 华为技术有限公司 Method, apparatus and system for data reconstruction in distributed storage system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352501B2 (en) * 2010-01-28 2013-01-08 Cleversafe, Inc. Dispersed storage network utilizing revision snapshots
CN102682012A (en) * 2011-03-14 2012-09-19 成都市华为赛门铁克科技有限公司 Method and device for reading and writing data in file system
CN107748702B (en) * 2015-06-04 2021-05-04 华为技术有限公司 Data recovery method and device
CN107885612B (en) * 2016-09-30 2020-02-21 华为技术有限公司 Data processing method, system and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160205190A1 (en) * 2014-08-22 2016-07-14 Nexenta Systems, Inc. Parallel transparent restructuring of immutable content in a distributed object storage system
CN104331478A (en) * 2014-11-05 2015-02-04 浪潮电子信息产业股份有限公司 Data consistency management method of self-simplified storage system
CN106662983A (en) * 2015-12-31 2017-05-10 华为技术有限公司 Method, apparatus and system for data reconstruction in distributed storage system

Also Published As

Publication number Publication date
CN110825552A (en) 2020-02-21
CN110825552B (en) 2021-04-09

Similar Documents

Publication Publication Date Title
US11093324B2 (en) Dynamic data verification and recovery in a storage system
CN107807794B (en) Data storage method and device
CA2897129C (en) Data processing method and device in distributed file storage system
US7266716B2 (en) Method and recovery of data using erasure coded data from stripe blocks
WO2019119311A1 (en) Data storage method, device, and system
CN107526536B (en) Method and system for managing storage system
CN106776130B (en) Log recovery method, storage device and storage node
US10649843B2 (en) Storage systems with peer data scrub
CN110651246B (en) Data reading and writing method and device and storage server
US7284088B2 (en) Methods of reading and writing data
US7310703B2 (en) Methods of reading and writing data
US10725662B2 (en) Data updating technology
CN107729536B (en) Data storage method and device
WO2015085530A1 (en) Data replication method and storage system
CN109582213B (en) Data reconstruction method and device and data storage system
WO2017071563A1 (en) Data storage method and cluster management node
CN108733311B (en) Method and apparatus for managing storage system
WO2019137323A1 (en) Data storage method, apparatus and system
CN111949210A (en) Metadata storage method, system and storage medium in distributed storage system
US20190347165A1 (en) Apparatus and method for recovering distributed file system
WO2020034695A1 (en) Data storage method, data recovery method, apparatus, device and storage medium
US20220129346A1 (en) Data processing method and apparatus in storage system, and storage system
CN111381770A (en) Data storage switching method, device, equipment and storage medium
CN115470041A (en) Data disaster recovery management method and device
CN112256204B (en) Storage resource allocation method and device, storage node and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19850311

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19850311

Country of ref document: EP

Kind code of ref document: A1