CN114237971A

CN114237971A - Erasure code coding layout method and system based on distributed storage system

Info

Publication number: CN114237971A
Application number: CN202111481100.7A
Authority: CN
Inventors: 宋�莹; 穆天童; 杨明杰
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-25

Abstract

The invention provides an erasure code coding layout method and an erasure code coding layout system of a distributed storage system, which aim to improve the recovery efficiency and reliability of the whole system by reducing the recovered data transmission quantity and recovery time when the distributed system has data loss. The invention adds parity check calculation in the node on the basis of the traditional RS erasure code storage, uses RS codes with different n and k values, and stores the parity check calculation result in the current node, thus when a small amount of data is lost or the node fails, the parity check block can be decoded from the node, and the purpose of reducing the cross-frame and cross-node network bandwidth generated by recovery is achieved.

Description

Erasure code coding layout method and system based on distributed storage system

Technical Field

The invention relates to the field of distributed storage calculation and data recovery, in particular to an erasure code coding layout method based on a distributed storage system for improving recovery efficiency, and belongs to the field of distributed calculation.

Background

With the rapid development of internet technology, the data volume is exponentially increased, and the data storage mode is gradually changed from single-machine storage to distributed storage. The most popular big data open source framework at present is Hadoop, a big data platform capable of processing massive data in an off-line and parallel mode has the characteristics of high reliability, high expandability, high efficiency, low cost, open source and the like, and becomes a preferred massive data processing scheme for many Internet companies. Hadoop mainly comprises a Hadoop Distributed File System (HDFS) and a MapReduce distributed computing framework, and although Hadoop is developed to 3.x, the Hadoop is mature, but some aspects still have defects and need improvement and optimization.

Distributed clusters (e.g., Hadoop) are typically composed of many independent unreliable commercial components, and it is common for a component to fail. To ensure high reliability and availability of data in such distributed storage systems, two common approaches are to provide fault tolerance with multiple copies and erasure codes. The multi-copy form is easy to deploy and fail-over, but the storage overhead is too large to be suitable for a system with an excessive amount of data and a small disk space. Erasure codes provide near fault tolerance to multiple copies with lower storage overhead as an alternative, which has been deployed in some distributed systems. This reduces the storage redundancy from the traditional 3x to 1.4x, saving more space. But recovery of a failed block using erasure codes requires retrieval of multiple available blocks, which results in high recovery costs. Although erasure codes improve storage efficiency, they significantly increase disk I/O and network bandwidth utilization for failover.

To maximize data availability of distributed storage systems deployed using erasure codes, different blocks of erasure codes are stored in nodes of different chassis. This data layout approach enables the system to tolerate a certain number of node failures and rack failures. However, this placement of data blocks inevitably results in the repair of any failed data blocks requiring the retrieval of available data blocks from other chassis, and therefore occupies a significant amount of cross-chassis bandwidth. Typically, the available cross-chassis bandwidth for each node is only 1/20 to 1/5 of the internal chassis bandwidth. Therefore, in a distributed storage system, the internal rack bandwidth is considered to be sufficient, but the cross-rack bandwidth is not abundant, and is generally considered to be a scarce resource, and excessive cross-rack traffic inevitably delays the recovery process, and reduces the recovery efficiency.

Disclosure of Invention

The invention provides an erasure code coding layout method for improving data recovery efficiency, which aims to improve the recovery efficiency and reliability of the whole system by reducing the data transmission quantity and recovery time length of recovery when a distributed system has data loss. The invention adds parity check calculation in the node on the basis of the traditional RS erasure code storage, uses RS codes with different n and k values, and stores the parity check calculation result in the current node, thus when a small amount of data is lost or the node fails, the parity check block can be decoded from the node, and the purpose of reducing the cross-frame and cross-node network bandwidth generated by recovery is achieved. Specifically, the present invention comprises the steps of:

aiming at the defects of the prior art, the invention provides an erasure code coding layout method based on a distributed storage system, which comprises the following steps:

step 1, acquiring a distributed storage system with a plurality of storage nodes, setting transverse and longitudinal coding parameters according to the number of the storage nodes of the distributed storage system, and dividing all the storage nodes into data nodes for storing data blocks and check nodes for storing transverse check blocks according to storage contents;

step 2, according to the horizontal and vertical coding parameters, respectively performing vertical and horizontal erasure coding on each original data block on each data node to obtain a vertical check block and a horizontal check block corresponding to each original data block; storing the transverse check block to a check node, and storing the longitudinal check block to the data node corresponding to the original data block;

step 3, when data is lost, judging whether the lost data belongs to an original data block, if so, decoding a longitudinal check block of a data node where the lost data is located to recover the lost data, and storing the data node where the lost data is located; otherwise, judging whether the lost data belongs to a longitudinal check block, if so, carrying out longitudinal erasure coding on the lost data so as to recover the lost data, and storing the data into a data node where the lost data is located; otherwise, the lost data belongs to the transverse check block, transverse erasure coding is carried out on the lost data so as to recover the lost data, and the lost data is stored in the check node.

The erasure code coding layout method based on the distributed storage system further comprises the following steps: and 4, when the data node fails, decoding the transverse check block of the check node to recover to obtain the stripe where the longitudinal check block is located, and then decoding to recover the original data block until the last remaining original data block is decoded and recovered by using the recovered longitudinal check block.

The erasure code coding layout method based on the distributed storage system further comprises the following steps: and 5, when the check node fails, performing transverse erasure coding on each original data block in the data node to recover the failed check node.

The erasure code coding layout method based on the distributed storage system is characterized in that the longitudinal erasure codes and the transverse erasure codes belong to parity check codes, and the number of the transverse check blocks is greater than that of the longitudinal check blocks.

The invention also provides an erasure code coding layout system based on the distributed storage system, which comprises the following steps:

the initial module is used for acquiring a distributed storage system with a plurality of storage nodes, setting transverse and longitudinal coding parameters according to the number of the storage nodes of the distributed storage system, and dividing all the storage nodes into data nodes for storing data blocks and check nodes for storing transverse check blocks according to storage contents;

the encoding module is used for respectively carrying out longitudinal and transverse erasure coding on each original data block on each data node according to the transverse and longitudinal coding parameters to obtain a longitudinal check block and a transverse check block corresponding to each original data block; storing the transverse check block to a check node, and storing the longitudinal check block to the data node corresponding to the original data block;

the recovery module is used for judging whether the lost data belongs to an original data block or not when the data is lost, if so, decoding a longitudinal check block of a data node where the lost data is located to recover the lost data and storing the lost data into the data node where the lost data is located; otherwise, judging whether the lost data belongs to a longitudinal check block, if so, carrying out longitudinal erasure coding on the lost data so as to recover the lost data, and storing the data into a data node where the lost data is located; otherwise, the lost data belongs to the transverse check block, transverse erasure coding is carried out on the lost data so as to recover the lost data, and the lost data is stored in the check node.

The erasure code coding layout system based on the distributed storage system is characterized in that the recovery module is further used for decoding the transverse check block of the check node when the data node fails to obtain the stripe where the longitudinal check block is located, and then decoding and recovering the original data block until the last remaining original data block is decoded and recovered by using the longitudinal check block obtained by recovery.

The erasure code coding layout system based on the distributed storage system is characterized in that the recovery module is further configured to perform transverse erasure coding on each original data block in the data node when the check node fails, so as to recover the failed check node.

The erasure code coding layout system based on the distributed storage system is characterized in that the longitudinal erasure codes and the transverse erasure codes belong to parity check codes, and the number of the transverse check blocks is greater than that of the longitudinal check blocks.

The invention also provides a storage medium for storing a program for executing any erasure code coding layout method based on the distributed storage system.

The invention also provides a client used for any erasure code coding layout system based on the distributed storage system.

According to the scheme, the invention has the advantages that:

firstly, reading id and position of an original data block in a distributed file storage system, respectively generating a horizontal parity check block and a vertical parity check block, recording the horizontal parity check block and the vertical parity check block into a file, judging a fault type according to the file when data is lost, and coding or decoding to recover the lost data block or the parity check block. The layout method provided by the invention improves the reliability of the distributed cluster and the efficiency of cluster data recovery, and reduces the data transmission quantity and the bandwidth occupation during recovery.

Drawings

FIG. 1 is a layout diagram of the present invention;

FIG. 2 is a flow chart of data recovery according to the present invention.

Detailed Description

The method provided by the invention comprises the following steps:

A. the values of RS (n, k) and RS (n ', k') are selected according to the number of distributed cluster nodes.

A1. And determining the values of n, k, n 'and k' according to the number of nodes in the distributed cluster and the common coding mode of the RS.

A2. And dividing all nodes in the cluster into data nodes and parity check nodes according to the conditions, wherein the data nodes only store data blocks, and the parity check nodes only store parity check blocks.

B. All data blocks are encoded, marked and recorded based on the determined number of nodes.

B1. And reading the id and the position of the original data block in the distributed file storage system.

B2. And B, longitudinally correcting and coding all data blocks on the corresponding node according to the longitudinal RS (n ', k') determined in the step A.

B3. And B, transversely erasure coding all the data blocks according to the transverse RS (n, k) determined in the step A.

B4. The label of each block is stored in a file together with the content read in step B1.

C. And D, judging the fault type according to the error report condition of the data block read by the user and the file uploaded in the step B.

C1. It is determined whether a single data block is lost.

C2. It is determined whether it is a single node failure.

D. And D, selecting a data recovery mode according to the distributed cluster fault condition judged in the step C.

D1. And C, recovering the single data block loss according to the judgment result of the step C.

D2. And C, recovering the single node fault according to the judgment result of the step C.

D3. And reading all the recovered data, comparing the data with the content in the file, and checking whether the recovery is successful.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The steps of the present invention are further described below in conjunction with figures 1 and 2, comprising: A. selecting values of RS (n, k) and RS (n ', k') according to the number of distributed cluster nodes; B. coding, marking and recording all data blocks based on the determined number of nodes; C. judging the fault type according to the error report condition of the data block read by the user and the file uploaded in the step B; D. and D, selecting a data recovery mode according to the distributed cluster fault condition judged in the step C. One specific implementation is as follows:

A1. And determining the values of n, k, n 'and k' according to the number of nodes in the distributed cluster and the common coding mode of the RS. One (n, k) code encodes n data blocks into k additional transverse parity blocks. The number of all nodes is larger than or equal to the sum of n and k, and the encoding result of RS (n, k) is stored on the parity check node. n 'and k' are respectively less than or equal to n and k, and a combination mode can be fixedly used for calculating the longitudinal parity check blocks in the nodes, and the calculation results are respectively stored on the respective nodes.

A2. All nodes in the cluster are divided into data nodes and parity nodes according to the above conditions, the data nodes will store the original data blocks and the longitudinal parity blocks calculated in step a1, and the parity nodes store only the transverse parity blocks.

B1-1, setting the number of copies of the stored data in all nodes to 1 by setting the cluster configuration file.

B1-2, obtaining the id number and the storage position of each original data block in the distributed file storage system through the API, including which data block is split by the same file in which node.

B1-3, arranging the original data blocks into an abstract matrix according to the information obtained in the step B1-2, and generating the parity check blocks by subsequent encoding.

B2-1, longitudinally erasure coding all the original data blocks on each data node according to the RS (n ', k') determined in the step A and the information obtained in the step B1-1.

B2-2, storing the newly generated inner longitudinal parity blocks on respective nodes, and adding identification to all longitudinal parity blocks.

And B2-3, recording the position and the identification of the longitudinal check block in the file in the form of an abstract matrix. One possible layout architecture is shown in fig. 1, where RS (3,1) is used for vertical erasure coding in fig. 1, that is, 1 vertical parity block is generated every 3 data blocks and stored in the current node, and this coding scheme can tolerate 1 data block failure at most.

B3-1, according to the storage positions of the horizontal RS (n, k) determined in the step A and the vertical erasure correction coding in the step B2, encoding all the current data blocks (including the original data block and the vertical parity block) in a horizontal erasure correction mode.

B3-2, storing the generated transverse parity blocks on the parity check nodes according to the data stripes, and adding identifications for all the transverse parity blocks.

B3-3, recording the position and identification of the transverse check block in the file in the form of abstract matrix. One possible layout architecture is shown in fig. 1, in which the horizontal erasure coding in fig. 1 uses RS (3,2), that is, 2 horizontal parity blocks are generated for every 3 data blocks and stored in the parity nodes, and this coding scheme can tolerate at most 2 data block failures.

B4-1. after the above steps are completed, there should be three different types of data blocks, namely, original data block, longitudinal parity block and transverse parity block, and identification is added respectively. The identity and position in the abstract matrix of all data blocks are recorded into a file.

B4-2, uploading the file to a distributed file storage system for storage, and downloading and checking when data loss occurs.

C1. It is determined whether a single data block is lost.

C1-1, reading all data blocks after failure, comparing with the data blocks and the identification stored in the file, and judging which type of data block is lost, original data block or transverse check block or longitudinal check block.

C2. It is determined whether it is a single node failure.

C2-1, reading all data blocks after failure, comparing with the data blocks and the marks stored in the file, and judging which type of node failure, data node or parity check node is.

D1. And C, recovering the lost single data block according to the judgment result of the step C.

D1-1. if the original data block is lost, only the longitudinal parity check block on the same node needs to be decoded, and the data does not need to be transmitted.

D1-2. if the parity block is lost, the original data block is encoded by the longitudinal RS (n ', k') to generate the parity block, and the parity block is stored in the current node without transmitting data.

D1-3. if the transverse parity block is lost, transverse RS (n, k) encoding the original data block on the same band as the lost parity block, generating the transverse parity block, and storing the transverse parity block on the parity check node.

D2. And C, restoring the single failed node according to the judgment result of the step C.

D2-1, if the data node is failed, decoding and recovering the stripe where the longitudinal check block is located by preferentially using the transverse check block of the parity check node, then decoding and recovering any n-1 original data blocks, and finally decoding and recovering the remaining original data block by using the longitudinal parity check block which is preferentially recovered.

D2-2. if the parity check node is failed, the transverse parity check block of each stripe can be recovered by encoding calculation.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

The invention respectively generates a transverse parity check block and a longitudinal parity check block by reading the id and the position of an original data block in a distributed file storage system, marks the data with 3 types and records the data into a file, then judges the type of the fault according to the file when the distributed cluster has the fault, and encodes or decodes the data to recover the lost data block or node. By the layout and the recovery method provided by the invention, the data transmission amount and the occupied cross-frame and cross-node bandwidth during recovery are reduced. Finally, the recovery efficiency and reliability of the whole system are improved, and the method has good market prospect and application value.

Claims

1. An erasure code coding layout method based on a distributed storage system is characterized by comprising the following steps:

2. The distributed storage system based erasure code coding layout method of claim 1, further comprising: and 4, when the data node fails, decoding the transverse check block of the check node to recover to obtain the stripe where the longitudinal check block is located, and then decoding to recover the original data block until the last remaining original data block is decoded and recovered by using the recovered longitudinal check block.

3. The distributed storage system based erasure code coding layout method of claim 1 or 2, further comprising: and 5, when the check node fails, performing transverse erasure coding on each original data block in the data node to recover the failed check node.

4. The distributed storage system-based erasure code coding layout method of claim 1, wherein the vertical and horizontal erasure codes belong to parity codes, and the number of horizontal parity check blocks is greater than the number of vertical parity check blocks.

5. An erasure code coding layout system based on a distributed storage system, comprising:

6. The distributed storage system-based erasure code coding layout system of claim 5, wherein the recovery module is further configured to, in case of a data node failure, decode the horizontal parity chunks of the parity nodes to recover the stripe where the vertical parity chunks are located, and then decode and recover the original data chunks until the last remaining original data chunk is recovered by decoding the recovered vertical parity chunk.

7. The distributed storage system-based erasure coding layout system of claim 5 or 6, wherein the recovery module is further configured to perform lateral erasure coding on each original data block in the data node when the check node fails, so as to recover the failed check node.

8. The distributed storage system-based erasure code coding layout system of claim 5, wherein the vertical and horizontal erasure codes are parity codes, and the number of horizontal parity blocks is greater than the number of vertical parity blocks.

9. A storage medium storing a program for executing the erasure code coding layout method based on the distributed storage system of any one of claims 1 through 4.

10. A client for use in an erasure coding layout system based on a distributed storage system as claimed in any one of claims 6 to 8.