WO2020010504A1 - 分布式存储系统的数据修复方法及存储介质 - Google Patents

分布式存储系统的数据修复方法及存储介质 Download PDF

Info

Publication number
WO2020010504A1
WO2020010504A1 PCT/CN2018/095084 CN2018095084W WO2020010504A1 WO 2020010504 A1 WO2020010504 A1 WO 2020010504A1 CN 2018095084 W CN2018095084 W CN 2018095084W WO 2020010504 A1 WO2020010504 A1 WO 2020010504A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
nodes
node
auxiliary
repair
Prior art date
Application number
PCT/CN2018/095084
Other languages
English (en)
French (fr)
Inventor
张婧垚
Original Assignee
深圳花儿数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳花儿数据技术有限公司 filed Critical 深圳花儿数据技术有限公司
Priority to PCT/CN2018/095084 priority Critical patent/WO2020010504A1/zh
Priority to CN201880005531.6A priority patent/CN110168505B/zh
Priority to US16/973,029 priority patent/US11500725B2/en
Publication of WO2020010504A1 publication Critical patent/WO2020010504A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1088Reconstruction on already foreseen single or plurality of spare disks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/373Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 with erasure correction and erasure determination, e.g. for packet loss recovery or setting of erasures for the decoding of Reed-Solomon codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/61Aspects and characteristics of methods and arrangements for error correction or error detection, not provided for otherwise
    • H03M13/615Use of computational or mathematical techniques
    • H03M13/616Matrix operations, especially for generator matrices or check matrices, e.g. column or row permutations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the present invention relates to the field of distributed storage, and in particular, to a data repair method for minimum storage reproduction codes of a product matrix.
  • helper nodes (Assistant Node).
  • repair bandwidth In a relatively busy storage system, the bandwidth required for data repair including transmission of auxiliary data, referred to as "repair bandwidth”, has a crucial impact on the performance of the entire system. Therefore, minimizing repair bandwidth is a necessary consideration for design optimization of distributed storage systems.
  • the regeneration code based on the idea of network coding can just achieve this goal.
  • the Minimum Storage Regeneration Code As a type of regeneration code that minimizes the repair bandwidth and maximizes space utilization, has very important value.
  • the MSR code using a product matrix (PM) is simple and clever, and is an important minimum storage and regeneration code.
  • PMMSR code only gives a repair method in the case of single node failure.
  • multiple nodes often fail at the same time.
  • a solution is to repair node by node like a single node failure.
  • Another idea is to perform joint repair on multiple failed nodes to minimize the overall required bandwidth.
  • Another advantage of this idea is that it is not necessary to repair immediately when a node fails, you can wait for the number of failed nodes to reach a certain threshold, or start the repair after the reduced-order read continues for a specific time, which can further reduce the repair Overhead.
  • Another type of multi-data block simultaneous repair method uses a strategy of inter-node cooperation. This type of method first allows the new node to download auxiliary data from the normal node, and then exchanges data between the new nodes. By cleverly designing the coding structure, the repair bandwidth can be reduced, even reaching the theoretical lower limit. Although this method can theoretically minimize the repair bandwidth, its practical application faces great difficulties. This "download-first-exchange" approach increases the time and various overheads required for repair, requires complex protocols to control, and is prone to errors.
  • data repair methods can be divided into two categories: centralized and distributed.
  • the former usually designates a proxy node, and the proxy node is responsible for collecting auxiliary data, reconstructing the lost data, and finally transmitting the repaired data to the new node.
  • the latter is the opposite.
  • the new node itself is responsible for collecting and reconstructing data. It is also possible to combine the two, such as the aforementioned inter-node collaboration strategy, but the complexity of doing so is high.
  • Most of the existing research results have adopted centralized repair, and only considered the bandwidth required to collect auxiliary data. But actually uploading the reconstructed data to the new node also consumes bandwidth. This has not been taken into account in existing methods.
  • This application provides a data repair method for a distributed storage system with a minimum storage regeneration code for a product matrix.
  • the technical problem to be solved is to provide a data repair method for a distributed storage system, which can Multiple failed nodes are repaired simultaneously.
  • a data repair method for a distributed storage system is used for a data repair process when a distributed storage system has a node failure.
  • the distributed storage system uses a coding method of the distributed storage system.
  • This application provides two methods for repairing the data using the minimum storage regeneration code (PM MSR) of the product matrix, which are centralized and distributed.
  • PM MSR minimum storage regeneration code
  • the auxiliary nodes are selected according to the number of failed nodes.
  • the collection, repair matrix calculation, and reconstruction of lost data blocks of the auxiliary data sub-blocks are completed by the central proxy node; in the distributed mode, the auxiliary nodes are selected according to the number of failed nodes Nodes, each new node reconstructs the data blocks stored on the failed node it replaced, and the process of assisting data sub-block collection, repairing matrix operations, and reconstructing lost data blocks is completed by the new node.
  • the data repair process includes the following steps:
  • Step 1 Select auxiliary nodes from normal nodes, use the auxiliary nodes to calculate auxiliary data sub-blocks, and send the auxiliary data sub-blocks to the nodes that reconstruct the data.
  • the number of selected auxiliary nodes is related to the number of failed nodes to be repaired.
  • Step 2 Calculate the repair matrix based on the PMMSR code and the missing data information
  • Step 3 The missing data block is obtained by multiplying the repair matrix with the auxiliary data sub-block, and the data block reconstruction is completed.
  • a new node can not only reconstruct the data blocks stored on the failed node that it replaces, but also the data blocks required by other new nodes.
  • the new node can choose to send the extra reconstructed data blocks to other new nodes that are needed, so as to prevent these new nodes from rebuilding the data blocks themselves. This can further reduce repair bandwidth and computational overhead,
  • the beneficial effect of the present invention is that by using the data storage method of the distributed storage system provided by the minimum storage regeneration code of the product matrix, one or more data blocks can be reconstructed at the same time, which overcomes the difficulty that the lower limit of the minimum repair bandwidth theory is difficult to achieve, There is no need to exchange data between new nodes, and data can be repaired with any number of failed nodes and arbitrary encoding parameters.
  • FIG. 1 is a flowchart of a data block repair calculation process
  • FIG. 2 is a system model diagram of data repair in a centralized repair mode
  • FIG. 3 is a system model diagram of data repair in a distributed repair mode
  • FIG. 4 is a calculation process diagram of a left sub-matrix of a repair matrix
  • FIG. 5 is a calculation process diagram of a right sub-matrix of the repair matrix
  • FIG. 6 is a flowchart of a data repair algorithm in a centralized repair mode
  • FIG. 7 is a flowchart of a data repair algorithm in a distributed repair mode.
  • a data repair method for a distributed storage system of a product matrix minimum storage reproduction code is proposed.
  • the basic idea is to construct a repair matrix that can be applied to any number of failed nodes and arbitrary coding parameters, and simply multiply the auxiliary data with the repair matrix to get the missing data block.
  • This embodiment of the present invention provides two data repair methods for PM MSR codes, including centralized and distributed.
  • Figure 2 and Figure 3 respectively describe the system model of joint repair of data blocks under centralized and distributed repair modes.
  • a distributed storage system suppose the total number of storage nodes is n, the numbers are N 1 , N 2 , across, N n , and the number of failed nodes is t.
  • the first t nodes N 1 , N 2 ,..., N t fail.
  • the centralized repair mode there is a central agent responsible for collecting auxiliary data and reconstructing lost data, and then sending the reconstructed data to t new nodes.
  • t new nodes In the distributed repair mode, t new nodes independently complete the process of collecting auxiliary data and reconstructing lost data, and each new node only needs to reconstruct the lost data on the failed node that it replaced.
  • the system node refers to a node storing an uncoded data block
  • the check node refers to a node storing an encoded data block.
  • d is the "repair degree", which refers to the number of auxiliary nodes needed to repair the data when there is only one failed node.
  • the data of length B is firstly divided into k segments, and then compiled into n data blocks, which are respectively represented by b 1 , b 2 , ..., b n , and the data block number is Respectively 1, 2, ..., n.
  • Each data block b i includes ⁇ sub-blocks s i1 , s i2 , ..., s i ⁇ .
  • PM MSR codes require d ⁇ 2k-2, so there are n ⁇ d + 1 ⁇ 2k-1, and ⁇ k-1.
  • the PMMSR code is constructed by the following formula:
  • M is a d ' ⁇ ⁇ -dimensional message matrix as shown below:
  • S 1 and S 2 are both ⁇ ⁇ ⁇ -dimensional symmetric matrices, and ( ⁇ + 1) ⁇ / 2 elements of the upper triangular part are respectively served by different data sub-blocks.
  • [ ⁇ ⁇ ] is an n ′ ⁇ d′-dimensional coding matrix, and the i-th row of ⁇ is represented by ⁇ i .
  • is an n ' ⁇ n' diagonal matrix as shown below:
  • C is an n ' ⁇ ⁇ -dimensional matrix obtained after encoding, and each element is a data sub-block obtained after encoding.
  • ⁇ virtual nodes When a node fails, before repairing the data, first add ⁇ virtual nodes to the system. The data on these virtual nodes is all 0.
  • 1 ⁇ i ⁇ n ' ⁇ be a set consisting of virtual nodes and real nodes, where N 1 ⁇ N ⁇ are virtual nodes, and N ⁇ + 1 ⁇ N n' are real nodes. Therefore, it can be considered that ⁇ c i
  • the "node” in this embodiment can be either physical or logical, that is, multiple logical nodes can be located on one or more physical machines, without affecting this Effectiveness of the invention.
  • the virtual nodes mentioned in the present invention are not real. They are conceived for theoretical reasoning. They are the inventor's thinking process and are also convenient for those skilled in the art to understand.
  • i 1, ..., t, ⁇ ⁇ x i ⁇ n ' ⁇ composed of the numbers of all the missing data blocks is defined as the missing list. Use Represents missing data blocks. Without loss of generality, it is assumed here that x i is arranged in ascending order in X. Currently including virtual nodes, there are n'-t normal nodes in total, and d'-t + 1 nodes are selected as auxiliary nodes to help repair the data. For the collection of auxiliary nodes To represent. The auxiliary node and the failed node are integrated into a "relevant node" and are represented by the following set:
  • N e ⁇ N j
  • each auxiliary node N j ⁇ N a When repairing data, each auxiliary node N j ⁇ N a first calculates its own stored data block c j (think of it as a vector composed of sub-blocks [c j1 , c j2 , ..., c j ⁇ ]) and Inner product of t to get t coded sub-blocks:
  • d'-t + 1 nodes are selected as auxiliary nodes from the normal nodes mentioned above.
  • the prerequisite is that t failed nodes are repaired at the same time.
  • the ones to be modified can also be selected according to actual conditions Nodes, the number of auxiliary nodes selected in this case will change, the specific situation will be introduced in the following centralized and distributed repair mode.
  • the data reconstruction node recovers the lost data block through the "repair matrix".
  • the repair matrix is obtained as follows.
  • the in (13) Is a matrix A sub-array of rows corresponding to other related nodes, or It is matrix ⁇ with nodes removed The remainder after the relevant line. and Is a matrix Medium and auxiliary data sub-blocks The sequence number of the corresponding column.
  • the calculation process of the above matrices is shown in the calculation process of the left sub-matrix.
  • the calculation method is the same as formula (12), and the specific process is shown in the calculation process of the submatrix on the right side of FIG. 5.
  • each auxiliary node It is each auxiliary node's calculation of its own data block with A vector consisting of auxiliary coding sub-blocks obtained from the inner product of.
  • the vector Are of length d'-t + 1. but in fact The first ⁇ elements of are all 0 (provided by the virtual node), and accordingly, the first ⁇ columns of each ⁇ i are also useless columns.
  • the repair matrix can be calculated by the following formula (18):
  • the repair matrix given by formula (18) is applicable to any t ⁇ min ⁇ k, ⁇ .
  • the lost data can be reconstructed by decoding, that is, select k nodes from normal nodes as auxiliary nodes, and download a total of k ⁇ ⁇ from all auxiliary nodes. Data sub-blocks, and then perform a decoding operation on these data sub-blocks to obtain the original data, and then encode the obtained original data into a missing data block.
  • the repair matrix cannot be calculated by formula (18).
  • the data blocks that are actually repaired include not only data blocks that are really lost, but also data blocks on newly added or newly replaced nodes.
  • step 102 The calculation process of the repair matrix is shown in step 102 to step 106 in FIG. 1, and the specific operation of each step is as follows:
  • Step 1 For each x i ⁇ X, calculate according to formulas (9) and (13)
  • Step 2 For each x i , x j ⁇ X, j ⁇ i, calculate ⁇ i, j according to FIG. 4 and formula (11);
  • Step 3 Based on the results obtained in Step 2, construct a matrix ⁇ X according to formula (17);
  • Step 5 If ⁇ X is reversible, use Equation (18) to use Multiply the generalized diagonal matrix obtained in step 4 to obtain the repair matrix R X.
  • steps 2 and 3 can be executed before step 4 or after step 4.
  • step 600 determines whether t ⁇ min ⁇ k, ⁇ holds.
  • Step 611 Select k nodes from normal nodes as auxiliary nodes, and the central proxy node initiates a request to the auxiliary node, asking the auxiliary node to send the data block stored by itself to the central proxy node;
  • Step 612 The central proxy node waits until all k ⁇ ⁇ auxiliary data sub-blocks are received
  • Step 613 The central proxy node decodes the received data sub-block to obtain the original data.
  • Step 614 The central proxy node reconstructs the missing data blocks in an encoding manner, and sends the reconstructed database to t new nodes.
  • Step 621 calculate ⁇ X according to steps 102 to 104 in FIG. 1;
  • Step 622 If ⁇ X is irreversible, choose one of the following 3 operations from a) to c), otherwise proceed to the next step;
  • step 611 Return to step 611 to implement data reconstruction in a decoding manner
  • Step 623 As shown in step 101 of FIG. 1, d-t + 1 nodes are selected from the normal nodes as auxiliary nodes, and the central proxy node initiates a request to the auxiliary nodes, requesting each auxiliary node to calculate and provide t auxiliary data according to formula (10). Sub-block
  • Step 624 The central proxy node waits until all t (d-t + 1) auxiliary data sub-blocks are received;
  • Step 625 Calculate the repair matrix R X according to step 106 in FIG. 1;
  • Step 626 rearrange the received auxiliary data sub-blocks according to formula (16) so as to correspond to the generalized diagonal matrix in formula (18);
  • Step 627 As shown in step 107 in FIG. 1, the vector consisting of the auxiliary data sub-blocks rearranged in step 626 is multiplied by the repair matrix R X to the left to reconstruct the missing data blocks and send the reconstructed data to t new nodes on.
  • each new node only needs to rebuild the data blocks stored on the failed node it replaced. If t ⁇ nd, a new node can select d normal nodes as auxiliary nodes, and obtain 1 encoded auxiliary data sub-block from each auxiliary node, and then reconstruct its lost data block, thus repairing the bandwidth There are td sub-blocks in total. If t> nd, then for a new node, it is not feasible to rebuild only one data block it lost, because the number of auxiliary nodes is not enough. At this time, each new node needs to rebuild at least t- (nd) + at the same time. 1 data block, the number of required auxiliary nodes is nt.
  • step 700 the magnitude of t is detected.
  • Step 711 Select k nodes from the normal nodes as auxiliary nodes, and the new node initiates a request to the auxiliary node, asking the auxiliary node to send the data block stored by itself to the new node;
  • Step 712 The new node waits until all k ⁇ ⁇ data sub-blocks are received
  • Step 713 Decode the received auxiliary data sub-block to obtain the original data
  • Step 714 Rebuild the missing data block in an encoding manner and return.
  • Step 721 Select d nodes from the normal nodes as auxiliary nodes, and the new node initiates a request to the auxiliary nodes, asking the auxiliary nodes to calculate and provide 1 auxiliary data sub-block each according to formula (10);
  • Step 722 The new node waits until all d auxiliary data sub-blocks are received
  • Step 723 Calculate the repair matrix as shown in step 106 in FIG. 1 Where x i is the number of the new node;
  • Step 724 rearrange the received auxiliary data sub-blocks according to formula (16) so as to correspond to the generalized diagonal matrix in formula (18);
  • Step 725 Use the repair matrix Left-multiply the vector composed of the auxiliary data sub-blocks rearranged in step 724, reconstruct the missing data block, and return.
  • X ⁇ x i , y 1 , ..., y u ⁇ , where x i is the number of the new node;
  • Step 732 Calculate ⁇ X according to the steps shown in FIG. 1;
  • Step 733 if ⁇ X is irreversible, choose one of the following 3 operations from a) to c), otherwise perform the next step;
  • step 1.1 Return to step 1.1 to implement data reconstruction by decoding
  • Step 734 Select all n-t normal nodes as auxiliary nodes, and the new node initiates a request to the auxiliary nodes, requesting each auxiliary node to calculate and provide u + 1 auxiliary data sub-blocks according to formula (10);
  • Step 735 The new node waits until all (n-t) (u + 1) auxiliary data sub-blocks are received;
  • Step 736 Calculate the repair matrix R X according to step 106 in FIG. 1;
  • Step 737 rearrange the received auxiliary data sub-blocks according to formula (16) so as to correspond to the generalized diagonal matrix in formula (18);
  • Step 738 Multiply the vector composed of the auxiliary data sub-blocks rearranged in step 737 by using the repair matrix R X to left to reconstruct the missing data block and return.
  • a new node when t> n-d, a new node not only reconstructs the data blocks stored on the failed node that it replaces, but may also reconstruct the data blocks required by other new nodes at the same time. At this time, the new node can choose to send the extra reconstructed data blocks to other new nodes that are needed, so as to prevent these new nodes from rebuilding the data blocks themselves. This can further reduce the repair bandwidth and computing overhead, but requires coordination and cooperation between nodes, and may increase the repair time, which already belongs to the category of collaborative repair. In practical applications, trade-offs should be made based on system performance needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种分布式存储系统的数据修复方法,即一种用于分布式存储系统有节点失效时,以最小可行带宽同时恢复多个故障节点的方法。通过选择辅助节点并用选中的辅助节点计算得到辅助数据子块,然后计算修复矩阵(106)并最终通过修复矩阵与辅助数据子块相乘重建丢失数据块(107),或者通过解码的方法重建丢失的数据块的方法。本方法适用于任何数量的失效节点和任意编码参数情况下的数据修复,修复方法达到最小修复带宽的理论下限,灵活性高且易于实现。

Description

分布式存储系统的数据修复方法及存储介质 技术领域
本发明涉及分布式存储领域,具体涉及用于乘积矩阵最小存储再生码的数据修复方法。
背景技术
近年来随着互联网,大数据和云计算等新兴技术的飞速发展,对海量数据存储的需求也激增,使得分布式数据存储系统日益成为存储技术的主流。相比于单节点存储系统,分布式存储系统具有更高的可靠性、可用性和可扩展性。在分布式存储系统中,可能会有一个或若干个节点发生故障。在这种情况下,可以通过副本或纠删码等冗余技术来保证整个系统的服务不被中断。相比于简单的复制数据,纠删码的空间利用率相对较高,因此在业界被广泛使用。例如微软的Azure和OceanStore等分布式存储系统均采用了纠删码。而纠删码中的最大距离可分码(Maximum Distance Separable,MDS)由于能够最大化空间利用率,尤其为业界所重视。
当有节点发生故障时,分布式存储系统应当及时恢复出丢失的数据,以保持系统的冗余度,并避免长时间的“降阶读取”,即通过读取多个非原始的数据片段然后通过某种方法计算出所需的数据。这一过程通常通过替换或修复失效的节点,并在新节点上执行数据恢复操作来完成。由于一开始这些新节点对丢失的数据内容一无所知,它们需要从其他的未发生故障的节点处获得“辅助数据”(Helper Data),而这些提供辅助数据的节点被称为“辅助节点”(Assistant Node)。在一个比较繁忙的存储系统中,包括传输辅助数据在内的数据修复所需的带宽,简称“修复带宽”,其对整个系统的性能有至关重要的影响。因此,使修复带宽最小化是分布式存储系统设计优化的必要考量。基于网络编码思想的再生码恰好能够达到这一目标。其中,最小存储再生码(Minimum Storage Regenerating Code,MSR)作为一类在最小化修复带宽的同时,还能最大化空间利用率的再生码,具有十分重要的价值。
采用乘积矩阵(Product Matrix,PM)的MSR码构造简单巧妙,是一种重要的最小存储再生码。遗憾的是,原始的PM MSR码仅给出了单节点失效情况下的修复方法。而在实际工程中,经常会有多个节点同时失效的情况出现。针对这种情况,一种解决办法是像处理单节点失效那样,一个 节点接一个节点地修复,这个方法虽然简单直接,但是会浪费带宽。另一种思路是,对多个失效节点进行联合修复,从而使整体所需的带宽最小化。该思路的另一个好处是,不必在节点失效时马上进行修复,可以等失效节点的数量达到某一个阈值,或者降阶读取持续了某一特定时间后再启动修复,这样可以进一步降低修复的开销。
一些现有的研究对这个问题进行了多方面的考虑,提出了一种采用“虚拟符号”的方法,能够同时修复多个失效节点,还证明了同时修复所需的最小带宽的下限。但是这种方法通常需要对复杂的方程组进行求解,运算开销巨大且难以实施。还有的研究人员提出了由代理协助的最小存储再生码,使用这种码进行修复时,辅助节点传输辅助数据之前无需对数据进行编码。但是这种码只能处理有1或2个失效节点的情形。针对任意失效节点数目和任意编码参数的联合修复方案仍未被提出。
另一类多数据块同时修复方法采用了节点间协作的策略。这类方法首先让新节点从正常节点下载辅助数据,然后在新节点之间交换数据。通过巧妙地设计编码的结构,可以减小修复带宽,甚至达到理论的下限。尽管这类方法从理论上可以使修复带宽最小化,实际应用却面临很大难度。这种“先下载-再交换”的方式增加了修复所需的时间和各种开销,需要复杂的协议来控制且容易出错。
从工程实现的角度讲,数据修复方法可分为两类:集中式和分布式。前者通常会指定一个代理节点,由代理节点负责收集辅助数据,重建丢失的数据并最终将修复好的数据传送到新的节点上。后者则相反,没有代理节点协助,由新节点自身负责收集和重建数据。也可以将二者结合起来,例如前面提到的节点间协作策略,但这样做复杂度较高。绝大多数已有的研究成果都采用了集中式修复,且只考虑了采集辅助数据所需的带宽。但实际上将重建的数据上传到新节点上也需要消耗带宽。这一点在已有的方法中并未考虑。有一些研究探讨了网络拓扑结构对修复带宽和修复策略的影响,但只在一些简单的拓扑结构,如直线级联,星形和网格结构上,获得了准确的数学表达式。
此外,现有的数据修复策略在重建数据的时候都使用了尽可能多的辅助节点,以达到降低修复带宽的目的。理论上也证明了,当系统中除失效节点外所有的正常节点,或称“幸存节点”都作为辅助节点参与数据重建时,所需的带宽最小。但在实际系统中,使用更多的辅助节点意味着更大的延迟和开销,因此在综合考虑多方面性能的情况下,使用的辅助节点数目并 非越多越好。
发明内容
本申请提供一种用于乘积矩阵的最小存储再生码的分布式存储系统的数据修复方法,要解决的技术问题是,提供一种分布式存储系统的数据修复方法,能够对分布式存储系统中的多个失效节点进行同时修复。
一种分布式存储系统的数据修复方法,用于分布式存储系统有节点失效时的数据修复过程,分布式存储系统采用分布式存储系统的编码方法。本申请中提供了两种使用乘积矩阵的最小存储再生码(PM MSR)的数据修复方法,分别为集中式和分布式。在集中式数据修复模式下,根据失效节点数量选取辅助节点,辅助数据子块的收集、修复矩阵运算和重建丢失数据块过程由中心代理节点完成;在分布式模式下,根据失效节点数量选取辅助节点,每个新节点重建其所替换的失效节点上存储的数据块,辅助数据子块收集、修复矩阵运算和重建丢失数据块过程由新节点完成。
数据修复过程包括如下步骤:
步骤1:从正常节点中选取辅助节点,用辅助节点计算得到辅助数据子块,并将辅助数据子块发送给重建数据的节点,所选取的辅助节点的数量与待修复的失效节点的数量有关;
步骤2:根据PM MSR码和丢失的数据信息,计算修复矩阵;
步骤3:通过将修复矩阵与辅助数据子块相乘得到丢失的数据块,完成数据块重建。
在实际操作过程中,为了得到修复矩阵,需要向丢失列表中增加未包含丢失数据块的节点信息,或用未包含丢失数据块的节点信息替换原丢失列表中的节点信息。例如在分布式模式下,一个新节点不仅可以重建自己所替换的失效节点上存储的数据块,还可能同时重建其他新节点所需的数据块。这种情况下,该新节点可以选择将额外重建的数据块发送给需要的其他新节点,以免这些新节点自己去重建这些数据块。这样可以进一步降低修复带宽和计算开销,
本发明的有益效果是:通过使用乘积矩阵的最小存储再生码提供的分布式存储系统的数据修复方法,能够同时重建一个或多个数据块,克服了最小修复带宽理论下限难以实现的困难,同时不需要在新节点之间交换数据,能够在任意数量的失效节点和任意编码参数的情况下对数据进行修复。
附图说明
图1为数据块修复计算过程的流程图;
图2为集中式修复模式下数据修复的系统模型图;
图3为分布式修复模式下数据修复的系统模型图;
图4为修复矩阵的左侧子矩阵的计算过程图;
图5为修复矩阵的右侧子矩阵的计算过程图;
图6为集中式修复模式下数据修复算法流程图;
图7为分布式修复模式下数据修复算法流程图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或者描述,这是为了避免本申请的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要的,他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。
在本发明实施例中,提出了一种用于乘积矩阵最小存储再生码(PM MSR)的分布式存储系统的数据修复方法。基本思路是构造出能够适用于任何数量的失效节点和任意编码参数的修复矩阵,将辅助数据与修复矩阵作简单相乘从而得到丢失的数据块。
本发明实施例中提供了两种用于PM MSR码的数据修复方法,包括集中式和分布式。如图2和图3分别描述了集中式和分布式修复模式下,数据块联合修复的系统模型。在一个分布式存储系统中,设存储节点的总数为n,编号分别为N 1,N 2,.....,N n,失效节点的个数为t。如图2和图3所示,不失一般性地假定前t个节点N 1,N 2,.....,N t失效。在集中式修复模式下,有一个中心代理负责收集辅助数据和重建丢失的数据,然后将重建好的数据发送 到t个新节点上。而在分布式的修复模式下,由t个新节点自己独立地完成收集辅助数据和重建丢失数据的过程,且每个新节点只需要重建自己所替代的失效节点上丢失的数据。
为了叙述方便,这里先给出本发明中所用到的符号及其定义。首先,一个使用乘积矩阵的最小存储再生码(PM MSR)用C(n,k,d)表示,其中n是节点的总数,k是“系统节点”的数量,那么m=n-k即为“校验节点”的数量。其中,系统节点是指存放未编码数据块的节点,校验节点是指存放编码数据块的节点。d为“修复度”,指当只有一个失效节点时,修复数据所需要的辅助节点的数目。
设待存储的数据总量为B。根据PM MSR码的要求,编码时首先将长度为B的数据等分为k段,然后编成n个数据块,分别用b 1,b 2,...,b n表示,其数据块编号分别为1,2,...,n。每个数据块b i包含α个子块s i1,s i2,...,s 。根据MSR码的相关理论,有如下两个等式:
α=d-k+1         (1)
B=kα=k(d-k+1)            (2)
PM MSR码要求d≥2k-2,因此有n≥d+1≥2k-1,以及α≥k-1。
针对这种一般情形,数据总量B将被基于扩展的PM MSR码C(n',k',d')编码成n'个数据块,其中n'=n+δ,k'=k+δ并且d'=d+δ。δ表示扩展增加的虚拟节点的数量,其选择应使得d'=2k'-2,因此有:
δ=d-(2k-2)            (3)
以及
d'=2(k+δ)-2=2(d-k+1)=2α         (4)
PM MSR码通过如下的公式来构建:
C=ΨM               (5)
其中M是如下式所示的d'×α维的消息矩阵:
Figure PCTCN2018095084-appb-000001
其中S 1和S 2均为α×α维对称矩阵,其上三角部分的(α+1)α/2个元素分别由不同的数据子块充当。Ψ=[Φ ΛΦ]则为n'×d'维编码矩阵,Φ的第i行用φ i表示。Λ是如下所示的n'×n'对角阵:
Figure PCTCN2018095084-appb-000002
C是编码后得到的n'×α维矩阵,其每个元素都是编码后得到的数据子块,C的第i行为:
Figure PCTCN2018095084-appb-000003
其中
Figure PCTCN2018095084-appb-000004
是Ψ的第i行。M的编码方式使得C成为系统码,也就是说,C的前δ行都是0,第δ+1到第δ+k行恰好为原始数据,而剩下的m=n'-k'行则为编码后的校验数据。这样的好处是,在没有系统节点失效的情况下,无需解码即可读到原始数据。需要指出的是,本实施例提到的所有矩阵和数据都用有限域GF(q)上的元素表示。
当有节点失效时,在修复数据之前,首先为系统添加δ个虚拟节点。这些虚拟节点上的数据全部为0。用{N i|1≤i≤n'}表示由虚拟节点和真实节点共同组成的集合,其中N 1~N δ为虚拟节点,而N δ+1~N n'是真实节点。因此可以认为{c i|1≤i≤n'}分别存储在n'个节点上。需要指出的是,本实施例中所说的“节点”既可以是物理上的,也可以是逻辑上的,也就是说,多个逻辑节点可以位于一个或多个物理机器上,不影响本发明的有效性。其次,本发明中所提到的虚拟节点并非真实存在,其是为了理论推理而臆想出来的,是发明者的思维过程,也为了便于本领域技术人员理解。
设一共有t≥1个失效节点。所有丢失数据块的编号所组成的集合X={x i|i=1,...,t,δ<x i≤n'}被定义为丢失列表。用
Figure PCTCN2018095084-appb-000005
表示丢失的数据块。不失一般性地,这里假定x i在X中按从小到大的顺序排列。当前包括虚拟节点在内,一共有n'-t个正常节点,从中选择d'-t+1个节点作为辅助节点来帮助修复数据。辅助节点的集合用
Figure PCTCN2018095084-appb-000006
来表示。辅助节点和失效节点的并集成为“相关节点”,用如下集合表示:
N e={N j|N j∈N a or j∈X}          (9)
由上面的叙述可以看出集合N e中有d'+1个元素。
在修复数据时,每个辅助节点N j∈N a首先计算自己存储的数据块c j(将其视为由子块[c j1,c j2,...,c ]组成的向量)与
Figure PCTCN2018095084-appb-000007
的内积,得到t个编码后的子块:
Figure PCTCN2018095084-appb-000008
然后将这些编码后的子块作为辅助数据发送给重建数据的节点。这样重建数据的节点将获得一共t(d'-t+1)个辅助子块。需要指出的是,上述提到的从正常节点中选择d'-t+1个节点作为辅助节点,前提条件是假设t个失效节点被同时修复,实际应用中也可根据实际情况选择待修改的节点,这种情况下选择的辅助节点个数将会发生变化,具体情况将在下述集中式和分 布式修复模式下介绍。
本发明中,数据重建节点通过“修复矩阵”来恢复丢失的数据块。修复矩阵通过如下方式得到。
首先,对每一个x i,x j∈X,j≠i,按照下式(11)计算α×α矩阵
Figure PCTCN2018095084-appb-000009
其中
Figure PCTCN2018095084-appb-000010
公式(12)中的
Figure PCTCN2018095084-appb-000011
Figure PCTCN2018095084-appb-000012
的前α个元素组成的向量,即
Figure PCTCN2018095084-appb-000013
的前半部分,而
Figure PCTCN2018095084-appb-000014
Figure PCTCN2018095084-appb-000015
的后α个元素组成的向量,即
Figure PCTCN2018095084-appb-000016
的后半部分。这里
Figure PCTCN2018095084-appb-000017
指矩阵
Figure PCTCN2018095084-appb-000018
中与辅助数据子块
Figure PCTCN2018095084-appb-000019
对应的列。
Figure PCTCN2018095084-appb-000020
是下式(13)所示矩阵的逆矩阵:
Figure PCTCN2018095084-appb-000021
也就是说,(13)中的
Figure PCTCN2018095084-appb-000022
是矩阵Ψ中与除了
Figure PCTCN2018095084-appb-000023
以外的相关节点所对应的行组成的子阵,或者说
Figure PCTCN2018095084-appb-000024
是矩阵Ψ去掉与节点
Figure PCTCN2018095084-appb-000025
相关的行后的剩余部分。而
Figure PCTCN2018095084-appb-000026
是矩阵
Figure PCTCN2018095084-appb-000027
中与辅助数据子块
Figure PCTCN2018095084-appb-000028
对应的列的序号。上面各矩阵的计算过程如图4左侧子矩阵的计算过程所示。
下一步,首先用X i=X|x i表示丢失列表X去掉x i后剩余编号的有序集合,用
Figure PCTCN2018095084-appb-000029
表示矩阵
Figure PCTCN2018095084-appb-000030
中与数据块
Figure PCTCN2018095084-appb-000031
对应的那些列的序号按顺序排列的有序集合。对每一个i=1~t,按下式(14)分别计算
Figure PCTCN2018095084-appb-000032
其中
Figure PCTCN2018095084-appb-000033
的计算方式与公式(12)相同,具体过程如图5右侧子矩阵的计算过程所示。
结合上面的计算,可以得到
Figure PCTCN2018095084-appb-000034
其中,
Figure PCTCN2018095084-appb-000035
是每个辅助节点通过计算自己存储的数据块与
Figure PCTCN2018095084-appb-000036
的内积得到的辅助编码子块组成的向量。
对于每个x i,向量
Figure PCTCN2018095084-appb-000037
的长度都是d'-t+1。但事实上
Figure PCTCN2018095084-appb-000038
的前δ个元素全都是0(虚拟节点提供的),那么相应地,每个Θ i的前δ列也都是无用的列。
因此,可以对(15)式右边的矩阵做一些剪裁,用
Figure PCTCN2018095084-appb-000039
来代替
Figure PCTCN2018095084-appb-000040
其中Θ' i是Θ i去掉前δ列后剩余的部分,
Figure PCTCN2018095084-appb-000041
Figure PCTCN2018095084-appb-000042
去掉前δ个元素后剩余的部分。
Figure PCTCN2018095084-appb-000043
如果Ξ X矩阵可逆,则可以通过如下式(18)算得修复矩阵:
Figure PCTCN2018095084-appb-000044
算出修复矩阵后,用R X左乘辅助数据子块组成的向量
Figure PCTCN2018095084-appb-000045
就可以得到丢失的数据块。
公式(18)给出的修复矩阵适用于任意t<min{k,α}。对于t≥min{k,α}的情形,可以用解码的方式来重建丢失的数据,也就是说,从正常节点中任选k个节点作为辅助节点,从所有辅助节点上下载共k×α个数据子块,然后对这些数据子块作解码操作,得到原始数据,然后将得到的原始数据编码成丢失的数据块。但对于t>m的情形,丢失的数据将无法恢复,因为由m=n-k可知此时正常节点的数目n-t<k。
另外,如果矩阵Ξ X不可逆,则无法通过公式(18)计算出修复矩阵。对于这个问题,有几种方法可以解决,包括但不限于:1)向X中增加一个或若干个节点,使Ξ X变为可逆;2)将X中的某个或某些节点替换成其他节点,使Ξ X变为可逆;3)用解码的方式实现数据重建。由于这种情况出现的概率很小,无论采用何种解决办法,对整体性能的影响都是微乎其微的。采用第1)和/或2)种方法时,实际修复的数据块不仅包括真正丢失的数据块,还包括新增加或新替换的节点上的数据块。
修复矩阵的计算过程如图1所示步骤102至步骤106,其中每一步的具体操作如下:
第1步:对每一个x i∈X,按照公式(9)和(13)计算
Figure PCTCN2018095084-appb-000046
第2步:对每个x i,x j∈X,j≠i,按照图4和公式(11)计算Ω i,j
第3步:基于第2步得到的结果,按照公式(17)构造矩阵Ξ X
第4步:对每个x i∈X,按照图5和公式(14)计算Θ' i,并将得到的Θ' i,i=1,…,t按照公式(18)排成广义对角阵;
第5步:如果Ξ X可逆,按照公式(18)用
Figure PCTCN2018095084-appb-000047
左乘第4步得到的广义对角阵,得到修复矩阵R X
如图1所示,第2,3步执行时可以在第4步之前,也可以在第4步之后。
下面将针对集中式和分布式两种数据修复模式,分别对分布式存储多数据块联合修复方法进行具体描述。
对于采用集中式的修复模式,用于PM MSR再生码的多数据块联合修复算法如图6所示。其中每一步的具体操作如下所述。由于是集中式的修复,除了提供辅助数据以外,其他的操作都在中心代理节点上进行。数据重建后,中心代理节点还需将重建后的数据发送到对应的新节点上。当t>m时,丢失的数据将无法恢复,为了简洁起见,t>m的情形这里不再描述。如图6,步骤600,判断t≥min{k,α}是否成立。
(1)如果t≥min{k,α}:
步骤611:从正常节点中任选k个节点作为辅助节点,中心代理节点向辅助节点发起请求,要求辅助节点将自己存储的数据块发送至中心代理节点;
步骤612:中心代理节点等待,直至收到所有的k×α个辅助数据子块;
步骤613:中心代理节点对收到的数据子块作解码,得到原始数据;
步骤614:中心代理节点用编码的方式重建丢失的数据块,并将重建好的数据库发送至t个新节点上。
(2)如果t<min{k,α}:
步骤621:按照图1步骤102至104计算Ξ X
步骤622:如果Ξ X不可逆,任选下面a)到c)的3种操作之一,否则执行下一步;
a)返回至步骤611,用解码的方式实现数据重建;
b)向X中增加一个节点,重新计算Ξ X,如果可逆,到步骤623,否则执行a),b)或c);
c)将X中的一个节点替换为X之外的某个节点,重新计算Ξ X,如果可逆,到步骤623,否则执行a),b)或c);
设X中元素的个数为z,如果在执行b)和/或c)的过程中,已经穷尽所有z<min{k,α}的可能组合,须执行a);
步骤623:如图1步骤101从正常节点中选择d-t+1个节点作为辅助节点,中心代理节点向辅助节点发起请求,要求每个辅助节点按照公式(10)计算并提供t个辅助数据子块;
步骤624:中心代理节点等待,直到收到所有的t(d-t+1)个辅助数据子块;
步骤625步:按照图1步骤106计算修复矩阵R X
步骤626:按照公式(16)对收到的辅助数据子块进行重新排列,使之与公式(18)中的广义对角阵相对应;
步骤627:如图1步骤107所示用修复矩阵R X左乘步骤626中重排得到的辅助数据子块组成的向量,重建丢失的数据块,并将重建好的数据发送至t个新节点上。
对于分布式的修复模式,每个新节点只需要重建其所替换的失效节点上存储的数据块。如果t≤n-d,则一个新节点可以选择d个正常节点作为辅助节点,从每个辅助节点处获得1个编码后的辅助数据子块,进而重构出其丢失的数据块,这样的修复带宽总共为td个子块。如果t>n-d,那么对于一个新节点来说,只重建自己丢失的1个数据块就行不通了,因为辅助节点的个数不够,这时每个新节点至少需要同时重建t-(n-d)+1个数据块,需要的辅助节点的数目为n-t。
在分布式的修复模式下,用于PM MSR再生码的多数据块联合修复算法如图7所示。其中每一步的具体操作如下所述。除了提供辅助数据以外,其他的操作都在新节点上进行。
如图7,步骤700,检测t的大小。
(1)如果t≥n-d-1+min{k,α}:
步骤711:从正常节点中任选k个节点作为辅助节点,新节点向辅助节点发起请求,要求辅助节点将自己存储的数据块发至新节点;
步骤712:新节点等待,直至收到所有的k×α个数据子块;
步骤713:对收到的辅助数据子块作解码,得到原始数据;
步骤714:用编码的方式重建丢失的数据块,返回。
(2)如果t≤n-d:
步骤721:从正常节点中任选d个节点作为辅助节点,新节点向辅助节点发起请求,要求辅助节点按照公式(10)各自计算并提供1个辅助数据子块;
步骤722:新节点等待,直到收到全部d个辅助数据子块;
步骤723:按照图1步骤106所示计算修复矩阵
Figure PCTCN2018095084-appb-000048
其中x i是本新节点的编号;
步骤724:按照公式(16)对收到的辅助数据子块进行重新排列,使之与公式(18)中的广义对角阵相对应;
步骤725:用修复矩阵
Figure PCTCN2018095084-appb-000049
左乘步骤724中重排得到的辅助数据子块组成的向量,重建丢失的数据块,返回。
(3)其他情况
步骤731:任选另外u=t-n+d个丢失的数据块
Figure PCTCN2018095084-appb-000050
令X={x i,y 1,…,y u},其中x i是本新节点的编号;
步骤732:按照图1所示步骤计算Ξ X
步骤733:如果Ξ X不可逆,任选下面a)到c)的3种操作之一,否则执行下一步;
a):返回至第1.1步,用解码的方式实现数据重建;
b):向X中增加一个节点,重新计算Ξ X,如果可逆,到步骤734,否则执行a),b)或c);
c):将X中的一个节点替换为X之外的某个节点,重新计算Ξ X,如果可逆,到步骤734,否则执行a),b)或c);
设X中元素的个数为z,如果在执行b)和/或c)的过程中,已经穷尽所有z<min{k,α}的可能组合,须执行a);
步骤734:将n-t个正常节点全部选为辅助节点,新节点向辅助节点发起请求,要求每个辅助节点按照公式(10)计算并提供u+1个辅助数据子块;
步骤735:新节点等待,直到收到所有的(n-t)(u+1)个辅助数据子块;
步骤736:按照图1步骤106计算修复矩阵R X
步骤737:按照公式(16)对收到的辅助数据子块进行重新排列,使之与公式(18)中的广义对角阵相对应;
步骤738:用修复矩阵R X左乘步骤737中重排得到的辅助数据子块组成的向量,重建丢失的数据块,返回。
需要说明的是,如果应用上述算法,当t>n-d时,一个新节点不仅重建了自己所替换的失效节点上存储的数据块,还可能同时重建了其他新节点所需的数据块。这时该新节点可以选择将额外重建的数据块发送给需要的其他新节点,以免这些新节点自己去重建这些数据块。这样可以进一步降低修复带宽和计算开销,但是需要节点间的协调配合,而且可能增加修复时间,已经属于协作式修复的范畴。实际应用中应根据系统性能需要进行权衡选择。
以上应用了具体实施例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。

Claims (7)

  1. 一种分布式存储系统的数据修复方法,所述系统的存储节点的总数为n,所述系统基于使用乘积矩阵的最小存储再生码C(n',k',d')对需要存储的数据进行编码,
    其中n'=n+δ,k'=k+δ,d'=d+δ,δ表示虚拟节点的数量,k表示将数据等分为k段,每段包含α个数据子块,d'为修复度,编码获得的C为n'×α维矩阵,其每个元素是编码后得到的数据子块,C的前δ行是0,第δ+1到第δ+k行为原始数据,剩下的m=n'-k'行为编码后的校验数据,这使得编码获得的n'个数据块中只有n=n'-δ个需要被存储;
    其特征在于,所述数据修复方法包括如下步骤:
    步骤1:从正常节点中选取辅助节点,用所述辅助节点计算得到辅助数据子块,并将所述辅助数据子块发送给重建数据的节点:
    用{N i|1≤i≤n'}表示由虚拟节点和真实节点共同组成的集合,其中N 1~N δ为虚拟节点,而N δ+1~N n'是真实节点,所述正常节点共有n'-t个,其中t为失效节点数量,且t≥1,定义X={x i|i=1,...,t,δ<x i≤n'}为丢失列表,用
    Figure PCTCN2018095084-appb-100001
    表示丢失的数据块,从所述正常节点中选择d'-t+1个节点作为所述辅助节点,所述辅助节点的集合表示为
    Figure PCTCN2018095084-appb-100002
    所述辅助节点和所述失效节点并集为“相关节点”,表示为:N e={N j|N j∈N a or j∈X},
    所述辅助节点N j∈N a首先计算存储的数据块c j
    Figure PCTCN2018095084-appb-100003
    的内积,得到t个所述辅助数据子块:
    Figure PCTCN2018095084-appb-100004
    然后将所述辅助数据子块作为辅助数据发送给重建数据的节点,
    步骤2:对修复矩阵R X进行计算:
    (1)对每一个x i,x j∈X,j≠i,通过下式计算矩阵:
    Figure PCTCN2018095084-appb-100005
    其中
    Figure PCTCN2018095084-appb-100006
    Figure PCTCN2018095084-appb-100007
    的前α个元素组成的向量,而
    Figure PCTCN2018095084-appb-100008
    Figure PCTCN2018095084-appb-100009
    的后α个元素组成的向量,这里
    Figure PCTCN2018095084-appb-100010
    指的是矩阵
    Figure PCTCN2018095084-appb-100011
    中与所述辅助数据子块
    Figure PCTCN2018095084-appb-100012
    对应的列,
    Figure PCTCN2018095084-appb-100013
    Figure PCTCN2018095084-appb-100014
    的逆矩阵,
    (2)用X i=X|x i表示所述丢失列表X去掉x i后剩余编号的有序集合,用
    Figure PCTCN2018095084-appb-100015
    表示矩阵
    Figure PCTCN2018095084-appb-100016
    中与数据块
    Figure PCTCN2018095084-appb-100017
    对应的列的序号按顺序排列的有序集,对每一个i=1~t:
    Figure PCTCN2018095084-appb-100018
    其中,
    Figure PCTCN2018095084-appb-100019
    的计算方式与
    Figure PCTCN2018095084-appb-100020
    相同,m=n-k,
    根据上述过程有:
    Figure PCTCN2018095084-appb-100021
    其中,
    Figure PCTCN2018095084-appb-100022
    (3)对于每个x i,向量
    Figure PCTCN2018095084-appb-100023
    的长度都是d'-t+1,事实上
    Figure PCTCN2018095084-appb-100024
    的前δ个元素全都是0,相应地,每个Θ i的前δ列也都是无用的列,因此用
    Figure PCTCN2018095084-appb-100025
    来代替
    Figure PCTCN2018095084-appb-100026
    其中Θ' i是Θ i去掉前δ列后剩余的部分,
    Figure PCTCN2018095084-appb-100027
    Figure PCTCN2018095084-appb-100028
    去掉前δ个元素后剩余的部分:
    Figure PCTCN2018095084-appb-100029
    则修复矩阵为:
    Figure PCTCN2018095084-appb-100030
    步骤3:对丢失的数据块进行重建:
    用修复矩阵R X左乘辅助数据子块组成的向量
    Figure PCTCN2018095084-appb-100031
    得到丢失的数据块。
  2. 如权利要求1所述的方法,其特征在于,所述修复矩阵适用于t<min{k,α},对于t≥min{k,α}的情形,用解码的方式重建丢失的数据:
    从正常节点中任选k个节点作为辅助节点,从所述辅助节点下载共k×α个数据子块,然后对所述数据子块作解码操作,得到原始数据。
  3. 如权利要求1所述的方法,其特征在于,Ξ X为可逆矩阵,或通过向所述丢失列表中增加节点使Ξ X变为可逆,或者将所述丢失列表中的某个或某些节点替换成其他节点使Ξ X变为可逆;否则用解码的方式重建丢失的数据。
  4. 如权利要求1所述的方法,其特征在于,数据修复方法包括集中式数据修复,在所述集中式模式下,根据待修复的所述失效节点数量选取所述辅助节点数量,所述辅助数据子块的收集、所述修复矩阵运算和所述重建丢失数据块过程由中心代理节点完成。
  5. 如权利要求1所述的方法,其特征在于,数据修复方法包括分布式 数据修复,在所述分布式模式下,根据待修复的所述失效节点数量选取所述辅助节点数量,每个所述新节点重建其所替换的所述失效节点上存储的数据块,所述辅助数据子块收集、所述修复矩阵运算和所述重建丢失数据块过程由所述新节点完成。
  6. 如权利要求5所述的方法,其特征在于,所述新节点重建的数据块还包括其他新节点所需的数据块。
  7. 一种计算机可读存储介质,其特征在于,包括程序,所述程序能够被处理器执行实现如权利要求1至6任意一项所述的方法。
PCT/CN2018/095084 2018-07-10 2018-07-10 分布式存储系统的数据修复方法及存储介质 WO2020010504A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/095084 WO2020010504A1 (zh) 2018-07-10 2018-07-10 分布式存储系统的数据修复方法及存储介质
CN201880005531.6A CN110168505B (zh) 2018-07-10 2018-07-10 分布式存储系统的数据修复方法及存储介质
US16/973,029 US11500725B2 (en) 2018-07-10 2018-07-10 Methods for data recovery of a distributed storage system and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/095084 WO2020010504A1 (zh) 2018-07-10 2018-07-10 分布式存储系统的数据修复方法及存储介质

Publications (1)

Publication Number Publication Date
WO2020010504A1 true WO2020010504A1 (zh) 2020-01-16

Family

ID=67644889

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/095084 WO2020010504A1 (zh) 2018-07-10 2018-07-10 分布式存储系统的数据修复方法及存储介质

Country Status (3)

Country Link
US (1) US11500725B2 (zh)
CN (1) CN110168505B (zh)
WO (1) WO2020010504A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445656A (zh) * 2020-12-14 2021-03-05 北京京航计算通讯研究所 分布式存储系统中数据的修复方法及装置

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11513898B2 (en) * 2019-06-19 2022-11-29 Regents Of The University Of Minnesota Exact repair regenerating codes for distributed storage systems
CN110704232B (zh) * 2019-10-10 2023-03-14 广东工业大学 一种分布式系统中失效节点的修复方法、装置和设备
CN112181707B (zh) * 2020-08-21 2022-05-17 山东云海国创云计算装备产业创新中心有限公司 分布式存储数据恢复调度方法、系统、设备及存储介质
CN113225395A (zh) * 2021-04-30 2021-08-06 桂林电子科技大学 一种多数据中心环境下的数据分布策略及数据修复算法
US11886295B2 (en) * 2022-01-31 2024-01-30 Pure Storage, Inc. Intra-block error correction
CN114595092B (zh) * 2022-04-28 2022-09-20 阿里云计算有限公司 分布式存储系统、数据重构方法、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358037A1 (en) * 2013-02-26 2015-12-10 Peking University Shenzhen Graduate School Method for encoding msr (minimum-storage regenerating) codes and repairing storage nodes
CN105260259A (zh) * 2015-09-16 2016-01-20 长安大学 一种基于系统最小存储再生码的局部性修复编码方法
CN106776129A (zh) * 2016-12-01 2017-05-31 陕西尚品信息科技有限公司 一种基于最小存储再生码的多节点数据文件的修复方法
CN106790408A (zh) * 2016-11-29 2017-05-31 中国空间技术研究院 一种用于分布式存储系统节点修复的编码方法
CN106911793A (zh) * 2017-03-17 2017-06-30 上海交通大学 I/o优化的分布式存储数据修复方法
CN108199720A (zh) * 2017-12-15 2018-06-22 深圳大学 一种减小存储开销和提高修复效率的节点修复方法及系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8631269B2 (en) * 2010-05-21 2014-01-14 Indian Institute Of Science Methods and system for replacing a failed node in a distributed storage network
CN103631761B (zh) * 2012-08-29 2018-02-27 睿励科学仪器(上海)有限公司 并行处理架构进行矩阵运算并用于严格波耦合分析的方法
US10146618B2 (en) * 2016-01-04 2018-12-04 Western Digital Technologies, Inc. Distributed data storage with reduced storage overhead using reduced-dependency erasure codes
CN110178122B (zh) * 2018-07-10 2022-10-21 深圳花儿数据技术有限公司 分布式存储系统的数据同步修复方法及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150358037A1 (en) * 2013-02-26 2015-12-10 Peking University Shenzhen Graduate School Method for encoding msr (minimum-storage regenerating) codes and repairing storage nodes
CN105260259A (zh) * 2015-09-16 2016-01-20 长安大学 一种基于系统最小存储再生码的局部性修复编码方法
CN106790408A (zh) * 2016-11-29 2017-05-31 中国空间技术研究院 一种用于分布式存储系统节点修复的编码方法
CN106776129A (zh) * 2016-12-01 2017-05-31 陕西尚品信息科技有限公司 一种基于最小存储再生码的多节点数据文件的修复方法
CN106911793A (zh) * 2017-03-17 2017-06-30 上海交通大学 I/o优化的分布式存储数据修复方法
CN108199720A (zh) * 2017-12-15 2018-06-22 深圳大学 一种减小存储开销和提高修复效率的节点修复方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112445656A (zh) * 2020-12-14 2021-03-05 北京京航计算通讯研究所 分布式存储系统中数据的修复方法及装置
CN112445656B (zh) * 2020-12-14 2024-02-13 北京京航计算通讯研究所 分布式存储系统中数据的修复方法及装置

Also Published As

Publication number Publication date
US20210271552A1 (en) 2021-09-02
US11500725B2 (en) 2022-11-15
CN110168505A (zh) 2019-08-23
CN110168505B (zh) 2022-10-21

Similar Documents

Publication Publication Date Title
WO2020010504A1 (zh) 分布式存储系统的数据修复方法及存储介质
WO2020010505A1 (zh) 分布式存储系统的数据同步修复方法及存储介质
US9647698B2 (en) Method for encoding MSR (minimum-storage regenerating) codes and repairing storage nodes
Greenan et al. Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs
CN109643258B (zh) 使用高速率最小存储再生擦除代码的多节点修复
US9961142B2 (en) Data storage method, device and distributed network storage system
WO2020047707A1 (zh) 分布式存储系统的数据编码、解码及修复方法
CN103729151B (zh) 一种基于改进型纠删码的失效数据修复方法
CN112799875B (zh) 基于高斯消元进行校验恢复的方法、系统、设备及介质
CN108132854B (zh) 一种可同时恢复数据元素及冗余元素的纠删码解码方法
CN105335150A (zh) 纠删码数据的快速编解码方法和系统
WO2018072294A1 (zh) 一种校验矩阵的构造方法及水平阵列纠删码的构造方法
Zorgui et al. Centralized multi-node repair regenerating codes
CN105808170B (zh) 一种能够修复单磁盘错误的raid6编码方法
CN108762978B (zh) 一种局部部分重复循环码的分组构造方法
CN109358980B (zh) 一种对数据更新和单磁盘错误修复友好的raid6编码方法
CN104782101A (zh) 用于分布式网络存储的自修复码的编码、重构和恢复方法
US20170264317A1 (en) Method of encoding data and data storage system
CN110781024A (zh) 对称部分重复码的矩阵构造方法及故障节点修复方法
CN110781025B (zh) 基于完全图的对称部分重复码构造及故障节点修复方法
CN109343998B (zh) 一种基于纠删码的全分布修复方法
Xie et al. AZ-Recovery: an efficient crossing-AZ recovery scheme for erasure coded cloud storage systems
CN114756175A (zh) 一种用于磁盘阵列的解码方法、系统、设备及介质
Calis et al. Architecture-aware coding for distributed storage: Repairable block failure resilient codes
CN107463462A (zh) 数据修复方法及数据修复装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18926115

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18926115

Country of ref document: EP

Kind code of ref document: A1