WO2015180038A1 - Partial replica code construction method and device, and data recovery method therefor - Google Patents

Partial replica code construction method and device, and data recovery method therefor Download PDF

Info

Publication number
WO2015180038A1
WO2015180038A1 PCT/CN2014/078539 CN2014078539W WO2015180038A1 WO 2015180038 A1 WO2015180038 A1 WO 2015180038A1 CN 2014078539 W CN2014078539 W CN 2014078539W WO 2015180038 A1 WO2015180038 A1 WO 2015180038A1
Authority
WO
WIPO (PCT)
Prior art keywords
blocks
code
block
elements
data
Prior art date
Application number
PCT/CN2014/078539
Other languages
French (fr)
Chinese (zh)
Inventor
李挥
朱兵
陈俊
侯韩旭
周泰
Original Assignee
北京大学深圳研究生院
深圳赛思鹏科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学深圳研究生院, 深圳赛思鹏科技发展有限公司 filed Critical 北京大学深圳研究生院
Priority to PCT/CN2014/078539 priority Critical patent/WO2015180038A1/en
Priority to CN201480078750.9A priority patent/CN107003933B/en
Publication of WO2015180038A1 publication Critical patent/WO2015180038A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes

Definitions

  • the present invention relates to the field of network storage, and more particularly to a method, a device for constructing a partial replica code, and a method for repairing the data.
  • the encoding method generally uses the M DS code (Maximum Distance Separable), and the M DS code can achieve the best storage space efficiency.
  • An MDS code with a parameter of (n, k) needs to divide an original file into /: equal-sized modules, and generate 7 mutually unrelated coding modules by coding, and 7 nodes store different modules and satisfy The data stored in any /: node can be reconstructed from the original file. This feature is further referred to as the M DS attribute.
  • This coding technology plays an important role in providing effective network storage redundancy, and is especially suitable for large file storage and archive data backup applications.
  • data of size B is usually stored in n storage nodes, and the size of data stored in each node is ".
  • the data receiver only needs to connect any / n of the storage nodes and download The data can recover the original data ⁇ , which is called the data reconstruction process.
  • the RS (Reed-Solomon) code is a typical code word that satisfies the characteristics of the M DS code.
  • the RS code first It is necessary to download /: the data of the storage node and recover the original file, and then generate the data stored by the failed node for the newly introduced node code. Decoding the entire original data in order to recover the data of one storage node is obviously a waste of network bandwidth. .
  • EC code Erasure Codes
  • the data when the data is repaired, the data is first downloaded from the /: storage nodes in the system and the original file is reconstructed; then the original file is re-encoded and the new module is stored on the new node.
  • the recovery process indicates that the network load required to repair any failed node is at least /: the content stored by the nodes.
  • RGC Regenerating Codes
  • MSR minimum storage reproduction code
  • MRR minimum repair bandwidth reproduction code
  • the repair process of the regenerated code is computationally complex, and usually involves a large number of finite field operations, that is, the repair node needs to perform a random linear network coding operation on the data stored therein. Specifically, the node participating in the repair reads the stored data block and performs a specific linear operation, and then passes the combined data block to the replacement node. In order to satisfy that all coding packets are independent of each other, the operation of the RGC code needs to be in a large finite field. Considering that the node read and write bandwidth is less than the network bandwidth in the actual system, the read and write bandwidth can easily become a system performance bottleneck.
  • the concept of FR code is proposed based on the M BR code, which indicates that the FR code can provide accurate and effective repair.
  • the FR code consists of two parts: an external MDS code and an internal copy code. After the data block is encoded by M DS, the output code block is copied an integer multiple and then distributed to each storage node.
  • the repair can be done by directly downloading data from other nodes and storing it to the replacement node, without additional operations.
  • the FR code greatly improves the node failure repair speed and correspondingly reduces the repair time.
  • the technical problem to be solved by the present invention is to provide a short-time, convenient parameter setting, and low system overhead for the defects of the prior art that are used for a long time, inconvenient parameter setting, and large system overhead.
  • a method, a device for reconstructing a code, and a method for repairing the data is to provide a short-time, convenient parameter setting, and low system overhead for the defects of the prior art that are used for a long time, inconvenient parameter setting, and large system overhead.
  • the data to be stored is equally divided into ⁇ parts, and MDS codes whose parameters are ( ⁇ , ⁇ ) are obtained, and ⁇ code blocks are obtained;
  • the block is an element in which the elements are satisfied by a set of elements of different groupings;
  • the n blocks include a total of ⁇ code blocks that are copied by a factor of f;
  • each storage node stores a code block corresponding to the selected block
  • ⁇ , ⁇ , t, s, and f are all positive integers, and the ⁇ can be divisible by t.
  • step C) all the obtained packets constitute a set G; the set G is a partition of the set V.
  • all the blocks obtained in the step D) satisfy the elements in any one of the sets V and exist in the f blocks respectively.
  • the size of each of the groups is the same, and the capacity of each group is the same.
  • the factor f is obtained;
  • the number n of storage nodes is obtained according to n ⁇ - ⁇ ⁇ - ⁇ .
  • the method further includes the following steps:
  • D2 arbitrarily selecting f among the ⁇ parallel classes to obtain a selected block; wherein the f is less than or equal to P; and the p parallel classes include n blocks.
  • the invention also relates to an apparatus for implementing the above method, comprising:
  • the coding block acquisition module is configured to divide the data to be stored into ⁇ parts, and perform MDS coding with the parameter ( ⁇ , oc ) to obtain ⁇ coding blocks;
  • a set V building block configured to obtain a set parameter, where the parameter includes the number t of the elements included in each group, and the number of code blocks stored in each storage node s; And as an element of the set V, get the set V;
  • Grouping module used to group elements in the set V to obtain ⁇ / t groups; the element numbers in a group are different;
  • the block construction module is configured to obtain all the blocks of the set V by the obtained grouping, and select among all the blocks according to the set parameters, to obtain the selected n blocks;
  • the block is an element in which the element satisfies a set consisting of elements of any different grouping; a total of ⁇ code blocks that are copied by a factor of f are included in the n blocks;
  • a data storage module configured to store the obtained code blocks corresponding to the selected block in the storage node, and each storage node stores a code block corresponding to the selected block;
  • ⁇ , ⁇ , t, s, and f are all positive integers, and the ⁇ can be divisible by t.
  • block building module further includes:
  • Parallel class division unit configured to divide all the obtained blocks into p parallel classes; wherein, for example, the elements in the set of several block groups are exactly all elements in the set V, and there is no between the blocks Intersect elements, then these blocks form a parallel class;
  • Parallel class selection unit used to arbitrarily select f among the p parallel classes to obtain a selected block; wherein, f is less than or equal to p; and the p parallel classes include n blocks.
  • the present invention also relates to a method of repairing data obtained using the above method, including The following steps:
  • repair table is stored in system metadata of a tracking server in the storage system; the repair scheme for one node in the repair table includes at least one.
  • the method, apparatus and data repair method for implementing the partial replica code of the present invention have the following beneficial effects: Since in the present embodiment, the internal replica code of the partial replica code is constructed by group design, so that it Under the premise of keeping the partial copy code construction shorter and the system overhead is small, the parameter setting is more convenient and flexible; it has great flexibility in using it on different storage systems.
  • FIG. 1 is a flow chart showing a process of constructing a partial replica code in an embodiment of a method, apparatus, and data repair method for constructing a partial replica code according to the present invention
  • FIG. 2 is a schematic diagram of a construction of a partial replica code in the embodiment
  • Figure 3 is a schematic view showing another construction in the embodiment
  • FIG. 4 is a schematic diagram of a device for constructing a partial replica code in the embodiment
  • Figure 5 is a schematic diagram showing the relationship between data repair selectivity and node storage capacity in the embodiment
  • Figure 6 is a comparison of the repair time of various codes in the case of a storage parameter in the embodiment
  • Figure 7 is a comparison of the repair time of various codes in the case of another storage parameter in the embodiment.
  • the method for constructing the partial replica code includes the following steps:
  • Step S11 performs MDS encoding on the data to obtain a coding block:
  • the data on the network to be stored usually a file, is equally divided into ⁇ parts, and its parameters are ( ⁇ , ot The MDS code is obtained to obtain ⁇ code blocks. Since MDS coding itself is a relatively mature technology, it will not be described too much here.
  • Step S12 processes the obtained coded block according to the set parameter, and obtains the set V: In this step, the previously set parameters are obtained, and the parameters are not only related to this step, but also the grouping in the subsequent step, to obtain the block and Select the group and other related.
  • these parameters include the number of elements t in each group, the number of coding blocks s stored in each storage node, the number n of storage nodes, the copy multiple f of the coding block, and the like.
  • the above ⁇ , ⁇ , t, s, and f are all positive integers, and ⁇ can be divisible by t.
  • the coded blocks obtained in the above steps are sequentially numbered, and the elements of the set V are obtained as a set. That is to say, for the set V, the number of its elements is unique, and there are a total of ⁇ code blocks in the above set V.
  • the element is replaced by the number of the element, and the specific content of the element is not involved.
  • Step S13 groups the set V:
  • the set V obtained above is grouped, each group includes t of the above elements, and the elements between each group are not repeated (that is, the number of the element is not repeated) . That is, the above ⁇ elements (i.e., coding blocks) are grouped, and each group includes t elements to obtain ⁇ / t groups.
  • all the obtained packets constitute a set G; each group has the same capacity; and the set G is a partition (or a division) of the set V.
  • Step S14 obtains the block group, and selects the block group according to the setting parameters:
  • the block group is obtained on the basis of the group; one block group is a set, wherein the elements are all It is an element in the above set V, and each block includes s elements, and any element in each block does not belong to the same group as other elements in the block.
  • the set V is first divided once by means of grouping. On the basis of this division, the set V is again divided according to the definition of the block to obtain the block. For a block, its elements are elements that are divided into sets V of different groups.
  • the set V have 4 coding blocks, which are 1, 2, 3, and 4, respectively, which are divided into two groups, which are (1, 2) and (3, 4) respectively; then the grouping may include (1) 3); (2, 4); (1, 4) and (2, 3). If all are selected in this step, they can be stored in 4 storage nodes, each storing 2 encoding blocks, and for the elements in the set V, the copying multiple is 2, because these groups include 2 in total.
  • the range of selection of the setting parameters is relatively large. When the settings are chosen more appropriately, the resulting groupings may form a parallel class.
  • All elements in the set are just all the elements in the set V, and these blocks are considered to form a parallel class.
  • it can be divided into p parallel classes.
  • all the blocks of the p parallel classes for the set V can also be obtained first, and then f among the p parallel classes is selected to realize the selection of the block.
  • f is less than p. At this time, f can be set or calculated.
  • the parameters used may be all given or set, or may be obtained by calculating some other ungiven parameters.
  • Step S15 assigns the selected block to each storage node:
  • the above-mentioned block groups are respectively stored on the storage nodes, and each storage node stores one block.
  • two parallel classes have been selected for a total of six blocks, so that the code blocks represented by the selected six blocks are stored in six storage nodes.
  • the amount of data stored on a storage node is the amount of data included (or pointed) by a block; for example, a block includes two coded block numbers, that is, it includes (or points to) two code blocks, one storage node
  • the amount of data stored is the amount of data of two coded blocks, and the data stored therein is the two coded blocks.
  • ⁇ , ⁇ , t, s, and / are all positive integers, and ⁇ can be divisible by t.
  • a distributed storage system is generally represented by (n, k, d), where "the total number of nodes representing the storage system, indicating the minimum number of nodes required to reconstruct the original file, ⁇ Fix the number of available nodes required for a failed node and satisfy 1.
  • the research on MDS codes has been relatively mature and can satisfy almost any qualified parameter. Therefore, the difficulty in constructing part of the replica code lies in the internal replica code design.
  • the essence of the FR code is an arrangement in which the data blocks of the multiple of / are replicated on the node, while ensuring that copies of each data block are stored separately on different nodes.
  • each element in U belongs to M / subset.
  • each subset corresponds to a storage node. All data blocks are distributed on "different nodes, and each node has a storage capacity of o.
  • (; ⁇ ,. represents a file containing 5 data blocks, representing a finite field of size q.
  • the code block for each output is duplicated twice, and the generated data block is stored on 4 nodes, see Figure 2.
  • the number in the 2 box indicates the subscript of the coding block.
  • the three data blocks stored by the node are, in turn, Y 3 , 7 5 .
  • V and ⁇ be the given positive integers, and S and ⁇ be the given positive integer set.
  • the elements in V are called points, the elements in A are called blocks, and the elements in G are called groups. If the following conditions are met:
  • 0 ( ⁇ G, A) is a GD design for PcA, if every point in V is exactly The only block in P is associated, so P is called a parallel class. If all the blocks of a GD ( ⁇ , t; v) can be divided into parallel classes, it is called a decomposable GD design.
  • the existence of the truncated design TD( , is equivalent to the existence of mutually orthogonal Latin squares.
  • the Steiner system is a A special GD design, but not all GD designs belong to the Steiner system.
  • V ⁇ 1, 2, 6 ⁇
  • three equal-sized groups are taken as G: ⁇ 1, 2 ⁇ , ⁇ 3, 4 ⁇ , ⁇ 5, 6 ⁇
  • the generated block is ⁇ 1, 3, 5 ⁇ , ⁇ 2, 3, 6 ⁇ , ⁇ 1, 4, 6 ⁇ , ⁇ 2, 4, 5 ⁇ .
  • a GD design may have several isomorphisms. In this embodiment, only one specific design (corresponding to a specific grouping) is considered, and the corresponding construction method is equally applicable to all other isomorphic designs.
  • the GDDFRC code construct takes a given GD0, 1, t; v), where ⁇ >2.
  • the copy multiple / and the number of system nodes can be obtained by the above equation. ⁇
  • the constructed FR code is shown in Figure 2.
  • the system can accommodate one Node failure and accurate data reproduction without encoding. If the node N P N 2 fails at the same time, the original file must be reconstructed to obtain the coded block 7 3 . In general, for a FR code with a copy number of /, the system can withstand /-7 nodes without losing the exact codeless repair feature, at which point all data blocks in the system have at least one backup.
  • the copying factor of the coding block and the node size in the system can be flexibly selected.
  • a GG 3, 1, 3; 9
  • the nine blocks generated by this design can be divided into three parallel classes (the blocks of each row form a parallel class):
  • This flexible parameter selection provides great convenience for system design.
  • an apparatus for implementing the foregoing method further includes an encoding block obtaining module 1, a set V building block 2, a grouping module 3, a block building module 4, and a data storage module 5;
  • the coding block obtaining module 1 is configured to divide the data to be stored into ⁇ parts, and perform MDS encoding of the parameter ( ⁇ , ⁇ ) to obtain ⁇ coding blocks;
  • the set V construction module 2 is used to obtain setting parameters.
  • the parameter includes the number t of the elements included in each group, the number of code blocks stored in each storage node s; the ⁇ code blocks are sequentially numbered and used as elements of the set V to obtain a set V;
  • the grouping module 3 is configured to group the elements in the set V to obtain ⁇ / t packets; the element numbers in one group are different; the block building module 4 is used to obtain all the areas of the set V by the obtained grouping.
  • the data encoding module 5 is configured to store the coded blocks corresponding to the selected selected block in the storage node, and each storage node stores a code block corresponding to the selected block.
  • ⁇ , ⁇ , t, s, and f are all positive integers, and the ⁇ can be divisible by t.
  • the block construction module 1 further includes: a parallel class division unit 41 and a parallel class selection unit 42; wherein, the parallel class division unit 41 is configured to divide all the obtained block groups into P parallel classes; Wherein, if the elements in the set of several block groups are exactly all the elements in the set V, and there are no intersecting elements between the blocks, the blocks form a parallel class; the parallel class selecting unit 42 is used in the Any one of the p parallel classes is selected to obtain a selected block; wherein, f is smaller than p; and the p parallel classes include n blocks.
  • the GDDFRC code covers all the characteristics of the FR code.
  • the copying multiple of each data block is the same, and the storage capacity of each node of the system is the same.
  • the GDDFRC code uses a table-based repair method.
  • the repair form indicates the repair options that are selectable for each particular failed node. As shown in FIG. 3, if node N 8 fails, repairs can be made through nodes N 2 , N 4 , N 6 instead of nodes, N 2 and N 3 .
  • a storage server deployment usually includes a tracking server.
  • the repair form information can be written to the metadata for quick access reading of the fail-safe.
  • the degree of selection of data repair is relatively large.
  • the downloaded data can be randomly selected from other nJ available nodes, and the original file is reconstructed and then encoded and repaired. Therefore, for any node failure, there are 71 repair schemes for the MDS code. This specifies the number of alternatives for node failure repair, called the repair selectivity of the node.
  • the GDDFRC code uses a table-based repair method, wherein the table gives a node-specific repair scheme. Since each copy of the data block is distributed in different nodes and a pair of different data blocks are stored on a unique node, when one node fails, other nodes that store the same data block with the node can be connected, and the download is lost. A copy of the data block regenerates the replacement node. It can be seen that given a storage node with a capacity of one, the system has ( ⁇ l) rf failure repair scheme. Figure 5 shows the relationship between the node repair selectivity and the storage capacity of the GDDFRC code with a copying multiple of 3.
  • the repair method of the GDDFRC code is based on a table
  • the node repair selectivity can still reach a very high level.
  • the node repair selectivity increases exponentially with the node storage capacity.
  • the GDDFRC code proposed by the present invention is implemented by using the popular Hadoop distributed file system in the industry, and the file encoding and decoding and failure recovery functions are completed.
  • the CPU of the system server is configured as Intel(R) Xeon( ) E5-2609 2.40GHz, and the memory size is 24G.
  • a normal PC CPU is AMD A8-5600k 3.0GHz, 4G memory
  • the same experimental environment is configured, and there is no other operation for each node during the experiment. Under the condition that the node storage capacity is the same, the difference between the GDDFRC code and the classic RS code and MBR code in the repair time is analyzed from different A values.
  • the advantage of the GDDFRC code in repair time is more obvious.
  • the original file needs to be restored, and the generated code block is re-encoded and stored in the replacement node, so the repair time is relatively long.
  • the node participating in the repair performs a linear operation on the stored data, and then transfers the combined data block to the replacement node.
  • the node further integrates all the received data blocks to recover the invalid data.
  • the entire process involves a large number of finite field operations, increasing the repair time.
  • the system When a node failure is detected, the system first determines which node is invalid, and determines the repair plan according to the GDDFRC code repair table (stored in the system metadata). At the same time, connect the available nodes specified in the scheme, download the corresponding data blocks and store them directly in the replacement node. It can be seen that the entire repair process only involves file reading work and does not introduce other complex operations. Although the redundancy of the system is increased to a certain extent, the results show that the GDDFRC code can greatly reduce the failure repair time.
  • the maximum advantage of the partial copy code (G DDFRC ) based on the group design is that the computational complexity in the codec process is significantly reduced, and the finite domain complex operation is replaced by the simple and easy to implement data replication. .
  • the construction of traditional RGC codes is based on finite field GF (W, finite field addition, subtraction and multiplication designed in the coding and decoding process.
  • W finite field addition, subtraction and multiplication designed in the coding and decoding process.
  • the GDDFRC code is different.
  • the node failure repair in the system can be repaired by directly downloading data from other nodes and storing it to the replacement node. No additional calculations are required, which greatly improves the node repair.
  • the rate of data block regeneration has high application value and development potential in practical distributed storage systems.
  • the partial replica code based on the groupable design not only reduces the computational complexity in the node repair process, but also ensures that the bandwidth consumed during the node repair process is the smallest (ie, the original file size), and does not consume excess bandwidth.
  • its GDDFRC code can guarantee: Lost coding block Several subsets of other encoding modules can be downloaded directly for repair; missing encoded blocks can be fixed by a fixed number of encoding modules, and the repair mode is based on tables.
  • the data stored by the node after the repair of the G DDFRC code is completely consistent with the failed node, that is, the exact repair, which greatly reduces the system operation complexity (such as metadata update, updated data broadcast, etc.).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A partial replica code construction method. MDS coding is performed on data, and β coding blocks are obtained (S11), and the coding blocks are numbered sequentially to obtain a set V (S12); grouping is performed on elements in the set V, and β/t groups are obtained (S13); according to the grouping of the set V, all area groups satisfying a condition are obtained (S14); coding blocks corresponding to the obtained area groups are stored in storage nodes, each storage node storing coding blocks corresponding to one area group, and a partial replica code is obtained (S15). The partial replica code construction method, a device implementing the method and a method performing data recovery on the partial replica code have the following beneficial effect: parameter setting is convenient and flexible.

Description

部分复制码的构建方法、 装置及其数据修复的方法  Method and device for constructing partial replica code and method for repairing data thereof
技术领域 Technical field
本发明涉及网络存储领域, 更具体地说, 涉及一种部分复制码的构建 方法、 装置及其数据修复的方法。  The present invention relates to the field of network storage, and more particularly to a method, a device for constructing a partial replica code, and a method for repairing the data.
背景技术 Background technique
随着计算机技术和因特网的飞速发展, 网络信息数据量呈爆炸式增长, 大数据对现有存储系统提出了严峻的挑战, 高效地存储海量数据的系统已 经越来越重要。 目前, 分布式存储系统以其高效的可扩展性和高可用性成 为存储海量数据的有效系统。 然而在大规模分布式存储系统中, 由于突发 性断电等情况, 数据存储节点是不可靠的。 为了能够由不可靠的存储节点 提供可靠的存储服务, 通常需要在存储系统中引入冗余。 引入冗余最直接 的方法就是对原数据直接备份, 备份机制虽然简单但是其存储效率不高。 在相同冗余的情况下, 新兴的编码技术的 ^ I入可以大幅地提高其存储效率。 在目前的存储系统中, 编码方法一般釆用 M DS 码(Maximum Distance Separable , 最大距离可分), M DS码可以实现存储空间效率的最佳。 一个参 数为(n, k)的 MDS码需要将一个原文件均分成 /:个大小相等的模块, 并通过 编码生成 7个互不相关的编码模块, 由 7个节点存储不同的模块, 并满足 任意/:个节点存储的数据就可重构原始文件,该特性进一步称为 M DS属性。 这种编码技术在提供有效的网络存储冗余中占有重要的地位, 特别适合大 文件存储以及档案数据备份应用。  With the rapid development of computer technology and the Internet, the amount of network information data is exploding. Big data poses a serious challenge to existing storage systems, and systems for efficiently storing massive amounts of data have become increasingly important. Currently, distributed storage systems are an effective system for storing massive amounts of data with their high scalability and high availability. However, in large-scale distributed storage systems, data storage nodes are unreliable due to sudden power outages and the like. In order to be able to provide reliable storage services from unreliable storage nodes, redundancy is often introduced into the storage system. The most direct way to introduce redundancy is to directly back up the original data. Although the backup mechanism is simple, its storage efficiency is not high. With the same redundancy, the emerging coding technology can greatly improve its storage efficiency. In the current storage system, the encoding method generally uses the M DS code (Maximum Distance Separable), and the M DS code can achieve the best storage space efficiency. An MDS code with a parameter of (n, k) needs to divide an original file into /: equal-sized modules, and generate 7 mutually unrelated coding modules by coding, and 7 nodes store different modules and satisfy The data stored in any /: node can be reconstructed from the original file. This feature is further referred to as the M DS attribute. This coding technology plays an important role in providing effective network storage redundancy, and is especially suitable for large file storage and archive data backup applications.
在分布式存储系统中, 通常把大小为 B的数据存储在 n个存储节点中, 每个节点存储的数据大小为"。数据接收者只需要连接 n个存储节点中的任 意/:个并下载数据即可恢复出原始数据 β, 这一过程称为数据重建过程。 RS ( Reed-Solomon , 里德-所罗门 )码是一种典型的满足 M DS码特性的一种码 字。 当存储系统中的存储节点失效时, 为了保持存储系统的冗余量, 需要 恢复该失效节点存储的数据并将该数据存储在新节点中, 该过程称为修复 过程。 然而, 在修复过程中, RS码首先需要下载/:个存储节点的数据并恢 复出原始文件, 再为新引入的节点编码生成失效节点存储的数据。 为了恢 复一个存储节点的数据而解码出整个原始数据显然对网络带宽是一种浪费。 然而, 由于系统中节点失效或者文件损耗, 系统的冗余度会随着时间而 逐渐减小, 因此需要一种机制来保证系统的冗余。 在此基础上, 有人提出 了 EC码( Erasure Codes , 纠删码) , 该码有效地减少了系统存储开销, 然 而支持冗余恢复所需要的通信开销也比较大。 在 EC码中, 数据修复时首先 从系统中的 /:个存储节点中下载数据并重构原始文件; 再由原始文件再重 新编码出新的模块, 存储在新节点上。 该恢复过程表明修复任何一个失效 节点所需要的网络负载至少为 /:个节点所存储的内容。 In a distributed storage system, data of size B is usually stored in n storage nodes, and the size of data stored in each node is ". The data receiver only needs to connect any / n of the storage nodes and download The data can recover the original data β, which is called the data reconstruction process. The RS (Reed-Solomon) code is a typical code word that satisfies the characteristics of the M DS code. When the storage node fails, in order to maintain the redundancy of the storage system, it is necessary to recover the data stored by the failed node and store the data in the new node. This process is called a repair process. However, in the repair process, the RS code first It is necessary to download /: the data of the storage node and recover the original file, and then generate the data stored by the failed node for the newly introduced node code. Decoding the entire original data in order to recover the data of one storage node is obviously a waste of network bandwidth. . However, due to node failure or file loss in the system, the redundancy of the system will gradually decrease with time, so a mechanism is needed to ensure system redundancy. On this basis, EC code (Erasure Codes) has been proposed, which effectively reduces the system storage overhead. However, the communication overhead required to support redundant recovery is also large. In the EC code, when the data is repaired, the data is first downloaded from the /: storage nodes in the system and the original file is reconstructed; then the original file is re-encoded and the new module is stored on the new node. The recovery process indicates that the network load required to repair any failed node is at least /: the content stored by the nodes.
为了降低修复过程中所使用的带宽, 有人利用网络编码理论的思想提出 了再生码 (RGC, Regenerating Codes) , RGC码也满足 MDS码特性。 传统的再 生码修复过程中, 替换节点需要在剩下的可用存储节点中连接 X个并分别 从这 X个存储节点中下载 y大小的数据, 所以 RGC码的修复带宽为 xy。 对 于 RGC码功能修复而言,提出了 RGC码的两类最佳码:最小存储再生码 (MSR) 和最小修复带宽再生码 (MBR)。 RGC码的修复过程中不需要重构出源文件, 修复带宽优于 RS码。  In order to reduce the bandwidth used in the repair process, some people have proposed the Regenerating Codes (RGC) using the idea of network coding theory, and the RGC codes also satisfy the MDS code characteristics. In the traditional regenerative code repair process, the replacement node needs to connect X among the remaining available storage nodes and download the y-size data from the X storage nodes respectively, so the repair bandwidth of the RGC code is xy. For the RGC code function repair, two types of optimal codes of RGC code are proposed: minimum storage reproduction code (MSR) and minimum repair bandwidth reproduction code (MBR). The RGC code does not need to reconstruct the source file during the repair process, and the repair bandwidth is better than the RS code.
然而, 再生码的修复过程计算复杂度比较高, 通常涉及大量的有限域运 算, 即修复节点需要对其存储的数据执行随机线性网络编码操作。 具体地 说, 参与修复的节点读出所存储的数据块并进行特定的线性运算, 再向替 换节点传递组合后的数据块。 为了满足所有编码包是相互独立的, RGC码 的运算需要在一个较大的有限域内。 考虑到实际系统中节点读写带宽小于 网络带宽, 因此读写带宽很容易成为系统性能瓶颈。 为了降低修复过程运 算复杂度, 又在 M BR码的基础上提出了 FR码的概念, 指出了 FR码可以提 供精确有效的修复。 一般地, FR码包含两个部分: 一个外部 MDS码以及一 个内部复制码。 数据块经过 M DS编码后, 将输出的编码块复制整数倍再分 散到各存储节点。 系统中发生节点失效时, 可以通过从其它节点直接下载 数据并存储到替换节点来完成修复, 不需要额外的运算。 相比传统的 RS码 与再生码 RGC, FR码大幅提升了节点失效修复速度, 相应地降低了修复时 间。 由于 M DS码的构建是较为成熟的技术, 所以部分复制码的构造难点在 于内部复制码设计。 现有的部分复制码一般釆用基于有限几何进行构造, 如正则图, 有限射影平面, 正交拉丁方等, 这些特定的抽象几何构造过程 比较复杂, 而且参数选择具有一定局限性, 无疑增加了部分重复码的设计 复杂度。 However, the repair process of the regenerated code is computationally complex, and usually involves a large number of finite field operations, that is, the repair node needs to perform a random linear network coding operation on the data stored therein. Specifically, the node participating in the repair reads the stored data block and performs a specific linear operation, and then passes the combined data block to the replacement node. In order to satisfy that all coding packets are independent of each other, the operation of the RGC code needs to be in a large finite field. Considering that the node read and write bandwidth is less than the network bandwidth in the actual system, the read and write bandwidth can easily become a system performance bottleneck. In order to reduce the computational complexity of the repair process, the concept of FR code is proposed based on the M BR code, which indicates that the FR code can provide accurate and effective repair. In general, the FR code consists of two parts: an external MDS code and an internal copy code. After the data block is encoded by M DS, the output code block is copied an integer multiple and then distributed to each storage node. When a node failure occurs in the system, the repair can be done by directly downloading data from other nodes and storing it to the replacement node, without additional operations. Compared with the traditional RS code and the regenerative code RGC, the FR code greatly improves the node failure repair speed and correspondingly reduces the repair time. Since the construction of the M DS code is a relatively mature technology, the construction difficulty of the partial replica code lies in the internal replica code design. Existing partial replica codes are generally constructed based on finite geometry, such as regular graphs, finite projective planes, orthogonal Latin squares, etc. These specific abstract geometric construction processes It is more complicated, and the parameter selection has certain limitations, which undoubtedly increases the design complexity of some duplicate codes.
发明内容 Summary of the invention
本发明要解决的技术问题在于, 针对现有技术的上述所用时间较长、 参数设置较为不便、 系统开销较大的缺陷, 提供一种所用时间较短、 参数 设置方便、 系统开销较小的部分复制码的构建方法、 装置及其数据修复的 方法。  The technical problem to be solved by the present invention is to provide a short-time, convenient parameter setting, and low system overhead for the defects of the prior art that are used for a long time, inconvenient parameter setting, and large system overhead. A method, a device for reconstructing a code, and a method for repairing the data.
本发明解决其技术问题所釆用的技术方案是: 构造一种部分复制码的 构建方法, 包括如下步骤:  The technical solution adopted by the present invention to solve the technical problem thereof is as follows: Constructing a method for constructing a partial replica code, comprising the following steps:
A )将要存储的数据等分为 α份, 并对其进行其参数为 ( β, α ) 的 MDS编码, 得到 β个编码块;  A) The data to be stored is equally divided into α parts, and MDS codes whose parameters are (β, α) are obtained, and β code blocks are obtained;
Β )取得设定参数, 所述参数包括每个组包括的所述元素个数 t、 每个存储节点存储的编码块的个数 s; 将所述 β个编码块依次编号并作为集 合 V的元素, 得到集合 V;  Β) obtaining a setting parameter, the parameter including the number of the elements t included in each group, and the number s of coding blocks stored in each storage node; and sequentially numbering the β coding blocks as a set V Element, get the set V;
C )对所述集合 V中的元素进行分组, 得到 β It个分组; 一个分组 中的元素编号各不相同;  C) grouping the elements in the set V to obtain β It packets; the element numbers in a group are different;
D )通过上述步骤中得到的分组得到集合 V的所有区组, 并根据 设定参数在所有区组中选择, 得到选择后的 n个区组; 所述区组是一个其 中的元素满足由任意不同分组的元素构成的集合; 所述 n个区组中总共包 括被复制 f倍的 β个编码块;  D) obtaining all the blocks of the set V by the grouping obtained in the above steps, and selecting among all the blocks according to the setting parameters, and obtaining the selected n blocks; the block is an element in which the elements are satisfied by a set of elements of different groupings; the n blocks include a total of β code blocks that are copied by a factor of f;
Ε )将得到的选择后的区组对应的编码块存储在存储节点, 每个存 储节点存储一个选择后的区组对应的编码块;  Ε storing the obtained code blocks corresponding to the selected block in the storage node, and each storage node stores a code block corresponding to the selected block;
其中, 所述 α、 β、 t、 s和 f均为正整数, 所述 β能够被 t整除。  Wherein, α, β, t, s, and f are all positive integers, and the β can be divisible by t.
更进一步地, 在步骤 C )中, 得到的所有分组构成集合 G; 所述集合 G 是所述集合 V的一个划分。  Further, in step C), all the obtained packets constitute a set G; the set G is a partition of the set V.
更进一步地, 所述步骤 D ) 中得到的全部区组满足任意一个集合 V中 的元素分别存在于 f个区组中。  Further, all the blocks obtained in the step D) satisfy the elements in any one of the sets V and exist in the f blocks respectively.
更进一步地, 所述每个区组的大小相同, 所述每个组的容量相同。 更进一步地, 所述步骤 B )中, 按照 / = ^-^ / ^ - 1)得到所述编码块复 制倍数 f; 按照 n β{β- ή ΐ -\、得到存储节点数 η。 Further, the size of each of the groups is the same, and the capacity of each group is the same. Further, in the step B), the coding block is obtained according to /=^-^/^-1) The factor f is obtained; the number n of storage nodes is obtained according to n β{β- ή ΐ -\.
更进一步地, 所述步骤 D ) 中, 进一步包括如下步骤:  Further, in the step D), the method further includes the following steps:
D1 )将所述得到的所有区组划分成 ρ个平行类; 其中, 如若干个区 组的集合中的元素刚好是集合 V 中的所有元素, 并且这些区组之间无相交 元素, 则这些区组就构成一个平行类;  D1) dividing all the obtained blocks into ρ parallel classes; wherein, if the elements in the set of several blocks are just all the elements in the set V, and there are no intersecting elements between the blocks, then these The block constitutes a parallel class;
D2 )在所述 ρ个平行类中任意选择 f 个, 得到选择的区组; 其中, 所述 f小于等于 P ; 所述 p个平行类中包括 n个区组。  D2) arbitrarily selecting f among the ρ parallel classes to obtain a selected block; wherein the f is less than or equal to P; and the p parallel classes include n blocks.
本发明还涉及一种实现上述方法的装置, 包括:  The invention also relates to an apparatus for implementing the above method, comprising:
编码块取得模块: 用于将要存储的数据等分为 α份, 并对其进行 其参数为 ( β, oc ) 的 MDS编码, 得到 β个编码块;  The coding block acquisition module is configured to divide the data to be stored into α parts, and perform MDS coding with the parameter (β, oc ) to obtain β coding blocks;
集合 V构建模块: 用于取得设定参数, 所述参数包括每个组包括的 所述元素个数 t、 每个存储节点存储的编码块的个数 s; 将所述 β个编码块 依次编号并作为集合 V的元素, 得到集合 V;  a set V building block: configured to obtain a set parameter, where the parameter includes the number t of the elements included in each group, and the number of code blocks stored in each storage node s; And as an element of the set V, get the set V;
分组模块: 用于对所述集合 V中的元素进行分组, 得到 β /t个分 组; 一个分组中的元素编号各不相同;  Grouping module: used to group elements in the set V to obtain β / t groups; the element numbers in a group are different;
区组构建模块: 用于通过得到的分组得到集合 V的所有区组, 并 根据设定参数在所有区组中选择, 得到选择后的 η个区组; 所述区组是一 个其中的元素满足由任意不同分组的元素构成的集合; 所述 η个区组中总 共包括被复制 f倍的 β个编码块;  The block construction module is configured to obtain all the blocks of the set V by the obtained grouping, and select among all the blocks according to the set parameters, to obtain the selected n blocks; the block is an element in which the element satisfies a set consisting of elements of any different grouping; a total of β code blocks that are copied by a factor of f are included in the n blocks;
数据存储模块: 用于将得到的选择后的区组对应的编码块存储在 存储节点, 每个存储节点存储一个选择后的区组对应的编码块;  a data storage module: configured to store the obtained code blocks corresponding to the selected block in the storage node, and each storage node stores a code block corresponding to the selected block;
其中, 所述 α、 β、 t、 s和 f均为正整数, 所述 β能够被 t整除。  Wherein, α, β, t, s, and f are all positive integers, and the β can be divisible by t.
更进一步地, 所述区组构建模块进一步包括:  Further, the block building module further includes:
平行类划分单元: 用于将所述得到的所有区组划分成 p个平行类; 其中, 如若干个区组的集合中的元素刚好是集合 V 中的所有元素, 并且这 些区组之间无相交元素, 则这些区组就构成一个平行类;  Parallel class division unit: configured to divide all the obtained blocks into p parallel classes; wherein, for example, the elements in the set of several block groups are exactly all elements in the set V, and there is no between the blocks Intersect elements, then these blocks form a parallel class;
平行类选择单元: 用于在所述 p个平行类中任意选择 f个, 得到选 择的区组; 其中, 所述 f小于等于 p ; 所述 p个平行类中包括 n个区组。  Parallel class selection unit: used to arbitrarily select f among the p parallel classes to obtain a selected block; wherein, f is less than or equal to p; and the p parallel classes include n blocks.
本发明还涉及一种对使用上述方法得到的数据进行修复的方法, 包括 如下步骤: The present invention also relates to a method of repairing data obtained using the above method, including The following steps:
M )取得修复表格, 以失效节点的编号为索引查找其修复方案; N )下载修复表格指示的节点数据并得到替换节点数据, 生成替换 节点。  M) obtain the repair table, find the repair scheme by using the number of the failed node as an index; N) download the node data indicated by the repair table and obtain the replacement node data to generate a replacement node.
更进一步地, 所述修复表格存储在存储系统中的追踪服务器的系统元 数据中; 所述修复表格中对于一个节点的修复方案包括至少一个。  Further, the repair table is stored in system metadata of a tracking server in the storage system; the repair scheme for one node in the repair table includes at least one.
实施本发明的部分复制码的构建方法、 装置及其数据修复的方法, 具 有以下有益效果: 由于在本实施例中, 釆用了可分组设计来构建部分复制 码的内部复制码, 使得其在保持部分复制码的构建所用时间较短、 系统开 销较小的前提下, 增加了其参数设置较为方便、 灵活; 使得其在不同的存 储系统上使用具有很好的灵活性。  The method, apparatus and data repair method for implementing the partial replica code of the present invention have the following beneficial effects: Since in the present embodiment, the internal replica code of the partial replica code is constructed by group design, so that it Under the premise of keeping the partial copy code construction shorter and the system overhead is small, the parameter setting is more convenient and flexible; it has great flexibility in using it on different storage systems.
附图说明 DRAWINGS
图 1是本发明部分复制码的构建方法、 装置及其数据修复的方法实施 例中部分复制码的构建过程流程图;  1 is a flow chart showing a process of constructing a partial replica code in an embodiment of a method, apparatus, and data repair method for constructing a partial replica code according to the present invention;
图 2是所述实施例中部分复制码的一个构建示意图;  2 is a schematic diagram of a construction of a partial replica code in the embodiment;
图 3是所述实施例中另一个构建示意图;  Figure 3 is a schematic view showing another construction in the embodiment;
图 4是所述实施例中部分复制码的构建装置示意图;  4 is a schematic diagram of a device for constructing a partial replica code in the embodiment;
图 5是所述实施例中数据修复选择度和节点存储容量之间的关系示意 图;  Figure 5 is a schematic diagram showing the relationship between data repair selectivity and node storage capacity in the embodiment;
图 6是所述实施例中一种存储参数情况下各种编码的修复时间对比; 图 7是所述实施例中另一种存储参数情况下各种编码的修复时间对比。 具体实施方式  Figure 6 is a comparison of the repair time of various codes in the case of a storage parameter in the embodiment; Figure 7 is a comparison of the repair time of various codes in the case of another storage parameter in the embodiment. detailed description
下面将结合附图对本发明实施例作进一步说明。  The embodiments of the present invention will be further described below in conjunction with the accompanying drawings.
如图 1 所示, 在本发明的部分复制码的构建方法、 装置及其数据修复 的方法实施例中, 该部分复制码的构建方法包括如下步骤:  As shown in FIG. 1, in the method for constructing a partial replica code of the present invention, an apparatus, and a method for repairing the data, the method for constructing the partial replica code includes the following steps:
步骤 S11对数据进行 MDS编码, 得到编码块: 在本步骤中, 将要存储 的网络上的数据, 通常是一个文件, 将这些数据等分为 α份, 并对其进行 其参数为 ( β, ot ) 的 MDS编码, 得到 β个编码块。 由于 MDS编码本身 是较为成熟的技术, 在此不作过多的描述。 步骤 S12按照设定参数, 处理得到的编码块, 得到集合 V: 在本步骤 中, 取得事先设定的参数, 这些参数不仅仅与本步骤相关, 也和后续步骤 中的分组, 得到区组以及选择区组等相关。 这些参数包括每个组中的元素 个数 t、 每个存储节点存储的编码块的个数 s、 存储节点个数 n、 编码块的 复制倍数 f等等; 在本实施例中, 上述 α、 β 、 t、 s和 f均为正整数, 且 β 能够被 t整除。 在本步骤中, 取得上述设定参数之后, 首先对上述步骤中得 到的编码块依次编号, 并作为集合 V的元素, 得到集合 。 也就是说, 对于 集合 V而言, 其元素的编号是唯一的, 在上述集合 V中共有 β个编码块。 在本实施例中, 涉及任何集合时, 其元素是用该元素的编号来代替的, 并 不涉及该元素的具体内容。 Step S11 performs MDS encoding on the data to obtain a coding block: In this step, the data on the network to be stored, usually a file, is equally divided into α parts, and its parameters are (β, ot The MDS code is obtained to obtain β code blocks. Since MDS coding itself is a relatively mature technology, it will not be described too much here. Step S12 processes the obtained coded block according to the set parameter, and obtains the set V: In this step, the previously set parameters are obtained, and the parameters are not only related to this step, but also the grouping in the subsequent step, to obtain the block and Select the group and other related. These parameters include the number of elements t in each group, the number of coding blocks s stored in each storage node, the number n of storage nodes, the copy multiple f of the coding block, and the like. In this embodiment, the above α, β, t, s, and f are all positive integers, and β can be divisible by t. In this step, after obtaining the above-described setting parameters, first, the coded blocks obtained in the above steps are sequentially numbered, and the elements of the set V are obtained as a set. That is to say, for the set V, the number of its elements is unique, and there are a total of β code blocks in the above set V. In this embodiment, when any set is involved, its element is replaced by the number of the element, and the specific content of the element is not involved.
步骤 S13对集合 V进行分组: 在本步骤中, 对上述得到的集合 V进行 分组, 每组包括上述元素中的 t个, 每组之间的元素不会重复(即该元素的 编号不重复) 。 即将上述 β个元素 (即编码块)进行分组, 每组包括 t 个 元素, 得到 β / t个组。 在本步骤中, 得到的所有分组构成集合 G; 每个组 的容量相同; 所述集合 G是所述集合 V的一个划分(或一种划分方式)。  Step S13 groups the set V: In this step, the set V obtained above is grouped, each group includes t of the above elements, and the elements between each group are not repeated (that is, the number of the element is not repeated) . That is, the above β elements (i.e., coding blocks) are grouped, and each group includes t elements to obtain β / t groups. In this step, all the obtained packets constitute a set G; each group has the same capacity; and the set G is a partition (or a division) of the set V.
步骤 S14得到区组, 并按照设定参数选择区组: 在本步骤中, 由于在 上述步骤中已经得到了组, 在组的基础上, 得到区组; 一个区组是一个集 合, 其中元素均是上述集合 V中的元素, 每个区组中包括 s个元素, 每个 区组中的任意元素与该区组中的其他元素不属于同一个组。 换句话说, 在 本实施例中, 首先釆用分组的方式对上述集合 V进行一次划分, 在这次划 分的基础上, 再次依据区组的定义对集合 V 划分, 得到区组。 对于一个区 组而言, 其元素是被划分到不同分组的集合 V的元素。 例如, 设集合 V有 4 个编码块, 分别是 1, 、 2、 3和 4, 被分为两组, 分别是(1、 2 )和(3、 4 ); 则其分组可能包括(1、 3 ); ( 2、 4 ); ( 1、 4 )和(2、 3 )。 如果在本步骤中 全部选择,则其可以存储在 4个存储节点,每个存储节点存储 2个编码块, 对于集合 V中的元素而言, 其复制倍数是 2, 因为这些分组总共包括了 2个 编码块 1, 2个编码块 2, 两个编码块 3以及两个编码块 4。 由这个例子可以 看到, 在本实施例中, 设定参数的选择的范围是比较大的。 当设定选择得 较为适当时, 得到的分组可能会构成平行类。 一般来讲, 如果若干个区组 中的所有元素刚好是集合 V 中的所有元素, 则认为这些区组组成一个平行 类。 对于一个集合中得到的所有区组而言, 可以将其划分为 p个平行类。 在这种情况下, 在本步骤中, 也可以首先得到对于集合 V 而言的 p个平行 类的所有区组, 然后, 在这 p个平行类中选择 f 个, 实现对区组的选择。 例如, 一个平行类中如果包括 3个区组, 所有区组构成 3个平行类, 而选 择其中两个平行类的话, 就是在全部 9个区组中选择其中 6个。 其中, 所 述 f小于 p。 此时, f可以是设定的或计算得到的。 Step S14 obtains the block group, and selects the block group according to the setting parameters: In this step, since the group has been obtained in the above steps, the block group is obtained on the basis of the group; one block group is a set, wherein the elements are all It is an element in the above set V, and each block includes s elements, and any element in each block does not belong to the same group as other elements in the block. In other words, in the present embodiment, the set V is first divided once by means of grouping. On the basis of this division, the set V is again divided according to the definition of the block to obtain the block. For a block, its elements are elements that are divided into sets V of different groups. For example, let the set V have 4 coding blocks, which are 1, 2, 3, and 4, respectively, which are divided into two groups, which are (1, 2) and (3, 4) respectively; then the grouping may include (1) 3); (2, 4); (1, 4) and (2, 3). If all are selected in this step, they can be stored in 4 storage nodes, each storing 2 encoding blocks, and for the elements in the set V, the copying multiple is 2, because these groups include 2 in total. One coding block 1, two coding blocks 2, two coding blocks 3 and two coding blocks 4. As can be seen from this example, in the present embodiment, the range of selection of the setting parameters is relatively large. When the settings are chosen more appropriately, the resulting groupings may form a parallel class. Generally speaking, if several blocks All elements in the set are just all the elements in the set V, and these blocks are considered to form a parallel class. For all the blocks obtained in a set, it can be divided into p parallel classes. In this case, in this step, all the blocks of the p parallel classes for the set V can also be obtained first, and then f among the p parallel classes is selected to realize the selection of the block. For example, if a parallel group includes 3 blocks, all blocks constitute 3 parallel classes, and if two parallel classes are selected, 6 out of all 9 blocks are selected. Wherein f is less than p. At this time, f can be set or calculated.
值得一提的是,在本实施例中,使用的参数可以是全部给出或设定的, 也可以是给出一部分后通过计算得到其他未给出参数的。 例如, 可以按照 f = {v - t)l{s - l) 得到编码块复制倍数 f,按照 w = - t)/ -l)得到存储节点数 π。  It is worth mentioning that, in this embodiment, the parameters used may be all given or set, or may be obtained by calculating some other ungiven parameters. For example, the coding block copy factor f can be obtained according to f = {v - t)l{s - l), and the number of storage nodes π is obtained according to w = - t) / -l).
步骤 S15将选择的区组分配到各存储节点: 在本步骤中, 将上述的到 的区组分别存储到存储节点上, 每个存储节点存储一个区组。 例如, 在上 述步骤中已经选择了 2个平行类共 6个区组, 这样将选择的 6个区组所代 表的编码块分别存储到 6个存储节点。 一个存储节点上存储的数据量就是 一个区组所包括(或指向) 的数据量; 例如, 一个区组中包括两个编码块 编号, 即其包括(或指向) 两个编码块, 一个存储节点存储的数据量就是 两个编码块的数据量, 其存储的数据就是这两个编码块。  Step S15 assigns the selected block to each storage node: In this step, the above-mentioned block groups are respectively stored on the storage nodes, and each storage node stores one block. For example, in the above steps, two parallel classes have been selected for a total of six blocks, so that the code blocks represented by the selected six blocks are stored in six storage nodes. The amount of data stored on a storage node is the amount of data included (or pointed) by a block; for example, a block includes two coded block numbers, that is, it includes (or points to) two code blocks, one storage node The amount of data stored is the amount of data of two coded blocks, and the data stored therein is the two coded blocks.
在上述步骤中, α、 β、 t、 s 、 /均为正整数, 而 β能够被 t整除。 在本实施例中, 从整体上来讲, 一个分布式存储系统通常用 (n,k,d ) 来表示, 其中《表示存储系统的节点总数, 表示重构原文件所需最少节点 数, ί表示修复一个失效节点所需的可用节点数,并且满足 1。 MDS 码的研究已经相对成熟, 几乎可以满足任何符合条件的参数。 所以, 部分 复制码的构造难点在于内部复制码设计。 FR码的实质是复制倍数为/的 个 数据块在节点上的一种排列, 同时保证每个数据块的副本分别存储在不同 的节点上。  In the above steps, α, β, t, s, and / are all positive integers, and β can be divisible by t. In this embodiment, a distributed storage system is generally represented by (n, k, d), where "the total number of nodes representing the storage system, indicating the minimum number of nodes required to reconstruct the original file, ί Fix the number of available nodes required for a failed node and satisfy 1. The research on MDS codes has been relatively mature and can satisfy almost any qualified parameter. Therefore, the difficulty in constructing part of the replica code lies in the internal replica code design. The essence of the FR code is an arrangement in which the data blocks of the multiple of / are replicated on the node, while ensuring that copies of each data block are stored separately on different nodes.
一个适用于参数为 Λ,ί 分布式存储系统的部分复制码 C=^7,A ), 复 制倍数为/, 是指特定 w个子集的集合 M = { 1 ..,M 其中每个子集的元素均 来自于符号集 = { ,..., 。 同时满足以下两个条件: (1)每个子集的大小均为 d One applies to the partial copy code C=^7, A) of the distributed storage system with parameters Λ, ί, and the copy multiple is /, which refers to the set of specific w subsets M = { 1 .., M where each subset The elements are all from the symbol set = { ,..., . At the same time, the following two conditions are met: (1) The size of each subset is d
(2) U中的每个元素属于 M中 /个子集。 在上述定义中,每个子集 中的元素表示经过 MDS编码后数据块的下 标, 这些数据块相应地存储在节点 Ν · = 1, ...,«)。 可见, 每个子集对应于 一个存储节点。 所有数据块分布在《个不同的节点上, 且每个节点的存储 容量为 o (2) Each element in U belongs to M / subset. In the above definition, the elements in each subset represent the subscripts of the MDS-encoded data blocks, which are stored in the corresponding nodes = · = 1, ..., «). As you can see, each subset corresponds to a storage node. All data blocks are distributed on "different nodes, and each node has a storage capacity of o.
假设; Γ = (;^,. 表示一个包含 5个数据块的文件, 表示大小为 q 的有限域。 经过参数为(6,5)的 MDS 编码, 输出 6 个数据块 .. 6。 其中 =;r,, = i,...,5;;r6= ;r,。 每个输出的编码块均复制两倍, 将生成的数据块 存储在 4个节点上, 请参见图 2。 图 2方框中的数字表示编码块的下标, 如 节点 存储的三个数据块依次为 , Y3, 75。。任意两个节点存储的数据可 以重构出原文件, 因此有 =2。 当节点失效时, 可以从其它三个节点下载数 据进行修复, 则 d=3。 设 V与 λ为给定的正整数, S与 Γ为给定的正整数集。 设0 = ^,0, ) 为有限关联结构, 其中 V为一个 V元集, G构成 V的一个划分。 V中的元 素叫点 ( point ), A中的元素叫做区组( block ), G中的元素叫做组( group )。 若以下条件满足: Assume that Γ = (;^,. represents a file containing 5 data blocks, representing a finite field of size q. After MDS encoding with parameters (6, 5), output 6 data blocks.. 6 where = ;r,, = i,...,5;;r 6 = ;r,. The code block for each output is duplicated twice, and the generated data block is stored on 4 nodes, see Figure 2. The number in the 2 box indicates the subscript of the coding block. For example, the three data blocks stored by the node are, in turn, Y 3 , 7 5 . The data stored by any two nodes can reconstruct the original file, so there is =2. When the node fails, the data can be downloaded from the other three nodes for repair, then d=3. Let V and λ be the given positive integers, and S and Γ be the given positive integer set. Let 0 = ^, 0, ) is a finite association structure, where V is a set of V elements and G constitutes a division of V. The elements in V are called points, the elements in A are called blocks, and the elements in G are called groups. If the following conditions are met:
(1)对任意 seA, 都有 ;  (1) for any seA;
(2)对任意 GGG, 都有 |G r; (2) For any GGG, there is |G r;
(3)对任意 A与 G£G, 都有 |5flG| l; (3) For any A and G£G, there is |5flG| l;
(4) V中的任意一对属于不同组的元素恰好同时包含在 λ个区组中; 那么,上述 D为一个可分组设计( group divisible design )或者 GD设计, 记作 GD (& λ, Τ; 。如果每个组的容量相同,每个区组大小相同,即 S = }, T= {t), 把 GD({s}, λ, {t}; 简记作 GD0, λ, t; 并称其为均匀 ( uniform ) 可分组设计。 若对 1 G包含 个容量为 ^的组, 且 v = 2^i ,t,, 则 称 D为一个型 (type) 为
Figure imgf000010_0001
的 GD设计。
(4) Any pair of elements belonging to different groups in V is included in the λ block at the same time; then, D is a group divisible design or GD design, which is denoted as GD (& λ, Τ If each group has the same capacity, each block has the same size, ie S = }, T = {t), and GD({s}, λ, {t}; abbreviated as GD0, λ, t; It is called a uniform group design. If 1 G contains a group with a capacity of ^, and v = 2^ i , t, then D is a type (type)
Figure imgf000010_0001
GD design.
如0 = (^ G, A)为一个 GD设计 令 PcA, 若 V中的每一点都正好与 P中唯一的区组相关联, 则称 P为一个平行类 (parallel class )。 如果一个 GD( λ, t; v)的全部区组能够划分成平行类, 则称为一个可分解 GD设计。 当 v=W时, GD( , λ, t; 叫做 λ重横截设计 (λ-fold transversal design), 记 作 ΤΟ(·ν,λ;ί), 简称 TD设计。 如果参数 λ=1, 横截设计 TD( , 的存在性 等价于相互正交的拉丁方的存在性。 若每组均只包含一个点, 即 ί=1, 则该 TD设计相当于一个 Steiner系。 虽然 Steiner系是一种特殊的 GD设计, 但 并不是所有的 GD设计都属于 Steiner系。 对于均匀 GD( , ,i; v),V中的每一点都属于特定数目的区组(记为 r , 称为此设计的重复数, 并满足下述参数关系: r = A(v-t)/(s-\) 同时, 用 b表示该 GD设计包含的区组总数, 从而有如下等式成立: b = Av(v-t)/s(s-\). 例如,设 V = {1,2, 6},三个大小相等的组分别取为 G: {1,2}, {3,4}, {5, 6}, 所生成的区组为 {1, 3, 5}, {2, 3, 6}, {1, 4, 6}, {2, 4, 5}。 则(V, G, A) 构成一个均勾 GD(3, 1,2;6)。 其中,任一给定的点属于两个不同的区组。 因 此, r=2,b=4。 为了构造能够达到随机访问模式下的系统存储容量的 FR码, GD设计 中应取 λ=1, 任意一对属于不同组的点恰好同时包含在唯一的区组中。 并 且设计中节点存储容量相同,这里釆用均勾 GD设计。一个 GD设计可能存 在若干同构, 在本实施例中仅考虑一种具体的设计 (对应于特定的分组), 相应的构造方法同样适用于所有其它同构设计。 For example, 0 = (^ G, A) is a GD design for PcA, if every point in V is exactly The only block in P is associated, so P is called a parallel class. If all the blocks of a GD ( λ, t; v) can be divided into parallel classes, it is called a decomposable GD design. When v=W, GD( , λ, t; is called λ-fold transversal design, which is denoted as ΤΟ(·ν,λ;ί), referred to as TD design. If the parameter λ=1, horizontal The existence of the truncated design TD( , is equivalent to the existence of mutually orthogonal Latin squares. If each group contains only one point, ie ί=1, then the TD design is equivalent to a Steiner system. Although the Steiner system is a A special GD design, but not all GD designs belong to the Steiner system. For uniform GD( , , i; v), each point in V belongs to a specific number of blocks (denoted as r, called this design) The number of repetitions, and satisfies the following parameter relationship: r = A(vt)/(s-\) At the same time, b is used to indicate the total number of blocks included in the GD design, so that the following equation holds: b = Av(vt) /s(s-\). For example, let V = {1, 2, 6}, three equal-sized groups are taken as G: {1, 2}, {3, 4}, {5, 6}, The generated block is {1, 3, 5}, {2, 3, 6}, {1, 4, 6}, {2, 4, 5}. Then (V, G, A) constitutes a square GD(3, 1,2;6) where any given point belongs to two different blocks. Therefore, r=2, b=4. The FR code of the system storage capacity in the mode, λ=1 should be taken in the GD design, and any pair of points belonging to different groups should be included in the unique block at the same time. In the design, the storage capacity of the nodes is the same. GD design. A GD design may have several isomorphisms. In this embodiment, only one specific design (corresponding to a specific grouping) is considered, and the corresponding construction method is equally applicable to all other isomorphic designs.
GDDFRC码构造 取一个给定的 GD0, 1, t; v), 其中 ί>2。 该设计全 体区组为 Α = ¾,. 则可以生成一个 FR^ C=(V,A)。这里, 所构造的 FR 码参数为: 6> = ν, /=(ν_ί)/( -1)。 相应存储系统的节点规模为" = vO-i)Av v -1), 每个节点可以存储 i = s个数据块。 其中, 复制倍数/和系统节点数《可以分别上述式子求出。 如釆用上述 均匀 GD(3, 1 , 2; 6)设计, 构造出的 FR码如图 2所示。 该系统可以容纳一个 节点失效且保证精确无编码地数据再生。 如果节点 NP N2同时失效, 那么 须重构原文件才能得到编码块 73。一般地,对于一个复制倍数为/的 FR码, 系统可以承受 /-7个节点失效而不丟失精确无编码修复特性,此时系统中所 有数据块都至少存在一个备份。 The GDDFRC code construct takes a given GD0, 1, t; v), where ί>2. The entire block of the design is Α = 3⁄4,. Then an FR^ C=(V, A) can be generated. Here, the constructed FR code parameters are: 6> = ν, /=(ν_ί)/( -1). The node size of the corresponding storage system is "= vO-i" Av v -1), and each node can store i = s data blocks. Among them, the copy multiple / and the number of system nodes can be obtained by the above equation.釆 Using the above uniform GD (3, 1 , 2; 6) design, the constructed FR code is shown in Figure 2. The system can accommodate one Node failure and accurate data reproduction without encoding. If the node N P N 2 fails at the same time, the original file must be reconstructed to obtain the coded block 7 3 . In general, for a FR code with a copy number of /, the system can withstand /-7 nodes without losing the exact codeless repair feature, at which point all data blocks in the system have at least one backup.
对于允许多节点失效的情况下, 令元素集 V= {1,2, 8},G= {{1,2}, {3, 4}, {5, 6}, {7, 8}}, 从而可以得到 8个区组:  For the case where multiple nodes are allowed to fail, let the element set V = {1, 2, 8}, G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}, Thus you can get 8 blocks:
{1,3, 5}, {2, 4, 6}, {1,4, 7}, {2,3, 8}, {1,3, 5}, {2, 4, 6}, {1,4, 7}, {2,3, 8},
{1,6, 8}, {2, 5,7}, {3, 6,7}, {4, 5,8}. 取一个包含 6个数据块的文件,记为 ,...^。经过参数为 6 的 MDS 编码, 输出 8个编码块 ;^。 运用所述均勾 GD设计, 系统中的数据块 存储方式如图 3所示。 如果构造过程中所使用的设计能够分解成 p个平行类,则可以通过选取 其中任意 /(< p)个来生成复制倍数为 /的 FR码。 每个平行类包含了符号集 中的所有元素, 因此只要系统中存在一个完整的平行类, 那么节点修复就 可以正常进行。 相应地, 如果 GDDFRC码构造过程中所应用的 GD设计是 可分解的, 则可以灵活地选择编码块的复制倍数和系统中节点规模。 再如, 考虑一个均勾 GD(3, 1,3; 9), 其中三个分组依次为 {1,2, 3}, {4, 5, 6}, {7, 8, 9}。 该设计所生成的 9个区组可以分成 3个平行类 (每一行的区 组构成一个平行类): {1,6, 8}, {2, 5,7}, {3, 6,7}, {4, 5,8}. Take a file containing 6 data blocks, denoted as , ...^. After MDS encoding with parameter 6, output 8 encoding blocks; ^. Using the GG design, the data block storage mode in the system is shown in Figure 3. If the design used in the construction process can be decomposed into p parallel classes, then the FR code with a copy multiple of / can be generated by selecting any /(< p) of them. Each parallel class contains all the elements in the symbol set, so as long as there is a complete parallel class in the system, the node repair can proceed normally. Accordingly, if the GD design applied in the construction of the GDDFRC code is decomposable, the copying factor of the coding block and the node size in the system can be flexibly selected. For another example, consider a GG (3, 1, 3; 9), three of which are {1, 2, 3}, {4, 5, 6}, {7, 8, 9}. The nine blocks generated by this design can be divided into three parallel classes (the blocks of each row form a parallel class):
{1,4, 7}, {2, 5,9}, {3,6, 8}; {1,4, 7}, {2, 5,9}, {3,6, 8};
{1,6, 9}, {2,4, 8}, {3, 5,7}; {1,6, 9}, {2,4, 8}, {3, 5,7};
{1,5, 8}, {2, 6, 7}, {3, 4,9}。 {1,5, 8}, {2, 6, 7}, {3, 4,9}.
如果选取其中任意两个平行类, 由构造方法可以得到一个复制倍数 /=2 的 GDDFRC码, 适用于参数为 (6, 3, 3) 的分布式存储系统; 如果取三 个平行类, 则可以生成一个复制倍数/ =3的 GDDFRC码, 对应存储系统参 数为 (9, 3, 3)。 这种灵活的参数选取, 给系统设计提供了很大的便利。 在本实施例中, 参见图 4, 还包括一种实现上述方法的装置, 包括 编码块取得模块 1、 集合 V构建模块 2、 分组模块 3、 区组构建模块 4以及 数据存储模块 5;其中,编码块取得模块 1用于将要存储的数据等分为 α份, 并对其进行其参数为 ( β, α ) 的 MDS编码, 得到 β个编码块; 集合 V 构建模块 2用于取得设定参数,所述参数包括每个组包括的所述元素个数 t、 每个存储节点存储的编码块的个数 s;将所述 β个编码块依次编号并作为集 合 V的元素, 得到集合 V; 分组模块 3用于对所述集合 V中的元素进行分 组, 得到 β /t个分组; 一个分组中的元素编号各不相同; 区组构建模块 4 用于通过得到的分组得到集合 V的所有区组, 并根据设定参数在所有区组 中选择, 得到选择后的 η个区组; 所述区组是一个其中的元素满足由任意 不同分组的元素构成的集合;所述 η个区组中总共包括被复制 f倍的 β个编 码块; 数据存储模块 5用于将得到的选择后的区组对应的编码块存储在存 储节点, 每个存储节点存储一个选择后的区组对应的编码块; 其中, α、 β、 t、 s和 f均为正整数, 所述 β能够被 t整除。 If any two parallel classes are selected, a GDDFRC code with a copy multiple/=2 can be obtained by the constructor, which is suitable for distributed storage systems with parameters (6, 3, 3); if three parallel classes are used, Generate a copying multiple / = 3 GDFFRC code, corresponding to the storage system parameters (9, 3, 3). This flexible parameter selection provides great convenience for system design. In this embodiment, referring to FIG. 4, an apparatus for implementing the foregoing method, further includes an encoding block obtaining module 1, a set V building block 2, a grouping module 3, a block building module 4, and a data storage module 5; The coding block obtaining module 1 is configured to divide the data to be stored into α parts, and perform MDS encoding of the parameter (β, α) to obtain β coding blocks; the set V construction module 2 is used to obtain setting parameters. The parameter includes the number t of the elements included in each group, the number of code blocks stored in each storage node s; the β code blocks are sequentially numbered and used as elements of the set V to obtain a set V; The grouping module 3 is configured to group the elements in the set V to obtain β / t packets; the element numbers in one group are different; the block building module 4 is used to obtain all the areas of the set V by the obtained grouping. Grouping, and selecting among all the groups according to the set parameters, obtaining the selected n blocks; the block is a set in which the elements satisfy any of the different grouped elements; the n blocks are total The data encoding module 5 is configured to store the coded blocks corresponding to the selected selected block in the storage node, and each storage node stores a code block corresponding to the selected block. Wherein α, β, t, s, and f are all positive integers, and the β can be divisible by t.
在本实施例中, 区组构建模块 1 进一步包括: 平行类划分单元 41 和平行类选择单元 42; 其中, 平行类划分单元 41用于将所述得到的所有区 组划分成 P个平行类; 其中, 如若干个区组的集合中的元素刚好是集合 V 中的所有元素, 并且这些区组之间无相交元素, 则这些区组就构成一个平 行类; 平行类选择单元 42用于在所述 p个平行类中任意选择 f个, 得到选 择的区组; 其中, 所述 f小于 p ; 所述 p个平行类中包括 n个区组。  In this embodiment, the block construction module 1 further includes: a parallel class division unit 41 and a parallel class selection unit 42; wherein, the parallel class division unit 41 is configured to divide all the obtained block groups into P parallel classes; Wherein, if the elements in the set of several block groups are exactly all the elements in the set V, and there are no intersecting elements between the blocks, the blocks form a parallel class; the parallel class selecting unit 42 is used in the Any one of the p parallel classes is selected to obtain a selected block; wherein, f is smaller than p; and the p parallel classes include n blocks.
在本实施例中,还涉及一种对上述方法得到的部分复制码进行数据修复 的方法, 在本实施例中, 由于 GDDFRC码涵盖了 FR码的所有特性。 每个 数据块的复制倍数一致, 同时系统每个节点的存储容量相同。 值得注意的 是, 与传统随机访问模式不同, GDDFRC码釆用基于表格(table-based )的 修复方式。 具体地说, 修复表格指明了每个具体失效节点可选择的修复方 案。 如图 3中所示, 如果节点 N8失效, 可以通过节点 N2, N4, N6来进行修 复, 而非节点 , N2及 N3。 实际存储系统部署中通常包含一个追踪服务器In this embodiment, a method for performing data repair on the partial replica code obtained by the foregoing method is also involved. In this embodiment, the GDDFRC code covers all the characteristics of the FR code. The copying multiple of each data block is the same, and the storage capacity of each node of the system is the same. It is worth noting that, unlike the traditional random access mode, the GDDFRC code uses a table-based repair method. Specifically, the repair form indicates the repair options that are selectable for each particular failed node. As shown in FIG. 3, if node N 8 fails, repairs can be made through nodes N 2 , N 4 , N 6 instead of nodes, N 2 and N 3 . A storage server deployment usually includes a tracking server.
( tracker server ), 用于记录系统元数据。 因此, 可以将修复表格信息写入 元数据, 便于失效修复的快速访问读取。 就降低修复过程的复杂度而言, 建立和维护节点修复表格的代价是值得的。 此外, 对于使用本实施例中的方法得到的部分复制码而言, 其数据修复 的选择度是比较大的。 由于对 MDS码来说, 系统中发生节点失效时可以通 过从其它 n-J个可用节点中随机选择 个下载数据,重构出原始文件再进行 编码修复。 因此, 对于任一节点失效, MDS 码存在 7 1种修复方案。 这 种指明了节点失效修复可选的方案数, 称为该节点的修复选择度。 在本实施例中, 与随机访问模式不同, GDDFRC码釆用基于表格的修 复方式, 其中表格给出了节点具体的修复方案。 由于每个数据块的/个副本 分布在不同的节点并且一对不同的数据块存储在唯一的节点上, 当一个节 点失效时, 可以连接其它与该节点存储相同数据块的节点, 下载所丟失数 据块的副本再生出替换节点。 由此可知, 给定一个容量为 的存储节点, 系统存在( ~ l)rf种失效修复方案。图 5给出了复制倍数为 3的 GDDFRC码 的节点修复选择度与存储容量 之间的关系。 从图中可以看出, 虽然 GDDFRC码的修复方式基于表格, 其节点修复 选择度依然可以达到很高的水平。 对于复制倍数一定的 GDDFRC码, 节点 修复选择度随着节点存储容量 呈指数倍增长。 ( tracker server ), used to record system metadata. Therefore, the repair form information can be written to the metadata for quick access reading of the fail-safe. In terms of reducing the complexity of the repair process, The cost of building and maintaining a node repair form is worthwhile. In addition, for the partial replica code obtained by using the method in this embodiment, the degree of selection of data repair is relatively large. For the MDS code, when a node fails in the system, the downloaded data can be randomly selected from other nJ available nodes, and the original file is reconstructed and then encoded and repaired. Therefore, for any node failure, there are 71 repair schemes for the MDS code. This specifies the number of alternatives for node failure repair, called the repair selectivity of the node. In this embodiment, unlike the random access mode, the GDDFRC code uses a table-based repair method, wherein the table gives a node-specific repair scheme. Since each copy of the data block is distributed in different nodes and a pair of different data blocks are stored on a unique node, when one node fails, other nodes that store the same data block with the node can be connected, and the download is lost. A copy of the data block regenerates the replacement node. It can be seen that given a storage node with a capacity of one, the system has (~ l) rf failure repair scheme. Figure 5 shows the relationship between the node repair selectivity and the storage capacity of the GDDFRC code with a copying multiple of 3. As can be seen from the figure, although the repair method of the GDDFRC code is based on a table, the node repair selectivity can still reach a very high level. For the GDDFRC code with a certain copying multiple, the node repair selectivity increases exponentially with the node storage capacity.
在本实施例的一个例子中, 釆用业内流行的 Hadoop 分布式文件系统, 实现了本发明提出的 GDDFRC码,完成了文件的编解码以及失效恢复功能。 实验中系统服务器的 CPU配置为 Intel(R) Xeon( ) E5-2609 2.40GHz, 内存 大小为 24G。 釆用普通 PC机( CPU为 AMD A8-5600k 3.0GHz, 4G内存) 作为数据存储节点, 配置了相同的实验环境, 并且实验过程中每个节点无 任何其它作业。 在节点存储容量相同的条件下, 从不同的 ,A 值比较分 析了 GDDFRC码与经典的 RS码、 MBR码在修复时间上的差异。 首先,设定节点数量 n=9,任意 6个节点存储的数据可以重构出原文件。 同时, 实验中釆用复制倍数为 2 的 GDDFRC码, 分别在节点存储容量为 100MB、 200MB、 300MB 的情况下测试三种编码的单节点失效修复时间。 在相同条件下运行多次取平均测试值,结果如图 6所示。从图中可以看出, 与 RS码和 MBR码相比, GDDFRC码大幅降低了节点失效恢复时间。 设定节点数量《=6, 且^ =10。 釆用复制倍数为戶 3的 GDDFRC码, 结 果如图 7所示。 当节点存储容量增加时, GDDFRC码在修复时间上的优势 更为明显。 传统的 RS码节点修复过程中需恢复出原始文件, 重新编码再将生成的 编码块存储到替换节点,因此修复时间比较长。对于最小带宽再生 MBR码, 参与修复的节点对存储的数据进行线性运算, 再将组合后的数据块传送到 替换节点。 该节点对接收的所有数据块进一步整合, 进而恢复出失效的数 据。 整个过程涉及大量的有限域运算, 增加了修复时间。 当检测出节点失 效时, 系统首先判定具体哪一个节点失效, 根据 GDDFRC码修复表格(存 储在系统元数据中)确定修复方案。 同时连接方案中指定的可用节点, 下 载相应数据块并直接存储到替换节点中。 可见, 整个修复过程仅仅涉及文 件读取工作, 并不引入其它复杂运算。 虽然在一定程度上增加了系统的冗 余量, 但其结果表明, GDDFRC码可以大幅降低失效修复时间。 相比传统的 RGC码, 基于可分组设计的部分复制码(G DDFRC )最大的 优势在于显著地减小了编解码过程中计算复杂度, 以简单易于实施的数据 复制取代了有限域复杂的运算。 传统 RGC码的构造基于有限域 GF(W, 编 解码过程中设计到的有限域加法、 减法以及乘法。 有限域的运算虽然理论 研究比较成熟, 但实际运用起来比较繁瑣、 时间消耗大, 明显不能符合当 今分布式存储系统快速可靠的设计指标。 GDDFRC码则不同, 系统中节点失 效修复可以通过从其它节点直接下载数据并存储到替换节点来完成修复, 不需要额外的运算, 大大提高了节点修复及数据块再生的速率, 在实际的 分布式存储系统中具有很高的应用价值和发展潜力。 In an example of the embodiment, the GDDFRC code proposed by the present invention is implemented by using the popular Hadoop distributed file system in the industry, and the file encoding and decoding and failure recovery functions are completed. In the experiment, the CPU of the system server is configured as Intel(R) Xeon( ) E5-2609 2.40GHz, and the memory size is 24G. Using a normal PC (CPU is AMD A8-5600k 3.0GHz, 4G memory) as a data storage node, the same experimental environment is configured, and there is no other operation for each node during the experiment. Under the condition that the node storage capacity is the same, the difference between the GDDFRC code and the classic RS code and MBR code in the repair time is analyzed from different A values. First, set the number of nodes n=9, and the data stored in any six nodes can reconstruct the original file. At the same time, in the experiment, the GDDFRC code with copy number 2 is used, and the three-coded single-node failure repair time is tested under the condition that the node storage capacity is 100MB, 200MB, 300MB. The average test value was run multiple times under the same conditions, and the result is shown in FIG. 6. As can be seen from the figure, the GDDFRC code significantly reduces the node failure recovery time compared to the RS code and the MBR code. Set the number of nodes "=6, and ^=10. GD Use the copy ratio of the GDDFRC code of the household 3, and the result is shown in Fig. 7. When the node storage capacity increases, the advantage of the GDDFRC code in repair time is more obvious. In the traditional RS code node repair process, the original file needs to be restored, and the generated code block is re-encoded and stored in the replacement node, so the repair time is relatively long. For the minimum bandwidth regenerative MBR code, the node participating in the repair performs a linear operation on the stored data, and then transfers the combined data block to the replacement node. The node further integrates all the received data blocks to recover the invalid data. The entire process involves a large number of finite field operations, increasing the repair time. When a node failure is detected, the system first determines which node is invalid, and determines the repair plan according to the GDDFRC code repair table (stored in the system metadata). At the same time, connect the available nodes specified in the scheme, download the corresponding data blocks and store them directly in the replacement node. It can be seen that the entire repair process only involves file reading work and does not introduce other complex operations. Although the redundancy of the system is increased to a certain extent, the results show that the GDDFRC code can greatly reduce the failure repair time. Compared with the traditional RGC code, the maximum advantage of the partial copy code (G DDFRC ) based on the group design is that the computational complexity in the codec process is significantly reduced, and the finite domain complex operation is replaced by the simple and easy to implement data replication. . The construction of traditional RGC codes is based on finite field GF (W, finite field addition, subtraction and multiplication designed in the coding and decoding process. Although the theoretical research is quite mature, the practical application is cumbersome and time consuming, obviously not It meets the fast and reliable design specifications of today's distributed storage systems. The GDDFRC code is different. The node failure repair in the system can be repaired by directly downloading data from other nodes and storing it to the replacement node. No additional calculations are required, which greatly improves the node repair. And the rate of data block regeneration has high application value and development potential in practical distributed storage systems.
基于可分组设计的部分复制码不仅降低了节点修复过程中的运算复杂 度, 同时可以保证节点修复过程中所消耗的带宽是最小的 (即原始文件大 小), 并不消耗多余的带宽。 在带宽资源越来越宝贵的今天, GDDFRC码带 来的裨益是显然的。 在本实施例中, 其 GDDFRC码可以保证: 丟失的编码块 可以直接下载其他编码模块的若干子集进行修复; 丟失的编码块可以通过 固定数目的编码模块进行修复, 修复模式基于表格。 同时, G DDFRC码修复 后的节点存储的数据和失效节点是完全一致的, 也就是精确修复, 很大程 度上减少了系统操作复杂度(如元数据更新、 更新后的数据广播等)。 The partial replica code based on the groupable design not only reduces the computational complexity in the node repair process, but also ensures that the bandwidth consumed during the node repair process is the smallest (ie, the original file size), and does not consume excess bandwidth. Today, as bandwidth resources become more and more valuable, the benefits of GDDFRC codes are obvious. In this embodiment, its GDDFRC code can guarantee: Lost coding block Several subsets of other encoding modules can be downloaded directly for repair; missing encoded blocks can be fixed by a fixed number of encoding modules, and the repair mode is based on tables. At the same time, the data stored by the node after the repair of the G DDFRC code is completely consistent with the failed node, that is, the exact repair, which greatly reduces the system operation complexity (such as metadata update, updated data broadcast, etc.).
详细, 但并不能因此而理解为对本发明专利范围的限制。 应当指出的是, 对于本领域的普通技术人员来说, 在不脱离本发明构思的前提下, 还可以 做出若干变形和改进, 这些都属于本发明的保护范围。 因此, 本发明专利 的保护范围应以所附权利要求为准。 The detailed description is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of the invention should be determined by the appended claims.

Claims

权利要求书 Claim
1、 一种部分复制码的构建方法, 其特征在于, 包括如下步骤: A method for constructing a partial replica code, comprising the steps of:
A )将要存储的数据等分为 α份, 并对其进行其参数为 ( β, α ) 的 MDS编码, 得到 β个编码块;  A) The data to be stored is equally divided into α parts, and MDS codes whose parameters are (β, α) are obtained, and β code blocks are obtained;
Β )取得设定参数, 所述参数包括每个组包括的所述元素个数 t、 每个存储节点存储的编码块的个数 s;将所述 β个编码块依次编号并作为集 合 V的元素, 得到集合 V;  Β) obtaining a setting parameter, the parameter including the number of the elements t included in each group, the number of coding blocks stored in each storage node s; and sequentially numbering the β coding blocks as a set V Element, get the set V;
C )对所述集合 V中的元素进行分组, 得到 β It个分组; 一个分组 中的元素编号各不相同;  C) grouping the elements in the set V to obtain β It packets; the element numbers in a group are different;
D )通过上述步骤中得到的分组得到集合 V的所有区组, 并根据 设定参数在所有区组中选择, 得到选择后的 n个区组; 所述区组是一个其 中的元素满足由任意不同分组的元素构成的集合; 所述 n个区组中总共包 括被复制 f倍的 β个编码块;  D) obtaining all the blocks of the set V by the grouping obtained in the above steps, and selecting among all the blocks according to the setting parameters, and obtaining the selected n blocks; the block is an element in which the elements are satisfied by a set of elements of different groupings; the n blocks include a total of β code blocks that are copied by a factor of f;
Ε )将得到的选择后的区组对应的编码块存储在存储节点, 每个存 储节点存储一个选择后的区组对应的编码块;  Ε storing the obtained code blocks corresponding to the selected block in the storage node, and each storage node stores a code block corresponding to the selected block;
其中, 所述 α、 β、 t、 s和 f均为正整数, 所述 β能够被 t整除。  Wherein, α, β, t, s, and f are all positive integers, and the β can be divisible by t.
2、 根据权利要求 1所述的部分复制码的构建方法, 其特征在于, 在步 骤 C ) 中, 得到的所有分组构成集合 G; 所述集合 G是所述集合 V的一个 划分。  The method of constructing a partial replica code according to claim 1, wherein in step C), all the obtained packets constitute a set G; and said set G is a partition of said set V.
3、 根据权利要求 2所述的部分复制码的构建方法, 其特征在于, 所述 步骤 D ) 中得到的全部区组满足任意一个集合 V中的元素分别存在于 f个 区组中。  The method for constructing a partial replica code according to claim 2, wherein all the blocks obtained in the step D) satisfy the elements in any one of the sets V are respectively present in the f blocks.
4、 根据权利要求 3所述的部分复制码的构建方法, 其特征在于, 所述 每个区组的大小相同, 所述每个组的容量相同。  The method for constructing a partial replica code according to claim 3, wherein each of the groups has the same size, and each of the groups has the same capacity.
5、 根据权利要求 4所述的部分复制码的构建方法, 其特征在于, 所述 步骤 B ) 中, 按照 / = ^ - t - l)得到所述编码块复制倍数 f; 按照 ^ - /s -l)得到存储节点数 n。  The method for constructing a partial replica code according to claim 4, wherein in the step B), the coding block copy multiple f is obtained according to / = ^ - t - l); according to ^ - /s -l) Get the number n of storage nodes.
6、 根据权利要求 5所述的部分复制码的构建方法, 其特征在于, 所述 步骤 D ) 中, 进一步包括如下步骤: Dl )将所述得到的所有区组划分成 p个平行类; 其中, 如若干个区 组的集合中的元素刚好是集合 V 中的所有元素, 并且这些区组之间无相交 元素, 则这些区组就构成一个平行类; The method for constructing a partial replica code according to claim 5, wherein the step D) further comprises the following steps: Dl) dividing all the obtained blocks into p parallel classes; wherein, if the elements in the set of several blocks are just all the elements in the set V, and there are no intersecting elements between the blocks, then these The block constitutes a parallel class;
D2 )在所述 p个平行类中任意选择 f 个, 得到选择的区组; 其中, 所述 f小于等于 P ; 所述 p个平行类中包括 n个区组。  D2) arbitrarily selecting f among the p parallel classes to obtain a selected block; wherein, f is less than or equal to P; and the p parallel classes include n blocks.
7、 一种实现如权利要求 1所述的部分复制码构建方法的装置, 其特征 在于, 包括:  An apparatus for implementing a partial replica code construction method according to claim 1, comprising:
编码块取得模块: 用于将要存储的数据等分为 α份, 并对其进行 其参数为 ( β, oc ) 的 MDS编码, 得到 β个编码块;  The coding block acquisition module is configured to divide the data to be stored into α parts, and perform MDS coding with the parameter (β, oc ) to obtain β coding blocks;
集合 V构建模块: 用于取得设定参数, 所述参数包括每个组包括的 所述元素个数 t、 每个存储节点存储的编码块的个数 s; 将所述 β个编码块 依次编号并作为集合 V的元素, 得到集合 V;  a set V building block: configured to obtain a set parameter, where the parameter includes the number t of the elements included in each group, and the number of code blocks stored in each storage node s; And as an element of the set V, get the set V;
分组模块: 用于对所述集合 V中的元素进行分组, 得到 β /t个分 组; 一个分组中的元素编号各不相同;  Grouping module: used to group elements in the set V to obtain β / t groups; the element numbers in a group are different;
区组构建模块: 用于通过得到的分组得到集合 V的所有区组, 并 根据设定参数在所有区组中选择, 得到选择后的 η个区组; 所述区组是一 个其中的元素满足由任意不同分组的元素构成的集合; 所述 η个区组中总 共包括被复制 f倍的 β个编码块;  The block construction module is configured to obtain all the blocks of the set V by the obtained grouping, and select among all the blocks according to the set parameters, to obtain the selected n blocks; the block is an element in which the element satisfies a set consisting of elements of any different grouping; a total of β code blocks that are copied by a factor of f are included in the n blocks;
数据存储模块: 用于将得到的选择后的区组对应的编码块存储在 存储节点, 每个存储节点存储一个选择后的区组对应的编码块;  a data storage module: configured to store the obtained code blocks corresponding to the selected block in the storage node, and each storage node stores a code block corresponding to the selected block;
其中, 所述 α、 β、 t、 s和 f均为正整数, 所述 β能够被 t整除。  Wherein, α, β, t, s, and f are all positive integers, and the β can be divisible by t.
8、 根据权利要求 7所述的装置, 其特征在于, 所述区组构建模块进一 步包括:  8. The apparatus according to claim 7, wherein the block building module further comprises:
平行类划分单元: 用于将所述得到的所有区组划分成 p个平行类; 其中, 如若干个区组的集合中的元素刚好是集合 V 中的所有元素, 并且这 些区组之间无相交元素, 则这些区组就构成一个平行类;  Parallel class division unit: configured to divide all the obtained blocks into p parallel classes; wherein, for example, the elements in the set of several block groups are exactly all elements in the set V, and there is no between the blocks Intersect elements, then these blocks form a parallel class;
平行类选择单元: 用于在所述 p个平行类中任意选择 f个, 得到选 择的区组; 其中, 所述 f小于等于 p ; 所述 p个平行类中包括 n个区组。  Parallel class selection unit: used to arbitrarily select f among the p parallel classes to obtain a selected block; wherein, f is less than or equal to p; and the p parallel classes include n blocks.
9、 一种对使用如权利要求 1所述的部分复制码构建方法得到的数据进 行数据修复的方法, 其特征在于, 包括如下步骤: 9. A data obtained by using the partial replica code construction method of claim 1. The method for repairing data data is characterized in that it comprises the following steps:
M )取得修复表格, 以失效节点的编号为索引查找其修复方案; N )下载修复表格指示的节点数据并得到替换节点数据, 生成替换 节点。  M) obtain the repair table, find the repair scheme by using the number of the failed node as an index; N) download the node data indicated by the repair table and obtain the replacement node data to generate a replacement node.
10、 根据权利要求 9 所述的数据修复的方法, 其特征在于, 所述修复 表格存储在存储系统中的追踪服务器的系统元数据中; 所述修复表格中对 于一个节点的修复方案包括至少一个。  10. The data repair method according to claim 9, wherein the repair table is stored in system metadata of a tracking server in a storage system; and the repair scheme for one node in the repair table includes at least one .
PCT/CN2014/078539 2014-05-27 2014-05-27 Partial replica code construction method and device, and data recovery method therefor WO2015180038A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/078539 WO2015180038A1 (en) 2014-05-27 2014-05-27 Partial replica code construction method and device, and data recovery method therefor
CN201480078750.9A CN107003933B (en) 2014-05-27 2014-05-27 Method and device for constructing partial copy code and data restoration method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/078539 WO2015180038A1 (en) 2014-05-27 2014-05-27 Partial replica code construction method and device, and data recovery method therefor

Publications (1)

Publication Number Publication Date
WO2015180038A1 true WO2015180038A1 (en) 2015-12-03

Family

ID=54697824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/078539 WO2015180038A1 (en) 2014-05-27 2014-05-27 Partial replica code construction method and device, and data recovery method therefor

Country Status (2)

Country Link
CN (1) CN107003933B (en)
WO (1) WO2015180038A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018209541A1 (en) * 2017-05-16 2018-11-22 北京大学深圳研究生院 Coding structure based on t-design fractional repetition codes, and coding method
CN109257049A (en) * 2018-08-09 2019-01-22 东莞理工学院 A kind of building method and restorative procedure for repairing binary array code check matrix

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032470B (en) * 2019-03-18 2023-02-28 长安大学 Method for constructing heterogeneous partial repeat codes based on Huffman tree
CN110532125A (en) * 2019-07-15 2019-12-03 长安大学 A kind of part repetition code constructing method decomposed based on factor of diagram
CN111125014B (en) * 2019-11-19 2023-02-28 长安大学 Construction method of flexible partial repeat code based on U-shaped design
CN111290710B (en) * 2020-01-20 2024-04-05 北京信息科技大学 Cloud copy storage method and system based on dynamic adjustment of replication factors
CN112799605B (en) * 2021-03-31 2021-06-29 中南大学 Square part repeated code construction method, node repair method and capacity calculation method
CN113157485B (en) * 2021-05-06 2022-07-15 中南大学 Expansion construction method of partial repetition code

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932331A (en) * 2012-09-29 2013-02-13 南京云创存储科技有限公司 Super-safe-storage coding/decoding method applicable to distributed storage system
WO2013090233A1 (en) * 2011-12-12 2013-06-20 Cleversafe, Inc. Distributed computing in a distributed storage and task network
CN103559102A (en) * 2013-10-22 2014-02-05 北京航空航天大学 Data redundancy processing method and device and distributed storage system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100138717A1 (en) * 2008-12-02 2010-06-03 Microsoft Corporation Fork codes for erasure coding of data blocks
CN102624866B (en) * 2012-01-13 2014-08-20 北京大学深圳研究生院 Data storage method, data storage device and distributed network storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013090233A1 (en) * 2011-12-12 2013-06-20 Cleversafe, Inc. Distributed computing in a distributed storage and task network
CN102932331A (en) * 2012-09-29 2013-02-13 南京云创存储科技有限公司 Super-safe-storage coding/decoding method applicable to distributed storage system
CN103559102A (en) * 2013-10-22 2014-02-05 北京航空航天大学 Data redundancy processing method and device and distributed storage system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHU, BING ET AL.: "General Fractional Repetition Codes for Distributed Storage Systems", E COMMUNICATIONS LETTERS, vol. 18, no. 4, 30 April 2014 (2014-04-30), pages 660 - 663, XP011545432 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018209541A1 (en) * 2017-05-16 2018-11-22 北京大学深圳研究生院 Coding structure based on t-design fractional repetition codes, and coding method
CN109257049A (en) * 2018-08-09 2019-01-22 东莞理工学院 A kind of building method and restorative procedure for repairing binary array code check matrix
CN109257049B (en) * 2018-08-09 2020-11-06 东莞理工学院 Construction method for repairing binary array code check matrix and repairing method

Also Published As

Publication number Publication date
CN107003933A (en) 2017-08-01
CN107003933B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
WO2015180038A1 (en) Partial replica code construction method and device, and data recovery method therefor
US10514971B2 (en) Dispersed b-tree directory trees
US10289488B1 (en) System and method for recovery of unrecoverable data with erasure coding and geo XOR
US10241695B2 (en) Optimizing rebuilds when using multiple information dispersal algorithms
US9330137B2 (en) Cloud data backup storage manager
CN109491835B (en) Data fault-tolerant method based on dynamic block code
WO2010033644A1 (en) Matrix-based error correction and erasure code methods and apparatus and applications thereof
CN105703782B (en) A kind of network coding method and system based on incremental shift matrix
CN112799605B (en) Square part repeated code construction method, node repair method and capacity calculation method
CN107153661A (en) A kind of storage, read method and its device of the data based on HDFS systems
CN109478125B (en) Manipulating a distributed consistency protocol to identify a desired set of storage units
Ivanichkina et al. Mathematical methods and models of improving data storage reliability including those based on finite field theory
WO2014059651A1 (en) Method for encoding, data-restructuring and repairing projective self-repairing codes
CN108647108B (en) Construction method of minimum bandwidth regeneration code based on cyclic VFRC
JP2018524705A (en) Method and system for processing data access requests during data transfer
WO2018209541A1 (en) Coding structure based on t-design fractional repetition codes, and coding method
CN110781163B (en) Heterogeneous part repeated code construction and fault node repairing method based on complete graph
CN112667443A (en) User-oriented variable distributed storage copy fault tolerance method
Subedi et al. FINGER: a novel erasure coding scheme using fine granularity blocks to improve Hadoop write and update performance
Datta et al. Storage codes: Managing big data with small overheads
Xu et al. CRL: Efficient Concurrent Regeneration Codes with Local Reconstruction in Geo-Distributed Storage Systems
CN113157485B (en) Expansion construction method of partial repetition code
Akutsu et al. MEC: Network optimized multi-stage erasure coding for scalable storage systems
CN112988454B (en) Expansion part repeated code construction method
Zhang et al. BEC: A reliable and efficient mechanism for cloud storage service

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14893377

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14893377

Country of ref document: EP

Kind code of ref document: A1