WO2020199162A1 - Rack-aware regenerating code for data center - Google Patents

Rack-aware regenerating code for data center Download PDF

Info

Publication number
WO2020199162A1
WO2020199162A1 PCT/CN2019/081267 CN2019081267W WO2020199162A1 WO 2020199162 A1 WO2020199162 A1 WO 2020199162A1 CN 2019081267 W CN2019081267 W CN 2019081267W WO 2020199162 A1 WO2020199162 A1 WO 2020199162A1
Authority
WO
WIPO (PCT)
Prior art keywords
rack
code
repair
node
nodes
Prior art date
Application number
PCT/CN2019/081267
Other languages
French (fr)
Chinese (zh)
Inventor
侯韩旭
李柏晴
Original Assignee
东莞理工学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东莞理工学院 filed Critical 东莞理工学院
Priority to CN201980000640.3A priority Critical patent/CN111971945A/en
Priority to PCT/CN2019/081267 priority patent/WO2020199162A1/en
Publication of WO2020199162A1 publication Critical patent/WO2020199162A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications

Definitions

  • the invention belongs to the technical improvement field of digital coding, and in particular relates to a rack-aware regeneration code used in a data center.
  • Modern storage systems are usually deployed in the form of data centers. In these data centers, data is distributed on a large number of storage nodes in different racks. For example, Google file system [1] and Windows Azure storage system [2], and Facebook storage system [3].
  • erasure codes are widely used in modern storage systems to encode data. Compared with traditional copy backup methods, erasure codes have higher fault tolerance and lower storage redundancy.
  • Reed-Solomon (RS) code [4] is the most widely used erasure code (for example, in Google [1]).
  • An (n, k) RS code encodes a file with k data blocks (that is, the unit used for error correction coding operations), obtains n data blocks in a finite field, and then puts these n data The blocks are distributed in n different nodes (where k ⁇ n). Then, during the reconstruction process, the data collector reconstructs the data file by connecting any k data blocks among the n nodes.
  • RS code has two important practical characteristics: (i) it can tolerate any n-k node failures to achieve minimum storage redundancy, (ii) it supports arbitrary n and k values (k ⁇ n).
  • each data block stored in the lost node needs to be repaired in the new node to maintain the same fault tolerance.
  • the traditional repair method is also used on the RS code: first reconstruct the data file, and then encode it again to form the previously lost data block. Therefore, for an (n, k) RS code, the number of data blocks that need to be downloaded to repair a lost data block is k (that is, k times the number of lost data), which will increase the network bandwidth and I/O read and write operations.
  • RC Regenerating Codes
  • Dimakis et al. [5] was proposed by Dimakis et al. [5], with the goal of optimizing the network bandwidth during the repair process.
  • the RC code encodes a data file into multiple data blocks of n, and then distributes them on n nodes, and each node stores multiple data blocks. In this way, data files can be reconstructed with a certain number of nodes.
  • the new node downloads the encoded data block from the selected part of the lost node.
  • the repair bandwidth the total amount of encoded data blocks downloaded from all non-lost nodes is called the repair bandwidth, which is much smaller than the size of the original data file. Dimakis et al. [5] also gave the optimal compromise between repair bandwidth and storage redundancy.
  • DRC codes can achieve less cross-chassis repair bandwidth than RC codes.
  • DRC codes are based on the minimum storage redundancy (such as RS codes). Similar to the optimal analysis of RC code [5], there are still a lot of unresolved problems in the study of the optimal compromise between storage redundancy and repair bandwidth across racks.
  • the compromise curve of the MSRR (Minimum Storage Rack Perception Regenerating Point) code is the same as the minimum storage regeneration code (MSR: Minimum Storage Regenerating).
  • MSRR Minimum Storage Regenerating
  • the cross-rack repair bandwidth of the MSRR code is strictly smaller than the cross-rack repair bandwidth of the MSR code.
  • MBR Minimum Bandwidth Regenerating Repair bandwidth.
  • the cross-rack repair bandwidth of RRC code is less than or equal to the encoding in [7], and for most parameters, it is also less than the encoding method in [8] (for details, refer to VII- Part A).
  • the purpose of the present invention is to provide a rack-aware regeneration code used in a data center to solve the problem.
  • each rack is provided with a relay node, and each relay node can exchange information in other nodes in the same rack.
  • a further technical solution of the present invention is that the relay node is any surviving node selected from the rack, and different data files are associated with different relay nodes during the repair operation.
  • a further technical solution of the present invention is: the rack-aware regeneration code generates a new node for replacement when the storage node fails, and the new node is placed in the same rack; the new node arbitrarily selects other d racks, where d ⁇ r, and connect and select the corresponding relay node of each rack. Each corresponding relay node sends ⁇ symbol information to the new node.
  • the new node receives d ⁇ symbol information from the received d ⁇ symbol information and the storage of the rack where the faulty node is located. (n/r-1)
  • a further technical solution of the present invention is that when the relay node is connected to the data collector, the rack-aware regeneration code performs precise repair and functional repair, and the symbol information stored in the failed node and the new node in the precise repair The symbol information in is the same; as long as the decoding characteristics are satisfied in the function repair, the new node may contain symbol information different from the failed node.
  • the beneficial effects of the present invention are: by reconstructing part of the repair data inside the rack and combining the part of the repair data between different racks, to minimize the repair bandwidth between the racks; realize fewer cross racks
  • the repair bandwidth, regeneration code is applied to repair problems between multiple racks, not only repair problems within a single rack, so it has more practical engineering application significance.
  • the article studies the best trade-off between storage redundancy and cross-rack repair bandwidth in a data center.
  • the scope of the research is more extensive than the previous research on some special points of the regeneration code.
  • the construction has a wide range of parameter selection. In most cases, its cross-rack repair bandwidth is strictly lower than the classic minimum storage regeneration code repair bandwidth; the minimum cross-rack repair bandwidth construction method supports all parameters, and at the same time for all
  • the parameter of has a smaller cross-rack repair bandwidth than the minimum bandwidth regeneration code.
  • Fig. 1 is a schematic diagram of a rack sensing regeneration code used in a data center provided by an embodiment of the present invention.
  • the flexible RC code [19] is designed for heterogeneous storage systems and can reach the lower limit of repair bandwidth. Combined with the structure of tree regeneration topology, RC code can further save network bandwidth [20], [21]. Some studies such as [16], [22], they focus on the capacity limit of a heterogeneous model. However, all of the above-mentioned studies did not distinguish between the two communication overheads in the data center, in-rack and across-rack.
  • Table 1 compares and analyzes RRC codes and some closely related researches for erasure code data centers.
  • DRC codes [6], [18] consider the same model as this paper, and achieve a compromise between storage and cross-rack repair bandwidth under minimum storage conditions.
  • the DRC code can be seen as a special case of the MSRR code when all other racks are connected to repair a lost node.
  • Sohn et al. [8] considered a different repair model and provided the best compromise between storage and repair bandwidth (including cross-rack repair bandwidth and intra-rack repair bandwidth).
  • the compromise curve of the RRC code is the same as the optimal compromise curve of [7], but the precise repair construction method of the MSRR code and the MBRR code is more than [
  • the minimum storage code and minimum bandwidth code in [7] only support when k is a multiple of the number of nodes in each rack, while our MSRR code and MBRR code do not have this restriction on k.
  • k+1 will not be a multiple of the number of nodes in each rack.
  • the same method is applied to the two-layer encoding of the data center.
  • the first layer uses an (n, k) MDS code to encode the data document, and then distributes it to n nodes.
  • the second layer creates data blocks stored in each node through the MDS code with the coding rate ⁇ . If the proportion of failed data blocks in the data blocks stored in the node is not greater than 1- ⁇ (that is, some nodes fail), the failed data blocks can be recovered by the failed node itself. Otherwise, there is a trade-off between storage and repair bandwidth.
  • a missing node can be downloaded from the main rack. All other symbol information, and from any other d racks, each download ⁇ symbol information.
  • the missing data file can be repaired by a data collector by downloading k ⁇ symbol information from any k nodes.
  • n is a multiple of r
  • i-th node in rack h as X h,i .
  • All operations in this paper are based on a finite field of size q, and a data file is regarded as a vector of length B in the finite field.
  • a data file is encoded into n ⁇ symbols and stored in n nodes. Each node stores ⁇ symbols.
  • each rack there is an important relay node, which can obtain information stored in other nodes in the same rack. It is assumed that the information exchange in the rack is very fast, so the transmission overhead between the nodes in the rack is very low. If the storage node fails, we will generate a new node to replace it and place the new node in the same rack.
  • the new node arbitrarily selects other d racks, where d ⁇ r, and connects to the corresponding relay node.
  • the relay node can be any surviving node selected from the rack, and different data files can be associated with different relay nodes during the repair operation.
  • node failure as a partial failure of the rack. Repair the faulty node by downloading ⁇ symbols from any other d rack, and download (n/r-1) ⁇ symbols from other n/r-1 nodes in the same rack.
  • the coding scheme with parameters n, k, r, d, ⁇ and ⁇ that meets the above requirements is called a rack-aware storage system RSS (n, k, r, d, ⁇ , ⁇ ) code (Rack-aware Storage System) .
  • Table 2 summarizes the main symbols used in this article.
  • an information flow graph is a directed acyclic graph (DAG) constructed according to the following rules.
  • DAG directed acyclic graph
  • the storage system may experience a series of node failures and repairs. We repeat the above process accordingly. Finally, we draw k edges from the outer k vertices to T. If T is the vertex Out h,1 of the relay node connected to rack h, it is agreed that T is also connected to all vertices Out h,2 ,...,Out h,n/r in rack h.
  • the DAG obtained as described above is called an information flow graph and is represented by G(n, k, r, d, ⁇ , ⁇ ).
  • the MSRR point can first minimize the value of ⁇ and then minimize the value of ⁇ to repair, and The MBRR point can be repaired by first minimizing the ⁇ value and then minimizing the ⁇ value.
  • the MSRR code degenerates to a minimum storage regeneration (MSR) code
  • the MBRR code degenerates to a minimum bandwidth regeneration (MBR) code.
  • the cross-rack repair bandwidth ⁇ is equal to the storage ⁇
  • the amount of data downloaded from other racks is the same as the size of the failed data.
  • MSRR codes have less cross-chassis repair bandwidth than MSR codes (MBR codes).
  • the cross-rack repair bandwidth of MBRR code is smaller than that of MBR code, if and only when
  • the MBRR code has less cross-chassis repair bandwidth than the MBR code. In other words, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly less than the cross-rack repair bandwidth of the MBR code.
  • the cross-rack repair bandwidth of the RC code increases with the increase of r.
  • the cross-rack repair bandwidth of the RRC code increases as r increases, just like the RC code.
  • the cross-chassis repair bandwidth of the RRC(n,k,r') code is strictly smaller than the cross-chassis repair bandwidth of the RRC(n,k,r) code.
  • the RRC code has less cross-chassis repair bandwidth for the same parameter than the RC code.
  • kr/n is an integer
  • the cross-rack repair bandwidth of the MSRR code and the MSR code are the same. According to Theorem 4.
  • the cross-rack repair bandwidth of MBRR code is greater than that of MBR code
  • the cross-rack repair bandwidth of MBRR code is the same as that of MBR code.
  • System code refers to the code in which k ⁇ uncoded data blocks are stored in k nodes. Assume that the first k nodes are data nodes that store uncoded data blocks, and the last n-k nodes are coded nodes that store coded data blocks.
  • section V-C its construction method is used for MSRR codes, and satisfies ⁇ n/r ⁇ m+ ⁇ t.
  • the construction method of Part V-C only has the best cross-rack repair bandwidth for all data nodes. If kr is a multiple of n, k data nodes are stored in the first m racks called data racks, and n-k encoding nodes are placed in the last r-m racks called encoding racks. If kr is not a multiple of n, the first m racks are data racks, and the last rm-1 racks are encoding racks.
  • the MSRR code is called a homogeneous MSRR code; and when kr/n is not an integer, the MSRR code is called a mixed MSRR code.
  • Q i,h is the coding matrix of B ⁇ , and the rank h of the coding matrix G h is defined as,
  • the new node When a node in the frame f fails, the new node will access all the nodes in the frame f (n/r-1) ⁇ symbol information, and from the frame Download coding symbols and local coding vectors among them Is a row vector whose length is not greater than ⁇ n/r sub-matrix.
  • the sub-matrix of G is expressed as G[(i1,i2),(j1,j2)], including the rows from i1 to i2 and the columns from j1 to j2, and then G[(i1,i2),(:) ]and G[(:),(j1,j2)] represents two sub-matrices of G, including rows from i1 to i2 and columns from j1 to j2. .
  • the example in Figure 4 shows the repair of a data node.
  • five symbols are downloaded, and only the coded symbol (called the desired symbol) downloaded from the rack 5 is linearly dependent on s 1 .
  • the new node accesses all other symbols from the main rack. It downloads (i) m-1 coding symbols from the m-1 data rack, (ii) downloads a coding symbol from the hybrid rack, and (iii) downloads the d-m coding symbol from the d-m coding rack.
  • the symbols required for d-m+1 come from the hybrid frame and the d-m code frame.
  • the row vectors v 1 , v 2 ,..., v m are orthogonal to the length m.
  • the size of the vectors u 1 , u 2 ,..., u ⁇ is 1 ⁇ n/r
  • E 1 , E 2 ,...,E m be the ⁇ n/r ⁇ m matrix
  • F 1 ,F 2 ,... ,F ⁇ is the (B-m ⁇ n/r) ⁇ m matrix
  • R 1 , R 2 ,..., R ⁇ is the B ⁇ ( ⁇ n/rm) matrix.
  • y 1 ,y 2 ,...,y m are m orthogonal vectors with length ⁇ n/r–m.
  • x 2 ,x 3 ,...,x ⁇ are ⁇ -1 vectors of length ⁇ n/r.
  • D i,1 ,D i,2 ,...,D i,m are matrices of size ⁇ n/r ⁇ ( ⁇ n/rm).
  • C i is a matrix of size (B-m ⁇ n/r) ⁇ ( ⁇ n/rm).
  • the proposed code satisfies the (n, k) decoding characteristics if and only if the character can be repaired from any k node. This is equivalent to all ⁇ sub-matrices of the matrix
  • the missing ⁇ symbol can be repaired because the ⁇ sub-matrix corresponding to the matrix in (9) is non-singular.
  • ⁇ 1,1 and ⁇ 1,2 are two non-zero elements.
  • the coding matrix G 4 is
  • the (n, k) repair characteristic may be that if there are B independent symbols in the kd symbol, that is, if there are at most m(m-1)/2 related symbols in any k nodes, the repair characteristic is satisfied.
  • PM product matrix
  • the B data symbol is divided into two parts, where the first part has (k-m)d data symbols, and the second part has md-m(m-1)/2 data symbols.
  • the first part is encoded by PM-MBR (r, d, d) code to generate dr code symbols.
  • the coding symbols generated by the division are group r, and each group has d coding symbols. For each group, linearly combine the code symbols in the group and all (n/r-1)d symbols stored in the last (n/r-1) node in the rack, where the code vector is of length (n/r-1) d+1 column vector, and store the obtained d code symbols in the first node of the rack.
  • matrix Q it is stored in the last (n/r-1) nodes of the rack r, represented by a matrix M1 of r ⁇ (n/r-1)d.
  • matrix Q it is stored in the last (n/r-1) nodes of the rack r, represented by a matrix M1 of r ⁇ (n/r-1)d.
  • matrix Q it is stored in the last (n/r-1) nodes of the rack r, represented by a matrix M1 of r ⁇ (n/r-1)d.
  • matrix Q can choose matrix Q to be a Cauchy matrix, so that if nr ⁇ k, any k nodes among nr nodes (the last (n/r-1) node of r racks) are sufficient to reconstruct B data symbols .
  • the matrix S 1 passes through m(m+1)/2 data
  • the symbol S j is first filled with a symmetric m ⁇ m matrix obtained by filling the upper triangular part. Then the lower triangular part is obtained by reflection along the diagonal.
  • matrix Is the transpose of S 2 and matrix 0 is a (dm) ⁇ (dm) all-zero matrix.
  • B 23 data symbols.
  • the first 18 data symbols and 6 global code symbols are stored in the last two nodes of each rack, and the 4 local code symbols are stored in the first node in each rack.
  • Any k nodes can repair B data symbols, and the optimal cross-rack repair bandwidth of the MBRR code can be used to repair the alpha symbols stored in any node.
  • the product of all determinants is a polynomial, and the maximum total degree is (16).
  • a node in the rack f fails, where f ⁇ ⁇ 1,2,...,r ⁇ .
  • the relay node of the auxiliary rack hi accesses all the symbols stored in the rack to repair Calculate and send code symbols To the new node.
  • the new node contains d code symbols
  • the new node can calculate the code symbol ⁇ f M 2 , just like the repair process of the PM-MBR code, the new node can repair the faulty node by accessing all other symbols in the rack f.
  • the upper limit of the field size in Theorem 8 is an exponential relationship of k.
  • We search through calculations, we can always find P and ⁇ , for the example (n,k,r,d) (12,8,4,3), the field size is 11, any k nodes can reconstruct B Data symbol.
  • the MBRR code has less cross-chassis repair bandwidth than the MBR code. Therefore, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly smaller than the cross-rack repair bandwidth of the MBR code.
  • the cross-rack repair bandwidth of the code in [7] is equal to the cross-rack repair bandwidth of our RRC code.
  • the MSRR(n,k+i,r) code (MBRR(n,k+i ,r) code) has less cross-frame repair bandwidth than MSRR(n,k,r) (MBRR(n,k,r) code).
  • ⁇ c are the number of symbols downloaded from the main rack and the auxiliary nodes in other racks, respectively.
  • ⁇ msr is the cross-rack repair bandwidth and links all n-1 surviving nodes.
  • the cross-rack repair bandwidth of our RRC is strictly smaller than the code in [8], and if kr/n is an integer, it is the same as the code in [7]. In addition, RRC can tolerate more failure modes than the codes in [7].
  • kr/n is an integer
  • mixed MSRR(n,k+i,r) coded cross-rack repair bandwidth, where i 1, 2,...,n/r-1, which is less than MSR(n, k, r) code and the smallest storage (n, k, r) code in [7].
  • kr/n should be an integer.
  • the k value range of the clustering code in MSRR code, [7] and DRC is shown in Figure 11. The results show that the parameters supported by the MSRR code are far greater than the other two comparison codes.
  • the precise repair structure of the clustering code in [7] is based on the existing structure of the MSR code.
  • the existing structure of the MSR code has the limitation of storing the exponential relationship of ⁇ to be k, and there is also a limitation on the minimum storage structure of the clustering code in [7]. But the structure of our MSRR code does not have this restriction.
  • the MBRR code proposed in this paper can support all parameters.
  • the construction of the minimum bandwidth code in [8] is given in [10]. It can also support all parameters, but kr/n must be constructed so that [7] reaches the minimum bandwidth code The integer. Therefore, our structure supports more parameters than in [7].
  • the relay layer X 1,1 is connected to T. Since the input edges of T are all limited traffic, we only need to check the input edges of Out 1,1 and In 1,1 . Since X 1,1 is not a faulty node, the input edge capacities of Out 1,1 and In 1,1 are ⁇ n/r, which is a certain finite value. Therefore, no faulty relay layer transmits ⁇ n/r to the edge. On the other hand, if the relay layer X'1,1 as a faulty node is connected to T, the input edges of Out' 1,1 and In' 1,1 have capacities ⁇ n/r and ⁇ (n/r-1 )+d ⁇ .
  • Node X'1,1 can transmit min ⁇ (n/r-1) ⁇ +d ⁇ , ⁇ n/r ⁇ symbols to the edge. All other n/r-1 nodes in rack 1 have an edge connecting the input node and Out 1,1 , and the transmission traffic is ⁇ . All other n/r-1 nodes in rack 1 do not transmit to the edge, regardless of whether they are connected to T. Therefore, if a repeater is connected to T, we should connect all other n/r-1 nodes in the same rack to T to minimize the amount of communication at the edge.
  • n/r vertices transmit at least min ⁇ (n/r-1) ⁇ +d ⁇ , n/r ⁇ to the edge.
  • n/r outer vertices Similar to the discussion above, we get that those n/r nodes transmit at least min ⁇ (n/r-1) ⁇ +(d-1) ⁇ ,n/r ⁇ to the edge. Similarly, when, for the first With n/r vertices and the final k-mn/r vertices, the minimum cut set of any information flow graph G(n,k,r,d, ⁇ , ⁇ ) can be obtained exactly equal to the positive value of (1).
  • each element of the matrix M 1 in (9) and M 2 in (10) can be interpreted as a polynomial having a total degree of 2 and 1, respectively.
  • Each element of the matrix in (11) is a polynomial with a total degree of 1.
  • the product of all determinants can be regarded as a polynomial whose total degree is

Abstract

The present invention is applicable to the field of improvement of digital encoding technology. Erasure codes are widely applied to data centers with massive storage to implement high fault tolerance and low storage redundancy. Provided is a new erasure code which is called a rack-aware regenerating code (RRC) and is capable of implementing an optimal compromise. Further provided is a method for constructing an accurate repair RRC having minimum storage redundancy and a minimum cross-rack repair bandwidth. The present invention provides that: (1) the method for constructing an RRC having minimum storage redundancy has a wide parameter selection range, and in most cases, the cross-rack repair bandwidth of the RRC is strictly less than that of a classic minimum storage regenerating code; (2) the method for constructing an RRC having minimum cross-rack repair bandwidth can support all parameters, and for most parameters, the cross-rack repair bandwidth of the RRC is less than that of a minimum bandwidth regenerating code.

Description

一种用于数据中心的机架感知再生码A rack-aware regeneration code for data center 技术领域Technical field
本发明属于数字编码技术改进领域,尤其涉及一种用于数据中心的机架感知再生码。The invention belongs to the technical improvement field of digital coding, and in particular relates to a rack-aware regeneration code used in a data center.
背景技术Background technique
现代存储系统通常采用数据中心的形式部署,在这些数据中心中,数据分布在不同机架的大量存储节点上。例如Google文件系统[1]和Windows Azure存储系统[2],以及Facebook存储系统[3]。为了防止节点故障,提高数据存储的可靠性和持久性,现代存储系统中广泛采用纠删码来对数据进行编码。相比于传统复制备份的方法,纠删码具有更高的容错能力和更低的存储冗余。里德-所罗门(RS)码[4]是应用最广泛的纠删码(例如,在Google[1]中)。一个(n,k)RS码对一个有k个数据块(即,用于纠错编码操作的单位)的文件进行编码,在某个有限域上获得n个数据块,再把这n个数据块分布在n个不同节点(其中k<n)。然后,数据收集器在重建过程中,通过连接n个节点中的任意k个数据块来重建数据文件。RS码具有两个重要的实用特性:(i)在容忍任意n-k个节点故障时实现最小存储冗余,(ii)支持任意的n和k值(k<n)。Modern storage systems are usually deployed in the form of data centers. In these data centers, data is distributed on a large number of storage nodes in different racks. For example, Google file system [1] and Windows Azure storage system [2], and Facebook storage system [3]. In order to prevent node failures and improve the reliability and durability of data storage, erasure codes are widely used in modern storage systems to encode data. Compared with traditional copy backup methods, erasure codes have higher fault tolerance and lower storage redundancy. Reed-Solomon (RS) code [4] is the most widely used erasure code (for example, in Google [1]). An (n, k) RS code encodes a file with k data blocks (that is, the unit used for error correction coding operations), obtains n data blocks in a finite field, and then puts these n data The blocks are distributed in n different nodes (where k<n). Then, during the reconstruction process, the data collector reconstructs the data file by connecting any k data blocks among the n nodes. RS code has two important practical characteristics: (i) it can tolerate any n-k node failures to achieve minimum storage redundancy, (ii) it supports arbitrary n and k values (k<n).
当一个节点丢失时,存储在丢失结点中的每个数据块都需要在新节点中进行修复,以保持同样的容错能力。传统的修复方法也使用在RS码上:首先重建数据文件,然后再次编码来形成之前所丢失的数据块。因此,一个(n,k)RS码,修复一个丢失数据块所需要下载的数据块是k个(即丢失数据数量的k倍),这将增大网络带宽和I/O读写操作。When a node is lost, each data block stored in the lost node needs to be repaired in the new node to maintain the same fault tolerance. The traditional repair method is also used on the RS code: first reconstruct the data file, and then encode it again to form the previously lost data block. Therefore, for an (n, k) RS code, the number of data blocks that need to be downloaded to repair a lost data block is k (that is, k times the number of lost data), which will increase the network bandwidth and I/O read and write operations.
再生码(RC:Regenerating Codes)是由Dimakis等人[5]提出,以优化修复过程中的网络带宽为目标。RC码将一个数据文件编码为n的倍数个数据块,再将它们分布在n个节点,每个节点都存储多个数据块。这样,依靠一定数量的节点就可以重建数据文件。修复丢失节点中数据块时,新节点从未丢失节点中选定一部分下载编码数据块。总的来说,从所有未丢失节点所下载的编码数据块总量称为修复带宽,其远小于原始数据文件的大小。Dimakis等人[5]还给出了修复带宽和存储冗余之间的最优折中。Regenerating Codes (RC: Regenerating Codes) was proposed by Dimakis et al. [5], with the goal of optimizing the network bandwidth during the repair process. The RC code encodes a data file into multiple data blocks of n, and then distributes them on n nodes, and each node stores multiple data blocks. In this way, data files can be reconstructed with a certain number of nodes. When repairing the data block in the lost node, the new node downloads the encoded data block from the selected part of the lost node. In general, the total amount of encoded data blocks downloaded from all non-lost nodes is called the repair bandwidth, which is much smaller than the size of the original data file. Dimakis et al. [5] also gave the optimal compromise between repair bandwidth and storage redundancy.
在实际的数据中心,存储节点是以机架的方式来存放,跨机架间的通信开销通常比机架内的通信开销更大。因此,对于纠删码来说,最小化跨机架间的修复带宽(即在修复过程中,跨越不同机架传输的数据总量)是很重要的。遗憾的是,RC码没有解决这个问题,其不能最小化跨机架间的修复带宽,这便引发了一系列专门的研究来解决数据中心的修复问题(详情请参见第二部分)。特别是,Hu等人[6]提出双重再生码(DRC:Double Regenerating Codes),通过在机架内部重建部分的修复数据,和组合不同机架间的部分修复数据的方法,来最大限度地减少跨机架间的修复带宽。结果表明,相对于某些参数,DRC码相比RC码,可以实现更少的跨机架间的修复带宽。然而,DRC码建立在最低存储冗余(如RS码)的条件之上。类似于RC码[5]的最优分析,存储冗余和跨机架间的修复带宽的最优折中的研究还有大量未解决的问题。In an actual data center, storage nodes are stored in racks, and the communication overhead between the racks is usually greater than the communication overhead within the racks. Therefore, for erasure coding, it is very important to minimize the repair bandwidth between racks (that is, the total amount of data transmitted across different racks during the repair process). Unfortunately, the RC code did not solve this problem, and it cannot minimize the repair bandwidth between racks. This has led to a series of special studies to solve the repair problem of the data center (see Part 2 for details). In particular, Hu et al. [6] proposed Double Regenerating Codes (DRC: Double Regenerating Codes), which rebuilt part of the repair data inside the rack and combined the part of the repair data between different racks to minimize Repair bandwidth across racks. The results show that compared with some parameters, DRC codes can achieve less cross-chassis repair bandwidth than RC codes. However, DRC codes are based on the minimum storage redundancy (such as RS codes). Similar to the optimal analysis of RC code [5], there are still a lot of unresolved problems in the study of the optimal compromise between storage redundancy and repair bandwidth across racks.
本文考虑一个比DRC码[6]模型更为通用的模型(例如,在修复过程中灵活的存储大小,和用来修复数据的灵活的选择未丢失节点的数量)。最后,本文提出了一种用于数据中心的纠删码,称为机架感知再生码(RRC:Rack-aware Regenerating Codes)。本文的主要贡献如下:This paper considers a more general model than the DRC code [6] model (for example, flexible storage size in the repair process, and flexible selection of the number of nodes that are not lost in the repair process). Finally, this article proposes an erasure correction code used in data centers, called Rack-aware Regenerating Codes (RRC). The main contributions of this article are as follows:
第一,我们推导了RRC码的存储冗余和跨机架间的修复带宽的折中关系。在最佳的折中曲线中,存在两个极值点,即最小存储机架感知再生点(MSRR:Minimum Storage Rack-aware Regenerating)和最小带宽机架感知再生点(MBRR:Minimum Bandwidth Rack-aware Regenerating),分别对应于最小存储和最小值跨机架修复带宽。当每个机架只有一个节点时,RRC(机架感知再生码)码的最佳折中曲线退化为RC码的最佳折中曲线。令r为机架的个数,当kr/n为一个整数时,MSRR(最小存储机架感知再生点)码的折中曲线和最小存储再生码(MSR:Minimum Storage Regenerating)相同。当kr/n不是整数时,MSRR码的跨机架修复带宽严格小于MSR码的跨机架修复带宽。同时,我们表明,对于大部分的参数(具体参考定理4),MBRR(最小带宽机架感知再生点)码的跨机架修复带宽远小于最小带宽再生码(MBR:Minimum Bandwidth Regenerating)的跨机架修复带宽。例如,当(n,k,r)=(12,8,4),MSRR码的跨机架修复带宽比MSR码减少33.3%;对于相同的参数,MBRR码比MBR码则分别减少13.1%的跨机架修复带宽和28.9%存储量(具体参考图.3)。First, we deduced the trade-off relationship between RRC code storage redundancy and repair bandwidth across racks. In the best compromise curve, there are two extreme points, namely the minimum storage rack perception regeneration point (MSRR: Minimum Storage Rack-aware Regenerating) and the minimum bandwidth rack perception regeneration point (MBRR: Minimum Bandwidth Rack-aware Regenerating), which respectively correspond to the minimum storage and minimum cross-rack repair bandwidth. When each rack has only one node, the best compromise curve of RRC (Rack Sensing Regeneration Code) codes degenerates into the best compromise curve of RC codes. Let r be the number of racks, when kr/n is an integer, the compromise curve of the MSRR (Minimum Storage Rack Perception Regenerating Point) code is the same as the minimum storage regeneration code (MSR: Minimum Storage Regenerating). When kr/n is not an integer, the cross-rack repair bandwidth of the MSRR code is strictly smaller than the cross-rack repair bandwidth of the MSR code. At the same time, we show that for most of the parameters (refer to Theorem 4 for details), the cross-rack repair bandwidth of MBRR (Minimum Bandwidth Regenerating Point) code is much smaller than that of MBR: Minimum Bandwidth Regenerating Repair bandwidth. For example, when (n,k,r)=(12,8,4), the cross-rack repair bandwidth of MSRR code is reduced by 33.3% compared with MSR code; for the same parameters, MBRR code is reduced by 13.1% compared with MBR code. Cross-rack repair bandwidth and 28.9% storage capacity (refer to Fig. 3 for details).
和现有的研究相比,对于所有的参数,RRC码的跨机架修复带宽小于或等于[7]中的编码,对于大多数参数,也小于[8]中的编码方式(详情参考VII-A部分)。Compared with the existing research, for all parameters, the cross-rack repair bandwidth of RRC code is less than or equal to the encoding in [7], and for most parameters, it is also less than the encoding method in [8] (for details, refer to VII- Part A).
第二,我们提出了满足广泛参数范围的MSRR码和MBRR码的精确修复的构造。我们的构造方法支持的参数范围远大于[6]-[8]中编码方式的参数范围(详情参考VII-B部分)。[8]中编码的精确修复构造的编码方法在[9]、[10]中给出。例如,当n=12,r=4时,MSRR码的构造可以满足k=4,5,…,11,然而在[6]和[7]中仅分别满足k=9和k=6,9。[8]中精确修复的最小存储码的构造将在相关的研究[9]中给出,它仅仅满足r=2和n=2k。Second, we propose a structure for precise restoration of MSRR codes and MBRR codes that satisfy a wide range of parameters. The parameter range supported by our construction method is much larger than the parameter range of the encoding method in [6]-[8] (refer to Part VII-B for details). The coding method of the precise repair construction of the code in [8] is given in [9] and [10]. For example, when n=12 and r=4, the MSRR code structure can satisfy k=4,5,...,11, but in [6] and [7] only k=9 and k=6,9 . The structure of the minimum storage code for precise repair in [8] will be given in related research [9], which only satisfies r=2 and n=2k.
参考文献references
[1]D.Ford,F.Labelle,F.I.Popovici,M.Stokely,V.-A.Truong,L.Barroso,C.Grimes,and S.Quinlan,“Availability in globally distributed storage systems.”in Proc.of the 9th Usenix Symposium on Operating Systems Design and Implementation,2010,pp.1–7.[1]D.Ford,F.Labelle,FIPopovici,M.Stokely,V.-A.Truong,L.Barroso,C.Grimes,and S.Quinlan,"Availability in globally distributed storage systems."in Proc. of the 9th Usenix Symposium on Operating Systems Design and Implementation, 2010, pp. 1-7.
[2]C.Huang,H.Simitci,Y.Xu,A.Ogus,B.Calder,P.Gopalan,J.Li,and S.Yekhanin,“Erasure coding in Windows Azure storage,”in Usenix Conference on Technical Conference,2012.[2]C.Huang,H.Simitci,Y.Xu,A.Ogus,B.Calder,P.Gopalan,J.Li,and S.Yekhanin,"Erasure coding in Windows Azure storage,"in UsenixConference onTechnical Conference, 2012.
[3]M.Sathiamoorthy,M.Asteris,D.Papailiopoulos,A.G.Dimakis,R.Vadali,S.Chen,and D.Borthakur,“XORing elephants:Novel erasure codes for big data,”in Proceedings of the 39th international conference on Very Large Data Bases.VLDB Endowment,2013,pp.325–336.[3]M.Sathiamoorthy,M.Asteris,D.Papailiopoulos,AGDimakis,R.Vadali,S.Chen,and D.Borthakur,"XORing elephants:Novel erasure codes for big data,"in Proceedings of the 39th international on Very Large Data Bases. VLDB Endowment, 2013, pp. 325-336.
[4]I.S.Reed and G.Solomon,“Polynomial codes over certain finite fields,”Journal of the Society for Industrial&Applied Mathematics,vol.8,no.2,pp.300–304,1960.[4]I.S.Reed and G.Solomon,"Polynomial codes over certain finite fields," Journal of the Society for Industrial&Applied Mathematics,vol.8,no.2,pp.300–304,1960.
[5]A.Dimakis,P.Godfrey,Y.Wu,M.Wainwright,and K.Ramchandran,“Network coding for distributed storage systems,”IEEE Trans.Information Theory,vol.56,no.9,pp.4539–4551,Sep.2010.[5]A.Dimakis,P.Godfrey,Y.Wu,M.Wainwright,and K.Ramchandran,"Network coding for distributed storage systems,"IEEE Trans.Information Theory,vol.56,no.9,pp.4539 –4551, Sep. 2010.
[6]Y.Hu,P.P.C.Lee,and X.Zhang,“Double regenerating codes for hierarchical data centers,”in Proc.IEEE Int.Symp.Inf.Theory(ISIT),2016,pp.245–249.[6]Y.Hu,P.P.C.Lee,and X.Zhang, "Double regenerating codes for hierarchical data centers," in Proc.IEEE Int.Symp.Inf.Theory(ISIT),2016,pp.245–249.
[7]N.Prakash,V.Abdrashitov,and M.M′edard,“The storage versus repair-bandwidth trade-off for clustered storage systems,”IEEE Trans.Information Theory,vol.64,no.8,pp.5783–5805,August 2018.[7]N.Prakash,V.Abdrashitov,and MM'edard,"The storage versus repair-bandwidth trade-off for clustered storage systems," IEEE Trans.Information Theory,vol.64,no.8,pp.5783– 5805, August 2018.
[8]J.-y.Sohn,B.Choi,S.W.Yoon,and J.Moon,“Capacity of clustered distributed storage,”Accepted for publication in IEEE Trans.Inf.Theory,2018.[8]J.-y.Sohn,B.Choi,S.W.Yoon,and J.Moon,"Capacity of clustered distributed storage," Accepted for publication in IEEE Trans.Inf.Theory,2018.
[9]J.-y.Sohn,B.Choi,and J.Moon,“A class of MSR codes for clustered distributed storage,” https://arxiv.org/abs/1801.02014,2018. [9]J.-y.Sohn,B.Choi,and J.Moon,"A class of MSR codes for clustered distributed storage," https://arxiv.org/abs/1801.02014 ,2018.
[10]J.-y.Sohn and J.Moon,“Explicit construction of MBR codes for clustered distributed storage,”http-s://arxiv.org/abs/1801.02287,2018.[10]J.-y.Sohn and J.Moon, “Explicit construction of MBR codes for clustered distributed storage,” http-s://arxiv.org/abs/1801.02287,2018.
[11]H.C.Chen,Y.Tang,Y.Hu,and P.P.C.Lee,“NCCloud:a network-coding-based storage system in a cloud-of-clouds,”IEEE Trans.Computers,vol.63,no.1,pp.31–44,Jan.2014.[11]HCChen,Y.Tang,Y.Hu,and PPCLee,"NCCloud:a network-coding-based storage system in a cloud-of-clouds,"IEEE Trans.Computers,vol.63,no.1 ,pp.31–44,Jan.2014.
[12]K.Rashmi,P.Nakkiran,J.Wang,N.B.Shah,and K.Ramchandran,“Having your cake and eating it too:Jointly optimal erasure codes for I/O,storage,and network-bandwidth.”in Proc.of USENIX FAST,2015,pp.81–94.[12]K.Rashmi,P.Nakkiran,J.Wang,NBShah,and K.Ramchandran,"Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth." in Proc. of USENIX FAST, 2015, pp. 81-94.
[13]L.Pamies-Juarez,F.Blagojevic,R.Mateescu,C.Guyot,E.E.Gad,and Z.Bandic,“Opening the chrysalis:On the real repair performance of MSR codes.”in Proc.of USENIX FAST,2016,pp.81–94.[13]L.Pamies-Juarez,F.Blagojevic,R.Mateescu,C.Guyot,EEGad,and Z.Bandic,"Opening the chrysalis:On the real repair performance of MSR codes." in Proc.of USENIX FAST , 2016, pp. 81–94.
[14]J.Li and B.Li,“Beehive:erasure codes for fixing multiple failures in distributed storage systems,”IEEE Transactions on Parallel and Distributed Systems,vol.28,no.5,pp.1257–1270,2017.40 IEEE TRANSACTIONS ON INFORMATION THEORY[14]J.Li and B.Li, “Beehive: erasure codes for fixing multiple failures in distributed storage systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1257–1270, 2017.40 IEEE TRANSACTIONS ON INFORMATION THEORY
[15]J.Pernas,C.Yuen,B.Gast′on,and J.Pujol,“Non-homogeneous two-rack model for distributed storage systems,”in Proc.IEEE Int.Symp.Inf.Theory,2013,pp.1237–1241.[15]J.Pernas,C.Yuen,B.Gast′on,and J.Pujol,"Non-homogeneous two-rack model for distributed storage systems," in Proc.IEEE Int.Symp.Inf.Theory,2013, pp.1237–1241.
[16]T.Ernvall,S.El Rouayheb,C.Hollanti,and H.V.Poor,“Capacity and security of heterogeneous distributed storage systems,”IEEE J.Selected Areas in Communications,vol.31,no.12,pp.2701–2709,Dec.2013.[16]T.Ernvall,S.El Rouayheb,C.Hollanti,and HVPoor,"Capacity and security of heterogeneous distributed storage systems," IEEE J.Selected Areas in Communications, vol.31,no.12,pp.2701 –2709, Dec. 2013.
[17]M.A.Tebbi,T.H.Chan,and C.W.Sung,“A code design framework for multi-rack distributed storage,”in Proc.IEEE Inf.Theory Workshop(ITW),2014,pp.55–59.[17]M.A.Tebbi,T.H.Chan,and C.W.Sung,"A code design framework for multi-rack distributed storage," in Proc.IEEE Inf.Theory Workshop(ITW),2014,pp.55–59.
[18]Y.Hu,X.Li,M.Zhang,P.P.C.Lee,X.Zhang,P.Zhou,and D.Feng,“Optimal repair layering for erasure-coded data centers:From theory to practice,”ACM Transactions on Storage,vol.13,no.4,pp.33–56,2017.[18]Y.Hu,X.Li,M.Zhang,PPCLee,X.Zhang,P.Zhou,and D.Feng, "Optimal repair layering for erasure-coded data centers: From theory to practice," ACM Transactions on Storage, vol. 13, no. 4, pp. 33–56, 2017.
[19]N.B.Shah,K.V.Rashmi,and P.V.Kumar,“A flexible class of regenerating codes for distributed storage,”in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.1943–1947.[19]N.B.Shah, K.V.Rashmi, and P.V.Kumar, "A flexible class of regenerating codes for distributed storage," in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.1943–1947.
[20]J.Li,S.Yang,X.Wang,and B.Li,“Tree-structured data egeneration in distributed storage systems with regenerating codes,”in Conference on Information Communications,2010,pp.2892–2900.[20]J.Li,S.Yang,X.Wang,and B.Li,"Tree-structured data egeneration in distributed storage systems with regenerating codes," in Conference on Information Communications,2010,pp.2892-2900.
[21]Y.Wang,D.Wei,X.Yin,and X.Wang,“Heterogeneity-aware data regeneration in distributed storage systems,”in Proc.IEEE INFOCOM,2014,pp.1878–1886.[21]Y.Wang,D.Wei,X.Yin, and X.Wang, "Heterogeneity-aware data regeneration in distributed storage systems," in Proc.IEEE INFOCOM,2014,pp.1878–1886.
[22]S.Akhlaghi,A.Kiani,and M.R.Ghanavati,“Cost-bandwidth tradeoff in distributed storage systems,”Computer Communications,vol.33,no.17,pp.2105–2115,2010.[22]S.Akhlaghi,A.Kiani,and M.R.Ghanavati,"Cost-bandwidth tradeoff in distributed storage systems,"Computer Communications,vol.33,no.17,pp.2105-2115,2010.
[23]S.Goparaju,A.Fazeli,and A.Vardy,“Minimum storage egenerating codes for all parameters,”IEEE Trans.Information Theory,vol.63,no.10,pp.6318–6328,2017.[23]S.Goparaju, A. Fazeli, and A. Vardy, "Minimum storage egenerating codes for all parameters," IEEE Trans. Information Theory, vol. 63, no. 10, pp. 6318-6328, 2017.
[24]B.Gast′on,J.Pujol,and M.Villanueva,“A realistic distributed storage system that minimizes data storage and repair bandwidth,”arXiv preprint arXiv:1301.1549,2013.[24]B.Gast′on,J.Pujol,and M.Villanueva,"A realistic distributed storage system that minimizes data storage and repair bandwidth,"arXiv preprint arXiv:1301.1549,2013.
[25]Z.Shen,J.Shu,and P.P.C.Lee,“Reconsidering single failure recovery in clustered file systems,”in 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN),2016,pp.323–334.[25]Z.Shen, J.Shu, and PPCLee, “Reconsidering single failure recovery in clustered file systems,” in 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN), 2016, pp.323–334 .
[26]M.Gerami,M.Xiao,and M.Skoglund,“Two-layer coding in distributed storage systems with partial node failure/repair,”IEEE Communications Letters,vol.21,no.4,pp.726–729,2017.[26]M.Gerami,M.Xiao,and M.Skoglund,"Two-layer coding in distributed storage systems with partial node failure/repair," IEEE Communications Letters, vol.21, no.4, pp.726–729 , 2017.
[27]R.W.Yeung,Information Theory and Network Coding.Springer,2008.[27]R.W.Yeung,Information Theory and Network Coding.Springer,2008.
[28]N.B.Shah,K.V.Rashmi,P.V.Kumar,and K.Ramchandran,Explicit codes minimizing repair bandwidth for distributed storage,”in Proc.IEEE Inf.Theory Workshop(ITW),2010,pp.1–5.[28]N.B.Shah,K.V.Rashmi,P.V.Kumar,and K.Ramchandran,Explicit codes,minimizing,repair,bandwidth for distributed storage,"in Proc.IEEE Inf.Theory Workshop(ITW),2010,pp.1-5.
[29]C.Suh and K.Ramchandran,“Exact-repair MDS codes for distributed storage using interference alignment,”in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.161–165.[29]C.Suh and K.Ramchandran, "Exact-repair MDS codes for distributed storage using interference alignment," in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.161–165.
[30]R.Motwani and P.Raghavan,Randomized Algorithms.Cambridge University Press,1995.[30] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge University Press, 1995.
[31]J.Li,X.Tang,and C.Tian,“A generic transformation for optimal repair bandwidth and rebuilding access in mds codes,”in Proc.IEEE Int.Symp.Inf.Theory,2017,pp.1623–1627.[31]J.Li,X.Tang, and C.Tian, "A generic transformation for optimal repair bandwidth and rebuilding access in mds codes," in Proc.IEEE Int.Symp.Inf.Theory,2017,pp.1623– 1627.
[32]K.V.Rashmi,N.B.Shah,and P.V.Kumar,“Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction,”IEEE Trans.Information Theory,vol.57,no.8,pp.5227–5239,August 2011.[32]KVRashmi, NBShah, and PVKumar, "Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction," IEEE Trans. Information Theory, vol. 57, no. 8, pp. 5227-5239, August 2011.
[33]H.Hou,K.W.Shum,M.Chen,and H.Li,“BASIC codes:Low-complexity regenerating codes for distributed storage systems,”IEEE Trans.Information Theory,vol.62,no.6,pp.3053–3069,2016.[33] H. Hou, KWShum, M. Chen, and H. Li, "BASIC codes: Low-complexity regenerating codes for distributed storage systems," IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3053–3069, 2016.
发明内容Summary of the invention
本发明的目的在于提供一种用于数据中心的机架感知再生码,旨在解决的问题。The purpose of the present invention is to provide a rack-aware regeneration code used in a data center to solve the problem.
本发明是这样实现的,一种用于数据中心的机架感知再生码,所述机架感知再生码由n个节点组成的数据中心,把n个节点平均分配到r个机架h上,在每个机架h上均有n/r个节点,其中,n是r的倍数;将所有节点进行从1到n进行标记,其中,h=1,2,...,r和i=1,2,...,n/r;在有限域中,将一个数据文件认为是该有限域中B长度的向量,被编码成nα个符号,并存储在n个节点中,在每个节点中存储α个符号。The present invention is realized in this way, a rack-aware regeneration code for a data center, where the rack-aware regeneration code is composed of n nodes in a data center, and n nodes are equally distributed to r racks h, There are n/r nodes on each rack h, where n is a multiple of r; mark all nodes from 1 to n, where h=1, 2,..., r and i= 1, 2,..., n/r; in a finite field, a data file is considered to be a vector of length B in the finite field, encoded into nα symbols, and stored in n nodes, in each Α symbols are stored in the node.
本发明的进一步技术方案是:在每个机架中均设有一个中继节点,每个中继节点均能换取同一机架中其他节点中的信息。A further technical solution of the present invention is that: each rack is provided with a relay node, and each relay node can exchange information in other nodes in the same rack.
本发明的进一步技术方案是:所述中继节点是从机架中选择的任意存活节点,在修复操作过程中将不同的数据文件与不同的中继节点想关联。A further technical solution of the present invention is that the relay node is any surviving node selected from the rack, and different data files are associated with different relay nodes during the repair operation.
本发明的进一步技术方案是:所述机架感知再生码在存储节点出现故障产生一个新节点进行替换,将新节点放在同一机架中;新节点任意选择其他d个机架,其中d<r,并连接选取每个机架的相应中继节点,每个相应的中继节点均发送β个符号信息到新节点,新节点从接收到的dβ个符号信息和故障节点所在机架存储的(n/r-1)α个符号信息中重新生成新节点的内容;其中,跨机架间修复带宽为γ=dβ。A further technical solution of the present invention is: the rack-aware regeneration code generates a new node for replacement when the storage node fails, and the new node is placed in the same rack; the new node arbitrarily selects other d racks, where d< r, and connect and select the corresponding relay node of each rack. Each corresponding relay node sends β symbol information to the new node. The new node receives dβ symbol information from the received dβ symbol information and the storage of the rack where the faulty node is located. (n/r-1) The content of regenerating new nodes in α symbol information; among them, the cross-chassis repair bandwidth is γ=dβ.
本发明的进一步技术方案是:所述中继节点与数据收集器连接时,所述机架感知再生码进行精确修复和功能修复,所述精确修复中存储在故障节点中的符号信息与新节点中的符号信息相同;所述功能修复中只要满足解码特性,新节点可能包含不同于故障节点的符号信息。A further technical solution of the present invention is that when the relay node is connected to the data collector, the rack-aware regeneration code performs precise repair and functional repair, and the symbol information stored in the failed node and the new node in the precise repair The symbol information in is the same; as long as the decoding characteristics are satisfied in the function repair, the new node may contain symbol information different from the failed node.
本发明的有益效果是:通过在机架内部重建部分的修复数据,和组合不同机架间的部分修复数据,来最大限度地减少跨机架间的修复带宽;实现更少的跨机架间的修复带宽,再生码运用到多个机架之间的修复问题,而不仅是单一机架内部的修复问题,因此更有实际的工程应用意义。文章研究数据中心的存储冗余和跨机架修复带宽之间的最佳权衡,所研究的范围更加广泛,而不仅仅是如以往对于再生码的某些特殊点的研究。构造有广泛的参数选择范围,在大多数情况下,其跨机架修复带宽严格低于经典的最小存储再生码的修复带宽;最小跨机架修复带宽的构造方法支持所有的参数,同时对于所有的参数,要比最小带宽再生码有更小的跨机架修复带宽。The beneficial effects of the present invention are: by reconstructing part of the repair data inside the rack and combining the part of the repair data between different racks, to minimize the repair bandwidth between the racks; realize fewer cross racks The repair bandwidth, regeneration code is applied to repair problems between multiple racks, not only repair problems within a single rack, so it has more practical engineering application significance. The article studies the best trade-off between storage redundancy and cross-rack repair bandwidth in a data center. The scope of the research is more extensive than the previous research on some special points of the regeneration code. The construction has a wide range of parameter selection. In most cases, its cross-rack repair bandwidth is strictly lower than the classic minimum storage regeneration code repair bandwidth; the minimum cross-rack repair bandwidth construction method supports all parameters, and at the same time for all The parameter of, has a smaller cross-rack repair bandwidth than the minimum bandwidth regeneration code.
附图说明Description of the drawings
图1是本发明实施例提供的用于数据中心的机架感知再生码的示意图。Fig. 1 is a schematic diagram of a rack sensing regeneration code used in a data center provided by an embodiment of the present invention.
图2是本发明实施例提供的当n=9,k=5,r=3,d=2时的信息流示意图。Fig. 2 is a schematic diagram of the information flow when n=9, k=5, r=3, and d=2 provided by an embodiment of the present invention.
图3是本发明实施例提供的对于n=12,k=8,r=3,4,6和d=r-1的RRC码和RC码,α和γ=dβ之间的最佳折衷曲线示意图。Fig. 3 is the best compromise curve between α and γ=dβ for RRC codes and RC codes with n=12, k=8, r=3, 4, 6 and d=r-1 provided by an embodiment of the present invention Schematic.
图4是本发明实施例提供的当n=10,k=8,r=5,d=4时的例子,其中数据信息[s 1,s 2,...,s 8]表示为s的示意图。 Fig. 4 is an example when n=10, k=8, r=5, d=4 provided by an embodiment of the present invention, where the data information [s 1 , s 2 ,..., s 8 ] is expressed as s Schematic.
图5是本发明实施例提供的(n,k,r)=(12,8,4)的例子,其中数据信息表示为s=[s 1,s 2,...,s 16]的示意图。 Figure 5 is an example of (n,k,r)=(12,8,4) provided by an embodiment of the present invention, in which data information is represented as a schematic diagram of s=[s 1 ,s 2 ,...,s 16 ] .
图6是本发明实施例提供的当(n,k,r,d)=(12,8,4,3)的MBRR码例子的是示意图。Fig. 6 is a schematic diagram of an MBRR code example when (n, k, r, d)=(12, 8, 4, 3) provided by an embodiment of the present invention.
图7是本发明实施例提供的当r=3,4时,MSRR码和MSR码的跨机架修复带宽的示意图。Fig. 7 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when r=3,4 provided by an embodiment of the present invention.
图8是本发明实施例提供的当r=5,6时,MSRR码和MSR码的跨机架修复带宽的示意图。Fig. 8 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when r=5,6 provided by an embodiment of the present invention.
图9是本发明实施例提供的当n=18,24时,MSRR码和MSR码的跨机架修复带宽的示意图。Fig. 9 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when n=18,24 provided by an embodiment of the present invention.
图10是本发明实施例提供的当n=20,25时,MSRR码和[8]中编码的最少带宽点的存储与跨机架修复带宽之间的权衡关系的示意图。Fig. 10 is a schematic diagram of the trade-off relationship between the storage of the MSRR code and the minimum bandwidth point encoded in [8] and the cross-rack repair bandwidth when n=20, 25 provided by an embodiment of the present invention.
图11是本发明实施例提供的当r=3,4,5,6时,MSRR码,[7]中的群集码和DRC码的参数的示意图。FIG. 11 is a schematic diagram of the parameters of the MSRR code, the cluster code and the DRC code in [7] when r=3, 4, 5, and 6 provided by an embodiment of the present invention.
具体实施方式detailed description
现有研究已有许多不同的对RC码的深入研究,例如实际应用[11]-[14]和具有异构结构的修复问题[6]-[8],[15]–[18]。There are many different in-depth studies on RC codes in existing research, such as practical applications [11]-[14] and repair problems with heterogeneous structures [6]-[8], [15]-[18].
灵活RC码[19]是为异构存储系统而设计的,可以达到修复带宽的下限。结合树再生拓扑的结构,RC码可以进一步节省网络带宽[20],[21]。一些研究如[16],[22],它们侧重研究一个异构模型的容量界限。然而,上述所有的研究都没有区分数据中心中在机架内和跨机架间的两种通信开销。The flexible RC code [19] is designed for heterogeneous storage systems and can reach the lower limit of repair bandwidth. Combined with the structure of tree regeneration topology, RC code can further save network bandwidth [20], [21]. Some studies such as [16], [22], they focus on the capacity limit of a heterogeneous model. However, all of the above-mentioned studies did not distinguish between the two communication overheads in the data center, in-rack and across-rack.
虽然现有的研究有区分跨机架间和机架内的通信开销,但是他们的系统模型基本上是不相同的。表1针对纠删码数据中心,对比分析RRC码和一些紧密相关的研究。DRC码[6],[18]考虑和本文一样的模型,在最小存储条件下 达到存储和跨机架修复带宽之间的折中。DRC码可以看作是当所有其他机架都相连来修复一个丢失节点时MSRR码的一个特例。Sohn等人[8]考虑了一种不同的修复模型,并且给出存储和修复带宽(包括跨机架修复带宽和机架内部的修复带宽)之间提供最佳的折中。在他们定义的修复过程中,同一机架中两个节点之间没有信息编码,而在我们的模型中,从其他机架下载的信息是在机架中所有信息的线性编码(类似于DRC[6],[18])。同时,为了修复一个丢失节点,[8]中的新节点需要连接所有其他机架,而本文中RRC码的跨机架修复带宽对于大多数参数来说,均小于[8]中的编码方式。最近,Sohn等人在[9]、[10]中提出针对[8]中的编码的最小存储点和最小带宽点的精确修复构造方法。Although the existing research distinguishes the communication overhead between inter-rack and intra-rack, their system models are basically different. Table 1 compares and analyzes RRC codes and some closely related researches for erasure code data centers. DRC codes [6], [18] consider the same model as this paper, and achieve a compromise between storage and cross-rack repair bandwidth under minimum storage conditions. The DRC code can be seen as a special case of the MSRR code when all other racks are connected to repair a lost node. Sohn et al. [8] considered a different repair model and provided the best compromise between storage and repair bandwidth (including cross-rack repair bandwidth and intra-rack repair bandwidth). In the repair process they defined, there is no information encoding between two nodes in the same rack, while in our model, the information downloaded from other racks is a linear encoding of all information in the rack (similar to DRC[ 6], [18]). At the same time, in order to repair a lost node, the new node in [8] needs to connect to all other racks, and the cross-chassis repair bandwidth of the RRC code in this article is less than the encoding method in [8] for most parameters. Recently, Sohn et al. proposed in [9] and [10] a precise repair construction method for the minimum storage point and minimum bandwidth point of the code in [8].
和我们最相关的一个工作是Prakash等人所做的工作[7]。在他们的模型中,一个文件的重建需要一定数量的机架,因此,k必须是在每个机架中的节点个数的倍数。另一方面,我们提出的模型允许一个文件从任何k个节点重建文件。因此,RRC码可以容纳比[7]中编码更多的节点丢失。当k是每个机架中节点个数的倍数时,可以得到RRC码的折中曲线与[7]的最优折中曲线是相同的,但是MSRR码和MBRR码的精确修复构造方法比[7]中的最小存储码和最小带宽码的精确修复构造方法有更多的参数选择(详细参考第VII-B部分)。[7]中的最小存储码和最小带宽码仅仅支持当k为每个机架中节点数的倍数的情况,而我们的MSRR码和MBRR码对k没有这个限制。当k是每个机架节点数的倍数时,k+1将不是每个机架节点数的倍数。我们研究表明有k+1个数据节点的MSRR码(MBRR码)的跨机架修复带宽远小于有k个数据节点的MSRR码(MBRR码),其中k是每个机架节点数的倍数。One of the work most relevant to us is the work done by Prakash et al. [7]. In their model, the reconstruction of a file requires a certain number of racks, so k must be a multiple of the number of nodes in each rack. On the other hand, our proposed model allows a file to be reconstructed from any k nodes. Therefore, RRC codes can accommodate more node losses than those coded in [7]. When k is a multiple of the number of nodes in each rack, it can be obtained that the compromise curve of the RRC code is the same as the optimal compromise curve of [7], but the precise repair construction method of the MSRR code and the MBRR code is more than [ There are more parameter options for the precise repair construction method of the minimum storage code and minimum bandwidth code in 7] (refer to Part VII-B for details). The minimum storage code and minimum bandwidth code in [7] only support when k is a multiple of the number of nodes in each rack, while our MSRR code and MBRR code do not have this restriction on k. When k is a multiple of the number of nodes in each rack, k+1 will not be a multiple of the number of nodes in each rack. Our research shows that the cross-rack repair bandwidth of the MSRR code (MBRR code) with k+1 data nodes is much smaller than the MSRR code (MBRR code) with k data nodes, where k is a multiple of the number of nodes in each rack.
还有其他一些基于机架数据中心所使用的纠删码应用的研究,一些研究如[15],[24]考虑两个机架间的交互。Tebbi等人[17]设计了多机架存储系统的局部修复码。沈等人[25]提供一个机架感知的修复算法,这个算法专门用来设计RS码。在本文中,我们分析并制定一种通用模型,它可以给出存储和跨机架修复带宽的最佳折中。There are other researches based on the application of erasure coding used in rack data centers. Some studies such as [15] and [24] consider the interaction between two racks. Tebbi et al. [17] designed a partial repair code for a multi-rack storage system. Shen et al. [25] provide a rack-aware repair algorithm, which is specifically used to design RS codes. In this article, we analyze and develop a general model that can give the best compromise between storage and repair bandwidth across racks.
[26]把相同的方法应用于数据中心的两层编码,第一层用一个(n,k)MDS码来编码数据文档,然后分布于n个节点。第二层通过编码率为δ的MDS码来创建存放在每个节点的数据块。如果存储在节点中的数据块中失效数据块的比例不大于1-δ(即部分节点失效),则失效的数据块可以由失效节点本身来恢复。否则,存储和修复带宽之间存在折中关系。[26]和我们的工作之间的主要区别在于,我们的工作是区分机架内通信和跨机架通信,并考虑基于机架的存储系统中一个丢失节点的修复情况,而在[26]中,作者则考虑部分节点丢失的修复情况。[26] The same method is applied to the two-layer encoding of the data center. The first layer uses an (n, k) MDS code to encode the data document, and then distributes it to n nodes. The second layer creates data blocks stored in each node through the MDS code with the coding rate δ. If the proportion of failed data blocks in the data blocks stored in the node is not greater than 1-δ (that is, some nodes fail), the failed data blocks can be recovered by the failed node itself. Otherwise, there is a trade-off between storage and repair bandwidth. The main difference between [26] and our work is that our work is to distinguish between intra-rack communication and cross-rack communication, and consider the repair of a lost node in a rack-based storage system, while in [26] In the middle, the author considers the repair of some nodes lost.
系统模型System model
TABLE I:Comparison with related work.TABLE I: Comparison with related work.
Figure PCTCN2019081267-appb-000001
Figure PCTCN2019081267-appb-000001
我们考虑一个由n个节点组成的数据中心,把n个节点平均分到r个机架,每个机架有n/r个节点,请参见图1,一个丢失的节点可以通过下载主机架中所有其他符号信息,和从其他任意d个机架,每个下载β个符号信息。丢失的数据文件可以由一个数据收集器通过从任意k个节点下载kα符号信息来修复。在本文中,假设n是r的倍数,把节点的标记从1到n进行标记,h=1,2,...,r和i=1,2,...,n/r。我们把机架h中的第i个节点记为X h,i。本文所有的操作均是基于大小为q的有限域,一个数据文件被认为是该有限域中B长度的向量。一个数据文件被编码成nα个符号,并存储在n个节点中。每个节点存储α个符号。 We consider a data center composed of n nodes, and divide n nodes into r racks. Each rack has n/r nodes. Please refer to Figure 1. A missing node can be downloaded from the main rack. All other symbol information, and from any other d racks, each download β symbol information. The missing data file can be repaired by a data collector by downloading kα symbol information from any k nodes. In this article, assuming that n is a multiple of r, the nodes are labeled from 1 to n, h=1, 2,..., r and i=1, 2,..., n/r. We denote the i-th node in rack h as X h,i . All operations in this paper are based on a finite field of size q, and a data file is regarded as a vector of length B in the finite field. A data file is encoded into nα symbols and stored in n nodes. Each node stores α symbols.
在每个机架中,有一个重要的中继节点,它可以获取存储在同一个机架中的其他节点中的信息。假设机架内的信息交换非常快,所以在机架内的节点之间传输的开销很低。如果存储节点出现故障,我们将产生一个新节点来替换它,并将新节点放在同一机架中。新节点任意选择其他d个机架,其中d<r,并连接相应的中继节点。我们把参与修复过程的中继节点或机架称为帮助者,参数d被称为修复节点数。基于存储在机架中的αn/r个符号信息,每个相应的中继节点都发送β个符号信息到新节点。跨机架间的修复带宽为γ=dβ。然后从接收到的dβ个符号信息和存储在主机中的(n/r-1)α个符号信息中,重新生成新节点的内容。注意,中继节点可以是从机架中选择的任意存活节点,并且在修复操作过程可以将不同的数据文件与不同的中继节点相关联。我们可以将节点故障视为机架的部分故障。通过从任何其他d机架中下载β符号来修复故障节点,并从同一机架中的其他n/r-1节点下载(n/r-1)个α符号。为重新标记存储节点,不失一般性,我们假设X h,1是h机架的中继节点,h=1,2,...,r。 In each rack, there is an important relay node, which can obtain information stored in other nodes in the same rack. It is assumed that the information exchange in the rack is very fast, so the transmission overhead between the nodes in the rack is very low. If the storage node fails, we will generate a new node to replace it and place the new node in the same rack. The new node arbitrarily selects other d racks, where d<r, and connects to the corresponding relay node. We call the relay nodes or racks participating in the repair process helpers, and the parameter d is called the number of repair nodes. Based on the αn/r symbol information stored in the rack, each corresponding relay node sends β symbol information to the new node. The repair bandwidth across the racks is γ=dβ. Then from the received dβ symbol information and the (n/r-1)α symbol information stored in the host, the content of the new node is regenerated. Note that the relay node can be any surviving node selected from the rack, and different data files can be associated with different relay nodes during the repair operation. We can regard node failure as a partial failure of the rack. Repair the faulty node by downloading β symbols from any other d rack, and download (n/r-1) α symbols from other n/r-1 nodes in the same rack. To relabel storage nodes, without loss of generality, we assume that X h,1 is the relay node of rack h, and h=1, 2,...,r.
我们希望任何k个节点都可以用来解码数据文件,称之为(n,k)解码特性。当数据收集器连接到中继节点时,这相当于连接到机架中的所有n/r个节点。我们可以作出这样的假设:如果数据收集器连接到中继节点,则它也连接到同一机架中的所有其他节点。本文考虑两种修复情况:精确修复和功能修复。在精确修复中,存储在故障节点中的符号信息与新节点中的符号信息相同。在功能修复中,只要满足(n,k)解码特性,新节点可能包含不同于故障节点的符号信息。参数为n、k、r、d、α和β的满足上述要求的编码方案称为基于机架的存储系统RSS(n,k,r,d,α,β)码(Rack-aware Storage System)。表二概述了本文使用的主要符号。We hope that any k nodes can be used to decode data files, which is called (n,k) decoding feature. When the data collector is connected to the relay node, this is equivalent to connecting to all n/r nodes in the rack. We can make the assumption that if the data collector is connected to a relay node, it is also connected to all other nodes in the same rack. This article considers two repair situations: precise repair and functional repair. In precise repair, the symbol information stored in the failed node is the same as the symbol information in the new node. In functional repair, as long as the (n, k) decoding characteristics are satisfied, the new node may contain symbol information different from the failed node. The coding scheme with parameters n, k, r, d, α and β that meets the above requirements is called a rack-aware storage system RSS (n, k, r, d, α, β) code (Rack-aware Storage System) . Table 2 summarizes the main symbols used in this article.
存储和跨机架间修复带宽的优化折中Optimized trade-off between storage and repair bandwidth across racks
我们使用信息流图来表示上述的存储系统。为了区别于图1的系统图,我们将使用“顶点”一词,而非“节 点”,来描述信息流图。We use information flow diagrams to represent the aforementioned storage systems. To distinguish it from the system diagram in Figure 1, we will use the term "vertex" instead of "node" to describe the information flow graph.
TABLE II:Major notation used in this paper.TABLE II: Major notation used in this paper.
Figure PCTCN2019081267-appb-000002
Figure PCTCN2019081267-appb-000002
给定系统参数n,k,r,d,α,β,一个信息流图是根据以下规则构造的有向无环图(DAG)。:有一个顶点S代表数据文件,顶点T表示数据收集器。当h=1,2,...,r和i=1,2,…,n/r,在机架h中第i个节点用h中的一对顶点In h,i和Out h,i表示。我们画一个边从In h,i到Out h,i,通信容量为α。对每一个内部顶点In h,i,我们画一个从S到In h,i的边,其通信容量无限大。这说明对每个存储节点的内容的编码过程,相当于求所有符号信息的函数,每个节点的容量不大于α。h=1,2,...,r和i=2,3,...,n/r,我们画出一条从Out h,i到Out h,1的边,其通信容量为α。这表明X h,1是中继节点,X h,1可以访问存储在X h,i中的所有内容。 Given the system parameters n, k, r, d, α, β, an information flow graph is a directed acyclic graph (DAG) constructed according to the following rules. : There is a vertex S for the data file, and a vertex T for the data collector. When h=1, 2,...,r and i=1, 2,...,n/r, the i-th node in rack h uses a pair of vertices In h,i and Out h,i in h Said. We draw an edge from In h,i to Out h,i , and the communication capacity is α. For each internal vertex In h,i , we draw an edge from S to In h,i with infinite communication capacity. This shows that the encoding process for the content of each storage node is equivalent to finding a function of all symbol information, and the capacity of each node is not greater than α. h=1, 2,...,r and i=2,3,...,n/r, we draw an edge from Out h,i to Out h,1 , and its communication capacity is α. This shows that X h,1 is a relay node, and X h,1 can access all the content stored in X h,i .
假设机架h中的第f个节点失效,h∈{1,2,...,r}和f∈{1,2,...,n/r}。我们把在信息流图中n/r对顶点,表示为In′ h,j和Out′ h,j。当j∈{1,2,..,n/r}\{f},我们画一条无限通信容量的从Out h,j到In′ h,j的边,和一条无限通信容量的从In′ h,j到Out′ h,j的边。这意味着节点j的内容在修复过程中不变。用顶点In′ h,f表示新节点,我们画一条从Out h,j到In′ h,f的边,具有无限通信容量,它可以访问存储在同一机架中的所有其他节点中的符号信息。假设新节点将d连接到机架中的中继节点h 1,h 2,…,h d,其中h 1,h 2,…,h d是不同的指标,均不等于h。当i=1,2,...,d,在信息流程图中有一条具有通信容量为β,从
Figure PCTCN2019081267-appb-000003
到In h,f的边,因此,In′ h,f具有(n/r-1)+d条输入的边,其中有d条边的通信容量为β,n/r-1条边具有无限通信容量。最终这新节点存储α符号信息,我们表示一条从In′ h,f到Out′ h,j,通信容量为α的边来表示。当j=2,3,...,n/r,还有一条从Out′ h,j到Out′ h,1,通信容量无限的边,
Suppose that the f-th node in rack h fails, h ∈ {1, 2,..., r} and f ∈ {1, 2,..., n/r}. We denote the n/r pairs of vertices in the information flow graph as In′ h,j and Out′ h,j . When j∈{1,2,..,n/r}\{f}, we draw an edge from Out h,j to In′ h,j with infinite communication capacity, and an edge from In′ with infinite communication capacity h,j to the edge of Out′ h,j . This means that the content of node j remains unchanged during the repair process. Use vertex In′ h,f to represent the new node, we draw an edge from Out h,j to In′ h,f , with unlimited communication capacity, it can access the symbol information stored in all other nodes in the same rack . Assume that the new node connects d to the relay nodes h 1 , h 2 ,..., h d in the rack, where h 1 , h 2 ,..., h d are different indicators, which are not equal to h. When i = 1, 2,..., d, there is a communication capacity β in the information flow chart, from
Figure PCTCN2019081267-appb-000003
To the edge of In h, f , therefore, In′ h, f has (n/r-1)+d input edges, of which the communication capacity of d edges is β, and n/r-1 edges have infinite Communication capacity. Finally, this new node stores the alpha symbol information. We represent an edge from In′ h,f to Out′ h,j with a communication capacity of α. When j=2,3,...,n/r, there is an edge from Out′ h,j to Out′ h,1 with unlimited communication capacity,
存储系统可能会经历一系列的节点故障和修复情况。我们相应地重复上述过程。最后,我们从外部的k个顶点到T,画出k条边。如果T是连接到机架h的中继节点的顶点Out h,1,约定T也连接到在机架h中所有的顶点Out h,2,…,Out h,n/rThe storage system may experience a series of node failures and repairs. We repeat the above process accordingly. Finally, we draw k edges from the outer k vertices to T. If T is the vertex Out h,1 of the relay node connected to rack h, it is agreed that T is also connected to all vertices Out h,2 ,...,Out h,n/r in rack h.
如上所述获得的DAG被称为信息流图,由G(n,k,r,d,α,β)表示。图2显示一个(n,k,r,d)=(9,5,3,2)的例子。The DAG obtained as described above is called an information flow graph and is represented by G(n, k, r, d, α, β). Figure 2 shows an example of (n, k, r, d) = (9, 5, 3, 2).
给定一个信息流图G,我们把唯一的顶点S作为源顶点,唯一的顶点T作为终端顶点,并考虑从S到T的最大流。将(s,t)-cut定义为信息流图G的子集,在此子集的边从G中移除后,s和t不是直接相连。割集(s,t)-cut的容量定义为切口处所有边的通信容量总和。用mincut(G)表示给定信息流图G中(s,t)-cut的最小容量,min G mincut(g)表示整个信息流图G所有割集的最小容量。根据网络编码理论中的最大流界[24,定理18.3],文件大小B不能超过min Gmincut(G)。下一个定理确定min Gmincut(G),给出一个文件大小的上界。在本文中,我们将使用符号 Given an information flow graph G, we take the only vertex S as the source vertex, and the only vertex T as the terminal vertex, and consider the maximum flow from S to T. Define (s, t)-cut as a subset of the information flow graph G. After the edges of this subset are removed from G, s and t are not directly connected. The capacity of the cut set (s, t)-cut is defined as the sum of the communication capacity of all edges at the cut. Let mincut(G) represent the minimum capacity of (s,t)-cut in a given information flow graph G, and min G mincut(g) represents the minimum capacity of all cut sets of the entire information flow graph G. According to the maximum flow boundary in the network coding theory [24, Theorem 18.3], the file size B cannot exceed min G mincut(G). The next theorem determines min G mincut(G), which gives an upper bound on file size. In this article, we will use the notation
Figure PCTCN2019081267-appb-000004
Figure PCTCN2019081267-appb-000004
定理1。给定参数n、k、r、d、α、β和b,如果有一个RSS(n,k,r,d,α,β)码,其文件大小为B,则 Theorem 1. Given the parameters n, k, r, d, α, β, and b, if there is an RSS (n, k, r, d, α, β) code and its file size is B, then
Figure PCTCN2019081267-appb-000005
Figure PCTCN2019081267-appb-000005
定理1的证明附录A中给出。The proof of Theorem 1 is given in Appendix A.
如果RSS(n,k,r,d,α,β)码的编码方案对公式(1)中的等式成立,我们称之为机架感知再生码RRC(n,k,r,d,α,β)。等式(1)中左侧的值称为RRC(n,k,r,d,α,β)码的容量。当r=n时,我们发现(1)中RRC码的折中曲线变为RC码[5]的最优折中曲线。If the encoding scheme of the RSS (n, k, r, d, α, β) code holds true for the equation in formula (1), we call it the rack-aware regeneration code RRC (n, k, r, d, α ,β). The value on the left side in equation (1) is called the capacity of the RRC (n, k, r, d, α, β) code. When r=n, we find that the compromise curve of the RRC code in (1) becomes the optimal compromise curve of the RC code [5].
备注。如果kr/n是整数(即m=kr/n),则(1)中给出的上限和[7]的上限相同(见[7]中的(1))。请注意,我们的工作和[7]的修复方案是相同的,但我们的模型可以容忍比[7]中的模型更多的节点丢失。如果kr/n不是整数,则我们的上限比[7]中给出的界限更严谨。Remarks. If kr/n is an integer (ie m=kr/n), the upper limit given in (1) is the same as the upper limit of [7] (see (1) in [7]). Please note that our work is the same as the repair scheme in [7], but our model can tolerate more node loss than the model in [7]. If kr/n is not an integer, our upper limit is more stringent than the limit given in [7].
我们现在描述在给定(n,k,r)的条件下,存储α和跨机架修复带宽γ=dβ两者的折中关系。给定β,定义α*(β)为α的最小值,如果α*(β)存在,则(1)中的等式成立,否则视为无穷大。下面的定理显示最佳折中。We now describe the trade-off relationship between storage α and cross-rack repair bandwidth γ=dβ under the condition of given (n,k,r). Given β, define α*(β) as the minimum value of α. If α*(β) exists, the equation in (1) holds, otherwise it is regarded as infinite. The following theorem shows the best compromise.
定理2。给定参数n,k,r,d,b,令 Theorem 2. Given the parameters n, k, r, d, b, let
Figure PCTCN2019081267-appb-000006
Figure PCTCN2019081267-appb-000006
其中i=0,1,...,m-1。如果β范围从f(m-1)到无穷大,那么最小存储α*(β)如下,Where i=0,1,...,m-1. If β ranges from f(m-1) to infinity, then the minimum storage α*(β) is as follows,
Figure PCTCN2019081267-appb-000007
Figure PCTCN2019081267-appb-000007
当i=1,2,…,m-2,和When i=1,2,...,m-2, and
Figure PCTCN2019081267-appb-000008
当β=f(m-1)时
Figure PCTCN2019081267-appb-000008
When β=f(m-1)
证明过程请参考附录B。Please refer to Appendix B for the certification process.
TABLE III:Some parameters of r,n,k for which the MBRR codes have high code rates.TABLE III: Some parameters of r, n, k for which the MBRR codes have high code rates.
Figure PCTCN2019081267-appb-000009
Figure PCTCN2019081267-appb-000009
最优折中曲线上有两个极值点,分别对应最小存储点和跨机架最小修复带宽点。这两个极值点被称为最小存储机架感知再生(MSRR)点和最小带宽机架感知再生(MBRR)点,MSRR点可以首先最小化α值和然后在最小化β值来修复,而MBRR点可以首先最小化β值和然后在最小化α值来修复。There are two extreme points on the optimal compromise curve, which correspond to the minimum storage point and the minimum repair bandwidth point across racks. These two extreme points are called the minimum storage rack perception regeneration (MSRR) point and the minimum bandwidth rack perception regeneration (MBRR) point. The MSRR point can first minimize the value of α and then minimize the value of β to repair, and The MBRR point can be repaired by first minimizing the β value and then minimizing the α value.
从(2)和(3)中,满足以下条件时,MSRR点可以实现From (2) and (3), when the following conditions are met, the MSRR point can be achieved
Figure PCTCN2019081267-appb-000010
Figure PCTCN2019081267-appb-000010
而满足以下条件时,MBRR点可以实现When the following conditions are met, the MBRR point can be achieved
Figure PCTCN2019081267-appb-000011
Figure PCTCN2019081267-appb-000011
当r=n时,MSRR码退化为最小存储再生(MSR)码,MBRR码退化为最小带宽再生(MBR)码。在MBRR码中,跨机架修复带宽γ等于存储α,从其他机架下载的数据量与失效数据大小相同。在MBRR码中,根据(5),我们有B=kd-m(m-1)/2和=γ=d。如果2kd–m(m–1)>nd,那么MBRR码的码率(即
Figure PCTCN2019081267-appb-000012
)满足
When r=n, the MSRR code degenerates to a minimum storage regeneration (MSR) code, and the MBRR code degenerates to a minimum bandwidth regeneration (MBR) code. In the MBRR code, the cross-rack repair bandwidth γ is equal to the storage α, and the amount of data downloaded from other racks is the same as the size of the failed data. In the MBRR code, according to (5), we have B=kd-m(m-1)/2 and =γ=d. If 2kd–m(m–1)>nd, then the bit rate of the MBRR code (ie
Figure PCTCN2019081267-appb-000012
)Satisfy
Figure PCTCN2019081267-appb-000013
Figure PCTCN2019081267-appb-000013
因此,我们可以构造具有高码率的MBRR码,而所有MBR码的码率均不大于0.5。给定r和n,对于d=r-1,MBRR码的码率大于0.5,其k值总结在表III中。对于表III中的所有评估参数,当k/n>0.5时,MBRR码具有高码率。Therefore, we can construct MBRR codes with high code rates, and the code rates of all MBR codes are not greater than 0.5. Given r and n, for d=r-1, the code rate of the MBRR code is greater than 0.5, and its k value is summarized in Table III. For all evaluation parameters in Table III, when k/n>0.5, the MBRR code has a high code rate.
备注。从(4)和(5),我们看到,在给定相同的参数B,d,m的情况下,MSRR码和MBRR码的跨机架修复带宽随着k的增加而减小。如果kr/n是一个整数,我们有
Figure PCTCN2019081267-appb-000014
i=1,2,...,n/r-1,然后MSRR(n,k+i,r)码的跨机架修复带宽严格小于MSRR(n,k,r)码。当kr/n为整数时,MSRR码的跨机架修复带宽等于最小存储码的带宽[16]。因此,当kr/n不是整数时,MSRR码的构造是非常有必要的,因为它有较少的跨架修复带宽。我们将在第V部分和第VI部分分别给出MSRR码和MBRR码的精确修复构造方法。
Remarks. From (4) and (5), we see that, given the same parameters B, d, m, the cross-chassis repair bandwidth of MSRR code and MBRR code decreases as k increases. If kr/n is an integer, we have
Figure PCTCN2019081267-appb-000014
i=1, 2,..., n/r-1, and the cross-chassis repair bandwidth of the MSRR (n, k+i, r) code is strictly smaller than that of the MSRR (n, k, r) code. When kr/n is an integer, the cross-rack repair bandwidth of the MSRR code is equal to the bandwidth of the minimum storage code [16]. Therefore, when kr/n is not an integer, the construction of the MSRR code is very necessary because it has less cross-rack repair bandwidth. We will give the precise repair construction method of MSRR code and MBRR code in Part V and Part VI respectively.
如果在基于机架的存储中直接使用RC(n,k,d')码,我们可以获得RC(n,k,d')的存储α和跨机架修复带宽γ'之间的折中曲线,正如[5]中定理1,其如下所示。If the RC(n,k,d') code is directly used in rack-based storage, we can obtain a compromise curve between the storage α of RC(n,k,d') and the cross-rack repair bandwidth γ' , Just like Theorem 1 in [5], which is as follows.
定理3.如果我们在基于机架的存储系统中直接使用RC(n,k,d')码,即我们通过从每个d'=dn/r+n/r-1辅助节点(包括主机架中的n/r-1节点和dn/r其他节点)下载β'符号来修复丢失节点。最小存储α'*(γ')和跨机架修复带宽γ'之间的折中关系如下 Theorem 3. If we directly use RC (n, k, d') codes in a rack-based storage system, that is, we pass from each d'=dn/r+n/r-1 auxiliary node (including the main rack The n/r-1 node and other nodes in dn/r) download the β'symbol to repair the missing node. The trade-off relationship between the minimum storage α'*(γ') and the cross-rack repair bandwidth γ'is as follows
Figure PCTCN2019081267-appb-000015
Figure PCTCN2019081267-appb-000015
当i=1,2,...,k-1,其中When i=1,2,...,k-1, where
Figure PCTCN2019081267-appb-000016
Figure PCTCN2019081267-appb-000016
根据定理3,MSR点和MBR点处的跨机架修复带宽分别为According to Theorem 3, the cross-rack repair bandwidth at MSR and MBR points are respectively
Figure PCTCN2019081267-appb-000017
Figure PCTCN2019081267-appb-000017
with
Figure PCTCN2019081267-appb-000018
Figure PCTCN2019081267-appb-000018
下一个定理表明,对于大多数参数,MSRR码(MBRR码)比MSR码(MBR码),具有较少的跨机架修复带宽修复带宽。The next theorem shows that for most parameters, MSRR codes (MBRR codes) have less cross-chassis repair bandwidth than MSR codes (MBR codes).
定理4.设d'=dn/r+n/r–1,如果kr/n是整数,则MSR(n,k,d')码具有与MSRR(n,k,d)码相同的跨机架修复带宽。如果kr/n不是整数,则MSRR(n,k,d)码具有比MSR(n,k,d')码更少的跨机架修复带宽。如果kr/n是整数且k/n>2/r,则MBRR(n,k,d)码具有比MBR(n,k,d')码更少的跨机架修复带宽。 Theorem 4. Let d'=dn/r+n/r–1, if kr/n is an integer, then the MSR(n,k,d') code has the same cross-machine as the MSRR(n,k,d) code Repair bandwidth. If kr/n is not an integer, the MSRR(n,k,d) code has less cross-chassis repair bandwidth than the MSR(n,k,d') code. If kr/n is an integer and k/n>2/r, the MBRR(n,k,d) code has less cross-chassis repair bandwidth than the MBR(n,k,d') code.
证明。由于
Figure PCTCN2019081267-appb-000019
当kr/n是整数,我们得到kr/n=m和
prove. due to
Figure PCTCN2019081267-appb-000019
When kr/n is an integer, we get kr/n=m and
Figure PCTCN2019081267-appb-000020
Figure PCTCN2019081267-appb-000020
由于kr/n=m,上式等于MSRR码在(4)中的跨机架修复带宽。如果kr/n=m不是整数,我们得到kr/n>m和MSRR码的跨机架修复带宽比MSR码的跨机架修复带宽小。MBRR码的跨机架修复带宽为(5)式中的r MBRR,如果kr/n是整数,我们有m=kr/n.,MBRR码的跨机架修复带宽比MBR码小,当且仅当 Since kr/n=m, the above formula is equal to the cross-chassis repair bandwidth of the MSRR code in (4). If kr/n=m is not an integer, we get kr/n>m and the cross-rack repair bandwidth of the MSRR code is smaller than the cross-rack repair bandwidth of the MSR code. The cross-rack repair bandwidth of MBRR code is r MBRR in equation (5). If kr/n is an integer, we have m=kr/n. The cross-rack repair bandwidth of MBRR code is smaller than that of MBR code, if and only when
Figure PCTCN2019081267-appb-000021
Figure PCTCN2019081267-appb-000021
因此,当且仅当k/n>2/r时,MBRR码具有比MBR码更少的跨机架修复带宽。也就是说,如果码率不是太低,则MBRR码的跨机架修复带宽严格小于MBR码的跨机架修复带宽。Therefore, if and only if k/n>2/r, the MBRR code has less cross-chassis repair bandwidth than the MBR code. In other words, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly less than the cross-rack repair bandwidth of the MBR code.
当B=1,n=12,k=8,r=3,4,6,d=r-1时,RRC和RC的折中曲线如图3所示,其中d’=dn/r+n/r-1。When B=1, n=12, k=8, r=3,4,6, d=r-1, the compromise curve between RRC and RC is shown in Figure 3, where d'=dn/r+n /r-1.
图3:当n=12,k=8和r=3,4,6和d=r-1时,RRC和RC的存储和跨机架修复带宽之间的最佳折中曲线。当(n,k,r)=(12,8,4)时,MSRR码和MSR码的跨机架修复带宽分别为r MSRR=0.1875和r MSR=0.2813;。MBRR码和MBR码的存储和跨机架修复带宽分别为(αMBRR,γMBRR)=(0.1304,0.1304)和(αMBR,γMBR)=(0.1833,0.1500)。 Figure 3: When n=12, k=8, r=3, 4, 6 and d=r-1, the best compromise curve between RRC and RC storage and cross-rack repair bandwidth. When (n,k,r)=(12,8,4), the cross-chassis repair bandwidth of MSRR code and MSR code are r MSRR =0.1875 and r MSR = 0.2813 ;. The storage and cross-chassis repair bandwidths of MBRR codes and MBR codes are (αMBRR, γMBRR)=(0.1304, 0.1304) and (αMBR, γMBR)=(0.1833, 0.1500), respectively.
我们有几点发现。首先,在相同存储冗余下,RC码的跨机架修复带宽随着r的增加而增加。其次,与RC码不 同,在相同存储条件下,r=4时RRC码的跨机架修复带宽小于r=3的跨机架修复带宽。通常,如果kr/n是整数,那么RRC码的跨机架修复带宽随着r的增加而增加,如同RC码。然而,当kr'/n不是整数且
Figure PCTCN2019081267-appb-000022
时,RRC(n,k,r')码的跨机架修复带宽严格小于RRC(n,k,r)码的跨机架修复带宽。第三,当r=3及r=6时,除了两个MSR点之外,RRC码对于相同参数具有比RC码更少的跨机架修复带宽。实际上,仅当kr/n是整数时,MSRR码的跨机架修复带宽和MSR码是相同的。根据定理4.,当(n,k,r)=(12,8,4)和(n,k,r)=(12,8,6)时,MBRR码的跨机架修复带宽比MBR码少;而当(n,k,r)=(12,8,3)时,MBRR码的跨机架修复带宽和MBR码的跨机架修复带宽相同。根据定理4,如果kr/n是整数且k/n>2/r,则MBRR码具有比MBR码更少的跨机架修复带宽。因此,图3中(n,k,r)=(12,8,6)的结果与定理4完全吻合。当kr/n不是整数时,如(n,k,r)=(12,8,4),MBRR码的跨机架修复带宽小于MBR码的跨机架修复带宽。对于所有参数,MBRR码的存储冗余严格小于MBR码的存储冗余。在本文的其余部分,我们将重点关注MSRR码和MBRR码的精确修复构造方法。
We have a few findings. First, under the same storage redundancy, the cross-rack repair bandwidth of the RC code increases with the increase of r. Secondly, unlike the RC code, under the same storage condition, the cross-rack repair bandwidth of the RRC code when r=4 is smaller than the cross-rack repair bandwidth of r=3. Generally, if kr/n is an integer, then the cross-rack repair bandwidth of the RRC code increases as r increases, just like the RC code. However, when kr'/n is not an integer and
Figure PCTCN2019081267-appb-000022
At this time, the cross-chassis repair bandwidth of the RRC(n,k,r') code is strictly smaller than the cross-chassis repair bandwidth of the RRC(n,k,r) code. Third, when r=3 and r=6, except for two MSR points, the RRC code has less cross-chassis repair bandwidth for the same parameter than the RC code. In fact, only when kr/n is an integer, the cross-rack repair bandwidth of the MSRR code and the MSR code are the same. According to Theorem 4. When (n,k,r)=(12,8,4) and (n,k,r)=(12,8,6), the cross-rack repair bandwidth of MBRR code is greater than that of MBR code When (n,k,r)=(12,8,3), the cross-rack repair bandwidth of MBRR code is the same as that of MBR code. According to Theorem 4, if kr/n is an integer and k/n>2/r, the MBRR code has less cross-chassis repair bandwidth than the MBR code. Therefore, the result of (n,k,r)=(12,8,6) in Figure 3 is completely consistent with Theorem 4. When kr/n is not an integer, such as (n,k,r)=(12,8,4), the cross-chassis repair bandwidth of the MBRR code is smaller than that of the MBR code. For all parameters, the storage redundancy of the MBRR code is strictly less than the storage redundancy of the MBR code. In the rest of this article, we will focus on the precise repair construction methods for MSRR codes and MBRR codes.
码精确修复的构造The structure of code accurate repair
本节介绍了精确修复MSRR码的系统码构造方法。系统码指kα未编码数据块存储在k个节点中的编码。假设前k个节点是存储未编码数据块的数据节点,而最后的n-k个节点是存储编码数据块的编码节点。在V-B节中,构造方法适用于α=1和任何(n,k)。V-B节中的构造方法的所有数据节点和编码节点均具有最佳的跨机架修复带宽。在V-C节中,其构造方法使用于MSRR码,且满足αn/r≥m+αt。注意第V-C部分的构造方法仅对所有数据节点具有最佳的跨机架修复带宽。如果kr是n的倍数,k个数据节点存放在称为数据机架的前m个机架中,并且将n-k个编码节点放置在称为编码机架的最后r-m机架中。如果kr不是n的倍数,则前面m个机架是数据机架,最后r-m-1个机架是编码机架,机架m+1称为混合机架,包含t=k mod(n/r)个数据节点和n/r–t个编码节点。当kr/n是一个整数时,MSRR码称为齐次MSRR码;而当kr/n不是一个整数,MSRR码称为混合MSRR码。This section introduces the systematic code construction method for accurately repairing MSRR codes. System code refers to the code in which kα uncoded data blocks are stored in k nodes. Assume that the first k nodes are data nodes that store uncoded data blocks, and the last n-k nodes are coded nodes that store coded data blocks. In section V-B, the construction method is applicable to α=1 and any (n, k). All data nodes and coding nodes of the construction method in section V-B have the best cross-rack repair bandwidth. In section V-C, its construction method is used for MSRR codes, and satisfies αn/r≥m+αt. Note that the construction method of Part V-C only has the best cross-rack repair bandwidth for all data nodes. If kr is a multiple of n, k data nodes are stored in the first m racks called data racks, and n-k encoding nodes are placed in the last r-m racks called encoding racks. If kr is not a multiple of n, the first m racks are data racks, and the last rm-1 racks are encoding racks. Rack m+1 is called a hybrid rack, including t=k mod(n/r ) Data nodes and n/r–t coding nodes. When kr/n is an integer, the MSRR code is called a homogeneous MSRR code; and when kr/n is not an integer, the MSRR code is called a mixed MSRR code.
我们假设β=1,因为我们构造MSR码时,很容易地将构造方法扩展到β≠1的情况。当β=1时,我们有We assume β=1, because when we construct the MSR code, we can easily extend the construction method to β≠1. When β=1, we have
α=d-m+1,B=k(d-m+1).α=d-m+1, B=k(d-m+1).
A.根据定理4,MSR码具有与齐次MSRR码相同的跨机架修复带宽,即当kr/n为整数时,MSR码的全部现有构造方法都可以直接应用于MSRR码,但具有更少的机架内修复带宽。由于MSR码的现有构造方法可以支持所有参数[23],因此没有必要给出齐次MSRR码的构造。因此,我们将在本节的其余部分重点介绍混合MSRR码的构造方法。我们首先提出了α=1的MSRR码的构造。然后,我们给出了αn/r≥m+αt和α>1的混合MSRR码的构造。我们的结构采用干涉对齐,这类似于特征向量的概念,已被用于构建精确修复MSR码[28],[29]。在给出构造之前,我们首先介绍本节中使用的一些符号。A. According to Theorem 4, MSR codes have the same cross-rack repair bandwidth as homogeneous MSRR codes, that is, when kr/n is an integer, all existing construction methods of MSR codes can be directly applied to MSRR codes, but have more Less repair bandwidth within the rack. Since the existing construction method of MSR codes can support all parameters [23], it is not necessary to give the construction of homogeneous MSRR codes. Therefore, we will focus on the construction of hybrid MSRR codes in the rest of this section. We first proposed the construction of the MSRR code with α=1. Then, we give the construction of mixed MSRR code with αn/r≥m+αt and α>1. Our structure uses interference alignment, which is similar to the concept of feature vectors, and has been used to construct accurate repair MSR codes [28], [29]. Before giving the construction, we first introduce some symbols used in this section.
一个数据文件用B个在有限域Fq中s=[s 1,s 2,…,s B]的数据符号来表示。令α个符号存储在机架h的节点i上,编码结果为sQ i,h,其中i=1,2,…,n和h=1,2,…,r。Q i,h是B×α的编码矩阵,编码矩阵G h的秩h定义为, A data file is represented by B data symbols with s=[s 1 ,s 2 ,...,s B ] in the finite field Fq. Let α symbols be stored on node i of rack h, and the result of encoding is sQ i,h , where i=1, 2,...,n and h=1, 2,...,r. Q i,h is the coding matrix of B×α, and the rank h of the coding matrix G h is defined as,
Figure PCTCN2019081267-appb-000023
Figure PCTCN2019081267-appb-000023
当机架f中的一个节点出现故障时,新节点将访问机架f中的所有节点(n/r-1)α个符号信息,并从机架
Figure PCTCN2019081267-appb-000024
Figure PCTCN2019081267-appb-000025
下载编码符号和本地编码向量
Figure PCTCN2019081267-appb-000026
其中
Figure PCTCN2019081267-appb-000027
是一个长度不大于αn/r子矩阵的行向量。用G的子矩阵表示为G[(i1,i2),(j1,j2)],包括从i1到i2的行和从j1到j2的列,再用G[(i1,i2),(:)]and G[(:),(j1,j2)]表示G的两个子矩阵,分别包括从i1到i2的行和从j1到j2的列。。
When a node in the frame f fails, the new node will access all the nodes in the frame f (n/r-1) α symbol information, and from the frame
Figure PCTCN2019081267-appb-000024
Figure PCTCN2019081267-appb-000025
Download coding symbols and local coding vectors
Figure PCTCN2019081267-appb-000026
among them
Figure PCTCN2019081267-appb-000027
Is a row vector whose length is not greater than αn/r sub-matrix. The sub-matrix of G is expressed as G[(i1,i2),(j1,j2)], including the rows from i1 to i2 and the columns from j1 to j2, and then G[(i1,i2),(:) ]and G[(:),(j1,j2)] represents two sub-matrices of G, including rows from i1 to i2 and columns from j1 to j2. .
B.当α=1和任意n和k值时,码的构造。B. When α=1 and arbitrary values of n and k, the code construction.
我们在下一个定理中表明,任何(n,k)MDS码都可以实现最小的跨机架修复带宽。We show in the next theorem that any (n,k) MDS code can achieve the smallest cross-rack repair bandwidth.
定理5。如果α=1,那么我们可以通过下载主机架中的所有其他n/r-1符号和其他任意机架中的每个符号中的一个符号来修复(n,k)MDS码的任何一个符号。 Theorem 5. If α=1, then we can repair any symbol of the (n,k) MDS code by downloading all other n/r-1 symbols in the main rack and one symbol in each symbol in any other rack.
证明:当α=1时,我们有B=k且d=m。为了符号表示的方便,用c i,f表示存储在节点i和机架f中的符号,其中i=1,2,....,n/r和f=1,2,..,r。我们需要证明我们可以通过从机架f下载n/r-1符号来恢复符号c i,f Proof: When α=1, we have B=k and d=m. For the convenience of notation, use c i,f to represent the symbols stored in node i and rack f, where i=1, 2,...,n/r and f=1, 2,...,r . We need to prove that we can recover the symbols c i, f by downloading the n/r-1 symbols from the rack f
c 1,f,...,c i-1,f,...,c i+1,f,...,c n/r,f c 1,f ,...,c i-1,f ,...,c i+1,f ,...,c n/r,f
和从任意d机架h 1,h 2,…,h d下载d符号,其中h 1≠...≠h d∈{1,2,...,r}\{f}。我们首先考虑kr/n不是整数的情况。 And download d symbols from any d rack h 1 , h 2 ,..., h d , where h 1 ≠...≠h d ∈ {1,2,...,r}\{f}. We first consider the case where kr/n is not an integer.
注意,机架h i中的符号是
Figure PCTCN2019081267-appb-000028
其中i=1,2,....,d。由于每个机架都有n/r符号,因此机架中的符号总数为f,h 1,h 2,...,h d-1和机架h d中的第一个t符号是dn/r+t=k。我们可以将符号
Figure PCTCN2019081267-appb-000029
视为k个符号的线性组合,包括机架f,h 1,h 2,...,h d-1中的d n/r符号和在机架h d中的t符号
Figure PCTCN2019081267-appb-000030
Note that, in the rack symbols are h i
Figure PCTCN2019081267-appb-000028
Where i=1, 2,...,d. Since each rack has n / r symbols, so the total number of symbols in the frame f, h 1, h 2, ..., h d-1 and a rack h d t symbol dn is /r+t=k. We can put the symbol
Figure PCTCN2019081267-appb-000029
Treated as a linear combination of k symbols, including rack f, h 1 , h 2 ,..., d n/r symbol in h d-1 and t symbol in rack h d
Figure PCTCN2019081267-appb-000030
which is
Figure PCTCN2019081267-appb-000031
Figure PCTCN2019081267-appb-000031
其中当i=1,2,...,k时q i≠0。因此我们可以通过从机架h d下载下面的信息来修复c i,f Among them, q i ≠0 when i=1, 2,...,k. So we can fix c i, f below by downloading information from the rack h d
Figure PCTCN2019081267-appb-000032
Figure PCTCN2019081267-appb-000032
其中among them
Figure PCTCN2019081267-appb-000033
Figure PCTCN2019081267-appb-000033
来自机架h i,其中i∈{1,2,...,d-1}。n/r–1个信息 From the rack h i , where i ∈ {1,2,...,d-1}. n/r–1 messages
c 1,f,...,c i-1,f,c i+1,f,...,c n/r,fc 1, f ,..., c i-1, f , c i+1, f ,..., c n/r, f ,
来自机架f。From frame f.
因此,当kr/n不是整数时,可以通过从任意每个机架下载一个符号和从主机架下载n/r-1个符号来修复机架中的任何一个故障。当kr/n为整数,丢失符号c i,f的修复过程可视为上述修复过程中t=0的特殊情况。 Therefore, when kr/n is not an integer, it is possible to repair any failure in the rack by downloading one symbol from any rack and n/r-1 symbols from the main rack. When kr/n is an integer, the repair process of missing symbols c i, f can be regarded as a special case of t=0 in the above repair process.
图4中的示例显示了数据节点的修复。为了恢复故障符号s 1,下载五个符号,并且仅从机架5下载的编码符号(称为期望符号)是线性地取决于s 1。期望符号
Figure PCTCN2019081267-appb-000034
是由一个期望分量q 1,1s 1组成,其对于恢复故障符号非常理想,并且一个干扰分量
Figure PCTCN2019081267-appb-000035
如果干扰分量是对齐的,那么我们得到所需的分量q 1,1s 1,如果q 1,1=0,则可以修复故障符号。因此,下载的其他四个编码符号(称为干扰符号)用于对齐干扰成分。[18]中DRC码的第一个结构可以看作是d=r-1和n/(n–k)为整数时的特例
The example in Figure 4 shows the repair of a data node. In order to recover the failed symbol s 1 , five symbols are downloaded, and only the coded symbol (called the desired symbol) downloaded from the rack 5 is linearly dependent on s 1 . Expectation symbol
Figure PCTCN2019081267-appb-000034
Is composed of an expected component q 1,1 s 1 , which is ideal for restoring faulty symbols, and an interference component
Figure PCTCN2019081267-appb-000035
If the interference components are aligned, then we get the required components q 1,1 s 1 , if q 1,1 =0, the faulty symbol can be repaired. Therefore, the other four downloaded code symbols (called interference symbols) are used to align the interference components. The first structure of the DRC code in [18] can be regarded as a special case when d=r-1 and n/(n–k) are integers
C:满足αn/r≥m+αt的混合MSRR码C: Mixed MSRR code satisfying αn/r≥m+αt
思想:在混合MSR码中,将前面k-t数据节点放置在m数据架中,最后t=k mod(n/r)个数据节点放置在混合架中,每个节点存储α=d-m+1符号。在数据架中的数据节点的修复过程中,新节点从主机架访问所有其他符号。它从m-1数据机架下载(i)m-1编码符号,(ii)从混合机架下载一个编码符号,(iii)d-m编码符号从d-m编码机架下载。d-m+1需要的符号来自混合架和d-m编码架。注意,所有从数据架下载的干扰符号都与最后的tα数据符号无关。如果我们要恢复失败的符号,每个d-m+1需要的符号应该在最后的tα数据符号上独立。要同时对α需要符号的所有干涉分量进行对齐,需要仔细选择编码矩阵,引入正交向量的概念。我们想指出的是,在我们构建混合MSRR码的过程中,我们可以通过d特殊辅助机架来恢复数据节点,而不是d任意辅助机架,来实现最优的跨机架修复带宽。Idea: In the hybrid MSR code, the previous kt data nodes are placed in the m data shelf, and finally t=k mod(n/r) data nodes are placed in the hybrid shelf, and each node stores α=d-m+1 symbol. During the repair process of the data node in the data rack, the new node accesses all other symbols from the main rack. It downloads (i) m-1 coding symbols from the m-1 data rack, (ii) downloads a coding symbol from the hybrid rack, and (iii) downloads the d-m coding symbol from the d-m coding rack. The symbols required for d-m+1 come from the hybrid frame and the d-m code frame. Note that all interference symbols downloaded from the data shelf have nothing to do with the last tα data symbol. If we want to recover the failed symbols, the symbols needed for each d-m+1 should be independent on the last tα data symbol. In order to align all the interference components of the sign required for α at the same time, it is necessary to carefully select the encoding matrix and introduce the concept of orthogonal vectors. What we want to point out is that in the process of constructing hybrid MSRR codes, we can restore data nodes through d special auxiliary racks instead of d arbitrary auxiliary racks to achieve the optimal cross-rack repair bandwidth.
构造:在说明构造之前,我们应该说明一些记号。行向量v 1,v 2,...,v m是长度m的正交。向量u 1,u 2,...,u α的大小为1×αn/r,设E 1,E 2,…,E m是αn/r×m矩阵,F 1,F 2,...,F α为(B-mαn/r)×m矩阵和R 1,R 2,...,R α为B×(αn/r-m)矩阵。当h=m+1,m+2,…,r,编码矩阵G hStructure: Before explaining the structure, we should explain some notations. The row vectors v 1 , v 2 ,..., v m are orthogonal to the length m. The size of the vectors u 1 , u 2 ,..., u α is 1×αn/r, let E 1 , E 2 ,...,E m be the αn/r×m matrix, F 1 ,F 2 ,... ,F α is the (B-mαn/r)×m matrix and R 1 , R 2 ,..., R α is the B×(αn/rm) matrix. When h=m+1,m+2,...,r, the coding matrix G h is
Figure PCTCN2019081267-appb-000036
Figure PCTCN2019081267-appb-000036
其中的矩阵R i,当i=1,2,...,α时,矩阵R 1定义如下 The matrix R i , when i = 1, 2,..., α, the matrix R 1 is defined as follows
Figure PCTCN2019081267-appb-000037
Figure PCTCN2019081267-appb-000037
其中0 (B-α·t)×α·t是一个(B-α·t)×α·t的零矩阵,I α·t×α·t是一个α·t×α·t的单位矩阵,L 1是一个B×(αn/r-m-α·t)矩阵。因此,参数应满足αn/r≥m+αt。当i=2,3,...,α时,Ri为 Where 0 (B-α·t)×α·t is a zero matrix of (B-α·t)×α·t, and I α·t×α·t is an identity matrix of α·t×α·t , L 1 is a B×(αn/rm-α·t) matrix. Therefore, the parameters should satisfy αn/r≥m+αt. When i=2,3,...,α, Ri is
Figure PCTCN2019081267-appb-000038
Figure PCTCN2019081267-appb-000038
其中y 1,y 2,...,y m是长度为αn/r–m的m个正交向量。x 2,x 3,...,x α是长度为αn/r的α–1个向量。D i,1,D i,2,...,D i,m是大小为αn/r×(αn/r-m)的矩阵。C i是一个大小为(B-mαn/r)×(αn/r-m)的矩阵。 Where y 1 ,y 2 ,...,y m are m orthogonal vectors with length αn/r–m. x 2 ,x 3 ,...,x α are α-1 vectors of length αn/r. D i,1 ,D i,2 ,...,D i,m are matrices of size αn/r×(αn/rm). C i is a matrix of size (B-mαn/r)×(αn/rm).
应满足以下条件:对于f=1,2,…,m和i=2,3,…,α,存在非零元素λ’ i,j对于j=1,…,f-1,f+1,…,a使得(7)、(8)中的方程式成立,当l=1,2,…,n/r时,在矩阵M 1在(9)中所有的子矩阵
Figure PCTCN2019081267-appb-000039
是非奇异。当i=1,2,…,t时,矩阵M 2在(10)中的所有子矩阵M 2[(1+(i-1)α,iα),(1,α)]都是非奇异的。上述情况称为修复条件。
The following conditions should be met: for f=1, 2,...,m and i=2,3,...,α, there are non-zero elements λ'i ,j for j=1,...,f-1,f+1, …,A makes the equations in (7) and (8) hold. When l=1,2,…,n/r, all sub-matrices in the matrix M 1 in (9)
Figure PCTCN2019081267-appb-000039
It's not strange. When i=1, 2,...,t, all sub-matrices M 2 [(1+(i-1)α,iα),(1,α)] of matrix M 2 in (10) are non-singular . The above conditions are called repair conditions.
Figure PCTCN2019081267-appb-000040
Figure PCTCN2019081267-appb-000040
Figure PCTCN2019081267-appb-000041
Figure PCTCN2019081267-appb-000041
Figure PCTCN2019081267-appb-000042
Figure PCTCN2019081267-appb-000042
Figure PCTCN2019081267-appb-000043
Figure PCTCN2019081267-appb-000043
所提出的编码满足(n,k)解码特性,当且仅当可以从任何k节点修复该字符。这等价于矩阵的所有α×α子矩阵The proposed code satisfies the (n, k) decoding characteristics if and only if the character can be repaired from any k node. This is equivalent to all α×α sub-matrices of the matrix
[G m+1[(:),(1,(n/r-t)α)] G m+2 … G r]     (11) [G m+1 [(:), (1, (n/rt)α)] G m+2 … G r ] (11)
包括行
Figure PCTCN2019081267-appb-000044
Include line
Figure PCTCN2019081267-appb-000044
with
Figure PCTCN2019081267-appb-000045
是非奇异的,其中要满足以下关系
Figure PCTCN2019081267-appb-000045
Is non-singular, in which the following relationship must be satisfied
Figure PCTCN2019081267-appb-000046
Figure PCTCN2019081267-appb-000046
Figure PCTCN2019081267-appb-000047
Figure PCTCN2019081267-appb-000047
Figure PCTCN2019081267-appb-000048
上述要求为容错条件。
And
Figure PCTCN2019081267-appb-000048
The above requirements are fault tolerance conditions.
如果机架f∈{1,2,...,m}的一个数据节点丢失了,新节点从机架m+1的中继节点下载一个理想符号If a data node of frame f∈{1,2,...,m} is lost, the new node downloads an ideal symbol from the relay node of frame m+1
Figure PCTCN2019081267-appb-000049
Figure PCTCN2019081267-appb-000049
和从机架i+m的中继节点,其中i=2,...,α,下载符号And the relay node of the slave frame i+m, where i=2,...,α, download symbols
Figure PCTCN2019081267-appb-000050
Figure PCTCN2019081267-appb-000050
从机架{1,2,...,m}\{f}中的m-1个干扰信号,伴随局部编码向量,分别是From the m-1 interference signals in the frame {1,2,...,m}\{f}, accompanied by the local code vector, they are
Figure PCTCN2019081267-appb-000051
Figure PCTCN2019081267-appb-000051
因此,我们通过机架{1,...,f-1,f+1,...,m}下载的信号来减去从机架m+1,...,m+α下载的信号,得到α编码符号,它们是(9)中矩阵的和下面矩阵的乘积,Therefore, we subtract the signal downloaded from the rack m+1,...,m+α by the signal downloaded from the rack {1,...,f-1,f+1,...,m} , Get alpha coding symbols, which are the product of the matrix in (9) and the matrix below,
[s (f-1)αn/r+1 s (f-1)αn/r+2 … s fαn/r] [s (f-1)αn/r+1 s (f-1)αn/r+2 … s fαn/r ]
丢失的α符号可以修复,因为矩阵在(9)中对应的α×α子矩阵是非奇异的。在给出修复条件和容错条件之前,我们先回顾一下Schwartz-Zippel定理。The missing α symbol can be repaired because the α×α sub-matrix corresponding to the matrix in (9) is non-singular. Before giving the repair conditions and fault tolerance conditions, let's review the Schwartz-Zippel theorem.
引理6.(schwartz-zippel[30])设Q(x 1,...,x n)∈F q[x 1,...,x n]为次数是d的非零多元多项式让r 1,...,r n可以从F q的子集中独立和均匀地随机选择。则 Lemma 6. (schwartz-zippel[30]) Let Q(x 1 ,...,x n )∈F q [x 1 ,...,x n ] be a non-zero multivariate polynomial of degree d. Let r 1 ,..., r n can be randomly selected independently and uniformly from a subset of F q . then
Figure PCTCN2019081267-appb-000052
Figure PCTCN2019081267-appb-000052
TABLE IV:Parameters satisfy the construction of hybrid MSRR codes in Section V-C.TABLE IV: Parameters satisfy the construction of hybrid MSRR codes in Section V-C.
Figure PCTCN2019081267-appb-000053
Figure PCTCN2019081267-appb-000053
then there exist encoding matrices G h for h=m+1,m+2,...,r over
Figure PCTCN2019081267-appb-000054
of hybrid MSRR codes,where the parameters n,r,m,t,α satisfy αn/r≥m+αt.
then there exist encoding matrices G h for h=m+1, m+2,..., r over
Figure PCTCN2019081267-appb-000054
of hybrid MSRR codes, where the parameters n, r, m, t, α satisfy αn/r≥m+αt.
证明:参考附录CProof: Refer to Appendix C
从定理7,我们得到提出的混合MSRR码的支持参数满足n≥(m+αt)r/α。表IV显示了d=r-1和r=3、4、5、6 的一些例子。我们从表IV可以看出,我们可以给出混合MSRR码的构造的大部分参数。From Theorem 7, we get that the supporting parameters of the proposed hybrid MSRR code satisfy n≥(m+αt)r/α. Table IV shows some examples where d=r-1 and r=3, 4, 5, and 6. We can see from Table IV that we can give most of the parameters for the construction of the hybrid MSRR code.
举例。以(n,k,r,d)=(12,8,4,3)为例。它给出m=2,t=2,α=2,b=16。行向量v 1,v 2为长度2的正交,向量u 1,u 2为大小1×6。矩阵e 1,e 2的大小为6×2,F 1,F 2的大小为4×2,R 1,R 2的大小为16×4。那么,编码矩阵G 3给出为 For example. Take (n,k,r,d)=(12,8,4,3) as an example. It gives m=2, t=2, α=2, b=16. The row vectors v 1 and v 2 are orthogonal of length 2, and the vectors u 1 and u 2 are of size 1×6. The size of the matrices e 1 , e 2 is 6×2, the size of F 1 , F 2 is 4×2, and the size of R 1 , R 2 is 16×4. Then, the coding matrix G 3 is given as
Figure PCTCN2019081267-appb-000055
Figure PCTCN2019081267-appb-000055
其中λ 1,11,2是两个非零元素。编码矩阵G 4Among them, λ 1,1 and λ 1,2 are two non-zero elements. The coding matrix G 4 is
Figure PCTCN2019081267-appb-000056
Figure PCTCN2019081267-appb-000056
其中y 1,y 2是长度4的正交向量,x 2的长度为6,λ 2,12,2是两个非零元素,D 2,1和D 2,2的大小为6×4,C 2是大小4×4的矩阵。图5展示其例子。 Where y 1 , y 2 are orthogonal vectors of length 4, x 2 has a length of 6, λ 2,1 , λ 2,2 are two non-zero elements, and the size of D 2,1 and D 2,2 is 6. ×4, C 2 is a matrix of size 4×4. Figure 5 shows an example.
我们可以通过下载第一架中的另外四个数据符号(干扰符号)S 3,S 4,S 5,S 6和从机架2下载一个符号(干扰符号)来修复节点1中的两个数据符号S 1,S 2We can fix the two data in node 1 by downloading the other four data symbols (interference symbols) in the first rack S 3 , S 4 , S 5 , S 6 and downloading one symbol (interference symbols) from rack 2 Symbols S 1 , S 2 ,
Figure PCTCN2019081267-appb-000057
Figure PCTCN2019081267-appb-000057
和从机架3,4下载两个符号(需要的符号)And download two symbols from rack 3, 4 (required symbols)
Figure PCTCN2019081267-appb-000058
Figure PCTCN2019081267-appb-000058
注意,根据(8),
Figure PCTCN2019081267-appb-000059
和根据(7),
Figure PCTCN2019081267-appb-000060
我们得到从第4架下载的所需符号为
Note that according to (8),
Figure PCTCN2019081267-appb-000059
And according to (7),
Figure PCTCN2019081267-appb-000060
We get the required symbol downloaded from the fourth frame as
Figure PCTCN2019081267-appb-000061
Figure PCTCN2019081267-appb-000061
通过从(14)中的干涉符号中减去机架3,4中的两个需要符号,我们得到了以下两个符号By subtracting the two required symbols in racks 3 and 4 from the interference symbols in (14), we get the following two symbols
Figure PCTCN2019081267-appb-000062
Figure PCTCN2019081267-appb-000062
因此,我们可以修复两个数据符号S 1,S 2,方法是从四个干涉符号S 3,S 4,S 5,S 6中删除上述两个符号,然后求解由此产生的两个线性系统,因为以上矩阵在右边2×2子矩阵,根据(9),是非奇异的。节点2和3可以类似地恢复。同理,我们可以同时修复机架2,3上的一个节点.虽然构建的混合MSRR码仅对于数据节点,具有最小的跨机架修复带宽,我们可把混合MSRR码使用通用转换[31],从而让每个编码节点具有和均匀MSRR码相同的跨机架修复带宽(n,k=mn/r,d)。 Therefore, we can repair the two data symbols S 1 , S 2 by deleting the above two symbols from the four interference symbols S 3 , S 4 , S 5 , and S 6 , and then solve the two linear systems that result , Because the above matrix is a 2×2 sub-matrix on the right, according to (9), it is non-singular. Nodes 2 and 3 can be similarly restored. In the same way, we can repair a node on racks 2 and 3 at the same time. Although the constructed hybrid MSRR code is only for data nodes and has the smallest cross-rack repair bandwidth, we can use general conversion for the hybrid MSRR code [31] So that each coding node has the same cross-chassis repair bandwidth (n, k=mn/r, d) as the uniform MSRR code.
满足所有参数的精确修复MBRR码的构造The structure of the MBRR code that satisfies all parameters is accurately repaired
与MSRR码的构造一样,我们考虑β=1的MBRR码的构造。当β=1时,MBRR码的参数满足As with the construction of MSRR codes, we consider the construction of MBRR codes with β=1. When β=1, the parameters of the MBRR code satisfy
B=kd-m(m-1)/2,α=γ=d。B=kd-m(m-1)/2, α=γ=d.
我们要根据(n,k)修复特性,构建具有满足上述要求的参数的编码。We need to construct a code with parameters that meet the above requirements based on the (n, k) repair characteristics.
通过连接任何k个节点,我们可以获得kd符号。(n,k)修复特性可以是如果在kd符号中存在B个独立符号,即满足任何k个节点中最多有m(m–1)/2个相关符号,则满足修复特性。我们想转换积矩阵(PM)MBR码[32]的构建用在MBRR码。在下面,我们给出其具体的构造方法。By connecting any k nodes, we can obtain kd symbols. The (n, k) repair characteristic may be that if there are B independent symbols in the kd symbol, that is, if there are at most m(m-1)/2 related symbols in any k nodes, the repair characteristic is satisfied. We want to transform the construction of the product matrix (PM) MBR code [32] to use in the MBRR code. In the following, we give its specific construction method.
编码过程描述如下:The encoding process is described as follows:
将B数据符号分成两部分,其中第一部分具有(k-m)d个数据符号,第二部分具有md-m(m-1)/2个数据符号。The B data symbol is divided into two parts, where the first part has (k-m)d data symbols, and the second part has md-m(m-1)/2 data symbols.
·通过编码所有B数据符号来计算(n-r-k+m)d个全局编码符号。存储(n-r-k+m)d全局编码符号和(k-m)d数据符号的第一部分(全(n-r)d符号)在机架r中的最后(n/r-1)节点。Calculate (n-r-k+m)d global code symbols by coding all B data symbols. The first part (all (n-r)d symbols) of the (n-r-k+m)d global code symbol and (k-m)d data symbol is stored at the last (n/r-1) node in the rack r.
通过PM-MBR(r,d,d)码对第一部分进行编码来生成dr编码符号。划分生成的编码符号为r组,每组有d个编码符号。对于每一组,对组中的编码符号和所有存储在机架中的最后一个(n/r-1)节点中的(n/r-1)d符号进行线性组合,其中编码向量是长度为(n/r-1)d+1的列向量,并将得到的d个编码符号存储在机架的第一个节点中。The first part is encoded by PM-MBR (r, d, d) code to generate dr code symbols. The coding symbols generated by the division are group r, and each group has d coding symbols. For each group, linearly combine the code symbols in the group and all (n/r-1)d symbols stored in the last (n/r-1) node in the rack, where the code vector is of length (n/r-1) d+1 column vector, and store the obtained d code symbols in the first node of the rack.
我们展示如下的结构。当j=1,2,...,(k-m)d,用前面(k-m)d个数据信息来表示行向量[s 1 s 2 … s (k-m)d].。 We show the following structure. When j=1, 2,...,(km)d, use the previous (km)d pieces of data information to represent the row vector [s 1 s 2 … s (km)d ].
通过下式比较(n-r-k+m)d各全局编码符号Compare (n-r-k+m)d each global code symbol by the following formula
[c 1 c 2 … c (n-r-k+m)d]=[s 1 s 2 … s B]Q, [c 1 c 2 … c (nr-k+m)d ]=[s 1 s 2 … s B ]Q,
其存放在机架r的最后(n/r-1)个节点,用一个r×(n/r-1)d的矩阵M1来表示。通常,我们可以选择矩阵Q为Cauchy矩阵,以便如果n-r≥k,则n-r个节点中的任何k个节点(r个机架的最后一个(n/r-1)节点)足以重建B个数据符号。It is stored in the last (n/r-1) nodes of the rack r, represented by a matrix M1 of r×(n/r-1)d. Generally, we can choose matrix Q to be a Cauchy matrix, so that if nr≥k, any k nodes among nr nodes (the last (n/r-1) node of r racks) are sufficient to reconstruct B data symbols .
建立一个d×d的数据矩阵Build a d×d data matrix
Figure PCTCN2019081267-appb-000063
Figure PCTCN2019081267-appb-000063
当j=(k-m)d+1,(k-m)d+2,…,(k–m)d+m(m+1)/2,矩阵S 1是通过m(m+1)/2个数据符号S j首先填充上三角形部分而获得的对称m×m矩阵。然后通过沿对角线的反射来获得下三角形部分。矩形矩阵S 2的大小为m×(d-m),S 2中的条目为m(d-m)个数据符号s j,j=(k-m)d+m(m+1)/2+1,…,B,以某些任意的固定顺序列出。矩阵
Figure PCTCN2019081267-appb-000064
是S 2的转置,矩阵0是(d-m)×(d-m)的全零矩阵。
When j=(km)d+1, (km)d+2,...,(k–m)d+m(m+1)/2, the matrix S 1 passes through m(m+1)/2 data The symbol S j is first filled with a symmetric m×m matrix obtained by filling the upper triangular part. Then the lower triangular part is obtained by reflection along the diagonal. The size of the rectangular matrix S 2 is m×(dm), and the entries in S 2 are m(dm) data symbols s j,j =(km)d+m(m+1)/2+1,...,B , Listed in some arbitrary fixed order. matrix
Figure PCTCN2019081267-appb-000064
Is the transpose of S 2 and matrix 0 is a (dm)×(dm) all-zero matrix.
将矩阵Φ定义为d×r矩阵,第i列用
Figure PCTCN2019081267-appb-000065
表示i=1,2,...,r。将矩阵P定义为(n/r-1)d×rd矩阵,其中第l列用
Figure PCTCN2019081267-appb-000066
表示,
Figure PCTCN2019081267-appb-000067
对于i=1,2,....,r,存储在机架i中的第一节点中的d个本地编码符号被计算为
Define matrix Φ as a d×r matrix, and use
Figure PCTCN2019081267-appb-000065
It means i=1,2,...,r. Define matrix P as (n/r-1)d×rd matrix, where the lth column is
Figure PCTCN2019081267-appb-000066
Means,
Figure PCTCN2019081267-appb-000067
For i=1, 2,....,r, the d local code symbols stored in the first node in rack i are calculated as
Figure PCTCN2019081267-appb-000068
Figure PCTCN2019081267-appb-000068
注意note
Figure PCTCN2019081267-appb-000069
Figure PCTCN2019081267-appb-000069
可以看作是形如PM-MBR(r,d,d)码的码字It can be seen as a code word shaped like PM-MBR (r, d, d) code
图6示出了(n,k,r,d)=(12,8,4,3)的示例。在该示例中,我们有B=23个数据符号。前18个数据符号和6个全局编码符号存储在每个机架的最后两个节点中,并且4个本地编码符号存储在每个机架中的第一个节点中。Fig. 6 shows an example of (n, k, r, d)=(12, 8, 4, 3). In this example, we have B=23 data symbols. The first 18 data symbols and 6 global code symbols are stored in the last two nodes of each rack, and the 4 local code symbols are stored in the first node in each rack.
定理8.如果域的大小超过 Theorem 8. If the size of the domain exceeds
Figure PCTCN2019081267-appb-000070
Figure PCTCN2019081267-appb-000070
任何k个节点都可修复B个数据符号,并且可以利用MBRR码的最佳跨机架修复带宽来修复存储在任何一个节点中的α符号。Any k nodes can repair B data symbols, and the optimal cross-rack repair bandwidth of the MBRR code can be used to repair the alpha symbols stored in any node.
证明:文件修复。假设数据收集器连接到全部来自前面k个节点,它们来自n/r-r机架中的1个节点。然后我们可以修复B数据符号,其为一个非奇异的Cauchy矩阵的任何方形子矩阵。一个数据收集器连接到
Figure PCTCN2019081267-appb-000071
节点,其来自的最后n/r-1节点和从后面往前数的
Figure PCTCN2019081267-appb-000072
个节点,其中
Figure PCTCN2019081267-appb-000073
接收的kd符号可以用B×kd编码矩阵表示。如果我们把P和Φ的每个元素看作非零变量,我们可以检查是否存在B×B子矩阵,使得总次数最多为B的行列式是一个非零多项式。一共有
Proof: file repair. Assume that the data collector is connected to all the k nodes from the previous, and they are from 1 node in the n/rr rack. Then we can fix the B data symbol, which is any square sub-matrix of a non-singular Cauchy matrix. A data collector connected to
Figure PCTCN2019081267-appb-000071
Node, which comes from the last n/r-1 node and counted from back to front
Figure PCTCN2019081267-appb-000072
Nodes, where
Figure PCTCN2019081267-appb-000073
The received kd symbol can be represented by a B×kd coding matrix. If we treat each element of P and Φ as a non-zero variable, we can check whether there is a B×B sub-matrix, so that the determinant whose total degree is at most B is a non-zero polynomial. A total of
Figure PCTCN2019081267-appb-000074
Figure PCTCN2019081267-appb-000074
种选择的可能性,所有行列式的乘积是一个多项式,总次数最大为(16)。There is a possibility of choice, the product of all determinants is a polynomial, and the maximum total degree is (16).
因此,根据Schwartz-Zippel引理,如果域的大小大于(16)中的值,则我们可以从任何k个节点解码B数据符号。Therefore, according to the Schwartz-Zippel lemma, if the size of the field is greater than the value in (16), then we can decode B data symbols from any k nodes.
修复。假设机架f中的节点发生故障,其中f∈{1,2,...,r}。对于i=1,2…,d,新节点连接到任何d个辅助机架hi。辅助机架hi的中继节点访问存储在机架中的所有符号,来修复
Figure PCTCN2019081267-appb-000075
计算并发送编码符号
Figure PCTCN2019081267-appb-000076
到新节点。新节点包含d个编码符号
repair. Suppose that a node in the rack f fails, where f ∈ {1,2,...,r}. For i=1, 2,...,d, the new node is connected to any d auxiliary racks hi. The relay node of the auxiliary rack hi accesses all the symbols stored in the rack to repair
Figure PCTCN2019081267-appb-000075
Calculate and send code symbols
Figure PCTCN2019081267-appb-000076
To the new node. The new node contains d code symbols
Figure PCTCN2019081267-appb-000077
Figure PCTCN2019081267-appb-000077
由于上面的左矩阵是可逆的,新节点可以计算编码符号φ fM 2,就像PM-MBR码的修复过程一样,新节点可以通过访问机架f中的所有其他符号来修复故障节点。 Since the left matrix above is reversible, the new node can calculate the code symbol φ f M 2 , just like the repair process of the PM-MBR code, the new node can repair the faulty node by accessing all other symbols in the rack f.
实际上,定理8中的域大小的上限是k的指数关系。但是,我们可以通过计算搜索是否有任何k个节点可以重建B数据符号来直接检测。我们通过计算搜索,我们总是可以找到P和Φ,对于例子(n,k,r,d)=(12,8,4,3),其字段大小为11,任何k个节点都可以重建B数据符号。我们可以用二进制循环码[33]替换底层有限域,以降低计算复杂度。对比分析In fact, the upper limit of the field size in Theorem 8 is an exponential relationship of k. However, we can directly detect whether there are any k nodes that can reconstruct the B data symbol by calculating and searching. We search through calculations, we can always find P and Φ, for the example (n,k,r,d)=(12,8,4,3), the field size is 11, any k nodes can reconstruct B Data symbol. We can replace the underlying finite field with binary cyclic code [33] to reduce computational complexity. Comparative analysis
在本节中,我们评估RRC码,RC码和其他相关码的两个极值点的跨机架修复带宽,例如[7]中的聚簇码和[8]中的码。我们还讨论精确修复结构所支持的参数,与及[7]和DRC[6],[18]中的精确修复结构。In this section, we evaluate the cross-rack repair bandwidth of two extreme points of RRC codes, RC codes and other related codes, such as the clustering code in [7] and the code in [8]. We also discuss the parameters supported by the precise repair structure, and the precise repair structure in [7] and DRC [6], [18].
A跨机架修复带宽A cross-rack repair bandwidth
1)MSRR(MBRR)和MSR(MBR)的比较:根据定理4,如果kr/n是整数,则MSR码的跨机架修复带宽与MSRR码的跨机架修复带宽相同。如果kr/n不是整数,则MSRR码的跨机架修复带宽严格小于MSR码的跨机架修复带宽。图7显示了当B=1,r=3,4和d=r-1时MSRR码和MSR码的跨机架修复带宽。结果表明混合MSRR码具有比MSR严格更少的跨机架修复带宽,这个优势随k增加而增大。例如,当(n,k,r)=(18,11,3)和(n,k,r)=(18,17,3)时,MSRR码的跨机架修复带宽比MSR码分别减少42%和83%。1) Comparison of MSRR (MBRR) and MSR (MBR): According to Theorem 4, if kr/n is an integer, then the cross-rack repair bandwidth of the MSR code is the same as the cross-rack repair bandwidth of the MSRR code. If kr/n is not an integer, the cross-rack repair bandwidth of the MSRR code is strictly less than the cross-rack repair bandwidth of the MSR code. Figure 7 shows the cross-chassis repair bandwidth of the MSRR code and the MSR code when B=1, r=3,4, and d=r-1. The results show that the hybrid MSRR code has a strictly less cross-chassis repair bandwidth than MSR, and this advantage increases as k increases. For example, when (n,k,r)=(18,11,3) and (n,k,r)=(18,17,3), the cross-rack repair bandwidth of MSRR code is reduced by 42 compared with MSR code. % And 83%.
根据定理4,如果k/n>2/r且kr/n是整数,则MBRR码具有比MBR码更少的跨机架修复带宽。因此,如果码率不是太低,则MBRR码的跨机架修复带宽严格小于MBR码的跨机架修复带宽。According to Theorem 4, if k/n>2/r and kr/n is an integer, the MBRR code has less cross-chassis repair bandwidth than the MBR code. Therefore, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly smaller than the cross-rack repair bandwidth of the MBR code.
图8显示了当B=1,r=5,6和d=r-1时两个码的跨机架修复带宽。结果表明MBRR码对于所有评估的参数具有较少的跨机架修复带宽。给定r和n,我们注意到当k增加时,MBRR码和MBR码之间的差异变得更大。当n=20且r=5时,MBRR码的跨机架修复带宽相对于MBR码的减少为11%至32%。当n=18且r=6时,降低率为7%至32%。Figure 8 shows the cross-chassis repair bandwidth of two codes when B=1, r=5,6 and d=r-1. The results show that MBRR codes have less cross-rack repair bandwidth for all evaluated parameters. Given r and n, we notice that as k increases, the difference between the MBRR code and the MBR code becomes larger. When n=20 and r=5, the cross-chassis repair bandwidth of the MBRR code is reduced by 11% to 32% relative to the MBR code. When n=18 and r=6, the reduction rate is 7% to 32%.
设B=1且d=r-1.对于(n,k)=(18,13)((n,k)=(24,18))的特定情况,图9显示,当r=3,6,9(r=3,4,6,8,12)时,MBR码和MBRR码的跨机架修复带宽。我们从图9中得到两个观察结果。首先,MBRR码的跨机架修复带宽总是小于MBR码。其次,对于给定的n和k,MBRR码的较低跨机架修复的优点,随着r的不同值而变化。例如,当(n,k,r)=(24,18,3)时,与MBR码相比,MBRR码的跨机架修复带宽减少6.8%,而当(n,k,r)=(24,18,8)时,减少增加到21.6%。Let B=1 and d=r-1. For the specific case of (n,k)=(18,13)((n,k)=(24,18)), Figure 9 shows that when r=3,6 ,9 (r=3,4,6,8,12), the cross-chassis repair bandwidth of MBR code and MBRR code. We get two observations from Figure 9. First, the cross-rack repair bandwidth of MBRR codes is always smaller than that of MBR codes. Secondly, for a given n and k, the advantage of the lower cross-rack repair of the MBRR code varies with different values of r. For example, when (n,k,r)=(24,18,3), compared with MBR code, the cross-rack repair bandwidth of MBRR code is reduced by 6.8%, and when (n,k,r)=(24 ,18,8), the decrease increased to 21.6%.
2)MSRR(MBRR)和最小存储(带宽)的比较[7],[8]:两个最近的相关工作是[7],[8]。在我们的模型中,任何k个节点都足以重建数据文件,而在[7]中,任何kr/n机架(其中kr/n是整数)都可以重建数据文件,但可能存在k个节点无法重建数据文件的情况。在[7]中,通过从主机架中的每
Figure PCTCN2019081267-appb-000078
个其他节点下载α符号,并从其他d个机架,每一个下载β符号来修复故障节点。从远程机架下载的β符号是存储在机架中的所有αn/r符号的线性组合。在功能修复中,[7]中定理4.1中显示,文件大小是下面式子的上限(注意我们应该将k替换为kr/n,将m替换为[7]中的n/r):
2) Comparison of MSRR (MBRR) and minimum storage (bandwidth) [7], [8]: Two recent related works are [7], [8]. In our model, any k nodes are enough to rebuild the data file, and in [7], any kr/n rack (where kr/n is an integer) can rebuild the data file, but there may be k nodes that cannot Rebuilding the data file. In [7], through each
Figure PCTCN2019081267-appb-000078
The other nodes download the α symbol, and from the other d racks, each download the β symbol to repair the faulty node. The β symbol downloaded from the remote rack is a linear combination of all αn/r symbols stored in the rack. In the function repair, the theorem 4.1 in [7] shows that the file size is the upper limit of the following formula (note that we should replace k with kr/n and m with n/r in [7]):
Figure PCTCN2019081267-appb-000079
Figure PCTCN2019081267-appb-000079
如果我们下载主机架中的所有α(n/r-1)符号来修复故障节点,即
Figure PCTCN2019081267-appb-000080
那么上式的上限是
If we download all α(n/r-1) symbols in the main rack to repair the faulty node, that is
Figure PCTCN2019081267-appb-000080
Then the upper limit of the above formula is
Figure PCTCN2019081267-appb-000081
Figure PCTCN2019081267-appb-000081
假设d≥kr/n。然后max{d-I,0}β-α=(d–i)β-α,并且当kr/n是整数时,上述界限与我们的定理1中的(1)中的界限相同。因此,[7]中码的跨机架修复带宽等于我们的RRC码的跨机架修复带宽。根据定理3的分析,当i=1,2,...,n/r-1时,其中kr/n是整数,MSRR(n,k+i,r)码(MBRR(n,k+i,r)码)具有比MSRR(n,k,r)(MBRR(n,k,r)码)更少的跨机架修复带宽。因此得到MSRR(n,k+i,r)码(MBRR(n,k+i,r)码)具有比(n,k,r)最小存储(带宽)严格更少的跨机架修复带宽,当i=1,2,n/r-1,其中kr/n是整数。在模型中任何k个节点都可以重建数据文件,而在[7]中则无法办到。但是,在相同的设置下,我们模型的界限与[7]中的界限相同。从我们的结果和[7]中的结果还可以得到以下的发现。主机架中的所有α(n/r-1)符号都是获得最小跨机架修复带宽所必需的。如果我们减少机架内修复带宽,即减少
Figure PCTCN2019081267-appb-000082
那么跨机架修复带宽将增加。
Assume that d≥kr/n. Then max{dI,0}β-α=(d−i)β-α, and when kr/n is an integer, the above limit is the same as the limit in (1) in our Theorem 1. Therefore, the cross-rack repair bandwidth of the code in [7] is equal to the cross-rack repair bandwidth of our RRC code. According to the analysis of Theorem 3, when i=1, 2,...,n/r-1, where kr/n is an integer, the MSRR(n,k+i,r) code (MBRR(n,k+i ,r) code) has less cross-frame repair bandwidth than MSRR(n,k,r) (MBRR(n,k,r) code). Therefore, the MSRR(n,k+i,r) code (MBRR(n,k+i,r) code) has strictly less cross-rack repair bandwidth than the (n,k,r) minimum storage (bandwidth), When i=1, 2, n/r-1, where kr/n is an integer. Any k nodes in the model can reconstruct the data file, but it cannot be done in [7]. However, under the same settings, the limits of our model are the same as those in [7]. The following findings can be obtained from our results and the results in [7]. All α(n/r-1) symbols in the main rack are necessary to obtain the minimum spanning rack repair bandwidth. If we reduce the repair bandwidth in the rack, we reduce
Figure PCTCN2019081267-appb-000082
Then the cross-rack repair bandwidth will increase.
Figure PCTCN2019081267-appb-000083
和β c分别是从主机架和其他机架中的辅助节点下载的符号数。设
Figure PCTCN2019081267-appb-000084
[8]中的定理3表明,当且仅当ε≥1/(n-k)时,才能实现最小存储开销,即α=B/k。因此,当实现最小存储开销时,ε=1/(n-k)是达到最小跨机架修复带宽的条件。当ε=1/(n-k)时,[8]中码的最小存储点为
make
Figure PCTCN2019081267-appb-000083
And β c are the number of symbols downloaded from the main rack and the auxiliary nodes in other racks, respectively. Assume
Figure PCTCN2019081267-appb-000084
Theorem 3 in [8] shows that if and only when ε≥1/(nk), the minimum storage overhead can be achieved, that is, α=B/k. Therefore, when the minimum storage overhead is achieved, ε=1/(nk) is the condition to achieve the minimum cross-rack repair bandwidth. When ε=1/(nk), the minimum storage point of the code in [8] is
Figure PCTCN2019081267-appb-000085
Figure PCTCN2019081267-appb-000085
其中γ msr是跨机架修复带宽,并且链接所有n-1个幸存节点。我们有两个发现。第一,应下载主机机架节点中的所有α符号,以最大限度地减少修复中的跨机架修复带宽。第二,当ε=1/(n-k)时,[8]中码的最小存储点的跨机架修复带宽与所有参数相同的原始MSR码相同。第三,当t≠0时,我们的MSRR码的跨机架修复带宽严格小于[8]中码的最小存储点的带宽。另一方面,如果t=0,即kr/n是一个整数,MSRR码的跨机架修复带宽等于[8]中代码的最小存储点的带宽。 Where γ msr is the cross-rack repair bandwidth and links all n-1 surviving nodes. We have two discoveries. First, all α symbols in the host rack node should be downloaded to minimize the cross-rack repair bandwidth during repair. Second, when ε=1/(nk), the cross-rack repair bandwidth of the smallest storage point of the code in [8] is the same as the original MSR code with the same parameters. Third, when t≠0, the cross-rack repair bandwidth of our MSRR code is strictly smaller than the bandwidth of the minimum storage point of the code in [8]. On the other hand, if t=0, that is, kr/n is an integer, the cross-rack repair bandwidth of the MSRR code is equal to the bandwidth of the smallest storage point of the code in [8].
考虑[8]中码最小带宽点的跨机架修复带宽。对于ε>0,令β c=1,则
Figure PCTCN2019081267-appb-000086
[8]中码的最小带宽点是
Consider the cross-rack repair bandwidth at the code minimum bandwidth point in [8]. For ε>0, let β c =1, then
Figure PCTCN2019081267-appb-000086
[8] The minimum bandwidth point of the code is
mbr,γ mbr)=((n/r-1)/∈+(n-n/r),(n-n/r)), mbrmbr )=((n/r-1)/∈+(nn/r), (nn/r)),
其中γ mbr是跨机架修复带宽,文件大小为 Where γ mbr is the cross-rack repair bandwidth, and the file size is
Figure PCTCN2019081267-appb-000087
Figure PCTCN2019081267-appb-000087
根据[10]的命题3。我们观察到存储随着ε的减少而增加。如果我们减少ε,那么标准化的跨机架修复带宽将以增加存储为开销而降低。存储α mbr大于[8]中码的最小带宽点的跨机架修复带宽γ mbr,而存储等于MBRR码的跨机架修复带宽。 According to Proposition 3 of [10]. We observe that storage increases as ε decreases. If we reduce ε, then the standardized cross-rack repair bandwidth will be reduced at the expense of increased storage. Store the cross-rack repair bandwidth γ mbr where α mbr is greater than the minimum bandwidth point of the code in [8], and store the cross-rack repair bandwidth equal to the MBRR code.
设ε=1。图10显示当B=1,n=20,25,r=5,d=4时存储和跨机架修复带宽之间的折中关系。我们可以从图10中看出,对于所有的参数,MBRR码的存储和跨机架修复带宽小于[8]中代码的最小带宽点。Let ε=1. Figure 10 shows the trade-off relationship between storage and cross-rack repair bandwidth when B=1, n=20, 25, r=5, and d=4. We can see from Figure 10 that for all parameters, the storage and cross-rack repair bandwidth of MBRR codes is less than the minimum bandwidth point of the code in [8].
总之,对于大多数参数,我们的RRC的跨机架修复带宽严格小于[8]中的码,并且如果kr/n是整数,则与[7]中的码相同。此外,RRC可以容忍比[7]中的码更多的故障模式。当kr/n是整数时,混合MSRR(n,k+i,r)编码的跨机架修复带宽,其中i=1,2,....,n/r–1,小于MSR(n,k,r)码和[7]中的最小存储(n,k,r)码。In summary, for most parameters, the cross-rack repair bandwidth of our RRC is strictly smaller than the code in [8], and if kr/n is an integer, it is the same as the code in [7]. In addition, RRC can tolerate more failure modes than the codes in [7]. When kr/n is an integer, mixed MSRR(n,k+i,r) coded cross-rack repair bandwidth, where i=1, 2,...,n/r-1, which is less than MSR(n, k, r) code and the smallest storage (n, k, r) code in [7].
B.精确修复MSRR码和MBRR码的参数B. Precisely repair the parameters of MSRR code and MBRR code
现在,我们分析RRC码,[7],[8]中的码和[18]中的DRC码的两个极值点的精确修复构造的参数。[18]中DRC码的第一个结构可以看作是我们在V-B节中构造MSRR码的一个特例,其中,n/(n-k)是一个整数,d=r–1。DRC的第二个结构是[18]当r=3的情况。而在[7]中的码构造中,kr/n应该是整数,[8]中最小存储码的构造在[9]中给出,它只能支持r=2和n=2k。当r=3,4,5,6和n取不同的值时,MSRR码,[7]和DRC中的聚簇码的k值范围如图11所示。结果表明MSRR码支持的参数远远大于另外两种对比码。Now, we analyze the parameters of the precise repair construction of the two extreme points of the RRC code, the code in [7], [8] and the DRC code in [18]. The first structure of the DRC code in [18] can be regarded as a special case of MSRR code construction in section V-B, where n/(n-k) is an integer and d=r-1. The second structure of DRC is [18] when r=3. In the code structure in [7], kr/n should be an integer. The structure of the smallest stored code in [8] is given in [9], which can only support r=2 and n=2k. When r=3, 4, 5, 6 and n take different values, the k value range of the clustering code in MSRR code, [7] and DRC is shown in Figure 11. The results show that the parameters supported by the MSRR code are far greater than the other two comparison codes.
[7]中聚类码的精确修复构造是基于MSR码的现有结构。当k/n<0.5时,MSR码的现有构造存在存储α需是k的指数关系的限制,并且对于[7]中的聚类码的最小存储构造也存在限制。但我们的MSRR码的构造没有这个限制。The precise repair structure of the clustering code in [7] is based on the existing structure of the MSR code. When k/n<0.5, the existing structure of the MSR code has the limitation of storing the exponential relationship of α to be k, and there is also a limitation on the minimum storage structure of the clustering code in [7]. But the structure of our MSRR code does not have this restriction.
本文提出的MBRR码可以支持所有参数,[8]中最小带宽代码的构造在[10]中给出,它也可以支持所有参数,但kr/n必须是构造中令[7]达到最小带宽码的整数。因此,我们的结构比[7]中支持更多的参数。The MBRR code proposed in this paper can support all parameters. The construction of the minimum bandwidth code in [8] is given in [10]. It can also support all parameters, but kr/n must be constructed so that [7] reaches the minimum bandwidth code The integer. Therefore, our structure supports more parameters than in [7].
结论和未来的工作Conclusion and future work
在本文中,我们研究了基于机架的数据中心的存储和跨机架修复带宽之间的最佳折中关系。我们提出可以实现最佳折中的机架感知再生代码(RRC),并推导出两个极端最优点,即MSRR点和MBRR点,并给出MSRR码和MBRR码的精确修复结构。对于大多数参数,MSRR码(MBRR码)的跨机架修复带宽严格小于MSR码(MBR码)。在我们的系统模型中,需要下载主机机架的所有符号来修复故障节点。我们未来的一项工作是优化结果,以达到更灵活地选择主机架中的帮助节点;另一项工作则是在现实中的基于机架的数据中心上实现RRC码。In this article, we studied the best trade-off between storage in rack-based data centers and repair bandwidth across racks. We propose a Rack-Aware Regeneration Code (RRC) that can achieve the best compromise, and derive the two extreme best points, namely the MSRR point and the MBRR point, and give the precise repair structure of the MSRR code and the MBRR code. For most parameters, the cross-chassis repair bandwidth of the MSRR code (MBRR code) is strictly smaller than that of the MSR code (MBR code). In our system model, we need to download all symbols of the host rack to repair the faulty node. One of our future tasks is to optimize the results to achieve more flexible selection of help nodes in the main rack; another task is to implement RRC codes in the actual rack-based data center.
附录AAppendix A
定理证明1 Theorem Proof 1
证明:首先,我们说明以下的引理。Proof: First, we explain the following lemma.
引理9.如果机架的中继层连接到数据收集器T,而不是连接到连接器T的其他所有n/r-1个节点,则(S,T)-cut的通信量不是最小。Lemma 9. If the relay layer of the rack is connected to the data collector T instead of all the other n/r-1 nodes connected to the connector T, the (S, T)-cut communication volume is not the smallest.
证明。考虑到中继层X 1,1连接到T。由于T的输入边缘都是有限通信量,我们只需要检查Out 1,1和In 1,1的输入边缘。由于X 1,1不是故障节点,Out 1,1和In 1,1的输入边缘的容量分别是αn/r,某个有限值。因此,没有发生故障的中继层传输αn/r到边缘。另一方面,如果作为故障节点的中继层X’ 1,1连接到T,则Out’ 1,1和In’ 1,1的输入边缘分别具有容量αn/r和α(n/r-1)+dβ。节点X’ 1,1可以向边缘传输min{(n/r-1)α+dβ,αn/r}符号。机架1中的其他所有n/r-1节点都有一条边连接输入节点和Out 1,1,传输通信量为α。机架1中的所有其他n/r-1节点不向边缘进行传输,无论它们是否连接到T。因此,如果中继器连接到T,我们应该将同一机架中的所有其他n/r-1节点连接到T,以最大限度地减少边缘的通信量。 prove. Consider that the relay layer X 1,1 is connected to T. Since the input edges of T are all limited traffic, we only need to check the input edges of Out 1,1 and In 1,1 . Since X 1,1 is not a faulty node, the input edge capacities of Out 1,1 and In 1,1 are αn/r, which is a certain finite value. Therefore, no faulty relay layer transmits αn/r to the edge. On the other hand, if the relay layer X'1,1 as a faulty node is connected to T, the input edges of Out' 1,1 and In' 1,1 have capacities αn/r and α(n/r-1 )+dβ. Node X'1,1 can transmit min{(n/r-1)α+dβ,αn/r} symbols to the edge. All other n/r-1 nodes in rack 1 have an edge connecting the input node and Out 1,1 , and the transmission traffic is α. All other n/r-1 nodes in rack 1 do not transmit to the edge, regardless of whether they are connected to T. Therefore, if a repeater is connected to T, we should connect all other n/r-1 nodes in the same rack to T to minimize the amount of communication at the edge.
接下来,我们将说明,存在信息流图G(n,k,r,d,α,β),使得mincut(G)等于(1)右边的值。在图中,中继节点X 1,1,X 2,1,...,X m,1按此顺序发生故障。当
Figure PCTCN2019081267-appb-000088
每个新节点X’ 1,1从每个节点
Figure PCTCN2019081267-appb-000089
中读取α符号,和读取前面d个中继节点的β符号。数据收集器T连接了前面m个机架的所有节点和机架m+1.的k-mn/r个节点(除了中继节点)。图2显示当(n,k,r,d)=(9,5,3,2)时的图G(n,k,r,d,α,β)。对于每个
Figure PCTCN2019081267-appb-000090
机架
Figure PCTCN2019081267-appb-000091
可以传输min{(n/r-1)α+(d-l+1)β,n/r·α}到边缘。因此,mincut(G)等于(1)中的右侧值。
Next, we will explain that there is an information flow graph G(n,k,r,d,α,β) such that mincut(G) is equal to the value on the right side of (1). In the figure, the relay nodes X 1,1 , X 2,1 ,..., X m,1 fail in this order. when
Figure PCTCN2019081267-appb-000088
Each new node X'1,1 from each node
Figure PCTCN2019081267-appb-000089
Read the alpha symbol and read the beta symbols of the previous d relay nodes. The data collector T connects all the nodes of the previous m racks and the k-mn/r nodes of the rack m+1. (except the relay node). Figure 2 shows the graph G(n,k,r,d,α,β) when (n,k,r,d)=(9,5,3,2). For each
Figure PCTCN2019081267-appb-000090
frame
Figure PCTCN2019081267-appb-000091
Can transmit min{(n/r-1)α+(d-l+1)β, n/r·α} to the edge. Therefore, mincut(G) is equal to the right-hand value in (1).
接下来,我们说明(1)必须满足所有的信息流图G(n,k,r,d,α,β)。考虑T连接到k外顶点,它由{Out h,i:(h,i)∈I}表示,I的基数是k。我们想要说明最小的mincut(G)至少是(1)中的正值。 Next, we explain (1) that all information flow graphs G(n, k, r, d, α, β) must be satisfied. Consider that T is connected to k outer vertices, which is represented by {Out h,i :(h,i)∈I}, and the base of I is k. We want to show that the smallest mincut(G) is at least the positive value in (1).
不失一般性,假定
Figure PCTCN2019081267-appb-000092
为边缘的前面n/r个外顶点。如果只有一个顶点
Figure PCTCN2019081267-appb-000093
即当
Figure PCTCN2019081267-appb-000094
时,存在一个中继层,它可以传输min{(n/r-1)α+dβ,n/r·α}到边缘,我们选择n/r–1个顶点放在同一个机架中,它们不传送到边缘。如果中继层的数量大于1,则传输量大于min{(n/r-1)α+dβ,n/r·α}。如果所有顶点
Figure PCTCN2019081267-appb-000095
都不是中继层,那么它们可以传输n/r到边缘。因此,n/r个顶点至少传输min{(n/r-1)α+dβ,n/r·α}到边缘。
Without loss of generality, assuming
Figure PCTCN2019081267-appb-000092
Is the front n/r outer vertices of the edge. If there is only one vertex
Figure PCTCN2019081267-appb-000093
Immediately
Figure PCTCN2019081267-appb-000094
When there is a relay layer, it can transmit min{(n/r-1)α+dβ,n/r·α} to the edge. We choose n/r-1 vertices to be placed in the same rack, They are not transmitted to the edge. If the number of relay layers is greater than 1, the transmission volume is greater than min{(n/r-1)α+dβ,n/r·α}. If all vertices
Figure PCTCN2019081267-appb-000095
Neither is a relay layer, so they can transmit n/r to the edge. Therefore, n/r vertices transmit at least min{(n/r-1)α+dβ, n/r·α} to the edge.
现在,我们假设
Figure PCTCN2019081267-appb-000096
是另外n/r个外顶点。类似于上面的讨论,我们得到那些n/r节点至少传输min{(n/r-1)α+(d-1)β,n/r·α}到边缘。同理,当
Figure PCTCN2019081267-appb-000097
时,对于第
Figure PCTCN2019081267-appb-000098
个n/r顶点和最后的k-mn/r顶点,可以得到任何信息流图G(n,k,r,d,α,β)的最小割集正好等于(1)的正值。
Now we assume
Figure PCTCN2019081267-appb-000096
Is another n/r outer vertices. Similar to the discussion above, we get that those n/r nodes transmit at least min{(n/r-1)α+(d-1)β,n/r·α} to the edge. Similarly, when
Figure PCTCN2019081267-appb-000097
When, for the first
Figure PCTCN2019081267-appb-000098
With n/r vertices and the final k-mn/r vertices, the minimum cut set of any information flow graph G(n,k,r,d,α,β) can be obtained exactly equal to the positive value of (1).
附录B.Appendix B.
定理证明2 Theorem Proof 2
证明:如下步骤求解α*(β)Proof: Solve α*(β) in the following steps
Figure PCTCN2019081267-appb-000099
Figure PCTCN2019081267-appb-000099
如果α≤(d-m+1)β,那么我们有kα≥B和α*(β)=B/k。如果α≥dβ,我们有If α≤(d-m+1)β, then we have kα≥B and α*(β)=B/k. If α≥dβ, we have
kα+(dβ-α)+((d-1)β-α)+…+((d-m+1)β-α)≥B,kα+(dβ-α)+((d-1)β-α)+…+((d-m+1)β-α)≥B,
with
Figure PCTCN2019081267-appb-000100
Figure PCTCN2019081267-appb-000100
当i=1,2,...,m–1,如果(d-m+i+1)β<α≤(d-m+i+2)β,则传输量为When i=1,2,...,m-1, if (d-m+i+1)β<α≤(d-m+i+2)β, the transmission volume is
kα+((d-m+1)β-α)+((d-m+2)β-α)+…+kα+((d-m+1)β-α)+((d-m+2)β-α)+…+
((d-m+i+1)β-α)((d-m+i+1)β-α)
=(k-i-1)α+(i+1)(d-m+i/2+1)β.=(k-i-1)α+(i+1)(d-m+i/2+1)β.
最小容量Φ和α的关系如下The relationship between the minimum capacity Φ and α is as follows
Figure PCTCN2019081267-appb-000101
Figure PCTCN2019081267-appb-000101
其中among them
b i=β(d-m+i+1),       (18) b i = β(d-m+i+1), (18)
其中i=0,1,...,m–1。因为Φ≥B,我们可以解出α*(β),其为Where i=0,1,...,m-1. Because Φ≥B, we can solve for α*(β), which is
Figure PCTCN2019081267-appb-000102
Figure PCTCN2019081267-appb-000102
因此,如果So if
Figure PCTCN2019081267-appb-000103
Figure PCTCN2019081267-appb-000103
则,当i=1,2,...,m-1Then, when i=1,2,...,m-1
Figure PCTCN2019081267-appb-000104
Figure PCTCN2019081267-appb-000104
回想一下b i是在(18)中定义,其计算是 Recall that b i is defined in (18) and its calculation is
Figure PCTCN2019081267-appb-000105
Figure PCTCN2019081267-appb-000105
Figure PCTCN2019081267-appb-000106
Figure PCTCN2019081267-appb-000106
则我们在该定理中得到最优的折中。Then we get the optimal compromise in the theorem.
附录C.Appendix C.
定理证明7 Theorem Proof 7
证明:我们把v 1,v m,y 1,…,y m和λ i,j的值视为常数,矢量和矩阵的其他项作为变量,则一共有 Proof: We treat the values of v 1 ,v m ,y 1 ,…,y m and λ i,j as constants, and other items of vectors and matrices as variables.
Figure PCTCN2019081267-appb-000107
Figure PCTCN2019081267-appb-000107
个变量,αmn/r(α-1)2个方程在(7)中和Variables, αmn/r(α-1) 2 equations are summed in (7)
Figure PCTCN2019081267-appb-000108
Figure PCTCN2019081267-appb-000108
个变量,(B-αmn/r)m(α-1)个方程在(8)中。由于αn/r≥2m,There are three variables, (B-αmn/r)m(α-1) equations in (8). Since αn/r≥2m,
我们可以视D i,j,F 2,…,F α的所有元素和c 2,…,α作为常数,其他元素作为自由变量。(9)中的矩阵M 1和(10)中的M 2的每个元素可以解释为分别具有总次数是2和1的多项式。对于修复条件而言,(9)和(10)中相应子矩阵的所有行列式的乘积是总次数为2αn/r+αt=α(2n/r+t)的多项式。 We can regard all elements of D i,j , F 2 , ..., F α and c 2 , ..., α as constants, and other elements as free variables. Each element of the matrix M 1 in (9) and M 2 in (10) can be interpreted as a polynomial having a total degree of 2 and 1, respectively. For the repair condition, the product of all the determinants of the corresponding sub-matrices in (9) and (10) is a polynomial with a total degree of 2αn/r+αt=α(2n/r+t).
重构条件。(11)中矩阵的每个元素是总次数为1的多项式。所有行列式的乘积可以视为一个多项式,其总次数为Reconstruction conditions. Each element of the matrix in (11) is a polynomial with a total degree of 1. The product of all determinants can be regarded as a polynomial whose total degree is
Figure PCTCN2019081267-appb-000109
Figure PCTCN2019081267-appb-000109
因此,根据Schwartz-Zippel引理,如果域的大小大于(13),则能满足修复条件和MDS特性。Therefore, according to the Schwartz-Zippel lemma, if the size of the domain is larger than (13), the repair condition and MDS characteristics can be satisfied.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (5)

  1. 一种用于数据中心的机架感知再生码,其特征在于,所述机架感知再生码由n个节点组成的数据中心,把n个节点平均分配到r个机架h上,在每个机架h上均有n/r个节点,其中,n是r的倍数;将所有节点进行从1到n进行标记,其中,h=1,2,...,r和i=1,2,...,n/r;在有限域中,将一个数据文件认为是该有限域中B长度的向量,被编码成nα个符号,并存储在n个节点中,在每个节点中存储α个符号。A rack-aware regeneration code for data centers, which is characterized in that the rack-aware regeneration code is composed of n nodes in a data center, and n nodes are equally distributed to r racks h. There are n/r nodes on rack h, where n is a multiple of r; mark all nodes from 1 to n, where h=1, 2,..., r and i=1, 2 ,..., n/r; In a finite field, a data file is regarded as a vector of length B in the finite field, encoded into nα symbols, and stored in n nodes, and stored in each node α symbols.
  2. 根据权利要求1所述的用于数据中心的机架感知再生码,其特征在于,在每个机架中均设有一个中继节点,每个中继节点均能读取同一机架中其他节点中的信息。The rack-aware regeneration code for data centers according to claim 1, wherein each rack is provided with a relay node, and each relay node can read other relay nodes in the same rack. Information in the node.
  3. 根据权利要求2所述的用于数据中心的机架感知再生码,其特征在于,所述中继节点是从机架中选择的任意存活节点,在修复操作过程中将不同的数据文件与不同的中继节点相关联。The rack-aware regeneration code for a data center according to claim 2, wherein the relay node is any surviving node selected from the rack, and different data files are different from different data files during the repair operation. Is associated with the relay node.
  4. 根据权利要求3所述的用于数据中心的机架感知再生码,其特征在于,所述机架感知再生码在存储节点出现故障产生一个新节点进行替换,将新节点放在同一机架中;新节点任意选择其他d个机架,其中d<r,并连接选取每个机架的相应中继节点,每个相应的中继节点均发送β个符号信息到新节点,新节点从接收到的dβ个符号信息和故障节点所在机架中的(n/r-1)α个符号信息中重新生成新节点的内容;其中,跨机架间修复带宽为γ=dβ。The rack-aware regeneration code for data centers according to claim 3, wherein the rack-aware regeneration code generates a new node for replacement when a storage node fails, and the new node is placed in the same rack ; The new node arbitrarily selects other d racks, where d<r, and selects the corresponding relay node of each rack. Each corresponding relay node sends β symbol information to the new node, and the new node receives The content of the new node is regenerated from the received dβ symbol information and the (n/r-1)α symbol information in the rack where the faulty node is located; wherein the cross-chassis repair bandwidth is γ=dβ.
  5. 根据权利要求4所述的用于数据中心的机架感知再生码,其特征在于,所述中继节点与数据收集器连接时,所述机架感知再生码进行精确修复和功能修复,所述精确修复中存储在故障节点中的符号信息与新节点中的符号信息相同;所述功能修复中只要满足解码特性,新节点可能包含不同于故障节点的符号信息。The rack-aware regeneration code for data centers according to claim 4, wherein when the relay node is connected to the data collector, the rack-aware regeneration code performs precise repair and functional repair, and The symbol information stored in the faulty node in the precise repair is the same as the symbol information in the new node; as long as the decoding characteristics are satisfied in the function repair, the new node may contain symbol information different from the faulty node.
PCT/CN2019/081267 2019-04-03 2019-04-03 Rack-aware regenerating code for data center WO2020199162A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980000640.3A CN111971945A (en) 2019-04-03 2019-04-03 Rack sensing regeneration code for data center
PCT/CN2019/081267 WO2020199162A1 (en) 2019-04-03 2019-04-03 Rack-aware regenerating code for data center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/081267 WO2020199162A1 (en) 2019-04-03 2019-04-03 Rack-aware regenerating code for data center

Publications (1)

Publication Number Publication Date
WO2020199162A1 true WO2020199162A1 (en) 2020-10-08

Family

ID=72664578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/081267 WO2020199162A1 (en) 2019-04-03 2019-04-03 Rack-aware regenerating code for data center

Country Status (2)

Country Link
CN (1) CN111971945A (en)
WO (1) WO2020199162A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023362A1 (en) * 2010-07-20 2012-01-26 Tata Consultancy Services Limited System and method for exact regeneration of a failed node in a distributed storage system
CN103688514A (en) * 2013-02-26 2014-03-26 北京大学深圳研究生院 Coding method for minimum storage regeneration codes and method for restoring of storage nodes
CN105260259A (en) * 2015-09-16 2016-01-20 长安大学 System minimum storage regeneration code based local repair encoding method
CN105721611A (en) * 2016-04-15 2016-06-29 西南交通大学 General method for generating minimal storage regenerating code with maximum distance separable storage code

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788913B1 (en) * 2011-12-30 2014-07-22 Emc Corporation Selection of erasure code parameters for no data repair
US9722637B2 (en) * 2013-03-26 2017-08-01 Peking University Shenzhen Graduate School Construction of MBR (minimum bandwidth regenerating) codes and a method to repair the storage nodes
US9734007B2 (en) * 2014-07-09 2017-08-15 Qualcomm Incorporated Systems and methods for reliably storing data using liquid distributed storage
CN108512553B (en) * 2018-03-09 2022-09-27 哈尔滨工业大学深圳研究生院 Truncated regeneration code construction method for reducing bandwidth consumption
CN108512918A (en) * 2018-03-23 2018-09-07 山东大学 The data processing method of heterogeneous distributed storage system
CN109062724B (en) * 2018-07-21 2019-04-05 湖北大学 A kind of correcting and eleting codes conversion method and terminal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120023362A1 (en) * 2010-07-20 2012-01-26 Tata Consultancy Services Limited System and method for exact regeneration of a failed node in a distributed storage system
CN103688514A (en) * 2013-02-26 2014-03-26 北京大学深圳研究生院 Coding method for minimum storage regeneration codes and method for restoring of storage nodes
CN105260259A (en) * 2015-09-16 2016-01-20 长安大学 System minimum storage regeneration code based local repair encoding method
CN105721611A (en) * 2016-04-15 2016-06-29 西南交通大学 General method for generating minimal storage regenerating code with maximum distance separable storage code

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU, YUCHONG ET AL.: "Double Regenerating Codes for Hierarchical Data Centers", 2016 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 31 December 2016 (2016-12-31), XP032940240 *

Also Published As

Publication number Publication date
CN111971945A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
Hou et al. Rack-aware regenerating codes for data centers
Tamo et al. A family of optimal locally recoverable codes
US10146618B2 (en) Distributed data storage with reduced storage overhead using reduced-dependency erasure codes
JP5976960B2 (en) Design for lifted LDPC codes with high parallelism, low error floor, and simple coding principles
US11500725B2 (en) Methods for data recovery of a distributed storage system and storage medium thereof
WO2018166078A1 (en) Mds array code encoding and decoding method for repairing failure of multiple nodes
CN111858169B (en) Data recovery method, system and related components
Mazumdar et al. Update-efficiency and local repairability limits for capacity approaching codes
Shahabinejad et al. A class of binary locally repairable codes
Lin et al. Novel repair-by-transfer codes and systematic exact-MBR codes with lower complexities and smaller field sizes
Hou et al. Triple-fault-tolerant binary MDS array codes with asymptotically optimal repair
Elyasi et al. Determinant codes with helper-independent repair for single and multiple failures
Kralevska et al. Balanced locally repairable codes
Mahdaviani et al. Bandwidth adaptive & error resilient MBR exact repair regenerating codes
Bartz et al. Efficient decoding of interleaved subspace and Gabidulin codes beyond their unique decoding radius using Gröbner bases.
Gligoroski et al. Repair duality with locally repairable and locally regenerating codes
Chen et al. A new Zigzag MDS code with optimal encoding and efficient decoding
WO2020199162A1 (en) Rack-aware regenerating code for data center
Yang et al. Hierarchical coding to enable scalability and flexibility in heterogeneous cloud storage
Dorkson et al. Private information retrieval using product-matrix minimum storage regenerating codes
Rawat et al. Optimal locally repairable codes with local minimum storage regeneration via rank-metric codes
Calis et al. Architecture-aware coding for distributed storage: Repairable block failure resilient codes
CN115237662A (en) Distributed storage node error correction method and system
Hou et al. Towards efficient repair and coding of binary MDS array codes with small sub-packetization
Mehrabi et al. On minimum distance of locally repairable codes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19922850

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19922850

Country of ref document: EP

Kind code of ref document: A1