WO2020199162A1

WO2020199162A1 - Rack-aware regenerating code for data center

Info

Publication number: WO2020199162A1
Application number: PCT/CN2019/081267
Authority: WO
Inventors: 侯韩旭; 李柏晴
Original assignee: 东莞理工学院
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2020-10-08
Also published as: CN111971945A

Abstract

The present invention is applicable to the field of improvement of digital encoding technology. Erasure codes are widely applied to data centers with massive storage to implement high fault tolerance and low storage redundancy. Provided is a new erasure code which is called a rack-aware regenerating code (RRC) and is capable of implementing an optimal compromise. Further provided is a method for constructing an accurate repair RRC having minimum storage redundancy and a minimum cross-rack repair bandwidth. The present invention provides that: (1) the method for constructing an RRC having minimum storage redundancy has a wide parameter selection range, and in most cases, the cross-rack repair bandwidth of the RRC is strictly less than that of a classic minimum storage regenerating code; (2) the method for constructing an RRC having minimum cross-rack repair bandwidth can support all parameters, and for most parameters, the cross-rack repair bandwidth of the RRC is less than that of a minimum bandwidth regenerating code.

Description

A rack-aware regeneration code for data center

Technical field

The invention belongs to the technical improvement field of digital coding, and in particular relates to a rack-aware regeneration code used in a data center.

Background technique

Modern storage systems are usually deployed in the form of data centers. In these data centers, data is distributed on a large number of storage nodes in different racks. For example, Google file system [1] and Windows Azure storage system [2], and Facebook storage system [3]. In order to prevent node failures and improve the reliability and durability of data storage, erasure codes are widely used in modern storage systems to encode data. Compared with traditional copy backup methods, erasure codes have higher fault tolerance and lower storage redundancy. Reed-Solomon (RS) code [4] is the most widely used erasure code (for example, in Google [1]). An (n, k) RS code encodes a file with k data blocks (that is, the unit used for error correction coding operations), obtains n data blocks in a finite field, and then puts these n data The blocks are distributed in n different nodes (where k<n). Then, during the reconstruction process, the data collector reconstructs the data file by connecting any k data blocks among the n nodes. RS code has two important practical characteristics: (i) it can tolerate any n-k node failures to achieve minimum storage redundancy, (ii) it supports arbitrary n and k values (k<n).

When a node is lost, each data block stored in the lost node needs to be repaired in the new node to maintain the same fault tolerance. The traditional repair method is also used on the RS code: first reconstruct the data file, and then encode it again to form the previously lost data block. Therefore, for an (n, k) RS code, the number of data blocks that need to be downloaded to repair a lost data block is k (that is, k times the number of lost data), which will increase the network bandwidth and I/O read and write operations.

Regenerating Codes (RC: Regenerating Codes) was proposed by Dimakis et al. [5], with the goal of optimizing the network bandwidth during the repair process. The RC code encodes a data file into multiple data blocks of n, and then distributes them on n nodes, and each node stores multiple data blocks. In this way, data files can be reconstructed with a certain number of nodes. When repairing the data block in the lost node, the new node downloads the encoded data block from the selected part of the lost node. In general, the total amount of encoded data blocks downloaded from all non-lost nodes is called the repair bandwidth, which is much smaller than the size of the original data file. Dimakis et al. [5] also gave the optimal compromise between repair bandwidth and storage redundancy.

In an actual data center, storage nodes are stored in racks, and the communication overhead between the racks is usually greater than the communication overhead within the racks. Therefore, for erasure coding, it is very important to minimize the repair bandwidth between racks (that is, the total amount of data transmitted across different racks during the repair process). Unfortunately, the RC code did not solve this problem, and it cannot minimize the repair bandwidth between racks. This has led to a series of special studies to solve the repair problem of the data center (see Part 2 for details). In particular, Hu et al. [6] proposed Double Regenerating Codes (DRC: Double Regenerating Codes), which rebuilt part of the repair data inside the rack and combined the part of the repair data between different racks to minimize Repair bandwidth across racks. The results show that compared with some parameters, DRC codes can achieve less cross-chassis repair bandwidth than RC codes. However, DRC codes are based on the minimum storage redundancy (such as RS codes). Similar to the optimal analysis of RC code [5], there are still a lot of unresolved problems in the study of the optimal compromise between storage redundancy and repair bandwidth across racks.

This paper considers a more general model than the DRC code [6] model (for example, flexible storage size in the repair process, and flexible selection of the number of nodes that are not lost in the repair process). Finally, this article proposes an erasure correction code used in data centers, called Rack-aware Regenerating Codes (RRC). The main contributions of this article are as follows:

First, we deduced the trade-off relationship between RRC code storage redundancy and repair bandwidth across racks. In the best compromise curve, there are two extreme points, namely the minimum storage rack perception regeneration point (MSRR: Minimum Storage Rack-aware Regenerating) and the minimum bandwidth rack perception regeneration point (MBRR: Minimum Bandwidth Rack-aware Regenerating), which respectively correspond to the minimum storage and minimum cross-rack repair bandwidth. When each rack has only one node, the best compromise curve of RRC (Rack Sensing Regeneration Code) codes degenerates into the best compromise curve of RC codes. Let r be the number of racks, when kr/n is an integer, the compromise curve of the MSRR (Minimum Storage Rack Perception Regenerating Point) code is the same as the minimum storage regeneration code (MSR: Minimum Storage Regenerating). When kr/n is not an integer, the cross-rack repair bandwidth of the MSRR code is strictly smaller than the cross-rack repair bandwidth of the MSR code. At the same time, we show that for most of the parameters (refer to Theorem 4 for details), the cross-rack repair bandwidth of MBRR (Minimum Bandwidth Regenerating Point) code is much smaller than that of MBR: Minimum Bandwidth Regenerating Repair bandwidth. For example, when (n,k,r)=(12,8,4), the cross-rack repair bandwidth of MSRR code is reduced by 33.3% compared with MSR code; for the same parameters, MBRR code is reduced by 13.1% compared with MBR code. Cross-rack repair bandwidth and 28.9% storage capacity (refer to Fig. 3 for details).

Compared with the existing research, for all parameters, the cross-rack repair bandwidth of RRC code is less than or equal to the encoding in [7], and for most parameters, it is also less than the encoding method in [8] (for details, refer to VII- Part A).

Second, we propose a structure for precise restoration of MSRR codes and MBRR codes that satisfy a wide range of parameters. The parameter range supported by our construction method is much larger than the parameter range of the encoding method in [6]-[8] (refer to Part VII-B for details). The coding method of the precise repair construction of the code in [8] is given in [9] and [10]. For example, when n=12 and r=4, the MSRR code structure can satisfy k=4,5,...,11, but in [6] and [7] only k=9 and k=6,9 . The structure of the minimum storage code for precise repair in [8] will be given in related research [9], which only satisfies r=2 and n=2k.

references

[1]D.Ford,F.Labelle,FIPopovici,M.Stokely,V.-A.Truong,L.Barroso,C.Grimes,and S.Quinlan,"Availability in globally distributed storage systems."in Proc. of the 9th Usenix Symposium on Operating Systems Design and Implementation, 2010, pp. 1-7.

[2]C.Huang,H.Simitci,Y.Xu,A.Ogus,B.Calder,P.Gopalan,J.Li,and S.Yekhanin,"Erasure coding in Windows Azure storage,"in UsenixConference onTechnical Conference, 2012.

[3]M.Sathiamoorthy,M.Asteris,D.Papailiopoulos,AGDimakis,R.Vadali,S.Chen,and D.Borthakur,"XORing elephants:Novel erasure codes for big data,"in Proceedings of the 39th international on Very Large Data Bases. VLDB Endowment, 2013, pp. 325-336.

[4]I.S.Reed and G.Solomon,"Polynomial codes over certain finite fields," Journal of the Society for Industrial&Applied Mathematics,vol.8,no.2,pp.300–304,1960.

[5]A.Dimakis,P.Godfrey,Y.Wu,M.Wainwright,and K.Ramchandran,"Network coding for distributed storage systems,"IEEE Trans.Information Theory,vol.56,no.9,pp.4539 –4551, Sep. 2010.

[6]Y.Hu,P.P.C.Lee,and X.Zhang, "Double regenerating codes for hierarchical data centers," in Proc.IEEE Int.Symp.Inf.Theory(ISIT),2016,pp.245–249.

[7]N.Prakash,V.Abdrashitov,and MM'edard,"The storage versus repair-bandwidth trade-off for clustered storage systems," IEEE Trans.Information Theory,vol.64,no.8,pp.5783– 5805, August 2018.

[8]J.-y.Sohn,B.Choi,S.W.Yoon,and J.Moon,"Capacity of clustered distributed storage," Accepted for publication in IEEE Trans.Inf.Theory,2018.

[9]J.-y.Sohn,B.Choi,and J.Moon,"A class of MSR codes for clustered distributed storage," https://arxiv.org/abs/1801.02014 ,2018.

[10]J.-y.Sohn and J.Moon, “Explicit construction of MBR codes for clustered distributed storage,” http-s://arxiv.org/abs/1801.02287,2018.

[11]HCChen,Y.Tang,Y.Hu,and PPCLee,"NCCloud:a network-coding-based storage system in a cloud-of-clouds,"IEEE Trans.Computers,vol.63,no.1 ,pp.31–44,Jan.2014.

[12]K.Rashmi,P.Nakkiran,J.Wang,NBShah,and K.Ramchandran,"Having your cake and eating it too: Jointly optimal erasure codes for I/O, storage, and network-bandwidth." in Proc. of USENIX FAST, 2015, pp. 81-94.

[13]L.Pamies-Juarez,F.Blagojevic,R.Mateescu,C.Guyot,EEGad,and Z.Bandic,"Opening the chrysalis:On the real repair performance of MSR codes." in Proc.of USENIX FAST , 2016, pp. 81–94.

[14]J.Li and B.Li, “Beehive: erasure codes for fixing multiple failures in distributed storage systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1257–1270, 2017.40 IEEE TRANSACTIONS ON INFORMATION THEORY

[15]J.Pernas,C.Yuen,B.Gast′on,and J.Pujol,"Non-homogeneous two-rack model for distributed storage systems," in Proc.IEEE Int.Symp.Inf.Theory,2013, pp.1237–1241.

[16]T.Ernvall,S.El Rouayheb,C.Hollanti,and HVPoor,"Capacity and security of heterogeneous distributed storage systems," IEEE J.Selected Areas in Communications, vol.31,no.12,pp.2701 –2709, Dec. 2013.

[17]M.A.Tebbi,T.H.Chan,and C.W.Sung,"A code design framework for multi-rack distributed storage," in Proc.IEEE Inf.Theory Workshop(ITW),2014,pp.55–59.

[18]Y.Hu,X.Li,M.Zhang,PPCLee,X.Zhang,P.Zhou,and D.Feng, "Optimal repair layering for erasure-coded data centers: From theory to practice," ACM Transactions on Storage, vol. 13, no. 4, pp. 33–56, 2017.

[19]N.B.Shah, K.V.Rashmi, and P.V.Kumar, "A flexible class of regenerating codes for distributed storage," in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.1943–1947.

[20]J.Li,S.Yang,X.Wang,and B.Li,"Tree-structured data egeneration in distributed storage systems with regenerating codes," in Conference on Information Communications,2010,pp.2892-2900.

[21]Y.Wang,D.Wei,X.Yin, and X.Wang, "Heterogeneity-aware data regeneration in distributed storage systems," in Proc.IEEE INFOCOM,2014,pp.1878–1886.

[22]S.Akhlaghi,A.Kiani,and M.R.Ghanavati,"Cost-bandwidth tradeoff in distributed storage systems,"Computer Communications,vol.33,no.17,pp.2105-2115,2010.

[23]S.Goparaju, A. Fazeli, and A. Vardy, "Minimum storage egenerating codes for all parameters," IEEE Trans. Information Theory, vol. 63, no. 10, pp. 6318-6328, 2017.

[24]B.Gast′on,J.Pujol,and M.Villanueva,"A realistic distributed storage system that minimizes data storage and repair bandwidth,"arXiv preprint arXiv:1301.1549,2013.

[25]Z.Shen, J.Shu, and PPCLee, “Reconsidering single failure recovery in clustered file systems,” in 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks(DSN), 2016, pp.323–334 .

[26]M.Gerami,M.Xiao,and M.Skoglund,"Two-layer coding in distributed storage systems with partial node failure/repair," IEEE Communications Letters, vol.21, no.4, pp.726–729 , 2017.

[27]R.W.Yeung,Information Theory and Network Coding.Springer,2008.

[28]N.B.Shah,K.V.Rashmi,P.V.Kumar,and K.Ramchandran,Explicit codes,minimizing,repair,bandwidth for distributed storage,"in Proc.IEEE Inf.Theory Workshop(ITW),2010,pp.1-5.

[29]C.Suh and K.Ramchandran, "Exact-repair MDS codes for distributed storage using interference alignment," in Proc.IEEE Int.Symp.Inf.Theory,2010,pp.161–165.

[30] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge University Press, 1995.

[31]J.Li,X.Tang, and C.Tian, "A generic transformation for optimal repair bandwidth and rebuilding access in mds codes," in Proc.IEEE Int.Symp.Inf.Theory,2017,pp.1623– 1627.

[32]KVRashmi, NBShah, and PVKumar, "Optimal exact-regenerating codes for distributed storage at the MSR and MBR points via a product-matrix construction," IEEE Trans. Information Theory, vol. 57, no. 8, pp. 5227-5239, August 2011.

[33] H. Hou, KWShum, M. Chen, and H. Li, "BASIC codes: Low-complexity regenerating codes for distributed storage systems," IEEE Trans. Information Theory, vol. 62, no. 6, pp. 3053–3069, 2016.

Summary of the invention

The purpose of the present invention is to provide a rack-aware regeneration code used in a data center to solve the problem.

The present invention is realized in this way, a rack-aware regeneration code for a data center, where the rack-aware regeneration code is composed of n nodes in a data center, and n nodes are equally distributed to r racks h, There are n/r nodes on each rack h, where n is a multiple of r; mark all nodes from 1 to n, where h=1, 2,..., r and i= 1, 2,..., n/r; in a finite field, a data file is considered to be a vector of length B in the finite field, encoded into nα symbols, and stored in n nodes, in each Α symbols are stored in the node.

A further technical solution of the present invention is that: each rack is provided with a relay node, and each relay node can exchange information in other nodes in the same rack.

A further technical solution of the present invention is that the relay node is any surviving node selected from the rack, and different data files are associated with different relay nodes during the repair operation.

A further technical solution of the present invention is: the rack-aware regeneration code generates a new node for replacement when the storage node fails, and the new node is placed in the same rack; the new node arbitrarily selects other d racks, where d< r, and connect and select the corresponding relay node of each rack. Each corresponding relay node sends β symbol information to the new node. The new node receives dβ symbol information from the received dβ symbol information and the storage of the rack where the faulty node is located. (n/r-1) The content of regenerating new nodes in α symbol information; among them, the cross-chassis repair bandwidth is γ=dβ.

A further technical solution of the present invention is that when the relay node is connected to the data collector, the rack-aware regeneration code performs precise repair and functional repair, and the symbol information stored in the failed node and the new node in the precise repair The symbol information in is the same; as long as the decoding characteristics are satisfied in the function repair, the new node may contain symbol information different from the failed node.

The beneficial effects of the present invention are: by reconstructing part of the repair data inside the rack and combining the part of the repair data between different racks, to minimize the repair bandwidth between the racks; realize fewer cross racks The repair bandwidth, regeneration code is applied to repair problems between multiple racks, not only repair problems within a single rack, so it has more practical engineering application significance. The article studies the best trade-off between storage redundancy and cross-rack repair bandwidth in a data center. The scope of the research is more extensive than the previous research on some special points of the regeneration code. The construction has a wide range of parameter selection. In most cases, its cross-rack repair bandwidth is strictly lower than the classic minimum storage regeneration code repair bandwidth; the minimum cross-rack repair bandwidth construction method supports all parameters, and at the same time for all The parameter of, has a smaller cross-rack repair bandwidth than the minimum bandwidth regeneration code.

Description of the drawings

Fig. 1 is a schematic diagram of a rack sensing regeneration code used in a data center provided by an embodiment of the present invention.

Fig. 2 is a schematic diagram of the information flow when n=9, k=5, r=3, and d=2 provided by an embodiment of the present invention.

Fig. 3 is the best compromise curve between α and γ=dβ for RRC codes and RC codes with n=12, k=8, r=3, 4, 6 and d=r-1 provided by an embodiment of the present invention Schematic.

Fig. 4 is an example when n=10, k=8, r=5, d=4 provided by an embodiment of the present invention, where the data information [s ₁ , s ₂ ,..., s ₈ ] is expressed as s Schematic.

Figure 5 is an example of (n,k,r)=(12,8,4) provided by an embodiment of the present invention, in which data information is represented as a schematic diagram of s=[s ₁ ,s ₂ ,...,s ₁₆ ] .

Fig. 6 is a schematic diagram of an MBRR code example when (n, k, r, d)=(12, 8, 4, 3) provided by an embodiment of the present invention.

Fig. 7 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when r=3,4 provided by an embodiment of the present invention.

Fig. 8 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when r=5,6 provided by an embodiment of the present invention.

Fig. 9 is a schematic diagram of cross-chassis repair bandwidth of MSRR code and MSR code when n=18,24 provided by an embodiment of the present invention.

Fig. 10 is a schematic diagram of the trade-off relationship between the storage of the MSRR code and the minimum bandwidth point encoded in [8] and the cross-rack repair bandwidth when n=20, 25 provided by an embodiment of the present invention.

FIG. 11 is a schematic diagram of the parameters of the MSRR code, the cluster code and the DRC code in [7] when r=3, 4, 5, and 6 provided by an embodiment of the present invention.

detailed description

There are many different in-depth studies on RC codes in existing research, such as practical applications [11]-[14] and repair problems with heterogeneous structures [6]-[8], [15]-[18].

The flexible RC code [19] is designed for heterogeneous storage systems and can reach the lower limit of repair bandwidth. Combined with the structure of tree regeneration topology, RC code can further save network bandwidth [20], [21]. Some studies such as [16], [22], they focus on the capacity limit of a heterogeneous model. However, all of the above-mentioned studies did not distinguish between the two communication overheads in the data center, in-rack and across-rack.

Although the existing research distinguishes the communication overhead between inter-rack and intra-rack, their system models are basically different. Table 1 compares and analyzes RRC codes and some closely related researches for erasure code data centers. DRC codes [6], [18] consider the same model as this paper, and achieve a compromise between storage and cross-rack repair bandwidth under minimum storage conditions. The DRC code can be seen as a special case of the MSRR code when all other racks are connected to repair a lost node. Sohn et al. [8] considered a different repair model and provided the best compromise between storage and repair bandwidth (including cross-rack repair bandwidth and intra-rack repair bandwidth). In the repair process they defined, there is no information encoding between two nodes in the same rack, while in our model, the information downloaded from other racks is a linear encoding of all information in the rack (similar to DRC[ 6], [18]). At the same time, in order to repair a lost node, the new node in [8] needs to connect to all other racks, and the cross-chassis repair bandwidth of the RRC code in this article is less than the encoding method in [8] for most parameters. Recently, Sohn et al. proposed in [9] and [10] a precise repair construction method for the minimum storage point and minimum bandwidth point of the code in [8].

One of the work most relevant to us is the work done by Prakash et al. [7]. In their model, the reconstruction of a file requires a certain number of racks, so k must be a multiple of the number of nodes in each rack. On the other hand, our proposed model allows a file to be reconstructed from any k nodes. Therefore, RRC codes can accommodate more node losses than those coded in [7]. When k is a multiple of the number of nodes in each rack, it can be obtained that the compromise curve of the RRC code is the same as the optimal compromise curve of [7], but the precise repair construction method of the MSRR code and the MBRR code is more than [ There are more parameter options for the precise repair construction method of the minimum storage code and minimum bandwidth code in 7] (refer to Part VII-B for details). The minimum storage code and minimum bandwidth code in [7] only support when k is a multiple of the number of nodes in each rack, while our MSRR code and MBRR code do not have this restriction on k. When k is a multiple of the number of nodes in each rack, k+1 will not be a multiple of the number of nodes in each rack. Our research shows that the cross-rack repair bandwidth of the MSRR code (MBRR code) with k+1 data nodes is much smaller than the MSRR code (MBRR code) with k data nodes, where k is a multiple of the number of nodes in each rack.

There are other researches based on the application of erasure coding used in rack data centers. Some studies such as [15] and [24] consider the interaction between two racks. Tebbi et al. [17] designed a partial repair code for a multi-rack storage system. Shen et al. [25] provide a rack-aware repair algorithm, which is specifically used to design RS codes. In this article, we analyze and develop a general model that can give the best compromise between storage and repair bandwidth across racks.

[26] The same method is applied to the two-layer encoding of the data center. The first layer uses an (n, k) MDS code to encode the data document, and then distributes it to n nodes. The second layer creates data blocks stored in each node through the MDS code with the coding rate δ. If the proportion of failed data blocks in the data blocks stored in the node is not greater than 1-δ (that is, some nodes fail), the failed data blocks can be recovered by the failed node itself. Otherwise, there is a trade-off between storage and repair bandwidth. The main difference between [26] and our work is that our work is to distinguish between intra-rack communication and cross-rack communication, and consider the repair of a lost node in a rack-based storage system, while in [26] In the middle, the author considers the repair of some nodes lost.

System model

TABLE I: Comparison with related work.

We consider a data center composed of n nodes, and divide n nodes into r racks. Each rack has n/r nodes. Please refer to Figure 1. A missing node can be downloaded from the main rack. All other symbol information, and from any other d racks, each download β symbol information. The missing data file can be repaired by a data collector by downloading kα symbol information from any k nodes. In this article, assuming that n is a multiple of r, the nodes are labeled from 1 to n, h=1, 2,..., r and i=1, 2,..., n/r. We denote the i-th node in rack h as X _h,i . All operations in this paper are based on a finite field of size q, and a data file is regarded as a vector of length B in the finite field. A data file is encoded into nα symbols and stored in n nodes. Each node stores α symbols.

In each rack, there is an important relay node, which can obtain information stored in other nodes in the same rack. It is assumed that the information exchange in the rack is very fast, so the transmission overhead between the nodes in the rack is very low. If the storage node fails, we will generate a new node to replace it and place the new node in the same rack. The new node arbitrarily selects other d racks, where d<r, and connects to the corresponding relay node. We call the relay nodes or racks participating in the repair process helpers, and the parameter d is called the number of repair nodes. Based on the αn/r symbol information stored in the rack, each corresponding relay node sends β symbol information to the new node. The repair bandwidth across the racks is γ=dβ. Then from the received dβ symbol information and the (n/r-1)α symbol information stored in the host, the content of the new node is regenerated. Note that the relay node can be any surviving node selected from the rack, and different data files can be associated with different relay nodes during the repair operation. We can regard node failure as a partial failure of the rack. Repair the faulty node by downloading β symbols from any other d rack, and download (n/r-1) α symbols from other n/r-1 nodes in the same rack. To relabel storage nodes, without loss of generality, we assume that X _h,1 is the relay node of rack _h, and h=1, 2,...,r.

We hope that any k nodes can be used to decode data files, which is called (n,k) decoding feature. When the data collector is connected to the relay node, this is equivalent to connecting to all n/r nodes in the rack. We can make the assumption that if the data collector is connected to a relay node, it is also connected to all other nodes in the same rack. This article considers two repair situations: precise repair and functional repair. In precise repair, the symbol information stored in the failed node is the same as the symbol information in the new node. In functional repair, as long as the (n, k) decoding characteristics are satisfied, the new node may contain symbol information different from the failed node. The coding scheme with parameters n, k, r, d, α and β that meets the above requirements is called a rack-aware storage system RSS (n, k, r, d, α, β) code (Rack-aware Storage System) . Table 2 summarizes the main symbols used in this article.

Optimized trade-off between storage and repair bandwidth across racks

We use information flow diagrams to represent the aforementioned storage systems. To distinguish it from the system diagram in Figure 1, we will use the term "vertex" instead of "node" to describe the information flow graph.

TABLE II: Major notation used in this paper.

Given the system parameters n, k, r, d, α, β, an information flow graph is a directed acyclic graph (DAG) constructed according to the following rules. : There is a vertex S for the data file, and a vertex T for the data collector. When h=1, 2,...,r and i=1, 2,...,n/r, the i-th node in rack h uses a pair of vertices In _h,i and Out _{h,i in h} Said. We draw an edge from In _h,i to Out _h,i , and the communication capacity is α. For each internal vertex In _h,i , we draw an edge from S to In _h,i with infinite communication capacity. This shows that the encoding process for the content of each storage node is equivalent to finding a function of all symbol information, and the capacity of each node is not greater than α. h=1, 2,...,r and i=2,3,...,n/r, we draw an edge from Out _h,i to Out _h,1 , and its communication capacity is α. This shows that X _h,1 is a relay node, and X _h,1 can access all the content stored in X _h,i .

Suppose that the f-th node in rack h fails, h ∈ {1, 2,..., r} and f ∈ {1, 2,..., n/r}. We denote the n/r pairs of vertices in the information flow graph as In′ _h,j and Out′ _h,j . When j∈{1,2,..,n/r}\{f}, we draw an edge from Out _h,j to In′ _h,j with infinite communication capacity, and an edge from In′ with infinite communication capacity _h,j to the edge of Out′ _h,j . This means that the content of node j remains unchanged during the repair process. Use vertex In′ _{h,f to} represent the new node, we draw an edge from Out _h,j to In′ _h,f , with unlimited communication capacity, it can access the symbol information stored in all other nodes in the same rack . Assume that the new node connects d to the relay nodes h ₁ , h ₂ ,..., h _d in the rack, where h ₁ , h ₂ ,..., h _d are different indicators, which are not equal to h. When i = 1, 2,..., d, there is a communication capacity β in the information flow chart, from

To the edge of In _{h, f} , therefore, In′ _{h, f} has (n/r-1)+d input edges, of which the communication capacity of d edges is β, and n/r-1 edges have infinite Communication capacity. Finally, this new node stores the alpha symbol information. We represent an edge from In′ _h,f to Out′ _h,j with a communication capacity of α. When j=2,3,...,n/r, there is an edge from Out′ _h,j to Out′ _h,1 with unlimited communication capacity,

The storage system may experience a series of node failures and repairs. We repeat the above process accordingly. Finally, we draw k edges from the outer k vertices to T. If T is the vertex Out _h,1 of the relay node connected to rack h, it is agreed that T is also connected to all vertices Out _h,2 ,...,Out _h,n/r in rack h.

The DAG obtained as described above is called an information flow graph and is represented by G(n, k, r, d, α, β). Figure 2 shows an example of (n, k, r, d) = (9, 5, 3, 2).

Given an information flow graph G, we take the only vertex S as the source vertex, and the only vertex T as the terminal vertex, and consider the maximum flow from S to T. Define (s, t)-cut as a subset of the information flow graph G. After the edges of this subset are removed from G, s and t are not directly connected. The capacity of the cut set (s, t)-cut is defined as the sum of the communication capacity of all edges at the cut. Let mincut(G) represent the minimum capacity of (s,t)-cut in a given information flow graph G, and min _G mincut(g) represents the minimum capacity of all cut sets of the entire information flow graph G. According to the maximum flow boundary in the network coding theory [24, Theorem 18.3], the file size B cannot exceed min _G mincut(G). The next theorem determines min _G mincut(G), which gives an upper bound on file size. In this article, we will use the notation

Theorem 1. Given the parameters n, k, r, d, α, β, and b, if there is an RSS (n, k, r, d, α, β) code and its file size is B, then

The proof of Theorem 1 is given in Appendix A.

If the encoding scheme of the RSS (n, k, r, d, α, β) code holds true for the equation in formula (1), we call it the rack-aware regeneration code RRC (n, k, r, d, α ,β). The value on the left side in equation (1) is called the capacity of the RRC (n, k, r, d, α, β) code. When r=n, we find that the compromise curve of the RRC code in (1) becomes the optimal compromise curve of the RC code [5].

Remarks. If kr/n is an integer (ie m=kr/n), the upper limit given in (1) is the same as the upper limit of [7] (see (1) in [7]). Please note that our work is the same as the repair scheme in [7], but our model can tolerate more node loss than the model in [7]. If kr/n is not an integer, our upper limit is more stringent than the limit given in [7].

We now describe the trade-off relationship between storage α and cross-rack repair bandwidth γ=dβ under the condition of given (n,k,r). Given β, define α*(β) as the minimum value of α. If α*(β) exists, the equation in (1) holds, otherwise it is regarded as infinite. The following theorem shows the best compromise.

Theorem 2. Given the parameters n, k, r, d, b, let

Where i=0,1,...,m-1. If β ranges from f(m-1) to infinity, then the minimum storage α*(β) is as follows,

When i=1,2,...,m-2, and

When β=f(m-1)

Please refer to Appendix B for the certification process.

TABLE III: Some parameters of r, n, k for which the MBRR codes have high code rates.

There are two extreme points on the optimal compromise curve, which correspond to the minimum storage point and the minimum repair bandwidth point across racks. These two extreme points are called the minimum storage rack perception regeneration (MSRR) point and the minimum bandwidth rack perception regeneration (MBRR) point. The MSRR point can first minimize the value of α and then minimize the value of β to repair, and The MBRR point can be repaired by first minimizing the β value and then minimizing the α value.

From (2) and (3), when the following conditions are met, the MSRR point can be achieved

When the following conditions are met, the MBRR point can be achieved

When r=n, the MSRR code degenerates to a minimum storage regeneration (MSR) code, and the MBRR code degenerates to a minimum bandwidth regeneration (MBR) code. In the MBRR code, the cross-rack repair bandwidth γ is equal to the storage α, and the amount of data downloaded from other racks is the same as the size of the failed data. In the MBRR code, according to (5), we have B=kd-m(m-1)/2 and =γ=d. If 2kd–m(m–1)>nd, then the bit rate of the MBRR code (ie

)Satisfy

Therefore, we can construct MBRR codes with high code rates, and the code rates of all MBR codes are not greater than 0.5. Given r and n, for d=r-1, the code rate of the MBRR code is greater than 0.5, and its k value is summarized in Table III. For all evaluation parameters in Table III, when k/n>0.5, the MBRR code has a high code rate.

Remarks. From (4) and (5), we see that, given the same parameters B, d, m, the cross-chassis repair bandwidth of MSRR code and MBRR code decreases as k increases. If kr/n is an integer, we have

i=1, 2,..., n/r-1, and the cross-chassis repair bandwidth of the MSRR (n, k+i, r) code is strictly smaller than that of the MSRR (n, k, r) code. When kr/n is an integer, the cross-rack repair bandwidth of the MSRR code is equal to the bandwidth of the minimum storage code [16]. Therefore, when kr/n is not an integer, the construction of the MSRR code is very necessary because it has less cross-rack repair bandwidth. We will give the precise repair construction method of MSRR code and MBRR code in Part V and Part VI respectively.

If the RC(n,k,d') code is directly used in rack-based storage, we can obtain a compromise curve between the storage α of RC(n,k,d') and the cross-rack repair bandwidth γ' , Just like Theorem 1 in [5], which is as follows.

Theorem 3. If we directly use RC (n, k, d') codes in a rack-based storage system, that is, we pass from each d'=dn/r+n/r-1 auxiliary node (including the main rack The n/r-1 node and other nodes in dn/r) download the β'symbol to repair the missing node. The trade-off relationship between the minimum storage α'*(γ') and the cross-rack repair bandwidth γ'is as follows

When i=1,2,...,k-1, where

According to Theorem 3, the cross-rack repair bandwidth at MSR and MBR points are respectively

with

The next theorem shows that for most parameters, MSRR codes (MBRR codes) have less cross-chassis repair bandwidth than MSR codes (MBR codes).

Theorem 4. Let d'=dn/r+n/r–1, if kr/n is an integer, then the MSR(n,k,d') code has the same cross-machine as the MSRR(n,k,d) code Repair bandwidth. If kr/n is not an integer, the MSRR(n,k,d) code has less cross-chassis repair bandwidth than the MSR(n,k,d') code. If kr/n is an integer and k/n>2/r, the MBRR(n,k,d) code has less cross-chassis repair bandwidth than the MBR(n,k,d') code.

prove. due to

When kr/n is an integer, we get kr/n=m and

Since kr/n=m, the above formula is equal to the cross-chassis repair bandwidth of the MSRR code in (4). If kr/n=m is not an integer, we get kr/n>m and the cross-rack repair bandwidth of the MSRR code is smaller than the cross-rack repair bandwidth of the MSR code. The cross-rack repair bandwidth of MBRR code is r _{MBRR in} equation (5). If kr/n is an integer, we have m=kr/n. The cross-rack repair bandwidth of MBRR code is smaller than that of MBR code, if and only when

Therefore, if and only if k/n>2/r, the MBRR code has less cross-chassis repair bandwidth than the MBR code. In other words, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly less than the cross-rack repair bandwidth of the MBR code.

When B=1, n=12, k=8, r=3,4,6, d=r-1, the compromise curve between RRC and RC is shown in Figure 3, where d'=dn/r+n /r-1.

Figure 3: When n=12, k=8, r=3, 4, 6 and d=r-1, the best compromise curve between RRC and RC storage and cross-rack repair bandwidth. When (n,k,r)=(12,8,4), the cross-chassis repair bandwidth of MSRR code and MSR code are r _MSRR =0.1875 and r _MSR = _0.2813 ;. The storage and cross-chassis repair bandwidths of MBRR codes and MBR codes are (αMBRR, γMBRR)=(0.1304, 0.1304) and (αMBR, γMBR)=(0.1833, 0.1500), respectively.

We have a few findings. First, under the same storage redundancy, the cross-rack repair bandwidth of the RC code increases with the increase of r. Secondly, unlike the RC code, under the same storage condition, the cross-rack repair bandwidth of the RRC code when r=4 is smaller than the cross-rack repair bandwidth of r=3. Generally, if kr/n is an integer, then the cross-rack repair bandwidth of the RRC code increases as r increases, just like the RC code. However, when kr'/n is not an integer and

At this time, the cross-chassis repair bandwidth of the RRC(n,k,r') code is strictly smaller than the cross-chassis repair bandwidth of the RRC(n,k,r) code. Third, when r=3 and r=6, except for two MSR points, the RRC code has less cross-chassis repair bandwidth for the same parameter than the RC code. In fact, only when kr/n is an integer, the cross-rack repair bandwidth of the MSRR code and the MSR code are the same. According to Theorem 4. When (n,k,r)=(12,8,4) and (n,k,r)=(12,8,6), the cross-rack repair bandwidth of MBRR code is greater than that of MBR code When (n,k,r)=(12,8,3), the cross-rack repair bandwidth of MBRR code is the same as that of MBR code. According to Theorem 4, if kr/n is an integer and k/n>2/r, the MBRR code has less cross-chassis repair bandwidth than the MBR code. Therefore, the result of (n,k,r)=(12,8,6) in Figure 3 is completely consistent with Theorem 4. When kr/n is not an integer, such as (n,k,r)=(12,8,4), the cross-chassis repair bandwidth of the MBRR code is smaller than that of the MBR code. For all parameters, the storage redundancy of the MBRR code is strictly less than the storage redundancy of the MBR code. In the rest of this article, we will focus on the precise repair construction methods for MSRR codes and MBRR codes.

The structure of code accurate repair

This section introduces the systematic code construction method for accurately repairing MSRR codes. System code refers to the code in which kα uncoded data blocks are stored in k nodes. Assume that the first k nodes are data nodes that store uncoded data blocks, and the last n-k nodes are coded nodes that store coded data blocks. In section V-B, the construction method is applicable to α=1 and any (n, k). All data nodes and coding nodes of the construction method in section V-B have the best cross-rack repair bandwidth. In section V-C, its construction method is used for MSRR codes, and satisfies αn/r≥m+αt. Note that the construction method of Part V-C only has the best cross-rack repair bandwidth for all data nodes. If kr is a multiple of n, k data nodes are stored in the first m racks called data racks, and n-k encoding nodes are placed in the last r-m racks called encoding racks. If kr is not a multiple of n, the first m racks are data racks, and the last rm-1 racks are encoding racks. Rack m+1 is called a hybrid rack, including t=k mod(n/r ) Data nodes and n/r–t coding nodes. When kr/n is an integer, the MSRR code is called a homogeneous MSRR code; and when kr/n is not an integer, the MSRR code is called a mixed MSRR code.

We assume β=1, because when we construct the MSR code, we can easily extend the construction method to β≠1. When β=1, we have

α=d-m+1, B=k(d-m+1).

A. According to Theorem 4, MSR codes have the same cross-rack repair bandwidth as homogeneous MSRR codes, that is, when kr/n is an integer, all existing construction methods of MSR codes can be directly applied to MSRR codes, but have more Less repair bandwidth within the rack. Since the existing construction method of MSR codes can support all parameters [23], it is not necessary to give the construction of homogeneous MSRR codes. Therefore, we will focus on the construction of hybrid MSRR codes in the rest of this section. We first proposed the construction of the MSRR code with α=1. Then, we give the construction of mixed MSRR code with αn/r≥m+αt and α>1. Our structure uses interference alignment, which is similar to the concept of feature vectors, and has been used to construct accurate repair MSR codes [28], [29]. Before giving the construction, we first introduce some symbols used in this section.

A data file is represented by B data symbols with s=[s ₁ ,s ₂ ,...,s _B ] in the finite field Fq. Let α symbols be stored on node i of rack h, and the result of encoding is sQ _i,h , where i=1, 2,...,n and h=1, 2,...,r. Q _i,h is the coding matrix of B×α, and the rank h of the coding matrix G _h is defined as,

When a node in the frame f fails, the new node will access all the nodes in the frame f (n/r-1) α symbol information, and from the frame

Download coding symbols and local coding vectors

among them

Is a row vector whose length is not greater than αn/r sub-matrix. The sub-matrix of G is expressed as G[(i1,i2),(j1,j2)], including the rows from i1 to i2 and the columns from j1 to j2, and then G[(i1,i2),(:) ]and G[(:),(j1,j2)] represents two sub-matrices of G, including rows from i1 to i2 and columns from j1 to j2. .

B. When α=1 and arbitrary values of n and k, the code construction.

We show in the next theorem that any (n,k) MDS code can achieve the smallest cross-rack repair bandwidth.

Theorem 5. If α=1, then we can repair any symbol of the (n,k) MDS code by downloading all other n/r-1 symbols in the main rack and one symbol in each symbol in any other rack.

Proof: When α=1, we have B=k and d=m. For the convenience of notation, use c _{i,f to} represent the symbols stored in node i and rack f, where i=1, 2,...,n/r and f=1, 2,...,r . We need to prove that we can recover the symbols c _{i, f} by downloading the n/r-1 symbols from the rack _f

c _1,f ,...,c _i-1,f ,...,c _i+1,f ,...,c _n/r,f

And download d symbols from any d rack h ₁ , h ₂ ,..., h _d , where h ₁ ≠...≠h _d ∈ {1,2,...,r}\{f}. We first consider the case where kr/n is not an integer.

Note that, in the rack symbols are h _i

Where i=1, 2,...,d. Since each rack has n / r symbols, so the total number of symbols in the frame _{_{f, h 1, h 2,}} ..., h d-1 and a rack h _d t symbol dn is /r+t=k. We can put the symbol

Treated as a linear combination of k symbols, including rack f, h ₁ , h ₂ ,..., d _n/r symbol in h _d-1 and t symbol in rack h _d

which is

Among them, q _i ≠0 when i=1, 2,...,k. So we can fix c _{i, f} below by downloading information from the rack h _d

among them

From the rack h _i , where i ∈ {1,2,...,d-1}. n/r–1 messages

c _{1, f} ,..., c _{i-1, f} , c _{i+1, f} ,..., c _{n/r, f} ,

From frame f.

Therefore, when kr/n is not an integer, it is possible to repair any failure in the rack by downloading one symbol from any rack and n/r-1 symbols from the main rack. When kr/n is an integer, the repair process of missing symbols c _{i, f} can be regarded as a special case of t=0 in the above repair process.

The example in Figure 4 shows the repair of a data node. In order to recover the failed symbol s ₁ , five symbols are downloaded, and only the coded symbol (called the desired symbol) downloaded from the rack 5 is linearly dependent on s ₁ . Expectation symbol

Is composed of an expected component q _1,1 s ₁ , which is ideal for restoring faulty symbols, and an interference component

If the interference components are aligned, then we get the required components q _1,1 s ₁ , if q _1,1 =0, the faulty symbol can be repaired. Therefore, the other four downloaded code symbols (called interference symbols) are used to align the interference components. The first structure of the DRC code in [18] can be regarded as a special case when d=r-1 and n/(n–k) are integers

C: Mixed MSRR code satisfying αn/r≥m+αt

Idea: In the hybrid MSR code, the previous kt data nodes are placed in the m data shelf, and finally t=k mod(n/r) data nodes are placed in the hybrid shelf, and each node stores α=d-m+1 symbol. During the repair process of the data node in the data rack, the new node accesses all other symbols from the main rack. It downloads (i) m-1 coding symbols from the m-1 data rack, (ii) downloads a coding symbol from the hybrid rack, and (iii) downloads the d-m coding symbol from the d-m coding rack. The symbols required for d-m+1 come from the hybrid frame and the d-m code frame. Note that all interference symbols downloaded from the data shelf have nothing to do with the last tα data symbol. If we want to recover the failed symbols, the symbols needed for each d-m+1 should be independent on the last tα data symbol. In order to align all the interference components of the sign required for α at the same time, it is necessary to carefully select the encoding matrix and introduce the concept of orthogonal vectors. What we want to point out is that in the process of constructing hybrid MSRR codes, we can restore data nodes through d special auxiliary racks instead of d arbitrary auxiliary racks to achieve the optimal cross-rack repair bandwidth.

Structure: Before explaining the structure, we should explain some notations. The row vectors v ₁ , v ₂ ,..., v _m are orthogonal to the length m. The size of the vectors u ₁ , u ₂ ,..., u _α is 1×αn/r, let E ₁ , E ₂ ,...,E _m be the αn/r×m matrix, F ₁ ,F ₂ ,... ,F _α is the (B-mαn/r)×m matrix and R ₁ , R ₂ ,..., R _α is the B×(αn/rm) matrix. When h=m+1,m+2,...,r, the coding matrix G _h is

The matrix R _i , when i = 1, 2,..., α, the matrix R _{1 is} defined as follows

Where 0 _{(B-α·t)×α·t} is a zero matrix of (B-α·t)×α·t, and I _α·t×α·t is an identity matrix of α·t×α·t , L ₁ is a B×(αn/rm-α·t) matrix. Therefore, the parameters should satisfy αn/r≥m+αt. When i=2,3,...,α, Ri is

Where y ₁ ,y ₂ ,...,y _m are m orthogonal vectors with length αn/r–m. x ₂ ,x ₃ ,...,x _α are α-1 vectors of length αn/r. D _i,1 ,D _i,2 ,...,D _i,m are matrices of size αn/r×(αn/rm). C _i is a matrix of size (B-mαn/r)×(αn/rm).

The following conditions should be met: for f=1, 2,...,m and i=2,3,...,α, there are non-zero elements λ'i _,j for j=1,...,f-1,f+1, …,A makes the equations in (7) and (8) hold. When l=1,2,…,n/r, all sub-matrices in the matrix M ₁ in (9)

It's not strange. When i=1, 2,...,t, all sub-matrices M ₂ [(1+(i-1)α,iα),(1,α)] of matrix M ₂ in (10) are non-singular . The above conditions are called repair conditions.

The proposed code satisfies the (n, k) decoding characteristics if and only if the character can be repaired from any k node. This is equivalent to all α×α sub-matrices of the matrix

[G _m+1 [(:), (1, (n/rt)α)] G _m+2 … G _r ] (11)

Include line

with

Is non-singular, in which the following relationship must be satisfied

And

The above requirements are fault tolerance conditions.

If a data node of frame f∈{1,2,...,m} is lost, the new node downloads an ideal symbol from the relay node of frame m+1

And the relay node of the slave frame i+m, where i=2,...,α, download symbols

From the m-1 interference signals in the frame {1,2,...,m}\{f}, accompanied by the local code vector, they are

Therefore, we subtract the signal downloaded from the rack m+1,...,m+α by the signal downloaded from the rack {1,...,f-1,f+1,...,m} , Get alpha coding symbols, which are the product of the matrix in (9) and the matrix below,

[s _(f-1)αn/r+1 s _(f-1)αn/r+2 … s _fαn/r ]

The missing α symbol can be repaired because the α×α sub-matrix corresponding to the matrix in (9) is non-singular. Before giving the repair conditions and fault tolerance conditions, let's review the Schwartz-Zippel theorem.

Lemma 6. (schwartz-zippel[30]) Let Q(x ₁ ,...,x _n )∈F _q [x ₁ ,...,x _n ] be a non-zero multivariate polynomial of degree d. Let r ₁ ,..., r _n can be randomly selected independently and uniformly from a subset of F _q . then

TABLE IV: Parameters satisfy the construction of hybrid MSRR codes in Section V-C.

then there exist encoding matrices G _h for h=m+1, m+2,..., r over

of hybrid MSRR codes, where the parameters n, r, m, t, α satisfy αn/r≥m+αt.

Proof: Refer to Appendix C

From Theorem 7, we get that the supporting parameters of the proposed hybrid MSRR code satisfy n≥(m+αt)r/α. Table IV shows some examples where d=r-1 and r=3, 4, 5, and 6. We can see from Table IV that we can give most of the parameters for the construction of the hybrid MSRR code.

For example. Take (n,k,r,d)=(12,8,4,3) as an example. It gives m=2, t=2, α=2, b=16. The row vectors v ₁ and v ₂ are orthogonal of length 2, and the vectors u ₁ and u ₂ are of size 1×6. The size of the matrices e ₁ , e ₂ is 6×2, the size of F ₁ , F ₂ is 4×2, and the size of R ₁ , R ₂ is 16×4. Then, the coding matrix G _{3 is} given as

Among them, λ _1,1 and λ _1,2 are two non-zero elements. The coding matrix G ₄ is

Where y ₁ , y ₂ are orthogonal vectors of length 4, x _{2 has} a length of 6, λ _2,1 , λ _2,2 are two non-zero elements, and the size of D _2,1 and D _2,2 is 6. ×4, C ₂ is a matrix of size 4×4. Figure 5 shows an example.

We can fix the two data in node 1 by downloading the other four data symbols (interference symbols) in the first rack S ₃ , S ₄ , S ₅ , S ₆ and downloading one symbol (interference symbols) from rack 2 Symbols S ₁ , S ₂ ,

And download two symbols from rack 3, 4 (required symbols)

Note that according to (8),

And according to (7),

We get the required symbol downloaded from the fourth frame as

By subtracting the two required symbols in

racks

3 and 4 from the interference symbols in (14), we get the following two symbols

Therefore, we can repair the two data symbols S ₁ , S ₂ by deleting the above two symbols from the four interference symbols S ₃ , S ₄ , S ₅ , and S ₆ , and then solve the two linear systems that result , Because the above matrix is a 2×2 sub-matrix on the right, according to (9), it is non-singular.

Nodes

2 and 3 can be similarly restored. In the same way, we can repair a node on

racks

2 and 3 at the same time. Although the constructed hybrid MSRR code is only for data nodes and has the smallest cross-rack repair bandwidth, we can use general conversion for the hybrid MSRR code [31] So that each coding node has the same cross-chassis repair bandwidth (n, k=mn/r, d) as the uniform MSRR code.

The structure of the MBRR code that satisfies all parameters is accurately repaired

As with the construction of MSRR codes, we consider the construction of MBRR codes with β=1. When β=1, the parameters of the MBRR code satisfy

B=kd-m(m-1)/2, α=γ=d.

We need to construct a code with parameters that meet the above requirements based on the (n, k) repair characteristics.

By connecting any k nodes, we can obtain kd symbols. The (n, k) repair characteristic may be that if there are B independent symbols in the kd symbol, that is, if there are at most m(m-1)/2 related symbols in any k nodes, the repair characteristic is satisfied. We want to transform the construction of the product matrix (PM) MBR code [32] to use in the MBRR code. In the following, we give its specific construction method.

The encoding process is described as follows:

The B data symbol is divided into two parts, where the first part has (k-m)d data symbols, and the second part has md-m(m-1)/2 data symbols.

Calculate (n-r-k+m)d global code symbols by coding all B data symbols. The first part (all (n-r)d symbols) of the (n-r-k+m)d global code symbol and (k-m)d data symbol is stored at the last (n/r-1) node in the rack r.

The first part is encoded by PM-MBR (r, d, d) code to generate dr code symbols. The coding symbols generated by the division are group r, and each group has d coding symbols. For each group, linearly combine the code symbols in the group and all (n/r-1)d symbols stored in the last (n/r-1) node in the rack, where the code vector is of length (n/r-1) d+1 column vector, and store the obtained d code symbols in the first node of the rack.

We show the following structure. When j=1, 2,...,(km)d, use the previous (km)d pieces of data information to represent the row vector [s ₁ s ₂ … s _(km)d ].

Compare (n-r-k+m)d each global code symbol by the following formula

[c ₁ c ₂ … c _(nr-k+m)d ]=[s ₁ s ₂ … s _B ]Q,

It is stored in the last (n/r-1) nodes of the rack r, represented by a matrix M1 of r×(n/r-1)d. Generally, we can choose matrix Q to be a Cauchy matrix, so that if nr≥k, any k nodes among nr nodes (the last (n/r-1) node of r racks) are sufficient to reconstruct B data symbols .

Build a d×d data matrix

When j=(km)d+1, (km)d+2,...,(k–m)d+m(m+1)/2, the matrix S ₁ passes through m(m+1)/2 data The symbol S _{j is} first filled with a symmetric m×m matrix obtained by filling the upper triangular part. Then the lower triangular part is obtained by reflection along the diagonal. The size of the rectangular matrix S ₂ is m×(dm), and the entries in S ₂ are m(dm) data symbols s _j,j =(km)d+m(m+1)/2+1,...,B , Listed in some arbitrary fixed order. matrix

Is the transpose of S ₂ and matrix 0 is a (dm)×(dm) all-zero matrix.

Define matrix Φ as a d×r matrix, and use

It means i=1,2,...,r. Define matrix P as (n/r-1)d×rd matrix, where the lth column is

Means,

For i=1, 2,....,r, the d local code symbols stored in the first node in rack i are calculated as

note

It can be seen as a code word shaped like PM-MBR (r, d, d) code

Fig. 6 shows an example of (n, k, r, d)=(12, 8, 4, 3). In this example, we have B=23 data symbols. The first 18 data symbols and 6 global code symbols are stored in the last two nodes of each rack, and the 4 local code symbols are stored in the first node in each rack.

Theorem 8. If the size of the domain exceeds

Any k nodes can repair B data symbols, and the optimal cross-rack repair bandwidth of the MBRR code can be used to repair the alpha symbols stored in any node.

Proof: file repair. Assume that the data collector is connected to all the k nodes from the previous, and they are from 1 node in the n/rr rack. Then we can fix the B data symbol, which is any square sub-matrix of a non-singular Cauchy matrix. A data collector connected to

Node, which comes from the last n/r-1 node and counted from back to front

Nodes, where

The received kd symbol can be represented by a B×kd coding matrix. If we treat each element of P and Φ as a non-zero variable, we can check whether there is a B×B sub-matrix, so that the determinant whose total degree is at most B is a non-zero polynomial. A total of

There is a possibility of choice, the product of all determinants is a polynomial, and the maximum total degree is (16).

Therefore, according to the Schwartz-Zippel lemma, if the size of the field is greater than the value in (16), then we can decode B data symbols from any k nodes.

repair. Suppose that a node in the rack f fails, where f ∈ {1,2,...,r}. For i=1, 2,...,d, the new node is connected to any d auxiliary racks hi. The relay node of the auxiliary rack hi accesses all the symbols stored in the rack to repair

Calculate and send code symbols

To the new node. The new node contains d code symbols

Since the left matrix above is reversible, the new node can calculate the code symbol φ _f M ₂ , just like the repair process of the PM-MBR code, the new node can repair the faulty node by accessing all other symbols in the rack f.

In fact, the upper limit of the field size in Theorem 8 is an exponential relationship of k. However, we can directly detect whether there are any k nodes that can reconstruct the B data symbol by calculating and searching. We search through calculations, we can always find P and Φ, for the example (n,k,r,d)=(12,8,4,3), the field size is 11, any k nodes can reconstruct B Data symbol. We can replace the underlying finite field with binary cyclic code [33] to reduce computational complexity. Comparative analysis

In this section, we evaluate the cross-rack repair bandwidth of two extreme points of RRC codes, RC codes and other related codes, such as the clustering code in [7] and the code in [8]. We also discuss the parameters supported by the precise repair structure, and the precise repair structure in [7] and DRC [6], [18].

A cross-rack repair bandwidth

1) Comparison of MSRR (MBRR) and MSR (MBR): According to Theorem 4, if kr/n is an integer, then the cross-rack repair bandwidth of the MSR code is the same as the cross-rack repair bandwidth of the MSRR code. If kr/n is not an integer, the cross-rack repair bandwidth of the MSRR code is strictly less than the cross-rack repair bandwidth of the MSR code. Figure 7 shows the cross-chassis repair bandwidth of the MSRR code and the MSR code when B=1, r=3,4, and d=r-1. The results show that the hybrid MSRR code has a strictly less cross-chassis repair bandwidth than MSR, and this advantage increases as k increases. For example, when (n,k,r)=(18,11,3) and (n,k,r)=(18,17,3), the cross-rack repair bandwidth of MSRR code is reduced by 42 compared with MSR code. % And 83%.

According to Theorem 4, if k/n>2/r and kr/n is an integer, the MBRR code has less cross-chassis repair bandwidth than the MBR code. Therefore, if the code rate is not too low, the cross-rack repair bandwidth of the MBRR code is strictly smaller than the cross-rack repair bandwidth of the MBR code.

Figure 8 shows the cross-chassis repair bandwidth of two codes when B=1, r=5,6 and d=r-1. The results show that MBRR codes have less cross-rack repair bandwidth for all evaluated parameters. Given r and n, we notice that as k increases, the difference between the MBRR code and the MBR code becomes larger. When n=20 and r=5, the cross-chassis repair bandwidth of the MBRR code is reduced by 11% to 32% relative to the MBR code. When n=18 and r=6, the reduction rate is 7% to 32%.

Let B=1 and d=r-1. For the specific case of (n,k)=(18,13)((n,k)=(24,18)), Figure 9 shows that when r=3,6 ,9 (r=3,4,6,8,12), the cross-chassis repair bandwidth of MBR code and MBRR code. We get two observations from Figure 9. First, the cross-rack repair bandwidth of MBRR codes is always smaller than that of MBR codes. Secondly, for a given n and k, the advantage of the lower cross-rack repair of the MBRR code varies with different values of r. For example, when (n,k,r)=(24,18,3), compared with MBR code, the cross-rack repair bandwidth of MBRR code is reduced by 6.8%, and when (n,k,r)=(24 ,18,8), the decrease increased to 21.6%.

2) Comparison of MSRR (MBRR) and minimum storage (bandwidth) [7], [8]: Two recent related works are [7], [8]. In our model, any k nodes are enough to rebuild the data file, and in [7], any kr/n rack (where kr/n is an integer) can rebuild the data file, but there may be k nodes that cannot Rebuilding the data file. In [7], through each

The other nodes download the α symbol, and from the other d racks, each download the β symbol to repair the faulty node. The β symbol downloaded from the remote rack is a linear combination of all αn/r symbols stored in the rack. In the function repair, the theorem 4.1 in [7] shows that the file size is the upper limit of the following formula (note that we should replace k with kr/n and m with n/r in [7]):

If we download all α(n/r-1) symbols in the main rack to repair the faulty node, that is

Then the upper limit of the above formula is

Assume that d≥kr/n. Then max{dI,0}β-α=(d−i)β-α, and when kr/n is an integer, the above limit is the same as the limit in (1) in our Theorem 1. Therefore, the cross-rack repair bandwidth of the code in [7] is equal to the cross-rack repair bandwidth of our RRC code. According to the analysis of Theorem 3, when i=1, 2,...,n/r-1, where kr/n is an integer, the MSRR(n,k+i,r) code (MBRR(n,k+i ,r) code) has less cross-frame repair bandwidth than MSRR(n,k,r) (MBRR(n,k,r) code). Therefore, the MSRR(n,k+i,r) code (MBRR(n,k+i,r) code) has strictly less cross-rack repair bandwidth than the (n,k,r) minimum storage (bandwidth), When i=1, 2, n/r-1, where kr/n is an integer. Any k nodes in the model can reconstruct the data file, but it cannot be done in [7]. However, under the same settings, the limits of our model are the same as those in [7]. The following findings can be obtained from our results and the results in [7]. All α(n/r-1) symbols in the main rack are necessary to obtain the minimum spanning rack repair bandwidth. If we reduce the repair bandwidth in the rack, we reduce

Then the cross-rack repair bandwidth will increase.

make

And β _c are the number of symbols downloaded from the main rack and the auxiliary nodes in other racks, respectively. Assume

Theorem 3 in [8] shows that if and only when ε≥1/(nk), the minimum storage overhead can be achieved, that is, α=B/k. Therefore, when the minimum storage overhead is achieved, ε=1/(nk) is the condition to achieve the minimum cross-rack repair bandwidth. When ε=1/(nk), the minimum storage point of the code in [8] is

Where γ _msr is the cross-rack repair bandwidth and links all n-1 surviving nodes. We have two discoveries. First, all α symbols in the host rack node should be downloaded to minimize the cross-rack repair bandwidth during repair. Second, when ε=1/(nk), the cross-rack repair bandwidth of the smallest storage point of the code in [8] is the same as the original MSR code with the same parameters. Third, when t≠0, the cross-rack repair bandwidth of our MSRR code is strictly smaller than the bandwidth of the minimum storage point of the code in [8]. On the other hand, if t=0, that is, kr/n is an integer, the cross-rack repair bandwidth of the MSRR code is equal to the bandwidth of the smallest storage point of the code in [8].

Consider the cross-rack repair bandwidth at the code minimum bandwidth point in [8]. For ε>0, let β _c =1, then

[8] The minimum bandwidth point of the code is

(α _mbr ,γ _mbr )=((n/r-1)/∈+(nn/r), (nn/r)),

Where γ _mbr is the cross-rack repair bandwidth, and the file size is

According to Proposition 3 of [10]. We observe that storage increases as ε decreases. If we reduce ε, then the standardized cross-rack repair bandwidth will be reduced at the expense of increased storage. Store the cross-rack repair bandwidth γ _mbr where α _{mbr is} greater than the minimum bandwidth point of the code in [8], and store the cross-rack repair bandwidth equal to the MBRR code.

Let ε=1. Figure 10 shows the trade-off relationship between storage and cross-rack repair bandwidth when B=1, n=20, 25, r=5, and d=4. We can see from Figure 10 that for all parameters, the storage and cross-rack repair bandwidth of MBRR codes is less than the minimum bandwidth point of the code in [8].

In summary, for most parameters, the cross-rack repair bandwidth of our RRC is strictly smaller than the code in [8], and if kr/n is an integer, it is the same as the code in [7]. In addition, RRC can tolerate more failure modes than the codes in [7]. When kr/n is an integer, mixed MSRR(n,k+i,r) coded cross-rack repair bandwidth, where i=1, 2,...,n/r-1, which is less than MSR(n, k, r) code and the smallest storage (n, k, r) code in [7].

B. Precisely repair the parameters of MSRR code and MBRR code

Now, we analyze the parameters of the precise repair construction of the two extreme points of the RRC code, the code in [7], [8] and the DRC code in [18]. The first structure of the DRC code in [18] can be regarded as a special case of MSRR code construction in section V-B, where n/(n-k) is an integer and d=r-1. The second structure of DRC is [18] when r=3. In the code structure in [7], kr/n should be an integer. The structure of the smallest stored code in [8] is given in [9], which can only support r=2 and n=2k. When r=3, 4, 5, 6 and n take different values, the k value range of the clustering code in MSRR code, [7] and DRC is shown in Figure 11. The results show that the parameters supported by the MSRR code are far greater than the other two comparison codes.

The precise repair structure of the clustering code in [7] is based on the existing structure of the MSR code. When k/n<0.5, the existing structure of the MSR code has the limitation of storing the exponential relationship of α to be k, and there is also a limitation on the minimum storage structure of the clustering code in [7]. But the structure of our MSRR code does not have this restriction.

The MBRR code proposed in this paper can support all parameters. The construction of the minimum bandwidth code in [8] is given in [10]. It can also support all parameters, but kr/n must be constructed so that [7] reaches the minimum bandwidth code The integer. Therefore, our structure supports more parameters than in [7].

Conclusion and future work

In this article, we studied the best trade-off between storage in rack-based data centers and repair bandwidth across racks. We propose a Rack-Aware Regeneration Code (RRC) that can achieve the best compromise, and derive the two extreme best points, namely the MSRR point and the MBRR point, and give the precise repair structure of the MSRR code and the MBRR code. For most parameters, the cross-chassis repair bandwidth of the MSRR code (MBRR code) is strictly smaller than that of the MSR code (MBR code). In our system model, we need to download all symbols of the host rack to repair the faulty node. One of our future tasks is to optimize the results to achieve more flexible selection of help nodes in the main rack; another task is to implement RRC codes in the actual rack-based data center.

Appendix A

Theorem Proof 1

Proof: First, we explain the following lemma.

Lemma 9. If the relay layer of the rack is connected to the data collector T instead of all the other n/r-1 nodes connected to the connector T, the (S, T)-cut communication volume is not the smallest.

prove. Consider that the relay layer X _{1,1 is} connected to T. Since the input edges of T are all limited traffic, we only need to check the input edges of Out _1,1 and In _1,1 . Since X _1,1 is not a faulty node, the input edge capacities of Out _1,1 and In _1,1 are αn/r, which is a certain finite value. Therefore, no faulty relay layer transmits αn/r to the edge. On the other hand, if the relay layer _X'1,1 as a faulty node is connected to T, the input edges of Out' _1,1 and In' _1,1 have capacities αn/r and α(n/r-1 )+dβ. Node _X'1,1 can transmit min{(n/r-1)α+dβ,αn/r} symbols to the edge. All other n/r-1 nodes in rack 1 have an edge connecting the input node and Out _1,1 , and the transmission traffic is α. All other n/r-1 nodes in rack 1 do not transmit to the edge, regardless of whether they are connected to T. Therefore, if a repeater is connected to T, we should connect all other n/r-1 nodes in the same rack to T to minimize the amount of communication at the edge.

Next, we will explain that there is an information flow graph G(n,k,r,d,α,β) such that mincut(G) is equal to the value on the right side of (1). In the figure, the relay nodes X _1,1 , X _2,1 ,..., X _m,1 fail in this order. when

Each new node _X'1,1 from each node

Read the alpha symbol and read the beta symbols of the previous d relay nodes. The data collector T connects all the nodes of the previous m racks and the k-mn/r nodes of the rack m+1. (except the relay node). Figure 2 shows the graph G(n,k,r,d,α,β) when (n,k,r,d)=(9,5,3,2). For each

frame

Can transmit min{(n/r-1)α+(d-l+1)β, n/r·α} to the edge. Therefore, mincut(G) is equal to the right-hand value in (1).

Next, we explain (1) that all information flow graphs G(n, k, r, d, α, β) must be satisfied. Consider that T is connected to k outer vertices, which is represented by {Out _h,i :(h,i)∈I}, and the base of I is k. We want to show that the smallest mincut(G) is at least the positive value in (1).

Without loss of generality, assuming

Is the front n/r outer vertices of the edge. If there is only one vertex

Immediately

When there is a relay layer, it can transmit min{(n/r-1)α+dβ,n/r·α} to the edge. We choose n/r-1 vertices to be placed in the same rack, They are not transmitted to the edge. If the number of relay layers is greater than 1, the transmission volume is greater than min{(n/r-1)α+dβ,n/r·α}. If all vertices

Neither is a relay layer, so they can transmit n/r to the edge. Therefore, n/r vertices transmit at least min{(n/r-1)α+dβ, n/r·α} to the edge.

Now we assume

Is another n/r outer vertices. Similar to the discussion above, we get that those n/r nodes transmit at least min{(n/r-1)α+(d-1)β,n/r·α} to the edge. Similarly, when

When, for the first

With n/r vertices and the final k-mn/r vertices, the minimum cut set of any information flow graph G(n,k,r,d,α,β) can be obtained exactly equal to the positive value of (1).

Appendix B.

Theorem Proof 2

Proof: Solve α*(β) in the following steps

If α≤(d-m+1)β, then we have kα≥B and α*(β)=B/k. If α≥dβ, we have

kα+(dβ-α)+((d-1)β-α)+…+((d-m+1)β-α)≥B,

with

When i=1,2,...,m-1, if (d-m+i+1)β<α≤(d-m+i+2)β, the transmission volume is

kα+((d-m+1)β-α)+((d-m+2)β-α)+…+

((d-m+i+1)β-α)

=(k-i-1)α+(i+1)(d-m+i/2+1)β.

The relationship between the minimum capacity Φ and α is as follows

among them

b _i = β(d-m+i+1), (18)

Where i=0,1,...,m-1. Because Φ≥B, we can solve for α*(β), which is

So if

Then, when i=1,2,...,m-1

Recall that b _i is defined in (18) and its calculation is

Then we get the optimal compromise in the theorem.

Appendix C.

Theorem Proof 7

Proof: We treat the values of v ₁ ,v _m ,y ₁ ,…,y _m and λ _i,j as constants, and other items of vectors and matrices as variables.

Variables, αmn/r(α-1) 2 equations are summed in (7)

There are three variables, (B-αmn/r)m(α-1) equations in (8). Since αn/r≥2m,

We can regard all elements of D _i,j , F ₂ , ..., F _α and c ₂ , ..., α as constants, and other elements as free variables. Each element of the matrix M _{1 in} (9) and M ₂ in (10) can be interpreted as a polynomial having a total degree of 2 and 1, respectively. For the repair condition, the product of all the determinants of the corresponding sub-matrices in (9) and (10) is a polynomial with a total degree of 2αn/r+αt=α(2n/r+t).

Reconstruction conditions. Each element of the matrix in (11) is a polynomial with a total degree of 1. The product of all determinants can be regarded as a polynomial whose total degree is

Therefore, according to the Schwartz-Zippel lemma, if the size of the domain is larger than (13), the repair condition and MDS characteristics can be satisfied.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A rack-aware regeneration code for data centers, which is characterized in that the rack-aware regeneration code is composed of n nodes in a data center, and n nodes are equally distributed to r racks h. There are n/r nodes on rack h, where n is a multiple of r; mark all nodes from 1 to n, where h=1, 2,..., r and i=1, 2 ,..., n/r; In a finite field, a data file is regarded as a vector of length B in the finite field, encoded into nα symbols, and stored in n nodes, and stored in each node α symbols.
The rack-aware regeneration code for data centers according to claim 1, wherein each rack is provided with a relay node, and each relay node can read other relay nodes in the same rack. Information in the node.
The rack-aware regeneration code for a data center according to claim 2, wherein the relay node is any surviving node selected from the rack, and different data files are different from different data files during the repair operation. Is associated with the relay node.
The rack-aware regeneration code for data centers according to claim 3, wherein the rack-aware regeneration code generates a new node for replacement when a storage node fails, and the new node is placed in the same rack ; The new node arbitrarily selects other d racks, where d<r, and selects the corresponding relay node of each rack. Each corresponding relay node sends β symbol information to the new node, and the new node receives The content of the new node is regenerated from the received dβ symbol information and the (n/r-1)α symbol information in the rack where the faulty node is located; wherein the cross-chassis repair bandwidth is γ=dβ.
The rack-aware regeneration code for data centers according to claim 4, wherein when the relay node is connected to the data collector, the rack-aware regeneration code performs precise repair and functional repair, and The symbol information stored in the faulty node in the precise repair is the same as the symbol information in the new node; as long as the decoding characteristics are satisfied in the function repair, the new node may contain symbol information different from the faulty node.