CN116909475A

CN116909475A - Cross-frame perceived erasure code storage system equalization redundancy conversion method and device

Info

Publication number: CN116909475A
Application number: CN202310648438.XA
Authority: CN
Inventors: 沈志荣; 张峰; 舒继武
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-10-20

Abstract

The invention discloses a method and a device for equalizing redundancy conversion of an erasure code storage system perceived by a cross-frame, which are used for solving the problems of large transmission flow overhead of the cross-frame and unbalanced load among frames in the erasure code storage system storage redundancy conversion, and the method comprises the following steps: formulating balanced data layout among frames and aggregated check block layout in the frames, and maintaining the layout in the redundancy conversion process so as to reduce the transmission flow across the frames caused by data block repositioning and check block updating; designing a data block selection and parity check block increment updating strategy to further reduce the cross-frame transmission flow of the check block updating; a heuristic algorithm is provided for selecting the strip groups to perform redundancy conversion in parallel so as to balance the flow load of each rack. The invention balances the load between the racks while inhibiting the transmission flow of the cross-rack caused by the redundancy conversion, and can shorten the time for completing the redundancy conversion.

Description

Cross-frame perceived erasure code storage system equalization redundancy conversion method and device

Technical Field

The invention relates to the technical field of storage, in particular to a method and a device for equalizing redundancy conversion of an erasure code storage system perceived by a cross-frame.

Background

With the increasing popularity of failures in large-scale storage systems today, to protect the reliability of data from unexpected failures, most existing storage systems often maintain additional data redundancy in advance through "backups" and "erasure codes" in order to recover the original data using pre-stored redundancy in the event of a failure. The backup copies the data in N copies and stores it in N different nodes, respectively, so its storage overhead is N times that of the original data. Erasure codes are lightweight calculations, and data of a fixed size (referred to as a data block) is input, and a small amount of redundant data of the same size (referred to as a check block) is obtained by calculation. Specifically, the erasure code is configured by two parameters k and m, and m check blocks are generated by calculating k data blocks during encoding, and the (k+m) blocks form a stripe and are stored in (k+m) different nodes (each node needs to be guaranteed to have only one block of the stripe), meanwhile, any k blocks in the stripe can decode the rest m blocks, so that the erasure code can ensure that original data can be recovered when faults occur. Compared to backups storing multiple identical copies, erasure codes are more space efficient and can maintain the same fault tolerance as backups with less data redundancy. At present, erasure codes have been widely used in various storage systems, mainly for persistent storage of cold data and for mitigating delays in frequently accessed data.

In most cases, storage systems typically employ a series of different levels of redundancy (i.e., storage overhead, calculated as) To gracefully adjust access performance in the face of varying access characteristics and varying reliability requirements. Therefore, conversion between erasure codes of different redundancy levels (referred to as redundancy conversion) is critical to modern storage systems, which process is prone to generate large amounts of conversion traffic (i.e., conversion of data transmitted over a network). In general, redundant transformations represent operations that transform (k, m) erasure codes into (k ', m) erasure codes, and existing approaches consider such that k'>k redundant switching operations. In particular, redundant switching requires the disassembly (referred to as de-striping) of some old stripes of (k, m) codes, and the allocation of data blocks therein to other (k, m) coded stripes (referred to as stretched stripes) for re-encoding under the new (k', m) codes. Suppose { D ₁ ,D ₂ ,...,D _k Sum { P } ₁ ,P ₂ ,...,P _m Is } isK data blocks and m check blocks of a certain stripe, wherein each check block is a linear combination of the k data blocks, and can pass through a Galois field [1 ]]The following formula->To calculate, wherein alpha _i,j (1.ltoreq.i.ltoreq.k, 1.ltoreq.j.ltoreq.m) is represented by data block D _i Calculating a check block P _j The coding coefficients used in the process. For the decomposed stripe, the old linear relation needs to be decoupled, namely, the check block is invalidated, and the data block is regarded as a new data block and added into the stretched stripe; for stretching the stripe, it is necessary to receive these new data blocks and update their own check blocks, using P _j ' representing updated check block P _j Then can be calculated as +.>Wherein j is more than or equal to 1 and less than or equal to m, and delta P _j Known as a parity-added block.

The traffic overhead of the redundancy switch consists of two parts, namely data block migration and check block update. In the conversion process, a part of data blocks need to be migrated to other nodes so as to ensure that (k ' +m) blocks of a new stripe coded by (k ', m) are still stored in (k ' +m) different nodes, and the generated traffic is the data block migration traffic. Meanwhile, the (k, m) codes of the stretching strips are converted into (k', m) codes, and as the data blocks are newly added, m check blocks of each stretching strip need to be recalculated to ensure that the fault tolerance is still effective after conversion, and the part of traffic is the update traffic of the check blocks. Existing work has been done primarily by reducing traffic to speed up the redundancy conversion process, such as SRS 2 by designing the data layout to eliminate data block migration traffic; ERS 3 designs special data layout like SRS, which not only eliminates data block migration flow, but also further reduces check block update flow; stripeMerge [4] achieves a k to 2k conversion by directly merging the two stripes. The existing method has the following defects: (i) The choice of parameters for redundant switching lacks flexibility (SRS and ERS need to give parameters in advance to configure the layout, stripeMerge can only switch k to a multiple thereof); (ii) While single transition traffic is optimized, they still tend to produce large amounts of cross-chassis traffic when continuously transitioning (i.e., making multiple transitions) (SRS and ERS require re-adjusting the layout before each transition); (iii) The problem of load imbalance between racks during conversion is not considered, and thus conversion time is prolonged.

[1]Plank J S,Simmerman S,Schuman C D.Jerasure:A library in C/C++facilitating erasure coding for storage applications-Version 1.2[J].University of Tennessee,Tech.Rep.CS-08-627,2008,23.

[2]Taranov K,Alonso G,Hoefler T.Fast and strongly-consistent per-item resilience in key-value stores[C]//Proceedings of the Thirteenth EuroSys Conference.2018:1-14.

[3]Wu S,Shen Z,Lee P P C.Enabling I/O-efficient redundancy transitioning in erasure-coded KV stores via elastic Reed-Solomon codes[C]//2020International Symposium on Reliable Distributed Systems(SRDS).IEEE,2020:246-255.

[4]Yao Q,Hu Y,Cheng L,et al.Stripemerge:Efficient wide-stripe generation for large-scale erasure-coded storage[C]//2021IEEE 41st International Conference on Distributed Computing Systems(ICDCS).IEEE,2021:483-493.

Disclosure of Invention

The invention mainly aims to solve the problems of large transmission flow overhead across frames and unbalanced load among frames in the storage redundancy conversion of an erasure code storage system, and provides a method and a device for equalizing redundancy conversion of the erasure code storage system perceived across frames, which can balance the load among frames while inhibiting the transmission flow across frames caused by redundancy conversion and can shorten the time for completing redundancy conversion.

The invention adopts the following technical scheme:

in a first aspect, a method for equalizing redundancy conversion of a cross-rack aware erasure code storage system includes:

a block layout making step, namely aggregating check blocks of all the strips in the redundancy conversion group into the same check rack, and uniformly distributing data blocks of each strip in the redundancy conversion group into each data rack; the tapes include tensile tapes and decomposing tapes;

a check block updating step, namely updating the check block of the stretching strip by directly reading the data block and decomposing the decoupling mode of the check block of the strip;

and load balancing, namely selecting redundant conversion groups with different checking racks to form an execution group, iteratively replacing the execution group by using a heuristic algorithm, selecting the execution group with the minimum uploading load balancing ratio, and executing the redundant conversion groups in the execution group in parallel.

Preferably, the block layout making step specifically includes:

step 1.1, for a redundant conversion group comprising k stretching strips and k' -k decomposing strips, aggregating check blocks of all strips in the group into the same checking rack;

step 1.2, uniformly distributing the data blocks of each stripe in the redundant conversion group in each data rack; the data blocks are also uniformly distributed in each data frame; the data rack is a rack except for the verification rack;

step 1.3, after the redundancy operation is initiated, assigning priorities to the data frames according to the number of data blocks of each stripe in the data frames, establishing a priority queue, sequentially assigning k ' -k data blocks of k ' -k decomposed stripes to the previous k ' -k stretched stripes according to the sequence of the data frames in the priority queue, and designating the rest stretched stripes as rest stretched stripes;

step 1.4, establishing a network flow diagram for the rest of the data blocks and the rest of the stretching strips; in the network flow diagram, each stretching strip, decomposing strip and data rack are respectively represented by a vertex; the vertices of the decomposition strips are connected with directed edges pointing to the vertices of the data frame, and the vertices of each decomposition strip are connected with The number of frame vertices, the edge capacity is determined by the number of data blocks of the decomposition stripe in the corresponding frame, indicating that the decomposition stripe is +.>The number of data blocks that can be provided in the data rack; the frame vertexes are connected with directed edges pointing to the vertexes of the stretching strips, and the edge capacity is determined by the number of data blocks which can be received by the stretching strips in the corresponding frames; establishing a source point, wherein the source point is connected with directed edges pointing to the top point of each decomposition strip, and the edge capacity depends on the number of the rest data blocks of the decomposition strip; establishing a sink, wherein the vertex of each stretching strip is connected with a directed edge pointing to the sink, and the edge capacity depends on the number of data blocks required to be received by the stretching strip;

and 1.5, running a Dinic algorithm to find the maximum flow, and distributing the rest data blocks in the decomposed stripes and the rest stretched stripes according to the found maximum flow.

Preferably, data frames containing a smaller number of data blocks have a higher priority.

Preferably, for the remaining stretched stripes, the allocation of data blocks is performed according to the priority queue until a completely balanced state, i.e. the number of data blocks in each data frame is equal.

Preferably, the number of directed edges includes k' -k.

Preferably, the step of updating the check block specifically includes:

Step 2.1, transmitting k' -k data blocks allocated to the residual tensile stripes to a first check node of a corresponding decomposition stripe of the check rack;

step 2.2, calculating a verification increment block required by the verification blocks of the residual tensile strips according to the first verification node of the decomposed strips; the check increment blocks comprise m check increment blocks, and the m check increment blocks are sent to check nodes of the residual stretching strips;

step 2.3, the first check node of the decomposition strip forwards the read data block to the rest check nodes, the check nodes perform decoupling operation on the received data block and the local check block, m check increment blocks are calculated in total, and the m check increment blocks are distributed to the check nodes of the previous stretching strip;

and 2.4, for each stretching strip, the check node uses the received check increment block to update the check block stored in the check node.

Preferably, the load balancing step specifically includes:

step 3.1, grouping all redundant conversion groups according to the machine frame where the check block is located, and locating the check block in different machine framesThe combination of the redundant switch groups is called an execution group; wherein (1)>Total number of racks;

step 3.2, traversing the redundant conversion groups which are not selected, selecting the redundant conversion groups which are provided with different checking racks from the current execution group, and adding the redundant conversion groups into the execution group;

Step 3.3, recording the load condition of the execution group when the redundancy conversion group is newly added each time, and newly adding an uploading load for the execution group according to the number of data blocks uploaded by each data rack of the redundancy conversion group;

step 3.4, calculating the load balancing ratio of the execution group, namely the ratio of the maximum uploading load of the execution group to the average uploading load of the execution group;

step 3.5, selecting an execution group with the minimum load balancing ratio, and executing redundancy conversion in parallel by all redundancy conversion groups in the execution group;

step 3.6, repeating the steps 3.2-3.5 until all the redundant switch groups have been selected.

In a second aspect, an erasure code storage system equalization redundancy conversion apparatus that is perceived across racks, comprising:

the block layout making module is used for aggregating the check blocks of all the strips in the redundant conversion group into the same check rack and uniformly distributing the data blocks of each strip in the redundant conversion group into each data rack; the tapes include tensile tapes and decomposing tapes;

the check block updating module is used for updating the check blocks of the stretching strips in a mode of directly reading and decomposing the strip check blocks by adopting the data blocks;

and the load balancing module is used for selecting the redundant conversion groups with different checking racks to form an execution group, iteratively replacing the execution group by using a heuristic algorithm, selecting the execution group with the minimum uploading load balancing ratio, and executing the redundant conversion groups in the execution group in parallel.

In a third aspect, a computer device includes a program or instructions that when executed, perform the cross-chassis aware erasure code storage system equalization redundancy conversion method.

In a fourth aspect, a storage medium includes a program or instructions that when executed perform the across chassis aware erasure code storage system equalization redundancy conversion method.

Compared with the prior art, the invention has the following beneficial effects:

(1) By making balanced data layout among frames and aggregated check block layout in the frames and maintaining the layout in the redundancy conversion process, the transmission flow of the cross frames caused by data block repositioning and check block updating is reduced; the data block selection and parity check block increment updating strategy is designed to further reduce the transmission flow of the check block updating across the machine frame; carefully selecting the stripe groups to perform redundancy conversion in parallel by providing a heuristic algorithm so as to balance the flow load of each rack;

(2) The redundancy conversion parameters can be flexibly selected, and the redundancy conversion parameters can keep better performance in continuous conversion, so that the redundancy conversion system is a more universal design, balances the load among frames and better utilizes the I/O parallelism;

(3) The parameters of the redundant conversion are not limited, and the redundant conversion can be used for converting any given k into k', and the parameters do not need to be given in advance;

(4) The invention can still keep good and stable performance in continuous redundant conversion operation, and the prior SRS and ERS can not be ensured;

(5) The invention considers the problem of load balancing among frames in redundancy conversion, and provides a heuristic algorithm which carefully selects redundancy conversion groups to execute redundancy conversion in parallel so as to balance the load of each frame.

Drawings

FIG. 1 is a flow chart of a method for equalizing redundancy conversion of a cross-rack perceived erasure code storage system provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a block layout design strategy according to an embodiment of the present invention, wherein (a) represents a schematic diagram of a stripe layout before redundancy conversion; (b) represents a schematic diagram of allocation data blocks; (c) Representing establishing a network flow diagram and searching a maximum flow diagram; (d) representing a redundant converted stripe layout schematic;

FIG. 3 is an exemplary diagram of an optimization verification block update algorithm provided by an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a heuristic selection of stretch strips provided by an embodiment of the present invention;

FIG. 5 is a block diagram of a structure of an equalization redundancy conversion device of a cross-rack aware erasure code storage system provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a prototype system according to an embodiment of the present invention;

FIG. 7 is a graph of experimental results of data block relocation and check block update flow in a large-scale simulation experiment according to an embodiment of the present invention;

FIG. 8 is a graph of experimental results of continuous flow conversion in a large-scale simulation experiment provided by an embodiment of the present invention;

FIG. 9 is a graph of experimental results for different numbers of racks in a large-scale simulation experiment provided by an embodiment of the present invention; wherein, (a) represents a redundant switching flow result graph when the number of racks is 10; (b) A redundant conversion flow result graph showing the number of frames is 20; (c) A redundant switching flow result graph representing the number of racks 30;

FIG. 10 is a graph of experimental results of different m values in a large-scale simulation experiment provided by an embodiment of the present invention; wherein, (a) represents a redundant switching flow result graph when the check increment block is 3; (b) A redundant conversion flow result graph when the check increment block is 4 is shown; (c) A redundant conversion flow result graph when the check increment block is 5 is shown;

FIG. 11 is a graph of the results of a load balancing experiment in a large-scale simulation experiment provided by an embodiment of the present invention;

fig. 12 is a graph of experimental results of continuous conversion time and calculation time in an amazon cloud environment experiment provided by the embodiment of the invention; wherein, (a) represents an experimental result graph of successive redundancy conversion time; (b) an experimental result graph showing total calculation time;

Fig. 13 is a graph of experimental results of different block sizes in an amazon cloud environment experiment provided by an embodiment of the present application; wherein, (a) represents a graph of a conversion time experiment result when the block size is 16 MB; (b) A graph showing the experimental result of the conversion time when the block size is 64 MB;

fig. 14 is a diagram of experimental results of different network bandwidths in an amazon cloud environment experiment provided in an embodiment of the present application; wherein, the liquid crystal display device comprises a liquid crystal display device,

(a) A conversion time experimental result diagram when the network bandwidth is 1GB/s is shown; (b) And a conversion time experimental result graph when the network bandwidth is 3GB/s is shown.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Referring to fig. 1, the equalization redundancy conversion method of the erasure code storage system perceived by the cross-frame includes:

step 1, a block layout making step, namely aggregating check blocks of all strips in a redundant conversion group into the same check rack, and uniformly distributing data blocks of each strip in the redundant conversion group into each data rack; the tapes include tensile tapes and decomposing tapes;

Step 2, a check block updating step, namely updating the check block of the stretching strip by directly reading the data block and decomposing the decoupling mode of the check block of the strip;

and step 3, load balancing, namely selecting redundant conversion groups with different checking racks to form an execution group, iteratively replacing the execution group by using a heuristic algorithm, selecting the execution group with the minimum uploading load balancing ratio, and executing the redundant conversion groups in the execution group in parallel.

Specifically, the specific implementation of the step 1 is as follows.

Step 1.1, assuming that there are k stretching stripes and k '-k decomposing stripes, the combination of the k stretching stripes and their corresponding k' -k decomposing stripes is referred to as a redundant switch group. For each redundant switch group, the check blocks of all stripes within the group are aggregated in the same rack. The rack where the check block is located is referred to as a check rack, and the racks other than the check rack are referred to as data racks.

Step 1.2, uniformly distributing the data blocks of each stripe in the redundant conversion group in the data rack. The uniform distribution means that the difference in the number of data blocks of the stripe in any two data frames is less than or equal to 1. At the same time, the data blocks for the entire redundancy conversion group are also distributed uniformly in the individual data frames.

And 1.3, after the redundancy operation is initiated, assigning priorities to the data frames according to the number of the data blocks of each stripe in the data frames (the data frames with smaller number of the data blocks have higher priorities), establishing a priority queue, sequentially assigning k ' -k data blocks of k ' -k decomposed stripes to the first k ' -k stretched stripes according to the data frame sequence in the priority queue, and designating the rest stretched stripes as rest stretched stripes. For the remaining stretched stripes, the allocation of data blocks is performed according to the priority queue until a fully balanced state (i.e., the number of data blocks in each data chassis is equal).

And step 1.4, establishing a network flow diagram for the rest of the data blocks and the rest of the stretching strips. In the network flow graph, each of the stretch stripe, the break stripe, and the data frame are represented by a vertex. The vertices of the disaggregated stripe (abbreviated as disaggregated vertices) are connected with directed edges pointing to the vertices of the data frame (abbreviated as frame vertices), each disaggregated vertex should be connected withThe rack vertices (because the data blocks of each disaggregated stripe are stored +.>In a data rack), the edge capacity is determined by the number of data blocks of the split stripe in the corresponding rack, indicating that the split stripe is +. >The number of data blocks that can be provided in the data rack. The frame vertices are connected with directed edges pointing to the vertices of the drawn stripe (referred to as drawn vertices for short), and the edge capacity is determined by the number of data blocks that the drawn stripe can receive in the corresponding frame, meaning that the drawn stripe can acquire data blocks from the data frame and can keep the stripe evenly distributed in each data frame. A source point is established with a directed edge (i.e., with k' -k stripes) pointing to each decomposed vertex, the edge capacity being dependent on the number of remaining data blocks of the decomposed stripe. The sink is established and each stretched vertex has a directed edge attached to it, the edge capacity being dependent on the number of data blocks that need to be received to stretch the stripe. So far, the establishment of the network flow diagram is completed;

and step 1.5, running a Dinic algorithm to find the maximum flow, wherein the Dinic algorithm is an existing algorithm for searching the maximum flow. And distributing the rest data blocks and the rest stretching strips in the decomposing strips according to the found maximum stream.

Referring to fig. 2, a block layout strategy encoded by RS (5, 3) in a rack { R1, R2, R3, R4} is shown that uses the data blocks in the split stripes { S6, S7, S8} to stretch the stripes { S1, S2, S3, S4, S5}, completing the redundant conversion of k=5 to k' =8. The parity blocks of all stripes of the redundancy switch group shown in fig. 2 (a) are disposed in an aggregate manner in the parity rack R1, and the data blocks are uniformly distributed in the data racks { R2, R3, R4}, and it can be seen that the number of data blocks in each data rack is uniform. In fig. 2 (b), k' -k=3 data blocks from the split stripes { S6, S7, S8} are allocated to the stretched stripes { S1, S2, S3} in the order of the racks in the priority queue, and the stripes are stretched to a fully balanced state in accordance with the priority queue order for the remaining stretched stripes { S4, S5 }. In fig. 2 (c), a network flow graph is built according to the rule described in step 1 to allocate remaining data blocks, and the decomposition vertices { S7, S8} (there are no directed edges to the frame vertices since there are no remaining data blocks in the decomposition stripe S8) are connected with directed edges pointing to the frame vertices { R2, R3, R4}, each edge having a capacity of 1; the rack vertexes are connected with directed edges pointing to stretching vertexes { S4, S5}, and the capacity of each edge is 1; the source point is connected with a directed edge pointing to the decomposition vertex, and the capacity of the directed edge is 2; the stretched vertex has a directed edge pointing to the sink with a capacity of 2. Wherein the red edge between a chassis vertex and a stretched vertex is part of the maximum flow of the network flow graph, indicating which data blocks in the chassis are assigned to corresponding stretched stripes. The redundant converted stretched stripe layout is shown in fig. 2 (d), it can be seen that the stretched layout still maintains inter-shelf balance while no additional cross-shelf data migration overhead is incurred.

The specific implementation of the step 2 is as follows.

step 2.2, the data block is directly read. Calculating a check increment block required by the check blocks of the residual tensile strips by decomposing the first check node of the strips, namely, calculating m check increment blocks, and transmitting the m check increment blocks to the check nodes of the residual tensile strips;

and 2.3, decoupling the decomposed stripe check blocks. The first check node of the decomposition strip forwards the data block read in the step 2.1 to the rest check nodes, the check nodes perform decoupling operation on the received data block and the local check block, namely m check increment blocks are calculated, and the m check increment blocks are distributed to the check nodes of the previous stretching strip;

Specifically, in step 2, the corresponding data blocks for stretching the remaining stretching strips are sent to the checking frame, the checking blocks of the remaining stretching strips are updated and the checking blocks of the decomposing strips are decoupled, and then the first k' -k stretching strips read the checking blocks of the decomposing strips in the checking frame, and the checking blocks stored in the nodes of the checking frames are updated. Referring to fig. 3, an example of performing verification block updating by using a set of redundancy conversion groups is shown, where k=4, m=2, and k' =5, firstly, according to the layout policy allocated by the block layout design module, data blocks corresponding to the remaining stretching stripes (i.e. D18, D19, D20) are read into the verification rack and the original verification blocks of the remaining stretching stripes are updated, and then these data blocks are calculated with the verification blocks P9 and P10 of the decomposition stripe S5 to generate verification increment blocks, and are used for updating the original verification blocks of the stretching stripe S1.

The specific implementation of the step 3 is as follows.

Step 3.1, grouping all redundant conversion groups according to the machine frame where the check block is located, and locating the check block in different machine framesThe combination of redundant switch groups (i.e., the total number of racks in the system) is referred to as an execution group;

step 3.2, traversing the redundant conversion groups which are not selected, attempting to select the redundant conversion groups which are provided with different checking racks from the current execution group and adding the redundant conversion groups into the execution group;

step 3.3, recording the load condition of the execution group when the redundancy conversion group is newly added each time, and newly adding uploading load for the execution group according to the number of data blocks uploaded by each data rack of the redundancy conversion group (only uploading load is recorded here, because the execution group in step 3.1 consists of redundancy conversion groups with check blocks positioned on different racks, the downloading load of the execution group on each rack is the same);

step 3.6, repeating the above steps until all the redundant switch groups have been selected.

Specifically, in step 3, the execution groups are formed by selecting different redundancy conversion groups of the checking racks, so that the download load of each rack is balanced, then, a heuristic algorithm is used for further iterating and replacing the combination mode of the execution groups, so that the execution group with the smallest uploading load balancing ratio is selected, finally, the redundancy conversion groups in the execution groups are executed in parallel, and the steps are repeated to select the execution groups until the redundancy conversion groups in the system are executed. Referring to fig. 4, an example of a heuristic algorithm for selecting redundancy conversion groups to be executed in parallel is shown, in which data blocks and check blocks represent data blocks that need to be sent by the redundancy conversion groups for performing redundancy conversion, and check blocks that need to be updated, and these blocks are distributed in three racks, so that to equalize the download bandwidths of the respective racks, three redundancy conversion groups are selected at a time to execute redundancy conversion in parallel. Firstly, a first group G1 is selected from redundancy conversion groups of a first rack of a check rack, 7 and 5 data blocks need to be uploaded by a rack R2 and a rack R3 respectively, one group is selected from redundancy conversion groups G3 and G4 of a second rack of the check rack to be added into a parallel execution group, G3 is selected firstly, 6 blocks need to be uploaded by the racks R1 and R3, the load balancing ratio is 11/8, G4 is selected firstly, the load balancing ratio is 10/8, therefore G4 is added into the parallel execution group, then the redundancy conversion group G6 is selected from the redundancy conversion group of a third rack of the check rack to be added into the parallel execution group, the load balancing ratio of system uploading is minimum, and the selection is completed.

Referring to fig. 5, the embodiment further discloses a device for equalizing redundancy conversion of a cross-rack perceived erasure code storage system, which comprises:

the block layout making module 501 is configured to aggregate check blocks of all stripes in the redundancy conversion group in the same check rack, and uniformly distribute data blocks of each stripe in the redundancy conversion group in each data rack; the tapes include tensile tapes and decomposing tapes;

the check block updating module 502 is configured to update the check block of the stretched stripe by directly reading the data block and decomposing the stripe check block in a decoupling manner;

the load balancing module 503 is configured to select the redundant conversion groups with different check racks to form an execution group, iteratively replace the execution group by using a heuristic algorithm, select the execution group with the smallest uploading load balancing ratio, and execute the redundant conversion groups in the execution group in parallel.

Specific implementation of each module in the cross-frame sensing erasure code storage system equalization redundancy conversion device refers to a cross-frame sensing erasure code storage system equalization redundancy conversion method, and repeated description is omitted in this embodiment.

Further, referring to fig. 6, the present embodiment discloses a system architecture prototype, where the system prototype includes a centralized controller (provided with the above-mentioned erasure code storage system equalization redundancy conversion method/apparatus perceived across frames) and multiple agents, and the controller operates in a metadata server, and is responsible for generating conversion decisions according to data distribution and conversion parameters (i.e., k, m, k'), and guiding redundancy conversion by accessing metadata information (e.g., the positions of data blocks and parity blocks of each stripe). The conversion decisions may be represented by a custom data structure that specifies the unique ID of the block to be transmitted and its destination node, and the proxy of the corresponding node performing the conversion decisions, with the proxy component on each storage node being used to snoop the conversion decisions and perform redundant conversions. When reporting the conversion request to the metadata server, the coordinator first generates a conversion decision, which is then distributed to the respective agents of the nodes participating in the conversion (steps in fig. 6) ). After receiving the conversion decisions, each agent can parse the decisions to understand its tasks, including which blocks should be relocated and which blocks need to be sent for parity block update (step>). The agent will inform the coordinator of the completion of its task by returning an ACK command to the coordinator (step +_ in fig. 6>)。Once all ACKs from the participating nodes are collected, the coordinator can know the completion of the conversion operation.

The redundancy conversion method/redundancy conversion apparatus of the present invention was used for performance test as follows.

The performance of the present invention was evaluated by extensive simulation and cloud environment experiments. The invention is compared with two other most advanced redundancy conversion methods: (i) SRS, it establishes the strip overall arrangement on the basis of the coding parameter of the prefixed, has dispelled the data migration of the first conversion operation; (ii) ERS, also by establishing a stripe layout with pre-fixed coding parameters, eliminates data migration for the first conversion operation, but further reduces the verification update traffic by expanding the coding matrix in advance. In the experimental result graph, (k, m, k ') represents a redundant conversion of erasure codes (k, m) into erasure codes (k', m).

A. Large-scale simulation experiment

Simulations were first performed to reveal the performance of the present invention when deployed in a mass storage system. Experiment setting: network transmission and storage operations are deleted and redundant switching traffic under common erasure code configurations is measured. Specifically, experiments first deployed erasure codes (k, m) and continually increased the value of k while keeping the value of m unchanged. Unless otherwise indicated, the following default configuration will be selected: the number of stripes is set to 100000 stripes and they are distributed over 100 nodes, and the block size is set to 64MB (used in Hadoop HDFS).

A.1 data relocation and check block update flow experiment:

this experiment was performed to demonstrate that the majority of the traffic in performing successive redundancy transformations comes from the traffic of the check block update, and that the traffic resulting from the data block relocation is trending towards 0. The system racks are set to 10 and the value of k is increased from 6 to 26 to perform 8 conversion operations and measure the flow of data block relocation and check block updates, respectively. Referring to fig. 7 (the two histograms are the results of the data block relocation and the check block update in sequence), in most cases the traffic caused by the data block relocation is almost negligible compared to the traffic caused by the check block update.

A.2 continuous conversion flow experiment:

the value of k is increased from 6 to 96 to perform 14 switching operations and the switching traffic for the three methods is measured. Referring to fig. 8 (three histograms are the results of the present invention, ERS and SRS in sequence), the method of the present invention greatly reduces redundant switching traffic while SRS and ERS amplify switching traffic in successive switching operations. Overall, the present invention reduces redundant switching traffic by 93.8% and 96.5% compared to ERS and SRS averages, respectively.

A.3 different gantry number experiments:

the number of racks in the system was increased from 10 to 30 and the switching traffic for the three methods was measured. Referring to fig. 9 (three bar graphs are the results of the present invention, ERS and SRS in order), the conversion traffic of the method of the present invention is hardly changed at different numbers of racks. The root cause is that the present invention maintains a relatively balanced block layout in successive transitions, while ERS and SRS require additional overhead for adjusting the layout. The present invention reduces the redundant switching traffic by 88.9% and 95.4% on average, respectively, as compared to ERS and SRS.

A.4 different m-value experiments:

the switching flow at m of different values is measured separately. Referring to the experimental results shown in fig. 10 (the three bar graphs are the results of the present invention, ERS and SRS in sequence), it is shown that the conversion traffic of the method of the present invention is hardly changed, because the present invention aggregates the check blocks of the redundant conversion group in the same rack, so that the change of the number of check blocks does not change the transmission traffic across the racks at the time of updating. As m increases, ERS and SRS need to transmit more data blocks to make layout adjustments, and they need to read more data blocks to make parity updates. The present invention reduces the redundant switching traffic by 88.7% and 95.7%, respectively, on average, compared to ERS and SRS.

A.5 load balancing experiment:

the load balancing ratio of the three methods was evaluated experimentally. Referring to fig. 11 (three bar graphs are the results of the present invention, ERS and SRS in sequence), it is shown that the present invention can well balance the inter-chassis load, which employs a download bandwidth balancing layout to ensure download bandwidth balancing across the chassis. At the same time, by carefully selecting the transition groups, the upload bandwidth of each chassis is further balanced. In all successive conversion operations, the average load balancing ratio of ERS and SRS is 2.2, whereas the method of the present invention reduces the load balancing ratio to 1.1, closer to optimum (optimum 1).

B. Ariyun environmental experiment

The prototype of the invention was deployed in an ali cloud environment and its performance was evaluated to reveal its performance in a real world cloud data center. The experiment distributed 19 virtual machine instances (type ecs.g7.large), each equipped with 2 vCPU (2.7 GHz third generation Intel Xeon scalable processor) and 8GB memory. The operating system is Ubuntu 18.04, and the network bandwidth between any two examples is measured to be about 10Gb/s through the iporf.

Of these 19 instances, one instance was deployed with the controller in the prototype of the present invention to direct redundant switching, and the remaining 18 instances run agents in the prototype of the present invention, one agent for each instance, with three agents constituting one chassis (i.e., six total chassis). Initially 300 stripes encoded by erasure codes (6, 3) were deployed in 18 instances and the value of k was continually increased to 15. The size of the data block is set to 64MB by default. Linux tools tc are used to control network bandwidth between racks. Each experiment was repeated 5 times and the redundant switching time for each stretched strip averaged was calculated. Experimental results also plotted error bars to show the maximum and minimum values (some of which may be too small) in all experiments.

B.1 continuous transition time experiment:

referring to fig. 12 (a) (the three bar graphs are the results of the invention, ERS and SRS in order) are continuous redundant transition time experimental results showing that the method of the invention always achieves the minimum transition time, on average 72.3% and 89.7% less than ERS and SRS, respectively.

B.2 calculation time experiment:

experiments have measured the total computation time required to generate a redundant switching scheme for the method of the present invention under different rack and stripe numbers. The number of racks is increased from 10 to 40 and the number of stripes is increased from 1000 to 100000 and the conversion is considered (6,3,8). The result of fig. 12 (b) shows that the run time of the inventive generated conversion scheme is always kept small compared to the transmission time, which is negligible.

B.3 different block size experiments:

finally, the switching time is measured by changing the block size from 16MB to 64 MB. The experimental results, shown in fig. 13 (three histograms are the results of the present invention, ERS and SRS in order), demonstrate that the conversion time for both methods increases with increasing block size, since a larger block size will amplify the conversion traffic and store I/O. Overall, the average transition time of the present invention is reduced by 65.6% and 89.1% compared to ERS and SRS, respectively.

B.4 different cross-chassis transmission bandwidth experiments:

experiments prove that the method changes the bandwidth from 1Gb/s to 3Gb/s by using tc under different transmission bandwidths of the cross-frame network. Referring to the experimental results shown in fig. 14 (three bar graphs are the results of the present invention, ERS and SRS in order), it is shown that the transition time of the method of the present invention is minimal at different bandwidths, and at the same time, the present invention exhibits better performance in the case of relatively lack of cross-chassis bandwidth, because it reduces the cross-chassis transition traffic and improves load balancing. Overall, the present invention reduces the transition time by 56.7% and 87.0% on average, respectively, compared to ERS and SRS.

The invention firstly analyzes the data layout scheme in the redundancy conversion, and provides a stripe layout for accelerating the conversion, which is beneficial to reducing the flow of the cross-frame in the continuous redundancy conversion; the old parity block is then used when recalculating the new parity block and load balancing is explored to speed up the conversion. Numerical simulation and an Arian environment experiment result show that the conversion effect of the invention is good.

The embodiment of the invention also provides a computer device, which comprises a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used for executing the equalization redundancy conversion method and any optional method of the cross-frame aware erasure code storage system provided by the embodiment of the invention.

The embodiment of the invention provides a storage medium, which comprises a program or an instruction, and when the program or the instruction is executed, the program or the instruction is used for executing the equalization redundancy conversion method and any optional method of the cross-frame aware erasure code storage system.

Finally, it should be noted that: it will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A cross-frame perceived erasure code storage system equalization redundancy conversion method is characterized by comprising the following steps:

2. The across-chassis aware erasure code storage system equalization redundancy conversion method of claim 1, wherein the block layout formulation step specifically comprises:

Step 1.4, establishing a network flow diagram for the rest of the data blocks and the rest of the stretching strips; in the network flow diagram, each stretching strip, decomposing strip and data rack are respectively represented by a vertex; the vertices of the decomposition strips are connected with directed edges pointing to the vertices of the data frame, and the vertices of each decomposition strip are connected withThe number of frame vertices, the edge capacity is determined by the number of data blocks of the decomposition stripe in the corresponding frame, indicating that the decomposition stripe is +.>The number of data blocks that can be provided in the data rack; the frame vertexes are connected with directed edges pointing to the vertexes of the stretching strips, and the edge capacity is determined by the number of data blocks which can be received by the stretching strips in the corresponding frames; establishing a source point, wherein the source point is connected with directed edges pointing to the top point of each decomposition strip, and the edge capacity depends on the number of the rest data blocks of the decomposition strip; establishing a sink, wherein the vertex of each stretching strip is connected with a directed edge pointing to the sink, and the edge capacity depends on the number of data blocks required to be received by the stretching strip;

3. The across chassis aware erasure code storage system equalization redundancy switching method of claim 2, wherein the data chassis containing the smaller number of data blocks has a higher priority.

4. The across-rack aware erasure code storage system equalization redundancy converting method of claim 2, wherein for the remaining stretched stripes, the allocation of data blocks is performed according to the priority queue until a fully equalized state, i.e., the number of data blocks in each data rack is equal.

5. The across chassis aware erasure code storage system equalization redundancy converting method of claim 2, wherein the number of directed edges comprises k' -k.

6. The across-chassis aware erasure code storage system equalization redundancy conversion method of claim 2, wherein the check block updating step specifically comprises:

7. The across chassis aware erasure code storage system equalization redundancy conversion method according to claim 6, wherein the load balancing step specifically comprises:

8. An erasure code storage system equalization redundancy conversion device perceived across racks, comprising:

9. A computer device comprising a program or instructions which, when executed, performs the method of any of claims 1 to 8.

10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 8.