CN114237985A - Method for repairing failed memory block in erasure code memory system and related device - Google Patents

Method for repairing failed memory block in erasure code memory system and related device Download PDF

Info

Publication number
CN114237985A
CN114237985A CN202111584947.8A CN202111584947A CN114237985A CN 114237985 A CN114237985 A CN 114237985A CN 202111584947 A CN202111584947 A CN 202111584947A CN 114237985 A CN114237985 A CN 114237985A
Authority
CN
China
Prior art keywords
directed
repair
bandwidth
shortest
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111584947.8A
Other languages
Chinese (zh)
Other versions
CN114237985B (en
Inventor
郑龙
康红宴
姚煊道
夏俊旭
程葛瑶
叶昭晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111584947.8A priority Critical patent/CN114237985B/en
Publication of CN114237985A publication Critical patent/CN114237985A/en
Application granted granted Critical
Publication of CN114237985B publication Critical patent/CN114237985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1489Generic software techniques for error detection or fault masking through recovery blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a method and related equipment for repairing a failed storage block in an erasure code storage system, wherein an initial directed Steiner tree is calculated according to a directed graph obtained by modeling the erasure code storage system, a directed Steiner tree group is formed by the directed Steiner tree and the initial directed Steiner tree which meet the requirements through iterative calculation, and the failed storage block is repaired by transmitting data in parallel through the directed Steiner tree group. Therefore, the repair throughput can be improved by splitting the operation of repairing the failed memory block into a plurality of parallel tree-shaped repair pipelines for execution; and intermediate data in the data transmission process are aggregated by the switches on the directed path in the directed Steiner tree, so that the bandwidth occupation can be reduced, and the repair performance of the failed storage block is obviously improved.

Description

Method for repairing failed memory block in erasure code memory system and related device
Technical Field
The present application relates to erasure code storage systems, and more particularly, to a method and related apparatus for repairing failed storage blocks in an erasure code storage system.
Background
Erasure codes are a low-cost fault-tolerant mechanism that is widely adopted by many real-world Distributed Storage Systems (DSS). Deployment of erasure codes may save PB-level space overhead for DSS compared to replica-based fault-tolerant schemes. However, when one storage block is not available due to a node failure, repairing this failed block requires many relevant storage blocks to be fetched from other storage nodes, resulting in a huge incast traffic. This incast transmission not only results in high network delay, but also causes heavy traffic load on the network. Causing equipment and components to fail is a norm for large-scale DSS.
To solve this problem, many efforts have been made both in the industry and academia. Existing methods can be basically divided into two categories, namely transmission planning and intra-network aggregation. The former focuses on planning how to direct the memory blocks involved in incast transmission to avoid potential network bottlenecks. One representative effort is a partially parallel repair mechanism (PPR), which divides repair operations into sequential sub-operations and dispatches them to multiple servers already participating in the repair process, and then aggregates the partial results step by a distributed protocol to reconstruct the failed block. Thus, the PPR can effectively avoid congestion caused by incast transmission, thereby reducing repair time. However, PPR does not reduce the bandwidth consumption of the repair process; also PPR is based on operations at the data block level, so its repair time is still significantly higher than the normal block read time in general.
In contrast, the intra-network aggregation approach aggregates small coded units from multiple blocks on a server or switch into one unit and then sends the aggregated unit to the next hop. This aggregation is possible because the related units have the same size and are code dependent; therefore, xoring them will generate a unit of the same size. Thus, the flow rate after polymerization can be naturally reduced. Because each occupied link only transmits a block with a standard size, the bottleneck problem of incast transmission can be avoided. According to the design concept, a Repair Pipeline (RP) is provided, the Repair pipeline serializes the Repair operation of the failure blocks across the storage nodes in a pipeline mode, the Repair speed is greatly improved, and the Repair time is relatively consistent with the normal read time of the same data volume in the isomorphic environment. However, the intermediate server on the RP must receive/send intermediate data from its neighboring switches, which also incurs a significant transmission cost.
With the advent of Software Defined Networks (SDNs) and programmable switches, DSSs can aggregate these blocks directly with switches, rather than just on servers. Therefore, a Repair Tree (RT) is proposed as an intra-switch aggregation solution for failed block Repair. The RT models the network as a weighted undirected graph and computes a steiner tree of minimum cost connecting the associated storage nodes. When the related block units pass through the same switch, the switch aggregates the related data and then forwards the aggregated data, thereby reducing the repair traffic. However, the repair tree may suffer from the problem of asynchronous arrival of messages. The first arriving packet must wait for other related packets before being aggregated. Therefore, the method is difficult to realize effective repair speed guarantee. Furthermore, RT ignores the intrinsic characteristics of the network, i.e. the parallel paths between any pair of nodes, which makes its repair performance severely limited by the bandwidth threshold of the single path.
In order to solve the above problems, a method capable of effectively repairing a failed memory block in an erasure code storage system is needed to achieve the purposes of reducing transmission cost and avoiding the limitation of a bandwidth threshold of a single path on repair performance.
Disclosure of Invention
In view of the above, an object of the present application is to provide a method and related device for repairing a failed memory block in an erasure code memory system.
Based on the above object, the present application provides a method for repairing failed memory blocks in an erasure code storage system, which is characterized by comprising:
determining a repair node and a plurality of candidate supply nodes which are positioned on the same strip as the failed storage block and can be used for repairing the failed storage block according to a directed graph of a network formed by the erasure code storage system;
respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node according to the directed graph, and calculating respective bandwidths of the shortest directed paths;
arranging the shortest directed paths in a descending order of the bandwidth, and acquiring front k shortest directed paths in the arranged shortest directed paths, wherein k is equal to the number of data blocks on the strip;
taking k candidate supply nodes respectively corresponding to the first k shortest directed paths in the plurality of candidate supply nodes as k first supply nodes, and taking the bandwidth of the k-th shortest directed path in the plurality of arranged shortest directed paths as a bandwidth threshold;
generating an initial directed Steiner tree connecting the k first supply nodes and the repair node in the directed graph according to the first k shortest directed paths, wherein the width of the initial directed Steiner tree is equal to the bandwidth threshold;
iteratively executing the following operations until a preset iteration end condition is met:
in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer;
generating a new directed Steiner tree according to the updated available bandwidth;
and repairing the failed memory block in parallel by using the k valid memory blocks respectively stored on the k first supply nodes and by using the generated plurality of directed Steiner trees as a plurality of tree repair pipelines in parallel.
In some embodiments, generating the initial directed steiner tree from the first k shortest directed paths comprises:
for each of the k first donor nodes,
generating a candidate directed path from the first supply node to the repair node using a path generation function;
calculating the bandwidth and transmission cost of the candidate directed path;
and in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of a target shortest directed path corresponding to the first supply node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the target shortest directed path and the transmission cost of the candidate directed path is less than the transmission cost of the target shortest directed path, replacing the target shortest directed path with the candidate directed path.
In some embodiments, the iteration end condition includes that a width of the newly generated directed steiner tree is less than a ratio of the bandwidth threshold to a preset parameter value.
In some embodiments, calculating the respective bandwidths of the plurality of shortest directed paths comprises:
and for each shortest directed path in the plurality of shortest directed paths, calculating the minimum value in the bandwidths of all links forming the shortest directed path, and taking the minimum value as the bandwidth of the shortest directed path.
In some embodiments, repairing the failed memory blocks in parallel comprises:
for each of the k valid memory blocks, partitioning the valid memory block into a plurality of memory locations for allocation to the plurality of tree repair pipeline transmissions, respectively, wherein the associated memory locations are aggregated using programmable switches in the plurality of tree repair pipelines;
for each tree repair pipeline in the plurality of tree repair pipelines, repairing a part of data on the failed storage block by using the storage unit transmitted by the tree repair pipeline on the repair node.
In some embodiments, partitioning the valid memory block into a plurality of memory locations for allocation to the plurality of tree repair pipeline transfers, respectively, comprises:
calculating a plurality of transmission bandwidths respectively allocated to the plurality of tree repair pipelines and a sum of the plurality of transmission bandwidths;
and for each tree-shaped repair pipeline in the plurality of tree-shaped repair pipelines, according to the proportion of the transmission bandwidth of the tree-shaped repair pipeline in the sum, dividing a part of the effective storage block as the storage unit to be distributed to the tree-shaped repair pipeline for transmission.
Based on the same object, the present application further provides an apparatus for repairing failed memory blocks in an erasure code storage system, comprising:
a node determination module configured to: determining a repair node and a plurality of candidate supply nodes which are positioned on the same strip as the failed storage block and can be used for repairing the failed storage block according to a directed graph of a network formed by the erasure code storage system;
a path computation module configured to: respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node according to the directed graph, and calculating respective bandwidths of the shortest directed paths;
a path selection module configured to: arranging the shortest directed paths in a descending order of the bandwidth, and acquiring front k shortest directed paths in the arranged shortest directed paths, wherein k is equal to the number of data blocks on the strip;
a node selection module configured to: taking k candidate supply nodes respectively corresponding to the first k shortest directed paths in the plurality of candidate supply nodes as k first supply nodes, and taking the bandwidth of the k-th shortest directed path in the plurality of arranged shortest directed paths as a bandwidth threshold;
an initial directed steiner tree generation module configured to: generating an initial directed Steiner tree connecting the k first supply nodes and the repair node in the directed graph according to the first k shortest directed paths, wherein the width of the initial directed Steiner tree is equal to the bandwidth threshold;
an iteration generation module configured to: iteratively executing the following operations until a preset iteration end condition is met: in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer; generating a new directed Steiner tree according to the updated available bandwidth;
a parallel repair module configured to: and repairing the failed memory block in parallel by using the k valid memory blocks respectively stored on the k first supply nodes and by using the generated plurality of directed Steiner trees as a plurality of tree repair pipelines in parallel.
In some embodiments, the initial directed steiner tree generation module is configured to:
for each of the k first donor nodes,
generating a candidate directed path from the first supply node to the repair node using a path generation function;
calculating the bandwidth and transmission cost of the candidate directed path;
and in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of a target shortest directed path corresponding to the first supply node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the target shortest directed path and the transmission cost of the candidate directed path is less than the transmission cost of the target shortest directed path, replacing the target shortest directed path with the candidate directed path.
Based on the same object, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable by the processor, where the processor implements a method for repairing a failed memory block in an erasure code storage system when executing the computer program.
The present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of repairing failed memory blocks in an erasure code storage system.
As can be seen from the above, according to the method for repairing a failed storage block in an erasure code storage system and the related device, a first supply node and a bandwidth threshold required for repairing the erasure code storage system are calculated according to a directed graph obtained by modeling the erasure code storage system and candidate supply node repair nodes. The method comprises the steps of calculating an initial directed Steiner tree with the width being a bandwidth threshold according to all first supply nodes and the bandwidth threshold, and iteratively calculating a directed Steiner tree with the whole width being larger than or equal to a first ratio and the initial directed Steiner tree to form a directed Steiner tree group, wherein each directed Steiner tree in the directed Steiner tree group transmits data in parallel to repair a failed storage block, so that the purpose of repairing the failed storage block by splitting the operation of repairing the failed storage block into a plurality of parallel tree-shaped repair pipelines is achieved, the repair throughput is improved, and the transmission performance is improved; and intermediate data in the data transmission process are aggregated by the switches on the path in the directed Steiner tree, so that the bandwidth occupation is reduced, and the repair performance of the erasure code storage system on the failed storage block is improved. After performance comparison and evaluation are carried out with the existing repairing method, the method for repairing the failed storage block in the erasure code storage system can obviously improve the repairing performance, and particularly in a network with a large number of redundant paths, the throughput is improved by more than 3 times compared with the existing repairing method.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of an erasure code storage system with failed storage blocks;
FIGS. 2a to 2f are schematic diagrams illustrating repair of a failed memory block by using a plurality of related methods, respectively;
FIG. 3 is a diagram illustrating a repair of a failed memory block using a plurality of minimum and maximum width directed Steiner trees according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for repairing failed memory blocks in an erasure code storage system according to an embodiment of the present application;
fig. 5 is a flowchart of generating an initial directed steiner tree connecting k first supply nodes and repair nodes in a directed graph according to the embodiment of the present application;
fig. 6 is a flowchart for generating a new directed steiner tree according to an updated available bandwidth according to an embodiment of the present application;
FIG. 7 is a flowchart for repairing the failed memory block in parallel by using the generated directed Steiner trees as a plurality of parallel tree repair pipelines according to an embodiment of the present application;
FIG. 8 is a flowchart illustrating repairing a plurality of failed memory blocks according to an embodiment of the present disclosure;
FIGS. 9a and 9b are comparative graphs of repair performance of different repair methods for a single failed memory block under different geographically distributed topologies;
10a and 10b are comparative graphs of repair performance of different repair methods for a single failed memory block under different data center topologies;
11a and 11b are comparative graphs of repair performance of different repair methods under different erasure code codes for a single failed memory block in a geographically distributed topology;
12a and 12b are comparative graphs of repair performance of different repair methods under different erasure code coding for a single failed storage block in a data center topology;
FIGS. 13a and 13b are comparative graphs of repair performance of different repair methods on a single failed memory block in a geographically distributed topology at different link utilization rates;
14a and 14b are comparative diagrams of repair performance of a plurality of failed memory blocks by the method and the RT method provided by the embodiment of the application;
FIG. 15 is a block diagram of an apparatus for repairing failed memory blocks in an erasure code storage system according to an embodiment of the present application;
fig. 16 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items.
The actual erasure codes satisfy linearity, including RS coding, LRC coding, MSR coding, MBR coding, and the like. In particular, the memory blocks on k same stripes may be denoted as B1,B2,...,Bk. Then, any one of the blocks B on the same stripe*Can be calculated by linear combination of the k memory blocks. Whereas in practical deployments, the encoding and decoding of memory blocks is performed in smaller units to achieve better performance. Thus, when repairing a failed memory block, the erasure code memory system can perform a decode operation upon receiving a small portion of each memory block, rather than waiting for the entire data block to be received for processing.
Referring to fig. 1, the erasure code storage system shown in fig. 1 has failed storage blocks on node H4, nodes H1, H2 and H3 are supply nodes for repairing the failed storage blocks, node H5 is a repair node, and S1 to S7 are switches in the erasure code storage system.
Referring to fig. 2a to 2f, which illustrate different repairing methods for a failed memory block, fig. 2a is a schematic diagram of repairing a node H4 containing a failed memory block by using a conventional repairing method, since all the blocks involved in this method need to be transmitted through links { S7, H5} of the node H5, this may cause serious congestion and the resulting transmission cost is high.
FIGS. 2b and 2c are schematic diagrams of a node H4 for repairing a memory block containing a failure using PPR, wherein FIG. 2b is a first round of PPR repair process, and the memory block is multiplied by the corresponding decoding coefficient by a node H1 and a node H3 and then sent to a node H2 and a node H5 respectively along the corresponding paths in FIG. 2 b; fig. 2c shows the second round of the PPR repair process, where the node H2 multiplies the storage block by the corresponding decoding coefficient, and then aggregates the storage block with the storage block received from the node H1, and sends the aggregation result to the node H5 along the path in fig. 2c, and the node H5 aggregates the received contents together to complete the repair of the failed storage block. The PPR utilizes a partial parallel mechanism to avoid congestion, and the memory block is firstly multiplied by the corresponding decoding coefficient and then transmitted to other related nodes by utilizing different time sequences, so that the performance of the method is superior to that of the conventional repairing method. However, PPR does not reduce the bandwidth consumption of the repair process; also PPR is based on operations at the data block level, so its repair time is still significantly higher than the normal block read time in general.
FIG. 2d is a schematic diagram of a repair of node H4 containing failed memory blocks using RP in which the supply node divides each associated block into many small cells that are aggregated at the supply node along the path. The repair process is complete when the donor node sends a complete block. And the RP serializes the repairing operation of the failed storage blocks across the storage nodes in a pipeline mode, thereby greatly improving the repairing speed and ensuring that the repairing time is relatively consistent with the normal reading time of the same data volume in the isomorphic environment. However, the intermediate server on the RP must receive/send intermediate data from its neighboring switches, which also incurs a significant transmission cost.
Fig. 2e is a schematic diagram of repairing a node H4 containing a failed memory block by using RT, wherein a tree-shaped repair pipeline is constructed, so that switches on a path also participate in processing of intermediate data, thereby further reducing transmission traffic. Specifically, the RT sets a bandwidth and delay related cost to each link and then computes a minimum steiner tree to connect the donor node and the repair node. We extend this to directed graphs with asymmetric links. As can be seen from fig. 2e, since the repair traffic is aggregated in the switches on the path, the bandwidth consumption can be effectively reduced. The RT may achieve lower transmission costs than the RP, but because the RT blindly selects the lower latency link, the packets may arrive asynchronously at the switch. The first arriving packet must wait for other related packets to arrive before being aggregated, resulting in many unnecessary links being added to the transmission path, resulting in additional bandwidth consumption.
Referring to FIG. 2f, to improve RT, repair of failed memory block H4 may be performed by a Minimum-Width Directed Steiner Tree (MWDST). Compared with the Repair tree, the Repair throughput of the node H4 with the failed storage block is the same, and the transmission cost of repairing the node H4 with the failed storage block by using MWDST is smaller.
Fig. 2f shows an MWDST comprising source nodes { H1, H2, H3} and destination node H5: by finding a MWDST, the repair problem of node H4 with failed memory blocks can be solved well. However, in a network of an actual erasure code storage system, there are multiple paths between any pair of nodes, which makes it possible to find multiple such steiner trees. Thus, utilizing multiple MWDSTs concurrently may potentially increase repair throughput. Referring to FIG. 3, in this case, the cost of each Steiner tree is the product of its transmitted data amount and its link cost, e.g., the transmission cost of Steiner tree 1 in FIG. 3 is
Figure BDA0003427546810000091
The transmission cost of Steiner tree 2 is
Figure BDA0003427546810000092
When determining the total amount of data transmitted, therefore, our goal is to find multiple MWDSTs, maximizing their sum of bandwidths,and their overall cost is minimal. We refer to the Packing Minimum-wide Directed Steiner Trees (PMWDST) problem.
In view of this, one or more embodiments of the present application provide a method for repairing a failed storage block in an erasure code storage system, where an operation for repairing the failed storage block in the erasure code storage system is split and then executed by a plurality of MWDSTs in parallel, so as to improve a repair throughput when repairing the erasure code storage system. And the aggregation of data in the network is realized by utilizing the switch nodes on the transmission path, so that the transmission cost is reduced, and compared with the existing repair method, the repair method provided by the application has obvious advantages in both repair throughput and transmission cost.
As an alternative embodiment, referring to fig. 4, the method for repairing a failed memory block in an erasure code memory system provided by the present application includes:
step S401, determining, according to a directed graph of a network formed by the erasure code storage system, one repair node and a plurality of candidate supply nodes that are located on the same stripe as the failed storage block and that can be used for repairing the failed storage block.
In erasure code storage systems, files are stored as fixed-size blocks as basic read-write units. Typically, in (k, m) erasure code storage systems, each k original blocks on a stripe (stripe) are encoded together, resulting in m redundant blocks. The k original blocks are from the original file and are called data blocks (data blocks), and the m redundant blocks are called parity blocks (parity blocks), mainly for fault tolerance. The k + m memory blocks on the same stripe are stored on the k + m memory nodes in a distributed manner. Each block may be regenerated by decoding any k blocks on the slice. Thus, the system can tolerate up to m node failures without losing data.
Wherein, modeling the erasure code storage system network to obtain a directed graph G ═ V, E, V with asymmetric linksiE, V is a node in the directed graph and can be a server or a switch; e.g. of the typeijE is node viTo vjEach edge e ofijIs given two real numbers, bijE B is the available bandwidth, called bandwidth, cijE C is the transmission cost per unit data. Since the links are asymmetrical, bijIs not necessarily equal to bji,cijIs not necessarily equal to cji. If there is no link between nodes, bijb ji0 and cij=cjiInfinity. For any slave node viTo vjDirected path p ofij={vi,vx,vy,...,vz,vjBandwidth width (p) of pathij) Is defined as width (p)ij)=min{bix,bxy,...,bzj}. The cost of a path is defined as cost (p)ij)=γ×(cix+cxy+,...,+czj) Where γ is the size of the data transmitted on this path. Specifically, when different paths transmit the same amount of data, we can directly calculate the sum of their link costs to compare their transmission costs.
Step S402, according to the directed graph, respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node, and calculating respective bandwidths of the shortest directed paths.
When the bandwidths of the multiple directed paths are calculated, the team doctors calculate the bandwidths of all the links forming the shortest directed path according to each shortest directed path, and the bandwidth of each link is compared to obtain the minimum value which is used as the bandwidth of the shortest directed path.
Step S403, arranging the shortest directed paths in a descending order of the bandwidth, and obtaining front k shortest directed paths in the arranged shortest directed paths, where k is equal to the number of data blocks on the stripe.
Step S404, regarding k candidate providing nodes respectively corresponding to the first k shortest directed paths among the multiple candidate providing nodes as k first providing nodes, and regarding the bandwidth of the k-th shortest directed path among the multiple shortest directed paths after arrangement as a bandwidth threshold.
Step S405, according to the first k shortest directed paths, generating an initial directed steiner tree connecting the k first supply nodes and the repair node in the directed graph, where a width of the initial directed steiner tree is equal to the bandwidth threshold.
Wherein, based on the directed graph G ═ (V, E) obtained in step S401, the definition given to the directed steiner tree is: given a directed graph G ═ (V, E), a set of source nodes S and a destination node d, a directed steiner tree T ═ Vt,Et) Is a subfigure of G, wherein
Figure BDA0003427546810000101
So that
Figure BDA0003427546810000102
And there is a directed path in T from each node in S to d, and V ∈ V for all nodest- { d } has a degreeout(v) 1, V ∈ V for all nodest-S has a degreeout(d)=0,degreein(v)≥1。degreein(v) Edge e representing node vuvNumber of (1), degreeout(v) Edge e representing node vvuThe number of (2).
Wherein, the directed Steiner tree T ═ (V)t,Et) Is defined as the minimum bandwidth of all edges in the directed steiner tree, i.e., width (t) ═ min { b }ijIn which b isijIs an edge eijThe bandwidth of (c). The cost of the directed steiner tree T is the sum of the costs of all its edges when the directed steiner tree transmits a single unit of data. The Directed Steiner Tree with the largest bandwidth is called the Widest Directed Steiner Tree (WDST), and further defined as follows: MWDST is the least costly of all WDSTs to transmit.
Referring to fig. 5, when an initial directed steiner tree connecting the k first supply nodes and the repair node in the directed graph is generated according to the first k shortest directed paths in step S405, the same operation is performed on each first supply node, and the method includes:
in step S501, a path generation function is used to generate a candidate directed path from the first supply node to the repair node.
Step S502, calculating the bandwidth and transmission cost of the candidate directed path.
In response to the bandwidth of the generated directed path being greater than the bandwidth of the shortest directed path from one of the k first donor nodes to the repair node, replacing the shortest directed path from the one of the k first donor nodes to the repair node with the generated directed path.
Step S503, in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of the shortest target directed path corresponding to the first provisioning node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the shortest target directed path and the transmission cost of the candidate directed path is less than the transmission cost of the shortest target directed path, replacing the shortest target directed path with the candidate directed path.
The method has the advantages that the directed path with lower bandwidth is replaced by the directed path with higher bandwidth, so that the repair throughput of the generated initial directed Steiner tree can be improved, and the transmission performance is further improved; the directed path with the lower transmission cost is used for replacing the directed path with the higher transmission cost, so that the transmission cost is reduced under the condition of unchanged bandwidth, and the transmission performance is improved.
Step S406, iteratively performing the following operations until a preset iteration end condition is satisfied: in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer; and generating a new directed Steiner tree according to the updated available bandwidth.
Wherein the preset iteration end condition is as follows: and calculating a first ratio of the bandwidth threshold value to a preset threshold value, when the width of the directed Steiner tree generated by iteration is smaller than the first ratio, namely the condition of ending the iteration is met, and recording the directed Steiner tree of which the width is larger than or equal to the first ratio and the initial directed Steiner tree in the directed Steiner tree group.
Referring to fig. 6, the step S406 of generating a new directed steiner tree according to the updated available bandwidth includes:
step S601, respectively calculating k shortest directed paths from the k first providing nodes to the repairing node according to the updated available bandwidth, the k first providing nodes, and the repairing node.
Step S602, respectively calculating the bandwidths of the k shortest directed paths, arranging the bandwidths of the k shortest directed paths in a descending order, and acquiring the bandwidth of the kth shortest directed path.
Step S603, generating the new directed steiner tree with the width of the kth shortest directed path according to the k shortest directed paths.
However, when a new directed steiner tree is generated in step S603, the transmission cost of the new directed steiner tree is also reduced as much as possible according to the method shown in fig. 5. The bandwidth of the kth shortest directed path is the loan threshold of the directed steiner tree generated this time.
In step S407, the failed memory block is repaired in parallel by using the k valid memory blocks stored in the k first supply nodes, respectively, and by using the generated plurality of directed steiner trees as a plurality of tree repair pipelines in parallel.
Wherein, for each valid memory block of the k valid memory blocks, the valid memory block is partitioned into a plurality of memory locations to be respectively allocated to the plurality of tree repair pipeline transmissions. And the programmable switches in the plurality of tree-shaped repair pipelines are used for aggregating the associated storage units, so that the problem that the repair flow is wasted by transmitting data to the first supply node by the switches for aggregation and then receiving the aggregated data for forwarding is avoided, the bandwidth occupation of each tree-shaped repair pipeline is reduced, and the repair performance is further improved.
When the plurality of tree-shaped repair pipelines repair the memory block, each tree-shaped repair pipeline repairs a part of data on the failed memory block by using the memory unit transmitted by the tree-shaped repair pipeline on the repair node. The bandwidth of each directed Steiner tree is obtained according to the shape and the directed graph of each directed Steiner tree recorded by the directed Steiner tree group, the data quantity required to be transmitted by each directed Steiner tree is determined according to the proportion of the bandwidth of each directed Steiner tree occupying the sum of the bandwidths of all the directed Steiner trees, and each directed Steiner tree is enabled to transmit a corresponding quantity of data in parallel to repair a failed storage block, so that the associated data packets can reach the switch at a relatively consistent time. Referring to fig. 7, this process includes:
step S701 calculates a plurality of transmission bandwidths respectively allocated to the plurality of tree repair pipelines and a sum of the plurality of transmission bandwidths.
Calculating the transmission bandwidth of each directed Steiner tree according to the directed graph and the bandwidth matrix of the directed graph; and adding the transmission bandwidths of the tree repair pipelines to obtain a transmission bandwidth sum.
Step S702, for each tree-like repair pipeline in the plurality of tree-like repair pipelines, according to the ratio of the transmission bandwidth of the tree-like repair pipeline to the total sum, dividing a part of the valid memory block as the memory unit to be allocated to the tree-like repair pipeline for transmission.
Firstly, the ratio of the transmission bandwidth of each tree-shaped repair pipeline to the sum of the transmission bandwidths is calculated as the transmission bandwidth ratio of each tree-shaped repair pipeline. And respectively calculating the product of the size of the failed storage block to be repaired and the transmission bandwidth ratio of each tree-shaped repair pipeline as the data volume required to be transmitted in the repair process of each tree-shaped repair pipeline, wherein when repairing, each tree-shaped repair pipeline transmits the corresponding data volume to the repair node, and the repair node finishes repairing the failed storage block.
According to the method for repairing the failed storage block in the erasure code storage system, the first supply node and the bandwidth threshold value required by repairing the erasure code storage system are calculated according to the directed graph, the candidate supply nodes and the repair nodes obtained by modeling the erasure code storage system. The method comprises the steps of calculating an initial directed Steiner tree with the width being a bandwidth threshold according to all first supply nodes and the bandwidth threshold, and iteratively calculating a directed Steiner tree with the whole width being larger than or equal to a first ratio and the initial directed Steiner tree to form a directed Steiner tree group, wherein each directed Steiner tree in the directed Steiner tree group transmits data in parallel to repair a failed storage block, so that the purpose of dividing the operation of repairing the failed storage block into a plurality of parallel tree-shaped repair pipelines for repair is achieved, the repair throughput is improved, and the transmission performance is improved; and intermediate data in the data transmission process are aggregated by the switches on the path in the directed Steiner tree, so that the bandwidth occupation is reduced, and the repair performance of the erasure code storage system on the failed storage block is improved.
As an alternative embodiment, the following Threshold Constrained Shortest-Widest Path algorithm (TCSWP) may be further used to perform steps S402 to S404:
Figure BDA0003427546810000141
in the above TCSWP algorithm, the input items are the repair node dst, the available network Bandwidth matrix Bandwidth, the link Cost matrix Cost, and the specified Bandwidth threshold botinit. Among them, to obtain the shortest and widest directed path from the candidate supply node to the repair node dst, the bot needs to be updatedinitSet to infinity. The variable prenode output after the operation of the algorithm is a sequence consisting of n nodes, wherein n is the total number of nodes in a directed graph G (V, E); the element at the ith position in the variable prenode represents a node previous to the ith node, and the output variables bot and cst are also sequences of n values, which respectively represent the bandwidth and transmission cost of the shortest and widest path from the corresponding node in the variable prenode to the repair node dst.
As an alternative embodiment, in order to execute step S405 to obtain the initial directed steiner tree, the following Modified SCTF algorithm (Modified SCTF, MSCTF) may be obtained by modifying the SCTF algorithm (selected closed Terminal First, SCTF) with preference, and the MWDST is obtained by performing an operation.
Figure BDA0003427546810000151
In the MSCTF algorithm, the input item of the algorithm is added with a knob parameter k on the basis of the TCSWP algorithm. The knob parameter κ is a specified value between 1 and n, n being the total number of memory blocks on a stripe of the erasure code memory system; according to the SCTF algorithm, the higher the value of k is, the higher the precision is, but the calculation time is also increased correspondingly, and a small increase of k can bring about a remarkable improvement of accuracy. When the MSCTF algorithm runs, the shortest and widest directed path from each candidate supply node to the repair node is obtained by calculation according to the TCSWP algorithm, and the node corresponding to the first k directed paths with the largest bandwidth is selected as the first supply node and is marked as HtempAnd recording the bandwidth bot of the directed path with the largest k-th bandwidthtree. MSCTF algorithm with final output bandwidth of bottreeThe directed steiner tree of (a) is taken as the initial directed steiner tree.
As an alternative embodiment, to iteratively generate the directed Steiner Tree group in step S406, a conformal rested Steiner Tree Packing (SRSTP) algorithm is used. According to the input preset threshold value
Figure BDA0003427546810000161
And carrying out iterative computation on a knob parameter kappa, a repair node dst, a set H of supply nodes capable of repairing the failed storage block, an erasure code parameter k, an available network Bandwidth matrix Bandwidth and a link overhead matrix Cost to obtain a directed Steiner tree group for transmitting the data repair failed storage block. And the bandwidth of the directed Steiner tree with the minimum bandwidth in the directed Steiner tree group obtained by the SRSTP algorithm is more than or equal to the bandwidth of the initial directed Steiner tree and a preset threshold value
Figure BDA0003427546810000162
The ratio of (a) to (b).
Figure BDA0003427546810000163
As an alternative embodiment, the method provided in this application may also be used to repair a plurality of failed memory blocks in the erasure code storage system, and with reference to fig. 8, the method includes:
step S801, generating corresponding directed steiner tree groups for the plurality of failed storage blocks according to the directed graph.
Firstly, k first supply nodes and repair nodes corresponding to each failed storage block are calculated according to a directed graph, and an initial directed Steiner tree and a bandwidth threshold corresponding to each failed storage block are calculated. And respectively generating new directed Steiner trees in an iterative manner according to the initial directed Steiner tree corresponding to each failed storage block and the bandwidth threshold value until the width of the generated new directed Steiner trees corresponding to each failed storage block is smaller than the ratio of the bandwidth threshold value corresponding to each failed storage block to the preset threshold value.
In an iteration process, a new directed steiner tree corresponding to each failed storage block is generated, and then the available bandwidth is updated according to all the occupied links of the generated directed steiner trees.
Step S802, respectively repairing the plurality of failed memory blocks in parallel by using the generated directed Steiner tree groups respectively corresponding to the plurality of failed memory blocks.
The performance of the method for repairing a failed memory block in an erasure code storage system provided by the present application is described below by comparing the existing repair method with the repair method provided by the present application.
The topology listed in table 1 is used to evaluate the performance of the method for repairing failed blocks in erasure coded storage systems provided herein and the existing repair methods. Wherein the R1-R6 topologies in table 1 are a plurality of real geographically distributed topologies (ISPs) in the DEFO dataset, respectively; topology F7-F8 is a Fat-Tree topology (Fat-Tree), topology B9-B10 is a BCube topology, both of which are data center topologies, the former being switch-centric topology and the latter being server-centric topology.
TABLE 1 Experimental topology
Figure BDA0003427546810000171
Background traffic needs to be added to the topology to simulate different network states when performing repair performance evaluations. For a DEFO dataset, the background flow comes from its own demand data. A portion of the demand data is randomly drawn to simulate different network loads. For other cases, a flow matrix is generated based on a gravity model with the same distribution index random variables. The resulting flow is quite realistic. Each link is assigned a cost-related value, which is contained in the DEFO dataset; for other cases, it is assumed that the cost of use of each link is the same and the background traffic is routed using shortest paths. If multiple paths with the same cost exist, the path with larger bandwidth is selected to achieve better load balance. When the available bandwidth of the shortest path cannot meet the transmission requirements, a longer path is selected. The available bandwidth of each link is then calculated to monitor the status of the network. Since the bandwidth in the DEFO dataset is dimensionless, the experimental results are mainly based on normalized values.
Firstly, the repair performance of single-node failure and multi-node failure in an erasure code storage system is respectively evaluated. Specifically, for t failed nodes, we allocate k + m-t available donor nodes and specify t repair nodes. Each requester selects k of the k + m-t donor nodes to repair the corresponding failed memory block. In an ISP topology, the available donor nodes and repair nodes are randomly selected, whereas in a data center topology, both the donor nodes and repair nodes are limited to server nodes. The evaluation is performed by using RS (6,3), RS (8,4), RS (10,4) and RS (12,4) erasure code codes respectively, and the encoding modes are widely applied to many practical erasure code storage systems. The routing algorithm is implemented in a centralized manner, assuming that the switch can aggregate intermediate data, setting the knob parameter k to 20 and a preset threshold
Figure BDA0003427546810000181
Results are based on mean and standard deviation of multiple runs. The experiment was run on a desktop, equipped with an Intel (R) core (TM) i7 CPU, a 3.8GHz 8-core CPU and 64GB memory.
Four existing repair methods can be selected for performance comparison and evaluation with the method for repairing failed blocks in an erasure code storage system provided by the application, and the selected existing methods are respectively as follows: in a Conventional Repair method (CR), k first supply nodes are randomly selected on a stripe where a node with a failed storage block of an erasure code storage system is located, and a related data block is sent to a Repair node through a shortest path; a Method (MW) of repairing failed memory blocks with a single MWDST; RP method and RT method. In order to deploy the RT method, the ratio of the data volume transmitted by the link to the available bandwidth is taken as the delay cost of the link. The uplink and downlink directions of the link are allocated different overheads depending on the available bandwidth in that direction, and since the RT method assumes that the k first donor nodes are pre-designated, the first k candidate donor nodes with the lowest delay to the repair node are selected to evaluate the RT method. And the method for repairing the failure block In the erasure code storage system is a Parallelized intra-network Aggregation repair method (Paint)
Firstly, the repair performance of different methods on a single failed storage block under different topologies is evaluated, multiple rounds of experiments are performed on each topology in table 1 by adopting RS (6,3) coding, the repair throughput and the repair cost when each method repairs the single failed storage block are respectively calculated and compared, and the comparison result is shown in fig. 9a and 9 b. The abscissa of fig. 9a and 9b is the different ISP topologies, the ordinate of fig. 9a is the repair throughput for the different methods, and the ordinate of fig. 9b is the repair cost spent by the different methods. Fig. 9a and 9b show repair throughput and repair cost, respectively, for 6 ISP topologies. Obviously, Paint always maintains the best repair throughput and more satisfactory bandwidth overhead. In contrast, RP can achieve throughput close to MW, but its repair costs are highest because it has to transmit a large amount of intermediate data between the donor nodes for aggregation. Paint and other methods take advantage of nodes on the transmission path to aggregate intermediate data, thereby significantly reducing repair traffic. With minimal overhead of RT. However, it cannot achieve a satisfactory repair speed. This may be the reason for a defect in the selection of the donor node.
Referring to fig. 10a and 10b, the abscissa of fig. 10a and 10b is different data center topologies, the ordinate of fig. 10a is repair throughput of different methods, and the ordinate of fig. 10b is repair cost consumed by different methods. Similar conclusions were also drawn from evaluations performed in the data center topology, and in the BCube topology Paint showed a very high repair throughput advantage (more than 3 times higher than the comparative approach in B10), while the increase in repair cost was slight. This is because in a server-centric topology, servers are interconnected by multiple links. Thus, there are a large number of parallel paths between nodes, which provides the power for parallelization. In contrast, Paint has a weak advantage in the Fat-Tree topology, because each server is connected to only one single link with limited bandwidth in the Fat-Tree topology, which results in a low probability of parallel repair, and cannot embody the advantage of parallel repair.
An ISP topology R2 may be selected to evaluate the repair performance of a single node failure under different erasure codes, referring to fig. 11a and 11b, the abscissa of fig. 11a and 11b is different erasure code codes, the ordinate of fig. 11a is the repair throughput of different methods, and the ordinate of fig. 11b is the repair cost consumed by different methods. From the image available, as the number of erasure coded blocks increases, the parallelism gain of Paint decreases, but still outperforms other methods in terms of repair throughput. The root cause is that repairing a failed block requires more donor nodes, and the available bandwidth between these donor nodes may be very low.
However, the extent of this effect depends on the underlying topology. We chose BCube topology B10 to verify this. Referring to the results shown in fig. 12a and 12b, the abscissa of fig. 12a and 12b is different erasure code encoding, the ordinate of fig. 12a is repair throughput of different methods, and the ordinate of fig. 10b is repair cost consumed by different methods. As the number of memory blocks increases, Paint is still overwhelmingly superior to other approaches. The reason is that there are a large number of parallel paths in a BCube network, and partially congested links do not have much impact on our parallelism. Therefore, Paint can guarantee sufficient throughput over multiple parallel paths even if some links are congested. Furthermore, we found that the repair cost of Paint does not increase linearly as with either the RP or CR methods, whether in BCube or Fat-tree.
Different amounts of background traffic may also be added to the topological network to simulate different network loads. And calculating the maximum link utilization rate of different repair methods, namely the ratio of the occupied bandwidth of each link to the total available bandwidth, so as to reflect the load level of the network. As shown in fig. 13a and 13b, the abscissa of fig. 13a and 13b is the link utilization, the ordinate of fig. 13a is the repair throughput of the different method, and the ordinate of fig. 13b is the repair cost consumed by the different method. The results of the R2 topology are shown in fig. 13a, where the maximum link utilization ranges from 20% to 95%, the RT method and the CR method are sensitive to network load, while the MW method and the RP method only slightly decrease throughput, and the repair performance of Paint is not significantly degraded compared to the above methods. The reason for this is that the actual traffic is highly skewed and severe congestion usually occurs on part of the link; in addition, the load balancing mechanism for background traffic also ensures that most links do not have obvious congestion. Therefore, Paint and other methods can mitigate the effects of congestion by selecting an appropriate provisioning node, while RT methods and CR methods suffer from congestion. But the change of network load has little influence on the repair cost, as shown in fig. 13b, the repair cost does not increase sharply as the load increases, because these methods mainly select the providing node according to the available bandwidth rather than the cost; and congested links occur randomly throughout the network, the repair costs of these methods are not significantly affected by the network load. In addition, the CR method always selects the supply node according to the shortest path, and therefore, the repair cost is less sensitive to the network load.
As an alternative embodiment, Paint is also applicable to the simultaneous repair of multiple failed blocks on the same stripe, although this is quite rare. And the multi-fault repair of the RP method adopts a traditional repair method. . Specifically, a repair tree is first obtained according to a given repair method for a single failed node. The remaining requesters are then connected to the last aggregation node via the shortest path, where it is noted that the shortest path is the lowest delay, thereby reducing the delay and cost as much as possible. RS (6,3) coding is adopted to evaluate the multi-fault repair performance of the Paint and RT methods under different topologies. As shown in fig. 14a and 14b, the abscissa of fig. 14a and 14b is different ISP topologies, and the ordinate of fig. 14a and 14b is repair throughput of different methods. Fig. 14a shows the throughput for repairing two failed blocks simultaneously and fig. 14b shows the throughput for repairing three failed blocks simultaneously. Under the multi-fault repairing scene, Paint still shows the superiority of the repairing speed. When 2 failed blocks are repaired simultaneously, the repair throughput of Paint is significantly higher than that of the RT method. Especially in R5 and R6, Paint's repair throughput is almost twice that of the RT method. This is because Paint employs an efficient mechanism to reduce the impact of the dequeuers, and in this way the maximum completion time can be shortened as much as possible. Under the condition of concurrently repairing 3 failed blocks, the Paint repair throughput is lower than that of a scene with 2 failed blocks, because some bottleneck links are shared among more repaired nodes, but compared with the RT method, Paint still can keep advantages.
In summary, according to the method for repairing a failed storage block in an erasure code storage system, a first supply node and a bandwidth threshold required for repairing the erasure code storage system are calculated according to a directed graph and candidate supply node repair nodes obtained by modeling the erasure code storage system. The method comprises the steps of calculating an initial directed Steiner tree with the width being a bandwidth threshold according to all first supply nodes and the bandwidth threshold, and iteratively calculating a directed Steiner tree with the whole width being larger than or equal to a first ratio and the initial directed Steiner tree to form a directed Steiner tree group, wherein each directed Steiner tree in the directed Steiner tree group transmits data in parallel to repair a failed storage block, so that the purpose of repairing the failed storage block by splitting the operation of repairing the failed storage block into a plurality of parallel tree-shaped repair pipelines is achieved, the repair throughput is improved, and the transmission performance is improved; and intermediate data in the data transmission process are aggregated by the switches on the path in the directed Steiner tree, so that the bandwidth occupation is reduced, and the repair performance of the erasure code storage system on the failed storage block is improved. After performance comparison and evaluation are carried out with the existing repairing method, the method for repairing the failed storage block in the erasure code storage system can obviously improve the repairing performance, and particularly in a network with a large number of redundant paths, the throughput is improved by more than 3 times compared with the existing repairing method.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment of the application can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a device for repairing the failed memory block in the erasure code memory system.
Referring to fig. 15, the apparatus for repairing a failed memory block in an erasure code storage system includes:
a node determination module 151 configured to: determining a repair node and a plurality of candidate supply nodes which are positioned on the same strip as the failed storage block and can be used for repairing the failed storage block according to a directed graph of a network formed by the erasure code storage system;
a path computation module 152 configured to: respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node according to the directed graph, and calculating respective bandwidths of the shortest directed paths;
a path selection module 153 configured to: arranging the shortest directed paths in a descending order of the bandwidth, and acquiring front k shortest directed paths in the arranged shortest directed paths, wherein k is equal to the number of data blocks on the strip;
a node selection module 154 configured to: taking k candidate supply nodes respectively corresponding to the first k shortest directed paths in the plurality of candidate supply nodes as k first supply nodes, and taking the bandwidth of the k-th shortest directed path in the plurality of arranged shortest directed paths as a bandwidth threshold;
initial directed steiner tree generation module 155: configured to generate an initial directed steiner tree in the directed graph connecting the k first supply nodes and the repair node according to the first k shortest directed paths, a width of the initial directed steiner tree being equal to the bandwidth threshold;
an iteration generation module 156 configured to: iteratively executing the following operations until a preset iteration end condition is met: in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer; generating a new directed Steiner tree according to the updated available bandwidth;
a parallel repair module 157 configured to: and repairing the failed memory block in parallel by using the k valid memory blocks respectively stored on the k first supply nodes and by using the generated plurality of directed Steiner trees as a plurality of tree repair pipelines in parallel.
Wherein the initial directed Steiner tree generation module is configured to: for each of the k first donor nodes,
generating a candidate directed path from the first supply node to the repair node using a path generation function;
calculating the bandwidth and transmission cost of the candidate directed path;
and in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of a target shortest directed path corresponding to the first supply node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the target shortest directed path and the transmission cost of the candidate directed path is less than the transmission cost of the target shortest directed path, replacing the target shortest directed path with the candidate directed path.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.
The apparatus in the foregoing embodiment is used to implement the method for repairing a failed memory block in an erasure code storage system in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to the method of any embodiment described above, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for repairing a failed memory block in the erasure correction code storage system described in any embodiment above.
Fig. 16 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application. The electronic device may include: processor 1610, memory 1620, input/output interface 1630, communication interface 1640, and bus 1650. Wherein processor 1610, memory 1620, input/output interface 1630, and communication interface 1640 enable communication connections to each other within the device over bus 1650.
The processor 1610 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.
The Memory 1620 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1620 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1620 and called by the processor 1610 to be executed.
The input/output interface 1630 is used for connecting to an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1640 is used to connect a communication module (not shown) to enable communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1650 includes a pathway for communicating information between various components of the device, such as processor 1610, memory 1620, input/output interface 1630, and communication interface 1640.
It should be noted that although the above-mentioned devices only show processor 1610, memory 1620, input/output interface 1630, communication interface 1640, and bus 1650, in a specific implementation, the devices may also include other components necessary for proper operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the foregoing embodiment is used to implement the method for repairing a failed memory block in an erasure code storage system in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
Based on the same inventive concept, corresponding to any of the above-mentioned embodiment methods, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method for repairing a failed memory block in an erasure code storage system according to any of the above embodiments.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, of embodiments of the present application may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the method for repairing a failed storage block in an erasure code storage system according to any of the foregoing embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims (10)

1. A method for repairing failed memory blocks in an erasure code memory system, comprising:
determining a repair node and a plurality of candidate supply nodes which are positioned on the same strip as the failed storage block and can be used for repairing the failed storage block according to a directed graph of a network formed by the erasure code storage system;
respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node according to the directed graph, and calculating respective bandwidths of the shortest directed paths;
arranging the shortest directed paths in a descending order of the bandwidth, and acquiring front k shortest directed paths in the arranged shortest directed paths, wherein k is equal to the number of data blocks on the strip;
taking k candidate supply nodes respectively corresponding to the first k shortest directed paths in the plurality of candidate supply nodes as k first supply nodes, and taking the bandwidth of the k-th shortest directed path in the plurality of arranged shortest directed paths as a bandwidth threshold;
generating an initial directed Steiner tree connecting the k first supply nodes and the repair node in the directed graph according to the first k shortest directed paths, wherein the width of the initial directed Steiner tree is equal to the bandwidth threshold;
iteratively executing the following operations until a preset iteration end condition is met:
in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer;
generating a new directed Steiner tree according to the updated available bandwidth;
and repairing the failed memory block in parallel by using the k valid memory blocks respectively stored on the k first supply nodes and by using the generated plurality of directed Steiner trees as a plurality of tree repair pipelines in parallel.
2. The method of claim 1, wherein generating the initial directed steiner tree from the first k shortest directed paths comprises:
for each of the k first donor nodes,
generating a candidate directed path from the first supply node to the repair node using a path generation function;
calculating the bandwidth and transmission cost of the candidate directed path;
and in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of a target shortest directed path corresponding to the first supply node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the target shortest directed path and the transmission cost of the candidate directed path is less than the transmission cost of the target shortest directed path, replacing the target shortest directed path with the candidate directed path.
3. The method of claim 1, wherein the iteration end condition comprises a width of a newly generated directed steiner tree being less than a ratio of the bandwidth threshold to a preset parameter value.
4. The method of claim 1, wherein calculating the respective bandwidths of the plurality of shortest directed paths comprises:
and for each shortest directed path in the plurality of shortest directed paths, calculating the minimum value in the bandwidths of all links forming the shortest directed path, and taking the minimum value as the bandwidth of the shortest directed path.
5. The method of any of claims 1 to 4, wherein repairing the failed memory blocks in parallel comprises:
for each of the k valid memory blocks, partitioning the valid memory block into a plurality of memory locations for allocation to the plurality of tree repair pipeline transmissions, respectively, wherein the associated memory locations are aggregated using programmable switches in the plurality of tree repair pipelines;
for each tree repair pipeline in the plurality of tree repair pipelines, repairing a part of data on the failed storage block by using the storage unit transmitted by the tree repair pipeline on the repair node.
6. The method of claim 5, wherein partitioning the valid memory block into a plurality of memory locations for allocation to the plurality of tree repair pipeline transports, respectively, comprises:
calculating a plurality of transmission bandwidths respectively allocated to the plurality of tree repair pipelines and a sum of the plurality of transmission bandwidths;
and for each tree-shaped repair pipeline in the plurality of tree-shaped repair pipelines, according to the proportion of the transmission bandwidth of the tree-shaped repair pipeline in the sum, dividing a part of the effective storage block as the storage unit to be distributed to the tree-shaped repair pipeline for transmission.
7. An apparatus for repairing failed memory blocks in an erasure code memory system, comprising:
a node determination module configured to: determining a repair node and a plurality of candidate supply nodes which are positioned on the same strip as the failed storage block and can be used for repairing the failed storage block according to a directed graph of a network formed by the erasure code storage system;
a path computation module configured to: respectively determining a plurality of shortest directed paths from the candidate supply nodes to the repair node according to the directed graph, and calculating respective bandwidths of the shortest directed paths;
a path selection module configured to: arranging the shortest directed paths in a descending order of the bandwidth, and acquiring front k shortest directed paths in the arranged shortest directed paths, wherein k is equal to the number of data blocks on the strip;
a node selection module configured to: taking k candidate supply nodes respectively corresponding to the first k shortest directed paths in the plurality of candidate supply nodes as k first supply nodes, and taking the bandwidth of the k-th shortest directed path in the plurality of arranged shortest directed paths as a bandwidth threshold;
an initial directed steiner tree generation module configured to: generating an initial directed Steiner tree connecting the k first supply nodes and the repair node in the directed graph according to the first k shortest directed paths, wherein the width of the initial directed Steiner tree is equal to the bandwidth threshold;
an iteration generation module configured to: iteratively executing the following operations until a preset iteration end condition is met: in response to determining that the previously generated alpha directed Steiner trees occupy the same link, updating the available bandwidth of the link to 1/alpha of the available bandwidth, wherein alpha is a positive integer; generating a new directed Steiner tree according to the updated available bandwidth;
a parallel repair module configured to: and repairing the failed memory block in parallel by using the k valid memory blocks respectively stored on the k first supply nodes and by using the generated plurality of directed Steiner trees as a plurality of tree repair pipelines in parallel.
8. The apparatus of claim 7, wherein the initial directed steiner tree generation module is configured to:
for each of the k first donor nodes,
generating a candidate directed path from the first supply node to the repair node using a path generation function;
calculating the bandwidth and transmission cost of the candidate directed path;
and in response to determining that the bandwidth of the candidate directed path is greater than the bandwidth of a target shortest directed path corresponding to the first supply node in the first k shortest directed paths, or in response to determining that the bandwidth of the candidate directed path is equal to the bandwidth of the target shortest directed path and the transmission cost of the candidate directed path is less than the transmission cost of the target shortest directed path, replacing the target shortest directed path with the candidate directed path.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method of any one of claims 1 to 6 when executing the computer program.
10. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.
CN202111584947.8A 2021-12-22 2021-12-22 Method for repairing failed memory block in erasure code memory system and related device Active CN114237985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111584947.8A CN114237985B (en) 2021-12-22 2021-12-22 Method for repairing failed memory block in erasure code memory system and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111584947.8A CN114237985B (en) 2021-12-22 2021-12-22 Method for repairing failed memory block in erasure code memory system and related device

Publications (2)

Publication Number Publication Date
CN114237985A true CN114237985A (en) 2022-03-25
CN114237985B CN114237985B (en) 2022-09-23

Family

ID=80761629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111584947.8A Active CN114237985B (en) 2021-12-22 2021-12-22 Method for repairing failed memory block in erasure code memory system and related device

Country Status (1)

Country Link
CN (1) CN114237985B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149576A (en) * 2023-04-20 2023-05-23 北京大学 Method and system for reconstructing disk redundant array oriented to server non-perception calculation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110212923A (en) * 2019-05-08 2019-09-06 西安交通大学 A kind of distributed correcting and eleting codes memory system data restorative procedure based on simulated annealing
CN112714031A (en) * 2021-03-29 2021-04-27 中南大学 Fault node rapid repairing method based on bandwidth sensing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110212923A (en) * 2019-05-08 2019-09-06 西安交通大学 A kind of distributed correcting and eleting codes memory system data restorative procedure based on simulated annealing
CN112714031A (en) * 2021-03-29 2021-04-27 中南大学 Fault node rapid repairing method based on bandwidth sensing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐凤林等: "分布式存储再生码数据修复的节点选择方案", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116149576A (en) * 2023-04-20 2023-05-23 北京大学 Method and system for reconstructing disk redundant array oriented to server non-perception calculation
CN116149576B (en) * 2023-04-20 2023-07-25 北京大学 Method and system for reconstructing disk redundant array oriented to server non-perception calculation

Also Published As

Publication number Publication date
CN114237985B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
US9805140B2 (en) Striping of directed graphs and nodes with improved functionality
US9317336B2 (en) Method and apparatus for assignment of virtual resources within a cloud environment
JP6366291B2 (en) Method, system, compute node, and computer program for all-to-all message exchange in a parallel computing system (all-to-all message exchange in a parallel computing system)
Ng et al. A survey of coded distributed computing
Tessier et al. Topology-aware data aggregation for intensive I/O on large-scale supercomputers
JP6597105B2 (en) Parallel information processing apparatus, communication procedure determination method, and communication procedure determination program
CN114237985B (en) Method for repairing failed memory block in erasure code memory system and related device
CN112732168B (en) Method, apparatus and computer program product for managing a storage system
CN113541870A (en) Recovery optimization method for erasure code storage single node failure
CN115834587A (en) Method and device for selecting target storage server and electronic equipment
CN113835823A (en) Resource scheduling method and device, electronic equipment and computer readable storage medium
JP6256167B2 (en) Risk reduction in data center networks
Myslitski et al. Toward fast calculation of communication paths for resilient routing
WO2023098048A1 (en) Method and apparatus for expanding erasure code storage system
CN114298431A (en) Network path selection method, device, equipment and storage medium
JP6874565B2 (en) Information processing system, information processing method and information processing equipment
Chen et al. A Secure Cloud-Edge Collaborative Fault-Tolerant Storage Scheme and Its Data Writing Optimization
WO2018130307A1 (en) An architecture and coordination mechanism to distribute and parallelize any mcf solver
JP6915434B2 (en) Information processing system, information processing method and program
JP6870487B2 (en) Information processing system and information processing method
Cao et al. Ec-scheduler: A load-balanced scheduler to accelerate the straggler recovery for erasure coded storage systems
US20170331716A1 (en) Active probing for troubleshooting links and devices
US7542431B2 (en) Nodal pattern configuration
US11709755B2 (en) Method, device, and program product for managing storage pool of storage system
CN115361284B (en) Deployment adjustment method of virtual network function based on SDN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant