CN111131457B - Capacity and bandwidth compromise method and system for heterogeneous distributed storage - Google Patents

Capacity and bandwidth compromise method and system for heterogeneous distributed storage Download PDF

Info

Publication number
CN111131457B
CN111131457B CN201911355800.4A CN201911355800A CN111131457B CN 111131457 B CN111131457 B CN 111131457B CN 201911355800 A CN201911355800 A CN 201911355800A CN 111131457 B CN111131457 B CN 111131457B
Authority
CN
China
Prior art keywords
cluster
node
module
sequence
bandwidth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911355800.4A
Other languages
Chinese (zh)
Other versions
CN111131457A (en
Inventor
骆源
王旌兆
顾振兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201911355800.4A priority Critical patent/CN111131457B/en
Publication of CN111131457A publication Critical patent/CN111131457A/en
Application granted granted Critical
Publication of CN111131457B publication Critical patent/CN111131457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a capacity and bandwidth compromise method and a system of heterogeneous distributed storage, which comprise a client module, a repair sequence generation module and a compromise curve drawing module; inputting parameter information of a storage system through a client module input module; inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence; the repairing sequence is output to a compromise curve drawing module to obtain a compromise curve, and the compromise curve is output to an output module of the client module; the repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph; the compromise curve drawing module: and drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system. Aiming at the heterogeneous distributed storage system, the invention provides a method for drawing a compromise curve of storage capacity and repair bandwidth by analyzing the upper bound of the file storage size which can be achieved by different repair schemes.

Description

Capacity and bandwidth compromise method and system for heterogeneous distributed storage
Technical Field
The invention relates to the field of data storage, in particular to a method and a system for compromising capacity and bandwidth of heterogeneous distributed storage, and more particularly to calculation of capacity of a heterogeneous distributed storage system and drawing of a compromise curve of the storage capacity and repair bandwidth.
Background
In recent years, with the rapid development of internet technology and the development of the whole information industry, information is generated, transmitted, processed and stored in large quantities and is in an exponential growth trend. In order to meet the storage requirements of mass data, the distributed storage system is widely applied due to the characteristics of low cost, strong expansibility, high access speed, high reliability, higher concurrent access capacity support and the like.
Erasure codes can greatly reduce data redundancy while ensuring higher data reliability, and thus are widely applied to distributed storage systems. The working principle of erasure codes is as follows: an erasure code is generally a code that encodes a document using a linear code encoding technique, and the original data of a size is divided and encoded into data of sizes and stored on nodes. If an erasure code satisfies the property: any piece of data can recover the original data, and we say that this erasure code satisfies the mds (maximum Distance separate) property. Linear codes that satisfy MDS properties are referred to as MDS codes. MDS codes are a very memory efficient class of coding schemes. While MDS codes are optimal in terms of redundancy and reliability tradeoffs, repairing a node requires access to the other intact nodes as well. If we add extra check information to some (less than one) information bits, then when these nodes are damaged, only these check-related nodes can be accessed, and there is no need to access one node. This addition of extra parity bits reduces the storage efficiency to some extent, but can save repair bandwidth by a large amount.
As described above, the conventional erasure codes require a large amount of network bandwidth to repair the damaged nodes, and the addition of extra parity bits reduces the storage efficiency. In order to balance the relationship between the storage capacity and the repair bandwidth, an information flow graph is introduced to model a distributed storage system, the system capacity is defined by using a network coding method, and accordingly, the compromise relationship between the storage capacity of a node and the repair bandwidth of the node is described. The construction of the regeneration code is mainly based on the Minimum Storage point and the Minimum Bandwidth point on the optimal compromise relationship curve, and respectively corresponds to a Minimum Storage Regeneration (MSR) code and a Minimum Bandwidth Regeneration (MBR) code.
The above studies of erasure coded data recovery are all based on the assumption that the nodes in a distributed storage system are indifferent. In a practical distributed system, the system tends to be heterogeneous, i.e. the amount of data stored by each node and downloaded from the helper node is different. In this case, the calculation of the compromise between the capacity and the repair bandwidth of the heterogeneous distributed storage system is very important, because the construction of the regeneration code needs to be based on the compromise relationship between the storage capacity and the repair bandwidth.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a capacity and bandwidth compromise method and system for heterogeneous distributed storage.
The capacity and bandwidth compromise system for the heterogeneous distributed storage provided by the invention comprises the following components:
module M1: inputting parameter information of a storage system through a client module;
module M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
module M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth;
the client module is used as a user interface;
the repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph;
the compromise curve drawing module: and drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system.
Preferably, said module M1 comprises: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the cluster number L of the heterogeneous distributed storage system, the storage node number R of each cluster, the number E of scattered storage points and/or the total storage point number n; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
Preferably, said module M2 comprises:
module M2.1: the node cluster source sequence generating module is used for inputting parameter information of the heterogeneous distributed storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
module M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
Preferably, said module M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
module M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
module M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating blocks M2.1.1-M2.1.2 until all selected nodes have been taken;
said module M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
module M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
module M2.2.2: when the cluster with the largest number is obtained or the current cluster has no node selection, the cluster 1 is obtained again; repeating blocks M2.2.1-M2.2.2 until all selected nodes have been taken;
module M2.2.3: and selecting all scatter points as the help nodes.
Preferably, said module M3 comprises:
module M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Module M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Module M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
The invention provides a method for compromising capacity and bandwidth of heterogeneous distributed storage, which comprises the following steps:
step M1: inputting parameter information of a storage system through a client module;
step M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
step M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth;
the client module is used as a user interface;
the repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph;
the compromise curve drawing module: and drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system.
Preferably, the step M1 includes: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the cluster number L of the heterogeneous distributed storage system, the storage node number R of each cluster, the number E of scattered storage points and/or the total storage point number n; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
Preferably, the step M2 includes:
step M2.1: the node cluster source sequence generating module is used for inputting parameter information of the heterogeneous distributed storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
step M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
Preferably, said step M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
step M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
step M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating the steps M2.1.1 to M2.1.2 until all the selected nodes are obtained;
said step M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
step M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
step M2.2.2: when the cluster with the largest number is obtained or the current cluster has no node selection, the cluster 1 is obtained again; repeating the steps M2.2.1 to M2.2.2 until all the selected nodes are obtained;
step M2.2.3: and selecting all scatter points as the help nodes.
Preferably, the step M3 includes:
step M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Step M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Step M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
Compared with the prior art, the invention has the following beneficial effects: existing research considers the trade-off of capacity and bandwidth of homogeneous distributed storage systems; by isomorphism, it is meant that the transmission bandwidth between all nodes is the same; aiming at the heterogeneous distributed storage system, the invention provides a method for drawing a compromise curve of storage capacity and repair bandwidth by analyzing the upper bound of the file storage size which can be achieved by different repair schemes.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a general block diagram of the process;
FIG. 2 is a diagram of algorithm 0-1 generating a repair scenario;
fig. 3 is a plot of the storage capacity and repair bandwidth trade-off for this cluster using algorithms 0-2.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention aims to provide a set of efficient and feasible system capacity calculation method of a heterogeneous distributed storage system and a method for drawing a compromise curve of storage capacity and repair bandwidth. For a given parameter (n, k, L, R, E ═ 0) heterogeneous distributed storage system, a repair scheme with the minimum cut corresponding to the information flow graph is generated for all feasible repair sequences, the system capacity is calculated, and then a compromise curve is drawn according to the system capacity.
The distributed storage system is a cloud storage service with low cost, strong expansibility, high access speed and high reliability. In a heterogeneous distributed storage system, calculation of system capacity and calculation of a compromise between storage capacity and repair bandwidth are very important problems. According to the method, through research on the repair sequence of the heterogeneous distributed storage system when the nodes are damaged, a method for generating the repair node sequence with the minimum cut of an information flow graph and a method for drawing a compromise curve of the storage capacity and the repair bandwidth of the heterogeneous distributed storage system according to the repair sequence are provided. And additionally provides a practical calculation method. The method is suitable for capacity calculation and compromise curve drawing of the heterogeneous distributed storage system which is commonly used at present.
The capacity and bandwidth compromise system for the heterogeneous distributed storage comprises a client module, a repair sequence generation module and a compromise curve drawing module;
module M1: inputting parameter information of a storage system through a client module;
specifically, the module M1 includes: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the cluster number L of the heterogeneous distributed storage system, the storage node number R of each cluster, the number E of scattered storage points and/or the total storage point number n; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
Module M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
specifically, the module M2 includes:
module M2.1: the node cluster source sequence generating module is used for inputting the parameter information of the storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
the min-cut min represents the maximum size of a file that the system can correctly repair, i.e. given the relation between the size of a file that can be stored and the capacity and bandwidth, a file that exceeds this size may not be correctly repaired. Only files that are less than or equal to the minimum cut size are guaranteed to be repaired correctly.
In particular, said module M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
module M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
module M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating blocks M2.1.1-M2.1.2 until all selected nodes have been taken;
module M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
Said module M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
module M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
module M2.2.2: when the cluster with the largest number is obtained or no node can be selected in the current cluster, the collection is started from the cluster 1 again; blocks M2.2.1 through M2.2.2 are repeated until all selected nodes have been retrieved.
Module M2.2.3: and selecting all scatter points as the help nodes.
Module M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth; enumerating a plurality of bandwidth points, respectively calculating the corresponding minimum storage capacity of the bandwidth points, and fitting each point to form a compromise curve;
specifically, the module M3 includes:
module M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Module M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Module M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
The client module: as a user interface; the user inputs the parameter information of the storage system through the client module. And the compromise curve is returned to the user by the client module after the completion of the drawing.
The repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph; consider the impact of the cluster origin and cluster location of the helper node on the minimum cut of the information flow graph. The generated node cluster source sequence has the smallest minimal cut in the information flow graph among all possible cluster source sequences.
The compromise curve drawing module: and drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system.
The invention provides a capacity and bandwidth compromise method for heterogeneous distributed storage, which comprises a client step, a repair sequence generation step and a compromise curve drawing step;
step M1: inputting parameter information of a storage system through a client module;
specifically, the step M1 includes: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the method comprises the following steps that the cluster number L of a storage system, the storage node number R of each cluster, the number E of scattered point storage points and/or the total storage point number n are/is calculated; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
Step M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
specifically, the step M2 includes:
step M2.1: the node cluster source sequence generating module is used for inputting the parameter information of the storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
the min-cut min represents the maximum size of a file that the system can correctly repair, i.e. given the relation between the size of a file that can be stored and the capacity and bandwidth, a file that exceeds this size may not be correctly repaired. Only files that are less than or equal to the minimum cut size are guaranteed to be repaired correctly.
In particular, said step M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
step M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
step M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating blocks M2.1.1-M2.1.2 until all selected nodes have been taken;
step M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
Said step M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
step M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
step M2.2.2: when the cluster with the largest number is obtained or no node can be selected in the current cluster, the collection is started from the cluster 1 again; steps M2.2.1 through M2.2.2 are repeated until all selected nodes have been fetched.
Step M2.2.3: and selecting all scatter points as the help nodes.
Step M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth; enumerating a plurality of bandwidth points, respectively calculating the corresponding minimum storage capacity of the bandwidth points, and fitting each point to form a compromise curve;
specifically, the step M3 includes:
step M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Step M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Step M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
The client module: as a user interface; the user inputs the parameter information of the storage system through the client module. And the compromise curve is returned to the user by the client module after the completion of the drawing.
The repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph; the impact of the cluster source and the cluster location of the help node on the minimum cut of the information flow graph is considered. The generated node cluster source sequence has the smallest minimal cut in the information flow graph among all possible cluster source sequences.
The compromise curve drawing module: and drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system.
The present invention is further described in detail by the following preferred examples:
the implementation scheme comprises three parts: the method comprises the implementation of a client module, the implementation of a repair sequence generation module and the implementation of a compromise curve drawing module.
Client module implementation
The client module consists of an input module and an output module. The input module receives parameters input by a user and transmits the parameters to the repair sequence generation module. And the output module outputs the image drawn by the compromise curve drawing module in a form specified by a user.
Repair sequence generation module implementation
A repair node sequence is described in the following manner, node cluster source p ═ p (p)0,p1,,…,pL) And node position order q ═ q (q)1,q2,…,qk). The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piAnd (4) respectively. p is a radical of0Indicating the number of scatter points as help nodes. The node position sequence is used to describe the number of the cluster to which each node belongs in a repair scheme, i.e. the ith repair node is from the qthiThe number cluster. The node cluster source and the node position order both affect the size of the minimum cut of the information flow graph corresponding to the repair scheme.
The invention employs the following algorithm to determine the node cluster source.
Firstly, determining the number of scattered points as help nodes, selecting all scattered points as help nodes, and enabling p to be0=E
For the remaining help node selections, preference is given to selecting from the cluster with the smaller cluster number. That is, a node is preferentially selected from the cluster 1, and when all of the R nodes in the cluster 1 are selected, the node is selected from the cluster 2, and when all of the R nodes in the cluster 2 are selected, the node is selected from the cluster 3. And so on.
It can be proved that the node cluster source sequence generated by the above algorithm is the minimum cut of the information flow graph in all possible node source sequences.
After determining the cluster source of the node, the present invention determines the node cluster location by the following method.
Starting from cluster 1, nodes are preferably selected from low to high according to the cluster number.
When the cluster with the largest number is taken, or no node can be selected in the current cluster, the taking is started again from the cluster 1.
And repeating the steps above until all cluster nodes are completely taken.
And finally, sequentially selecting each scatter point as a help node.
It can be demonstrated that the minimal cut of the information flow graph among all possible distributions of node cluster locations determined by the above algorithm is minimal.
As shown in fig. 2, the pseudo code of the algorithm is shown in algorithm 0-1, after the repair sequence of the node is obtained, that is, the system capacity can be obtained by calculating the minimum cut of the corresponding information flow graph.
Compromise curve drawing module
The following parameters are required for the compromise curve plotting: the size M of the file that needs to be stored. When a cluster has a node damaged, the new node downloads beta from the node in the same clusterIData of (referred to as intra-cluster repair bandwidth), downloading beta from an off-cluster nodeCIs known as cross cluster repair bandwidth. After the repair scheme corresponding to the information flow graph with the minimum cut is known, an algorithm 0-2 is used to draw a compromise curve of the storage capacity and the repair bandwidth of the cluster as shown in fig. 3.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (8)

1. A capacity and bandwidth tradeoff system for heterogeneous distributed storage, comprising:
module M1: inputting parameter information of a storage system through a client module;
module M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
module M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth;
the client module is used as a user interface;
the repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph;
the compromise curve drawing module: drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system;
the module M2 includes:
module M2.1: the node cluster source sequence generating module is used for inputting the parameter information of the storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
module M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
2. The capacity and bandwidth trading system for heterogeneous distributed storage according to claim 1, wherein the module M1 comprises: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the cluster number L of the heterogeneous distributed storage system, the storage node number R of each cluster, the number E of scattered storage points and/or the total storage point number n; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
3. A capacity and bandwidth trading system for heterogeneous distributed storage according to claim 1, wherein said module M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
module M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
module M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating blocks M2.1.1-M2.1.2 until all selected nodes have been taken;
said module M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
module M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
module M2.2.2: when the cluster with the largest number is obtained or the current cluster has no node selection, the cluster 1 is obtained again; repeating blocks M2.2.1-M2.2.2 until all selected nodes have been taken;
module M2.2.3: and selecting all scatter points as the help nodes.
4. The capacity and bandwidth trading system for heterogeneous distributed storage according to claim 1, wherein the module M3 comprises:
module M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Module M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Module M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
5. A capacity and bandwidth compromise method for heterogeneous distributed storage is characterized by comprising the following steps:
step M1: inputting parameter information of a storage system through a client module;
step M2: inputting the parameter information of the storage system into a repair sequence generation module to obtain a repair sequence;
step M3: calculating the relation between the storage capacity and the bandwidth of a large-small system of the file which can be correctly stored by the storage system through the repair sequence; drawing a curve by a compromise curve drawing step by using the relation between the storage capacity and the bandwidth;
the client module is used as a user interface;
the repair sequence generation module: for any bandwidth and capacity, analyzing the influence of the repair sequence on the minimum cut to generate the repair sequence with the minimum cut of the information flow graph;
the compromise curve drawing module: drawing a compromise curve of the storage capacity and the repair bandwidth of the storage system;
the step M2 includes:
step M2.1: the node cluster source sequence generating module is used for inputting the parameter information of the storage system as the input of the node cluster source sequence generating module to generate a node cluster source sequence, and the generated node cluster source sequence is the minimum cut of an information flow graph in all the cluster source sequences;
step M2.2: the cluster position sequence generating module is used for generating a cluster position sequence by taking the minimum cut node cluster source sequence of the information flow graph generated by the node cluster source sequence generating module as the input of the cluster position sequence generating module, wherein the generated cluster position sequence is the minimum cut node of the information flow graph in all cluster positions of the current cluster source sequence;
the minimum cut value of the information flow graph is the maximum value of the storable files, namely the relation between the size and the capacity of the storable files and the bandwidth.
6. The capacity and bandwidth trade-off method for the heterogeneous distributed storage according to claim 5, wherein the step M1 comprises: acquiring parameter information of the heterogeneous distributed storage system through the built heterogeneous distributed storage system;
the parameter information includes: the cluster number L of the heterogeneous distributed storage system, the storage node number R of each cluster, the number E of scattered storage points and/or the total storage point number n; wherein n ═ LR + E; erasure code parameters (n, k) adopted by users, node transmission bandwidth beta in the clusterICross cluster transmission bandwidth betaC
7. The method of claim 5, wherein the step M2.1 comprises: node cluster source p ═ p (p)0,p1,...pi...pL) (ii) a The node cluster source represents the number of the help nodes in each cluster, i.e. the number of the ith cluster used as the help node is piA plurality of; p is a radical of0Representing the number of scatter points as help nodes;
step M2.1.1: determining the number of scattered points as help nodes, and selecting the scattered points as the help nodes;
step M2.1.2: for the selection of the rest of the help nodes, the cluster numbers in the node cluster source are sequentially selected from small to large; repeating the steps M2.1.1 to M2.1.2 until all the selected nodes are obtained;
said step M2.2 comprises: node position order q ═ q (q)1,q2...qi...qk) The node position sequence is used for describing the number of the cluster to which each node belongs in a repair sequence, namely the ith repair node is from the qthiA cluster;
step M2.2.1: starting from the cluster 1, preferentially selecting nodes from low to high according to the cluster number;
step M2.2.2: when the cluster with the largest number is obtained or the current cluster has no node selection, the cluster 1 is obtained again; repeating the steps M2.2.1 to M2.2.2 until all the selected nodes are obtained;
step M2.2.3: and selecting all scatter points as the help nodes.
8. The capacity and bandwidth trade-off method for heterogeneous distributed storage according to claim 5, wherein said step M3 comprises:
step M3.1: sequentially calculating the edge entering weight coefficient a of k selected nodes in the information flow graphiAnd bi
Step M3.2: incorporating the edge weight coefficient aiAnd biAnd node transmission bandwidth beta in the clusterIAnd cross-cluster transmission bandwidth betaCCorrelation, calculating edge weights wi
Step M3.3: calculating the edge weights w separatelyiAnd betaCAnd (4) integrating the compromise relationship of the k selected nodes, and drawing a compromise curve by an iterative method.
CN201911355800.4A 2019-12-25 2019-12-25 Capacity and bandwidth compromise method and system for heterogeneous distributed storage Active CN111131457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911355800.4A CN111131457B (en) 2019-12-25 2019-12-25 Capacity and bandwidth compromise method and system for heterogeneous distributed storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911355800.4A CN111131457B (en) 2019-12-25 2019-12-25 Capacity and bandwidth compromise method and system for heterogeneous distributed storage

Publications (2)

Publication Number Publication Date
CN111131457A CN111131457A (en) 2020-05-08
CN111131457B true CN111131457B (en) 2021-11-30

Family

ID=70503588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911355800.4A Active CN111131457B (en) 2019-12-25 2019-12-25 Capacity and bandwidth compromise method and system for heterogeneous distributed storage

Country Status (1)

Country Link
CN (1) CN111131457B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102857558A (en) * 2012-08-13 2013-01-02 广东科学技术职业学院 Dynamically constructed and autonomously managed mobile cloud storage cluster system
CN103124299A (en) * 2013-03-21 2013-05-29 杭州电子科技大学 Distributed block-level storage system in heterogeneous environment
CN108512918A (en) * 2018-03-23 2018-09-07 山东大学 The data processing method of heterogeneous distributed storage system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103645861B (en) * 2013-12-03 2016-04-13 华中科技大学 The reconstructing method of failure node in a kind of correcting and eleting codes cluster
CN105159603B (en) * 2015-08-18 2018-01-12 福建省海峡信息技术有限公司 A kind of restorative procedure of distributed data-storage system
US10120764B1 (en) * 2016-07-29 2018-11-06 Nutanix, Inc. Efficient disaster recovery across heterogeneous storage systems
CN107977167B (en) * 2017-12-01 2020-08-18 西安交通大学 Erasure code based degeneration reading optimization method for distributed storage system
CN110212923B (en) * 2019-05-08 2020-11-17 西安交通大学 Distributed erasure code storage system data restoration method based on simulated annealing
CN110636122A (en) * 2019-09-11 2019-12-31 中移(杭州)信息技术有限公司 Distributed storage method, server, system, electronic device, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102857558A (en) * 2012-08-13 2013-01-02 广东科学技术职业学院 Dynamically constructed and autonomously managed mobile cloud storage cluster system
CN103124299A (en) * 2013-03-21 2013-05-29 杭州电子科技大学 Distributed block-level storage system in heterogeneous environment
CN108512918A (en) * 2018-03-23 2018-09-07 山东大学 The data processing method of heterogeneous distributed storage system

Also Published As

Publication number Publication date
CN111131457A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
US9048862B2 (en) Systems and methods for selecting data compression for storage data in a storage system
US8788913B1 (en) Selection of erasure code parameters for no data repair
CN102970043B (en) A kind of compression hardware system based on GZIP and accelerated method thereof
US9378155B2 (en) Method for processing and verifying remote dynamic data, system using the same, and computer-readable medium
US10122379B1 (en) Content-aware compression of data with reduced number of class codes to be encoded
CN110301095B (en) Non-binary context hybrid compressor/decompressor
US20180024746A1 (en) Methods of encoding and storing multiple versions of data, method of decoding encoded multiple versions of data and distributed storage system
CN104461641A (en) Data burning and writing method, system and equipment and target equipment
CN113687975A (en) Data processing method, device, equipment and storage medium
CN117574976B (en) Large language model software and hardware collaborative quantization acceleration calculation method and system
JP4028381B2 (en) Method and apparatus for encoding information using multiple passes and decoding in a single pass
US8515882B2 (en) Efficient storage of individuals for optimization simulation
CN105844210A (en) Hardware efficient fingerprinting
CN105528183A (en) Data storage method and storage equipment
CN106658034A (en) File storage and reading method and device
CN111131457B (en) Capacity and bandwidth compromise method and system for heterogeneous distributed storage
US9450619B2 (en) Dynamic log likelihood ratio quantization for solid state drive controllers
CN113609090A (en) Data storage method and device, computer readable storage medium and electronic equipment
van den Bos et al. Domain-specific optimization in digital forensics
WO2023082629A1 (en) Data storage method and apparatus, electronic device, and storage medium
CN107463462B (en) Data restoration method and data restoration device
US20090112900A1 (en) Collaborative Compression
CN113157715B (en) Erasure code data center rack collaborative updating method
Luby Repair rate lower bounds for distributed storage
CN103326731B (en) A kind of Hidden Markov correlated source coded method encoded based on distributed arithmetic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant