CN107273207A

CN107273207A - A kind of related data storage method based on hypergraph partitioning algorithm

Info

Publication number: CN107273207A
Application number: CN201710388857.9A
Authority: CN
Inventors: 王宝亮; 张光荣; 常鹏; 张荧允
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-05-25
Filing date: 2017-05-25
Publication date: 2017-10-20

Abstract

The present invention relates to a kind of related data storage method based on hypergraph partitioning algorithm, including：The task of processing data is needed for one, a demand model is called, the demand model needs multiple data for being stored in data center's node, after demand model is determined, predicts its demand factor；The setting of rate according to demand, the standard that selected metric standard, i.e. hypergraph are divided；Set up hypergraph model；Slightly draw the stage；Initial phase；Optimizing phase：Fission reduction will be carried out successively after k subgraph random selection fission node, construct a series of hypergraphs, until scale reaches the original scale for having no right hypergraph, k subgraph after being optimized.

Description

A kind of related data storage method based on hypergraph partitioning algorithm

Technical field

The invention belongs to big data processing technology field, it is related to a kind of related data storage method.

Background technology

Explosive growth is presented in high speed development and rapid popularization with internet, global metadata amount, and we come into Epoch of one information explosion type.In face of magnanimity, complicated data, information processing capacity reaches that TB grades even PB grades have been normal Phenomenon, big data concept is arisen at the historic moment.Relative to traditional data, the feature of big data is summarized as 4 V, i.e. data by people Amount big (Volume), speed fast (Velocity), many (Variety) value densities of type are low (Value).Data volume greatly still can be with Alleviate to a certain extent by extension storage, but requirement is timely responded to, data diversity and data uncertainty are traditional numbers It cann't be solved according to processing method.In order to tackle difficulty and the challenge that this big data is brought, many Large-Scale Interconnected nets are public Department was proposed various types of big data processing systems in recent years.As an emerging technology, big data treatment technology is very Many places also have many deficiencies, postponing as caused by calling distributed data, huge data throughout and not phase The problem of network load caused by the network rate of symbol is serious etc..Therefore, many scholars are looking for preferably always both at home and abroad Date storage method with strengthen big data processing integration capability.

Seem magnanimity, complicated data, with certain relevance in it, required data when handling some specific tasks With some features (such as frequency of use, size and other data are used) simultaneously, if the high data of correlation are deposited as far as possible Calculate node is placed on, then Internet resources need not be taken by waiting when in use, save the time, improve the validity of system.

The popularization of common figure in hypergraph (Hypergraph) discrete mathematics, its mathematical definition is：For hypergraph H, have The node set V of the hypergraph and set E while (while super, Hyperedge) of hypergraph, then have H=(V, E).Wherein, each super side E is a V nonempty set, and the nodal point number that general e is included means that its number of degrees is designated as | e | (being more than or equal to 2).Hypergraph is drawn It by the node division of hypergraph is k roughly equal parts point to be exactly, and the node of same hypergraph connection some occurs Situation be minimized.

The content of the invention

The purpose of the present invention is that proposition is a kind of carries out storage optimization method based on what hypergraph was divided to related data.The party Method predicts the fixed demand of this generic task, is designated as a demand model, makes for the essentially identical same generic task of demand data The data needed for this pattern are moved to the less node of load with hypergraph partitioning algorithm.Technical scheme is as follows：

A kind of related data storage method based on hypergraph partitioning algorithm, comprises the following steps：

(1) task of processing data is needed for one, is called a demand model, the demand model needs multiple deposit The data in data center's node are stored up, after demand model is determined, its demand factor are predicted, it is assumed that each node demand factor is R_py, The total demand factor of the demand model is

(2) setting of rate according to demand, the standard that selected metric standard, i.e. hypergraph are divided, first is to complete a demand Required overhead C^A, second is total relay traffic C needed for completion demand^L, it is C (D)=C to draw module^A+α C^L, parameters of the α for two standard values of balance between zero and one.

(3) hypergraph model is set up according to the criterion of step 2, all data item and back end is set to hypergraph Vertex set V, super line set E in include the mapping relations of all demand model and data item and node, every super side e ∈ E is endowed a weight, based on the module in (2), is each weight assignment, in hypergraph, there is two class nodes, storage section Point and data item, two class sides, demand model it is super while and back end surpass while.

(4) it is n output set to refer to hypergraph vertex partition, and each summit is pertaining only to one in n set, reduces super The weight that figure is divided is calculated as reducing the summation of super side right weight, if the summit on a super side is not pertaining only to a set, this Super side is cut up, and super side e summit is fallen in t set, then its power that subtracts is calculated as (t-1) w_e。

(5) stage is slightly drawn：Reduce the weight on super side, will contact close node merging, construction scale it is smaller have no right surpass Figure so that the minification between adjacent two layers hypergraph reaches the minification of setting, the minification is adjacent two layers hypergraph node Reduced number of percentage.

(6) initial phase：The smaller hypergraph of having no right of scale obtained by step (5) is subjected to initial division, obtains most initial K subgraph, division methods are random division；

(7) optimizing phase：By step (6) to k subgraph random selection fission node after carry out fission reduction successively, A series of hypergraphs are constructed, until scale reaches the original scale for having no right hypergraph, k subgraph after being optimized.

Brief description of the drawings

Fig. 1 demand model legends

Fig. 2 bipartite graphs

Fig. 3 hypergraph models

Fig. 4 algorithm flow charts

Embodiment

The basic thought of this patent is, for the demand model of a determination, and data needed for pattern set up one according to demand Individual demand model and the binary crelation of data center's data storage node.According to the binary crelation and the module proposed, Build the Function Mapping relation that a data are stored in back end.It is described as follows.

One, data item and node

X represents to be stored in the set that m data is included on back end, and each task needs to transmit d from set X Individual different data item.Assuming that mode requirement space isDemand model in practical application is the space A subset, is usedRepresent.As shown in Figure 1.There are five data item, three different demand moulds in legend Formula.

Y represents including the set of n memory node.Initially, it is assumed that each data item x ∈ X are stored in exclusive node y ∈ In Y.If crawling storage after data arrives the rule of node for D:x→y.Final purpose of the present invention is just to provide a suitable storage Scheme, can provide an efficient D function.In addition, we use D_yExpression is stored in node y data acquisition system.

Two, data are placed

1. demand factor

The data of data center are stored in, the input of another task may be output as, it is also possible to just local Operation.Without loss of generality, back end demand initially accessed as demand source position, so, in a model, data center Or node has two roles simultaneously：The source node location of demand model and the finish node position of data storage.

For each demand model completed on demand nodes y ∈ Y, its workload or demand factor are foreseeable (Forecasting Methodology is very ripe, does not include herein) is designated as R_py, according to demand rate make data storage decision-making.We define work Measure or demand factor collection is combined into R={ R_py|p∈P,y∈Y}.As shown in Fig. 2 this is the bipartite model of a hypothesis, data Center and demand model are respectively the summit of bipartite graph, connect two kinds of side demand factor R_pyRepresent, and assign weights.Meter The total demand factors of each demand nodes y are calculated, are usedRepresent.For each demand model p, its total need is calculated The rate is asked to be

2. module

The storage of data can influence the performance of system, and the validity and Consumer's Experience for showing as system postpone two aspects. Relation between systematic function is deposited by observed data, we summarize two modules.

1) related data is altogether put

The system necessary processing time that the evaluation criterion of system effectiveness needs for the given workload of completion.In distribution system In system, the system average time needed for completing a demand is not only relevant with the information content of reading, goes back and include each node Processing expense total node number amount it is relevant.Define S_pRepresent to complete the data volume needed for a demand model p, S_pyRepresent in section The workload needed for demand model p is completed in point y ∈ Y,S_pyIt is to have data to deposit mapping function D:X → y is determined A variable, S_pIt is a constant.The system necessary time for being defined on node y ∈ Y portions or being fully finished a demand p is S_py+λ·1(S_py), S_pyRepresent the Conventional Time that process demand the needs, (S of λ 1_py) represent process demand p routine operation needed for The constant process time wanted, such as the connection of TCP.For the demand factor of different mode, the system total time of all demands is completed ForIt is equivalent toService can be lifted by minimizing the formula The validity of device, reduces expense.Requirement can be reached by putting the strong data of correlation altogether.One extreme case, a demand model Required data item be stored on same node, the minimum time of the system needed for completing the demand be R_p(S_py+ λ), and for Any given workloadFor a constant, so the overhead needed for completing a demand is

2) local data services

Demand nodes the different of back end position from needed for storing the demand can also influence the performance of system, show as Relay traffic is produced, so we assign total relay traffic needed for completion demand as second criterion.It is defined as1(x∈D_y) represent whether data item x is stored in node y.

The final purpose of the present invention is to provide a kind of method of related data storage, increases the validity of system, reduces system The expense of system, is specifically just to provide the data of optimization to the mapping function D of storage:x→y.Based on two above standard, most The optimisation criteria of result function is set to C (D)=C^A+αC^L, α is the parameter of two standards of balance.

Three, hypergraphs are divided

1. the foundation of hypergraph model

All data item and back end are set to the vertex set V, V={ X, Y } of hypergraph.Included in super line set E The mapping relations of all demand models and data item and node, E={ { e_p|p∈P},{e_xy|x∈X,y∈Y}}.Every super side E ∈ E are endowed a weight.Based on optimisation criteria C (D)=C^A+αC^L, weight is set to As shown in figure 3, in hypergraph, there is two class nodes, memory node and data item, two class sides, the super side of demand model and back end Super side.

2. the foundation that hypergraph is divided

Theorem：For output set I, by method described above it as a hypergraph.Hypergraph is divided into top N set of point, then, obtains data and places function D.Define reduction weight is divided into H, and it is one to meet H=C (D)-B, B Individual constant.

Prove：First, we discuss the super side e of demand model_pSubtract power, use H_pRepresent.According to the definition of super side reduction,According toWe can obtainIt is a constant.Second, the less weight on the super side of back end is discussed.It is defined as H_xy.It is right In any data item x, in hypergraph model, it is connected to all nodes.After hypergraph is divided, a node can only be connected. Otherwise some set of division result will be connected by x.Consider that we are placed into each node among different set. Assuming that data item x is ultimately connected to node f_x.The super side of the back end related to x subtract power summation beTherefore,

By theorem, the result divided based on hypergraph, we will show that it is of equal value to reduce the weight of n k-path partitions with C (D) 's.

3. the step of hypergraph is divided

1) stage is slightly drawn：Close node will be contacted to merge, the smaller hypergraph of construction scale so that adjacent two layers hypergraph it Between minification reach the minification of setting, the minification is the percentage that adjacent two layers hypergraph interstitial content reduces；

2) initial phase：The smaller hypergraph of having no right of step 1 gained scale is subjected to initial division, most initial k is obtained Individual subgraph, division methods are random division；

3) optimizing phase：By step 2 to k subgraph random selection fission node after carry out fission reduction, structure successively A series of hypergraphs are made, until scale reaches the original scale for having no right hypergraph, k subgraph after being optimized.

Its algorithm flow chart is as shown in Figure 4.

In summary, the present invention proposes a kind of optimization storage method of the related data divided based on hypergraph, improves and is The validity of system, reduces Consumer's Experience delay.

Claims

1. a kind of related data storage method based on hypergraph partitioning algorithm, comprises the following steps：

(1) task of processing data is needed for one, is called a demand model, the demand model needs multiple be stored in The data of data center's node, after demand model is determined, predict its demand factor, it is assumed that each node demand factor is R_py, this is needed The total demand factor of modulus formula is

(2) setting of rate according to demand, selected metric standard, i.e. hypergraph divide standard, first be complete a demand needed for Overhead C^A, second is total relay traffic C needed for completion demand^L, it is C (D)=C to draw module^A+αC^L, α is Balance the parameter of two standard values between zero and one.

(3) hypergraph model is set up according to the criterion of step 2, all data item and back end is set to the top of hypergraph The mapping relations of all demand model and data item and node, every super side e ∈ E quilt are included in point set V, super line set E A weight is assigned, is each weight assignment w based on the module in (2)_e, in hypergraph, there is two class nodes, storage section Point and data item, two class sides, demand model it is super while and back end surpass while.

(4) it is n output set to refer to hypergraph vertex partition, and each summit is pertaining only to one in n set, reduces hypergraph and draws The weight divided is calculated as reducing the summation of super side right weight, if the summit on a super side is not pertaining only to a set, this super side It is cut up, super side e summit is fallen in t set, then its power that subtracts is calculated as (t-1) w_e。

(5) stage is slightly drawn：Reduce the weight on super side, will contact close node merging, construction scale it is smaller have no right hypergraph, make The minification that the minification between adjacent two layers hypergraph reaches setting is obtained, the minification is that adjacent two layers hypergraph interstitial content subtracts Small percentage.

(6) initial phase：The smaller hypergraph of having no right of scale obtained by step (5) is subjected to initial division, most initial k are obtained Subgraph, division methods are random division；

(7) optimizing phase：By step (6) to k subgraph random selection fission node after carry out fission reduction successively, construct A series of hypergraphs, until scale reaches the original scale for having no right hypergraph, k subgraph after being optimized.