CN107506452B - Big data full-comparison data distribution method and system based on graph coverage - Google Patents

Big data full-comparison data distribution method and system based on graph coverage Download PDF

Info

Publication number
CN107506452B
CN107506452B CN201710751446.1A CN201710751446A CN107506452B CN 107506452 B CN107506452 B CN 107506452B CN 201710751446 A CN201710751446 A CN 201710751446A CN 107506452 B CN107506452 B CN 107506452B
Authority
CN
China
Prior art keywords
graph
data
induction
optimal
solution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710751446.1A
Other languages
Chinese (zh)
Other versions
CN107506452A (en
Inventor
张雪英
李凤莲
田玉楚
李彦民
焦江丽
高燕军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201710751446.1A priority Critical patent/CN107506452B/en
Publication of CN107506452A publication Critical patent/CN107506452A/en
Application granted granted Critical
Publication of CN107506452B publication Critical patent/CN107506452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses data distribution of big data full comparison based on graph coverageThe data distribution method comprises the following steps: abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph GM(ii) a Will complete picture GMDividing the induction map into N induction maps, and combining the induction maps to cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; determining an optimal coverage solution according to each induced graph; and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution. The method comprises the steps of abstracting a data file to be processed into a top point of a graph, comparing and calculating edges of the abstract graph, mapping the full comparison calculation of the data file to be processed into a complete graph, introducing a graph covering method, solving the problem of data distribution of the full comparison calculation, and further determining an optimal covering solution by dividing an induced graph to realize the global optimality of data distribution.

Description

Big data full-comparison data distribution method and system based on graph coverage
Technical Field
The invention relates to the technical field of data distribution of big data distributed computation, in particular to a data distribution method based on graph coverage big data full comparison.
Background
The full comparison is a special calculation problem and widely exists in the fields of bioinformatics, biometrics, data mining and the like. In bioinformatics, lineage relationships are inferred by comparing gene sequences of different species. In the field of biometrics, a typical full-scale comparison problem is to identify human physiological features by pairwise comparison of large amounts of data in a biometric database, such as facial recognition, finger judgment, and palm scanning. In data mining, the computation of a similarity matrix, which represents the similarity between the considered objects, is a key step in classification and clustering analysis. Sequence alignment, cluster analysis, and current research focus global network alignment are all typical full-comparison computational problems in computational biology and bioinformatics.
Full comparison computation represents a typical mode of computation, i.e., each data in a data set is compared to all other data in the data set. When the number of files in the data set or the data included in the files becomes large, the scale of the full comparison calculation becomes large. Currently, solutions have been proposed to the full-scale comparison problem in some specific areas, such as the well-known BLAST and ClustalW. In addition, distributed systems, such as open-source distributed processing framework Hadoop, are widely used to solve large-scale data-intensive computational problems, including full-comparison computations. In recent years, it has been proposed to abstract the data distribution problem of full-scale computation into a combinatorial optimization problem with constraints, and to use a heuristic algorithm to solve the optimal solution.
However, the existing method requires all data files to be stored on each node in the system, which significantly increases the time overhead and communication cost, and requires a large storage space. In addition, the data allocation strategy of Hadoop does not consider the dependency relationship between the comparison task and the data, so the calculation efficiency for full comparison is low. Compared with Hadoop, the data distribution strategy based on the heuristic algorithm improves the overall calculation performance. However, as the amount of data increases, the solution space becomes larger and the problem size grows exponentially. Furthermore, the heuristic algorithm cannot guarantee the global optimality of the solution.
Disclosure of Invention
The invention aims to provide a data distribution method and a data distribution system based on graph coverage big data full comparison, which can ensure the global optimality of data distribution.
In order to achieve the purpose, the invention provides the following scheme:
a data distribution method based on graph-covered big data full comparison comprises the following steps:
abstracting M data files to be processed into vertexes of a graph and any two parts to be processedThe comparison calculation between the physical data files is abstracted into the edges of the graph, and the full comparison calculation of M data files to be processed is mapped into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
will complete picture GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; wherein V represents a set of points, | VNL represents the number of points in the point set in the nth induced graph;
determining an optimal coverage solution according to each induced graph;
and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.
Alternatively, the combination of the respective induction maps can be determined to cover the complete map G according to the following conditionsM
Figure BDA0001391239140000021
Where N denotes an induction map number, and N ═ 1, 2.
Optionally, the determining an optimal coverage solution according to each induced graph specifically includes:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein N represents an inducer sequence number, and N is 1, 2.
Optionally, the sequentially allocating the data to be processed to each computing node according to the optimal coverage solution specifically includes:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2The sum of the elements (A) and (B); l is4Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;
and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.
Optionally, the performing data allocation specifically includes:
and enabling the number of the data files to be processed on each computing node to be n.
In order to achieve the above purpose, the invention also provides the following scheme:
a data distribution system based on graph-overlaid big-data full-comparison, the data distribution system comprising:
a mapping unit for abstracting M data files to be processed into the top of the graph, abstracting the comparison calculation between any two data files to be processed into the edge of the graph, and mapping the full comparison calculation of the M data files to be processed into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
a dividing unit for dividing the complete graph GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; wherein V represents a set of points, | VNL represents the number of points in the point set in the nth induced graph;
a determining unit, configured to determine an optimal coverage solution according to each of the induction maps;
and the distribution unit is used for sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.
Optionally, the dividing unit determines that the combination of the induction maps covers the complete map G according to the following conditionM
Figure BDA0001391239140000041
Where N denotes an induction map number, and N ═ 1, 2.
Optionally, the determining unit determines the optimal coverage solution according to each induced graph, and specifically includes:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein N represents an inducer sequence number, and N is 1, 2.
Optionally, the allocating unit sequentially allocates the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2The sum of the elements (A) and (B); l is4Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;
and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.
Optionally, the performing data allocation specifically includes:
and enabling the number of the data files to be processed on each computing node to be n.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the method, the data files to be processed are abstracted to the top point of the graph, the comparison calculation between any two data files to be processed is abstracted to the edge of the graph, so that the full comparison calculation of the data files to be processed is mapped to a complete graph, the graph covering method is introduced, the problem of data distribution of the full comparison calculation is solved, the optimal covering solution is further determined by dividing the induced graph, the quantity of the data to be processed distributed to each computing node is determined, and the global optimality of data distribution is realized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a full comparison calculation process;
FIG. 2 is a schematic diagram of the additional tasks resulting from adding new data d to node k;
FIG. 3 is a flowchart of a data distribution method based on graph-covered big data full comparison according to an embodiment of the present invention;
FIG. 4 is a block diagram of a data distribution system based on full comparison of big data covered by graphs according to an embodiment of the present invention.
Description of the symbols:
mapping unit-1, dividing unit-2, determining unit-3, and allocating unit-4.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a big data full-comparison data distribution method and system based on graph coverage.
Distributed Computing (Distributed Computing): originally referred to as a network of computers with individual computers dispersed throughout a particular geographic location. Today, the term is used in a broader sense, such as referring to processes running on the same computer, where the processes communicate with each other through message passing. "distributed computing" and "parallel computing" have many similarities, and the same system can be considered "parallel" or "distributed". "parallel computing" may be considered a tightly coupled form of distributed computing, which may be considered a loosely coupled form of parallel computing. In parallel computing, all processors can exchange information by accessing a shared memory, in distributed computing, each processor has a private memory, and information exchange is realized by message passing between the processors.
Total comparative calculation (All-to-All company Computing): represents a typical mode of computation, i.e., each data in a data set is compared with all other data in the data set.
Graph coverage (Graph coverage): that is, given a graph and the number of inducers, the original graph is overlaid with these inducers while the maximum number of vertices in each subgraph is minimized.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Wherein, the total comparison problem specifically is:
let A denote the set of data files to be processed, C denote the comparison function for computing the data in A, and M denotes the output similarity matrix of A. The full comparison calculation can be expressed as follows:
Mij=C(Ai,Aj),i,j=1,2,…,|A| (1);
wherein A isiRepresents the ith data in A; mijOne element of the output matrix M is represented, AiAnd AjComparing the calculated results; | a | represents the number of data in a.
A typical full-compare problem is shown in fig. 1, where each data item needs to be compared with all other data items. In the output matrix shown in fig. 1, the comparison calculation between the data is unordered, i.e.:
C(Ai,Aj)=C(Aj,Ai) (1);
therefore, only the calculation of the upper triangular element of this symmetric matrix needs to be considered.
In order to be able to efficiently solve the full comparison problem, the following aspects will be looked at: (1) the computational performance of each full compare task; (2) overall performance of the distributed system; (3) overhead of allocating data.
(1) Computational performance of each full comparison task: for each comparison task, if the data required by the task is stored on the node performing the task, then the task does not need to access the data remotely over the network. In this case, the comparison task can be performed immediately without additional data movement between nodes. Let C (x, y), TiAnd DiRepresenting the comparison task between data x and y, respectively, the set of all tasks, the union of all tasksA set of tasks performed by node i, and a data set stored on node i. Good data locality can be described for all comparison tasks as equation (2):
Figure BDA0001391239140000071
(2) overall performance of the distributed system: if each computing node in the system can be assigned a task amount that matches its processing capacity, the system can achieve load balancing and all nodes can complete tasks in the same time, thereby making full use of the computing capacity of the system. Let TiThe number of comparison tasks executed by the ith node is shown, and for a distributed system with N nodes and M data, M (M-1)/2 comparison tasks are required to be distributed to each node. The load balancing of the system can be expressed as equation (3):
Figure BDA0001391239140000072
Figure BDA0001391239140000073
representing an upper rounding function.
(3) Allocation data overhead: while the manner in which all of the data is stored on each node can satisfy the requirements of equations (2) and (3), it is desirable to further reduce the overhead of allocating the data. In order to distribute all the data into the distributed system, the storage usage of each computation, each node must be within its capacity. In addition to this, the time taken to allocate data should also be reduced. Generally, the time for data distribution and the number of distributed data files are in a linear relationship. Let | DiAnd | represents the number of files allocated on node i. Considering the limitations of data allocation time and storage, the data allocation policy needs to satisfy formula (4):
Minmize max{|D1|,|D2|,...,|Di|}. (4);
therefore, under the restrictions of equations (2) and (3), data allocation is performed with equation (4) as a target.
Data allocation problem: given a full comparison problem of M data and a distributed system of N nodes, the data files are distributed to the nodes so that all comparison tasks have local data (equation (2)). In addition, each node can be assigned a number of comparison tasks that matches its computational power (equation (3)). In the case where the above requirements are satisfied, the maximum number of data files among all nodes is minimized.
The following will focus on a data distribution strategy based on heuristic algorithms.
As shown in FIG. 2, the rightmost column of the comparison matrix shows that when a new data file d is assigned to a node that already stores p data files, additional comparison tasks may be assigned to a particular node k, and thus, if as many comparison tasks as possible may be assigned to node k, the total data file that needs to be assigned may be minimized.
The additional comparison tasks resulting from adding new data d to node k include those that have never been previously assigned and those that have already been assigned. The relevant rules for distributing data can be summarized as follows:
rule 1: for those comparison tasks that have never been previously assigned, the data allocation policy is designed to assign as many tasks as possible to node k, and obey equation (4).
Rule 2: for those comparison tasks that have already been assigned, the data allocation policy is designed to redistribute these comparison tasks while respecting equation (4). For example, if a comparison task t has already been assigned to node q, the policy compares the number of assigned comparison tasks between node k and node q, and reassigns the comparison tasks.
Through these heuristic rules, a corresponding data allocation algorithm will be proposed:
the data distribution method based on the heuristic method and the task driving method comprises the following steps:
step 1: all unallocated comparison tasks are found.
Step 2: finding the data files needed by the unallocated comparison task, putting the files into a set I, and initializing the set to be an empty set.
And step 3: from set I, the maximum number of data files needed for the unassigned comparison task is found. These data files are denoted by d.
And 4, step 4: selecting a storage node set, and meeting the following conditions: (1) there is no such data file d; (2) assigning a minimum number of comparison tasks; (3) a minimum number of data files are stored. This set is denoted by C.
And 5: all nodes in set C are checked according to rule 1. If none of the nodes satisfies equation (4), remove data file d from collection I and return to step 3.
Step 6: one node k in the set C is found and the node is empty, or the maximum number of new unassigned comparison tasks resulting from the addition of data file d can be assigned, assigning data file d to node k.
And 7: for comparison tasks caused by adding data file d in step 6 and which have been previously distributed to other nodes, these comparison tasks are redistributed using rule 2.
And 8: steps 1 to 7 are repeated until all the pairwise compare tasks are assigned to the nodes.
The heuristic data distribution algorithm enables the minimization problem in equation (2) to be achieved. All comparison tasks are distributed while reducing the total amount of data distributed with as little data as possible. Evenly distributing the data across all the compute nodes helps to meet the storage constraints of each node. The requirement of equation (4) to achieve load balancing can easily be implemented as a constraint in the optimization problem (2) if each node has the same or similar number of comparison tasks.
However, as the amount of data increases, the solution space becomes larger and the problem size grows exponentially. Furthermore, the heuristic algorithm cannot guarantee the global optimality of the solution.
As shown in fig. 3, the data distribution method based on the full comparison of the big data covered by the graph includes:
step 310: abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
step 320: will complete picture GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd making max { | V |)1,|V2|,...,|VNMinimizing, | }; wherein V represents a set of points, | VNL represents the number of points in the point set in the nth induced graph;
step 330: determining an optimal coverage solution according to each induced graph;
step 340: and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.
Where G (V, E) represents a graph, where V and E represent a set of points and a set of edges, respectively. A set of V 'vertices in the set V is selected to form an induced graph, and G [ V']To indicate. Two induction maps G1(V1,E1) And G2(V2,E2) Is G ═ V1∪V2,E1∪E2) I.e. the vertices and edges of the two graphs are respectively merged.
In step 320, determining that the combination of the respective inducers can cover the complete map G according to the following conditionsM
Figure BDA0001391239140000101
Where N denotes an induction map number, and N ═ 1, 2.
Because of GMIs provided with
Figure BDA0001391239140000102
Edge to be provided with
Figure BDA0001391239140000103
G of sidenCovering is then provided with
Figure BDA0001391239140000104
Namely n (n-1) M (M-1). GMThe side starting from a certain vertex A has (M-1) strips, and the point A is also an induced graph GpAt a certain point in, GnThere are (n-1) edges connected to A, and thus (n-1) | (M-1). Further, a point m is determined, K containing the point mnIn common with
Figure BDA0001391239140000105
Then, another KnThe n points required are at most from each K containing m just beforenIn each case take a little, thus have
Figure BDA0001391239140000106
That is, M-1 is not less than n (n-1).
In step 330, the determining an optimal coverage solution according to each induced graph specifically includes:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein N represents an inducer sequence number, and N is 1, 2.
In step 340, the sequentially allocating the data to be processed to each computing node according to the optimal coverage solution specifically includes:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2Middle elementThe sum of elements; l is4Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;
and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.
Further, the performing data allocation specifically includes: and enabling the number of the data files to be processed on each computing node to be n.
Specifically, when n is 2 or 3, the length of the optimal solution is relatively short, and therefore, the optimal solution is constructed by an enumeration method, as shown in table 1 or table 2. From tables 1 and 2, the following points can be summarized: (1) if the M is equal to N (N-1) +1, and N is more than or equal to 2, an optimal solution covered by the graph can be constructed, and then the number of data files on each node is N; (2) as long as a combination of these n vertices can be found, as in (1,2,4) in table 2, and then each point in the combination is incremented all the time. When the nth point is incremented to N +1, the value of this point is set to 1, and then the N points are rearranged from small to large. Finally, we find this to be a loop, as in table 2, three points (1,3,7) on node 7, each point incremented by 1, then (2,4,8), set 8 to 1, then (1,2,4), and go back to the combination of node 1.
As long as a combination of n vertices can be found, by1,V2,...,Vn) To show that by satisfying the definition step 320 all the time in the incremental process, a solution for optimal graph coverage can be found. By observing tables 1 and 2, the following rules can be obtained: (1) of these n points, 1 and 2 are two points that can be determined first, and in the case where 1 and 2 are determined, 3 is excludable, for example, if (1,2,3) is incremented by 1, then (2,3,4) is obtained, and it is obvious that (2,3) is a common edge, and the condition for optimal graph coverage is not satisfied. Thus, the third point starts at 4; (2) since these n points are cyclically incremented, the difference between any two points cannot be the same. Such as pIf four points are taken as (1,2,4,7) when incrementing to (4,5,7,10), a common edge will appear, and there will be a common edge until the fourth number increments to 13; (3) after the laws (1) and (2) are satisfied, then at (V)1,V2,...,Vp) N th point VnNo common edge exists until the increment is N + 1. Thus we consider that VnAnd N +1 cannot be the same as the difference between any two points in the combination. For example, taking N-4 and M-N-13 as an example, four points are (1,2,4,8), and since the difference between 8 and 14 is 6, the difference between 2 and 8 is exactly. Thus, when incrementing to (7,8,10,14), setting 14 to 1, then to (1,7,8,10), incrementing 1 again, then to (2,8,9,11), produces a common edge (2, 8).
Table 1 when n is 2, the graph covers the optimal solution
Figure BDA0001391239140000111
Figure BDA0001391239140000121
Table 2 when n is 3, the graph covers the optimal solution
Figure BDA0001391239140000122
From the above analysis the following conclusions can be drawn: when the optimal solution covered by the graph exists when M is equal to N (N-1) +1, and N is more than or equal to 2, the solution can be constructed by three rules; after the optimal solution is constructed, the data, and the corresponding tasks, are distributed according to the optimal solution. When M > N, the file is uniformly divided into M-N areas (Blocks), the number of files in each block is not larger than 1, then a solution is constructed according to the condition that M-N, and data and tasks are distributed. If N is 7 and M is 9, 7 Blocks may be constructed, as shown in table 3.
TABLE 3M > N Structure of Blocks
Figure BDA0001391239140000123
From the optimal solution construction method, we constructed the optimal solution when M-N-13, 21,31, plus the solution when N-2, 3 is constructed manually as shown in table 4:
table 4 optimal coverage solutions when N is 3,7,13,21,31
Figure BDA0001391239140000124
Figure BDA0001391239140000131
The specific implementation method comprises the following steps:
suppose that: where M is N (N-1) +1, N is greater than or equal to 2, there is an optimal solution
Step 1: constructing an optimal solution
Defining variables: four lists L are defined1,L2,L3,L4Constructing an optimal solution (V)1,V2,...,Vn) Wherein V is1←1,V2←2,V3Starting from 4;
handle V1,V2,V3Is stored in L1,V1,V2,V3The difference between any two elements is stored to L without repetition2:
{while V3<N do
for i=4 to p do
for x in L1do
i.for y in L2do
Placing x + y into L without repetition3In
iii.end for
end for}。
To L iv3Sorting in ascending order at L3In the first of the structures is at Vi-1And L3[last]A natural number not in between, if any, giving it to ViOtherwise, L is3[last]+ + to Vi
{while Vi<N do
v.for z in L1do
vi.L4.add(V1-z);
vii.end for
viii.if(N+1-Vi)∈L2||(N+1-Vi)∈L4then
ix.Vi++;
x.if ViIs Vi-1And L3[last]Of an element in between then
Vi←L3[last]++,Continue;
xi.end if
xii.Else}。
Handle ViPut into L1Handle L4Copy to L2Clear L4
{i++;
break;
xiii.end if
end while
All elements V of the if-optimal solutioniAll find then
Storing the optimal solution and exiting the loop;
else
xv.V3++;
end for
end while}。
step 2: and sequentially distributing the data to each computing node according to the optimal solution:
when M is equal to N, data is distributed according to the optimal coverage solution, when M is larger than N, the files are uniformly packed into N Blocks, and then the data is distributed according to the optimal coverage solution.
The data distribution based on the big data full comparison covered by the graph carries out deep theoretical analysis on the full comparison problem, carries out model construction on the data distribution problem of the full comparison calculation, and provides a data distribution method based on the graph covering. Firstly, a theoretical basis for solving the data distribution problem of full-comparison calculation by using graph coverage is introduced; secondly, it is demonstrated that under certain conditions an optimal solution of the graph coverage can be constructed, and several sets of optimal solutions are successfully constructed. Compared with the heuristic method, in addition to ensuring that the comparison task has 100% of data locality and load balance, the data distribution algorithm based on graph coverage has better calculation performance under the condition of an optimal coverage solution.
In addition, the invention also provides a data distribution system based on the graph coverage big data full comparison. As shown in fig. 4, the data distribution system based on the full comparison of the big data covered by the graph of the present invention includes a mapping unit 1, a dividing unit 2, a determining unit 3, and a distributing unit 4.
The mapping unit 1 is configured to abstract M to-be-processed data files into vertices of a graph, abstract comparison calculation between any two to-be-processed data files into edges of the graph, and map full comparison calculation of the M to-be-processed data files into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
the dividing unit 2 is used to divide the complete graph GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; wherein V represents a set of points, | VNL represents the number of points in the point set in the nth induced graph;
the determining unit 3 is configured to determine an optimal coverage solution according to each induced graph;
the distribution unit 4 is configured to sequentially distribute the data to be processed to each computing node according to the optimal coverage solution.
Specifically, the classification unit 2 determines that the combination of the respective induction maps can cover the complete map G according to the following conditionsM
Figure BDA0001391239140000151
Where N denotes an induction map number, and N ═ 1, 2.
The determining unit 3 determines an optimal coverage solution according to each induced graph, and specifically includes:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein N represents an inducer sequence number, and N is 1, 2.
The allocating unit 4 sequentially allocates the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2The sum of the elements (A) and (B); l is4Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;
and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.
The data allocation specifically includes: and enabling the number of the data files to be processed on each computing node to be n.
Compared with the prior art, the data distribution system based on the graph coverage big data full comparison has the same beneficial effects as the data distribution method based on the graph coverage big data full comparison, and is not repeated herein.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A data distribution method based on graph coverage big data full comparison is characterized by comprising the following steps:
abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
will complete picture GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; where V represents a set of vertices, | VNL represents the number of vertexes in the vertex set in the Nth induction graph;
determining an optimal coverage solution according to each induced graph;
according to the optimal coverage solution, sequentially distributing the data to be processed to each computing node, specifically comprising:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2The sum of the elements (A) and (B); l is4For storing newly-found optimal covering solution elements and existing optimal covering solution elementsA difference;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;
when M is larger than N, uniformly packaging each data file to be processed into N areas, and performing data distribution according to the areas and the optimal coverage solution; wherein n represents the number of vertices in each induction map.
2. The method for data distribution based on graph-coverage big data full comparison according to claim 1, wherein the joint of the respective induced graphs is determined to cover the full graph G according to the following conditionM
Figure FDA0002307404150000011
Wherein N represents the number of vertexes in each induction graph, N is more than 1 and less than or equal to M, the number of induction graphs is N in total, and | represents the optimal coverage.
3. The data distribution method based on graph coverage big data full comparison according to claim 1, wherein the determining an optimal coverage solution according to each induced graph specifically includes:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein n represents the number of vertices in each induction map, 1<N is less than or equal to M, and the number of the induction maps is N.
4. The data distribution method based on graph coverage big data full comparison according to claim 1, wherein the data distribution specifically includes:
and enabling the number of the data files to be processed on each computing node to be n.
5. A data distribution system based on graph-overlay big data full comparison, the data distribution system comprising:
a mapping unit for abstracting M data files to be processed into the top of the graph, abstracting the comparison calculation between any two data files to be processed into the edge of the graph, and mapping the full comparison calculation of the M data files to be processed into a full graph GM(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;
a dividing unit for dividing the complete graph GMDivided into N induction maps, G (V) respectively1),G(V2),...,G(VN) And the combination of the induction maps can cover the complete map GMAnd let max { | V1|,|V2|,...,|VNMinimizing, | }; where V represents a set of vertices, | VNL represents the number of vertexes in the vertex set in the Nth induction graph;
a determining unit, configured to determine an optimal coverage solution according to each of the induction maps;
the allocation unit is configured to sequentially allocate the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:
four set variables L are defined1、L2、L3And L4Wherein L is1For storing the found optimal covering solution elements; l is2For storing the difference between any two optimal coverage solutions; l is3For storing L1And L2The sum of the elements (A) and (B); l is4Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;
constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:
when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution; wherein n represents the number of vertexes in each induction map;
and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.
6. The data distribution system for big data full comparison based on graph coverage as claimed in claim 5, wherein the dividing unit determines that the combination of the induced graphs can cover the complete graph G according to the following conditionM
Figure FDA0002307404150000031
Wherein N represents the number of vertexes in each induction graph, N is more than 1 and less than or equal to M, the number of induction graphs is N in total, and | represents the optimal coverage.
7. The data distribution system based on graph coverage big data full comparison according to claim 5, wherein the determining unit determines an optimal coverage solution according to each induced graph, specifically comprising:
selecting an induction map meeting the following conditions:
the complete map GMThere is no common edge between the induction maps of (1), and the complete map GMOn at least one induction map;
each selected induction map is an optimal coverage solution, and each selected induction map GnCombined as complete graph GMIs best coverage, denoted as Gn|GM(ii) a Wherein n represents the number of vertices in each induction map, 1<N is less than or equal to M, and the number of the induction maps is N.
8. The data distribution system based on graph coverage big data full comparison according to claim 5, wherein the data distribution specifically includes:
and enabling the number of the data files to be processed on each computing node to be n.
CN201710751446.1A 2017-08-28 2017-08-28 Big data full-comparison data distribution method and system based on graph coverage Active CN107506452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710751446.1A CN107506452B (en) 2017-08-28 2017-08-28 Big data full-comparison data distribution method and system based on graph coverage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710751446.1A CN107506452B (en) 2017-08-28 2017-08-28 Big data full-comparison data distribution method and system based on graph coverage

Publications (2)

Publication Number Publication Date
CN107506452A CN107506452A (en) 2017-12-22
CN107506452B true CN107506452B (en) 2020-05-08

Family

ID=60693691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710751446.1A Active CN107506452B (en) 2017-08-28 2017-08-28 Big data full-comparison data distribution method and system based on graph coverage

Country Status (1)

Country Link
CN (1) CN107506452B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1402176A (en) * 2001-08-21 2003-03-12 松下电器产业株式会社 Data allocation method and system
CN102929945A (en) * 2012-09-28 2013-02-13 用友软件股份有限公司 Data distribution device and data distribution method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1402176A (en) * 2001-08-21 2003-03-12 松下电器产业株式会社 Data allocation method and system
CN102929945A (en) * 2012-09-28 2013-02-13 用友软件股份有限公司 Data distribution device and data distribution method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Data-aware task scheduling for all-to-all comparison problems in;Yi-Fan Zhang et al.;《Journal of Parallel Distributed Computing》;20160425;第87-101页 *

Also Published As

Publication number Publication date
CN107506452A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
Shi et al. Large-scale and scalable latent factor analysis via distributed alternative stochastic gradient descent for recommender systems
Deveci et al. Hypergraph partitioning for multiple communication cost metrics: Model and methods
JP6376865B2 (en) Computer-implemented method, storage medium, and computer system for parallel tree-based prediction
CN104111936B (en) Data query method and system
EP3079060B1 (en) Load balancing for large in-memory databases
Choo et al. Processor scheduling and allocation for 3D torus multicomputer systems
Yang et al. Balancing job performance with system performance via locality-aware scheduling on torus-connected systems
Pooranian et al. Using imperialist competition algorithm for independent task scheduling in grid computing
CN111966495A (en) Data processing method and device
Hamed et al. Task scheduling optimization in cloud computing based on genetic algorithms
Zhang et al. Data-aware task scheduling for all-to-all comparison problems in heterogeneous distributed systems
Bakhthemmat et al. Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach
Pilla Scheduling algorithms for federated learning with minimal energy consumption
Switalski et al. Scheduling parallel batch jobs in grids with evolutionary metaheuristics
Dharavath et al. An apriori-based vertical fragmentation technique for heterogeneous distributed database transactions
CN107506452B (en) Big data full-comparison data distribution method and system based on graph coverage
Biswas et al. Parallel dynamic load balancing strategies for adaptive irregular applications
Luo Nested optimization method combining complex method and ant colony optimization to solve JSSP with complex associated processes
CN115981843A (en) Task scheduling method and device in cloud-edge cooperative power system and computer equipment
CN114065617A (en) Manufacturing service combination recommendation method and device
Mitchell The refinement-tree partition for parallel solution of partial differential equations
Chokri et al. Heuristics for dynamic load balancing in parallel computing
Novikov et al. Layer-by-layer partitioning of finite element meshes for multicore architectures
CHOKRI et al. Impact of communication volume on the maximum speedup in Parallel computing based on graph partitioning
WO2020019315A1 (en) Computational operation scheduling method employing graphic data, system, computer readable medium, and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant