CN107506452B

CN107506452B - Big data full-comparison data distribution method and system based on graph coverage

Info

Publication number: CN107506452B
Application number: CN201710751446.1A
Authority: CN
Inventors: 张雪英; 李凤莲; 田玉楚; 李彦民; 焦江丽; 高燕军
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2020-05-08
Anticipated expiration: 2037-08-28
Also published as: CN107506452A

Abstract

The invention discloses data distribution of big data full comparison based on graph coverageThe data distribution method comprises the following steps: abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph G_M(ii) a Will complete picture G_MDividing the induction map into N induction maps, and combining the induction maps to cover the complete map G_MAnd let max { | V₁|,|V₂|,...,|V_NMinimizing, | }; determining an optimal coverage solution according to each induced graph; and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution. The method comprises the steps of abstracting a data file to be processed into a top point of a graph, comparing and calculating edges of the abstract graph, mapping the full comparison calculation of the data file to be processed into a complete graph, introducing a graph covering method, solving the problem of data distribution of the full comparison calculation, and further determining an optimal covering solution by dividing an induced graph to realize the global optimality of data distribution.

Description

Big data full-comparison data distribution method and system based on graph coverage

Technical Field

The invention relates to the technical field of data distribution of big data distributed computation, in particular to a data distribution method based on graph coverage big data full comparison.

Background

The full comparison is a special calculation problem and widely exists in the fields of bioinformatics, biometrics, data mining and the like. In bioinformatics, lineage relationships are inferred by comparing gene sequences of different species. In the field of biometrics, a typical full-scale comparison problem is to identify human physiological features by pairwise comparison of large amounts of data in a biometric database, such as facial recognition, finger judgment, and palm scanning. In data mining, the computation of a similarity matrix, which represents the similarity between the considered objects, is a key step in classification and clustering analysis. Sequence alignment, cluster analysis, and current research focus global network alignment are all typical full-comparison computational problems in computational biology and bioinformatics.

Full comparison computation represents a typical mode of computation, i.e., each data in a data set is compared to all other data in the data set. When the number of files in the data set or the data included in the files becomes large, the scale of the full comparison calculation becomes large. Currently, solutions have been proposed to the full-scale comparison problem in some specific areas, such as the well-known BLAST and ClustalW. In addition, distributed systems, such as open-source distributed processing framework Hadoop, are widely used to solve large-scale data-intensive computational problems, including full-comparison computations. In recent years, it has been proposed to abstract the data distribution problem of full-scale computation into a combinatorial optimization problem with constraints, and to use a heuristic algorithm to solve the optimal solution.

However, the existing method requires all data files to be stored on each node in the system, which significantly increases the time overhead and communication cost, and requires a large storage space. In addition, the data allocation strategy of Hadoop does not consider the dependency relationship between the comparison task and the data, so the calculation efficiency for full comparison is low. Compared with Hadoop, the data distribution strategy based on the heuristic algorithm improves the overall calculation performance. However, as the amount of data increases, the solution space becomes larger and the problem size grows exponentially. Furthermore, the heuristic algorithm cannot guarantee the global optimality of the solution.

Disclosure of Invention

The invention aims to provide a data distribution method and a data distribution system based on graph coverage big data full comparison, which can ensure the global optimality of data distribution.

In order to achieve the purpose, the invention provides the following scheme:

a data distribution method based on graph-covered big data full comparison comprises the following steps:

abstracting M data files to be processed into vertexes of a graph and any two parts to be processedThe comparison calculation between the physical data files is abstracted into the edges of the graph, and the full comparison calculation of M data files to be processed is mapped into a full graph G_M(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;

will complete picture G_MDivided into N induction maps, G (V) respectively₁),G(V₂),...,G(V_N) And the combination of the induction maps can cover the complete map G_MAnd let max { | V₁|,|V₂|,...,|V_NMinimizing, | }; wherein V represents a set of points, | V_NL represents the number of points in the point set in the nth induced graph;

determining an optimal coverage solution according to each induced graph;

and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.

Alternatively, the combination of the respective induction maps can be determined to cover the complete map G according to the following conditions_M：

Where N denotes an induction map number, and N ═ 1, 2.

Optionally, the determining an optimal coverage solution according to each induced graph specifically includes:

selecting an induction map meeting the following conditions:

the complete map G_MThere is no common edge between the induction maps of (1), and the complete map G_MOn at least one induction map;

each selected induction map is an optimal coverage solution, and each selected induction map G_nCombined as complete graph G_MIs best coverage, denoted as G_n|G_M(ii) a Wherein N represents an inducer sequence number, and N is 1, 2.

Optionally, the sequentially allocating the data to be processed to each computing node according to the optimal coverage solution specifically includes:

four set variables L are defined₁、L₂、L₃And L₄Wherein L is₁For storing the found optimal covering solution elements; l is₂For storing the difference between any two optimal coverage solutions; l is₃For storing L₁And L₂The sum of the elements (A) and (B); l is₄Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;

constructing an optimal coverage solution, and distributing data of the obtained optimal coverage solution:

when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution;

and when M is larger than N, uniformly packaging each data file to be processed into N areas, and distributing data according to the areas and the optimal coverage solution.

Optionally, the performing data allocation specifically includes:

and enabling the number of the data files to be processed on each computing node to be n.

In order to achieve the above purpose, the invention also provides the following scheme:

a data distribution system based on graph-overlaid big-data full-comparison, the data distribution system comprising:

a mapping unit for abstracting M data files to be processed into the top of the graph, abstracting the comparison calculation between any two data files to be processed into the edge of the graph, and mapping the full comparison calculation of the M data files to be processed into a full graph G_M(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;

a dividing unit for dividing the complete graph G_MDivided into N induction maps, G (V) respectively₁),G(V₂),...,G(V_N) And the combination of the induction maps can cover the complete map G_MAnd let max { | V₁|,|V₂|,...,|V_NMinimizing, | }; wherein V represents a set of points, | V_NL represents the number of points in the point set in the nth induced graph;

a determining unit, configured to determine an optimal coverage solution according to each of the induction maps;

and the distribution unit is used for sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.

Optionally, the dividing unit determines that the combination of the induction maps covers the complete map G according to the following condition_M：

Where N denotes an induction map number, and N ═ 1, 2.

Optionally, the determining unit determines the optimal coverage solution according to each induced graph, and specifically includes:

selecting an induction map meeting the following conditions:

Optionally, the allocating unit sequentially allocates the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:

Optionally, the performing data allocation specifically includes:

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the method, the data files to be processed are abstracted to the top point of the graph, the comparison calculation between any two data files to be processed is abstracted to the edge of the graph, so that the full comparison calculation of the data files to be processed is mapped to a complete graph, the graph covering method is introduced, the problem of data distribution of the full comparison calculation is solved, the optimal covering solution is further determined by dividing the induced graph, the quantity of the data to be processed distributed to each computing node is determined, and the global optimality of data distribution is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a full comparison calculation process;

FIG. 2 is a schematic diagram of the additional tasks resulting from adding new data d to node k;

FIG. 3 is a flowchart of a data distribution method based on graph-covered big data full comparison according to an embodiment of the present invention;

FIG. 4 is a block diagram of a data distribution system based on full comparison of big data covered by graphs according to an embodiment of the present invention.

Description of the symbols:

mapping unit-1, dividing unit-2, determining unit-3, and allocating unit-4.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a big data full-comparison data distribution method and system based on graph coverage.

Distributed Computing (Distributed Computing): originally referred to as a network of computers with individual computers dispersed throughout a particular geographic location. Today, the term is used in a broader sense, such as referring to processes running on the same computer, where the processes communicate with each other through message passing. "distributed computing" and "parallel computing" have many similarities, and the same system can be considered "parallel" or "distributed". "parallel computing" may be considered a tightly coupled form of distributed computing, which may be considered a loosely coupled form of parallel computing. In parallel computing, all processors can exchange information by accessing a shared memory, in distributed computing, each processor has a private memory, and information exchange is realized by message passing between the processors.

Total comparative calculation (All-to-All company Computing): represents a typical mode of computation, i.e., each data in a data set is compared with all other data in the data set.

Graph coverage (Graph coverage): that is, given a graph and the number of inducers, the original graph is overlaid with these inducers while the maximum number of vertices in each subgraph is minimized.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Wherein, the total comparison problem specifically is:

let A denote the set of data files to be processed, C denote the comparison function for computing the data in A, and M denotes the output similarity matrix of A. The full comparison calculation can be expressed as follows:

M_ij＝C(A_i,A_j),i,j＝1,2,…,|A| (1)；

wherein A is_iRepresents the ith data in A; m_ijOne element of the output matrix M is represented, A_iAnd A_jComparing the calculated results; | a | represents the number of data in a.

A typical full-compare problem is shown in fig. 1, where each data item needs to be compared with all other data items. In the output matrix shown in fig. 1, the comparison calculation between the data is unordered, i.e.:

C(A_i,A_j)＝C(A_j,A_i) (1)；

therefore, only the calculation of the upper triangular element of this symmetric matrix needs to be considered.

In order to be able to efficiently solve the full comparison problem, the following aspects will be looked at: (1) the computational performance of each full compare task; (2) overall performance of the distributed system; (3) overhead of allocating data.

(1) Computational performance of each full comparison task: for each comparison task, if the data required by the task is stored on the node performing the task, then the task does not need to access the data remotely over the network. In this case, the comparison task can be performed immediately without additional data movement between nodes. Let C (x, y), T_iAnd D_iRepresenting the comparison task between data x and y, respectively, the set of all tasks, the union of all tasksA set of tasks performed by node i, and a data set stored on node i. Good data locality can be described for all comparison tasks as equation (2):

(2) overall performance of the distributed system: if each computing node in the system can be assigned a task amount that matches its processing capacity, the system can achieve load balancing and all nodes can complete tasks in the same time, thereby making full use of the computing capacity of the system. Let T_iThe number of comparison tasks executed by the ith node is shown, and for a distributed system with N nodes and M data, M (M-1)/2 comparison tasks are required to be distributed to each node. The load balancing of the system can be expressed as equation (3):

representing an upper rounding function.

(3) Allocation data overhead: while the manner in which all of the data is stored on each node can satisfy the requirements of equations (2) and (3), it is desirable to further reduce the overhead of allocating the data. In order to distribute all the data into the distributed system, the storage usage of each computation, each node must be within its capacity. In addition to this, the time taken to allocate data should also be reduced. Generally, the time for data distribution and the number of distributed data files are in a linear relationship. Let | D_iAnd | represents the number of files allocated on node i. Considering the limitations of data allocation time and storage, the data allocation policy needs to satisfy formula (4):

Minmize max{|D₁|,|D₂|,...,|D_i|}. (4)；

therefore, under the restrictions of equations (2) and (3), data allocation is performed with equation (4) as a target.

Data allocation problem: given a full comparison problem of M data and a distributed system of N nodes, the data files are distributed to the nodes so that all comparison tasks have local data (equation (2)). In addition, each node can be assigned a number of comparison tasks that matches its computational power (equation (3)). In the case where the above requirements are satisfied, the maximum number of data files among all nodes is minimized.

The following will focus on a data distribution strategy based on heuristic algorithms.

As shown in FIG. 2, the rightmost column of the comparison matrix shows that when a new data file d is assigned to a node that already stores p data files, additional comparison tasks may be assigned to a particular node k, and thus, if as many comparison tasks as possible may be assigned to node k, the total data file that needs to be assigned may be minimized.

The additional comparison tasks resulting from adding new data d to node k include those that have never been previously assigned and those that have already been assigned. The relevant rules for distributing data can be summarized as follows:

rule 1: for those comparison tasks that have never been previously assigned, the data allocation policy is designed to assign as many tasks as possible to node k, and obey equation (4).

Rule 2: for those comparison tasks that have already been assigned, the data allocation policy is designed to redistribute these comparison tasks while respecting equation (4). For example, if a comparison task t has already been assigned to node q, the policy compares the number of assigned comparison tasks between node k and node q, and reassigns the comparison tasks.

Through these heuristic rules, a corresponding data allocation algorithm will be proposed:

the data distribution method based on the heuristic method and the task driving method comprises the following steps:

step 1: all unallocated comparison tasks are found.

Step 2: finding the data files needed by the unallocated comparison task, putting the files into a set I, and initializing the set to be an empty set.

And step 3: from set I, the maximum number of data files needed for the unassigned comparison task is found. These data files are denoted by d.

And 4, step 4: selecting a storage node set, and meeting the following conditions: (1) there is no such data file d; (2) assigning a minimum number of comparison tasks; (3) a minimum number of data files are stored. This set is denoted by C.

And 5: all nodes in set C are checked according to rule 1. If none of the nodes satisfies equation (4), remove data file d from collection I and return to step 3.

Step 6: one node k in the set C is found and the node is empty, or the maximum number of new unassigned comparison tasks resulting from the addition of data file d can be assigned, assigning data file d to node k.

And 7: for comparison tasks caused by adding data file d in step 6 and which have been previously distributed to other nodes, these comparison tasks are redistributed using rule 2.

And 8: steps 1 to 7 are repeated until all the pairwise compare tasks are assigned to the nodes.

The heuristic data distribution algorithm enables the minimization problem in equation (2) to be achieved. All comparison tasks are distributed while reducing the total amount of data distributed with as little data as possible. Evenly distributing the data across all the compute nodes helps to meet the storage constraints of each node. The requirement of equation (4) to achieve load balancing can easily be implemented as a constraint in the optimization problem (2) if each node has the same or similar number of comparison tasks.

However, as the amount of data increases, the solution space becomes larger and the problem size grows exponentially. Furthermore, the heuristic algorithm cannot guarantee the global optimality of the solution.

As shown in fig. 3, the data distribution method based on the full comparison of the big data covered by the graph includes:

step 310: abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph G_M(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;

step 320: will complete picture G_MDivided into N induction maps, G (V) respectively₁),G(V₂),...,G(V_N) And the combination of the induction maps can cover the complete map G_MAnd making max { | V |)₁,|V₂|,...,|V_NMinimizing, | }; wherein V represents a set of points, | V_NL represents the number of points in the point set in the nth induced graph;

step 330: determining an optimal coverage solution according to each induced graph;

step 340: and sequentially distributing the data to be processed to each computing node according to the optimal coverage solution.

Where G (V, E) represents a graph, where V and E represent a set of points and a set of edges, respectively. A set of V 'vertices in the set V is selected to form an induced graph, and G [ V']To indicate. Two induction maps G₁(V₁,E₁) And G₂(V₂,E₂) Is G ═ V₁∪V₂,E₁∪E₂) I.e. the vertices and edges of the two graphs are respectively merged.

In step 320, determining that the combination of the respective inducers can cover the complete map G according to the following conditions_M：

Where N denotes an induction map number, and N ═ 1, 2.

Because of G_MIs provided with

Edge to be provided with

G of side_nCovering is then provided with

Namely n (n-1) M (M-1). G_MThe side starting from a certain vertex A has (M-1) strips, and the point A is also an induced graph G_pAt a certain point in, G_nThere are (n-1) edges connected to A, and thus (n-1) | (M-1). Further, a point m is determined, K containing the point m_nIn common with

Then, another K_nThe n points required are at most from each K containing m just before_nIn each case take a little, thus have

That is, M-1 is not less than n (n-1).

In step 330, the determining an optimal coverage solution according to each induced graph specifically includes:

selecting an induction map meeting the following conditions:

In step 340, the sequentially allocating the data to be processed to each computing node according to the optimal coverage solution specifically includes:

four set variables L are defined₁、L₂、L₃And L₄Wherein L is₁For storing the found optimal covering solution elements; l is₂For storing the difference between any two optimal coverage solutions; l is₃For storing L₁And L₂Middle elementThe sum of elements; l is₄Storing the difference between the newly found optimal covering solution element and the existing optimal covering solution element;

Further, the performing data allocation specifically includes: and enabling the number of the data files to be processed on each computing node to be n.

Specifically, when n is 2 or 3, the length of the optimal solution is relatively short, and therefore, the optimal solution is constructed by an enumeration method, as shown in table 1 or table 2. From tables 1 and 2, the following points can be summarized: (1) if the M is equal to N (N-1) +1, and N is more than or equal to 2, an optimal solution covered by the graph can be constructed, and then the number of data files on each node is N; (2) as long as a combination of these n vertices can be found, as in (1,2,4) in table 2, and then each point in the combination is incremented all the time. When the nth point is incremented to N +1, the value of this point is set to 1, and then the N points are rearranged from small to large. Finally, we find this to be a loop, as in table 2, three points (1,3,7) on node 7, each point incremented by 1, then (2,4,8), set 8 to 1, then (1,2,4), and go back to the combination of node 1.

As long as a combination of n vertices can be found, by₁,V₂,...,V_n) To show that by satisfying the definition step 320 all the time in the incremental process, a solution for optimal graph coverage can be found. By observing tables 1 and 2, the following rules can be obtained: (1) of these n points, 1 and 2 are two points that can be determined first, and in the case where 1 and 2 are determined, 3 is excludable, for example, if (1,2,3) is incremented by 1, then (2,3,4) is obtained, and it is obvious that (2,3) is a common edge, and the condition for optimal graph coverage is not satisfied. Thus, the third point starts at 4; (2) since these n points are cyclically incremented, the difference between any two points cannot be the same. Such as pIf four points are taken as (1,2,4,7) when incrementing to (4,5,7,10), a common edge will appear, and there will be a common edge until the fourth number increments to 13; (3) after the laws (1) and (2) are satisfied, then at (V)₁,V₂,...,V_p) N th point V_nNo common edge exists until the increment is N + 1. Thus we consider that V_nAnd N +1 cannot be the same as the difference between any two points in the combination. For example, taking N-4 and M-N-13 as an example, four points are (1,2,4,8), and since the difference between 8 and 14 is 6, the difference between 2 and 8 is exactly. Thus, when incrementing to (7,8,10,14), setting 14 to 1, then to (1,7,8,10), incrementing 1 again, then to (2,8,9,11), produces a common edge (2, 8).

Table 1 when n is 2, the graph covers the optimal solution

Table 2 when n is 3, the graph covers the optimal solution

From the above analysis the following conclusions can be drawn: when the optimal solution covered by the graph exists when M is equal to N (N-1) +1, and N is more than or equal to 2, the solution can be constructed by three rules; after the optimal solution is constructed, the data, and the corresponding tasks, are distributed according to the optimal solution. When M > N, the file is uniformly divided into M-N areas (Blocks), the number of files in each block is not larger than 1, then a solution is constructed according to the condition that M-N, and data and tasks are distributed. If N is 7 and M is 9, 7 Blocks may be constructed, as shown in table 3.

TABLE 3M > N Structure of Blocks

From the optimal solution construction method, we constructed the optimal solution when M-N-13, 21,31, plus the solution when N-2, 3 is constructed manually as shown in table 4:

table 4 optimal coverage solutions when N is 3,7,13,21,31

The specific implementation method comprises the following steps:

suppose that: where M is N (N-1) +1, N is greater than or equal to 2, there is an optimal solution

Step 1: constructing an optimal solution

Defining variables: four lists L are defined₁,L₂,L₃,L₄Constructing an optimal solution (V)₁,V₂,...,V_n) Wherein V is₁←1,V₂←2,V₃Starting from 4;

handle V₁,V₂,V₃Is stored in L₁，V₁,V₂,V₃The difference between any two elements is stored to L without repetition₂:

{while V₃<N do

for i＝4 to p do

for x in L₁do

i.for y in L₂do

Placing x + y into L without repetition₃In

iii.end for

end for}。

To L iv₃Sorting in ascending order at L₃In the first of the structures is at V_i-1And L₃[last]A natural number not in between, if any, giving it to V_iOtherwise, L is₃[last]+ + to V_i：

{while V_i<N do

v.for z in L₁do

vi.L₄.add(V₁-z)；

vii.end for

viii.if(N+1-V_i)∈L₂||(N+1-V_i)∈L₄then

ix.V_i++；

x.if V_iIs V_i-1And L₃[last]Of an element in between then

V_i←L₃[last]++，Continue；

xi.end if

xii.Else}。

Handle V_iPut into L₁Handle L₄Copy to L₂Clear L₄；

{i++；

break；

xiii.end if

end while

All elements V of the if-optimal solution_iAll find then

Storing the optimal solution and exiting the loop;

else

xv.V₃++；

end for

end while}。

step 2: and sequentially distributing the data to each computing node according to the optimal solution:

when M is equal to N, data is distributed according to the optimal coverage solution, when M is larger than N, the files are uniformly packed into N Blocks, and then the data is distributed according to the optimal coverage solution.

The data distribution based on the big data full comparison covered by the graph carries out deep theoretical analysis on the full comparison problem, carries out model construction on the data distribution problem of the full comparison calculation, and provides a data distribution method based on the graph covering. Firstly, a theoretical basis for solving the data distribution problem of full-comparison calculation by using graph coverage is introduced; secondly, it is demonstrated that under certain conditions an optimal solution of the graph coverage can be constructed, and several sets of optimal solutions are successfully constructed. Compared with the heuristic method, in addition to ensuring that the comparison task has 100% of data locality and load balance, the data distribution algorithm based on graph coverage has better calculation performance under the condition of an optimal coverage solution.

In addition, the invention also provides a data distribution system based on the graph coverage big data full comparison. As shown in fig. 4, the data distribution system based on the full comparison of the big data covered by the graph of the present invention includes a mapping unit 1, a dividing unit 2, a determining unit 3, and a distributing unit 4.

The mapping unit 1 is configured to abstract M to-be-processed data files into vertices of a graph, abstract comparison calculation between any two to-be-processed data files into edges of the graph, and map full comparison calculation of the M to-be-processed data files into a full graph G_M(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;

the dividing unit 2 is used to divide the complete graph G_MDivided into N induction maps, G (V) respectively₁),G(V₂),...,G(V_N) And the combination of the induction maps can cover the complete map G_MAnd let max { | V₁|,|V₂|,...,|V_NMinimizing, | }; wherein V represents a set of points, | V_NL represents the number of points in the point set in the nth induced graph;

the determining unit 3 is configured to determine an optimal coverage solution according to each induced graph;

the distribution unit 4 is configured to sequentially distribute the data to be processed to each computing node according to the optimal coverage solution.

Specifically, the classification unit 2 determines that the combination of the respective induction maps can cover the complete map G according to the following conditions_M：

Where N denotes an induction map number, and N ═ 1, 2.

The determining unit 3 determines an optimal coverage solution according to each induced graph, and specifically includes:

selecting an induction map meeting the following conditions:

The allocating unit 4 sequentially allocates the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:

The data allocation specifically includes: and enabling the number of the data files to be processed on each computing node to be n.

Compared with the prior art, the data distribution system based on the graph coverage big data full comparison has the same beneficial effects as the data distribution method based on the graph coverage big data full comparison, and is not repeated herein.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A data distribution method based on graph coverage big data full comparison is characterized by comprising the following steps:

abstracting M data files to be processed into vertexes of a graph, abstracting comparison calculation between any two data files to be processed into edges of the graph, and mapping full comparison calculation of the M data files to be processed into a full graph G_M(ii) a The complete graph is formed by connecting an edge between each pair of vertexes;

will complete picture G_MDivided into N induction maps, G (V) respectively₁)，G(V₂)，...，G(V_N) And the combination of the induction maps can cover the complete map G_MAnd let max { | V₁|，|V₂|，...，|V_NMinimizing, | }; where V represents a set of vertices, | V_NL represents the number of vertexes in the vertex set in the Nth induction graph;

determining an optimal coverage solution according to each induced graph;

according to the optimal coverage solution, sequentially distributing the data to be processed to each computing node, specifically comprising:

four set variables L are defined₁、L₂、L₃And L₄Wherein L is₁For storing the found optimal covering solution elements; l is₂For storing the difference between any two optimal coverage solutions; l is₃For storing L₁And L₂The sum of the elements (A) and (B); l is₄For storing newly-found optimal covering solution elements and existing optimal covering solution elementsA difference;

when M is larger than N, uniformly packaging each data file to be processed into N areas, and performing data distribution according to the areas and the optimal coverage solution; wherein n represents the number of vertices in each induction map.

2. The method for data distribution based on graph-coverage big data full comparison according to claim 1, wherein the joint of the respective induced graphs is determined to cover the full graph G according to the following condition_M：

Wherein N represents the number of vertexes in each induction graph, N is more than 1 and less than or equal to M, the number of induction graphs is N in total, and | represents the optimal coverage.

3. The data distribution method based on graph coverage big data full comparison according to claim 1, wherein the determining an optimal coverage solution according to each induced graph specifically includes:

selecting an induction map meeting the following conditions:

each selected induction map is an optimal coverage solution, and each selected induction map G_nCombined as complete graph G_MIs best coverage, denoted as G_n|G_M(ii) a Wherein n represents the number of vertices in each induction map, 1<N is less than or equal to M, and the number of the induction maps is N.

4. The data distribution method based on graph coverage big data full comparison according to claim 1, wherein the data distribution specifically includes:

5. A data distribution system based on graph-overlay big data full comparison, the data distribution system comprising:

a dividing unit for dividing the complete graph G_MDivided into N induction maps, G (V) respectively₁)，G(V₂)，...，G(V_N) And the combination of the induction maps can cover the complete map G_MAnd let max { | V₁|，|V₂|，...，|V_NMinimizing, | }; where V represents a set of vertices, | V_NL represents the number of vertexes in the vertex set in the Nth induction graph;

the allocation unit is configured to sequentially allocate the data to be processed to each computing node according to the optimal coverage solution, and specifically includes:

when M is equal to N (N-1) +1, and N is more than or equal to 2, performing data distribution according to the optimal coverage solution; wherein n represents the number of vertexes in each induction map;

6. The data distribution system for big data full comparison based on graph coverage as claimed in claim 5, wherein the dividing unit determines that the combination of the induced graphs can cover the complete graph G according to the following condition_M：

7. The data distribution system based on graph coverage big data full comparison according to claim 5, wherein the determining unit determines an optimal coverage solution according to each induced graph, specifically comprising:

selecting an induction map meeting the following conditions:

8. The data distribution system based on graph coverage big data full comparison according to claim 5, wherein the data distribution specifically includes: