CN113282415B

CN113282415B - Method for matching patterns of labeled graph in distributed environment

Info

Publication number: CN113282415B
Application number: CN202110570428.XA
Authority: CN
Inventors: 李靖东; 王晓玲; 卢兴见; 张吉
Original assignee: East China Normal University; Zhejiang Lab
Current assignee: East China Normal University; Zhejiang Lab
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-10-31
Anticipated expiration: 2041-05-25
Also published as: CN113282415A

Abstract

The invention discloses a pattern matching method of a labeled graph in a distributed environment, which comprises the steps that a master node in the distributed environment divides a data graph and respectively sends each part of node data to each slave node, meanwhile, the labeled graph is distributed to each slave node, the slave nodes dynamically select a matching path according to the storage condition and the communication condition of local data, the graph pattern matching result is obtained and fed back to the master node, and the master node aggregates and outputs all graph pattern matching results. The invention fully considers the load balancing problem in the distributed environment while using the graph calculation mode with the task as the center, so as to fully utilize the CPU calculation force of each machine in the distributed environment and effectively improve the matching efficiency of the graph mode.

Description

Method for matching patterns of labeled graph in distributed environment

Technical Field

The invention belongs to the technical field of graph data mining, and particularly relates to a tagged graph pattern matching method in a distributed environment.

Background

A graph is commonly used to represent complex structured data as a generic data structure. It better stores and expresses entities and their associations relative to other data structures. In the real world, the graph has wide application in the fields of social network analysis, web network analysis, traffic network optimization, knowledge graph construction, computational chemistry, computational biology and the like. Aiming at the graph data with rich semantics, various styles and huge data volume, how to quickly and accurately acquire valuable information in the graph data becomes a very popular research direction at present.

With the continuous development of emerging technologies such as the internet of things and cloud computing, the rapid rise of novel internet applications such as social networks and the wide popularization of various electronic wearable devices, the scale and complexity of graph data are continuously increased, so that the existing graph computing method faces great challenges in performance and efficiency, and particularly aims at computationally intensive tasks such as graph pattern mining in large-scale graph data. One intuitive solution to solve these high complexity computing problems is to use multiple CPU cores to execute in parallel, however, the existing big data framework is mainly aimed at data intensive graph mining tasks, and data transmission often becomes a bottleneck, resulting in low CPU utilization. Solving computationally intensive graph mining tasks using these frameworks typically results in poorer performance.

In view of graph pattern matching task, the graph pattern matching task is a computationally intensive graph mining task, and the large scale of the data graph causes that the storage space and the computational power of a single machine are difficult to meet the task requirement, however, the existing distributed graph pattern matching method is mainly based on MapReduce design or vertex-centered design similar to the Pregel model. However, both of these well-known distributed architecture designs are not suitable for computationally intensive problems, where the MapReduce-based graph computing system nsale does not begin computationally intensive processing of the decomposable subgraphs until all the decomposable subgraphs are synchronously constructed, resulting in CPU underutilization, which is not suitable for computationally intensive problems. Meanwhile, a barrier synchronization stage exists in a batch synchronization parallel model used in Pregel, so that different machines have to wait for a long CPU processing time to synchronize, and the design also has difficulty in fully utilizing CPU calculation power.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a graph pattern matching method with labels in a distributed environment, which adopts a graph calculation mode with tasks as centers so as to fully utilize the CPU calculation force of each machine in the distributed environment and improve the efficiency of the graph pattern matching process.

In order to achieve the above object, the pattern matching method for the labeled graph in the distributed environment of the present invention comprises the following steps:

s1: obtaining graph data needing to be subjected to pattern matching of a labeled graph, wherein the graph data comprises a data graph and a labeled pattern graph, the data graph is an undirected graph containing node IDs, node label information and node association relations, and the labeled pattern graph comprises node labels and the node association relations;

s2: note that the master node in the distributed environment is M, and the slave node is S _n N=1, 2, …, N represents the number of slave nodes, the master node M divides the data graph obtained in step S1 into N pieces of node data, each piece of node data contains a plurality of pieces of node information, each piece of node information includes a node ID, node tag information and a node association relationship, and then distributes each piece of node data to a corresponding slave node S _n The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the master node M respectively transmits the pattern diagram with the label to each slave node S _n ；

S3: each slave node S _n After receiving the node data and the labeled pattern diagram, the label in the labeled pattern diagram is marked as f _k K=1, 2, …, K representing the number of labels in the labeled pattern graph, slave node S _n In the received node data, the statistics results in a label f _k Node set phi of (2) _n,k Then, a node set with the least number of nodes is selected as a root node set R _n The method comprises the steps of carrying out a first treatment on the surface of the Root node set R _n The number of the nodes is D _n Then slave node S _n Respectively by root node set R _n One of the nodes is used as a root node to execute D _n The graph pattern matching task feeds back the obtained graph pattern matching result to the master node M;

s4: the master node M gathers the graph pattern matching results fed back by each slave node, eliminates repeated graph pattern matching results, and obtains a final graph pattern matching result set.

According to the pattern matching method for the tagged graph in the distributed environment, a master node in the distributed environment divides a data graph and sends each part of node data to each slave node respectively, meanwhile, the tagged graph is distributed to each slave node, the slave nodes dynamically select matching paths according to the storage condition and the communication condition of local data, the graph pattern matching result is obtained and fed back to the master node, and the master node aggregates and outputs all graph pattern matching results. The invention fully considers the load balancing problem in the distributed environment while using the graph calculation mode with the task as the center, so as to fully utilize the CPU calculation force of each machine in the distributed environment and effectively improve the matching efficiency of the graph mode.

Drawings

FIG. 1 is a flow chart of an embodiment of a tagged map pattern matching method in a distributed environment of the present invention;

fig. 2 is a data diagram in the present embodiment;

fig. 3 is a diagram of a labeled pattern in the present embodiment;

FIG. 4 is a flow chart of the execution of the pattern matching task of the present invention;

fig. 5 is a graph pattern matching result of the data graph of fig. 2 and the labeled graph pattern of fig. 3 obtained by using the present invention.

Detailed Description

The following description of the embodiments of the invention is presented in conjunction with the accompanying drawings to provide a better understanding of the invention to those skilled in the art. It is to be expressly noted that in the description below, detailed descriptions of known functions and designs are omitted here as perhaps obscuring the present invention.

Examples

FIG. 1 is a flow chart of an embodiment of a tagged map pattern matching method in a distributed environment of the present invention. As shown in fig. 1, the specific steps of the pattern matching method for the labeled graph in the distributed environment of the present invention include:

s101: obtaining graph data:

and obtaining graph data needing to be subjected to pattern matching of the labeled graph, wherein the graph data comprises a data graph and a labeled pattern graph, the data graph is an undirected graph containing node IDs, node label information and node association relations, and the labeled pattern graph comprises node labels and the node association relations.

That is, the data graph contains at least three-dimensional features, and if the original data graph contains other node attributes, but does not appear in the labeled pattern graph, a simplification process can be performed. The tagged pattern diagram data may be in each line format (start point attribute, end point attribute). Fig. 2 is a data diagram in the present embodiment. Fig. 3 is a diagram of a labeled pattern in the present embodiment. As shown in fig. 2 and 3, the data graph in this embodiment is a personnel relationship graph, and includes 14 personnel nodes, where the personnel nodes have 6 kinds of labels, and each kind of label represents a different personnel identity, and is represented by a different shape, that is, 6 kinds of identities in total.

The information contained in fig. 2 and 3 can be represented by tables. Table 1 is a data table of the data diagram shown in fig. 2.

Node ID	Node attributes	Association relation (adjacent node)
			1	A	2,3,5
2	B	1,3,4
			3	D	1,2,12
4	C	2
			5	E	1
6	A	7,8
			7	B	6,8,9,12
8	D	6,7
			9	C	7
10	B	11,12,13,14
			11	A	10,12
12	D	3,7,10,11
			13	C	10
14	F	10

Table 1 table 2 is a data table of the tagged pattern diagram shown in fig. 3 in this example.

Start Point Attribute	Termination point attribute
		A	B
A	D
		B	C
B	D

S102: graph data distribution:

since a distributed environment generally includes a master node and several slave nodes, the size of a data graph in a real application often exceeds the storage space of a single machine, and for a graph pattern matching task, it is unnecessary to backup all graph data at each machine and a matching result may be duplicated, so that the data graph is distributed by the master node first. While the pattern graph entered by the user tends to be much smaller in size than the data graph, the pattern graph can be sent directly to each slave node. That is, the specific process of graph data distribution is: note that the master node (master) in the distributed environment is M, and the slave node is S _n N=1, 2, …, N represents the number of slave nodes, the master node M divides the data graph obtained in step S101 into N pieces of node data, each piece of node data contains a plurality of pieces of node information, each piece of node information includes a node ID, node tag information and a node association relationship, and then distributes each piece of node data toCorresponding slave node S _n . At the same time, the master node M respectively transmits the pattern diagram with the label to each slave node S _n 。

In order to realize uniform distribution of the data graph, a single slave node is prevented from containing a large amount of data to become a communication bottleneck, and the embodiment adopts a data graph distribution method based on a hash value of a node ID, which comprises the following specific processes: and respectively carrying out hash calculation on each node ID in the data graph, wherein a hash function is hash (ID)% N, so as to obtain a hash value corresponding to each node ID, and dividing nodes with the same hash value into the same piece of node data.

S103: pattern matching of the graph:

each slave node S _n After receiving the node data and the labeled pattern diagram, the label in the labeled pattern diagram is marked as f _k K=1, 2, …, K representing the number of labels in the labeled pattern graph, slave node S _n In the received node data, the statistics results in a label f _k Node set phi of (2) _n,k Then, a node set with the least number of nodes is selected as a root node set R _n . Root node set R _n The number of the nodes is D _n Then slave node S _n Respectively by root node set R _n One of the nodes is used as a root node to execute D _n And the graph pattern matching task feeds back the obtained graph pattern matching result to the master node M.

FIG. 4 is a flow chart of the execution of the pattern matching task of the present invention. As shown in FIG. 3, the specific steps of the pattern matching task in the present invention include:

s401: initializing a root node:

let node serial number i=0, determine the root node label according to the root node of the current pattern matching taskLet node set B ₀ The set of root nodes of the task is pattern-matched for the current graph.

S402: determining the next set of nodes:

because the node data is distributed to different slave nodes in the invention, when the subsequent matching nodes are selected (namely, the matching paths are determined), the fact that each slave node has differences in the process of graph matching tasks due to the difference of local data and the data quantity required to be pulled from other machines is considered, and meanwhile, the communication condition is influenced. The present invention thus proposes a candidate set based cost estimation model that can generate an adaptive graph-matching path from the cost function of each candidate tag. The design of the model is based on the following observations: (1) Different matching sequences will result in different communication costs, thereby affecting the time of the overall matching calculation process; (2) Considering the size of candidates present in the data map in determining the matching order may effectively reduce the subsequent search space. Specifically, the cost estimation model takes into account the following three factors: (1) Structure information (node degree, candidate set size) of the current matching node; (2) The matching probability (likelihood of premature termination) of the current matching path; (3) The current slave node's data storage status and workload conditions (number of tasks and nodes in the cache). The specific calculation process of the cost function is as follows:

acquiring labels in a pattern diagram asAdjacent node candidate label set a of nodes of (a) _i+1 Record candidate tag set A _i+1 The p-th tag in (b) is u _p ，p＝1,2,…,P _i+1 ，P _i+1 Representing candidate tag set A _i+1 A number of candidate tags. Firstly, calculating to obtain candidate label u in the obtained data graph _p Communication cost of all corresponding nodes _pull (u _p )：

cost _pull (u _p )＝[C(u _p )-C _local (u _p )]×W _remote +C _local (u _p )×W _local

Wherein C (u) _p ) Tag u in data graph representing slave node queried by master node _p The number of all corresponding nodes, C _local (u _p ) Indicating that the tag located locally to the current slave node is u _p Is the number of nodes, W _local Representing the unit communication cost, W, required to obtain the local node data of the slave node _remote Representing the unit communication cost required to obtain node data from other slave nodes.

Calculate the candidate tag u _p Traffic cost reduced by subsequent unmatched occurrence of next node label on matched path _stop (u _p )：

Wherein ρ is _p Representing candidate tag u _p The probability of unmatched subsequent occurrence when the next node label on the matching path is used as a calculation formula is as follows:

the next label on the matching path is calculated by adopting the following calculation formulaCost function cost (u) _p )：

cost(u _p )＝cost _pull (u _p )-cost _stop (u _p )

Selecting candidate tag set A _i+1 The label with the smallest cost function as the label of the next nodeThe slave node obtains the label of +.A. in the data graph through the inquiry of the master node>And the adjacent node belongs to node set B _i Form node set B _i+1 。

S403: judging whether i is less than T-1, wherein T represents the number of nodes in the pattern diagram with labels, if so, proceeding to step S404, otherwise proceeding to step S405.

S404: let i=i+1, return to step S402.

S405: backtracking to obtain a matching path:

with node set B _T-1 Each node in the graph is used as a final node, and a matching path corresponding to the final node is obtained by backtracking, so that a graph pattern matching result is obtained.

Taking fig. 2 as an example, assume that a root node is node 1 in a graph pattern matching task performed by a certain slave node. The label is a because in the pattern diagram, the nodes adjacent to the node with attribute a are the nodes labeled B and D, respectively, and then the node that matches with which label preferentially is determined.

For the label B, the node set adjacent to the node 1 and labeled B in the data graph only comprises the node 2, and the label B is taken as the communication cost of the next label if the node 2 is on the current slave node _pull (B)＝W _local 。

For the label D, the node set adjacent to the node 1 and labeled D in the data graph only comprises the node 3, and the label D is taken as the communication cost of the next label if the node 3 is not on the current slave node _pull (D)＝W _remote 。

In the embodiment, since the labels B and D are all nodes that can be matched, there is no possibility of early matching termination, so that the non-matching probabilities corresponding to the two labels are both 0, and then the labels B and D are used as the labels of the next node on the matching path, and the traffic cost is reduced due to the fact that the non-matching occurs subsequently _stop (B)＝cost _stop (D) All 0.

In summary, the cost functions of tags B and D are cost (B) =W _local 、cost(D)＝W _remote It is evident that in general, W _remote ＞W _local Therefore, the label B is set as the next label, and the corresponding node set is the set constituted by the nodes 2. And the same is done, the 3 rd label is the label D, the node set is the set formed by the node 3, the 4 th label is the label C, the node set is the set formed by the node 4, and then the current graph mode matching can be obtained by backtrackingAnd (5) matching results.

In step S103, D on each slave node _n The graph pattern matching task can be performed in a parallel mode to improve the graph pattern matching efficiency.

S104: result polymerization:

the master node M gathers the graph pattern matching results fed back by each slave node, eliminates repeated graph pattern matching results, and obtains a final graph pattern matching result set.

Fig. 5 is a graph pattern matching result of the data graph of fig. 2 and the labeled graph pattern of fig. 3 obtained by using the present invention. As shown in FIG. 5, the accurate graph pattern matching result can be obtained by adopting the method and the device.

In addition, the master node M may perform task scheduling on the slave nodes to implement load balancing, that is, after a certain slave node performs all graph pattern matching tasks, the master node M sends a schedulable message to the master node M, after receiving the schedulable message, the master node M initiates a query to the slave nodes that do not complete the graph pattern matching tasks, and queries to obtain the graph pattern matching tasks that have not yet been performed, and then reassigns the graph pattern matching tasks to the schedulable slave nodes.

In order to better illustrate the technical effects of the invention, the invention is experimentally verified on some data sets of practical application, and the data graph data sets used in the test are Email data sets and DBLP data sets, wherein the Email data sets are data graphs representing communication relations among people, the DBLP data sets are data graphs representing paper citation relations among people, and the two data sets are all open-sourced on SNAP and can be downloaded from http:// SNAP. The pattern diagram used in the test is a pattern diagram which is constructed according to a general pattern diagram generation method and sequentially increases in three scales, and the generation rule can be referred to in papers of Han M, kim H, gu G, et al, efficiency subgraph matching: harmonizing dynamic programming, adaptive matching order, and failing set together [ C ]// Proceedings of the 2019International Conference on Management of Data.2019:1429-1446 ] "

In the experimental verification, a distributed graph pattern matching method (BENU) with optimal performance in the prior art is selected as a comparison method, and compared with the matching time of the invention. Details of BENU methods can be found in the paper "Wang, zhaokang, et al," BENU: distributed subgraph enumeration with backtracking-based frame work, "2019IEEE 35th International Conference on Data Engineering (ICDE)," IEEE,2019 "

Table 3 is a comparison table of matching times of the present invention and BENU method on two sets of personal relationship graph data in this experimental verification.

TABLE 3 Table 3

As shown in Table 3, the efficiency of carrying out pattern matching of the labeled graph is higher than that of the BENU method, the matching time is shortened by 26% on average, and the efficiency is improved greatly.

Table 4 is a table comparing memory overhead and communication cost for the present invention and BENU method on two sets of personal relationship graph data in this experimental verification.

TABLE 4 Table 4

As shown in Table 4, the memory overhead and the communication cost required by the method for carrying out pattern matching of the tagged graph are lower than those of the BENU method, the memory overhead is reduced by 14% on average, the communication cost is reduced by 30% on average, and the resource consumption is greatly reduced.

In summary, the invention realizes the effective improvement of three performance evaluation indexes of matching efficiency, memory overhead and communication cost by using a calculation model based on a task as a center and an adaptive matching path selection technology, and has good application prospect.

While the foregoing describes illustrative embodiments of the present invention to facilitate an understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as protected by the accompanying claims insofar as various changes are within the spirit and scope of the present invention as defined and defined by the appended claims.

Claims

1. The pattern matching method for the labeled graph in the distributed environment is characterized by comprising the following steps of:

2. The method for matching patterns with labels in distributed environment according to claim 1, wherein in the step S2, when node data is distributed, a hash calculation is performed on each node ID in the data graph, a hash function is hash (ID)% N, a hash value corresponding to each node ID is obtained, and nodes with the same hash value are divided into the same piece of node data.

3. The method for pattern matching of a tagged graph in a distributed environment according to claim 1, wherein the execution flow of the pattern matching task in step S3 comprises the steps of:

s3.1: let node serial number i=0, determine the root node label according to the root node of the current pattern matching taskLet node set B ₀ A set formed by the root nodes of the current graph pattern matching task;

s3.2: acquiring labels in a pattern diagram asAdjacent node candidate label set a of nodes of (a) _i+1 Record candidate tag set A _i+1 The p-th tag in (b) is u _p ，p＝1,2,…,P _i+1 ，P _i+1 Representing candidate tag set A _i+1 Number of candidate tags; firstly, calculating to obtain candidate label u in the obtained data graph _p Communication cost of all corresponding nodes _pull (u _p )：

Wherein C (u) _p ) Tag u in data graph representing slave node queried by master node _p The number of all corresponding nodes, C _local (u _p ) The representation is located at the present timeThe label local to the former slave node is u _p Is the number of nodes, W _local Representing the unit communication cost, W, required to obtain the local node data of the slave node _remote Representing a unit communication cost required for acquiring node data from other slave nodes;

cost(u _p )＝cost _pull (u _p )-cost _stop (u _p )

Selecting candidate tag set A _i+1 The label with the smallest cost function as the label of the next nodeThe slave node obtains the label of +.A. in the data graph through the inquiry of the master node>And the adjacent node belongs to node set B _i Form node set B _i+1 ；

S3.3: judging whether i is less than T-1, wherein T represents the number of nodes in the pattern diagram with the label, if so, entering a step S3.4, otherwise, entering a step S3.5;

s3.4: let i=i+1, return to step S3.2;

s3.5: with node set B _T-1 Each node in the graph is used as a final node, and a matching path corresponding to the final node is obtained by backtracking, so that a graph pattern matching result is obtained.

4. The method according to claim 1, wherein in step S4, the master node M performs task scheduling on the slave nodes to implement load balancing, that is, after a certain slave node performs all the graph pattern matching tasks, sends a schedulable message to the master node M, and after receiving the schedulable message, the master node M initiates a query to the slave nodes that have not completed the graph pattern matching tasks, and the query obtains the graph pattern matching tasks that have not yet been performed, and then reassigns the graph pattern matching tasks to the schedulable slave nodes.