CN112559807B - Graph pattern matching method based on multi-source point parallel exploration - Google Patents

Graph pattern matching method based on multi-source point parallel exploration Download PDF

Info

Publication number
CN112559807B
CN112559807B CN202011410948.6A CN202011410948A CN112559807B CN 112559807 B CN112559807 B CN 112559807B CN 202011410948 A CN202011410948 A CN 202011410948A CN 112559807 B CN112559807 B CN 112559807B
Authority
CN
China
Prior art keywords
graph
nodes
node
pattern
exploration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011410948.6A
Other languages
Chinese (zh)
Other versions
CN112559807A (en
Inventor
黄文杰
高杨
陈伟
王新根
黄滔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Bangsheng Technology Co ltd
Original Assignee
Zhejiang Bangsheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Bangsheng Technology Co ltd filed Critical Zhejiang Bangsheng Technology Co ltd
Priority to CN202011410948.6A priority Critical patent/CN112559807B/en
Publication of CN112559807A publication Critical patent/CN112559807A/en
Application granted granted Critical
Publication of CN112559807B publication Critical patent/CN112559807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Abstract

The invention discloses a graph pattern matching method based on multi-source point parallel exploration, which can be used for fuzzy query of graph patterns starting from a determined point in a graph database. According to the invention, the mode graph to be queried is decomposed according to the hierarchical structure, and graph traversal query is carried out by taking the graph layer as a unit, so that the exploration depth can be obviously reduced, and the parallel exploration performance is improved. The invention provides concepts of a central mode set and an edge mode set, which are used for controlling an exploration process, converting an exploration task from a subgraph as a center into a point as a center and realizing an algorithm on a general distributed graph computing platform. The invention provides a matching result refining method for combining multi-source point search results, which is characterized in that a plurality of auxiliary source points are appointed for repeated search, and the constraint on the matching result is enhanced by utilizing layer differences brought by different visual angles, so that the matching precision is improved.

Description

Graph pattern matching method based on multi-source point parallel exploration
Technical Field
The invention relates to the field of distributed computing in the field of big data, in particular to a graph pattern fuzzy matching method of a directed graph with a label based on a distributed graph computing framework.
Background
With the development of the internet, social and economic activities of human beings depending on the internet are more and more common, and the data scale is objectively and rapidly increased. The graph can be used for modeling socio-economic activities with association relations, and searching complex association patterns from the graph becomes a common method for analyzing the association relations of the entities, but also consumes a large amount of computing resources. In the context of increasing graph size, it is important to reduce the computational resources required for graph pattern matching.
Taking the enterprise trade network as an example, the transaction entity is used as a node, and the transaction behavior is used as a directed edge. An enterprise is represented as a node in a network, with the nodes labeled to represent products or services that can be provided by each. The existence of directed edges between nodes indicates that a trade transaction occurs between enterprises, and the weight of the directed edges indicates the amount of the transaction. The mode of the trade network is sensitive to the graph topology structure, and the graph simulation algorithm suitable for social network analysis cannot be directly applied to the network. The subgraph isomorphism method which strictly restricts the topology is an NP complete problem, and the exponentially increased time complexity is not suitable for the query of a large-size graph.
Disclosure of Invention
The invention aims to provide a fuzzy matching method for a graph pattern oriented to a directed graph with a label, which balances matching efficiency and precision on the premise of keeping polynomial time complexity. The method is characterized by layer decomposition, multi-source point exploration and result fusion.
The purpose of the invention is realized by the following technical scheme: a graph pattern matching method based on multi-source point parallel exploration comprises the following steps:
(a) let the pattern diagram to be matched be Q ═ Vq,Eq,Lq) Wherein V isqRepresenting a set of nodes of a pattern graph, EqSet of edges representing connected nodes, LqA set of labels representing nodes of the schema graph; selecting a node s E V of the pattern graphqAnd as a search source point, performing depth traversal by taking a source point s as a center, and marking the level of the mode graph node by using a depth value. The pattern graph is decomposed into a plurality of layers, and each layer comprises nodes of the same layer and edges connecting the nodes.
(b) And calculating a central pattern set according to the structure and node semantics of the pattern graph to be matched. And calculating an edge mode set according to the dependency relationship of the adjacent layers. VqEach node of (a) contains its own set of central patterns, which includes the tags of that node, and the ids of its parent and child nodes. For nodes with level d > 0, the edge pattern set is only a subset of the central pattern setContaining the parent-child node id of level d-1. The set of central patterns and the set of edge patterns contain all structural constraints in the graph layer-by-layer exploration process. And defining the depth of the layer as D, performing exploration on the layer at least D times and at most 2D times, wherein the former D times are an expansion stage, adding new matchable nodes and removing mismatched nodes exist at the same time, and a convergence stage is performed after the exploration D times, so that new nodes are not added and only the mismatched nodes are removed.
(c) And acquiring a data graph to be matched, and selecting a starting point in the data graph to explore the neighbor nodes. Each node only maintains its matching set, and the matching set contains the node in the data graph and the V of the pattern graphqA set of all nodes that can be matched. Each node of the data graph may also maintain its own state set for computation and constraint of the edge pattern set or the center pattern set. In the previous expansion stage for D times, the step of exploring the D +1 layer by the layer D is as follows: traversing nodes of a matching set which is not empty in the data graph, selecting data graph nodes of the matching set which comprise d-layer nodes of the pattern graph, and adding neighbor nodes of the selected data graph nodes into an exploration queue; each node in the exploration queue collects a matching set of neighbor nodes, and the matching sets of the neighbor nodes of which the matching sets are not empty are merged into a state set; traversing the pattern graph V for each node in the discovery queueqJudging that the data graph nodes and the mode graph nodes have the same label and the data graph node state set is a superset of the mode graph node edge mode set by the nodes with the middle hierarchy of d + 1; and if the judgment result is true, adding the current pattern graph node into the matching set of the data graph nodes. Meanwhile, in the whole matching process, the non-empty node of each matching set of the data graph also collects the matching sets of the neighbors of the nodes of the data graph and forms a state set, the mode graph nodes which are not updated in the matching sets are traversed, and whether the current state set is a superset of the central mode set of the mode graph nodes is judged; if not, the pattern graph node is removed from the matching set. And repeatedly executing the searching step, wherein only one adjacent layer is matched in each searching until all layers are searched. After the exploration step is finished, the nodes with non-empty matching sets can continuously collect the neighbor states and continuously verify whether the matching sets of the nodes contain unmatched nodes until all the matched nodesUntil the set is no longer changed.
(d) And acquiring nodes of which the matching sets are not empty in the data graph, and extracting the subgraph as a new data graph. The subgraph is used as a subset of the data graph and comprises nodes and edges, wherein the nodes and the edges are not empty in a matching set in one exploration from s. And (4) obtaining a plurality of sub-graphs by one data graph matching task, and using the id of the starting point of the data graph in the step (c) as a unique identifier. For a data graph G ═ (V, E, L), where V denotes the set of nodes of the data graph,
Figure BDA0002814739580000021
as the data graph edge set, L is a label set of the data graph, and generally, there are
Figure BDA0002814739580000022
Selecting a plurality of secondary source points in the pattern graph, and repeatedly exploring on the subgraph by using the method provided by the step (c). The secondary source points are defined as other nodes in the pattern graph that are not identical to the source point s. And finally, solving intersection of the multi-source point exploration results, and removing nodes with empty matching sets to obtain refined matching results.
Further, the pattern diagram Q ═ V (V)q,Eq,Lq) And (V, E, L) are directed graphs with node labels. The node labels may be symbols that distinguish nodes of different roles. Each node of the schema graph may also have an additional explicit label indicating that the data graph node matching this node should exclude other schema graph nodes. The method specifically comprises the following steps: the node x in the data graph that has matched the expicit tag in the pattern graph cannot match other nodes at the same time. That is, the matching set M contains x and the matching set modulo | M | ≦ 1, or does not contain x and | M | < | VqL. By specifying an explicit node, the matched pattern can be further specified.
Further, there may be a plurality of ways of selecting an exploration source point in step (a), including:
(a1) one node in the pattern graph is randomly selected.
(a2) The node with the smallest pattern graph eccentricity and the largest degree is selected.
(a3) When the data graph is a dynamic graph, nodes with the same incremental node labels as the data graph are selected.
(a4) And manually selecting according to specific business logic.
Further, in (a2), the eccentricity of each node may be expressed as a maximum edgeless distance of the node to other nodes of the pattern graph. The undirected edge distance means that all directed edges in the graph replace the undirected edges and then the actual distance is calculated.
Further, in the step (a), the pattern map is subjected to depth traversal by using the source point as a starting point through a depth priority method, and the layers are divided by marking the levels of the nodes with the traversed depth.
Further, in step (c), the search process of the graph is performed from the lower level to the higher level in the order of the depth values from small to large. The nodes of the new graph layer do not depend on each other, the access sequence among the nodes is not concerned during exploration, and the nodes of the data graph independently update the matching sets of the nodes in parallel.
Further, in step (d), the secondary source points are selected in order of decreasing node eccentricity of the pattern diagram. When the node eccentricity is the same, the point with the largest degree is preferably selected. The calculation mode of the node degree is to convert all directed edges into undirected edges and count the number of adjacent edges.
Further, in step (c), the calculation may be performed by means of a distributed graph calculation framework, or may be performed by a stand-alone calculation. In step (d), the selection and hierarchical processing of multiple secondary source points may be performed synchronously while performing step (c), and the repeated exploration step should be performed in parallel in a single-machine multi-task manner, with each task processing a sub-graph.
Further, in a pattern matching task started from a global uncertain point, a point with the same data graph node label and a pattern graph source point s label can be selected as a starting point; in a graph database fixed point mode query task, a starting point of a graph traversal query in a data graph can be selected as an exploration starting point.
Further, the data graph can model the interactive relationship between real world entities, and the domain expert designs the pattern graph according to the business experience, so as to inquire the coincidence relationAn inter-modal entity. Specifically, the trade relationship between enterprises can be modeled, and the data can be defined as G ═ V, E, L. Wherein V is the unique identifier of the enterprise; e is a data graph edge set, if the enterprise A purchases some service or commodity from the enterprise B, an edge pointing to the B from the A is created; l represents node attribute and can mark information such as product types and the like which can be provided by enterprises. Pattern diagram Q ═ Vq,Eq,Lq) An inquiry pattern designed for a business expert, VqA set of nodes representing a schema graph; eqRepresenting an edge set of the pattern graph, representing an incidence relation of the query; l isqThen it is the same set of labels as the set of labels L of the data graph G. And the service expert designs the pattern diagram according to historical experience without knowing the structure of the data diagram in advance. Through the designed pattern diagram, other enterprises with which specific interaction patterns exist can be inquired by starting from enterprises of specified types.
The invention has the beneficial effects that: the method uses the layer as the center to explore, and defines a center pattern set and an edge pattern set to be matched and constrained. Compared with a graph simulation algorithm, the method can be used for matching a more accurate result on the structure in polynomial time. The method adopts a mode of repeatedly exploring multi-source points to refine the result, further strengthens structural constraint from multiple visual angles, and provides the capability of balancing the calculation efficiency and the matching precision. The method can be used for inquiring the fixed interaction mode on the natural graph structures such as the enterprise trade network and the like, and provides a certain inquiry flow control capability for business experts.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a schematic diagram pre-processing flow;
FIG. 3 is a data diagram exploration flow;
FIG. 4 is a result iterative refinement process;
FIG. 5 is a sample data diagram;
fig. 6 is a pattern diagram sample.
Detailed Description
In order that those skilled in the art will better understand the invention, the invention will now be described in further detail with reference to the accompanying drawings and specific examples.
As shown in fig. 1, the overall process of the present invention comprises: selecting a source point from the pattern diagram by a proper mode, and decomposing the pattern diagram into a plurality of layers by taking the source point as a center. The layers have a hierarchical relationship, and after a starting point is selected on the data graph, the data graph is explored layer by layer from a bottom layer to a high layer by taking the layers as a unit. The pattern diagram source points and the data diagram starting points need to satisfy certain constraints. Then, optionally, one or more secondary source points are selected on the pattern diagram, the exploration step is repeatedly executed on the explored subgraph, and the intersection is obtained on the result, so that the precision is improved. And the searched result is represented by a connected subgraph, and all subgraph sets obtained by searching different starting points of the data graph are selected as graph pattern matching results.
The pattern diagram is defined as Q ═ Vq,Eq,Lq,dq) Wherein V _ q is a node set of the pattern diagram;
Figure BDA0002814739580000041
is a mode graph edge set; t is a set of tags, Lq:Vq→ T maps a node id to a tag; d is a radical ofq:Vq→ 0, 1 represents a decision to determine whether a node is an explicit node. One specific example is shown in FIG. 6, where a schema diagram of a business expert defined query contains 7 nodes, an
And (3) node aggregation:
Vq={1,2,3,4,5,6,7}
and (3) edge aggregation:
Eq={(1,2),(2,3),(3,2),(3,4),(5,2),(5,6),(5,7),(7,5)}
and (3) label set:
Lq={A,B,C,D,E}
wherein V q7 kinds of enterprises are represented; eqIndicating trade traffic between enterprises, such as whether to purchase goods or services, etc.; l isqSpecifying the goods or services that each business can provide.
As shown in FIG. 2, the schema diagram preprocessing section contains the 1 st and 2 nd steps of the overall process. Firstly, a distance matrix D between nodes is calculated and is defined as the undirected edge distance between any two nodes of the pattern diagram Q. The calculation flow is as follows: first, remove the directional information on the top of Q and convert it into an undirected graph. Then, the weight of the edge is set to be 1, the shortest distance between every two nodes is calculated through a shortest path algorithm such as Dijkstra or Floyd, and the calculation result is filled in a distance matrix D. The distance matrix D is of size | Vq|×|VqSymmetric matrix of | Di,jRepresenting the shortest distance from node i to node j. Taking the data diagram shown in FIG. 6 as an example, D4,7=D7,4Is 4, and D6,60. Distances between other nodes are similar, and D is a symmetric matrix with a diagonal element of 0.
The node eccentricity E can be calculated from the node distance matrix D. For node i, its eccentricity EiSatisfy the formula
Figure BDA0002814739580000051
Figure BDA0002814739580000052
Expressed as the maximum distance of node i from the other nodes. Node 2 in FIG. 6 has an eccentricity of [1, 0, 1, 2, 1, 2 ] of the 2 nd row vector of the vector matrix D]I.e. 2. In the same way, E1=3,E3=2,E4=4,E5=3,E6=4,E7=4。
There are a number of ways to select the source point s on the pattern diagram. The simplest way is random selection, and a more preferable method is to select the node with the smallest eccentricity E and the largest node degree in the pattern graph as the source point. If the degrees of the plurality of candidate nodes are the same, one may be randomly selected. The degree of a node is defined as the number of the connected undirected edges of the node, and can also be expressed by formula
Figure BDA0002814739580000053
And (4) calculating. As shown in FIG. 6, deg1=1,deg2=4,deg3=3,deg4=1,deg5=4,deg6=1,deg 72. At this point, source s may select node 2, whose eccentricity E 22 min. The selection of the source point has certain influence on the matching result, and other nodes can be selected as the source point by the service expert according to actual experience.
And calculating the level of each node by combining the source point s and the node distance matrix D, and further decomposing the pattern diagram Q into a plurality of layers. For node i, its level G is defined as Gi=Ds,i. After the hierarchy is obtained, the set of points V can be hierarchically sortedqDivision into hierarchical sets Ys={y0,y1,…,ym}. Has y0Is { s } and
Figure BDA00028147395800000519
Figure BDA0002814739580000054
the maximum level is m. As shown in the schematic diagram of FIG. 6, the source point s is set to be node 2 and y0={2},y1={1,3,5},y2With a maximum level of 2, {4, 6, 7 }.
The final step in the pattern map preprocessing is to compute the center pattern set C and the edge pattern set P. C and P contain all constraint information in the exploration process, and can be conveniently serialized on a plurality of nodes of distributed computing. For node i of the pattern graph Q, its central pattern set
Figure BDA0002814739580000055
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002814739580000056
is a collection of parent nodes of the node i,
Figure BDA0002814739580000057
is a set of node i child nodes, having
Figure BDA0002814739580000058
And
Figure BDA0002814739580000059
tiis a node label, having ti=Lq(i)。diIs a binary judgment when diConsider node i when 1 to be explicit, otherwise diThis attribute requires the business expert to mark by experience, 0. Similar to the center pattern set, the edge pattern set is defined as
Figure BDA00028147395800000510
Wherein t isiAnd diIs defined exactly as the central pattern set.
Figure BDA00028147395800000511
Is composed of
Figure BDA00028147395800000512
A subset of
Figure BDA00028147395800000513
Function y: vq→ N is used to compute the hierarchy to which the node belongs. In a similar manner, the first and second substrates are,
Figure BDA00028147395800000514
also satisfy
Figure BDA00028147395800000515
Taking node 3 in FIG. 6 as an example, its central pattern set
Figure BDA00028147395800000516
Included
Figure BDA00028147395800000517
t3=B,d30. Node 3 belongs to layer 1 and has edge mode set
Figure BDA00028147395800000518
I.e. only layer 0 nodes.
In the scenario of determining point queries by a graph database, the conventional idea is to perform queries in a graph traversal manner. The process of graph traversal is from a determined pointAnd continuously searching outwards along the adjacent edge until the specified condition is met. At this time, the query request already contains the starting point V of graph traversal, and V can be directly selectedqThe node with the middle label identical to the label of v is used as a source point.
In a dynamic graph scenario, it is generally desirable to explore directly from the incremental portion of the data graph. At this time, the source point may not be fixed, and all nodes of the pattern graph are taken as potential source points to be preprocessed in sequence. When the delta deltag of the data graph is received, the labelset deltat for all nodes in deltag is also obtained. Final source point selects only VqThe middle label contains the node in Δ T and the final selection is determined with low eccentricity and high node degree as criteria. Because the preprocessing of the pattern graph is independent of the data graph, the mode of preprocessing each node as a source point in advance does not increase the matching time.
As shown in fig. 3, the data diagram exploration flow includes steps 3 and 4 of the overall flow diagram. Defining a data graph G ═ V, E, L, where V is a set of data graph nodes, E is a set of data graph edges, L: v → T represents the function that maps the data graph node id to a tag, and the definition of the tag set T is the same as the mode icon tag set definition. Taking the enterprise trading network shown in fig. 5 as an example, the data diagram includes 10 enterprises, each of which can provide 5 kinds of commodities. V is the unique identification set of the enterprise, here represented by node id, i.e., 1, 2, …; e, establishing a directed edge from node 1 to node 2 if enterprise 1 purchases a certain commodity or service from enterprise 2. L represents the products or services that each enterprise can offer.
The exploration starting point of the data graph G may be selected in various ways. For an indeterminate point query, all data graph nodes with the same labels as the source points are selected as exploration starting points, and a starting point set is defined as { V ∈ V: l (v) ═ Lq(s) }. For query with fixed points, there is a set of fixed points VeAs a feasible starting point, satisfy
Figure BDA0002814739580000061
At this time, V can be adjustedeAnd (4) treating the target as a search starting point set in a dynamic source point mode. SourceThe selection of points is s ∈ Vq∧x∈Ve∧Lq(s) ═ l (x), where v is the selected source point. For exploration on the dynamic graph, the method can be regarded as a query with a certain point and VeΔ V, where Δ V is the dynamic map increment portion. And determining a pattern diagram source point and a corresponding data diagram starting point, and starting an exploration task. The data graph searching tasks started from each searching starting point are independent from each other and can be processed in parallel.
Before the exploration is started from the starting point of the data graph, a matching set M needs to be established for each node. For node i in the data graph G, the matching set M isiThe nodes of the pattern graph expressed as matching with the node i are
Figure BDA0002814739580000062
At the beginning
Figure BDA0002814739580000063
Each exploration task maintains an explored node set defined as
Figure BDA0002814739580000064
Selecting a source point s, searching a starting point u, and dividing MuIs set as MuS, and t is 1. And then sending out scheduling signals to the parent node and the child node of the u, and processing the nodes receiving the scheduling signals in parallel in the next iteration. In each iteration, the node i receiving the scheduling signal performs the following processing: the level t that this iteration should match is first determined. Then from VaObtain a matching set of parent nodes
Figure BDA0002814739580000065
Matching set with child nodes
Figure BDA0002814739580000066
If there are multiple parent and child nodes, they are merged into a matching set. Then, from ytThe nodes with labels L (i) in the selection form a set
Figure BDA0002814739580000067
Is provided with
Figure BDA0002814739580000068
Sequential judgment
Figure BDA0002814739580000069
Whether the node in (1) satisfies the constraint of the edge pattern set is
Figure BDA00028147395800000610
If node v does not satisfy the constraint, calculate
Figure BDA00028147395800000611
Sequentially judging MiWhether the middle node satisfies the central pattern set constraint, have
Figure BDA00028147395800000612
If node v does not satisfy the constraint, M is computedi=Mi- { v }. In particular, when MiWhen containing a node marked as an explicit, direct command
Figure BDA00028147395800000613
Final calculation
Figure BDA00028147395800000614
Combining the matching sets if
Figure BDA00028147395800000615
And the node i sends a scheduling signal to the neighbor, calculates t as t +1, and performs the next iteration. Here m is the level of the largest layer. When the algorithm enters the convergence stage after iteration m times, VaAll nodes send scheduling signals to neighbors, and the node i receiving the scheduling signals collects the matching sets of the father node and the child nodes by the method
Figure BDA0002814739580000071
And
Figure BDA0002814739580000072
is sequentially judged fromRaw matching set MiWhether each node v in (b) satisfies
Figure BDA0002814739580000073
Figure BDA0002814739580000074
If not, calculate Mi=Mi- { v }. If it occurs
Figure BDA0002814739580000075
Then Va=Va- { i }. The convergence phase iterates to VaThe process can be stopped without any change, and the process iterates for m times at most. As shown in the data diagram of FIG. 5, the matching result set V using the node 5 as the search starting point a2, 3, 4, 5, 6, 7, 8, 10, where M is2={3},M3={4},M4={1},M5={2,4},M6={7},M7={5},M8={4,6},M 106. The node 5 and the node 8 can simultaneously satisfy two interactive relations, and certain ambiguity exists. If the pattern diagram d is set21, then M is obtained5The ambiguity can be eliminated to some extent by {2}, i.e. only including the pattern graph explicit nodes. Whether there is such a need should be met with a specific business context, depending on the specific experience of the business expert.
After searching m layers, according to the searched node set
Figure BDA0002814739580000076
Generating subgraphs from a data graph
Figure BDA0002814739580000077
As a result of the first stage. If the result meets the matching precision requirement, the next refining step can be directly skipped, and the result is directly returned.
As illustrated in FIG. 4, the results of the preliminary exploration
Figure BDA0002814739580000078
The refining process of (2) corresponds to the 5 th and 6 th steps of the total flow. K secondary source points are first selected from the pattern diagram Q. The selection rule is as follows: will VqThe nodes are sorted from large to small according to the eccentricity E, if the eccentricity is the same, the nodes are sorted from large to small according to the node degrees, and the first K nodes are selected as the auxiliary starting points. The mode graph is still decomposed into a plurality of layers by adopting a method of dividing the hierarchy according to the node distance matrix D. Adopting a data diagram starting point selection mode applied to uncertain point query
Figure BDA0002814739580000079
A plurality of parallel graph exploration tasks are started, and the task number set is X ═ 1, 2, …, n }. For the
Figure BDA00028147395800000710
The middle node i and the graph exploration task x obtain a matching set of
Figure BDA00028147395800000711
After all the exploration tasks are executed, each node calculates
Figure BDA00028147395800000712
Finally removing
Figure BDA00028147395800000713
And regenerating the subgraph as the final result output when the middle matching set is empty nodes. The result refining process requires the initiation of n exploration tasks, typically
Figure BDA00028147395800000714
The method has small scale, can be directly downloaded to a single computing node for processing, and avoids the overhead brought by distributed communication.
The above-described embodiments are intended to illustrate rather than limit the invention, and any modifications and variations of the present invention are within the spirit and scope of the appended claims.

Claims (8)

1. A graph pattern matching method based on multi-source point parallel exploration is characterized by comprising the following steps:
(a) the pattern diagram to be matched is recorded as
Figure DEST_PATH_IMAGE001
Wherein, in the step (A),
Figure 152772DEST_PATH_IMAGE002
a collection of nodes of the pattern graph is represented,
Figure DEST_PATH_IMAGE003
represents a collection of edges connecting the nodes,
Figure 693475DEST_PATH_IMAGE004
a set of labels representing nodes of the schema graph; selecting a node of a pattern graph
Figure DEST_PATH_IMAGE005
As exploration source point, source point
Figure 664842DEST_PATH_IMAGE006
Performing depth traversal for the center, and marking the level of the mode graph nodes by using depth values; decomposing the pattern graph into a plurality of layers, wherein each layer comprises nodes of the same level and edges connecting the nodes;
(b) calculating a central pattern set according to the structure and node semantics of the pattern graph to be matched; calculating an edge mode set according to the dependency relationship of the adjacent layers,
Figure 265587DEST_PATH_IMAGE002
each node of (2) contains its own central mode set, which includes the label of the node and the id of its parent-child node; for hierarchy
Figure DEST_PATH_IMAGE007
The node of (1) has an edge mode set as a subset of a central mode set, and only comprises a hierarchy
Figure 197771DEST_PATH_IMAGE008
Id of parent-child node; the central mode set and the edge mode set contain all structural constraints in the layer-by-layer exploration process of the graph; defining the depth of the layer as
Figure DEST_PATH_IMAGE009
If the layer is searched for at least D times, the layer is searched for at most
Figure 542165DEST_PATH_IMAGE010
Then, before
Figure 118771DEST_PATH_IMAGE009
Secondly, an expansion stage, namely adding new matchable nodes and removing mismatched nodes, wherein a convergence stage is performed after D times of exploration, new nodes are not added, and only the mismatched nodes are removed;
(c) acquiring a data graph to be matched, and selecting a starting point in the data graph to explore a neighbor node; each node only maintains its own matching set, and the matching set contains the node and the pattern graph in the data graph
Figure 890418DEST_PATH_IMAGE002
A set of all nodes that can be matched in (1); each node of the data graph also maintains its own state set for computing constraints of the edge mode set or the central mode set; before
Figure DEST_PATH_IMAGE011
A sub-expansion stage consisting of layers
Figure 106636DEST_PATH_IMAGE012
Exploration of
Figure DEST_PATH_IMAGE013
The step of the layer is as follows: traversing nodes of the data graph with non-empty matching sets, and selecting the matching sets to contain the pattern graph
Figure 192403DEST_PATH_IMAGE012
The data graph nodes of the layer nodes add the neighbor nodes of the selected data graph nodes into the exploration queue; each node in the exploration queue collects a matching set of neighbor nodes, and the matching sets of the neighbor nodes of which the matching sets are not empty are merged into a state set; traversing the schema graph for each node in the exploration queue
Figure 810466DEST_PATH_IMAGE014
In the middle level is
Figure 80911DEST_PATH_IMAGE013
The node (2) judges that the data graph nodes and the mode graph nodes have the same label and the data graph node state set is a superset of the mode graph node edge mode set; if the judgment is true, adding the current pattern graph node into the matching set of the data graph nodes; meanwhile, in the whole matching process, each non-empty node of the matching set of the data graph also collects the matching sets of the data graph node neighbors and forms a state set, the mode graph nodes which are not updated in the matching sets are traversed, and whether the current state set is a superset of the central mode set of the mode graph nodes or not is judged; if the judgment is no, the pattern graph node is removed from the matching set; repeatedly executing the exploration step, wherein only one adjacent layer is matched in each exploration until all layers are explored; after the exploration step is finished, the nodes with non-empty matching sets can continuously collect the neighbor states and continuously verify whether the matching sets of the nodes contain unmatched nodes or not until all the matching sets are not changed,
(d) acquiring nodes of which the matching sets are not empty in the data graph, and extracting subgraphs as new data graphs; the subgraph, which is a subset of the data graph, contains the subgraphs
Figure 518845DEST_PATH_IMAGE006
Matching nodes with non-empty sets and edges connected with the nodes in the starting exploration; a plurality of sub-graphs are obtained by one data graph matching task, and the id of the starting point of the data graph in the step (c) is used as a unique identifier; for a data diagram
Figure DEST_PATH_IMAGE015
Wherein, in the step (A),
Figure 205042DEST_PATH_IMAGE016
a set of nodes representing a data graph,
Figure DEST_PATH_IMAGE017
as a set of edges of the data graph,
Figure 146453DEST_PATH_IMAGE018
is a labelset of a data graph, having
Figure DEST_PATH_IMAGE019
(ii) a Selecting a plurality of secondary source points in the pattern diagram, and repeatedly exploring on the subgraph by using the method provided by the step (c); the secondary source point is defined as the mode diagram neutralization source point
Figure 837066DEST_PATH_IMAGE020
Other nodes that are not identical; finally, solving an intersection set of the multi-source point exploration results, and removing nodes with empty matching sets to obtain refined matching results;
the data diagram can model the interactive relation between real world entities, and a domain expert designs a pattern diagram according to business experience so as to inquire the entities conforming to the interactive pattern; specifically, the trade relationship between enterprises can be modeled, and data can be defined as
Figure DEST_PATH_IMAGE021
(ii) a Wherein
Figure 27876DEST_PATH_IMAGE016
Is a unique identifier of the enterprise;
Figure 252184DEST_PATH_IMAGE022
if enterprise A purchases some service or commodity from enterprise B, then an edge pointing to B from A is created;
Figure 516943DEST_PATH_IMAGE018
representing node attributes, and marking the product type information which can be provided by the enterprise; schematic diagram
Figure DEST_PATH_IMAGE023
The query pattern designed for the business expert,
Figure 66873DEST_PATH_IMAGE014
a set of nodes representing a schema graph;
Figure 479400DEST_PATH_IMAGE024
representing an edge set of the pattern graph, representing an incidence relation of the query;
Figure DEST_PATH_IMAGE025
then it is a sum data graph
Figure 100874DEST_PATH_IMAGE026
Tag set of
Figure 282457DEST_PATH_IMAGE018
The same set of tags; a business expert designs a pattern diagram according to historical experience without knowing a data diagram structure in advance; through the designed pattern diagram, other enterprises with which specific interaction patterns exist can be inquired by starting from enterprises of specified types.
2. The graph pattern matching method based on multi-source point parallel exploration according to claim 1, wherein the pattern graph
Figure DEST_PATH_IMAGE027
And data diagram
Figure 675392DEST_PATH_IMAGE015
All are directed graphs with node labels; the node label is a symbol for distinguishing nodes with different roles;each node of the schema graph also carries an additional explicit label indicating that a data graph node matching this node should exclude other schema graph nodes; the method comprises the following specific steps: nodes with explicit labels in matched pattern graph in data graph
Figure 840794DEST_PATH_IMAGE028
Other nodes cannot be matched at the same time; i.e. matching set
Figure DEST_PATH_IMAGE029
Included
Figure 485533DEST_PATH_IMAGE028
And matched set model
Figure 52781DEST_PATH_IMAGE030
Or do not comprise
Figure 413355DEST_PATH_IMAGE028
And is
Figure DEST_PATH_IMAGE031
(ii) a By specifying an explicit node, the matching pattern can be specified.
3. The graph pattern matching method based on multi-source point parallel exploration according to claim 1, wherein a plurality of modes for selecting exploration source points exist in step (a), and the modes comprise:
(a1) randomly selecting a node in the pattern graph;
(a2) selecting a node with the smallest eccentricity and the largest degree of the pattern graph;
(a3) when the data graph is a dynamic graph, selecting nodes with the same labels as the incremental nodes of the data graph;
(a4) and manually selecting according to specific business logic.
4. The graph pattern matching method based on multi-source point parallel exploration according to claim 3, wherein in (a2), the eccentricity of each node is expressed as the maximum undirected edge distance of the node to other nodes of the pattern graph; the undirected edge distance means that all directed edges in the graph replace the undirected edges and then the actual distance is calculated.
5. The graph pattern matching method based on multi-source point parallel exploration according to claim 1, wherein in step (c), the exploration process of the graph is gradually explored from a lower level to a higher level according to the sequence of depth values from small to large; the nodes of the new graph layer do not depend on each other, the access sequence among the nodes is not concerned during exploration, and the nodes of the data graph independently update the matching sets of the nodes in parallel.
6. The graph pattern matching method based on the multi-source point parallel exploration according to claim 1, wherein in the step (d), the sub-source points are selected according to the sequence of the node eccentricity of the pattern graph from large to small; when the node eccentricity is the same, preferentially selecting the point with the largest degree; the calculation mode of the node degree is to convert all directed edges into undirected edges and count the number of adjacent edges.
7. The graph pattern matching method based on multi-source point parallel exploration according to claim 1, wherein in step (c), the calculation can be performed by means of a distributed graph calculation framework, and a single-machine calculation mode can also be adopted; in step (d), the selection and hierarchical processing of multiple secondary source points can be performed synchronously while performing step (c), and the repeated exploration step should be performed in parallel in a single-machine multi-task manner, with each task processing a sub-graph.
8. The graph pattern matching method based on multi-source point parallel exploration according to claim 1, wherein data graph node labels and pattern graph source points can be selected in a pattern matching task starting from a global uncertain point
Figure 3736DEST_PATH_IMAGE006
The point with the same label is taken as a starting point; in the figureIn the fixed-point mode query task of the database, the starting point of the graph traversal query in the data graph can be selected as the exploration starting point.
CN202011410948.6A 2020-12-03 2020-12-03 Graph pattern matching method based on multi-source point parallel exploration Active CN112559807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011410948.6A CN112559807B (en) 2020-12-03 2020-12-03 Graph pattern matching method based on multi-source point parallel exploration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011410948.6A CN112559807B (en) 2020-12-03 2020-12-03 Graph pattern matching method based on multi-source point parallel exploration

Publications (2)

Publication Number Publication Date
CN112559807A CN112559807A (en) 2021-03-26
CN112559807B true CN112559807B (en) 2022-06-21

Family

ID=75048613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011410948.6A Active CN112559807B (en) 2020-12-03 2020-12-03 Graph pattern matching method based on multi-source point parallel exploration

Country Status (1)

Country Link
CN (1) CN112559807B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408827B (en) * 2021-07-21 2022-05-31 上海勘察设计研究院(集团)有限公司 Measuring line searching method based on graph algorithm
CN114817264B (en) * 2022-04-28 2023-04-25 电子科技大学 Topology query structure, query method, electronic equipment and medium for graph calculation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521332A (en) * 2011-12-06 2012-06-27 北京航空航天大学 Graphic mode matching method, device and system based on strong simulation
CN105138601A (en) * 2015-08-06 2015-12-09 中国科学院软件研究所 Graph pattern matching method for supporting fuzzy constraint relation
CN105956114A (en) * 2016-05-05 2016-09-21 南京邮电大学 Method for searching pattern matching subgraphs based on tag graph
CN107451210A (en) * 2017-07-13 2017-12-08 北京航空航天大学 A kind of figure matching inquiry method based on inquiry relaxation result enhancing
CN109614520A (en) * 2018-10-22 2019-04-12 中国科学院信息工程研究所 One kind is towards the matched parallel acceleration method of multi-mode figure
CN111737538A (en) * 2020-06-11 2020-10-02 浙江邦盛科技有限公司 Reverse real-time matching method for graph mode based on event driving

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10204174B2 (en) * 2015-12-15 2019-02-12 Oracle International Corporation Efficient method for subgraph pattern matching

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521332A (en) * 2011-12-06 2012-06-27 北京航空航天大学 Graphic mode matching method, device and system based on strong simulation
CN105138601A (en) * 2015-08-06 2015-12-09 中国科学院软件研究所 Graph pattern matching method for supporting fuzzy constraint relation
CN105956114A (en) * 2016-05-05 2016-09-21 南京邮电大学 Method for searching pattern matching subgraphs based on tag graph
CN107451210A (en) * 2017-07-13 2017-12-08 北京航空航天大学 A kind of figure matching inquiry method based on inquiry relaxation result enhancing
CN109614520A (en) * 2018-10-22 2019-04-12 中国科学院信息工程研究所 One kind is towards the matched parallel acceleration method of multi-mode figure
CN111737538A (en) * 2020-06-11 2020-10-02 浙江邦盛科技有限公司 Reverse real-time matching method for graph mode based on event driving

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Cypher-based Graph Pattern Matching in Gradoop;Martin Junghanns.et.l;《Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems》;20170530;第1-8页 *
Graph Pattern Matching through Model Checking;Rui Qiao.et.l;《 2015 8th International Conference on Database Theory and Application (DTA)》;20151128;第1-5页 *
动态图模式匹配技术综述;许嘉等;《软件学报》;20180330;第663-687页 *

Also Published As

Publication number Publication date
CN112559807A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
Ali et al. A novel design of differential evolution for solving discrete traveling salesman problems
Madkour et al. A survey of shortest-path algorithms
Han et al. A graph-based approach for trajectory similarity computation in spatial networks
US20160098445A1 (en) Syntactic Graph Modeling in a Functional Information System
Planken et al. Computing all-pairs shortest paths by leveraging low treewidth
CN107103050A (en) A kind of big data Modeling Platform and method
CN112559807B (en) Graph pattern matching method based on multi-source point parallel exploration
CN107784598A (en) A kind of network community discovery method
Meng et al. Intelligent salary benchmarking for talent recruitment: A holistic matrix factorization approach
CN108428200A (en) A kind of the electric business field patent infringement decision-making system and determination method of case-based reasioning
CN105205052A (en) Method and device for mining data
CN110781940A (en) Fuzzy mathematics-based community discovery information processing method and system
CN111626311B (en) Heterogeneous graph data processing method and device
Wang et al. ASNN-FRR: A traffic-aware neural network for fastest route recommendation
Kang et al. A review and synthesis of recent geoai research for cartography: Methods, applications, and ethics
Qu et al. Analysis of distribution path optimization algorithm based on big data technology
CN114723108B (en) Method and device for calculating accessibility of mass public service facilities of urban road network
Fan et al. Spatially enabled customer segmentation using a data classification method with uncertain predicates
Ocampo et al. A sustainable manufacturing strategy from different strategic responses under uncertainty
CN102346873B (en) Multi-standard information processing method of uncertain data
Anuradha et al. Mining generalized positive and negative inter-cross fuzzy multiple-level coherent rules
Do et al. Real-time forecasting of non-linear competing online activities
Chen et al. Toponym Based Community Search in Large Social Network
Priyadarshini et al. Geometric Multi-Way Frequent Subgraph Mining Approach to a Single Large Database
Yu et al. Mining K primary corridors from vehicle GPS trajectories on a road network based on traffic flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room ABCD, 17th floor, building D, Paradise Software Park, No.3 xidoumen Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant after: Zhejiang Bangsheng Technology Co.,Ltd.

Address before: Room ABCD, 17th floor, building D, Paradise Software Park, No.3 xidoumen Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant before: ZHEJIANG BANGSUN TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant