CN109783696B - Multi-pattern graph index construction method and system for weak structure correlation - Google Patents

Multi-pattern graph index construction method and system for weak structure correlation Download PDF

Info

Publication number
CN109783696B
CN109783696B CN201811466997.4A CN201811466997A CN109783696B CN 109783696 B CN109783696 B CN 109783696B CN 201811466997 A CN201811466997 A CN 201811466997A CN 109783696 B CN109783696 B CN 109783696B
Authority
CN
China
Prior art keywords
graph
pattern
mode
sub
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811466997.4A
Other languages
Chinese (zh)
Other versions
CN109783696A (en
Inventor
于静
唐钰葆
刘小梅
刘燕兵
曹聪
谭建龙
郭莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201811466997.4A priority Critical patent/CN109783696B/en
Publication of CN109783696A publication Critical patent/CN109783696A/en
Application granted granted Critical
Publication of CN109783696B publication Critical patent/CN109783696B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multimode graph index construction method and system for weak structure correlation. The method comprises the following steps: 1) reading pattern diagrams in a pattern diagram set of a target field and generating a pattern diagram mark ID for each pattern diagram; 2) constructing a pattern graph isomorphic tree: judging every two pattern graphs, if a sub-graph isomorphism relationship exists between the two pattern graphs, adding a directed edge pointing to a pattern graph with a large scale from a pattern graph with a small scale to obtain a pattern graph isomorphism tree of a pattern graph set; 3) performing frequent subgraph mining on the mode graph isomorphic tree, finding a common mode graph and adding the common mode graph into the mode graph isomorphic tree; 4) when a sub-pattern graph with a plurality of parent pattern graphs exists in the pattern graph isomorphic tree, reserving a unique parent pattern graph for the sub-pattern graph; 5) and calculating the minimum spanning tree of the mode graph isomorphic tree, and performing depth-first traversal on the minimum spanning tree to obtain the optimal matching sequence of the mode graph set. The invention can greatly improve the matching efficiency.

Description

Multi-pattern graph index construction method and system for weak structure correlation
Technical Field
The invention relates to a multimode graph index construction method and system for weak structure correlation, and belongs to the technical field of computer software.
Background
With the arrival of the big data era, massive multi-source heterogeneous data from the internet and life are generated and accumulated at an unprecedented speed, close relevance exists among the data, a graph (graph) is used as a widely applied data structure and is very suitable for describing the data with the intrinsic relevance, and graph pattern matching (graph pattern matching) is used as an important means for realizing efficient query on graph data and is a basic technology based on analysis and mining of the graph data.
Graph pattern matching refers to that given data graph and specific pattern graph, all matching results which are the same as the connection topological structure of the nodes and edges of the pattern graph and have the same attributes of the corresponding nodes and edges are found in the data graph. With the continuous expansion of data scale, fast and accurate graph pattern matching on large-scale graph data is a problem to be solved urgently at present. The existing graph pattern matching technology mainly improves the performance of graph pattern matching by optimizing pruning strategies, establishing indexes for data graphs and the like, however, in practical application, many application scenarios need to process batch pattern graphs, the problem that a plurality of pattern graphs need to be matched simultaneously is called multi-pattern matching, which is essentially that for a given data graph and a group of pattern graphs to be queried, the graph pattern matching result of each pattern graph on the data graph is calculated, for example:
(1) in network attack detection, the relationship between computers, IP addresses, users and software services in a network can be represented as a graph structure, wherein the computers, IP addresses and the like are abstracted into nodes of the graph structure, and the relationship between the computers, IP addresses and the like is represented as edges. By analyzing the data transmission path and the communication mode of the virus (such as Witty word) and the network attack (such as Smurf DDOS and Fraggle DDOS), the network security problems of virus propagation, malicious attack and the like which may occur along with the update of the graph data can be detected in real time in a virus mode library and a network attack mode library by using a multi-graph mode matching technology.
(2) In protein structure analysis, the molecules of a protein are natural map nodes, and the intermolecular force is the edge of the corresponding node. In the process of structure analysis, a large number of molecular combination structures with unknown properties need to be searched in a protein database by utilizing a multi-pattern matching technology.
(3) In the social relationship query, the social relationship can be represented in a graph form (such as an academic relationship graph, a social network graph and the like), and community discovery, important role detection, cooperative relationship graph search, employee function importance evaluation and the like can be realized through a multi-graph pattern matching technology.
(4) In social security analysis, services developed by the famous intelligence company DeUmbra in the united states and using multi-graph pattern matching as a core technology are applied to security fields such as terrorist tracking, public security information analysis, fraud analysis and the like from 2015 to now. The company performs fusion association on behaviors of personnel in a physical world and a network space, constructs a large-scale personnel relationship network, researches a method for detecting suspicious criminal behavior patterns, and mainly utilizes a multi-pattern matching technology to realize real-time monitoring and early warning on various abnormal behaviors.
The existing graph pattern matching algorithm mainly aims at the single pattern matching problem, when a plurality of pattern matching problems are processed, each pattern is regarded as an independent individual, independent calculation is carried out on a data graph, and therefore structural correlation existing in a pattern graph set is omitted, and a large amount of redundant calculation exists in the matching process. Therefore, for the multi-pattern matching problem, researchers propose a series of multi-pattern matching technologies based on structural correlation, and by mining the structural correlation of the pattern diagram, combining repeated structures, existing in the form of a tree or a diagram, reducing redundant computation, establishing a pattern diagram index structure, and obtaining the optimal sequence of pattern diagram matching through the structure.
Such as: a Pattern Tree (Pattern Tree) method is characterized in that for a given Pattern graph set P, pairwise isomorphic calculation is firstly carried out on Pattern graphs through a subgraph isomorphism algorithm, and after subgraph isomorphism relations among all the Pattern graphs are determined, a Pattern Tree structure is established, wherein the Pattern Tree structure is a Tree structure, and root nodes are virtual nodes and are used for connecting all the Pattern graphs without subgraphs. In the schema tree structure, the relationship between a child schema graph and a parent schema graph is represented by directed edges, pointing from the child schema graph to the parent schema graph. If the sub pattern Graph has multiple parent pattern graphs, a minimum spanning tree (Chu Y J, Liu T H.on the short architecture of a Directed Graph [ J ]. Science Sinica,1965,14: 1396-. Therefore, the construction of the graph index is completed, and the subsequent graph pattern matching process can determine the matching sequence of the pattern graphs according to the index, namely, the parent pattern graph is matched firstly, and then the child pattern graph is matched. The sub-pattern graph only needs to be matched with the extension part in the parent pattern graph, and the pattern graph directly connected with the root node adopts a sub-graph isomorphism algorithm to carry out matching calculation. According to the mode tree construction result, the depth of the tree structure is smaller, the width is larger, and more repeated structures still exist. Meanwhile, only a subgraph isomorphism algorithm is relied on, and the public subgraph relation among all the pattern graphs cannot be thoroughly mined.
In order to solve the disadvantages of the pattern tree algorithm, a further improved method is provided: the method for schema implication mapping (PCM) utilizes the maximum common subgraph relationship to construct the multi-mode graph index, and provides a structure for the multi-mode graph index: the mode contains a mapping graph structure, wherein the maximum Common subgraphs (maximum Common subgraphs) are mined pairwise for the mode graphs, the Common subgraphs are used as new mode graphs to enter the matching calculation of a mode graph set, and in order to reduce the calculation amount of the mining of the maximum Common subgraphs, a paper firstly extracts features from the mode graphs to cluster and carries out mining calculation in the same type of mode graph set. By introducing new pattern diagrams, the association between pattern diagrams is enhanced compared to the former. However, the index construction of the method depends on the clustering effect, and the subgraph coverage rate is low.
In summary, in the existing multi-pattern matching technology, the multi-pattern index has the problems of incomplete mining of redundant structures between pattern diagrams and poor indexing effect of the pattern diagram with weak structural correlation, and therefore how to improve the multi-pattern index construction effect with weak structural correlation needs further research.
The conventional method for constructing a multimodal map index has the following problems. For example, a pattern tree algorithm simply uses a sub-graph isomorphism method to construct a pattern graph index, the adaptability to a pattern graph set with weak correlation is poor, a plurality of pattern graph pairs which have no isomorphism relationship but have a common sub-graph exist in a data graph set, and only the pattern tree algorithm cannot establish the association between the pattern graph pairs, so that the calculation amount cannot be ignored.
Aiming at the defects of the pattern tree, the new improved algorithm utilizes the maximum common subgraph among the data graph sets, but the algorithm depends on the clustering effect, the clustering coefficient needs to be specified, the expandability of any data set is poor, in addition, the maximum common subgraph is calculated by the pattern graphs in pairs, more time is consumed, the common subgraph is not screened, and unnecessary subgraph calculation is easy to perform during matching.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a multimode graph index construction method and system for weak structure correlation, wherein the weak structure correlation is defined firstly: in the pattern diagram set, there are only a few isomorphic relationships between some pattern diagram subsets and the remaining pattern diagram subsets; that is, when the number of sub pattern maps having an isomorphic relationship in a pattern map set is less than or equal to a set ratio h (for example, h is 10%) of the total number of pattern maps, the pattern map set is defined to be weakly structure-dependent. When a mode diagram set of such a state exists, it is difficult to dig out a common structure (substructure) therein, and the present invention defines such a mode diagram set as a mode diagram set with weak structural correlation. Facing to the mode graph set, the invention provides a graph index construction method, which comprises the following steps: the hybrid mode tree algorithm (HybirdpatternTree) is based on a subgraph isomorphic algorithm, a public subgraph mining technology is integrated, more representative subgraphs (namely sub-subgraphs) need to be mined, and the representativeness of the subgraphs is mainly represented by: the method has certain scale, appears in more pattern graphs, and saves more calculation amount than that for additionally calculating the subgraph. Therefore, by means of a frequent subgraph mining technology, subgraphs which have high occurrence frequency in a specific pattern graph set are mined, and a proper subgraph is selected as an auxiliary pattern graph to be added into the pattern graph index.
The multi-pattern graph index construction technology adopts a two-stage structure correlation mining technology, an algorithm firstly carries out sub-pattern isomorphic calculation on a pattern graph set to establish an isomorphic tree, then carries out frequent sub-pattern mining on all sub-pattern graphs of the same father pattern graph in the isomorphic tree, selects representative sub-patterns and establishes the multi-pattern graph index. The specific idea and method for index construction mainly comprises three steps: constructing a pattern graph isomorphic tree, excavating frequent subgraphs and calculating an optimal matching path.
The technical scheme of the invention is as follows:
a multi-pattern graph index construction method oriented to weak structure correlation comprises the following steps:
1) reading pattern diagrams in a pattern diagram set of a target field and generating a pattern diagram mark ID for each pattern diagram; the mode graph set is a mode graph set with weak structure correlation, namely the number of the sub-mode graphs containing isomorphic relations in the mode graph set is lower than or equal to the set proportion h of the total number of the mode graphs;
2) constructing a pattern graph isomorphic tree: judging every two pattern graphs, if a sub-graph isomorphism relationship exists between the two pattern graphs, adding a directed edge pointing to the pattern graph with larger scale from the pattern graph with smaller scale to obtain a pattern graph isomorphism tree of the pattern graph set;
3) performing frequent subgraph mining on the mode graph isomorphic tree, finding a common mode graph and adding the common mode graph into the mode graph isomorphic tree;
4) when a sub-pattern graph with a plurality of parent pattern graphs exists in the pattern graph isomorphic tree, reserving a unique parent pattern graph for the sub-pattern graph;
5) and calculating the minimum spanning tree of the mode graph isomorphic tree, and performing depth-first traversal on the minimum spanning tree to obtain the optimal matching sequence of the mode graph set.
Further, a double-layer filtering strategy is adopted to judge the pattern diagram pairwise, and the method comprises the following steps: a) comparing the two pattern graphs according to the attributes of nodes and edges in the pattern graphs, and if the attribute which does not exist in the pattern graph with larger scale exists in the pattern graph with smaller scale or the frequency of the occurrence of an attribute in the pattern graph with smaller scale exceeds the frequency of the occurrence of the attribute corresponding to the pattern graph with larger scale, judging that the pattern graph with smaller scale is not a subgraph of the pattern graph with larger scale; b) combining the attributes of the nodes and the edges to construct a triple: (l (v)i),l(eij),l(vj) Wherein l) are(vi) Representing a node viProperty of l (v)j) Representing a node vjProperty of l (e)ij) Is expressed as viAnd vjEdge e as an end pointij(ii) an attribute of (d); if a triplet which does not exist in the mode diagram with the larger scale exists in the mode diagram with the smaller scale, or a triplet exists in the mode diagram with the smaller scale and the occurrence frequency of the triplet exceeds the occurrence frequency of the triplet corresponding to the mode diagram with the larger scale, determining that the mode diagram with the smaller scale is not a subgraph of the mode diagram with the larger scale; wherein the scale refers to the number of nodes and the number of edges in the pattern diagram.
Furthermore, in the mode map isomorphic tree, if a sub-mode map has a plurality of father mode maps, calculating the difference between the sub-mode map and each corresponding father mode map structure in the mode map isomorphic tree, and selecting the father mode map with the smallest difference with the sub-mode map as the father mode map of the sub-mode map.
Further, the difference is calculated as Score (p)i,pj)=(|Vi|-|Vj|)+(|Ei|-|EjI)); wherein, | Vi|,|VjRespectively represents a pattern diagram piAnd pjThe number of nodes, | Ei|,|EjRespectively represents a pattern diagram piAnd pjThe number of edges of (c).
Further, the specific implementation method of step 3) is as follows: carrying out frequent sub-graph mining on the mode graph isomorphic tree and filtering sub-graphs of which the sub-graph occurrence times are smaller than a set support degree s; and then calculating the calculation amount which can be saved by newly adding a subgraph in the mode graph isomorphic tree as the weight of the subgraph, selecting the subgraph with the maximum weight as a common mode graph according to the subgraph isomorphic relation, adding the subgraph into the mode graph isomorphic tree, and updating the weight of the edge.
Further, for a subgraph psubIncluding the subgraph psubOf the set of pattern graphs PS, the subgraph psubWeight (p) ofsub)=(|Vsub|+|EsubL) (| PS | -1); wherein, | VsubI represents subgraph psubThe number of nodes, | EsubI represents subgraph psubThe number of edges of (c).
Further, a minimum spanning tree of the mode diagram isomorphic tree is obtained by utilizing a Chu-Liu algorithm.
Further, the set ratio h is 10%.
Further, the target fields include, but are not limited to, the field of network security, the field of social networking, and the field of bioscience.
A multimode graph index system facing weak structure correlation is characterized by comprising a mode graph isomorphic tree construction module, a frequent subgraph mining module and an optimal matching path acquisition module; wherein the content of the first and second substances,
the mode graph isomorphic tree construction module is used for reading mode graphs in a mode graph set of a target field and generating a mode graph mark ID for each mode graph; then, judging every two pattern graphs, if a sub-graph isomorphism relationship exists between the two pattern graphs, adding a directed edge pointing to the pattern graph with larger scale from the pattern graph with smaller scale to obtain a pattern graph isomorphism tree of the pattern graph set; the mode graph set is a mode graph set with weak structure correlation, namely the number of the sub-mode graphs containing isomorphic relations in the mode graph set is lower than or equal to the set proportion h of the total number of the mode graphs;
the frequent subgraph mining module is used for carrying out frequent subgraph mining on the mode graph isomorphic tree, finding out a common mode graph and adding the common mode graph into the mode graph isomorphic tree;
and the optimal matching path acquisition module is used for calculating the minimum spanning tree of the mode graph isomorphic tree and performing depth-first traversal on the minimum spanning tree to obtain the optimal matching sequence of the mode graph set.
Compared with the prior art, the invention has the following positive effects:
1. on the basis of a subgraph isomorphic algorithm, a public subgraph mining technology is merged. The calculation amount of subgraphs which have a certain size and appear in more pattern graphs is reduced.
2. A two-stage multi-pattern graph index mining technology is provided, a multi-pattern graph index is constructed by utilizing a sub-graph isomorphism and frequent sub-graph mining algorithm, and the common sub-graph relation which cannot be mined by the sub-graph isomorphism algorithm can be made up, particularly in a data set with weak structural correlation.
The invention provides a multi-pattern index optimization technology based on frequent subgraph mining, tests are carried out on data sets with different characteristics, the matching time is reduced to 85% -88.68% of a pattern tree algorithm on an AIDS data set, and meanwhile, the test result shows that: by mining frequent subgraphs, the public subgraph relation which cannot be mined by a subgraph isomorphic algorithm can be made up, particularly in a data set with weak structural correlation. This proves the positive effect of the multimode graph index construction method for weak structural correlation provided by the invention.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is an AIDS pattern atlas index build time consumption chart;
FIG. 3 is an AIDS dataset index building space consumption graph;
FIG. 4 is a PDBS pattern atlas index build time consumption graph;
FIG. 5 is a PDBS pattern atlas index space consumption graph;
FIG. 6 is an Enron schema atlas index build time consumption graph;
FIG. 7 is an Enron schema atlas index build space consumption map;
FIG. 8 is a graph of performance testing of a mixed mode tree on an AIDS data set;
FIG. 9 is a graph of performance testing of a mixed mode tree on a PDBS data set.
Detailed Description
In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features, and advantages of the present invention more comprehensible, the technical core of the present invention is described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The method flow of the present invention is shown in fig. 1, and this paragraph introduces 3 steps of multimodal graph index construction mainly for specific steps in the inventive content:
(1) constructing a mode graph isomorphic tree, which comprises the following specific steps:
after the pattern graphs are input, according to a reading sequence, the pattern graphs serve as pattern graph mark IDs, when a pattern graph isomorphic tree is constructed, pairwise judgment is conducted on the pattern graphs, if a sub-graph isomorphic relation exists between the two pattern graphs, a directed edge is added, the pattern graphs with smaller scales point to the pattern graphs with larger scales, and the scales refer to the number of nodes and the number of edges in the pattern graphs; in order to reduce the calculation amount of sub-graph isomorphism, a double-layer filtering strategy is designed, the filtering time is controlled within a certain range, meanwhile, the wrong matching attempt is effectively reduced, and the detailed thought of the double-layer filtering strategy is as follows:
independent attributes of nodes and edges: comparing the two pattern graphs according to the attributes of nodes and edges in the pattern graphs, wherein if the attribute which does not exist in the pattern graph with a larger scale exists in the pattern graph with a smaller scale or the frequency of the appearance of a certain attribute exceeds the frequency of the appearance of the attribute corresponding to the pattern graph with a larger scale, the pattern graph with a smaller scale is not a subgraph of the pattern graph with a larger scale; for example, when there is a structure attribute, referred to herein as attribute a, in the smaller-scale pattern diagram, the attribute a appears n times, i.e., the frequency is n. When finding the subgraph isomorphism, judging whether the pattern graph is a subgraph of another pattern graph with larger scale, if so, the pattern graph with smaller scale is a sub-pattern graph, and the pattern graph with larger scale is a parent pattern graph, and the sub-pattern graph inherits part of the structure attribute of the parent pattern graph. Thus, if a child schema has attribute A, the parent schema must have attribute A. Meanwhile, the frequency of the attribute A of the sub-pattern diagram is n, and the frequency of the father pattern diagram is larger than or equal to n.
Triple of node and edge: combining the attributes of the nodes and the edges to construct a triple, which is in the form of: (l (v)i),l(eij),l(vj) Wherein l (v)i) Representing a node viProperty of l (v)j) Representing a node vjProperty of l (e)ij) Is expressed as viAnd vjEdge e as an end pointijThe attribute of (2). Similarly, if there is a triplet of a triplet that does not exist in the pattern diagram with a larger scale in the pattern diagram with a smaller scale, which is not necessarily a subgraph of the pattern diagram with a larger scale, the filtering may also be performed according to the occurrence frequency of the triplet (that is, if there is a triplet in the pattern diagram with a smaller scale and the occurrence frequency of the triplet exceeds the occurrence frequency of the triplet corresponding to the pattern diagram with a larger scale, it is determined that the pattern diagram with a smaller scale is not a subgraph of the pattern diagram with a larger scale).
And when the smaller pattern graph a is judged not to be the subgraph of the larger pattern graph b, continuously judging whether other smaller pattern graphs are the subgraphs of the pattern graph b, if so, adding the subgraph into the isomorphic tree, otherwise, continuously searching other subgraphs.
After the mode diagram isomorphic tree of the data set is obtained, the situation that one mode diagram has a plurality of father mode diagrams may occur, in order to quickly select one father mode diagram for the sub-mode diagram when the optimal matching path is selected, the difference between the structures of the father mode diagram and the sub-mode diagram in the isomorphic tree needs to be calculated, and when the matching is performed, if the difference between the father mode diagram of the sub-mode diagram and the structure of the sub-mode diagram is smaller, the residual calculation amount of the sub-mode diagram is smaller during the calculation, so that the father mode diagram with the smallest difference with the sub-mode diagram is selected as the father mode diagram of the sub-mode diagram; the calculation formula of the difference value Score between the structures of the parent-child pattern graph is as follows:
Score(pj,pj)=(|Vi|-|Vj|)+(|Ei|-|Ej|) (1)
wherein, | Vi|,|VjRespectively represents a pattern diagram piAnd pjThe number of nodes, | Ei|,|EjRespectively represents a pattern diagram piAnd pjNumber of sides of, Score (p)i,pj) Diagram p representing sub-patternsiIn the parent schema diagram pjThe cost of the matching calculation is performed on the basis.
(2) Frequent subgraph mining (namely, a sub-pattern graph without isomorphic relation but with a common pattern graph exists in the isomorphic tree, and in order to excavate the part of relation, frequent subgraph mining is required to be carried out so as to find the common pattern graph; the aim of frequent subgraph mining is to find subgraphs in the set, wherein the occurrence times of the subgraphs are greater than or equal to a support threshold; the method comprises the following specific steps:
the mining of frequent subgraphs needs to preset proper support degree, and the support degree s of the subgraphs is set to be 2 by reference (in the process of mining the frequent subgraphs, filtering the subgraphs with the occurrence frequency lower than 2). Adding a new sub-graph in the index (i.e. the mode graph isomorphic tree) should save the computation overall, and the more computation, the better. The calculation amount which can be saved by calculating the subgraph is taken as the weight of the subgraph, the subgraph with the maximum weight is selected as the common mode graph of the isomorphic tree to be added into the isomorphic tree according to the isomorphic relation of the subgraph, and the weight of the edge is updated at the same time. The size of the subgraph can be restricted by the set minimum node number or minimum edge number, the computation amount saved by the subgraph needs to be analyzed according to the mining result, the computation amount saved by the subgraph is marked as subgraph weight, the computation is determined by the average computation amount of the pattern graph (the subgraph and the pattern graph containing the subgraph), and a subgraph p is givensubThe pattern graph set PS ═ { p } containing the subgraphiI | 0,1 …, n }, and weight (p)sub) Comprises the following steps:
weight(psub)=(|Vsub|+|Esub|)*(|PS|-1) (2)
wherein, | VsubI represents subgraph psubNumber of nodes of, | EsubI represents subgraph psubThe number of edges of (c).
When a plurality of common pattern graphs meeting the size limitation are mined, preferential preservation is carried out, when subgraphs (namely the common pattern graphs) are selected, a loop operation is carried out, only subgraphs with larger weights are selected each time, then the calculated amount covered by the selected subgraphs is removed for the rest subgraphs, the weights are updated to be used as new comparison values, and the like until all the pattern graphs currently cover or do not have the rest subgraphs.
When frequent subgraph excavation is carried out, a mode graph set of a subgraph to be excavated needs to be determined firstly, hierarchical frequent subgraph excavation is used, excavation is carried out from the node of the last layer according to an isomorphic tree structure, the result is merged to the upper layer, then excavation is carried out on the upper layer, namely, frequent subgraph excavation is carried out on a sub-mode graph of the isomorphic tree from the node of the last layer, which is not a leaf node, the support degree s is 2, and the number of the included nodes and edges cannot be lower than the number of the nodes and edges of the current mode graph.
(3) The method for obtaining the optimal matching path comprises the following specific steps:
when a sub-schema is found to have a plurality of parent schema, a unique parent schema is reserved for each schema, how to select an optimal matching path problem can be converted into a minimum spanning tree problem, the minimum spanning tree is calculated by using the thought of Chu-Liu algorithm (the Multi-query optimization for sub-schema search) issued in the VLDB meeting of 2016), an entry edge with the minimum weight is reserved for each sub-schema, a minimum tree (Ren X, Wang J. Multi-query optimization for sub-schema search [ J ]. Proceedings of the International Conference on version Data Bases,2016,10(3): 121-132) is obtained, and depth-first traversal is performed on the minimum tree, so that the optimal matching order of the schema set can be obtained.
Based on the above scheme design, this section will introduce the positive effects of the multi-pattern index construction method of the present invention and the positive effects of the matching efficiency thereof. The test data set is 3 real data atlas (AIDS, PDBS, Enron) from biological science and social network, the environment of experimental test is configured as 64-bit Windows system, Intel 3GHz CPU and 4GBytes RAM, and all algorithms are realized by Python 2.7.
Table 1 is a set of experimental data used herein to validate the algorithm
Data set AIDS PDBS Enron
Number of data graphs 10000 600 4,959
Average number of nodes 25.4 2,939 18.2
Average number of edges 27.4 3,063.7 17.9
Mean degree of nodes 1.95 2.08 2.95
Maximum degree of node 11 7 913
Node attribute type 51 10 13,758
Edge attribute type 4 0 0
In the experiment, in order to ensure that the test is more consistent with the real situation, the mode diagram set consists of mode diagrams with different scales. First, 6 sets of pattern atlas of different sizes are generated according to the edge sizes from 4 to 24(4, 8, 12, 16, 20 and 24), the number of the pattern atlas in each set is 1000, and then the pattern atlas required by the test is formed by drawing the pattern atlas from the pattern atlas of different sizes. In the experimental test, the scale of the pattern atlas is respectively as follows: 600,1200,1800,2400,3000.
The purpose of the experiment is (1) to test the time complexity and the space complexity of the mixed mode tree algorithm provided by the invention, and to know the increased time and space consumption after the frequent subgraph mining algorithm is added. (2) The matching speed of the mixed mode tree algorithm is compared with that of the existing algorithm, and the effectiveness of the algorithm is tested by changing the scale of the pattern atlas. The two experiments were specifically set up as follows:
(1) the time and space consumption of the multi-mode graph index building optimized mixed mode tree algorithm: and on the three data sets, the time and space consumption of index construction is counted, and the time of sub-graph isomorphism and frequent sub-graph mining is counted respectively for comparison with the time and space consumption of the original pattern tree algorithm. To better illustrate the trend exhibited by the experimental results, 5 different scale pattern atlas matching tests were performed on each data set, the scale of the data set being: 600,1200,1800,2400,3000.
(2) Testing the matching efficiency of the mixed mode tree algorithm: the optimization index construction is to reduce the total matching time, the experiment is respectively accessed to a mode tree algorithm and a mixed mode tree algorithm in a classical turbine algorithm (Turbo) for index construction and matching, the total time from input of a test data set to result output is compared, the performance of the algorithm is better displayed, and meanwhile, the comparison is carried out with the total matching time of the turbine algorithm. In addition, in this experiment, the scale of the pattern atlas was varied, and multiple experiments were performed on each data set.
The experimental results are as follows: (1) multi-pattern index construction optimization algorithm performance test
Fig. 2-fig. 7 show the index construction performance test results on 3 real data sets, and the graphs are presented in the form of stacked tree graphs, where the horizontal axis represents the number of pattern graphs, the number of pattern graphs increases from 600 to 3000, and the vertical axis represents the total index construction time, where blue represents the computation time of subgraph isomorphism, and red represents the time of frequent subgraph mining, that is: the blue color shows the time consumption of index construction of the pattern tree algorithm, the red color shows the time consumed by the mixed mode tree algorithm compared with the original algorithm, and the scale of the data set in the test is kept unchanged and is the scale of the data set of the whole graph.
The multimode graph index construction method for weak structure correlation provided by the invention is superior to the existing graph index construction method in that: in 3 data sets adopted in the experiment, a large number of repeated structures exist among the pattern diagrams of the data sets AIDS and PDBS, and the structural correlation among the pattern diagrams of the data set Enron is weak.
Fig. 2 and 3 show the time and space consumption in the AIDS data set, and the display state of fig. 3 is the same as that of fig. 2. The overall development trends of the two graphs are generally consistent, the time for calculating the subgraph isomorphism is longer and longer along with the expansion of the data scale, the reason is that when the algorithm carries out the subgraph isomorphism calculation, the pattern graphs need to be compared pairwise, positive correlation is formed between the pattern graphs and the number of pattern graph sets, the time and the space consumed by the mixed-mode tree algorithm are smaller than those consumed by the pattern tree algorithm, the proportion of newly added time to the original time is less than 2%, and the proportion of newly added space to the original space is less than 1%. This is because the pattern atlas generated by the AIDS data set has strong structural correlation, the pattern atlas has similar structure, and the size of each layer structure is not very different, and as the pattern atlas is enlarged, the additional time and space consumption does not rise significantly.
The experimental results showed that the PDBS data set was substantially consistent with the trend and reaction phenomena of the AIDS data set, and the only difference is that the index construction on the PDBS takes more time, since the structure of the PDBS data set is more complex than that of the AIDS data set, and more time is consumed in calculation.
Fig. 6 and 7 show the test results on the acron data sets, and different from the first two data sets, on the acron data set, both the time consumption and the space consumption of index construction are obviously greater than those of the original mode tree algorithm, and on the acron data set, the mixed mode tree algorithm is added with 1.25-3.39 times of time and 1.31-5.83 times of space. The essential reason for this feature is that the characteristics of the data sets are different, the node attributes and the edge attributes of the Enron data set are many, and the generated pattern graph sets have weaker structural correlation, so after the subgraph isomorphic relation is calculated, more pattern graphs need to be mined, the time consumption of the frequent subgraph mining algorithm gSpan is positively correlated with the number of the pattern graphs of the frequent subgraph to be mined, and therefore in fig. 6, it can be seen that the time for mining the frequent subgraph is longer than the calculation time of the subgraph isomorphism, and since the number of the finally obtained subgraphs meeting the conditions is large, more space consumption is increased.
(2) Multi-pattern matching efficiency test
Fig. 8 and 9 show the test results of the multi-mode graph matching performance of the three algorithms on two different data sets, wherein the blue broken line shows the total matching time of the classical turbo algorithm, the red broken line shows the total time of the turbo algorithm accessing the mode tree, and the green broken line shows the total time consumption of the turbo algorithm accelerated by the mixed mode tree algorithm.
The efficiency of the multimode graph index construction method for weak structure correlation applied to multimode graph matching provided by the invention is superior to the multimode graph matching efficiency corresponding to the conventional multimode graph index construction method: fig. 8 shows a comparison graph of the total matching time on the AIDS dataset, and compared with the turbo _ PatternTree, the algorithm turbo _ hybirdpattenttree has a less significant advantage, when the pattern atlas scale is 1200-3000, the total time is reduced to 85% -88.68% of the turbo _ patternttree, and when the pattern atlas scale is 600, the total time is slightly higher, which is consistent with the index construction analysis result on the AIDS dataset in the acquisition of the optimal matching path, the improvement is performed on the dataset with strong structural correlation, and the number of the mined effective auxiliary subgraphs is small, so that the improvement space is small as a whole.
Fig. 9 performance testing of the mixed mode tree algorithm on the PDBS data set showed experimental results on the PDBS data set, where the overall time to match was only reduced by 94.7% -99.9% compared to the AIDS data set for the algorithm turbo _ hybirdpattertree. It is more clear that the space for optimizing by mining frequent subgraphs is smaller on a data set with stronger structural correlation.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims (9)

1. A multi-pattern graph index construction method oriented to weak structure correlation comprises the following steps:
1) reading pattern diagrams in a pattern diagram set of a target field and generating a pattern diagram mark ID for each pattern diagram; the mode graph set is a mode graph set with weak structure correlation, namely the number of the sub-mode graphs containing isomorphic relations in the mode graph set is lower than or equal to the set proportion h of the total number of the mode graphs; the target field is a network security field, in the network attack detection, the relation among a computer, an IP address, a user and software service in a network is represented as a graph structure, wherein the computer and the IP address are abstracted into nodes of the graph structure, and the relation among the nodes is represented as edges;
2) constructing a pattern graph isomorphic tree: judging every two pattern graphs, if a sub-graph isomorphic relation exists between the two pattern graphs, adding a directed edge pointing from a small-scale pattern graph in the two pattern graphs with the sub-graph isomorphic relation to a large-scale pattern graph in the two pattern graphs with the sub-graph isomorphic relation to obtain a pattern graph isomorphic tree of the pattern graph set; wherein the scale refers to the number of nodes and the number of edges in the pattern diagram;
3) performing frequent subgraph mining on the mode graph isomorphic tree, finding a common mode graph and adding the common mode graph into the mode graph isomorphic tree;
4) when a sub-pattern graph with a plurality of parent pattern graphs exists in the pattern graph isomorphic tree, reserving a unique parent pattern graph for the sub-pattern graph;
5) and calculating the minimum spanning tree of the mode graph isomorphic tree, and performing depth-first traversal on the minimum spanning tree to obtain the optimal matching sequence of the mode graph set.
2. The method of claim 1, wherein the pattern graph is determined pairwise using a two-level filtering strategy, the method comprising: a) comparing the two pattern graphs according to the attributes of nodes and edges in the pattern graphs, and if the attribute which does not exist in the pattern graph with larger scale exists in the pattern graph with smaller scale or the frequency of the occurrence of an attribute in the pattern graph with smaller scale exceeds the frequency of the occurrence of the attribute corresponding to the pattern graph with larger scale, judging that the pattern graph with smaller scale is not a subgraph of the pattern graph with larger scale; b) combining the attributes of the nodes and the edges to construct a triple: (l (v)i),l(eij),l(vj) Wherein l (v)i) Representing a node viProperty of l (v)j) Representing a node vjProperty of l (e)ij) Is expressed as viAnd vjEdge e as an end pointij(ii) an attribute of (d); and if a triplet which does not exist in the mode diagram with the larger scale exists in the mode diagram with the smaller scale, or a triplet exists in the mode diagram with the smaller scale and the occurrence frequency of the triplet exceeds the occurrence frequency of the triplet corresponding to the mode diagram with the larger scale, judging that the mode diagram with the smaller scale is not a subgraph of the mode diagram with the larger scale.
3. The method of claim 1, wherein in the pattern graph isomorphism tree, if a sub-pattern graph has multiple parent pattern graphs, the difference between the sub-pattern graph and the corresponding parent pattern graph structure in the pattern graph isomorphism tree is calculated, and the parent pattern graph with the smallest difference with the sub-pattern graph is selected as the parent pattern graph of the sub-pattern graph.
4. The method of claim 3, wherein the difference is calculated as Score (p)i,pj)=(|Vi|-|Vj|)+(|Ei|-|EjI)); wherein, | Vi|,|VjRespectively represents a pattern diagram piAnd pjThe number of nodes, | Ei|,|EjRespectively represents a pattern diagram piAnd pjThe number of edges of (c).
5. The method as claimed in claim 1, wherein the step 3) is implemented by: carrying out frequent sub-graph mining on the mode graph isomorphic tree and filtering sub-graphs of which the sub-graph occurrence times are smaller than a set support degree s; and then calculating the calculation amount which can be saved by newly adding a subgraph in the mode graph isomorphic tree as the weight of the subgraph, selecting the subgraph with the maximum weight as a common mode graph according to the subgraph isomorphic relation, adding the subgraph into the mode graph isomorphic tree, and updating the weight of the edge.
6. The method of claim 5, wherein p is a subgraphsubIncluding the subgraph psubOf the set of pattern graphs PS, the subgraph psubWeight (p) ofsub)=(|Vsub|+|EsubL) (| PS | -1); wherein, | VsubI represents subgraph psubThe number of nodes, | EsubI represents subgraph psubThe number of edges of (c).
7. The method of claim 1 or 5, wherein the minimum spanning tree of the pattern graph isomorphic tree is calculated using the Chu-Liu algorithm.
8. The method of claim 1, wherein the set ratio h is 10%.
9. A multimode graph index system facing weak structure correlation is characterized by comprising a mode graph isomorphic tree construction module, a frequent subgraph mining module and an optimal matching path acquisition module; wherein the content of the first and second substances,
the mode graph isomorphic tree construction module is used for reading mode graphs in a mode graph set of a target field and generating a mode graph mark ID for each mode graph; then, judging every two pattern graphs, if a sub-graph isomorphic relation exists between the two pattern graphs, adding a directed edge pointing from a small-scale pattern graph in the two pattern graphs with the sub-graph isomorphic relation to a large-scale pattern graph in the two pattern graphs with the sub-graph isomorphic relation to obtain a pattern graph isomorphic tree of a pattern graph set; the mode graph set is a mode graph set with weak structure correlation, namely the number of the sub-mode graphs containing isomorphic relations in the mode graph set is lower than or equal to the set proportion h of the total number of the mode graphs; the target field is a network security field, in the network attack detection, the relation among a computer, an IP address, a user and software service in a network is represented as a graph structure, wherein the computer and the IP address are abstracted into nodes of the graph structure, and the relation among the nodes is represented as edges; wherein the scale refers to the number of nodes and the number of edges in the pattern diagram;
the frequent subgraph mining module is used for carrying out frequent subgraph mining on the mode graph isomorphic tree, finding out a common mode graph and adding the common mode graph into the mode graph isomorphic tree;
and the optimal matching path acquisition module is used for calculating the minimum spanning tree of the mode graph isomorphic tree and performing depth-first traversal on the minimum spanning tree to obtain the optimal matching sequence of the mode graph set.
CN201811466997.4A 2018-12-03 2018-12-03 Multi-pattern graph index construction method and system for weak structure correlation Active CN109783696B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811466997.4A CN109783696B (en) 2018-12-03 2018-12-03 Multi-pattern graph index construction method and system for weak structure correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811466997.4A CN109783696B (en) 2018-12-03 2018-12-03 Multi-pattern graph index construction method and system for weak structure correlation

Publications (2)

Publication Number Publication Date
CN109783696A CN109783696A (en) 2019-05-21
CN109783696B true CN109783696B (en) 2021-06-04

Family

ID=66496565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811466997.4A Active CN109783696B (en) 2018-12-03 2018-12-03 Multi-pattern graph index construction method and system for weak structure correlation

Country Status (1)

Country Link
CN (1) CN109783696B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929029A (en) * 2019-11-04 2020-03-27 中国科学院信息工程研究所 Text classification method and system based on graph convolution neural network
CN111737538B (en) * 2020-06-11 2023-12-26 浙江邦盛科技股份有限公司 Graph mode reverse real-time matching method based on event driving
CN112182497B (en) * 2020-09-25 2021-04-27 齐鲁工业大学 Biological sequence-based negative sequence pattern similarity analysis method, realization system and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218397A (en) * 2013-03-12 2013-07-24 浙江大学 Privacy protecting method for social network based on undirected graph modification
CN107885797A (en) * 2017-10-27 2018-04-06 中国科学院信息工程研究所 A kind of multi-mode figure matching process based on structural dependence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218397A (en) * 2013-03-12 2013-07-24 浙江大学 Privacy protecting method for social network based on undirected graph modification
CN107885797A (en) * 2017-10-27 2018-04-06 中国科学院信息工程研究所 A kind of multi-mode figure matching process based on structural dependence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Parallel Algorithm for Frequent Subgraph Mining;B vo等;《Advanced Computational Methods for Knowledge Engineering》;20150623;第163-172页 *
Graph Indexing;Xifeng Yan等;《Managing and Mining Graph Data》;20100118;第161-180页 *
图数据分析系统计算模型综述;刘梦雅等;《计算机应用研究》;20171130;第34卷(第11期);第3204-3208页 *

Also Published As

Publication number Publication date
CN109783696A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN109783696B (en) Multi-pattern graph index construction method and system for weak structure correlation
CN109120465B (en) Target area network topology division method based on motif
CN104346481B (en) A kind of community detection method based on dynamic synchronization model
Wang et al. Clan: An algorithm for mining closed cliques from large dense graph databases
Liu et al. Efficient mining of large maximal bicliques
CN103927398A (en) Microblog hype group discovering method based on maximum frequent item set mining
CN106599090A (en) Structure centrality-based network community discovery method
CN103020163A (en) Node-similarity-based network community division method in network
CN109614520B (en) Parallel acceleration method for multi-pattern graph matching
Stattner et al. Social-based conceptual links: Conceptual analysis applied to social networks
Dai et al. Fast maximal clique enumeration on uncertain graphs: A pivot-based approach
Liu et al. Social group query based on multi-fuzzy-constrained strong simulation
Zhao et al. Effective and efficient dense subgraph query in large-scale social Internet of Things
CN110287237A (en) One kind analyzing efficient corporations' data digging method based on social network structure
Pan et al. Overlapping community detection via leader-based local expansion in social networks
CN116938587A (en) Threat detection method and system based on trace-source diagram behavior semantic extraction
CN110046265B (en) Subgraph query method based on double-layer index
Kazemi et al. Mpgm: Scalable and accurate multiple network alignment
CN112380267B (en) Community discovery method based on privacy graph
Stattner et al. Towards a hybrid algorithm for extracting maximal frequent conceptual links in social networks
Wan et al. KNFCOM-T: a k-nearest features-based co-location pattern mining algorithm for large spatial data sets by using T-trees
Tao et al. Discovering overlapping communities by clustering local link structures
Shan et al. A subgraph query method based on adjacent node features on large-scale label graphs
Yu et al. Complex detection based on integrated properties
Arab et al. Efficient community detection algorithm with label propagation using node importance and link weight

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant