CN114490799A

CN114490799A - Method and device for mining frequent subgraphs of single graph

Info

Publication number: CN114490799A
Application number: CN202011254159.8A
Authority: CN
Inventors: 田群; 戴永恒; 李荣华; 李艳斌; 潘敏佳; 刘学谦
Original assignee: Diankeyun Beijing Technology Co ltd
Current assignee: Diankeyun Beijing Technology Co ltd
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-13

Abstract

The invention provides a method and a device for mining frequent subgraphs of a single graph, wherein the method comprises the following steps: generating a standard adjacency matrix according to the dictionary ordering result of the node labels of the single graph, and numbering the nodes of each graph in sequence; generating an initial suboptimal canonical adjacency matrix tree through the canonical adjacency matrix, wherein leaf nodes comprise a first number of edges, and the CSP search space of the tree is a dictionary ordering sequence combination of numbers of graph nodes corresponding to node labels contained in the tree; performing FFSM-Join operation or FFSM-Extension operation on leaf nodes according to the canonical adjacency matrix, and growing a subgraph to obtain child nodes which expand one edge; taking the child nodes as candidate subgraphs, and constructing a CSP search space of the children nodes according to a subgraph growth mode; if the effective number of the search space is smaller than the set support threshold, marking the candidate subgraph as an invalid subgraph; and if the subgraph growth is finished, outputting the frequent subgraphs. By the scheme, the frequent subgraph excavation efficiency can be improved.

Description

Method and device for mining frequent subgraphs of single graph

Technical Field

The invention relates to the technical field of data mining, in particular to a frequent subgraph mining method and device for a single graph.

Background

With the rapid development of big data technology, drawing data by graph structure is gradually applied to mass data. The traditional big data analysis technology usually has a relatively universal analysis engine based on SQL or SQL-like tabular analysis tools, and mass graph data can be realized only by a special calculation analysis engine due to the complexity and the particularity of the relationship storage.

The diagram is a high abstraction of the structure. Frequent subgraph mining is one of graph mining key technologies, and has wide application in the fields of social networks, information mining, bioengineering, communication network optimization, text mining, knowledge reasoning and the like, such as protein structure analysis, link prediction, sensitive group recognition, image classification and the like. Meanwhile, the result of frequent subgraph mining can be used as the basis of data classification, clustering, retrieval, matching and similarity analysis.

The traditional Frequent Subgraph Mining algorithm has higher complexity, mostly belongs to a single machine serial algorithm, and is mainly divided into two categories of Apriori (association analysis) and FFSM (Fast frequency Subgraph Mining), which are represented by AGM (Apriori-Based Graph Mining) and gSpan (Graph-Based Substructure Graph Mining) algorithms respectively. The FFSM-based algorithm is often better than the Apriori-based algorithm, but at the same time, the FFSM-based algorithm occupies more memory space.

In recent years, frequent subgraph mining algorithms represented by MapReduce and BSP (bulk synchronization parallel) appear in the field of massive graph data mining, and the technologies are based on Hadoop, Spark, Flink and other general big data storage and calculation, so that the frequent subgraph mining capability above a billion-level edge rule on a commercial cluster is realized.

However, the existing engine for the frequent sub-graph mining algorithm of mass data is not a special engine for optimizing graph calculation, so that the mining efficiency of the frequent sub-graph algorithm is greatly influenced.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for frequent subgraph mining of a single graph to improve the efficiency of frequent subgraph mining.

In order to achieve the purpose, the invention adopts the following scheme:

according to an aspect of the embodiment of the invention, a method for mining frequent subgraphs of a single graph is provided, which includes:

generating a standard adjacency matrix of the single graph according to a dictionary ordering result of the node labels of the single graph, and numbering each graph node in the standard adjacency matrix of the single graph according to the sequence of rows or columns;

generating an initial sub-optimal canonical adjacency matrix tree by searching the canonical adjacency matrices of the single graph in the order of the numbering of the graph nodes, wherein the nodes of each leaf of the initial sub-optimal canonical adjacency matrix tree include a first number of edges; the first number is an integer not less than one; the root node of the initial suboptimal canonical adjacency matrix tree does not contain the graph nodes and edges of the single graph, the child nodes of the root node are frequent nodes, and the child nodes of the frequent nodes are frequent edges; the CSP search space of the leaf node is the dictionary ordering sequence combination of the numbers of the graph nodes corresponding to the node labels contained in the CSP search space;

when a first node of a leaf containing a first number of edges is a canonical adjacency matrix and other second nodes of the leaf containing the first number of edges and a parent node shared by the first node are present, performing FFSM-Join operation on the first node and the second nodes according to the canonical adjacency matrix of the single graph, and performing subgraph growth to obtain child nodes which take the first node and the second nodes as parent nodes and contain a second number of edges; under the condition that a third node of a leaf containing a first number of edges is a canonical adjacency matrix and is an outer matrix, performing FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing sub-graph growth to obtain a child node which takes the third node as a parent node and contains a second number of edges; wherein the second number minus the first number equals one; the third node is the same as or different from the first node or the second node; child nodes containing a second number of edges become nodes of leaves of the sub-optimal canonical adjacency matrix tree;

taking the nodes of the leaves containing the second number of edges as candidate subgraphs, and constructing CSP search spaces of the corresponding candidate subgraphs according to a subgraph growth mode and the parent nodes of the candidate subgraphs and the CSP search spaces of the parent nodes;

under the condition that the effective number of the dictionary ordering sequence combination of the numbers of the graph nodes in the CSP search space of the current candidate subgraph is smaller than a set support threshold, marking the corresponding candidate subgraph as an invalid subgraph;

and under the condition that the current candidate subgraph is not the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, carrying out subgraph growth according to the canonical adjacency matrix of the single graph and the nodes which are not marked as invalid subgraphs and contain the second number of edges so as to update the leaf nodes of the suboptimal canonical adjacency matrix tree, and outputting the frequent subgraph of the single graph according to the leaf nodes which are finished to grow the subgraph.

In some embodiments, generating an initial sub-optimal canonical adjacency matrix tree by searching the canonical adjacency matrices of the single graph in the order of the numbering of the graph nodes comprises:

initializing a suboptimal canonical adjacency matrix tree by searching the canonical adjacency matrix of the single graph according to the numbering sequence of the graph nodes to obtain an initial suboptimal canonical adjacency matrix tree; wherein, the root node of the initial suboptimal standard adjacent matrix tree is null; the first number is equal to one; and the CSP search space of the frequent nodes is a dictionary ordering sequence combination of the node label numbers of the start graph nodes and the end graph node label numbers of the corresponding edges.

In some embodiments, the frequent subgraph mining method for a single graph further includes:

under the condition that the numbers of the graph nodes in the suboptimal adjacency matrix corresponding to the nodes of the leaves comprising the first number of edges do not accord with the dictionary sorting order, marking the corresponding nodes of the leaves comprising the first number of edges as invalid subgraphs;

when a third node of a leaf including a first number of edges is a canonical adjacency matrix and is an outer matrix, performing an FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing sub-graph growth to obtain a child node including the third node as a parent node and a second number of edges, including:

and under the condition that a third node of a leaf containing the first number of edges is a canonical adjacency matrix, is an outer matrix and is not marked as an invalid subgraph, performing FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and growing the subgraph to obtain a child node which takes the third node as a parent node and contains a second number of edges.

In some embodiments, in a case where a first node of a leaf containing a first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf containing the first number of edges with which a parent node is common, performing an FFSM-Join operation on the first node and the second node according to the canonical adjacency matrix of the single graph, a subgraph growth resulting in child nodes containing a second number of edges with the first node and the second node being parent nodes comprises:

in the case where the first node of a leaf containing the first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf containing the first number of edges with which parent nodes are common,

if the first node and the second node are both internal matrixes, if and only if f and k are different, performing subgraph growth in a first mode to obtain an adjacent matrix represented as C corresponding to child nodes which take the first node and the second node as father nodes and contain a second number of edges_m×mWherein the adjacency matrix C_m×mThe elements in (A) are represented as:

wherein, c_i,jElements representing the ith row and jth column in the adjacency matrix corresponding to the child node, a_i,jElements representing the ith row and the jth column of the canonical adjacency matrix corresponding to the first node, b_i,jThe element of ith row and jth column of the adjacent matrix corresponding to the second node and the last element in the standard adjacent matrix corresponding to the first nodeThe sides are denoted as a_m,fM and f respectively represent the number of rows and columns and the total number of edges of the canonical adjacency matrix corresponding to the first node, and the last edge in the adjacency matrix corresponding to the second node is represented as b_n,kN and k respectively represent the number of rows and columns of the adjacent matrix corresponding to the second node and the total number of edges;

if the first node is an inner matrix and the second node is an outer matrix, performing subgraph growth in a second mode to obtain an adjacency matrix represented as C corresponding to child nodes which take the first node and the second node as father nodes and contain a second number of edges_n×nWherein the adjacent matrix C_n×nThe elements in (A) are represented as:

if the first node and the second node are both outer matrices, if and only if f ≠ k ^ a_m,m＝b_m,mAnd performing subgraph growth in a third mode to obtain an adjacency matrix C which takes the first node and the second node as father nodes and contains child nodes of a second number of edges_m×mWherein the adjacency matrix C_m×mThe elements in (A) are represented as:

if the first node and the second node are both outer matrixes, performing subgraph growth in a fourth mode to obtain an adjacent matrix which takes the first node and the second node as father nodes and comprises child nodes of a second number of edges and is represented as D_(m+1)×(m+1)Wherein the adjacency matrix D_(m+1)×(m+1)The elements in (A) are represented as:

wherein d is_i,jTo representElement of ith row and jth column in adjacency matrix corresponding to child node, a_i,jElements representing the ith row and the jth column of the canonical adjacency matrix corresponding to the first node, b_m,jElements representing the m-th row and j-th column of the adjacency matrix corresponding to the second node, b_m,mAnd m represents the row number and the column number of the canonical adjacency matrix corresponding to the first node and the adjacency matrix corresponding to the second node.

In some embodiments, constructing CSP search spaces of the respective candidate subgraphs according to the subgraph growth mode and according to parent nodes of the candidate subgraphs and the CSP search spaces of the parent nodes, with nodes of leaves including the second number of edges as the candidate subgraphs, includes:

combining the subgraphs corresponding to the dictionary ordering sequence of the numbers of the graph nodes in the CSP search space of the first node and the CSP search space of the second node to remove the duplication to obtain the CSP search space of the candidate subgraph under the condition that the nodes of leaves containing a second number of edges are obtained by increasing the subgraph according to the first mode or the third mode;

and under the condition that the nodes of leaves containing a second number of edges are obtained as candidate subgraphs by the growth of the subgraphs in the second mode or the fourth mode, removing duplication of the subgraphs corresponding to the dictionary sorting sequence combination of the numbers of the graph nodes in the two CSP search spaces corresponding to the common part of the canonical adjacency matrix corresponding to the first node and the adjacency matrix corresponding to the second node, and splicing the subgraphs to the CSP search spaces corresponding to the other nodes except the common part of the canonical adjacency matrix corresponding to the first node in the adjacency matrix corresponding to the second node to obtain the CSP search spaces of the candidate subgraphs.

and under the condition that the node of the leaf containing the second number of edges is obtained as a candidate subgraph by performing FFSM-Extension operation subgraph growth, acquiring the CSP search space of the third node, and adding the number of the terminal graph node of the candidate subgraph increased relative to the third node to the tail of the dictionary ordering sequence combination of the numbers of the graph nodes in the CSP search space of the third node to obtain the CSP search space of the candidate subgraph.

if there is another node sharing a parent node with the fourth node, finding a fifth node which has the same graph node set and edge set as the fourth node and is a canonical adjacency matrix, generating a new numbering sequence combination by adjusting the numbering sequence in the sequence combination of the graph nodes in the CSP search space of the fourth node, adding the generated new numbering sequence combination to the CSP search space of the fifth node, and marking the fourth node as an invalid subgraph.

and distributing all frequent nodes, all frequent edges and child nodes thereof in the initial suboptimal canonical adjacency matrix tree to each computing node of the distributed system according to the frequent nodes, so that each computing node executes subgraph growth under the condition of load balancing to obtain nodes containing a second number of edges.

and under the condition that the current candidate subgraph is not the subgraph which is completely grown according to the standard adjacency matrix of the single graph, according to the CSP search space of the candidate subgraph of the single graph, distributing the parent node, the corresponding node containing the first number of edges and the child nodes thereof to each computing node in the distributed system according to the parent node of the node containing the first number of edges in the next-time optimal standard adjacency matrix tree, so that each computing node performs subgraph growth according to the standard adjacency matrix of the single graph and according to the node which is not marked as an invalid subgraph and contains the second number of edges under the condition of load balancing.

In some embodiments, assigning parent nodes and their child nodes of nodes in the sub-optimal canonical adjacency matrix tree that contain the first number of edges to compute nodes in the distributed system according to the CSP search space of the candidate subgraph of the single graph comprises:

determining a pre-search space for constructing the CSP corresponding to the candidate subgraph according to the number of node numbers between the number of the last graph node in the CSP search space of the node containing the first number of edges as the candidate subgraph and the number of the last graph node in the canonical adjacency matrix of the single graph;

and performing grouping summation on the pre-search spaces for constructing the CSP of the nodes containing the first number of edges according to the parent node of the nodes containing the first number of edges, and distributing the nodes containing the first number of edges, the parent node and the child nodes thereof in the suboptimal canonical adjacency matrix tree to each computing node in the distributed system according to the grouping summation result so as to enable the sum of the pre-search spaces executed on different computing nodes to be similar.

In some embodiments, in a case that the number of valid dictionary ordering order combinations of numbers of graph nodes in the CSP search space of the current candidate subgraph is less than a set support threshold, marking the corresponding candidate subgraph as an invalid subgraph comprises:

and calculating the product of the number of the serial number combinations in the CSP search space of the candidate subgraph and the number of the corresponding search modes when the newly added nodes in the candidate subgraph are searched according to the serial number sequence of the graph nodes in the standard adjacency matrix of the single graph to obtain the effective number of the dictionary sorting serial combinations of the serial numbers of the graph nodes in the CSP search space of the candidate subgraph, and marking the corresponding candidate subgraph as an invalid subgraph under the condition that the corresponding effective number is less than a set support threshold.

In some embodiments, in a case that it is determined that the current candidate subgraph is not a subgraph that has completed growing according to the canonical adjacency matrix of the single graph, performing subgraph growing according to the canonical adjacency matrix of the single graph and according to nodes containing a second number of edges that are not marked as invalid subgraphs to update leaf nodes of a sub-optimal canonical adjacency matrix tree, and outputting frequent subgraphs of the single graph according to the leaf nodes that have completed subgraph growing, the method comprises:

under the condition that the current candidate subgraph is judged not to be the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, under the condition that a sixth node which is taken as the candidate subgraph and comprises a second number of edges is the canonical adjacency matrix and other seventh nodes which share the parent node and comprise other leaves of the second number of edges exist, FFSM-Join operation is carried out on the sixth node and the seventh node according to the canonical adjacency matrix of the single graph, and the subgraph is grown to obtain child nodes which take the sixth node and the seventh node as parent nodes and comprise a third number of edges; under the condition that an eighth node which is a candidate subgraph and contains a second number of edges is not marked as an invalid subgraph, is a canonical adjacency matrix and is an outer matrix, performing FFSM-Extension operation on the eighth node according to the canonical adjacency matrix of the single graph, and growing the subgraph to obtain a child node which takes the eighth node as a parent node and contains a third number of edges; wherein the third number minus the second number equals one; the eighth node is the same as or different from the sixth node or the seventh node; child nodes containing a third number of edges become nodes of leaves of the sub-optimal canonical adjacency matrix tree;

taking the nodes of the leaves containing the third number of edges as new candidate subgraphs, and constructing CSP search spaces of the corresponding candidate subgraphs according to a subgraph growth mode and the parent nodes of the candidate subgraphs and the CSP search spaces of the parent nodes;

under the condition that the effective number of the dictionary ordering sequence combination of the numbers of the graph nodes in the CSP search space of the current new candidate subgraph is smaller than a set support threshold, marking the corresponding candidate subgraph as an invalid subgraph;

and under the condition that the current new candidate subgraph is judged to be the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, outputting the frequent subgraph of the single graph according to the current new candidate subgraph.

under the condition that a first node of a leaf containing a first number of edges is a canonical adjacency matrix and other second nodes of the leaf containing the first number of edges and sharing a parent node with the first node exist, traversing the canonical adjacency matrix of the single graph by using a push-pull dual-mode engine by using a distributed system, carrying out FFSM-Join operation on the first node and the second node, and carrying out subgraph growth to obtain child nodes which take the first node and the second node as parent nodes and contain a second number of edges;

under the condition that a third node of a leaf containing a first number of edges is a canonical adjacency matrix and is an outer matrix, traversing the canonical adjacency matrix of the single graph by using a push-pull dual-mode engine by using a distributed system, carrying out FFSM-Extension operation on the third node, and carrying out sub-graph growth to obtain child nodes which take the third node as a father node and contain a second number of edges;

the distributed communication scheduling of the distributed system is realized by adopting non-blocking communication and a dynamic thread resource release mode.

According to another aspect of the embodiments of the present invention, there is also provided a computer device, including a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method according to any of the above embodiments when executing the program.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the embodiments described above.

According to the frequent subgraph mining method, the computer equipment and the computer-readable storage medium of the single graph, the regular adjacency matrix of the single graph is generated, the graph nodes and the edges in the single graph can be conveniently searched sequentially by numbering the graph nodes, repeated searching is avoided, and pruning is equivalently performed. By constructing a sub-optimal standard adjacent matrix tree and utilizing nodes in the adjacent matrix tree which meet the requirement of the matrix to carry out subgraph growth, the method is equivalent to further pruning. Moreover, by carrying out pre-pruning on the marked invalid subgraphs, the growth of the subsequent invalid subgraphs is avoided. Therefore, the embodiment of the invention can greatly reduce the search space and improve the mining efficiency of frequent subgraphs by pruning in multiple aspects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

FIG. 1 is a flow chart of a single graph frequent subgraph mining method according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating a method for mining frequent subgraphs of a single graph according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of a push-pull dual engine mode in accordance with an embodiment of the present invention;

FIG. 4 is a graph structure example of a single graph of 6 nodes of a particular embodiment of the invention;

FIG. 5 is a diagram illustrating S (0) to S (3) stage subgraph growth on the single graph of FIG. 4 in accordance with an exemplary embodiment of the present invention;

FIG. 6 is a diagram illustrating S (4) stage subgraph growth on a single graph of FIG. 4 in accordance with an exemplary embodiment of the present invention;

FIG. 7 is a diagram illustrating S (5) stage subgraph growth on a single graph of FIG. 4 in accordance with an exemplary embodiment of the present invention;

FIG. 8 is a diagram illustrating S (6) to S (8) stage subgraph growth on the single graph of FIG. 4 in accordance with an exemplary embodiment of the present invention;

FIG. 9 is a diagram illustrating the construction of a search space for the subgraph shown in FIG. 7 in accordance with an embodiment of the present invention;

FIG. 10 is a diagram illustrating the construction of a search space for the subgraph shown in FIG. 7 according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

First, terms that may be referred to in this description are explained as follows:

canonical Adjacency Matrix (CAM): given the adjacency matrix M of the single graph G, the character string generated by concatenating the lower triangular elements and diagonal elements of the adjacency matrix M in the order from top to bottom and from left to right is referred to as the code sequence of the matrix M, which is denoted as code (M). Due to different node sequencing of diagonal elements (graph nodes), a plurality of coding sequences can be generated by the same single graph G, all the coding sequences are further sequenced according to the alphabetical order, and the adjacency matrix M with the largest coding sequence is called a standard adjacency matrix of the single graph G.

Maximum feature Submatrix (maximum property Submatrix): given an m × m adjacency matrix a, an n × n adjacency matrix B is called a maximum-feature sub-matrix of the adjacency matrix a if another n × n adjacency matrix B is obtained by removing the last non-0 entry of the lower triangular region (not containing the diagonal) of the adjacency matrix a.

Sub-optimal Canonical Adjacency Matrix (SCAM): if the maximum sub-matrix N of an adjacency matrix M is a canonical adjacency matrix, the adjacency matrix M is called a suboptimal canonical adjacency matrix. In particular, if the adjacency matrix M is not a canonical adjacency matrix CAM but N is a canonical adjacency matrix CAM, then the adjacency matrix M is called a characteristic sub-optimal canonical adjacency matrix (property sub-CAM), or it can be considered as a special case of a sub-optimal canonical adjacency matrix.

Suboptimal canonical adjacency matrix Tree (SCAM Tree): a sub-graph formula which satisfies the following conditions and is organized in a tree structure: 1) the root node of the tree is a NULL matrix NULL; 2) each node in the tree represents a different sub-graph, represented as its sub-optimal canonical adjacency matrix (i.e., CAM or SCAM); 3) and (4) arbitrarily giving a matrix M on a non-root node, wherein the parent node of the matrix M is a maximum characteristic sub-matrix of the M.

Inner and outer matrices: given the adjacency Matrix a of a single graph G, if the last row of the adjacency Matrix a contains at least two non-0 elements (except for diagonals, i.e., there are at least two edges), it is called an Inner Matrix (Inner Matrix), otherwise it is an Outer Matrix (Outer Matrix).

Fig. 1 is a schematic flowchart of a frequent subgraph mining method of a single graph according to an embodiment of the present invention, and referring to fig. 1, the frequent subgraph mining method of a single graph according to an embodiment of the present invention may include the following steps S110 to S160.

Specific embodiments of steps S110 to S160 will be described in detail below.

Step S110: and generating a standard adjacency matrix of the single graph according to the dictionary ordering result of the node labels of the single graph, and numbering the nodes of the graph in the standard adjacency matrix of the single graph according to the sequence of rows or columns.

In the above step S110, the single map may refer to a single large map. One node label may indicate that one graph node has a certain characteristic or is of a certain type, and different graph nodes may have the same node label. The node labels may be represented by a label that is capable of being sorted, such as a number or letter, so that all node labels may follow a certain dictionary ordering, such as numerical or alphabetical. Each diagonal element in the canonical adjacency matrix of a single graph corresponds to a node in the single graph, and these diagonal elements may be lexicographically arranged node labels, which may be adjacent on the diagonal. Numbering the graph nodes in the canonical adjacency matrix of a single graph may set an Identification (ID) for each diagonal element (graph node) in the matrix, and the identifications (numbers) of the graph nodes have a certain order, for example, when numbering by columns, the labels may be inserted into a column on the left side of the canonical adjacency matrix, and the column elements are numbers that increase from top to bottom. The obtained canonical adjacency matrix of the single graph can be used as a data index matrix of subsequent subgraph growth, support degree calculation and other processes.

Step S120: generating an initial sub-optimal canonical adjacency matrix tree by searching the canonical adjacency matrices of the single graph in the order of the numbering of the graph nodes, wherein the nodes of each leaf of the initial sub-optimal canonical adjacency matrix tree include a first number of edges; the first number is an integer not less than one; the root node of the initial suboptimal canonical adjacency matrix tree does not contain the graph nodes and edges of the single graph, the child nodes of the root node are frequent nodes, and the child nodes of the frequent nodes are frequent edges; the CSP search space of the leaf node is the dictionary ordering sequence combination of the numbers of the graph nodes corresponding to the node labels contained in the leaf node.

In step S120, each node in the sub-optimal canonical adjacency matrix tree represents a sub-graph of a single graph (the sub-graph essentially corresponds to a sub-graph pattern (corresponds to a matrix containing node labels and edge labels), the sub-graph or the node corresponds to a CSP search space, the CSP search space contains various possibilities of sequential combinations of numbers of all nodes in the sub-graph pattern, each combination may be called a number sequence, and may form a row in the matrix corresponding to the CSP search space, at this time, the row is equal to the number of combinations), and the parent node may obtain child nodes by expanding an edge. The nodes in the sub-optimal canonical adjacency matrix tree may be represented as adjacency matrices that all belong to the sub-optimal canonical adjacency matrices, but may or may not be canonical adjacency matrices. The root node of the adjacency matrix tree with the suboptimal specification, the frequent node containing only one graph node, and the frequent edge containing one edge can be obtained through initialization, so the node of the leaf described in step S120 may refer to the frequent edge. All frequent nodes may constitute a frequent set of points for a single graph and all frequent edges may constitute a frequent set of edges for a single graph. The root node may be a null value, the frequent node is added with a graph node relative to the root node, and the frequent edge is added with an edge and an end point graph node relative to the frequent node. It should be noted that, herein, a point in a single graph is referred to as a "graph node", and a node in the sub-optimal canonical adjacency matrix tree is referred to as a "node" for distinction. The CSP search space of the frequent node is a set of node numbers corresponding to the node labels of the frequent node. The CSP search space of a frequent edge is a set of ways of combinations of the start graph node and the end graph node of the frequent edge. For a node (subgraph or subgraph pattern), a combination of numbering sequences is actually a numbering sequence.

In specific implementation, frequent nodes and frequent edges in the adjacency matrix tree with suboptimal specification can be obtained through initialization. For example, in the step S120, the searching the canonical adjacency matrix of the single graph according to the numbering order of the graph nodes to generate the initial suboptimal canonical adjacency matrix tree specifically includes the steps of: s121, initializing a suboptimal canonical adjacency matrix tree by searching the canonical adjacency matrix of the single graph according to the numbering sequence of the graph nodes to obtain an initial suboptimal canonical adjacency matrix tree; wherein, the root node of the initial suboptimal standard adjacent matrix tree is null; the first number is equal to one; and the CSP search space of the frequent nodes is a dictionary ordering sequence combination of the node label numbers of the start graph nodes and the end graph node label numbers of the corresponding edges.

In step S121, for example, in the canonical adjacency matrix of the single graph, the graph nodes are numbered from top to bottom, so that when frequent nodes of the suboptimal canonical adjacency matrix tree need to be generated, the numbers of the graph nodes in the canonical adjacency matrix of the single graph can be sequentially searched from top to bottom, and frequent nodes (subgraphs including only one graph node) corresponding to the graph nodes of the single graph are sequentially obtained, and when frequent edges of the suboptimal canonical adjacency matrix tree need to be generated, the graph nodes below and the adjacent edges of the graph nodes in the canonical adjacency matrix of the single graph can be sequentially searched from top to bottom from one frequent node (one graph node, which is a graph node starting from the frequent edge), that is, an edge is expanded and an end graph node of the edge is obtained. The initialized CSP search space of the frequent nodes can be the number of the corresponding graph node, the initialized CSP search space of the frequent edges can be the dictionary sequence combination of the number of the initial graph node and the number of the terminal graph node, and the graph node number combination obtained by searching the standard adjacency matrix of the single graph according to the number sequence of the graph nodes can be directly arranged in sequence.

In other embodiments, more nodes of the adjacency matrix tree with the suboptimal specification may be obtained through an initialization manner, for example, a node containing at most two edges may be obtained, or a node containing at most three edges may be obtained, and the like.

Step S130: when a first node of a leaf containing a first number of edges is a canonical adjacency matrix and other second nodes of the leaf containing the first number of edges and a parent node shared by the first node are present, performing FFSM-Join operation on the first node and the second nodes according to the canonical adjacency matrix of the single graph, and performing subgraph growth to obtain child nodes which take the first node and the second nodes as parent nodes and contain a second number of edges; under the condition that a third node of a leaf containing a first number of edges is a canonical adjacency matrix and is an outer matrix, performing FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing sub-graph growth to obtain a child node which takes the third node as a parent node and contains a second number of edges; wherein the second number minus the first number equals one; the third node is the same as or different from the first node or the second node; the child nodes containing the second number of edges become the nodes of the leaves of the sub-optimal canonical adjacency matrix tree.

In step S130, the nodes in the sub-optimal canonical adjacency matrix tree may be represented by canonical adjacency matrices (belonging to the sub-optimal canonical adjacency matrices) or non-canonical adjacency matrices (which may also be referred to as the sub-optimal canonical adjacency matrices), and in addition, frequent nodes and frequent edges may be used as special cases for the outer matrices. Before subgraph growth is carried out, if the suboptimal standard adjacency matrix tree comprises three layers of a root node, a frequent node and a frequent edge, the frequent edge is a leaf node. If the lowest node in the sub-optimal adjacency matrix tree is a node containing the first number of edges, the node containing the first number of edges is a leaf node. The leaf nodes including the first number of edges may be Canonical Adjacency Matrices (CAMs) or non-canonical adjacency matrices (SCAMs), and FFSM-Join operations (FFSM, Fast frequency Subgraph Mining) are performed with other nodes sharing a common parent node only in the case of the canonical adjacency matrices, and if the non-canonical adjacency matrices are non-canonical adjacency matrices, the leaf nodes may participate in the FFSM-Join operations as other nodes sharing a common parent node with the canonical adjacency matrices. The leaf nodes containing the first number of edges may also be inner or outer matrices when the canonical adjacency matrix is. The leaf nodes containing the first number of edges can do the FFSM-Extension operation only if the adjacency matrix is canonical.

In specific implementation, the specific mode of the FFSM-Join operation is different according to the fact that two leaf nodes including the first number of edges of the common parent node are not the outer matrix or the inner matrix.

In the foregoing step S130, when a first node of a leaf including a first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf including the first number of edges sharing a parent node with the first node, performing an FFSM-Join operation on the first node and the second node according to the canonical adjacency matrix of the single graph, and performing a subgraph growth to obtain child nodes including the first number of edges and the first node and the second node are parent nodes, and when the first node of the leaf including the first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf including the first number of edges sharing a parent node with the first node, the following cases S131 to S134 may be specifically included:

s131, if the first node and the second node are both internal matrixes, if and only if f and k are different, performing subgraph growth in a first mode to obtain an adjacency matrix represented as C corresponding to child nodes which take the first node and the second node as father nodes and contain a second number of edges_m×mWherein the adjacency matrix C_m×mThe elements in (A) are represented as:

wherein, c_i,jElements representing the ith row and jth column in the adjacency matrix corresponding to the child node, a_i,jElements representing the ith row and the jth column of the canonical adjacency matrix corresponding to the first node, b_i,jThe element of the ith row and the jth column of the adjacency matrix corresponding to the second node is represented, and the last edge in the canonical adjacency matrix corresponding to the first node is represented as a_m,fM and f respectively represent the number of rows and columns and the total number of edges of the canonical adjacency matrix corresponding to the first node, and the last edge in the adjacency matrix corresponding to the second node is represented as b_n,kN and k respectively represent the number of rows and columns of the adjacent matrix corresponding to the second node and the total number of edges;

s132, if the first node is an inner matrix and the second node is an outer matrix, performing subgraph growth in a second mode to obtain an adjacency matrix C which is expressed by an adjacency matrix C and corresponding to child nodes which take the first node and the second node as father nodes and contain a second number of edges_n×nWherein the adjacency matrix C_n×nThe elements in (A) are represented as:

s133, if the first node and the second node are both outer matrixes, if and only if f ≠ k ^ a_m,m＝b_m,mAt the time of the third partyPerforming sub-graph growth to obtain an adjacency matrix represented by C, wherein the adjacency matrix is represented by a child node which takes the first node and the second node as father nodes and contains a second number of edges_m×mWherein the adjacent matrix C_m×mThe elements in (A) are represented as:

s134, if the first node and the second node are both external matrixes, performing subgraph growth in a fourth mode to obtain an adjacent matrix which takes the first node and the second node as father nodes and contains child nodes of a second number of edges and is represented as D_(m+1)×(m+1)Wherein the adjacent matrix D_(m+1)×(m+1)The elements in (1) are represented as:

wherein d is_i,jElements representing the ith row and jth column in the adjacency matrix corresponding to the child node, a_i,jElement representing ith row and jth column of normalized adjacency matrix corresponding to first node, b_m,jElements representing the m-th row and j-th column of the adjacency matrix corresponding to the second node, b_m,mAnd m represents the row number and the column number of the canonical adjacency matrix corresponding to the first node and the adjacency matrix corresponding to the second node.

In the above embodiment, through the FFSM-Join operation, two matrices having a common parent matrix can be connected by adding an edge to generate a new matrix. Given the SCAM of the matrix A, B, let a_m,fRepresents the last edge of SCAM (A) (sub-optimal specification adjacency matrix A), b_n,kRepresenting the last edge of scam (B) (sub-optimal specification adjacency matrix B), the FFSM-Join operation can be defined as follows:

1) if matrix a and matrix B are both internal matrices, the FFSM-Join operation is expressed as Join (a, B) ═ C if and only if f and k are not the same, and element C of matrix C_i,jSatisfies the above formula (1);

2) if matrix A is an inner matrix, matrix B is an outer matrix. In this case, the FFSM-Join operation is expressed as Join (a, B) ═ C, and the element C of the matrix C_i,jSatisfies the above formula (2);

3) if both matrix a and matrix B are outer matrices, there are further two cases:

the first condition is as follows: if and only if f ≠ k ^ a_m,m＝b_m,mWhen, the FFSM-Join operation is denoted as C_m×mMatrix C_m×mElement c of_i,jSatisfies the above formula (3);

case two: FFSM-Join operation is denoted as D_(m+1)×(m+1)At this time, the matrix D_(m+1)×(m+1)Must exist, and matrix D_(m+1)×(m+1)Element d of (1)_i,jCan be expressed as the above formula (4).

In addition, in the step S130, when the third node of the leaf including the first number of edges is a canonical adjacency matrix and is an outer matrix, performing an FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing a sub-graph growth to obtain a child node including the third node as a parent node and a second number of edges, specifically, the method may include the steps of: under the condition that a third node of a leaf containing a first number of edges is a standard adjacent matrix and is an outer matrix, if the third node is the outer matrix, one edge is expanded, an end point graph node of the edge is obtained, and a child node which takes the third node as a father node and contains a second number of edges is obtained and is represented as B_(n+1)×(n+1)Wherein the matrix B of child nodes_(n+1)×(n+1)The elements in (A) are represented as:

wherein, a_i,jElements representing the ith row and jth column of the matrix corresponding to the third node, b_i,jThe element representing the ith row and jth column of the matrix corresponding to the child node containing the second number of edges, e represents the augmented edge, and n represents the end graph node (the imported graph node) of the augmented edge.

In the above embodiment, in the FFSM-Extension operation process, if the sub-optimal canonical adjacency matrix SCAM is an outer matrix, a new edge e is added between the last node and a newly introduced node n, and a sub-graph of the current k edges is expanded into a new sub-graph B of k +1 edges_(n+1)x(n+1)The elements thereof satisfy the above formula (5).

Step S140: and constructing a CSP search space of the corresponding candidate subgraph according to a subgraph growth mode and the parent node of the candidate subgraph and the CSP search space of the parent node by taking the nodes of the leaves containing the second number of edges as the candidate subgraph.

Wherein, let C_k-1And C_kSCAM sets with k-1 edges and k edges subgraphs, respectively, of a single graph G (k ≧ 2), then set C_kAll SCAMs in (A) can be connected by two (Join) C_k-1Element(s) or pair C_k-1The element in (1) is extended (Extension).

The CSP (constraint Satisfacial Problem) model is: order to

Is a single large picture

A subgraph isomorphic with the candidate subgraph G, the single big graph G, can be represented as constrained CSPs (X, D, C), where V represents a set of nodes (frequent set of nodes), E represents a set of edges (frequent set of edges), and V represents a set of edges (frequent set of edges)₁Represents a node, E₁Represents an edge, L_VSet of node labels, L, representing set of nodes V_EAn edge label set representing the edge set E,

representing a set of nodes V and a set of node labels L_VMapping relation of (2), edge set E and edge label set L_EThe set of mapping relationships of (a) to (b),

a set of node labels representing the sub-graph,

a set of edge labels representing a sub-graph,

representing a set of nodes and node labels in the subgraph and mapping relations between edges and edge labels, and meeting the following conditions:

1, X is a set of ordered variables (in this embodiment, the serial number/number of a node may be), and each node in the sub-graph g corresponds to one variable in X;

2. from x_vThe domain space (CSP search space) formed by the epsilon X is called D, wherein the set elements of D are all subsets of V;

3. the following three constraints are satisfied in constraint set C:

a) the node in X is unique;

b) the node in the X is the same as the node label of the corresponding subgraph;

c) the edge formed by the nodes in X is the same as the label of the edge of the corresponding sub-graph.

In specific implementation, for candidate subgraphs (nodes including the second number of edges) obtained in different ways, CSP (constraint satisfaction problem) search spaces may be constructed in different ways. For example, the step S140 may specifically include the steps of: s1411, combining the dictionary ordering sequence of the numbers of the graph nodes in the CSP search space of the first node and the CSP search space of the second node to remove the duplication of the corresponding subgraphs under the condition that the nodes of leaves containing a second number of edges are obtained as candidate subgraphs by increasing the subgraphs according to the first mode or the third mode to obtain the CSP search space of the candidate subgraphs; and S1412, under the condition that the nodes containing leaves of the second number of edges are obtained by the growth of the second mode or the fourth mode sub-graph as candidate sub-graphs, combining the sub-graphs corresponding to the dictionary sorting sequence of the numbers of the graph nodes in the two CSP search spaces corresponding to the common part of the canonical adjacency matrix corresponding to the first node and the adjacency matrix corresponding to the second node to remove duplication, and splicing the sub-graphs into the CSP search spaces corresponding to the rest nodes except the common part of the canonical adjacency matrix corresponding to the first node in the adjacency matrix corresponding to the second node to obtain the CSP search spaces of the candidate sub-graphs.

In the step S1411, in this growing manner, if the node sets and the edge sets of the first node and the second node are the same, after the two nodes expand one edge by performing the FFSM-Extension operation, the specific subgraphs corresponding to the two search spaces corresponding to the obtained subgraph pattern may be the same, so that deduplication needs to be performed. Alternatively, one could say that the maximum common denominator subset of the CSP search space of the first node and the CSP search space of the second node is taken.

In another example, the step S140 may specifically include the steps of: s1421, under the condition that a node containing leaves of a second number of edges is obtained as a candidate subgraph by performing FFSM-Extension operation subgraph growth, a CSP search space of the third node is obtained, and the number of the terminal graph node added by the candidate subgraph relative to the third node is added to the tail of the dictionary ordering sequence combination of the numbers of the graph nodes in the CSP search space of the third node, so that the CSP search space of the candidate subgraph is obtained. In step S1421, a general subgraph growth is performed, and the number of the endpoint graph node of the new expansion edge is added at the end of the number combination (number sequence) in the previous search space.

In further embodiments, nodes that do not conform to the dictionary ordering may be pruned more accurately. For example, the step S140 may specifically include the steps of: s1431, if there is another node sharing a parent node with the fourth node when the fourth node including the second number of edges is the non-canonical adjacency matrix, finding a fifth node including the second number of edges and being the canonical adjacency matrix, the fifth node having the same graph node set and edge set as the fourth node, generating a new sequential combination of numbers by adjusting an order of numbers in the sequential combination of numbers of graph nodes in the CSP search space of the fourth node, adding the generated new sequential combination of numbers to the CSP search space of the fifth node, and marking the fourth node as an invalid subgraph.

In step S1431, the CSP search space of the node of the non-canonical adjacency matrix is shifted to the CSP search space of the node corresponding to the node and being the canonical adjacency matrix.

Step S150: and under the condition that the effective number of the dictionary ordering sequence combination of the numbers of the graph nodes in the CSP search space of the current candidate subgraph is less than a set support threshold, marking the corresponding candidate subgraph as an invalid subgraph.

Wherein, a node corresponds to a candidate subgraph, i.e. corresponds to a subgraph mode. The calculation of the support degree of the candidate subgraph g is equivalent to calculating the effective number (effective distribution number) of each variable (the number of the number combinations in the CSP search space) in the CSP model, that is, at least τ (support degree) nodes can be contained in the corresponding CSP search space D for each variable in X. If the CSP search space of the candidate subgraph not marked as the invalid subgraph is not specified (the numbering combination removed by deduplication is an invalid numbering combination), the numbering combination in the CSP search space can be considered to be valid.

In step S150, a candidate sub-graph may include some graph nodes and some edges, where the combination order in the CSP search space is different, and different combinations correspond to different specific sub-graphs (a sub-graph pattern may correspond to one or more specific sub-graphs, depending on the combination manner of numbers in the CSP search space), and each combination may correspond to one row of the matrix of the CSP search space, so the row number may be equal to the number of combination manners, that is, the number of effective sub-graphs allocated for the candidate sub-graph. This effective number may also be referred to as a support. If the requirement of the support degree threshold is met, the method can be reserved, if the requirement of the support degree threshold is not met, subsequent subgraph growth can not be carried out, and frequent subgraphs cannot be selected, so that the pre-pruning effect is achieved.

In a specific implementation, the step S150 may specifically include the steps of: and calculating the product of the number of the serial number combinations in the CSP search space of the candidate subgraph and the number of the corresponding search modes when the newly added nodes in the candidate subgraph are searched according to the serial number sequence of the graph nodes in the standard adjacency matrix of the single graph to obtain the effective number of the dictionary sorting serial combinations of the serial numbers of the graph nodes in the CSP search space of the candidate subgraph, and marking the corresponding candidate subgraph as an invalid subgraph under the condition that the corresponding effective number is less than a set support threshold.

Wherein the newly added node may be the last added node in the candidate subgraph. The number of search modes corresponding to the newly added nodes when searching according to the number sequence of the graph nodes in the canonical adjacency matrix of the single graph may be the number of possible modes for searching the adjacency node of one node backwards according to the canonical adjacency matrix of the single graph.

Step S160: and under the condition that the current candidate subgraph is not the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, carrying out subgraph growth according to the canonical adjacency matrix of the single graph and the nodes which are not marked as invalid subgraphs and contain the second number of edges so as to update the leaf nodes of the suboptimal canonical adjacency matrix tree, and outputting the frequent subgraph of the single graph according to the leaf nodes which are finished to grow the subgraph.

In step S160, if the subgraph growth is not completed, the subgraph growth can be continued by using the methods from step S130 to step S150.

In an exemplary implementation, in the case that it is determined, according to the canonical adjacency matrix of the single graph, that the current candidate subgraph is not the subgraph in which the growth is completed, the step S160 may specifically include the steps of: s161, under the condition that the current candidate subgraph is not the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, under the condition that a sixth node which is taken as the candidate subgraph and contains a second number of edges is the canonical adjacency matrix and other seventh nodes which share the parent node and contain other leaves of the second number of edges exist, carrying out FFSM-Join operation on the sixth node and the seventh node according to the canonical adjacency matrix of the single graph, and growing the subgraph to obtain child nodes which take the sixth node and the seventh node as the parent node and contain a third number of edges; under the condition that an eighth node which is a candidate subgraph and contains a second number of edges is not marked as an invalid subgraph, is a canonical adjacency matrix and is an outer matrix, performing FFSM-Extension operation on the eighth node according to the canonical adjacency matrix of the single graph, and growing the subgraph to obtain a child node which takes the eighth node as a parent node and contains a third number of edges; wherein the third number minus the second number equals one; the eighth node is the same as or different from the sixth node or the seventh node; child nodes containing a third number of edges become nodes of leaves of the sub-optimal canonical adjacency matrix tree; s162, taking the leaf nodes containing the third number of edges as new candidate subgraphs, and constructing CSP search spaces of the corresponding candidate subgraphs according to a subgraph growth mode and the parent nodes of the candidate subgraphs and the CSP search spaces of the parent nodes; s163, under the condition that the effective number of the dictionary sorting sequence combination of the numbers of the graph nodes in the CSP search space of the current new candidate subgraph is less than the set support threshold, marking the corresponding candidate subgraph as an invalid subgraph; and S164, under the condition that the current new candidate subgraph is judged to be the subgraph which is finished to grow according to the canonical adjacency matrix of the single graph, outputting the frequent subgraph of the single graph according to the current new candidate subgraph.

In step S161, the subgraph not marked as the invalid subgraph can be increased, which is equivalent to pruning.

In the frequent subgraph mining method for the single graph in each embodiment, the regular adjacency matrix of the single graph is generated and the graph nodes are numbered, so that the graph nodes and edges in the single graph can be conveniently searched sequentially, repeated searching is avoided, and pruning is performed equivalently. By constructing a sub-optimal standard adjacent matrix tree and utilizing nodes in the adjacent matrix tree which meet the requirement of the matrix to carry out subgraph growth, the method is equivalent to further pruning. Moreover, by carrying out pre-pruning on the marked invalid subgraphs, the growth of the subsequent invalid subgraphs is avoided. Therefore, the embodiment of the invention can greatly reduce the search space through multi-aspect pruning, lacks directional optimization in aspects of data structure optimization, space occupation and the like, and improves the mining efficiency of frequent subgraphs.

To further reduce the search space, nodes in the matrix of nodes in the sub-optimal canonical adjacency matrix tree whose graph node order does not conform to the dictionary ordering may be pruned. Illustratively, the method described in fig. 1 may further include the steps of: and S170, under the condition that the numbers of the graph nodes in the suboptimal adjacency matrix corresponding to the nodes of the leaves comprising the first number of edges do not conform to the dictionary sorting order, marking the corresponding nodes of the leaves comprising the first number of edges as invalid subgraphs.

In this case, in the step S130, when the third node of the leaf including the first number of edges is a canonical adjacency matrix and an outer matrix, performing an FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing a sub-graph growth to obtain a child node including the third node as a parent node and a second number of edges, specifically, the method may include the steps of: and under the condition that a third node of a leaf containing the first number of edges is a canonical adjacency matrix, is an outer matrix and is not marked as an invalid subgraph, performing FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and growing the subgraph to obtain a child node which takes the third node as a parent node and contains a second number of edges. In this way, nodes marked as invalid subgraphs do not undergo Extension growth, which is equivalent to pruning.

The frequent subgraph mining method for a single graph according to the embodiments can be operated on a single machine or a distributed system. Running on a distributed system, the execution speed can be improved.

In some embodiments, a distributed system may be utilized for subgraph growth, search space construction, and the like. Before this, the computing nodes of the distributed system may be load balanced.

For example, the frequent subgraph mining method for the single graph shown in fig. 1 may further include the steps of: and S180, distributing all frequent nodes, all frequent edges and child nodes thereof in the initial suboptimal canonical adjacency matrix tree to each computing node of the distributed system according to the frequent nodes, so that each computing node executes subgraph growth under the condition of load balancing to obtain nodes containing a second number of edges. In step S180, the graph nodes and edges of the single graph may be distributed more evenly to the computing nodes of the distributed system to balance the load.

In other embodiments, a distributed system may be utilized for calculations such as search space construction. After the subgraph is grown, load balancing can be carried out on the computing nodes of the distributed system according to the search space conditions of the nodes.

For example, the frequent subgraph mining method for the single graph shown in fig. 1 may further include the steps of: and S190, under the condition that the current candidate subgraph is not the subgraph which is finished to grow according to the standard adjacency matrix of the single graph, according to the CSP search space of the candidate subgraph of the single graph, distributing the father node, the corresponding nodes containing the first number of edges and the child nodes thereof to each computing node in the distributed system according to the father node of the nodes containing the first number of edges in the sub-optimal standard adjacency matrix tree, so that each computing node performs subgraph growth according to the standard adjacency matrix of the single graph and according to the nodes which are not marked as invalid subgraphs and contain the second number of edges under the condition of load balancing.

In step S190, whether the current candidate subgraph can be expanded with a new edge or not can be checked by searching the canonical adjacency matrix of the single graph, if yes, it indicates that the subgraph growth has not been completed, and if not, it indicates that the subgraph growth has been completed. If the first number is k, the nodes including k edges and k +1 edges on the adjacency matrix tree with the sub-optimal specification can be divided into different branches according to the nodes including k-1 edges, each child node of the node including k-1 edges and the child node of the child node thereof can correspond to one branch, the branches can be used as units, the units can be distributed to a plurality of computing nodes, and one computing node can include one or more units, so that the computation amount on each computing node is balanced. In addition, if the candidate subgraph is an invalid subgraph, subgraph growth may not be performed, or pruning may be actually performed without being allocated to a compute node. In the case of incomplete subgraph growth, subgraph growth similar to step S130 can be continuously performed, and search space construction, pruning and other steps similar to step S140 and step S150 can be performed to improve search efficiency.

In a specific implementation, in the step S190, the allocating parent nodes and child nodes of the nodes including the first number of edges in the suboptimal canonical adjacency matrix tree to each computing node in the distributed system according to the CSP search space of the candidate subgraph of the single graph may specifically include the steps of: s191, determining a pre-search space corresponding to the candidate subgraph and used for constructing the CSP according to the number of node numbers between the number of the last graph node in the CSP search space of the node containing the first number of edges as the candidate subgraph and the number of the last graph node in the canonical adjacency matrix of the single graph; and S192, grouping and summing the pre-search spaces for constructing the CSP of the nodes containing the first number of edges according to the parent node of the nodes containing the first number of edges, and distributing the nodes containing the first number of edges, the parent node and the child nodes of the nodes to each computing node in the distributed system according to the grouping and summing result, so that the sum of the pre-search spaces executed on different computing nodes is approximate.

In step S191, for example, the CSP search space including the nodes of two edges is composed of (1,2,3) by the number sequence of the graph nodes included therein, and the number of the last graph node in the canonical adjacency matrix of a single graph is 10 (may include 10 graph nodes in total), so that 10 minus 3 obtains 7 as the pre-search space of the nodes including two edges, and the larger the pre-search space is, the larger the calculation amount required for the general growth of the following subgraph, the construction of the search space, and the like is. In step S192, for example, the parent node of the node including the two edges is a frequent node, the search space sum of all child nodes (i.e., the nodes including the two edges) of all child nodes of the frequent node may be calculated for each frequent node, and the frequent node and child nodes of the child nodes and child nodes may be allocated according to the search space sum corresponding to each frequent node, so as to equalize the search space sum corresponding to each calculation node.

Under the condition of realizing the scheme by using the distributed system, the traversal mode can be optimized.

For example, in the step S130, when a first node of a leaf including a first number of edges is a canonical adjacency matrix and there are second nodes of other leaves including the first number of edges, which share a parent node with the canonical adjacency matrix, performing an FFSM-Join operation on the first node and the second node according to the canonical adjacency matrix of the single graph, and performing a sub-graph growth to obtain child nodes including the first node and the second node as parent nodes and including a second number of edges, the method may specifically include the steps of: and under the condition that a first node of a leaf containing a first number of edges is a canonical adjacency matrix and other second nodes of the leaves containing the first number of edges and sharing a parent node with the first node exist, traversing the canonical adjacency matrix of the single graph by using a push-pull dual-mode engine by using a distributed system, carrying out FFSM-Join operation on the first node and the second node, and carrying out subgraph growth to obtain child nodes which take the first node and the second node as parent nodes and contain a second number of edges.

In addition, in the step S130, when the third node of the leaf including the first number of edges is a canonical adjacency matrix and is an outer matrix, performing an FFSM-Extension operation on the third node according to the canonical adjacency matrix of the single graph, and performing a sub-graph growth to obtain a child node including the third node as a parent node and a second number of edges, specifically, the method may include the steps of: and under the condition that a third node of a leaf containing the first number of edges is a canonical adjacency matrix and is an outer matrix, traversing the canonical adjacency matrix of the single graph by using a push-pull dual-mode engine by using a distributed system, carrying out FFSM-Extension operation on the third node, and carrying out sub-graph growth to obtain a child node which takes the third node as a father node and contains the second number of edges.

The distributed communication scheduling of the distributed system can be realized by adopting non-blocking communication and a dynamic thread resource release mode. Therefore, optimization is performed on the aspects of distributed message communication and the like, and the problem that directional optimization is lacked on the aspects of distributed message communication and the like in the prior art is solved.

In one embodiment, the frequent subgraph mining method for the single graph comprises the following steps: searching all frequent subgraphs of the single graph through a distributed subgraph growth algorithm; the calculation efficiency of the distributed engine is improved by using a load balancing algorithm; and rapidly searching a frequent subgraph set meeting the conditions through a distributed support algorithm, and reducing the search space of the algorithm by utilizing a pre-search pruning and self-downward pruning algorithm.

Input of the distributed subgraph growth algorithm: and traversing the nodes and edges of the single large graph by the adjacent matrix M of the single large graph G to obtain an edge with the occurrence frequency larger than a certain numerical value to form a frequent node set S (0) in the single large graph and obtain an edge with the occurrence frequency larger than a certain numerical value to form a frequent edge set S (1).

The process of carrying out subgraph growth and support degree calculation comprises the following steps: inputting a frequent subgraph set of k-1 edges, for example, inputting a frequent subgraph set of 2 edges; the frequent subgraph set with 0 edges and the frequent subgraph set with 1 edge are obtained through initialization; the purpose of the distributed subgraph growth algorithm is to output: newly generated frequent subgraph set Sk.

The process of subgraph growth by using the distributed subgraph growth algorithm is as follows: frequent subgraphs (sub-optimal canonical adjacency matrix SCAM) in the frequent subgraph set S (k-1) are divided into a group if common parent CAMs (canonical adjacency matrices) exist; performing the following operations on each group of frequent subgraphs divided by a common parent CAM: carrying out FFSM-Join operation on every two matrixes (SCAMs) of the frequent subgraphs in the group to obtain new k-edge frequent subgraphs, and adding a frequent subgraph set S (k); if the CAM (SCAM for the first time) of the frequent subgraphs in the group is an outer matrix, carrying out FFSM-Extension operation on the frequent subgraphs to generate corresponding new k-edge frequent subgraphs, and adding a frequent subgraph set S (k); and (5) renumbering the frequent subgraphs in the k-edge frequent subgraph set S (k), and then outputting S (k).

Input of the distributed support algorithm: the sub-graph G (including the sub-graph set of all G) and the support threshold tau (set); constructing a CSP (constraint satisfaction problem) search space Dg through a sub-atlas G; candidate subgraph g (obtained by CSP method); each candidate sub-graph g corresponds to a search space Dg.

Input of the load balancing algorithm: a candidate set of sub-graphs on Partition (distributed server nodes) S; sorting the candidate sub-set according to the CSP search space; reallocating frequent sub-atlas sets in each Partition (server node) so that CSP search space (sum of all Dg corresponding to all g in S) in each Partition is approximately equal in scale; output of the load balancing algorithm: the adjusted Partition candidate subgraph Squalance; the process of searching for frequent subgraphs meeting the conditions by using the distributed server nodes and the distributed support algorithm is as follows: if the number of the adjacent nodes of the variable v in the search space Dg is larger than one, subtracting redundant adjacent nodes according to node uniqueness constraint; if the number of the effective distribution of the variable v in the search space Dg (the number of the nodes in the search space) is less than the support degree threshold tau, the support degree requirement is not met, and false is returned (the rest nodes are returned true); the invalid assignment schemes in the search space Dg are marked (nodes not being searched for), and the search spaces of the sub SCAMs thereof may not be searched for the marked nodes.

In addition, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method according to any of the above embodiments when executing the program.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method according to any of the above embodiments.

The following description will explain embodiments of the present invention by way of specific examples to enable those skilled in the art to better understand the present invention.

In a particular embodiment, large scale single graph frequent subgraph mining is implemented using distributed computing engines. The method realizes frequent subgraph enumeration by constructing an SCAM tree (suboptimal canonical adjacency matrix), realizes frequent subgraph screening by incrementally calculating the support degree in the growth process of the subgraph, and supports quick search of frequent subgraphs of a single big graph in a distributed environment. The main processes comprise the generation of CAM (canonical adjacency matrix) of a single large graph, the initialization of SCAM tree, subgraph growth (supporting single machine and distributed environments), incremental support degree calculation, SCAM tree pre-pruning and dynamic load balancing. Specifically, referring to fig. 2, the method of this embodiment includes the following steps (hereinafter, "node" means a node of a SCAM tree or a node of a single graph (graph node), and the specific meaning is determined according to the detailed description):

s1, initializing an adjacent matrix of a single graph, namely generating a unique CAM (computer-aided manufacturing) matrix of a large graph by using a matrix dictionary sorting algorithm, and using the CAM matrix as a data index matrix in a subgraph growth and increment support degree calculation process; numbering nodes of a graph in the initialized CAM matrix again, so that the nodes of the CAM matrix are sequentially increased from small to large according to rows or columns;

s2, traversing the CAM matrix in sequence in the growth process of the SCAM tree; under a distributed environment, the CAM matrix can be subjected to load balancing, and nodes and edge data in the CAM are distributed in different computing nodes;

s3, calculating the subgraph support degree by adopting a CSP model, namely, calculating a frequent subgraph set containing tau (support degree) in a definition space D in the CSP model; the space D reduces the search space of the algorithm by adopting an incremental construction and pre-pruning mode;

and S4, after the increment support degree is calculated, if the subgraph is not increased, the distributed computing engine performs dynamic load balancing according to the scale of the definition space D on the distributed nodes so as to optimize the computing resources of the distributed cluster and improve the algorithm efficiency.

The distributed graph calculation engine adopts a push-pull dual-mode engine and CAM indexes, as shown in FIG. 3, in two processes of subgraph growth and support calculation, an adjacency list of nodes needing to be traversed is adopted, according to the number of active nodes, a push mode is adopted when active nodes are in a sparse state, and a pull mode is adopted when active nodes are in a dense state, so that the graph traversal efficiency is improved. In addition, under the distributed computing environment, the distributed communication scheduling adopts non-blocking communication and a dynamic thread resource release mode to realize effective overlapping of computing and communication so as to improve the comprehensive efficiency of the distributed engine.

In step S1, the building of the SCAM tree may specifically include the following steps:

s11, inputting a constructed SCAM tree into an adjacent matrix M of a single large graph G, and generating a CAM matrix through dictionary sorting, wherein the CAM matrix can uniquely represent the large graph, and node numbers of graphs on the CAM matrix are sequentially arranged from small to large;

for example, as shown in FIG. 4, a single large graph G contains 6 nodes, i.e., A, B, B, C, D, A, with four node labels A, B, C, D, and the 6 nodes are sorted in the dictionary order of the node labels to generate an adjacency matrix M, as shown in FIG. 5 as a "canonical adjacency matrix," where x, y, z, x are all edges in a single graph G, and the same edge label indicates that the same attribute is present. Numbering the 6 nodes in sequence according to columns to obtain the serial numbers (numbers) of the graph nodes, namely 1,2,3,4,5 and 6;

s12, performing primary load balancing after the CAM is generated, and distributing the CAM to different nodes of the computing cluster by the primary load balancing according to an average distribution principle;

s13, traversing rows or columns of the CAM matrix, extracting nodes with the same labels to generate a frequent node set S (0) of the large graph G, wherein a domain space (CSP search space) of the frequent node set S (0) in the CSP is a node number of the frequent node set S (0); the father node of the frequent node set S (0) in the SCAM tree is a NULL node;

for example, as shown in fig. 5, the root node in the SCAM tree is node NULL, the frequent node set S (0) is the child node of node NULL, and contains node A, B, C, D (node label), and each tree node (graph node label) corresponds to a domain space (matrix) (CSP search space) represented by a sequence number, e.g., a corresponds to 1 and 6, B corresponds to 2 and 3, C corresponds to 4, and D corresponds to 5.

S14, traversing the frequent node adjacency list in the S (0), and generating a frequent edge set S (1) of the SCAM tree by taking the corresponding frequent node labels as father nodes; the domain space in the CSP of S (1) is the end point sequence number (the number of the end point graph node) of the side in the corresponding vertex adjacency list (only the side with the end point sequence number larger than the starting point is selected, namely, the adjacency list is cut from the lower side or the right side of the diagonal line element in the CAM matrix of the single graph G); for example, as shown in fig. 5, the child nodes of the frequent node set S (0) are the frequent edge set S (1), and the frequent edge set S (1) includes nodes AB (corresponding to domain spaces (combination of numbering sequences) 1,2 and 2,3), AC (corresponding to domain spaces 1, 4), BB (corresponding to domain spaces 2,3), BD (corresponding to

domain spaces

2, 5 and 3, 5), CA (corresponding to domain spaces 4, 6), and DA (corresponding to domain spaces 5, 6). The node labels in the nodes CA and DA do not conform to the dictionary sequence, so that the nodes are marked as invalid subgraphs (dotted frames), do not perform Extension operation any more, but can participate in Join operation;

s15, after a frequent node set S (0) and a frequent edge set S (1) are generated, the initialization process of the SCAM tree is completed; as shown in fig. 5, the root node, the frequent point set S (0) and the frequent edge set S (1) of the SCAM tree are obtained through initialization.

The subgraph growth algorithm is that a subgraph set S (K) child node containing K edges is generated for a parent node according to elements in a subgraph set (node set) S (K-1) containing K-1 edges in the SCAM, and after the SCAM is initialized, a frequent node set S (0) and a frequent edge set S (1) are known, wherein the subgraph set S (K) containing K edges is a subgraph set (node set of a SCAM tree) with K edges; the node matrixes generated in the SCAM tree are all CAMs or SCAMs, wherein elements in the frequent node set S (0) and the frequent edge set S (1) can be regarded as special cases of CAM outer matrixes.

In the step S2, the SCAM tree growing (sub-graph growing) process may specifically include the following steps:

s21, under the condition that the process is to realize that K > -2, the SCAM tree growth process is carried out, all the node matrixes in S (K) and S (K) take the node matrixes in S (K-1) as parent matrixes, wherein only CAM nodes can be taken as the parent matrixes, and the SCAM nodes only participate in subgraph growth operation; as shown in fig. 5 to 8, the subgraph matrix and CSP search space matrix of the subgraph S (2) containing 2 edges, the subgraph set S (3) containing 3 edges, the subgraph set S (4) containing 4 edges, the subgraph set S (5) containing 5 edges, the subgraph set S (6) containing 6 edges, the subgraph set S (7) containing 7 edges, and the subgraph set S (8) containing 8 edges are shown. Wherein, the label j3.1 indicates that FFSM-Join operation is performed for sub-graph growth according to the first mode or the third mode, the label e indicates that FFSM-Extension operation is performed for sub-graph growth, the label j2 indicates that FFSM-Join operation is performed for sub-graph growth according to the second mode or the fourth mode, the node labeled with a dashed box indicates that Extension operation is not performed any more but Join operation can be performed, and the label "pruning" indicates that invalid sub-graphs need to be pruned. In addition, the sub-graph labeled with the dashed box is represented by a solid box matrix with an arrow pointing to the common parent node, and the search space of the dashed box sub-graph is added to the matrix pointed by the arrow.

S22, in the step S21, selecting a CAM matrix in S (K-1) as a father node, and performing FFSM-Join and FFSM-Extension operations to realize subgraph growth, wherein the specific growth operation can be as follows:

1. if the selected father node and other SCAMs in S (K-1) have a common father node (namely, a node in S (K-2)), the two matrixes carry out FFSM-Join operation, and the subgraph selected as the father matrix expands 1 edge from the other matrixes to form a node in S (K);

2. when the selected father node is an outer matrix, FFSM-Extension operation can be carried out on the matrix, the father node expands 1 edge from the adjacency list of the node with the father node matrix ranking to form a node in S (K), the process continuously circulates along with continuous subgraph growth, and all proper edges in the adjacency list are added (the sequence number of a terminal point of a newly added edge in the CSP definition domain space D is larger than the sequence number of a starting point);

3. in the two operation processes of 1 and 2, the operation can be carried out according to the sequence of the formula (1), the formula (2), the formula (3) (case one), the formula (4) (case two) and the formula (5), and the generated subgraphs are arranged according to the sequence of the characters in the CAM from small to large, so that the newly generated subgraphs are comprehensive and are not redundant;

s23, when the operation of the step S22 is stopped, the growth process of S (K) is ended, and the next wheel graph growth can be entered after the step S4.

In step S3, the specific process of calculating the increment support degree is as follows:

s31, calculating the subgraph support degree, wherein the input is a candidate subgraph g, and a parent node g of the candidate subgraph g_k-1(number 1-2, since there are two parents if it is obtained by FFSM-Join operation) and g_k-1CSP search space D (g)_k-1) And a support threshold τ;

s32. the incremental construction process of the definition domain space D in the CSP model is simultaneously carried out in an SCAM initialization stage (step S1) and a subgraph growth stage (step S2); can be represented by D (g)_k-1) And CAM of a single graph G constructs CSP search space D (G)_k) There are mainly the following cases:

1. if the candidate subgraph g consists of two father nodes g_k-1Generated by the operations of formula (1) and formula (3) (case one) in FFSM-Join, two D (g) are directly taken_k-1) The maximum common contract subset of the space (the condition that different node sequence number combinations in the same subgraph mode correspond to the same specific subgraph are subjected to de-duplication) is taken as D (g)_k) (ii) a As shown in fig. 7 and 9, two nodes (subgraphs) including a graph node A, B, B, C, D and 5 edges respectively correspond to the CSP search spaces ((1,2,3,4,5), (1,3,2,4,5)) and (1,2,3,4, 5)), wherein the combination of numbers (1,3,2,4,5) in the CSP search space ((1,2,3,4,5), (1,3,2,4,5)) is converted from the CSP search space (1,2,3,4, 5). When FFSM-Join is performed to obtain a node including graph node A, B, B, C, D and a node including 6 edges from two nodes including graph node A, B, B, C, D and 5 edges as parent nodes, the obtained search space of the node is obtained by removing duplication or taking the greatest common subset of the CSP space including graph node A, B, B, C, D and two nodes including 5 edges, that is, by converting (not considering) and (1,2,3,4,5) from the CSP search space (1,2,3,4,5) to obtain a CSP search space (1,2,3,4,5), and generating a graph node including graph node A, B, B, C, 6, D and CSP search space containing nodes of 6 edges.

2. If the candidate subgraph g consists of two father nodes g_k-1Generated by the operation of formula (2) and formula (4) (case two) in FFSM-Join, and used for the parent node g_k-1The public part of the A matrix corresponding to one parent node and the public part of the B matrix corresponding to the other parent node are taken as the maximum common contract subset, and the search space of the residual elements of the B matrix is combined to be used as D (g)_k) (ii) a As shown in fig. 7 and 10, the node containing graph node A, B, B, C, D and containing 5 edges obtained by the FFSM-Join operation (j3.1) and the node containing graph node A, B, B, C, D, A and containing 5 edges, the common part of the matrix corresponding to the two nodes (i.e. the part containing graph nodes A, B, B, C, D and 5 edges), i.e. the common part of the search spaces (1,2,3,4,5) and (1,2,3,4,5,6), i.e. (1,2,3,4,5),and then merging the search space into the search space 6 of the remaining element a of the next node, so as to obtain a CSP search space (1,2,3,4,5,6) containing the graph node A, B, B, C, D, A and the nodes containing 6 edges, which is generated by performing FFSM-Join operation with the two nodes as parent nodes.

3. If the candidate subgraph g is generated by FFSM-Extension operation, the father node g is reserved_k-1Search space D (g)_k-1) And filling the newly added edge end point sequence number into D (g)_k-1) Last column of (g) to generate D (g)_k)；

4. When the adjacent matrix of the candidate subgraph g is a non-CAM matrix, if the same father node still has subgraphs before the candidate subgraph g, the subgraph g' of the CAM matrix corresponding to the candidate subgraph g can be found certainly, and the search space D (g) generated by the candidate subgraph g is needed_k) Search space D (g ') added to CAM matrix sub-graph g'_k) In (1).

In both

cases

1 and 2, the newly generated search space can be reduced by taking the maximum convention subset, which can be regarded as a pruning process. In both

cases

3 and 4, the newly generated search space can be increased, increasing the number of subgraphs.

S33, if the support degree threshold condition is not met, calculating output false of the incremental support degree; specifically, if D (g) in the space is searched at this time_k) If the effective distribution number of the data is less than the threshold value tau, returning to false; wherein, D (g)_k) Is equal to D (g)_k) The number of rows of the matrix or the product of the number of the search space and the number of elements in the definition domain of the last newly added node label. Wherein D (g) of one node is shown in FIG. 9_k) The matrix comprises two rows (1,2,3,4,5) and (1,3,2,4,5), the number of the row numbers or the number of the search spaces is 2, the backward search mode of the last newly added node label D only has one edge x (the end point graph node is A), namely the definition domain of the graph node is 1, and D (g) is obtained_k) Is 2 by 1, i.e. 2.

S34, after the calculation of the support degree of the round is completed, marking the invalid assignment scheme in the search space (namely returning to the SCAM node of false), and not performing subgraph growth after the node (including the node); the process is a pre-pruning process;

s35, if the support degree threshold condition is met, the incremental support degree calculation output is true; specifically, if D (g) in the space is searched at this time_k) If the effective distribution number is larger than or equal to the threshold tau, returning a result true.

In the step S4, the load balancing method in the second stage may perform load balancing once after each increase, or perform load balancing once when the load reaches a certain degree, and the specific steps may be as follows:

s41, inputting a candidate sub-graph set S (K) on Partition (computing node) in load balancing; wherein, a candidate subgraph refers to a subgraph mode (corresponding to a subgraph matrix);

s42, the factor influencing the load balance of the candidate sub-graph set S (K) is mainly that the FFSM-Extension process of the sub-graph growth needs to traverse the edge of which the end sequence number is larger than the currently allocated maximum sequence number in the adjacency list of the last node in D (g). The search space becomes a pre-search space constructed by the CSP;

s43, outputting a target to adjust each Partition candidate sub-atlas S '(K) during load balancing so that the CSP pre-search space scales of all nodes in S' (K) are approximately the same;

s44, the CSP pre-search space of each element in the candidate sub-graph set S (K) is approximately equal to N-Vmax, wherein N is the total number of nodes in the current graph G, and Vmax is equal to the minimum value of the node sequence numbers in the maximum value set in all the allocations in D (G).

S45, grouping and summing CSP pre-search spaces in the candidate sub-picture set S (K) according to S (K-2), and re-partitioning to enable the total CSP pre-search spaces of each partition to be approximately equal;

and S46, redistributing the frequent sub-image sets in each Partition according to the result of the repartitioning.

According to the method for mining the frequent subgraphs, SCAM tree growth is achieved through the ordered CAM matrix, and the traditional FFSM algorithm can be conveniently popularized to a distributed computing environment through node sequencing; in the implementation process of the FFSM algorithm, the search space for sub-graph growth is effectively reduced by sequentially traversing the CAM matrix and adopting a method of searching backward at the tail node, and the joint calculation of the sub-graph growth and the support calculation is realized by the incremental support calculation; in the traditional FFSM algorithm, the search space is large and the complexity is high in the process of calculating the support degree. Under a distributed environment, calculation and communication overlapping are realized and the realization efficiency of the algorithm is improved through two-stage load balancing (the load balancing of a CAM matrix is realized in the first stage, and the load balancing of a CSP search space is dynamically adjusted in the second stage), a push-pull dual-mode calculation mode (distributed adjacent list access is realized), and a non-blocking message and dynamic thread resource management mode.

In summary, according to the frequent subgraph mining method, the computer device and the computer-readable storage medium for the single graph in the embodiments of the present invention, the regular adjacency matrix of the single graph is generated and the graph nodes are numbered, so that the graph nodes and edges in the single graph can be conveniently searched sequentially, repeated search is avoided, and pruning is equivalent to performing pruning. By constructing a sub-optimal standard adjacent matrix tree and utilizing nodes in the adjacent matrix tree which meet the requirement of the matrix to carry out subgraph growth, the method is equivalent to further pruning. Moreover, by carrying out pre-pruning on the marked invalid subgraphs, the growth of the subsequent invalid subgraphs is avoided. Therefore, the embodiment of the invention can greatly reduce the search space and improve the mining efficiency of frequent subgraphs by pruning in multiple aspects.

In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A frequent subgraph mining method for a single graph is characterized by comprising the following steps:

2. The method of frequent subgraph mining of a single graph in accordance with claim 1, wherein generating an initial sub-optimal canonical adjacency matrix tree by searching the canonical adjacency matrices of the single graph in the order of the numbering of the graph nodes comprises:

3. The frequent subgraph mining method of a single graph in claim 1,

the method further comprises the following steps:

4. The method of claim 1, wherein in a case where a first node of a leaf containing a first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf containing the first number of edges with which parent nodes are common, and FFSM-Join operations are performed on the first node and the second node according to the canonical adjacency matrix of the single graph, sub-graph growth results in child nodes having the first node and the second node as parent nodes and containing a second number of edges, comprising:

if the first node and the second node are both internal matrixes, if and only if f and k are different, performing subgraph growth in a first mode to obtain a second node which takes the first node and the second node as father nodes and comprises a second nodeThe adjacency matrix corresponding to the child nodes of the number of edges is denoted C_m×mWherein the adjacency matrix C_m×mThe elements in (A) are represented as:

wherein, c_i,jElements representing the ith row and jth column of the adjacency matrix corresponding to the child node, a_i,jElements representing the ith row and the jth column of the canonical adjacency matrix corresponding to the first node, b_i,jThe element of the ith row and the jth column of the adjacency matrix corresponding to the second node is represented, and the last edge in the canonical adjacency matrix corresponding to the first node is represented as a_m,fM and f respectively represent the number of rows and columns and the total number of edges of a normal adjacency matrix corresponding to the first node, and the last edge in the adjacency matrix corresponding to the second node is represented as b_n,kN and k respectively represent the number of rows and columns of the adjacent matrix corresponding to the second node and the total number of edges;

if the first node is an inner matrix and the second node is an outer matrix, performing subgraph growth in a second mode to obtain an adjacency matrix represented as C corresponding to child nodes which take the first node and the second node as father nodes and contain a second number of edges_n×nWherein the adjacency matrix C_n×nThe elements in (A) are represented as:

if the first node and the second node are both outer matrices, if and only if f ≠ k ^ a_m,m＝b_m,mAnd performing subgraph growth in a third mode to obtain an adjacency matrix C which takes the first node and the second node as father nodes and contains child nodes of a second number of edges_m×mWherein the adjacency matrix C_m×mThe elements in (1) are represented as:

if the first node and the second node are external matrixes, carrying out subgraph growth in a fourth mode to obtain an adjacency matrix D which takes the first node and the second node as father nodes and contains child nodes of a second number of edges_(m+1)×(m+1)Wherein the adjacency matrix D_(m+1)×(m+1)The elements in (A) are represented as:

wherein d is_i,jElements representing the ith row and jth column in the adjacency matrix corresponding to the child node, a_i,jElements representing the ith row and the jth column of the canonical adjacency matrix corresponding to the first node, b_m,jElements representing the m-th row and j-th column of the adjacency matrix corresponding to the second node, b_m,mAnd m represents the row number and the column number of the canonical adjacency matrix corresponding to the first node and the adjacency matrix corresponding to the second node.

5. The method of frequent subgraph mining of a single graph according to claim 4, wherein constructing the CSP search spaces of the respective candidate subgraphs according to the subgraph growth mode and according to the parent nodes of the candidate subgraphs and the CSP search spaces of the parent nodes, with the nodes of the leaves containing the second number of edges as the candidate subgraphs, comprises:

6. The method of frequent subgraph mining of a single graph according to claim 1, wherein constructing CSP search spaces of respective candidate subgraphs according to a subgraph growth mode and according to parent nodes of the candidate subgraphs and CSP search spaces of the parent nodes, with nodes of leaves containing a second number of edges as the candidate subgraphs, comprises:

7. The method of frequent subgraph mining of a single graph according to claim 1, wherein constructing CSP search spaces of respective candidate subgraphs according to a subgraph growth mode and according to parent nodes of the candidate subgraphs and CSP search spaces of the parent nodes, with nodes of leaves containing a second number of edges as the candidate subgraphs, comprises:

8. The frequent subgraph mining method of a single graph according to any one of claims 1 to 7, further comprising:

9. The frequent subgraph mining method of a single graph according to any one of claims 1 to 7, further comprising:

10. The method of frequent subgraph mining of a single graph in accordance with claim 9, wherein assigning parent nodes and their child nodes of nodes in a sub-optimal canonical adjacency matrix tree that contain a first number of edges to compute nodes in a distributed system based on the CSP search space of the candidate subgraph of the single graph comprises:

determining a pre-search space corresponding to the candidate subgraph and used for constructing the CSP according to the number of node numbers between the number of the last graph node in the CSP search space of the node containing the first number of edges as the candidate subgraph and the number of the last graph node in the canonical adjacency matrix of the single graph;

11. The method of frequent subgraph mining of a single graph in accordance with claim 1, wherein in the event that the number of valid lexicographically ordered combinations of numbers of graph nodes in the CSP search space of the current candidate subgraph is less than a set support threshold, marking the corresponding candidate subgraph as an invalid subgraph comprises:

12. The method of claim 1, wherein in case that the current candidate subgraph is not a completely grown subgraph according to the canonical adjacency matrix of the single graph, performing subgraph growth according to the canonical adjacency matrix of the single graph and according to nodes containing a second number of edges that are not marked as invalid subgraphs to update leaf nodes of a suboptimal canonical adjacency matrix tree, and outputting the frequent subgraph of the single graph according to the leaf nodes of the completely grown subgraph, comprises:

13. The frequent subgraph mining method of a single graph in claim 1,

when a first node of a leaf including a first number of edges is a canonical adjacency matrix and there are other second nodes of the leaf including the first number of edges, which have a parent node shared by the first node, the first node and the second node are subjected to FFSM-Join operation according to the canonical adjacency matrix of the single graph, and sub-graph growth is performed to obtain child nodes which include the first node and the second node as parent nodes and include a second number of edges, the method includes:

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 13 are implemented when the program is executed by the processor.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 13.