CN114661927A

CN114661927A - Frequent subgraph mining method based on community detection

Info

Publication number: CN114661927A
Application number: CN202210382776.9A
Authority: CN
Inventors: 袁野; 张义; 马德龙; 马玉亮
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-06-24

Abstract

The invention provides a frequent subgraph mining method based on community detection. And clustering the vertexes in the community according to the vertex labels by carrying out community detection operation on the preprocessed graph data, and extracting the number of the vertexes with the same proportion from each label set to obtain an initial set of frequent subgraph mining. And expanding subgraph expansion operation according to the initial set, pruning the generated candidate subgraphs every time one round of expansion is carried out, deleting non-frequent subgraphs in time, reducing subsequent invalid expansion, greatly improving expansion and mining efficiency, simultaneously carrying out isomorphic detection operation efficiently during pruning, reducing execution time and further improving the subgraph mining efficiency. In addition, the invention verifies the high efficiency of the method through a comparison experiment.

Description

Frequent subgraph mining method based on community detection

Technical Field

The invention belongs to the technical field of computer networks, and particularly relates to a frequent subgraph mining method based on community detection.

Background

With the rapid development of information technology, analysis and mining of big data play an increasingly important role in production life, however, mining valuable information from massive data is still a non-negligible challenge. In many domains, data is often modeled as entities, various attributes of the entities are represented by tags, and relationships between entities are related by attributes, forming complex graph data structures. The graph is a universal data structure, and can conveniently and efficiently express the basic attributes of each entity and the complex interrelation among the entities. As the amount of data increases, the size of the graph also grows rapidly and the structure becomes more complex. Frequent subgraph mining is one of the hotspots of graph analysis research and is widely applied to social networks, biological information networks, business networks and the like.

The frequent subgraph mining can be performed on a graph set or a single large graph, and most of the existing work is performed on the graph set. However, since computation of subgraph isomorphism is a non-deterministic polynomial (NP) problem, subgraph isomorphism is a sub-problem for frequent subgraph mining, the computation of mining frequent subgraphs from a single big graph is extremely expensive. In the traditional frequent subgraph mining method, all frequent vertexes or frequent edges are used as an initial set of subgraph expansion, and a large number of repeated subgraphs are generated in the mining process, so that high calculation cost and high memory consumption are caused. And the initial set scale of frequent subgraph mining is reduced, so that the generation of repeated subgraphs can be effectively reduced, and correct solutions can be mined in a short time.

At present, the method for frequent subgraph mining is mainly divided into two types, one is a method based on Apriori thought, and the other is a method based on FP-growth mode growth, wherein the former adopts a breadth-first search strategy, and the latter adopts a depth-first search strategy. (1) The generation-and-test idea is adopted based on the Apriori method: the k +1 subgraph is generated from k frequent subgraphs, based on the closure attribute down (if any one of the k subgraphs of the k +1 subgraphInfrequent, then the k +1 subgraph must also be infrequent) to generate a k +1 frequent subgraph set. For example, in the method (abbreviated as AGM) proposed by a.inokuchi et al in An apriori-based algorithm for mining frequency substructures from data, all frequent subgraphs satisfying a frequent threshold are mined based on recursive statistics in the AGM method. Kuramochi&The method (FSM for short) proposed by Karysis et al in the paper An effective efficient for discovery frequency maps is correspondingly improved for AGM method extension and pruning, and can effectively find Frequent Subgraphs in the Subgraphs. (2) In the FP-growth mode-based growth method, a k +1 candidate subgraph is generated by expanding k frequent subgraphs by one vertex or one edge at all possible positions. For example, x.yan et al propose gSpan (graph-based)Substructure patternmining) method is currently a more classical method based on a pattern-growing method, gSpan proposes a novel rightmost extended concept, where a new candidate subgraph is generated by adding a new edge between the rightmost vertex and another vertex on the graph's rightmost path; in the acceleration isomorphic stage, two technologies of DFS (depth first search) dictionary ordering and minimum DFS coding are introduced in gSpan, and a specification marking system supporting DFS searching is formed.

Chinese patent CN106777065 discloses a method and system for frequent sub-graph mining, which is a method for mining frequent sub-graphs, and the method divides data of a graph to be mined into sub-graphs by acquiring the data of the graph to be mined; performing parallel computation on the subgraph based on a depth-first search method, and finding out a corresponding frequent item set; and merging the frequent item sets of the sub-to-obtain frequent subgraphs of the graph data to be mined. And partitioning the graph data to obtain sub graphs, and then performing parallel computation on the sub graphs, namely simultaneously computing and processing a plurality of sub graphs to obtain corresponding sub frequent item sets. Parallel processing and multithreaded concurrent processing can improve processing efficiency compared to serial processing. And finally, merging the frequent item sets of the children to obtain a final frequent subgraph. Chinese patent CN106446161, a method for mining very frequent subgraph using Hadoop, uses Hadoop to mine very frequent subgraphs, combines frequent subtrees with candidate edges, and then judges whether the frequent subgraphs are frequent and generate the very frequent subgraphs according to stored intermediate results without traversing the database again.

Although the above frequent subgraph mining problem has been widely focused, the existing work still has some disadvantages, for example, the above a.inokuchi et al proposes an AGM method, which results in a large number of redundant k +1 subgraphs when the subgraph is extended, and also needs to spend much time determining whether all k subgraphs of each k +1 subgraph are frequent when pruning the k +1 subgraph. Kuramochi and Karysis et al propose an FSG method, and make corresponding improvements for AGM method expansion and pruning, so that frequent subgraphs can be effectively found in a small graph, but still the problem of considerable overhead when two k subgraphs are connected to generate a k +1 subgraph exists. Yan et al propose a gSpan method, but gSpan is only suitable for frequent sub-graph mining work on a graph set and cannot be effectively applied to a single large graph.

Chinese patent CN106777065 a method and system for frequent subgraph excavation, which proposes a method and system for frequent subgraph excavation, the method divides the graph data to be excavated into n segments of subgraph, but does not guarantee the structural integrity of the divided graph data, at the same time, the size of the excavated frequent subgraph is influenced by the size of the divided subgraph, which may result in missed solution, and there are larger frequent subgraph division into multiple segments of subgraph, which results in the original frequent subgraph becoming an infrequent subgraph; in addition, the method does not prune the intermediate result, which causes a large amount of memory consumption and cannot be effectively applied to larger graph data mining work. Chinese patent CN106446161, a method for mining a very frequent subgraph using Hadoop, proposes a method for mining a very frequent subgraph using Hadoop, but this method does not process graph data before mining, i.e. it does not delete non-frequent vertices and edges by frequent thresholds, and does not prune the intermediate results but store them, thus causing consumption of a large amount of memory, and it cannot be effectively applied to mining of larger graph data.

The current frequent subgraph mining method is not detailed enough for some aspects. For example, how to compress the size of the starting set of the subgraph expansion phase to reduce the generation of invalid intermediate results; how to process invalid intermediate results generated in the mining process to reduce the consumption of the memory; how to efficiently perform sub-graph isomorphism detection to accelerate the execution efficiency of the method, and the like. Therefore, how to enable the method to use less memory consumption and faster execution efficiency to complete frequent subgraph mining work is a problem to be solved.

Disclosure of Invention

The invention aims to provide a frequent subgraph mining method based on community detection, which can compress the size of an initial set in a subgraph expansion stage, effectively reduce the generation of intermediate results, efficiently carry out subgraph isomorphism detection in the subgraph expansion stage, and further delete non-frequent intermediate results through subgraph pruning.

In order to achieve the above object, the invention provides a frequent subgraph mining method based on community detection, which comprises the following steps:

step 1: preprocessing the acquired social network data;

step 2: carrying out community detection work on the processed social network to obtain a community set;

and step 3: sampling each community in the community set to obtain a subgraph expansion initial set

And 4, step 4: according to

And performing frequent subgraph mining on the graph data by the set.

The step 1 comprises the following steps:

step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;

step 1.2: by setting a frequent threshold tau, marking the vertexes and edges with the frequency of the vertex labels and the edge labels smaller than the tau value as the infrequent vertexes and the infrequent edges;

step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;

step 1.4: numbering and reconstructing the rest vertexes;

step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to the vertex mapping function f.

The step 1.4 comprises the following steps:

step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending order on the groups according to the number of the vertexes in the groups;

step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;

step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'₀,v′₁,…,v′_m-1}；

Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f:

such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V₁,v₂,…,v_n}，n≥m。

The step 2 comprises the following steps:

step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k maximum connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);

step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;

step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):

wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;

step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;

step 2.5: initializing an iteration variable t as 1;

step 2.6: according to the sequence of Seq, changing the community label of each vertex into the label with the maximum number in the community labels in the adjacent vertices;

step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;

in the formula, N^l(v) Representing a vertex set with a community label l in adjacent vertexes of a vertex v of the community label to be modified, and VI (u) is the importance of a vertex u;

step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, the loop terminates;

step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.

The step 3 comprises the following steps:

step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a set

Wherein i is the degree of the vertex, and l is the community number;

step 3.2: setting a sampling factor delta from each

Randomly extracting vertexes from the set;

step 3.3: saving the extracted vertices to

The step 4 comprises the following steps:

step 4.1: initializing an iteration variable i to be 1 and maximum iteration times maxIter;

step 4.2: according to

Performing subgraph expansion operation on the set;

step 4.3: obtained by expanding

Performing subgraph pruning operation on the set;

step 4.4: let i equal i +1, go to step 4.2, when i is>maxIter, or

When the set is empty, the iteration is finished;

step 4.5: return to

I.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.

The step 4.2 comprises the following steps:

step 4.2.1: for is to

All vertices in the set, first extending the verticesAn edge and a vertex;

step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection

Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set

The step 4.3 comprises the following steps:

step 4.3.1: to pair

Any two subgraphs are collected to be judged pairwise, and the degree of a subgraph g and d (g) are calculated according to a formula (3):

d(g)＝∑_v∈gdeg(v) (3)

where deg (v) represents the degree of v in the subgraph;

step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining the importance sequence of the vertices of the subgraphs, which is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;

wherein freq (v.label) represents the frequency of appearance of the label of vertex v in the subgraph;

step 4.3.3: if the importance ranks of the top points of the two subgraphs are equal, the two subgraphs are considered to be isomorphic subgraphs, the isomorphic subgraphs are stored in the same set, and the set is marked as

Step 4.3.4: when in use

After all subgraphs have been judged, the subgraphs will

Setting the data into an empty set;

step 4.3.5: for each one

Counting the number of subgraphs in the set;

step 4.3.6: if it is not

If the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored again

In the set, otherwise will

And deleting the sub-graphs in the set.

The invention has the beneficial effects that:

the invention provides a frequent subgraph mining method based on community detection, which comprises the steps of firstly carrying out community detection on a social network, uniformly extracting vertexes from graph data during sampling, and rapidly mining the relationship among entities; and secondly, an effective subgraph expansion method and an efficient subgraph pruning strategy are provided, and subgraph reconstruction judgment efficiency is improved in a subgraph reconstruction stage.

Drawings

FIG. 1 is a flow chart of a frequent subgraph mining method based on community detection in the present invention;

FIG. 2 is a flow chart of community detection in the present invention;

FIG. 3 is a data sampling flow diagram of the present invention;

FIG. 4 is a diagram illustrating the subgraph discovery process according to the present invention;

FIG. 5 is a flow chart of subgraph expansion according to the present invention;

FIG. 6 is a flow chart of subgraph pruning in the present invention;

FIG. 7 is a graph comparing the performance of the present invention on three synthetic data sets syn1, syn2, syn3, wherein (a) is a comparison experiment of the execution time of the CG-FSM of the present invention method with the change of the sampling rate on syn1 with the Baseline method and the FFSM method; (b) the method comprises the steps of (a) performing a comparison experiment on syn2 according to the method of the invention, namely a CG-FSM and base method and an FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate, (c) performing a comparison experiment on syn3 according to the method of the invention, namely the CG-FSM and base method and the FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate;

FIG. 8 is a graph comparing performances of micro and youtube on real data sets, wherein (a) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on micro with change of sampling rate, and (b) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on youtube with change of sampling rate.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The frequent subgraph mining means that all subgraphs with the occurrence times larger than or equal to a frequent threshold tau are mined from given graph data, and finally all frequent subgraph sets meeting the conditions are returned, wherein the frequent threshold tau is a numerical value input by a program user in a self-defining mode.

Subgraph isomorphism means that for two subgraphs g₁＝(v₁,e₁,μ₁,ε₁) And g₂＝(v₂,e₂,μ₂,ε₂) Are isomorphic subgraphs if they have the same topology and there is a mapping f v₁→v₂So that

Has a mu₁(p)＝μ₂(f (p)), and

presence of edge e₂(f (p), f (q)) and ε₁(p,q)＝ε₂(f(p),f(q))。

According to the method, preprocessing is firstly carried out on large graph data (such as a social network), and aiming at the vertexes and edges in the large graph, all the infrequent vertexes and the infrequent edges in the graph data are deleted through a frequent threshold tau; the complexity of a large graph structure is reduced, and the size of a frequent subgraph mining starting set is compressed, so that repeated subgraph generation in the mining process is reduced. After the graph data is processed, the vertices and edges of the graph data need to be reconstructed. Vertex reconstruction needs to meet two requirements (1), and firstly, the sequence is reduced according to the label clustering number; (2) and the vertexes under the same label are in ascending order according to the original vertex numbers. After the top point is reconstructed, deleting the side can ensure that the end point of the frequent side is assigned as a new top point number, storing the mapping relation between the top point number after reconstruction and the top point number before reconstruction while reconstructing, ensuring the recovery work after the frequent sub-graph mining work is completed, and ensuring the uniformity of the front and the back of the data. In order to uniformly extract vertexes from the social network during sampling and perform community detection work on the social network, the social network is divided into communities by a NIBLPA (novel node influence based label development algorithm); after the communities are obtained, uniformly extracting vertexes from each Community through a CG-Samp sampling method (Community Graph Sample) to serve as an initial set of frequent subgraph mining expansion; performing subgraph expansion according to the initial set, wherein the expansion mode is to expand one edge at a time or expand one edge and one vertex; carrying out subgraph isomorphism detection on subgraphs in the expansion stage, and pruning infrequent subgraphs; finally returning all frequent subgraph sets in the big graph; finally, an experiment is designed to verify the effectiveness of the CG-FSM method.

As shown in fig. 1, a frequent subgraph mining method based on community detection includes:

step 1: preprocessing the acquired social network data; the method comprises the following steps:

step 1.4: numbering and reconstructing the rest vertexes; the method comprises the following steps:

step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending on the groups according to the number of the vertexes in the groups;

such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V₁,v₂,…,v_n}，n≥m；

Step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to a vertex mapping function f.

In order to uniformly extract the vertices from the social network during sampling, the social network is first subjected to community detection work, and the social network is divided into individual communities by the NIBLPA method, where a community detection flow chart is shown in fig. 2. Community detection comprises three stages in total: (1) initializing, and allocating a unique community label from 0 to m-1 (the number of vertex points) for each vertex; (2) calculating the importance of each vertex, and fixing the updating sequence of the vertices according to the descending order of the importance; the vertex importance calculation formula is as follows:

wherein VI (v) refers to the importance of vertex v; ks (v) refers to the maximum connected subgraph of k-shell in the network to which the vertex v belongs, and the degree of each vertex in the connected subgraph is at least k; α is an adjustable parameter between 0 and 1; n (v) refers to a set of neighboring vertices of vertex v; d (u) refers to the degree of vertex u. (3) Each vertex changes the community label of the vertex into the community label with the largest number carried in the adjacent vertex, when the number of the community labels reaches the maximum value, the influence of the community label reaching the maximum value is calculated, and the community label with the largest influence is selected to update the community label of the vertex; the community label influence calculation formula is as follows:

wherein LI (v, l) refers to the influence of the community label l on the vertex v; n is a radical of^l(v) The community label representing vertex v the neighboring vertex is the set of vertices of l.

Step 2: as shown in fig. 2, performing community detection work on the processed social network to obtain a community set; the method comprises the following steps:

step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k-largest connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);

step 2.5: initializing an iteration variable t as 1;

in the formula, N^l(v) Representing the vertex set with the community label l in the adjacent vertexes of the vertex v of the community label to be modified, and VI (u) being the importance of the vertex u;

In order to further reduce the generation of repeated sub-graphs in the frequent sub-graph mining process, data is sampled by a CG-Samp method, and the flow of the CG-Samp method is shown in FIG. 3. Firstly, in each community, calculating the degrees of vertexes contained in the community, storing all vertexes and vertex ids and degrees thereof in a hash table (the keys are vertex degrees, the values are vertex lists, and all the degrees are stored as the vertexes of the keys), and further sorting the vertex lists according to the descending order of the keys. Vertices with same degree are aggregated and stored in list d_iIn whichi denotes the degree of the vertex in the list. The method finally returns to the frequent vertex set size-1 of the frequent subgraph mining start expansion.

Obtain the frequent vertex set of the subgraph expansion initial size-1, noted as F¹Will F¹As an initial point for depth-first search DFS extension based on mining the final size-k frequent subgraph. When all size-k frequent subgraphs are found, and the size- (k +1) set is

Then the iteration is terminated. The flowcharts of sub-graph discovery and sub-graph expansion are shown in fig. 4 and fig. 5, respectively.

In the sub-graph expansion stage, a candidate sub-graph set is generated in each round of expansion, in the sub-graph pruning stage, the support degree of the candidate sub-graphs needs to be calculated, the candidate sub-graphs with the support degree smaller than the frequent threshold are discarded, the rest sub-graphs are added into the frequent sub-graph list, and the sub-graph pruning process is shown in fig. 6. In the process of calculating the support degree, because the detection of subgraph isomorphism is an NP difficult problem, the calculation cost is high. To avoid expensive computational overhead, two arguments are given to simplify the decision of sub-graph isomorphism:

lemma 1. degree isomorphism: given G_LG 'and g' are d (g) Σ_v∈gdeg (v) and d (g') ═ Σ_v′∈g′deg (v '), if two subgraphs are not isomorphic subgraphs, it is certain that d (g) ≠ d (g') holds.

Lemma 2. sequence isomorphism: given G_LIf the subgraph g and g' is a homogeneous subgraph, then there is the same top importance ranking. The vertex importance calculation formula is as follows:

where ni (v) represents the importance of vertex v, deg (v) represents the degree of vertex v, and freq (v.label) the frequency with which the label of vertex v appears in the subgraph. And (5) calculating the importance of each vertex, and sorting the importance according to the descending order of the vertices to obtain an importance sequence of the subgraph g, which is recorded as NIS (g).

The method comprises the following steps:

Wherein i is the degree of the vertex, and l is the community number;

step 3.2: setting a sampling factor delta from each

Randomly extracting vertexes from the set;

step 3.3: saving the extracted vertex to

And 4, step 4: according to

Performing frequent subgraph mining on graph data by a set; the method comprises the following steps:

step 4.1: initializing an iteration variable i to 1 and a maximum iteration time maxIter;

step 4.2: according to

Performing subgraph expansion operation on the set; the method comprises the following steps:

step 4.2.1: to pair

Firstly, expanding one edge and one vertex for the vertex of all the vertexes in the set;

Step 4.3: obtained by expanding

Performing subgraph pruning operation on the set; the method comprises the following steps:

step 4.3.1: to pair

Any two subgraphs are collected to be judged pairwise, and the degree d (g) of one subgraph g is calculated according to a formula (3):

d(g)＝∑_v∈gdeg(v) (3)

where deg (v) represents the degree of v in the subgraph;

step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining vertex importance sequence of the subgraphs, wherein the sequence is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;

Step 4.3.4: when in use

After all subgraphs have been judged, the subgraphs will

Setting the data to be an empty set;

step 4.3.5: for each one

Counting the number of subgraphs in the set;

step 4.3.6: if it is not

In the set, otherwise will

Deleting sub-graphs in the set;

step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, or

When the set is empty, the iteration is finished;

step 4.5: return to

In order to verify the effectiveness of the method, the invention firstly carries out the experiment comparison of the NIBLPA and the traditional LPA (Label Propagation Algorithm), and the experiment shows that the NIBLPA can obtain better community detection effect on a plurality of data sets, so that the community is as small and uniform as possible; subsequently, the method (CG-FSM for short) of the invention is compared with a Baseline algorithm Baseline (which is different from the method of the invention in that random sampling is adopted in the sampling process instead of uniform random sampling), and experiments are carried out by a classic Frequent Subgraph mining method FFSM (fast frequency Subgraph mining), and the execution time required by the method of the invention, the Baseline method and the FFSM method is observed along with the change method of the sampling factors by adjusting the sampling factors to be 0.1 and 0.2.

In the experiments of the invention, verification operations are carried out on three synthetic data sets (syn1, syn2 and syn3) and two real data sets (micro and youtube), wherein the synthetic data sets are obtained by expanding small graph data (comprising 12 vertexes and 28 edges), the real data sets are downloaded in http:// snap.

TABLE 1 data set Table

Data set name	Vertex point	Edge
			syn1	12000	28000
syn2	120000	280000
			syn3	600000	1400000
micro	100000	1080299
			youtube	1134890	2987624

Compared with the prior art, the technical scheme provided by the invention firstly deletes the infrequent vertexes and the infrequent edges in the big image data through pretreatment, and uniformly extracts vertexes from the big image data through community detection, so that the size of the initial vertex set is compressed and expanded, and the generation of intermediate results is greatly reduced; in the sub-graph isomorphism detection stage, in order to avoid expensive calculation overhead of detection, two lemmas are combined to simplify the judgment of sub-graph isomorphism, and non-frequent sub-graphs are deleted in advance through sub-graph pruning, so that invalid expansion and memory consumption are avoided.

Fig. 7 shows that the method of the present invention is compared with other methods in the execution time of frequent subgraph mining under three composite data sets, wherein the horizontal axis represents the sampling rate δ and the vertical axis represents the execution time. Comparing (a), (b) and (c) of fig. 7, it is clear that the method of the present invention performs better, and the execution time increases with the increase of the sampling rate δ, and other methods do not change with the change of δ because sampling is not performed.

Fig. 8 shows that the method of the present invention and the other two methods are compared in the execution time of frequent subgraph mining under two real data sets, where the horizontal axis is the sampling rate δ and the vertical axis is the execution time. As is clear from comparing (a) and (b) in fig. 8, the model of the present invention performs better, and as the sampling rate δ increases, the execution time also increases, and other models may overflow during execution and cannot operate normally, so that they are not shown in the figure.

Claims

1. A frequent subgraph mining method based on community detection is characterized by comprising the following steps:

step 1: preprocessing the acquired social network data;

And 4, step 4: according to

And performing frequent subgraph mining on the graph data by the set.

2. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 1 comprises:

step 1.2: by setting a frequent threshold tau, marking vertexes and edges of which the frequency of the vertex labels and the edge labels obtained by statistics is smaller than the tau value as infrequent vertexes and infrequent edges;

step 1.4: numbering and reconstructing the rest vertexes;

3. The frequent subgraph mining method based on community detection as claimed in claim 2, wherein the step 1.4 comprises:

such that f (u) v; wherein the vertex set before reconstruction V ═ { V ═ V₁,v₂,…,v_n}，n≥m。

4. The frequent subgraph mining method based on community detection according to claim 1, wherein the step 2 comprises the following steps:

step 2.5: initializing an iteration variable t as 1;

step 2.6: according to the sequence of the Seq, the community label of each vertex is changed into the label with the maximum number carried by the community labels in the adjacent vertices;

step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, loop termination;

5. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 3 comprises:

Wherein i is the degree of the vertex, and l is the community number;

step 3.2: setting a sampling factor delta from each

Randomly extracting vertexes from the set;

step 3.3: saving the extracted vertices to

6. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 4 comprises:

step 4.2: according to

Performing subgraph expansion operation on the set;

step 4.3: obtained by expanding

Performing subgraph pruning operation on the set;

step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, or

When the set is empty, the iteration is finished;

step 4.5: return to

7. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.2 comprises:

step 4.2.1: for is to

All vertices in the set are first expanded to the vertexUnfolding one edge and one vertex;

8. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.3 comprises:

step 4.3.1: to pair

d(g)＝∑_v∈gdeg(v) (3)

where deg (v) represents the degree of v in the subgraph;

wherein freq (v.label) represents the frequency of occurrence of the label of vertex v in the subgraph;

step 4.3.3: if the vertex importance rankings of the two subgraphs are equal, then the two subgraphs are considered to be isomorphicStoring the isomorphic subgraphs into the same set, and marking the set as a set

Step 4.3.4: when in use

After all subgraphs have been judged, the subgraphs will

Setting the data into an empty set;

step 4.3.5: for each one

Counting the number of subgraphs in the set;

step 4.3.6: if it is not

If the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored in

In the set, otherwise will

And deleting the sub-graphs in the set.