CN114661927A - Frequent subgraph mining method based on community detection - Google Patents

Frequent subgraph mining method based on community detection Download PDF

Info

Publication number
CN114661927A
CN114661927A CN202210382776.9A CN202210382776A CN114661927A CN 114661927 A CN114661927 A CN 114661927A CN 202210382776 A CN202210382776 A CN 202210382776A CN 114661927 A CN114661927 A CN 114661927A
Authority
CN
China
Prior art keywords
vertex
community
subgraph
frequent
subgraphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210382776.9A
Other languages
Chinese (zh)
Inventor
袁野
张义
马德龙
马玉亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202210382776.9A priority Critical patent/CN114661927A/en
Publication of CN114661927A publication Critical patent/CN114661927A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a frequent subgraph mining method based on community detection. And clustering the vertexes in the community according to the vertex labels by carrying out community detection operation on the preprocessed graph data, and extracting the number of the vertexes with the same proportion from each label set to obtain an initial set of frequent subgraph mining. And expanding subgraph expansion operation according to the initial set, pruning the generated candidate subgraphs every time one round of expansion is carried out, deleting non-frequent subgraphs in time, reducing subsequent invalid expansion, greatly improving expansion and mining efficiency, simultaneously carrying out isomorphic detection operation efficiently during pruning, reducing execution time and further improving the subgraph mining efficiency. In addition, the invention verifies the high efficiency of the method through a comparison experiment.

Description

Frequent subgraph mining method based on community detection
Technical Field
The invention belongs to the technical field of computer networks, and particularly relates to a frequent subgraph mining method based on community detection.
Background
With the rapid development of information technology, analysis and mining of big data play an increasingly important role in production life, however, mining valuable information from massive data is still a non-negligible challenge. In many domains, data is often modeled as entities, various attributes of the entities are represented by tags, and relationships between entities are related by attributes, forming complex graph data structures. The graph is a universal data structure, and can conveniently and efficiently express the basic attributes of each entity and the complex interrelation among the entities. As the amount of data increases, the size of the graph also grows rapidly and the structure becomes more complex. Frequent subgraph mining is one of the hotspots of graph analysis research and is widely applied to social networks, biological information networks, business networks and the like.
The frequent subgraph mining can be performed on a graph set or a single large graph, and most of the existing work is performed on the graph set. However, since computation of subgraph isomorphism is a non-deterministic polynomial (NP) problem, subgraph isomorphism is a sub-problem for frequent subgraph mining, the computation of mining frequent subgraphs from a single big graph is extremely expensive. In the traditional frequent subgraph mining method, all frequent vertexes or frequent edges are used as an initial set of subgraph expansion, and a large number of repeated subgraphs are generated in the mining process, so that high calculation cost and high memory consumption are caused. And the initial set scale of frequent subgraph mining is reduced, so that the generation of repeated subgraphs can be effectively reduced, and correct solutions can be mined in a short time.
At present, the method for frequent subgraph mining is mainly divided into two types, one is a method based on Apriori thought, and the other is a method based on FP-growth mode growth, wherein the former adopts a breadth-first search strategy, and the latter adopts a depth-first search strategy. (1) The generation-and-test idea is adopted based on the Apriori method: the k +1 subgraph is generated from k frequent subgraphs, based on the closure attribute down (if any one of the k subgraphs of the k +1 subgraphInfrequent, then the k +1 subgraph must also be infrequent) to generate a k +1 frequent subgraph set. For example, in the method (abbreviated as AGM) proposed by a.inokuchi et al in An apriori-based algorithm for mining frequency substructures from data, all frequent subgraphs satisfying a frequent threshold are mined based on recursive statistics in the AGM method. Kuramochi&The method (FSM for short) proposed by Karysis et al in the paper An effective efficient for discovery frequency maps is correspondingly improved for AGM method extension and pruning, and can effectively find Frequent Subgraphs in the Subgraphs. (2) In the FP-growth mode-based growth method, a k +1 candidate subgraph is generated by expanding k frequent subgraphs by one vertex or one edge at all possible positions. For example, x.yan et al propose gSpan (graph-based)Substructure patternmining) method is currently a more classical method based on a pattern-growing method, gSpan proposes a novel rightmost extended concept, where a new candidate subgraph is generated by adding a new edge between the rightmost vertex and another vertex on the graph's rightmost path; in the acceleration isomorphic stage, two technologies of DFS (depth first search) dictionary ordering and minimum DFS coding are introduced in gSpan, and a specification marking system supporting DFS searching is formed.
Chinese patent CN106777065 discloses a method and system for frequent sub-graph mining, which is a method for mining frequent sub-graphs, and the method divides data of a graph to be mined into sub-graphs by acquiring the data of the graph to be mined; performing parallel computation on the subgraph based on a depth-first search method, and finding out a corresponding frequent item set; and merging the frequent item sets of the sub-to-obtain frequent subgraphs of the graph data to be mined. And partitioning the graph data to obtain sub graphs, and then performing parallel computation on the sub graphs, namely simultaneously computing and processing a plurality of sub graphs to obtain corresponding sub frequent item sets. Parallel processing and multithreaded concurrent processing can improve processing efficiency compared to serial processing. And finally, merging the frequent item sets of the children to obtain a final frequent subgraph. Chinese patent CN106446161, a method for mining very frequent subgraph using Hadoop, uses Hadoop to mine very frequent subgraphs, combines frequent subtrees with candidate edges, and then judges whether the frequent subgraphs are frequent and generate the very frequent subgraphs according to stored intermediate results without traversing the database again.
Although the above frequent subgraph mining problem has been widely focused, the existing work still has some disadvantages, for example, the above a.inokuchi et al proposes an AGM method, which results in a large number of redundant k +1 subgraphs when the subgraph is extended, and also needs to spend much time determining whether all k subgraphs of each k +1 subgraph are frequent when pruning the k +1 subgraph. Kuramochi and Karysis et al propose an FSG method, and make corresponding improvements for AGM method expansion and pruning, so that frequent subgraphs can be effectively found in a small graph, but still the problem of considerable overhead when two k subgraphs are connected to generate a k +1 subgraph exists. Yan et al propose a gSpan method, but gSpan is only suitable for frequent sub-graph mining work on a graph set and cannot be effectively applied to a single large graph.
Chinese patent CN106777065 a method and system for frequent subgraph excavation, which proposes a method and system for frequent subgraph excavation, the method divides the graph data to be excavated into n segments of subgraph, but does not guarantee the structural integrity of the divided graph data, at the same time, the size of the excavated frequent subgraph is influenced by the size of the divided subgraph, which may result in missed solution, and there are larger frequent subgraph division into multiple segments of subgraph, which results in the original frequent subgraph becoming an infrequent subgraph; in addition, the method does not prune the intermediate result, which causes a large amount of memory consumption and cannot be effectively applied to larger graph data mining work. Chinese patent CN106446161, a method for mining a very frequent subgraph using Hadoop, proposes a method for mining a very frequent subgraph using Hadoop, but this method does not process graph data before mining, i.e. it does not delete non-frequent vertices and edges by frequent thresholds, and does not prune the intermediate results but store them, thus causing consumption of a large amount of memory, and it cannot be effectively applied to mining of larger graph data.
The current frequent subgraph mining method is not detailed enough for some aspects. For example, how to compress the size of the starting set of the subgraph expansion phase to reduce the generation of invalid intermediate results; how to process invalid intermediate results generated in the mining process to reduce the consumption of the memory; how to efficiently perform sub-graph isomorphism detection to accelerate the execution efficiency of the method, and the like. Therefore, how to enable the method to use less memory consumption and faster execution efficiency to complete frequent subgraph mining work is a problem to be solved.
Disclosure of Invention
The invention aims to provide a frequent subgraph mining method based on community detection, which can compress the size of an initial set in a subgraph expansion stage, effectively reduce the generation of intermediate results, efficiently carry out subgraph isomorphism detection in the subgraph expansion stage, and further delete non-frequent intermediate results through subgraph pruning.
In order to achieve the above object, the invention provides a frequent subgraph mining method based on community detection, which comprises the following steps:
step 1: preprocessing the acquired social network data;
step 2: carrying out community detection work on the processed social network to obtain a community set;
and step 3: sampling each community in the community set to obtain a subgraph expansion initial set
Figure BDA0003593570520000031
And 4, step 4: according to
Figure BDA0003593570520000032
And performing frequent subgraph mining on the graph data by the set.
The step 1 comprises the following steps:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking the vertexes and edges with the frequency of the vertex labels and the edge labels smaller than the tau value as the infrequent vertexes and the infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes;
step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to the vertex mapping function f.
The step 1.4 comprises the following steps:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending order on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f:
Figure BDA0003593570520000041
Figure BDA0003593570520000042
such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V1,v2,…,vn},n≥m。
The step 2 comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k maximum connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
Figure BDA0003593570520000043
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of Seq, changing the community label of each vertex into the label with the maximum number in the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
Figure BDA0003593570520000044
in the formula, Nl(v) Representing a vertex set with a community label l in adjacent vertexes of a vertex v of the community label to be modified, and VI (u) is the importance of a vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, the loop terminates;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
The step 3 comprises the following steps:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a set
Figure BDA0003593570520000045
Wherein i is the degree of the vertex, and l is the community number;
step 3.2: setting a sampling factor delta from each
Figure BDA0003593570520000046
Randomly extracting vertexes from the set;
step 3.3: saving the extracted vertices to
Figure BDA0003593570520000047
The step 4 comprises the following steps:
step 4.1: initializing an iteration variable i to be 1 and maximum iteration times maxIter;
step 4.2: according to
Figure BDA0003593570520000051
Performing subgraph expansion operation on the set;
step 4.3: obtained by expanding
Figure BDA0003593570520000052
Performing subgraph pruning operation on the set;
step 4.4: let i equal i +1, go to step 4.2, when i is>maxIter, or
Figure BDA0003593570520000053
When the set is empty, the iteration is finished;
step 4.5: return to
Figure BDA0003593570520000054
I.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.
The step 4.2 comprises the following steps:
step 4.2.1: for is to
Figure BDA0003593570520000055
All vertices in the set, first extending the verticesAn edge and a vertex;
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
Figure BDA0003593570520000056
Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set
Figure BDA0003593570520000057
The step 4.3 comprises the following steps:
step 4.3.1: to pair
Figure BDA0003593570520000058
Any two subgraphs are collected to be judged pairwise, and the degree of a subgraph g and d (g) are calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining the importance sequence of the vertices of the subgraphs, which is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
Figure BDA0003593570520000059
wherein freq (v.label) represents the frequency of appearance of the label of vertex v in the subgraph;
step 4.3.3: if the importance ranks of the top points of the two subgraphs are equal, the two subgraphs are considered to be isomorphic subgraphs, the isomorphic subgraphs are stored in the same set, and the set is marked as
Figure BDA00035935705200000510
Step 4.3.4: when in use
Figure BDA00035935705200000511
After all subgraphs have been judged, the subgraphs will
Figure BDA00035935705200000512
Setting the data into an empty set;
step 4.3.5: for each one
Figure BDA00035935705200000513
Counting the number of subgraphs in the set;
step 4.3.6: if it is not
Figure BDA00035935705200000514
If the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored again
Figure BDA00035935705200000515
In the set, otherwise will
Figure BDA00035935705200000516
And deleting the sub-graphs in the set.
The invention has the beneficial effects that:
the invention provides a frequent subgraph mining method based on community detection, which comprises the steps of firstly carrying out community detection on a social network, uniformly extracting vertexes from graph data during sampling, and rapidly mining the relationship among entities; and secondly, an effective subgraph expansion method and an efficient subgraph pruning strategy are provided, and subgraph reconstruction judgment efficiency is improved in a subgraph reconstruction stage.
Drawings
FIG. 1 is a flow chart of a frequent subgraph mining method based on community detection in the present invention;
FIG. 2 is a flow chart of community detection in the present invention;
FIG. 3 is a data sampling flow diagram of the present invention;
FIG. 4 is a diagram illustrating the subgraph discovery process according to the present invention;
FIG. 5 is a flow chart of subgraph expansion according to the present invention;
FIG. 6 is a flow chart of subgraph pruning in the present invention;
FIG. 7 is a graph comparing the performance of the present invention on three synthetic data sets syn1, syn2, syn3, wherein (a) is a comparison experiment of the execution time of the CG-FSM of the present invention method with the change of the sampling rate on syn1 with the Baseline method and the FFSM method; (b) the method comprises the steps of (a) performing a comparison experiment on syn2 according to the method of the invention, namely a CG-FSM and base method and an FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate, (c) performing a comparison experiment on syn3 according to the method of the invention, namely the CG-FSM and base method and the FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate;
FIG. 8 is a graph comparing performances of micro and youtube on real data sets, wherein (a) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on micro with change of sampling rate, and (b) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on youtube with change of sampling rate.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The frequent subgraph mining means that all subgraphs with the occurrence times larger than or equal to a frequent threshold tau are mined from given graph data, and finally all frequent subgraph sets meeting the conditions are returned, wherein the frequent threshold tau is a numerical value input by a program user in a self-defining mode.
Subgraph isomorphism means that for two subgraphs g1=(v1,e111) And g2=(v2,e222) Are isomorphic subgraphs if they have the same topology and there is a mapping f v1→v2So that
Figure BDA0003593570520000061
Has a mu1(p)=μ2(f (p)), and
Figure BDA0003593570520000062
presence of edge e2(f (p), f (q)) and ε1(p,q)=ε2(f(p),f(q))。
According to the method, preprocessing is firstly carried out on large graph data (such as a social network), and aiming at the vertexes and edges in the large graph, all the infrequent vertexes and the infrequent edges in the graph data are deleted through a frequent threshold tau; the complexity of a large graph structure is reduced, and the size of a frequent subgraph mining starting set is compressed, so that repeated subgraph generation in the mining process is reduced. After the graph data is processed, the vertices and edges of the graph data need to be reconstructed. Vertex reconstruction needs to meet two requirements (1), and firstly, the sequence is reduced according to the label clustering number; (2) and the vertexes under the same label are in ascending order according to the original vertex numbers. After the top point is reconstructed, deleting the side can ensure that the end point of the frequent side is assigned as a new top point number, storing the mapping relation between the top point number after reconstruction and the top point number before reconstruction while reconstructing, ensuring the recovery work after the frequent sub-graph mining work is completed, and ensuring the uniformity of the front and the back of the data. In order to uniformly extract vertexes from the social network during sampling and perform community detection work on the social network, the social network is divided into communities by a NIBLPA (novel node influence based label development algorithm); after the communities are obtained, uniformly extracting vertexes from each Community through a CG-Samp sampling method (Community Graph Sample) to serve as an initial set of frequent subgraph mining expansion; performing subgraph expansion according to the initial set, wherein the expansion mode is to expand one edge at a time or expand one edge and one vertex; carrying out subgraph isomorphism detection on subgraphs in the expansion stage, and pruning infrequent subgraphs; finally returning all frequent subgraph sets in the big graph; finally, an experiment is designed to verify the effectiveness of the CG-FSM method.
As shown in fig. 1, a frequent subgraph mining method based on community detection includes:
step 1: preprocessing the acquired social network data; the method comprises the following steps:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking the vertexes and edges with the frequency of the vertex labels and the edge labels smaller than the tau value as the infrequent vertexes and the infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes; the method comprises the following steps:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f:
Figure BDA0003593570520000071
Figure BDA0003593570520000072
such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V1,v2,…,vn},n≥m;
Step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to a vertex mapping function f.
In order to uniformly extract the vertices from the social network during sampling, the social network is first subjected to community detection work, and the social network is divided into individual communities by the NIBLPA method, where a community detection flow chart is shown in fig. 2. Community detection comprises three stages in total: (1) initializing, and allocating a unique community label from 0 to m-1 (the number of vertex points) for each vertex; (2) calculating the importance of each vertex, and fixing the updating sequence of the vertices according to the descending order of the importance; the vertex importance calculation formula is as follows:
Figure BDA0003593570520000081
wherein VI (v) refers to the importance of vertex v; ks (v) refers to the maximum connected subgraph of k-shell in the network to which the vertex v belongs, and the degree of each vertex in the connected subgraph is at least k; α is an adjustable parameter between 0 and 1; n (v) refers to a set of neighboring vertices of vertex v; d (u) refers to the degree of vertex u. (3) Each vertex changes the community label of the vertex into the community label with the largest number carried in the adjacent vertex, when the number of the community labels reaches the maximum value, the influence of the community label reaching the maximum value is calculated, and the community label with the largest influence is selected to update the community label of the vertex; the community label influence calculation formula is as follows:
Figure BDA0003593570520000082
wherein LI (v, l) refers to the influence of the community label l on the vertex v; n is a radical ofl(v) The community label representing vertex v the neighboring vertex is the set of vertices of l.
Step 2: as shown in fig. 2, performing community detection work on the processed social network to obtain a community set; the method comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k-largest connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
Figure BDA0003593570520000083
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of Seq, changing the community label of each vertex into the label with the maximum number in the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
Figure BDA0003593570520000091
in the formula, Nl(v) Representing the vertex set with the community label l in the adjacent vertexes of the vertex v of the community label to be modified, and VI (u) being the importance of the vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, the loop terminates;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
In order to further reduce the generation of repeated sub-graphs in the frequent sub-graph mining process, data is sampled by a CG-Samp method, and the flow of the CG-Samp method is shown in FIG. 3. Firstly, in each community, calculating the degrees of vertexes contained in the community, storing all vertexes and vertex ids and degrees thereof in a hash table (the keys are vertex degrees, the values are vertex lists, and all the degrees are stored as the vertexes of the keys), and further sorting the vertex lists according to the descending order of the keys. Vertices with same degree are aggregated and stored in list diIn whichi denotes the degree of the vertex in the list. The method finally returns to the frequent vertex set size-1 of the frequent subgraph mining start expansion.
Obtain the frequent vertex set of the subgraph expansion initial size-1, noted as F1Will F1As an initial point for depth-first search DFS extension based on mining the final size-k frequent subgraph. When all size-k frequent subgraphs are found, and the size- (k +1) set is
Figure BDA0003593570520000092
Then the iteration is terminated. The flowcharts of sub-graph discovery and sub-graph expansion are shown in fig. 4 and fig. 5, respectively.
In the sub-graph expansion stage, a candidate sub-graph set is generated in each round of expansion, in the sub-graph pruning stage, the support degree of the candidate sub-graphs needs to be calculated, the candidate sub-graphs with the support degree smaller than the frequent threshold are discarded, the rest sub-graphs are added into the frequent sub-graph list, and the sub-graph pruning process is shown in fig. 6. In the process of calculating the support degree, because the detection of subgraph isomorphism is an NP difficult problem, the calculation cost is high. To avoid expensive computational overhead, two arguments are given to simplify the decision of sub-graph isomorphism:
lemma 1. degree isomorphism: given GLG 'and g' are d (g) Σv∈gdeg (v) and d (g') ═ Σv′∈g′deg (v '), if two subgraphs are not isomorphic subgraphs, it is certain that d (g) ≠ d (g') holds.
Lemma 2. sequence isomorphism: given GLIf the subgraph g and g' is a homogeneous subgraph, then there is the same top importance ranking. The vertex importance calculation formula is as follows:
Figure BDA0003593570520000093
where ni (v) represents the importance of vertex v, deg (v) represents the degree of vertex v, and freq (v.label) the frequency with which the label of vertex v appears in the subgraph. And (5) calculating the importance of each vertex, and sorting the importance according to the descending order of the vertices to obtain an importance sequence of the subgraph g, which is recorded as NIS (g).
And step 3: sampling each community in the community set to obtain a subgraph expansion initial set
Figure BDA0003593570520000101
The method comprises the following steps:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a set
Figure BDA0003593570520000102
Wherein i is the degree of the vertex, and l is the community number;
step 3.2: setting a sampling factor delta from each
Figure BDA0003593570520000103
Randomly extracting vertexes from the set;
step 3.3: saving the extracted vertex to
Figure BDA0003593570520000104
And 4, step 4: according to
Figure BDA0003593570520000105
Performing frequent subgraph mining on graph data by a set; the method comprises the following steps:
step 4.1: initializing an iteration variable i to 1 and a maximum iteration time maxIter;
step 4.2: according to
Figure BDA0003593570520000106
Performing subgraph expansion operation on the set; the method comprises the following steps:
step 4.2.1: to pair
Figure BDA0003593570520000107
Firstly, expanding one edge and one vertex for the vertex of all the vertexes in the set;
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
Figure BDA0003593570520000108
Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set
Figure BDA0003593570520000109
Step 4.3: obtained by expanding
Figure BDA00035935705200001010
Performing subgraph pruning operation on the set; the method comprises the following steps:
step 4.3.1: to pair
Figure BDA00035935705200001011
Any two subgraphs are collected to be judged pairwise, and the degree d (g) of one subgraph g is calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining vertex importance sequence of the subgraphs, wherein the sequence is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
Figure BDA00035935705200001012
wherein freq (v.label) represents the frequency of appearance of the label of vertex v in the subgraph;
step 4.3.3: if the importance ranks of the top points of the two subgraphs are equal, the two subgraphs are considered to be isomorphic subgraphs, the isomorphic subgraphs are stored in the same set, and the set is marked as
Figure BDA00035935705200001013
Step 4.3.4: when in use
Figure BDA00035935705200001014
After all subgraphs have been judged, the subgraphs will
Figure BDA00035935705200001015
Setting the data to be an empty set;
step 4.3.5: for each one
Figure BDA0003593570520000111
Counting the number of subgraphs in the set;
step 4.3.6: if it is not
Figure BDA0003593570520000112
If the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored again
Figure BDA0003593570520000113
In the set, otherwise will
Figure BDA0003593570520000114
Deleting sub-graphs in the set;
step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, or
Figure BDA0003593570520000115
When the set is empty, the iteration is finished;
step 4.5: return to
Figure BDA0003593570520000116
I.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.
In order to verify the effectiveness of the method, the invention firstly carries out the experiment comparison of the NIBLPA and the traditional LPA (Label Propagation Algorithm), and the experiment shows that the NIBLPA can obtain better community detection effect on a plurality of data sets, so that the community is as small and uniform as possible; subsequently, the method (CG-FSM for short) of the invention is compared with a Baseline algorithm Baseline (which is different from the method of the invention in that random sampling is adopted in the sampling process instead of uniform random sampling), and experiments are carried out by a classic Frequent Subgraph mining method FFSM (fast frequency Subgraph mining), and the execution time required by the method of the invention, the Baseline method and the FFSM method is observed along with the change method of the sampling factors by adjusting the sampling factors to be 0.1 and 0.2.
In the experiments of the invention, verification operations are carried out on three synthetic data sets (syn1, syn2 and syn3) and two real data sets (micro and youtube), wherein the synthetic data sets are obtained by expanding small graph data (comprising 12 vertexes and 28 edges), the real data sets are downloaded in http:// snap.
TABLE 1 data set Table
Data set name Vertex point Edge
syn1 12000 28000
syn2 120000 280000
syn3 600000 1400000
micro 100000 1080299
youtube 1134890 2987624
Compared with the prior art, the technical scheme provided by the invention firstly deletes the infrequent vertexes and the infrequent edges in the big image data through pretreatment, and uniformly extracts vertexes from the big image data through community detection, so that the size of the initial vertex set is compressed and expanded, and the generation of intermediate results is greatly reduced; in the sub-graph isomorphism detection stage, in order to avoid expensive calculation overhead of detection, two lemmas are combined to simplify the judgment of sub-graph isomorphism, and non-frequent sub-graphs are deleted in advance through sub-graph pruning, so that invalid expansion and memory consumption are avoided.
Fig. 7 shows that the method of the present invention is compared with other methods in the execution time of frequent subgraph mining under three composite data sets, wherein the horizontal axis represents the sampling rate δ and the vertical axis represents the execution time. Comparing (a), (b) and (c) of fig. 7, it is clear that the method of the present invention performs better, and the execution time increases with the increase of the sampling rate δ, and other methods do not change with the change of δ because sampling is not performed.
Fig. 8 shows that the method of the present invention and the other two methods are compared in the execution time of frequent subgraph mining under two real data sets, where the horizontal axis is the sampling rate δ and the vertical axis is the execution time. As is clear from comparing (a) and (b) in fig. 8, the model of the present invention performs better, and as the sampling rate δ increases, the execution time also increases, and other models may overflow during execution and cannot operate normally, so that they are not shown in the figure.

Claims (8)

1. A frequent subgraph mining method based on community detection is characterized by comprising the following steps:
step 1: preprocessing the acquired social network data;
step 2: carrying out community detection work on the processed social network to obtain a community set;
and step 3: sampling each community in the community set to obtain a subgraph expansion initial set
Figure FDA0003593570510000014
And 4, step 4: according to
Figure FDA0003593570510000015
And performing frequent subgraph mining on the graph data by the set.
2. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 1 comprises:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking vertexes and edges of which the frequency of the vertex labels and the edge labels obtained by statistics is smaller than the tau value as infrequent vertexes and infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes;
step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to the vertex mapping function f.
3. The frequent subgraph mining method based on community detection as claimed in claim 2, wherein the step 1.4 comprises:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f:
Figure FDA0003593570510000011
Figure FDA0003593570510000012
such that f (u) v; wherein the vertex set before reconstruction V ═ { V ═ V1,v2,…,vn},n≥m。
4. The frequent subgraph mining method based on community detection according to claim 1, wherein the step 2 comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k-largest connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
Figure FDA0003593570510000013
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of the Seq, the community label of each vertex is changed into the label with the maximum number carried by the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
Figure FDA0003593570510000021
in the formula, Nl(v) Representing a vertex set with a community label l in adjacent vertexes of a vertex v of the community label to be modified, and VI (u) is the importance of a vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, loop termination;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
5. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 3 comprises:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a set
Figure FDA0003593570510000022
Wherein i is the degree of the vertex, and l is the community number;
step 3.2: setting a sampling factor delta from each
Figure FDA0003593570510000023
Randomly extracting vertexes from the set;
step 3.3: saving the extracted vertices to
Figure FDA0003593570510000024
6. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 4 comprises:
step 4.1: initializing an iteration variable i to 1 and a maximum iteration time maxIter;
step 4.2: according to
Figure FDA0003593570510000025
Performing subgraph expansion operation on the set;
step 4.3: obtained by expanding
Figure FDA0003593570510000026
Performing subgraph pruning operation on the set;
step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, or
Figure FDA0003593570510000027
When the set is empty, the iteration is finished;
step 4.5: return to
Figure FDA0003593570510000028
I.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.
7. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.2 comprises:
step 4.2.1: for is to
Figure FDA0003593570510000031
All vertices in the set are first expanded to the vertexUnfolding one edge and one vertex;
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
Figure FDA0003593570510000032
Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set
Figure FDA0003593570510000033
8. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.3 comprises:
step 4.3.1: to pair
Figure FDA0003593570510000034
Any two subgraphs are collected to be judged pairwise, and the degree of a subgraph g and d (g) are calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining the importance sequence of the vertices of the subgraphs, which is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
Figure FDA0003593570510000035
wherein freq (v.label) represents the frequency of occurrence of the label of vertex v in the subgraph;
step 4.3.3: if the vertex importance rankings of the two subgraphs are equal, then the two subgraphs are considered to be isomorphicStoring the isomorphic subgraphs into the same set, and marking the set as a set
Figure FDA0003593570510000036
Step 4.3.4: when in use
Figure FDA0003593570510000037
After all subgraphs have been judged, the subgraphs will
Figure FDA0003593570510000038
Setting the data into an empty set;
step 4.3.5: for each one
Figure FDA0003593570510000039
Counting the number of subgraphs in the set;
step 4.3.6: if it is not
Figure FDA00035935705100000310
If the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored in
Figure FDA00035935705100000311
In the set, otherwise will
Figure FDA00035935705100000312
And deleting the sub-graphs in the set.
CN202210382776.9A 2022-04-13 2022-04-13 Frequent subgraph mining method based on community detection Pending CN114661927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210382776.9A CN114661927A (en) 2022-04-13 2022-04-13 Frequent subgraph mining method based on community detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210382776.9A CN114661927A (en) 2022-04-13 2022-04-13 Frequent subgraph mining method based on community detection

Publications (1)

Publication Number Publication Date
CN114661927A true CN114661927A (en) 2022-06-24

Family

ID=82034752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210382776.9A Pending CN114661927A (en) 2022-04-13 2022-04-13 Frequent subgraph mining method based on community detection

Country Status (1)

Country Link
CN (1) CN114661927A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858875A (en) * 2023-02-10 2023-03-28 武汉中科通达高新技术股份有限公司 Enterprise employee hierarchical relationship discovery method and device based on frequent graph pattern mining

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858875A (en) * 2023-02-10 2023-03-28 武汉中科通达高新技术股份有限公司 Enterprise employee hierarchical relationship discovery method and device based on frequent graph pattern mining
CN115858875B (en) * 2023-02-10 2023-05-23 武汉中科通达高新技术股份有限公司 Enterprise employee hierarchical relationship discovery method and device based on frequent pattern mining

Similar Documents

Publication Publication Date Title
Liu et al. Efficient mining of large maximal bicliques
Whang et al. Scalable and memory-efficient clustering of large-scale social networks
CN101339553A (en) Approximate quick clustering and index method for mass data
CN110909173A (en) Non-overlapping community discovery method based on label propagation
Wei et al. Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce
Tang et al. Reliable community search in dynamic networks
CN114661927A (en) Frequent subgraph mining method based on community detection
CN112287118B (en) Event mode frequent subgraph mining and prediction method
WO2020211466A1 (en) Non-redundant gene clustering method and system, and electronic device
Yang et al. Parallel co-location pattern mining based on neighbor-dependency partition and column calculation
CN112487110A (en) Overlapped community evolution analysis method and system based on network structure and node content
Yang et al. Towards automatic clustering of protein sequences
CN108509531B (en) Spark platform-based uncertain data set frequent item mining method
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
CN108897820A (en) A kind of parallel method of DENCLUE algorithm
Asha et al. A survey on efficient incremental algorithm for mining high utility itemsets in distributed and dynamic database
Kiran et al. An improved frequent pattern-growth approach to discover rare association rules
CN113902003A (en) MITree-based multidimensional time series online motif discovery method
CN113420187A (en) GPU subgraph matching method based on edge segmentation
Hamedanian et al. An efficient prefix tree for incremental frequent pattern mining
Fei et al. A improved sequential pattern mining algorithm based on PrefixSpan
Kiran et al. Mining periodic-frequent patterns with maximum items' support constraints
Yingfan et al. Revisiting $ k $-Nearest Neighbor Graph Construction on High-Dimensional Data: Experiments and Analyses
CN108228607B (en) Maximum frequent item set mining method based on connectivity
CN106599187B (en) Edge instability based community discovery system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination