CN114661927A - Frequent subgraph mining method based on community detection - Google Patents
Frequent subgraph mining method based on community detection Download PDFInfo
- Publication number
- CN114661927A CN114661927A CN202210382776.9A CN202210382776A CN114661927A CN 114661927 A CN114661927 A CN 114661927A CN 202210382776 A CN202210382776 A CN 202210382776A CN 114661927 A CN114661927 A CN 114661927A
- Authority
- CN
- China
- Prior art keywords
- vertex
- community
- subgraph
- frequent
- subgraphs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 106
- 238000005065 mining Methods 0.000 title claims abstract description 63
- 238000001514 detection method Methods 0.000 title claims abstract description 36
- 238000013138 pruning Methods 0.000 claims abstract description 15
- 238000005070 sampling Methods 0.000 claims description 26
- 238000013507 mapping Methods 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000002474 experimental method Methods 0.000 abstract description 13
- 101100535673 Drosophila melanogaster Syn gene Proteins 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 101100043727 Caenorhabditis elegans syx-2 gene Proteins 0.000 description 4
- 101100043731 Caenorhabditis elegans syx-3 gene Proteins 0.000 description 4
- 101100422644 Caenorhabditis elegans syx-5 gene Proteins 0.000 description 4
- 101100368134 Mus musculus Syn1 gene Proteins 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000009412 basement excavation Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/535—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Economics (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a frequent subgraph mining method based on community detection. And clustering the vertexes in the community according to the vertex labels by carrying out community detection operation on the preprocessed graph data, and extracting the number of the vertexes with the same proportion from each label set to obtain an initial set of frequent subgraph mining. And expanding subgraph expansion operation according to the initial set, pruning the generated candidate subgraphs every time one round of expansion is carried out, deleting non-frequent subgraphs in time, reducing subsequent invalid expansion, greatly improving expansion and mining efficiency, simultaneously carrying out isomorphic detection operation efficiently during pruning, reducing execution time and further improving the subgraph mining efficiency. In addition, the invention verifies the high efficiency of the method through a comparison experiment.
Description
Technical Field
The invention belongs to the technical field of computer networks, and particularly relates to a frequent subgraph mining method based on community detection.
Background
With the rapid development of information technology, analysis and mining of big data play an increasingly important role in production life, however, mining valuable information from massive data is still a non-negligible challenge. In many domains, data is often modeled as entities, various attributes of the entities are represented by tags, and relationships between entities are related by attributes, forming complex graph data structures. The graph is a universal data structure, and can conveniently and efficiently express the basic attributes of each entity and the complex interrelation among the entities. As the amount of data increases, the size of the graph also grows rapidly and the structure becomes more complex. Frequent subgraph mining is one of the hotspots of graph analysis research and is widely applied to social networks, biological information networks, business networks and the like.
The frequent subgraph mining can be performed on a graph set or a single large graph, and most of the existing work is performed on the graph set. However, since computation of subgraph isomorphism is a non-deterministic polynomial (NP) problem, subgraph isomorphism is a sub-problem for frequent subgraph mining, the computation of mining frequent subgraphs from a single big graph is extremely expensive. In the traditional frequent subgraph mining method, all frequent vertexes or frequent edges are used as an initial set of subgraph expansion, and a large number of repeated subgraphs are generated in the mining process, so that high calculation cost and high memory consumption are caused. And the initial set scale of frequent subgraph mining is reduced, so that the generation of repeated subgraphs can be effectively reduced, and correct solutions can be mined in a short time.
At present, the method for frequent subgraph mining is mainly divided into two types, one is a method based on Apriori thought, and the other is a method based on FP-growth mode growth, wherein the former adopts a breadth-first search strategy, and the latter adopts a depth-first search strategy. (1) The generation-and-test idea is adopted based on the Apriori method: the k +1 subgraph is generated from k frequent subgraphs, based on the closure attribute down (if any one of the k subgraphs of the k +1 subgraphInfrequent, then the k +1 subgraph must also be infrequent) to generate a k +1 frequent subgraph set. For example, in the method (abbreviated as AGM) proposed by a.inokuchi et al in An apriori-based algorithm for mining frequency substructures from data, all frequent subgraphs satisfying a frequent threshold are mined based on recursive statistics in the AGM method. Kuramochi&The method (FSM for short) proposed by Karysis et al in the paper An effective efficient for discovery frequency maps is correspondingly improved for AGM method extension and pruning, and can effectively find Frequent Subgraphs in the Subgraphs. (2) In the FP-growth mode-based growth method, a k +1 candidate subgraph is generated by expanding k frequent subgraphs by one vertex or one edge at all possible positions. For example, x.yan et al propose gSpan (graph-based)Substructure patternmining) method is currently a more classical method based on a pattern-growing method, gSpan proposes a novel rightmost extended concept, where a new candidate subgraph is generated by adding a new edge between the rightmost vertex and another vertex on the graph's rightmost path; in the acceleration isomorphic stage, two technologies of DFS (depth first search) dictionary ordering and minimum DFS coding are introduced in gSpan, and a specification marking system supporting DFS searching is formed.
Chinese patent CN106777065 discloses a method and system for frequent sub-graph mining, which is a method for mining frequent sub-graphs, and the method divides data of a graph to be mined into sub-graphs by acquiring the data of the graph to be mined; performing parallel computation on the subgraph based on a depth-first search method, and finding out a corresponding frequent item set; and merging the frequent item sets of the sub-to-obtain frequent subgraphs of the graph data to be mined. And partitioning the graph data to obtain sub graphs, and then performing parallel computation on the sub graphs, namely simultaneously computing and processing a plurality of sub graphs to obtain corresponding sub frequent item sets. Parallel processing and multithreaded concurrent processing can improve processing efficiency compared to serial processing. And finally, merging the frequent item sets of the children to obtain a final frequent subgraph. Chinese patent CN106446161, a method for mining very frequent subgraph using Hadoop, uses Hadoop to mine very frequent subgraphs, combines frequent subtrees with candidate edges, and then judges whether the frequent subgraphs are frequent and generate the very frequent subgraphs according to stored intermediate results without traversing the database again.
Although the above frequent subgraph mining problem has been widely focused, the existing work still has some disadvantages, for example, the above a.inokuchi et al proposes an AGM method, which results in a large number of redundant k +1 subgraphs when the subgraph is extended, and also needs to spend much time determining whether all k subgraphs of each k +1 subgraph are frequent when pruning the k +1 subgraph. Kuramochi and Karysis et al propose an FSG method, and make corresponding improvements for AGM method expansion and pruning, so that frequent subgraphs can be effectively found in a small graph, but still the problem of considerable overhead when two k subgraphs are connected to generate a k +1 subgraph exists. Yan et al propose a gSpan method, but gSpan is only suitable for frequent sub-graph mining work on a graph set and cannot be effectively applied to a single large graph.
Chinese patent CN106777065 a method and system for frequent subgraph excavation, which proposes a method and system for frequent subgraph excavation, the method divides the graph data to be excavated into n segments of subgraph, but does not guarantee the structural integrity of the divided graph data, at the same time, the size of the excavated frequent subgraph is influenced by the size of the divided subgraph, which may result in missed solution, and there are larger frequent subgraph division into multiple segments of subgraph, which results in the original frequent subgraph becoming an infrequent subgraph; in addition, the method does not prune the intermediate result, which causes a large amount of memory consumption and cannot be effectively applied to larger graph data mining work. Chinese patent CN106446161, a method for mining a very frequent subgraph using Hadoop, proposes a method for mining a very frequent subgraph using Hadoop, but this method does not process graph data before mining, i.e. it does not delete non-frequent vertices and edges by frequent thresholds, and does not prune the intermediate results but store them, thus causing consumption of a large amount of memory, and it cannot be effectively applied to mining of larger graph data.
The current frequent subgraph mining method is not detailed enough for some aspects. For example, how to compress the size of the starting set of the subgraph expansion phase to reduce the generation of invalid intermediate results; how to process invalid intermediate results generated in the mining process to reduce the consumption of the memory; how to efficiently perform sub-graph isomorphism detection to accelerate the execution efficiency of the method, and the like. Therefore, how to enable the method to use less memory consumption and faster execution efficiency to complete frequent subgraph mining work is a problem to be solved.
Disclosure of Invention
The invention aims to provide a frequent subgraph mining method based on community detection, which can compress the size of an initial set in a subgraph expansion stage, effectively reduce the generation of intermediate results, efficiently carry out subgraph isomorphism detection in the subgraph expansion stage, and further delete non-frequent intermediate results through subgraph pruning.
In order to achieve the above object, the invention provides a frequent subgraph mining method based on community detection, which comprises the following steps:
step 1: preprocessing the acquired social network data;
step 2: carrying out community detection work on the processed social network to obtain a community set;
The step 1 comprises the following steps:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking the vertexes and edges with the frequency of the vertex labels and the edge labels smaller than the tau value as the infrequent vertexes and the infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes;
step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to the vertex mapping function f.
The step 1.4 comprises the following steps:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending order on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f: such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V1,v2,…,vn},n≥m。
The step 2 comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k maximum connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of Seq, changing the community label of each vertex into the label with the maximum number in the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
in the formula, Nl(v) Representing a vertex set with a community label l in adjacent vertexes of a vertex v of the community label to be modified, and VI (u) is the importance of a vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, the loop terminates;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
The step 3 comprises the following steps:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a setWherein i is the degree of the vertex, and l is the community number;
The step 4 comprises the following steps:
step 4.1: initializing an iteration variable i to be 1 and maximum iteration times maxIter;
step 4.4: let i equal i +1, go to step 4.2, when i is>maxIter, orWhen the set is empty, the iteration is finished;
step 4.5: return toI.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.
The step 4.2 comprises the following steps:
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set
The step 4.3 comprises the following steps:
step 4.3.1: to pairAny two subgraphs are collected to be judged pairwise, and the degree of a subgraph g and d (g) are calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining the importance sequence of the vertices of the subgraphs, which is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
wherein freq (v.label) represents the frequency of appearance of the label of vertex v in the subgraph;
step 4.3.3: if the importance ranks of the top points of the two subgraphs are equal, the two subgraphs are considered to be isomorphic subgraphs, the isomorphic subgraphs are stored in the same set, and the set is marked as
Step 4.3.4: when in useAfter all subgraphs have been judged, the subgraphs willSetting the data into an empty set;
step 4.3.6: if it is notIf the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored againIn the set, otherwise willAnd deleting the sub-graphs in the set.
The invention has the beneficial effects that:
the invention provides a frequent subgraph mining method based on community detection, which comprises the steps of firstly carrying out community detection on a social network, uniformly extracting vertexes from graph data during sampling, and rapidly mining the relationship among entities; and secondly, an effective subgraph expansion method and an efficient subgraph pruning strategy are provided, and subgraph reconstruction judgment efficiency is improved in a subgraph reconstruction stage.
Drawings
FIG. 1 is a flow chart of a frequent subgraph mining method based on community detection in the present invention;
FIG. 2 is a flow chart of community detection in the present invention;
FIG. 3 is a data sampling flow diagram of the present invention;
FIG. 4 is a diagram illustrating the subgraph discovery process according to the present invention;
FIG. 5 is a flow chart of subgraph expansion according to the present invention;
FIG. 6 is a flow chart of subgraph pruning in the present invention;
FIG. 7 is a graph comparing the performance of the present invention on three synthetic data sets syn1, syn2, syn3, wherein (a) is a comparison experiment of the execution time of the CG-FSM of the present invention method with the change of the sampling rate on syn1 with the Baseline method and the FFSM method; (b) the method comprises the steps of (a) performing a comparison experiment on syn2 according to the method of the invention, namely a CG-FSM and base method and an FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate, (c) performing a comparison experiment on syn3 according to the method of the invention, namely the CG-FSM and base method and the FFSM method, wherein the execution time of the comparison experiment is changed along with the sampling rate;
FIG. 8 is a graph comparing performances of micro and youtube on real data sets, wherein (a) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on micro with change of sampling rate, and (b) is a comparison experiment of execution time of CG-FSM, Baseline method and FFSM method of the invention on youtube with change of sampling rate.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The frequent subgraph mining means that all subgraphs with the occurrence times larger than or equal to a frequent threshold tau are mined from given graph data, and finally all frequent subgraph sets meeting the conditions are returned, wherein the frequent threshold tau is a numerical value input by a program user in a self-defining mode.
Subgraph isomorphism means that for two subgraphs g1=(v1,e1,μ1,ε1) And g2=(v2,e2,μ2,ε2) Are isomorphic subgraphs if they have the same topology and there is a mapping f v1→v2So thatHas a mu1(p)=μ2(f (p)), andpresence of edge e2(f (p), f (q)) and ε1(p,q)=ε2(f(p),f(q))。
According to the method, preprocessing is firstly carried out on large graph data (such as a social network), and aiming at the vertexes and edges in the large graph, all the infrequent vertexes and the infrequent edges in the graph data are deleted through a frequent threshold tau; the complexity of a large graph structure is reduced, and the size of a frequent subgraph mining starting set is compressed, so that repeated subgraph generation in the mining process is reduced. After the graph data is processed, the vertices and edges of the graph data need to be reconstructed. Vertex reconstruction needs to meet two requirements (1), and firstly, the sequence is reduced according to the label clustering number; (2) and the vertexes under the same label are in ascending order according to the original vertex numbers. After the top point is reconstructed, deleting the side can ensure that the end point of the frequent side is assigned as a new top point number, storing the mapping relation between the top point number after reconstruction and the top point number before reconstruction while reconstructing, ensuring the recovery work after the frequent sub-graph mining work is completed, and ensuring the uniformity of the front and the back of the data. In order to uniformly extract vertexes from the social network during sampling and perform community detection work on the social network, the social network is divided into communities by a NIBLPA (novel node influence based label development algorithm); after the communities are obtained, uniformly extracting vertexes from each Community through a CG-Samp sampling method (Community Graph Sample) to serve as an initial set of frequent subgraph mining expansion; performing subgraph expansion according to the initial set, wherein the expansion mode is to expand one edge at a time or expand one edge and one vertex; carrying out subgraph isomorphism detection on subgraphs in the expansion stage, and pruning infrequent subgraphs; finally returning all frequent subgraph sets in the big graph; finally, an experiment is designed to verify the effectiveness of the CG-FSM method.
As shown in fig. 1, a frequent subgraph mining method based on community detection includes:
step 1: preprocessing the acquired social network data; the method comprises the following steps:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking the vertexes and edges with the frequency of the vertex labels and the edge labels smaller than the tau value as the infrequent vertexes and the infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes; the method comprises the following steps:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
Step 1.4.4: storing the mapping between the peak number after reconstruction and the peak number before reconstruction, wherein the mapping function f: such that f (u) is equal to v; wherein the vertex set before reconstruction V ═ { V ═ V1,v2,…,vn},n≥m;
Step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to a vertex mapping function f.
In order to uniformly extract the vertices from the social network during sampling, the social network is first subjected to community detection work, and the social network is divided into individual communities by the NIBLPA method, where a community detection flow chart is shown in fig. 2. Community detection comprises three stages in total: (1) initializing, and allocating a unique community label from 0 to m-1 (the number of vertex points) for each vertex; (2) calculating the importance of each vertex, and fixing the updating sequence of the vertices according to the descending order of the importance; the vertex importance calculation formula is as follows:
wherein VI (v) refers to the importance of vertex v; ks (v) refers to the maximum connected subgraph of k-shell in the network to which the vertex v belongs, and the degree of each vertex in the connected subgraph is at least k; α is an adjustable parameter between 0 and 1; n (v) refers to a set of neighboring vertices of vertex v; d (u) refers to the degree of vertex u. (3) Each vertex changes the community label of the vertex into the community label with the largest number carried in the adjacent vertex, when the number of the community labels reaches the maximum value, the influence of the community label reaching the maximum value is calculated, and the community label with the largest influence is selected to update the community label of the vertex; the community label influence calculation formula is as follows:
wherein LI (v, l) refers to the influence of the community label l on the vertex v; n is a radical ofl(v) The community label representing vertex v the neighboring vertex is the set of vertices of l.
Step 2: as shown in fig. 2, performing community detection work on the processed social network to obtain a community set; the method comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k-largest connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of Seq, changing the community label of each vertex into the label with the maximum number in the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
in the formula, Nl(v) Representing the vertex set with the community label l in the adjacent vertexes of the vertex v of the community label to be modified, and VI (u) being the importance of the vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, the loop terminates;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
In order to further reduce the generation of repeated sub-graphs in the frequent sub-graph mining process, data is sampled by a CG-Samp method, and the flow of the CG-Samp method is shown in FIG. 3. Firstly, in each community, calculating the degrees of vertexes contained in the community, storing all vertexes and vertex ids and degrees thereof in a hash table (the keys are vertex degrees, the values are vertex lists, and all the degrees are stored as the vertexes of the keys), and further sorting the vertex lists according to the descending order of the keys. Vertices with same degree are aggregated and stored in list diIn whichi denotes the degree of the vertex in the list. The method finally returns to the frequent vertex set size-1 of the frequent subgraph mining start expansion.
Obtain the frequent vertex set of the subgraph expansion initial size-1, noted as F1Will F1As an initial point for depth-first search DFS extension based on mining the final size-k frequent subgraph. When all size-k frequent subgraphs are found, and the size- (k +1) set isThen the iteration is terminated. The flowcharts of sub-graph discovery and sub-graph expansion are shown in fig. 4 and fig. 5, respectively.
In the sub-graph expansion stage, a candidate sub-graph set is generated in each round of expansion, in the sub-graph pruning stage, the support degree of the candidate sub-graphs needs to be calculated, the candidate sub-graphs with the support degree smaller than the frequent threshold are discarded, the rest sub-graphs are added into the frequent sub-graph list, and the sub-graph pruning process is shown in fig. 6. In the process of calculating the support degree, because the detection of subgraph isomorphism is an NP difficult problem, the calculation cost is high. To avoid expensive computational overhead, two arguments are given to simplify the decision of sub-graph isomorphism:
Lemma 2. sequence isomorphism: given GLIf the subgraph g and g' is a homogeneous subgraph, then there is the same top importance ranking. The vertex importance calculation formula is as follows:
where ni (v) represents the importance of vertex v, deg (v) represents the degree of vertex v, and freq (v.label) the frequency with which the label of vertex v appears in the subgraph. And (5) calculating the importance of each vertex, and sorting the importance according to the descending order of the vertices to obtain an importance sequence of the subgraph g, which is recorded as NIS (g).
And step 3: sampling each community in the community set to obtain a subgraph expansion initial setThe method comprises the following steps:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a setWherein i is the degree of the vertex, and l is the community number;
And 4, step 4: according toPerforming frequent subgraph mining on graph data by a set; the method comprises the following steps:
step 4.1: initializing an iteration variable i to 1 and a maximum iteration time maxIter;
step 4.2: according toPerforming subgraph expansion operation on the set; the method comprises the following steps:
step 4.2.1: to pairFirstly, expanding one edge and one vertex for the vertex of all the vertexes in the set;
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
Step 4.2.3: if the expanded edges and the vertex are not frequent, expanding one edge for the vertex; if one side of the expansion is frequent, storing the sub-graph obtained by the expansion into a set
Step 4.3: obtained by expandingPerforming subgraph pruning operation on the set; the method comprises the following steps:
step 4.3.1: to pairAny two subgraphs are collected to be judged pairwise, and the degree d (g) of one subgraph g is calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining vertex importance sequence of the subgraphs, wherein the sequence is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
wherein freq (v.label) represents the frequency of appearance of the label of vertex v in the subgraph;
step 4.3.3: if the importance ranks of the top points of the two subgraphs are equal, the two subgraphs are considered to be isomorphic subgraphs, the isomorphic subgraphs are stored in the same set, and the set is marked as
Step 4.3.4: when in useAfter all subgraphs have been judged, the subgraphs willSetting the data to be an empty set;
step 4.3.6: if it is notIf the number of the subgraph statistics in the set is more than tau, all the subgraphs are stored againIn the set, otherwise willDeleting sub-graphs in the set;
step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, orWhen the set is empty, the iteration is finished;
step 4.5: return toI.e. all frequent subgraph sets mined from the graph data by the final frequent subgraph mining method.
In order to verify the effectiveness of the method, the invention firstly carries out the experiment comparison of the NIBLPA and the traditional LPA (Label Propagation Algorithm), and the experiment shows that the NIBLPA can obtain better community detection effect on a plurality of data sets, so that the community is as small and uniform as possible; subsequently, the method (CG-FSM for short) of the invention is compared with a Baseline algorithm Baseline (which is different from the method of the invention in that random sampling is adopted in the sampling process instead of uniform random sampling), and experiments are carried out by a classic Frequent Subgraph mining method FFSM (fast frequency Subgraph mining), and the execution time required by the method of the invention, the Baseline method and the FFSM method is observed along with the change method of the sampling factors by adjusting the sampling factors to be 0.1 and 0.2.
In the experiments of the invention, verification operations are carried out on three synthetic data sets (syn1, syn2 and syn3) and two real data sets (micro and youtube), wherein the synthetic data sets are obtained by expanding small graph data (comprising 12 vertexes and 28 edges), the real data sets are downloaded in http:// snap.
TABLE 1 data set Table
Data set name | Vertex point | Edge |
syn1 | 12000 | 28000 |
syn2 | 120000 | 280000 |
syn3 | 600000 | 1400000 |
micro | 100000 | 1080299 |
youtube | 1134890 | 2987624 |
Compared with the prior art, the technical scheme provided by the invention firstly deletes the infrequent vertexes and the infrequent edges in the big image data through pretreatment, and uniformly extracts vertexes from the big image data through community detection, so that the size of the initial vertex set is compressed and expanded, and the generation of intermediate results is greatly reduced; in the sub-graph isomorphism detection stage, in order to avoid expensive calculation overhead of detection, two lemmas are combined to simplify the judgment of sub-graph isomorphism, and non-frequent sub-graphs are deleted in advance through sub-graph pruning, so that invalid expansion and memory consumption are avoided.
Fig. 7 shows that the method of the present invention is compared with other methods in the execution time of frequent subgraph mining under three composite data sets, wherein the horizontal axis represents the sampling rate δ and the vertical axis represents the execution time. Comparing (a), (b) and (c) of fig. 7, it is clear that the method of the present invention performs better, and the execution time increases with the increase of the sampling rate δ, and other methods do not change with the change of δ because sampling is not performed.
Fig. 8 shows that the method of the present invention and the other two methods are compared in the execution time of frequent subgraph mining under two real data sets, where the horizontal axis is the sampling rate δ and the vertical axis is the execution time. As is clear from comparing (a) and (b) in fig. 8, the model of the present invention performs better, and as the sampling rate δ increases, the execution time also increases, and other models may overflow during execution and cannot operate normally, so that they are not shown in the figure.
Claims (8)
1. A frequent subgraph mining method based on community detection is characterized by comprising the following steps:
step 1: preprocessing the acquired social network data;
step 2: carrying out community detection work on the processed social network to obtain a community set;
2. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 1 comprises:
step 1.1: counting the frequency of the vertex labels and the edge labels in the social network data;
step 1.2: by setting a frequent threshold tau, marking vertexes and edges of which the frequency of the vertex labels and the edge labels obtained by statistics is smaller than the tau value as infrequent vertexes and infrequent edges;
step 1.3: removing all the infrequent vertexes and edges from the graph data, and recording the number of the residual vertexes in the graph data as m;
step 1.4: numbering and reconstructing the rest vertexes;
step 1.5: and reconstructing the opposite side according to the reconstructed vertex, and modifying the numbers of the starting point and the end point of the side into the number of the reconstructed vertex according to the vertex mapping function f.
3. The frequent subgraph mining method based on community detection as claimed in claim 2, wherein the step 1.4 comprises:
step 1.4.1: clustering and grouping the vertexes according to the labels, and performing descending on the groups according to the number of the vertexes in the groups;
step 1.4.2: the inner vertexes of each group are in ascending order according to the vertex numbers;
step 1.4.3: the remaining vertices in the graph data are renumbered from 0 to (m-1) according to the ordering rules described in step 1.4.1 and step 1.4.2 to obtain a reconstructed vertex set V '═ V'0,v′1,…,v′m-1};
4. The frequent subgraph mining method based on community detection according to claim 1, wherein the step 2 comprises the following steps:
step 2.1: calculating a k-shell value of each vertex, wherein the k-shell value of each vertex refers to a k-largest connected subgraph in which the vertex is positioned, the degree of each vertex in the subgraph is at least k, and the k-shell value of the vertex v is recorded as Ks (v);
step 2.2: initializing and allocating a unique community label from 0 to (m-1) for each vertex in the graph data;
step 2.3: for each vertex v, the vertex importance vi (v) is calculated according to equation (1):
wherein alpha is a variable parameter between 0 and 1, d (u) is the degree of the vertex u, N (v) is a set of adjacent vertices of the vertex v, and Ks (u) is the k-shell value of the vertex u;
step 2.4: descending the vertexes according to the importance, taking the descending sequence as a vertex label updating sequence, and recording the descending sequence as Seq;
step 2.5: initializing an iteration variable t as 1;
step 2.6: according to the sequence of the Seq, the community label of each vertex is changed into the label with the maximum number carried by the community labels in the adjacent vertices;
step 2.7: when the carrying quantity of a plurality of community tags reaches the maximum value, calculating the importance LI (v, l) of the community tags according to a formula (2), and selecting the community tag with the maximum importance LI (v, l) for assignment;
in the formula, Nl(v) Representing a vertex set with a community label l in adjacent vertexes of a vertex v of the community label to be modified, and VI (u) is the importance of a vertex u;
step 2.8: let t be t +1, jump to step 2.6, when t is greater than the maximum iteration number, or all vertexes have no community label of any vertex modified in the last iteration, loop termination;
step 2.9: and clustering all vertexes in the graph data according to the labels to obtain a community set.
5. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 3 comprises:
step 3.1: aggregating the vertexes of the communities in each community set according to the vertex degrees, and recording the aggregate set as a setWherein i is the degree of the vertex, and l is the community number;
6. The frequent subgraph mining method based on community detection as claimed in claim 1, wherein the step 4 comprises:
step 4.1: initializing an iteration variable i to 1 and a maximum iteration time maxIter;
step 4.4: let i equal i +1, jump to step 4.2, when i>maxIter, orWhen the set is empty, the iteration is finished;
7. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.2 comprises:
step 4.2.1: for is toAll vertices in the set are first expanded to the vertexUnfolding one edge and one vertex;
step 4.2.2: if the extended edges and vertices are frequent, saving the extended subgraphs to the collection
8. The frequent subgraph mining method based on community detection as claimed in claim 6, wherein the step 4.3 comprises:
step 4.3.1: to pairAny two subgraphs are collected to be judged pairwise, and the degree of a subgraph g and d (g) are calculated according to a formula (3):
d(g)=∑v∈gdeg(v) (3)
where deg (v) represents the degree of v in the subgraph;
step 4.3.2: if the degrees of the two subgraphs are equal, calculating the importance of each vertex in the subgraphs according to a formula (4), and obtaining the importance sequence of the vertices of the subgraphs, which is marked as NIS (g); otherwise, if the degrees of the two subgraphs are not equal, the judgment of the other two subgraphs is directly skipped;
wherein freq (v.label) represents the frequency of occurrence of the label of vertex v in the subgraph;
step 4.3.3: if the vertex importance rankings of the two subgraphs are equal, then the two subgraphs are considered to be isomorphicStoring the isomorphic subgraphs into the same set, and marking the set as a set
Step 4.3.4: when in useAfter all subgraphs have been judged, the subgraphs willSetting the data into an empty set;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210382776.9A CN114661927A (en) | 2022-04-13 | 2022-04-13 | Frequent subgraph mining method based on community detection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210382776.9A CN114661927A (en) | 2022-04-13 | 2022-04-13 | Frequent subgraph mining method based on community detection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114661927A true CN114661927A (en) | 2022-06-24 |
Family
ID=82034752
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210382776.9A Pending CN114661927A (en) | 2022-04-13 | 2022-04-13 | Frequent subgraph mining method based on community detection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661927A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115858875A (en) * | 2023-02-10 | 2023-03-28 | 武汉中科通达高新技术股份有限公司 | Enterprise employee hierarchical relationship discovery method and device based on frequent graph pattern mining |
-
2022
- 2022-04-13 CN CN202210382776.9A patent/CN114661927A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115858875A (en) * | 2023-02-10 | 2023-03-28 | 武汉中科通达高新技术股份有限公司 | Enterprise employee hierarchical relationship discovery method and device based on frequent graph pattern mining |
CN115858875B (en) * | 2023-02-10 | 2023-05-23 | 武汉中科通达高新技术股份有限公司 | Enterprise employee hierarchical relationship discovery method and device based on frequent pattern mining |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Efficient mining of large maximal bicliques | |
Whang et al. | Scalable and memory-efficient clustering of large-scale social networks | |
CN101339553A (en) | Approximate quick clustering and index method for mass data | |
CN110909173A (en) | Non-overlapping community discovery method based on label propagation | |
Wei et al. | Incremental FP-Growth mining strategy for dynamic threshold value and database based on MapReduce | |
Tang et al. | Reliable community search in dynamic networks | |
CN114661927A (en) | Frequent subgraph mining method based on community detection | |
CN112287118B (en) | Event mode frequent subgraph mining and prediction method | |
WO2020211466A1 (en) | Non-redundant gene clustering method and system, and electronic device | |
Yang et al. | Parallel co-location pattern mining based on neighbor-dependency partition and column calculation | |
CN112487110A (en) | Overlapped community evolution analysis method and system based on network structure and node content | |
Yang et al. | Towards automatic clustering of protein sequences | |
CN108509531B (en) | Spark platform-based uncertain data set frequent item mining method | |
CN108717551A (en) | A kind of fuzzy hierarchy clustering method based on maximum membership degree | |
CN108897820A (en) | A kind of parallel method of DENCLUE algorithm | |
Asha et al. | A survey on efficient incremental algorithm for mining high utility itemsets in distributed and dynamic database | |
Kiran et al. | An improved frequent pattern-growth approach to discover rare association rules | |
CN113902003A (en) | MITree-based multidimensional time series online motif discovery method | |
CN113420187A (en) | GPU subgraph matching method based on edge segmentation | |
Hamedanian et al. | An efficient prefix tree for incremental frequent pattern mining | |
Fei et al. | A improved sequential pattern mining algorithm based on PrefixSpan | |
Kiran et al. | Mining periodic-frequent patterns with maximum items' support constraints | |
Yingfan et al. | Revisiting $ k $-Nearest Neighbor Graph Construction on High-Dimensional Data: Experiments and Analyses | |
CN108228607B (en) | Maximum frequent item set mining method based on connectivity | |
CN106599187B (en) | Edge instability based community discovery system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |