CN113515685A - Community detection method integrating Embedding enhanced topology and node content information - Google Patents

Community detection method integrating Embedding enhanced topology and node content information Download PDF

Info

Publication number
CN113515685A
CN113515685A CN202110425314.6A CN202110425314A CN113515685A CN 113515685 A CN113515685 A CN 113515685A CN 202110425314 A CN202110425314 A CN 202110425314A CN 113515685 A CN113515685 A CN 113515685A
Authority
CN
China
Prior art keywords
node
information
matrix
network
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110425314.6A
Other languages
Chinese (zh)
Inventor
曹金鑫
许伟忠
丁卫平
张晓峰
鞠恒荣
黄嘉爽
程纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202110425314.6A priority Critical patent/CN113515685A/en
Publication of CN113515685A publication Critical patent/CN113515685A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a community detection method integrating Embedding enhanced topology and node content information, which comprises the following steps: acquiring text data sets on edges and nodes among complex network summary nodes; performing matrixing processing and one-hot processing on the network topology and the Node content information respectively, and performing the network topology Embedding processing based on a Node2vec method to acquire the Embedding information of the complex network; associating the information to construct a community detection model fusing and strengthening network topology and node content; model optimization is extrapolated to model parameters, and clustering based on the model parameters is evaluated. The invention has the beneficial effects that: the method and the device slow down the negative effects of insufficient network topology representation and sparsity, and have good theoretical application value for community detection fusing network topology and content information.

Description

Community detection method integrating Embedding enhanced topology and node content information
Technical Field
The invention relates to the technical field of complex network analysis, in particular to a community detection method fusing Embedding enhanced topology and node content information.
Background
There are a large number of complex networks in the real world, such as social networks, the internet, telephone networks, gene function networks, etc. Mining the community structure in the network is helpful to understanding the main functions or structures of the complex network and predicting the node behaviors, and is one of the important tasks of complex network analysis. The traditional community detection is usually based on the main idea that the links between nodes belonging to the same community are close, and the links between nodes belonging to different communities are sparse. However, the complex network also includes rich content information, which can assist the link information (also called network topology) to improve the accuracy of community detection. Meanwhile, links included in a complex network are sparse, namely, the network topology has sparsity, so that the network topology representation capability is insufficient. Therefore, there is still room for improvement in the performance of the community detection method that integrates the network topology and the node content information. With the development of internet technology, the information contained in a complex network is increased exponentially, so that the calculation time of large-scale network community detection is long. How to effectively mine a large-scale community structure in a content network, a new method is urgently needed to strengthen the capability of representing communities by network topology, and simultaneously, the network topology and content information are robustly fused so as to improve the precision of community detection and reduce the running time of an algorithm.
Disclosure of Invention
Aiming at sparseness of network topology, the invention introduces the Embedding information to strengthen the network topology so as to relieve the influence caused by insufficient representation of the network topology, and simultaneously fuses the content information of the nodes so as to improve the execution capacity of community detection; model optimization is carried out based on a non-Negative Matrix Factorization (NMF) technology so as to reduce the running time of the algorithm, and the method has good theoretical application value for community detection fusing network topology and content information.
The idea of the invention is as follows: firstly, acquiring data sets such as edges between nodes describing network topology in a complex network and texts on the nodes describing content information; then, matrixing processing and one-hot processing are respectively carried out aiming at the Network topology and the content information, meanwhile, Embedding processing is carried out on an adjacent matrix representing the Network topology based on a Node2vec method, and Network Embedding information of a complex Network is obtained; then, assuming that the membership degree of the Network topology is low-dimensional representation of the Network Embedding information and low-dimensional representation of the content information, establishing a community detection model fusing the enhanced Network topology and the node content information by using a membership degree matrix to correlate the Network topology, the Network Embedding information and the content information; and finally, deriving model parameters of the community detection model through model optimization, further calculating a clustering result based on the model parameters, evaluating the performance of the clustering result, and evaluating the approximation degree of the clustering result and the community structure.
The invention is realized by the following measures: a community detection method fusing Embedding enhanced topology and node content information comprises the following steps:
s1, the complex network data with content information is denoted as G ═ (V, E, F), where V ═ V { (V, E, F)1,v2,...,vnThe E represents a set of links, and the F represents a feature vector set of the content information of the nodes;
s2, inputting topology information, Network Embedding information and content information of nodes according to the G design algorithm in the step S1;
s3, the model of the algorithm contains three submodels, wherein a first submodel is constructed based on topology information, a second submodel is constructed by using Network Embedding for strengthening Network topology information, and a third submodel is constructed based on content information of nodes;
and S4, combining the three sub-models in the step S3 into one model under a unified framework, verifying the model on a data set, and evaluating the community detection execution force of the unified model by using standardized mutual information as an evaluation method.
The invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step S2 specifically comprises the following steps:
s2.1, formally designing topological information, and concretely realizing the following steps: construct adjacency matrix a of G ═ { a }ij}∈Rn×nWherein when the node viAnd node vjWhen they are connected with each other by edges A ij1, otherwise Aij=0;
S2.2, introducing Network Embedding to enhance topological information, and for the generation of the Network Embedding information, using a Node2vec algorithm to map nodes in a Network to a l-dimensional manifold space to obtain an Embedding matrix U belonging to Rn×l
S2.3, constructing content information of the nodes, and concretely realizing the following steps: constructing a node content matrix M of G for Rn×mEach row of M represents the content information of a node, i.e. the content information of each node is represented by an M-dimensional feature vector, and one-hot encoding is used.
The invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step S3 specifically comprises the following steps:
s3.1, constructing a first sub-model based on topology information, comprising the following steps:
the first sub-model is constructed based on the following two points:
first, node viThe tendency to belong to community j is called membership, HijTo show that for all the membership degrees of all the nodes in the network, a non-negative membership degree matrix can be constructed
Figure BDA0003029263950000021
Where k represents the number of communities;
second, node v in the networkiAnd vjThe tendency of belonging to the community t at the same time is represented as HitHjtDue to node viAnd vjWhether there are edges connected depends on the probability that they belong to the same community, HitHjtIt can also represent the node v in the community tiAnd vjThe number of edges expected in between; then all the nodes v in the communityiAnd vjThe sum of the expected edge numbers generated in the process obtains a node viAnd vjThe desired number of edges in between is
Figure BDA0003029263950000031
Based on the above two points, an expected adjacency matrix (expected adjacency matrix) array can be constructed by using the characterization matrix H
Figure BDA0003029263950000032
Namely, it is
Figure BDA0003029263950000033
Then use
Figure BDA0003029263950000034
Fitting a to obtain an objective function of the first sub-model as:
Figure BDA0003029263950000035
s3.2, Network Embedding information is introduced to strengthen Network topology information, and the Network Embedding is an Embedding expression of the topology information, so that the Network Embedding can be considered to contain a membership matrix H; therefore, based on the idea of data dimension reduction by nonnegative matrix factorization, the membership matrix H is a low-dimensional description of the Embedding matrix U in step S2.2, and a feature matrix C e R is introducedl×kConstructing the expected Network Embedding matrix with the membership matrix H
Figure BDA00030292639500000313
To achieve the fitting to U, the objective function of the second submodel is as follows:
Figure BDA0003029263950000036
s3.3, constructing a third sub-model of the content information based on the nodes, comprising the following steps:
for the node content matrix M designed in step S2.3, each node is described by M contents, and the M contents can be divided into k subjects, so that the element H in the characterization matrix HijWhich may be interpreted herein as node viTo construct a trend of content of
Figure BDA0003029263950000037
To realize the fitting of M, a matrix N epsilon R needs to be introducedl×kWhere each column represents a different topic, element NjiThe meaning of expression is the tendency of a subject i to contain a content j. Thus, node viThe tendency to contain content j can be expressed as
Figure BDA0003029263950000038
So that we can construct
Figure BDA0003029263950000039
The fitting of the original content matrix M is realized, so that the objective function of the node content model is obtained as follows:
Figure BDA00030292639500000310
the invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step (4) specifically comprises the following steps:
organically fusing the first sub-model based on the topology information, the second sub-model for reinforcing the topology information and the third sub-model based on the node content information in the step S3, adjusting the specific gravity of different sub-models by using the weight factors alpha and beta, and finally constructing a unified model, wherein the target function of the obtained final model is as follows:
Figure BDA00030292639500000311
to minimize the objective function, the formula is updated by the following three equations:
Figure BDA00030292639500000312
Figure BDA0003029263950000041
Figure BDA0003029263950000042
and continuously iterating and updating the model parameter matrixes H, C and N until the value of the target function is converged, and finally obtaining a target matrix H, namely a community membership matrix of the nodes for community detection.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the content information of the nodes and the topology information of the nodes are combined, so that the community detection precision is improved, the Network Embedding information is introduced, the topology information is enhanced, the influence caused by the sparsity of the topology information is relieved, and the community detection precision integrating the topology and the node content information is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
Fig. 1 is a schematic overall flow chart of an embodiment of the present invention.
FIG. 2 is a diagram illustrating experimental results of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.
Example 1
Referring to fig. 1 to 2, the technical solution provided by the present invention is a community detection method integrating an Embedding enhanced topology and node content information,
(1) and obtaining community information using the data set. The data set used in this example is an LFR artificial network, and as to the LFR artificial network, the following must be said:
a) the degrees of all nodes are generated by adopting power law distribution with the exponent being gamma. The maximum value of the node degree is set as kmaxMinimum value of kminAverage degree is set as<k>。
b) Each node has a 1- μ proportion of edges connected to nodes within its community and correspondingly, a μ proportion of edges connected to nodes outside its community. Mu is called a mixing coefficient and is used for describing the fuzzy degree of the community structure in the network.
c) The scales of all communities are generated by power law distribution with the exponent of beta, and the sum of the scales of all communities is equal to the size of the network scale, namely the number of nodes in the network. The maximum value of the community size is smaxThe minimum value is set to smin
In the present embodiment, the network size is set to 1000, the maximum value of the node degrees is set to 32, the average value is set to 16, and the maximum value and the minimum value of the community size are both set to 25, so that the community number is determined to be 40. In four cases (2, 1), (2, 2), (3, 1) and (3, 2) where (γ, β) is set, the mixing coefficient μ is set to 0.1 to 0.6 and is varied by a step size of 0.05, thereby generating four sets of networks each including 11 networks and 44 networks in total.
(2) The topology information of the nodes is constructed according to the community information acquired in the step (1), and the Network Embedding information and the content information of the nodes comprise the following steps:
constructing adjacency matrix A ═ { A ] according to generated LFR network informationij}∈R1000×1000Wherein when the node viAnd node vjWhen they are connected with each other by edges A ij1, otherwise Aij0. For the generation of Network Embedding information, a Node2vec algorithm is used for mapping nodes in a Network to a 24-dimensional popular space to obtain a matrix U belonging to R1000×24. Node content information composed ofArtificially generating, and describing the content of each node by using a 1280-dimensional characteristic vector by using one-hot coding so as to obtain a node content matrix M e R1000×1280
(3) Constructing a topology model based on the topology information, and simultaneously introducing Network Embedding information to supplement the topology information, thereby reducing the influence caused by the sparsity of the topology information; a node content model is established based on the content information of the node, the topology model and the content model are unified into a final model by using a weight factor, and an objective function is obtained as follows:
Figure BDA0003029263950000051
wherein H ∈ R1000×40,C∈R24×40,N∈R1280×40
(4) And continuously and iteratively updating H, C and N through the updating formula in the step 4 until the convergence is reached to obtain a target matrix H, and finally obtaining the community attribution of all the nodes.
(5) Using Normalized Mutual Information (NMI) as an evaluation index of the model, the normalized mutual information being based on a confusion matrix C, each column of the matrix representing a category, each row representing an actual category; the mutual information is normalized to be between [0, 1], and the mutual information is generally used for presenting the visualization effect of the algorithm precision, and the specific expression is as follows:
Figure BDA0003029263950000061
(6) the experiment is carried out in the four groups of LFR networks, SCI, Bigclam, SNE, SDNE, Node2vec and CESNNA are added as baseline contrast experiments, wherein Bigclam is a community detection model based on topology information, SCI and CESNNA are community detection models combining the topology information and Node content information, and for a target matrix (characterization matrix) H obtained by the two models, a community to which the Node belongs is determined directly according to the serial number of the row where the maximum element in the characterization vector of each Node is located. SNE, SDNE, and Node2vec are three Network Embedding (Network Embedding) models, where SDNE and Node2vec are models based on topology information, and SNE is a model that combines topology information and Node content information. The Network Embedding implementation represents nodes in a Network in a low-dimensional, real-valued, dense vector form. Based on the Embedding information obtained by the SNE, the SDNE and the Node2vec, the KMeans method is further used for clustering the information to obtain a community detection result. In order to avoid the randomness of the results as much as possible, we have to run 10 times for each method and then take the average as the final result, and the experimental results are shown in fig. 2. From the analysis of fig. 2, it can be seen that in the process of changing the mixing coefficient μ from 0.1 to 0.6, compared with other models, our model has the highest accuracy and is the most stable among the four groups of networks. Particularly, when mu is less than or equal to 0.45, the precision of the model is almost consistent with that of SCI and Node2vec, and when mu is more than or equal to 0.45, the advantages of the model are shown. Theoretically, the model has the advantages that the node content information is combined to supplement the topology information, the Embedding information is introduced to strengthen the network topology, the influence caused by the sparsity of the network topology is relieved, and therefore the efficiency of community detection is further improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A community detection method fusing Embedding enhanced topology and node content information is characterized by comprising the following steps:
s1, the complex network data with content information is denoted as G ═ (V, E, F), where V ═ V { (V, E, F)1,v2,...,vnThe E represents a set of links, and the F represents a feature vector set of the content information of the nodes;
s2, inputting topology information, Network Embedding information and content information of nodes according to the G design algorithm in the step S1;
s3, the model of the algorithm contains three submodels, wherein a first submodel is constructed based on topology information, a second submodel is constructed by using Network Embedding for strengthening Network topology information, and a third submodel is constructed based on content information of nodes;
and S4, combining the three sub-models in the step S3 into one model under a unified framework, verifying the model on a data set, and evaluating the community detection execution force of the unified model by using standardized mutual information as an evaluation method.
2. The method for detecting a community fusing an Embedding enhanced topology and node content information according to claim 1, wherein the step S2 specifically includes:
s2.1, formally designing topological information, and concretely realizing the following steps: construct adjacency matrix a of G ═ { a }ij}∈Rn×nWherein when the node viAnd node vjWhen they are connected with each other by edges Aij1, otherwise Aij=0;
S2.2, introducing Network Embedding to enhance topological information, and for the generation of the Network Embedding information, using a Node2vec algorithm to map nodes in a Network to a l-dimensional manifold space to obtain an Embedding matrix U belonging to Rn×l
S2.3, constructing content information of the nodes, and concretely realizing the following steps: constructing a node content matrix M of G for Rn×mEach row of M represents the content information of a node, i.e. the content information of each node is represented by an M-dimensional feature vector, and one-hot encoding is used.
3. The method for detecting a community fusing an Embedding enhanced topology and node content information according to claim 1 or 2, wherein the step S3 specifically includes:
s3.1, constructing a first sub-model based on topology information, comprising the following steps:
the first sub-model is constructed based on the following two points:
first, node viThe tendency to belong to community j is called membership, HijTo show that for all the membership degrees of all the nodes in the network, a non-negative membership degree matrix can be constructed
Figure FDA0003029263940000011
Where k represents the number of communities;
second, node v in the networkiAnd vjThe tendency of belonging to the community t at the same time is represented as HitHjtDue to node viAnd vjWhether there are edges connected depends on the probability that they belong to the same community, HitHjtIt can also represent the node v in the community tiAnd vjThe number of edges expected in between; then all the nodes v in the communityiAnd vjThe sum of the expected edge numbers generated in the process obtains a node viAnd vjThe desired number of edges in between is
Figure FDA0003029263940000022
Based on the above two points, an expected adjacency matrix (expected adjacency matrix) array can be constructed by using the characterization matrix H
Figure FDA0003029263940000023
Namely, it is
Figure FDA0003029263940000024
Then use
Figure FDA0003029263940000025
Fitting a to obtain an objective function of the first sub-model as:
Figure FDA0003029263940000026
s3.2, Network Embedding information is introduced to strengthen Network topology information, and because the Network Embedding is one type of topology informationExpressing that the Network Embedding should contain a membership matrix H; therefore, based on the idea of data dimension reduction by nonnegative matrix factorization, the membership matrix H is a low-dimensional description of the Embedding matrix U in step S2.2, and a feature matrix C e R is introducedl×kConstructing the expected Network Embedding matrix with the membership matrix H
Figure FDA0003029263940000027
To achieve the fitting to U, the objective function of the second submodel is as follows:
Figure FDA0003029263940000028
s3.3, constructing a third sub-model of the content information based on the nodes, and comprising the following steps:
for the node content matrix M designed in step S2.3, each node is described by M contents, and the M contents can be divided into k subjects, so that the element H in the characterization matrix HijWhich may be interpreted herein as node viTo construct a trend of content of
Figure FDA0003029263940000029
To realize the fitting of M, a matrix N epsilon R needs to be introducedl×kWhere each column represents a different topic, element NjiThe meaning of expression is the tendency of a subject i to contain a content j. Thus, node viThe tendency to contain content j can be expressed as
Figure FDA00030292639400000210
So that we can construct
Figure FDA00030292639400000211
The fitting of the original content matrix M is realized, so that the objective function of the node content model is obtained as follows:
Figure FDA00030292639400000212
4. the method for detecting a community fusing an Embedding enhanced topology and node content information according to any one of claims 1-3, wherein the step (4) specifically comprises:
organically fusing the first sub-model based on the topology information, the second sub-model for reinforcing the topology information and the third sub-model based on the node content information in the step S3, adjusting the specific gravity of different sub-models by using the weight factors alpha and beta, and finally constructing a unified model, wherein the target function of the obtained final model is as follows:
Figure FDA00030292639400000213
to minimize the objective function, the formula is updated by the following three equations:
Figure FDA0003029263940000021
Figure FDA0003029263940000031
Figure FDA0003029263940000032
and continuously iterating and updating the model parameter matrixes H, C and N until the value of the target function is converged, and finally obtaining a target matrix H, namely a community membership matrix of the nodes for community detection.
CN202110425314.6A 2021-04-20 2021-04-20 Community detection method integrating Embedding enhanced topology and node content information Pending CN113515685A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110425314.6A CN113515685A (en) 2021-04-20 2021-04-20 Community detection method integrating Embedding enhanced topology and node content information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110425314.6A CN113515685A (en) 2021-04-20 2021-04-20 Community detection method integrating Embedding enhanced topology and node content information

Publications (1)

Publication Number Publication Date
CN113515685A true CN113515685A (en) 2021-10-19

Family

ID=78062945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110425314.6A Pending CN113515685A (en) 2021-04-20 2021-04-20 Community detection method integrating Embedding enhanced topology and node content information

Country Status (1)

Country Link
CN (1) CN113515685A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244284A (en) * 2022-12-30 2023-06-09 成都中轨轨道设备有限公司 Big data processing method based on three-dimensional content

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116244284A (en) * 2022-12-30 2023-06-09 成都中轨轨道设备有限公司 Big data processing method based on three-dimensional content
CN116244284B (en) * 2022-12-30 2023-11-14 成都中轨轨道设备有限公司 Big data processing method based on three-dimensional content

Similar Documents

Publication Publication Date Title
Mak et al. Support points
Tian et al. Diversity assessment of multi-objective evolutionary algorithms: Performance metric and benchmark problems [research frontier]
Chen et al. An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping
Xiang et al. A decomposition-based many-objective artificial bee colony algorithm
CN113065974B (en) Link prediction method based on dynamic network representation learning
CN111564183A (en) Single cell sequencing data dimension reduction method fusing gene ontology and neural network
CN112800231B (en) Power data verification method and device, computer equipment and storage medium
CN107368707B (en) Gene chip expression data analysis system and method based on US-E L M
Huang et al. Harmonious genetic clustering
CN108900320B (en) Method and device for reducing topological structure of Internet test bed in large scale
Li et al. Density estimation via discrepancy based adaptive sequential partition
CN108536844B (en) Text-enhanced network representation learning method
CN109409434A (en) The method of liver diseases data classification Rule Extraction based on random forest
CN113515685A (en) Community detection method integrating Embedding enhanced topology and node content information
CN108898273A (en) A kind of user side load characteristic clustering evaluation method based on morphological analysis
Sun et al. An evolutionary many-objective algorithm based on decomposition and hierarchical clustering selection
CN112990776B (en) Distribution network equipment health degree evaluation method
CN117236374A (en) Layering interpretation method based on fully developed material graph neural network
Yan et al. Hybrid chain-hypergraph P systems for multiobjective ensemble clustering
Cai et al. Realize Generative Yet Complete Latent Representation for Incomplete Multi-View Learning
Zeng et al. Scalable Semi-Supervised Clustering Via Structural Entropy With Different Constraints
CN114861450A (en) Attribute community detection method based on potential representation and graph regular nonnegative matrix decomposition
Zhang et al. Adaptive truncation technique for constrained multi-objective optimization
Zhang et al. Randomized statistical inference: A unified statistical inference frame of frequentist, fiducial, and Bayesian inference
Lee et al. Model‐based clustering of semiparametric temporal exponential‐family random graph models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination