Disclosure of Invention
Aiming at sparseness of network topology, the invention introduces the Embedding information to strengthen the network topology so as to relieve the influence caused by insufficient representation of the network topology, and simultaneously fuses the content information of the nodes so as to improve the execution capacity of community detection; model optimization is carried out based on a non-Negative Matrix Factorization (NMF) technology so as to reduce the running time of the algorithm, and the method has good theoretical application value for community detection fusing network topology and content information.
The idea of the invention is as follows: firstly, acquiring data sets such as edges between nodes describing network topology in a complex network and texts on the nodes describing content information; then, matrixing processing and one-hot processing are respectively carried out aiming at the Network topology and the content information, meanwhile, Embedding processing is carried out on an adjacent matrix representing the Network topology based on a Node2vec method, and Network Embedding information of a complex Network is obtained; then, assuming that the membership degree of the Network topology is low-dimensional representation of the Network Embedding information and low-dimensional representation of the content information, establishing a community detection model fusing the enhanced Network topology and the node content information by using a membership degree matrix to correlate the Network topology, the Network Embedding information and the content information; and finally, deriving model parameters of the community detection model through model optimization, further calculating a clustering result based on the model parameters, evaluating the performance of the clustering result, and evaluating the approximation degree of the clustering result and the community structure.
The invention is realized by the following measures: a community detection method fusing Embedding enhanced topology and node content information comprises the following steps:
s1, the complex network data with content information is denoted as G ═ (V, E, F), where V ═ V { (V, E, F)1,v2,...,vnThe E represents a set of links, and the F represents a feature vector set of the content information of the nodes;
s2, inputting topology information, Network Embedding information and content information of nodes according to the G design algorithm in the step S1;
s3, the model of the algorithm contains three submodels, wherein a first submodel is constructed based on topology information, a second submodel is constructed by using Network Embedding for strengthening Network topology information, and a third submodel is constructed based on content information of nodes;
and S4, combining the three sub-models in the step S3 into one model under a unified framework, verifying the model on a data set, and evaluating the community detection execution force of the unified model by using standardized mutual information as an evaluation method.
The invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step S2 specifically comprises the following steps:
s2.1, formally designing topological information, and concretely realizing the following steps: construct adjacency matrix a of G ═ { a }ij}∈Rn×nWherein when the node viAnd node vjWhen they are connected with each other by edges A ij1, otherwise Aij=0;
S2.2, introducing Network Embedding to enhance topological information, and for the generation of the Network Embedding information, using a Node2vec algorithm to map nodes in a Network to a l-dimensional manifold space to obtain an Embedding matrix U belonging to Rn×l;
S2.3, constructing content information of the nodes, and concretely realizing the following steps: constructing a node content matrix M of G for Rn×mEach row of M represents the content information of a node, i.e. the content information of each node is represented by an M-dimensional feature vector, and one-hot encoding is used.
The invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step S3 specifically comprises the following steps:
s3.1, constructing a first sub-model based on topology information, comprising the following steps:
the first sub-model is constructed based on the following two points:
first, node v
iThe tendency to belong to community j is called membership, H
ijTo show that for all the membership degrees of all the nodes in the network, a non-negative membership degree matrix can be constructed
Where k represents the number of communities;
second, node v in the network
iAnd v
jThe tendency of belonging to the community t at the same time is represented as H
itH
jtDue to node v
iAnd v
jWhether there are edges connected depends on the probability that they belong to the same community, H
itH
jtIt can also represent the node v in the community t
iAnd v
jThe number of edges expected in between; then all the nodes v in the community
iAnd v
jThe sum of the expected edge numbers generated in the process obtains a node v
iAnd v
jThe desired number of edges in between is
Based on the above two points, an expected adjacency matrix (expected adjacency matrix) array can be constructed by using the characterization matrix H
Namely, it is
Then use
Fitting a to obtain an objective function of the first sub-model as:
s3.2, Network Embedding information is introduced to strengthen Network topology information, and the Network Embedding is an Embedding expression of the topology information, so that the Network Embedding can be considered to contain a membership matrix H; therefore, based on the idea of data dimension reduction by nonnegative matrix factorization, the membership matrix H is a low-dimensional description of the Embedding matrix U in step S2.2, and a feature matrix C e R is introduced
l×kConstructing the expected Network Embedding matrix with the membership matrix H
To achieve the fitting to U, the objective function of the second submodel is as follows:
s3.3, constructing a third sub-model of the content information based on the nodes, comprising the following steps:
for the node content matrix M designed in step S2.3, each node is described by M contents, and the M contents can be divided into k subjects, so that the element H in the characterization matrix H
ijWhich may be interpreted herein as node v
iTo construct a trend of content of
To realize the fitting of M, a matrix N epsilon R needs to be introduced
l×kWhere each column represents a different topic, element N
jiThe meaning of expression is the tendency of a subject i to contain a content j. Thus, node v
iThe tendency to contain content j can be expressed as
So that we can construct
The fitting of the original content matrix M is realized, so that the objective function of the node content model is obtained as follows:
the invention provides a further optimization scheme of the community detection method integrating the Embedding enhanced topology and the node content information, wherein the step (4) specifically comprises the following steps:
organically fusing the first sub-model based on the topology information, the second sub-model for reinforcing the topology information and the third sub-model based on the node content information in the step S3, adjusting the specific gravity of different sub-models by using the weight factors alpha and beta, and finally constructing a unified model, wherein the target function of the obtained final model is as follows:
to minimize the objective function, the formula is updated by the following three equations:
and continuously iterating and updating the model parameter matrixes H, C and N until the value of the target function is converged, and finally obtaining a target matrix H, namely a community membership matrix of the nodes for community detection.
Compared with the prior art, the invention has the beneficial effects that: according to the invention, the content information of the nodes and the topology information of the nodes are combined, so that the community detection precision is improved, the Network Embedding information is introduced, the topology information is enhanced, the influence caused by the sparsity of the topology information is relieved, and the community detection precision integrating the topology and the node content information is further improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. Of course, the specific embodiments described herein are merely illustrative of the invention and are not intended to be limiting.
Example 1
Referring to fig. 1 to 2, the technical solution provided by the present invention is a community detection method integrating an Embedding enhanced topology and node content information,
(1) and obtaining community information using the data set. The data set used in this example is an LFR artificial network, and as to the LFR artificial network, the following must be said:
a) the degrees of all nodes are generated by adopting power law distribution with the exponent being gamma. The maximum value of the node degree is set as kmaxMinimum value of kminAverage degree is set as<k>。
b) Each node has a 1- μ proportion of edges connected to nodes within its community and correspondingly, a μ proportion of edges connected to nodes outside its community. Mu is called a mixing coefficient and is used for describing the fuzzy degree of the community structure in the network.
c) The scales of all communities are generated by power law distribution with the exponent of beta, and the sum of the scales of all communities is equal to the size of the network scale, namely the number of nodes in the network. The maximum value of the community size is smaxThe minimum value is set to smin。
In the present embodiment, the network size is set to 1000, the maximum value of the node degrees is set to 32, the average value is set to 16, and the maximum value and the minimum value of the community size are both set to 25, so that the community number is determined to be 40. In four cases (2, 1), (2, 2), (3, 1) and (3, 2) where (γ, β) is set, the mixing coefficient μ is set to 0.1 to 0.6 and is varied by a step size of 0.05, thereby generating four sets of networks each including 11 networks and 44 networks in total.
(2) The topology information of the nodes is constructed according to the community information acquired in the step (1), and the Network Embedding information and the content information of the nodes comprise the following steps:
constructing adjacency matrix A ═ { A ] according to generated LFR network informationij}∈R1000×1000Wherein when the node viAnd node vjWhen they are connected with each other by edges A ij1, otherwise Aij0. For the generation of Network Embedding information, a Node2vec algorithm is used for mapping nodes in a Network to a 24-dimensional popular space to obtain a matrix U belonging to R1000×24. Node content information composed ofArtificially generating, and describing the content of each node by using a 1280-dimensional characteristic vector by using one-hot coding so as to obtain a node content matrix M e R1000×1280;
(3) Constructing a topology model based on the topology information, and simultaneously introducing Network Embedding information to supplement the topology information, thereby reducing the influence caused by the sparsity of the topology information; a node content model is established based on the content information of the node, the topology model and the content model are unified into a final model by using a weight factor, and an objective function is obtained as follows:
wherein H ∈ R1000×40,C∈R24×40,N∈R1280×40。
(4) And continuously and iteratively updating H, C and N through the updating formula in the step 4 until the convergence is reached to obtain a target matrix H, and finally obtaining the community attribution of all the nodes.
(5) Using Normalized Mutual Information (NMI) as an evaluation index of the model, the normalized mutual information being based on a confusion matrix C, each column of the matrix representing a category, each row representing an actual category; the mutual information is normalized to be between [0, 1], and the mutual information is generally used for presenting the visualization effect of the algorithm precision, and the specific expression is as follows:
(6) the experiment is carried out in the four groups of LFR networks, SCI, Bigclam, SNE, SDNE, Node2vec and CESNNA are added as baseline contrast experiments, wherein Bigclam is a community detection model based on topology information, SCI and CESNNA are community detection models combining the topology information and Node content information, and for a target matrix (characterization matrix) H obtained by the two models, a community to which the Node belongs is determined directly according to the serial number of the row where the maximum element in the characterization vector of each Node is located. SNE, SDNE, and Node2vec are three Network Embedding (Network Embedding) models, where SDNE and Node2vec are models based on topology information, and SNE is a model that combines topology information and Node content information. The Network Embedding implementation represents nodes in a Network in a low-dimensional, real-valued, dense vector form. Based on the Embedding information obtained by the SNE, the SDNE and the Node2vec, the KMeans method is further used for clustering the information to obtain a community detection result. In order to avoid the randomness of the results as much as possible, we have to run 10 times for each method and then take the average as the final result, and the experimental results are shown in fig. 2. From the analysis of fig. 2, it can be seen that in the process of changing the mixing coefficient μ from 0.1 to 0.6, compared with other models, our model has the highest accuracy and is the most stable among the four groups of networks. Particularly, when mu is less than or equal to 0.45, the precision of the model is almost consistent with that of SCI and Node2vec, and when mu is more than or equal to 0.45, the advantages of the model are shown. Theoretically, the model has the advantages that the node content information is combined to supplement the topology information, the Embedding information is introduced to strengthen the network topology, the influence caused by the sparsity of the network topology is relieved, and therefore the efficiency of community detection is further improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.