CN113658012A

CN113658012A - Community discovery method based on deep network representation learning

Info

Publication number: CN113658012A
Application number: CN202110703377.3A
Authority: CN
Inventors: 潘雨; 潘志松; 胡谷雨; 王帅辉; 邹军华; 刘鑫; 黎维; 陶蔚; 周星宇
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-11-16

Abstract

A community discovery method based on deep network representation learning relates to the technical field of mapping segmentation problems. Modeling a network into a graph; constructing a community structure matrix; obtaining a network node representation vector; and running a K-mean strategy on the obtained low-dimensional representation of the network to obtain a final network community structure. The method provided by the invention successfully captures the nonlinear structure of the network by using the deep neural network, learns more accurate and rich node representation, and lays a solid foundation for subsequent community discovery. The task of accurately excavating the community structure in a large-scale, sparse and high-dimensional network is realized.

Description

Community discovery method based on deep network representation learning

Technical Field

The invention relates to the technical field of mapping segmentation problems, in particular to a community discovery method based on deep network representation learning.

Background

The community structure is an important structural feature widely existing in a network, the connection among nodes in the communities is tight, and the connection among the nodes in the communities is sparse. Community discovery is the process of mining the structure of communities hidden in network data from a mesoscopic perspective by analyzing the interactions and potential information between nodes in the network. The community discovery provides an effective tool for exploring the potential characteristics of the complex network, and has important theoretical and practical significance for understanding the network organization structure, analyzing the potential characteristics of the network, discovering the network hiding rule, the interaction mode and the like.

In recent years, with the development of networks and the influx of social media, complex networks have gradually appeared to be large-scale, sparse, and high-dimensional. The conventional topology-based community discovery algorithm has the problems of high computational complexity, low parallelism, incapability of expanding to a large-scale network and incapability of processing sparse data. Therefore, it is imperative to design an extensible community discovery algorithm for large-scale, sparse and high-dimensional networks.

The traditional community discovery algorithm is carried out on an adjacency matrix based on topological representation, and the problems of high computational complexity, incapability of parallel computation, incapability of mining a network nonlinear structure and the like exist. It is not suitable for large-scale, sparse and high-dimensional networks.

Disclosure of Invention

The invention aims to provide a community discovery method based on deep network representation learning, which can realize accurate mining of community structures under large-scale, sparse and high-dimensional networks.

A community discovery method based on deep network representation learning comprises the following steps:

the first step is as follows: modeling a network into a graph;

the second step is that: constructing a community structure matrix;

the third step: obtaining a network node representation vector;

the fourth step: and running a K-mean strategy on the obtained low-dimensional representation of the network to obtain a final network community structure.

The method for constructing the community structure matrix specifically comprises the following steps:

firstly, designing a function R to measure the similarity between community members; then, based on similarity measurement, a Skip-gram model based on negative sampling is adopted to further explore a community structure of a network bottom layer; finally, obtaining a matrix X capable of capturing a network potential community structure;

firstly, designing a function R to measure the similarity between community members; introducing a community relation indication matrix H belonging to R^n×kEach row H of the matrix H_iRepresenting the degree of membership of the corresponding node to each community,

representing a node v_iAnd v_jHas a probability of an edge existing therebetween, and

designing the following node similarity function R to measure the similarity of two nodes belonging to the same community:

where σ (·) is a sigmoid function, such that R (i, j) is ∈ [0, 1);

adopting a Skip-gram model based on negative sampling to any two nodes v_iAnd v_jThere is the formula:

wherein, k is the number of negative samples; selecting negative samples according to the degree of the node, randomly sampled node samples v in the network_nObedience distribution

Wherein d is_iIs node v_iThe degree of (a) is greater than (b),

D＝∑_id_iis the sum of all the node degrees in the network, equation (6) is rewritten as:

then, by

Calculating a partial derivative to optimize equation (7):

thereby obtaining

Comprises the following steps:

obtaining a weighting matrix X belonging to R and storing the information of the potential community of the network^n×nElement X of matrix X_ijComprises the following steps:

the weights of the elements in the matrix X are the weights between the edges influenced by the community structure between the nodes, so that the structural proximity between the nodes is quantized, and the potential community structure of the network is reflected.

The specific process of obtaining the network node expression vector comprises the following steps:

the obtained community matrix X is used as the input of a depth self-encoder, the low-dimensional vector representation of the network is obtained, and the community structure of the network is captured, so that the nodes belonging to the same community are ensured to be close to each other in an embedding space;

each row of matrix X is the input to the depth autoencoder, and the loss function is as follows:

wherein

By training the auto-encoder to minimize reconstruction errors, the similarity between the input vectors in the embedding space can be preserved. Minimizing the loss function of input and output can maximally reserve the characteristics of input data, namely the potential community structure of the network, in a hidden layer. The node output from the last layer of the hidden layer represents the characteristics of the input community structure matrix X which are stored to the maximum extent, and the clear and accurate community structure can be obtained by applying the node to the subsequent community discovery algorithm.

The invention provides a community discovery method based on network representation learning, aiming at the community structure discovery problem of large-scale, sparse and high-dimensional networks. The method utilizes the deep neural network to realize accurate excavation of the community structure in the excavation network. The strong interaction and complex dependency relationship between nodes in a real network result in high nonlinearity of the network structure, and the interaction between different features is often nonlinear. For the nonlinear relation, the deep neural network has strong representation and generalization capability. Recently, deep learning has enjoyed great success in many applications, such as image classification, speech recognition and natural language processing. The self-encoder is an unsupervised deep feature learning model and has good performance in data reduction and feature extraction. Thus, the self-encoder is employed herein to capture complex non-linear relationships in the network.

The node representation of the network is learned by using a depth model, and then the communities are found in an embedding space, so that the method can greatly keep high-efficiency performance and calculation speed, has portability and strong learning characteristic capability, and is more elastic to the problem of network sparsity. Deep learning is applied to community discovery, and the community discovery problem in a large-scale sparse network is successfully solved. Compared with the traditional community discovery algorithm, the method provided by the invention successfully captures the nonlinear structure of the network by using the deep neural network, learns more accurate and rich node representation and lays a solid foundation for subsequent community discovery. The task of accurately excavating the community structure in a large-scale, sparse and high-dimensional network is realized.

Detailed Description

The goal of the self-encoder is to reconstruct the original input so that the output is as close as possible to the input. In this way, the output of the hidden layer can be viewed as a low-dimensional representation of the original data, thereby maximizing the extraction of the features contained in the original data. The self-encoder comprises two symmetrical components: an encoder and a decoder. A basic self-encoder can be seen as a three-layer neural network, consisting of an input layer, a hidden layer and an output layer.

Given an input data x_iThe encoder converts the original data x_iOutput coding h mapped as hidden layer_i，h_iCan be regarded as x_iRepresents the low-dimensional embedding of:

h_i＝σ(W⁽¹⁾x_i+b⁽¹⁾) (1)

the decoder then reconstructs the input data,

is the reconstructed output data:

the input data is encoded and decoded to obtain a reconstructed representation of the input data. Wherein θ ═ W⁽¹⁾,W⁽²⁾,b⁽¹⁾,b⁽²⁾Is a parameter set, W⁽¹⁾，W⁽²⁾Weight matrices for the encoder and decoder, respectively, b⁽¹⁾，b⁽²⁾The bias vectors for the encoder and decoder, respectively. σ (-) is a non-linear activation function, e.g. sigmoid function

tan h function

And the like. The self-encoder derives a characterization of the input data by minimizing the error between the input data and the reconstructed data:

in order to capture the high nonlinearity of the topological structure and the node attribute, the section combines a plurality of nonlinear functions into an encoder and a decoder, and learns the characteristics of different abstraction levels by carrying out multi-layer abstraction learning on data.

Wherein K represents the number of hidden layers,

is a low-dimensional representation of the feature of node i.

The detailed process of the community discovery method based on network representation learning of the invention can be described as follows:

the first step is as follows: modeling a network into a graph;

the second step is that: constructing a community structure matrix;

according to the probabilistic generative model, the formation of edges between nodes is influenced by the structure of potential communities in the network. If two nodes are connected by an edge, they are likely to belong to the same community. Thus, potential community structures in a network can be mined by maximizing the probability that an edge exists between two nodes. Based on this, the node quantifies the structural proximity of the nodes according to the potential community member similarity of the nodes, thereby constructing a community structure matrix X.

First, we design a function R to measure the similarity between community members. And then, based on the similarity measurement, adopting a Skip-gram model based on negative sampling to further explore the community structure of the network bottom layer. And finally obtaining a matrix X capable of capturing the potential community structure of the network.

First, a function R is designed to measure the similarity between community members. Introducing a community relation indication matrix H belonging to R^n×kEach row H of the matrix H_iRepresenting the degree of membership of the corresponding node to each community,

therefore, the following node similarity function R is designed to measure the similarity of two nodes belonging to the same community:

where σ (·) is a sigmoid function, such that R (i, j) E [0, 1). Because the functions R (i, j) and

linear relationship, so the main discussion

And (4) finishing.

According to the probability generation model, if the probability of edges existing between two nodes is larger, namely R (i, j) is larger, then the two nodes belong to a social societyThe probability of a clique is large. Thus, for two nodes v connected by an edge in the network_iAnd v_jBy maximizing

To capture the potential community structure of the network. At the same time, we minimize for two nodes randomly selected in the network

This is because the network is generally sparse, and most nodes have no connection therebetween, so that for two nodes randomly selected in the network, the probability of having an edge therebetween is low, and the probability of belonging to a community is also low. Based on the above, the chapter adopts a Skip-gram model based on negative sampling, and any two nodes v are subjected to_iAnd v_jThere is the formula:

where κ is the number of negative samples. In this chapter, negative samples are selected according to the degree of nodes, and node samples v sampled randomly in the network_nCompliance

Wherein d is_iIs node v_iDegree of (d)_i＝∑_ja_ij，D＝∑_id_iIs the sum of all the node degrees in the network, equation (6) is rewritten as:

then, by

Calculating a partial derivative to optimize equation (7):

thereby obtaining

Comprises the following steps:

in summary, we obtain the weighting matrix X belonging to R and storing the information of the potential community of the network^n×nElement X of matrix X_ijComprises the following steps:

The third step: obtaining a network node representation vector;

and taking the obtained community matrix X as the input of a depth self-encoder, obtaining the low-dimensional vector representation of the network, and capturing the community structure of the network, thereby ensuring that the nodes belonging to the same community are close to each other in the embedding space.

The present invention will be described in further detail with reference to the following experiments. The effectiveness of the algorithm is evaluated by generating 5 power-law networks of different scales by using a Lancitinetti-Fortunato-Radichi (LFR) network reference proposed by Lancitinetti et al. As shown in table 1, the network size is on an increasing trend from Lnetwork1 to Lnetwork 5.

TABLE 1 statistical information of LFR Artificial data set

The DNCE of the invention achieves the best community discovery performance on 5 artificial data sets. Compared with the traditional community discovery method, the method has the advantages that on the large-scale data set, the nonlinear structure of the network can be better captured, and the method has stronger learning characteristic capacity, so that an excellent community structure division result is obtained.

Claims

1. A community discovery method based on deep network representation learning is characterized by comprising the following steps:

the first step is as follows: modeling a network into a graph;

the second step is that: constructing a community structure matrix;

the third step: obtaining a network node representation vector;

2. The method of claim 1, wherein the constructing the community structure matrix specifically comprises the following steps:

where σ (·) is a sigmoid function, such that R (i, j) is ∈ [0, 1);

then, by

Calculating a partial derivative to optimize equation (7):

thereby obtaining

Comprises the following steps:

3. The method of claim 2, wherein the obtaining of the network node representation vector comprises:

wherein

The reconstruction error is minimized by training the automatic encoder, the similarity between input vectors in an embedding space is kept, the loss function of input and output is minimized, and the characteristics of input data, namely the potential community structure of the network, can be kept to the maximum degree in a hidden layer. The node output from the last layer of the hidden layer represents the characteristics of the input community structure matrix X which are stored to the maximum extent, and the clear and accurate community structure can be obtained by applying the node to the subsequent community discovery algorithm.