CN109039721B

CN109039721B - Node importance evaluation method based on error reconstruction

Info

Publication number: CN109039721B
Application number: CN201810802825.3A
Authority: CN
Inventors: 朱先强; 郭园园; 朱承; 周鋆; 黄金才; 林福良; 丁兆云; 闫晶晶
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2021-06-18
Anticipated expiration: 2038-07-20
Also published as: CN109039721A

Abstract

The invention provides a node importance evaluation method based on error reconstruction, which specifically comprises the following steps: taking a sparse matrix of network connection as input, and calculating a network Node characteristic representation matrix X through a Node2Vec algorithm of network representation learning; constructing a multi-scale network; calculating two reconstruction error significances of network nodes according to the reconstruction error model for the networks under different scales; integrating two reconstruction error significances under different scales; and for the two reconstruction error significances, calculating the reconstruction error significance after fusion according to a weighted fusion algorithm and taking the reconstruction error significance as an index for finally measuring the node significance degree. The method has universality for networks with different structures and types, so that a relatively accurate conclusion can be obtained under the condition that the property of the network is not fully known in advance, and a large amount of analysis work on the network in advance in the traditional method and result errors caused by the wrong ranking mode are avoided. The invention is applied to the technical field of complex networks.

Description

Node importance evaluation method based on error reconstruction

Technical Field

The invention relates to the technical field of complex networks, in particular to a node importance evaluation method based on error reconstruction.

Background

Efficient identification of key nodes is a fundamental problem in complex networks and has wide application. A network critical node refers to some special node that can affect the structure or function of the network to a greater extent than other nodes in the network. For example, redundant backup of key nodes and key links of a network can increase the fault tolerance and survivability of the network and effectively improve the robustness of the network. In addition, people can identify influential users and propagators in a social network or community by optimizing the use of limited resources to promote information dissemination, and an iterative algorithm of AP Rank is available in the prior art, and the influence of authors and publications is measured by multiple associations between the influential users and the propagators, so that prestige and popularity are effectively distinguished.

So far, a large number of methods for evaluating key nodes have been proposed, such as degree centrality, betweenness centrality, proximity centrality, feature vector centrality, and the like. The centrality is a direct index, but is usually less important or correct, and similar indexes include Local Rank, proximity, H-index, and the like. The prior art researches the mathematical relationship among three simple but important network centrality measurement indexes of degree, core degree and H index. The centrality based on the paths, such as the proximity centrality and the betweenness centrality, is a global measure index, which can identify the key nodes in the network more effectively, but the computation complexity is higher. Feature vector centrality and PageRank are also widely used for the importance of network nodes. Recently, the local approach of Cluster Rank also performed well in some cases. In addition, semi-local centrality aims to neglect the topological relation between neighbors and only consider the number of nearest neighbors and secondary neighbors of a node to make a trade-off between relevance and computational complexity, but the position of the node in the network may play a more important role than global attributes such as degree. Kitsak et al indicate by Kshell value that the location of a node in the network is a key factor affecting node significance. Under this metric strategy, nodes with larger Kshell values generally have wider scalability, but have poorer propagation speed than the proportional propagation if they are selected as source nodes for propagation. In addition, other ranking methods such as proximity, PageRank, Leader-Rank, ClusterRank, etc. have similar limitations.

Disclosure of Invention

The invention aims to provide a node importance evaluation method based on error reconstruction, which can be used for reversely finding out important nodes from the view of the unimportant nodes, so that the method has universality on networks with different structures and types, the application range is greatly improved, a more accurate conclusion can be obtained based on the method under the condition that the network properties are not fully known in advance, and the results are prevented from being wrong due to the fact that a large amount of analysis work is performed on the network in advance and a ranking mode is used wrongly in the traditional method.

The technical scheme adopted by the invention is as follows:

a node importance evaluation method based on error reconstruction specifically comprises the following steps:

s1, taking the sparse matrix of network connection as input, and calculating a network Node characteristic representation matrix X through a Node2Vec algorithm of network representation learning;

s2, constructing a multi-scale network according to the network node feature representation matrix X;

s3, calculating two reconstruction error significances of the network nodes under different scales according to the reconstruction error model for the network under different scales constructed in the step S2, wherein the two reconstruction error significances are sparse reconstruction error significance and dense reconstruction error significance;

s4, integrating two reconstruction error significances under different scales;

and S5, calculating the significance of the two reconstruction errors according to a weighted fusion algorithm, and taking the calculated significance of the fused reconstruction errors as an index for finally measuring the significance degree of the node.

As a further improvement of the above technical solution, in step S1, the form of the network node feature representation matrix X is:

X＝[x₁,x₂,...,x_N],X∈R^D×N

where D is the feature dimension and N is the number of nodes in the network

As a further improvement of the above technical solution, step S2 specifically includes:

s21, initializing the scale of the network to be N;

s22, clustering the network node feature representation matrix X by a Kmeans clustering algorithm according to four scales of 0.95N, 0.9N, 0.85N and 0.8N, and calculating a module area to which each network node belongs;

s23, counting the network nodes in each module area, and taking the mean value of the network node feature representation in the module area to which the network belongs as the feature representation of the module area;

and S24, constructing network feature matrixes under different scales.

As a further improvement of the above technical solution, step S3 specifically includes:

s31, extracting unimportant nodes in the network nodes under each scale and respectively constructing the unimportant nodes as corresponding background modules B;

s32, reconstructing the network under each scale by using the sparse reconstruction model and the dense reconstruction model, and calculating two reconstruction error significances of the network under each scale;

and S33, calculating the significance of the propagation reconstruction errors for measuring the propagation influence between the adjacent nodes.

As a further improvement of the above technical solution, in step S31, the nodes of the unimportant nodes located at the network edge in the network structure are decomposed by using a Kshell decomposition method, and the nodes located at the network edge are selected as background nodes to form a background module B.

As a further improvement of the above technical solution, in step S32, the calculating two reconstruction error significances specifically includes:

s321, constructing a sparse reconstruction model, and solving a sparse reconstruction coefficient alpha and a sparse reconstruction error significance epsilon^s：

In the formula, x_iIs the characteristic representation of the node i, B is the characteristic formed by the corresponding scale network background nodeSign matrix, α_iIs the sparse reconstruction coefficient for node i, is the L1 regularization coefficient,

is the sparse reconstruction error significance of node i;

s322, constructing a dense reconstruction model, and solving a dense reconstruction coefficient beta and a dense reconstruction error significance epsilon^d：

In the formula, x_iIs a characteristic representation of the node i,

is the mean characteristic of all nodes, U_B＝[u₁,u₂,...,u_D']，u_iIs the ith principal component, D' is the number of extracted principal components, T represents the transpose of the matrix, β_iIs the dense reconstruction coefficient for node i,

is the dense reconstruction error significance of node i.

As a further improvement of the above technical solution, in step S33, the calculating significance of propagation reconstruction errors specifically includes:

s331, clustering the N nodes through a K-means clustering algorithm;

s332, constructing a similarity coefficient according to the similarity of the class to which the node belongs and other nodes, and performing error correction on the node i;

and S333, carrying out weighted estimation on the reconstruction error of the node.

As a further improvement of the above technical solution, in step S332:

the similarity coefficient is defined as:

wherein, { k₁,k₂,...,k_NcDenotes Nc node labels in the clustering block k,

refers to the similarity normalized weight of the node labeled j in the clustering block k and the node i,

is the sum of the variances of each feature dimension of x, x_iIs a characteristic representation of node i, k_jIs a node of the label j in the cluster block k that is a peripheral adjacent node of the node i,

is a characteristic representation of the node labeled j in the cluster block k, δ (k)_j-i) is k_j-an indicator function of i;

the significance of the correction error is as follows:

wherein, tau is a weight parameter,

is the corrected sparse error significance or the corrected dense error significance of node i,

is the sparse or dense error significance, ε, of the nodes of label j in cluster block k, i.e., the peripheral adjacent nodes of node i_iIs the sparse error significance or the dense error significance of node i.

As a further improvement of the above technical solution, in step S4, an expression integrating the significance of the reconstruction error at different scales is as follows:

wherein z represents a node in the network, Ns represents the number of scales of the multi-scale analysis,

a modified sparse error significance or a modified dense error significance for a module containing node z at scale s,

representing the feature similarity of the node z and the module where the node z is located, and taking the feature similarity as the weight of the current scale;

the expression of (a) is:

in the formula (f)_zRepresenting the node characteristics corresponding to the node z,

representing the mean, σ, of the characteristics of the nodes in the block to which the node z belongs_s ²Is the sum of the variances of each feature dimension of s.

As a further improvement of the above technical solution, in step S5, the calculation formula of the weighted fusion is:

S(z)＝αE₁(z)+(1-α)E₂(z)

in the formula, E₁(z) sparse reconstruction error significance of node z, E₂(z) is the dense reconstruction error significance of the node z, alpha is a weighted fusion coefficient, alpha belongs to R, alpha is more than or equal to 0 and less than or equal to 1, and S (z) is an index for calculating the reconstruction error significance after fusion, namely the node significance degree.

The invention has the beneficial technical effects that:

the invention relates to a Node importance evaluation method based on error reconstruction, which generates a network characteristic representation matrix through a Node2Vec algorithm, constructing a multi-scale network, solving the significance values of the dense and sparse reconstruction errors under the network with different scales, and the two models of sparse reconstruction and dense reconstruction are fused by a weighted fusion technology, the obtained fusion significance result is an index for judging the significance of the node, the important node is reversely found from the perspective of the unimportant node, therefore, the method has universality for networks with different structures and types, greatly improves the application range, under the condition that the property of the network is not fully known in advance, a relatively accurate conclusion can be obtained based on the method, and a large amount of analysis work on the network in advance in the traditional method and result errors caused by the fact that a ranking mode is used mistakenly are avoided.

Drawings

FIG. 1 is a schematic flow diagram of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a collaboration network CA-GrQc according to the first embodiment;

fig. 3 is a schematic diagram of the average SIR propagation of the four comparison methods and the method of this embodiment 50 times by the 200 nodes before ranking in the second embodiment.

Detailed Description

In order to facilitate the practice of the invention, further description is provided below with reference to specific examples.

As shown in fig. 1, a node importance evaluation method based on error reconstruction specifically includes the following steps:

s1, using the sparse matrix of network connection as input, calculating the network Node characteristic expression matrix X by Node2Vec algorithm of network expression learning

The Node2vec algorithm defines a biased random Walk strategy generation sequence on the basis of social characterization online learning (Deep Walk), so that a certain balance is adopted between Breadth First Search (break First Search) and Depth First Search (Depth First Search), a forward propagation Skip-Gram neural network model method is used for training Node features, the features of nodes in the network are extracted and converted into an optimization "possibility" objective function problem, and the objective function is as follows:

in the formula, Z_u＝∑_v∈Vexp(f(v)·f(u)).f:V→R^DAnd f is a V-D dimensional real number space R of the node set^D×NU is a certain node in the network, ns (u) is a set of neighboring nodes of node u, f (n)_i) Is node n_iAnd f (u) is the feature representation of the node u in the D-dimensional real number space, and f (v) is the feature representation of the neighbor node v of the node u in the D-dimensional real number space.

The form of the optimal network node feature representation matrix X obtained by the objective function is as follows:

X＝[x₁,x₂,...,x_N],X∈R^D×N

where D is the feature dimension and N is the number of nodes in the network.

S2, constructing the multi-scale network according to the network node feature representation matrix X

The method comprises the following steps:

s21, initializing the scale of the network to be N;

and S24, constructing network feature matrixes under different scales.

S3, calculating two reconstruction error significances of the network nodes under different scales according to the reconstruction error model for the networks under different scales constructed in the step S2, wherein the two reconstruction error significances are sparse reconstruction error significance and dense reconstruction error significance

The method comprises the following steps:

s31, extracting unimportant nodes in the network nodes under each scale to respectively form a corresponding background module B:

the nodes of the unimportant nodes at the network edge in the network structure are decomposed by using a Kshell decomposition method, the nodes at the network edge are selected as background nodes, and a background module B is formed:

B＝[b₁,b₂,...,b_M],B∈R^D×M

in the formula, M represents the number of selected background nodes.

S32, reconstructing the network under each scale by the sparse reconstruction model and the dense reconstruction model, and calculating two reconstruction error significances of the network under each scale, which specifically comprises the following steps:

s321, constructing a sparse reconstruction model, and solving a sparse reconstruction coefficient alpha and a sparse reconstruction error significance epsilon^sTaking background nodes as a group of base vectors, performing multiple linear regression on nodes in the network, and adopting the conventional Lasso regression (L1 regularization of linear regression) to reduce overfitting phenomenon and quickly calculate regression coefficients, namely adding L1 regularization (lambda alpha) when performing linear regression on high-dimensional data_i||₁). And calculating a regression coefficient by using a Lars algorithm (minimum angle regression algorithm), and taking the regression coefficient as a sparse reconstruction coefficient alpha:

in the formula, x_iIs the characteristic representation of the node i, B is the characteristic matrix formed by the background nodes of the corresponding scale network, alpha_iIs the sparse reconstruction coefficient for node i,

is a nodei, sparse reconstruction error significance, wherein lambda is an L1 regularization coefficient, and is set to be 0.01 in an experiment;

s322, constructing a dense reconstruction model, wherein the dense reconstruction model is to extract a principal component U of the selected background node B by using Principal Component Analysis (PCA)_BAnd then constructing a dense reconstruction coefficient beta through the residual error of the node characteristics and the mean value and solving the significance epsilon of the dense reconstruction error^d：

In the formula, x_iIs a characteristic representation of the node i,

is the dense reconstruction error significance of node i.

S33, considering the propagation influence of the neighboring nodes, and calculating the significance of the propagation reconstruction error, specifically including:

s331, clustering the N nodes through a K-means clustering algorithm;

s332, constructing a similarity coefficient through the similarity between the class to which the node belongs and other nodes, and performing error correction on the node i, wherein:

the similarity coefficient is defined as:

wherein, { k₁,k₂,...,k_NcDenotes Nc node labels in the clustering block k,

the sum of the variances of each feature dimension being x, δ (·) is an indicator function, x_iIs a characteristic representation of node i, k_jIs a node of the label j in the cluster block k that is a peripheral adjacent node of the node i,

the significance of the correction error is as follows:

wherein, tau is a weight parameter,

is the sparse or dense error significance, ε, of the nodes of label j in cluster block k, i.e., the peripheral adjacent nodes of node i_iAnd substituting the sparse error significance into the formula to obtain the corrected sparse error significance, and substituting the dense error significance into the formula to obtain the corrected dense error significance.

S4, integrating two reconstruction error significances under different scales

The expression for integrating the significance of reconstruction errors at different scales is:

representing the corrected sparse error significance or the corrected dense error significance of the inclusion node z at the scale s,

the expression of (a) is:

The calculation formula of the weighted fusion is as follows:

S(z)＝αE₁(z)+(1-α)E₂(z)

in the formula, E₁(z) sparse reconstruction error significance of node z, E₂(z) is the dense reconstruction error significance of the node z, α is a weighted fusion coefficient, α belongs to R, α is greater than or equal to 0 and less than or equal to 1, in this embodiment, α is 0.5, and s (z) is an index for calculating the reconstruction error significance after fusion, that is, for measuring the node significance degree.

Example one

By taking the common data set collaboration network CA-GrQc as an embodiment, as shown in fig. 2, which is a schematic network structure diagram of the collaboration network CA-GrQc, it can be known from observing a data set that the collaboration network is a undirected network formed by 5242 nodes and 28980 edges, where the nodes represent authors, the edges represent two authors cooperating with at least one article, the node ID is 26196 at most, 5242 nodes are not arranged in order, and the possible reason is that some nodes do not have connecting edges, that is, isolated points, represent that no collaboration with others occurs.

Example two

In order to verify the technical effect of the method of the embodiment, four sorting methods of degree, proximity, betweenness and feature vector are adopted for comparison. Fig. 3 shows a schematic diagram of 50 times average SIR propagation of four comparison methods by selecting 200 nodes before ranking and the method of this embodiment, as shown in fig. 3, the method of this embodiment has a higher propagation speed than the feature vector sorting method and the proximity sorting method, and the propagation speed is not as good as the medium sorting method in the propagation process, but when the propagation steady state is reached, the number of propagation nodes of the method of this embodiment has a larger propagation scale than the four comparison methods.

As shown in the above table, the number of nodes when the four comparison methods of feature vector centrality, degree centrality, betweenness centrality and K-kernel decomposition reach the propagation steady state is 4158, the number of nodes when the invention reaches the propagation steady state is 4299, and the invention can perform propagation in a larger scale than the four comparison methods; the propagation times when the method reaches the steady state are higher than two methods of degree and betweenness, but lower than two comparison methods of feature vectors and nearness; the propagation speed of the present invention is effectively higher than the four comparison methods with respect to the propagation speed. In conclusion, the node significance method based on error reconstruction has certain propagation advantages. The invention can be used as a new node importance evaluation index to be applied to the field of complex networks, finds nodes which are critical relative to the core from the side of the local unimportant nodes, effectively improves the identification accuracy and can more quickly and widely spread the nodes in the network.

The foregoing description of the preferred embodiments of the present invention has been included to describe the features of the invention in detail, and is not intended to limit the inventive concepts to the particular forms of the embodiments described, as other modifications and variations within the spirit of the inventive concepts will be protected by this patent. The subject matter of the present disclosure is defined by the claims, not by the detailed description of the embodiments.

Claims

1. A node importance evaluation method based on error reconstruction is characterized by comprising the following steps:

s4, integrating two reconstruction error significances under different scales;

s5, calculating the significance of the two reconstruction errors according to a weighted fusion algorithm, and taking the calculated significance of the fused reconstruction errors as an index for finally measuring the significance degree of the node;

step S3 specifically includes:

2. The method for evaluating the importance of nodes based on error reconstruction as claimed in claim 1, wherein in step S1, the form of the network node feature representation matrix X is:

X＝[x₁,x₂,...,x_N],X∈R^D×N

where D is the feature dimension and N is the number of nodes in the network.

3. The method for evaluating the importance of the node based on the error reconstruction as claimed in claim 1, wherein the step S2 specifically comprises:

s21, initializing the scale of the network to be N;

and S24, constructing network feature matrixes under different scales.

4. The method for node importance assessment based on error reconstruction as claimed in claim 1, wherein in step S31, the nodes of the unimportant nodes located at the network edge in the network structure are decomposed by using Kshell decomposition method, and the nodes located at the network edge are selected as background nodes and constitute a background module B.

5. The method for evaluating the importance of nodes based on error reconstruction according to claim 1, wherein the step S32 of calculating the significance of two reconstruction errors specifically comprises:

S321. constructing a sparse reconstruction model, and solving a sparse reconstruction coefficient alpha and a sparse reconstruction error significance epsilon^s：

In the formula, x_iIs the characteristic representation of the node i, B is the characteristic matrix formed by the background nodes of the corresponding scale network, alpha_iIs the sparse reconstruction coefficient for node i, is the L1 regularization coefficient,

is the sparse reconstruction error significance of node i;

In the formula, x_iIs a characteristic representation of the node i,

is the mean characteristic of all nodes, U_B＝[u₁,u₂,...,u_D′]，u_iIs the ith principal component, D' is the number of extracted principal components, T represents the transpose of the matrix, β_iIs the dense reconstruction coefficient for node i,

is the dense reconstruction error significance of node i.

6. The method for evaluating the importance of nodes based on error reconstruction according to claim 1, wherein the step S33 of calculating the significance of the propagation reconstruction error specifically comprises:

s331, clustering the N nodes through a K-means clustering algorithm;

7. The method for evaluating the importance of nodes based on error reconstruction according to claim 6, wherein in step S332:

the similarity coefficient is defined as:

wherein, { k₁,k₂,...,k_NcDenotes Nc node labels in the clustering block k,

is a characteristic representation of the node labeled j in the cluster block k, δ (k)_j-i) is k_jIndication of-iA function;

the significance of the correction error is as follows:

wherein, tau is a weight parameter,

8. The method for evaluating the importance of the nodes based on the error reconstruction as claimed in claim 7, wherein in step S4, the expression for integrating the significance of the reconstruction errors at different scales is:

the expression of (a) is:

9. The method for evaluating the importance of nodes based on error reconstruction according to claim 8, wherein in step S5, the calculation formula of the weighted fusion is:

S(z)＝αE₁(z)+(1-α)E₂(z)