CN113065037A

CN113065037A - Label propagation community detection method and device based on density peak optimization

Info

Publication number: CN113065037A
Application number: CN202110407213.6A
Authority: CN
Inventors: 陈国强; 马岩; 赵艳丽; 周宏基
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-02

Abstract

The invention belongs to the technical field of complex networks and discloses a label propagation community detection method and device based on density peak optimization. According to the invention, the density peak value is introduced to find the clustering center, the rudiment of the community is determined firstly, the number of the communities and the clustering center of the complex network are fixed, and then the community is detected by adopting a label propagation algorithm, so that the accuracy and robustness of community discovery are improved, the iteration times are reduced, and the formation of the community is accelerated. Compared with other advanced algorithms, the method can quickly and effectively solve the community detection problem, can predict the community number under the condition of no prior condition, and has better stability and accuracy because the discovered community number is always consistent with the actual community number.

Description

Label propagation community detection method and device based on density peak optimization

Technical Field

The invention belongs to the technical field of complex networks, and particularly relates to a label propagation community detection method and device based on density peak optimization.

Background

Community structures are an extremely important attribute in complex networks. The community structure plays a crucial role in analyzing the social relationship in the social network, analyzing the functional relationship of the organization and the organ in the biological network, and analyzing the quotation relationship in the scientist collaboration network. Thus, the discovery of community structures from complex networks has been extensively studied over the last decade. In 2002, Girvan and Newman (M.Girvan, M.E.J.New. Community Structure in Social and Biological Networks [ J ]. Proceedings of the National Academy of Sciences of the United States of America,2002,99 (12)) have taken pioneering work, pointed out that complex Networks have ubiquitous community structures, and proposed a modularity Q to measure the stability of communities in Networks. Although the definition of the community structure is not determined consistently by explicit related research, a community is generally considered to be a group of nodes, and may also be referred to as a community or a group of modules. The nodes have the characteristics of tight community internal connection and sparse community external connection.

The community discovery algorithm based on label propagation is widely applied to community detection as one of the hot spots of the current research. The algorithm is a semi-supervised learning method based on a graph, and the semi-supervised learning has the advantage that a large number of unlabelled samples can be determined through a small number of labeled samples, so that the effectiveness in the learning process is improved. The basic idea of label propagation is to predict label information of unmarked nodes by using topological relation between nodes from label information of marked nodes, and finally complete graph division to form a cluster structure. Although the algorithm has the advantages of simple implementation, clear logic, no need of knowing the number of communities in advance, time complexity close to linearity and the like, the algorithm has the defects of unstable division result and strong randomness. In each iteration process of the label propagation algorithm, which community the node belongs to depends on the label with the largest cumulative weight of the neighbor nodes, so that when more than one maximum neighbor label of one node appears, one label is randomly selected as the own label. This randomness causes an avalanche effect, i.e. a small cluster result error that has just started to appear is amplified continuously. And the updating sequence of the node labels has little influence on the result, and the earlier updating of the more important nodes accelerates the convergence process. In the label propagation algorithm, the closer the setting of the initial label is to the core point, the more accurate clustering effect can be obtained.

Disclosure of Invention

The invention provides a label propagation community detection method and device based on density peak optimization, aiming at the problems that labels are randomly selected in the existing label propagation algorithm and the community division result is unstable.

In order to achieve the purpose, the invention adopts the following technical scheme:

a label propagation community detection method based on density peak optimization comprises the following steps:

step 1: constructing an adjacency matrix A from a complex network G ═ V, E; the node set with V being G comprises n nodes; e is an edge set of G, and comprises m edges;

step 2: calculating a similarity matrix S between nodes in the complex network by adopting cosine similarity;

and step 3: calculating a distance matrix d of nodes in the complex network based on the similarity matrix S between the nodes;

and 4, step 4: calculating the local density of the nodes by adopting a Gaussian kernel function and standardizing to obtain the local density rho of the standardized nodes^*；

And 5: distance matrix d based on nodes and local density rho of nodes after standardization^*Obtaining the distance between the nodes in the complex network and the high-density nodes, and standardizing to obtain the distance delta between the nodes after standardization and the high-density nodes^*；

Step 6: local density rho based on normalized nodes^*And the distance delta between the normalized node and the high-density node^*Acquiring K core points;

and 7: adopting a Gaussian kernel function method to construct weights among nodes, and constructing a probability transfer matrix P based on the weights among the nodes;

and 8: constructing a label matrix F based on the obtained K core points;

and step 9: and (3) propagating the label matrix F according to the similarity between the nodes in the probability transition matrix P, resetting the label matrix F, then propagating and resetting the label matrix F, and iterating the process until the change difference value of the label which is not marked in the label matrix F reaches a critical point, thereby completing the division of the label.

Further, the step 3 comprises:

calculating a distance matrix d of nodes in the complex network according to the following mode:

wherein d is_i,jRepresenting the distance between the node i and the node j as an element in the distance matrix d; s (i, j) is an element in the similarity matrix S and represents the similarity of the node i and the node j; σ is a small positive number.

Further, the step 5 includes:

and calculating the distance between the node and the high-density node in the complex network according to the following mode:

where ρ is_iRepresenting the local density, p, of the node i_jRepresenting the local density of node j.

Further, the step 6 comprises:

calculating the product γ ═ ρ at each node^*×δ^*And selecting a value larger than the sum of the average value of gamma and the standard deviation of gamma into a list, then arranging the values in sequence, and finally selecting the nodes with the largest first n x 20% in the list as core points, namely the number K of the core points is equal to n x 20%.

A label propagation community detection device based on density peak optimization comprises:

a first constructing module, configured to construct an adjacency matrix a from a complex network G ═ (V, E); the node set with V being G comprises n nodes; e is an edge set of G, and comprises m edges;

the first calculation module is used for calculating a similarity matrix S between nodes in the complex network by adopting cosine similarity;

the second calculation module is used for calculating a distance matrix d of the nodes in the complex network based on the similarity matrix S between the nodes;

a third calculation module for calculating and standardizing the local density of the nodes by adopting a Gaussian kernel function to obtain the standardized local density rho of the nodes^*；

A fourth calculation module for calculating a distance matrix d based on the nodes and a normalized local density ρ of the nodes^*Obtaining the distance between the nodes in the complex network and the high-density nodes, and standardizing to obtain the distance delta between the nodes after standardization and the high-density nodes^*；

A core point deriving module for deriving a local density ρ based on the normalized nodes^*And the distance delta between the normalized node and the high-density node^*Acquiring K core points;

the second construction module is used for constructing the weight among the nodes by adopting a Gaussian kernel function method and constructing a probability transfer matrix P based on the weight among the nodes;

the third building module is used for building a label matrix F based on the obtained K core points;

and the label propagation module is used for propagating the label matrix F according to the similarity between the nodes in the probability transition matrix P, resetting the label matrix F, propagating and resetting the label matrix F, and iterating the process until the change difference value of the label which is not marked in the label matrix F reaches a critical point, thereby completing the division of the label.

Compared with the prior art, the invention has the following beneficial effects:

the invention can predict the number of the communities under the condition of no prior condition, avoids the defects of unstable division and strong randomness of the random label algorithm, and effectively improves the accuracy of community excavation and the stability of the algorithm. In addition, because a probability transition matrix is constructed, the iteration times of label propagation are reduced, so that the method has high operation efficiency, and finally the community structure of the network can be quickly found. Compared with other advanced algorithms, the method can quickly and effectively solve the community detection problem, can predict the community number under the condition of no prior condition, and has better stability and accuracy because the discovered community number is always consistent with the actual community number.

Drawings

FIG. 1 is a basic flowchart of a tag propagation community detection method based on density peak optimization according to an embodiment of the present invention;

FIG. 2 is a graph comparing the results of different NMI experiments on an LFR reference data set;

FIG. 3 is a visualization result diagram of the Football network division by the method of the present invention;

FIG. 4 is a visualization result diagram of the Karate network partitioning using the method of the present invention;

FIG. 5 is a visualization result diagram of the Dolphins network partitioning by the method of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

example 1

As shown in fig. 1, a label propagation community detection method based on density peak optimization, which is abbreviated as DPLPA for convenience of description, includes:

step S101: constructing an adjacency matrix A from a complex network G ═ V, E; the node set with V being G comprises n nodes; e is an edge set of G, and comprises m edges;

step S102: calculating a similarity matrix S between nodes in the complex network by adopting cosine similarity;

step S103: calculating a distance matrix d of nodes in the complex network based on the similarity matrix S between the nodes;

step S104: calculating the local density of the nodes by adopting a Gaussian kernel function and standardizing to obtain the local density rho of the standardized nodes^*；

Step S105: distance matrix d based on nodes and local density rho of nodes after standardization^*To derive the complexityStandardizing the distance between the nodes in the network and the high-density nodes to obtain the distance delta between the nodes and the high-density nodes after standardization^*；

Step S106: local density rho based on normalized nodes^*And the distance delta between the normalized node and the high-density node^*Acquiring K core points;

step S107: adopting a Gaussian kernel function method to construct weights among nodes, and constructing a probability transfer matrix P based on the weights among the nodes;

step S108: constructing a label matrix F based on the obtained K core points;

step S109: and (3) propagating the label matrix F according to the similarity between the nodes in the probability transition matrix P, resetting the label matrix F, then propagating and resetting the label matrix F, and iterating the process until the change difference value of the label which is not marked in the label matrix F reaches a critical point, thereby completing the division of the label.

Further, the step S103 includes:

Further, the step S105 includes:

Further, the step S106 includes:

In particular, the amount of the solvent to be used,

let G ═ V, E be a complex network without directional weights. The node set V comprises n nodes, the edge set E comprises m edges, the adjacency matrix of the graph G is A, wherein if the node i and the node j have one connected edge, a in the adjacency matrix A_ij1 otherwise a_ij0. Therefore, a node similarity formula of the node i and the node j is obtained, and the cosine similarity is used for expressing:

wherein N (i) and N (j) represent neighbor nodes of node i and node j, respectively, | N (i) | represents the number of neighbor nodes of node i, so the formula | N (i) # N (j) | represents the number of neighbors shared by node i and node j, and the denominator formula

Indicating the number of neighbors that node i and node j are expected to share. The value of S (i, j) is between 0 and 1, and the closer S (i, j) is to 1, the higher the similarity of two nodes is. And the distance formula for node i and node j is as follows:

where σ is a small positive number to avoid a denominator of 0.

Next, a local density of the nodes is calculated by using a gaussian kernel function, and the formula is as follows:

where ρ is_iRepresenting the local density of node i, d_i,jRepresents the distance between node i and node j, d_cDenotes the cut-off distance, in particular 1% to 2% of the total number of data points, as an embodiment, d_cThe size of (d) was chosen to be 1.5% of the total number of data points. Then p is measured_iThe values were normalized:

then, a distance formula between the nodes and the high-density nodes is defined:

wherein, when the local density of the node i is the maximum, the distance thereof is the maximum of the distances between the node i and other nodes. When the local density of node i is not at its maximum, its distance is a distance from node i that is slightly greater than the local density of node i.

Then to delta_iAnd (4) carrying out standardization:

threshold value d_aSelected from the delta list at around 80% of the delta list arranged from small to large.

Finally, the product γ ═ ρ is calculated at each node^*×δ^*And selecting a value larger than the sum of the average value of gamma and the standard deviation of gamma into a list, then sequentially arranging, and finally selecting K-n-20% as the number of core points (known labels label) to be transmitted to a label propagation algorithm (LP algorithm) to form a label matrix.

The label propagation algorithm is a graph-based clustering algorithm, so a graph G needs to be constructed first. The nodes of the graph are data points, and the weight between the two nodes is constructed by adopting a Gaussian kernel function method:

wherein d is_ijAnd (4) obtaining a similarity matrix formed by weights w, wherein the distance between the node i and the node j is represented, and the beta is a super parameter.

Next, the known labels are propagated through the edges between the nodes. The greater the weight of an edge, the more similar two nodes are represented, and the easier it is for label to propagate through. Defining a probability transition matrix:

wherein P is_ijRepresenting the probability of propagating the label of node i to node j. Because there are core points of K known labels, a label matrix YL of K × K known label nodes is defined, and the ith row represents the label indication vector of the node i, i.e. if the label of the ith node is j, the jth element of the row is 1, and the rest is 0. And simultaneously defining an unlabel matrix YU of unknown label nodes. And combining to obtain the label matrixes of all the nodes:

F＝[YL,YU] (13)

and then, propagating the label matrix F according to the similarity between the nodes in the probability matrix P, wherein the formula is expressed as:

F＝PF (14)

after one propagation pass, the label matrix F needs to be reset because the YL in the known label matrix F is changed during the propagation process, but the YL is previously obtained and the accurate label should not be changed, and the formula is:

FL＝YL (15)

then, the label matrix F is propagated, reset, and the process is iterated until the unlabeled label variation difference in F reaches the critical point, at which time the DPLPA completes the label partition.

TABLE 1 DPLPA pseudo code

After the clustered label matrix F is obtained, the DPLPA can cluster the nodes with the same dimension and the numerical value of 1 together from the F to form a community, all the nodes are divided according to the dimension, the clustering algorithm is finished, and the complex network is also divided.

To evaluate DPLPA effects, the present invention was tested using various real and synthetic data sets, and compared to some classical methods, including: newman's fast greedy discovery algorithm (FN) (Newman M E J. fast algorithm for detecting communication structure in networks. J. Physical Review. E, Statistical, nonliner, and soft tester physics,2004,69(6Pt 2)), Lovain algorithm (B G L) (Vision D Blindel, Jean-Loupu Guillaile, Renaud Lambliotte, Etien Lefebvre. fast information of communication in networks [ J. Journal of Statistical Mechanics: Theory and Experiment,2008 (10)), original LPA algorithm (raw relational) of communication network in networks [ J. Journal of Statistical Mechanics: Theory and Experiment,2008, 10), original LPA algorithm (raw relational expression, algorithm J. environmental friendly, algorithm, transform J., simulation, R. environmental simulation, R. 12. S. D. environmental simulation, R. balance, R. 12, R. balance, R. D. simulation, PT. E, PT. D. C. E, R. D. E, R. E. C. D. C. E. R. D. E. D. C. E, R. D. E. D. C. E. 1. C. E. D. 1. C. 1. the original greedy, E. A. propagation algorithm, A. 2009,80(2):026129.). The hardware environment of the experiment was as follows: inter (R) core (TM) i7-7700M CPU, 3.60GHz and 8GB memory. The programming language adopts Python 3.764-bit.

The modularity function Q provided by Newman is used as an evaluation index of an experiment. The modularity is defined as:

where E represents the total number of edges of the social network, A represents the adjacency matrix, k_iDegree of node i, c_iRepresenting the community assigned by node i. Theta (c)_i,c_j) The definition is as follows:

wherein, when the node i and the node j are in the same community, theta (c)_i,c_j) Is 1, otherwise is 0. It is generally considered that the higher the modularity is, the more obvious the community structure is.

In order to verify the accuracy of the DPLPA, the invention also adopts standardized mutual information (NMI) to measure the similarity degree of two clustering results, which is one of important measurement indexes found by the community and can basically objectively evaluate the accuracy of the comparison between one community partition and the real partition. The value range of NMI is [0,1], and higher values represent that the divided communities are closer to the real community result. NMI (a, B) is defined as:

wherein A (B) represents a community discovery algorithm A (B), C is a confusion matrix, C_ijRepresenting the number of nodes shared in the partition of A (B), CA (CB) representing the number of communities found in the community method A (B), C_i.(C_.j) Represents the sum of the elements in the ith row (column j) in C, and N is the number of nodes. If the clustering results of algorithms a and B are the same, NMI (a, B) ═ 1.

Evaluating the effectiveness of the algorithm by using an artificially synthesized network becomes an effective means for testing the quality of the algorithm, wherein the most common Benchmark test network for community detection is LFR Benchmark proposed by Lancihineti Andrea. The LFR reference network is an extension of the GN reference network, and has higher practical value. The LFR reference network reflects heterogeneity of community distribution and power law distribution of node degree, in which some important parameters are described as follows: n represents the number of network nodes, k represents the average degree of the nodes, max k represents the maximum degree of the nodes, min c represents the minimum value of the community size, max c represents the maximum value of the community size, tau 1 and tau 2 represent the negative indexes of the node degree and the power law distribution of the community size respectively, mu is equal to the ratio of the number of edges connected between the communities in the network to the total number of the edges and is used for representing the obvious degree of the communities in the network, and the smaller the mu value, the more obvious the structure of the communities is. Fig. 2 is a comparison of the results of NMI experiments with the algorithm on the LFR reference data set.

The LFR experiment set up parameters as: n is 1000, k is 15, maxk is 40, minc is 20, maxc is 50, τ 1 is 2, τ 2 is 1, and μ ranges from 0.1 to 0.8. It can be seen from fig. 2 that when μ is small, that is, the community structure of the complex network is obvious, the NMI values of the algorithm results are high except for the FN algorithm, but as μ increases, the community structure becomes more complex, the NMI values of the FN algorithm and the LPA algorithm start to decrease significantly, and the remaining algorithms start to decrease when μ is 0.6, but the DPLAP algorithm decreases relatively slowly compared with the BGLL and LPAm algorithms, and finally the NMI value is high, which indicates that the DPLPA algorithm has high accuracy in community exploration and has better stability in the community exploration with high complexity.

In order to further compare the advantages and disadvantages of the algorithm, the algorithm test is also carried out in a plurality of real-existing community networks. These networks are typically of different sizes and relate to various fields. The details are shown in table 2, where n represents the node, m represents the number of edges, and k represents the number of already defined communities.

TABLE 2 detailed description of the real network

Wherein, Karate is a membership data set of the air-lane club of university of Union of America, is constructed according to the interaction situation among club members, and is commonly used for the analysis of social networks. Dolphins are a member network constructed of life habits of 62 wide mouth Dolphins, often together with Dolphins corresponding to an edge between nodes. Polbook is a network of communities constructed from political books sold by Amazon, USA, each node represents a book, and if two books are purchased by the same customer, there is an edge between them on the corresponding node. Football is a network constructed by American university Football game, and nodes represent teams, and if there is a game between the nodes, an edge is formed between the nodes. The results of the different algorithms on different networks are shown in table 3.

TABLE 3 comparison of Q values for different algorithms in a real network

In order to better compare the clustering effect of the DPLPA algorithm on the data set, the invention is explained in detail through the Football data set. The actual grouping of the Football data sets is shown in Table 4, and the clustering effect of the DPLPA algorithm is shown in FIG. 3.

TABLE 4 actual grouping of football dataset networks

As can be seen from table 3, although the Q value of the method of the present invention is not the best in some data sets, the partitioning result of the DPLPA algorithm is identical to the actual community distribution, which can be seen from table 4 and fig. 3. The probability transition matrix well inhibits the randomness of the propagation process in the label propagation process, so that each update of the nodes is updated to the label of the same community node as much as possible, and the result of community division is more stable and closer to the real community condition. The comparison of K values for different algorithms on different networks is shown in table 5.

TABLE 5 comparison of K values for different algorithms in a real network

It is also found from table 5 that the DPLPA algorithm can detect the true number of communities, which is exactly the same as the actual K value. This is mainly because the DPLPA algorithm starts to compute the local density and distance of the nodes through the topology of the network at the very beginning and selects the number of K values by means of a decision graph. Therefore, the K value does not need to be provided, and the DPLPA algorithm has the advantage of detecting the K value.

In order to better display the experimental results, the Karate network and the Dolphins network are taken as case studies, and the detected communities are visualized. The nodes of the same community are divided in the same color. Fig. 4 is a visualization of DPLPA algorithm partitioning of a Karate network. Fig. 5 is a visualization result of DPLPA algorithm partitioning of the Dolphins network.

As can be seen from fig. 4, the local densities of the node 1 and the node 34 are the highest, and as can be seen from fig. 5, the local densities of the node 15 and the node 18 are the highest, and the nodes have higher node distances, so that it is very reasonable for the DPLPA algorithm to select the nodes as K, and the division result is completely consistent with the division result of the actual community. Therefore, the DPLPA algorithm can perform high-quality community detection in real communities.

In conclusion, the invention can predict the number of the communities under the condition of no prior condition, avoids the defects of unstable division and strong randomness of the random label algorithm, and effectively improves the accuracy of community mining and the stability of the algorithm. In addition, because a probability transition matrix is constructed, the iteration times of label propagation are reduced, so that the method has high operation efficiency, and finally the community structure of the network can be quickly found. Compared with other advanced algorithms, the method can quickly and effectively solve the community detection problem, can predict the community number under the condition of no prior condition, and has better stability and accuracy because the discovered community number is always consistent with the actual community number.

Example 2

The invention also discloses a label propagation community detection device based on density peak optimization, which comprises the following steps:

A fourth calculation module for calculating a distance matrix d based on the nodes and a normalized local density ρ of the nodes^*Obtaining the distance between nodes in the complex network, and standardizing to obtain the distance delta between the nodes after standardization^*；

A core point deriving module for deriving a local density ρ based on the normalized nodes^*And distance delta between nodes after normalization^*Acquiring K core points;

Further, the second calculation module is specifically configured to:

Further, the fourth calculating module is specifically configured to:

the distance between nodes in the complex network is calculated as follows:

Further, the core point deriving module is specifically configured to:

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A label propagation community detection method based on density peak optimization is characterized by comprising the following steps:

and 8: constructing a label matrix F based on the obtained K core points;

2. The label propagation community detection method based on density peak optimization as claimed in claim 1, wherein the step 3 comprises:

3. The label propagation community detection method based on density peak optimization as claimed in claim 2, wherein the step 5 comprises:

4. The label propagation community detection method based on density peak optimization as claimed in claim 1, wherein the step 6 comprises:

5. A label propagation community detection device based on density peak optimization is characterized by comprising:

A fourth calculation module for calculating a distance matrix d based on the nodes and the normalized nodesLocal density ρ^*Obtaining the distance between the nodes in the complex network and the high-density nodes, and standardizing to obtain the distance delta between the nodes after standardization and the high-density nodes^*；