CN111738515A

CN111738515A - Social network community discovery method based on local distance and node rank optimization function

Info

Publication number: CN111738515A
Application number: CN202010581334.8A
Authority: CN
Inventors: 刘小洋; 丁楠; 刘加苗
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-02
Anticipated expiration: 2040-06-23
Also published as: CN111738515B

Abstract

The invention provides a social network community discovery method based on local distance and node rank optimization functions, which comprises the following steps: s1, acquiring a network social node data set, and performing Laplace normalization processing on the acquired network social node data set; obtaining a Laplace node matrix; s2, calculating to obtain a social network node value according to the internal distance and the external distance of the social network: if the network social node value is larger than or equal to the preset network social node value, discovering a network social community; if the value of the network social node is smaller than the preset value of the network social node, rediscovery the network social community. The invention takes the node self-transfer problem into account. Secondly, the method comprehensively considers the problem of the edge weight and can effectively show the characteristic structure of the whole social network. Finally, compared with other methods, the method has better performance.

Description

Social network community discovery method based on local distance and node rank optimization function

Technical Field

The invention relates to the technical field of social networks, in particular to a social network community discovery method based on local distance and a node rank optimization function.

Background

In the last two decades, the internet has increased in speed in developing a global process, the position of data networks in human society has become more and more important, and researchers have become more and more interested in the study of complex networks. In nature, complex networks are diverse in form and are composed of communities with relatively independent mutual influence. Such as social networks, biological networks, economic networks, information networks, and so forth. The community structure is an important topological attribute of the complex network, so community discovery has important significance in the research of complex network analysis, data mining and the like. This attribute allows community discovery to better analyze complex networks and extract useful information and apply to various fields, such as text analysis, personality recommendation systems, user identification, epidemic propagation, behavior prediction.

Although there are many articles on social network community discovery, in a network, the nodes contained in each cluster must be somehow related to each other, rather than to nodes outside the cluster, to form a community. Most researchers believe that communities are characterized by tight connections between community nodes and sparse connections with nodes outside the community. Since the initiative of Girvan and Newman, many algorithms for community detection in complex networks have been proposed, the most typical of which are, for example, a modularity optimization algorithm, a label propagation algorithm, a greedy algorithm, a random walk algorithm, a spectrum division algorithm, and a fuzzy algorithm.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly provides a social network community discovery method based on local distance and a node rank optimization function.

In order to achieve the above object, the present invention provides a social network community discovery method based on local distance and node rank optimization function, including the following steps:

s1, acquiring a network social node data set, and performing Laplace normalization processing on the acquired network social node data set; obtaining a Laplace node matrix;

s2, calculating to obtain a social network node value according to the internal distance and the external distance of the social network:

if the network social node value is larger than or equal to the preset network social node value, discovering a network social community;

if the value of the network social node is smaller than the preset value of the network social node, rediscovery the network social community.

In a preferred embodiment of the present invention, in step S1, the laplacian normalization calculation method for the obtained social networking node is:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a denotes an adjacency matrix.

In a preferred embodiment of the present invention, in step S1, the calculation method of the element values in the laplacian node matrix is:

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

and the element values of the ith row and the jth column in the Laplace node matrix are represented.

In a preferred embodiment of the present invention, in step S2, the method for calculating the social networking internal distance is:

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^internal(G,V_k) Representing the internal distance of network societies.

In a preferred embodiment of the present invention, in step S2, the method for calculating the social networking external distance is:

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^external(G,V_k) Representing the external distance of network socialization.

In a preferred embodiment of the present invention, in step S2, the method for calculating the social networking node value is:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

d^internal(G,V_k) An external distance representing network socialization;

d^external(G,V_k) An internal distance representing network socializing;

S_LDL(G, V) represents a network social node value.

In a preferred embodiment of the invention, the set of nodes V_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

vx represents node x; x is 1,2,3, …, N;

v_yrepresents node y; y is 1,2,3, …, N.

In a preferred embodiment of the present invention, in step S2, the node set V-V_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3, …, K;

v represents a node partition set; v ═ V₁,V₂,V₃,…,V_K}；

vx represents node x; x is 1,2,3, …, N;

v_yrepresents node y; y is 1,2,3, …, N.

In summary, due to the adoption of the technical scheme, firstly, the invention considers the problem of node self-transmission. Secondly, the method comprehensively considers the problem of the edge weight and can effectively show the characteristic structure of the whole social network. Finally, compared with other methods, the method has better performance.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow diagram of the present invention.

FIG. 2 is a schematic diagram of local distance community partitioning according to the present invention.

Fig. 3 is a schematic diagram comparing different algorithms of the present invention on an artificial network.

FIG. 4 is a schematic diagram illustrating an overview of the community discovery process of the present invention.

Fig. 5 is a schematic view of the visualization of the present invention on different networks.

FIG. 6 is a schematic diagram of community membership in a real data network according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

To date, some classical and effective local community discovery algorithms and MF algorithms have been proposed, and Liu et al propose a local community discovery framework based on node pair similarity, and a new local community discovery algorithm can be obtained by embedding a better node similarity measure.

Clauset et al propose an algorithm R for measuring local community structure, the calculation method is as follows:

wherein B is a local community, R represents an algorithm for measuring the structure of the local community, B_inRepresenting the number of edges whose endpoints are all in local community B, and B_outIs the number of edges that have an endpoint in local community B. The algorithm requires a predefined size of the community. It will continue to add the neighbor node that increases R the most to the current community until the current community reaches a predefined size.

Luo et al propose another local community discovery algorithm M, the calculation method is as follows:

wherein M represents a local community discovery algorithm, E_inRepresents the number of internal edges of the community, and E_outRepresenting the number of edges between the community boundary and the external node. The algorithm provides three heuristic node searching methods to partially solve the problem of community discovery in a complex network. However, it must set different thresholds for different sizes of networks.

The two algorithms have several ideal advantages, can detect clusters of any shape, do not need to preset the number of clusters, and can display the selection process of the center through a decision diagram. However, DPC still has drawbacks. First, the truncation distance has a greater impact on the clustering results. Furthermore, manual intervention is required to select a suitable cluster center.

Lancihienti et al propose a fitness function F_cTo measure the density of nodes within a community. The fitness function is defined as follows:

in the formula,

represents the sum of the internal degrees of the community c,

denotes the sum of the degrees of externality of the community c, α denotes a resolution parameter for controlling the size of the detected community, F_cRepresenting the density of nodes within the community. The quality function can effectively measure the closeness of nodes in the community, but cannot fully utilize local information between the nodes.

Xu et al studied how to apply computational intelligence genetic algorithms to directed, undirected community discovery and developed optimization algorithms through iterations. Wang et al propose a method for discovering overlapping communities using a bayesian MF model. The advantage of this approach is that the number of communities can be automatically determined and there is no resolution limit. However, its internal value estimation of the number of communities may mislead the decomposition and return a wrong solution.

Guo et al uses locality center nodes and Jaccard coefficients to detect the core members of the community as seeds in the network, thereby ensuring that the selected seeds are the center nodes of the community. The node with the greatest degree in the seed is pre-expanded each time by the fitness function. And expanding the first k nodes with the best performance in the pre-expansion process by utilizing the internal force among the nodes according to the fitness function so as to obtain a high-quality community in the network.

Chen et al propose a novel community discovery method that separates overlapping communities from the network using a non-Negative Matrix Factorization (NMF) model, and solves the problem of unknown community number through feature Matrix preprocessing and sorting optimization, thereby enabling the algorithm to divide the network structure of unknown community number. Hu et al propose an improved lagrangian alternating direction algorithm for symmetric non-negative matrix factorization.

Recently, Li et al proposed a method based on semi-supervised matrix factorization and random walk to perform community partitioning. And calculating the transition probability among the nodes through network topology, obtaining the final wandering probability by using a random wandering model, and constructing a characteristic matrix.

Wu et al propose a novel framework called hybrid hypergraph Regularization non-negative Matrix Factorization (MHGNMF) that takes into account higher-order information between nodes to improve clustering performance. The hypergraph regularization term enforces that the nodes in the same superperiphere are projected to the same potential subspace, thereby realizing more discriminant representation. In the proposed framework, the topological connectivity information and the structural similarity information are exploited by blending together two neighbors of each centroid to generate a set of hyper-edges.

The local community discovery algorithms all use the topological property of the network, and all default networks have the same edge weight, but in a real data network, the connection strength between entities is different, and the node bias is not considered, so that the weight is easily estimated incorrectly. To the best of the present invention, there is no other community discovery work that combines local distance and laplacian matrix decomposition based methods.

To better describe the proposed model, the invention will use the following mathematical definition:

definition 1: the network G ═ (V, E) is composed of a set of node partitions V and a set of edges E, and the nodes contained in the set of node partitions V will be labeled V₁、v₂、v₃、……、v_N，v_pIndicating that the node p, p is 1,2,3, …, N indicates the total number of nodes in the node partition set, and the edge contained in the edge set E indicates which nodes are connected_x≠y(v_x,v_y) In the edge set E, x is 1,2,3, … N, and y is 1,2,3, … N; v is then_xIs connected to v_yIn the present invention, only undirected graphs are processed, so edge pairs ∪_x≠y(v_x,v_y) And edge pair ∪_y≠x(v_y,v_x) Same, ∪_y≠x(v_y,v_x) Watch (A)Show v_yIs connected to v_xI.e. represent v_yAnd v_xWherein ∪ are connected with each other_ζDenotes the condition ζ, i.e. ∪_x≠yIndicating condition x ≠ y, ∪_y≠xIndicating that condition y ≠ x.

Definition 2 Each network G has an adjacency matrix A. if a network G has N nodes, the adjacency matrix A is an N × N matrix in the form of a combination of 0 and 1_pq1 if and only if pair of edges ∪_p≠q(v_p,v_q) ∈ E,

p

1,2,3, …, N,

q

1,2,3, …, N, i.e. v_pAnd v_qConnection, known from definition 1, ∪_p≠q(v_p,v_q)＝∪_q≠p(v_q,v_p) Therefore, the adjacency matrix a here is a symmetric matrix.

Definition 3: the community discovery of the network G ═ V, E) is to divide the node partition set V into node sets V₁,V₂,V₃,...,V_KAs a result of (3), so that V₁∪V₂∪V₃∪...∪V_KIs equal to V, and V₁,V₂,V₃,...,V_KAre not empty sets. I.e. set of nodes V₁,V₂,V₃,...,V_KIs the community structure. The present invention defines a partition as V ═ V₁,V₂,V₃,...,V_K}. The number of the subareas is K ═<V>，<V>And the number of the node sets in the node partition set V is represented.

Definition 4: given a network G ═ (V, E) and a set of node partitions V ═ V₁,V₂,V₃,...,V_KThe edges of the network G can be divided into an edge set E_mnI.e. E_mn∈E，

And is

1,2,3, K,

n

1,2,3, K; if and only if

And is

There is an edge pair

Definition 5: the definition is particularly given in the following,

and

k1, 2,3., K,

l

1,2,3., K; in other words, the inner set of edges

Containing a set of nodes V_kInternal edge, internal edge set

Two nodes on any edge pair in the system belong to the same community; and the outer edge set

Comprising V_kOuter edge, outer edge set

A node on any edge pair in the node set V_kIn that another node does not belong to the set of nodes V_kAnd belongs to a node set V-V_kIn (1).

The invention discloses a social network community discovery method based on local distance and a node rank optimization function, which comprises the following steps of:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a denotes an adjacency matrix.

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^internal(G,V_k) Representing the internal distance of network societies.

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^external(G,V_k) Representing the external distance of network socialization.

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

d^internal(G,V_k) An external distance representing network socialization;

d^external(G,V_k) An internal distance representing network socializing;

S_LDL(G, V) represents a network social node value.

In a preferred embodiment of the present invention, in step S2, the node set V_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3, …, N;

v_yrepresents node y; 1,2,3.

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

vx represents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2,3.

In a preferred embodiment of the present invention, the method further comprises the steps of:

s3, optimizing the social network community found in the step S2;

s4, displaying the social network community obtained in the step S3. In step S3, the method for optimizing the found social network community includes:

wherein, V_kRepresents a set of nodes, K ═ 1,2,3.., K;

v represents a node partition set; v ═ V₁,V₁,V₁,...,V_K}；

v_iRepresents a node i;

indicates that in the case of … …, there is … …;

V[v_i]indicating that node i belongs to a set of nodes V_i]；

v_jRepresents node j;

A_ijrepresenting the ith row and jth column element values in the adjacency matrix A;

if yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

In a preferred embodiment of the present invention, the method further comprises:

wherein m represents the total number of connecting node edges; a. the_ijRepresenting the values of the elements in adjacency matrix a; f_ijRepresenting the proportion of any edge connecting the two nodes i and j;

wherein deg (v)_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; v. of_iRepresents a node i; v. of_jRepresents node j;

and/or the method also comprises a method for calculating the Jaccard coefficient:

wherein, V_MThe community structure is optimal;

V₀is a reference vector;

J(V_M,V₀) Represents the Jaccard coefficient;

when V is_MAnd V₀All are empty, J (V)_M,V₀)＝1；

And/or further comprises an Error index calculation method:

wherein, V_M' structural feature of V;

V₀is' a V₀Structural features of (a);

E(V_M′,V₀') indicates the Error index;

when V has the same value as V₀The same community structure time E (V)_M′,V₀') is equal to 0.

As shown in fig. 2: the entire network G is divided into 5 partitions, i.e. V ═{V₁,V₂,V₃,V₄,V₅In which V is indicated briefly₁Partitioned internal edge set

And external edge set

Community discovery is to find a node partition set V (V, E) of a network G (V, E)₁,V₂,V₃,...,V_KThe nodes contained in each cluster must be somehow related to each other, not to nodes outside the cluster, to form a community. Firstly, in order to solve the problem of node information self-transmission, the invention comprehensively considers the influence of the node on the node, introduces a self-degree matrix, and constructs the following model by utilizing the Laplace matrix decomposition principle:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix; a represents an adjacency matrix; i is_nRepresenting an n-order identity matrix; l is^symIs a laplacian node matrix.

The completeness of extracting the network features is considered, namely, the problem of the edge weight is fully considered. The method is obtained by normalizing the adjacency matrix, multiplying two sides of the adjacency matrix by the degree evolution of the nodes and then inverting. For single node operation, normalization is to divide the degree of its node, so that the information transfer value of each adjacent edge is normalized, and the influence of the former is not larger than that of the latter because a certain node has 10 edges and another has 1 edge, because the weight of the latter is only 0.1 after normalization, the operation of rising from a single node to a two-dimensional matrix is to invert the matrix and multiply the inverse nature of the matrix, namely the operation of rising from a single node to a two-dimensional matrix is to multiply the inverse nature of the matrixAnd performing matrix division to finish normalization. However, the left and right are multiplied by the evolution of the i, j degrees of the node respectively, which is the degree of the point at both sides of one edge. Specific to each node pair v_i,v_jThe elements in the matrix are given by the following equation:

wherein deg (v)_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; i.e. the value of the degree matrix at node i, j; v. of_iRepresents a node i; v. of_jRepresents node j;

The inner and outer distances are given by:

wherein L is^symRepresenting a laplacian node matrix;

represents V_kThe adjacency matrix of (a);

represents V-V_kThe adjacency matrix of (a); d^internal(G,V_k) An internal distance representing network socializing; d^external(G,V_k) External distance representing network socializing, d^external(G,V_k) Can be written as d^e(G,V_k) Or d^e；d^internal(G,V_k) Can be written as dⁱ(G,V_k) Or dⁱ。

It should be understood in equation (8) that when node x and node y both belong to node set (node set is also called community) V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 1; when node x belongs to node set V_kNode y belongs to the set of nodes V-V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 0; similarly, in formula (9), when both node x and node y belong to node set V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 0; when node x belongs to node set V_kNode y belongs to the set of nodes V-V_k，V-V_kOf a neighboring matrix

The value of the element in the x-th row and y-th column is 1.

For all v_x∈ V all have A_xx1 (i.e., each node has a self-loop). All edges except the self-loop are counted twice. d^internal(G,V_k) Is taken to be [0,1 ]]When the network G is a union of communities which are not continuous with each other, d^internal(G,V_k) This case is a perfect community structure diagram. It d^external(G,V_k) Also take on values of [0,1]And (for a perfect community structure graph, its value is 0).

The local distance Laplace network social node value function is as follows:

wherein, V_kRepresenting a set of nodes; v represents a node partition set; d^internal(G,V_k) An external distance representing network socialization; d^external(G,V_k) An internal distance representing network socializing; s_LDL(G, V) represents a network social node value.

One point to emphasize for the LDL model is that the weight for each local partition (local inner distance plus local outer distance) is | V_kI/2 | V |. This is done to avoid that smaller communities will have a disproportionate impact on the score of their total community.

3.3 Node Rank Optimization Function

Due to the community discovery algorithm proposed by the present invention, more than one possible community discovery result is generated. In this case, a community discovery optimization is required. The optimal community selection method provided by the invention is based on the idea of community discovery effectiveness, namely, more edges should be arranged inside the community but not outside the community. Weak criteria (WRC) and Strong criteria (SRC) were first proposed by radichi et al, but his WRC was too weak and showed no distinction at each node. Then even u and V_kCompletely disconnected, any additional node u may also be added to V_kAnd still satisfy the WRC. This can lead to failure of many discovered communities. Therefore, the present invention provides a node rank optimization function, which is as follows:

wherein,

that is to

Phi represents a node set to which the node j belongs; a. the_ijRepresenting the ith row and jth column element values in the adjacency matrix A; v [ i ]]Indicating that node i belongs to the set of nodes V [ i ]](ii) a That is, in the case of … …, there is … ….

Wherein v is_iRepresenting nodes i, v_yRepresenting node y.

Thus, the NRO function is expressed as follows:

wherein, V_kRepresenting a node set, and V representing a node partition set; v. of_iRepresents a node i; indicates that in the case of … …, there is … …; v [ V ]_i]Indicating that node i belongs to a set of nodes V_i]；v_jRepresents node j; a. the_ijRepresenting the ith row and jth column element values in the adjacency matrix a.

If yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

In the optimization effect, because two coordination parameters V [ i ] and V-V [ i ] are set, the optimization effect is stronger than WRC and weaker than SRC, and thus a better optimization effect is achieved.

The main flow of the algorithm provided by the invention is as follows:

4 results and analysis of the experiments

To evaluate the algorithm proposed by the present invention, the present invention contemplates the use of eleven real data networks and artificial network datasets. Data sources are http:// www-personal. umich. edu/mejn/Netdata/http:// snap. stanford. edu/data/. The hardware environment of the experiment was as follows: inter (R) core (TM) i5-4160M CPU, 3.60GHz and 4GB memory, windows 10, MATLAB R2019 a.

4.1 evaluation index

In the present invention, Q is used as a performance metric in experiments in order to evaluate the performance of networks that do not have authenticity.

The performance metric Q is:

it is composed of

deg(v_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; v. of_iRepresents a node i; v. of_jRepresenting node j.

_ij(c_i,c_j) Is represented as follows:

wherein, c_iIs the community to which vertex i is assigned, c_jIs the community to which vertex j is assigned.

The Jaccard Coefficient (JSC) is used to compare Similarity and difference between a finite sample set.

Given two sets V_M，V₀The Jaccard coefficient is defined as V_MAnd V₀A larger value of the ratio of the size of the intersection to the size of the union indicates a higher degree of similarity.

Wherein, V_MFor an optimal community structure, V₀As a reference vector, when V_MAnd V₀All are empty, J (V)_M,V₀)＝1。

The range of the RI is larger, which means that the community discovery result is more consistent with the real situation. A larger RI indicates a higher accuracy of clustering effect and a higher purity within each class.

Error index when V has the same value as V₀The same community structure time E (V)_M′,V₀') is equal to 0, defined as follows:

wherein, V_M' structural feature as V, V₀Is' a V₀The structural characteristics of (1).

4.2 Artificial network Performance comparison

The invention adopts an algorithm operated on an artificial data network (GN reference network). Internal edge set E for each node^internalExternal edge set E connected to other nodes in the same community^externalConnect with other communities. With outer edge set E^externalWith the increase in community structure becoming less clear, the community discovery task becomes more challenging.

TABLE 1 Artificial network parameters

Fig. 3 shows the performance comparison of 8 algorithms in an artificial data network, and the proposed LDL algorithm was experimentally analyzed on various data sets of the artificial network and the real network and compared with the conventional algorithm by experiments, which are LinkLPA, MFM, LFK, NMF, LRLFP, specluster 1 and specluster 2, respectively.

As shown in fig. 3 (a): the performance of the algorithm 8 on the Jaccard coefficient evaluation standard is described, it is easily understood that when the external edge number overview is larger, the Jaccard coefficient value is lower, and when the external edge probability is less than 0.4, the LDL algorithm provided by the invention is obviously advantageous, but after 0.4, the Jaccard coefficient value is slightly lower than that of other algorithms, but always higher than that of the LinkLPA algorithm.

Fig. 3(b) depicts the performance of the algorithm on the Rand index evaluation standard, the overall trend of each algorithm is similar to that of fig. 3(a), and the Rand index gradually decreases as the probability of the number of external edges increases. It is noted that the algorithms LDL and LinkLPA provided by the present invention have significant advantages over other algorithms, and when the probability of the number of external edges is less than 0.4, the LDL algorithm is better than the LinkLPA algorithm.

Figure 3(c) shows that the performance of the algorithm does not differ much in the performance of the modulority evaluation criteria, but the LDL algorithm remains dominant throughout.

The Error values of the algorithms in fig. 3(d) are significantly different, and it can be seen that the Error value of the LDL algorithm is the lowest when the probability of the number of outer edges is less than 0.8, and the LDL algorithm is only 3% worse than the MFM algorithm when the probability is greater than 0.8. In conclusion, the LDL algorithm proposed by the present invention is indeed better and more stable than the other 7 algorithms.

4.3 true network Performance comparison

To further evaluate the LDL algorithm proposed by the present invention, eleven representative social networks of different sizes were selected by the present invention. In table 2, Networks represents a real data network, nodes represent Node numbers, edges represent Edge numbers, a-co represents an average clustering coefficient of nodes, a-Lenth represents an average path length, and Description describes the practical significance of the network. As shown in table 2:

TABLE 2 true network

Networks	Node	Edge	A-co	A-Lenth	Description
						Karate	34	78	0.588	2.408	Zachary’s karate club
Dolphin	62	159	0.303	3.357	Dolphin social network
						Lemis	77	254	0.736	2.641	Victor Hugo novel Les Miserables
Public book	103	441	0.488	3.079	Books about US politics
						Football	115	616	0.289	3.421	A map of the popular board game Risk
Celegansnertal	297	2359	0.308	2.455	Celegansnertal dissertation
						Email	1005	25571	0.439	2.968	Students in ANLP course email message
Public blogs	1490	19025	0.361	2.738	Blogs about politics
						Netscience	1589	2742	0.701	2.842	Co-authorship in network science
Power grid	4941	6594	0.405	2.391	The topology of Power Grid
						Hep_th	8361	15751	0.636	3.129	Collaboration of High Energy Physics Theory

To better illustrate the overall social network community discovery process, fig. 4(a) -4 (i) show a brief overview of the overall community discovery process, taking a power grid network as an example. A total of 9 subgraphs, i.e. finally 9 communities are formed.

As shown in fig. 4 (a): community structure (green cut set) for the first one divided; secondly, a second community structure (purple cut set) is divided, as shown in FIG. 4 (b); then, a third community structure is divided, as shown in fig. 4 (c); by analogy, until the ninth community is divided, the convergence criterion has been reached, i.e. all nodes are contained within a certain community, as shown in fig. 4 (i).

The divided social networks already have clear community structures, and fig. 5(a) to 5(d) are respectively the visualization results of the community discovery of the LDL algorithm proposed by the present invention in 4 social networks, i.e., Dolphin, Lemis, celegansnert, and Netscience. It can be found that the LDL algorithm has high recognition quality in a large-scale data network (as shown in table 2), and the higher the degree and the average clustering coefficient of the node is, the stronger the display effect is, and the more easily the node becomes a community center to form a community structure.

Table 4 shows the results of the proposed LDL algorithm compared to the conventional algorithm on the Jaccard index in the real dataset. The bolded values in the table indicate algorithms that perform optimally, and the shaded gray values indicate algorithms that perform suboptimally.

Results of LDL and traditional algorithms presented in Table 4 on Jaccard index (real dataset)

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.5	0.7375	0.6507	0.5882	0.325	0.6052	0.5593	0.2852
Dolphins	0.1035	0.1918	0.2131	0.1877	0.0541	0.2118	0.2161	0.2136
									Lemis	0.4112	0.4793	0.6524	0.2844	0.2410	0.6276	0.4159	0.1972
Public book	0.3403	0.6671	0.6440	0.6512	0.0551	0.3951	0.6749	0.6951
									Football	0.7147	0.6357	0.4052	0.8413	0.6920	0.0798	0.0798	0.07798
Celegansnertal	0.3445	0.2151	0.4804	0.343	0.0681	0.3551	0.2150	0.2151
									Email	0.2599	0.0460	0.2085	0.1912	0.1251	0.0462	0.0467	0.0467
Public blogs	0.3112	0.5167	0.5690	0.5426	0.0162	0.4027	0.4120	0.4998
									Netscience	0.2186	0.1332	0.2213	0.1780	0.0841	0.1464	0.0239	0.0100
Power	0.1603	0.0168	0.2240	0.2092	0.0048	0.0023	0.1371	0.0285
									Hep_th	0.1524	0.1912	0.2203	0.1036	0.2015	0.1242	0.2003	0.0972

As shown in Table 4, the LDL algorithm provided by the invention has the optimal performance in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the rest 7 algorithms; the LDL algorithm is suboptimal in Karate, Dolphins and Email data networks, is respectively second to MFM, speClust1 and LinkLPA algorithms, but has better performance than the other 6 algorithms; the LDL algorithm generally performs better in Public book, Football data networks than some other algorithms.

Results of the LDL algorithm presented in Table 5 with the conventional algorithm on the Rand index (real dataset)

As shown in Table 5, the LDL algorithm provided by the invention is optimal in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the other 7 algorithms; the LDL algorithm is suboptimal in Dolphins and Email data networks, is inferior to the LRLFP algorithm, but has better performance than the other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Public book, and Football data networks than some of the rest of the algorithms. For example, in the Karate data network, the performance is better than 5 algorithms, LFK, speClost 1, LinkLPA, LRLFP, and speClost 2.

Results of the LDL algorithm and the conventional algorithm on the Modularity index (real data set) presented in Table 6

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.4427	0.4477	0.4347	0.4459	0.3663	0.4343	0.4116	0.1545
Dolphins	0.46	0.0108	0.4709	0.4486	0.4022	0.01080	0.0054	0.1299
									Lemis	0.5882	0.5768	0.5772	0.4849	0.5298	0.5632	0.1088	0.2034
Public book	0.5531	0.5091	0.5196	0.5182	0.4117	0.4065	0.4595	0.4209
									Football	0.6189	0.5423	0.6092	0.6236	0.6171	0.1075	0.5933	0.5753
Celegansnertal	0.433	0.4378	0.4521	0.3761	0.1722	0.2035	0.0092	0.3874
									Email	0.6178	0.0381	0.5008	0.6547	0.6547	0.3292	0.1002	0.3048
Public blogs	0.3007	0.3431	0.3967	0.367	0.1864	0.1155	0.0087	0.2133
									Netscience	0.8085	0.8118	0.872	0.8238	0.8238	0.8011	0.2062	0.73
Power	0.5826	0.531	0.6438	0.6241	0.5471	0.5289	0.6207	0.5227
									Hep_th	0.6021	0.6754	0.7181	0.501	0.6503	0.685	0.6821	0.5431

As shown in Table 6, the LDL algorithm of the present invention performed best in the Dolphins, Celegansnal, Public blogs, Netsccience, Power and Hep _ th data networks, and was superior to the other 7 algorithms; the LDL algorithm is suboptimal in the Lemis and Public book data networks, is inferior to the LinkLPA algorithm, but has better performance than other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Football, and Email data networks than some of the rest of the algorithms. Taking the Football data network as an example, the performance of the algorithm is better than that of 4 algorithms such as MFM, LFK, speClost 1 and speClost 2.

Results of LDL and conventional algorithms on Error index (true data set) presented in Table 7

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.75	0.5	0.25	0.5	2.5	0.25	0.25	0.75
Dolphins	0.8	0.4	0.4	0.5	3.2	0.4	0.6	0.8
									Lemis	0.8333	0.6667	0.1	0.5	0.4	0.8333	0.6667	0.8333
Public book	1.3333	0.667	0.3333	0.5543	0.62	0.3333	0.6667	0.6667
									Football	0.833	0.3333	0.5	0.75	0.25	0.9167	0.9167	0.9167
Celegansnertal	0.5	0.8333	0.3333	0.6667	0.7	0.8333	0.8333	0.8333
									Email	0.3333	0.5	0.4286	0.7	0.75	0.5234	0.9762	0.9762
Public blogs	0.17	0.138	0.1345	0.1375	0.1375	0.417	0.5	0.5
									Netscience	0.0024	0.1935	0.0123	0.032	0.4975	0.0024	0.9877	0.9975
Power	0.8611	0.7574	0.5832	0.6239	0.8621	0.8265	0.980	0.7281
									Hep_th	0.4278	0.5439	0.3738	0.7421	0.6384	0.4839	0.7846	0.8971

As shown in Table 7, the LDL algorithm provided by the invention has more obvious advantages in Error indexes, is optimal in ten data networks of Karate, Dolphins, Lemis, Poblic book, Celegansnal, Email, Public blogs, Netscience, Power and Hep _ th, is slightly worse than the LRLFP algorithm in Football, and has stronger stability as shown by experimental data.

In summary, although the LDL algorithm proposed by the present invention does not perform optimally in every data network, the ratio of the dominance (optimal + suboptimal) is much higher than other algorithms. The LDL algorithm provided by the invention has better performance in a social network with higher average clustering coefficient and more complex data network, and is more suitable for the characteristics of large scale and complexity of the modern social network.

As shown in fig. 6(a) to 6 (e): respectively represents the community structure comparison expression of the LDL algorithm in 5 real data networks of Karate, Lemis, Celegansnartal, Public blogs and Power grid. The abscissa represents the number of nodes, the ordinate represents the community membership relationship of the community, namely the community to which the node belongs, blue is the reference community structure, and red is the community structure of the LDL algorithm. The more similar the community structure after the algorithm execution is to the reference community structure, the higher the score.

In table 8, the performance of the LDL algorithm and the conventional algorithm mentioned in tables 4 to 7 on each index is counted, and the loss function Y constructed by the present invention is LOG₁₀((X₁+X₂+X₃)/(X₄+1))。X₁Represents the coefficient of variation at the Jaccard index; x₂Representing the coefficient of variation at Rand index; x₃Expressing the coefficient of variation at the modulority index; x₄Represents the coefficient of variation at the Error index; y represents the constructed loss function.

Results of LDL Algorithm and conventional Algorithm on Each index (real data set) presented in Table 8

As shown in table 8: and (3) performing result analysis on the performance of the LDL algorithm on each index through statistical mean, standard deviation, coefficient of variation and constructed loss functions. The first proposed LDL algorithm has the highest score (bold data value) on two indexes of Jaccard and Rand, and in the Jaccard index, the mean value of the LDL algorithm in each data network is the highest, but the standard deviation is higher than LinkLPA, which indicates that the performance difference of the LDL algorithm in the data networks is larger than that of the LinkLPA algorithm, but the score of the index variation coefficient (the mean value/standard deviation is higher as well as better) is finally the highest in comprehensive consideration; in the Rand index, the performance of the LDL algorithm in the mean value and the standard deviation is superior to that of other algorithms, and the score of the variation coefficient is obviously higher than that of other best NMF algorithms and is close to 77 percent; secondly, the score of the LDL algorithm on the modulatity index is second to that of the LinkLPA algorithm, because the performance difference of the LDL algorithm in each data network is larger than that of the LinkLPA algorithm; the performance of the LDL algorithm on the Error index is the best compared with other algorithms, the Error rate is only 0.0548 and is far better than other algorithms, and the experimental data show that the algorithm provided by the invention has stronger robustness; finally, the performance score at the loss function is also highest, which is approximately 7 percentage points higher than the conventional best method.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A social network community discovery method based on local distance and node rank optimization functions is characterized by comprising the following steps:

2. The method for discovering social network communities based on local distances and node rank optimization functions as claimed in claim 1, wherein in step S1, the calculation method of performing laplacian normalization on the obtained social networking nodes is:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a denotes an adjacency matrix.

3. The social network community discovery method based on local distance and node rank optimization function of claim 1, wherein in step S1, the calculation method of the element values in the laplacian node matrix is:

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

4. The method for discovering social network community based on local distance and node rank optimization function according to claim 1, wherein in step S2, the internal distance of network societies is calculated by:

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^internal(G,V_k) Representing the internal distance of network societies.

5. The method for discovering social network community based on local distance and node rank optimization function according to claim 1, wherein in step S2, the external distance of the social network is calculated by:

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^external(G,V_k) Representing the external distance of network socialization.

6. The method for discovering social network community based on local distance and node rank optimization function according to claim 1, wherein in step S2, the method for calculating the value of the social network node is:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

d^internal(G,V_k) An external distance representing network socialization;

d^external(G,V_k) An internal distance representing network socializing;

S_LDL(G, V) represents a network social node value.

7. The locality-based of claim 4The social network community discovery method of the distance and node rank optimization function is characterized in that in step S2, a node set V_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2,3.

8. The social network community discovery method based on local distance and node rank optimization function of claim 5, wherein in step S2, the node set is V-V_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2,3.