CN111738516B

CN111738516B - Social network community discovery system through local distance and node rank optimization function

Info

Publication number: CN111738516B
Application number: CN202010582081.6A
Authority: CN
Inventors: 刘小洋; 丁楠; 吴松阳
Original assignee: Chongqing University of Technology
Current assignee: Hefei Jiuzhou Longteng Scientific And Technological Achievement Transformation Co ltd; Nantong Baisong Data Technology Co.,Ltd.
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2021-08-10
Anticipated expiration: 2040-06-23
Also published as: CN111738516A

Abstract

The invention provides a social network community discovery system through a local distance and node rank optimization function, which comprises a data acquisition module, a Laplace node matrix calculation module, a network social node value calculation discovery module, a community optimization module and a display module, wherein the data acquisition module is used for acquiring a local distance and node rank value; the data output end of the data acquisition module is connected with the data input end of the Laplace node matrix calculation module, the data output end of the Laplace node matrix calculation module is connected with the data input end of the network social node value calculation discovery module, the data output end of the network social node value calculation discovery module is connected with the data input end of the community optimization module, and the data output end of the community optimization module is connected with the display data end of the display module. The invention first considers the problem of node self-transmission. Secondly, the method comprehensively considers the problem of the edge weight and can effectively show the characteristic structure of the whole social network. Thirdly, in terms of processing the multi-scale optimization problem, the optimization function of the invention can effectively find the optimal community structure. Finally, compared with other methods, the method has better performance.

Description

Social network community discovery system through local distance and node rank optimization function

Technical Field

The invention relates to the technical field of social networks, in particular to a social network community discovery system based on local distance and node rank optimization functions.

Background

In the last two decades, the internet has increased in speed in developing a global process, the position of data networks in human society has become more and more important, and researchers have become more and more interested in the study of complex networks. In nature, complex networks are diverse in form and are composed of communities with relatively independent mutual influence. Such as social networks, biological networks, economic networks, information networks, and so forth. The community structure is an important topological attribute of the complex network, so community discovery has important significance in the research of complex network analysis, data mining and the like. This attribute allows community discovery to better analyze complex networks and extract useful information and apply to various fields, such as text analysis, personality recommendation systems, user identification, epidemic propagation, behavior prediction.

Although there are many articles on social network community discovery, in a network, the nodes contained in each cluster must be somehow related to each other, rather than to nodes outside the cluster, to form a community. Most researchers believe that communities are characterized by tight connections between community nodes and sparse connections with nodes outside the community. Since the initiative of Girvan and Newman, many algorithms for community detection in complex networks have been proposed, the most typical of which are, for example, a modularity optimization algorithm, a label propagation algorithm, a greedy algorithm, a random walk algorithm, a spectrum division algorithm, and a fuzzy algorithm.

Disclosure of Invention

The invention aims to at least solve the technical problems in the prior art, and particularly provides a social network community discovery system based on a local distance and node rank optimization function.

In order to achieve the above object, the present invention provides a social network community discovery system through a local distance and node rank optimization function, which includes a data acquisition module, a laplacian node matrix computation module, a network social node value computation discovery module, a community optimization module, and a display module;

the data output end of the data acquisition module is connected with the data input end of the Laplace node matrix calculation module, the data output end of the Laplace node matrix calculation module is connected with the data input end of the network social node value calculation discovery module, the data output end of the network social node value calculation discovery module is connected with the data input end of the community optimization module, and the data output end of the community optimization module is connected with the display data end of the display module;

the data acquisition module is used for acquiring a network social node data set;

the Laplace node matrix calculation module is used for carrying out Laplace normalization processing on the network social node data set acquired in the data acquisition module; obtaining a Laplace node matrix;

the network social node value calculation discovery module is used for calculating to obtain a network social node value according to the internal distance and the external distance of the network social:

if the network social node value is larger than or equal to the preset network social node value, discovering a network social community;

if the network social node value is smaller than the preset network social node value, rediscovering the network social community;

the community optimization module is used for optimizing the social network communities found in the social network node value calculation and discovery module;

and the display module is used for displaying the social networking communities obtained by the community optimization module or/and the social networking node value calculation and discovery module.

In a preferred embodiment of the present invention, a method for performing laplacian normalization processing calculation on the obtained social networking nodes in a laplacian node matrix calculation module is as follows:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a denotes an adjacency matrix.

In a preferred embodiment of the present invention, a method for calculating an element value in a laplacian node matrix calculation module is as follows:

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

and the element values of the ith row and the jth column in the Laplace node matrix are represented.

In a preferred embodiment of the present invention, the method for calculating the social networking internal distance in the social networking node value calculation discovery module comprises:

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3, …, K;

d^internal(G,V_k) Representing the internal distance of network societies.

In a preferred embodiment of the present invention, the method for calculating the external distance of the social network in the social network node value calculation discovery module comprises:

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3, …, K;

d^external(G,V_k) Representing the external distance of network socialization.

In a preferred embodiment of the present invention, the method for calculating the social networking node value in the social networking node value calculation discovery module includes:

wherein, V_kRepresenting a set of nodes; k is 1,2,3, …, K;

v represents a node partition set; v ═ V₁,V₂,V₃,…,V_K}；

d^internal(G,V_k) An internal distance representing network socializing;

d^external(G,V_k) An external distance representing network socialization;

S_LDL(G, V) represents a network social node value.

In a preferred embodiment of the present invention, the node set V is calculated and found in the module for calculating and discovering the values of the social network nodes_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3, …, K;

v represents a node partition set; v ═ V₁,V₂,V₃,…,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; y is 1,2,3, …, N.

In a preferred embodiment of the invention, the node set V-V in the network social node value calculation discovery module_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,…,V_K}；

v_xRepresents node x; x is 1,2,3, …, N;

v_yrepresents node y; y is 1,2,3, …, N.

In a preferred embodiment of the present invention, the method for optimizing the discovered social networking community in the community optimization module is as follows:

wherein, V_kRepresents a set of nodes, K ═ 1,2, 3.., K;

v represents a node partition set; v ═ V₁,V₁,V₁,...,V_K}；

v_iRepresents a node i;

indicates that in the case of … …, there is … …;

V[v_i]indicating that node i belongs to a set of nodes V_i]；

v_jRepresents node j;

A_ijrepresenting the ith row and jth column element values in the adjacency matrix A;

if yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

In a preferred embodiment of the present invention, the system further includes a performance metric module, and the method for calculating the performance metric in the performance metric module includes:

wherein m represents the total number of connecting node edges; a. the_ijRepresenting the values of the elements in adjacency matrix a; f_ijRepresenting the proportion of any edge connecting the two nodes i and j;

wherein deg (v)_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; v. of_iRepresents a node i; v. of_jRepresents node j;

and/or the Jaccard coefficient module is also included, and the calculation method of the Jaccard coefficient in the Jaccard coefficient module is as follows:

wherein, V_MThe community structure is optimal;

V₀is a reference vector;

J(V_M,V₀) Represents the Jaccard coefficient;

when V is_MAnd V₀All are empty, J (V)_M,V₀)＝1；

And/or the Error index module is also included, and the Error index calculation method in the Error index module is as follows:

wherein, V_M' structural feature of V;

V₀is' a V₀Structural features of (a);

E(V_M′,V₀') indicates the Error index;

when V has the same value as V₀The same community structure time E (V)_M′,V₀') is equal to 0;

and displaying one of the performance metric value, the Jaccard coefficient value and the Error index value or any combination on the display module.

In summary, due to the adoption of the technical scheme, firstly, the invention considers the problem of node self-transmission. Secondly, the method comprehensively considers the problem of the edge weight and can effectively show the characteristic structure of the whole social network. Thirdly, in terms of processing the multi-scale optimization problem, the optimization function of the invention can effectively find the optimal community structure. Finally, compared with other methods, the method has better performance.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic block diagram of the system of the present invention.

FIG. 2 is a schematic diagram of local distance community partitioning according to the present invention.

Fig. 3 is a schematic diagram comparing different algorithms of the present invention on an artificial network.

FIG. 4 is a schematic diagram illustrating an overview of the community discovery process of the present invention.

Fig. 5 is a schematic view of the visualization of the present invention on different networks.

FIG. 6 is a schematic diagram of community membership in a real data network according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

To date, some classical and effective local community discovery algorithms and MF algorithms have been proposed, and Liu et al propose a local community discovery framework based on node pair similarity, and a new local community discovery algorithm can be obtained by embedding a better node similarity measure.

Clauset et al propose an algorithm R for measuring local community structure, the calculation method is as follows:

wherein B is a local community, R represents an algorithm for measuring the structure of the local community, B_inRepresenting the number of edges whose endpoints are all in local community B, and B_outIs the number of edges that have an endpoint in local community B. The algorithm requires a predefined size of the community. It will continue to add the neighbor node that increases R the most to the current community until the current community reaches a predefined size.

Luo et al propose another local community discovery algorithm M, the calculation method is as follows:

wherein M represents a local communityDiscovery algorithm, E_inRepresents the number of internal edges of the community, and E_outRepresenting the number of edges between the community boundary and the external node. The algorithm provides three heuristic node searching methods to partially solve the problem of community discovery in a complex network. However, it must set different thresholds for different sizes of networks.

The two algorithms have several ideal advantages, can detect clusters of any shape, do not need to preset the number of clusters, and can display the selection process of the center through a decision diagram. However, DPC still has drawbacks. First, the truncation distance has a greater impact on the clustering results. Furthermore, manual intervention is required to select a suitable cluster center.

Lancihienti et al propose a fitness function F_cTo measure the density of nodes within a community. The fitness function is defined as follows:

in the formula,

represents the sum of the internal degrees of the community c,

denotes the sum of the externalities of the community c, α denotes a resolution parameter for controlling the size of the detected community, F_cRepresenting the density of nodes within the community. The quality function can effectively measure the closeness of nodes in the community, but cannot fully utilize local information between the nodes.

Xu et al studied how to apply computational intelligence genetic algorithms to directed, undirected community discovery and developed optimization algorithms through iterations. Wang et al propose a method for discovering overlapping communities using a bayesian MF model. The advantage of this approach is that the number of communities can be automatically determined and there is no resolution limit. However, its internal value estimation of the number of communities may mislead the decomposition and return a wrong solution.

Guo et al uses locality center nodes and Jaccard coefficients to detect the core members of the community as seeds in the network, thereby ensuring that the selected seeds are the center nodes of the community. The node with the greatest degree in the seed is pre-expanded each time by the fitness function. And expanding the first k nodes with the best performance in the pre-expansion process by utilizing the internal force among the nodes according to the fitness function so as to obtain a high-quality community in the network.

Chen et al propose a novel community discovery method that separates overlapping communities from the network using a non-Negative Matrix Factorization (NMF) model, and solves the problem of unknown community number through feature Matrix preprocessing and sorting optimization, thereby enabling the algorithm to divide the network structure of unknown community number. Hu et al propose an improved lagrangian alternating direction algorithm for symmetric non-negative matrix factorization.

Recently, Li et al proposed a method based on semi-supervised matrix factorization and random walk to perform community partitioning. And calculating the transition probability among the nodes through network topology, obtaining the final wandering probability by using a random wandering model, and constructing a characteristic matrix.

Wu et al propose a novel framework called hybrid Hypergraph Regularization non-negative Matrix Factorization (MHGNMF) that takes into account higher-order information between nodes to improve clustering performance. The hypergraph regularization term enforces that the nodes in the same superperiphere are projected to the same potential subspace, thereby realizing more discriminant representation. In the proposed framework, the topological connectivity information and the structural similarity information are exploited by blending together two neighbors of each centroid to generate a set of hyper-edges.

The local community discovery algorithms all use the topological property of the network, and all default networks have the same edge weight, but in a real data network, the connection strength between entities is different, and the node bias is not considered, so that the weight is easily estimated incorrectly. To the best of the present invention, there is no other community discovery work that combines local distance and laplacian matrix decomposition based methods.

To better describe the proposed model, the invention will use the following mathematical definition:

definition 1: the network G ═ (V, E) is composed of a set of node partitions V and a set of edges E, and the nodes contained in the set of node partitions V will be labeled V₁、v₂、v₃、……、v_N，v_pRepresents node p, p ═ 1,2, 3.., N; n represents the total number of nodes in the node partition set; the edges contained in the edge set E indicate which nodes are connected. Thus, if an edge is paired with $ U_x≠y(v_x,v_y) In the edge set E, x is 1,2, 3.. N, y is 1,2,3, … N; v is then_xIs connected to v_y. In the present invention, only undirected graphs are processed, thus edges are directed to U_x≠y(v_x,v_y) U is opposite to edge_y≠x(v_y,v_x) Equal in U_y≠x(v_y,v_x) Denotes v_yIs connected to v_xI.e. represent v_yAnd v_xAre connected with each other. Wherein, U is_ζRepresents a condition ζ; i.e. U_x≠yIndicating that the condition x ≠ y, U_y≠xIndicating that condition y ≠ x.

Definition 2: each network G has an adjacency matrix a. If the network G has N nodes, the adjacency matrix A is an N matrix in the form of a combination of 0 and 1. A. the_pqIf and only if the edge is to U ═ 1_p≠q(v_p,v_q) E, p is 1,2,3, N, q is 1,2,3, N; i.e. v_pAnd v_qAnd (4) connecting. Known from definition 1 as $, $_p≠q(v_p,v_q)＝∪_q≠p(v_q,v_p) Therefore, the adjacency matrix a here is a symmetric matrix.

Definition 3: the community discovery of the network G ═ V, E) is to divide the node partition set V into node sets V₁,V₂,V₃,...,V_KAs a result of (3), so that V₁∪V₂∪V₃∪...∪V_KIs equal to V, and V₁,V₂,V₃,...,V_KAll are notAnd (4) empty collection. I.e. set of nodes V₁,V₂,V₃,...,V_KIs the community structure. The present invention defines a partition as V ═ V₁,V₂,V₃,...,V_K}. The number of the subareas is K ═<V>，<V>And the number of the node sets in the node partition set V is represented.

Definition 4: given a network G ═ (V, E) and a set of node partitions V ═ V₁,V₂,V₃,…,V_KThe edges of the network G can be divided into an edge set E_mnI.e. E_mn∈E，

And is

If and only if

And is

There is an edge pair

Definition 5: the definition is particularly given in the following,

and

in other words, the inner set of edges

Containing a set of nodes V_kInternal edge, internal edge set

In (1)Two nodes on one side pair belong to the same community; and the outer edge set

Comprising V_kOuter edge, outer edge set

A node on any edge pair in the node set V_kIn that another node does not belong to the set of nodes V_kAnd belongs to a node set V-V_kIn (1).

The invention provides a social network community discovery system through a local distance and node rank optimization function, which comprises a data acquisition module, a Laplace node matrix calculation module, a network social node value calculation discovery module, a community optimization module and a display module, wherein the data acquisition module is used for acquiring a data set;

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a denotes an adjacency matrix.

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^internal(G,V_k) Representing the internal distance of network societies.

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^external(G,V_k) Representing the external distance of network socialization.

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

d^internal(G,V_k) An internal distance representing network socializing;

d^external(G,V_k) An external distance representing network socialization;

S_LDL(G, V) represents a network social node value.

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2, 3.

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2, 3.

wherein, V_kRepresents a set of nodes, K ═ 1,2,3, …, K;

v represents a node partition set; v ═ V₁,V₁,V₁,…,V_K}；

v_iRepresents a node i;

indicates that in the case of … …, there is … …;

V[v_i]indicating that node i belongs to a set of nodes V_i]；

v_jRepresents node j;

if yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

wherein, V_MThe community structure is optimal;

V₀is a reference vector;

J(V_M,V₀) Represents the Jaccard coefficient;

when V is_MAnd V₀All are empty, J (V)_M,V₀)＝1；

wherein, V_M' structural feature of V;

V₀is' a V₀Structural features of (a);

E(V_M′,V₀') indicates the Error index;

As shown in fig. 2: the entire network G is divided into 5 partitions, i.e. V ═ V₁,V₂,V₃,V₄,V₅In which V is indicated briefly₁Partitioned internal edge set

And external edge set

Community discovery is to find a node partition set V (V, E) of a network G (V, E)₁,V₂,V₃,…,V_KThe nodes contained in each cluster must be somehow related to each other, not to nodes outside the cluster, to form a community. Firstly, in order to solve the problem of node information self-transmission, the invention comprehensively considers the influence of the node on the node, introduces a self-degree matrix, and constructs the following model by utilizing the Laplace matrix decomposition principle:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix; a represents an adjacency matrix; i is_nRepresenting an n-order identity matrix; l is^symIs a laplacian node matrix.

In view of the completeness of extracting the network features,i.e. the edge weight value problem is fully considered. The method is obtained by normalizing the adjacency matrix, multiplying two sides of the adjacency matrix by the degree evolution of the nodes and then inverting. For single node operation, normalization is to divide the degree of each node, so that the information transfer value of each adjacent edge is normalized, the influence of the node is not larger than that of the node because one node has 10 edges and the other node has 1 edge, the weight of the node is only 0.1 after normalization, the operation of rising from a single node to a two-dimensional matrix is to invert the matrix, and the normalization is completed by matrix division after multiplication by the nature of the inverse of the matrix. However, the left and right are multiplied by the evolution of the i, j degrees of the node respectively, which is the degree of the point at both sides of one edge. Specific to each node pair v_i,v_jThe elements in the matrix are given by the following equation:

wherein deg (v)_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; i.e. the value of the degree matrix at node i, j; v. of_iRepresents a node i; v. of_jRepresents node j;

The inner and outer distances are given by:

wherein L is^symRepresenting a laplacian node matrix;

represents V_kThe adjacency matrix of (a);

represents V-V_kThe adjacency matrix of (a); d^internal(G,V_k) An internal distance representing network socializing; d^external(G,V_k) External distance representing network socializing, d^external(G,V_k) Can be written as d^e(G,V_k) Or d^e；d^internal(G,V_k) Can be written as dⁱ(G,V_k) Or dⁱ。

It should be understood in equation (8) that when node x and node y both belong to node set (node set is also called community) V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 1; when node x belongs to node set V_kNode y belongs to the set of nodes V-V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 0; similarly, in formula (9), when both node x and node y belong to node set V_k，V_kOf a neighboring matrix

The value of the element in the x row and the y column is 0; when node x belongs to node set V_kNode y belongs to the set of nodes V-V_k，V-V_kOf a neighboring matrix

The value of the element in the x-th row and y-th column is 1.

For all v_xAll e.v have A_xx1 (i.e., each node has a self-loop). All edges except the self-loop are counted twice. d^internal(G,V_k) Is taken to be [0,1 ]]When the network G is a union of communities which are not continuous with each other, d^internal(G,V_k) This case is a perfect community structure diagram. It d^external(G,V_k) Also take on values of [0,1]And (for a perfect community structure graph, its value is 0).

The local distance Laplace network social node value function is as follows:

wherein, V_kRepresenting a set of nodes; v represents a node partition set; d^internal(G,V_k) An internal distance representing network socializing; d^external(G,V_k) An external distance representing network socialization; s_LDL(G, V) represents a network social node value.

One point to emphasize for the LDL model is that the weight for each local partition (local inner distance plus local outer distance) is | V_kI/2 | V |. This is done to avoid that smaller communities will have a disproportionate impact on the score of their total community.

3.3 Node Rank Optimization Function

Due to the community discovery algorithm proposed by the present invention, more than one possible community discovery result is generated. In this case, a community discovery optimization is required. The optimal community selection method provided by the invention is based on the idea of community discovery effectiveness, namely, more edges should be arranged inside the community but not outside the community. Weak criteria (WRC) and Strong criteria (SRC) were first proposed by Radicchi et al, but his WRC was too Weak and appeared indistinguishable at various nodes. ThatOr u and V_kCompletely disconnected, any additional node u may also be added to V_kAnd still satisfy the WRC. This can lead to failure of many discovered communities. Therefore, the present invention provides a node rank optimization function, which is as follows:

wherein,

that is to

Phi represents a node set to which the node j belongs; a. the_ijRepresenting the ith row and jth column element values in the adjacency matrix A; v [ i ]]Indicating that node i belongs to the set of nodes V [ i ]](ii) a That is, in the case of … …, there is … ….

Wherein v is_iRepresenting nodes i, v_yRepresenting node y.

Thus, the NRO function is expressed as follows:

wherein, V_kRepresenting a node set, and V representing a node partition set; v. of_iRepresents a node i; indicates that in the case of … …, there is … …; v [ V ]_i]Indicating that node i belongs to a set of nodes V_i]；v_jRepresents node j; a. the_ijRepresenting the ith row and jth column element values in the adjacency matrix a.

If yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

In the optimization effect, because two coordination parameters V [ i ] and V-V [ i ] are set, the optimization effect is stronger than WRC and weaker than SRC, and thus a better optimization effect is achieved.

The main flow of the algorithm provided by the invention is as follows:

4 results and analysis of the experiments

To evaluate the algorithm proposed by the present invention, the present invention contemplates the use of eleven real data networks and artificial network datasets. Data sources are http:// www-personal. umich. edu/mejn/Netdata/http:// snap. stanford. edu/data/. The hardware environment of the experiment was as follows: inter (R) core (TM) i5-4160M CPU, 3.60GHz and 4GB memory, windows 10, MATLAB R2019 a.

4.1 evaluation index

In the present invention, Q is used as a performance metric in experiments in order to evaluate the performance of networks that do not have authenticity. The performance metric Q is:

it is composed of

deg(v_i) Represents the degree of node i; deg (v)_j) Represents the degree of node j; v. of_iRepresents a node i; v. of_jRepresenting node j.

δ_ij(c_i,c_j) Is represented as follows:

wherein, c_iIs the community to which vertex i is assigned, c_jIs the community to which vertex j is assigned.

The Jaccard Coefficient (JSC) is used to compare Similarity and difference between a finite sample set.

Given two sets V_M，V₀The Jaccard coefficient is defined as V_MAnd V₀A larger value of the ratio of the size of the intersection to the size of the union indicates a higher degree of similarity.

Wherein, V_MFor an optimal community structure, V₀As a reference vector, when V_MAnd V₀All are empty, J (V)_M,V₀)＝1。

The range of the RI is larger, which means that the community discovery result is more consistent with the real situation. A larger RI indicates a higher accuracy of clustering effect and a higher purity within each class.

Error index when V has the same value as V₀The same community structure time E (V)_M′,V₀') is equal to 0, defined as follows:

wherein, V_M' structural feature as V, V₀Is' a V₀The structural characteristics of (1).

4.2 Artificial network Performance comparison

The invention adopts an algorithm operated on an artificial data network (GN reference network). Internal edge set E for each node^internalConnected to other nodes in the same community, outsideSet of partial edges E^externalConnect with other communities. With outer edge set E^externalWith the increase in community structure becoming less clear, the community discovery task becomes more challenging.

TABLE 1 Artificial network parameters

Fig. 3 shows the performance comparison of 8 algorithms in an artificial data network, and the proposed LDL algorithm was experimentally analyzed on various data sets of the artificial network and the real network and compared with the conventional algorithm by experiments, which are LinkLPA, MFM, LFK, NMF, LRLFP, specluster 1 and specluster 2, respectively.

As shown in fig. 3 (a): the performance of the algorithm 8 on the Jaccard coefficient evaluation standard is described, it is easily understood that when the external edge number overview is larger, the Jaccard coefficient value is lower, and when the external edge probability is less than 0.4, the LDL algorithm provided by the invention is obviously advantageous, but after 0.4, the Jaccard coefficient value is slightly lower than that of other algorithms, but always higher than that of the LinkLPA algorithm.

Fig. 3(b) depicts the performance of the algorithm on the Rand index evaluation standard, the overall trend of each algorithm is similar to that of fig. 3(a), and the Rand index gradually decreases as the probability of the number of external edges increases. It is noted that the algorithms LDL and LinkLPA provided by the present invention have significant advantages over other algorithms, and when the probability of the number of external edges is less than 0.4, the LDL algorithm is better than the LinkLPA algorithm.

Figure 3(c) shows that the performance of the algorithm does not differ much in the performance of the modulority evaluation criteria, but the LDL algorithm remains dominant throughout.

The Error values of the algorithms in fig. 3(d) are significantly different, and it can be seen that the Error value of the LDL algorithm is the lowest when the probability of the number of outer edges is less than 0.8, and the LDL algorithm is only 3% worse than the MFM algorithm when the probability is greater than 0.8. In conclusion, the LDL algorithm proposed by the present invention is indeed better and more stable than the other 7 algorithms.

4.3 true network Performance comparison

To further evaluate the LDL algorithm proposed by the present invention, eleven representative social networks of different sizes were selected by the present invention. In table 2, Networks represents a real data network, nodes represent Node numbers, edges represent Edge numbers, a-co represents an average clustering coefficient of nodes, a-Lenth represents an average path length, and Description describes the practical significance of the network. As shown in table 2:

TABLE 2 true network

To better illustrate the overall social network community discovery process, fig. 4(a) -4 (i) show a brief overview of the overall community discovery process, taking a power grid network as an example. A total of 9 subgraphs, i.e. finally 9 communities are formed.

As shown in fig. 4 (a): community structure (green cut set) for the first one divided; secondly, a second community structure (purple cut set) is divided, as shown in FIG. 4 (b); then, a third community structure is divided, as shown in fig. 4 (c); by analogy, until the ninth community is divided, the convergence criterion has been reached, i.e. all nodes are contained within a certain community, as shown in fig. 4 (i).

The divided social networks already have clear community structures, and fig. 5(a) to 5(d) are respectively the visualization results of the community discovery of the LDL algorithm proposed by the present invention in 4 social networks, i.e., Dolphin, Lemis, celegansnert, and Netscience. It can be found that the LDL algorithm has high recognition quality in a large-scale data network (as shown in table 2), and the higher the degree and the average clustering coefficient of the node is, the stronger the display effect is, and the more easily the node becomes a community center to form a community structure.

Table 4 shows the results of the proposed LDL algorithm compared to the conventional algorithm on the Jaccard index in the real dataset. The bolded values in the table indicate algorithms that perform optimally, and the shaded gray values indicate algorithms that perform suboptimally.

Results of LDL and traditional algorithms presented in Table 4 on Jaccard index (real dataset)

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.5	0.7375	0.6507	0.5882	0.325	0.6052	0.5593	0.2852
Dolphins	0.1035	0.1918	0.2131	0.1877	0.0541	0.2118	0.2161	0.2136
									Lemis	0.4112	0.4793	0.6524	0.2844	0.2410	0.6276	0.4159	0.1972
Public book	0.3403	0.6671	0.6440	0.6512	0.0551	0.3951	0.6749	0.6951
									Football	0.7147	0.6357	0.4052	0.8413	0.6920	0.0798	0.0798	0.07798
Celegansnertal	0.3445	0.2151	0.4804	0.343	0.0681	0.3551	0.2150	0.2151
									Email	0.2599	0.0460	0.2085	0.1912	0.1251	0.0462	0.0467	0.0467
Public blogs	0.3112	0.5167	0.5690	0.5426	0.0162	0.4027	0.4120	0.4998
									Netscience	0.2186	0.1332	0.2213	0.1780	0.0841	0.1464	0.0239	0.0100
Power	0.1603	0.0168	0.2240	0.2092	0.0048	0.0023	0.1371	0.0285
									Hep_th	0.1524	0.1912	0.2203	0.1036	0.2015	0.1242	0.2003	0.0972

As shown in Table 4, the LDL algorithm provided by the invention has the optimal performance in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the rest 7 algorithms; the LDL algorithm is suboptimal in Karate, Dolphins and Email data networks, is respectively second to MFM, speClust1 and LinkLPA algorithms, but has better performance than the other 6 algorithms; the LDL algorithm generally performs better in Public book, Football data networks than some other algorithms.

Results of the LDL algorithm presented in Table 5 with the conventional algorithm on the Rand index (real dataset)

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.8503	0.9251	0.8574	0.9037	0.80214	0.8396	0.85	0.2852
Dolphins	0.7536	0.2523	0.7612	0.7536	0.7721	0.2523	0.2517	0.2136
									Lemis	0.8835	0.8558	0.8914	0.73	0.8445	0.8168	0.4686	0.1972
Public book	0.7304	0.8419	0.7132	0.8377	0.639	0.3951	0.8432	0.395
									Football	0.978	0.9682	0.8874	0.76	0.95	0.0781	0.08	0.0798
Celegansnertal	0.8281	0.2151	0.8468	0.7717	0.795	0.2151	0.2151	0.2151
									Email	0.9107	0.0866	0.9233	0.9131	0.95	0.08	0.045	0.045
Public blogs	0.6779	0.7542	0.7799	0.7545	0.5085	0.5007	0.4998	0.4998
									Netscience	0.9872	0.9879	0.9935	0.9881	0.99	0.9903	0.7792	0.01
Power	0.964	0.9599	0.976	0.9656	0.9668	0.8995	0.9394	0.9201
									Hep_th	0.8674	0.9015	0.9395	0.9041	0.9192	0.803	0.9077	0.9234

As shown in Table 5, the LDL algorithm provided by the invention is optimal in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the other 7 algorithms; the LDL algorithm is suboptimal in Dolphins and Email data networks, is inferior to the LRLFP algorithm, but has better performance than the other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Public book, and Football data networks than some of the rest of the algorithms. For example, in the Karate data network, the performance is better than 5 algorithms, LFK, speClost 1, LinkLPA, LRLFP, and speClost 2.

Results of the LDL algorithm and the conventional algorithm on the Modularity index (real data set) presented in Table 6

	LinkLPA	MFM	LDL	NMF	LRLFP	LFK	speClust1	speClust2
									Karate	0.4427	0.4477	0.4347	0.4459	0.3663	0.4343	0.4116	0.1545
Dolphins	0.46	0.0108	0.4709	0.4486	0.4022	0.01080	0.0054	0.1299
									Lemis	0.5882	0.5768	0.5772	0.4849	0.5298	0.5632	0.1088	0.2034
Public book	0.5531	0.5091	0.5196	0.5182	0.4117	0.4065	0.4595	0.4209
									Football	0.6189	0.5423	0.6092	0.6236	0.6171	0.1075	0.5933	0.5753
Celegansnertal	0.433	0.4378	0.4521	0.3761	0.1722	0.2035	0.0092	0.3874
									Email	0.6178	0.0381	0.5008	0.6547	0.6547	0.3292	0.1002	0.3048
Public blogs	0.3007	0.3431	0.3967	0.367	0.1864	0.1155	0.0087	0.2133
									Netscience	0.8085	0.8118	0.872	0.8238	0.8238	0.8011	0.2062	0.73
Power	0.5826	0.531	0.6438	0.6241	0.5471	0.5289	0.6207	0.5227
									Hep_th	0.6021	0.6754	0.7181	0.501	0.6503	0.685	0.6821	0.5431

As shown in Table 6, the LDL algorithm of the present invention performed best in the Dolphins, Celegansnal, Public blogs, Netsccience, Power and Hep _ th data networks, and was superior to the other 7 algorithms; the LDL algorithm is suboptimal in the Lemis and Public book data networks, is inferior to the LinkLPA algorithm, but has better performance than other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Football, and Email data networks than some of the rest of the algorithms. Taking the Football data network as an example, the performance of the algorithm is better than that of 4 algorithms such as MFM, LFK, speClost 1 and speClost 2.

Results of LDL and conventional algorithms on Error index (true data set) presented in Table 7

As shown in Table 7, the LDL algorithm provided by the invention has more obvious advantages in Error indexes, is optimal in ten data networks of Karate, Dolphins, Lemis, Poblic book, Celegansnal, Email, Public blogs, Netscience, Power and Hep _ th, is slightly worse than the LRLFP algorithm in Football, and has stronger stability as shown by experimental data.

In summary, although the LDL algorithm proposed by the present invention does not perform optimally in every data network, the ratio of the dominance (optimal + suboptimal) is much higher than other algorithms. The LDL algorithm provided by the invention has better performance in a social network with higher average clustering coefficient and more complex data network, and is more suitable for the characteristics of large scale and complexity of the modern social network.

As shown in fig. 6(a) to 6 (e): respectively represents the community structure comparison expression of the LDL algorithm in 5 real data networks of Karate, Lemis, Celegansnartal, Public blogs and Power grid. The abscissa represents the number of nodes, the ordinate represents the community membership relationship of the community, namely the community to which the node belongs, blue is the reference community structure, and red is the community structure of the LDL algorithm. The more similar the community structure after the algorithm execution is to the reference community structure, the higher the score.

In Table 8, the LDL algorithm and the conventional algorithm mentioned in tables 4 to 7 were counted for each indexExpressing the situation and using the loss function Y ═ LOG constructed by the invention₁₀((X₁+X₂+X₃)/(X₄+1))。X₁Represents the coefficient of variation at the Jaccard index; x₂Representing the coefficient of variation at Rand index; x₃Expressing the coefficient of variation at the modulority index; x₄Represents the coefficient of variation at the Error index; y represents the constructed loss function.

Results of LDL Algorithm and conventional Algorithm on Each index (real data set) presented in Table 8

As shown in table 8: and (3) performing result analysis on the performance of the LDL algorithm on each index through statistical mean, standard deviation, coefficient of variation and constructed loss functions. The first proposed LDL algorithm has the highest score (bold data value) on two indexes of Jaccard and Rand, and in the Jaccard index, the mean value of the LDL algorithm in each data network is the highest, but the standard deviation is higher than LinkLPA, which indicates that the performance difference of the LDL algorithm in the data networks is larger than that of the LinkLPA algorithm, but the score of the index variation coefficient (the mean value/standard deviation is higher as well as better) is finally the highest in comprehensive consideration; in the Rand index, the performance of the LDL algorithm in the mean value and the standard deviation is superior to that of other algorithms, and the score of the variation coefficient is obviously higher than that of other best NMF algorithms and is close to 77 percent; secondly, the score of the LDL algorithm on the modulatity index is second to that of the LinkLPA algorithm, because the performance difference of the LDL algorithm in each data network is larger than that of the LinkLPA algorithm; the performance of the LDL algorithm on the Error index is the best compared with other algorithms, the Error rate is only 0.0548 and is far better than other algorithms, and the experimental data show that the algorithm provided by the invention has stronger robustness; finally, the performance score at the loss function is also highest, which is approximately 7 percentage points higher than the conventional best method.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A social network community discovery system based on local distance and node rank optimization functions is characterized by comprising a data acquisition module, a Laplace node matrix calculation module, a network social node value calculation discovery module, a community optimization module and a display module;

the calculation method of the internal distance comprises the following steps:

wherein L is^symRepresenting a laplacian node matrix;

representing a set of nodes V_kThe adjacency matrix of (a);

g represents a social network;

V_krepresenting a set of nodes; k is 1,2,3,. K;

d^internal(G,V_k) An internal distance representing network socializing;

the method for calculating the external distance comprises the following steps:

wherein L is^symRepresenting a laplacian node matrix;

represents V-V_kThe adjacency matrix of (a);

representing a set of nodes V_kThe adjacency matrix of (a);

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

G represents a social network;

V_krepresenting a set of nodes; k is 1,2,3, …, K;

d^external(G,V_k) An external distance representing network socialization;

the method for calculating the social network node value comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3, …, K;

v represents a node partition set; v ═ V₁,V₂,V₃,…,V_K}；

d^internal(G,V_k) An internal distance representing network socializing;

d^external(G,V_k) An external distance representing network socialization;

S_LDL(G, V) represents a network social node value;

2. The social network community discovery system through a local distance and node rank optimization function according to claim 1, wherein the calculation method of performing laplacian normalization processing on the obtained network social nodes in the laplacian node matrix calculation module is as follows:

wherein D represents a node degree matrix;

represents the un-normalized laplacian matrix;

a represents an adjacency matrix;

I_nrepresenting an n-order identity matrix.

3. The social network community discovery system through local distance and node rank optimization function according to claim 1, wherein the calculation method of the element values in the laplacian node matrix calculation module is as follows:

wherein deg (v)_i) Represents the degree of node i;

deg(v_j) Represents the degree of node j;

v_irepresents a node i;

v_jrepresents node j;

4. The social network community discovery system through local distance and node rank optimization function of claim 1, wherein a set of nodes V in a social network node value calculation discovery module_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2, 3.

5. The social network community discovery system through local distance and node rank optimization function of claim 1, wherein a set of nodes V-V in a network social node value calculation discovery module_kOf a neighboring matrix

The calculation method comprises the following steps:

wherein, V_kRepresenting a set of nodes; k is 1,2,3,. K;

v represents a node partition set; v ═ V₁,V₂,V₃,...,V_K}；

v_xRepresents node x; x is 1,2,3,. N;

v_yrepresents node y; 1,2, 3.

6. The system of claim 1, wherein the method for optimizing the discovered social networking communities in the community optimization module comprises:

wherein, V_kRepresents a set of nodes, K ═ 1,2, 3.., K;

v represents a node partition set; v ═ V₁,V₁,V₁,...,V_K}；

v_iRepresents a node i;

indicates that in the case of … …, there is … …;

V[v_i]indicating that node i belongs to a set of nodes V_i]；

v_jRepresents node j;

if yes, keeping node set V [ V ]_i]；

If not, the node set V is discarded_i]。

7. The social network community discovery system through local distance and node rank optimization function of claim 1, further comprising a performance metric module, wherein the performance metric in the performance metric module is calculated by:

wherein, V_MThe community structure is optimal;

V₀for reference purposesVector quantity;

J(V_M,V₀) Represents the Jaccard coefficient;

when V is_MAnd V₀All are empty, J (V)_M,V₀)＝1；

wherein, V_M' structural feature of V;

V₀is' a V₀Structural features of (a);

E(V_M′,V₀') indicates the Error index;