Disclosure of Invention
The invention aims to at least solve the technical problems in the prior art, and particularly provides a social network community discovery system based on a local distance and node rank optimization function.
In order to achieve the above object, the present invention provides a social network community discovery system through a local distance and node rank optimization function, which includes a data acquisition module, a laplacian node matrix computation module, a network social node value computation discovery module, a community optimization module, and a display module;
the data output end of the data acquisition module is connected with the data input end of the Laplace node matrix calculation module, the data output end of the Laplace node matrix calculation module is connected with the data input end of the network social node value calculation discovery module, the data output end of the network social node value calculation discovery module is connected with the data input end of the community optimization module, and the data output end of the community optimization module is connected with the display data end of the display module;
the data acquisition module is used for acquiring a network social node data set;
the Laplace node matrix calculation module is used for carrying out Laplace normalization processing on the network social node data set acquired in the data acquisition module; obtaining a Laplace node matrix;
the network social node value calculation discovery module is used for calculating to obtain a network social node value according to the internal distance and the external distance of the network social:
if the network social node value is larger than or equal to the preset network social node value, discovering a network social community;
if the network social node value is smaller than the preset network social node value, rediscovering the network social community;
the community optimization module is used for optimizing the social network communities found in the social network node value calculation and discovery module;
and the display module is used for displaying the social networking communities obtained by the community optimization module or/and the social networking node value calculation and discovery module.
In a preferred embodiment of the present invention, a method for performing laplacian normalization processing calculation on the obtained social networking nodes in a laplacian node matrix calculation module is as follows:
wherein D represents a node degree matrix;
represents the un-normalized laplacian matrix;
a denotes an adjacency matrix.
In a preferred embodiment of the present invention, a method for calculating an element value in a laplacian node matrix calculation module is as follows:
wherein deg (v)i) Represents the degree of node i;
deg(vj) Represents the degree of node j;
virepresents a node i;
vjrepresents node j;
and the element values of the ith row and the jth column in the Laplace node matrix are represented.
In a preferred embodiment of the present invention, the method for calculating the social networking internal distance in the social networking node value calculation discovery module comprises:
wherein L issymRepresenting a laplacian node matrix;
representing a set of nodes V
kThe adjacency matrix of (a);
g represents a social network;
Vkrepresenting a set of nodes; k is 1,2,3, …, K;
dinternal(G,Vk) Representing the internal distance of network societies.
In a preferred embodiment of the present invention, the method for calculating the external distance of the social network in the social network node value calculation discovery module comprises:
wherein L issymRepresenting a laplacian node matrix;
represents V-V
kThe adjacency matrix of (a);
representing a set of nodes V
kThe adjacency matrix of (a);
v represents a node partition set; v ═ V1,V2,V3,...,VK};
G represents a social network;
Vkrepresenting a set of nodes; k is 1,2,3, …, K;
dexternal(G,Vk) Representing the external distance of network socialization.
In a preferred embodiment of the present invention, the method for calculating the social networking node value in the social networking node value calculation discovery module includes:
wherein, VkRepresenting a set of nodes; k is 1,2,3, …, K;
v represents a node partition set; v ═ V1,V2,V3,…,VK};
dinternal(G,Vk) An internal distance representing network socializing;
dexternal(G,Vk) An external distance representing network socialization;
SLDL(G, V) represents a network social node value.
In a preferred embodiment of the present invention, the node set V is calculated and found in the module for calculating and discovering the values of the social network nodes
kOf a neighboring matrix
The calculation method comprises the following steps:
wherein, VkRepresenting a set of nodes; k is 1,2,3, …, K;
v represents a node partition set; v ═ V1,V2,V3,…,VK};
vxRepresents node x; x is 1,2,3,. N;
vyrepresents node y; y is 1,2,3, …, N.
In a preferred embodiment of the invention, the node set V-V in the network social node value calculation discovery module
kOf a neighboring matrix
The calculation method comprises the following steps:
wherein, VkRepresenting a set of nodes; k is 1,2,3,. K;
v represents a node partition set; v ═ V1,V2,V3,…,VK};
vxRepresents node x; x is 1,2,3, …, N;
vyrepresents node y; y is 1,2,3, …, N.
In a preferred embodiment of the present invention, the method for optimizing the discovered social networking community in the community optimization module is as follows:
wherein, VkRepresents a set of nodes, K ═ 1,2, 3.., K;
v represents a node partition set; v ═ V1,V1,V1,...,VK};
viRepresents a node i;
indicates that in the case of … …, there is … …;
V[vi]indicating that node i belongs to a set of nodes Vi];
vjRepresents node j;
Aijrepresenting the ith row and jth column element values in the adjacency matrix A;
if yes, keeping node set V [ V ]i];
If not, the node set V is discardedi]。
In a preferred embodiment of the present invention, the system further includes a performance metric module, and the method for calculating the performance metric in the performance metric module includes:
wherein m represents the total number of connecting node edges; a. theijRepresenting the values of the elements in adjacency matrix a; fijRepresenting the proportion of any edge connecting the two nodes i and j;
wherein deg (v)i) Represents the degree of node i; deg (v)j) Represents the degree of node j; v. ofiRepresents a node i; v. ofjRepresents node j;
and/or the Jaccard coefficient module is also included, and the calculation method of the Jaccard coefficient in the Jaccard coefficient module is as follows:
wherein, VMThe community structure is optimal;
V0is a reference vector;
J(VM,V0) Represents the Jaccard coefficient;
when V isMAnd V0All are empty, J (V)M,V0)=1;
And/or the Error index module is also included, and the Error index calculation method in the Error index module is as follows:
wherein, VM' structural feature of V;
V0is' a V0Structural features of (a);
E(VM′,V0') indicates the Error index;
when V has the same value as V0The same community structure time E (V)M′,V0') is equal to 0;
and displaying one of the performance metric value, the Jaccard coefficient value and the Error index value or any combination on the display module.
In summary, due to the adoption of the technical scheme, firstly, the invention considers the problem of node self-transmission. Secondly, the method comprehensively considers the problem of the edge weight and can effectively show the characteristic structure of the whole social network. Thirdly, in terms of processing the multi-scale optimization problem, the optimization function of the invention can effectively find the optimal community structure. Finally, compared with other methods, the method has better performance.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
To date, some classical and effective local community discovery algorithms and MF algorithms have been proposed, and Liu et al propose a local community discovery framework based on node pair similarity, and a new local community discovery algorithm can be obtained by embedding a better node similarity measure.
Clauset et al propose an algorithm R for measuring local community structure, the calculation method is as follows:
wherein B is a local community, R represents an algorithm for measuring the structure of the local community, BinRepresenting the number of edges whose endpoints are all in local community B, and BoutIs the number of edges that have an endpoint in local community B. The algorithm requires a predefined size of the community. It will continue to add the neighbor node that increases R the most to the current community until the current community reaches a predefined size.
Luo et al propose another local community discovery algorithm M, the calculation method is as follows:
wherein M represents a local communityDiscovery algorithm, EinRepresents the number of internal edges of the community, and EoutRepresenting the number of edges between the community boundary and the external node. The algorithm provides three heuristic node searching methods to partially solve the problem of community discovery in a complex network. However, it must set different thresholds for different sizes of networks.
The two algorithms have several ideal advantages, can detect clusters of any shape, do not need to preset the number of clusters, and can display the selection process of the center through a decision diagram. However, DPC still has drawbacks. First, the truncation distance has a greater impact on the clustering results. Furthermore, manual intervention is required to select a suitable cluster center.
Lancihienti et al propose a fitness function FcTo measure the density of nodes within a community. The fitness function is defined as follows:
in the formula,
represents the sum of the internal degrees of the community c,
denotes the sum of the externalities of the community c, α denotes a resolution parameter for controlling the size of the detected community, F
cRepresenting the density of nodes within the community. The quality function can effectively measure the closeness of nodes in the community, but cannot fully utilize local information between the nodes.
Xu et al studied how to apply computational intelligence genetic algorithms to directed, undirected community discovery and developed optimization algorithms through iterations. Wang et al propose a method for discovering overlapping communities using a bayesian MF model. The advantage of this approach is that the number of communities can be automatically determined and there is no resolution limit. However, its internal value estimation of the number of communities may mislead the decomposition and return a wrong solution.
Guo et al uses locality center nodes and Jaccard coefficients to detect the core members of the community as seeds in the network, thereby ensuring that the selected seeds are the center nodes of the community. The node with the greatest degree in the seed is pre-expanded each time by the fitness function. And expanding the first k nodes with the best performance in the pre-expansion process by utilizing the internal force among the nodes according to the fitness function so as to obtain a high-quality community in the network.
Chen et al propose a novel community discovery method that separates overlapping communities from the network using a non-Negative Matrix Factorization (NMF) model, and solves the problem of unknown community number through feature Matrix preprocessing and sorting optimization, thereby enabling the algorithm to divide the network structure of unknown community number. Hu et al propose an improved lagrangian alternating direction algorithm for symmetric non-negative matrix factorization.
Recently, Li et al proposed a method based on semi-supervised matrix factorization and random walk to perform community partitioning. And calculating the transition probability among the nodes through network topology, obtaining the final wandering probability by using a random wandering model, and constructing a characteristic matrix.
Wu et al propose a novel framework called hybrid Hypergraph Regularization non-negative Matrix Factorization (MHGNMF) that takes into account higher-order information between nodes to improve clustering performance. The hypergraph regularization term enforces that the nodes in the same superperiphere are projected to the same potential subspace, thereby realizing more discriminant representation. In the proposed framework, the topological connectivity information and the structural similarity information are exploited by blending together two neighbors of each centroid to generate a set of hyper-edges.
The local community discovery algorithms all use the topological property of the network, and all default networks have the same edge weight, but in a real data network, the connection strength between entities is different, and the node bias is not considered, so that the weight is easily estimated incorrectly. To the best of the present invention, there is no other community discovery work that combines local distance and laplacian matrix decomposition based methods.
To better describe the proposed model, the invention will use the following mathematical definition:
definition 1: the network G ═ (V, E) is composed of a set of node partitions V and a set of edges E, and the nodes contained in the set of node partitions V will be labeled V1、v2、v3、……、vN,vpRepresents node p, p ═ 1,2, 3.., N; n represents the total number of nodes in the node partition set; the edges contained in the edge set E indicate which nodes are connected. Thus, if an edge is paired with $ Ux≠y(vx,vy) In the edge set E, x is 1,2, 3.. N, y is 1,2,3, … N; v is thenxIs connected to vy. In the present invention, only undirected graphs are processed, thus edges are directed to Ux≠y(vx,vy) U is opposite to edgey≠x(vy,vx) Equal in Uy≠x(vy,vx) Denotes vyIs connected to vxI.e. represent vyAnd vxAre connected with each other. Wherein, U isζRepresents a condition ζ; i.e. Ux≠yIndicating that the condition x ≠ y, Uy≠xIndicating that condition y ≠ x.
Definition 2: each network G has an adjacency matrix a. If the network G has N nodes, the adjacency matrix A is an N matrix in the form of a combination of 0 and 1. A. thepqIf and only if the edge is to U ═ 1p≠q(vp,vq) E, p is 1,2,3, N, q is 1,2,3, N; i.e. vpAnd vqAnd (4) connecting. Known from definition 1 as $, $p≠q(vp,vq)=∪q≠p(vq,vp) Therefore, the adjacency matrix a here is a symmetric matrix.
Definition 3: the community discovery of the network G ═ V, E) is to divide the node partition set V into node sets V1,V2,V3,...,VKAs a result of (3), so that V1∪V2∪V3∪...∪VKIs equal to V, and V1,V2,V3,...,VKAll are notAnd (4) empty collection. I.e. set of nodes V1,V2,V3,...,VKIs the community structure. The present invention defines a partition as V ═ V1,V2,V3,...,VK}. The number of the subareas is K ═<V>,<V>And the number of the node sets in the node partition set V is represented.
Definition 4: given a network G ═ (V, E) and a set of node partitions V ═ V
1,V
2,V
3,…,V
KThe edges of the network G can be divided into an edge set E
mnI.e. E
mn∈E,
And is
If and only if
And is
There is an edge pair
Definition 5: the definition is particularly given in the following,
and
in other words, the inner set of edges
Containing a set of nodes V
kInternal edge, internal edge set
In (1)Two nodes on one side pair belong to the same community; and the outer edge set
Comprising V
kOuter edge, outer edge set
A node on any edge pair in the node set V
kIn that another node does not belong to the set of nodes V
kAnd belongs to a node set V-V
kIn (1).
The invention provides a social network community discovery system through a local distance and node rank optimization function, which comprises a data acquisition module, a Laplace node matrix calculation module, a network social node value calculation discovery module, a community optimization module and a display module, wherein the data acquisition module is used for acquiring a data set;
the data output end of the data acquisition module is connected with the data input end of the Laplace node matrix calculation module, the data output end of the Laplace node matrix calculation module is connected with the data input end of the network social node value calculation discovery module, the data output end of the network social node value calculation discovery module is connected with the data input end of the community optimization module, and the data output end of the community optimization module is connected with the display data end of the display module;
the data acquisition module is used for acquiring a network social node data set;
the Laplace node matrix calculation module is used for carrying out Laplace normalization processing on the network social node data set acquired in the data acquisition module; obtaining a Laplace node matrix;
the network social node value calculation discovery module is used for calculating to obtain a network social node value according to the internal distance and the external distance of the network social:
if the network social node value is larger than or equal to the preset network social node value, discovering a network social community;
if the network social node value is smaller than the preset network social node value, rediscovering the network social community;
the community optimization module is used for optimizing the social network communities found in the social network node value calculation and discovery module;
and the display module is used for displaying the social networking communities obtained by the community optimization module or/and the social networking node value calculation and discovery module.
In a preferred embodiment of the present invention, a method for performing laplacian normalization processing calculation on the obtained social networking nodes in a laplacian node matrix calculation module is as follows:
wherein D represents a node degree matrix;
represents the un-normalized laplacian matrix;
a denotes an adjacency matrix.
In a preferred embodiment of the present invention, a method for calculating an element value in a laplacian node matrix calculation module is as follows:
wherein deg (v)i) Represents the degree of node i;
deg(vj) Represents the degree of node j;
virepresents a node i;
vjrepresents node j;
and the element values of the ith row and the jth column in the Laplace node matrix are represented.
In a preferred embodiment of the present invention, the method for calculating the social networking internal distance in the social networking node value calculation discovery module comprises:
wherein L issymRepresenting a laplacian node matrix;
representing a set of nodes V
kThe adjacency matrix of (a);
g represents a social network;
Vkrepresenting a set of nodes; k is 1,2,3,. K;
dinternal(G,Vk) Representing the internal distance of network societies.
In a preferred embodiment of the present invention, the method for calculating the external distance of the social network in the social network node value calculation discovery module comprises:
wherein L issymRepresenting a laplacian node matrix;
represents V-V
kThe adjacency matrix of (a);
representing a set of nodes V
kThe adjacency matrix of (a);
v represents a node partition set; v ═ V1,V2,V3,...,VK};
G represents a social network;
Vkrepresenting a set of nodes; k is 1,2,3,. K;
dexternal(G,Vk) Representing the external distance of network socialization.
In a preferred embodiment of the present invention, the method for calculating the social networking node value in the social networking node value calculation discovery module includes:
wherein, VkRepresenting a set of nodes; k is 1,2,3,. K;
v represents a node partition set; v ═ V1,V2,V3,...,VK};
dinternal(G,Vk) An internal distance representing network socializing;
dexternal(G,Vk) An external distance representing network socialization;
SLDL(G, V) represents a network social node value.
In a preferred embodiment of the present invention, the node set V is calculated and found in the module for calculating and discovering the values of the social network nodes
kOf a neighboring matrix
The calculation method comprises the following steps:
wherein, VkRepresenting a set of nodes; k is 1,2,3,. K;
v represents a node partition set; v ═ V1,V2,V3,...,VK};
vxRepresents node x; x is 1,2,3,. N;
vyrepresents node y; 1,2, 3.
In a preferred embodiment of the invention, the node set V-V in the network social node value calculation discovery module
kOf a neighboring matrix
The calculation method comprises the following steps:
wherein, VkRepresenting a set of nodes; k is 1,2,3,. K;
v represents a node partition set; v ═ V1,V2,V3,...,VK};
vxRepresents node x; x is 1,2,3,. N;
vyrepresents node y; 1,2, 3.
In a preferred embodiment of the present invention, the method for optimizing the discovered social networking community in the community optimization module is as follows:
wherein, VkRepresents a set of nodes, K ═ 1,2,3, …, K;
v represents a node partition set; v ═ V1,V1,V1,…,VK};
viRepresents a node i;
indicates that in the case of … …, there is … …;
V[vi]indicating that node i belongs to a set of nodes Vi];
vjRepresents node j;
Aijrepresenting the ith row and jth column element values in the adjacency matrix A;
if yes, keeping node set V [ V ]i];
If not, the node set V is discardedi]。
In a preferred embodiment of the present invention, the system further includes a performance metric module, and the method for calculating the performance metric in the performance metric module includes:
wherein m represents the total number of connecting node edges; a. theijRepresenting the values of the elements in adjacency matrix a; fijRepresenting the proportion of any edge connecting the two nodes i and j;
wherein deg (v)i) Represents the degree of node i; deg (v)j) Represents the degree of node j; v. ofiRepresents a node i; v. ofjRepresents node j;
and/or the Jaccard coefficient module is also included, and the calculation method of the Jaccard coefficient in the Jaccard coefficient module is as follows:
wherein, VMThe community structure is optimal;
V0is a reference vector;
J(VM,V0) Represents the Jaccard coefficient;
when V isMAnd V0All are empty, J (V)M,V0)=1;
And/or the Error index module is also included, and the Error index calculation method in the Error index module is as follows:
wherein, VM' structural feature of V;
V0is' a V0Structural features of (a);
E(VM′,V0') indicates the Error index;
when V has the same value as V0The same community structure time E (V)M′,V0') is equal to 0;
and displaying one of the performance metric value, the Jaccard coefficient value and the Error index value or any combination on the display module.
As shown in fig. 2: the entire network G is divided into 5 partitions, i.e. V ═ V
1,V
2,V
3,V
4,V
5In which V is indicated briefly
1Partitioned internal edge set
And external edge set
Community discovery is to find a node partition set V (V, E) of a network G (V, E)1,V2,V3,…,VKThe nodes contained in each cluster must be somehow related to each other, not to nodes outside the cluster, to form a community. Firstly, in order to solve the problem of node information self-transmission, the invention comprehensively considers the influence of the node on the node, introduces a self-degree matrix, and constructs the following model by utilizing the Laplace matrix decomposition principle:
wherein D represents a node degree matrix;
represents the un-normalized laplacian matrix; a represents an adjacency matrix; i is
nRepresenting an n-order identity matrix; l is
symIs a laplacian node matrix.
In view of the completeness of extracting the network features,i.e. the edge weight value problem is fully considered. The method is obtained by normalizing the adjacency matrix, multiplying two sides of the adjacency matrix by the degree evolution of the nodes and then inverting. For single node operation, normalization is to divide the degree of each node, so that the information transfer value of each adjacent edge is normalized, the influence of the node is not larger than that of the node because one node has 10 edges and the other node has 1 edge, the weight of the node is only 0.1 after normalization, the operation of rising from a single node to a two-dimensional matrix is to invert the matrix, and the normalization is completed by matrix division after multiplication by the nature of the inverse of the matrix. However, the left and right are multiplied by the evolution of the i, j degrees of the node respectively, which is the degree of the point at both sides of one edge. Specific to each node pair vi,vjThe elements in the matrix are given by the following equation:
wherein deg (v)
i) Represents the degree of node i; deg (v)
j) Represents the degree of node j; i.e. the value of the degree matrix at node i, j; v. of
iRepresents a node i; v. of
jRepresents node j;
and the element values of the ith row and the jth column in the Laplace node matrix are represented.
The inner and outer distances are given by:
wherein L is
symRepresenting a laplacian node matrix;
represents V
kThe adjacency matrix of (a);
represents V-V
kThe adjacency matrix of (a); d
internal(G,V
k) An internal distance representing network socializing; d
external(G,V
k) External distance representing network socializing, d
external(G,V
k) Can be written as d
e(G,V
k) Or d
e;d
internal(G,V
k) Can be written as d
i(G,V
k) Or d
i。
It should be understood in equation (8) that when node x and node y both belong to node set (node set is also called community) V
k,V
kOf a neighboring matrix
The value of the element in the x row and the y column is 1; when node x belongs to node set V
kNode y belongs to the set of nodes V-V
k,V
kOf a neighboring matrix
The value of the element in the x row and the y column is 0; similarly, in formula (9), when both node x and node y belong to node set V
k,V
kOf a neighboring matrix
The value of the element in the x row and the y column is 0; when node x belongs to node set V
kNode y belongs to the set of nodes V-V
k,V-V
kOf a neighboring matrix
The value of the element in the x-th row and y-th column is 1.
For all vxAll e.v have Axx1 (i.e., each node has a self-loop). All edges except the self-loop are counted twice. dinternal(G,Vk) Is taken to be [0,1 ]]When the network G is a union of communities which are not continuous with each other, dinternal(G,Vk) This case is a perfect community structure diagram. It dexternal(G,Vk) Also take on values of [0,1]And (for a perfect community structure graph, its value is 0).
The local distance Laplace network social node value function is as follows:
wherein, VkRepresenting a set of nodes; v represents a node partition set; dinternal(G,Vk) An internal distance representing network socializing; dexternal(G,Vk) An external distance representing network socialization; sLDL(G, V) represents a network social node value.
One point to emphasize for the LDL model is that the weight for each local partition (local inner distance plus local outer distance) is | VkI/2 | V |. This is done to avoid that smaller communities will have a disproportionate impact on the score of their total community.
3.3 Node Rank Optimization Function
Due to the community discovery algorithm proposed by the present invention, more than one possible community discovery result is generated. In this case, a community discovery optimization is required. The optimal community selection method provided by the invention is based on the idea of community discovery effectiveness, namely, more edges should be arranged inside the community but not outside the community. Weak criteria (WRC) and Strong criteria (SRC) were first proposed by Radicchi et al, but his WRC was too Weak and appeared indistinguishable at various nodes. ThatOr u and VkCompletely disconnected, any additional node u may also be added to VkAnd still satisfy the WRC. This can lead to failure of many discovered communities. Therefore, the present invention provides a node rank optimization function, which is as follows:
Phi represents a node set to which the node j belongs; a. theijRepresenting the ith row and jth column element values in the adjacency matrix A; v [ i ]]Indicating that node i belongs to the set of nodes V [ i ]](ii) a That is, in the case of … …, there is … ….
Wherein v isiRepresenting nodes i, vyRepresenting node y.
Thus, the NRO function is expressed as follows:
wherein, VkRepresenting a node set, and V representing a node partition set; v. ofiRepresents a node i; indicates that in the case of … …, there is … …; v [ V ]i]Indicating that node i belongs to a set of nodes Vi];vjRepresents node j; a. theijRepresenting the ith row and jth column element values in the adjacency matrix a.
If yes, keeping node set V [ V ]i];
If not, the node set V is discardedi]。
In the optimization effect, because two coordination parameters V [ i ] and V-V [ i ] are set, the optimization effect is stronger than WRC and weaker than SRC, and thus a better optimization effect is achieved.
The main flow of the algorithm provided by the invention is as follows:
4 results and analysis of the experiments
To evaluate the algorithm proposed by the present invention, the present invention contemplates the use of eleven real data networks and artificial network datasets. Data sources are http:// www-personal. umich. edu/mejn/Netdata/http:// snap. stanford. edu/data/. The hardware environment of the experiment was as follows: inter (R) core (TM) i5-4160M CPU, 3.60GHz and 4GB memory, windows 10, MATLAB R2019 a.
4.1 evaluation index
In the present invention, Q is used as a performance metric in experiments in order to evaluate the performance of networks that do not have authenticity. The performance metric Q is:
wherein m represents the total number of connecting node edges; a. theijRepresenting the values of the elements in adjacency matrix a; fijRepresenting the proportion of any edge connecting the two nodes i and j;
it is composed of
deg(v
i) Represents the degree of node i; deg (v)
j) Represents the degree of node j; v. of
iRepresents a node i; v. of
jRepresenting node j.
δij(ci,cj) Is represented as follows:
wherein, ciIs the community to which vertex i is assigned, cjIs the community to which vertex j is assigned.
The Jaccard Coefficient (JSC) is used to compare Similarity and difference between a finite sample set.
Given two sets VM,V0The Jaccard coefficient is defined as VMAnd V0A larger value of the ratio of the size of the intersection to the size of the union indicates a higher degree of similarity.
Wherein, VMFor an optimal community structure, V0As a reference vector, when VMAnd V0All are empty, J (V)M,V0)=1。
The range of the RI is larger, which means that the community discovery result is more consistent with the real situation. A larger RI indicates a higher accuracy of clustering effect and a higher purity within each class.
Error index when V has the same value as V0The same community structure time E (V)M′,V0') is equal to 0, defined as follows:
wherein, VM' structural feature as V, V0Is' a V0The structural characteristics of (1).
4.2 Artificial network Performance comparison
The invention adopts an algorithm operated on an artificial data network (GN reference network). Internal edge set E for each nodeinternalConnected to other nodes in the same community, outsideSet of partial edges EexternalConnect with other communities. With outer edge set EexternalWith the increase in community structure becoming less clear, the community discovery task becomes more challenging.
TABLE 1 Artificial network parameters
Fig. 3 shows the performance comparison of 8 algorithms in an artificial data network, and the proposed LDL algorithm was experimentally analyzed on various data sets of the artificial network and the real network and compared with the conventional algorithm by experiments, which are LinkLPA, MFM, LFK, NMF, LRLFP, specluster 1 and specluster 2, respectively.
As shown in fig. 3 (a): the performance of the algorithm 8 on the Jaccard coefficient evaluation standard is described, it is easily understood that when the external edge number overview is larger, the Jaccard coefficient value is lower, and when the external edge probability is less than 0.4, the LDL algorithm provided by the invention is obviously advantageous, but after 0.4, the Jaccard coefficient value is slightly lower than that of other algorithms, but always higher than that of the LinkLPA algorithm.
Fig. 3(b) depicts the performance of the algorithm on the Rand index evaluation standard, the overall trend of each algorithm is similar to that of fig. 3(a), and the Rand index gradually decreases as the probability of the number of external edges increases. It is noted that the algorithms LDL and LinkLPA provided by the present invention have significant advantages over other algorithms, and when the probability of the number of external edges is less than 0.4, the LDL algorithm is better than the LinkLPA algorithm.
Figure 3(c) shows that the performance of the algorithm does not differ much in the performance of the modulority evaluation criteria, but the LDL algorithm remains dominant throughout.
The Error values of the algorithms in fig. 3(d) are significantly different, and it can be seen that the Error value of the LDL algorithm is the lowest when the probability of the number of outer edges is less than 0.8, and the LDL algorithm is only 3% worse than the MFM algorithm when the probability is greater than 0.8. In conclusion, the LDL algorithm proposed by the present invention is indeed better and more stable than the other 7 algorithms.
4.3 true network Performance comparison
To further evaluate the LDL algorithm proposed by the present invention, eleven representative social networks of different sizes were selected by the present invention. In table 2, Networks represents a real data network, nodes represent Node numbers, edges represent Edge numbers, a-co represents an average clustering coefficient of nodes, a-Lenth represents an average path length, and Description describes the practical significance of the network. As shown in table 2:
TABLE 2 true network
To better illustrate the overall social network community discovery process, fig. 4(a) -4 (i) show a brief overview of the overall community discovery process, taking a power grid network as an example. A total of 9 subgraphs, i.e. finally 9 communities are formed.
As shown in fig. 4 (a): community structure (green cut set) for the first one divided; secondly, a second community structure (purple cut set) is divided, as shown in FIG. 4 (b); then, a third community structure is divided, as shown in fig. 4 (c); by analogy, until the ninth community is divided, the convergence criterion has been reached, i.e. all nodes are contained within a certain community, as shown in fig. 4 (i).
The divided social networks already have clear community structures, and fig. 5(a) to 5(d) are respectively the visualization results of the community discovery of the LDL algorithm proposed by the present invention in 4 social networks, i.e., Dolphin, Lemis, celegansnert, and Netscience. It can be found that the LDL algorithm has high recognition quality in a large-scale data network (as shown in table 2), and the higher the degree and the average clustering coefficient of the node is, the stronger the display effect is, and the more easily the node becomes a community center to form a community structure.
Table 4 shows the results of the proposed LDL algorithm compared to the conventional algorithm on the Jaccard index in the real dataset. The bolded values in the table indicate algorithms that perform optimally, and the shaded gray values indicate algorithms that perform suboptimally.
Results of LDL and traditional algorithms presented in Table 4 on Jaccard index (real dataset)
|
LinkLPA
|
MFM
|
LDL
|
NMF
|
LRLFP
|
LFK
|
speClust1
|
speClust2
|
Karate
|
0.5
|
0.7375
|
0.6507
|
0.5882
|
0.325
|
0.6052
|
0.5593
|
0.2852
|
Dolphins
|
0.1035
|
0.1918
|
0.2131
|
0.1877
|
0.0541
|
0.2118
|
0.2161
|
0.2136
|
Lemis
|
0.4112
|
0.4793
|
0.6524
|
0.2844
|
0.2410
|
0.6276
|
0.4159
|
0.1972
|
Public book
|
0.3403
|
0.6671
|
0.6440
|
0.6512
|
0.0551
|
0.3951
|
0.6749
|
0.6951
|
Football
|
0.7147
|
0.6357
|
0.4052
|
0.8413
|
0.6920
|
0.0798
|
0.0798
|
0.07798
|
Celegansnertal
|
0.3445
|
0.2151
|
0.4804
|
0.343
|
0.0681
|
0.3551
|
0.2150
|
0.2151
|
Email
|
0.2599
|
0.0460
|
0.2085
|
0.1912
|
0.1251
|
0.0462
|
0.0467
|
0.0467
|
Public blogs
|
0.3112
|
0.5167
|
0.5690
|
0.5426
|
0.0162
|
0.4027
|
0.4120
|
0.4998
|
Netscience
|
0.2186
|
0.1332
|
0.2213
|
0.1780
|
0.0841
|
0.1464
|
0.0239
|
0.0100
|
Power
|
0.1603
|
0.0168
|
0.2240
|
0.2092
|
0.0048
|
0.0023
|
0.1371
|
0.0285
|
Hep_th
|
0.1524
|
0.1912
|
0.2203
|
0.1036
|
0.2015
|
0.1242
|
0.2003
|
0.0972 |
As shown in Table 4, the LDL algorithm provided by the invention has the optimal performance in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the rest 7 algorithms; the LDL algorithm is suboptimal in Karate, Dolphins and Email data networks, is respectively second to MFM, speClust1 and LinkLPA algorithms, but has better performance than the other 6 algorithms; the LDL algorithm generally performs better in Public book, Football data networks than some other algorithms.
Results of the LDL algorithm presented in Table 5 with the conventional algorithm on the Rand index (real dataset)
|
LinkLPA
|
MFM
|
LDL
|
NMF
|
LRLFP
|
LFK
|
speClust1
|
speClust2
|
Karate
|
0.8503
|
0.9251
|
0.8574
|
0.9037
|
0.80214
|
0.8396
|
0.85
|
0.2852
|
Dolphins
|
0.7536
|
0.2523
|
0.7612
|
0.7536
|
0.7721
|
0.2523
|
0.2517
|
0.2136
|
Lemis
|
0.8835
|
0.8558
|
0.8914
|
0.73
|
0.8445
|
0.8168
|
0.4686
|
0.1972
|
Public book
|
0.7304
|
0.8419
|
0.7132
|
0.8377
|
0.639
|
0.3951
|
0.8432
|
0.395
|
Football
|
0.978
|
0.9682
|
0.8874
|
0.76
|
0.95
|
0.0781
|
0.08
|
0.0798
|
Celegansnertal
|
0.8281
|
0.2151
|
0.8468
|
0.7717
|
0.795
|
0.2151
|
0.2151
|
0.2151
|
Email
|
0.9107
|
0.0866
|
0.9233
|
0.9131
|
0.95
|
0.08
|
0.045
|
0.045
|
Public blogs
|
0.6779
|
0.7542
|
0.7799
|
0.7545
|
0.5085
|
0.5007
|
0.4998
|
0.4998
|
Netscience
|
0.9872
|
0.9879
|
0.9935
|
0.9881
|
0.99
|
0.9903
|
0.7792
|
0.01
|
Power
|
0.964
|
0.9599
|
0.976
|
0.9656
|
0.9668
|
0.8995
|
0.9394
|
0.9201
|
Hep_th
|
0.8674
|
0.9015
|
0.9395
|
0.9041
|
0.9192
|
0.803
|
0.9077
|
0.9234 |
As shown in Table 5, the LDL algorithm provided by the invention is optimal in the Lemis, Celegansnertal, Public blogs, Netsccience, Power and Hep _ th data networks, and is superior to the other 7 algorithms; the LDL algorithm is suboptimal in Dolphins and Email data networks, is inferior to the LRLFP algorithm, but has better performance than the other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Public book, and Football data networks than some of the rest of the algorithms. For example, in the Karate data network, the performance is better than 5 algorithms, LFK, speClost 1, LinkLPA, LRLFP, and speClost 2.
Results of the LDL algorithm and the conventional algorithm on the Modularity index (real data set) presented in Table 6
|
LinkLPA
|
MFM
|
LDL
|
NMF
|
LRLFP
|
LFK
|
speClust1
|
speClust2
|
Karate
|
0.4427
|
0.4477
|
0.4347
|
0.4459
|
0.3663
|
0.4343
|
0.4116
|
0.1545
|
Dolphins
|
0.46
|
0.0108
|
0.4709
|
0.4486
|
0.4022
|
0.01080
|
0.0054
|
0.1299
|
Lemis
|
0.5882
|
0.5768
|
0.5772
|
0.4849
|
0.5298
|
0.5632
|
0.1088
|
0.2034
|
Public book
|
0.5531
|
0.5091
|
0.5196
|
0.5182
|
0.4117
|
0.4065
|
0.4595
|
0.4209
|
Football
|
0.6189
|
0.5423
|
0.6092
|
0.6236
|
0.6171
|
0.1075
|
0.5933
|
0.5753
|
Celegansnertal
|
0.433
|
0.4378
|
0.4521
|
0.3761
|
0.1722
|
0.2035
|
0.0092
|
0.3874
|
Email
|
0.6178
|
0.0381
|
0.5008
|
0.6547
|
0.6547
|
0.3292
|
0.1002
|
0.3048
|
Public blogs
|
0.3007
|
0.3431
|
0.3967
|
0.367
|
0.1864
|
0.1155
|
0.0087
|
0.2133
|
Netscience
|
0.8085
|
0.8118
|
0.872
|
0.8238
|
0.8238
|
0.8011
|
0.2062
|
0.73
|
Power
|
0.5826
|
0.531
|
0.6438
|
0.6241
|
0.5471
|
0.5289
|
0.6207
|
0.5227
|
Hep_th
|
0.6021
|
0.6754
|
0.7181
|
0.501
|
0.6503
|
0.685
|
0.6821
|
0.5431 |
As shown in Table 6, the LDL algorithm of the present invention performed best in the Dolphins, Celegansnal, Public blogs, Netsccience, Power and Hep _ th data networks, and was superior to the other 7 algorithms; the LDL algorithm is suboptimal in the Lemis and Public book data networks, is inferior to the LinkLPA algorithm, but has better performance than other 6 algorithms; the LDL algorithm generally performs slightly better in karte, Football, and Email data networks than some of the rest of the algorithms. Taking the Football data network as an example, the performance of the algorithm is better than that of 4 algorithms such as MFM, LFK, speClost 1 and speClost 2.
Results of LDL and conventional algorithms on Error index (true data set) presented in Table 7
As shown in Table 7, the LDL algorithm provided by the invention has more obvious advantages in Error indexes, is optimal in ten data networks of Karate, Dolphins, Lemis, Poblic book, Celegansnal, Email, Public blogs, Netscience, Power and Hep _ th, is slightly worse than the LRLFP algorithm in Football, and has stronger stability as shown by experimental data.
In summary, although the LDL algorithm proposed by the present invention does not perform optimally in every data network, the ratio of the dominance (optimal + suboptimal) is much higher than other algorithms. The LDL algorithm provided by the invention has better performance in a social network with higher average clustering coefficient and more complex data network, and is more suitable for the characteristics of large scale and complexity of the modern social network.
As shown in fig. 6(a) to 6 (e): respectively represents the community structure comparison expression of the LDL algorithm in 5 real data networks of Karate, Lemis, Celegansnartal, Public blogs and Power grid. The abscissa represents the number of nodes, the ordinate represents the community membership relationship of the community, namely the community to which the node belongs, blue is the reference community structure, and red is the community structure of the LDL algorithm. The more similar the community structure after the algorithm execution is to the reference community structure, the higher the score.
In Table 8, the LDL algorithm and the conventional algorithm mentioned in tables 4 to 7 were counted for each indexExpressing the situation and using the loss function Y ═ LOG constructed by the invention10((X1+X2+X3)/(X4+1))。X1Represents the coefficient of variation at the Jaccard index; x2Representing the coefficient of variation at Rand index; x3Expressing the coefficient of variation at the modulority index; x4Represents the coefficient of variation at the Error index; y represents the constructed loss function.
Results of LDL Algorithm and conventional Algorithm on Each index (real data set) presented in Table 8
As shown in table 8: and (3) performing result analysis on the performance of the LDL algorithm on each index through statistical mean, standard deviation, coefficient of variation and constructed loss functions. The first proposed LDL algorithm has the highest score (bold data value) on two indexes of Jaccard and Rand, and in the Jaccard index, the mean value of the LDL algorithm in each data network is the highest, but the standard deviation is higher than LinkLPA, which indicates that the performance difference of the LDL algorithm in the data networks is larger than that of the LinkLPA algorithm, but the score of the index variation coefficient (the mean value/standard deviation is higher as well as better) is finally the highest in comprehensive consideration; in the Rand index, the performance of the LDL algorithm in the mean value and the standard deviation is superior to that of other algorithms, and the score of the variation coefficient is obviously higher than that of other best NMF algorithms and is close to 77 percent; secondly, the score of the LDL algorithm on the modulatity index is second to that of the LinkLPA algorithm, because the performance difference of the LDL algorithm in each data network is larger than that of the LinkLPA algorithm; the performance of the LDL algorithm on the Error index is the best compared with other algorithms, the Error rate is only 0.0548 and is far better than other algorithms, and the experimental data show that the algorithm provided by the invention has stronger robustness; finally, the performance score at the loss function is also highest, which is approximately 7 percentage points higher than the conventional best method.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.