CN112015954A

CN112015954A - Martha effect-based community detection method

Info

Publication number: CN112015954A
Application number: CN202010884765.1A
Authority: CN
Inventors: 孙泽军; 常新峰; 王启明
Original assignee: Pingdingshan University
Current assignee: Pingdingshan University
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-01
Anticipated expiration: 2040-08-28
Also published as: CN112015954B

Abstract

The invention discloses a Martha effect-based community detection method, which relates to the technical field of information science, and comprises the following steps: inputting a network G consisting of nodes and edges; initializing the network G, and dividing each node into an independent community; calculating a core packet of the network G; simulating a Marble effect process in a Marble effect model by adopting an iteration method; judging whether the network structure achieves the optimal division; if the optimal condition is not reached, performing iterative simulation of the Martian effect again; and if the optimal community division is achieved, carrying out community division to obtain a community division result.

Description

Martha effect-based community detection method

Technical Field

The invention relates to the technical field of information science, in particular to a community detection method based on the Martha effect.

Background

The community structure reflects the structural characteristics of the network on a mesoscopic scale and is widely present in a real network. Communities are also commonly referred to as communities (communities), clusters (clusters), groups (groups), and the like. Due to the diversity of complex networks and the complexity of community structures, no uniform and definite definition is formed for the community structures of complex networks at present. A community is generally considered to be a set of nodes, the connection between nodes in the group is more compact, and the connection between nodes in the group is sparse.

The community detection provides an important way for analyzing the structural characteristics of the complex network, researching the organization function of the complex network and mining the potential relation of the complex network. In addition, community detection has been widely used in many disciplines and fields, such as computer science, bioinformatics, sociology, economics, and epidemiology. For example, it may be used to discover organizational groups and provide personalized services in a social network. In an e-commerce network, community detection may be used for intelligent recommendation and precision marketing. In crime and anti-terrorism networks, it can be used to find crime parties. In addition, the method can be used for optimizing routing tables in the Internet, finding out close relations among proteins and analyzing the cooperative relations among authors in the citation network. Therefore, the method for researching and designing the efficient and accurate community structure detection has important significance.

Heretofore, various community detection methods have been proposed from different perspectives, including graph partitioning-based methods, modularization-based methods, and dynamics-based methods. Among these methods, identifying communities using the inherent topology and dynamics of the network is an emerging method that is simple, efficient, accurate, and data-driven. However, most of the existing methods have some limitations, such as high time complexity, complex parameter setting, poor stability and the like. For example, WalkTrap is a dynamics-based approach that uses random walks to obtain high quality communities. However, the temporal complexity is O (mn ^ 2). The markov clustering algorithm (MCL) is a well-known dynamic-based clustering method and is widely used in graph clustering. However, it is sensitive to the "swell" parameter. Furthermore, the tag propagation algorithm (LPA) has an approximately linear time complexity, however, the results of community detection are not always stable. The Fluid Community (Fluid) algorithm is a diffusion-based approach, similar to LPA, and Fluid C often returns different results during each run. In summary, the conventional community detection methods are all limited.

Aiming at the phenomenon, the community detection method based on the Martha effect can reveal the community structure in the network and solve the problems of high time complexity, complex parameter setting and poor stability in the existing method.

Disclosure of Invention

The invention aims to provide a community detection method based on the Martha effect, which can reveal a community structure in a network and solve the problems of high time complexity, complex parameter setting and poor stability in the conventional method.

The invention provides a Martha effect-based community detection method, which comprises the following steps of:

s1: inputting a network G consisting of nodes and edges;

s2: initializing the network G, and dividing each node into an independent community;

s3: calculating a core packet of the network G;

s4: simulating a Marble effect process in a Marble effect model by adopting an iteration method;

s5: judging whether the network structure achieves the optimal division;

s6: if the optimal condition is not reached, performing iterative simulation of the Martian effect again; and if the optimal community division is achieved, carrying out community division to obtain a community division result.

Further, the step of dividing the community in step S2 is:

s21: taking the node number of each node as a label;

s22: each node is divided into an independent community.

Further, the step S3 calculates the core packet of the network G by using the node attraction formula, and the calculating steps are as follows:

s31: calculating the Jaccard similarity coefficient between the nodes:

given undirected network G ═ V, E, the Jaccard similarity coefficients for nodes u and V are defined as:

wherein_U＝N_(u)∪{u}，_UIs a group of neighbors of the node u, comprising the node u and the nodes directly connected with the node u;

s32: calculating the node attraction force:

given a undirected network G ═ V, E, the attraction of node u to node V is defined as:

NA_v→u＝J_uv*D_u (2)

wherein J_uvRepresenting the Jaccard similarity coefficient, D, between nodes u and v_uRepresenting u degrees of a node;

s33: the core packet of the network is calculated using the node attraction formula:

because the nodes have attractive force, each node attracts the neighbors to join the community, and the node selects the community where the node with the strongest attractive force to join according to the formula (3), wherein the formal definition is as follows:

wherein, C_vDenotes a community to which the node v belongs, D_vThe degree of the node v is then selected according to the formula (3) to join the community where the node with the strongest attraction is located, and if the degree of the node v is greater than the degree of the adjacent node, the node v still belongs to the original community C_v(ii) a If the degree of the neighbor node is greater than the degree of the node v, the node v selects the node having the greatest attraction max (NA)_v→u) The node of (2) joining its community; through iteration, nodes with more resources attract neighboring nodes to join the community, and therefore a plurality of core groups are formed.

Further, the step S4 of simulating the martai effect process includes the steps of:

s41: calculating the community attraction:

wherein,

indicating the proximity of the node v to a community ci, according to whether the number of edges from a node to different communities is the same or not,

the formalization of (a) is defined as follows:

wherein,

representing a community c_iInternal degree of the middle node v.

S42: simulating a Martian effect process:

after obtaining the core packet from step S3, more and more nodes are attracted by different core packets, and the simulation of the mattesy effect is performed according to equation (6), formalized as follows:

wherein, c_iIs a neighbor community connected to node v,

representing a community c_iAttraction to node v.

Compared with the prior art, the invention has the following remarkable advantages:

the invention provides a Martha effect-based community detection method, which discloses a community structure in a network in a natural way by considering a community detection problem from the viewpoint of the Martha effect. Compared with the existing community detection method, the CDME algorithm has higher efficiency and better community detection quality in both the generation network and the real network. And unlike algorithms that rely on prior knowledge and parameter settings, the CDME method does not require parameter settings. Because of adopting the local interaction mode, the CDME only needs to countComputing the attractiveness of neighboring nodes, which reduces the time complexity to 0(n · k)²). Where k represents the average degree of the node, which is usually small. Therefore, the CDME algorithm can be applied to a large-scale network. The community detection method based on the Martha effect can reveal the community structure in the network and solve the problems of high time complexity, complex parameter setting and poor stability in the existing method.

Drawings

FIG. 1 is a flow chart of community detection of a Martian-effect-based community detection method according to an embodiment of the present invention;

FIG. 2 is a social attraction diagram of a social detection method based on the Martian effect according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a community detection process of the community detection method based on the Martian effect according to the embodiment of the invention;

FIG. 4 is a graph showing a comparison of performance of comparison algorithms provided by embodiments of the present invention on a generated reference network when the parameter μ varies from 0.1 to 0.8;

FIG. 5 is a graph of performance of comparison algorithms versus synthetic reference networks when the degree of averaging provided by embodiments of the present invention changes;

FIG. 6 is a diagram of community detection results of a CDME for an airway club network according to an embodiment of the present invention;

fig. 7 is a diagram illustrating a community detection result of CDME for the american football network according to an embodiment of the present invention;

FIG. 8 is a diagram of community detection results of a CDME for use in the U.S. political book network, according to an embodiment of the present invention;

FIG. 9 is a diagram of community detection results of a CDME for a dolphin network according to an embodiment of the present invention;

FIG. 10 is a comparison graph of the running time of each comparison algorithm for the number of nodes from 2K to 50K provided by the embodiment of the present invention;

FIG. 11 shows the number of nodes from 1K to 10 according to the embodiment of the present invention³K, run-time comparison of the comparison algorithms.

Detailed Description

The technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the drawings in the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

First, referring to table 1, symbols in the present application are defined, where G ═ V, E is a given network, which represents an undirected graph and an unweighted graph, where V is a set of nodes and E is a set of edges.

TABLE 1 description of symbols

Referring to fig. 1-11, the present invention provides a martial effect-based community detection method, comprising the steps of:

s1: inputting a network G consisting of nodes and edges;

s3: calculating a core packet of the network G;

s5: judging whether the network structure achieves the optimal division;

Example 1

The step of dividing the community in step S2 is:

s21: taking the node number of each node as a label;

s22: each node is divided into an independent community.

Example 2

Before computing the core packet, it is necessary to compute the attractiveness between the nodes, which is related to the resources owned by the nodes and the similarities between the nodes. According to the topological structure of the network, the resources owned by the nodes are represented by the node degree, and the similarity between the nodes is represented by the Jaccard similarity coefficient. Due to the mutual attraction between the nodes, the nodes can attract adjacent nodes to join the communities, and core groups are formed.

The step S3 calculates the core packet of the network G by using the node attraction formula, and the calculation steps are as follows:

s31: calculating the Jaccard similarity coefficient between the nodes:

s32: calculating the node attraction force:

NA_v→u＝J_uv*D_u (2)

wherein, C_vDenotes a community to which the node v belongs, D_vIs a degree of node v, howeverAnd then selecting the community where the node with the strongest attraction force is located according to the formula (3) to join, wherein if the degree of the node v is greater than that of the adjacent node, the node v still belongs to the original community C_v(ii) a If the degree of the neighbor node is greater than the degree of the node v, the node v selects the node having the greatest attraction max (NA)_v→u) The node of (2) joining its community; through iteration, nodes with more resources attract neighboring nodes to join the community, and therefore a plurality of core groups are formed.

Example 3

The step of simulating the Martha effect in the step S4 comprises the following steps:

s41: calculating the community attraction:

wherein,

representing from node v to a community c_iThe proximity of the nodes is determined according to whether the number of edges from one node to different communities is the same or not,

the formalization of (a) is defined as follows:

wherein,

representing a community c_iInternal degree of the middle node v.

S42: simulating a Martian effect process:

wherein, c_iIs a neighbor community connected to node v,

representing a community c_iAttraction to node v.

The better a team is, the more interesting it can attract people's attention, and more likely it will attract more people to join. Also, a node will naturally choose the most attractive community to join. When a node joins a community, the structure of the community may change, as may its attractiveness. Thus, community detection based on the Martian effect is a constantly iterative process. In each iteration, each node updates its community tag according to equation (6). Finally, under the drive of the network topology, the community to which the node belongs does not change, and the network structure reaches a balanced state. Then, the community structure of the network is naturally exposed.

Example 4

In step S5, based on the proposed madrepore model, the core group continuously attracts the surrounding nodes to join, so as to form a larger community structure, and an iterative method is used to simulate the madrepore process. Further, the step S5 evaluates the quality of each community partition using a Normalized Mutual Information (NMI) index. Over time, the network structure reaches a steady state due to topology driven effects, and then, the best community division can be obtained.

Example 5

Referring to fig. 2, the network is composed of 13 nodes, divided into two communities. Two community attractions are calculated using equations (4) and (5). For node 6, at community c_iInterior degree of (1) is

And in community C_jInterior degree of (1) is

It is clear that the internal degree of the node 6 in these two communities is not equal, and therefore, according to the first part of equation (5),

and

so that the attraction of the community Ci to the node 6 can be calculated as

The attraction of the community Cj to the node 6 is

Due to community c_iThe attraction to the node 6 is stronger than that of the community C_jThe attraction to the node 6, and therefore, the node 6 is more likely to join the community c_iIn (1).

Similarly, for node 13, since

That is, the nodes 13 are within the same degree in both communities. In this case the influence of indirect neighbors on the node 13 needs to be further considered. According to the second part of equation (5), the community attractiveness of two communities to the node 13 can be calculated as

And

thus, community c_iThe attraction to the node 13 is larger, and the node 13 is more likely to join the community c_iIn (1).

Example 6

Referring to fig. 3, the community detection process based on the madrepore effect is shown, which is divided into 4 stages:

stage one: as shown in fig. 3(a), each individual has its own resources and independent groups, edges represent the relationships between individuals, and each individual has a different label;

and a second stage: as shown in fig. 3(b), individuals with more resources attract neighboring individuals to join and form an initial core population; for example, since individuals U7 have more resources, individuals U1, U8, U10, and U12 are attracted to join their groups, each initially having its own resources, each considered an independent group or community;

and a third stage: as shown in fig. 3(c), due to the mattesia effect, more and more individuals are attracted to join the core population, forming a large community; for example, individual U9 was attracted to the group labeled L7;

and a fourth stage: as shown in fig. 3(d), each individual is attracted to a different community, and the community structure is exposed by itself.

Example 7

In order to verify the effectiveness of the method, a Martha effect-based Community Detection Method (CDME) is compared with nine representative community detection algorithms in the prior art, experiments are respectively carried out on a generation network and a real network, and three widely-used evaluation methods (NMI, ARI and Purity) are adopted to evaluate the accuracy of community detection.

Each algorithm was run 20 times independently for each network and the results were then averaged. All experiments were run independently on the same desktop computer with a CPU of 3.3GHz, an Intel Core i5 processor, and 16.0GB memory.

(one) experiment on the generating network

The LFR generation network can be conveniently controlled using a plurality of attribute parameters of the real network, such as the average degree, the population scale and the clustering coefficient, to create a synthetic reference network using a well-known LFR model. The most important parameter of the LFR model is the blending parameter μ, which is used to control the complexity of community division. Table 2 is a parametric description of the LFR model:

table 2: description of LFR model parameters

Referring to fig. 4, to evaluate the performance of the comparison algorithm, the parameter μ is varied from 0.1 to 0.8 to increase the difficulty of generating the network. Meanwhile, other parameters are fixed as n 1000, k 15, Cmin 10, Cmax 50, τ ₁2 and τ ₂1. Fig. 4 shows the performance of each algorithm on three different evaluation metrics. On the NMI index, the CDME, Infomap, Ncut, LPA, MCL, WT, Louvain and FluidC methods yield excellent community divisions when the parameter μ varies from 0.1 to 0.5. However, the performance of LPA, Infomap and FluidC starts to decline significantly as μ increases. Especially when the value of μ increases to 0.6, the NMI index value of LPA will be substantially zero. In contrast, the MCL and CDME algorithms are more robust than the other comparable methods, and when μ is increased to 0.8, they can still achieve better community division.

To more accurately reflect the performance of each comparison algorithm, the standard deviation was used to evaluate the results of the community detection. Specifically, the parameter μ is increased from 0.1 to 0.8, and each change in the parameter generates 10 networks, for a total of 80 networks. Referring to table 3, table 3 shows the average accuracy and standard deviation of each algorithm on these composite networks. Taking the results shown in the NMI index as an example, it can be seen that the standard deviation is very low for most methods. For example, when the parameter μ varies between 0.1 and 0.4, the values of CDME, Infmap and WT are at most 1, and the standard deviation is at least 0. Even if the parameter μ is increased to 0.8, MCL and CDME still achieve better community division. Wherein the average accuracy of the LPA is close to 0 when the parameter μ is increased to 0.6.

TABLE 3 average accuracy and Standard deviation of algorithms on NMI index

Referring to fig. 5, in order to evaluate the effectiveness of each comparison algorithm on different community density networks, the parameter μ is fixed to 0.1, and the average degree parameter k is changed from 5 to 25 to generate a reference network. Fig. 5 is used to show the performance of each comparison algorithm on different criteria when the parameter k is varied from 5 to 25. It can be observed that the CDME, Ncut, Infomap, MCL, LPA, WT, Louvain and EDCD methods all work well on these generated networks when the averaging parameter k is > 10. However, when k.ltoreq.5, the community detection quality of the FG, FluidC, and EDCD methods is significantly degraded.

(II) experiments on real networks

In order to further evaluate the performance of each comparison algorithm on the real network, a plurality of real networks with different characteristics and different domains are selected for experiment. Since these networks know real community partition information, the NMI, ARI and Purity indices are still used to evaluate the community detection quality of the respective algorithms. The basic information of these real networks is described with reference to table 4.

TABLE 4 statistical characteristics of each real network

Where | V | represents the number of nodes, | E | represents the number of edges, k represents the average degree, C C represents the clustering coefficient, and # C represents the number of communities.

TABLE 5 Performance of different comparison algorithms on real networks

Tables 5 and 6 show the results of the detection of each comparison algorithm community on the real network. # C denotes the number of communities obtained by each algorithm. It can be seen that the CDME algorithm performs very well on these real networks, achieving good community detection quality.

Referring to fig. 6, for Zachary air channel club networks, the CDME method successfully detected two communities, achieved the highest index scores (NMI 1, ARI 1, Purity 1) and outperformed other comparison algorithms. FIG. 6 is a visualization of CDME community detection results, which identifies a community structure that is completely consistent with the true partitions. The existing algorithm cannot divide the tenth node into the right communities because it is basically equivalent to the degree of connection of two communities. However, the present application can still correctly divide the node because the CDME method considers not only the attractiveness of the node but also the attractiveness of the community. The performance of each comparison algorithm is shown in table 6:

TABLE 6 Performance of different comparison algorithms on large networks

Referring to fig. 7, for the american college football network, the existing algorithms achieve a good clustering effect due to a high average degree (k ═ 10.66). Fig. 7 shows communities detected by CDME, where the CDME algorithm automatically detects 12 high-quality communities (NMI 0.93).

Referring to fig. 8, CDME achieves good clustering results for political book networks with the highest NMI and ARI scores (NMI 0.58, ARI 0.67). FIG. 8 is a visualization that the CDME has detected communities, three communities being discovered.

Referring to fig. 9, CDME achieves good results for Dolphin networks and performance is better than other comparison algorithms.

To evaluate the performance of the comparison algorithm on a large scale network, experiments were performed using three large datasets, Amazon, DBLP, and LiveJournal.

In Amazon networks, CDME, WT and Infomap algorithms achieve the best performance, they achieve very high NMI and Purity index values (NMI 0.9, Purity 0.99).

On a DBLP cooperative network, the CDME, LPA, and FluidC methods achieved good performance with an NMI index value of 0.75. Meanwhile, CDME achieves the highest precision index value (precision ═ 0.92).

On the LiveJournal social network, CDME obtains the best community division quality with the highest index value (NMI 0.93, ARI 0.49, Purity 0.98). While the Ncut, WT, FG, Infmap, MCL and EDCD algorithms, run for more than 10 days, still do not yield community partition results. The LPA, Louvain and FluidC algorithms also achieve better community detection results.

In summary, the community detection results of the Infomap and the CDME on the generated reference network are good, and the Infomap and the CDME both obtain most of the highest index values. On a real network, CDME also achieves the best community division. Therefore, the CDME can not only process networks with different sizes, but also achieve higher community detection quality.

Example 8

With reference to fig. 10-11, the method and existing algorithms of the present application were run-time analyzed, and in order to evaluate the scalability of the CDME algorithm in terms of network size, the present embodiment generated reference networks of different sizes using LFR models. The fixed average k is 15, the number of nodes varies from 1,000 to 1,000,000, and the running time of the algorithm on different scale networks is then tested. Fig. 10 and 11 show the run times of the respective comparison algorithms at different network scales. It can be observed that CDME is faster than Ncut, EDCD, WT, FG and MCL because its time complexity is 0(k2 · n), where the value of k is usually small. Thus, the CDME algorithm can be used to handle large-scale networks. However, CDME is slower than existing LPAs, Louvain, and FluidC. But these algorithms detect a poorer community quality than CDME. In addition, the LPA and FluidC methods also suffer from stability problems.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. The community detection method based on the Martha effect is characterized by comprising the following steps of:

s1: inputting a network G consisting of nodes and edges;

s3: calculating a core packet of the network G;

s5: judging whether the network structure achieves the optimal division;

2. The method of claim 1, wherein the step of dividing the community in step S2 is:

s21: taking the node number of each node as a label;

s22: each node is divided into an independent community.

3. The Matai-effect-based community detection method of claim 1, wherein the step S3 utilizes a node attraction formula to calculate the core groups of the network G by the following steps:

s31: calculating the Jaccard similarity coefficient between the nodes:

wherein_U＝N_(u)∪{u}，_UIs a group of neighbors of the node u, including the node u and nodes directly connected with the node u;

s32: calculating the node attraction force:

NA_v→u＝J_uv*D_u (2)

4. The method for detecting communities based on the martensitic effect as claimed in claim 1, wherein the step S4 simulates the martensitic effect process steps of:

s41: calculating the community attraction:

wherein,

formalization ofIt is defined as follows:

wherein,

representing a community c_iInternal degree of the middle node v.

S42: simulating a Martian effect process:

after obtaining the core packets from step S3, more and more nodes are attracted by different core packets, and the simulation of the mattesy effect is performed according to equation (6), formalized as follows:

wherein, c_iIs a neighbor community connected to node v,

representing a community c_iAttraction to node v.