CN110738418A

CN110738418A - Detection method of weakly connected overlapping communities

Info

Publication number: CN110738418A
Application number: CN201910981098.6A
Authority: CN
Inventors: 许小媛; 刘芳; 黄金国; 李海波
Original assignee: Jiangsu Open University of Jiangsu City Vocational College
Current assignee: Jiangsu Open University of Jiangsu City Vocational College
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2020-01-31

Abstract

The invention discloses a detection method of weakly connected overlapped communities, which comprises the steps of receiving a community detection graph model, dividing the graph communities according to the number of processors, calculating an influence propagation model of each partition, wherein the influence propagation model is used for adjusting the edge weight of an approximate active edge of each partition according to the field edge density, and detecting the weakly connected overlapped communities by using a time interaction bias algorithm.

Description

Detection method of weakly connected overlapping communities

Technical Field

The invention relates to the technical field of community detection, in particular to a detection method of weakly connected overlapping communities.

Background

In modern society, social networks, such as facebook, Twitter, and LinkedIn, have become an important part of people's daily lives, about 68% of online users have social information for obtaining news or connecting with friends, family, and other acquaintances, many of these users form or join online communities.

In such applications, standard community detection methods, such as the Louvain method, the infomap method, the label propagation or Newman dominant feature vector method, etc., are mainly concerned with the relationship connection between users, depending on the relationship link between the followers/followers. These techniques represent the network as a static structure with stable links because these links exist for a long time. Communities are detected, for example, by several methods:

(1) detection methods serving as disjoint and overlapping communities are proposed based on the Louvain method, which considers a graph partitioning process implementation method in processing social graphs with acceptable time cost.

(2) By introducing weighted community clustering measurement, extensible community detection of a large-scale graph community model is realized by adopting an infomap method, and the setting of the index depends on a triangular structure in a community.

(3) Based on a label propagation algorithm model, a PageRank method is adopted, and a method for solving the problem of load balance of a processor in a large-scale community detection process is provided.

(4) The community detection process is processed into NP difficult optimization problems based on a Newman main feature vector method, and then a heuristic search algorithm is adopted to realize effective detection on the community structure.

The above algorithms all work well, but observations in the Twitter dataset show that most users interact independently of their follower links, with about 70% of users not sharing interactions with the users following the links.

Disclosure of Invention

The invention aims to provide detection methods of weakly connected overlapping communities, which can determine high-frequency interaction of users by predicting the active trend of influential users to the future and the like, can determine the influence of the users on adjacent users, can ensure that the weakly connected users still have the opportunity to be brought into the communities, can improve the accuracy of community detection, can simultaneously provide Time Interaction Bias (TIB) community detection methods based on overlapping community detection by considering the community structure so as to obtain better overlapping community detection performance, can also introduce objective functions into the graph community division process, and can process the partition division processes in parallel to calculate the influence propagation model, thereby greatly improving the community detection performance.

To achieve the above object, with reference to fig. 1, the present invention provides methods for detecting weakly connected overlapping communities, the methods including:

s1: receiving a community detection graph model, and dividing graph communities according to the number of processors;

s2: calculating an influence propagation model of each partition, wherein the influence propagation model is used for adjusting the edge weight of the approximate active edge of each partition by combining the edge density of the field;

s3: and detecting the weakly connected overlapped communities by using a time interaction biasing algorithm.

In an embodiment of the step , in the step S1, the process of receiving the community detection graph model and dividing graph communities according to the number of processors includes the following steps:

s11: searching all communities in the community detection graph model, acquiring fully-communicated subgraphs, and generating a subgraph set PCs, wherein pc belongs to PCs;

s12: collecting the domain nodes in each edge h-hop range in each subgraph, wherein h is larger than 0;

s13, setting a plurality of partitions according to the number of processors, judging whether the number of subgraphs is larger than the number of partitions, if so, correspondingly distributing the subgraph to part of the partitions, ending the process, otherwise, entering the step S14;

s14, allocating at least sub-graphs to each partition;

s15: and sequentially distributing the rest sub-graphs to each partition by taking the minimum value of the target function as a constraint condition, wherein the target function is as follows:

wherein, P^eIs the number of edges in the processor P, P^vIs the number of vertices in the processor P,is pc_iAnd the total number of edges in the additional data set per processor, the additional data set holding all the contiguous edges,

is pc_iThe total number of vertices in the set and in the additional data set for each processor,

and

the calculation formulas of (A) and (B) are respectively as follows:

wherein, pc_iIs a connected community, which belongs to the PC set,

is pc_iIs the distance pc_iThe weighted edge set within h hops.

In an embodiment of step , in step S2, the process of calculating the influence propagation model for each partition includes the following steps:

s21: setting a community detection graph model G as an undirected weighted graph, wherein G is (V, E, W), wherein V represents a node set, E represents an edge set, and W is a weight function set omega epsilon W; each edge

The weight w (e) ≧ 0 denotes the total frequency of interaction between nodes u and v for a randomly selected time interval t;

s22, the weight w (e) of the edge e is normalized based on the following criteria:

wherein m and n are active parameters, are empirically selected according to the social network data set, are real numbers, and satisfy that m is less than n;

in the formula, N is more than or equal to 0 and less than or equal to 1 (e), and non-zero value parameters represent different activity degrees;

s23: based on active edge e_iPropagating the normalized weight to its neighbor node e as follows:

U(e)＝λ^h·N(e_i)

where λ is the attenuation factor, 0 < λ < 1, h is the number of hops between the current edge and its neighbor nodes, N (e)_i) Is the weight of the active neighbor of e;

repeatedly calculating U (e) for the lower neighborhood edge in h hops of the edge e, storing the result of U (e) into a hash table, and calculating the final weight f (e) of the edge e by combining the hash table:

in an embodiment of step , in step S3, the process of detecting weakly connected overlapping communities by using a time interaction biasing algorithm includes:

s31, obtaining groups of all maximum communities PL which cannot be further expanded to exceed the size k, wherein k is the preset number of nodes;

s32: calculating a density index rho (pl) of each community_i)，

And determining a final TIB community identification result through density comparison.

In the further embodiment of step , the selection criterion of the community is that the corresponding density index is greater than a preset density threshold θ.

In a further embodiment of step , in step S32, the density indicator for the active bias communities is calculated according to the following formula:

ρ(C)＝∑_e∈Cf(e)/|C

where | C | refers to the size of the community, and f (e) is the final weight of the edge e.

In an embodiment of step , the detecting method further includes:

s4: acquiring communities

It is k connected community sets, so that: 1)

where e is an active or semi-active edge; 2) the active bias density metric ρ (C) is maximum.

In an embodiment of step , the detecting method further includes:

evaluation of indexes and F Using mutual information₁And evaluating the detection result by the evaluation index.

Compared with the prior art, the technical scheme of the invention has the following remarkable beneficial effects:

1) the method comprises the steps of (1) designing objective functions aiming at the segmentation problem of a community detection graph model, optimizing processor load balance by utilizing a community structure, improving the efficiency of model solution, (2) redefining an active edge concept, providing influence propagation models, determining that users have high-frequency interaction and have strong identification performance for weak connection users, and (3) providing Time Interaction Bias (TIB) community detection methods based on overlapping community detection, and obtaining good overlapping community detection performance.

It should be understood that all combinations of the foregoing concepts as well as additional concepts described in greater detail below can be considered to be part of the presently disclosed subject matter unless such concepts are mutually inconsistent.

The foregoing and other aspects, embodiments and features of the present teachings can be more fully understood from the following description taken in conjunction with the accompanying drawings. Additional aspects of the present invention, such as features and/or advantages of exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of specific embodiments in accordance with the teachings of the present invention.

Drawings

The drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method of detecting weakly connected overlapping communities of the present invention.

FIG. 2 is an exemplary graph of weight assignments of the present invention.

FIG. 3 is an exemplary plot of community bias density values of the present invention.

Fig. 4 is a diagram illustrating an example process of partitioning PCs in accordance with the present invention.

Fig. 5 is a block diagram of the detection method of examples in the invention.

FIG. 6 is a schematic diagram of an experimental model object of the present invention.

FIG. 7 shows the results of comparative tests (F) according to the invention₁) Schematic representation.

Fig. 8 is a graph showing the results of comparative experiments (NMI) according to the present invention.

FIG. 9 is a schematic diagram comparing the real community network detection results of the present invention.

Detailed Description

In order to better understand the technical content of the present invention, specific embodiments are described below with reference to the accompanying drawings.

With reference to fig. 1, the present invention provides methods for detecting weakly connected overlapping communities, where the methods include:

In the invention, a community detection graph model G is set as an undirected weighted graph, wherein G is (V, E, W), wherein V represents a node set, E represents an edge set, and W is a weight function set omega ∈ W. Each edge

For example, in Twitter, the weight will be calculated as w (e) ≧ Sigma (@ + RTs), which represents the sum of the Interactions represented by mentions and reprints within a given time interval t, @ represents the sum of the Interactions at time above, RTs represents the interaction at the current time, Table 1 gives all the mathematical sign definitions used in the present invention.

TABLE 1 mathematical symbol definitions

The technical problems to be solved by the invention include the following two: (1) time interaction bias problem, with the goal of finding communities

It is k connected community sets, so that: a)where e is an active or semi-active edge; b) the active bias density metric ρ (C) is maximized. (2) The community-based graph partitioning problem, with the goal of finding a set of communities R, whereComprises the following steps: a) the nodes and the edges belong to a community structure; b) the data (node and edge) load is balanced across the processor groups (P).

The following describes the detection method of the present invention in detail with reference to specific examples, aiming at the two aforementioned technical problems.

Problem 1, time interaction bias problem

To solve the connected k-community constraint, we first give the definition of "clique", which is fully connected subgraphs and PC is the set of adjacent "cliques", meaning that they share k-1 nodes.

In a Twitter network, an edge weighted at 40 means that it contains 40 interaction processes and is likely to be considered an active edge, but an inactive edge may also be important, particularly if most of its neighbors are highly active edges.

Although the impact is small in the current time interval, its impact may be amplified and become very important in subsequent community detection processes.

Considering each edge e of the graph, the model determines the weight of the edge through a two-step process step is to normalize the weight w (e) of the edge e based on the following criteria:

where 0 ≦ n (e) ≦ 1, with non-zero values for different activity, when n (e) ≦ 1, for the activity of the edge of 100%. the parameters m and n in the above equation are real and satisfy m < n as an activity parameter, which is intended to be normalized for weights in the interval 0 to 1.

Definition 1: (active edge e)_i) When N (e) is 1, the edge e of the community_iReferred to as active edges.

Definition 2: (approximate active edge) it is from active edge e_iAdjacent edges within h hops of (a).

The second step in the model is based on the active edge e_iPropagating the normalized weight to its neighbor node e as follows:

U(e)＝λ^h·N(e_i) (2)

where λ (0 < λ < 1) is attenuation factors, h is the number of hops between the current edge and its neighbor nodes, and N (e)_i) The calculation of u (e) is repeated for the next neighbor edge in h-hops of edge e then the result of u (e) is stored in hash tables to be used to calculate the final weight f (e) of edge e, in the form:

where f (e) multiplies the values in the hash table to determine the new weight for edge e. The learning model will consider the neighbor nodes of the edge, which helps redefine the edge weights based on the liveness of the neighbors.

The model aims to consider the edge weights of temporal interactions taking into account the inactive edge neighbor edge weights. Therefore, it takes into account not only the current time interval but also the neighboring edge influence probability.

Example 1 consider the specific example shown in FIG. 2. in FIG. 2a, the original weighted graph is given, with N (e), step redefines the weights from 0 to 1, when equation (3) is executedFinally, we use the hash table to compute the final weight for each edge, as shown in FIG. 2c, which is computed using f (e).

Definition 3: (active bias Density) the density index for the active bias communities may be calculated as follows:

ρ(C)＝∑_e∈Cf(e)/|C| (4)

which is the sum of the biases within the community divided by the size of the community | C |. The quality of the community may be assessed based on a threshold limit community. Therefore, community discovery is performed based on the C-community, where ρ (C) can be maximized.

Example 2 continuing with the example shown in FIG. 3, there are now reconstructed weighted graphs, then we find all communities in the graph and calculate their bias density values using equation (5), as shown in FIG. 3. then, calculate the bias density values for the PCs using equation (4). the PCs only form communities when their bias density scores are higher than the bias density score of each community, otherwise, each community is itself a community.FIG. 2 gives two cases where the two PCs on the left have smaller density values when joined, and the right part is a contradiction.

Problem 2, community-based graph partitioning problem

The complexity of solving this problem is NP hard and can be solved using heuristic algorithms.

Definition 4: (Objective function J) it can evenly distribute PCs to available partitions if the following conditions are met: 1) the number of partitions (R) is greater than the number of processors (P), i.e., R ≧ P. 2) Each partition

With almost the same number of PCs. 3) The objective function J has the smallest value.

To ensure that these conditions are met, we calculate for each unassigned pc ∈ PCs as follows:

wherein, P^eIs the number of edges in the processor P, P^vIs the number of vertices in the processor P,is pc_iAnd per processorTotal number of edges in the additional dataset. The additional data set holds all the contiguous edges,

can be calculated as:

wherein, pc_iIs a connected community, which belongs to the PC set,

is pc_iIs the distance pc, is_iThe parameter h is any number > 0 and can be changed as required.

Is pc_iAnd the total number of vertices in the additional data sets for each processor.

Example 3 considering the example shown in fig. 4a, to segment this graph we first find a community and then find PCs, where PCs are shown in dashed lines in fig. 4b, let us assume that the number of processors is P-2, then we need two partitions, each processor being assigned partitions₁Assign to partition 1, PC₂Is assigned to partition 2 and for PC₃The assignment of (c) will be designed according to the value of the objective function J.

As a result, when PC is added₃To PC in partition 1₁Then, the calculation result of the objective function J is 54. When adding PC₃To PC in partition 2₂The calculation result of the objective function J is 44. Based on the calculation result, PC₃Will take the last cases and be assigned to processor 2.

Fig. 5 is a block diagram of the detection method of examples, divided into three phases 1) graph partitioning, which is based on PCs and an objective function j-here, nodes and edges not belonging to the community are eliminated.2) computing an influence propagation model for each partition.3) detecting communities using the TIB community detection method.

The detailed calculation process of each stage of the community detection algorithm framework shown in FIG. 5 is as follows:

process 1: graph partitioning

The computational steps of the process are shown in Table 2. the inputs to the algorithmic process are the undirected weighted graph, neighborhood hop counts and the number of processors P. first, we look for communities (line 1), find PCs (line 2). then, we collect neighborhood nodes (lines 3-7) within each edge h-hop in PCs, which will be used to compute the impact propagation model.thereafter, we assign PCs to partitions, if there are six partitions, then assign 6 PCs (line 9) first, after that, the remaining PCs are assigned to partitions (lines 10-13) based on the computation of the objective function (J). when all PCs are evenly distributed across partitions, the partitioning process of the algorithm ends (lines 14-15).

And (2) a process: impact propagation model computation

The computation of the impact propagation model is given in Algorithm 2 of Table 3. first, the weights are initialized and normalized by P (e) (lines 2-4). for each edge e (line 5) where P (e) < 1, the initialization is performed as 1) h the number of hops from the edge to the neighborhood node, 2) N, which contains the set of e neighborhoods, 3) the hash table of e (line 6), and then the neighbor of e is found using degrees to search first and will beThey are stored in N. Then, u (e) is calculated and stored in N. This process is repeated until the maximum hop count constraint h < 4 (lines 7-11). Next, take the set of U (e) weights in the hash table as input (line 13), and output the f (e) value of edge e (line 14). This process is repeated until all edges are processed to output a set with bias weights

(line 16).

And 3, process: TIB colony detection

As shown in fig. 2a, the algorithm first obtains groups of all maximum communities that cannot be expanded to reach steps beyond size k, in the prior art, all neighboring communities that share k-1 nodes are considered, and the community selection criterion is that the density index i (G) is greater than the threshold θ:

in the present invention, we use the aforementioned ρ (C) index instead of the I (G) index for algorithm design, and the proposed bias density measurement index can find TIB community communities that are not connected_i)，

The final TIB community identification results (lines 9-17) were determined by density comparison, as shown in FIG. 3 b.

In an embodiment of step , the detecting method further includes:

evaluation of indexes and F Using mutual information₁And evaluating the detection result by the evaluation index. Mutual information evaluation index (NMI) can realize evaluation of detection community similarity, embodying overlapping community detection, F₁The evaluation index can evaluate the community detection accuracy.

Generation of subjects

The selected experimental object generation method is LFR and MMSB, and the sparsity of the generated network can be controlled according to the setting of network parameters to obtain networks with different characteristics:

(1) MMSB object generation: the basis of the community generation algorithm is probability theory, so that a community link between p and q is obtained, and the community link has Y (p, q) distribution:

in the formula, the parameter β is an interaction matrix used in the community detection process, and Z is a distribution form presented in the community detection process, and has a polynomial characteristic:

in the formula, the parameter α is used for controlling the overlapping degree of the generated model communities, the model structure is shown in fig. 6, according to the experimental result, if a sparse overlapping community network model is required to be obtained, the modularity index can be adjusted to be more than or equal to 0.5, otherwise, if a dense overlapping community network model is required to be obtained, the modularity index can be adjusted to be less than 0.5, and table 5 shows the attribute data of the network experimental object generated by adopting the MMSB method.

TABLE 5 network object Attribute generated Using MMSB method

(2) LFR object generation: the community network generation algorithm is similar to the MMSB object generation method, the sparsity of the community network model can be controlled based on the setting of the modularity parameter, the generated LFR object is shown in figure 6b, and the parameter setting is shown in figure 6.

Table 6 network object attributes generated by LFR method

The experimental model object shown in fig. 6 corresponds to the 1 st and 3 rd rows of the parameters set in table 5 and table 6, respectively.

Secondly, stability comparison experiment results

First, using F₁The evaluation indexes are used for evaluating the quality of the community discovery process, and the index form is as follows:

in the formula, the parameter FP is the proportion that the real network is a positive value, but the community detection result is a negative value; the parameter FN refers to the proportion that a real network is a negative value, but a community detection result is a positive value; the parameter TP is a proportion that a real network is a positive value and a community detection result is a positive value. P represents the accuracy of the community detection process, and R represents the recall rate of the community detection process.

To verify the effectiveness of the algorithm, the term "Chen M, Nguyen T, Szymanski B.on measurement of the quality of a network community structure [ J ] is chosen here]Social computing, 2013,52(3): 122-]Two documents mentioned in the two documents IEEE Transactions on Services Computing,2015,8(2):284-298 "(defined as comparative method 2)An overlapping community detection algorithm is used as a comparison. F₁The results of the evaluation experiment are shown in FIG. 7. In the usual case, F₁The evaluation experiment result value range is between 0 and 1, and the higher the value of the index value is like , the higher the stability of the algorithm is, from the experiment result of FIG. 7, the algorithm of the invention is F₁The evaluation experiment result is relatively superior to the experiment results of the two selected comparison algorithms, which shows that the stability of community detection of the algorithm is relatively better. At the same time, in F₁In the evaluation experiment result change trend, the three comparison algorithms show a gradually decreasing trend along with the increase of the number of communities, which shows that certain correlation exists between the stability of the three algorithms and the number of communities in the process of carrying out community detection, and the stability of the algorithm is worse when the number of communities is larger.

Thirdly, comparing the experimental results with the accuracy

For stability experimental analysis of community detection of the algorithm, NMI is selected as a comparative evaluation index. For communities a and B, it is specifically defined as:

wherein, the parameter N represents the number of the top points in the community detection network, the parameter C represents the confusion parameter matrix formed by the community detection model, and the parameter C represents the confusion parameter matrix formed by the community detection model_ijRepresenting vertices belonging to different types of community detection results i and j simultaneously. C_AAnd C_BIs the number of divisions. The two aforementioned overlapping community detection algorithms are still chosen for comparison. The results of the experiment are shown in FIG. 8.

Compared with the NMI experimental result shown in FIG. 8, the community identification precision of several algorithms shows a relatively -caused reduction trend with the increase of the number of network vertices, for the network with the same scale, the tighter the network represents the more serious the overlapping condition of the network, from the experimental result, although the NMI precision of the algorithm is reduced with the increase of the number of vertices, the reduction amplitude is not large, and it can be seen that the community detection precision and stability of the algorithm are higher in the community overlapping degree compared with the comparison algorithm.

Fourth, real community network detection result

The method comprises the steps of constructing a real data set by utilizing an API (application program interface) network data crawling tool, setting seed accounts in a network by a data acquisition platform, and acquiring data characteristics of the network, wherein 1) 5 adjacent accounts are selected as the seed accounts, 2) adjacent vertexes in the Xinlang microblog are captured according to a deep search strategy, and 3) parameters in a capturing process are adjusted according to a reference network.

Through the data capturing process, the size of the constructed Sina microblog network is 78 groups of subjects, the top points are 900, the microblog data interactive information is 5 thousands, the experimental program selects Java language to construct the model, the comparison algorithm still selects the two overlapped community detection algorithms, and the evaluation indexes of the algorithm performance are selected according to the accuracy, the recall rate and the concordance index F_measAnd comparing and verifying the performance of the algorithm. Can be defined as:

for the Xinlang microblog data obtained by the network data crawling tool, the edge number of the Xinlang microblog data is highly dependent on the data crawling sparsity, and the experimental comparison results of the two selected algorithms and the detection method in the data crawling are shown in the figure 8.

According to the comparison result of the Xinlang microblog data experiment shown in fig. 9, under the condition of low-density community detection, the difference of three algorithms is not large, particularly, compared with the comparison method 1, the detection method provided by the invention has unobvious advantages, but the overlapping and interaction degrees among communities are increased along with the increase of community compactness, and the detection method provided by the invention considers the problem that the propagation is influenced by the time interaction bias of the weakly connected overlapped communities, so that the comprehensive performance of the detection method provided by the invention is high in speed, and the change range of the two selected algorithms is small. The experimental results verify the performance advantages of the community discovery algorithm of the proposed algorithm.

It should be understood that the various concepts and embodiments described above, as well as those described in greater detail below, may be embodied in any of numerous ways, including , since the disclosed concepts and embodiments are not limited to any embodiment.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention should be determined by the appended claims.

Claims

1, method for detecting weakly connected overlapped communities, comprising:

2. The method for detecting weakly-connected overlapping communities as claimed in claim 1, wherein in step S1, the process of receiving the community detection graph model and dividing graph communities according to the number of processors includes the following steps:

s14, allocating at least sub-graphs to each partition;

and

the calculation formulas of (A) and (B) are respectively as follows:

wherein, pc_iIs a connected community, which belongs toThe set of PCs is a set of PCs,

is pc_iIs the distance pc_iThe weighted edge set within h hops.

3. The method for detecting weakly connected overlapping communities according to claim 1, wherein in the step S2, the process of calculating the influence propagation model of each partition comprises the following steps:

U(e)＝λ^h·N(e_i)

4. the method for detecting weakly connected overlapping communities according to claim 1, wherein in step S3, the step of detecting weakly connected overlapping communities by using a time interaction biasing algorithm includes:

s32: calculating a density index rho (pl) of each community_i)，

5. The method for detecting the weakly connected overlapping communities according to claim 4, wherein the selection criterion of the community is that the corresponding density index is greater than a preset density threshold θ.

6. The method for detecting weakly-connected overlapping communities according to claim 4, wherein in step S32, the density index of the active bias communities is calculated according to the following formula:

ρ(C)＝∑_e∈Cf(e)/|C|

7. The method for detecting weakly-connected overlapping communities as claimed in claim 6, further comprising:

s4: acquiring communities

It is k connected community sets, so that: 1)

8. The method for detecting weakly connected overlapping communities as claimed in claim 1, further comprising: