WO2011114135A1

WO2011114135A1 - Detecting at least one community in a network

Info

Publication number: WO2011114135A1
Application number: PCT/GB2011/050464
Authority: WO
Inventors: Leto Paul Peel
Original assignee: Bae Systems Plc
Priority date: 2010-03-16
Filing date: 2011-03-09
Publication date: 2011-09-22
Also published as: GB201004376D0

Abstract

A method of detecting at least one community in a network (100) includes obtaining (202) data indicating at least one observable parameter of the network. The obtained parameter data is used to select (204) at least one community- detecting technique from a set of community-detecting techniques. The selected community-detecting technique(s) is applied (206) to detect at least one community in the network.

Description

Detecting at least one Community in a Network

The present invention relates to detecting at least one community in a network.

The study of large scale networks has revealed a number of properties about the behaviour and topology of naturally occurring networks. One such property is the presence of community structures, or sets of nodes, in a network which are more interconnected between themselves than they are, relative to their connections to the rest of the network. It is the aim of community detection to identify these structures. Community detection is a problem which has attracted much interest in recent years and has consequently produced a wide range of approaches to the problem. An in-depth review of most contemporary methods is given in S. Fortunato, "Community detection in graphs", eprint arXiv: 0906.0612, 2009.

One of the reasons why the ability to detect communities is so attractive is that entities in a network have been observed to associate preferentially with similar entities. This suggests that the detection of communities may be used for identifying entities which share common attributes or purposes. An example of community structures identifying entity similarity would be community structures in a friendship network which reflect similarities in race and age. The wide range of complex systems that can naturally be expressed as networks (human interaction patterns, metabolic networks, WWW, the brain) implies that community detection has applications spanning domains as diverse as biology, sociology, computer science and security. With a large selection of algorithms available to undertake the task of community detection, choosing an appropriate algorithm becomes problematic. This is largely due to the lack of formal or commonly accepted evaluation procedures. The networks used to evaluate community detection tend to be a small selection of real networks and/or networks generated from simple models, where these networks vary widely between authors. Recent work to address this has focused on developing benchmark networks on which comparative analysis can be done to determine the reliability of different algorithms.

This work considers the idea that for different situations, different classes of algorithms may outperform other classes of algorithms. A large number of algorithms have been developed to tackle the community detection problem, but as with any machine learning task there is no single solution and each algorithm tends to suit a specific part of the problem space.

In the intelligence domain,, community detection could be used to identify groups of people who share common goals or purposes. Community detection could therefore potentially be used to constrain the search space when investigating or detecting malicious activities. Here, the network nodes would represent people and the links would represent interactions or relationships between them. Such a network could be constructed from a database of phone records, email logs or other transactional data. The overwhelming choice of algorithms which are available is a problem. An intelligence analyst trying to detect communities would require a detailed knowledge of community detection algorithms in order to select an appropriate algorithm or subset of algorithms. The present invention is intended to address at least some of the problems outlined above. In particular, the invention is intended to address the problem of selecting a technique/algorithm for community detection. An aim of community detection is to identify sets of nodes in a network which are more interconnected between themselves than they are relative to the rest of the network.

The invention benefits from the inventor's examination of the performance of algorithms developed for weighted networks against those using unweighted networks for different parts of the problem space (parameterised by the intra/inter community links) so that the choice of algorithm (e.g. weighted/unweighted) can be made based only on the observed network.

According to one aspect of the present invention there is provided a (computer-implemented) method of detecting at least one community in a network, the method including:

obtaining data indicating at least one observable parameter of the network;

using the obtained parameter data to select at least one community- detecting technique from a set of community-detecting techniques, and

applying the selected at least one community-detecting technique to detect at least one community in the network.

The parameter may provide an indirect indication of interaction between community structures in the network. The parameter may provide an indication of a proportion of links between a node in the network and others in its community and other nodes in the network that are outside its community. The parameter may provide an estimate of a mixing parameter of the node.

The parameter may relate to a local clustering coefficient of at least one node in the network. The local clustering coefficient may be defined as:

where represents a proportion of neighbouring nodes (N_v) of the at

least one node v that are connected, out of possible connections between the at least one node and its neighbouring nodes

The method may include calculating a mean value of the local clustering coefficient of a set of nodes in the network. The set of nodes in the network may comprise all of the nodes in the network. The local clustering coefficient may include a weighted extension and may be defined as:

where w_vi is a weight associated with a network link between nodes v and i.

The community-detecting techniques in the set may include techniques that use network link weight information and techniques that do not use network link weight information. For example, the techniques may include a technique involving an Infomap algorithm and a technique involving a COPRA algorithm.

According to another aspect of the present invention there is provided a (computer-implemented) method of selecting at least one algorithm intended to detect at least one community in a network, the method including obtaining an estimate of at least one parameter of the network and using that parameter to select a community-detection algorithm.

According to yet another aspect of the present invention there is provided a computer program product comprising computer readable medium, having thereon computer program code means, when the program code is loaded, to make the computer execute a method substantially as described herein.

According to another aspect of the present invention there is provided a system configured to detect at least one community in a network, the system including:

a device configured to obtain data indicating at least one observable parameter of the network;

a device configured to use the obtained parameter data to select at least one community-detecting technique from a set of community-detecting techniques, and

a device configured to apply the selected at least one community- detecting technique to detect at least one community in the network.

According to a further aspect of the present invention there is provided apparatus configured to detect at least one community in a network, the apparatus including:

a storage device including a set of community-detection techniques;

a processor configured to:

obtain data indicating at least one observable parameter of the network; select at least one community-detecting technique from the set of community-detecting techniques, and

apply the selected at least one community-detecting technique to detect at least one community in the network.

Whilst the invention has been described above, it extends to any inventive combination of features set out above or in the following description. Although illustrative embodiments of the invention are described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to these precise embodiments. As such, many modifications and variations will be apparent to practitioners skilled in the art. Furthermore, it is contemplated that a particular feature described either individually or as part of an embodiment can be combined with other individually described features, or parts of other embodiments, even if the other features and embodiments make no mention of the particular feature. Thus, the invention extends to such specific combinations not already described.

The invention may be performed in various ways, and, by way of example only, embodiments thereof will now be described, reference being made to the accompanying drawings in which:

Figure 1 illustrates schematically a network and a computing device configured to detect communities in the network;

Figure 2 illustrates schematically a set of example steps that can be performed for detecting network communities;

Figures 3(a) - 3(d) are graphs relating to mutual information scores of network nodes; Figure 4 is another graph relating to mutual information scores;

Figure 5 illustrates an example network node with links and weights;

Figures 6(a) - 6(b) are scatter plots relating to a mean local clustering coefficients of network nodes, and

Figures 7(a) - 7(d), 8 and 9 are further scatter plots.

Referring to Figure 1 , an example of a simple network 100 is shown that comprises a set of nodes 102. Some of the nodes are in communication with each other via link 104. The type of network can vary, e.g. it can be a Local Area Communications network, a Wide Area Network or any other type of network. In general terms, a network is a structure made up of nodes, representing entities, and links or edges, representing relationships or interactions between entities. The total number of links connected to a node is known as its degree. The network links may also have weights associated with them which may represent the relative importance of the link. For example, in an interaction network representing a phone record database, the nodes would represent people and the links phone calls. The link weights could then represent the frequency of calls. Network links may also be directed.

Figure 1 also shows a computing device 110 including a processor 112 and memory 114. The memory includes an application 116 that is configured to detect at least one community in the network 100. The application 116 can access a set of community-detection techniques, which may be based on conventional, or modified, community-detection algorithms. In the example, two techniques 118A, 118B are shown in the set, but in alternative embodiments more techniques may be made available. Further, more than one of the techniques could be selected in alternative embodiments.

The premise of community detection is that there is some underlying assignment of nodes to communities which has to be discovered. Although precise definitions of a community may vary and may be dependent on the application, it can be understood in terms of the intuitive concept that community structures have more intra-community links than inter-community links. A suitably comprehensive parameter set can be used for describing the space of community types and structures of interest. A reasonable starting point is the parameter set used to generate networks and communities using the LFR benchmark generator (see A. Lancichinetti and S. Fortunato, "Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities," eprint arXiv: 0904.3940, 2009) as not only do these describe a number of network properties, but by using the generator it is possible to obtain networks and community assignments with those properties.

The parameter set used to describe the problem space in the example are the parameters used by the known LFR benchmark. The LFR benchmark was designed to generate datasets to test community detection algorithms and mimic the observed properties of large-scale real complex networks, such as power-law degree and community distribution.

The parameters are best described in the context of the graph generation procedure:

1. N nodes are assigned to communities such that the community size distribution conforms to a power-law with minus exponent Τ2. 2. Each node is assigned a degree such that the degree distribution conforms to a power law with minus exponent n and mean degree k.

3. Links are initially assigned randomly according to the degree distribution. A topological mixing parameter, is set to define the proportion of each nodes links which link outside its community. Topological consistency with this parameter is achieved through an iterative re-wiring procedure.

4. Each node is then assigned a strength according to a power-law distribution with minus exponent β. The strength of a node is the weighted analogy of degree and as such represents the sum of the weights of the links for a given node.

5. To assign the link weights a similar process to step 3 is carried out according to the weight mixing parameter, μ^.

These may not comprise a full set of parameters to comprehensively describe the space of all possible network-community structures, but the skilled person will be aware of other parameters that may be useable. To constrain the problem dealt with by the example embodiment below, the values of all parameters were fixed with the exception of

which from initial tests were found to have the greatest impact on use of link weights.

Figure 2 illustrates schematically steps that can be performed by an example implementation of the community-detection application 116. At step 202 the application obtains data indicating at least one observable parameter of the network. The data may be computed by the computer 110, or may be retrieved from a store. The data can be generated in various ways, depending on the selected parameter. In the embodiment detailed below, the observable parameter comprises a local clustering coefficient and this parameter can be obtained by monitoring data flow between the nodes 102. In general, an observable parameter can be thought of one that can be discerned from network data without the community detection problem having been solved (yet).

At step 204 the application 116 uses the obtained parameter data to select at least one community-detecting technique from a set of community- detecting techniques 118A, 118B, and at step 206 the selected community- detecting technique(s) is/are used to detect at least one community in the network 100. Each selected technique can be executed to detect one or more community.

The technique/algorithm selection problem has been constrained in the present example to choosing between the class of algorithms which use link weight information and the class that does not. In light of this, algorithms suitable for unweighted or weighted networks are made available in the set. In this way a controlled comparison can be drawn between the performances of the unweighted and weighted algorithms without needing to worry about differences in algorithms. Two examples of such algorithms are provided:

- Infomap (see M. Rosvall and C.T. Bergstrom, "Maps of random walks on complex networks reveal community structure," 0707.0609, Jul. 2007): This algorithm approaches the community detection problem by identifying a duality between community detection and information compression. By using random walks to analyse the information flow through a network it identifies communities as modules through which information flows quickly and easily. Coding theory is used to compress the data stream describing the random walks by assigning frequently visited nodes a shorter codeword. This is further optimised by assigning a unique codewords to network modules and reusing short codewords for network nodes such that node names are unique given the context of the module. This two level description of the path allows a more efficient compression by capitalising on the fact that a random walker spends more time within a community than moving between communities.

- COPRA (see S. Gregory and W.Y. Cheung, "Finding overlapping communities in networks by label propagation," Arxiv preprint arXiv:0910.5516, 2009): This is an extension of the label propagation based RAK algorithm (see U.N. Raghavan, R. Albert, and S. Kumara, "Near linear time algorithm to detect community structures in large-scale networks," Physical Review E, vol. 76, 2007, p. 036106). The algorithm works as follows: to start, all nodes are initialised with a unique label. These labels are then updated iteratively, where a node's new label is assigned according to the label used most by its neighbours. If there is more than one most frequently occurring label amongst the neighbours, then the label is chosen randomly. At termination of the algorithm, nodes with the same label are assigned to the same community. The Community Overlap PRopagation Algorithm (COPRA) extends the RAK algorithm to deal with the possibility of overlapping communities. This can be done by augmenting the label with a belonging factor such that for a given node these sum to 1. To prevent all nodes becoming a member of all communities, a threshold is set below which the labels are discarded. Due to the stochastic nature of the algorithm, particularly in the initial iterations, in practise the algorithm is run a number of times and the "best" community assignment is decided according to the one which has the highest modularity. In the weighted instance of the algorithm, the weights of the network are incorporated by weighting the frequency of the labels according to the link weight connecting the respective node.

A number of different metrics can be used to measure the performance of community detection algorithms, but the Normalised Mutual Information (see A. Lancichinetti, S. Fortunato, and J. Kertesz, "Detecting the overlapping and hierarchical community structure of complex networks," New J. Phys, vol. 1 1 , 2009, p. 03301 5) is used in the present example, although another metric may be used in alternative embodiments. This metric provides a measure of similarity between the algorithm output assignment and the true community assignment, where a value of 1 denotes a perfect match. The inventor ran experiments to examine the effect of varying the two mixing parameters

and μνν, the results of which can be seen in the graphs of Figure 3(a) - 3(d), which show mutual information scores as changes. Each subplot in the graphs shows a different fixed value for The values of the other parameters were fixed:

Figures 3(a) - 3(d) show the mutual information scores for the weighted algorithms (COPRAw, INFOMAPw) and unweighted algorithms (COPRAuw, INFOMAPuw) as is changed. The plots 3(a) - 3(c) show the performance for different values of

Each point on the graphs represents the average mutual information over 25 generated networks with the indicated parameter values. It can be seen that the unweighted algorithms perform well when is low and are unaffected by for all values This is only to be expected as these algorithms only rely on the topological information. The weighted algorithms on the other hand are affected by both parameters, but are seen to consistently perform well for low μ^.

The effect of may be best observed in Figure 4 (see key in Figure 3(d)). Here, it can be seen that the weighted algorithms perform well when is at least as high as μ^ (in this case ^=0.3). The reason for this may be that a low μ_{ relative to μ„ν means that there is a lower proportion of inter-community links relative to the proportion of inter-community weights. The effect of this is that a small number of inter-community links receive high link weights relative to the intra-community weights, see Figure 5, which shows mutual information scores for the weighted (w) and unweighted (uw) algorithms as

is varied. The value of μνν is fixed at 0.3. It is noticeable that the two classes of algorithm perform for complimentary settings of ^t-

Figures 6(a) - 6(b) show an example node with links and weights from a network with = 0.2 and = 0.3. As a result a single inter-community link 602 receives a higher weight relative to the intra-community links. The effect of this is that there are regions of the problem space, parameterised by community mixing proportions, in which a weighted algorithm will outperform an unweighted one and vice versa. This can be seen in Figure 4 where the two regions are labelled w (weighted) and uw (unweighted). This result indicates that a choice can be made, based on the community structure, as to the class of community detection algorithm.

In order to take advantage of this information and select the best class of algorithm for a given network, some knowledge of the underlying community structure is normally required. However, the present inventor has appreciated that it may be possible to make some assumption about the communities that are sought based on some knowledge of the specific domain. In most community detection problems, however, this information about the community structure is unknown.

In order to use the information discussed above, it is required to know the values of the mixing parameters of the communities. Without knowledge of the communities (i.e. prior to community detection) it is not normally possible to evaluate these parameters. However, the present inventor has discovered how parameters of the observable network can be mapped to these community parameters and how these values can be used to build a classifier to determine the class of community detection most suitable for the given network.

There are a range of metrics associated with describing network topology, such as degree distribution, average diameter, betweeness, and centrality measures. A problem here is that a parameter is required which describes the way that the community structures interact, without explicitly knowing the community structures. To deal with this, the present inventor chose to consider the node measure called clustering coefficient (see D.J. Watts and S.H. Strogatz, "Collective dynamics of "small-world" networks," Nature, vol. 393, Jun. 1998, pp. 440-442). This measure can be defined as:

where the local clustering coefficient,

represents the proportion of the neighbours, N_v, of node v which are connected (i.e. edge if there is a link between neighbouring nodes / and j) out of the possible connections between its neighbours,

The inventor found that the mean value of the local clustering coefficient, taken over all the nodes in the network, showed a strong correlation with the topological mixing parameter, (see Figure 6(a)). The clustering coefficients are network parameters that can be observed, whereas the mixing parameters (and others) are not observable from the network alone because these are dependant on the underlying community structure (which was not known previously). This suggests that the mean clustering coefficient could be used to estimate this mixing parameter. If the mean clustering coefficient is used to estimate the topological mixing then it follows that a weighted extension to this may yield information about the weighted mixing parameter (Equation 2):

where w_vi is the weight associated with the link between nodes v and /^'.

The mean of this value over the network was found to correlate with (see Figure 6(b)). In some case a subset of nodes in the network may be considered instead of all nodes.

The results in Figures 6(a) and 6(b) suggest that the mixing parameters can be estimated from observed network characteristics without knowledge of the community structure.

The reason for this can be explained by considering the general principle of a community: nodes within a community are more likely to be connected compared to overall probability of connection due to the sparse nature of the network. Hence, if two neighbours are within the same community, it is reasonable to expect them to be connected. However, if neighbours are not in the same network it is more likely that they are not connected. Based on this reasoning, the local clustering coefficient is an estimate of the individual node's mixing parameter, which averaged over the network yields a global estimate.

The above results suggested to the present inventor that it is possible to estimate the mixing parameters of the communities. Returning to the reason why it may be useful to estimate these parameters, i.e. to determine the class of algorithm, the inventor considered that rather than estimate the mixing parameters and in turn predict the algorithm class, it may be more useful to use the clustering coefficients to directly predict the algorithm class. Figures 7(a) - 7(b) show similar plots as Figures 6(a) - 6(b), but indicating the performance for the different algorithms. It can be seen that the weighted algorithms have a distinctly different performance pattern to the unweighted ones.

In order to confirm that these observable parameters can effectively predict the algorithm class, a simple classifier was built using a linear support vector machine (SVM - see N. Cristianini and J. Shawe-Taylor, An introduction to support vector machines, Cambridge University Press, 2000). To do this, each of the networks were assigned a class {weighted, unweighted, none} based on the class of algorithm which performed best in terms of its mutual information score. A class of "none" was assigned to any network where the mutual information score for the best performing algorithm was below some threshold. The reasoning for this is that for low performance values the output is not meaningful and therefore the choice of algorithm is irrelevant. As SVMs are restricted to two classes, three classifiers were trained (weighted vs. unweighted, weighted vs. none, unweighted vs. none) and the predicted class obtained using a voting scheme over the three outputs. The results are discussed below.

Figures 7(a) - 7(d) are clustering coefficients scatter plots showing the mutual information score for (a) unweighted informap, (b) weighted infomap, (c) unweighted COPRA and (d) weighted COPRA.

In an experiment, a linear SVM was trained on 1790 networks taking the unweighted and weighted mean clustering coefficients as inputs. The "none" class was defined for networks for which a maximum mutual information score was below 0.6. The output classes for the test set (448 networks) are shown in Figure 8 (the predicted classification of the networks in the test set using a linear SVM). This can be compared to the true class labels in Figure 9 (the true classification of the networks in the test set). The overall performance on the test set was 83.9%. A confusion matrix of the test set performance is shown in the table below:

From these results the inventor appreciated that even with a simple classifier it is possible to obtain accurate predictions for the best class of community detection based on properties of the network alone. To the best of the present inventor's knowledge, no previous work has explored the problem of choosing an appropriate community detection algorithm based on the underlying structural properties. The embodiment above deals with a constrained case in which community detection algorithms are considered to fall into two classes (weighted or unweighted) and it is demonstrated that for different types of network and community structure, the class of algorithm has an effect on the performance. Furthermore, the inventor has shown that it is possible to choose the algorithm class based only on the observed network parameters without prior knowledge of the community structure or assignment.

The inventor has demonstrated how structural properties (which require knowledge of the underlying community assignment) can be estimated from features of the observed network. An estimation of the non-observable community dependant parameters, such as the local clustering coefficients, can be made based on observable network parameters. This is useful because the community-dependant parameters can be used to choose the type of community detection algorithm. By considering algorithms for weighted networks and algorithms for unweighted networks as two separate classes, it is demonstrated how the two classes perform differently in different areas of the problem space. Mixing parameters can also be estimated using global clustering coeffiecients, but the mean of local clustering coefficient provided better estimates in the experiments conducted.

The skilled person will appreciate that the principles set out in the above embodiment may be used in different embodiment. For instance, instead of indirectly estimating the mixing parameter of network nodes by obtaining data representing a mean value of the local clustering coefficients of nodes in the network, another parameter may be used to select at least one community- detection technique. Further, the techniques may be categorised in a manner other than weighted/unweighted, and a wider range of input networks can be considered. Other parameters that can be used to select community detection algorithms in the more general case (not constrained to weighted/unweighted) include (non-exhaustive list):

• degree distribution (parameterised by mean and exponent)

• community size distribution (parameterised by mean and exponent and/or min/max community size)

• link weight distribution (parameterised by mean and exponent)

• level of overlap of communities (parameterised by proportion of nodes in more than one community, mean number of communities to which overlapping nodes belong)

The parameters above can be observed directly from the network, with the exception of community size distribution and level of overlap, although observable parameters which can estimate these could be found. The present inventor has found that some of these parameters affect algorithm performance.

Claims

1. A method of detecting at least one community in a network (100), the method including:

obtaining (202) data indicating at least one observable parameter of the network;

using the obtained parameter data to select (204) at least one community-detecting technique from a set of community-detecting techniques, and

applying (206) the selected at least one community-detecting technique to detect at least one community in the network.

2. A method according to claim 1 , wherein the observable parameter provides an indirect indication of interaction between community structures in the network (100).

3. A method according to claim 1 , wherein the parameter provides an indication of a proportion of links (104) between a node (102) in the network and other nodes in its community and other nodes in the network that are outside its community.

4. A method according to claim 3, wherein the parameter provides an estimate of a mixing parameter of the node (102).

5. A method according to any one of the preceding claims, wherein the parameter relates to a local clustering coefficient of at least one node (102) in the network (100).

6. A method according to claim 5, wherein the local clustering coefficient is defined as:

where represents a proportion of neighbouring nodes (N_v) of the at

least one node v that are connected, out of possible connections between the at least one node (102) and its neighbouring nodes

7. A method according to claim 5 or 6, including calculating a mean value of the local clustering coefficient of a set of nodes (102) in the network (100).

8. A method according to claim 7, wherein the set of nodes in the network may comprise all of the nodes (102) in the network (100).

9. A method according to claim 8, wherein the local clustering coefficient includes a weighted extension and is defined as:

where w_vi is a weight associated with a network link between nodes v and i.

10. A method according to any one of the preceding claims, wherein the community-detecting techniques (118A, 118B) in the set include techniques that use network link weight information and techniques that do not use network link weight information.

11. A method according to claim 10, wherein the techniques (118A, 118B) include a technique involving an Infomap algorithm and a technique involving a COPRA algorithm.

12. A method according to claim 1 , wherein the observable parameter is selected from a set including: degree distribution (parameterised by mean and exponent); and link weight distribution (parameterised by mean and exponent).

13. A computer program product comprising computer readable medium, having thereon computer program code means, when the program code is loaded, to make the computer execute a method detecting at least one community in a network (100) according to any one of the preceding claims.

14. A system configured to detect at least one community in a network (100), the system including:

a device (110) configured to obtain (202) data indicating at least one observable parameter of the network;

a device (110) configured to use the obtained parameter data to select (204) at least one community-detecting technique from a set of community- detecting techniques, and

a device (110) configured to apply (206) the selected at least one community-detecting technique to detect at least one community in the network.

15. A method of selecting at least one technique intended to detect at least one community in a network, the method including obtaining an estimate of at least one observable parameter of the network and using that parameter to select a community-detection technique from a set of techniques.