CN103530503A

CN103530503A - Complex network sampling method for keeping community structure

Info

Publication number: CN103530503A
Application number: CN201310447528.9A
Authority: CN
Inventors: 童超; 彭赋; 牛建伟; 谢忠玉; 罗小简
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2014-01-22
Anticipated expiration: 2033-09-27
Also published as: CN103530503B

Abstract

The invention provides a complex network sampling method for keeping a community structure. The complex network sampling method is suitable for data sampling for large-scale data analysis under the limitation of existing hardware conditions. Two concepts of community dimensionality and community center are provided, community dimensionalities of all of nodes in a network are first obtained, then community centers are selected in the order of smallest to largest in community dimensionality, sampling is performed based on a forest fire sampling mode, the sampling size is determined according to the proportion of the community dimensionalities of the community centers to the community dimensionalities of the remaining community centers, and all of sampling nodes are output after all of community centers are sampled. An experiment proves that by means of the complex network sampling method, a sampling result is similar to data of an original network, the community structure is kept well, data scale is reduced to a great degree, and accordingly convenience is provided for big data processing under the limitation of existing hardware conditions.

Description

The complex network method of sampling that keeps community structure

Technical field

The present invention proposes to keep the complex network method of sampling of community structure, belongs to computer utility and complex network field.

Background technology

It is found that in recent years, the numerous systems in real world all exist with the form of complex network (complex network), as social networks, the Internet, mobile telephone network, protein DIALOGUES, neuron net etc.Complex network has node and limit huge amount, the complicated network structure, node and limit have diversity, the features such as differentiation and Dynamic complexity of evolving.For example, WWW has the URL(uniform resource locator) (URL) that surpasses trillion at present, Facebook has 1,000,000,000 user nodes and hundred billion customer relationships to connect limit, and cerebral nerve metanetwork has tens billion of nodes, and none does not have several hundred million users the mobile communication network of Chinese San great operator.How to process ultra-large network data, become researcher's key scientific problems urgently to be resolved hurrily.

Community structure has become one of the most general and most important topological structure attribute of complex network.Community structure has that community's internal node interconnects closely, node connects sparse feature between community.Research complex network community structure has important theory significance.Meanwhile, community structure research has been applied to the various fields such as terroristic organization's identification, social network analysis and management, agnoprotein matter function prediction, the identification of master control gene and Web community mining and search engine, has broad application prospects.

Although there is researcher to believe firmly the lifting along with computing power and data acquisition ability, processing overall data will become trend, and sampling remains a kind of common method of current processing mass data problem.For the data processing difficulty of large-scale complex network, if the little relation that is just difficult to guarantee former community structure of sampled data, if sampled data is large, the scale of data processing is large, difficult at present.

Summary of the invention

The present invention is directed to the problem of large-scale complex network data processing difficulty, a kind of complex network method of sampling that keeps community structure has been proposed, can better guarantee under the prerequisite of former net community structure, largely reduce data scale, by former network data is carried out to pre-service, to facilitate follow-up community structure research.

If some complex networks, represent without weight graph G with undirected, G=(V, E), the set that V is nodes, E is the set on limit in network, establishes node in figure and adds up to n, limit adds up to m, the number of degrees k of node v _vrepresent.

The complex network method of sampling of maintenance community structure of the present invention, comprises the steps:

Step 1: community's number of degrees of determining each node in complex network; Community's number of degrees of node are defined as the summation of node self number of degrees and the neighbor node number of degrees;

Step 2: the node in traversing graph G, finds community center; When the community of certain node number of degrees are all not less than its all neighbours' the community number of degrees, this node of mark is community center;

Step 3: all community centers are sorted according to community's number of degrees size of node;

Step 4: choose current be not sampled and the community center of community's number of degrees minimum as the start node of sampling;

Step 5: mark start node is sampling node, and the sample size of start node is set;

If start node is u, the sample size Samsize of node u is:

wherein, cur_size represents the interstitial content of needs sampling at present, N _uthe community's number of degrees that represent node u, U _srepresent the current community center's set not being sampled, N _vthe community's number of degrees that represent node v;

Step 6: start to sample from start node, obtain sampling node;

Step 7: judge whether that all community centers were all sampled, if so, export all sampling node, otherwise, continue to go to step 4 execution.

The method that described step 6 is sampled is:

Step 6.1: using start node u as current sampling node v;

Step 6.2: for node v generates a random integers x, choose x the neighbor node not being sampled of node v, and mark selected node is sampling node; The neighbor node w of node v is set _ithe probability of choosing be

Step 6.3: whether the number of the sampling node that judgement is obtained by start node u sampling reaches the sample size of node u, the if so, sampling of end node u, otherwise, proceed step 6.4;

Whether each neighbor node of step 6.4: decision node v has the neighbor node not being sampled, if having, performs step 6.5; If not, the sampling of end node u;

Step 6.5: the node that each is had to the neighbor node not being sampled, as sampling node v, then goes to step 6.2 execution.

The complex network method of sampling of maintenance community structure of the present invention, can better guarantee, under the prerequisite of former net community structure, largely to reduce data scale, thereby for large data processing is provided convenience under the restrictive condition of existing hardware.

Accompanying drawing explanation

Fig. 1 is the overall flow schematic diagram of the complex network method of sampling of maintenance community structure of the present invention;

Fig. 2 is the sample effect schematic diagram of each sampling algorithm under as-22july06 data set in the embodiment of the present invention;

Fig. 3 is the sample effect schematic diagram of each sampling algorithm under astro-ph data set in the embodiment of the present invention;

Fig. 4 is the sample effect schematic diagram of each sampling algorithm under cond-mat data set in the embodiment of the present invention;

Fig. 5 is the sample effect schematic diagram of each sampling algorithm under hep-th data set in the embodiment of the present invention;

Fig. 6 is the sample effect schematic diagram of each sampling algorithm under cond-mat2005 data set in the embodiment of the present invention.

Embodiment

Below in conjunction with drawings and Examples, technical scheme of the present invention is described.

The complex network method of sampling of maintenance community structure provided by the invention, is to improve based on the forest fire method of sampling, and forest fire sampling side ratio juris is first described below.

The forest fire method of sampling (reference paper: J.Leskovec, C.Faloutsos.Sampling from Large Graphs.In Proc of ACM SIGKDD, 2006:631-636) specifically: for certain network, first at random select node v, then produce a random number x, x meets how much and distributes; Node v selects x bar adjacent side, and another node on these limits is all not access, and namely obtains corresponding x the not accessed node in x bar limit; Then this x node is found to not accessed node according to the method that produces random number successively, so circulation burns abundant node always.For fear of duplicating, so carry out in process in the forest fire method of sampling, node can not be accessed twice.If fire has extinguished (the fire dies), again select at random a node.Parameter is called as the probability that burns forward, the forest fire method of sampling large node of the number of degrees that tends to sample.

The present invention selects to improve based on the forest fire method of sampling, mainly based on following 2 points: 1, by the effect of more existing several sampling algorithms, find that the forest fire method of sampling behaves oneself best therein; 2, the forest fire method of sampling can keep the connectedness of figure preferably.But the start node that forest fire is random, and the characteristic of not being partial to keep community structure at algorithmic procedure, therefore, the present invention need to improve for this two aspect.

First, the present invention has defined two concepts of community's number of degrees and community center, then proposes on this basis a kind of complex network method of sampling that keeps community structure, has both reduced network data scale, keeps again the community structure of former network.

The Zhong， of the present invention community number of degrees are defined as node self number of degrees and neighbor node number of degrees summation, and community center is defined as the network node of community's number of degrees maximum value.

The inventive method is that community center pro rata distributes the big or small method of sampling, substantially realizes principle as follows:

(1) using the node of community's number of degrees minimum of community center as start node, according to forest fire mode, sample, the size of sampling is that the ratio that the community's number of degrees with current sampling start node account for the total community's number of degrees of community center is distributed, after sample size is enough, jump to next node.

(2) from the remaining community center not being sampled, select the node of community's number of degrees maximum as new sampling start node, according to mode in (1), continue sampling, until whole-sample size reaches the sample size of regulation, or all community centers are all sampled.

As shown in Figure 1, the process flow diagram for the complex network method of sampling of maintenance community structure of the present invention, describes each step below in conjunction with Fig. 1.

First, for certain complex network to be sampled, with undirected, without weight graph G, represent, G=(V, E), wherein, and the set that V is nodes, E is the set on limit in network, establishes node in figure G and adds up to n, limit adds up to m.With (u, v), represent a limit in complex network, u wherein, v ∈ V, node v and node u be neighbor node each other, and the neighbor node set of establishing node u is U={ ν | (u, ν) ∈ E}.Use k _vthe number of degrees that represent node v.

Step 1: community's number of degrees of determining each node in complex network.

For the number of degrees N of node u ，Qi community _ufor: N _u=k _u+ Σ _{ν ∈ U}k _ν; Wherein, k _uthe number of degrees that represent node u.

Step 2: the node in traversing graph G, finds community center.Community's number of degrees of certain node are not less than community's number of degrees of all neighbor nodes of this node, and this node is community center.

If community's number of degrees of node u are more than or equal to community's number of degrees of its all neighbor nodes, node u is marked as community center.The node that is labeled as community center has more than one.

Step 3: all community centers are sorted according to community's number of degrees size of node.For example, according to community's number of degrees, arrange from small to large community center.The all community centers of initial markers are not sampling.

Step 4: select the start node of sampling each time.Start node alternative condition is: the node of community's number of degrees minimum in the community center not being sampled.In the inventive method, from the node of the minimum number of degrees, be conducive to correct this skewed popularity of the node that the deflection number of degrees are large.

Step 5: mark start node is sampling node, and start node sample size is set.

If the start node of current sampling is u, node u sample size Samsize is:

Samsize = cur_size * N_{u} / Σ_{v &Element; U_{s}} N_{v}

Wherein, U _sbe the set of the current remaining community center not being sampled, cur_size represents the interstitial content of needs sampling at present, by user, is according to circumstances set.

Step 6: start to sample with start node.

Concrete sampling process, based on the forest fire method of sampling, comprises step 6.1～6.5.

Step 6.1: start as current sampling node v from start node u.

Step 6.2: for node v generates a random integers x, select the x bar limit of node v, another node on selected limit is all not access, unsampled, uses w ₁, w ₂, w ₃...., w _xrepresent x corresponding to x bar limit not accessed node.

The neighbor node w of node v _iaccessed probability wherein, p _frepresent to burn forward probability, value is determined by user, k _vthe number of degrees that represent node v,

represent node w _ithe number of degrees.

X the neighbor node of the node v that mark is selected is sampling node.

Step 6.3: whether the number of the sampling node that judgement is obtained by start node u sampling reaches the sample size Samsize of node u, the if so, sampling of end node u, otherwise, proceed step 6.4.

Each neighbor node w of step 6.4: decision node v ₁, w ₂, w ₃...., w _xwhether there is the neighbor node not being sampled, if having, perform step 6.5; If all do not have, the sampling of end node u.

Selected x node is continued respectively to find to the not accessed node of random number.Circulation searching process, obtains abundant node always or can not find.For fear of duplicating, so in finding node process, node can not be accessed twice.If adopt the sample size of step 5 regulation or continued the node that burns, reselect a start node.

Due to the forest fire method of sampling large node of the number of degrees of being partial to sample, so define the accessed probability of each node, be wherein v is the node of sampling at present, node w _ithe node whether current selection burns, if node w _ithe node number of degrees large, can make w after above-mentioned processing _ithe probability being sampled reduces, and avoids to a certain extent the forest fire method of sampling be partial to the to sample shortcoming of the large node of the number of degrees.

The present invention chooses six kinds of classical sampling algorithms and the method for sampling of the present invention (being called for short FFS-Tran) compares, it is respectively breadth First sampling (being called for short BFS), random walk sampling (being called for short RW), MH random walk (being called for short MHRW), snowball sampling (being called for short SN), edge sampling (being called for short FS), forest fire sampling (being called for short FFS).Five data sets are chosen in experiment: high-energy field paper cooperative network (hep-th), astrophysics cooperative network (astro-ph), internet autonomous system (as-22july06), two condensed state paper cooperative networks (cond-mat).The effect of comparative result adopts several indexs of the NCP of standard: Conductance(is designated as Cond), Expansion(is designated as Exp), Internal density(is designated as ID), Cut Ratio(is designated as CR), these four kinds of indexs be all worth less, the inside, community of reflection is tightr, more sparse with contacting of outside; Normalized Cut(is designated as NC), Maximum-ODF(Out Degree Fraction, is designated as MODF), Average-ODF(is designated as AODF), and Flake-ODF(is designated as FODF), these four kinds refer to that target value is less, illustrate that sample effect is better; Modularity(is designated as Mod), Modularity ration(is designated as MR), and Volume(is designated as Volum), these three kinds refer to that the larger explanation sample effect of target value is better.

To six kinds of algorithms, it is 0.1,0.15,0.2,0.25,0.3,0.4,0.5,0.7,0.8 that sample size is set, and each sample size data run ten times, then to results averaged; For forest fire algorithm, the probability that burns forward gets 0.6,0.7,0.8,0.9,1.0, then in each sample size and each, burns forward under probability regulation, and data run ten times, to results averaged.The data that operation is obtained are converted to broken line graph, show intuitively rule, from row result, except first data set as-22july06, the inventive method (FFS-Tran) effect and other sampling algorithm difference are little, the operational effect of remaining data set, and the inventive method is better than other sampling algorithms, especially in cond-mat2005 data centralization, sample effect is obviously better than other six kinds of classical sampling algorithms.Below to individual data situation analysis:

As can be seen from Figure 2, in data set as-22july06 data, the inventive method (FFS-Tran) sample effect will be worse than random walk sampling (RW), MH random walk (MHRW), snowball sampling (SN), edge sampling (FS), and the sample effect difference of these several sampling algorithms is little, but the sample effect of FFS and BFS is obviously poor than the inventive method and all the other four kinds of classical sampling algorithms.

As can be seen from Figure 3, for data set astro-ph, the inventive method is better than other sampling algorithm in majority parameters, and in ID index and CR index, effect is worse than six kinds of classical sampling algorithms.Comprehensive all indexs, for data set astro-ph, the inventive method successful is better than six kinds of classical sampling algorithms, and as can be seen from the figure, this line of FFS-Tran is all more much lower than other line.

As can be seen from Figure 4, for data set cond-mat, majority parameters, the effect of the inventive method is also obviously better than the effect of other classical sampling algorithms, but in ID and CR index, weak effect is a little.FFS and BFS effect are the poorest.

As seen from Figure 5, under hep-th data set, the effect of the inventive method under majority parameters is best, and is obviously better than other sampling algorithms under Cond and Exp index, but poor in CR performance.And FFS and BFS sample effect are all very poor in majority parameters.

As can be seen from Figure 6, the effect of the inventive method is well a lot of than other sampling algorithm effects, and in CR index, the D-statistic value of all the other indexs is all in 0.2 left and right, illustrate and the data of former figure more approaching, kept preferably community structure.

Through experiment, for astro-ph, cond-mat, hep-th, cond-mat2005 data set, the sample effect of the inventive method is better than the effect of six kinds of sampling algorithms, illustrates that the inventive method is more effective than six kinds of sampling algorithms aspect maintenance community structure.And in cond-mat2005 data centralization, the inventive method has kept the community structure of former figure well.

More than experimental results show that: the sampled result of the inventive method and the data of former figure are more approaching, have kept preferably community structure, by with six kinds of classical sampling algorithm contrast verifications feasibility and the validity of the inventive method.

Claims

1. a complex network method of sampling that keeps community structure, represents with undirected complex network without weight graph G, G=(V, E), and the set that V is nodes, E is the set on limit in network, establishes node and adds up to n, limit adds up to m, the number of degrees k of node v _vrepresent; It is characterized in that, the described complex network method of sampling comprises the steps:

If start node is u, the sample size Samsize of node u is:

Step 6: start to sample from start node, obtain sampling node;

2. a kind of complex network method of sampling that keeps community structure according to claim 1, is characterized in that, the method that described step 6 is sampled is:

Step 6.1: using start node u as current sampling node v;