CN103530503A - Complex network sampling method for keeping community structure - Google Patents

Complex network sampling method for keeping community structure Download PDF

Info

Publication number
CN103530503A
CN103530503A CN201310447528.9A CN201310447528A CN103530503A CN 103530503 A CN103530503 A CN 103530503A CN 201310447528 A CN201310447528 A CN 201310447528A CN 103530503 A CN103530503 A CN 103530503A
Authority
CN
China
Prior art keywords
node
community
sampling
degrees
sampled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310447528.9A
Other languages
Chinese (zh)
Other versions
CN103530503B (en
Inventor
童超
彭赋
牛建伟
谢忠玉
罗小简
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201310447528.9A priority Critical patent/CN103530503B/en
Publication of CN103530503A publication Critical patent/CN103530503A/en
Application granted granted Critical
Publication of CN103530503B publication Critical patent/CN103530503B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a complex network sampling method for keeping a community structure. The complex network sampling method is suitable for data sampling for large-scale data analysis under the limitation of existing hardware conditions. Two concepts of community dimensionality and community center are provided, community dimensionalities of all of nodes in a network are first obtained, then community centers are selected in the order of smallest to largest in community dimensionality, sampling is performed based on a forest fire sampling mode, the sampling size is determined according to the proportion of the community dimensionalities of the community centers to the community dimensionalities of the remaining community centers, and all of sampling nodes are output after all of community centers are sampled. An experiment proves that by means of the complex network sampling method, a sampling result is similar to data of an original network, the community structure is kept well, data scale is reduced to a great degree, and accordingly convenience is provided for big data processing under the limitation of existing hardware conditions.

Description

The complex network method of sampling that keeps community structure
Technical field
The present invention proposes to keep the complex network method of sampling of community structure, belongs to computer utility and complex network field.
Background technology
It is found that in recent years, the numerous systems in real world all exist with the form of complex network (complex network), as social networks, the Internet, mobile telephone network, protein DIALOGUES, neuron net etc.Complex network has node and limit huge amount, the complicated network structure, node and limit have diversity, the features such as differentiation and Dynamic complexity of evolving.For example, WWW has the URL(uniform resource locator) (URL) that surpasses trillion at present, Facebook has 1,000,000,000 user nodes and hundred billion customer relationships to connect limit, and cerebral nerve metanetwork has tens billion of nodes, and none does not have several hundred million users the mobile communication network of Chinese San great operator.How to process ultra-large network data, become researcher's key scientific problems urgently to be resolved hurrily.
Community structure has become one of the most general and most important topological structure attribute of complex network.Community structure has that community's internal node interconnects closely, node connects sparse feature between community.Research complex network community structure has important theory significance.Meanwhile, community structure research has been applied to the various fields such as terroristic organization's identification, social network analysis and management, agnoprotein matter function prediction, the identification of master control gene and Web community mining and search engine, has broad application prospects.
Although there is researcher to believe firmly the lifting along with computing power and data acquisition ability, processing overall data will become trend, and sampling remains a kind of common method of current processing mass data problem.For the data processing difficulty of large-scale complex network, if the little relation that is just difficult to guarantee former community structure of sampled data, if sampled data is large, the scale of data processing is large, difficult at present.
Summary of the invention
The present invention is directed to the problem of large-scale complex network data processing difficulty, a kind of complex network method of sampling that keeps community structure has been proposed, can better guarantee under the prerequisite of former net community structure, largely reduce data scale, by former network data is carried out to pre-service, to facilitate follow-up community structure research.
If some complex networks, represent without weight graph G with undirected, G=(V, E), the set that V is nodes, E is the set on limit in network, establishes node in figure and adds up to n, limit adds up to m, the number of degrees k of node v vrepresent.
The complex network method of sampling of maintenance community structure of the present invention, comprises the steps:
Step 1: community's number of degrees of determining each node in complex network; Community's number of degrees of node are defined as the summation of node self number of degrees and the neighbor node number of degrees;
Step 2: the node in traversing graph G, finds community center; When the community of certain node number of degrees are all not less than its all neighbours' the community number of degrees, this node of mark is community center;
Step 3: all community centers are sorted according to community's number of degrees size of node;
Step 4: choose current be not sampled and the community center of community's number of degrees minimum as the start node of sampling;
Step 5: mark start node is sampling node, and the sample size of start node is set;
If start node is u, the sample size Samsize of node u is:
Figure BDA0000388500530000021
wherein, cur_size represents the interstitial content of needs sampling at present, N uthe community's number of degrees that represent node u, U srepresent the current community center's set not being sampled, N vthe community's number of degrees that represent node v;
Step 6: start to sample from start node, obtain sampling node;
Step 7: judge whether that all community centers were all sampled, if so, export all sampling node, otherwise, continue to go to step 4 execution.
The method that described step 6 is sampled is:
Step 6.1: using start node u as current sampling node v;
Step 6.2: for node v generates a random integers x, choose x the neighbor node not being sampled of node v, and mark selected node is sampling node; The neighbor node w of node v is set ithe probability of choosing be
Figure BDA0000388500530000022
Step 6.3: whether the number of the sampling node that judgement is obtained by start node u sampling reaches the sample size of node u, the if so, sampling of end node u, otherwise, proceed step 6.4;
Whether each neighbor node of step 6.4: decision node v has the neighbor node not being sampled, if having, performs step 6.5; If not, the sampling of end node u;
Step 6.5: the node that each is had to the neighbor node not being sampled, as sampling node v, then goes to step 6.2 execution.
The complex network method of sampling of maintenance community structure of the present invention, can better guarantee, under the prerequisite of former net community structure, largely to reduce data scale, thereby for large data processing is provided convenience under the restrictive condition of existing hardware.
Accompanying drawing explanation
Fig. 1 is the overall flow schematic diagram of the complex network method of sampling of maintenance community structure of the present invention;
Fig. 2 is the sample effect schematic diagram of each sampling algorithm under as-22july06 data set in the embodiment of the present invention;
Fig. 3 is the sample effect schematic diagram of each sampling algorithm under astro-ph data set in the embodiment of the present invention;
Fig. 4 is the sample effect schematic diagram of each sampling algorithm under cond-mat data set in the embodiment of the present invention;
Fig. 5 is the sample effect schematic diagram of each sampling algorithm under hep-th data set in the embodiment of the present invention;
Fig. 6 is the sample effect schematic diagram of each sampling algorithm under cond-mat2005 data set in the embodiment of the present invention.
Embodiment
Below in conjunction with drawings and Examples, technical scheme of the present invention is described.
The complex network method of sampling of maintenance community structure provided by the invention, is to improve based on the forest fire method of sampling, and forest fire sampling side ratio juris is first described below.
The forest fire method of sampling (reference paper: J.Leskovec, C.Faloutsos.Sampling from Large Graphs.In Proc of ACM SIGKDD, 2006:631-636) specifically: for certain network, first at random select node v, then produce a random number x, x meets how much and distributes; Node v selects x bar adjacent side, and another node on these limits is all not access, and namely obtains corresponding x the not accessed node in x bar limit; Then this x node is found to not accessed node according to the method that produces random number successively, so circulation burns abundant node always.For fear of duplicating, so carry out in process in the forest fire method of sampling, node can not be accessed twice.If fire has extinguished (the fire dies), again select at random a node.Parameter is called as the probability that burns forward, the forest fire method of sampling large node of the number of degrees that tends to sample.
The present invention selects to improve based on the forest fire method of sampling, mainly based on following 2 points: 1, by the effect of more existing several sampling algorithms, find that the forest fire method of sampling behaves oneself best therein; 2, the forest fire method of sampling can keep the connectedness of figure preferably.But the start node that forest fire is random, and the characteristic of not being partial to keep community structure at algorithmic procedure, therefore, the present invention need to improve for this two aspect.
First, the present invention has defined two concepts of community's number of degrees and community center, then proposes on this basis a kind of complex network method of sampling that keeps community structure, has both reduced network data scale, keeps again the community structure of former network.
The Zhong, of the present invention community number of degrees are defined as node self number of degrees and neighbor node number of degrees summation, and community center is defined as the network node of community's number of degrees maximum value.
The inventive method is that community center pro rata distributes the big or small method of sampling, substantially realizes principle as follows:
(1) using the node of community's number of degrees minimum of community center as start node, according to forest fire mode, sample, the size of sampling is that the ratio that the community's number of degrees with current sampling start node account for the total community's number of degrees of community center is distributed, after sample size is enough, jump to next node.
(2) from the remaining community center not being sampled, select the node of community's number of degrees maximum as new sampling start node, according to mode in (1), continue sampling, until whole-sample size reaches the sample size of regulation, or all community centers are all sampled.
As shown in Figure 1, the process flow diagram for the complex network method of sampling of maintenance community structure of the present invention, describes each step below in conjunction with Fig. 1.
First, for certain complex network to be sampled, with undirected, without weight graph G, represent, G=(V, E), wherein, and the set that V is nodes, E is the set on limit in network, establishes node in figure G and adds up to n, limit adds up to m.With (u, v), represent a limit in complex network, u wherein, v ∈ V, node v and node u be neighbor node each other, and the neighbor node set of establishing node u is U={ ν | (u, ν) ∈ E}.Use k vthe number of degrees that represent node v.
Step 1: community's number of degrees of determining each node in complex network.
For the number of degrees N of node u ,Qi community ufor: N u=k u+ Σ ν ∈ Uk ν; Wherein, k uthe number of degrees that represent node u.
Step 2: the node in traversing graph G, finds community center.Community's number of degrees of certain node are not less than community's number of degrees of all neighbor nodes of this node, and this node is community center.
If community's number of degrees of node u are more than or equal to community's number of degrees of its all neighbor nodes, node u is marked as community center.The node that is labeled as community center has more than one.
Step 3: all community centers are sorted according to community's number of degrees size of node.For example, according to community's number of degrees, arrange from small to large community center.The all community centers of initial markers are not sampling.
Step 4: select the start node of sampling each time.Start node alternative condition is: the node of community's number of degrees minimum in the community center not being sampled.In the inventive method, from the node of the minimum number of degrees, be conducive to correct this skewed popularity of the node that the deflection number of degrees are large.
Step 5: mark start node is sampling node, and start node sample size is set.
If the start node of current sampling is u, node u sample size Samsize is:
Samsize = cur _ size * N u / Σ v ∈ U s N v
Wherein, U sbe the set of the current remaining community center not being sampled, cur_size represents the interstitial content of needs sampling at present, by user, is according to circumstances set.
Step 6: start to sample with start node.
Concrete sampling process, based on the forest fire method of sampling, comprises step 6.1~6.5.
Step 6.1: start as current sampling node v from start node u.
Step 6.2: for node v generates a random integers x, select the x bar limit of node v, another node on selected limit is all not access, unsampled, uses w 1, w 2, w 3...., w xrepresent x corresponding to x bar limit not accessed node.
The neighbor node w of node v iaccessed probability wherein, p frepresent to burn forward probability, value is determined by user, k vthe number of degrees that represent node v,
Figure BDA0000388500530000043
represent node w ithe number of degrees.
X the neighbor node of the node v that mark is selected is sampling node.
Step 6.3: whether the number of the sampling node that judgement is obtained by start node u sampling reaches the sample size Samsize of node u, the if so, sampling of end node u, otherwise, proceed step 6.4.
Each neighbor node w of step 6.4: decision node v 1, w 2, w 3...., w xwhether there is the neighbor node not being sampled, if having, perform step 6.5; If all do not have, the sampling of end node u.
Step 6.5: the node that each is had to the neighbor node not being sampled, as sampling node v, then goes to step 6.2 execution.
Selected x node is continued respectively to find to the not accessed node of random number.Circulation searching process, obtains abundant node always or can not find.For fear of duplicating, so in finding node process, node can not be accessed twice.If adopt the sample size of step 5 regulation or continued the node that burns, reselect a start node.
Due to the forest fire method of sampling large node of the number of degrees of being partial to sample, so define the accessed probability of each node, be wherein v is the node of sampling at present, node w ithe node whether current selection burns, if node w ithe node number of degrees large, can make w after above-mentioned processing ithe probability being sampled reduces, and avoids to a certain extent the forest fire method of sampling be partial to the to sample shortcoming of the large node of the number of degrees.
Step 7: judge whether that all community centers were all sampled, if so, export all sampling node, otherwise, continue to go to step 4 execution.
The present invention chooses six kinds of classical sampling algorithms and the method for sampling of the present invention (being called for short FFS-Tran) compares, it is respectively breadth First sampling (being called for short BFS), random walk sampling (being called for short RW), MH random walk (being called for short MHRW), snowball sampling (being called for short SN), edge sampling (being called for short FS), forest fire sampling (being called for short FFS).Five data sets are chosen in experiment: high-energy field paper cooperative network (hep-th), astrophysics cooperative network (astro-ph), internet autonomous system (as-22july06), two condensed state paper cooperative networks (cond-mat).The effect of comparative result adopts several indexs of the NCP of standard: Conductance(is designated as Cond), Expansion(is designated as Exp), Internal density(is designated as ID), Cut Ratio(is designated as CR), these four kinds of indexs be all worth less, the inside, community of reflection is tightr, more sparse with contacting of outside; Normalized Cut(is designated as NC), Maximum-ODF(Out Degree Fraction, is designated as MODF), Average-ODF(is designated as AODF), and Flake-ODF(is designated as FODF), these four kinds refer to that target value is less, illustrate that sample effect is better; Modularity(is designated as Mod), Modularity ration(is designated as MR), and Volume(is designated as Volum), these three kinds refer to that the larger explanation sample effect of target value is better.
To six kinds of algorithms, it is 0.1,0.15,0.2,0.25,0.3,0.4,0.5,0.7,0.8 that sample size is set, and each sample size data run ten times, then to results averaged; For forest fire algorithm, the probability that burns forward gets 0.6,0.7,0.8,0.9,1.0, then in each sample size and each, burns forward under probability regulation, and data run ten times, to results averaged.The data that operation is obtained are converted to broken line graph, show intuitively rule, from row result, except first data set as-22july06, the inventive method (FFS-Tran) effect and other sampling algorithm difference are little, the operational effect of remaining data set, and the inventive method is better than other sampling algorithms, especially in cond-mat2005 data centralization, sample effect is obviously better than other six kinds of classical sampling algorithms.Below to individual data situation analysis:
As can be seen from Figure 2, in data set as-22july06 data, the inventive method (FFS-Tran) sample effect will be worse than random walk sampling (RW), MH random walk (MHRW), snowball sampling (SN), edge sampling (FS), and the sample effect difference of these several sampling algorithms is little, but the sample effect of FFS and BFS is obviously poor than the inventive method and all the other four kinds of classical sampling algorithms.
As can be seen from Figure 3, for data set astro-ph, the inventive method is better than other sampling algorithm in majority parameters, and in ID index and CR index, effect is worse than six kinds of classical sampling algorithms.Comprehensive all indexs, for data set astro-ph, the inventive method successful is better than six kinds of classical sampling algorithms, and as can be seen from the figure, this line of FFS-Tran is all more much lower than other line.
As can be seen from Figure 4, for data set cond-mat, majority parameters, the effect of the inventive method is also obviously better than the effect of other classical sampling algorithms, but in ID and CR index, weak effect is a little.FFS and BFS effect are the poorest.
As seen from Figure 5, under hep-th data set, the effect of the inventive method under majority parameters is best, and is obviously better than other sampling algorithms under Cond and Exp index, but poor in CR performance.And FFS and BFS sample effect are all very poor in majority parameters.
As can be seen from Figure 6, the effect of the inventive method is well a lot of than other sampling algorithm effects, and in CR index, the D-statistic value of all the other indexs is all in 0.2 left and right, illustrate and the data of former figure more approaching, kept preferably community structure.
Through experiment, for astro-ph, cond-mat, hep-th, cond-mat2005 data set, the sample effect of the inventive method is better than the effect of six kinds of sampling algorithms, illustrates that the inventive method is more effective than six kinds of sampling algorithms aspect maintenance community structure.And in cond-mat2005 data centralization, the inventive method has kept the community structure of former figure well.
More than experimental results show that: the sampled result of the inventive method and the data of former figure are more approaching, have kept preferably community structure, by with six kinds of classical sampling algorithm contrast verifications feasibility and the validity of the inventive method.

Claims (2)

1. a complex network method of sampling that keeps community structure, represents with undirected complex network without weight graph G, G=(V, E), and the set that V is nodes, E is the set on limit in network, establishes node and adds up to n, limit adds up to m, the number of degrees k of node v vrepresent; It is characterized in that, the described complex network method of sampling comprises the steps:
Step 1: community's number of degrees of determining each node in complex network; Community's number of degrees of node are defined as the summation of node self number of degrees and the neighbor node number of degrees;
Step 2: the node in traversing graph G, finds community center; When the community of certain node number of degrees are all not less than its all neighbours' the community number of degrees, this node of mark is community center;
Step 3: all community centers are sorted according to community's number of degrees size of node;
Step 4: choose current be not sampled and the community center of community's number of degrees minimum as the start node of sampling;
Step 5: mark start node is sampling node, and the sample size of start node is set;
If start node is u, the sample size Samsize of node u is:
Figure FDA0000388500520000011
wherein, cur_size represents the interstitial content of needs sampling at present, N uthe community's number of degrees that represent node u, U srepresent the current community center's set not being sampled, N vthe community's number of degrees that represent node v;
Step 6: start to sample from start node, obtain sampling node;
Step 7: judge whether that all community centers were all sampled, if so, export all sampling node, otherwise, continue to go to step 4 execution.
2. a kind of complex network method of sampling that keeps community structure according to claim 1, is characterized in that, the method that described step 6 is sampled is:
Step 6.1: using start node u as current sampling node v;
Step 6.2: for node v generates a random integers x, choose x the neighbor node not being sampled of node v, and mark selected node is sampling node; The neighbor node w of node v is set ithe probability of choosing be
Figure FDA0000388500520000012
Step 6.3: whether the number of the sampling node that judgement is obtained by start node u sampling reaches the sample size of node u, the if so, sampling of end node u, otherwise, proceed step 6.4;
Whether each neighbor node of step 6.4: decision node v has the neighbor node not being sampled, if having, performs step 6.5; If not, the sampling of end node u;
Step 6.5: the node that each is had to the neighbor node not being sampled, as sampling node v, then goes to step 6.2 execution.
CN201310447528.9A 2013-09-27 2013-09-27 Keep the complex network method of sampling of community structure Expired - Fee Related CN103530503B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310447528.9A CN103530503B (en) 2013-09-27 2013-09-27 Keep the complex network method of sampling of community structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310447528.9A CN103530503B (en) 2013-09-27 2013-09-27 Keep the complex network method of sampling of community structure

Publications (2)

Publication Number Publication Date
CN103530503A true CN103530503A (en) 2014-01-22
CN103530503B CN103530503B (en) 2016-05-04

Family

ID=49932508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310447528.9A Expired - Fee Related CN103530503B (en) 2013-09-27 2013-09-27 Keep the complex network method of sampling of community structure

Country Status (1)

Country Link
CN (1) CN103530503B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598605A (en) * 2015-01-30 2015-05-06 福州大学 Method for user influence evaluation in social network
US10110449B2 (en) 2015-12-18 2018-10-23 International Business Machines Corporation Method and system for temporal sampling in evolving network
CN108833211A (en) * 2018-06-28 2018-11-16 浙江理工大学 The unbiased delay sampling method of social networks
CN110717805A (en) * 2019-08-30 2020-01-21 华东理工大学 Method for identifying risk user, server and readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020713A (en) * 2012-11-19 2013-04-03 山东大学 Intelligent substation fault diagnosis method combining topology and relay protection logic

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020713A (en) * 2012-11-19 2013-04-03 山东大学 Intelligent substation fault diagnosis method combining topology and relay protection logic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIANWEI NIU等: "Evolution of Disconnected Components in SocialNetworks:Patterns and A Generative Model", 《IEEE》, 31 December 2012 (2012-12-31), pages 305 - 313, XP032300718, DOI: doi:10.1109/PCCC.2012.6407772 *
童超等: "移动模型研究综述", 《计算机科学》, vol. 36, no. 10, 31 October 2009 (2009-10-31) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598605A (en) * 2015-01-30 2015-05-06 福州大学 Method for user influence evaluation in social network
CN104598605B (en) * 2015-01-30 2018-01-12 福州大学 A kind of user force appraisal procedure in social networks
US10110449B2 (en) 2015-12-18 2018-10-23 International Business Machines Corporation Method and system for temporal sampling in evolving network
US10608905B2 (en) 2015-12-18 2020-03-31 International Business Machines Corporation Method and system for temporal sampling in evolving network
CN108833211A (en) * 2018-06-28 2018-11-16 浙江理工大学 The unbiased delay sampling method of social networks
CN110717805A (en) * 2019-08-30 2020-01-21 华东理工大学 Method for identifying risk user, server and readable storage medium

Also Published As

Publication number Publication date
CN103530503B (en) 2016-05-04

Similar Documents

Publication Publication Date Title
CN103530503B (en) Keep the complex network method of sampling of community structure
Du Energy analysis of Internet of things data mining algorithm for smart green communication networks
Vadeyar et al. Farthest first clustering in links reorganization
Zhongsheng et al. Retracted: Traffic identification and traffic analysis based on support vector machine
Zhou et al. Approximate deep network embedding for mining large-scale graphs
Cui et al. Dygcn: Dynamic graph embedding with graph convolutional network
Liu et al. Todynet: Temporal dynamic graph neural network for multivariate time series classification
Sun et al. Graph Based Long-Term And Short-Term Interest Model for Click-Through Rate Prediction
Chen et al. A parallel approximate ss-elm algorithm based on mapreduce for large-scale datasets
Zhao Information iterative retrieval of Internet of Things communication terminal based on symmetric algorithm
Liu et al. Scientific Paper Classification Based on Graph Neural Network with Hypergraph Self-attention Mechanism
Meng et al. SCRN: a complex network reconstruction method based on multiple time series
Atwa et al. Active selection constraints for semi-supervised clustering algorithms
Kumar et al. Machine learning solutions for investigating streams data using distributed frameworks: Literature review
Jin et al. Web Log Analysis and Security Assessment Method Based on Data Mining
CN104715418A (en) Novel social network sampling method
Meng et al. A novel method based on entity relationship for online transaction fraud detection
Bian et al. Research on a privacy preserving clustering method for social network
Dazzi et al. Experiences with complex user profiles for approximate p2p community matching
Lu et al. Vertex centrality of complex networks based on joint nonnegative matrix factorization and graph embedding
Zhang et al. DND: Deep learning-based directed network disintegrator
Chen et al. Attribute-enhanced Dual Channel Representation Learning for Session-based Recommendation
Wang et al. Community detection in complex networks using improved artificial bee colony algorithm
Chen et al. Random walks on the folded hypercube
Li et al. DAMGNN: Deep adaptive multi-channel graph neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160504

Termination date: 20160927

CF01 Termination of patent right due to non-payment of annual fee