CN108833211A

CN108833211A - The unbiased delay sampling method of social networks

Info

Publication number: CN108833211A
Application number: CN201810689711.2A
Authority: CN
Inventors: 刘良桂; 陈炳宪; 贾会玲; 张宇
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU; Zhejiang University of Science and Technology ZUST
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-16

Abstract

The present invention discloses a kind of data sampling method (unbiased delay sampling) of social networks, this method follows Markov convergence criterion, unbiased sampling method is adapted to the network that degree is used in conjunction in heterogeneous networks, on the one hand, unbiased related method thereof has better sampling network unbiasedness, on the other hand, what unbiased delay sampling method can reduce repeated data enters sample probability to improve the detectivity of network.

Description

The unbiased delay sampling method of social networks

Technical field

The present invention relates to social network data sampling technique fields, and in particular to a kind of unbiased delay sampling of social networks Method (Unbiased-delay sampling, UD Sampling).

Background technique

In recent years, online social network has become main Internet service.The booming attraction of social networks The concern of a large amount of researcher, sociologist want the user behavior of research online user, and engineer is set using social networks Better network system is counted, scientific research personnel studies this structure and dynamic changing process for using complex network.

Social networks would generally be modeled as socialgram and be researched and analysed.The direct problems faced of researcher is exactly social The data volume of network is too huge.First, it is desirable to which it is unpractical for obtaining complete data set, because grabbing so huge Socialgram to expend the unthinkable time, it is sometimes and impossible.At the same time, the so huge social activity of processing Figure is calculated even if being also required to a large amount of time using high-performance computer cluster.Secondly, for trade secret and user Privately owned setting, the partial data of social networks is also and unavailable.Finally, the number of users rapid development of social networks and Relationship between user can change over time, therefore classical catenet can not crawl completely.So how in catenet The middle suitable sample of crawl, and the underlying issue for keeping the network attribute of primitive network just to study at social networks.

Currently used network samples technology generally carries out data sampling using breadth-first search.Range is excellent Although first searching algorithm can be with quick obtaining a large number of users data.However need to consume vast resources design in actual production Duplicate removal queue can greatly reduce the extraction efficiency of data in this way.Breadth-first search is the traversal of typical network simultaneously Algorithm, the data that algorithm extracts can be biased to the node of height, so that this method cannot obtain reliable user data.

Summary of the invention

Scheme, which is extracted, in order to solve existing social network data cannot obtain unbiased data and need to design duplicate removal queue Deficiency, the present invention provides a kind of novel network samples method (unbiased sampling method), relatively reliable so as to obtain Unbiased data.

The present invention adopts the following technical scheme that：A kind of unbiased delay sampling method of social networks, includes the following steps：

(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates under live network Relationship between user, V indicate the set of the node in figure, and node indicates the user under live network.

(2) sampling set S is initialized, spatial cache Cache empties S and Cache；A node v is randomly selected from V； Then it is sampled in accordance with the following steps.

(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number Node.The neighbor node detected is stored in spatial cache Cache.

(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge K_v/K_wWhether it is more than or equal to P, if so, using neighbor node w as present node v, and node w is put into sampling set S, it is then return to step 3, if not It is to continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.K_vIndicate the neighbor node number of node v, i.e., The degree of node v.

(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, then return It is back to step 3, if it is not, continuing next step.

(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from The detected the smallest node of number is selected in these detected nodesHave the node of multiple identical detected numbers take with One, machine.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.

Further, in the step 5, α=0.2.

The invention has the advantages that first, on independent sampling set, the degree properties of distributions of network is more nearly original Network characteristic.Second, conventional method is avoided in height and is connected to the problem of low node excessively enters sample in subnet, and improves method To the detectivity of network.Third, in the case where low sampling rate, the transitivity of sample and same property of matching are closer to primitive network Attribute.

Detailed description of the invention

The independent sample node degree that Fig. 1 is Twitter and Epinions is distributed CDF, NMSE figure；

Fig. 2 is the transitivity figure of different sampling networks；

Fig. 3 is the same with property figure of different sampling networks；

Fig. 4 is the turnover rate figure of sampling node；

Fig. 5 is influence diagram of the parameter alpha to sampling repetitive rate.

Specific embodiment

Step 1：Defined notion：

Social networks method of sampling research is usually to convert graph model for true network, and the side in figure indicates true net Relationship under network between user, the node in figure indicate the user under live network.It is represented and is schemed using symbol G=(E, V), wherein E Indicate the set on the side in figure, V indicates the set of the node in figure, and v indicates the node in figure.Sampling set is defined as S.It sampled The node detected in journey can be pressed into spatial cache, and definition spatial cache is Cache.Node d in spatial cache It indicates, d^jIndicate the node being detected, wherein j is the number that the node was detected,Indicate that there is identical neighbour with node v Occupy the detected node of number.K_vIndicate the neighbor node number of node v, the i.e. degree of node v.α indicates the general of repeated sampling Rate, numerical value default take 0.2.

Step 2：Sampling set S is initialized, spatial cache Cache empties S and Cache

Step 3：Start node v is chosen, the method for selection is randomly selected in the whole network.

Step 4：10 neighbor nodes of probe node v, then detection all neighbor nodes of neighbours' number less than 10.It will The node d of detection^jIt is stored in spatial cache Cache.

Step 5：A neighbor node w is randomly choosed in all neighbor nodes of node v.

Step 6：One 0 to 1 random number P, P obedience 0-1 is generated to be uniformly distributed.

Step 7：Judge whether P is less than or equal to K_v/K_w, if so, using neighbor node w as present node v, and by node W is put into sampling set S, then goes to step 5, if it is not, continuing next step.

Step 8：Judge whether P is less than or equal to α (α is defaulted as 0.2), if it is, keeping present node v constant, then Go to step 5.If it is not, continuing next step.

Step 9：The detected node collection that there is identical neighbours' number with present node v is found out in spatial cache CacheNode, which is detected, at this concentrates the detected the smallest node of number of selectionWherein J=min (j) has multiple The node of identical detected number takes random one.By nodeAs present node v, and by nodeIt is put into sampling set, so After be transferred to step 5.

The stopping rule of method can be stopping artificial when having got enough data, be also possible to a period of time After extracting data, it is automatically stopped program.

Herein using the network of Twitter and two kinds of Epinions different connecting degrees to unbiased delay sampling method Sampling performance is assessed, wherein participating in the classical method of sampling of comparison has BFS, MHRW, RW.

Can be seen that MHRW from the left-half of Fig. 1, the degree distribution of BFS can the big node of degree of deviation because they The curve that NMSE curve ratio UD method obtains is higher.At the same time, the degree distribution CDF curve ratio other methods that UD method obtains are more The nearly primitive network of adjunction equally illustrates that the degree properties of distributions ratio MHRW of the acquired network of UD method and BFS method are closer former The distribution of beginning internet pricing.In conclusion the subnet of UD method acquisition has preferably degree properties of distributions, even if sampling network does not repeat Data.

In Fig. 2, the horizontal base line of black represents the occurrence of primitive network statistic transitivity, can from figure Out, with the continuous improvement of sample rate, its transitivity of the sampling network of different sampling method can be intended to baseline value, but smaller Oversampling ratio in, improved UD method is more nearly the transitivity index of primitive network compared with MHRW and RW method.

Fig. 3 has evaluated the network of the extracted network of different sampling method with matching property index.It can be seen from the figure that candidate The method of sampling converges to primitive network in matching property with the raising of sample rate faster.But in lower sample rate, improve UD method closer to baseline, illustrate that UD method has better network with matching property index compared with MHRW and RW method.

Fig. 4 shows the sampling turnover rate of Twitter and Epinions.Wherein horizontal axis indicates the isolated node being extracted Number, the longitudinal axis are the isolated node number being extracted and the ratio of actual samples number of nodes, referred to as turnover rate.It is not difficult to find that more High turnover rate has less duplicate node.Therefore higher turnover rate has more preferable network detection ability in sampling process. From fig. 4, it can be seen that the sampling turnover rate of sparse network (Epinions) is lower than high connectivity network (Twitter), and right It is better than MHRW in the sampling turnover rate of the network UD method of different connectivity.This is because the network of low connectivity have it is higher general Rate touches the node accessed.This demonstrate that the UD method of sampling can ask to avoid MHRW in the sample that excessively enters of low node Topic, at the same time, UD method has more preferable network detection ability.

In UD sampling process, the self-loopa probability of present node is controlled we used parameter value α, Fig. 5 is shown Influence of the different parameter alphas to sampling node repetitive rate.Wherein, the value of abscissa expression parameter α, α is since 0.05 with step Long 0.05 until 1 (α=0.05,0.1,0.15 ..., 1), and ordinate indicates the node repetitive rate (institute in sample in sampling set There is the quantity of isolated node in quantity/sample of node).We use UD method, MHRW and RW method is in Twitter data Concentrate the data of acquisition 5% respectively.From fig. 5, it can be seen that when the value of parameter alpha is close to 0, the sample repetitive rate of the UD method of sampling Close to RW.When the value of parameter alpha is 1,6 times of RW of sample repetitive rate of the UD method of sampling, and it is identical as MHRW.So in social activity In network, parameter alpha can be used to control the sample repetitive rate of the UD method of sampling.More particularly, if parameter alpha is between 0.2 to 0.4 When, the sampling repetitive rate of MHRW has preferable reduction.

Claims

1. a kind of unbiased delay sampling method of social networks, which is characterized in that include the following steps：

(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates user under live network Between relationship, V indicate figure in node set, node indicate live network under user.

(2) sampling set S is initialized, spatial cache Cache empties S and Cache；A node v is randomly selected from V；Then It is sampled in accordance with the following steps.

(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number and save Point.The neighbor node detected is stored in spatial cache Cache.

(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge K_v/K_wWhether P is more than or equal to, such as Fruit is, using neighbor node w as present node v, and node w is put into sampling set S, is then return to step 3, if it is not, Continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.K_vIndicate the neighbor node number of node v, i.e. node The degree of v.

(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, be then return to Step 3, if it is not, continuing next step.

(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from these The detected the smallest node of number is selected in detected nodeThere is the node of multiple identical detected numbers to take random one It is a.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.

2. the method according to claim 1, wherein in the step 5, α=0.2.