CN108833211A - The unbiased delay sampling method of social networks - Google Patents

The unbiased delay sampling method of social networks Download PDF

Info

Publication number
CN108833211A
CN108833211A CN201810689711.2A CN201810689711A CN108833211A CN 108833211 A CN108833211 A CN 108833211A CN 201810689711 A CN201810689711 A CN 201810689711A CN 108833211 A CN108833211 A CN 108833211A
Authority
CN
China
Prior art keywords
node
sampling
network
cache
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810689711.2A
Other languages
Chinese (zh)
Inventor
刘良桂
陈炳宪
贾会玲
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sci Tech University ZSTU
Zhejiang University of Science and Technology ZUST
Original Assignee
Zhejiang Sci Tech University ZSTU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sci Tech University ZSTU filed Critical Zhejiang Sci Tech University ZSTU
Priority to CN201810689711.2A priority Critical patent/CN108833211A/en
Publication of CN108833211A publication Critical patent/CN108833211A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/022Capturing of monitoring data by sampling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of data sampling method (unbiased delay sampling) of social networks, this method follows Markov convergence criterion, unbiased sampling method is adapted to the network that degree is used in conjunction in heterogeneous networks, on the one hand, unbiased related method thereof has better sampling network unbiasedness, on the other hand, what unbiased delay sampling method can reduce repeated data enters sample probability to improve the detectivity of network.

Description

The unbiased delay sampling method of social networks
Technical field
The present invention relates to social network data sampling technique fields, and in particular to a kind of unbiased delay sampling of social networks Method (Unbiased-delay sampling, UD Sampling).
Background technique
In recent years, online social network has become main Internet service.The booming attraction of social networks The concern of a large amount of researcher, sociologist want the user behavior of research online user, and engineer is set using social networks Better network system is counted, scientific research personnel studies this structure and dynamic changing process for using complex network.
Social networks would generally be modeled as socialgram and be researched and analysed.The direct problems faced of researcher is exactly social The data volume of network is too huge.First, it is desirable to which it is unpractical for obtaining complete data set, because grabbing so huge Socialgram to expend the unthinkable time, it is sometimes and impossible.At the same time, the so huge social activity of processing Figure is calculated even if being also required to a large amount of time using high-performance computer cluster.Secondly, for trade secret and user Privately owned setting, the partial data of social networks is also and unavailable.Finally, the number of users rapid development of social networks and Relationship between user can change over time, therefore classical catenet can not crawl completely.So how in catenet The middle suitable sample of crawl, and the underlying issue for keeping the network attribute of primitive network just to study at social networks.
Currently used network samples technology generally carries out data sampling using breadth-first search.Range is excellent Although first searching algorithm can be with quick obtaining a large number of users data.However need to consume vast resources design in actual production Duplicate removal queue can greatly reduce the extraction efficiency of data in this way.Breadth-first search is the traversal of typical network simultaneously Algorithm, the data that algorithm extracts can be biased to the node of height, so that this method cannot obtain reliable user data.
Summary of the invention
Scheme, which is extracted, in order to solve existing social network data cannot obtain unbiased data and need to design duplicate removal queue Deficiency, the present invention provides a kind of novel network samples method (unbiased sampling method), relatively reliable so as to obtain Unbiased data.
The present invention adopts the following technical scheme that:A kind of unbiased delay sampling method of social networks, includes the following steps:
(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates under live network Relationship between user, V indicate the set of the node in figure, and node indicates the user under live network.
(2) sampling set S is initialized, spatial cache Cache empties S and Cache;A node v is randomly selected from V; Then it is sampled in accordance with the following steps.
(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number Node.The neighbor node detected is stored in spatial cache Cache.
(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge Kv/KwWhether it is more than or equal to P, if so, using neighbor node w as present node v, and node w is put into sampling set S, it is then return to step 3, if not It is to continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.KvIndicate the neighbor node number of node v, i.e., The degree of node v.
(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, then return It is back to step 3, if it is not, continuing next step.
(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from The detected the smallest node of number is selected in these detected nodesHave the node of multiple identical detected numbers take with One, machine.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.
Further, in the step 5, α=0.2.
The invention has the advantages that first, on independent sampling set, the degree properties of distributions of network is more nearly original Network characteristic.Second, conventional method is avoided in height and is connected to the problem of low node excessively enters sample in subnet, and improves method To the detectivity of network.Third, in the case where low sampling rate, the transitivity of sample and same property of matching are closer to primitive network Attribute.
Detailed description of the invention
The independent sample node degree that Fig. 1 is Twitter and Epinions is distributed CDF, NMSE figure;
Fig. 2 is the transitivity figure of different sampling networks;
Fig. 3 is the same with property figure of different sampling networks;
Fig. 4 is the turnover rate figure of sampling node;
Fig. 5 is influence diagram of the parameter alpha to sampling repetitive rate.
Specific embodiment
Step 1:Defined notion:
Social networks method of sampling research is usually to convert graph model for true network, and the side in figure indicates true net Relationship under network between user, the node in figure indicate the user under live network.It is represented and is schemed using symbol G=(E, V), wherein E Indicate the set on the side in figure, V indicates the set of the node in figure, and v indicates the node in figure.Sampling set is defined as S.It sampled The node detected in journey can be pressed into spatial cache, and definition spatial cache is Cache.Node d in spatial cache It indicates, djIndicate the node being detected, wherein j is the number that the node was detected,Indicate that there is identical neighbour with node v Occupy the detected node of number.KvIndicate the neighbor node number of node v, the i.e. degree of node v.α indicates the general of repeated sampling Rate, numerical value default take 0.2.
Step 2:Sampling set S is initialized, spatial cache Cache empties S and Cache
Step 3:Start node v is chosen, the method for selection is randomly selected in the whole network.
Step 4:10 neighbor nodes of probe node v, then detection all neighbor nodes of neighbours' number less than 10.It will The node d of detectionjIt is stored in spatial cache Cache.
Step 5:A neighbor node w is randomly choosed in all neighbor nodes of node v.
Step 6:One 0 to 1 random number P, P obedience 0-1 is generated to be uniformly distributed.
Step 7:Judge whether P is less than or equal to Kv/Kw, if so, using neighbor node w as present node v, and by node W is put into sampling set S, then goes to step 5, if it is not, continuing next step.
Step 8:Judge whether P is less than or equal to α (α is defaulted as 0.2), if it is, keeping present node v constant, then Go to step 5.If it is not, continuing next step.
Step 9:The detected node collection that there is identical neighbours' number with present node v is found out in spatial cache CacheNode, which is detected, at this concentrates the detected the smallest node of number of selectionWherein J=min (j) has multiple The node of identical detected number takes random one.By nodeAs present node v, and by nodeIt is put into sampling set, so After be transferred to step 5.
The stopping rule of method can be stopping artificial when having got enough data, be also possible to a period of time After extracting data, it is automatically stopped program.
Herein using the network of Twitter and two kinds of Epinions different connecting degrees to unbiased delay sampling method Sampling performance is assessed, wherein participating in the classical method of sampling of comparison has BFS, MHRW, RW.
Can be seen that MHRW from the left-half of Fig. 1, the degree distribution of BFS can the big node of degree of deviation because they The curve that NMSE curve ratio UD method obtains is higher.At the same time, the degree distribution CDF curve ratio other methods that UD method obtains are more The nearly primitive network of adjunction equally illustrates that the degree properties of distributions ratio MHRW of the acquired network of UD method and BFS method are closer former The distribution of beginning internet pricing.In conclusion the subnet of UD method acquisition has preferably degree properties of distributions, even if sampling network does not repeat Data.
In Fig. 2, the horizontal base line of black represents the occurrence of primitive network statistic transitivity, can from figure Out, with the continuous improvement of sample rate, its transitivity of the sampling network of different sampling method can be intended to baseline value, but smaller Oversampling ratio in, improved UD method is more nearly the transitivity index of primitive network compared with MHRW and RW method.
Fig. 3 has evaluated the network of the extracted network of different sampling method with matching property index.It can be seen from the figure that candidate The method of sampling converges to primitive network in matching property with the raising of sample rate faster.But in lower sample rate, improve UD method closer to baseline, illustrate that UD method has better network with matching property index compared with MHRW and RW method.
Fig. 4 shows the sampling turnover rate of Twitter and Epinions.Wherein horizontal axis indicates the isolated node being extracted Number, the longitudinal axis are the isolated node number being extracted and the ratio of actual samples number of nodes, referred to as turnover rate.It is not difficult to find that more High turnover rate has less duplicate node.Therefore higher turnover rate has more preferable network detection ability in sampling process. From fig. 4, it can be seen that the sampling turnover rate of sparse network (Epinions) is lower than high connectivity network (Twitter), and right It is better than MHRW in the sampling turnover rate of the network UD method of different connectivity.This is because the network of low connectivity have it is higher general Rate touches the node accessed.This demonstrate that the UD method of sampling can ask to avoid MHRW in the sample that excessively enters of low node Topic, at the same time, UD method has more preferable network detection ability.
In UD sampling process, the self-loopa probability of present node is controlled we used parameter value α, Fig. 5 is shown Influence of the different parameter alphas to sampling node repetitive rate.Wherein, the value of abscissa expression parameter α, α is since 0.05 with step Long 0.05 until 1 (α=0.05,0.1,0.15 ..., 1), and ordinate indicates the node repetitive rate (institute in sample in sampling set There is the quantity of isolated node in quantity/sample of node).We use UD method, MHRW and RW method is in Twitter data Concentrate the data of acquisition 5% respectively.From fig. 5, it can be seen that when the value of parameter alpha is close to 0, the sample repetitive rate of the UD method of sampling Close to RW.When the value of parameter alpha is 1,6 times of RW of sample repetitive rate of the UD method of sampling, and it is identical as MHRW.So in social activity In network, parameter alpha can be used to control the sample repetitive rate of the UD method of sampling.More particularly, if parameter alpha is between 0.2 to 0.4 When, the sampling repetitive rate of MHRW has preferable reduction.

Claims (2)

1. a kind of unbiased delay sampling method of social networks, which is characterized in that include the following steps:
(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates user under live network Between relationship, V indicate figure in node set, node indicate live network under user.
(2) sampling set S is initialized, spatial cache Cache empties S and Cache;A node v is randomly selected from V;Then It is sampled in accordance with the following steps.
(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number and save Point.The neighbor node detected is stored in spatial cache Cache.
(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge Kv/KwWhether P is more than or equal to, such as Fruit is, using neighbor node w as present node v, and node w is put into sampling set S, is then return to step 3, if it is not, Continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.KvIndicate the neighbor node number of node v, i.e. node The degree of v.
(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, be then return to Step 3, if it is not, continuing next step.
(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from these The detected the smallest node of number is selected in detected nodeThere is the node of multiple identical detected numbers to take random one It is a.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.
2. the method according to claim 1, wherein in the step 5, α=0.2.
CN201810689711.2A 2018-06-28 2018-06-28 The unbiased delay sampling method of social networks Pending CN108833211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810689711.2A CN108833211A (en) 2018-06-28 2018-06-28 The unbiased delay sampling method of social networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810689711.2A CN108833211A (en) 2018-06-28 2018-06-28 The unbiased delay sampling method of social networks

Publications (1)

Publication Number Publication Date
CN108833211A true CN108833211A (en) 2018-11-16

Family

ID=64134724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810689711.2A Pending CN108833211A (en) 2018-06-28 2018-06-28 The unbiased delay sampling method of social networks

Country Status (1)

Country Link
CN (1) CN108833211A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530503A (en) * 2013-09-27 2014-01-22 北京航空航天大学 Complex network sampling method for keeping community structure
US20140221022A1 (en) * 2013-02-06 2014-08-07 Andrea Vaccari Grouping Ambient-Location Updates
CN105354244A (en) * 2015-10-13 2016-02-24 广西师范学院 Time-space LDA model for social network community mining
CN107145977A (en) * 2017-04-28 2017-09-08 电子科技大学 A kind of method that structured attributes deduction is carried out to online social network user
CN107945037A (en) * 2017-11-27 2018-04-20 北京工商大学 A kind of social networks based on node structure feature goes de-identification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140221022A1 (en) * 2013-02-06 2014-08-07 Andrea Vaccari Grouping Ambient-Location Updates
CN103530503A (en) * 2013-09-27 2014-01-22 北京航空航天大学 Complex network sampling method for keeping community structure
CN105354244A (en) * 2015-10-13 2016-02-24 广西师范学院 Time-space LDA model for social network community mining
CN107145977A (en) * 2017-04-28 2017-09-08 电子科技大学 A kind of method that structured attributes deduction is carried out to online social network user
CN107945037A (en) * 2017-11-27 2018-04-20 北京工商大学 A kind of social networks based on node structure feature goes de-identification method

Similar Documents

Publication Publication Date Title
CN103116605B (en) A kind of microblog hot event real-time detection method based on monitoring subnet and system
CN107291807B (en) SPARQL query optimization method based on graph traversal
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN109086375B (en) Short text topic extraction method based on word vector enhancement
CN105488211A (en) Method for determining user group based on feature analysis
CN106294815B (en) A kind of clustering method and device of URL
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
Hua et al. Nest: Locality-aware approximate query service for cloud computing
CN105512301A (en) User grouping method based on social content
CN104361135A (en) Image search method
CN107358534A (en) The unbiased data collecting system and acquisition method of social networks
CN109218366A (en) Monitor video temperature cloud storage method based on k mean value
CN108833211A (en) The unbiased delay sampling method of social networks
KR101824928B1 (en) Method for frequent itemset mining from uncertain data with different item importance and uncertain weighted frequent item mining apparatus performing the same
CN110287237B (en) Social network structure analysis based community data mining method
Nie et al. Efficient storage support for real-time near-duplicate video retrieval
US20170293658A1 (en) Partition aware evaluation of top-n queries
Wang et al. Graph compression storage based on spatial cluster entity optimization
Lyu et al. Intelligent clustering analysis model for mining area mineral resource prediction
Balbi et al. A two-step strategy for improving categorisation of short texts
Pan et al. Web page content extraction method based on link density and statistic
Lou et al. Massive Ship Fault Data Retrieval Algorithm Supporting Complex Query in Cloud Computing.
Zhang Large data oriented to image information fusion spark and improved fruit fly optimization based on the density clustering algorithm
Zaw et al. Web document clustering using Gauss distribution based cuckoo search clustering algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181116