CN108833211A - The unbiased delay sampling method of social networks - Google Patents
The unbiased delay sampling method of social networks Download PDFInfo
- Publication number
- CN108833211A CN108833211A CN201810689711.2A CN201810689711A CN108833211A CN 108833211 A CN108833211 A CN 108833211A CN 201810689711 A CN201810689711 A CN 201810689711A CN 108833211 A CN108833211 A CN 108833211A
- Authority
- CN
- China
- Prior art keywords
- node
- sampling
- network
- cache
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/02—Capturing of monitoring data
- H04L43/022—Capturing of monitoring data by sampling
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of data sampling method (unbiased delay sampling) of social networks, this method follows Markov convergence criterion, unbiased sampling method is adapted to the network that degree is used in conjunction in heterogeneous networks, on the one hand, unbiased related method thereof has better sampling network unbiasedness, on the other hand, what unbiased delay sampling method can reduce repeated data enters sample probability to improve the detectivity of network.
Description
Technical field
The present invention relates to social network data sampling technique fields, and in particular to a kind of unbiased delay sampling of social networks
Method (Unbiased-delay sampling, UD Sampling).
Background technique
In recent years, online social network has become main Internet service.The booming attraction of social networks
The concern of a large amount of researcher, sociologist want the user behavior of research online user, and engineer is set using social networks
Better network system is counted, scientific research personnel studies this structure and dynamic changing process for using complex network.
Social networks would generally be modeled as socialgram and be researched and analysed.The direct problems faced of researcher is exactly social
The data volume of network is too huge.First, it is desirable to which it is unpractical for obtaining complete data set, because grabbing so huge
Socialgram to expend the unthinkable time, it is sometimes and impossible.At the same time, the so huge social activity of processing
Figure is calculated even if being also required to a large amount of time using high-performance computer cluster.Secondly, for trade secret and user
Privately owned setting, the partial data of social networks is also and unavailable.Finally, the number of users rapid development of social networks and
Relationship between user can change over time, therefore classical catenet can not crawl completely.So how in catenet
The middle suitable sample of crawl, and the underlying issue for keeping the network attribute of primitive network just to study at social networks.
Currently used network samples technology generally carries out data sampling using breadth-first search.Range is excellent
Although first searching algorithm can be with quick obtaining a large number of users data.However need to consume vast resources design in actual production
Duplicate removal queue can greatly reduce the extraction efficiency of data in this way.Breadth-first search is the traversal of typical network simultaneously
Algorithm, the data that algorithm extracts can be biased to the node of height, so that this method cannot obtain reliable user data.
Summary of the invention
Scheme, which is extracted, in order to solve existing social network data cannot obtain unbiased data and need to design duplicate removal queue
Deficiency, the present invention provides a kind of novel network samples method (unbiased sampling method), relatively reliable so as to obtain
Unbiased data.
The present invention adopts the following technical scheme that:A kind of unbiased delay sampling method of social networks, includes the following steps:
(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates under live network
Relationship between user, V indicate the set of the node in figure, and node indicates the user under live network.
(2) sampling set S is initialized, spatial cache Cache empties S and Cache;A node v is randomly selected from V;
Then it is sampled in accordance with the following steps.
(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number
Node.The neighbor node detected is stored in spatial cache Cache.
(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge Kv/KwWhether it is more than or equal to
P, if so, using neighbor node w as present node v, and node w is put into sampling set S, it is then return to step 3, if not
It is to continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.KvIndicate the neighbor node number of node v, i.e.,
The degree of node v.
(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, then return
It is back to step 3, if it is not, continuing next step.
(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from
The detected the smallest node of number is selected in these detected nodesHave the node of multiple identical detected numbers take with
One, machine.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.
Further, in the step 5, α=0.2.
The invention has the advantages that first, on independent sampling set, the degree properties of distributions of network is more nearly original
Network characteristic.Second, conventional method is avoided in height and is connected to the problem of low node excessively enters sample in subnet, and improves method
To the detectivity of network.Third, in the case where low sampling rate, the transitivity of sample and same property of matching are closer to primitive network
Attribute.
Detailed description of the invention
The independent sample node degree that Fig. 1 is Twitter and Epinions is distributed CDF, NMSE figure;
Fig. 2 is the transitivity figure of different sampling networks;
Fig. 3 is the same with property figure of different sampling networks;
Fig. 4 is the turnover rate figure of sampling node;
Fig. 5 is influence diagram of the parameter alpha to sampling repetitive rate.
Specific embodiment
Step 1:Defined notion:
Social networks method of sampling research is usually to convert graph model for true network, and the side in figure indicates true net
Relationship under network between user, the node in figure indicate the user under live network.It is represented and is schemed using symbol G=(E, V), wherein E
Indicate the set on the side in figure, V indicates the set of the node in figure, and v indicates the node in figure.Sampling set is defined as S.It sampled
The node detected in journey can be pressed into spatial cache, and definition spatial cache is Cache.Node d in spatial cache
It indicates, djIndicate the node being detected, wherein j is the number that the node was detected,Indicate that there is identical neighbour with node v
Occupy the detected node of number.KvIndicate the neighbor node number of node v, the i.e. degree of node v.α indicates the general of repeated sampling
Rate, numerical value default take 0.2.
Step 2:Sampling set S is initialized, spatial cache Cache empties S and Cache
Step 3:Start node v is chosen, the method for selection is randomly selected in the whole network.
Step 4:10 neighbor nodes of probe node v, then detection all neighbor nodes of neighbours' number less than 10.It will
The node d of detectionjIt is stored in spatial cache Cache.
Step 5:A neighbor node w is randomly choosed in all neighbor nodes of node v.
Step 6:One 0 to 1 random number P, P obedience 0-1 is generated to be uniformly distributed.
Step 7:Judge whether P is less than or equal to Kv/Kw, if so, using neighbor node w as present node v, and by node
W is put into sampling set S, then goes to step 5, if it is not, continuing next step.
Step 8:Judge whether P is less than or equal to α (α is defaulted as 0.2), if it is, keeping present node v constant, then
Go to step 5.If it is not, continuing next step.
Step 9:The detected node collection that there is identical neighbours' number with present node v is found out in spatial cache CacheNode, which is detected, at this concentrates the detected the smallest node of number of selectionWherein J=min (j) has multiple
The node of identical detected number takes random one.By nodeAs present node v, and by nodeIt is put into sampling set, so
After be transferred to step 5.
The stopping rule of method can be stopping artificial when having got enough data, be also possible to a period of time
After extracting data, it is automatically stopped program.
Herein using the network of Twitter and two kinds of Epinions different connecting degrees to unbiased delay sampling method
Sampling performance is assessed, wherein participating in the classical method of sampling of comparison has BFS, MHRW, RW.
Can be seen that MHRW from the left-half of Fig. 1, the degree distribution of BFS can the big node of degree of deviation because they
The curve that NMSE curve ratio UD method obtains is higher.At the same time, the degree distribution CDF curve ratio other methods that UD method obtains are more
The nearly primitive network of adjunction equally illustrates that the degree properties of distributions ratio MHRW of the acquired network of UD method and BFS method are closer former
The distribution of beginning internet pricing.In conclusion the subnet of UD method acquisition has preferably degree properties of distributions, even if sampling network does not repeat
Data.
In Fig. 2, the horizontal base line of black represents the occurrence of primitive network statistic transitivity, can from figure
Out, with the continuous improvement of sample rate, its transitivity of the sampling network of different sampling method can be intended to baseline value, but smaller
Oversampling ratio in, improved UD method is more nearly the transitivity index of primitive network compared with MHRW and RW method.
Fig. 3 has evaluated the network of the extracted network of different sampling method with matching property index.It can be seen from the figure that candidate
The method of sampling converges to primitive network in matching property with the raising of sample rate faster.But in lower sample rate, improve
UD method closer to baseline, illustrate that UD method has better network with matching property index compared with MHRW and RW method.
Fig. 4 shows the sampling turnover rate of Twitter and Epinions.Wherein horizontal axis indicates the isolated node being extracted
Number, the longitudinal axis are the isolated node number being extracted and the ratio of actual samples number of nodes, referred to as turnover rate.It is not difficult to find that more
High turnover rate has less duplicate node.Therefore higher turnover rate has more preferable network detection ability in sampling process.
From fig. 4, it can be seen that the sampling turnover rate of sparse network (Epinions) is lower than high connectivity network (Twitter), and right
It is better than MHRW in the sampling turnover rate of the network UD method of different connectivity.This is because the network of low connectivity have it is higher general
Rate touches the node accessed.This demonstrate that the UD method of sampling can ask to avoid MHRW in the sample that excessively enters of low node
Topic, at the same time, UD method has more preferable network detection ability.
In UD sampling process, the self-loopa probability of present node is controlled we used parameter value α, Fig. 5 is shown
Influence of the different parameter alphas to sampling node repetitive rate.Wherein, the value of abscissa expression parameter α, α is since 0.05 with step
Long 0.05 until 1 (α=0.05,0.1,0.15 ..., 1), and ordinate indicates the node repetitive rate (institute in sample in sampling set
There is the quantity of isolated node in quantity/sample of node).We use UD method, MHRW and RW method is in Twitter data
Concentrate the data of acquisition 5% respectively.From fig. 5, it can be seen that when the value of parameter alpha is close to 0, the sample repetitive rate of the UD method of sampling
Close to RW.When the value of parameter alpha is 1,6 times of RW of sample repetitive rate of the UD method of sampling, and it is identical as MHRW.So in social activity
In network, parameter alpha can be used to control the sample repetitive rate of the UD method of sampling.More particularly, if parameter alpha is between 0.2 to 0.4
When, the sampling repetitive rate of MHRW has preferable reduction.
Claims (2)
1. a kind of unbiased delay sampling method of social networks, which is characterized in that include the following steps:
(1) it converts true network to figure G=(E, V), E indicates the set on the side in figure, and side indicates user under live network
Between relationship, V indicate figure in node set, node indicate live network under user.
(2) sampling set S is initialized, spatial cache Cache empties S and Cache;A node v is randomly selected from V;Then
It is sampled in accordance with the following steps.
(3) 10 neighbor nodes of probe node v then detect its all neighbour less than 10 node for neighbours' number and save
Point.The neighbor node detected is stored in spatial cache Cache.
(4) a neighbor node w is randomly choosed in all neighbor nodes of node v.Judge Kv/KwWhether P is more than or equal to, such as
Fruit is, using neighbor node w as present node v, and node w is put into sampling set S, is then return to step 3, if it is not,
Continue next step.Wherein, P is random number, and P obeys 0-1 and is uniformly distributed.KvIndicate the neighbor node number of node v, i.e. node
The degree of v.
(5) judge whether P is less than or equal to the probability α of repeated sampling, if it is, keeping present node v constant, be then return to
Step 3, if it is not, continuing next step.
(6) all detected nodes that there is identical neighbours' number with present node v are found out in spatial cache Cache, from these
The detected the smallest node of number is selected in detected nodeThere is the node of multiple identical detected numbers to take random one
It is a.By nodeAs present node v, and by nodeIt is put into sampling set, is then return to step 3.
2. the method according to claim 1, wherein in the step 5, α=0.2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689711.2A CN108833211A (en) | 2018-06-28 | 2018-06-28 | The unbiased delay sampling method of social networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810689711.2A CN108833211A (en) | 2018-06-28 | 2018-06-28 | The unbiased delay sampling method of social networks |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108833211A true CN108833211A (en) | 2018-11-16 |
Family
ID=64134724
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810689711.2A Pending CN108833211A (en) | 2018-06-28 | 2018-06-28 | The unbiased delay sampling method of social networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108833211A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530503A (en) * | 2013-09-27 | 2014-01-22 | 北京航空航天大学 | Complex network sampling method for keeping community structure |
US20140221022A1 (en) * | 2013-02-06 | 2014-08-07 | Andrea Vaccari | Grouping Ambient-Location Updates |
CN105354244A (en) * | 2015-10-13 | 2016-02-24 | 广西师范学院 | Time-space LDA model for social network community mining |
CN107145977A (en) * | 2017-04-28 | 2017-09-08 | 电子科技大学 | A kind of method that structured attributes deduction is carried out to online social network user |
CN107945037A (en) * | 2017-11-27 | 2018-04-20 | 北京工商大学 | A kind of social networks based on node structure feature goes de-identification method |
-
2018
- 2018-06-28 CN CN201810689711.2A patent/CN108833211A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140221022A1 (en) * | 2013-02-06 | 2014-08-07 | Andrea Vaccari | Grouping Ambient-Location Updates |
CN103530503A (en) * | 2013-09-27 | 2014-01-22 | 北京航空航天大学 | Complex network sampling method for keeping community structure |
CN105354244A (en) * | 2015-10-13 | 2016-02-24 | 广西师范学院 | Time-space LDA model for social network community mining |
CN107145977A (en) * | 2017-04-28 | 2017-09-08 | 电子科技大学 | A kind of method that structured attributes deduction is carried out to online social network user |
CN107945037A (en) * | 2017-11-27 | 2018-04-20 | 北京工商大学 | A kind of social networks based on node structure feature goes de-identification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103116605B (en) | A kind of microblog hot event real-time detection method based on monitoring subnet and system | |
CN107291807B (en) | SPARQL query optimization method based on graph traversal | |
CN108197144B (en) | Hot topic discovery method based on BTM and Single-pass | |
WO2017096892A1 (en) | Index construction method, search method, and corresponding device, apparatus, and computer storage medium | |
CN109086375B (en) | Short text topic extraction method based on word vector enhancement | |
CN105488211A (en) | Method for determining user group based on feature analysis | |
CN106294815B (en) | A kind of clustering method and device of URL | |
CN108280236A (en) | A kind of random forest visualization data analysing method based on LargeVis | |
Hua et al. | Nest: Locality-aware approximate query service for cloud computing | |
CN105512301A (en) | User grouping method based on social content | |
CN104361135A (en) | Image search method | |
CN107358534A (en) | The unbiased data collecting system and acquisition method of social networks | |
CN109218366A (en) | Monitor video temperature cloud storage method based on k mean value | |
CN108833211A (en) | The unbiased delay sampling method of social networks | |
KR101824928B1 (en) | Method for frequent itemset mining from uncertain data with different item importance and uncertain weighted frequent item mining apparatus performing the same | |
CN110287237B (en) | Social network structure analysis based community data mining method | |
Nie et al. | Efficient storage support for real-time near-duplicate video retrieval | |
US20170293658A1 (en) | Partition aware evaluation of top-n queries | |
Wang et al. | Graph compression storage based on spatial cluster entity optimization | |
Lyu et al. | Intelligent clustering analysis model for mining area mineral resource prediction | |
Balbi et al. | A two-step strategy for improving categorisation of short texts | |
Pan et al. | Web page content extraction method based on link density and statistic | |
Lou et al. | Massive Ship Fault Data Retrieval Algorithm Supporting Complex Query in Cloud Computing. | |
Zhang | Large data oriented to image information fusion spark and improved fruit fly optimization based on the density clustering algorithm | |
Zaw et al. | Web document clustering using Gauss distribution based cuckoo search clustering algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181116 |