WO2016090877A1 - 一种广义最大度随机游走图抽样算法 - Google Patents
一种广义最大度随机游走图抽样算法 Download PDFInfo
- Publication number
- WO2016090877A1 WO2016090877A1 PCT/CN2015/081147 CN2015081147W WO2016090877A1 WO 2016090877 A1 WO2016090877 A1 WO 2016090877A1 CN 2015081147 W CN2015081147 W CN 2015081147W WO 2016090877 A1 WO2016090877 A1 WO 2016090877A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- algorithm
- sample
- random walk
- degree
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
Definitions
- the invention belongs to the field of large-scale data mining technology, and in particular relates to a generalized maximum random walk pattern sampling algorithm.
- the graph-based traversal method mainly uses a breadth-first search (BFS) or a depth-first search (DFS) acquisition node.
- BFS breadth-first search
- DFS depth-first search
- the main disadvantage of this type of method is that in the process of collecting nodes, the algorithm will be biased towards higher-degree nodes, which obviously does not match the target of a uniform node sample.
- this kind of algorithm is The relatively high node bias cannot be theoretically portrayed, so it is difficult to correct this bias, and thus it is impossible to obtain a uniform node sample.
- the algorithm based on random walk solves the defects of graph traversal-based algorithms. They can directly generate unbiased node samples or generate node samples with biased but known bias. Therefore, such algorithms are used in graph sampling.
- RW re-weighted random walk
- MD maximum random walk
- represents the number of nodes, and m
- N(u) be the set of all neighboring nodes of node u ⁇ V
- d u
- f:V ⁇ R be a real-valued function defined on the node set V, representing the value of a certain property of the node u, such as the degree of the node, or a certain attribute value of the node.
- the goal is to estimate the average of the f(u) values of all nodes in the entire network, recorded as
- both RW and MD algorithms can produce a pair. Unbiased estimate.
- the RW algorithm uses a re-weighting strategy. Specifically, the RW algorithm uses estimation (S represents the set of sample nodes collected, w rw (u) ⁇ 1 / d u represents the weight of node u, where ⁇ represents a proportional relationship) to estimate This estimate can be explained by the framework of importance sampling (IS).
- the IS framework uses a relatively easy to implement test distribution to replace the target distribution to collect sample nodes, and then uses importance weighting to construct an unbiased estimate.
- the target distribution is a uniform distribution ⁇ u and the experimental distribution is ⁇ rw .
- the estimation accuracy of the sampling algorithm based on the IS framework depends on the chi-square distance between the experimental distribution and the target distribution. The larger the chi-square distance between the two, the worse the estimation accuracy of the sampling algorithm.
- the chi-square distance is defined as follows: Let p, q be the test distribution and the target distribution, respectively, then the chi-square distance between p and q is var p (q(X)/p(X)), where var represents the variance.
- the MD algorithm is an unbiased graph sampling algorithm that randomly moves the acquisition nodes from a dynamically constructed rule graph, which can directly obtain uniform node samples.
- the principle is that by adding a self-loop to the nodes of the original graph, the degree of each node is equal to the maximum degree of the graph, and a rule graph (a graph with equal node degrees is called a rule graph) is generated.
- the random walk algorithm proceeds to node u, it randomly selects a node from the adjacent node set N(u) of the u node with probability 1/d max , where d max represents the maximum degree of the graph (the degree of the node with the largest degree) ).
- the algorithm will stay on the original node u with a probability of (d max -d u )/d max .
- the experimental distribution ⁇ rw of the RW algorithm is proportional to the degree of the node, and the target distribution is a uniform distribution ⁇ u .
- the node degree of the network is often not uniform, but a long tail phenomenon. Therefore, in many applications, the experimental distribution ⁇ rw of the RW algorithm and the target distribution ⁇ u are greatly deviated.
- the effectiveness of the RW algorithm depends on the closeness of ⁇ rw and ⁇ u . Therefore, in a real network, the RW algorithm tends to have a large deviation.
- the MD algorithm is capable of producing uniform samples, so it can avoid the "big deviation problem” of the RW algorithm. But it produces a self-loop, which produces a lot of duplicate samples, and this situation is especially severe on nodes with smaller degrees. Too many repeated samples usually lead to a large estimated variance, which reduces the estimation accuracy of the algorithm. This defect of the MD algorithm is called a "repeated samples problem”.
- the maximum degree of a node is usually unknown. To solve this problem, the usual practice is to set the maximum to a very large constant to ensure that the constant is greater than the true maximum. Obviously, this method will lead to more self-loops, which will aggravate the "repetitive sample problem.”
- the invention provides a generalized maximum random walk pattern sampling algorithm, which can effectively balance the "large deviation problem” of the RW algorithm and the "repetitive sample problem” of the MD algorithm, thereby improving the overall efficiency of collecting sample points from the network.
- a generalized maximum random walk map sampling algorithm comprising the following steps:
- S i represents the i-th node collected by the algorithm
- ⁇ i refers to the number of repetitions used to represent the sample S i .
- d u represents the degree of node u and C is a non-negative integer.
- the above generalized maximum random walk algorithm referred to as GMD algorithm
- GMD algorithm can effectively solve the problem of extracting uniform samples from a "hidden” online social network, which balances the "large deviation problem” of the RW algorithm.
- the “repetitive sample problem” of the MD algorithm can replace the existing widely used RW and MD algorithms to solve the sampling problem of online social networks.
- FIG. 1 is a schematic diagram of sample collection of a random walk algorithm to be performed.
- the invention provides a new generalized maximum degree random walk algorithm, hereinafter referred to as GMD algorithm.
- the GMD algorithm introduces a parameter C (C is a non-negative integer) above the MD algorithm to control the number of self-loops. Its probability transfer equation is as follows:
- C is a non-negative integer.
- the GMD algorithm includes two steps: first, randomly collecting the samples on the map by the above transition probability; secondly, constructing an unbiased estimate based on the collected samples.
- the detailed process of the first step is as follows:
- the node u is taken as S i and added to the set S;
- the geometric random variable ⁇ i in the random walk algorithm represents the number of repetitions of the sample S i .
- the GMD algorithm constructs an unbiased estimate by the following formula:
- S i represents the i-th node collected by the algorithm
- ⁇ i refers to the number of repetitions used to represent the sample S i .
- the GMD algorithm adds fewer self-loops to each graph node than the MD algorithm. Therefore, the GMD algorithm can solve the "repetitive sample problem" of the MD algorithm to some extent. Moreover, the GMD algorithm can also solve the problem of the maximum degree unknown in the MD algorithm. In addition, it can be proved that the chi-square distance between the experimental distribution of the GMD algorithm and the target distribution (uniform distribution) is smaller than the chi-square distance between the experimental distribution of the RW algorithm and the target distribution. Therefore, the GMD algorithm can also solve the "large deviation problem" of the RW algorithm to some extent.
- ⁇ gmd (v) / ⁇ gmd (u) max ⁇ d v , C ⁇ / max ⁇ d u , C ⁇ .
- GMD algorithm generalized maximum random walk algorithm
- Output Collect a collection S containing 2 sample points
- a node v is randomly selected from the adjacent nodes of v 4 with a medium probability. Use it as the initial node for the next step.
- the generalized maximum random walk algorithm that is, the GMD algorithm can effectively solve the problem of extracting uniform samples from a "hidden” online social network, which well balances the "large deviation problem” of the RW algorithm. And the "repetitive sample problem” of the MD algorithm. Based on this, the GMD algorithm can replace the existing widely used RW and MD algorithms to solve the sampling problem of online social networks.
Abstract
Description
Claims (2)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410749244.XA CN104462374B (zh) | 2014-12-09 | 2014-12-09 | 一种广义最大度随机游走图抽样方法 |
CN201410749244.X | 2014-12-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016090877A1 true WO2016090877A1 (zh) | 2016-06-16 |
Family
ID=52908409
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/081147 WO2016090877A1 (zh) | 2014-12-09 | 2015-06-10 | 一种广义最大度随机游走图抽样算法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104462374B (zh) |
WO (1) | WO2016090877A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196995A (zh) * | 2019-04-30 | 2019-09-03 | 西安电子科技大学 | 一种基于带偏置随机游走的复杂网络特征提取方法 |
CN111147311A (zh) * | 2019-12-31 | 2020-05-12 | 杭州师范大学 | 一种基于图嵌入的网络结构性差异量化方法 |
CN112132326A (zh) * | 2020-08-31 | 2020-12-25 | 浙江工业大学 | 一种基于随机游走度惩罚机制的社交网络好友预测方法 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462374B (zh) * | 2014-12-09 | 2018-06-05 | 深圳大学 | 一种广义最大度随机游走图抽样方法 |
CN106713035B (zh) * | 2016-12-23 | 2019-12-27 | 西安电子科技大学 | 一种基于分组测试的拥塞链路定位方法 |
CN107358534A (zh) * | 2017-06-29 | 2017-11-17 | 浙江理工大学 | 社交网络的无偏数据采集系统及采集方法 |
CN110019975B (zh) | 2017-10-10 | 2020-10-16 | 创新先进技术有限公司 | 随机游走、基于集群的随机游走方法、装置以及设备 |
CN109658094B (zh) * | 2017-10-10 | 2020-09-18 | 阿里巴巴集团控股有限公司 | 随机游走、基于集群的随机游走方法、装置以及设备 |
CN109547265A (zh) * | 2018-12-29 | 2019-03-29 | 中国人民解放军国防科技大学 | 基于随机游走抽样的复杂网络局部免疫方法及系统 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8396855B2 (en) * | 2010-05-28 | 2013-03-12 | International Business Machines Corporation | Identifying communities in an information network |
US20140095616A1 (en) * | 2012-09-28 | 2014-04-03 | 7517700 Canada Inc. O/A Girih | Method and system for sampling online social networks |
CN104462374A (zh) * | 2014-12-09 | 2015-03-25 | 深圳大学 | 一种广义最大度随机游走图抽样算法 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8719211B2 (en) * | 2011-02-01 | 2014-05-06 | Microsoft Corporation | Estimating relatedness in social network |
US8583659B1 (en) * | 2012-07-09 | 2013-11-12 | Facebook, Inc. | Labeling samples in a similarity graph |
CN103617609B (zh) * | 2013-10-24 | 2016-04-13 | 上海交通大学 | 基于图论的k-means非线性流形聚类与代表点选取方法 |
CN103942308B (zh) * | 2014-04-18 | 2017-04-05 | 中国科学院信息工程研究所 | 大规模社交网络社区的检测方法及装置 |
-
2014
- 2014-12-09 CN CN201410749244.XA patent/CN104462374B/zh not_active Expired - Fee Related
-
2015
- 2015-06-10 WO PCT/CN2015/081147 patent/WO2016090877A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8396855B2 (en) * | 2010-05-28 | 2013-03-12 | International Business Machines Corporation | Identifying communities in an information network |
US20140095616A1 (en) * | 2012-09-28 | 2014-04-03 | 7517700 Canada Inc. O/A Girih | Method and system for sampling online social networks |
CN104462374A (zh) * | 2014-12-09 | 2015-03-25 | 深圳大学 | 一种广义最大度随机游走图抽样算法 |
Non-Patent Citations (2)
Title |
---|
LI, R.H: "On Random Walk Based Graph Sampling", ICDE CONFERENCE, 2015, pages 931 - 933 * |
ZIV, B.Y. ET AL.: "Approximating Aggregate Queries about Web Pages via Random Walks.", PROCEEDINGS OF THE 26 TH VLDB CONFERENCE, 2000, Cairo, Egypt * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196995A (zh) * | 2019-04-30 | 2019-09-03 | 西安电子科技大学 | 一种基于带偏置随机游走的复杂网络特征提取方法 |
CN110196995B (zh) * | 2019-04-30 | 2022-12-06 | 西安电子科技大学 | 一种基于带偏置随机游走的复杂网络特征提取方法 |
CN111147311A (zh) * | 2019-12-31 | 2020-05-12 | 杭州师范大学 | 一种基于图嵌入的网络结构性差异量化方法 |
CN111147311B (zh) * | 2019-12-31 | 2022-06-21 | 杭州师范大学 | 一种基于图嵌入的网络结构性差异量化方法 |
CN112132326A (zh) * | 2020-08-31 | 2020-12-25 | 浙江工业大学 | 一种基于随机游走度惩罚机制的社交网络好友预测方法 |
CN112132326B (zh) * | 2020-08-31 | 2023-12-01 | 浙江工业大学 | 一种基于随机游走度惩罚机制的社交网络好友预测方法 |
Also Published As
Publication number | Publication date |
---|---|
CN104462374B (zh) | 2018-06-05 |
CN104462374A (zh) | 2015-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016090877A1 (zh) | 一种广义最大度随机游走图抽样算法 | |
Cui et al. | Detecting overlapping communities in networks using the maximal sub-graph and the clustering coefficient | |
CN102456062B (zh) | 社区相似度计算方法与社会网络合作模式发现方法 | |
CN107276793B (zh) | 基于概率跳转随机游走的节点重要性度量方法 | |
CN110705045B (zh) | 一种利用网络拓扑特性构建加权网络的链路预测方法 | |
CN103838803A (zh) | 一种基于节点Jaccard相似度的社交网络社团发现方法 | |
Hou et al. | Prediction methods and applications in the science of science: A survey | |
Meng et al. | A novel potential edge weight method for identifying influential nodes in complex networks based on neighborhood and position | |
Zhang et al. | Identifying node importance by combining betweenness centrality and katz centrality | |
WO2016086634A1 (zh) | 一种拒绝率可控的Metropolis-Hastings图抽样算法 | |
CN107784327A (zh) | 一种基于gn的个性化社区发现方法 | |
CN105162654A (zh) | 一种基于局部社团信息的链路预测方法 | |
Ma et al. | The local triangle structure centrality method to rank nodes in networks | |
Han et al. | Generating uncertain networks based on historical network snapshots | |
Chen et al. | Fast community detection based on distance dynamics | |
Dong | Application of Big Data Mining Technology in Blockchain Computing | |
Jin et al. | Heterogeneous graph neural networks using self-supervised reciprocally contrastive learning | |
CN109492677A (zh) | 基于贝叶斯理论的时变网络链路预测方法 | |
Jiang et al. | Efficiency improvements in social network communication via MapReduce | |
Jia et al. | Effect of weak ties on degree and H-index in link prediction of complex network | |
Xiu et al. | An extended self-representation model of complex networks for link prediction | |
Jiang et al. | Community Detection using Closeness Similarity based on Common Neighbor Node Clustering Entropy. | |
Jiang et al. | Robust size estimation of online social networks via subgraph sampling | |
Junjie et al. | Local optimization overlapping community discovery algorithm combining attribute features | |
Mehdiabadi et al. | Sampling from diffusion networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15868061 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15868061 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.06.2018) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15868061 Country of ref document: EP Kind code of ref document: A1 |