US20230032521A1

US20230032521A1 - Social graph generation method using a degree distribution generation model

Info

Publication number: US20230032521A1
Application number: US17/784,175
Authority: US
Inventors: Chaokun Wang; Binbin WANG; Bingyang Huang
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2023-02-02
Also published as: WO2021184367A1; CN114207573A

Abstract

The disclosure proposes a social graph generation method based on the degree distribution generation model, including: setting the social graph generation schema which is the configuration used to generate graphs, setting the degree distribution generation model which is used to generate an out-degree and a number of target vertices for a source vertex so that the out-degree and in-degree distribution follows the desired distribution, generating a general graph based on the degree distribution generation model, generating a social graph based on the degree distribution generation model, and generating graphs in a streaming manner. This disclosure determines the out-degree and a target vertex ID for a source vertex efficiently. The vertices in the generated graphs could represent users in the social networks and the edges could represent the relationships in the social networks. The synthesis graphs have the characteristics of real-world networks and can be used for social networks analysis.

Description

TECHNICAL FIELD

The disclosure herein relates to computer science, especially to a social graph generation method based on a degree distribution model.

BACKGROUND

Social graph generators aim to generate social networks as realistic as possible. With the rapid progress of social media, a number of social network analysis tasks have emerged, such as community detection, community search, and network representations. Clearly, both real-world and synthetic graphs are necessary to evaluate the performance and scalability of various algorithms for social network analysis tasks. Thus, social graph generators have been becoming more and more important, especially because different algorithms focus on different features of social graphs.
For example, the community detection algorithms using hierarchical clustering or blocking matrices techniques proceed on homogeneous graphs in which there is only one type of nodes and edges. Some community detection algorithms are performed on heterogeneous graphs with multiple aspects of relationships and multiple labels of vertices. In addition, real-world communities can be classified as overlapping and non-overlapping communities, and many social applications encounter the exponential growth in the graph size.
However, existing synthetic graph generators cannot satisfy all of the above demands. Some schema-driven methods have been proposed to generate for various domains and applications. These methods, such as gMark, use well-designed schemas to cover features commonly found in graphs, e.g., the labels of vertices and edges. However, most of these methods are not designed for social graphs, since they lack the support for generating graphs with community structures. Also, they are not suitable to generate large-scale graphs.
LFR is a widely used benchmark tool for generating social graphs. It constructs communities based on the rules that vertices share more links with the other vertices in the same community than those in other communities. The in-degree of vertices of the generated graphs conform to the power-law distribution, but the out-degree does not. In addition, LFR has a limitation on the size of the generated graph due to its high computational overhead when constructing communities.
There are a number of methods proposed to generate large-scale synthetic graphs. RMAT and Kronecker are most widely used among them. RMAT uses a recursive matrix model to recursively select a quadrant of the adjacency matrix until a cell is selected. The procedure repeats until all edges are generated. Kronecker has two graph generation models, i.e., Stochastic Kronecker Graph (SKG) and Deterministic Kronecker Graph (DKG). The widely used SKG is a generalized variant of the recursive matrix model in terms of the number of probability parameters. The space complexity of RMAT is O(|E|), and the time complexity of Kronecker is O(|V|²). TrillionG proposes a new generation model called the recursive vector model to generate trillion-scale graphs efficiently. However, TrillionG only generates general graphs, i.e., ones without the guarantee of having the community structures.
Therefore, existing technology needs to be improved.
The foregoing background is for purposes of assisting in understanding the present disclosure only and is not intended to admit or recognize that any referenced matter is part of a well-known common sense with respect to the present disclosure.

SUMMARY

To solve the above technical problems, the present disclosure proposes a social graph generation method based on degree distribution generation model.
In an instance of the social graph generation method based on the degree distribution generation model, the generator generates social graphs according to a user-defined schema information.
Degree distribution generation model. The generator generates the out-degree for a source vertex and a number of target vertices using the degree distribution generation model to make sure the out-degree of source vertices and in-degree of target vertices conform to the desired distribution.
Given the number of source vertices and target vertices and the parameters of the degree distributions, the generator generates general graphs.
The generator generates social graphs based on the degree distribution generations model. Determine the size of each community. Determine the number of source vertices and target vertices of each community graph and graph among communities, which are generated using the general graph generation algorithm and be combined into a social graph.
The generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
In another instance of the social graph generation method based on the degree distribution generation model, the schema for the social graph generation is defined as follows.
The schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
The vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
The edge schema ES=(lbl, lbl_s, lbl_t, amount, distr_in, distr_out, attr), where lbl and lbl_tare respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distr_instands for the in-degree distribution of target vertices, and distr_outstands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
The community schema CS=(lbl_e, amount, λ_s, λ_t, p), where lbl_eis the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λ_sand λ_tare the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
The social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
The symbols of the generated social graph are as follows: the heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set.
A vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
An edge e ∈E is represented as (v_s,v_t, lbl, attr), where v, is the source vertex ID, v_tis the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The v_s, v_t, and lbl unique identify an edge e.
The attribute information of nodes and edges is optional. For example, in the scene of Peking Opera, the content comments posted by users of famous Peking Opera vocals can be represented as nodes, while famous artists belong to a certain genre and famous artists participate in a certain vocals. The attention relationship between users users who are interested in a certain artist can be represented as edges Among them, the model information of famous node includes (famous LBL, quantity, attribute information).
In another instance of the social graph generation method based on the degree distribution generation model, the community fusion parameter p is a real number between 0 and 1. Larger p values mean that there will be more edges among communities.
In another instance of the social graph generation method based on the degree distribution generation model, the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
The probability mass function is derived as follows:
$p (x) = {\begin{matrix} α P (D = x; θ) & if x \in [d_{m i n}, d_{m a x}] and x \in N^{+} \\ 0 & otherwise \end{matrix};$
where d_minand d_maxare the minimum degree and maximum degree, respectively. 0 indicates the parameters of the degree distribution. P(D=x;θ) is the existence probability of vertices with degree D=x. The normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1.
Given the out-degree distribution distr_out, the number of source vertices ns and the expected number of edges n_e, the out-degree of a source vertex is calculated as follows.
The number of edges n_e′when the out-degree of source vertices follows the distr_outdistribution:
$n_{e}^{'} = \sum_{x = {outd}_{m i n}}^{{outd}_{m a x}} x \cdot n_{s} \cdot α \cdot P (D = x; θ_{out}),$
where outd_minis the minimum out-degree of source vertices, outd_maxis the maximum out-degree of source vertices, and θ_outis the parameter of distr_out.
Adjust the maximum out-degree outd_maxto make the number of existing edges n_e′match the number of expected edges n_e.
The formula of the cumulative distribution function (CDF) is: F(x)=Σx i=outd_min ^xαP(D=i;θ_out),where x ∈[outd_min, outd_max].
To generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F⁻¹(y) is the generated number.
Design a new function
$G (z) = \underset{x}{\arg \max} F (x) \leq z, x \in [{outd}_{m i n}, {outd}_{m a x}],$
where z ∈{i ·step|i∈N⁺, step=minP(D=x), i·step≤}.
Given a uniformly distributed random value y on [0,1], we can obtain F⁻¹(y) from
$G (⌊ \frac{y}{step} ⌋ \cdot step)$
directly.
In another instance of the social graph generation method based on the degree distribution generation model, the method of adjusting the maximum out-degree outd_maxto make the number of existing edges n_e′ match the number of expected edges n_eis as follows.
If n_e′<n_e, increase outd_maxuntil the number of vertices with out-degree outd_maxis less than 1, or n_e′<n_e.
If n_e′=n_e, there is no need to adjust outd_max.
If n_e′>n_e, reduce outd_maxto make n_e′<n_e
In another instance of the social graph generation method based on the degree distribution generation model, the generator generates a target vertex for a source vertex with a determined out-degree.
Given the in-degree distribution distr_in, the number of target vertices nt, and the expected number of edges n_e, compute the target vertex ID to make the in-degree distribution conforms to the expected distribution.
Define an additional cumulative distribution function of the sum of in-degree: FS(x)=Σ_i=indmin ^xβ·i·α·P(D=i; θ_in), where x ∈[ind_min, ind_max] and β is a normalization parameter and its formula is
$β = \frac{1}{\sum_{i = {ind}_{m i n}}^{{ind}_{m a x}} i \cdot α \cdot P (D = i; θ_{i n})} .$
Define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:
$H_{1} (z) = F_{s} (\underset{x}{\arg \max} F_{s} (x) \leq z), H_{2} (z) = F_{s} (\underset{x}{\arg \min} F_{s} (x) \geq z),$
where x
is the in-degree
$z \in {i \cdot step ❘ i \in N^{+}}, step = \min_{x \in [{ind}_{m i n}, {ind}_{m a x}]} (F_{s} (x + 1) - F_{s} (x)), i \cdot step \leq 1.$
To find the corresponding target vertex IDs, another two functions are defined as follows:
$G_{1} (z) = \sum_{i = {ind}_{m i n}}^{x_{1}} i \cdot α \cdot P (D = i; θ_{i n}), G_{2} (z) = \sum_{i = {ind}_{m i n}}^{x_{2}} i \cdot α \cdot P (D = i; θ_{i n}), where, x_{1} = \underset{x}{\arg \max} F_{s} (x) \leq z, x_{2} = \underset{x}{\arg \min} F_{s} (x) \geq z .$
The target vertex ID is calculated by
$G_{1} (z) + ⌊ (y - H_{1} (z)) \times \frac{G_{2} (z) - G_{1} (z)}{H_{2} (z) - H_{1} (z)} ⌋ .$
In another instance of the social graph generation method based on the degree distribution generation model, given the number of source vertices, the number of target vertices, and the degree distribution parameters, the generation process of general graphs is as follows.
The parameters for generation include the number of source vertices ns, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, and the out-degree distribution of source vertices distr_out.
We can use an n_s×n_tmatrix M to represent the graph. M_ij=1 means that there exists an edge from a source vertex v_ito a target vertex v_j, and M_ij=0 implies that there is no such edge.
The general graph generation method determines outd target vertices to build edges.
In another instance of social graph generation method based on the degree distribution generation model, determine the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
Given a social graph generation schema S, let d_out(u) be the out-degree of vertex u, d_out ⁱ(u) the out-degree of u with vertices inside the same community, and d_out ^e(u)=d_out(u)−d_out ⁱ(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for d_out ^e(u) of vertex u as follows:
$p (x) = {\begin{matrix} α e^{- \frac{x}{1 + ρ}} if x \in [1, {outd}_{m a x}^{'}], \\ 0 \end{matrix}$
where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outd_max′=outd_max−d_out ⁱ(u), outd_maxis the maximum out-degree of source vertices, p(x) is a monotone decreasing function.
Regard the out-degree random variable d_out ^e(u) as a continuous variable, and then the following equation according to the property of PDF:
$\int_{1}^{{outd}_{m a x}^{'}} α e^{- \frac{x}{1 + ρ}} = 1.$
For a source vertex u, the out-degree with vertices in other communities
$d_{out}^{e} (u) = - (1 + ρ) \log (e^{- \frac{1}{1 + ρ}} + y (e^{- \frac{{outd}_{m a x}^{'}}{1 + ρ}} - e^{- \frac{1}{1 + ρ}})),$
where y is a real number from a uniform distribution U(0,1) and the following equation holds between y and the target external out-degree d_out ^e(u):
$\int_{1}^{d_{out}^{e} (u)} α e^{- \frac{x}{1 + ρ}} = y .$
For each edge schema ES and the corresponding source vertex schema VS,, target vertex schema VS_t, and the community schema CS, denote the number of generated edges ES. amount as n_e, the number of source vertices VS,. amount as n_s, the number of target vertices VS_t. amount as n_t, the number of communities CS. amount as nc, and CS.AS and CS. λ_tare the power-law parameters.
Determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λ_sand CS.λ_tso that the community size conforms to a power-law distribution. Denote the size of nc communities as:
$n_{s_{1}} \times n_{t_{1}}, \dots, n_{s_{n_{c}}} \times n_{t_{n_{c}}} .$
For a source vertex u, generate an external out-degree d_out ^e(u) with vertices in other communities randomly, and then d_out ^e(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
In another instance of social graph generation method based on the degree distribution generation model, the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
Given the number of source vertices n_s, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, the out-degree of source vertices distr_out, and the growing rate ry which is a real number in the interval [0,1], the streaming graphs generation process is as follows.
The last percentage and the target percentage are initialized to be 0 and ry, respectively.
The generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are n_s·pc_tgand n_t·pc_tg, respectively. For an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process. For a new source vertex, determine an out-degree directly.
When generating a target vertex, the algorithm should make sure that the ID is equal to or less than n_t·pc_tg.
The disclosure has the following advantages compared with the existing techniques.
The disclosure proposes a social graph generation method based on the degree distribution generation model. The model can generate a random value following a given distribution in 0(1) time. Thus, we can use this model to determine an out-degree and a number of target vertices for a source vertices to generate edges. The generated social graphs have the characteristics of real-world social graphs. The generator uses user-defined configurations to generate graphs, which is widely applicable. The generation method is efficient and scalable, and is proper to generate trillion-scale graphs.

BRIEF DESCRIPTION OF FIGURES

The accompanying drawings, which form a part of the specification, describe embodiments of the present disclosure and together with the description serve to explain the principles of the present disclosure.

The present disclosure will be more clearly understood from the following detailed description, with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a social networking graph generation method based on a degree distribution generation model provided by the invention.

DETAILED DESCRIPTION

While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
With the rapid progress of social media, billions of nodes becoming more and more common in real-world applications. Such complex social networks can be represented as graphs, where users are represented as nodes and interactions among users, such as following, commenting, and liking, are represented as edges. Many social network analysis tasks are emerged to assist in practical applications. For example, the community detection algorithms detect the structure in the network, and the structural information could assist in risk control tasks and user recommendation tasks. In order to verify the effectiveness and scalability of social network analysis algorithms, the synthetic datasets are needed due to the high cost of extracting networks in actual applications. Thus, it is necessary to generate social graphs efficiently.
The disclosure proposes a social graph generation method based on the degree distribution generation model. The method uses a user-defined schema to generate social graphs, which can meet the needs of various application scenarios. The efficient and scalable generation method is suitable for generating large-scale graphs.
A social networking graph generation method based on a degree distribution generation model provided by the present disclosure is described in more detail below in conjunction with the accompanying drawings and embodiments.
The social graph generation method based on the degree distribution generation model constructs a graph by operating on tis matrix representation. For each source vertex, an out-degree is generated and then a number of target vertices are determined to generated edges.
We propose a new generation model called the degree distribution generation model to accelerate the generation process. The time complexity of determining the out-degree and a target vertex for a source vertex are 0(1). Therefore, it is suitable to use the degree distribution generation model to generate large-scale graphs. Moreover, the model is a general model, which means that we can use this model to generate graphs with specified degree distribution as long as the probability density function or the probability mass function is given.
FIG. 1 is a flowchart of a social networking graph generation method based on a degree distribution generation model provided by the present disclosure, as shown in FIG. 1 , the social networking graph generation method based on the degree distribution generation model
10, Set the schema information which is used to generate social graphs,the generator generates social graphs according to a user-defined schema information.
20, Set the degree distribution generation model,the social graph generation method uses the degree distribution generation model to generate an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
30, Given the number of source vertices, the number of target vertices, and the parameters of a given distribution, the generator generates general graph based on the degree distribution generation model.
40, The social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community, the method generates simple graphs and combine them into a social graph.
50, The social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, it generates a simple graph in each generation stage.
The schema for the social graph generation is defined as follows.
The schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
The vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
The edge schema ES=(lbl, lbl_s, lbl_t, amount, distr_in, distr_out, attr), where lbl and lbl_tare respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distr_instands for the in-degree distribution of target vertices, and distr_outstands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
The community schema CS=(lbl_e, amount, As, Xt, p), where lbl_eis the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λ_sand λ_tare the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
The social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
The symbols of the generated social graph are as follows:The heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set.
A vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
An edge e ∈E is represented as (v_s, v_t, lbl, attr), where v_sis the source vertex ID, v_tis the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The v_s, v_t, and lbl unique identify an edge e.
The social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
The probability mass function is derived as follows:
$p (x) = {\begin{matrix} α P (D = x; θ) & if x \in [d_{m i n}, d_{m a x}] and x \in N^{+} \\ 0 & otherwise \end{matrix};$
where d_minand d_maxare the minimum degree and maximum degree, respectively. θ indicates the parameters of the degree distribution. P(D=x; θ) is the existence probability of vertices with degree D=x. The normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1.
The formula of computing α is as follows:
$α = \frac{1}{\sum_{x = d_{\min}}^{d_{\max}} P (D = x; θ)} .$
The formula of cumulative distribution function (CDF) is as follows:F(x)=Σ_i=dmin ^xαP(D=i;θ),where x ∈[d_min, d_max].
To generate a random value conforming to a desired cumulative distribution function F(x), we first generate a uniformly distributed random value y on [0,1], and then F-1(y) is the random value which is derived from the CDF F(x). To compute F⁻¹(y) efficiently, a new function G is designed as follows:
$G (z) = \underset{x}{\arg \max} F (x) \leq z, x \in [d_{\min}, d_{\max}],$
where z ∈ {i - step|i∈N⁺, step=minP(D=x), i - step≤1}. Given a uniformly distributed random value y on [0,1], we can obtain F⁻¹(y) from
$G (⌊ \frac{y}{step} ⌋ \cdot step)$
directly.
According to the degree distribution generation model, the procedure GenOutDegree(distr_out, n_s, n_e) is implemented. The parameters of the procedure are: the out-degree distribution distr_out, the number of source vertices n_sand the expected number of edges n_e. The output of the procedure is the out-degree of a source vertex.
Given the out-degree distribution distr_out, the number of source vertices n_sand the expected number of edges n_e, the out-degree of a source vertex is calculated as follows.
The number of edges n_e′when the out-degree of source vertices follows the distr_outdistribution:
$n_{e}^{'} = \sum_{x = out d_{\min}}^{out d_{\max}} x \cdot n_{s} \cdot α \cdot P (D = x; θ_{out}),$
where outd_minis the minimum out-degree of source vertices, outd_maxis the maximum out-degree of source vertices, and θ_outis the parameter of distr_out
If the number of the expected edges n_e=−1, there is no need to adjust parameters.
Otherwise, the method should adjust the maximum out-degree outd_maxto make n_e′matches n_e. There are three cases as follows.
If n_e′<n_e, increase outd_maxuntil the number of vertices with out-degree outd_maxis less than 1, or n_e′>n_e.
If n_e′=n_e, there is no need to adjust outd_max-
If n_e′>n_e, reduce outd_maxto make n_e′<n_e
The formula of the cumulative distribution function (CDF) is: F(x)=Σ_i=outdmin ^xαP(D=i; θ_out),where x ∈[outd_min, outd_max]-
To generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F-1(y) is the generated number.
Design a new function
$G (z) = \underset{x}{\arg \max} F (x) \leq z, x \in [o ut d_{\min}, out d_{\max}],$
where z ∈{i ·step|i∈N⁺, step=minP(D=x), i·step≤1}.
Given a uniformly distributed random value y on [0,1], we can obtain F⁻¹(y) from
$G (⌊ \frac{y}{step} ⌋ \cdot step)$
directly.
The social graph generation method generates a target vertex for a source vertex with a determined out-degree.
Given the in-degree distribution distr_in, the number of target vertices n_t, and the expected number of edges n_e, compute a target vertex ID to make the in-degree distribution conforms to the expected distribution.
We give a constraint on the relationship between the in-degrees of vertices and their IDs. Given a series of target vertices v₁, v₂, . . . , v_nt, the in-degrees of these vertices are nondecreasing. This constraint is reasonable because we can generate a permutation of [1, n_t]as a mapping function to change the original IDs of target vertices so that there is no apparent relationship between the IDs and in-degrees of vertices.
Define an additional cumulative distribution function of the sum of in-degree: F_s(x)=Σ_i=indminβ·α·P(D=i;θ_in), where ×∈[ind_min, ind_max] and β is a normalization parameter and its formula is
$β = \frac{1}{\sum_{i = in d_{\min}}^{in d_{\max}} i \cdot α \cdot P (D = x; θ_{in})} .$
Given a random number y ∈[0,1], the degree distribution generation model is used to find two CDF values satisfying F_S(x₁)≤y≤F_S(x₂) and the corresponding target vertex IDs are determined as follows.
Given a uniformly distributed random value y on [0,1], we first find two cumulative distribution function values F_s(x₁) and F_s(x₂) satisfying F_S(x₁)≤y≤F_S(x₂), x₁+1 ≥x₂and the corresponding vertices IDs.
Define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:
$H_{1} (z) = F_{s} (\underset{x}{\arg \max} F_{s} (x) \leq z), H_{2} (z) = F_{s} (\underset{x}{\arg \min} F_{s} (x) \geq z),$
where x is the in-degree,
$\begin{matrix} z \in {i \cdot step ❘ i \in N^{+}}, & step = \min_{x \in [{ind}_{\min}, {ind}_{\max}]} (F_{s} (x + 1) - F_{s} (x)), & i \end{matrix} \cdot step \leq 1.$
To find the corresponding target vertex IDs, another two functions are defined as
follows:
$G_{1} (z) = \sum_{i = {ind}_{\min}}^{x_{1}} i \cdot α \cdot P (D = i; θ_{i n}),$ $G_{2} (z) = \sum_{i = {ind}_{\min}}^{x_{2}} i \cdot α \cdot P (D = i; θ_{i n}),$ $where,$ $x_{1} = \underset{x}{\arg \max} F_{s} (x) \leq z,$ $x_{2} = \underset{x}{\arg_{} \min} F_{s} (x) \geq z .$
The target vertex ID is calculated by
$G_{1} (z) + ⌊ (y - H_{1} (z)) \times \frac{G_{2} (z) - G_{1} (z)}{H_{2} (z) - H_{1} (z)} ⌋ .$
Given the number of source vertices, the number of target vertices, and the degree distribution parameters, the social graph generation method based on the degree distribution generation model generates general graphs as follows.
The parameters for generation include the number of source vertices n_s, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, and the out-degree distribution of source vertices distr_out.
We can use an n_sx n_tmatrix M to represent the graph. Mi]=1 means that there exists an edge from a source vertex v_ito a target vertex v_i, and Mi]=0 implies that there is no such edge.
The general graph generation method determines outd target vertices to build edges.
The social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
Given a social graph generation schema S, let d_out(u) be the out-degree of vertex u, d_out ⁱ(u) the out-degree of u with vertices inside the same community, and d_out(u)=d_out(u) - d_out ⁱ(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for d_out ^e(u) of vertex u as follows.
$p (x) = {\begin{matrix} α e^{- \frac{x}{1 + ρ}} & if x \in [1, out d_{\max}^{'}] \\ 0 \end{matrix},$
where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outd_max′=outd_max- d_out ⁱ(u), outd_maxis the maximum out-degree of source vertices, p(x) is a monotone decreasing function which means that the probability of having a small d_out(u) for a source vertex u is higher, i.e., vertices between two communities connect sparsely. The larger p, the higher probability that a source vertex has a larger d_out(u), i.e., there will be more edges in blocks which are not on the main diagonal.
Regard the out-degree random variable d_out ^e(u) as a continuous variable, and then the following equation according to the property of PDF:
$\int_{1}^{out d_{\max}^{'}} α e^{- \frac{x}{1 + ρ}} = 1 .$
For a source vertex u, the out-degree with vertices in other communities
$d_{o u t}^{e} (u) = - (1 + ρ) \log (e^{- \frac{1}{1 + ρ}} + y (e^{- \frac{out d_{\max}^{'}}{1 + ρ}} - e^{- \frac{1}{1 + ρ}})),$
where y is a real number from a uniform distribution U(0,1) and the following equation holds between y and the target external out-degree d_out ^e(u):
$\int_{1}^{d_{o u t}^{e} (u)} α e^{- \frac{x}{1 + ρ}} = y .$
For each edge schema ES and the corresponding source vertex schema VSs, target vertex schema VS_t, and the community schema CS, denote the number of generated edges ES. amount as n_e, the number of source vertices VS,. amount as n_s, the number of target vertices VS_t. amount as n_t, the number of communities CS. amount as nc, and CS.AS and CS.λ_tare the power-law parameters.
Determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λ_sand CS.λ_tso that the community size conforms to a power-law distribution. Denote the size of n_ccommunities as:
$n_{s_{1}} \times n_{t_{1}}, \dots, n_{s_{n_{c}}} \times n_{t_{n_{c}}} .$
For a source vertex u, generate an external out-degree d_out ^e(u) with vertices in other communities randomly, and then d_out(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
The social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
Given the number of source vertices n_s, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, the out-degree of source vertices distr_out, and the growing rate ry which is a real number in the interval [0,1], the streaming graphs generation process is as follows.
The last percentage and the target percentage are initialized to be 0 and r_g, respectively.
The generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are n_s·pc_tgand n_t·pc_tg, respectively.
For an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process. For a new source vertex, determine an out-degree directly. When generating a target vertex, the algorithm should make sure that the ID is equal to or less than n_t- pc_tg.
With the rapid progress of social media, billions of nodes are becoming more and more common in real-world applications. A number of existing social network analysis tasks conduct on these large-scale networks to assist in practical applications. It usually takes significant resources to achieve the underlying large network. Thus, it is necessary to use synthetic graphs to verify the efficiency and scalability of social network analysis tasks.
This disclosure proposes a social graph generation method using a user-defined schema to satisfy various scenarios. We propose a degree distribution generation model to generate random values following a specified distribution efficiently. It is efficient to determine an out-degree and a number of target vertices for a source vertex to generate edges. The vertices in the synthesis graph could represent the users in the real-world network and the edges could represent the relationships in the network.
The generated graphs have the characteristics of real-world social networks, including small world, community structures, and power-law distribution. The synthesis social graphs could be used for social network analysis tasks, such as community detection, community search, and network representation learning.
For those skilled in the art, Obviously, the embodiments of the present disclosure are not limited to the details of the exemplary embodiments described above, Moreover, without departing from the spirit or essential characteristics of the embodiments of the present disclosure, Embodiments of the present disclosure can thus be implemented in other specific forms, No matter from which point, The examples are to be considered exemplary, And is not limiting, The scope of embodiments of the present disclosure is defined by the appended claims rather than by the foregoing description, It is therefore intended that all changes falling within the meaning and scope of the equivalents of the claims be embraced within the embodiments of the present disclosure and that any reference numerals in the claims not be construed as limiting the claims concerned; in addition, Obviously, the word “comprising” does not exclude other elements or steps, the singular does not exclude multiple elements, modules, or devices recited in the plural system, device, or terminal claims, and the terms first, second, or the like may also be implemented by the same element, module, or device in software or hardware to denote names, rather than any particular order.
Finally, it should be noted that, the above embodiments are merely illustrative of the technical solution of the embodiments of the present disclosure and are not intended to be limiting, While the embodiments of the present disclosure have been described in detail with reference to the preferred embodiments described above, it will be appreciated by those skilled in the art that modifications or equivalent substitutions to the embodiments of the present disclosure should not depart from the spirit and scope of the embodiments of the present disclosure.

Claims

What is claimed is:

1. A social graph generation method using a degree distribution generation model, comprising: set the schema which is used to generate social graphs, the generator generates social graphs according to a user-defined schema information;

degree distribution generation model. The generator generates the out-degree for a source vertex and a number of target vertices using the degree distribution generation model to make sure the out-degree of source vertices and in-degree of target vertices conform to the desired distribution;

given the number of source vertices and target vertices and the parameters of the degree distributions, the generator generates general graphs;

the generator generates social graphs based on the degree distribution generations model. Determine the size of each community. Determine the number of source vertices and target vertices of each community graph and graph among communities, which are generated using the general graph generation algorithm and be combined into a social graph;

the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, then, it generates a simple graph in each generation stage.

2. The method of claim 1, wherein the schema for the social graph generation is defined as follows: the schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows:

the vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs;

the edge schema ES=(lbl, blS, lbl_t, amount, distr_in, distr_out, attr), where bls and lbl_tare respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distr_instands for the in-degree distribution of target vertices, and distr_outstands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs;

the community schema CS=(lbl_e, amount, λs, λt, p), where lbl_eis the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λs and λt are the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1;

the social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively;

the symbols of the generated social graph are as follows:

the heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set;

a vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively;

an edge e∈E is represented as (v_s, V_t, lbl, attr), where v_sis the source vertex ID, V_tis the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The v_s, V_t, and lbl unique identify an edge.

3. The method of claim 2, wherein, the community fusion parameter p is a real number between 0 and 1. Larger p values mean that there will be more edges among communities.

4. The method of claim 1, wherein the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions:

the probability mass function is derived as follows:

p (x) = {\begin{matrix} α P (D = x; θ) & if x \in [d_{m i n}, d_{m a x}] and x \in N^{+} \\ 0 & otherwise \end{matrix};

where d_minand d_maxare the minimum degree and maximum degree, respectively, 0 indicates the parameters of the degree distribution. P(D=x; θ) is the existence probability of vertices with degree D=x, the normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1;

given the out-degree distribution distr_out, the number of source vertices n_sand the expected number of edges n_e, the out-degree of a source vertex is calculated as follows;

the number of edges n_e′when the out-degree of source vertices follows the distr_outdistribution:

n_{e}^{'} = \sum_{x = {outd}_{m i n}}^{{outd}_{m a x}} x \cdot n_{s} \cdot α \cdot P (D = x; θ_{out}),

where outd_minis the minimum out-degree of source vertices, outd_maxis the maximum out-degree of source vertices, and θ_outis the parameter of distr_out-

adjust the maximum out-degree outd_maxto make the number of existing edges n_e′match the number of expected edges n_e;

the formula of the cumulative distribution function (CDF) is: F(x)=Σ_i=outdmin ^xαP(D=i;θ_out),where x ∈[outd_min, outd_max];

to generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F-1(y) is the generated number;

design a new function

G (z) = \underset{x}{\arg \max} F (x) \leq z, x \in [{outd}_{m i n}, {outd}_{m a x}],

where z ∈{i·step|i∈N⁺, step=minP(D=x), i·step≤1};

given a uniformly distributed random value y on [0,1], we can obtain F⁻¹(y) from

G (⌊ \frac{y}{step} ⌋ \cdot step)

directly.

5. The method of claim 4, wherein the method of adjusting the maximum out-degree outd_maxto make the number of existing edges n_e′match the number of expected edges n_eis as follows:

if n_e′<n_e, increase outd_maxuntil the number of vertices with out-degree outd_maxis less than 1, or n_e′>n_e;

if n_e′=n_e, there is no need to adjust outd_max;

if n_e′>n_e, reduce outd_maxto make n_e′≤n_e

6. The method of claim 4, wherein the generator generates a target vertex for a source vertex with a determined out-degree:

given the in-degree distribution distr_in, the number of target vertices n_t, and the expected number of edges n_e, compute the a target vertex ID to make the in-degree distribution conforms to the expected distribution;

define an additional cumulative distribution function of the sum of in-degree: FS(x)=Σ_i=indmin ^xβ·i·α·P(D=i; θ_in), where x ∈[ind_min, ind_max] and β is a normalization parameter and its formula is

β = \frac{1}{\sum_{i = {ind}_{m i n}}^{{ind}_{m a x}} i \cdot α \cdot P (D = i; θ_{i n})};

define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:

H_{1} (z) = F_{s} (\underset{x}{\arg \max} F_{s} (x) \leq z), H_{2} (z) = F_{s} (\underset{x}{\arg \min} F_{s} (x) \geq z),

where x is the in-degree,

z \in {i \cdot step ❘ i \in N^{+}}, step = \min_{x \in [{ind}_{m i n}, {ind}_{m a x}]} (F_{s} (x + 1) - F_{s} (x)), i \cdot step \leq 1;

to find the corresponding target vertex IDs, another two functions are defined as follows:

G_{1} (z) = \sum_{i = {ind}_{m i n}}^{x_{1}} i \cdot α \cdot P (D = i; θ_{i n}), G_{2} (z) = \sum_{i = {ind}_{m i n}}^{x_{2}} i \cdot α \cdot P (D = i; θ_{i n}), where, x_{1} = \underset{x}{\arg \max} F_{s} (x) \leq z, x_{2} = \underset{x}{\arg \min} F_{s} (x) \geq z;

the target vertex ID is calculated by

G_{1} (z) + ⌊ (y - H_{1} (z)) \times \frac{G_{2} (z) - G_{1} (z)}{H_{2} (z) - H_{1} (z)} ⌋ .

7. The method of claim 6, wherein given the number of source vertices, the number of target vertices, and the degree distribution parameters, the generation process of generale

graphs is as follows:

the parameters for generation include the number of source vertices ni, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, and the out-degree distribution of source vertices distr_out;

We can use an n_s× n_tmatrix M to represent the graph. Mli=1 means that there exists an edge from a source vertex v_ito a target vertex v_i, and Mli=0 implies that there is no such edge;

the general graph generation method determines outd target vertices to build edges.

8. The method of claim 7, wherein determine the number of source vertices and the number of target vertices of each community graph and graphs among community, then, the generator generates simple graphs and combine them into a social graph;

given a social graph generation schema S, let d_out(u) be the out-degree of vertex u, d_out ⁱ(u) the out-degree of u with vertices inside the same community, and d_out ^e(u)=d_out(u) - d_out ⁱ(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for d_out ^e(u) of vertex u as follows:

p (x) = {\begin{matrix} α e^{- \frac{x}{1 + ρ}} if x \in [1, {outd}_{m a x}^{'}], \\ 0 \end{matrix}

where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outd;,nax=outd_max- d_out(u), outd_maxis the maximum out-degree of source vertices, p(x) is a monotone decreasing function;

regard the out-degree random variable d_out(u) as a continuous variable, and then the following equation according to the property of PDF:

\int_{1}^{{outd}_{m a x}^{'}} α e^{- \frac{x}{1 + ρ}} = 1;

for a source vertex u, the out-degree with vertices in other communities

d_{out}^{e} (u) = - (1 + ρ) \log (e^{- \frac{1}{1 + ρ}} + y (e^{- \frac{{outd}_{m a x}^{'}}{1 + ρ}} - e^{- \frac{1}{1 + ρ}})),

where y is a real number from a uniform

distribution U(0,1) and the following equation holds between y and the target external out-degree d_out ^e(u):

\int_{1}^{d_{out}^{e} (u)} α e^{- \frac{x}{1 + ρ}} = y;

for each edge schema ES and the corresponding source vertex schema VS,, target vertex schema VS_t, and the community schema CS, denote the number of generated edges ES. amount as n_e, the number of source vertices VS,. amount as n_s, the number of target vertices VS_t. amount as n_t, the number of communities CS. amount as n_c, and CS.AS and CS.λ_tare the power-law parameters;

determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λ_sand CS.λ_tso that the community size conforms to a power-law distribution. Denote the size of n_ccommunities as:

n_{s_{1}} \times n_{t_{1}}, \dots, n_{s_{n_{c}}} \times n_{t_{n_{c}}};

for a source vertex u, generate an external out-degree d_out ^e(u) with vertices in other communities randomly, and then d_out ^e(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.

9. The method of claim 8, wherein the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, then, it generates a simple graph in each generation stage:

given the number of source vertices n_s, the number of target vertices n_t, the number of expected edges n_e, the in-degree distribution of target vertices distr_in, the out-degree of source vertices distr_out, and the growing rate r_gwhich is a real number in the interval [0,1], the streaming graphs generation process is as follows:

the last percentage and the target percentage are initialized to be 0 and ry, respectively;

the generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are n_s- pc_tgand n_t- pc_tg, respectively, for an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process, for a new source vertex, determine an out-degree directly;

when generating a target vertex, the algorithm should make sure that the ID is equal to or less than n_t·pc_tg.