US20230032521A1 - Social graph generation method using a degree distribution generation model - Google Patents

Social graph generation method using a degree distribution generation model Download PDF

Info

Publication number
US20230032521A1
US20230032521A1 US17/784,175 US202017784175A US2023032521A1 US 20230032521 A1 US20230032521 A1 US 20230032521A1 US 202017784175 A US202017784175 A US 202017784175A US 2023032521 A1 US2023032521 A1 US 2023032521A1
Authority
US
United States
Prior art keywords
vertices
degree
vertex
target
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US17/784,175
Inventor
Chaokun Wang
Binbin WANG
Bingyang Huang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Assigned to TSINGHUA UNIVERSITY reassignment TSINGHUA UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUANG, Bingyang, WANG, BINBIN, WANG, Chaokun
Publication of US20230032521A1 publication Critical patent/US20230032521A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs

Definitions

  • the disclosure herein relates to computer science, especially to a social graph generation method based on a degree distribution model.
  • Social graph generators aim to generate social networks as realistic as possible. With the rapid progress of social media, a number of social network analysis tasks have emerged, such as community detection, community search, and network representations. Clearly, both real-world and synthetic graphs are necessary to evaluate the performance and scalability of various algorithms for social network analysis tasks. Thus, social graph generators have been becoming more and more important, especially because different algorithms focus on different features of social graphs.
  • the community detection algorithms using hierarchical clustering or blocking matrices techniques proceed on homogeneous graphs in which there is only one type of nodes and edges. Some community detection algorithms are performed on heterogeneous graphs with multiple aspects of relationships and multiple labels of vertices.
  • real-world communities can be classified as overlapping and non-overlapping communities, and many social applications encounter the exponential growth in the graph size.
  • LFR is a widely used benchmark tool for generating social graphs. It constructs communities based on the rules that vertices share more links with the other vertices in the same community than those in other communities. The in-degree of vertices of the generated graphs conform to the power-law distribution, but the out-degree does not. In addition, LFR has a limitation on the size of the generated graph due to its high computational overhead when constructing communities.
  • RMAT and Kronecker are most widely used among them.
  • RMAT uses a recursive matrix model to recursively select a quadrant of the adjacency matrix until a cell is selected. The procedure repeats until all edges are generated.
  • Kronecker has two graph generation models, i.e., Stochastic Kronecker Graph (SKG) and Deterministic Kronecker Graph (DKG).
  • SKG Stochastic Kronecker Graph
  • DKG Deterministic Kronecker Graph
  • the widely used SKG is a generalized variant of the recursive matrix model in terms of the number of probability parameters.
  • the space complexity of RMAT is O(
  • TrillionG proposes a new generation model called the recursive vector model to generate trillion-scale graphs efficiently. However, TrillionG only generates general graphs, i.e., ones without the guarantee of having the community structures.
  • the present disclosure proposes a social graph generation method based on degree distribution generation model.
  • the generator In an instance of the social graph generation method based on the degree distribution generation model, the generator generates social graphs according to a user-defined schema information.
  • the generator generates the out-degree for a source vertex and a number of target vertices using the degree distribution generation model to make sure the out-degree of source vertices and in-degree of target vertices conform to the desired distribution.
  • the generator Given the number of source vertices and target vertices and the parameters of the degree distributions, the generator generates general graphs.
  • the generator generates social graphs based on the degree distribution generations model. Determine the size of each community. Determine the number of source vertices and target vertices of each community graph and graph among communities, which are generated using the general graph generation algorithm and be combined into a social graph.
  • the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • the schema for the social graph generation is defined as follows.
  • the schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
  • the vertex schema VS (lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
  • the edge schema ES (lbl, lbl s , lbl t , amount, distr in , distr out , attr), where lbl and lbl t are respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distr in stands for the in-degree distribution of target vertices, and distr out stands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
  • the community schema CS (lbl e , amount, ⁇ s , ⁇ t , p), where lbl e is the label of edges in a community, and amount is the number of communities.
  • the community size conforms to a power-law distribution, ⁇ s and ⁇ t are the power-law parameters of the community size in source and target vertices, respectively.
  • the number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
  • the social graph generation schema SGS (VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
  • the heterogeneous graph G (V,E), where V is the vertex set, and E ⁇ V ⁇ V is the edge set.
  • a vertex v ⁇ V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
  • An edge e ⁇ E is represented as (v s ,v t , lbl, attr), where v, is the source vertex ID, v t is the target vertex ID, lbl is the edge label, and attr is the attributes of the edge.
  • the attribute information of nodes and edges is optional.
  • the content comments posted by users of famous Peking Opera vocals can be represented as nodes, while famous artists belong to a certain genre and famous artists participate in a certain vocals.
  • the attention relationship between users users who are interested in a certain artist can be represented as edges Among them, the model information of famous node includes (famous LBL, quantity, attribute information).
  • the community fusion parameter p is a real number between 0 and 1. Larger p values mean that there will be more edges among communities.
  • the social graph generation method In another instance of the social graph generation method based on the degree distribution generation model, the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • the probability mass function is derived as follows:
  • d min and d max are the minimum degree and maximum degree, respectively.
  • 0 indicates the parameters of the degree distribution.
  • the normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1 .
  • the out-degree of a source vertex is calculated as follows.
  • outd min is the minimum out-degree of source vertices
  • outd max is the maximum out-degree of source vertices
  • ⁇ out is the parameter of distr out .
  • CDF cumulative distribution function
  • G ⁇ ( z ) arg ⁇ max x ⁇ F ⁇ ( x ) ⁇ z , x ⁇ [ outd m ⁇ i ⁇ n , outd m ⁇ a ⁇ x ] ,
  • the method of adjusting the maximum out-degree outd max to make the number of existing edges n e ′ match the number of expected edges n e is as follows.
  • the generator In another instance of the social graph generation method based on the degree distribution generation model, the generator generates a target vertex for a source vertex with a determined out-degree.
  • the target vertex ID Given the in-degree distribution distr in , the number of target vertices nt, and the expected number of edges n e , compute the target vertex ID to make the in-degree distribution conforms to the expected distribution.
  • H 1 ( z ) F s ( arg ⁇ max x ⁇ F s ( x ) ⁇ z )
  • H 2 ( z ) F s ( arg ⁇ min x ⁇ F s ( x ) ⁇ z ) ,
  • step min x ⁇ [ ind m ⁇ i ⁇ n , ind m ⁇ a ⁇ x ] ( F s ( x + 1 ) - F s ( x ) ) , i ⁇ step ⁇ 1.
  • x 1 arg ⁇ max x ⁇ F s ( x ) ⁇ z
  • x 2 arg ⁇ min x ⁇ F s ( x ) ⁇ z .
  • the target vertex ID is calculated by
  • the generation process of general graphs is as follows.
  • the parameters for generation include the number of source vertices ns, the number of target vertices n t , the number of expected edges n e , the in-degree distribution of target vertices distr in , and the out-degree distribution of source vertices distr out .
  • the general graph generation method determines outd target vertices to build edges.
  • social graph generation method determine the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
  • is a normalization parameter
  • p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1
  • outd max ′ outd max ⁇ d out i (u)
  • outd max is the maximum out-degree of source vertices
  • p(x) is a monotone decreasing function.
  • edge schema ES For each edge schema ES and the corresponding source vertex schema VS, target vertex schema VS t , and the community schema CS, denote the number of generated edges ES. amount as n e , the number of source vertices VS,. amount as n s , the number of target vertices VS t . amount as n t , the number of communities CS. amount as nc, and CS.AS and CS. ⁇ t are the power-law parameters.
  • d out e For a source vertex u, generate an external out-degree d out e (u) with vertices in other communities randomly, and then d out e (u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
  • the generator In another instance of social graph generation method based on the degree distribution generation model, the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • the streaming graphs generation process is as follows.
  • the last percentage and the target percentage are initialized to be 0 and ry, respectively.
  • the generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner.
  • the number of source vertices and that of target vertices are n s ⁇ pc tg and n t ⁇ pc tg , respectively.
  • the out-degree is the difference between the result in this sub-process and that in the last sub-process.
  • the algorithm When generating a target vertex, the algorithm should make sure that the ID is equal to or less than n t ⁇ pc tg .
  • the disclosure has the following advantages compared with the existing techniques.
  • the disclosure proposes a social graph generation method based on the degree distribution generation model.
  • the model can generate a random value following a given distribution in 0 (1) time. Thus, we can use this model to determine an out-degree and a number of target vertices for a source vertices to generate edges.
  • the generated social graphs have the characteristics of real-world social graphs.
  • the generator uses user-defined configurations to generate graphs, which is widely applicable.
  • the generation method is efficient and scalable, and is proper to generate trillion-scale graphs.
  • FIG. 1 is a flow diagram of a social networking graph generation method based on a degree distribution generation model provided by the invention.
  • the disclosure proposes a social graph generation method based on the degree distribution generation model.
  • the method uses a user-defined schema to generate social graphs, which can meet the needs of various application scenarios.
  • the efficient and scalable generation method is suitable for generating large-scale graphs.
  • a social networking graph generation method based on a degree distribution generation model provided by the present disclosure is described in more detail below in conjunction with the accompanying drawings and embodiments.
  • the social graph generation method based on the degree distribution generation model constructs a graph by operating on tis matrix representation. For each source vertex, an out-degree is generated and then a number of target vertices are determined to generated edges.
  • the degree distribution generation model to accelerate the generation process.
  • the time complexity of determining the out-degree and a target vertex for a source vertex are 0( 1 ). Therefore, it is suitable to use the degree distribution generation model to generate large-scale graphs.
  • the model is a general model, which means that we can use this model to generate graphs with specified degree distribution as long as the probability density function or the probability mass function is given.
  • FIG. 1 is a flowchart of a social networking graph generation method based on a degree distribution generation model provided by the present disclosure, as shown in FIG. 1 , the social networking graph generation method based on the degree distribution generation model
  • the social graph generation method uses the degree distribution generation model to generate an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • the generator Given the number of source vertices, the number of target vertices, and the parameters of a given distribution, the generator generates general graph based on the degree distribution generation model.
  • the social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community, the method generates simple graphs and combine them into a social graph.
  • the social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, it generates a simple graph in each generation stage.
  • the schema for the social graph generation is defined as follows.
  • the schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
  • the vertex schema VS (lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
  • the edge schema ES (lbl, lbl s , lbl t , amount, distr in , distr out , attr), where lbl and lbl t are respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distr in stands for the in-degree distribution of target vertices, and distr out stands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
  • the community schema CS (lbl e , amount, As, Xt, p), where lbl e is the label of edges in a community, and amount is the number of communities.
  • the community size conforms to a power-law distribution, ⁇ s and ⁇ t are the power-law parameters of the community size in source and target vertices, respectively.
  • the number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
  • the social graph generation schema SGS (VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
  • the symbols of the generated social graph are as follows:
  • the heterogeneous graph G (V,E), where V is the vertex set, and E ⁇ V ⁇ V is the edge set.
  • a vertex v ⁇ V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
  • An edge e ⁇ E is represented as (v s , v t , lbl, attr), where v s is the source vertex ID, v t is the target vertex ID, lbl is the edge label, and attr is the attributes of the edge.
  • the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • the probability mass function is derived as follows:
  • d min and d max are the minimum degree and maximum degree, respectively.
  • indicates the parameters of the degree distribution.
  • the normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1.
  • CDF cumulative distribution function
  • G ⁇ ( z ) arg ⁇ max x ⁇ F ⁇ ( x ) ⁇ z , x ⁇ [ d min , d max ] ,
  • the procedure GenOutDegree(distr out , n s , n e ) is implemented.
  • the parameters of the procedure are: the out-degree distribution distr out , the number of source vertices n s and the expected number of edges n e .
  • the output of the procedure is the out-degree of a source vertex.
  • the out-degree of a source vertex is calculated as follows.
  • outd min is the minimum out-degree of source vertices
  • outd max is the maximum out-degree of source vertices
  • ⁇ out is the parameter of distr out
  • the method should adjust the maximum out-degree outd max to make n e ′matches n e . There are three cases as follows.
  • CDF cumulative distribution function
  • G ⁇ ( z ) arg ⁇ max x ⁇ F ⁇ ( x ) ⁇ z , x ⁇ [ o ⁇ ut ⁇ d min , out ⁇ d max ] ,
  • the social graph generation method generates a target vertex for a source vertex with a determined out-degree.
  • the degree distribution generation model is used to find two CDF values satisfying F S (x 1 ) ⁇ y ⁇ F S (x 2 ) and the corresponding target vertex IDs are determined as follows.
  • H 1 ( z ) F s ( arg ⁇ max x ⁇ F s ( x ) ⁇ z )
  • H 2 ( z ) F s ( arg ⁇ min x ⁇ F s ( x ) ⁇ z ) ,
  • step min x ⁇ [ ind min , ind max ] ( F s ( x + 1 ) - F s ⁇ ( x ) ) , i ⁇ step ⁇ 1.
  • x 1 arg ⁇ max x ⁇ F s ( x ) ⁇ z
  • x 2 arg ⁇ min x ⁇ F s ( x ) ⁇ z .
  • the target vertex ID is calculated by
  • the social graph generation method based on the degree distribution generation model Given the number of source vertices, the number of target vertices, and the degree distribution parameters, the social graph generation method based on the degree distribution generation model generates general graphs as follows.
  • the parameters for generation include the number of source vertices n s , the number of target vertices n t , the number of expected edges n e , the in-degree distribution of target vertices distr in , and the out-degree distribution of source vertices distr out .
  • n s x n t matrix M we can use an n s x n t matrix M to represent the graph.
  • the general graph generation method determines outd target vertices to build edges.
  • the social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
  • is a normalization parameter
  • p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1
  • outd max ′ outd max - d out i (u)
  • outd max is the maximum out-degree of source vertices
  • p(x) is a monotone decreasing function which means that the probability of having a small d out (u) for a source vertex u is higher, i.e., vertices between two communities connect sparsely.
  • the larger p the higher probability that a source vertex has a larger d out (u), i.e., there will be more edges in blocks which are not on the main diagonal.
  • edge schema ES For each edge schema ES and the corresponding source vertex schema VSs, target vertex schema VS t , and the community schema CS, denote the number of generated edges ES. amount as n e , the number of source vertices VS,. amount as n s , the number of target vertices VS t . amount as n t , the number of communities CS. amount as nc, and CS.AS and CS. ⁇ t are the power-law parameters.
  • d out e For a source vertex u, generate an external out-degree d out e (u) with vertices in other communities randomly, and then d out (u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
  • the social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • the streaming graphs generation process is as follows.
  • the last percentage and the target percentage are initialized to be 0 and r g , respectively.
  • the generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner.
  • the number of source vertices and that of target vertices are n s ⁇ pc tg and n t ⁇ pc tg , respectively.
  • the out-degree is the difference between the result in this sub-process and that in the last sub-process.
  • the algorithm should make sure that the ID is equal to or less than n t - pc tg .
  • This disclosure proposes a social graph generation method using a user-defined schema to satisfy various scenarios.
  • a degree distribution generation model to generate random values following a specified distribution efficiently. It is efficient to determine an out-degree and a number of target vertices for a source vertex to generate edges.
  • the vertices in the synthesis graph could represent the users in the real-world network and the edges could represent the relationships in the network.
  • the generated graphs have the characteristics of real-world social networks, including small world, community structures, and power-law distribution.
  • the synthesis social graphs could be used for social network analysis tasks, such as community detection, community search, and network representation learning.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure proposes a social graph generation method based on the degree distribution generation model, including: setting the social graph generation schema which is the configuration used to generate graphs, setting the degree distribution generation model which is used to generate an out-degree and a number of target vertices for a source vertex so that the out-degree and in-degree distribution follows the desired distribution, generating a general graph based on the degree distribution generation model, generating a social graph based on the degree distribution generation model, and generating graphs in a streaming manner. This disclosure determines the out-degree and a target vertex ID for a source vertex efficiently. The vertices in the generated graphs could represent users in the social networks and the edges could represent the relationships in the social networks. The synthesis graphs have the characteristics of real-world networks and can be used for social networks analysis.

Description

    TECHNICAL FIELD
  • The disclosure herein relates to computer science, especially to a social graph generation method based on a degree distribution model.
  • BACKGROUND
  • Social graph generators aim to generate social networks as realistic as possible. With the rapid progress of social media, a number of social network analysis tasks have emerged, such as community detection, community search, and network representations. Clearly, both real-world and synthetic graphs are necessary to evaluate the performance and scalability of various algorithms for social network analysis tasks. Thus, social graph generators have been becoming more and more important, especially because different algorithms focus on different features of social graphs.
  • For example, the community detection algorithms using hierarchical clustering or blocking matrices techniques proceed on homogeneous graphs in which there is only one type of nodes and edges. Some community detection algorithms are performed on heterogeneous graphs with multiple aspects of relationships and multiple labels of vertices. In addition, real-world communities can be classified as overlapping and non-overlapping communities, and many social applications encounter the exponential growth in the graph size.
  • However, existing synthetic graph generators cannot satisfy all of the above demands. Some schema-driven methods have been proposed to generate for various domains and applications. These methods, such as gMark, use well-designed schemas to cover features commonly found in graphs, e.g., the labels of vertices and edges. However, most of these methods are not designed for social graphs, since they lack the support for generating graphs with community structures. Also, they are not suitable to generate large-scale graphs.
  • LFR is a widely used benchmark tool for generating social graphs. It constructs communities based on the rules that vertices share more links with the other vertices in the same community than those in other communities. The in-degree of vertices of the generated graphs conform to the power-law distribution, but the out-degree does not. In addition, LFR has a limitation on the size of the generated graph due to its high computational overhead when constructing communities.
  • There are a number of methods proposed to generate large-scale synthetic graphs. RMAT and Kronecker are most widely used among them. RMAT uses a recursive matrix model to recursively select a quadrant of the adjacency matrix until a cell is selected. The procedure repeats until all edges are generated. Kronecker has two graph generation models, i.e., Stochastic Kronecker Graph (SKG) and Deterministic Kronecker Graph (DKG). The widely used SKG is a generalized variant of the recursive matrix model in terms of the number of probability parameters. The space complexity of RMAT is O(|E|), and the time complexity of Kronecker is O(|V|2). TrillionG proposes a new generation model called the recursive vector model to generate trillion-scale graphs efficiently. However, TrillionG only generates general graphs, i.e., ones without the guarantee of having the community structures.
  • Therefore, existing technology needs to be improved.
  • The foregoing background is for purposes of assisting in understanding the present disclosure only and is not intended to admit or recognize that any referenced matter is part of a well-known common sense with respect to the present disclosure.
  • SUMMARY
  • To solve the above technical problems, the present disclosure proposes a social graph generation method based on degree distribution generation model.
  • In an instance of the social graph generation method based on the degree distribution generation model, the generator generates social graphs according to a user-defined schema information.
  • Degree distribution generation model. The generator generates the out-degree for a source vertex and a number of target vertices using the degree distribution generation model to make sure the out-degree of source vertices and in-degree of target vertices conform to the desired distribution.
  • Given the number of source vertices and target vertices and the parameters of the degree distributions, the generator generates general graphs.
  • The generator generates social graphs based on the degree distribution generations model. Determine the size of each community. Determine the number of source vertices and target vertices of each community graph and graph among communities, which are generated using the general graph generation algorithm and be combined into a social graph.
  • The generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • In another instance of the social graph generation method based on the degree distribution generation model, the schema for the social graph generation is defined as follows.
  • The schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
  • The vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
  • The edge schema ES=(lbl, lbls, lblt, amount, distrin, distrout, attr), where lbl and lblt are respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distrin stands for the in-degree distribution of target vertices, and distrout stands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
  • The community schema CS=(lble, amount, λs, λt, p), where lble is the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λs and λt are the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
  • The social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
  • The symbols of the generated social graph are as follows: the heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set.
  • A vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
  • An edge e ∈E is represented as (vs,vt, lbl, attr), where v, is the source vertex ID, vt is the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The vs, vt, and lbl unique identify an edge e.
  • The attribute information of nodes and edges is optional. For example, in the scene of Peking Opera, the content comments posted by users of famous Peking Opera vocals can be represented as nodes, while famous artists belong to a certain genre and famous artists participate in a certain vocals. The attention relationship between users users who are interested in a certain artist can be represented as edges Among them, the model information of famous node includes (famous LBL, quantity, attribute information).
  • In another instance of the social graph generation method based on the degree distribution generation model, the community fusion parameter p is a real number between 0 and 1. Larger p values mean that there will be more edges among communities.
  • In another instance of the social graph generation method based on the degree distribution generation model, the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • The probability mass function is derived as follows:
  • p ( x ) = { α P ( D = x ; θ ) if x [ d m i n , d m a x ] and x N + 0 otherwise ;
  • where dmin and dmax are the minimum degree and maximum degree, respectively. 0 indicates the parameters of the degree distribution. P(D=x;θ) is the existence probability of vertices with degree D=x. The normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1.
  • Given the out-degree distribution distrout, the number of source vertices ns and the expected number of edges ne, the out-degree of a source vertex is calculated as follows.
  • The number of edges ne ′when the out-degree of source vertices follows the distrout distribution:
  • n e = x = outd m i n outd m a x x · n s · α · P ( D = x ; θ out ) ,
  • where outdmin is the minimum out-degree of source vertices, outdmax is the maximum out-degree of source vertices, and θout is the parameter of distrout .
  • Adjust the maximum out-degree outdmax to make the number of existing edges ne′match the number of expected edges ne.
  • The formula of the cumulative distribution function (CDF) is: F(x)=Σx i=outdmin xαP(D=i;θout),where x ∈[outdmin, outdmax].
  • To generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F−1(y) is the generated number.
  • Design a new function
  • G ( z ) = arg max x F ( x ) z , x [ outd m i n , outd m a x ] ,
  • where z ∈{i ·step|i∈N+, step=minP(D=x), i·step≤}.
  • Given a uniformly distributed random value y on [0,1], we can obtain F−1(y) from
  • G ( y step · step )
  • directly.
  • In another instance of the social graph generation method based on the degree distribution generation model, the method of adjusting the maximum out-degree outdmax to make the number of existing edges ne′ match the number of expected edges ne is as follows.
  • If ne ′<ne, increase outdmax until the number of vertices with out-degree outdmax is less than 1, or ne′<ne.
  • If ne′=ne, there is no need to adjust outdmax.
  • If ne ′>ne, reduce outdmax to make ne′<ne
  • In another instance of the social graph generation method based on the degree distribution generation model, the generator generates a target vertex for a source vertex with a determined out-degree.
  • Given the in-degree distribution distrin, the number of target vertices nt, and the expected number of edges ne, compute the target vertex ID to make the in-degree distribution conforms to the expected distribution.
  • Define an additional cumulative distribution function of the sum of in-degree: FS(x)=Σi=indmin xβ·i·α·P(D=i; θin), where x ∈[indmin, indmax] and β is a normalization parameter and its formula is
  • β = 1 i = ind m i n ind m a x i · α · P ( D = i ; θ i n ) .
  • Define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:
  • H 1 ( z ) = F s ( arg max x F s ( x ) z ) , H 2 ( z ) = F s ( arg min x F s ( x ) z ) ,
  • where x
    is the in-degree
  • z { i · step i N + } , step = min x [ ind m i n , ind m a x ] ( F s ( x + 1 ) - F s ( x ) ) , i · step 1.
  • To find the corresponding target vertex IDs, another two functions are defined as follows:
  • G 1 ( z ) = i = ind m i n x 1 i · α · P ( D = i ; θ i n ) , G 2 ( z ) = i = ind m i n x 2 i · α · P ( D = i ; θ i n ) , where , x 1 = arg max x F s ( x ) z , x 2 = arg min x F s ( x ) z .
  • The target vertex ID is calculated by
  • G 1 ( z ) + ( y - H 1 ( z ) ) × G 2 ( z ) - G 1 ( z ) H 2 ( z ) - H 1 ( z ) .
  • In another instance of the social graph generation method based on the degree distribution generation model, given the number of source vertices, the number of target vertices, and the degree distribution parameters, the generation process of general graphs is as follows.
  • The parameters for generation include the number of source vertices ns, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, and the out-degree distribution of source vertices distrout.
  • We can use an ns ×nt matrix M to represent the graph. Mij =1 means that there exists an edge from a source vertex vi to a target vertex vj, and Mij=0 implies that there is no such edge.
  • The general graph generation method determines outd target vertices to build edges.
  • In another instance of social graph generation method based on the degree distribution generation model, determine the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
  • Given a social graph generation schema S, let dout(u) be the out-degree of vertex u, dout i(u) the out-degree of u with vertices inside the same community, and dout e(u)=dout(u)−dout i(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for dout e(u) of vertex u as follows:
  • p ( x ) = { α e - x 1 + ρ if x [ 1 , outd m a x ] , 0
  • where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outdmax′=outdmax −dout i(u), outdmax is the maximum out-degree of source vertices, p(x) is a monotone decreasing function.
  • Regard the out-degree random variable dout e(u) as a continuous variable, and then the following equation according to the property of PDF:
  • 1 outd m a x α e - x 1 + ρ = 1.
  • For a source vertex u, the out-degree with vertices in other communities
  • d out e ( u ) = - ( 1 + ρ ) log ( e - 1 1 + ρ + y ( e - outd m a x 1 + ρ - e - 1 1 + ρ ) ) ,
  • where y is a real number from a uniform distribution U(0,1) and the following equation holds between y and the target external out-degree dout e(u):
  • 1 d out e ( u ) α e - x 1 + ρ = y .
  • For each edge schema ES and the corresponding source vertex schema VS,, target vertex schema VSt, and the community schema CS, denote the number of generated edges ES. amount as ne, the number of source vertices VS,. amount as ns, the number of target vertices VSt. amount as nt, the number of communities CS. amount as nc, and CS.AS and CS. λt are the power-law parameters.
  • Determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λs and CS.λt so that the community size conforms to a power-law distribution. Denote the size of nc communities as:
  • n s 1 × n t 1 , , n s n c × n t n c .
  • For a source vertex u, generate an external out-degree dout e(u) with vertices in other communities randomly, and then dout e(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
  • In another instance of social graph generation method based on the degree distribution generation model, the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • Given the number of source vertices ns, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, the out-degree of source vertices distrout, and the growing rate ry which is a real number in the interval [0,1], the streaming graphs generation process is as follows.
  • The last percentage and the target percentage are initialized to be 0 and ry, respectively.
  • The generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are ns ·pctg and nt ·pctg, respectively. For an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process. For a new source vertex, determine an out-degree directly.
  • When generating a target vertex, the algorithm should make sure that the ID is equal to or less than nt ·pctg.
  • The disclosure has the following advantages compared with the existing techniques.
  • The disclosure proposes a social graph generation method based on the degree distribution generation model. The model can generate a random value following a given distribution in 0(1) time. Thus, we can use this model to determine an out-degree and a number of target vertices for a source vertices to generate edges. The generated social graphs have the characteristics of real-world social graphs. The generator uses user-defined configurations to generate graphs, which is widely applicable. The generation method is efficient and scalable, and is proper to generate trillion-scale graphs.
  • BRIEF DESCRIPTION OF FIGURES
  • The accompanying drawings, which form a part of the specification, describe embodiments of the present disclosure and together with the description serve to explain the principles of the present disclosure.
  • The present disclosure will be more clearly understood from the following detailed description, with reference to the accompanying drawings, in which:
  • FIG. 1 is a flow diagram of a social networking graph generation method based on a degree distribution generation model provided by the invention.
  • DETAILED DESCRIPTION
  • While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.
  • With the rapid progress of social media, billions of nodes becoming more and more common in real-world applications. Such complex social networks can be represented as graphs, where users are represented as nodes and interactions among users, such as following, commenting, and liking, are represented as edges. Many social network analysis tasks are emerged to assist in practical applications. For example, the community detection algorithms detect the structure in the network, and the structural information could assist in risk control tasks and user recommendation tasks. In order to verify the effectiveness and scalability of social network analysis algorithms, the synthetic datasets are needed due to the high cost of extracting networks in actual applications. Thus, it is necessary to generate social graphs efficiently.
  • The disclosure proposes a social graph generation method based on the degree distribution generation model. The method uses a user-defined schema to generate social graphs, which can meet the needs of various application scenarios. The efficient and scalable generation method is suitable for generating large-scale graphs.
  • A social networking graph generation method based on a degree distribution generation model provided by the present disclosure is described in more detail below in conjunction with the accompanying drawings and embodiments.
  • The social graph generation method based on the degree distribution generation model constructs a graph by operating on tis matrix representation. For each source vertex, an out-degree is generated and then a number of target vertices are determined to generated edges.
  • We propose a new generation model called the degree distribution generation model to accelerate the generation process. The time complexity of determining the out-degree and a target vertex for a source vertex are 0(1). Therefore, it is suitable to use the degree distribution generation model to generate large-scale graphs. Moreover, the model is a general model, which means that we can use this model to generate graphs with specified degree distribution as long as the probability density function or the probability mass function is given.
  • FIG. 1 is a flowchart of a social networking graph generation method based on a degree distribution generation model provided by the present disclosure, as shown in FIG. 1 , the social networking graph generation method based on the degree distribution generation model
  • 10, Set the schema information which is used to generate social graphs,the generator generates social graphs according to a user-defined schema information.
  • 20, Set the degree distribution generation model,the social graph generation method uses the degree distribution generation model to generate an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • 30, Given the number of source vertices, the number of target vertices, and the parameters of a given distribution, the generator generates general graph based on the degree distribution generation model.
  • 40, The social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community, the method generates simple graphs and combine them into a social graph.
  • 50, The social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, it generates a simple graph in each generation stage.
  • The schema for the social graph generation is defined as follows.
  • The schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows.
  • The vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs.
  • The edge schema ES=(lbl, lbls, lblt, amount, distrin, distrout, attr), where lbl and lblt are respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distrin stands for the in-degree distribution of target vertices, and distrout stands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs.
  • The community schema CS=(lble, amount, As, Xt, p), where lble is the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λsand λt are the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1.
  • The social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively.
  • The symbols of the generated social graph are as follows:The heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set.
  • A vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively.
  • An edge e ∈E is represented as (vs, vt, lbl, attr), where vs is the source vertex ID, vt is the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The vs, vt, and lbl unique identify an edge e.
  • The social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions.
  • The probability mass function is derived as follows:
  • p ( x ) = { α P ( D = x ; θ ) if x [ d m i n , d m a x ] and x N + 0 otherwise ;
  • where dmin and dmax are the minimum degree and maximum degree, respectively. θ indicates the parameters of the degree distribution. P(D=x; θ) is the existence probability of vertices with degree D=x. The normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1.
  • The formula of computing α is as follows:
  • α = 1 x = d min d max P ( D = x ; θ ) .
  • The formula of cumulative distribution function (CDF) is as follows:F(x)=Σi=dmin x αP(D=i;θ),where x ∈[dmin, dmax].
  • To generate a random value conforming to a desired cumulative distribution function F(x), we first generate a uniformly distributed random value y on [0,1], and then F-1(y) is the random value which is derived from the CDF F(x). To compute F−1(y) efficiently, a new function G is designed as follows:
  • G ( z ) = arg max x F ( x ) z , x [ d min , d max ] ,
  • where z ∈ {i - step|i∈N+, step=minP(D=x), i - step≤1}. Given a uniformly distributed random value y on [0,1], we can obtain F−1(y) from
  • G ( y step · step )
  • directly.
  • According to the degree distribution generation model, the procedure GenOutDegree(distrout, ns, ne) is implemented. The parameters of the procedure are: the out-degree distribution distrout, the number of source vertices ns and the expected number of edges ne. The output of the procedure is the out-degree of a source vertex.
  • Given the out-degree distribution distrout, the number of source vertices ns and the expected number of edges ne, the out-degree of a source vertex is calculated as follows.
  • The number of edges ne′when the out-degree of source vertices follows the distrout distribution:
  • n e = x = out d min out d max x · n s · α · P ( D = x ; θ out ) ,
  • where outdmin is the minimum out-degree of source vertices, outdmax is the maximum out-degree of source vertices, and θout is the parameter of distrout
  • If the number of the expected edges ne=−1, there is no need to adjust parameters.
  • Otherwise, the method should adjust the maximum out-degree outdmax to make ne′matches ne. There are three cases as follows.
  • If ne′<ne, increase outdmax until the number of vertices with out-degree outdmax is less than 1, or ne′>ne.
  • If ne′=ne, there is no need to adjust outdmax-
  • If ne′>ne, reduce outdmax to make ne′<ne
  • The formula of the cumulative distribution function (CDF) is: F(x)=Σi=outdmin xαP(D=i; θout),where x ∈[outdmin, outdmax]-
  • To generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F-1(y) is the generated number.
  • Design a new function
  • G ( z ) = arg max x F ( x ) z , x [ o ut d min , out d max ] ,
  • where z ∈{i ·step|i∈N+, step=minP(D=x), i·step≤1}.
  • Given a uniformly distributed random value y on [0,1], we can obtain F−1(y) from
  • G ( y step · step )
  • directly.
  • The social graph generation method generates a target vertex for a source vertex with a determined out-degree.
  • Given the in-degree distribution distrin, the number of target vertices nt, and the expected number of edges ne, compute a target vertex ID to make the in-degree distribution conforms to the expected distribution.
  • We give a constraint on the relationship between the in-degrees of vertices and their IDs. Given a series of target vertices v1, v2, . . . , vnt, the in-degrees of these vertices are nondecreasing. This constraint is reasonable because we can generate a permutation of [1, nt]as a mapping function to change the original IDs of target vertices so that there is no apparent relationship between the IDs and in-degrees of vertices.
  • Define an additional cumulative distribution function of the sum of in-degree: Fs(x)=Σi=indminβ·α·P(D=i;θin), where ×∈[indmin, indmax] and β is a normalization parameter and its formula is
  • β = 1 i = in d min in d max i · α · P ( D = x ; θ in ) .
  • Given a random number y ∈[0,1], the degree distribution generation model is used to find two CDF values satisfying FS(x1)≤y≤FS(x2) and the corresponding target vertex IDs are determined as follows.
  • Given a uniformly distributed random value y on [0,1], we first find two cumulative distribution function values Fs(x1) and Fs(x2) satisfying FS(x1)≤y≤FS(x2), x1+1 ≥x2 and the corresponding vertices IDs.
  • Define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:
  • H 1 ( z ) = F s ( arg max x F s ( x ) z ) , H 2 ( z ) = F s ( arg min x F s ( x ) z ) ,
  • where x is the in-degree,
  • z { i · step "\[LeftBracketingBar]" i N + } , step = min x [ ind min , ind max ] ( F s ( x + 1 ) - F s ( x ) ) , i · step 1.
  • To find the corresponding target vertex IDs, another two functions are defined as
  • follows:
  • G 1 ( z ) = i = ind min x 1 i · α · P ( D = i ; θ i n ) , G 2 ( z ) = i = ind min x 2 i · α · P ( D = i ; θ i n ) , where , x 1 = arg max x F s ( x ) z , x 2 = arg min x F s ( x ) z .
  • The target vertex ID is calculated by
  • G 1 ( z ) + ( y - H 1 ( z ) ) × G 2 ( z ) - G 1 ( z ) H 2 ( z ) - H 1 ( z ) .
  • Given the number of source vertices, the number of target vertices, and the degree distribution parameters, the social graph generation method based on the degree distribution generation model generates general graphs as follows.
  • The parameters for generation include the number of source vertices ns, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, and the out-degree distribution of source vertices distrout.
  • We can use an ns x nt matrix M to represent the graph. Mi]=1 means that there exists an edge from a source vertex vi to a target vertex vi, and Mi]=0 implies that there is no such edge.
  • The general graph generation method determines outd target vertices to build edges.
  • The social graph generation method based on the degree distribution generation model determines the number of source vertices and the number of target vertices of each community graph and graphs among community. Then, the generator generates simple graphs and combine them into a social graph.
  • Given a social graph generation schema S, let dout(u) be the out-degree of vertex u, dout i(u) the out-degree of u with vertices inside the same community, and dout(u)=dout(u) - dout i(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for dout e(u) of vertex u as follows.
  • p ( x ) = { α e - x 1 + ρ if x [ 1 , out d max ] 0 ,
  • where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outdmax′=outdmax - dout i(u), outdmax is the maximum out-degree of source vertices, p(x) is a monotone decreasing function which means that the probability of having a small dout(u) for a source vertex u is higher, i.e., vertices between two communities connect sparsely. The larger p, the higher probability that a source vertex has a larger dout(u), i.e., there will be more edges in blocks which are not on the main diagonal.
  • Regard the out-degree random variable dout e(u) as a continuous variable, and then the following equation according to the property of PDF:
  • 1 out d max α e - x 1 + ρ = 1 .
  • For a source vertex u, the out-degree with vertices in other communities
  • d o u t e ( u ) = - ( 1 + ρ ) log ( e - 1 1 + ρ + y ( e - out d max 1 + ρ - e - 1 1 + ρ ) ) ,
  • where y is a real number from a uniform distribution U(0,1) and the following equation holds between y and the target external out-degree dout e(u):
  • 1 d o u t e ( u ) α e - x 1 + ρ = y .
  • For each edge schema ES and the corresponding source vertex schema VSs, target vertex schema VSt, and the community schema CS, denote the number of generated edges ES. amount as ne, the number of source vertices VS,. amount as ns, the number of target vertices VSt. amount as nt, the number of communities CS. amount as nc, and CS.AS and CS.λt are the power-law parameters.
  • Determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λs and CS.λt so that the community size conforms to a power-law distribution. Denote the size of nc communities as:
  • n s 1 × n t 1 , , n s n c × n t n c .
  • For a source vertex u, generate an external out-degree dout e(u) with vertices in other communities randomly, and then dout(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
  • The social graph generation method based on the degree distribution generation model generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges. Then, it generates a simple graph in each generation stage.
  • Given the number of source vertices ns, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, the out-degree of source vertices distrout, and the growing rate ry which is a real number in the interval [0,1], the streaming graphs generation process is as follows.
  • The last percentage and the target percentage are initialized to be 0 and rg, respectively.
  • The generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are ns ·pctg and nt ·pctg, respectively.
  • For an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process. For a new source vertex, determine an out-degree directly. When generating a target vertex, the algorithm should make sure that the ID is equal to or less than nt - pctg.
  • With the rapid progress of social media, billions of nodes are becoming more and more common in real-world applications. A number of existing social network analysis tasks conduct on these large-scale networks to assist in practical applications. It usually takes significant resources to achieve the underlying large network. Thus, it is necessary to use synthetic graphs to verify the efficiency and scalability of social network analysis tasks.
  • This disclosure proposes a social graph generation method using a user-defined schema to satisfy various scenarios. We propose a degree distribution generation model to generate random values following a specified distribution efficiently. It is efficient to determine an out-degree and a number of target vertices for a source vertex to generate edges. The vertices in the synthesis graph could represent the users in the real-world network and the edges could represent the relationships in the network.
  • The generated graphs have the characteristics of real-world social networks, including small world, community structures, and power-law distribution. The synthesis social graphs could be used for social network analysis tasks, such as community detection, community search, and network representation learning.
  • For those skilled in the art, Obviously, the embodiments of the present disclosure are not limited to the details of the exemplary embodiments described above, Moreover, without departing from the spirit or essential characteristics of the embodiments of the present disclosure, Embodiments of the present disclosure can thus be implemented in other specific forms, No matter from which point, The examples are to be considered exemplary, And is not limiting, The scope of embodiments of the present disclosure is defined by the appended claims rather than by the foregoing description, It is therefore intended that all changes falling within the meaning and scope of the equivalents of the claims be embraced within the embodiments of the present disclosure and that any reference numerals in the claims not be construed as limiting the claims concerned; in addition, Obviously, the word “comprising” does not exclude other elements or steps, the singular does not exclude multiple elements, modules, or devices recited in the plural system, device, or terminal claims, and the terms first, second, or the like may also be implemented by the same element, module, or device in software or hardware to denote names, rather than any particular order.
  • Finally, it should be noted that, the above embodiments are merely illustrative of the technical solution of the embodiments of the present disclosure and are not intended to be limiting, While the embodiments of the present disclosure have been described in detail with reference to the preferred embodiments described above, it will be appreciated by those skilled in the art that modifications or equivalent substitutions to the embodiments of the present disclosure should not depart from the spirit and scope of the embodiments of the present disclosure.

Claims (9)

What is claimed is:
1. A social graph generation method using a degree distribution generation model, comprising: set the schema which is used to generate social graphs, the generator generates social graphs according to a user-defined schema information;
degree distribution generation model. The generator generates the out-degree for a source vertex and a number of target vertices using the degree distribution generation model to make sure the out-degree of source vertices and in-degree of target vertices conform to the desired distribution;
given the number of source vertices and target vertices and the parameters of the degree distributions, the generator generates general graphs;
the generator generates social graphs based on the degree distribution generations model. Determine the size of each community. Determine the number of source vertices and target vertices of each community graph and graph among communities, which are generated using the general graph generation algorithm and be combined into a social graph;
the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, then, it generates a simple graph in each generation stage.
2. The method of claim 1, wherein the schema for the social graph generation is defined as follows: the schema includes vertex schema, edge schema, community schema, and social graph generation schema, whose symbols are defined as follows:
the vertex schema VS=(lbl, amount, attr), where amount is the number of vertices labeled lbl, and attr is the attribute information. Attributes of a vertex are represented as a set of key-value pairs;
the edge schema ES=(lbl, blS, lblt, amount, distrin, distrout, attr), where bls and lblt are respectively the label of source vertices and target vertices, amount is the number of edges labeled lbl, distrin stands for the in-degree distribution of target vertices, and distrout stands for the out-degree distribution of source vertices. Attributes information attr of edges are a set of key-value pairs;
the community schema CS=(lble, amount, λs, λt, p), where lble is the label of edges in a community, and amount is the number of communities. The community size conforms to a power-law distribution, λs and λt are the power-law parameters of the community size in source and target vertices, respectively. The number of edges between different communities depends on the community fusion parameter p which is a real number between 0 and 1;
the social graph generation schema SGS=(VSS, ESS, CSS), where VSS, ESS, and CSS represent a set of vertex schemas, a set of edge schemas, and a set of community schemas, respectively;
the symbols of the generated social graph are as follows:
the heterogeneous graph G=(V,E), where V is the vertex set, and E⊆V×V is the edge set;
a vertex v∈V is represented as a triple (id, lbl, attr), where id, lbl, and attr stand for ID, label, and attributes of the vertex, respectively;
an edge e∈E is represented as (vs, Vt, lbl, attr), where vs is the source vertex ID, Vt is the target vertex ID, lbl is the edge label, and attr is the attributes of the edge. The vs, Vt, and lbl unique identify an edge.
3. The method of claim 2, wherein, the community fusion parameter p is a real number between 0 and 1. Larger p values mean that there will be more edges among communities.
4. The method of claim 1, wherein the social graph generation method generates an out-degree and multiple target vertices for a source vertex randomly and ensure that the in-degree distribution and out-degree distribution conform to desired distributions:
the probability mass function is derived as follows:
p ( x ) = { α P ( D = x ; θ ) if x [ d m i n , d m a x ] and x N + 0 otherwise ;
where dmin and dmax are the minimum degree and maximum degree, respectively, 0 indicates the parameters of the degree distribution. P(D=x; θ) is the existence probability of vertices with degree D=x, the normalization parameter a is used to make the sum of the probabilities of vertices whose degrees are in a certain range be 1;
given the out-degree distribution distrout, the number of source vertices ns and the expected number of edges ne, the out-degree of a source vertex is calculated as follows;
the number of edges ne′when the out-degree of source vertices follows the distrout distribution:
n e = x = outd m i n outd m a x x · n s · α · P ( D = x ; θ out ) ,
where outdmin is the minimum out-degree of source vertices, outdmax is the maximum out-degree of source vertices, and θout is the parameter of distrout-
adjust the maximum out-degree outdmax to make the number of existing edges ne′match the number of expected edges ne;
the formula of the cumulative distribution function (CDF) is: F(x)=Σi=outdmin xαP(D=i;θout),where x ∈[outdmin, outdmax];
to generate a random number following a specified distribution whose CDF is F(x), we first generate a uniformly distributed value on [0,1] denoted as y, and then F-1(y) is the generated number;
design a new function
G ( z ) = arg max x F ( x ) z , x [ outd m i n , outd m a x ] ,
where z ∈{i·step|i∈N+, step=minP(D=x), i·step≤1};
given a uniformly distributed random value y on [0,1], we can obtain F−1(y) from
G ( y step · step )
directly.
5. The method of claim 4, wherein the method of adjusting the maximum out-degree outdmax to make the number of existing edges ne′match the number of expected edges ne is as follows:
if ne′<ne, increase outdmax until the number of vertices with out-degree outdmax is less than 1, or ne′>ne;
if ne′=ne, there is no need to adjust outdmax;
if ne′>ne, reduce outdmax to make ne′≤ne
6. The method of claim 4, wherein the generator generates a target vertex for a source vertex with a determined out-degree:
given the in-degree distribution distrin, the number of target vertices nt, and the expected number of edges ne, compute the a target vertex ID to make the in-degree distribution conforms to the expected distribution;
define an additional cumulative distribution function of the sum of in-degree: FS(x)=Σi=indmin xβ·i·α·P(D=i; θin), where x ∈[indmin, indmax] and β is a normalization parameter and its formula is
β = 1 i = ind m i n ind m a x i · α · P ( D = i ; θ i n ) ;
define two auxiliary functions between some random values on [0,1] and the CDF values are as follows:
H 1 ( z ) = F s ( arg max x F s ( x ) z ) , H 2 ( z ) = F s ( arg min x F s ( x ) z ) ,
where x is the in-degree,
z { i · step i N + } , step = min x [ ind m i n , ind m a x ] ( F s ( x + 1 ) - F s ( x ) ) , i · step 1 ;
to find the corresponding target vertex IDs, another two functions are defined as follows:
G 1 ( z ) = i = ind m i n x 1 i · α · P ( D = i ; θ i n ) , G 2 ( z ) = i = ind m i n x 2 i · α · P ( D = i ; θ i n ) , where , x 1 = arg max x F s ( x ) z , x 2 = arg min x F s ( x ) z ;
the target vertex ID is calculated by
G 1 ( z ) + ( y - H 1 ( z ) ) × G 2 ( z ) - G 1 ( z ) H 2 ( z ) - H 1 ( z ) .
7. The method of claim 6, wherein given the number of source vertices, the number of target vertices, and the degree distribution parameters, the generation process of generale
graphs is as follows:
the parameters for generation include the number of source vertices ni, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, and the out-degree distribution of source vertices distrout;
We can use an ns × nt matrix M to represent the graph. Mli=1 means that there exists an edge from a source vertex vi to a target vertex vi, and Mli=0 implies that there is no such edge;
the general graph generation method determines outd target vertices to build edges.
8. The method of claim 7, wherein determine the number of source vertices and the number of target vertices of each community graph and graphs among community, then, the generator generates simple graphs and combine them into a social graph;
given a social graph generation schema S, let dout(u) be the out-degree of vertex u, dout i(u) the out-degree of u with vertices inside the same community, and dout e (u)=dout(u) - dout i(u) the out-degree of u with vertices in other communities. A probability dense function (PDF) for dout e(u) of vertex u as follows:
p ( x ) = { α e - x 1 + ρ if x [ 1 , outd m a x ] , 0
where α is a normalization parameter, p is the community fusion parameter user-defined in community schemas and is a real number between 0 and 1, and outd;,nax=outdmax - dout(u), outdmax is the maximum out-degree of source vertices, p(x) is a monotone decreasing function;
regard the out-degree random variable dout(u) as a continuous variable, and then the following equation according to the property of PDF:
1 outd m a x α e - x 1 + ρ = 1 ;
for a source vertex u, the out-degree with vertices in other communities
d out e ( u ) = - ( 1 + ρ ) log ( e - 1 1 + ρ + y ( e - outd m a x 1 + ρ - e - 1 1 + ρ ) ) ,
where y is a real number from a uniform
distribution U(0,1) and the following equation holds between y and the target external out-degree dout e(u):
1 d out e ( u ) α e - x 1 + ρ = y ;
for each edge schema ES and the corresponding source vertex schema VS,, target vertex schema VSt, and the community schema CS, denote the number of generated edges ES. amount as ne, the number of source vertices VS,. amount as ns, the number of target vertices VSt. amount as nt, the number of communities CS. amount as nc, and CS.AS and CS.λt are the power-law parameters;
determine the size of each community according to the number of communities CS. amount and the power-law parameters CS.λs and CS.λt so that the community size conforms to a power-law distribution. Denote the size of nc communities as:
n s 1 × n t 1 , , n s n c × n t n c ;
for a source vertex u, generate an external out-degree dout e(u) with vertices in other communities randomly, and then dout e(u) target vertices are generated to build edges from the source vertex u to target vertices in other communities.
9. The method of claim 8, wherein the generator generates graphs in a streaming manner by comparing the numbers of source vertices and target vertices in the current generation stage and those in the last generation stage to determine the new vertices and edges, then, it generates a simple graph in each generation stage:
given the number of source vertices ns, the number of target vertices nt, the number of expected edges ne, the in-degree distribution of target vertices distrin, the out-degree of source vertices distrout, and the growing rate rg which is a real number in the interval [0,1], the streaming graphs generation process is as follows:
the last percentage and the target percentage are initialized to be 0 and ry, respectively;
the generation process can be decomposed into a series of sub-process of generating a general graph in a non-streaming manner. In each generation sub-process, the number of source vertices and that of target vertices are ns - pctg and nt - pctg, respectively, for an existing source vertex, the out-degree is the difference between the result in this sub-process and that in the last sub-process, for a new source vertex, determine an out-degree directly;
when generating a target vertex, the algorithm should make sure that the ID is equal to or less than nt ·pctg.
US17/784,175 2020-03-20 2020-03-20 Social graph generation method using a degree distribution generation model Pending US20230032521A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/080472 WO2021184367A1 (en) 2020-03-20 2020-03-20 Social network graph generation method based on degree distribution generation model

Publications (1)

Publication Number Publication Date
US20230032521A1 true US20230032521A1 (en) 2023-02-02

Family

ID=77767988

Family Applications (1)

Application Number Title Priority Date Filing Date
US17/784,175 Pending US20230032521A1 (en) 2020-03-20 2020-03-20 Social graph generation method using a degree distribution generation model

Country Status (3)

Country Link
US (1) US20230032521A1 (en)
CN (1) CN114207573A (en)
WO (1) WO2021184367A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609345B (en) * 2021-09-30 2021-12-10 腾讯科技(深圳)有限公司 Target object association method and device, computing equipment and storage medium
CN114329099B (en) * 2021-11-22 2023-07-07 腾讯科技(深圳)有限公司 Overlapping community identification method, device, equipment, storage medium and program product
CN115514580B (en) * 2022-11-11 2023-04-07 华中科技大学 Method and device for detecting source-tracing intrusion of self-encoder

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8560605B1 (en) * 2010-10-21 2013-10-15 Google Inc. Social affinity on the web
US8626835B1 (en) * 2010-10-21 2014-01-07 Google Inc. Social identity clustering
US8736612B1 (en) * 2011-07-12 2014-05-27 Relationship Science LLC Altering weights of edges in a social graph
US8880600B2 (en) * 2010-03-31 2014-11-04 Facebook, Inc. Creating groups of users in a social networking system
US20150026120A1 (en) * 2011-12-28 2015-01-22 Evan V Chrapko Systems and methods for visualizing social graphs
US20150120717A1 (en) * 2013-10-25 2015-04-30 Marketwire L.P. Systems and methods for determining influencers in a social data network and ranking data objects based on influencers
US20150242967A1 (en) * 2014-02-27 2015-08-27 Linkedin Corporation Generating member profile recommendations based on community overlap data in a social graph
US20170270210A1 (en) * 2016-03-16 2017-09-21 Sysomos L.P. Data Infrastructure and Method for Ingesting and Updating A Continuously Evolving Social Network
US20180103111A1 (en) * 2016-10-07 2018-04-12 International Business Machines Corporation Determination of well-knit groups in organizational settings
CN106097107B (en) * 2009-09-30 2020-10-16 柯蔼文 Systems and methods for social graph data analysis to determine connectivity within a community

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008052495A (en) * 2006-08-24 2008-03-06 Sony Corp Network system, information processor, method, program, and recording medium
CN105550212A (en) * 2015-12-03 2016-05-04 上海电机学院 Network construction element mining device and method
CN108510115A (en) * 2018-03-29 2018-09-07 山东科技大学 A kind of maximizing influence analysis method towards dynamic social networks
CN110659395B (en) * 2019-08-14 2023-05-30 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for constructing relational network map

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106097107B (en) * 2009-09-30 2020-10-16 柯蔼文 Systems and methods for social graph data analysis to determine connectivity within a community
US8880600B2 (en) * 2010-03-31 2014-11-04 Facebook, Inc. Creating groups of users in a social networking system
US8560605B1 (en) * 2010-10-21 2013-10-15 Google Inc. Social affinity on the web
US8626835B1 (en) * 2010-10-21 2014-01-07 Google Inc. Social identity clustering
US8736612B1 (en) * 2011-07-12 2014-05-27 Relationship Science LLC Altering weights of edges in a social graph
US20150026120A1 (en) * 2011-12-28 2015-01-22 Evan V Chrapko Systems and methods for visualizing social graphs
US20150120717A1 (en) * 2013-10-25 2015-04-30 Marketwire L.P. Systems and methods for determining influencers in a social data network and ranking data objects based on influencers
US20150242967A1 (en) * 2014-02-27 2015-08-27 Linkedin Corporation Generating member profile recommendations based on community overlap data in a social graph
US20170270210A1 (en) * 2016-03-16 2017-09-21 Sysomos L.P. Data Infrastructure and Method for Ingesting and Updating A Continuously Evolving Social Network
US20180103111A1 (en) * 2016-10-07 2018-04-12 International Business Machines Corporation Determination of well-knit groups in organizational settings

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
G. Vakhitov, Z. Enikeeva, N. Yangirova, A. Shavalieva and P. Ustin, "Identification of the Clusters of Social Network Communities for Users with a Specific Characteristic," 2019 12th International Conference on Developments in eSystems Engineering (DeSE), Kazan, Russia, 2019, pp. 140-146, (Year: 2019) *
H. Jung and S. Kim, "Sigcon: Simplifying a Graph Based on Degree Correlation and Clustering Coefficient," 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; pp. 372-379 (Year: 2017) *
Rios, Juan Camilo Ramírez de los, Paula Alejandra Escudero Marín, and María Camila Vásquez-Correa. "ABMS of social network based on affinity." arXiv preprint arXiv:1903.05977 (2019). (Year: 2019) *

Also Published As

Publication number Publication date
WO2021184367A1 (en) 2021-09-23
CN114207573A (en) 2022-03-18

Similar Documents

Publication Publication Date Title
US20230032521A1 (en) Social graph generation method using a degree distribution generation model
US8805845B1 (en) Framework for large-scale multi-label classification
US11512864B2 (en) Deep spatial-temporal similarity method for air quality prediction
Mao et al. A comprehensive algorithm for evaluating node influences in social networks based on preference analysis and random walk
Yang et al. An adaptive heuristic clustering algorithm for influence maximization in complex networks
Steck On the use of skeletons when learning in Bayesian networks
Ling et al. Deep generation of heterogeneous networks
Xue et al. Improving the efficiency of NSGA-II based ontology aligning technology
Qiao et al. A new blockmodeling based hierarchical clustering algorithm for web social networks
CN113228059A (en) Cross-network-oriented representation learning algorithm
Gao et al. A deep learning framework with spatial-temporal attention mechanism for cellular traffic prediction
Xu et al. Towards annotating media contents through social diffusion analysis
Pan et al. Overlapping community detection via leader-based local expansion in social networks
Jing et al. Identification of microblog opinion leader based on user feature and interaction network
Zhang et al. Delay-constrained client selection for heterogeneous federated learning in intelligent transportation systems
CN113592663A (en) Influence maximization method based on community degree and structural hole
Zhang et al. Space-invariant projection in streaming network embedding
CN113377656A (en) Crowd-sourcing recommendation method based on graph neural network
Huang Information Dissemination Control Algorithm of Ecological Changes in the New Media Communication Environment
Hajiramezanali et al. Semi-implicit graph variational auto-encoders
Wang et al. Who spread to whom? Inferring online social networks with user features
Bartels et al. Creating non-minimal triangulations for use in inference in mixed stochastic/deterministic graphical models
Huo et al. Network traffic statistics method for resource-constrained industrial project group scheduling under big data
Acampora et al. Hybridizing genetic algorithms and hill climbing for similarity aggregation in ontology matching
Nagarajan et al. Fastrain-gnn: Fast and accurate self-training for graph neural networks

Legal Events

Date Code Title Description
AS Assignment

Owner name: TSINGHUA UNIVERSITY, CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, CHAOKUN;WANG, BINBIN;HUANG, BINGYANG;REEL/FRAME:060158/0507

Effective date: 20220610

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER