WO2021027331A1 - Graph data-based full relationship calculation method and apparatus, device, and storage medium - Google Patents

Graph data-based full relationship calculation method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2021027331A1
WO2021027331A1 PCT/CN2020/087619 CN2020087619W WO2021027331A1 WO 2021027331 A1 WO2021027331 A1 WO 2021027331A1 CN 2020087619 W CN2020087619 W CN 2020087619W WO 2021027331 A1 WO2021027331 A1 WO 2021027331A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data
identifier
degree
relationship
Prior art date
Application number
PCT/CN2020/087619
Other languages
French (fr)
Chinese (zh)
Inventor
邓强
张娟
屠宁
赵之砚
施奕明
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021027331A1 publication Critical patent/WO2021027331A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Definitions

  • This application relates to the field of big data technology, and in particular to a method, device, device, and storage medium for calculating a full relationship based on graph data.
  • Graph data mining is an important method in relationship mining and group profiling.
  • Graph data is composed of nodes and edges.
  • the nodes in the graph data are used to represent the connected subjects, and the edges in the graph data are used to represent the association between the subjects.
  • a node is associated with other nodes through the edges connected to it.
  • One of the typical applications in graph calculation is to find the full relationship of a certain node and analyze the statistical characteristics. Among them, the calculation of the second-degree relationship and the third-degree relationship has become a difficult point in the graph calculation due to the large amount of calculation and computational resource consumption.
  • the current typical environment for graph computing is the GraphX environment in the Spark project open sourced by the Apache Software Foundation.
  • GraphX uses a memory computing strategy to achieve fast iterative calculations; however, the inventor realized that memory computing consumes a huge amount of memory resources, and it is difficult to support the processing of massive data.
  • index calculation of second-degree and third-degree associated nodes using GraphX to calculate graphs on 50 million nodes and 400 million edges will consume 2000GB of memory. The calculation efficiency of the full relationship is low. On social networks, such calculation efficiency is difficult to meet the application of billions to billions of nodes.
  • This application provides a method, device, device, and storage medium for calculating a full-scale relationship based on graph data, which are used to merge node attributes into node identifiers using bit operations to avoid duplication of node data, reduce memory resource consumption, and improve computing efficiency .
  • the first aspect of the embodiments of the present application provides a full-scale relationship calculation method based on graph data, including: obtaining preprocessed graph data, the preprocessed graph data including node data and edge data of each node; Perform a bit operation to generate the synthetic node ID of each node; divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group includes a synthetic node of the current node ID and the edge data connected to the current node; send a single node ID list of each node to all adjacent nodes, the single node ID list is used to store the composite node ID of the adjacent node; according to each node The received single node identification list generates the second-degree relationship of each node.
  • a second aspect of the embodiments of the present application provides a full-scale relationship calculation device based on graph data, including: a first obtaining unit configured to obtain preprocessed graph data, the preprocessed graph data including node data of each node and Edge data; an operation generating unit for performing bit operations on the node data to generate a synthetic node identifier for each node; a division generating unit for centering the node data and the edge data on each node data Divide, generate multiple data groups, each data group includes a composite node ID of the current node and edge data connected to the current node; the first sending unit is used to send the single node ID list of each node to all Adjacent nodes, the single node identification list is used to store the composite node identification of the adjacent nodes; the first generating unit is used to generate the second degree of each node according to the single node identification list received by each node relationship.
  • the third aspect of the embodiments of the present application provides a computer device, including: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and Is configured to be executed by the one or more processors, and the one or more computer programs are configured to execute a full relationship calculation method based on graph data, wherein the full relationship calculation method based on graph data includes :
  • the preprocessed graph data including node data and edge data of each node;
  • the second-degree relationship of each node is generated.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, a method for calculating a full relationship based on graph data is implemented, wherein , The method for calculating the full relationship based on graph data includes the following steps:
  • the preprocessed graph data including node data and edge data of each node;
  • the second-degree relationship of each node is generated.
  • preprocessed graph data is obtained, and the preprocessed graph data includes node data and edge data of each node; bit operations are performed on the node data to generate a synthetic node identifier for each node ; Divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group includes a synthetic node identifier of the current node and the edge data connected to the current node; The single-node identification list of each node is sent to all adjacent nodes, and the single-node identification list is used to store the composite node identification of the adjacent nodes; and each node is generated according to the single-node identification list received by each node. The second-degree relationship of the node. Use bit operation to merge node attributes into node identifiers, and eliminate self-connection, avoid node data duplication, reduce memory resource consumption, and improve calculation efficiency.
  • FIG. 1 is a schematic diagram of an embodiment of a method for calculating a full relationship based on graph data in an embodiment of the application;
  • FIG. 2 is a schematic diagram of another embodiment of a method for calculating a full relationship based on graph data in an embodiment of the application;
  • Fig. 3 is a schematic diagram of an embodiment of a full relation calculation device based on graph data in an embodiment of the application;
  • FIG. 4 is a schematic diagram of another embodiment of a full relationship calculation device based on graph data in an embodiment of the application;
  • FIG. 5 is a schematic diagram of an embodiment of a full relationship calculation device based on graph data in an embodiment of the application.
  • This application provides a method, device, device, and storage medium for calculating a full-scale relationship based on graph data, which are used to merge node attributes into node identifiers using bit operations to avoid duplication of node data, reduce memory resource consumption, and improve computing efficiency .
  • FIG. 1 a flowchart of a method for calculating a full relationship based on graph data provided by an embodiment of the present application, which specifically includes:
  • preprocessed graph data where the preprocessed graph data includes node data and edge data of each node.
  • the server obtains the preprocessed graph data, and the preprocessed graph data includes node data and edge data of each node. Specifically, the server loads the preprocessed graph data through Spark.
  • the preprocessed graph data is composed of node data and edge data.
  • the node data in the graph data is used to indicate the subject that is connected, and the edge data is used to indicate the association between the subjects. .
  • a node is related to other nodes through its connected edges.
  • the node data of a target node includes the target node attribute and a target node identifier
  • the target edge data includes a target edge identifier and two node identifiers connected to the target edge.
  • the node attribute contains several label data of the node.
  • the node label data may include an ID number, a mobile phone number, and three Boolean variables A, B, and C.
  • variable A is used to indicate the gender of the user, 1 is used to indicate male, 0 is used to indicate female;
  • variable B is used to indicate whether it is dishonest, 1 is used to indicate failure, and 0 is used to indicate no failure;
  • variable C is used to indicate whether there is a university degree, with 1 means a university degree, and 0 means no university degree.
  • the preprocessed graph data means that a large amount of original graph data has been deduplicated, and the graph data that meets the requirements have been selected.
  • the execution subject of this application may be a full-scale relational computing device based on graph data, or may be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application takes the server as the execution subject as an example for description.
  • the server performs bit operations on the node data to generate a synthetic node ID for each node. Specifically, the server determines multiple nodes in the node data; the server obtains the node attributes of each node and the initial node identifier corresponding to each node; the server obtains preset rules, which include the total storage bits for each node identifier Number and the starting and ending ordinal number of the storage location occupied by each variable; the server performs bit operations on the node attributes and initial node identification of each node according to the total number of storage bits for each node identification and the starting and ending ordinal number of the storage location occupied by each variable ; The server generates a synthetic node ID for each node.
  • the preset rules include: (1) the total number of storage bits for each node identification; (2) the starting and ending ordinal numbers of the storage bits occupied by each variable.
  • storage of ID card number, mobile phone number, and the aforementioned three Boolean variables A, B, and C are taken as examples for description.
  • the preset rules are: (1) Each node allocates 101 bits of storage; (2) 1-61 bits are used to store the ID number, the 62nd-98th bits store the mobile phone number, and the 99th bit stores the variable A, the 100th bit stores variable B, and the 101st bit stores variable C.
  • the server saves a lot of memory resources by using bit storage instead of individually assigning double integer storage space to each variable.
  • the initial node identification is collected by the user in the preprocessing stage of the graph data and stored on the external memory (hard disk) of the computer. After starting the calculation, the server loads the initial node identifier into the internal memory. The data identified by the initial node belongs to the original data, and its collection occurs before the bit operation.
  • each data group including a synthetic node identifier of the current node and edge data connected to the current node.
  • the server divides the node data and the edge data with each node data as the center, and generates multiple data groups.
  • Each data group includes a synthetic node identifier of the current node and the edge data connected to the current node.
  • the graph data contains node data and edge data. After being loaded into the memory, in order to perform distributed calculations, the graph data needs to be divided into small processing units, which are referred to as "data groups" here. By establishing a node-centric “data group”, it is ensured that each node has only one copy, which will only appear once, and avoiding multiple node data replication.
  • a data group contains the node ID and all the edge data on the node.
  • One edge data contains the composite node ID of the node and another node connected to it, but does not include any attribute data of the connected node.
  • the server sends the single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of each node's adjacent nodes.
  • the single node identification list is collected based on the edge data on the data group. Because the edge data contains the synthetic node identification of the current node and the once-associated node connected to the current node, all connected nodes can be collected by traversing all the edge data.
  • the single node identifier list contains only the composite node identifiers of adjacent nodes.
  • node a and nodes b, c, and d are adjacent nodes. Then first collect the identification list of all adjacent nodes (ie, single-node identification list), where the node identification is the four letters a, b, c, d (here is a simple example, in actual situations it contains hundreds of millions of nodes, you need Use more complex node identification to indicate, for example, when the device is a node, use the production information of the device, such as the date of generation, serial number, etc., as the node identification). For node a, the obtained node identification list is the list of [b, c, d]; then, node a passes the list of [b, c, d] to the three nodes b, c, d, respectively.
  • the server generates the second-degree relationship of each node according to the single node identification list received by each node. Specifically, the server receives the list of single node identifiers of each node; the server determines its own synthetic node identifier for each node; the server separately receives the same node identifier as its own synthetic node identifier in the list of single node identifiers received by each node Delete; the server generates a second-degree relationship for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node.
  • node a is connected to nodes b, c, and d, and nodes b, c, and d are not directly connected, then after node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list. These two nodes are the second-degree relationship of b.
  • a first-degree relationship refers to the connection between two nodes, that is, adjacent nodes; a second-degree relationship refers to a node between two nodes.
  • node b and node c are separated by node a, so node b and node c are second-degree related, that is, node b and node c have a second-degree relationship.
  • the method further includes: the server obtains the node attribute of each node according to the second-degree relationship of each node, and performs statistical analysis according to the node attribute of each node to generate an analysis result.
  • the server reads the second-degree relationship of each node; the server determines the node attribute of each node from the second-degree relationship; the server separates the node attribute of each node from the synthetic node identification according to preset rules; Perform statistical analysis on the node attributes of each node to generate analysis results.
  • FIG. 2 another flowchart of the full relationship calculation method based on graph data provided by the embodiment of the present application, which specifically includes:
  • the preprocessed graph data includes node data and edge data of each node.
  • the server obtains the preprocessed graph data, and the preprocessed graph data includes the node data and edge data of each node. Specifically, the server loads the preprocessed graph data through Spark.
  • the preprocessed graph data is composed of node data and edge data.
  • the node data in the graph data is used to indicate the subject that is connected, and the edge data is used to indicate the association between the subjects. .
  • a node is related to other nodes through its connected edges.
  • the node data of a target node includes the target node attribute and a target node identifier
  • the target edge data includes a target edge identifier and two node identifiers connected to the target edge.
  • the node attribute contains several label data of the node.
  • the node label data may include an ID number, a mobile phone number, and three Boolean variables A, B, and C.
  • variable A is used to indicate the gender of the user, 1 is used to indicate male, 0 is used to indicate female;
  • variable B is used to indicate whether it is dishonest, 1 is used to indicate failure, and 0 is used to indicate no failure;
  • variable C is used to indicate whether there is a university degree, with 1 means a university degree, and 0 means no university degree.
  • the preprocessed graph data means that a large amount of original graph data has been deduplicated, and the graph data that meets the requirements have been selected.
  • the execution subject of this application may be a full-scale relational computing device based on graph data, or may be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application takes the server as the execution subject as an example for description.
  • the server performs bit operations on the node data to generate a synthetic node ID for each node. Specifically, the server determines multiple nodes in the node data; the server obtains the node attributes of each node and the initial node identifier corresponding to each node; the server obtains preset rules, which include the total storage bits for each node identifier Number and the starting and ending ordinal number of the storage location occupied by each variable; the server performs bit operations on the node attributes and initial node identification of each node according to the total number of storage bits for each node identification and the starting and ending ordinal number of the storage location occupied by each variable ; The server generates a synthetic node ID for each node.
  • the preset rules include: (1) the total number of storage bits for each node identification; (2) the starting and ending ordinal numbers of the storage bits occupied by each variable.
  • storage of ID card number, mobile phone number, and the aforementioned three Boolean variables A, B, and C are taken as examples for description.
  • the preset rules are: (1) Each node allocates 101 bits of storage; (2) 1-61 bits are used to store the ID number, the 62nd-98th bits store the mobile phone number, and the 99th bit stores the variable A, the 100th bit stores variable B, and the 101st bit stores variable C.
  • the server saves a lot of memory resources by using bit storage instead of individually assigning double integer storage space to each variable.
  • the initial node identification is collected by the user in the preprocessing stage of the graph data and stored on the external memory (hard disk) of the computer. After starting the calculation, the server loads the initial node identifier into the internal memory. The data identified by the initial node belongs to the original data, and its collection occurs before the bit operation.
  • each data group including a synthetic node identifier of the current node and the edge data connected to the current node.
  • the server divides the node data and edge data with each node data as the center, and generates multiple data groups.
  • Each data group includes a synthetic node identifier of the current node and the edge data connected to the current node.
  • the graph data contains node data and edge data. After being loaded into the memory, in order to perform distributed calculations, the graph data needs to be divided into small processing units, which are referred to as "data groups" here. By establishing a node-centric “data group”, it is ensured that each node has only one copy, which will only appear once, and avoiding multiple node data replication.
  • a data group contains the node ID and all the edge data on the node.
  • One edge data contains the composite node ID of the node and another node connected to it, but does not include any attribute data of the connected node.
  • the server sends the single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes of each node.
  • the single node identification list is collected based on the edge data on the data group. Because the edge data contains the synthetic node identification of the current node and the once-associated node connected to the current node, all connected nodes can be collected by traversing all the edge data.
  • the single node identifier list contains only the composite node identifiers of adjacent nodes.
  • node a and nodes b, c, and d are adjacent nodes. Then first collect the identification list of all adjacent nodes (ie, single-node identification list), where the node identification is the four letters a, b, c, d (here is a simple example, in actual situations it contains hundreds of millions of nodes, you need Use more complex node identification to indicate, for example, when the device is a node, use the production information of the device, such as the date of generation, serial number, etc., as the node identification). For node a, the obtained node identification list is the list of [b, c, d]; then, node a passes the list of [b, c, d] to the three nodes b, c, d, respectively.
  • each node According to the single node identification list received by each node, generate a second-degree relationship of each node.
  • the server generates the second-degree relationship of each node according to the single node identification list received by each node. Specifically, the server receives the list of single node identifiers of each node; the server determines its own synthetic node identifier for each node; the server separately receives the same node identifier as its own synthetic node identifier in the list of single node identifiers received by each node Delete; the server generates a second-degree relationship for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node.
  • node a is connected to nodes b, c, and d, and nodes b, c, and d are not directly connected, then after node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list. These two nodes are the second-degree relationship of b.
  • the first-degree relationship refers to the connection between two nodes, that is, adjacent nodes;
  • the second-degree relationship refers to the distance between two nodes by one node.
  • node b and node c are separated by node a, so node b and node c are second-degree related, that is, node b and node c have a second-degree relationship.
  • the server generates a second-degree relationship identifier list according to the second-degree relationship of each node, and the second-degree relationship identifier list is used to store the second-degree relationship of each node.
  • the server sends the second-degree relationship identifier list of each node to all neighboring nodes, and the second-degree relationship identifier list is used to store the composite node identifier of the neighboring nodes of each node.
  • the three-degree relationship identifier list is used to store the three-degree relationship of each node, and the third-degree relationship is used for Indicates that there is a first-degree associated node and a second-degree associated node between the three-degree associated node and the current node.
  • the server generates a three-degree relationship identifier list for each node according to the second-degree relationship identifier list received by each node from neighboring nodes.
  • the three-degree relationship identifier list is used to store the three-degree relationship of each node, and the third-degree relationship is used to indicate Between the third-degree associated node and the current node, there is a first-degree associated node and a second-degree associated node.
  • the server obtains the second-degree relationship identifier list of each node; the server determines each node's own synthetic node identifier; the server separately receives the second-degree relationship identifier list received by each node that is the same as its own synthetic node identifier
  • the node ID is deleted; the server determines the three-degree relationship of each node, and the three-degree relationship is used to indicate the interval between two nodes (that is, the interval between the three-degree associated node and the current node is a one-degree associated node and a two-degree associated node.
  • Degree-related nodes generate a three-degree relationship identifier list, and the three-degree relationship identifier list is used to store the three-degree relationship of each node.
  • node a is connected to nodes b, c, and d
  • node e is connected to node b
  • nodes b, c, and d are not directly connected
  • node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list.
  • the node b continues to send the list [c, d] to the node e, and then obtains the three-degree relationship list of the node e, that is, the node e and the node c, and the e and d form a three-degree relationship.
  • node b will also send the node ID list [c, d] to node a.
  • the list [c, d] also exists in the one-degree relationship list [b, c, d] of the node a, c and d need to be excluded and do not form a three-degree relationship with a.
  • the method further includes: the server obtains the node attributes of each node according to the three-degree relationship identification list of each node, and performs statistical analysis according to the node attributes of each node.
  • the separation here refers to the process of reading the node attributes from the synthetic node identifier. It corresponds to the previous data reading process and follows the unified node attribute preset rules. Using the aforementioned example, use 101 bits to store the ID number, mobile phone number, and three Boolean variables of A, B, and C.
  • the separated node attributes are statistically analyzed according to business needs, which improves the calculation efficiency. For example, count all the friends worth recommending in the three-degree relationship, or count all the users with good credit, and so on.
  • the method further includes: the server obtains the original graph data of each node; the server performs deduplication and verification processing on the original graph data; and the server generates preprocessed graph data that meets the requirements.
  • An embodiment of the relational computing device includes:
  • the first obtaining unit 301 is configured to obtain preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node;
  • the operation generating unit 302 is configured to perform a bit operation on the node data to generate a synthetic node identifier of each node;
  • the dividing and generating unit 303 is configured to divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group including a composite node identifier of the current node and connection with the current node Edge data;
  • the first sending unit 304 is configured to send a single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes;
  • the first generating unit 305 is configured to generate the second degree relationship of each node according to the single node identification list received by each node.
  • FIG. 4 another embodiment of the full relation calculation device based on graph data in the embodiment of the present application includes:
  • the first obtaining unit 301 is configured to obtain preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node;
  • the operation generating unit 302 is configured to perform a bit operation on the node data to generate a synthetic node identifier of each node;
  • the dividing and generating unit 303 is configured to divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group including a composite node identifier of the current node and connection with the current node Edge data;
  • the first sending unit 304 is configured to send a single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes;
  • the first generating unit 305 is configured to generate the second degree relationship of each node according to the single node identification list received by each node.
  • the operation generating unit 302 is specifically configured to:
  • the full relationship calculation device based on graph data further includes:
  • the statistical analysis unit 306 is configured to obtain the node attribute of each node according to the second-degree relationship of each node, and perform statistical analysis according to the node attribute of each node to generate an analysis result.
  • the statistical analysis unit 306 is specifically used to:
  • the first generating unit 305 is specifically configured to:
  • Receive the single-node identification list of each node determine the connection status of each node; delete the synthetic node identification that is the same as itself in the received single-node identification list; generate the second degree relationship of each node, the second degree The relationship is used to indicate the second degree associated node and the current node, and there is an interval of one degree associated node.
  • the full relationship calculation device based on graph data further includes:
  • the second generating unit 307 is configured to generate a second-degree relationship identifier list according to the second-degree relationship of each node, and the second-degree relationship identifier list is used to store the second-degree relationship of each node;
  • the second sending unit 308 is configured to send the second-degree relationship identifier list of each node to all adjacent nodes;
  • the third generating unit 309 is configured to generate a third-degree relationship identifier list of each node according to the second-degree relationship identifier list sent by neighboring nodes received by each node, and the third-degree relationship identifier list is used to store the three-degree relationship identifier list of each node.
  • Degree relationship the three-degree relationship is used to indicate that there is an interval between a first-degree associated node and a second-degree associated node between the third-degree associated node and the current node.
  • the full relationship calculation device based on graph data further includes:
  • the second obtaining unit 310 is configured to obtain the original graph data of each node
  • the processing unit 311 is configured to perform deduplication processing and verification processing on the original image data
  • the fourth generating unit 312 is used to generate preprocessed image data that meets the requirements.
  • FIG. 5 is a schematic structural diagram of a full relational computing device based on graph data provided by an embodiment of the present application.
  • the full relational computing device 500 based on graph data may have relatively large differences due to different configurations or performances, and may include one or One or more processors (central processing units, CPU) 501 (for example, one or more processors) and a memory 509, and one or more storage media 508 (for example, one or one storage device with a large amount of storage) storing application programs 507 or data 506.
  • the memory 509 and the storage medium 508 may be short-term storage or persistent storage.
  • the program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the full relational computing device based on graph data. Further, the processor 501 may be configured to communicate with the storage medium 508, and execute a series of instruction operations in the storage medium 508 on the full relationship computing device 500 based on graph data.
  • the full relational computing device 500 based on graph data may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input and output interfaces 504, and/or, one or more operating systems 505 , Such as Windows Serve, MacOS X, Unix, Linux, FreeBSD, etc.
  • operating systems 505 Such as Windows Serve, MacOS X, Unix, Linux, FreeBSD, etc.
  • FIG. 5 does not constitute a limitation on the full relationship computing device based on graph data, and may include more or less components than shown in the figure. Or combine certain components, or different component arrangements.
  • the processor 501 can execute the first acquisition unit 301, the operation generation unit 302, the division generation unit 303, the first generation unit 305, the statistical analysis unit 306, the second generation unit 307, the third generation unit 309, and the second generation unit 301 in the above embodiments. Functions of the acquiring unit 310, the processing unit 311, and the fourth generating unit 312.
  • the processor 501 is the control center of the full relationship calculation device based on graph data, and can perform processing in accordance with the set full relationship calculation method based on graph data.
  • the processor 501 uses various interfaces and lines to connect various parts of the entire graph data-based full relational computing device, by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509, Perform various functions and processing data of full relational computing equipment based on graph data.
  • the storage medium 508 and the memory 509 are both carriers for storing data.
  • the storage medium 508 may refer to an internal memory with a small storage capacity but high speed, and the storage 509 may have a large storage capacity but a slow storage speed. External memory.
  • the memory 509 may be used to store software programs and modules.
  • the processor 501 executes various functional applications and data processing of the full relational computing device 500 based on graph data by running the software programs and modules stored in the memory 509.
  • the memory 509 may mainly include a program storage area and a data storage area.
  • the storage program area may store an operating system and at least one application program required by a function (for example, sending a single node identification list of each node to all adjacent nodes, The single node identifier list is used to store the composite node identifiers of adjacent nodes, etc.; the storage data area can store data created according to the use of the full relationship computing device based on graph data (such as the composite node identifier of each node).
  • the memory 509 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • a non-volatile memory such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, the storage medium being a volatile storage medium or a non-volatile storage medium,
  • the computer instructions can be sent from one website site, computer, server, or data center to another website site, through wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.) Computer, server or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A graph data-based full relationship calculation method and apparatus, a device, and a storage medium, used for merging node attributes into node identifiers using bit operations, avoiding the duplication of node data, reducing the consumption of memory resources, and increasing calculation efficiency. The method comprises: acquiring preprocessed graph data, the preprocessed graph data comprising node data and edge data of each node (101); performing bit operations on the node data to generate a synthesised node identifier of each node (102); dividing the node data and edge data with each node data at the centre to generate multiple data groups, each data group comprising the synthesised node identifier of a current node and edge data connected to the current node (103); sending a single node identifier list of each node to all of the adjacent nodes, the single node identifier list being used for storing the synthesised node identifiers of the adjacent nodes (104); and, on the basis of the single node identifier list received by each node, generating the second-degree relationships of each node (105).

Description

基于图数据的全量关系计算方法、装置、设备及存储介质Full relationship calculation method, device, equipment and storage medium based on graph data
本申请要求于2019年8月15日提交中国专利局、申请号为201910751784.4,发明名称为“基于图数据的全量关系计算方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on August 15, 2019, the application number is 201910751784.4, and the invention title is "the method, device, equipment and storage medium for calculating the full relationship based on graph data", all of which The content is incorporated in this application by reference.
技术领域Technical field
本申请涉及大数据技术领域,尤其涉及基于图数据的全量关系计算方法、装置、设备及存储介质。This application relates to the field of big data technology, and in particular to a method, device, device, and storage medium for calculating a full relationship based on graph data.
背景技术Background technique
图数据挖掘是关系挖掘和群体画像中的重要方法。图数据由节点和边组成,图数据中的节点用于表示发生连接的主体,图数据中的边用来表示主体之间的关联。节点通过与其相连的各个边和其他节点产生关联。图计算中的典型应用之一是查找某节点的全量关系并进行统计特征分析,其中,二度关系、三度关系的计算由于计算量和计算资源消耗非常大,成为图计算中的难点。Graph data mining is an important method in relationship mining and group profiling. Graph data is composed of nodes and edges. The nodes in the graph data are used to represent the connected subjects, and the edges in the graph data are used to represent the association between the subjects. A node is associated with other nodes through the edges connected to it. One of the typical applications in graph calculation is to find the full relationship of a certain node and analyze the statistical characteristics. Among them, the calculation of the second-degree relationship and the third-degree relationship has become a difficult point in the graph calculation due to the large amount of calculation and computational resource consumption.
目前图计算的典型环境是Apache软件基金会开源的Spark项目中的GraphX环境。GraphX使用内存计算的策略,可实现快速迭代计算;然而,发明人意识到内存计算对内存资源的耗用量巨大,难以支撑海量数据的处理。以二度、三度关联节点的指标计算来说,使用GraphX对5000万个节点,4亿条边上的图进行计算,要消耗2000GB的内存。对全量关系的计算效率低,在社交网络上,这样的计算效率很难满足亿级到十亿级节点上的应用。The current typical environment for graph computing is the GraphX environment in the Spark project open sourced by the Apache Software Foundation. GraphX uses a memory computing strategy to achieve fast iterative calculations; however, the inventor realized that memory computing consumes a huge amount of memory resources, and it is difficult to support the processing of massive data. In terms of index calculation of second-degree and third-degree associated nodes, using GraphX to calculate graphs on 50 million nodes and 400 million edges will consume 2000GB of memory. The calculation efficiency of the full relationship is low. On social networks, such calculation efficiency is difficult to meet the application of billions to billions of nodes.
发明内容Summary of the invention
本申请提供了基于图数据的全量关系计算方法、装置、设备及存储介质,用于利用位运算合并节点属性到节点标识中,避免节点数据的复制,降低了内存资源的消耗,提高了计算效率。This application provides a method, device, device, and storage medium for calculating a full-scale relationship based on graph data, which are used to merge node attributes into node identifiers using bit operations to avoid duplication of node data, reduce memory resource consumption, and improve computing efficiency .
本申请实施例的第一方面提供一种基于图数据的全量关系计算方法,包括:获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;对所述节点数据进行位运算,生成每个节点的合成节点标识;将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。The first aspect of the embodiments of the present application provides a full-scale relationship calculation method based on graph data, including: obtaining preprocessed graph data, the preprocessed graph data including node data and edge data of each node; Perform a bit operation to generate the synthetic node ID of each node; divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group includes a synthetic node of the current node ID and the edge data connected to the current node; send a single node ID list of each node to all adjacent nodes, the single node ID list is used to store the composite node ID of the adjacent node; according to each node The received single node identification list generates the second-degree relationship of each node.
本申请实施例的第二方面提供了一种基于图数据的全量关系计算装置, 包括:第一获取单元,用于获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;运算生成单元,用于对所述节点数据进行位运算,生成每个节点的合成节点标识;划分生成单元,用于将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;第一发送单元,用于将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;第一生成单元,用于根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。A second aspect of the embodiments of the present application provides a full-scale relationship calculation device based on graph data, including: a first obtaining unit configured to obtain preprocessed graph data, the preprocessed graph data including node data of each node and Edge data; an operation generating unit for performing bit operations on the node data to generate a synthetic node identifier for each node; a division generating unit for centering the node data and the edge data on each node data Divide, generate multiple data groups, each data group includes a composite node ID of the current node and edge data connected to the current node; the first sending unit is used to send the single node ID list of each node to all Adjacent nodes, the single node identification list is used to store the composite node identification of the adjacent nodes; the first generating unit is used to generate the second degree of each node according to the single node identification list received by each node relationship.
本申请实施例的第三方面提供了一种计算机设备,包括:一个或多个处理器;存储器;一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种基于图数据的全量关系计算方法,其中,所述基于图数据的全量关系计算方法包括:The third aspect of the embodiments of the present application provides a computer device, including: one or more processors; a memory; one or more computer programs, wherein the one or more computer programs are stored in the memory and Is configured to be executed by the one or more processors, and the one or more computer programs are configured to execute a full relationship calculation method based on graph data, wherein the full relationship calculation method based on graph data includes :
获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;Acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node;
对所述节点数据进行位运算,生成每个节点的合成节点标识;Performing bit operations on the node data to generate a synthetic node identifier for each node;
将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;Dividing the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node;
将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;Sending a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。According to the single node identification list received by each node, the second-degree relationship of each node is generated.
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现基于图数据的全量关系计算方法,其中,所述基于图数据的全量关系计算方法包括以下步骤:A fourth aspect of the embodiments of the present application provides a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, a method for calculating a full relationship based on graph data is implemented, wherein , The method for calculating the full relationship based on graph data includes the following steps:
获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;Acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node;
对所述节点数据进行位运算,生成每个节点的合成节点标识;Performing bit operations on the node data to generate a synthetic node identifier for each node;
将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;Dividing the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node;
将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;Sending a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。According to the single node identification list received by each node, the second-degree relationship of each node is generated.
本申请实施例提供的技术方案中,获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;对所述节点数据进行位运算,生成每个节点的合成节点标识;将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标 识以及与当前节点连接的边数据;将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。利用位运算合并节点属性到节点标识中,并排除自身相连的情况,避免节点数据的复制,降低了内存资源的消耗,提高了计算效率。In the technical solution provided by the embodiment of the present application, preprocessed graph data is obtained, and the preprocessed graph data includes node data and edge data of each node; bit operations are performed on the node data to generate a synthetic node identifier for each node ; Divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group includes a synthetic node identifier of the current node and the edge data connected to the current node; The single-node identification list of each node is sent to all adjacent nodes, and the single-node identification list is used to store the composite node identification of the adjacent nodes; and each node is generated according to the single-node identification list received by each node. The second-degree relationship of the node. Use bit operation to merge node attributes into node identifiers, and eliminate self-connection, avoid node data duplication, reduce memory resource consumption, and improve calculation efficiency.
附图说明Description of the drawings
图1为本申请实施例中基于图数据的全量关系计算方法的一个实施例示意图;FIG. 1 is a schematic diagram of an embodiment of a method for calculating a full relationship based on graph data in an embodiment of the application;
图2为本申请实施例中基于图数据的全量关系计算方法的另一个实施例示意图;2 is a schematic diagram of another embodiment of a method for calculating a full relationship based on graph data in an embodiment of the application;
图3为本申请实施例中基于图数据的全量关系计算装置的一个实施例示意图;Fig. 3 is a schematic diagram of an embodiment of a full relation calculation device based on graph data in an embodiment of the application;
图4为本申请实施例中基于图数据的全量关系计算装置的另一个实施例示意图;FIG. 4 is a schematic diagram of another embodiment of a full relationship calculation device based on graph data in an embodiment of the application;
图5为本申请实施例中基于图数据的全量关系计算设备的一个实施例示意图。FIG. 5 is a schematic diagram of an embodiment of a full relationship calculation device based on graph data in an embodiment of the application.
具体实施方式detailed description
本申请提供了基于图数据的全量关系计算方法、装置、设备及存储介质,用于利用位运算合并节点属性到节点标识中,避免节点数据的复制,降低了内存资源的消耗,提高了计算效率。This application provides a method, device, device, and storage medium for calculating a full-scale relationship based on graph data, which are used to merge node attributes into node identifiers using bit operations to avoid duplication of node data, reduce memory resource consumption, and improve computing efficiency .
请参阅图1,本申请实施例提供的基于图数据的全量关系计算方法的流程图,具体包括:Please refer to Fig. 1, a flowchart of a method for calculating a full relationship based on graph data provided by an embodiment of the present application, which specifically includes:
101、获取预处理图数据,预处理图数据包括每个节点的节点数据和边数据。101. Obtain preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node.
服务器获取预处理图数据,预处理图数据包括每个节点的节点数据和边数据。具体的,服务器通过Spark加载预处理图数据,其中,预处理图数据由节点数据和边数据组成,图数据中的节点数据用于表示发生连接的主体,边数据用来表示主体之间的关联。一个节点通过与其相连的各个边和其他节点产生关联。一个目标节点的节点数据包括目标节点属性和一个目标节点标识,一个目标边数据包括一个目标边标识和与目标边相连的两个节点标识。The server obtains the preprocessed graph data, and the preprocessed graph data includes node data and edge data of each node. Specifically, the server loads the preprocessed graph data through Spark. The preprocessed graph data is composed of node data and edge data. The node data in the graph data is used to indicate the subject that is connected, and the edge data is used to indicate the association between the subjects. . A node is related to other nodes through its connected edges. The node data of a target node includes the target node attribute and a target node identifier, and the target edge data includes a target edge identifier and two node identifiers connected to the target edge.
其中,节点属性包含节点的若干标签数据。例如,节点标签数据可包括一个身份证号,一个手机号,以及3个布尔型变量A、B、C。其中,变量A用于指示用户性别,用1表示男性,用0表示女性;变量B用于指示是否失信,用1表示失信,用0表示未失信;变量C用于指示是否有大学学历,用1表示有大学学历,用0表示没有大学学历。Among them, the node attribute contains several label data of the node. For example, the node label data may include an ID number, a mobile phone number, and three Boolean variables A, B, and C. Among them, variable A is used to indicate the gender of the user, 1 is used to indicate male, 0 is used to indicate female; variable B is used to indicate whether it is dishonest, 1 is used to indicate failure, and 0 is used to indicate no failure; variable C is used to indicate whether there is a university degree, with 1 means a university degree, and 0 means no university degree.
需要说明的是,本申请实施例,预处理图数据为已经对海量原始图数据进行去重处理,并筛选出符合要求的图数据。It should be noted that, in this embodiment of the application, the preprocessed graph data means that a large amount of original graph data has been deduplicated, and the graph data that meets the requirements have been selected.
可以理解的是,本申请的执行主体可以为基于图数据的全量关系计算装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It is understandable that the execution subject of this application may be a full-scale relational computing device based on graph data, or may be a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
102、对节点数据进行位运算,生成每个节点的合成节点标识。102. Perform a bit operation on the node data to generate a synthetic node identifier for each node.
服务器对节点数据进行位运算,生成每个节点的合成节点标识。具体的,服务器确定节点数据中的多个节点;服务器获取每个节点的节点属性和每个节点对应的初始节点标识;服务器获取预置规则,预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;服务器按照每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数将每个节点的节点属性和初始节点标识进行位运算;服务器生成每个节点的合成节点标识。其中,预置规则包含:(1)每个节点标识总共的存储比特位数;(2)每个变量占据的存储位的起止序数。本实施例以存储身份证号,手机号,以及前述的三个布尔型变量A、B、C为例进行说明。The server performs bit operations on the node data to generate a synthetic node ID for each node. Specifically, the server determines multiple nodes in the node data; the server obtains the node attributes of each node and the initial node identifier corresponding to each node; the server obtains preset rules, which include the total storage bits for each node identifier Number and the starting and ending ordinal number of the storage location occupied by each variable; the server performs bit operations on the node attributes and initial node identification of each node according to the total number of storage bits for each node identification and the starting and ending ordinal number of the storage location occupied by each variable ; The server generates a synthetic node ID for each node. Among them, the preset rules include: (1) the total number of storage bits for each node identification; (2) the starting and ending ordinal numbers of the storage bits occupied by each variable. In this embodiment, storage of ID card number, mobile phone number, and the aforementioned three Boolean variables A, B, and C are taken as examples for description.
例如,服务器通过计算得到身份证号占用61比特,手机号占用37比特,3个布尔型变量一共占用3个比特,总共占用61+37+3=101个比特。这种情况下,预置规则为:(1)每个节点分配101比特存储位;(2)1-61比特用于存储身份证号,第62-98位存储手机号,第99位存储变量A,第100位存储变量B,第101位存储变量C。服务器通过使用位存储,而不是给每个变量单独赋予双整型的存储空间,节约了大量的内存资源。For example, the server calculates that the ID card number occupies 61 bits, the mobile phone number occupies 37 bits, and 3 Boolean variables occupy a total of 3 bits, which occupies 61+37+3=101 bits in total. In this case, the preset rules are: (1) Each node allocates 101 bits of storage; (2) 1-61 bits are used to store the ID number, the 62nd-98th bits store the mobile phone number, and the 99th bit stores the variable A, the 100th bit stores variable B, and the 101st bit stores variable C. The server saves a lot of memory resources by using bit storage instead of individually assigning double integer storage space to each variable.
需要说明的是,初始节点标识是图数据的预处理阶段由用户收集并存放在计算机外存储器(硬盘)上的。开始计算后,服务器将初始节点标识载入到内存储器中。初始节点标识的数据属于原始数据,其收集发生在位运算之前。It should be noted that the initial node identification is collected by the user in the preprocessing stage of the graph data and stored on the external memory (hard disk) of the computer. After starting the calculation, the server loads the initial node identifier into the internal memory. The data identified by the initial node belongs to the original data, and its collection occurs before the bit operation.
103、将节点数据和边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据。103. Divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group including a synthetic node identifier of the current node and edge data connected to the current node.
服务器将节点数据和边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据。The server divides the node data and the edge data with each node data as the center, and generates multiple data groups. Each data group includes a synthetic node identifier of the current node and the edge data connected to the current node.
需要说明的是,图数据包含节点数据和边数据,加载到内存中后,为了进行分布式计算,需要将图数据切割成小的处理单元,即这里说的“数据组”。通过建立以节点为中心的“数据组”,确保了每个节点仅有一个拷贝,仅会出现一次,避免出现多个节点数据的复制。It should be noted that the graph data contains node data and edge data. After being loaded into the memory, in order to perform distributed calculations, the graph data needs to be divided into small processing units, which are referred to as "data groups" here. By establishing a node-centric "data group", it is ensured that each node has only one copy, which will only appear once, and avoiding multiple node data replication.
例如,一个数据组包含本节点标识以及该节点上所有的边数据,其中一条边数据包含了该节点和与其相连接的另外一个节点的合成节点标识,但是不包括相连接节点的任何属性数据。For example, a data group contains the node ID and all the edge data on the node. One edge data contains the composite node ID of the node and another node connected to it, but does not include any attribute data of the connected node.
104、将每个节点的单节点标识列表发送给所有的相邻节点,单节点标识列表用于存储相邻节点的合成节点标识。104. Send the single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes.
服务器将每个节点的单节点标识列表发送给所有的相邻节点,单节点标 识列表用于存储每个节点的相邻节点的合成节点标识。The server sends the single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of each node's adjacent nodes.
需要说明的是,单节点标识列表是根据数据组上的边数据收集得到的。因为边数据上包含了当前节点和与当前节点相连的一度关联节点的合成节点标识,通过遍历所有边数据,即可收集全部相连接的节点。单节点标识列表中仅包含相邻节点的合成节点标识。It should be noted that the single node identification list is collected based on the edge data on the data group. Because the edge data contains the synthetic node identification of the current node and the once-associated node connected to the current node, all connected nodes can be collected by traversing all the edge data. The single node identifier list contains only the composite node identifiers of adjacent nodes.
例如,节点a与节点b、c、d是相邻节点。那么首先收集得到所有相邻节点的标识列表(即单节点标识列表),这里的节点标识就是a、b、c、d四个字母(此处为简单举例,实际情况下包含上亿节点,需要用更复杂的节点标识来表示,例如当设备是节点时,用设备的生产信息,如生成日期,流水号等做节点标识)。对节点a来说,获取到的节点标识列表就是[b、c、d]这个列表;然后,节点a将[b、c、d]这个列表分别传递给b、c、d三个节点。For example, node a and nodes b, c, and d are adjacent nodes. Then first collect the identification list of all adjacent nodes (ie, single-node identification list), where the node identification is the four letters a, b, c, d (here is a simple example, in actual situations it contains hundreds of millions of nodes, you need Use more complex node identification to indicate, for example, when the device is a node, use the production information of the device, such as the date of generation, serial number, etc., as the node identification). For node a, the obtained node identification list is the list of [b, c, d]; then, node a passes the list of [b, c, d] to the three nodes b, c, d, respectively.
105、根据每个节点接收到的单节点标识列表,生成各个节点的二度关系。105. According to the single node identification list received by each node, generate the second-degree relationship of each node.
服务器根据每个节点接收到的单节点标识列表,生成各个节点的二度关系。具体的,服务器接收每个节点的单节点标识列表;服务器确定每个节点的自身的合成节点标识;服务器分别将每个节点接收到的单节点标识列表中与自身的合成节点标识相同的节点标识删除;服务器生成每个节点的二度关系,二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。The server generates the second-degree relationship of each node according to the single node identification list received by each node. Specifically, the server receives the list of single node identifiers of each node; the server determines its own synthetic node identifier for each node; the server separately receives the same node identifier as its own synthetic node identifier in the list of single node identifiers received by each node Delete; the server generates a second-degree relationship for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node.
需要说明的是,因为自身的合成节点标识属于0度关系,非两度关系。将列表中与自身相同的合成节点标识删除,即完成了将自身节点排除。例如,根据上文描述的场景,节点a与节点b、c、d相连,节点b、c、d不直接连接,那么节点b收到节点a发送过来的列表[b、c、d]之后,将列表中的b删除,列表中还余下[c、d]两个节点,这两个节点即是b的二度关系。It should be noted that because its own synthetic node identifier belongs to a 0-degree relationship, not a two-degree relationship. Delete the synthetic node identifier that is the same as the self in the list, which completes the elimination of the self node. For example, according to the scenario described above, node a is connected to nodes b, c, and d, and nodes b, c, and d are not directly connected, then after node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list. These two nodes are the second-degree relationship of b.
可以理解的是,一度关系是指两个节点之间相连,即相邻节点;二度关系是指两个节点中间隔一个节点。例如上面的例子,节点b和节点c之间隔了节点a,因此节点b和节点c是二度关联,即节点b和节点c具有二度关系。It can be understood that a first-degree relationship refers to the connection between two nodes, that is, adjacent nodes; a second-degree relationship refers to a node between two nodes. For example, in the above example, node b and node c are separated by node a, so node b and node c are second-degree related, that is, node b and node c have a second-degree relationship.
可选的,步骤105之后,还包括:服务器根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果。具体的,服务器读取每个节点的二度关系;服务器从二度关系中确定每个节点的节点属性;服务器按照预置规则将每个节点的节点属性从合成节点标识中分离;服务器对每个节点的节点属性进行统计分析,生成分析结果。Optionally, after step 105, the method further includes: the server obtains the node attribute of each node according to the second-degree relationship of each node, and performs statistical analysis according to the node attribute of each node to generate an analysis result. Specifically, the server reads the second-degree relationship of each node; the server determines the node attribute of each node from the second-degree relationship; the server separates the node attribute of each node from the synthetic node identification according to preset rules; Perform statistical analysis on the node attributes of each node to generate analysis results.
本申请实施例,通过以节点数据为中心进行划分最小处理单元,避免生成大量重复的节点数据,减少了节点数据所占用的内存资源,节省了大量的计算资源,提高了计算效率。In the embodiment of the present application, by dividing the minimum processing unit with node data as the center, a large amount of repeated node data is avoided, memory resources occupied by node data are reduced, a large amount of computing resources are saved, and computing efficiency is improved.
请参阅图2,本申请实施例提供的基于图数据的全量关系计算方法的另一个流程图,具体包括:Please refer to FIG. 2, another flowchart of the full relationship calculation method based on graph data provided by the embodiment of the present application, which specifically includes:
201、获取预处理图数据,预处理图数据包括每个节点的节点数据和边数据。201. Obtain preprocessed graph data. The preprocessed graph data includes node data and edge data of each node.
服务器获取预处理图数据,预处理图数据包括每个节点的节点数据和边 数据。具体的,服务器通过Spark加载预处理图数据,其中,预处理图数据由节点数据和边数据组成,图数据中的节点数据用于表示发生连接的主体,边数据用来表示主体之间的关联。一个节点通过与其相连的各个边和其他节点产生关联。一个目标节点的节点数据包括目标节点属性和一个目标节点标识,一个目标边数据包括一个目标边标识和与目标边相连的两个节点标识。The server obtains the preprocessed graph data, and the preprocessed graph data includes the node data and edge data of each node. Specifically, the server loads the preprocessed graph data through Spark. The preprocessed graph data is composed of node data and edge data. The node data in the graph data is used to indicate the subject that is connected, and the edge data is used to indicate the association between the subjects. . A node is related to other nodes through its connected edges. The node data of a target node includes the target node attribute and a target node identifier, and the target edge data includes a target edge identifier and two node identifiers connected to the target edge.
其中,节点属性包含节点的若干标签数据。例如,节点标签数据可包括一个身份证号,一个手机号,以及3个布尔型变量A、B、C。其中,变量A用于指示用户性别,用1表示男性,用0表示女性;变量B用于指示是否失信,用1表示失信,用0表示未失信;变量C用于指示是否有大学学历,用1表示有大学学历,用0表示没有大学学历。Among them, the node attribute contains several label data of the node. For example, the node label data may include an ID number, a mobile phone number, and three Boolean variables A, B, and C. Among them, variable A is used to indicate the gender of the user, 1 is used to indicate male, 0 is used to indicate female; variable B is used to indicate whether it is dishonest, 1 is used to indicate failure, and 0 is used to indicate no failure; variable C is used to indicate whether there is a university degree, with 1 means a university degree, and 0 means no university degree.
需要说明的是,本申请实施例,预处理图数据为已经对海量原始图数据进行去重处理,并筛选出符合要求的图数据。It should be noted that, in this embodiment of the application, the preprocessed graph data means that a large amount of original graph data has been deduplicated, and the graph data that meets the requirements have been selected.
可以理解的是,本申请的执行主体可以为基于图数据的全量关系计算装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It is understandable that the execution subject of this application may be a full-scale relational computing device based on graph data, or may be a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
202、对节点数据进行位运算,生成每个节点的合成节点标识。202. Perform a bit operation on the node data to generate a synthetic node identifier for each node.
服务器对节点数据进行位运算,生成每个节点的合成节点标识。具体的,服务器确定节点数据中的多个节点;服务器获取每个节点的节点属性和每个节点对应的初始节点标识;服务器获取预置规则,预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;服务器按照每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数将每个节点的节点属性和初始节点标识进行位运算;服务器生成每个节点的合成节点标识。其中,预置规则包含:(1)每个节点标识总共的存储比特位数;(2)每个变量占据的存储位的起止序数。本实施例以存储身份证号,手机号,以及前述的三个布尔型变量A、B、C为例进行说明。The server performs bit operations on the node data to generate a synthetic node ID for each node. Specifically, the server determines multiple nodes in the node data; the server obtains the node attributes of each node and the initial node identifier corresponding to each node; the server obtains preset rules, which include the total storage bits for each node identifier Number and the starting and ending ordinal number of the storage location occupied by each variable; the server performs bit operations on the node attributes and initial node identification of each node according to the total number of storage bits for each node identification and the starting and ending ordinal number of the storage location occupied by each variable ; The server generates a synthetic node ID for each node. Among them, the preset rules include: (1) the total number of storage bits for each node identification; (2) the starting and ending ordinal numbers of the storage bits occupied by each variable. In this embodiment, storage of ID card number, mobile phone number, and the aforementioned three Boolean variables A, B, and C are taken as examples for description.
例如,服务器通过计算得到身份证号占用61比特,手机号占用37比特,3个布尔型变量一共占用3个比特,总共占用61+37+3=101个比特。这种情况下,预置规则为:(1)每个节点分配101比特存储位;(2)1-61比特用于存储身份证号,第62-98位存储手机号,第99位存储变量A,第100位存储变量B,第101位存储变量C。服务器通过使用位存储,而不是给每个变量单独赋予双整型的存储空间,节约了大量的内存资源。For example, the server calculates that the ID card number occupies 61 bits, the mobile phone number occupies 37 bits, and 3 Boolean variables occupy a total of 3 bits, which occupies 61+37+3=101 bits in total. In this case, the preset rules are: (1) Each node allocates 101 bits of storage; (2) 1-61 bits are used to store the ID number, the 62nd-98th bits store the mobile phone number, and the 99th bit stores the variable A, the 100th bit stores variable B, and the 101st bit stores variable C. The server saves a lot of memory resources by using bit storage instead of individually assigning double integer storage space to each variable.
需要说明的是,初始节点标识是图数据的预处理阶段由用户收集并存放在计算机外存储器(硬盘)上的。开始计算后,服务器将初始节点标识载入到内存储器中。初始节点标识的数据属于原始数据,其收集发生在位运算之前。It should be noted that the initial node identification is collected by the user in the preprocessing stage of the graph data and stored on the external memory (hard disk) of the computer. After starting the calculation, the server loads the initial node identifier into the internal memory. The data identified by the initial node belongs to the original data, and its collection occurs before the bit operation.
203、将节点数据和边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据。203. Divide the node data and the edge data with each node data as the center, and generate multiple data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node.
服务器将节点数据和边数据以每个节点数据为中心进行划分,生成多个 数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据。The server divides the node data and edge data with each node data as the center, and generates multiple data groups. Each data group includes a synthetic node identifier of the current node and the edge data connected to the current node.
需要说明的是,图数据包含节点数据和边数据,加载到内存中后,为了进行分布式计算,需要将图数据切割成小的处理单元,即这里说的“数据组”。通过建立以节点为中心的“数据组”,确保了每个节点仅有一个拷贝,仅会出现一次,避免出现多个节点数据的复制。It should be noted that the graph data contains node data and edge data. After being loaded into the memory, in order to perform distributed calculations, the graph data needs to be divided into small processing units, which are referred to as "data groups" here. By establishing a node-centric "data group", it is ensured that each node has only one copy, which will only appear once, and avoiding multiple node data replication.
例如,一个数据组包含本节点标识以及该节点上所有的边数据,其中一条边数据包含了该节点和与其相连接的另外一个节点的合成节点标识,但是不包括相连接节点的任何属性数据。For example, a data group contains the node ID and all the edge data on the node. One edge data contains the composite node ID of the node and another node connected to it, but does not include any attribute data of the connected node.
204、将每个节点的单节点标识列表发送给所有的相邻节点,单节点标识列表用于存储相邻节点的合成节点标识。204. Send the single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes.
服务器将每个节点的单节点标识列表发送给所有的相邻节点,单节点标识列表用于存储每个节点的相邻节点的合成节点标识。The server sends the single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes of each node.
需要说明的是,单节点标识列表是根据数据组上的边数据收集得到的。因为边数据上包含了当前节点和与当前节点相连的一度关联节点的合成节点标识,通过遍历所有边数据,即可收集全部相连接的节点。单节点标识列表中仅包含相邻节点的合成节点标识。It should be noted that the single node identification list is collected based on the edge data on the data group. Because the edge data contains the synthetic node identification of the current node and the once-associated node connected to the current node, all connected nodes can be collected by traversing all the edge data. The single node identifier list contains only the composite node identifiers of adjacent nodes.
例如,节点a与节点b、c、d是相邻节点。那么首先收集得到所有相邻节点的标识列表(即单节点标识列表),这里的节点标识就是a、b、c、d四个字母(此处为简单举例,实际情况下包含上亿节点,需要用更复杂的节点标识来表示,例如当设备是节点时,用设备的生产信息,如生成日期,流水号等做节点标识)。对节点a来说,获取到的节点标识列表就是[b、c、d]这个列表;然后,节点a将[b、c、d]这个列表分别传递给b、c、d三个节点。For example, node a and nodes b, c, and d are adjacent nodes. Then first collect the identification list of all adjacent nodes (ie, single-node identification list), where the node identification is the four letters a, b, c, d (here is a simple example, in actual situations it contains hundreds of millions of nodes, you need Use more complex node identification to indicate, for example, when the device is a node, use the production information of the device, such as the date of generation, serial number, etc., as the node identification). For node a, the obtained node identification list is the list of [b, c, d]; then, node a passes the list of [b, c, d] to the three nodes b, c, d, respectively.
205、根据每个节点接收到的单节点标识列表,生成各个节点的二度关系。205. According to the single node identification list received by each node, generate a second-degree relationship of each node.
服务器根据每个节点接收到的单节点标识列表,生成各个节点的二度关系。具体的,服务器接收每个节点的单节点标识列表;服务器确定每个节点的自身的合成节点标识;服务器分别将每个节点接收到的单节点标识列表中与自身的合成节点标识相同的节点标识删除;服务器生成每个节点的二度关系,二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。The server generates the second-degree relationship of each node according to the single node identification list received by each node. Specifically, the server receives the list of single node identifiers of each node; the server determines its own synthetic node identifier for each node; the server separately receives the same node identifier as its own synthetic node identifier in the list of single node identifiers received by each node Delete; the server generates a second-degree relationship for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node.
需要说明的是,因为自身的合成节点标识属于0度关系,非两度关系。将列表中与自身相同的合成节点标识删除,即完成了将自身节点排除。例如,根据上文描述的场景,节点a与节点b、c、d相连,节点b、c、d不直接连接,那么节点b收到节点a发送过来的列表[b、c、d]之后,将列表中的b删除,列表中还余下[c、d]两个节点,这两个节点即是b的二度关系。It should be noted that because its own synthetic node identifier belongs to a 0-degree relationship, not a two-degree relationship. Delete the synthetic node identifier that is the same as the self in the list, which completes the elimination of the self node. For example, according to the scenario described above, node a is connected to nodes b, c, and d, and nodes b, c, and d are not directly connected, then after node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list. These two nodes are the second-degree relationship of b.
可以理解的是,一度关系是指两个节点之间相连,即相邻节点;二度关系是指两个节点之间间隔一个节点。例如上面的例子,节点b和节点c之间隔了节点a,因此节点b和节点c是二度关联,即节点b和节点c具有二度关系。It can be understood that the first-degree relationship refers to the connection between two nodes, that is, adjacent nodes; the second-degree relationship refers to the distance between two nodes by one node. For example, in the above example, node b and node c are separated by node a, so node b and node c are second-degree related, that is, node b and node c have a second-degree relationship.
206、根据各个节点的二度关系生成二度关系标识列表,二度关系标识列表用于存储各个节点的二度关系。206. Generate a second-degree relationship identifier list according to the second-degree relationship of each node, where the second-degree relationship identifier list is used to store the second-degree relationship of each node.
服务器根据各个节点的二度关系生成二度关系标识列表,二度关系标识列表用于存储各个节点的二度关系。The server generates a second-degree relationship identifier list according to the second-degree relationship of each node, and the second-degree relationship identifier list is used to store the second-degree relationship of each node.
207、将每个节点的二度关系标识列表发送给所有的相邻节点。207. Send the second-degree relationship identifier list of each node to all adjacent nodes.
服务器将每个节点的二度关系标识列表发送给所有的相邻节点,二度关系标识列表用于存储每个节点的相邻节点的合成节点标识。The server sends the second-degree relationship identifier list of each node to all neighboring nodes, and the second-degree relationship identifier list is used to store the composite node identifier of the neighboring nodes of each node.
208、根据每个节点接收到的相邻节点发送的二度关系标识列表生成每个节点的三度关系标识列表,三度关系标识列表用于存储各个节点的三度关系,三度关系用于指示三度关联节点和当前节点之间,间隔一个一度关联节点和一个二度关联节点。208. Generate a three-degree relationship identifier list for each node according to the two-degree relationship identifier list sent by neighboring nodes received by each node. The three-degree relationship identifier list is used to store the three-degree relationship of each node, and the third-degree relationship is used for Indicates that there is a first-degree associated node and a second-degree associated node between the three-degree associated node and the current node.
服务器根据每个节点接收到的相邻节点发送的二度关系标识列表生成每个节点的三度关系标识列表,三度关系标识列表用于存储各个节点的三度关系,三度关系用于指示三度关联节点和当前节点之间,间隔一个一度关联节点和一个二度关联节点。具体的,服务器获取每个节点的二度关系标识列表;服务器确定每个节点的自身的合成节点标识;服务器分别将每个节点接收到的二度关系标识列表中与自身的合成节点标识相同的节点标识删除;服务器确定每个节点的三度关系,三度关系用于指示两个节点之间,间隔两个节点(即指示三度关联节点和当前节点之间间隔一个一度关联节点和一个二度关联节点);生成三度关系标识列表,三度关系标识列表用于存储每个节点的三度关系。The server generates a three-degree relationship identifier list for each node according to the second-degree relationship identifier list received by each node from neighboring nodes. The three-degree relationship identifier list is used to store the three-degree relationship of each node, and the third-degree relationship is used to indicate Between the third-degree associated node and the current node, there is a first-degree associated node and a second-degree associated node. Specifically, the server obtains the second-degree relationship identifier list of each node; the server determines each node's own synthetic node identifier; the server separately receives the second-degree relationship identifier list received by each node that is the same as its own synthetic node identifier The node ID is deleted; the server determines the three-degree relationship of each node, and the three-degree relationship is used to indicate the interval between two nodes (that is, the interval between the three-degree associated node and the current node is a one-degree associated node and a two-degree associated node. Degree-related nodes); generate a three-degree relationship identifier list, and the three-degree relationship identifier list is used to store the three-degree relationship of each node.
需要说明的是,因为自身的合成节点标识属于0度关系,非三度关系。将列表中与自身相同的节点标识删除,即完成了将自身节点排除。同时,需要在三度关系列表中排除一度和二度关系。例如,节点a与节点b、c、d相连,节点e与节点b相连,节点b、c、d不直接连接,那么节点b收到节点a发送过来的列表[b、c、d]之后,将列表中的b删除,列表中还余下[c、d]两个节点。这两个节点即是b的二度关系。节点b继续将列表[c、d]发送至节点e,则得到节点e的三度关系列表,即节点e与节点c,e与d形成三度关系。这里不存在自身相连的情况(即自身的合成节点标识),同时,节点b也会将节点标识列表[c、d]发送至节点a。然而,由于列表[c、d]同时存在于节点a的一度关系列表[b、c、d]中,因此,c和d需要排除,不与a形成三度关系。It should be noted that because its own synthetic node identifier belongs to a 0 degree relationship, not a three degree relationship. Delete the node ID that is the same as itself in the list, that is, to exclude the own node. At the same time, first-degree and second-degree relations need to be excluded from the list of third-degree relations. For example, node a is connected to nodes b, c, and d, node e is connected to node b, and nodes b, c, and d are not directly connected, then node b receives the list [b, c, d] sent by node a, Delete b from the list, and there are two nodes [c, d] left in the list. These two nodes are the second degree relationship of b. The node b continues to send the list [c, d] to the node e, and then obtains the three-degree relationship list of the node e, that is, the node e and the node c, and the e and d form a three-degree relationship. There is no case where it is connected to itself (that is, its own synthetic node ID), and at the same time, node b will also send the node ID list [c, d] to node a. However, since the list [c, d] also exists in the one-degree relationship list [b, c, d] of the node a, c and d need to be excluded and do not form a three-degree relationship with a.
本申请实施例,通过以节点数据为中心进行划分最小处理单元,避免生成大量重复的节点数据,减少了节点数据所占用的内存资源,节省了大量的计算资源,提高了计算效率。In the embodiment of the present application, by dividing the minimum processing unit with node data as the center, a large amount of repeated node data is avoided, memory resources occupied by node data are reduced, a large amount of computing resources are saved, and computing efficiency is improved.
可选的,步骤208之后,还包括:服务器根据每个节点的三度关系标识列表获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析。具体的,这里的分离是说将节点属性从合成节点标识读出的过程。和前面数据读入的过程是对应的,遵循统一的节点属性预置规则。使用前述的例子, 用101个比特位存储身份证号,手机号,和A、B、C三个布尔型变量。计算完成后,将这些数据读出时,取出第1-61比特位即读出身份证号,第62-98位读出手机号,从第99位读出变量A,第100位读出变量B,第101位读出变量C,即得到节点属性。Optionally, after step 208, the method further includes: the server obtains the node attributes of each node according to the three-degree relationship identification list of each node, and performs statistical analysis according to the node attributes of each node. Specifically, the separation here refers to the process of reading the node attributes from the synthetic node identifier. It corresponds to the previous data reading process and follows the unified node attribute preset rules. Using the aforementioned example, use 101 bits to store the ID number, mobile phone number, and three Boolean variables of A, B, and C. After the calculation is completed, when reading the data, take out the 1st to 61st bit to read the ID number, the 62nd to 98th bit to read the mobile phone number, the 99th bit to read the variable A, the 100th bit to read the variable B, read the variable C at the 101st bit to get the node attribute.
本申请实施例中,按照业务需要对分离得到的节点属性进行统计分析,提高了计算效率。例如统计三度关系中所有值得推荐的朋友,或者统计所有好信用的用户等等。In the embodiments of the present application, the separated node attributes are statistically analyzed according to business needs, which improves the calculation efficiency. For example, count all the friends worth recommending in the three-degree relationship, or count all the users with good credit, and so on.
可选的,步骤201之前,还包括:服务器获取每个节点的原始图数据;服务器对原始图数据进行去重处理和校验处理;服务器生成符合要求的预处理图数据。Optionally, before step 201, the method further includes: the server obtains the original graph data of each node; the server performs deduplication and verification processing on the original graph data; and the server generates preprocessed graph data that meets the requirements.
需要说明的是,本申请实施例是单纯的以节点为中心的列表,支持所有Spark在弹性分布式数据集(resilientdistributeddataset,RDD)上的优化,例如RDD上的内存参数、存储方式、map的计算策略等。It should be noted that the embodiments of this application are purely node-centric lists, which support all Spark optimizations on resilient distributed datasets (RDD), such as memory parameters, storage methods, and map calculations on RDDs. Strategy etc.
上面对本申请实施例中基于图数据的全量关系计算方法进行了描述,下面对本申请实施例中基于图数据的全量关系计算装置进行描述,请参阅图3,本申请实施例中基于图数据的全量关系计算装置的一个实施例包括:The above describes the full relationship calculation method based on graph data in the embodiment of this application. The following describes the full relationship calculation device based on graph data in the embodiment of this application. Please refer to FIG. 3, the full relationship calculation method based on graph data in the embodiment of this application. An embodiment of the relational computing device includes:
第一获取单元301,用于获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;The first obtaining unit 301 is configured to obtain preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node;
运算生成单元302,用于对所述节点数据进行位运算,生成每个节点的合成节点标识;The operation generating unit 302 is configured to perform a bit operation on the node data to generate a synthetic node identifier of each node;
划分生成单元303,用于将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;The dividing and generating unit 303 is configured to divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group including a composite node identifier of the current node and connection with the current node Edge data;
第一发送单元304,用于将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;The first sending unit 304 is configured to send a single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes;
第一生成单元305,用于根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。The first generating unit 305 is configured to generate the second degree relationship of each node according to the single node identification list received by each node.
本申请实施例,通过以节点数据为中心进行划分最小处理单元,避免生成大量重复的节点数据,减少了节点数据所占用的内存资源,节省了大量的计算资源,提高了计算效率。In the embodiment of the present application, by dividing the minimum processing unit with node data as the center, a large amount of repeated node data is avoided, memory resources occupied by node data are reduced, a large amount of computing resources are saved, and computing efficiency is improved.
请参阅图4,本申请实施例中基于图数据的全量关系计算装置的另一个实施例包括:Please refer to FIG. 4, another embodiment of the full relation calculation device based on graph data in the embodiment of the present application includes:
第一获取单元301,用于获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;The first obtaining unit 301 is configured to obtain preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node;
运算生成单元302,用于对所述节点数据进行位运算,生成每个节点的合成节点标识;The operation generating unit 302 is configured to perform a bit operation on the node data to generate a synthetic node identifier of each node;
划分生成单元303,用于将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;The dividing and generating unit 303 is configured to divide the node data and the edge data with each node data as the center to generate multiple data groups, each data group including a composite node identifier of the current node and connection with the current node Edge data;
第一发送单元304,用于将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;The first sending unit 304 is configured to send a single node identification list of each node to all adjacent nodes, and the single node identification list is used to store the composite node identification of the adjacent nodes;
第一生成单元305,用于根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。The first generating unit 305 is configured to generate the second degree relationship of each node according to the single node identification list received by each node.
可选的,运算生成单元302具体用于:Optionally, the operation generating unit 302 is specifically configured to:
确定所述节点数据中的多个节点;获取每个节点的节点属性和每个节点对应的初始节点标识;获取预置规则,所述预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;按照所述每个节点标识总共的存储比特位数和所述每个变量占据的存储位的起止序数,将每个节点的节点属性和所述初始节点标识进行位运算;生成每个节点的合成节点标识。Determine multiple nodes in the node data; obtain the node attributes of each node and the initial node identifier corresponding to each node; obtain preset rules, the preset rules including the total number of storage bits for each node identifier and The starting and ending ordinal number of the storage position occupied by each variable; according to the total number of storage bits for each node identification and the starting and ending ordinal number of the storage position occupied by each variable, the node attributes of each node and the initial node The identification performs bitwise operations; the synthetic node identification of each node is generated.
可选的,基于图数据的全量关系计算装置还包括:Optionally, the full relationship calculation device based on graph data further includes:
统计分析单元306,用于根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果。The statistical analysis unit 306 is configured to obtain the node attribute of each node according to the second-degree relationship of each node, and perform statistical analysis according to the node attribute of each node to generate an analysis result.
可选的,统计分析单元306具体还用于:Optionally, the statistical analysis unit 306 is specifically used to:
读取每个节点的二度关系;从所述二度关系中确定每个节点的节点属性;按照预置规则将所述每个节点的节点属性从所述合成节点标识中分离;对所述每个节点的节点属性进行统计分析,生成分析结果。Read the second-degree relationship of each node; determine the node attribute of each node from the second-degree relationship; separate the node attribute of each node from the synthetic node identifier according to preset rules; Perform statistical analysis on the node attributes of each node to generate analysis results.
可选的,第一生成单元305具体用于:Optionally, the first generating unit 305 is specifically configured to:
接收每个节点的单节点标识列表;确定每个节点的自身相连情况;将接收到的单节点标识列表中与自身相同的合成节点标识删除;生成每个节点的二度关系,所述二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。Receive the single-node identification list of each node; determine the connection status of each node; delete the synthetic node identification that is the same as itself in the received single-node identification list; generate the second degree relationship of each node, the second degree The relationship is used to indicate the second degree associated node and the current node, and there is an interval of one degree associated node.
可选的,基于图数据的全量关系计算装置还包括:Optionally, the full relationship calculation device based on graph data further includes:
第二生成单元307,用于根据所述各个节点的二度关系生成二度关系标识列表,所述二度关系标识列表用于存储各个节点的二度关系;The second generating unit 307 is configured to generate a second-degree relationship identifier list according to the second-degree relationship of each node, and the second-degree relationship identifier list is used to store the second-degree relationship of each node;
第二发送单元308,用于将每个节点的二度关系标识列表发送给所有的相邻节点;The second sending unit 308 is configured to send the second-degree relationship identifier list of each node to all adjacent nodes;
第三生成单元309,用于根据每个节点接收到的相邻节点发送的二度关系标识列表生成每个节点的三度关系标识列表,所述三度关系标识列表用于存储各个节点的三度关系,所述三度关系用于指示三度关联节点和当前节点之间,间隔一个一度关联节点和一个二度关联节点。The third generating unit 309 is configured to generate a third-degree relationship identifier list of each node according to the second-degree relationship identifier list sent by neighboring nodes received by each node, and the third-degree relationship identifier list is used to store the three-degree relationship identifier list of each node. Degree relationship, the three-degree relationship is used to indicate that there is an interval between a first-degree associated node and a second-degree associated node between the third-degree associated node and the current node.
可选的,基于图数据的全量关系计算装置还包括:Optionally, the full relationship calculation device based on graph data further includes:
第二获取单元310,用于获取每个节点的原始图数据;The second obtaining unit 310 is configured to obtain the original graph data of each node;
处理单元311,用于对所述原始图数据进行去重处理和校验处理;The processing unit 311 is configured to perform deduplication processing and verification processing on the original image data;
第四生成单元312,用于生成符合要求的预处理图数据。The fourth generating unit 312 is used to generate preprocessed image data that meets the requirements.
本申请实施例,通过以节点数据为中心进行划分最小处理单元,避免生成大量重复的节点数据,减少了节点数据所占用的内存资源,节省了大量的计算资源,提高了计算效率。In the embodiment of the present application, by dividing the minimum processing unit with node data as the center, a large amount of repeated node data is avoided, memory resources occupied by node data are reduced, a large amount of computing resources are saved, and computing efficiency is improved.
上面图3至图4从模块化功能实体的角度对本申请实施例中的基于图数据的全量关系计算装置进行详细描述,下面从硬件处理的角度对本申请实施例中基于图数据的全量关系计算设备进行详细描述。The above Figures 3 to 4 describe in detail the full relationship calculation device based on graph data in the embodiment of the present application from the perspective of a modular functional entity. The following describes the full relationship calculation device based on graph data in the embodiment of the present application from the perspective of hardware processing Give a detailed description.
图5是本申请实施例提供的一种基于图数据的全量关系计算设备的结构示意图,该基于图数据的全量关系计算设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(centralprocessingunits,CPU)501(例如,一个或一个以上处理器)和存储器509,一个或一个以上存储应用程序507或数据506的存储介质508(例如一个或一个以上海量存储设备)。其中,存储器509和存储介质508可以是短暂存储或持久存储。存储在存储介质508的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对基于图数据的全量关系计算设备中的一系列指令操作。更进一步地,处理器501可以设置为与存储介质508通信,在基于图数据的全量关系计算设备500上执行存储介质508中的一系列指令操作。FIG. 5 is a schematic structural diagram of a full relational computing device based on graph data provided by an embodiment of the present application. The full relational computing device 500 based on graph data may have relatively large differences due to different configurations or performances, and may include one or One or more processors (central processing units, CPU) 501 (for example, one or more processors) and a memory 509, and one or more storage media 508 (for example, one or one storage device with a large amount of storage) storing application programs 507 or data 506. Among them, the memory 509 and the storage medium 508 may be short-term storage or persistent storage. The program stored in the storage medium 508 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the full relational computing device based on graph data. Further, the processor 501 may be configured to communicate with the storage medium 508, and execute a series of instruction operations in the storage medium 508 on the full relationship computing device 500 based on graph data.
基于图数据的全量关系计算设备500还可以包括一个或一个以上电源502,一个或一个以上有线或无线网络接口503,一个或一个以上输入输出接口504,和/或,一个或一个以上操作系统505,例如Windows Serve,MacOS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5中示出的基于图数据的全量关系计算设备结构并不构成对基于图数据的全量关系计算设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。处理器501可以执行上述实施例中第一获取单元301、运算生成单元302、划分生成单元303、第一生成单元305、统计分析单元306、第二生成单元307、第三生成单元309、第二获取单元310、处理单元311和第四生成单元312的功能。The full relational computing device 500 based on graph data may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input and output interfaces 504, and/or, one or more operating systems 505 , Such as Windows Serve, MacOS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the full relationship computing device based on graph data shown in FIG. 5 does not constitute a limitation on the full relationship computing device based on graph data, and may include more or less components than shown in the figure. Or combine certain components, or different component arrangements. The processor 501 can execute the first acquisition unit 301, the operation generation unit 302, the division generation unit 303, the first generation unit 305, the statistical analysis unit 306, the second generation unit 307, the third generation unit 309, and the second generation unit 301 in the above embodiments. Functions of the acquiring unit 310, the processing unit 311, and the fourth generating unit 312.
下面结合图5对基于图数据的全量关系计算设备的各个构成部件进行具体的介绍:In the following, the components of the full relational computing device based on graph data will be specifically introduced in conjunction with Figure 5:
处理器501是基于图数据的全量关系计算设备的控制中心,可以按照设置的基于图数据的全量关系计算方法进行处理。处理器501利用各种接口和线路连接整个基于图数据的全量关系计算设备的各个部分,通过运行或执行存储在存储器509内的软件程序和/或模块,以及调用存储在存储器509内的数据,执行基于图数据的全量关系计算设备的各种功能和处理数据,通过以节点数据为中心进行划分最小处理单元,避免生成大量重复的节点数据,减少了节点数据所占用的内存资源,节省了大量的计算资源,提高了计算效率。存储介质508和存储器509都是存储数据的载体,本申请实施例中,存储介质508可以是指储存容量较小,但速度快的内存储器,而存储器509可以是储存容量大,但储存速度慢的外存储器。The processor 501 is the control center of the full relationship calculation device based on graph data, and can perform processing in accordance with the set full relationship calculation method based on graph data. The processor 501 uses various interfaces and lines to connect various parts of the entire graph data-based full relational computing device, by running or executing software programs and/or modules stored in the memory 509, and calling data stored in the memory 509, Perform various functions and processing data of full relational computing equipment based on graph data. By dividing the minimum processing unit with node data as the center, avoid generating a large amount of repeated node data, reducing the memory resources occupied by node data, and saving a lot of Computing resources to improve computing efficiency. The storage medium 508 and the memory 509 are both carriers for storing data. In the embodiment of the present application, the storage medium 508 may refer to an internal memory with a small storage capacity but high speed, and the storage 509 may have a large storage capacity but a slow storage speed. External memory.
存储器509可用于存储软件程序以及模块,处理器501通过运行存储在存储器509的软件程序以及模块,从而执行基于图数据的全量关系计算设备500的各种功能应用以及数据处理。存储器509可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序 (比如将每个节点的单节点标识列表发送给所有的相邻节点,单节点标识列表用于存储相邻节点的合成节点标识)等;存储数据区可存储根据基于图数据的全量关系计算设备的使用所创建的数据(比如每个节点的合成节点标识)等。此外,存储器509可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。在本申请实施例中提供的基于图数据的全量关系计算方法程序和接收到的数据流存储在存储器中,当需要使用时,处理器501从存储器509中调用。The memory 509 may be used to store software programs and modules. The processor 501 executes various functional applications and data processing of the full relational computing device 500 based on graph data by running the software programs and modules stored in the memory 509. The memory 509 may mainly include a program storage area and a data storage area. The storage program area may store an operating system and at least one application program required by a function (for example, sending a single node identification list of each node to all adjacent nodes, The single node identifier list is used to store the composite node identifiers of adjacent nodes, etc.; the storage data area can store data created according to the use of the full relationship computing device based on graph data (such as the composite node identifier of each node). In addition, the memory 509 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. The full relation calculation method program based on graph data and the received data stream provided in the embodiment of the present application are stored in the memory, and the processor 501 is called from the memory 509 when needed.
在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,所述存储介质为易失性存储介质或非易失性存储介质,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,光盘)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium, the storage medium being a volatile storage medium or a non-volatile storage medium, For example, the computer instructions can be sent from one website site, computer, server, or data center to another website site, through wired (such as coaxial cable, optical fiber, twisted pair) or wireless (such as infrared, wireless, microwave, etc.) Computer, server or data center for transmission. The computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, an optical disc), or a semiconductor medium (for example, a solid state disk (SSD)).
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本 申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种基于图数据的全量关系计算方法,其中,包括:A calculation method for full relationship based on graph data, including:
    获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;Acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node;
    对所述节点数据进行位运算,生成每个节点的合成节点标识;Performing bit operations on the node data to generate a synthetic node identifier for each node;
    将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;Dividing the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node;
    将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;Sending a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
    根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。According to the single node identification list received by each node, the second-degree relationship of each node is generated.
  2. 根据权利要求1所述的基于图数据的全量关系计算方法,其中,所述对所述节点数据进行位运算,生成每个节点的合成节点标识包括:The method for calculating a full relationship based on graph data according to claim 1, wherein said performing bit operation on said node data to generate a synthetic node identifier for each node comprises:
    确定所述节点数据中的多个节点;Determine multiple nodes in the node data;
    获取每个节点的节点属性和每个节点对应的初始节点标识;Obtain the node attributes of each node and the initial node ID corresponding to each node;
    获取预置规则,所述预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;Acquiring a preset rule, the preset rule including a total number of storage bits for each node identifier and a starting and ending sequence number of storage bits occupied by each variable;
    按照所述每个节点标识总共的存储比特位数和所述每个变量占据的存储位的起止序数,将每个节点的节点属性和所述初始节点标识进行位运算;Perform a bit operation on the node attributes of each node and the initial node identifier according to the total number of storage bits for each node identifier and the starting and ending ordinal numbers of the storage positions occupied by each variable;
    生成每个节点的合成节点标识。Generate a synthetic node ID for each node.
  3. 根据权利要求1所述的基于图数据的全量关系计算方法,其中,在所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系之后,所述方法还包括:The method for calculating a full relationship based on graph data according to claim 1, wherein, after said generating the two-degree relationship of each node according to the single node identification list received by each node, the method further comprises:
    根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果。Obtain the node attributes of each node according to the two-degree relationship of each node, and perform statistical analysis according to the node attributes of each node to generate analysis results.
  4. 根据权利要求3所述的基于图数据的全量关系计算方法,其中,所述根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果包括:The method for calculating a full relationship based on graph data according to claim 3, wherein the node attribute of each node is obtained according to the second-degree relationship of each node, and statistical analysis is performed according to the node attribute of each node to generate an analysis The results include:
    读取每个节点的二度关系;Read the second-degree relationship of each node;
    从所述二度关系中确定每个节点的节点属性;Determine the node attribute of each node from the two-degree relationship;
    按照预置规则将所述每个节点的节点属性从所述合成节点标识中分离;Separating the node attribute of each node from the synthetic node identifier according to a preset rule;
    对所述每个节点的节点属性进行统计分析,生成分析结果。Perform statistical analysis on the node attributes of each node to generate analysis results.
  5. 根据权利要求1所述的基于图数据的全量关系计算方法,其中,所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系包括:The method for calculating a full relationship based on graph data according to claim 1, wherein said generating the second-degree relationship of each node according to the single node identification list received by each node comprises:
    接收每个节点的单节点标识列表;Receive a single node identification list of each node;
    确定每个节点的自身的合成节点标识;Determine the synthetic node ID of each node;
    分别将每个节点接收到的单节点标识列表中与自身的合成节点标识相同的节点标识删除;Respectively delete the node ID that is the same as its own synthetic node ID from the single node ID list received by each node;
    生成每个节点的二度关系,所述二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。A second-degree relationship is generated for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node by a first-degree associated node.
  6. 根据权利要求1所述的基于图数据的全量关系计算方法,其中,在所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系之后,所述方法还包括:The method for calculating a full relationship based on graph data according to claim 1, wherein, after said generating the two-degree relationship of each node according to the single node identification list received by each node, the method further comprises:
    根据所述各个节点的二度关系生成二度关系标识列表,所述二度关系标识列表用于存储各个节点的二度关系;Generating a second-degree relationship identifier list according to the second-degree relationship of each node, where the second-degree relationship identifier list is used to store the second-degree relationship of each node;
    将每个节点的二度关系标识列表发送给所有的相邻节点;Send the second-degree relationship identifier list of each node to all adjacent nodes;
    根据每个节点接收到的相邻节点发送的二度关系标识列表生成每个节点的三度关系标识列表,所述三度关系标识列表用于存储各个节点的三度关系,所述三度关系用于指示三度关联节点和当前节点之间,间隔一个一度关联节点和一个二度关联节点。A three-degree relationship identifier list for each node is generated according to the two-degree relationship identifier list sent by neighboring nodes received by each node. The three-degree relationship identifier list is used to store the three-degree relationship of each node. Used to indicate the interval between a three-degree associated node and the current node, a first-degree associated node and a second-degree associated node.
  7. 根据权利要求1-6中任一所述的基于图数据的全量关系计算方法,其中,在所述获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据之前,所述方法还包括:The method for calculating a full relationship based on graph data according to any one of claims 1-6, wherein, before said acquiring preprocessed graph data, said preprocessed graph data including node data and edge data of each node, The method also includes:
    获取每个节点的原始图数据;Obtain the original graph data of each node;
    对所述原始图数据进行去重处理和校验处理;Performing de-duplication processing and verification processing on the original image data;
    生成符合要求的预处理图数据。Generate preprocessed map data that meets the requirements.
  8. 一种基于图数据的全量关系计算装置,其中,包括:A full-scale relational calculation device based on graph data, which includes:
    第一获取单元,用于获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;The first acquiring unit is configured to acquire preprocessed graph data, where the preprocessed graph data includes node data and edge data of each node;
    运算生成单元,用于对所述节点数据进行位运算,生成每个节点的合成节点标识;An operation generating unit, configured to perform bit operation on the node data to generate a synthetic node identifier for each node;
    划分生成单元,用于将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;The dividing and generating unit is used to divide the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and a connection to the current node Edge data
    第一发送单元,用于将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;The first sending unit is configured to send a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
    第一生成单元,用于根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。The first generating unit is configured to generate the second-degree relationship of each node according to the single node identification list received by each node.
  9. 一种计算机设备,其中,包括:A computer device, which includes:
    一个或多个处理器;One or more processors;
    存储器;Memory
    一个或多个计算机程序,其中所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行一种基于图数据的全量关系计算方法,其中,所述基于图数据的全量关系计算方法包括:One or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, and the one or more computer programs are configured to execute A method for calculating a full relationship based on graph data, wherein the method for calculating a full relationship based on graph data includes:
    获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数 据;Acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node;
    对所述节点数据进行位运算,生成每个节点的合成节点标识;Performing bit operations on the node data to generate a synthetic node identifier for each node;
    将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;Dividing the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node;
    将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;Sending a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
    根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。According to the single node identification list received by each node, the second-degree relationship of each node is generated.
  10. 根据权利要求9所述的计算机设备,其中,所述对所述节点数据进行位运算,生成每个节点的合成节点标识包括:The computer device according to claim 9, wherein said performing a bit operation on said node data to generate a synthetic node identifier of each node comprises:
    确定所述节点数据中的多个节点;Determine multiple nodes in the node data;
    获取每个节点的节点属性和每个节点对应的初始节点标识;Obtain the node attributes of each node and the initial node ID corresponding to each node;
    获取预置规则,所述预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;Acquiring a preset rule, the preset rule including a total number of storage bits for each node identifier and a starting and ending sequence number of storage bits occupied by each variable;
    按照所述每个节点标识总共的存储比特位数和所述每个变量占据的存储位的起止序数,将每个节点的节点属性和所述初始节点标识进行位运算;Perform a bit operation on the node attributes of each node and the initial node identifier according to the total number of storage bits for each node identifier and the starting and ending ordinal numbers of the storage positions occupied by each variable;
    生成每个节点的合成节点标识。Generate a synthetic node ID for each node.
  11. 根据权利要求9所述的计算机设备,其中,在所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系之后,所述方法还包括:The computer device according to claim 9, wherein, after said generating the two-degree relationship of each node according to the single node identification list received by each node, the method further comprises:
    根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果。Obtain the node attributes of each node according to the two-degree relationship of each node, and perform statistical analysis according to the node attributes of each node to generate analysis results.
  12. 根据权利要求11所述的计算机设备,所述根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果包括:11. The computer device according to claim 11, wherein said acquiring the node attribute of each node according to the two-degree relationship of each node, and performing statistical analysis according to the node attribute of each node, and generating the analysis result comprises:
    读取每个节点的二度关系;Read the second-degree relationship of each node;
    从所述二度关系中确定每个节点的节点属性;Determine the node attribute of each node from the two-degree relationship;
    按照预置规则将所述每个节点的节点属性从所述合成节点标识中分离;Separating the node attribute of each node from the synthetic node identifier according to a preset rule;
    对所述每个节点的节点属性进行统计分析,生成分析结果。Perform statistical analysis on the node attributes of each node to generate analysis results.
  13. 根据权利要求9所述的计算机设备,其中,所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系包括:The computer device according to claim 9, wherein the generating the second-degree relationship of each node according to the single node identification list received by each node comprises:
    接收每个节点的单节点标识列表;Receive a single node identification list of each node;
    确定每个节点的自身的合成节点标识;Determine the synthetic node ID of each node;
    分别将每个节点接收到的单节点标识列表中与自身的合成节点标识相同的节点标识删除;Respectively delete the node ID that is the same as its own synthetic node ID from the single node ID list received by each node;
    生成每个节点的二度关系,所述二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。A second-degree relationship is generated for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node by a first-degree associated node.
  14. 根据权利要求9所述的计算机设备,其中,在所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系之后,所述方法还包括:The computer device according to claim 9, wherein, after said generating the two-degree relationship of each node according to the single node identification list received by each node, the method further comprises:
    根据所述各个节点的二度关系生成二度关系标识列表,所述二度关系标识列表用于存储各个节点的二度关系;Generating a second-degree relationship identifier list according to the second-degree relationship of each node, where the second-degree relationship identifier list is used to store the second-degree relationship of each node;
    将每个节点的二度关系标识列表发送给所有的相邻节点;Send the second-degree relationship identifier list of each node to all adjacent nodes;
    根据每个节点接收到的相邻节点发送的二度关系标识列表生成每个节点的三度关系标识列表,所述三度关系标识列表用于存储各个节点的三度关系,所述三度关系用于指示三度关联节点和当前节点之间,间隔一个一度关联节点和一个二度关联节点。A three-degree relationship identifier list for each node is generated according to the two-degree relationship identifier list sent by neighboring nodes received by each node. The three-degree relationship identifier list is used to store the three-degree relationship of each node. Used to indicate the interval between a three-degree associated node and the current node, a first-degree associated node and a second-degree associated node.
  15. 根据权利要求9-14任一所述计算机设备,其中,在所述获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据之前,所述方法还包括:The computer device according to any one of claims 9-14, wherein, before said acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node, the method further comprises:
    获取每个节点的原始图数据;Obtain the original graph data of each node;
    对所述原始图数据进行去重处理和校验处理;Performing de-duplication processing and verification processing on the original image data;
    生成符合要求的预处理图数据。Generate preprocessed map data that meets the requirements.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现基于图数据的全量关系计算方法,其中,所述基于图数据的全量关系计算方法包括以下步骤:A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, a method for calculating a full relationship based on graph data is implemented, wherein the full amount of graph data is The relationship calculation method includes the following steps:
    获取预处理图数据,所述预处理图数据包括每个节点的节点数据和边数据;Acquiring preprocessed graph data, the preprocessed graph data including node data and edge data of each node;
    对所述节点数据进行位运算,生成每个节点的合成节点标识;Performing bit operations on the node data to generate a synthetic node identifier for each node;
    将所述节点数据和所述边数据以每个节点数据为中心进行划分,生成多个数据组,每个数据组包括一个当前节点的合成节点标识以及与当前节点连接的边数据;Dividing the node data and the edge data with each node data as the center to generate a plurality of data groups, each data group including a synthetic node identifier of the current node and the edge data connected to the current node;
    将每个节点的单节点标识列表发送给所有的相邻节点,所述单节点标识列表用于存储所述相邻节点的合成节点标识;Sending a single node identification list of each node to all adjacent nodes, where the single node identification list is used to store the composite node identification of the adjacent nodes;
    根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系。According to the single node identification list received by each node, the second-degree relationship of each node is generated.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述节点数据进行位运算,生成每个节点的合成节点标识包括:The computer-readable storage medium according to claim 16, wherein the performing a bit operation on the node data to generate a synthetic node identifier for each node comprises:
    确定所述节点数据中的多个节点;Determine multiple nodes in the node data;
    获取每个节点的节点属性和每个节点对应的初始节点标识;Obtain the node attributes of each node and the initial node ID corresponding to each node;
    获取预置规则,所述预置规则包括每个节点标识总共的存储比特位数和每个变量占据的存储位的起止序数;Acquiring a preset rule, the preset rule including a total number of storage bits for each node identifier and a starting and ending sequence number of storage bits occupied by each variable;
    按照所述每个节点标识总共的存储比特位数和所述每个变量占据的存储位的起止序数,将每个节点的节点属性和所述初始节点标识进行位运算;Perform a bit operation on the node attributes of each node and the initial node identifier according to the total number of storage bits for each node identifier and the starting and ending ordinal numbers of the storage positions occupied by each variable;
    生成每个节点的合成节点标识。Generate a synthetic node ID for each node.
  18. 根据权利要求16所述的计算机可读存储介质,其中,在所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系之后,所述方法还包括:The computer-readable storage medium according to claim 16, wherein, after said generating the two-degree relationship of each node according to the single node identification list received by each node, the method further comprises:
    根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果。Obtain the node attributes of each node according to the two-degree relationship of each node, and perform statistical analysis according to the node attributes of each node to generate analysis results.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据每个节点的二度关系获取每个节点的节点属性,并根据每个节点的节点属性进行统计分析,生成分析结果包括:18. The computer-readable storage medium according to claim 18, wherein the obtaining the node attribute of each node according to the two-degree relationship of each node, and performing statistical analysis according to the node attribute of each node, and generating the analysis result comprises:
    读取每个节点的二度关系;Read the second-degree relationship of each node;
    从所述二度关系中确定每个节点的节点属性;Determine the node attribute of each node from the two-degree relationship;
    按照预置规则将所述每个节点的节点属性从所述合成节点标识中分离;Separating the node attribute of each node from the synthetic node identifier according to a preset rule;
    对所述每个节点的节点属性进行统计分析,生成分析结果。Perform statistical analysis on the node attributes of each node to generate analysis results.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述根据每个节点接收到的所述单节点标识列表,生成各个节点的二度关系包括:The computer-readable storage medium according to claim 16, wherein the generating the second-degree relationship of each node according to the single node identification list received by each node comprises:
    接收每个节点的单节点标识列表;Receive a single node identification list of each node;
    确定每个节点的自身的合成节点标识;Determine the synthetic node ID of each node;
    分别将每个节点接收到的单节点标识列表中与自身的合成节点标识相同的节点标识删除;Respectively delete the node ID that is the same as its own synthetic node ID from the single node ID list received by each node;
    生成每个节点的二度关系,所述二度关系用于指示二度关联节点和当前节点之间,间隔一个一度关联节点。A second-degree relationship is generated for each node, and the second-degree relationship is used to indicate that there is an interval between the second-degree associated node and the current node by a first-degree associated node.
PCT/CN2020/087619 2019-08-15 2020-04-28 Graph data-based full relationship calculation method and apparatus, device, and storage medium WO2021027331A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910751784.4A CN110609924A (en) 2019-08-15 2019-08-15 Method, device and equipment for calculating total quantity relation based on graph data and storage medium
CN201910751784.4 2019-08-15

Publications (1)

Publication Number Publication Date
WO2021027331A1 true WO2021027331A1 (en) 2021-02-18

Family

ID=68889757

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087619 WO2021027331A1 (en) 2019-08-15 2020-04-28 Graph data-based full relationship calculation method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110609924A (en)
WO (1) WO2021027331A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609924A (en) * 2019-08-15 2019-12-24 深圳壹账通智能科技有限公司 Method, device and equipment for calculating total quantity relation based on graph data and storage medium
CN111984828A (en) * 2020-06-30 2020-11-24 联想(北京)有限公司 Neighbor node retrieval method and device
CN112528090B (en) * 2020-12-11 2023-08-04 北京百度网讯科技有限公司 Storage method and storage device for graph data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952032A (en) * 2015-06-19 2015-09-30 清华大学 Graph processing method and device as well as rasterization representation and storage method
CN106919628A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of diagram data
US20180203897A1 (en) * 2017-01-18 2018-07-19 Oracle International Corporation Fast graph query engine optimized for typical real-world graph instances whose small portion of vertices have extremely large degree
CN109446362A (en) * 2018-09-05 2019-03-08 北京费马科技有限公司 Chart database structure, diagram data storage method, device based on external memory
CN110609924A (en) * 2019-08-15 2019-12-24 深圳壹账通智能科技有限公司 Method, device and equipment for calculating total quantity relation based on graph data and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104952032A (en) * 2015-06-19 2015-09-30 清华大学 Graph processing method and device as well as rasterization representation and storage method
CN106919628A (en) * 2015-12-28 2017-07-04 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of diagram data
US20180203897A1 (en) * 2017-01-18 2018-07-19 Oracle International Corporation Fast graph query engine optimized for typical real-world graph instances whose small portion of vertices have extremely large degree
CN109446362A (en) * 2018-09-05 2019-03-08 北京费马科技有限公司 Chart database structure, diagram data storage method, device based on external memory
CN110609924A (en) * 2019-08-15 2019-12-24 深圳壹账通智能科技有限公司 Method, device and equipment for calculating total quantity relation based on graph data and storage medium

Also Published As

Publication number Publication date
CN110609924A (en) 2019-12-24

Similar Documents

Publication Publication Date Title
US11868359B2 (en) Dynamically assigning queries to secondary query processing resources
WO2021027331A1 (en) Graph data-based full relationship calculation method and apparatus, device, and storage medium
CN110543586B (en) Multi-user identity fusion method, device, equipment and storage medium
US11727004B2 (en) Context dependent execution time prediction for redirecting queries
CN110705683A (en) Random forest model construction method and device, electronic equipment and storage medium
US20240126817A1 (en) Graph data query
CN113687964B (en) Data processing method, device, electronic equipment, storage medium and program product
Ho et al. Distributed graph database for large-scale social computing
CA3119167A1 (en) Approach for a controllable trade-off between cost and availability of indexed data in a cloud log aggregation solution such as splunk or sumo
Elagib et al. Big data analysis solutions using MapReduce framework
CN111404932A (en) Method for accessing medical institution system to smart medical cloud service platform
US11221890B2 (en) Systems and methods for dynamic partitioning in distributed environments
US10162830B2 (en) Systems and methods for dynamic partitioning in distributed environments
CN112925859A (en) Data storage method and device
CN116166191A (en) Integrated system of lake and storehouse
Wang et al. A Data-Aware Remote Procedure Call Method for Big Data Systems.
US20200104425A1 (en) Techniques for lossless and lossy large-scale graph summarization
CN112612832B (en) Node analysis method, device, equipment and storage medium
Cao et al. LogKV: Exploiting key-value stores for event log processing
EP3926453A1 (en) Partitioning method and apparatus therefor
US20150149498A1 (en) Method and System for Performing an Operation Using Map Reduce
US11249901B2 (en) Ownership-based garbage collection of data
CN110309206B (en) Order information acquisition method and system
US11768814B2 (en) Data transmissions between two databases
Hua et al. Real-time semantic search using approximate methodology for large-scale storage systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20852001

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20852001

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.08.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20852001

Country of ref document: EP

Kind code of ref document: A1