CN114327363A - Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product - Google Patents
Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product Download PDFInfo
- Publication number
- CN114327363A CN114327363A CN202111075644.3A CN202111075644A CN114327363A CN 114327363 A CN114327363 A CN 114327363A CN 202111075644 A CN202111075644 A CN 202111075644A CN 114327363 A CN114327363 A CN 114327363A
- Authority
- CN
- China
- Prior art keywords
- node
- radix
- counter
- graph data
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a graph data processing method, which comprises the following steps: acquiring directed graph data; creating an associated radix counter for each node on the parameter server based on the acquired graph data; initializing an associated radix counter based on the node information of each node; iteratively updating the base counter until a desired iteration round is reached, in each iteration round: for each source node, obtaining a radix counter associated with the source node from a parameter server and transmitting the radix counter to all tail nodes associated with the source node, and for each tail node, updating the radix counter associated with the tail node on the parameter server based on the radix counter; based on the iteratively updated radix counter, a number of nodes within a neighborhood of a desired order for at least one of the plurality of nodes is determined. The method and the device can be used for multiple applications capable of abstracting the mapping data for processing, and comprise the fields of finance, social interaction, vehicle traffic and the like.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a graph data processing method, a graph data processing apparatus, a computing device, a computer-readable storage medium, and a computer program product.
Background
With the increasing popularity and development of computer networks, people are increasingly participating in various network relationships, such as social networks, financial transaction networks, car networking, and the like. Typically, these network relationships can be described using a Graph (Graph). In particular, the nodes in the graph may correspond to various participants in a network relationship, the edges may correspond to network associations existing between different participants, e.g., in a social network, the edges may correspond to friends, communications, etc., between different participants, or in a financial transaction network, the edges may correspond to payment or transfer relationships, etc., between different participants. As another example, in a car networking, an edge may correspond to a distance, speed, orientation relationship between different vehicles.
In some scenarios, a person may be interested in the number of participants with whom one or more participants have a network association, either directly or indirectly, e.g., to monitor whether the one or more participants have abnormal activity, etc. This may be determined by counting the number of non-repeating nodes within a certain neighborhood of the corresponding node in the graph. However, in most current network relationships there are often a huge number of participants, and there can be intricate network associations between different participants, and thus the graphs used to describe these network relationships will also contain a huge amount of data. In a general processor or computing engine, performing the above statistics based on such a graph containing massive data would occupy large data storage space and computing resources and result in high data interaction pressure inside the processor or computing engine.
Disclosure of Invention
In view of the above, the present application provides a graph data processing method, a graph data processing apparatus, a computing device, a computer readable storage medium and a computer program product, which may alleviate, reduce or even eliminate the above-mentioned problems.
According to an aspect of the present application, there is provided a graph data processing method, including: acquiring directed graph data, wherein the directed graph data comprises a plurality of items of edge data related to a plurality of nodes, each item of edge data represents an edge pointing to a tail node from a source node, and each item of edge data comprises a corresponding source node identifier and a corresponding tail node identifier; creating, on the parameter server, an associated radix counter for each node based on the obtained directed graph data, each radix counter configured to determine a radix estimate based on the node identification added to the radix counter; initializing a radix counter associated with each node based on node information for the node, the node information including a node identification; iteratively updating the base counter until a desired iteration round is reached, in each iteration round performing the following iteration operations: for each source node, obtaining a radix counter associated with the source node from a parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on a parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node; based on the iteratively updated radix counter, a number of nodes within the neighborhood of a desired order of at least one of the plurality of nodes is determined, wherein the desired order is equal to the desired number of iterations.
In some embodiments, obtaining the directed graph data comprises: receiving undirected graph data, wherein the undirected graph data comprises a plurality of items of edge data related to a plurality of nodes, each item of edge data represents an edge connecting two nodes and comprises corresponding two node identifications; exchanging the positions of the two node identifications based on each item of edge data in the undirected graph data to generate additional edge data; and taking the combination of the multiple items of side data in the undirected graph data and the generated multiple items of additional side data as directed graph data.
In some embodiments, the node information further comprises a node tag, and wherein initializing the associated radix counter based on the node information for each node comprises: when the node label of the node belongs to the target label, adding the node identification of the node to the associated base counter so as to initialize the base estimation value of the base counter to 1; when the node label of the node does not belong to the target label, the node identification of the node is not added to the associated radix counter so as to initialize the radix estimate of the radix counter to 0.
In some embodiments, updating the radix counter associated with the tail node on the parameter server based on the radix counters received from all relevant source nodes comprises: de-duplication and combination are carried out on the base number counters received from all relevant source nodes to generate an updated base number counter; updating a radix counter on the parameter server associated with the tail node based on the generated update radix counter.
In some embodiments, updating the radix counter associated with the node on the parameter server based on the generated update radix counter comprises: pushing the generated update radix counter to a parameter server, wherein the parameter server is configured to, upon receiving the update radix counter, de-duplicate and combine the update radix counter and the radix counter associated with the same node as the updated radix counter.
In some embodiments, the graph data processing method further comprises: aggregating tail nodes having the same source node for the directed graph data based on the source node identification to obtain a set of tail nodes corresponding to each source node, and wherein transferring the obtained radix counters to all tail nodes associated with the source node comprises: and transmitting the acquired radix counter to all tail nodes in the tail node set corresponding to the source node.
In some embodiments, the graph data processing method further comprises: dividing directed graph data into a plurality of groups of graph data; wherein performing iterative operations in each iteration includes: in each iteration, an iterative operation is performed in parallel for each set of graph data.
In some embodiments, initializing the associated radix counter based on the node information of each node comprises: the associated radix counters are initialized in parallel based on the node information for each node in the sets of graph data.
In some embodiments, creating an associated radix counter for each node on the parameter server comprises: creating an associated first base counter and second base counter for each node on the parameter server, and wherein performing iterative operations on the sets of graph data in parallel comprises: for each group of graph data, further dividing the group of graph data into a plurality of data batches; for each data batch in turn, the following operations are performed: for each source node in the data batch, obtaining a first radix counter associated with the source node from a parameter server and transmitting the obtained first radix counter to all tail nodes associated with the source node, for each tail node, updating a second radix counter associated with the tail node on the parameter server based on the received first radix counters of all relevant source nodes from the tail node, wherein the parameter server is configured to update the first radix counter based on the second radix counter of each node at the end of each round of iteration.
In some embodiments, the radix counter is configured to: determining a bit string corresponding to each node identifier added to the radix counter using a preset hash function; the base value is estimated based on the first occurrence of a 1 in at least some of the bits in each bit string.
In some embodiments, the directed graph data comprises financial transaction graph data, each node identification comprises a transactor identification, and each item of edge data represents a transaction payment relationship that points from a source transactor to an end transactor.
According to another aspect of the present application, there is provided a graph data processing apparatus comprising: the acquisition module is configured to acquire directed graph data, wherein the directed graph data comprises a plurality of items of edge data related to a plurality of nodes, each item of edge data represents an edge pointing from a source node to a tail node, and comprises a corresponding source node identifier and a corresponding tail node identifier; a creation module configured to create an associated radix counter for each node on the parameter server based on the obtained directed graph data, each radix counter configured to determine a radix estimate based on the node identification added to the radix counter; an initialization module configured to initialize a radix counter associated with each node based on node information for the node, the node information including a node identification; an iteration module configured to iteratively update the base counter until a desired iteration round is reached, in each iteration round performing the following iteration operations: for each source node, obtaining a radix counter associated with the source node from a parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on a parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node; a determination module configured to determine, based on the iteratively updated radix counter, a number of nodes within a neighborhood of a desired order of at least one node of the plurality of nodes, wherein the desired order is equal to the desired number of iterations.
According to another aspect of the present application, there is provided a computing device, comprising: a memory configured to store computer-executable instructions; a processor configured to perform the graph data processing method described according to the preceding aspect when the computer executable instructions are executed by the processor.
According to another aspect of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform the graph data processing method described according to the foregoing aspect.
According to another aspect of the present application, a computer program product is provided, comprising computer instructions which, when executed by a processor, implement the steps of the graph data processing method described according to the preceding aspect.
In the graph data processing scheme provided by the application, an associated radix counter may be created on the parameter server for each node based on the obtained graph data, the associated radix counter is initialized based on the node information of each node, and the corresponding radix counter on the parameter server is updated in each iteration round by converging the radix counters of the relevant source nodes at each tail node, and when a desired iteration round is reached, the number of nodes in the neighborhood of the desired order of the corresponding node may be determined based on each radix counter on the parameter server. By introducing a parameter server to store and update radix counters associated with the various nodes, the footprint of the processor or compute engine executing the solution, as well as the computational resources, may be reduced. In addition, the parameter server can allow data interaction with high parallelism, and compared with a base counter locally stored in a processor or a computing engine, the parameter server is introduced to improve the data interaction efficiency and reduce the data interaction pressure, so that the execution efficiency of the whole scheme is improved.
These and other aspects of the application will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
Drawings
Further details, features and advantages of the present application are disclosed in the following description of exemplary embodiments, which is to be read in connection with the accompanying drawings, in which:
fig. 1 schematically shows an example application scenario in which the technical solution provided by the present application may be applied;
FIGS. 2A and 2B schematically illustrate examples of undirected and directed graphs, respectively;
FIG. 3 schematically illustrates a graph data processing architecture according to the related art;
FIG. 4A schematically illustrates an example flow diagram of a graph data processing method according to some embodiments of the present application;
FIG. 4B schematically illustrates another example flow diagram of a graph data processing method according to some embodiments of the present application;
FIG. 4C schematically illustrates yet another example flow diagram of a graph data processing method according to some embodiments of the present application;
FIG. 5 schematically illustrates an example of a conversion from an undirected graph to a directed graph, in accordance with some embodiments of the present application;
FIG. 6 schematically illustrates an example graph data processing architecture according to some embodiments of the present application;
FIG. 7 schematically illustrates an example block diagram of a graph data processing apparatus according to some embodiments of this application;
FIG. 8 schematically illustrates an example block diagram of a computing device in accordance with some embodiments of the present application.
Detailed Description
Before describing embodiments of the present application in detail, some relevant concepts are explained first:
FIG. (Graph): in the field of computer technology, a graph may be understood as a collection of nodes (also referred to as vertices or intersections) connected by a series of edges (also referred to as links). As previously mentioned, the graph may be used to describe various network relationships, such as social networks, financial transaction networks, and the like. The graph can be divided into an undirected graph and a directed graph according to whether the edges have directionality; graphs can be classified into homogeneous graphs and heterogeneous graphs according to whether there are different types of nodes and/or edges. In computers, graphs can typically be described by adjacency lists or adjacency matrices. In the adjacency list, each edge can be represented by a pair of node identifiers; in an adjacency matrix, rows and columns may represent nodes, with the corresponding element in the matrix that is determined by two nodes representing whether an edge exists between the two nodes. In general, when the graph is sparse, i.e., each node is connected to only a few nodes relative to the total number of nodes, it may be preferable to use an adjacency list to describe the graph; conversely, when the graph is dense, i.e., each node is connected to most nodes, it may be preferable to use the adjacency matrix to describe the graph. In addition, the data in the form of the adjacency list is more convenient to split, so that the adjacency list is more suitable for distributed storage of large-scale graph data.
Undirected graph: if the edges in the graph only indicate a connection relationship, but do not have directionality, the graph may be referred to as an undirected graph. Undirected graphs can be used to describe generally network relationships where the associations between participants are not directional. For example, in a social network, edges may be used to indicate friend relationships between participants, which may be considered to have no directional association, and thus, such network relationships may be described by an undirected graph.
Directed graph: if an edge in a graph has directionality, the graph may be referred to as a directed graph. Directed graphs can generally be used to describe network relationships where the association between participants is directional. For example, in a financial transaction network, edges may be used to indicate a transfer relationship from one party to another party, which may be considered to have a directional association, and thus, the network relationship may be described by a directed graph.
r-order neighborhood node set: assuming that in the graph G (V, E), V represents a node set and E represents an edge set, for any two nodes x, y ∈ V, the shortest path length from x to y can be represented as d (x, y), and the shortest path length can be measured by the number of edges included in the path, then the neighborhood node set of node x of r order (r ≧ 1) can be defined as Nr(x) = y ∈ V, d (y, x) ≦ r }, i.e.: the set of nodes in graph G for which the shortest path to node x is not greater than r includes node x itself.
Cardinality (Cardinality): referring to the number of non-repeating elements in a set, determining the number of neighborhood nodes may be understood as determining the cardinality of the set of neighborhood nodes. In the related art, there are methods of accurately determining a radix such as a radix count based on a B tree (B tree), a radix count based on a bitmap, and the like, and there are also radix estimation methods such as Linear Counting (LC), LogLog Counting (LLC), HyperLogLog Counting (HLLC), and Adaptive Counting (AC). In general, when processing a large amount of data, the estimation value of the cardinality can be determined using the above-described cardinality estimation method so as to strike a balance between the counting accuracy and the time consumption and the utilization efficiency of the storage resources and the calculation resources.
Cardinality estimates: refers to an estimate of the cardinality, i.e., an estimated value of the number of non-repeating elements in a set.
Parameter Server (PS): refers to a server for hosting parameters, which typically stores parameters in a distributed manner and allows the stored parameters to be called or new parameters to be stored.
FIG. 1 schematically illustrates an example application scenario 100 in which the graph data processing scheme of the present application may be applied.
As shown in FIG. 1, the scenario 100 may include a server 110, and the graph data processing scheme provided herein may be deployed at the server 110 and used to process graph data to determine a cardinality or cardinality estimate for a neighborhood node set of a desired order for one or more nodes. The servers may be independent physical servers, server clusters or distributed systems formed by a plurality of physical servers, or cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and big data and artificial intelligence platforms. Further, these servers are optionally shown only as examples, and other devices or combinations of devices with computing and storage capabilities may be used instead to provide the respective services.
Alternatively, the user 130 may access the server 110 through the terminal device 120 via the network 150 in order to obtain the service provided by the server 110. For example, the user 130 may input an instruction through a user interface provided by the terminal device 120 to start a graph data processing scheme deployed on the server 110, obtain a processing result, and the like. Illustratively, the terminal device 130 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like.
Alternatively, the server 110 may be linked with an external database 140, for example, to acquire the graph data to be processed from the database 140 or to store the processing result to the database 140, or the like. Illustratively, the database 140 may be a stand-alone data storage device or group of devices, or may also be a back-end data storage device or group of devices associated with other online services, such as online services that provide functions for social, financial transactions, and the like.
Further, in the present application, the network 140 may be a wired network connected via, for example, a cable, an optical fiber, or the like, or may be a wireless network such as 2G, 3G, 4G, 5G, Wi-Fi, Bluetooth, ZigBee, Li-Fi, or the like.
Fig. 2A and 2B schematically show examples of undirected and directed graphs, respectively.
As shown in fig. 2A, the undirected graph 200A includes nodes a-H, and the connecting lines between these nodes may represent edges, each of which represents only the connection relationship between the nodes without directionality. Illustratively, the undirected graph 200A may be described using graph data in the form of adjacency tables, which may include, for example, edge data (a, B), (B, C), (C, D), (C, E), etc., where a-H may be considered as node identifications, and the order of the node identifications in the various pieces of edge data does not indicate the directionality of the edges. Alternatively, other types of node identification may be used, such as a numerical number or the like.
As shown in fig. 2B, the directed graph 200B includes nodes a-H, and the arrowed lines between these nodes represent the connection relationships between the nodes and have directionality. Illustratively, the directed graph 200B may be described using graph data in the form of an adjacency table, which may for example comprise edge data (a, B), (B, a), (C, B), (B, E), etc., wherein a-H may be considered as node identifications, the order of the node identifications in the items of edge data indicating the directionality of the edges, e.g. edge data (a, B) represents an edge pointing from node a to node B, and edge data (B, a) represents an edge pointing from node B to node a. Illustratively, a numerical number may be used as a node identification for subsequent processing.
In fig. 2B, there are 2 paths from node D to node G, which are (D, G) and (D, H) - (H, G), respectively, and thus the shortest path from node D to node G is 1. Similarly, shortest paths between other nodes may be defined. Further, taking node E as an example, its 0 th order neighborhood node set only includes node E itself, 1 st order neighborhood node set includes nodes B, C, E, F and G (the shortest paths between these nodes and node E are all less than or equal to 1), and 2 nd order neighborhood node set includes nodes A, B, C, D, E, F, G and H (the shortest paths between these nodes and node E are all less than or equal to 2). Similarly, other nodes, sets of neighborhood nodes of other orders may be defined.
Fig. 3 schematically shows a graph data processing architecture 300 according to the related art. As shown in fig. 3, the architecture 300 includes a driver 310 and an actuator 320, optionally, there may be multiple actuators 320. Under such an architecture, the driver 310 may be responsible for resource application, task assignment and monitoring, etc., while the executor 320 may be responsible for executing the assigned tasks. For example, when executing an application program such as graph data processing, the driver 310 may apply for storage and/or computing resources from the resource management module, launch one or more executors 320, and assign tasks to the respective executors 320, e.g., send the respective application program and file data to the respective executors, and so on. Each executor 320 may execute the assigned task and return the execution results to the driver after execution is complete. The architecture 300 may be implemented by a processor or a compute engine. Illustratively, the architecture 300 may be implemented by a computational engine suitable for large-scale data processing, such as Spark.
Applicants have discovered, however, that in processing graph data using graph data processing architecture 300 to determine the number of nodes in a neighborhood of a desired order of nodes, it is often necessary to store a counter associated with each node on drive 310 and to continually merge different counters and update the respective counters over multiple iterations. When processing large or very large scale (e.g., on the order of billions, or more) graph data, storing counters on the driver 310 would take a lot of storage space, and during each iteration, the driver 310 would need to send counters to each executor and collect updates for each counter, which would take a lot of time and occupy a lot of communication resources, and thus it is easy to encounter a single point network bottleneck of the driver, affecting the overall processing performance. Moreover, a large amount of data reassembly may be involved in merging different counters and updating the counters, and in the case of a large amount of data, the data reassembly may also incur huge communication and computation resource overhead. Furthermore, in a computing engine such as Spark, the computation is generally performed based on RDD (abstract Distributed data sets), which is a fault-tolerant, parallel data structure that is beneficial for achieving higher data processing efficiency, but due to the non-changeability (or read-only property) of RDD, when a large number of update operations are involved, it can only be achieved by additionally creating and storing RDD, which may further increase the consumption of storage and/or computing resources.
In view of the above, the applicant has proposed a graph data processing method based on a parameter server, which is more suitable for achieving superior processing performance on ultra-large scale graph data. Fig. 4A schematically illustrates an example flow diagram of a graph data processing method 400A, according to some embodiments of the present application. Illustratively, the method 400A may be deployed on the server 110 in the application scenario 100 of FIG. 1 in a form such as an application program.
At step 410, directed graph data may be obtained, where the obtained directed graph data may include a plurality of items of edge data associated with a plurality of nodes, where each item of edge data represents an edge pointing from a source node to a tail node and includes a corresponding source node identification and tail node identification. Here and in other expressions of the present application, a source node and a tail node respectively represent a start point and an end point of a directed edge represented by one item of edge data, and in the related art, the source node may also be referred to as a head node or the like, and the tail node may also be referred to as a target node or the like. Illustratively, the directed graph data may be similar to that described with reference to FIG. 2B, but may include a different number of edge data associated with a different number of nodes, e.g., may involve a much larger number of nodes and edge data. The directed graph data may be input by a user through a user interface of the terminal device, or may be obtained from a memory of another online service, or may be obtained by preprocessing other initial data, for example.
At step 420, an associated radix counter may be created on the parameter server for each node based on the obtained directed graph data, each radix counter may be configured to determine a radix estimate based on the node identification added to the radix counter. The parameter server may be any existing or self-developed parameter server that can provide distributed parameter storage services and allow parallel, asynchronous parameter access operations. Illustratively, the created radix counters may be deposited in different memory partitions of the parameter server. The radix counter may determine the radix estimate according to any known or self-developed radix determination or estimation algorithm, such as the aforementioned B-tree or bitmap based radix counting methods or radix estimation methods such as LC, LLC, HLLC and AC, among others. For example, when one or more node identifications are added to the radix counter, the radix counter may determine the radix of the added node identification according to a corresponding algorithm, i.e., how many different node identifications exist in the added node identification. When the graph data size is large, a radix counter based on a radix estimation method can be adopted in order to reduce the calculation complexity, improve the counting efficiency and save resources.
At step 430, a radix counter associated with each node may be initialized based on the node information for that node, where the node information may include a node identification. In some embodiments, the cardinality counters associated with the respective nodes may all be initialized to 1, i.e., within the 0 th order neighborhood of the respective node, there is only the node itself, as the cardinality of the 0 th order neighborhood node set of the respective node may be determined to be 1. In other embodiments, cardinality counters associated with respective nodes may be selectively initialized to 0 or 1, and such selective initialization may be performed based on preset conditions regarding node information, such as whether a node identification belongs to a preset set, whether a node has certain preset attributes, and so forth.
At step 440, the radix counter may be iteratively updated until a desired iteration round number is reached. In particular, the following iterative operations may be performed in each iteration round: for each source node, obtaining a radix counter associated with the source node from a parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on the parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node. Illustratively, an iteration ordinal may be created to record the iteration turn. In general, the iteration ordinal may be initialized to 0 or 1 before the iteration starts and incremented by 1 at the beginning or end of each iteration round, and the iteration may end when the iteration ordinal indicates that the iteration round that has been completed reaches the desired iteration round. In the ith iteration turn, each tail node may acquire the updated radix counters of all relevant source nodes after i-1 iterations, when the radix in the radix counter updated over i-1 iterations represents a non-repeating number of nodes (or estimates thereof) in the i-1 th order neighborhood of the corresponding node, by de-registering the acquired i-1 iteratively updated radix counters of all relevant source nodes and the i-1 iteratively updated radix counter (stored on the parameter server) associated with the tail node and updating the radix counter associated with the tail node stored on the parameter server, the cardinality of each cardinality counter stored on the parameter server represents a non-repeating number of nodes (or estimates thereof) in the i-th order neighborhood of the corresponding node after the i-th iteration round is finished. In other words, each time one more iteration round is performed, the order of the node neighborhood corresponding to the radix in the radix counter associated with each node on the parameter server will be increased by 1. Since each radix counter on the parameter server is initialized based on the 0 th-order neighborhood of the corresponding node (i.e., the corresponding node itself) before the iteration starts, after the iteration of the desired round is performed, the radix in each radix counter on the parameter server may reflect the number of non-repeating nodes in the neighborhood of the corresponding order of the corresponding node.
At step 450, a number of nodes in the neighborhood of a desired order of at least one of the plurality of nodes may be determined based on the iteratively updated radix counter, where the desired order is equal to the desired number of iterations. Illustratively, the number of nodes in the neighborhood of the desired order of the corresponding node, which may be an estimated amount within an error tolerance range, may be determined by reading a radix counter on the parameter server.
By introducing a parameter server to store and update the radix counters associated with each node, the number of nodes in a neighborhood of a desired order of nodes obtained from graph data processing method 400 may reduce the footprint of the processor or compute engine executing the solution and the computational resources. In addition, the parameter server can allow data interaction with high parallelism, and compared with a base counter locally stored in a processor or a computing engine, the parameter server is introduced to improve the data interaction efficiency and reduce the data interaction pressure, so that the execution efficiency of the whole scheme is improved. In addition, since the number of nodes in the neighborhood of the desired order of a node is based on a radix counter that determines a radix estimate, this would be particularly suitable for graph data processing in a big data context where accurate counting tends to be very difficult and would consume a significant amount of computing and/or storage resources.
In some embodiments, the directed graph data may include financial transaction graph data, wherein each node identification may be a transaction party identification, and each item of edge data may represent a transaction payment relationship directed from a source transaction party to an end transaction party. Illustratively, each node identification may be an identity of a transaction party or a transaction account identification, and the transaction payment relationship may be a transfer relationship, a transfer relationship of other types of materials, and the like. By determining the number of non-duplicate counterparties within a certain level of proximity of a counterparty node, it may be useful or helpful to, for example, monitor counterparties for anomalous activity, etc.
In some embodiments, the directed graph data may include transportation vehicle data, wherein each node identification may be a vehicle node identification on a road, and each item of edge data may represent a distance from one vehicle node to another vehicle node. In such embodiments, the number of non-duplicate vehicle nodes within a certain order of neighborhood of a certain vehicle node may be determined by the above-described method for use in subsequent analysis.
In some embodiments, the directed graph data may be obtained based on undirected graph data. Illustratively, fig. 5 schematically illustrates a conversion process from an undirected graph 510 to a directed graph 520, according to some embodiments of the present application.
First, undirected graph data can be received, which can include a plurality of items of edge data associated with a plurality of nodes, wherein each item of edge data represents an edge connecting two nodes and includes respective two node identifications. Illustratively, the received undirected graph data can be used to describe an undirected graph similar to undirected graph 510, but which can involve a much larger number of nodes and edges than undirected graph 510. Taking undirected graph 510 as an example, which has nodes a-D, the graph data used to describe it may include edge data (a, B), (B, C), (C, D), (C, E), and (D, E).
The positions of the two node identifications may then be exchanged based on each item of edge data in the undirected graph data to generate additional edge data. Continuing with the undirected graph 510 as an example, the locations of the two node identifications in the edge data (A, B) may be exchanged, generating additional edge data (B, A). Similarly, new side data (C, B), (D, C), (E, C), and (E, D) may be generated based on the other side data.
Thus, a combination of the plurality of items of edge data and the generated plurality of items of additional edge data in the undirected graph data can be regarded as the above-described directed graph data. Continuing with undirected graph 510 as an example, the combination of edge data and generated edge data in the graph data thereof may include edge data (a, B), (B, a), (B, C), (C, B), (C, D), (D, C), (C, E), (E, C), (D, E), and (E, D). The combined edge data may be used as the directed graph data in step 410, where each item of edge data represents an edge pointing from the source node to the tail node. In other words, after the combined edge data is treated as directed graph data, the directed graph data actually describes the directed graph 520 shown in fig. 5.
By allowing the undirected graph data to be converted into directed graph data to execute the graph data processing method 400A described with reference to fig. 4A, the graph data processing method provided by the application can be compatible with undirected graphs and directed graphs, thereby being helpful for expanding the application range and application scenarios of the method.
For example, as mentioned above, in a social network, graph data can be used to describe friend relationships between different users, and such friend relationships and other similar relationships have no directionality, and thus the graph data involved tends to be undirected graph data. After converting it into directed graph data, the number of non-duplicate user nodes within a certain order neighborhood of user nodes may be determined according to the method 400A described with reference to fig. 4A, which may be used, for example, to analyze or assist in analyzing social liveness, interpersonal relationship complexity, etc. of a user, and may in turn assist in achieving other analysis objectives.
For example, in a vehicle communication network, graph data may be utilized to describe a communicable relationship between different vehicles, e.g., a communicable relationship (typically undirected) may exist between vehicles within a communication range. In such an example, a non-repeating number of vehicle nodes within a certain order of proximity of the vehicle nodes may be determined according to the method 400A described with reference to fig. 4A, such that, for example, vehicle density analysis or the like may be facilitated.
In some embodiments, the initialization in step 430 in method 400A described with reference to fig. 4A may be implemented based on more node information in addition to the node identification. Illustratively, the node information may further include a node tag, and initializing the associated radix counter based on the node information of each node may include: when the node label of the node belongs to the target label, adding the node identification of the node to the associated base counter so as to initialize the base estimation value of the base counter to 1; alternatively, when the node label of the node does not belong to the target label, the node identification of the node is not added to the associated radix counter so as to initialize the radix estimate of the radix counter to 0. It can be understood that, since the counting of the nodes in the neighborhood of each order is performed on the basis of the initialized radix counter corresponding to each node, when the radix counter of a certain node is initialized to 0, the node does not actually participate in counting, and therefore, by initializing the radix counter of the node based on the node label, the screening of the nodes participating in the counting of the neighborhood nodes is actually realized. In other words, in such embodiments, the determination of the number of nodes within the neighborhood of the desired order of the node only counts nodes having node labels belonging to the target label, and nodes not having such node labels.
By initializing the radix counters of the nodes based on the node labels, the flexibility of the graph processing method in application is increased, and the application range of the graph processing method is widened.
For example, in a financial transaction network, some of the transaction nodes may have tags indicating that they are involved in illegal loans, cash-outs, multi-head loans, gambling or other malicious transaction activities, and assuming that tags indicating one or more of such malicious activities are set as target tags, the number of nodes involved in some malicious transaction activity or activities within the neighborhood of the desired degree of the respective node may be determined. Furthermore, if the number of nodes involved in a malicious transaction activity is significantly greater in a certain order or in a neighborhood of certain orders of a node, the probability that the node is involved in the corresponding or related malicious transaction activity may be higher. Or, for example, in a social network, some user nodes may have a label indicating that they are involved in abnormal social activities, such as frequently posting illegal statements, frequently being complained or reported by other users, etc., and assuming that such a label is set as a target label, the number of user nodes involved in abnormal social activities in the neighborhood of the expected rank of each node may be determined. Furthermore, similarly, if the number of user nodes involved in abnormal social activities in a certain order or a certain neighborhood of orders of a certain user node is significantly larger, the probability that the user node is involved in abnormal social activities may be higher.
Furthermore, in some network relationships, there may be different types of nodes. For example, in a financial transaction network, a natural human account node, a government account node, an enterprise account node, an individual merchant account node, and the like may be involved. These different types of nodes may also be translated into different node labels and the target type of node counted according to the above embodiment where the radix counter of the node is initialized based on the node label. For example, if it is determined according to the above embodiments that there are a large number of account nodes of the loan service type in the first, second, third or higher order neighborhood of a natural person account node, the probability that the natural person is involved in a high-risk financial activity or even an illegal financial activity may be high.
In some embodiments, the method 400A shown in FIG. 4A may include an optional step 460 as shown in FIG. 4B. At step 460, for the obtained directed graph data, tail nodes having the same source node may be aggregated based on the source node identification to obtain a set of tail nodes corresponding to each source node. In such an embodiment, each source node may transmit its own associated radix counter obtained from the parameter server to all tail nodes in the corresponding set of tail nodes in each subsequent iteration, as shown in step 440'. This helps to further optimize the processing efficiency of the graph data.
Further, in some embodiments, additionally, a separate node-tag table may be used as input data along with the directed graph data or undirected graph data. In such an embodiment, after the corresponding tail node set is obtained based on the source node identification aggregation, the table formed by the source node-tail node set and the node-tag table may be aggregated to obtain a table of (source node, source node tag, tail node set), where the source node tag may participate in initialization of a radix counter of the relevant source node as node information. This has also been described in detail in the foregoing and will not be described in detail here.
In some embodiments, the method 400A shown in FIG. 4A may include an optional step 470 as shown in FIG. 4C. At step 470, the acquired directed graph data may be divided into multiple sets of graph data. In such embodiments, as shown in step 440 ", in each iteration, an iterative operation as shown in step 440 may be performed for each set of graph data in parallel. Alternatively, the graph data may be randomly and uniformly divided into a plurality of sets of graph data, the sets of graph data may be stored on different processing devices, respectively, and in each iteration, the different processing devices may process the stored graph data in parallel according to the above iteration operation. Thus, the processing efficiency of the graph data is improved by the distributed parallel data processing mode.
In some embodiments, the aforementioned radix counter may be configured to determine a radix estimate for the node identification added to the radix counter based on a probabilistic algorithm (e.g., LC, LLC, HLLC, AC, etc.). Illustratively, the aforementioned radix counter may be configured to: determining a bit string corresponding to each node identifier added to the radix counter using a preset hash function; the base value is estimated based on the first occurrence of a 1 in at least some of the bits in each bit string. The preset hash function may be configured to convert the raw data (e.g., node identification) into a hash array that satisfies a uniform distribution or substantially satisfies a uniform distribution, where each raw data may be converted into a bit string in which each bit is 0 or 1 may be considered an equiprobable event. According to the bernoulli process principle, by checking the bit order in which each bit string first appears to be 1, starting from the highest or lowest bit of the bit string, the cardinality of the original data can be estimated based on the maximum value of the bit order. Further, in order to improve the estimation accuracy, a part of bits in each bit string may be regarded as a bucket number, for example, the first few bits in the bit string are regarded as a bucket number, the maximum number of first occurring 1 in the remaining bits is checked for each bucket, and the obtained maximum number of bits of each bucket is averaged, for example, an arithmetic average, a geometric average, a harmonic average, and the like. Further, optionally, a correction term may be added to further improve the estimation accuracy.
By enabling the radix counter to determine the radix estimated value according to the algorithm, the consumption of storage and calculation resources can be reduced while the estimation precision is ensured, and the execution efficiency of the whole scheme is improved.
It should be understood that while fig. 4A, 4B, or 4C show the various steps in order, some of the steps may be performed in an order reverse to that shown, or one of the steps may be performed in parallel. For example, steps 460 and 420 shown in fig. 4B may be performed in reverse order or in parallel, steps 470 and 420 shown in fig. 4C may be performed in reverse order or in parallel, and so on.
In some embodiments, in the iterative operation of steps 440, 400' or 400 "described with reference to fig. 4A, 4B or 4C, the cardinality counter on the parameter server may be updated by the following steps. Specifically, after each source node transmits the radix counters obtained from the parameter server to all relevant tail nodes, each tail node may perform de-duplication combining on the radix counters received from all relevant source nodes to generate updated radix counters. The radix counter on the parameter server associated with the tail node may then be updated based on the generated update radix counter. In some embodiments, updating the radix counter associated with the tail node on the parameter server based on the generated update radix counter may include: pushing the update radix counters generated for each tail node to a parameter server, wherein the parameter server may be configured to, upon receiving the update radix counters, de-duplicate and merge the update radix counters and radix counters associated with the same node as the updated radix counters. The two radix counters are de-coincided and may refer to the process of: assuming that radix counter C1 is used to determine and store the radix of the elements in set a and radix counter C2 is used to determine and store the radix of the elements in set B, radix counter C3, which has been de-duplicated and combined with radix counter C1 and radix counter C2, may determine and store the radix of the elements in set C formed by set a and set B. In the related art, de-registration and may also be referred to as fusion.
Fig. 6 schematically illustrates an example graph data processing architecture 600, according to some embodiments of the present application. The architecture 600 may be deployed, for example, on the server 110 shown in fig. 1, and may be used to perform the graph processing method 400A, 400B, or 400C described with respect to fig. 4A, 4B, or 4C.
As shown in fig. 6, graph data processing architecture 600 may include a driver 610, one or more actors 620, and a parameter server 630. Illustratively, the directed graph data obtained at step 410 may be stored on one or more actuators 620. Alternatively, as previously mentioned, the directed graph data may be generated based on initially acquired undirected graph data. Illustratively, the one or more executors 620 may first obtain undirected graph data and then invoke the undirected graph-to-directed graph conversion module 640 to convert the undirected graph data into directed graph data. This transformation process has been described in detail above with reference to fig. 5 and will not be described in detail here.
In embodiments having multiple actors 620, such as the method 400C described with reference to fig. 4C, the acquired directed graph data or undirected graph data may be divided into multiple sets of graph data such that the graph data may be stored distributed across the various actors 620. Subsequently, the executors 620 may distributively process the respective stored graph data in parallel. For example, each executor 620 may initialize an associated radix counter in parallel based on the node information of each node in the sets of graph data and/or perform the iterative operations in step 440 described with reference to fig. 4A for the sets of graph data in parallel in each iteration. Specifically, before the iteration starts, each executor 620 may apply for initializing the corresponding radix counter in a distributed and parallel manner to the parameter server based on its stored graph data set, which may be accomplished by the various embodiments described above. In each iteration, each executor 620 may pull, based on its stored graph data set, the radix counters of all source nodes contained in the graph data set from the parameter server and transfer the pulled radix counters for each source node to all tail nodes associated with the source node in the graph data set, and may then update, for each tail node, the radix counter associated with the tail node stored on the parameter server based on all the received radix counters. The updating process may be implemented according to the processes described in the foregoing embodiments, and is not described herein again.
In some embodiments, before the iteration begins, such as at step 420 shown in FIG. 4A, the parameter server 630 may be applied by the driver 610 to create radix counters 631 associated with the respective nodes, e.g., a corresponding client on the parameter server 630 may be applied by the driver 610 to create a radix counter matrix to store the radix counters associated with the respective nodes. Illustratively, a radix counter based on the aforementioned probabilistic algorithm may be created, e.g., a radix counter of the HyperLogLog type may be created, etc. Further illustratively, for cardinal counters of, for example, the HyperLogLog type, instructions for adding elements to the cardinal counter may be used to add node identifications to the cardinal counter; the two or more radix counters may be de-duplicated and merged using a merge instruction for merging and de-duplicating elements involved by the two radix counters; a cardinality in the cardinality counter may be viewed using an instruction to output the cardinality stored by the cardinality counter; and so on.
In some embodiments, to further improve execution efficiency and reduce the need for computing resources, the graph data groups stored on each executor 620 may be further divided into data batches. In such an embodiment, to ensure consistency of the radix counters pulled from the parameter server during each iteration, in creating the radix counters, associated first and second radix counters may be created for each node on the parameter server 630 as shown in fig. 6. In each iteration, for each set of graph data, the set of graph data may be further divided into a plurality of data batches according to a preset standard, for example, a data amount of each batch of data may be preset, and then, in each iteration, the graph data may be processed in batches according to the preset data amount until the stored graph data is processed. For each data batch, the following operations may be performed: for each source node in the data batch, obtaining a first radix counter associated with the source node from the parameter server, and transmitting the obtained first radix counter to all tail nodes related to the source node, and for each tail node, updating a second radix counter associated with the tail node on the parameter server based on the received first radix counters of all related source nodes from the tail node, where the updating process may be performed according to the foregoing embodiments, and details are not repeated here. The parameter server may be configured to update the first base counter based on the second base counter of each node at the end of each iteration round, e.g., the first base counter may be updated such that it is the same as the second base counter. Thus, in each iteration, the second base counter changes with the processing of different data batches and/or graph data groups, while the first base counter may remain unchanged in the process, and the storage content in the second base counter is not transferred to the first base counter to update the first base counter until all data batches in all graph data groups are processed. Through the mechanism, in each iteration process, the base number counters pulled by each source node are all the base number counters when the previous iteration is finished and are not influenced by the current iteration, so that the execution efficiency can be improved, the occupation of computing resources is reduced, and the execution precision of the graph data method is not influenced.
Alternatively, graph data processing architecture 600 may be implemented with a combination of a Spark computation engine such as those previously mentioned and a parameter server, which may be, for example, an Angel-like parameter server. By introducing the parameter server, various disadvantages related to the Spark calculation engine can be solved, for example, by storing and updating the radix counter by means of the parameter server, additional creation and storage of RDD can be avoided, and meanwhile, the problem of huge network overhead caused by a large amount of data reorganization involved in updating the radix counter can be avoided, thereby effectively improving the performance of graph data processing.
Fig. 7 schematically illustrates an example block diagram of a graph data processing apparatus 700 according to some embodiments of this application. Illustratively, the graph data processing apparatus 700 may be deployed on the server 110 shown in fig. 1, and may be used to perform the graph data processing method described above. As shown in fig. 7, the graph data processing apparatus 700 includes an acquisition module 710, a creation module 720, an initialization module 730, an iteration module 740, and a determination module 750.
In particular, the obtaining module 710 may be configured to obtain directed graph data that includes a plurality of items of edge data associated with a plurality of nodes, where each item of edge data represents an edge pointing from a source node to a tail node and includes a corresponding source node identifier and tail node identifier; the creation module 720 may be configured to create an associated radix counter for each node on the parameter server based on the obtained directed graph data, each radix counter configured to determine a radix estimate based on the node identification added to the radix counter; the initialization module 730 may be configured to initialize a radix counter associated with each node based on node information for the node, wherein the node information includes a node identification; the iteration module 740 may be configured to iteratively update the radix counter until a desired iteration round is reached, in each iteration round performing the following iteration operations: for each source node, obtaining a radix counter associated with the source node from a parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on a parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node; the determining module 750 may be configured to determine, based on the iteratively updated radix counter, a number of nodes within a neighborhood of a desired order of at least one node of the plurality of nodes, wherein the desired order is equal to the desired number of iterations.
It should be understood that the apparatus 700 may be implemented in software, hardware, or a combination of software and hardware. Several different modules may be implemented in the same software or hardware configuration, or one module may be implemented by several different software or hardware configurations.
Moreover, the apparatus 700 may be used to implement the method 400A described above, the relevant details of which have been described in detail above and will not be repeated here for the sake of brevity. The apparatus 700 may have the same features and advantages as described with respect to the previous method.
Fig. 8 schematically illustrates an example block diagram of a computing device 800 in accordance with some embodiments of the present application. For example, which may represent server 110 in fig. 1 or other type of computing device that may be used to deploy apparatus 700 provided herein.
As shown, the example computing device 800 includes a processing system 801, one or more computer-readable media 802, and one or more I/O interfaces 803 communicatively coupled to each other. Although not shown, the computing device 800 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures, or that also includes data lines, such as control and data lines.
The computer-readable medium 802 is illustrated as including memory/storage 805. Memory/storage 805 represents memory/storage associated with one or more computer-readable media. The memory/storage 805 may include volatile storage media (such as Random Access Memory (RAM)) and/or nonvolatile storage media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 805 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). Illustratively, the memory/storage 805 may be used to store the directed or undirected graph data mentioned in the embodiments above, radix counters associated with the nodes, and the like. The computer-readable medium 802 may be configured in various other ways as further described below.
One or more input/output interfaces 803 represent functionality that allows a user to enter commands and information to computing device 800, and that also allows information to be presented to the user and/or sent to other components or devices using various input/output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that does not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), a network card, a receiver, and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a haptic response device, a network card, a transmitter, and so forth. For example, in the above-described embodiments, the input device may allow the user to input the data of the directed graph or the undirected graph, or input the path for acquiring the data of the directed graph or the undirected graph, and the output device may allow the user to view the graph data processing progress and the processing result, for example, to view the number of non-repeated nodes in the neighborhood of the desired order of one or more nodes.
The computing device 800 also includes a graph data processing application 806. The graph data processing application 806 may be stored as computer program instructions in the memory/storage 805. The graph data processing application 806 may implement all functions of the various modules of the apparatus 700 described with respect to fig. 7, along with the processing system 801 and the like.
Various techniques may be described herein in the general context of software, hardware, elements, or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and the like, as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.
An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 800. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".
"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.
"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to the hardware of computing device 800, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. By way of example, and not limitation, signal media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
As previously described, hardware element 801 and computer-readable medium 802 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), Complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device that performs program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device for storing instructions for execution, such as the computer-readable storage medium described previously.
Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 801. Computing device 800 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing a module as a module executable by the computing device 800 as software may be implemented at least partially in hardware, for example, using the processing system's computer-readable storage media and/or hardware elements 801. The instructions and/or functions may be executed/operable by, for example, one or more computing devices 800 and/or processing systems 801 to implement the techniques, modules, and examples described herein.
The techniques described herein may be supported by these various configurations of computing device 800 and are not limited to specific examples of the techniques described herein.
It should be appreciated that for clarity, embodiments of the application have been described with reference to different functional units. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without detracting from the application. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present application may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.
The present application provides a computer readable storage medium having computer readable instructions stored thereon, which when executed, implement the above-described graph data processing method.
A computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computing device executes the graph data processing method provided in the various embodiments described above.
Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Claims (15)
1. A graph data processing method, comprising:
obtaining directed graph data, wherein the directed graph data comprises a plurality of items of edge data related to a plurality of nodes, each item of edge data represents an edge pointing to a tail node from a source node, and comprises a corresponding source node identifier and a corresponding tail node identifier;
creating, on the parameter server, an associated radix counter for each node based on the obtained directed graph data, each radix counter configured to determine a radix estimate based on the node identification added to the radix counter;
initializing a radix counter associated with each node based on node information for the node, the node information including a node identification;
iteratively updating the base counter until a desired iteration round is reached, performing the following iteration operations in each iteration round: for each source node, obtaining a radix counter associated with the source node from the parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on the parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node;
determining, based on the iteratively updated radix counter, a number of nodes within a neighborhood of a desired order of at least one node of the plurality of nodes, wherein the desired order is equal to the desired number of iterations.
2. The method of claim 1, wherein the obtaining directed graph data comprises:
receiving undirected graph data, wherein the undirected graph data comprises a plurality of items of edge data related to a plurality of nodes, each item of edge data represents an edge connecting two nodes and comprises corresponding two node identifications;
exchanging positions of two node identifications based on each item of edge data in the undirected graph data to generate additional edge data;
and taking the combination of the multiple items of side data in the undirected graph data and the generated multiple items of additional side data as the directed graph data.
3. The method of claim 1, wherein the node information further comprises a node tag, and wherein the initializing an associated radix counter based on the node information for each node comprises:
when the node label of the node belongs to the target label, adding the node identification of the node to the associated base counter so as to initialize the base estimation value of the base counter to 1;
when the node label of the node does not belong to the target label, the node identification of the node is not added to the associated radix counter so as to initialize the radix estimate of the radix counter to 0.
4. The method of any of claims 1-3, wherein updating the radix counter on the parameter server associated with the tail node based on the radix counters received from all related source nodes comprises:
de-duplication and combination are carried out on the base number counters received from all relevant source nodes to generate an updated base number counter;
updating a radix counter on the parameter server associated with the tail node based on the generated updated radix counter.
5. The method of claim 4, wherein updating the radix counter associated with the node on the parameter server based on the generated updated radix counter comprises:
pushing the generated update radix counter to the parameter server, wherein the parameter server is configured to, upon receiving the update radix counter, de-duplicate and combine the update radix counter and the radix counter associated with the same node as the updated radix counter.
6. The method of any of claims 1-3, further comprising:
aggregating tail nodes having the same source node for the directed graph data based on source node identification to obtain a set of tail nodes corresponding to each source node,
and wherein said communicating the obtained radix counter to all tail nodes associated with the source node comprises: and transmitting the acquired radix counter to all tail nodes in the tail node set corresponding to the source node.
7. The method of any of claims 1-3, further comprising:
dividing the directed graph data into a plurality of groups of graph data;
wherein performing the iterative operation in each iteration comprises: in each iteration, the iterative operations are performed in parallel for respective sets of graph data.
8. The method of claim 7, wherein the initializing an associated radix counter based on the node information of each node comprises:
the associated radix counters are initialized in parallel based on the node information for each node in the sets of graph data.
9. The method of claim 7, wherein the creating an associated radix counter for each node on a parameter server comprises:
creating an associated first base counter and second base counter for each node on the parameter server,
and wherein said performing the iterative operations in parallel for respective sets of graph data comprises:
for each group of graph data, further dividing the group of graph data into a plurality of data batches;
for each data batch in turn, the following operations are performed: for each source node in the data batch, obtaining a first radix counter associated with the source node from the parameter server, and transmitting the obtained first radix counter to all tail nodes related to the source node, and for each tail node, updating a second radix counter associated with the tail node on the parameter server based on the received first radix counters of all related source nodes from the tail node, wherein the parameter server is configured to update the first radix counter based on the second radix counter of each node at the end of each round of iteration.
10. The method of claim 1, wherein the radix counter is configured to:
determining a bit string corresponding to each node identifier added to the radix counter using a preset hash function;
the base value is estimated based on the number of first-appearing 1's in at least some of the bits in each bit string.
11. The method of claim 1, wherein the directed graph data comprises financial transaction graph data, each node identification comprises a transactor identification, and each edge data represents a transaction payment relationship that points from a source transactor to an end transactor.
12. A graph data processing apparatus, comprising:
an obtaining module configured to obtain directed graph data, where the directed graph data includes multiple items of edge data related to multiple nodes, where each item of edge data represents an edge pointing from a source node to a tail node, and includes a corresponding source node identifier and a tail node identifier;
a creation module configured to create an associated radix counter for each node on the parameter server based on the obtained directed graph data, each radix counter configured to determine a radix estimate based on the node identification added to the radix counter;
an initialization module configured to initialize a radix counter associated with each node based on node information for the node, the node information including a node identification;
an iteration module configured to iteratively update the base counter until a desired iteration round is reached, in each iteration round performing the following iteration operations: for each source node, obtaining a radix counter associated with the source node from the parameter server and transmitting the obtained radix counter to all tail nodes associated with the source node; for each tail node, updating a radix counter on the parameter server associated with the tail node based on the radix counters received from all relevant source nodes of the tail node;
a determining module configured to determine, based on the iteratively updated radix counter, a number of nodes within a neighborhood of a desired order of at least one of the plurality of nodes, wherein the desired order is equal to the desired number of iterations.
13. A computing device, comprising:
a memory configured to store computer-executable instructions;
a processor configured to perform the method of any of claims 1 to 11 when the computer-executable instructions are executed by the processor.
14. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1 to 11.
15. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075644.3A CN114327363A (en) | 2021-09-14 | 2021-09-14 | Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075644.3A CN114327363A (en) | 2021-09-14 | 2021-09-14 | Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114327363A true CN114327363A (en) | 2022-04-12 |
Family
ID=81044992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111075644.3A Pending CN114327363A (en) | 2021-09-14 | 2021-09-14 | Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114327363A (en) |
-
2021
- 2021-09-14 CN CN202111075644.3A patent/CN114327363A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11822975B2 (en) | Systems and methods for synthetic data generation for time-series data using data segments | |
US11163803B2 (en) | Higher-order graph clustering | |
US10778707B1 (en) | Outlier detection for streaming data using locality sensitive hashing | |
CN104809242B (en) | A kind of big data clustering method and device based on distributed frame | |
JP6768681B2 (en) | Learning from distributed data | |
US20240163684A1 (en) | Method and System for Constructing and Analyzing Knowledge Graph of Wireless Communication Network Protocol, and Device and Medium | |
CN104809244B (en) | Data digging method and device under a kind of big data environment | |
CN104077723B (en) | A kind of social networks commending system and method | |
WO2018205999A1 (en) | Data processing method and apparatus | |
US20200301966A1 (en) | Attribute diversity for frequent pattern analysis | |
CN111949720B (en) | Data analysis method based on big data and artificial intelligence and cloud data server | |
CN112597399A (en) | Graph data processing method and device, computer equipment and storage medium | |
CN117290744A (en) | Electrical node aggregation method and device based on graph theory | |
Wan et al. | Dgs: Communication-efficient graph sampling for distributed gnn training | |
CN111784402A (en) | Multi-channel based order-descending rate prediction method and device and readable storage medium | |
CN114327363A (en) | Graph data processing method, graph data processing apparatus, computing device, storage medium, and program product | |
CN113630476A (en) | Communication method and communication device applied to computer cluster | |
CN114756714A (en) | Graph data processing method and device and storage medium | |
CN113537308A (en) | Two-stage k-means clustering processing system and method based on localized differential privacy | |
US12061630B2 (en) | System and method for identifying approximate K-nearest neighbors in web scale clustering | |
US20220398416A1 (en) | System and method for identifying approximate k-nearest neighbors in web scale clustering | |
US12112331B2 (en) | Rule based machine learning for precise fraud detection | |
US20240012857A1 (en) | Asserted Relationships Matching in an Identity Graph Data Structure | |
CN113590721B (en) | Block chain address classification method and device | |
US20230125509A1 (en) | Bayesian adaptable data gathering for edge node performance prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |