CN118151861A - Partition determination method and device for distributed storage of graphs - Google Patents

Partition determination method and device for distributed storage of graphs Download PDF

Info

Publication number
CN118151861A
CN118151861A CN202410401511.8A CN202410401511A CN118151861A CN 118151861 A CN118151861 A CN 118151861A CN 202410401511 A CN202410401511 A CN 202410401511A CN 118151861 A CN118151861 A CN 118151861A
Authority
CN
China
Prior art keywords
node
partition
nodes
storage
seed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410401511.8A
Other languages
Chinese (zh)
Inventor
万小培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202410401511.8A priority Critical patent/CN118151861A/en
Publication of CN118151861A publication Critical patent/CN118151861A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a partition determination method and a partition determination device for distributed storage of a graph, which are used for splitting a target graph into storage partitions respectively corresponding to a plurality of distributed devices. Specifically, in the distributed storage process of the graph, the diffusion operation of the plurality of rounds is jointly performed by the respective distributed devices based on the initial partition determined via the initial partition. In the single round of diffusion operation, each distributed device respectively determines seed nodes of corresponding storage partitions, broadcasts the seed nodes to other distributed devices, diffuses first-order neighbor nodes of the seed nodes in the local initial partition by each device, marks only the storage partition to which the node belongs without marking the storage partition to which the connecting edge belongs, and finally determines the storage partition to which the corresponding connecting edge belongs according to the storage partition to which the node in the local initial partition belongs by each distributed device. The implementation mode can reduce the memory consumption and improve the partitioning speed of the graph.

Description

Partition determination method and device for distributed storage of graphs
Technical Field
One or more embodiments of the present disclosure relate to the field of distributed computing technology, and in particular, to a partition determination method and apparatus for performing distributed storage on a graph.
Background
The graph may describe various entities or concepts and their relationships that exist in the real world, and may include a huge semantic network graph, where nodes represent entities or concepts (also may be represented as concepts, entities to which an instance corresponds), and edges correspond to attributes of the entities or relationships between the entities. The graph may include, for example, a knowledge graph, a bipartite graph, an isomorphic homograph (containing a node type, an edge, such as a social graph, a transaction graph, etc.).
In the practical application of the graph, the data volume of the graph may be huge, such as in the order of billions and billions. An important application of graph data is to model nodes in a graph using a graph neural network (Graph Neural Networks, GNNs) and then predict whether specific edges exist between the nodes using a trained model. As the scale of graph data continues to expand and graph structures continue to complicate (e.g., heterogeneous graphs, multiple graphs), it has been difficult for a single machine to support even billions or billions of graph data. Conventional solutions may be implemented based on a distributed graph sampling system. Specifically, a graph cut task is first performed on the full-volume graph data, and the graph data is cut into a plurality of partitions, so that the size of each partition can be ensured to be loaded into the memory of a single device. Since the graph segmentation generally consumes a large amount of memory, how to reduce the memory consumption in the graph segmentation process is an important technical problem of improving the graph segmentation efficiency of the distributed graph storage.
Disclosure of Invention
One or more embodiments of the present specification describe a partition determination method and apparatus for distributed storage of graphs, to solve one or more of the problems mentioned in the background.
According to a first aspect, there is provided a partition determination method for performing distributed storage on a graph, for splitting a target graph into respective storage partitions respectively corresponding to a plurality of distributed devices; the method is performed by a first device of the plurality of distributed devices, including a multi-round diffusion operation performed at a first initial partition local to the first device based on a target graph, wherein any round of diffusion operation includes: determining a plurality of seed nodes for the first storage partition, transmitting the seed nodes to other distributed devices along with the storage partition marks of the first storage partition, and receiving other seed nodes with the storage partition marks transmitted by other distributed devices; traversing first-order neighbor nodes in a first initial partition aiming at each current seed node in a seed node set respectively, and adding storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; wherein the seed node set includes the plurality of seed nodes and other seed nodes; and interacting the round node marking result with other distributed devices.
In one embodiment, the first initial partition is obtained by initially partitioning the target graph by using a partitioning manner of 2D hash, in which the first device corresponds to an R-th row and an C-th column in the device array of R-th row and C-th column, and a node identifier of any one of two nodes connected by a single connection edge in the first initial partition meets at least one of the following: the result of modulo R is R and the result of modulo C is C.
In one embodiment, the graph data of the first initial partition is recorded as a target data format based on compressed sparse lines, CRSs, the target data format including a connection record line, in which the number of neighbor nodes of each node is recorded, and a first index value, where the first index value is used to indicate the number of neighbor nodes of the same storage partition as the corresponding node in the first initial partition.
In a further embodiment, determining seed nodes for the first storage partition includes: k nodes are selected from a target node set obtained through previous diffusion rounds according to the sequence of segmentation complexity, wherein the segmentation complexity is determined according to the first index value, and the nodes in the target node set are marked as first storage partitions and are not determined as seed nodes in the previous diffusion rounds.
In a still further embodiment, K is determined by the product of the number of nodes in the set of target nodes and the partition speed, which is updated gradient descent at a predetermined learning rate for each diffusion round, the gradient of a single diffusion round being determined by the weighted sum of the following first ratio logarithm and second ratio logarithm, balanced by a predetermined point-to-edge balance factor: the first ratio is the ratio of the number of the nodes which are cumulatively marked by the local storage partition in the previous diffusion round to the average value of the number of the nodes which are cumulatively marked by each storage partition in the previous diffusion round; the second ratio is a ratio of the number of connecting edges of the local storage partition, which are cumulatively determined in the previous diffusion round, to the average value of the number of connecting edges of each storage partition, which are cumulatively determined in the previous diffusion round.
In one embodiment, the K nodes are the first K nodes with low to high segmentation complexity in each node in the target node set; the segmentation complexity of a single node is positively correlated with the number of edges of the single node which are not traversed on each distributed device, and the number of edges of the single node which are not traversed on the single distributed device is the difference value between the number of neighbor nodes of the single node in a corresponding single initial partition and the first index value.
In one embodiment, in the connection record row, the number of neighboring nodes of each node in the first initial partition is recorded by a first number of high order bits, and the first index value is recorded by a second number of low order bits.
In one embodiment, the target data format further includes an edge attribute row; and adding reverse edges for every two nodes connected by only one-way edges under the condition that the graph is a directed graph, and recording whether the edges corresponding to all elements in the neighbor node rows are added reverse edges or not through the edge attribute rows.
In one embodiment, the target data format further includes a node partition line, configured to record a storage partition to which each node belongs; wherein updating the marker state of the node in the first initial partition comprises: and updating a field corresponding to the first-order neighbor node of the node partition row to enable the field to contain a storage partition corresponding to the current seed node.
In one embodiment, the target data format further includes a neighbor node row, configured to record, in sequence, a neighbor node identifier of each node; updating the marking state of the node in the first initial partition, comprising: in the case where a single neighbor node of a single node is marked as having the same memory partition as the single node, the order of the neighbor node identifications of the single node in the neighbor node row is adjusted such that the local node identifications of the single neighbor node are arranged before the local node identifications of other nodes that do not have the same memory partition as the single node.
In a further embodiment, for each current seed node in the seed node set, traversing its first-order neighbor node in the first partition, adding a storage partition flag corresponding to the current seed node to the first-order neighbor node, including: and traversing the neighbor nodes after the index value of the current seed node by taking the first index value as an index from the connection record row to serve as the first-order neighbor nodes.
In one embodiment, the method further comprises: and after the multi-round diffusion operation, outputting the partition result of the partition to which each corresponding connecting edge belongs to a local partition file according to the partition marking state of each node in the first partition.
In one embodiment, the first initial partition includes a first connection edge, and storing, according to a storage partition to which each node in the first initial partition belongs, a partition result of the storage partition to which each corresponding connection edge belongs to a local partition file includes: in the case where two nodes connected via a first connection edge are both marked as belonging to the same plurality of memory partitions at the same time, the number of connection edges respectively corresponding to each of the plurality of memory partitions is determined, and the first connection edge is determined as the memory partition having the smallest number of corresponding connection edges.
In one embodiment, the interacting the present round of node marking results with other distributed devices for the next round of diffusion operations includes: and acquiring node marking results of all nodes corresponding to the first initial partition from the local node marking results of other distributed devices, and updating the storage partition marking state of the corresponding node in the first initial partition.
According to a second aspect, there is provided a partition determination apparatus for distributed storage of a map, for splitting a target map into respective storage partitions respectively corresponding to a plurality of distributed devices; the device is arranged on a first device in the plurality of distributed devices, and comprises an acquisition unit, a diffusion unit and a communication unit, wherein the acquisition unit, the diffusion unit and the communication unit are used for performing multi-round diffusion operation on the basis of a target graph on a first initial partition local to the first device, and in any round of diffusion operation:
the acquisition unit is configured to: determining a plurality of seed nodes in the first storage partition, sending the seed nodes to other distributed devices along with the storage partition marks of the first storage partition, and receiving other seed nodes with the storage partition marks sent by other distributed devices;
The diffusion unit is configured to: traversing first-order neighbor nodes in a first initial partition aiming at each current seed node in a seed node set respectively, and adding storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; wherein the seed node set includes the plurality of seed nodes and other seed nodes;
The communication unit is configured to: and interacting the node marking result of the round with other distributed equipment for the next round of diffusion operation.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method of the first aspect.
The method and the device provided by the embodiment of the specification are used for splitting the target graph into the storage partitions respectively corresponding to the plurality of distributed devices. Specifically, in the distributed storage of the graph, a plurality of rounds of diffusion operations are jointly performed by the respective distributed devices based on the initial partitions determined by the initial partitioning. In a single round of diffusion operation, each distributed device respectively determines seed nodes of corresponding storage partitions, broadcasts the seed nodes to other distributed devices, diffuses first-order neighbor nodes in a local initial partition by each device, and marks the first-order neighbor nodes as storage partitions consistent with the seed nodes. And finally, each distributed device determines the storage partition to which the corresponding connecting edge belongs according to the node corresponding to the local storage partition. Because only the storage partition to which the node belongs is marked, and the storage partition to which the marked edge belongs is not needed, and only the first-order neighbor nodes of the seed node are considered for single-round diffusion, the embodiment consumes less memory and has higher partition speed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a schematic diagram of one particularly useful architecture of the present specification;
FIG. 2 shows a schematic diagram of two cases of point segmentation and edge segmentation of a graph cut;
FIG. 3 is a schematic diagram of an architecture for one implementation of the technical concepts of the present specification;
FIG. 4 illustrates a partition determination flow diagram for distributed storage of a graph performed by a single distributed device, according to one embodiment;
FIG. 5 illustrates a schematic diagram of determining an initial partition from a connection edge in a 2D hash pair diagram;
FIG. 6 illustrates a diffusion schematic of a diffusion operation of a seed node by a single distributed device in a single diffusion round, according to one specific example;
fig. 7 illustrates a CRS storage format schematic diagram in accordance with a specific example;
FIG. 8 illustrates a data update diagram for a diffusion operation according to the CRS memory format of FIG. 7;
FIG. 9 illustrates a schematic block diagram of a partition determination apparatus provided to a single distributed device for distributed storage of a graph, according to one embodiment.
Detailed Description
The technical scheme provided in the specification is described below with reference to the accompanying drawings.
Fig. 1 shows a schematic diagram of a specific architecture for the present description. As shown in fig. 1, a suitable architecture for the present description is a distributed architecture. A distributed system may include multiple devices, such as distributed device 1, distributed device 2, distributed device 3, etc. in fig. 1 (there may be more distributed devices in a distributed system in practice). In larger scale graph data applications, it is often necessary to split a graph into multiple sub-graphs for storage in a distributed system. As in fig. 1, the graph is split into sub-graph 1, sub-graph 2, sub-graph 3, etc. (more sub-graphs may be split in practice), and stored in each distributed device.
It will be appreciated by those skilled in the art that a graph is a structure that is composed of a set of nodes and a set of connecting edges for connecting the nodes, and that a graph may have a more complex graph structure, such as a heterogeneous graph, a multiple graph, a knowledge graph, and the like. In the field of graph application, a homogeneous graph is used to describe graph structures of only one type of nodes and edges in a graph (the graph referred to in the specification can be various relation networks and can comprise the graph), and nodes and connecting edges in a heterogeneous graph can have complex structures of various types, such as nodes in the graph can be users, commodities, stores and the like, and edge types can be purchase, access and the like. In addition, the multiple graph may be a graph structure in which a plurality of connection edges may exist between two nodes, for example, in a case where a node type includes a user and a commodity, and an edge type includes a purchase, a single user purchases the same commodity multiple times at different times may correspond to a plurality of edges between the single user and the commodity. Typically, a new graph of a portion of nodes and connecting edges is selected from the original graph, referred to as a sub-graph of the original graph.
The graph segmentation is a process of splitting points and connecting edges in a target graph into a plurality of partitions (belonging to one type of subgraphs) according to a certain rule, so that the data of the very large scale graph can be conveniently processed in a distributed mode. Graph splitting can be generally divided into two types, point splitting (edge partitioning) and edge splitting (point partitioning). As shown in fig. 2, the left side is a dot cut and the right side is a side cut. In the point splitting process, a single point (e.g., point 211 in fig. 2) in the graph may be allocated to multiple partitions simultaneously, and a single connection edge may uniquely correspond to a single partition. In the edge splitting process, the connecting edges in the graph (such as the connecting edges between points 221 and 222 in fig. 2) may be simultaneously allocated to multiple partitions, and a single point may uniquely correspond to a single partition.
The graph partitioning algorithm (also referred to as a partitioning algorithm) can be divided into an offline partitioning algorithm and a streaming partitioning algorithm according to whether the graph data must be fully loaded into memory. The offline partitioning algorithm generally needs to run on the Distributed system, so that more resources need to be consumed, but better partitioning effects such as ParMetis, xtraPuLP and Distributed NE can be obtained generally; in contrast, streaming partitioning algorithms typically consume less memory and are typically less time consuming, but the partitioning is less efficient, such as 2D Hash, SNE, etc. On large-scale graph data, partition quality, memory consumed by a partition and partition time consumption are all very important, and a certain difficulty is usually brought to consideration of three aspects. The graph partition has multiple downstream tasks, and the requirements of indexes such as point-to-point balance degree, edge balance degree, repeatability degree (index for describing the repeatability degree of the same node in multiple distributed devices) and the like may be different. Generally, on large-scale graph data, resources and time consumed by partitioning are relatively large, so the graph data is usually partitioned once in the industry, and a partitioning result can support various downstream tasks performed based on distributed graph partitioning. For this reason, a partitioning algorithm with better point/edge balance and lower repetition is often needed to perform partitioning, so as to maximize the value of the partitioning.
In view of this, the present disclosure provides a technical concept, in the process of point segmentation (such as the left example of fig. 2) of the graph, each connection edge is first subjected to preliminary partitioning, so as to obtain each initial partition corresponding to each distributed device. Typically, a single distributed device gets an initial partition. The initial partition often has certain randomness, and point equilibrium degree, edge equilibrium degree, repeatability degree, sub-graph internal connectivity and the like often cannot be guaranteed, so that the initial partition can be used as the initial partition for carrying out distributed graph segmentation processing among distributed devices in the graph segmentation process. And then, each distributed device respectively selects a plurality of seed nodes for the local storage partition, and the seed nodes and other distributed devices are subjected to multiple rounds of iterative diffusion, each node in the local initial partition is marked, and the storage partition corresponding to the distributed device is determined to which the node in the local initial partition belongs. Here, the storage partition and the initial partition are relative concepts, the initial partition is a sub-graph processed by the distributed device in the graph splitting process, and the storage partition is a sub-graph corresponding to (or stored by) the distributed device after the graph splitting is completed.
Specifically, as shown in fig. 3, in a single diffusion pass: each distributed device broadcasts the selected seed node to other distributed devices; each distributed device queries corresponding first-order neighbor nodes from the local initial partition according to the received seed nodes of all current rounds, and marks which storage partition each first-order neighbor node belongs to; broadcasting each node marking result to other distributed devices; on the other hand, node marking results transmitted by other distributed devices are obtained so as to redetermine seed nodes corresponding to the local storage partitions. Through the iteration loop of multiple rounds, under the condition that the connecting edges in the local initial partition are all traversed, a single distributed device can finish the node marking of the local initial partition, but under the condition that other devices do not finish the diffusion process, other operations except the neighbor node marking of the local initial partition can still be performed in the diffusion process. Here, the traversal of the connecting edge can be understood as: every two neighboring nodes are marked with the same memory partition. Further, the connection edges in the local initial partition may be considered to be all traversed if the following conditions are satisfied: for any one node, there is at least one same memory partition for any one of its neighbor nodes with that node. Finally, each distributed device respectively outputs the partition result of dividing each connecting edge in the local initial partition into the storage partition to the file. When the downstream task is performed, the device performing the downstream task may read the corresponding subgraph according to the storage partition to which each connection edge described in the above file of each distributed device belongs.
In the partitioning process of the graph, only the storage partition to which the node belongs is marked in the memory without marking the storage partition to which the connecting edge belongs, and the partitioning result of the connecting edge is directly output to the disk file after the storage partition to which the node belongs is marked, so that the memory can be saved.
The technical idea of the present specification is described in detail below with reference to the embodiment shown in fig. 4.
As shown in FIG. 4, a partition determination flow for distributed storage of a graph is shown, according to one embodiment. The flow may be performed by a single distributed device. The single distributed device may be any computer, device, server having a certain computing power, or may be a separate running space running on a computer, device, server, such as a virtual machine. It is understood that the partitioning process of the distributed storage graph data may be performed by a plurality of distributed devices. Each distributed device in the distributed system may jointly determine the storage partition to which each connection edge in the target graph data belongs. The graph is a dataset comprising a number of nodes and connection edges describing connection relationships between the nodes.
From the foregoing description of the technical concepts of the present specification, it is known that, for any single distributed device in a distributed system, there may be corresponding an initial partition determined by initial segmentation, and corresponding a memory partition of the graph cut execution result. The initial partition is an initial partition during distributed processing by each of the distributed devices. That is, the partitions currently being distributed to devices for processing are temporarily assigned before the corresponding memory partitions are determined. A partition is a type of sub-graph that may be part of the target graph data. In this specification, in the case where the first device is any one of the distributed devices performing graph cut, the corresponding initial partition may be referred to as a first initial partition. It should be noted that, the process of performing the preliminary partitioning on each connection edge may be performed randomly, or may be performed according to a predetermined rule (for example, each connection edge is evenly distributed to each distributed device according to an edge identifier).
According to one embodiment, to reduce the repetition, the graph may be initially partitioned using 2D hashing. In the technical field of graph segmentation, the 2D hash may consider each distributed device as a two-dimensional matrix, and then allocate each connection edge to a device corresponding to a corresponding row and column of the two-dimensional matrix according to the hash value of the node connected with the connection edge. Alternatively, the process may be performed in parallel by P (P is greater than 1) distributed devices in the distributed system. If a single distributed device reads part of the subgraph and traverses the connecting edge, two hash values of the 2D hash are determined for the single connecting edge, and the single connecting edge is distributed to the distributed devices corresponding to the two hash values.
For example, regarding P distributed devices, which are regarded as a grid of devices of R rows and C columns, in other words, in the case where p=r×c is satisfied, each distributed device is uniquely determined by one row number and one column number. Because of the various possibilities for the values of R and C, meshing a device may be meaningless for unreasonable row, column values, e.g., r=1, c=p, etc. Thus, in an alternative embodiment, for P distributed devices, the values of R and C may be determined by minimizing the sum of r+c, with p=r×c. In this way, R and C can be made as close as possible, so that the values are more balanced. In alternative embodiments, R.gtoreq.C may also be predefined. As an example of 2D hashing, assume a connection edge is denoted (S, O), where S and O are the node identities of two nodes to which the connection edge is connected, respectively. The hash value corresponding to the edge may be the node identifier (or the value processed by the hash algorithm) of the two nodes, and modulo the number of rows and columns, respectively: r=s% R, C =o% C. The connecting edges (S, O) can then be assigned to the R row, C column, and R+C-1 total distributed devices.
Fig. 5 shows an example of initial partitioning of the plurality of connection edges corresponding to the distribution node 17 by 2D hashing. In the example shown in fig. 5, the number of rows and columns is r=c=4, the connection edge associated with the node identified as 17 is initially divided, the corresponding row number is r=17%r=1, and the column number is c=17%c=1. Depending on the node identification of another node to which each connection edge is connected, each connection edge corresponding to the node 17 may be assigned to each distributed device with a row number r=1 or a column number c=1, such as r+c-1 distributed devices corresponding to the second row and the second column in fig. 5. Thus, node repetition decreases from at least R C to R+C-1.
It will be appreciated that the preliminary partitioning is performed in terms of hash values of the nodes connected by the connection edges, and thus, the initial partitioning (or initial subgraph) obtained by a single distributed device may not be fully connected. The subgraph of a single distributed device, corresponding to the initial partition, resulting from graph splitting via the distributed system, may be referred to as a memory partition. Memory partitions are typically connected subgraphs. For the first device shown in fig. 4 with arbitrary properties, the corresponding memory partition may be referred to as a first memory partition.
After the preliminary partitioning is completed, each distributed device may jointly perform multiple rounds of diffusion operations. In the single round of diffusion operation, the first device marks the storage partition to which the node in the first initial partition belongs.
As shown in fig. 4, in the single round of diffusion operation, the partition determination procedure performed by the first device for distributed storage of the graph may include the following steps: step 401, determining a plurality of seed nodes for a first storage partition, sending the seed nodes to other distributed devices along with the storage partition marks of the first storage partition, and receiving other seed nodes with the storage partition marks sent by other distributed devices; step 402, traversing first-order neighbor nodes in a first initial partition aiming at each current seed node in a seed node set, and adding storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; the seed node set comprises a plurality of seed nodes and other seed nodes; step 403, interacting the round node marking result with other distributed devices. .
First, in step 401, a number of seed nodes are determined for a first storage partition, sent to other distributed devices along with a storage partition flag for the first storage partition, and received other seed nodes with storage partition flags sent by other distributed devices.
It will be appreciated that the distributed storage of graph data may be splitting the graph data into multiple memory partitions that respectively correspond to each of the distributed devices or that respectively act as a storage sub-graph for each of the distributed devices. To achieve as high partition processing efficiency as possible, a single distributed device may correspond to a single initial partition and a single storage partition. The present description describes the related art concept taking a single distributed device corresponding to a single initial partition and a single storage partition as an example, however, the possibility that a single distributed device corresponds to a plurality of initial partitions, 0's, or a plurality of storage partitions is not excluded in practice.
For a single distributed device, its corresponding storage partition may be diffused through a small number of seed nodes. A seed node may be understood here as an initial node determined for a corresponding memory partition. In an initial diffusion round, seed nodes of a local storage partition may be determined from the local initial partition. In the process of selecting the initial seed node, the initial seed node may be selected randomly or according to a predetermined rule (for example, selecting a predetermined number of nodes with the largest corresponding edges), which is not limited herein. Since memory partitions are typically connected subgraphs, in non-initial rounds, seed nodes may be determined from nodes marked as corresponding memory partitions in previous rounds. For the first device, in a first diffusion round, its seed node may be the node selected from the first initial partition, and in other diffusion rounds, its seed node may be determined from the various nodes marked as the first storage partition.
For the non-initial round, in the process of determining the seed nodes of the current round for the first storage partition, the previous round may be marked as all the seed nodes determined as the current round, a part of the nodes marked as the first storage partition in the previous round may be randomly selected as the seed nodes of the current round, and a part of the nodes marked as the first storage partition in the previous round may be selected as the seed nodes of the current round according to a predetermined selection rule. In general, in the case of a large scale of graph data, tens of millions of neighboring nodes may be generated after one iteration of the diffusion round, and the point-edge balance is very poor due to the fact that all neighboring nodes are directly used as seed nodes for diffusion. Thus, each diffusion round may pick only a portion from the nodes marked as first memory partition as seed nodes for the current round.
According to one embodiment, at each diffusion round, the first device records each node marked as a first storage partition and accumulates to a target node set (e.g., records accumulated node identifications in the form of vectors), and screens the target node set for other nodes after the seed node selected in the previous diffusion round as candidate nodes. In a single diffusion round, a portion is selected from the set of target nodes as a seed node.
According to another embodiment, at each diffusion round, the first device records each node marked as a first storage partition and having a connection edge that is not traversed, and accumulates to a target node set (e.g., records accumulated node identifications in the form of vectors), and screens other nodes in the target node set after the seed node selected in the previous diffusion round as candidate nodes. Similarly, in a single diffusion round, a portion is selected from the set of target nodes as a seed node.
According to one possible design, the predetermined rule for determining a number of seed nodes for the first storage partition may be: and respectively determining corresponding segmentation complexity scores for all candidate nodes of the target node set, sequencing all the candidate nodes according to the segmentation complexity scores from low to high, and taking out the previous K neighbor nodes from the candidate nodes as seed nodes of the current round. The segmentation complexity score of a single node is used for describing the complexity degree of the connecting edges associated with the single node in the subsequent segmentation process, and the segmentation complexity of the single node can be positively related to the number of the connecting edges which are not traversed on each distributed device. For example, in one embodiment, the split complexity of a single node is the sum of the number of connected edges that it has not traversed on each distributed device. The number of edges that a single node has not traversed on a single distributed device is: the number of neighbor nodes of the single node in the corresponding single initial partition is the difference value of the following index values: the single node has the same number of neighbor nodes of the same memory partition as it exists in the corresponding initial partition. . In another embodiment, the segmentation complexity of a single node may also be: the number of neighbor nodes of the single node in the target graph differs from the number of neighbor nodes that have been currently accumulated at each distributed device marked as having the same memory partition as the node. The number of neighbor nodes of a single node in the target graph can be obtained from the target graph, and the neighbor nodes of the single node, which are currently accumulated and marked as having the same storage partition with the node, can be obtained through statistics on the marked neighbor nodes on each distributed device.
The number of seed nodes K taken in a single diffusible round may be a predetermined value (e.g., 10) or may be a value determined based on the product of the number of candidate nodes in the target node set and the partition speed. The partition speed may be a proportional value that may be determined based on the number of points and the number of edges of all memory partitions, such that the point edges of each memory partition tend to be balanced (e.g., the degree of balance approaches 1) during a round-by-round iteration.
Those skilled in the art will appreciate that it is often difficult to give an objective function throughout the graph cut process due to the complexity of the graph data. However, since the graph-segmentation process is performed iteratively in multiple diffusion rounds, the present specification proposes a gradient direction and ensures that the gradient value and the point-to-edge imbalance are inversely related, so that the point-to-edge imbalance can be made to approach the ideal value 1 in the process of updating according to the gradient. For example, v i represents the partition speed of the ith storage partition, lr represents the learning rate (which may be a preset value), N v and N e represent the point and the number of connection edges, respectively, at which the current round of the storage partition corresponding to the current distributed device has been determined (the previous diffusion round was cumulatively marked or determined),And/>The partition speed of the ith storage partition in each diffusion round may be updated by the following manner when the balance factors (which may be preset values) respectively indicate the determined points (the previous diffusion round is accumulatively marked or determined) in the storage partitions corresponding to each distributed device in the current round, the average value of the number of connection sides, and λ are the point-side imbalance degrees: /(I)
Wherein the point and edge imbalances are logarithmic here such that: when the current point/side imbalance is smaller than 1, the value is negative, so that the partition speed v i is updated in the increasing direction, otherwise, when the current point/side imbalance is larger than 1, the value is positive, so that the partition speed v i is updated in the decreasing direction, and meanwhile, severe vibration of the partition speed caused by overlarge point/side imbalance in the initial partition of initial distribution can be avoided.
It should be noted that, in the case where the initial partition is determined using 2D hashing, a single connection edge is partitioned into a single initial partition, and a single node may exist in multiple initial partitions, so that, for a single distributed device,Relatively easy to determine, in determining/>When the number of storage partitions where a single node is located needs a large amount of data synchronization to be calculated, so that the running speed of graph segmentation is remarkably reduced. To this end, in one embodiment, the corresponding estimates may be utilized to determine/>The method is as follows: /(I)Wherein/>And/>Respectively express/>And/>Estimate of/>Representing the number of nodes on the ith storage partition on the nth distributed device. In general, the estimation value is closer to the true value in the case that the dot edge distribution of each initial partition is more uniform.
For a current single distributed device (e.g., a first device), the seed node of the local storage partition may be determined by itself and broadcast to other distributed devices from which the seed node of the other storage partition may be obtained. For each selected seed node, a storage partition flag for the current storage partition may be carried. Likewise, seed nodes obtained from other distributed devices may carry storage partition markers for other storage partitions. The first device may obtain a seed node set containing seed nodes of the respective distributed devices at the current diffusion round.
Then, via step 402, for each current seed node in the set of seed nodes, traversing its first-order neighbor nodes in the first initial partition, and adding a storage partition flag corresponding to the corresponding current seed node for each first-order neighbor node.
It can be understood that the seed node represents a seed of a storage partition, and under the technical conception of the present specification, a first-order neighbor node of a single seed node may be marked as belonging to a corresponding storage partition, so as to generate a node marking result on the first device of the current diffusion round, and may also be referred to as a node marking result of the current round of the first initial partition. In an alternative embodiment, for a single seed node, it does not mark the storage partition for the corresponding first-order neighbor node if the single first-order neighbor node of the first initial partition already has the same storage partition as the seed node. That is, for a single current seed node in the set of seed nodes, first-order neighbor nodes that do not have the same memory partition in the first initial partition are traversed to have the same memory partition as each first-order neighbor node.
As an example, as shown in fig. 6, assume that X0, X5, X8 are all seed nodes of the current diffusion round and are from 3 different distributed devices, corresponding to 3 different memory partitions (represented by different shadows), respectively, and are assumed to be memory partitions A0, A1, A2, respectively. According to the local initial partition, for the seed node X0, it is diffused to its first-order neighbor nodes X6, X7, X9, that is, its first-order neighbor nodes X6, X7, X9 are marked as the storage partition A0 where the seed node X0 is located. Similarly, for the seed node X5, the first-order neighbor nodes X4 and X6 are marked as a storage partition A1 where the seed node X5 is located, and for the seed node X8, the first-order neighbor nodes X1, X2 and X9 are marked as a storage partition A2 where the seed node X8 is located. Node X6 is labeled as belonging to both storage partition A0 and storage partition A1, and is the split point of storage partitions A0 and A1 (e.g., it may split the edge connecting X0 and X6 to storage partition A0 and the edge connecting X5 and X6 to storage partition A1). Similarly, node X9 may be marked as belonging to both storage partition A1 and storage partition A2, as the split point for storage partitions A1 and A2, (e.g., it may split the edge connecting X0 and X9 to storage partition A1 and the edge connecting X8 and X9 to storage partition A2). The current distributed device may record the marking result of the storage partition to which the nodes X6, X7, X9, X4, X1, X2 belong. The marking results of the nodes are used as single current marking results corresponding to single distributed equipment corresponding to the current round.
Each of the distributed devices may perform first-order neighbor node flooding and marking for each group of seed nodes from the local initial partition.
It should be noted that the target graph may be an undirected graph or a directed graph. In the undirected graph, the connection edges between every two nodes are undirected edges, the connection relationship is relatively easy to find, and in the directed graph, the connection relationship between every two nodes is directional, and the connection edges of the nodes can be divided into outgoing edges and incoming edges according to the direction of the connection edges. Optionally, for the directed graph, the direction of the connection edge is not considered in the diffusion process, so for convenience in searching, auxiliary connection edges in opposite directions can be added for each unidirectional connection edge, and it is assumed that the nodes are mutually connected by outgoing edges, so that bidirectional diffusion possibility is convenient to realize.
In a flooding operation, to facilitate finding neighboring nodes of each node, in some alternative embodiments, a single distributed device may store node, connection edge data in a corresponding initial partition by compressing a target data format of thin-lean rows (compressed Row Storage, CRS).
CRS is one of the recording modes of sparse matrix, each field is represented in a simple continuous array form, and the CRS has a higher loading speed and occupies lower memory. CRS storage may generally include three vectors, one for each: row statistics vector row_ptr, column index vector col_index, and data vector values. Typically, values store non-zero values, col_index records the column index of the non-zero values in the matrix, row_ptr records the index of the first non-zero value in the values for each row of the matrix, and the first value of row_ptr is typically an initial value, e.g., 0, with the first non-zero value index for each row represented by the difference between two adjacent elements. For the initial partition of the graph, the connection relation of each node can be represented by an adjacency matrix, wherein each row and each column in the adjacency matrix are each nodes in the initial partition, the two-to-two node row-column intersection elements connected by the connection edges are represented by non-zero values (for example, 1, corresponding to the values in the values), and the two-to-two node row-column intersection elements without the connection edges are represented by zero values. The adjacent matrix is a sparse matrix, and thus can be described by CRS. In general, the values term may be omitted in the case where non-zero values in the adjacency matrix are all predetermined values.
Based on the CRS manner, in the target data format of the present specification, a neighbor node of one node may correspond to a col_index vector, for example, denoted as a neighbor node row, a degree of each node may correspond to a row_ptr vector, for example, denoted as a connection record row, and a degree of a single node may be determined by a difference between a corresponding element value and a previous element value in the connection record row.
Referring to fig. 7, a specific example of recording single initial partition node information by a target data format in the CRS scheme is given. First, for each node in the initial partition, there are global node identities globallds in the target graph, which, when ordered, can be described by local node identities. The local node identification may default to correspond to the node identification location (memory consumption may be avoided from logging the local node identification). As shown in fig. 7, the global node identifiers describe nodes 1,3, 4, 6, 7, and 9, which are arranged in order, may correspond to local node identifiers 0, 1, 2,3, 4, and 5, respectively, described by the location. The first element of the offset row (i.e., the connection record row) in fig. 7 is an initial value, typically 0, and the difference between two adjacent elements corresponds in turn to the number of connection edges of each node in the local initial partition. It should be noted that, in the undirected graph, the number of connection edges is the degree of the node, and in the directed graph, the number of connection edges may be the degree of the node. In the example shown in fig. 7, opposite edges are added between every two nodes connected by a unidirectional edge in advance, so that the outgoing degree and the incoming degree of a single node are consistent, and the undirected graph-like property is achieved. The neighbor row (i.e., neighbor node row) in fig. 7 records the local node identities of the neighbor nodes of each node in turn. For example, the value of the second element minus the first element of the offset line (i.e., the connection record line) is 1, which represents the node with the local node identifier 0 (corresponding to the global node identifier 1), the number of neighboring nodes is 1, and the first element 4 of the neighbor line represents that the neighboring node of the node with the local node identifier 0 is the node with the local node identifier 4, i.e., the node corresponding to the global identifier 7. Next, the value of the third element minus the second element of the offset line is 5-1=4, which indicates the node of the local node identifier 1 (corresponding to the global node identifier 3), the number of neighboring nodes is 4, and the neighboring nodes of the node of the local node identifier 1 are the nodes of the local node identifiers 2,3, 4, 5, that is, the nodes corresponding to the global identifiers 4, 6, 7, 9, which are the 4 elements 2,3, 4, 5 from the second element of the neighbor line. And so on.
In the example of fig. 7, the storage partition flags for the respective nodes are also recorded by nodePartitionSet lines (node partition lines). Wherein in nodePartitionSet rows, a single element value represents each memory partition by a plurality of bits, respectively. Two selectable values 0 and 1 on a bit represent the marked and unmarked two states, respectively. In one specific example, a1 is marked to a corresponding memory partition and a 0 is not marked to a corresponding memory partition. Thus, on a single bit, 0 is an initial value, indicating that it is not marked, and 1 indicates that it is marked as belonging to the corresponding memory partition. The individual element values in nodePartitionSet rows may be represented by respective bits corresponding to respective memory partitions. In fig. 7, an element value of 2 in line nodePartitionSet (corresponding to binary number 00 … … 010) indicates that the corresponding node is marked as belonging to the memory partition corresponding to the second bit, and so on.
In addition, the offset line in fig. 7 may also record a first index value of each node, denoted as cost, and indicates the number of neighboring nodes (corresponding to the number of edges that a single node has not traversed on a single distributed device) that have the same storage partition in the local initial partition (e.g., the first initial partition) as the corresponding node. In an alternative embodiment, a single element of the offset row records the number of neighbor nodes of the corresponding node in the corresponding initial partition by a first number of high order bits, and a second number of low order bits records the first index value. Thus, the offset and cost can be put together to reduce memory consumption. E.g., represented by a value of s bits (e.g., s=64), rather than two values of s/2 (e.g., 32) bits. This is because there is a corresponding constraint between the offset and the cost, which saves memory. For example, the offset is described with 64-m (first number) high order bits, the cost with m (second number) low order bits, m is determined via the number of dot edges of the 2D hash, and the following condition is satisfied as in the case of s=64:
Where N (v) represents a first-order neighbor node of node v. Because of the addition of the reverse edge, |e i | is less than 2 64 -1-m instead of 2 64-m. In practical applications, the increase in the cost value is generally equal to the increase in the single element value of the offset & cost line, and in the case where m is less than s/2, the increase in the cost value does not affect the data in which the offset is located.
In an alternative embodiment, the target data format also includes an edge attribute row (row isOrigin in FIG. 7). isOrigin describes whether the corresponding outgoing edge is the added opposite edge. In the isOrigin lines (which may also be referred to as edge record lines, or edge state record lines) shown in fig. 7, 1 represents an outgoing edge in the target graph, and 0 represents an added outgoing edge. Here, the outgoing edge is in terms of a node recorded in globalids rows as a head node and a node recorded in neighbor rows as a tail node.
In the example shown in fig. 7, since a reverse edge is added to the unidirectional edge, there is no need to record and query the outgoing edge and the incoming edge separately. In addition, since only the storage partitions of the points are marked, the storage partitions of the connecting sides do not need to be marked, and the number of the points is often far smaller than that of the connecting sides, so that a great amount of memory consumption can be saved.
In the example of fig. 7, in the diffusion operation, when a first-order neighbor node of a single node (such as v g) is queried, the start point and the end point of the first-order neighbor node of the single node in a neighbor line can be obtained from offset, then the corresponding local node identifier is directly obtained from the neighbor line and used as an index value to search, and the complexity is O (1). After the corresponding first-order neighbor node is retrieved, the storage partition to which it belongs may be marked, for example, by updating the value in line nodePartitionSet. On the other hand, the cost values of the seed node and the first-order neighbor nodes thereof can be updated according to the local node identification. As in the case of nodes 7, 3 in fig. 7 as seed nodes from two different memory partitions, respectively, for the current diffusion round, the marking results and the updating of the data record are shown in fig. 8. Since 4 and 6 are common nodes of the seed nodes 3 and 7, and the node 3 is also a first-order neighbor node of the node 7, the value on the bit of the element value corresponding to the node 3, 4 and 6 in nodePartitionSet rows and the bit of the storage partition of the seed node 3 can be updated to 1, and the cost value of the node 3, 4, 6 and 7 is updated, and the corresponding value is increased, which means that the number of marked neighbor nodes is increased. The modified element value in line nodePartitionSet is used as the current marking result. In the case that a single node is a first-order neighbor node of a plurality of seed nodes, in nodePartitionSet rows, bits corresponding to a plurality of storage partitions on elements corresponding to the respective single node are modified to 1, so that the respective element values are changed.
It should be noted that, in the examples of fig. 7 and fig. 8, the operations of obtaining the neighbor node, the partition identification result of the neighbor node, and the cost value may be completed simultaneously, so that in one diffusion round, a single node may be queried only once, thereby reducing the waste of the memory.
In addition, as can be seen from fig. 8, nodes 3, 4, 6 may be spread to two different memory partitions in the current round, and all the memory partitions to which they may belong may be recorded in line nodePartitionSet as known from the foregoing. In fig. 7 and 8, the numerical values in the nodePartitionSet rows may be the result of converting binary numbers represented by respective bits into 10-ary numbers. For example, a value of "3" corresponding to "00 … … 011" may indicate that it is marked on both the first and second memory partitions.
Those skilled in the art will appreciate that during the graph cut process, the number of connection edges currently determined by a single storage partition may be accumulated round by round, and without directly storing whether a single edge is partitioned, the distributed device may not be able to determine whether an edge is partitioned to which the current round is marked or the previous round is marked, thereby causing a deviation in statistics of the number of connection edges currently determined by each storage partition. To address this issue, in an alternative embodiment, each node in the local initial partition may be traversed in a single flooding round, where a single first-order neighbor node of a single node is marked as having the same memory partition as the single node in the present flooding round, the node identification of the single first-order neighbor node corresponding to the neighbor row is moved to a forward position, and the node identification of the neighbor node that does not have the same memory partition as the single node is moved backward (if neighbor node 9 of node 3 is marked as having the same memory partition as seed node 3 in fig. 7, then local node identification 5 of node 9 may be moved before the node identifications of other nodes in 4 neighbor nodes corresponding to node 3), and line IsOrigin is moved corresponding to an element in the neighbor row. At the same time, the number of edges traversed by a single point can also be recorded in combination with the update of the cost, i.e., the number of neighbor nodes that have been marked as having the same memory partition as the corresponding node is accumulated. By the method, the number of the connecting edges of the storage partition to which the corresponding node belongs can be correctly obtained, and the corresponding node can be directly traversed from the first neighbor node when the corresponding node is subjected to neighbor diffusion in the next round, so that the marks of the storage partition of all first-order neighbor nodes of the corresponding node, which do not have the same storage partition, are rapidly obtained, and the partition process is effectively accelerated. At this time, it may be considered that the neighbor node after traversing the index value of the current seed node with the first index value as the index is used as the first-order neighbor node described above in the current diffusion round.
It will be appreciated by those skilled in the art that fig. 7 and 8 show a specific example of a target data format, and in practice, based on the technical concept of the present specification, other reasonable data structures may be used to record node marking information, which is not described herein.
Next, the round node marking results are interacted with other distributed devices, via step 403.
Here, the respective distributed devices may broadcast the local current marking result to each other. A single distributed device (e.g., a first device) may obtain node marking results for a locally marked storage partition. For example, the distributed device with the corresponding storage partition identification 3 only obtains node information (e.g., global node identification information) that is marked as storage partition 3.
In one aspect, a single distributed device (e.g., a first device) may obtain node marking information for a storage partition to which a current round in a corresponding initial partition is marked by other distributed devices. For example, the first device may query whether there is a node belonging to the first initial partition from the current marking results broadcast by other distributed devices, and in the case that it is queried that there is a node belonging to the first initial partition, it may record the storage partition to which the corresponding node in the first initial partition is marked in the current round (e.g. modify the corresponding value of nodePartitionSet lines in the examples of fig. 7 and 8).
Alternatively, a single distributed device may receive node information for which the current round is marked as a local storage partition. After receiving the node marking result of the local storage partition, determining the seed node of the next diffusion round according to the corresponding node information. The detailed description of step 301 is given for the seed node determining method of the next diffusion round, and will not be repeated here.
After a number of iterations of the diffusion pass, the diffusion operation is ended. The end condition of the flooding operation may be that each edge may be accessed, for example, and for a single distributed device, the end condition may specifically be that any node in the local initial partition has a storage partition that is the same as the seed node in the neighboring node in the current initial partition, or that the number of edges of the local initial partition is overlapped by the number of edges of the local retrieved edge and is the same as the number of edges of the local initial partition (as the offset value is equal to the cost value in fig. 7 and 8). After the iteration of the diffusion round is finished, each distributed device can obtain the marking result of the storage partition to which each node in the local initial partition belongs.
According to one possible design, the partition determination procedure for distributed storage of the graph after the multi-round diffusion operation may further comprise the steps of: and outputting the partition result of the storage partition to which each corresponding connecting edge belongs to a local partition file according to the storage partition marking state of each node in the first initial partition.
It will be appreciated that after all diffusion rounds have ended, a single distributed device (e.g., the first device) may obtain the marking results of the storage partitions to which each node in the local initial partition belongs. In this way, the associated memory partition can be determined for each connection edge in the local initial partition. For the first device, a storage partition to which each connection edge in the first initial partition belongs may be determined. In the directed graph, there may be unidirectional edges or bidirectional edges between every two nodes, and these edges may be marked and output to the local partition file as a partition result. For the reverse side added for the one-way side, it can be discarded.
It will be appreciated that for marking of nodes, there may be situations in a single initial partition where both adjacent nodes are marked as belonging to two storage partitions, such as node 3 and node 6, node 3 and node 4 in FIG. 8. At this time, in consideration of quality indexes such as point balance, edge balance, and the like, the connection edges between the two nodes may be allocated to memory partitions having a smaller number of connection edges. The number of connection edges of a single storage partition may be a statistic of connection edges on each distributed device for which a determination has been made of the storage partition to which it belongs.
Reviewing the above process, the present embodiments provide a method for splitting a target graph into respective memory partitions corresponding to a plurality of distributed devices, respectively. Specifically, in the distributed storage of the graph, a plurality of rounds of diffusion operations are jointly performed by the respective distributed devices based on the initial partitions determined by the initial partitioning. In a single round of diffusion operation, each distributed device respectively determines seed nodes of corresponding storage partitions, broadcasts the seed nodes to other distributed devices, diffuses first-order neighbor nodes in a local initial partition by each device, and marks the first-order neighbor nodes as storage partitions consistent with the seed nodes. And finally, each distributed device determines the storage partition to which the corresponding connecting edge belongs according to the node corresponding to the local storage partition. Because only the storage partition to which the node belongs is marked, and the storage partition to which the marked edge belongs is not needed, and only the first-order neighbor nodes of the seed node are considered for single-round diffusion, the embodiment consumes less memory and has higher partition speed.
According to another embodiment of the present invention, a partition determination apparatus for performing distributed storage on a graph provided in a single distributed device is also provided. The apparatus may be provided in any device of the distributed system for graph cut, such as the first device. The apparatus 900 may be used to determine, in conjunction with other distributed devices in a distributed system, a storage partition to which each connection edge in a target graph belongs.
As shown in fig. 9, the partition determination apparatus 900 for distributively storing a graph may include an acquisition unit 901, a diffusion unit 902, a communication unit 903, for performing a multi-round diffusion operation at a first initial partition local to a first device based on a target graph.
Wherein, in any round of diffusion operation: :
An obtaining unit 901, configured to determine a plurality of seed nodes for the first storage partition, send the seed nodes to other distributed devices along with the storage partition mark of the first storage partition, and receive other seed nodes with the storage partition mark sent by other distributed devices;
The diffusion unit 902 is configured to traverse first-order neighbor nodes in the first initial partition aiming at each current seed node in the seed node set respectively, and add storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; the seed node set comprises a plurality of seed nodes and other seed nodes;
a communication unit 903 configured to interact the present round node marking result with other distributed devices.
It should be noted that, the apparatus 900 shown in fig. 9 corresponds to the method described in fig. 4, and the corresponding description in the embodiment of the method shown in fig. 4 is also applicable to the apparatus 900, which is not repeated herein.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4 and the like.
According to an embodiment of yet another aspect, there is also provided a computing device including a memory having executable code stored therein and a processor that, when executing the executable code, implements the method described in connection with fig. 4 and the like.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present disclosure may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-described specific embodiments are used for further describing the technical concept of the present disclosure in detail, and it should be understood that the above description is only specific embodiments of the technical concept of the present disclosure, and is not intended to limit the scope of the technical concept of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made on the basis of the technical scheme of the embodiment of the present disclosure should be included in the scope of the technical concept of the present disclosure.

Claims (17)

1. The partition determining method is used for splitting the target graph into storage partitions respectively corresponding to a plurality of distributed devices; the method is performed by a first device of the plurality of distributed devices, including a multi-round diffusion operation performed at a first initial partition local to the first device based on a target graph, wherein any round of diffusion operation includes:
Determining a plurality of seed nodes for the first storage partition, transmitting the seed nodes to other distributed devices along with the storage partition marks of the first storage partition, and receiving other seed nodes with the storage partition marks transmitted by other distributed devices;
traversing first-order neighbor nodes in a first initial partition aiming at each current seed node in a seed node set respectively, and adding storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; wherein the seed node set includes the plurality of seed nodes and other seed nodes;
and interacting the round node marking result with other distributed devices.
2. The method of claim 1, wherein the first initial partition is obtained by initially partitioning the target graph using a partitioning method of 2D hash, in which the first device corresponds to an R-th row and a C-th column in the device array of R-row and C-column, and a node identifier of either one of two nodes connected by a single connection edge in the first initial partition satisfies at least one of: the result of modulo R is R and the result of modulo C is C.
3. The method of claim 1, wherein the graph data record for the first initial partition is a compressed sparse CRS based target data format including a connection record row in which the number of neighbor nodes for each node is recorded, and a first index value indicating the number of neighbor nodes for which the corresponding node has the same memory partition in the first initial partition.
4. The method of claim 3, wherein determining seed nodes for the first storage partition comprises:
K nodes are selected from a target node set obtained through previous diffusion rounds according to the sequence of segmentation complexity, wherein the segmentation complexity is determined according to the first index value, and the nodes in the target node set are marked as first storage partitions and are not determined as seed nodes in the previous diffusion rounds.
5. The method of claim 4, wherein K is determined by the product of the number of nodes in the set of target nodes and the partition speed, the partition speed being updated for gradient descent at a predetermined learning rate for each diffusion round, the gradient for a single diffusion round being determined by a weighted sum of the following first ratio logarithm and second ratio logarithm balanced by a predetermined point-to-edge balance factor adjustment:
The first ratio is the ratio of the number of the nodes which are cumulatively marked by the local storage partition in the previous diffusion round to the average value of the number of the nodes which are cumulatively marked by each storage partition in the previous diffusion round;
The second ratio is a ratio of the number of connecting edges of the local storage partition, which are cumulatively determined in the previous diffusion round, to the average value of the number of connecting edges of each storage partition, which are cumulatively determined in the previous diffusion round.
6. The method of claim 4, wherein the K nodes are top K of the split complexity from low to high for each node in the set of target nodes;
the segmentation complexity of a single node is positively correlated with the number of edges of the single node which are not traversed on each distributed device, and the number of edges of the single node which are not traversed on the single distributed device is the difference value between the number of neighbor nodes of the single node in a corresponding single initial partition and the first index value.
7. The method of claim 3, wherein in the connection record row, the number of neighbor nodes of each node in the first initial partition is recorded by a first number of high order bits, and the first index value is recorded by a second number of low order bits.
8. The method of claim 3, wherein the target data format further comprises an edge attribute row; and adding reverse edges for every two nodes connected by only one-way edges under the condition that the graph is a directed graph, and recording whether the edges corresponding to all elements in the neighbor node rows are added reverse edges or not through the edge attribute rows.
9. The method of claim 3, wherein the target data format further comprises a node partition row for recording a storage partition to which each node belongs;
wherein updating the marker state of the node in the first initial partition comprises:
And updating a field corresponding to the first-order neighbor node of the node partition row to enable the field to contain a storage partition corresponding to the current seed node.
10. The method of claim 3, wherein the target data format further comprises a neighbor node row for sequentially recording neighbor node identifications of the nodes;
Updating the marking state of the node in the first initial partition, comprising:
In the case where a single neighbor node of a single node is marked as having the same memory partition as the single node, the order of the neighbor node identifications of the single node in the neighbor node row is adjusted such that the local node identifications of the single neighbor node are arranged before the local node identifications of other nodes that do not have the same memory partition as the single node.
11. The method of claim 10, wherein for each current seed node in the set of seed nodes, traversing its first-order neighbor nodes in the first partition, adding a storage partition marker corresponding to the current seed node for the first-order neighbor nodes, respectively, comprises:
and traversing the neighbor nodes after the index value of the current seed node by taking the first index value as an index from the connection record row to serve as the first-order neighbor nodes.
12. The method of claim 1, wherein the method further comprises: and after the multi-round diffusion operation, outputting the partition result of the storage partition to which each corresponding connecting edge belongs to a local partition file according to the storage partition marking state of each node in the first initial partition.
13. The method of claim 12, wherein the first initial partition includes a first connection edge, and the storing, according to the storage partition to which each node in the first initial partition belongs, the partition result of the storage partition to which each corresponding connection edge belongs in the local partition file includes:
In the case where two nodes connected via a first connection edge are both marked as belonging to the same plurality of memory partitions at the same time, the number of connection edges respectively corresponding to each of the plurality of memory partitions is determined, and the first connection edge is determined as the memory partition having the smallest number of corresponding connection edges.
14. The method of claim 1, wherein the interacting the local round of node marking results with other distributed devices further comprises:
And acquiring node marking results of all nodes corresponding to the first initial partition from the local node marking results of other distributed devices, and updating the storage partition marking state of the corresponding node in the first initial partition.
15. A partition determining device for carrying out distributed storage on a graph, which is used for splitting a target graph into storage partitions respectively corresponding to a plurality of distributed devices; the device is arranged on a first device in the plurality of distributed devices, and comprises an acquisition unit, a diffusion unit and a communication unit, wherein the acquisition unit, the diffusion unit and the communication unit are used for performing multi-round diffusion operation on the basis of a target graph on a first initial partition local to the first device, and in any round of diffusion operation:
The acquisition unit is configured to: determining a plurality of seed nodes for the first storage partition, transmitting the seed nodes to other distributed devices along with the storage partition marks of the first storage partition, and receiving other seed nodes with the storage partition marks transmitted by other distributed devices;
The diffusion unit is configured to: traversing first-order neighbor nodes in a first initial partition aiming at each current seed node in a seed node set respectively, and adding storage partition marks corresponding to the current seed nodes for the first-order neighbor nodes so as to update the mark states of the nodes in the first initial partition; wherein the seed node set includes the plurality of seed nodes and other seed nodes;
the communication unit is configured to: and interacting the round node marking result with other distributed devices.
16. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-14.
17. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-14.
CN202410401511.8A 2024-04-02 2024-04-02 Partition determination method and device for distributed storage of graphs Pending CN118151861A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410401511.8A CN118151861A (en) 2024-04-02 2024-04-02 Partition determination method and device for distributed storage of graphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410401511.8A CN118151861A (en) 2024-04-02 2024-04-02 Partition determination method and device for distributed storage of graphs

Publications (1)

Publication Number Publication Date
CN118151861A true CN118151861A (en) 2024-06-07

Family

ID=91293661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410401511.8A Pending CN118151861A (en) 2024-04-02 2024-04-02 Partition determination method and device for distributed storage of graphs

Country Status (1)

Country Link
CN (1) CN118151861A (en)

Similar Documents

Publication Publication Date Title
Song et al. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning
US10120956B2 (en) Methods and systems for distributed computation of graph data
CN102915347B (en) A kind of distributed traffic clustering method and system
CN109522428B (en) External memory access method of graph computing system based on index positioning
Deng et al. An efficient online direction-preserving compression approach for trajectory streaming data
CN112883241B (en) Supercomputer benchmark test acceleration method based on connected component generation optimization
CN112667860A (en) Sub-graph matching method, device, equipment and storage medium
CN112100450A (en) Graph calculation data segmentation method, terminal device and storage medium
CN105320702A (en) Analysis method and device for user behavior data and smart television
Luo et al. Efficient attribute-constrained co-located community search
CN114567634B (en) Method, system, storage medium and electronic device for calculating E-level map facing backward
CN110719106A (en) Social network graph compression method and system based on node classification and sorting
CN111159577A (en) Community division method and device, storage medium and electronic device
Lu et al. An improved k-means distributed clustering algorithm based on spark parallel computing framework
US10198529B2 (en) Apparatus and method of processing graphic data using index based triangle listing
CN118151861A (en) Partition determination method and device for distributed storage of graphs
Falchi et al. Nearest neighbor search in metric spaces through content-addressable networks
CN112100446B (en) Search method, readable storage medium, and electronic device
CN111723246B (en) Data processing method, device and storage medium
CN110309367B (en) Information classification method, information processing method and device
Jouili et al. imGraph: A distributed in-memory graph database
CN117370619B (en) Method and device for storing and sub-sampling images in fragments
CN105550765B (en) Method for selecting representative elements in road network distance calculation
CN108809726B (en) Method and system for covering node by box
Lu et al. A Parallel Adaptive DBSCAN Algorithm Based on k-Dimensional Tree Partition

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination