CN114118443A

CN114118443A - Large-scale graph embedding training method and system based on Optane DIMM

Info

Publication number: CN114118443A
Application number: CN202111415792.5A
Authority: CN
Inventors: 姚建国; 陈悦
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-03-01

Abstract

The invention provides a large-scale graph embedding training method and a large-scale graph embedding training system based on Optane DIMMs, and the method comprises the following steps: original graph processing steps: processing the original graph to generate graph data which can be loaded by a DRAM; a data preprocessing step: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training; and (3) graph training: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU. According to the invention, high-quality two-layer graph segmentation is carried out according to the characteristics of the graph, and the large graph is converted into the subgraph and stored in the disk, so that the subgraph can be loaded into a GPU for partition training.

Description

Large-scale graph embedding training method and system based on Optane DIMM

Technical Field

The invention relates to the technical field of computer storage, computer calculation and deep learning, in particular to a large-scale graph embedding training method and system based on Optane DIMM.

Background

Graphs, such as social networks, word coexistence networks, and communication networks, are widely present in a variety of real-world applications. Through the analysis of the above, we can understand the social structure, language and different communication modes, so the graph is always the hot point of the research in the academic world.

The network representation using the adjacency matrix a represents a graph using a storage space of | V | × | V |, and the space required for such representation increases exponentially as the number of nodes increases. Meanwhile, the vast majority of adjacency matrices are 0, and the sparsity of data makes it difficult to apply a fast and efficient learning method. Graph embedding learning means learning to obtain low-dimensional vector representation of nodes in a network, and formally, the goal of graph embedding learning is to learn a real-valued vector for each node V e V

Where κ < | V | represents the dimension of the vector. The graph and graph embedding mathematics are defined as follows:

the following drawings: graph G (V, E) is a set of vertices V ═ V₁，…，v_nAnd E set of edge sets. e.g. of the type_ijE E contains a source vertex v_iAnd a target vertex v_j. For weighted graph G, adjacency matrix W contains non-negative weights W associated with each edge_ijIs more than or equal to 0. If v is_iAnd v_jWithout connection, then W_ijIs set to 0. For undirected weighted graphs, there is always

Graph embedding: given a graph G (V, E) and the predefined dimensionality of embedding d, a graph (node) embedding is a mapping f:

so that the function f retains some of the semantic features defined on the graph G. The graph embedding problem is the problem of mapping an entire graph, sub-graph or edge to a d-dimensional vector. A visual representation of graph embedding is shown in fig. 1.

The graph embedding technique is a technique belonging to graph analysis and representation learning. The purpose of this is to represent a graph as a low-dimensional vector while maintaining the graph structure. As a simple and effective method for reducing dimensionality, graph embedding has been widely applied and successfully applied to the fields of node classification, clustering, recommendation, link prediction, network visualization, and the like.

Graph embedding algorithms often measure similarity using random walks. Random walk is also the basis of a class of output sensitive algorithms, which utilize random walk to calculate local structural information under the linear complexity of input graph volume size. It is this association with local structures that encourages random walks to be the basic tool for extracting information from the graph. In addition to capturing community information, there are two other desirable features that use random walks as the basis for the algorithm. First, local exploration is easily parallelized. Several random walkers (on different threads, processes, or machines) can simultaneously explore different parts of the same graph. Second, small changes in the graph structure can be accommodated without requiring global recalculation, relying on information obtained from short random walks. We can update the learned model with new random walk iterations, from regions of sub-linear variation in time to the entire graph. Therefore, the invention extracts the graph characteristics by using the random walk as the basis.

In the big data age, the size of graphs is increasing. For example, in social networks, the number of nodes has increased to the billions and the number of edges has also increased to the billions. Effectively processing graphs of this scale remains a challenge. For large graphs with billions of vertices, it is difficult for a typical single-node server to guarantee enough DRAM memory to support the service. Although existing swap partitioning techniques may build a larger logical memory for a machine than a physical memory, it is still difficult to achieve TB level capacity and limited by low disk access speed and inefficiency.

Another solution referred to is to store the graph data and the embedded data on a magnetic disk. However, since graph embedding algorithms often require byte addressing and disk access latency is high, the possibility of building a high performance graph embedding system scheme with disks is low. While distributed solutions have limitations in terms of efficiency, cost and user-friendliness: first, the graphics data and embedded data need to be accessed and transferred frequently between different machines. It causes high network communication costs and delays. Second, purchasing a set of powerful machines is expensive, placing a burden on small corporations and individual developers. Therefore, how to build an efficient large graph embedding system on a single machine becomes a current challenge and opportunity. Non-volatile memory is another direction in which stand-alone solutions to this problem.

The advent of new hardware Optane DIMMs promises for efficient large-scale embedded training on a stand-alone machine. Intel's opto DIMM was the first commercial persistent memory that supported byte-granular access in DRAM order. Meanwhile, the storage capacity of a single Optane DIMM can be as high as 512 GB. For a dual-socket machine, the supported Optane DIMM size is up to 6T (2 sockets x 6 channels x 512 gb/DIMM).

In chinese patent publication No. CN113343123A, a training method for generating an antagonistic multiple relation graph network model for detecting a machine account is disclosed, the training method comprising: modeling a platform as a graph containing nodes v and relations r, wherein the number of the graph is determined by the number of types of the relations r; generating a false target node vt of a source node v by using a generator G; respectively inputting the sampled node pairs (v, u) and (v, vt) into a connection relation discriminator D, and repeatedly training the connection relation discriminator D; reasoning node pairs in the graph by using the trained connection relation discriminator D, determining the connection relation of the node pairs, and further updating the structure of the graph; and inputting the characterization vectors of the nodes into a classifier, reversely propagating and updating parameters of the model according to the loss function, and performing training for multiple times to obtain a trained network model for generating the confrontation multiple relation graph.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a large-scale graph embedding training method and system based on Optane DIMM.

The invention provides a large-scale graph embedding training method based on Optane DIMM, which comprises the following steps:

original graph processing steps: processing the original graph to generate graph data which can be loaded by a DRAM;

a data preprocessing step: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;

and (3) graph training: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.

Preferably, the primitive map processing step includes the sub-steps of:

s1: initializing a hash table with the size of hash _ table _ size, and storing the mapping from the vertex name to the vertex ID;

s2: entering circulation, wherein the circulation frequency is the size of the original graph list;

s3: reading an edge from the original graph in each cycle, wherein the edge consists of a vertex v _ name and a target vertex u _ name;

s4: searching whether v _ name appears already from a hash table, if so, returning the mapped v _ id, and if not, calling a hash _ table. AddVertex (name _ v, count _ num _ vertices) method and inserting a new vertex;

s5: if a hash _ table. AddVertex (name _ v, count _ num _ verticals) is called, firstly creating a new vertex, adding a vertex set verticals, and then adding a counter count _ num _ verticals by one; if the count _ num _ thresholds exceeds the maximum capacity of the existing thresholds, the thresholds will automatically expand;

s6: circulating until a vacancy of the hash table is found, and inserting a mapping relation from a vertex name to an ID in the hash table;

s7: processing u _ name according to the steps in S4-S6;

s8: and writing the mapped edge into an output file.

Preferably, the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.

Preferably, the splitting of the graph data includes:

-using an edge segmentation strategy: dividing the graph data according to points, separating edges, dividing sub-graph data into the number of GPUs, and not performing edge division for single GPU equipment;

-using a point segmentation strategy: and dividing the graph data according to edges, determining the number of the divided subgraphs according to the size of a GPU memory, and not performing a point division strategy when the subgraphs can be completely loaded into the GPU.

Preferably, the CPU performs negative sampling and trimming training operations, the GPU performs positive sampling, positive sample training, and negative sample training operations, and the CPU-GPU communication is performed using a PCI load.

Preferably, the CPU task specifically includes:

loading data: loading data required by training, wherein the data comprises graph structure data and graph embedding data;

the Graph structure data is loaded into the Optane DIMM from a disk, and a metadata structure Graph is abstracted and points to concrete data in the Optane DIMM, and the metadata is stored in a DRAM;

the graph embedded data is allocated with a memory and initialized in the DRAM, a metadata structure Embedding is abstracted, and the metadata structure Embedding points to concrete data and is stored in the DRAM;

negative sampling: the system starts the FIRST _ PARTITION _ NUM number of threads, each thread maintains a NEG _ SAMPLE _ POOL _ NUM block sampling POOL, the SIZE of the sampling POOL is NEG _ SAMPLE _ POOL _ SIZE, and different threads perform parallel negative sampling;

cutting edge training: training each cut edge generated in the edge segmentation process by using a CPU (central processing unit); the CPU starts CROSS _ EDGE _ TRAIN _ THREAD THREADs, and each THREAD is responsible for training partial trimming

Task scheduling: the first layer of subgraphs are trained by different GPUs in parallel, the different GPUs are isolated from each other, and no data communication overhead exists; serially training a second layer of sub-graphs;

graph embedding evaluation: with the generated node embedding, different types of machine learning tasks are run, evaluating either the micro f1 or the macro f1, for comparison with other solutions.

Preferably, in the negative sampling process, the sampling strategy for each small sampling pool includes: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:

subgraph_i-degree_num/graph_degree_num

＝subgraph_i-neg_sam_num/NEG_SAMPLE_POOL_SIZE。

preferably, the GPU task includes:

positive sampling: for each edge, positive sampling is carried out in a random walk mode, the input is a vertex vid, and the output is a positive sampling edge list;

training: for each edge in the subgraph, training is performed with the GPU.

Preferably, the PCI load task includes:

graph structure data transmission: copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;

and (3) negative sample transmission: copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;

embedding and transmitting: copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;

embedding and transmitting: the embedded data is copied back to the host DRAM by the GPU over the PCIe protocol.

The invention discloses a large-scale graph embedding training system based on Optane DIMM, which comprises:

an original graph processing module: processing the original graph to generate graph data which can be loaded by a DRAM;

a data preprocessing module: dividing the graph data into two layers of graphs according to characteristics, splitting the complete graph into subgraphs, and storing the subgraphs in a disk, so that the subgraphs can be loaded into a GPU for partition training;

a graph training module: the method comprises the steps of storing graph data used for training in different physical media according to the memory access characteristics of different media, cutting an algorithm according to different characteristics of data depended on in the training process, and balancing the expenses of CPU calculation, GPU calculation and CPU-GPU communication by adopting the division training of a CPU and a GPU.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, high-quality two-layer graph segmentation is carried out according to the characteristics of the graph, and the large graph is converted into the subgraph and stored in the disk, so that the subgraph can be loaded into a GPU for partition training.

2. According to the method, the graph data used for training are respectively stored in different physical media according to the access characteristics of different media, the frequently written data are stored in a DRAM, and the read-only data are stored in an Optane DIMM.

3. According to different characteristics of data depended on in the algorithm flow, the method cuts the algorithm, the GPU processes positive sampling (random walk), positive sample training and negative sample training, and the CPU processes negative sampling and edge cutting training, so that the system computing power can be fully utilized.

4. Asynchronous transmission in the invention balances computation and IO overhead, thereby constructing a high-performance end-to-end training complete execution flow.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram illustrating a graph embedding process in the prior art;

FIG. 2 is a schematic diagram of an overall method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

According to the large-scale graph embedding training method based on Optane DIMM provided by the invention, referring to FIG. 1, the method comprises the following steps:

In order to break through the limitation, the system designs a streaming Hash mapping scheme without loading complete graph data into the system, and the processing step of the original graph comprises the following substeps:

s7: processing u _ name according to the steps in S4-S6;

s8: and writing the mapped edge into an output file.

Further, the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.

The graph partitioning divides a large complete graph into small sub-graphs so as to be loaded to a GPU for calculation, and the graph data splitting comprises the following two strategies:

the first layer employs an edge-splitting strategy: dividing sub-graph data into the number of GPUs, and not performing edge division on single GPU equipment; edge division refers to dividing the edge according to points in the dividing process; the method has the advantages that points in the subgraph have uniqueness, and data synchronous communication in the training process is reduced; the overhead is the creation of cut edges, requiring additional processing during the training process.

The second layer employs a point segmentation strategy: the number of the sub-graphs to be divided is determined according to the size of the GPU memory, and when the sub-graphs can be completely loaded into the GPU, a point division strategy is not carried out. So in the case of a graph that is too small to load completely into the GPU, two levels of graph partitioning will not really be performed. The point segmentation refers to the division according to edges in the division process, and multiple copies exist in a vertex; the method has the advantages that the overhead of additional cutting edges is not generated; the disadvantage is that multiple copies of the vertices are created, potentially creating synchronous communication overhead.

Further, during the graph data partitioning process, a graph data compression mechanism between partitions and within partitions is operated. For example, in the full graph, the vertex ID is large and needs to be stored with long data types, each vertex taking 8 bytes. After partitioning, the vertices may be stored with < first _ partition _ id, second _ partition _ id, offset > triplets, of the type byte, each vertex occupying 3 bytes.

According to different characteristics of data depended on in the training process, the method comprises a cutting algorithm, wherein the GPU is responsible for positive sampling (random walk), positive sample training and negative sample training, and the CPU is responsible for negative sampling and trimming training and balances the expenses of CPU calculation, GPU calculation and CPU-GPU communication, so that a high-performance end-to-end training complete execution stream is constructed. Specifically, the CPU performs negative sampling and trimming training operations, the GPU performs positive sampling, positive sample training, and negative sample training operations, and the CPU-GPU communication is performed using a PCI load.

Task assignments are shown in the following table:

in more detail, the CPU tasks specifically include:

loading data: the first step in training the workflow is to perform data loading. The data in the system comprises two types, the first type is graph structure data (nodes and edges), the structure relation of a storage graph is large in data size, and only reading and not writing are carried out in the training process. The second method is graph embedding data, each graph vertex maintains an embedding vector which is a training target and a training result, reading and writing are frequent, and the data volume is small.

The Graph structure data is stored on a disk after being divided, the Graph structure data is loaded into an Optane DIMM from the disk in the loading process, the metadata structure Graph is abstracted and points to concrete data in the Optane DIMM, and the metadata is stored in a DRAM.

The graph embedded data is allocated with a memory and initialized in the DRAM, and a metadata structure Embelling is abstracted, points to concrete data and is stored in the DRAM.

Negative sampling: the system starts the FIRST _ PARTITION _ NUM number of threads, each thread maintains a NEG _ SAMPLE _ POOL _ NUM block SAMPLE POOL, the SIZE of the SAMPLE POOL is NEG _ SAMPLE _ POOL _ SIZE, and different threads perform negative sampling in parallel.

For each small sampling pool, the sampling strategy comprises: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:

subgraph_i-degree_num/graph_degree_num

＝subgraph_i-neg_sam_mum/NEG_SAMPLE_POOL_SIZE

in addition, a producer consumer model is adopted by the negative sampling module, the CPU sampling is taken as a producer, and the GPU is trained as a consumer. When the sampling pool is full, the CPU sampling thread is blocked, and the GPU training thread can be executed; when the sampling pool is empty, the CPU sampling thread can execute, and the GPU training thread is blocked. The size of NEG _ SAMPLE _ POOL _ NUM determines the load balance of the two; NEG _ SAMPLE _ POOL _ SIZE affects the quality of training, all user controllable parameters.

Cutting edge training: training each cut edge generated in the edge segmentation process by using a CPU (central processing unit); the CPU starts a CROSS _ EDGE _ TRAIN _ THREAD THREAD, and each THREAD is responsible for training partial trimming.

The specific strategy is as follows: the CPU starts CROSS _ EDGE _ TRAIN _ THREAD THREADs, and each THREAD is responsible for training partial trimming; each trimming, namely randomly walking the source _ vertex to generate a positive sample; for each positive sample, globally and randomly sampling a vertex to replace target _ vertex, and generating a negative sample; for each positive sample, sampling num _ negative samples; and carrying out gradient descent on the positive and negative samples to achieve the training effect.

Task scheduling: according to different characteristics of data depended on in the training process, the method adopts a cutting algorithm, a GPU is responsible for positive sampling (random walk), positive sample training and negative sample training, and a CPU is responsible for negative sampling and edge cutting training. Algorithm scheduling and data communication are uniformly controlled by a CPU.

The specific strategy is as follows: the first layer of subgraphs are trained by different GPUs in parallel, the different GPUs are isolated from each other, and no data communication overhead exists; serially training a second layer of sub-graphs; the negative sampling module adopts a producer consumer model, the CPU samples as a producer, and the GPU trains as a consumer. When the sampling pool is full, the CPU sampling thread is blocked, the GPU training thread can execute, when the sampling pool is empty, the CPU sampling thread can execute, and the GPU training thread is blocked; asynchronous transfer, overlapping GPU training overhead and CPU-GPU data transferOverhead, specifically, when the sub-graph P1 is trained in the GPU, the sub-graph P is trained₂The host (Optane DIMM) transmits to the GPU.

Graph embedding evaluation: and executing different types of machine learning tasks, such as node classification, by utilizing the generated node embedding. Either micro f1 or macro f1 was evaluated and compared to other solutions.

In more detail, the GPU tasks include:

positive sampling: for each edge, positive sampling is performed by random walk. The sampling pseudo-code is as follows:

the input is the vertex vid and the output is the list of positively sampled edges. Line 4 initializes walk _ length to 0. The 5 th row enters the loop, with the end condition that counter i accumulates to random _ walk _ length times. In the loop, line 9 obtains a neighbor vertex list of vertex vid; line 10 judges whether the neighbor list is empty, if so, the loop is skipped; line 13 performs random sampling on the neighbor vertex, the sampled probability is determined by the vertex weight, and the sampled vertex is assigned to the uid; line 14 assigns vid to head [ i ]; line 15 assigns uid to tail [ i ]; row 16 assigns uid to vid; the variable i, walk _ length, of lines 17 and 18 are incremented by one, respectively.

Training: for each edge in the subgraph, training is performed with the GPU. The specific strategy is as follows: starting NUM _ EDGES threads by the GPU, wherein each thread is responsible for training one edge; each trimming, namely randomly walking the source _ vertex to generate a positive sample; for each positive sample, sampling a vertex by a negative sample pool to replace target _ vertex, and generating a negative sample; for each positive sample, sampling num _ negative samples; and carrying out gradient descent on the positive and negative samples to achieve the training effect.

In more detail, the PCI load task includes:

graph structure data transmission: host (Optane DIMM) to GPU;

copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;

and (3) negative sample transmission: host (DRAM) to GPU;

copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;

embedding and transmitting: host (DRAM) to GPU;

embedding and transmitting: GPU to host (DRAM);

the embedded data is copied back to the host DRAM by the GPU over the PCIe protocol.

The invention also introduces a large-scale graph embedding training system based on Optane DIMM, which comprises:

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A large-scale graph embedding training method based on Optane DIMM is characterized by comprising the following steps:

2. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the original map processing step includes the sub-steps of:

s7: processing u _ name according to the steps in S4-S6;

s8: and writing the mapped edge into an output file.

3. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the graph data is stored on a disk in a file form, and the data format is source _ vertex _ id and destination _ vertex _ id; for an undirected graph, simultaneously storing source _ vertex _ id and destination _ vertex _ id; destination _ vertex _ id, source _ vertex _ id.

4. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the splitting of the graph data comprises:

5. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: the CPU executes negative sampling and trimming training operations, the GPU executes positive sampling, positive sample training and negative sample training operations, and the CPU-GPU communication is executed by adopting a PCI load.

6. The Optane DIMM based large scale graph embedding training method of claim 5, wherein: the CPU task specifically includes:

7. The Optane DIMM based large scale graph embedding training method of claim 1, wherein: in the negative sampling process, the sampling strategy for each small sampling pool comprises the following steps: counting the degree sum of the whole graph point and the degree sum of the sub-graph points, and determining the sampling number of each sub-graph according to the proportion of the sub-graph degree sum to the whole graph degree sum, wherein the following conditions are met:

subgraph_i_degree_num/graph_degree_num

＝subgraphi_neg_sam_num/NEG_SAMPLE_POOL_SIZE。

8. the Optane DIMM based large scale graph embedding training method of claim 5, wherein: the GPU task comprises the following steps:

training: for each edge in the subgraph, training is performed with the GPU.

9. The Optane DIMM based large scale graph embedding training method of claim 5, wherein: the PCI load tasks include:

the graph structure data transmission comprises the steps of copying the graph structure data from Optane DIMMs to a GPU for calculation through a PCIe protocol;

negative sample transmission, namely copying the graph structure data from the host DRAM to the GPU for calculation through a PCIe protocol;

the embedded transmission is that the graph structure data is copied to the GPU from the host DRAM through the PCIe protocol for calculation;

embedded transfer-the embedded data is copied back by the GPU to the host DRAM over the PCIe protocol.

10. A large-scale graph embedding training system based on Optane DIMMs, comprising: