CN114895985B

CN114895985B - Data loading system for graph neural network training based on sampling

Info

Publication number: CN114895985B
Application number: CN202210641439.7A
Authority: CN
Inventors: 熊颖彤; 翁楚良
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2023-06-09
Anticipated expiration: 2042-06-08
Also published as: CN114895985A

Abstract

The invention discloses a data loading system for sampling-based graph neural network training, which comprises: a neighbor node sampler and a data transmitter; the neighbor node sampler takes the output of the Dataloader of the deep learning framework Pytorch as input and samples neighbor nodes by using a sampling operator; the data transmitter comprises a classifier, a feature aggregator and a data manager; the classifier classifies the nodes output by the neighbor node sampler into shared nodes and non-shared nodes; the data manager maintains the characteristic data of the last mini-batch in the GPU and provides a mode for updating the characteristic data in the GPU in situ; the feature aggregator designs a high-performance feature aggregation operator to acquire the features of the non-shared nodes from the CPU; for shared nodes, the data transmitter directly uses the feature data maintained in the GPU. The invention improves the sampling efficiency, reduces unnecessary data transmission and improves the data transmission throughput.

Description

Data loading system for graph neural network training based on sampling

Technical Field

The invention belongs to the technical field of software development, and particularly relates to a data loading system for graph neural network training based on sampling.

Background

As data grows, the relationship between data becomes more complex, and the graph neural network has received a great deal of attention. Unlike conventional deep neural networks which are good at processing European spatial data and conventional graph computation which is specialized in processing graph data, the graph neural network focuses on processing non-European spatial data, combines an automatic differentiation in the neural network and a message passing mechanism in the conventional graph computation, has a better effect on graph data processing, and is successfully applied to actual production environments, such as social networks, traffic prediction, recommendation systems and the like.

In order to improve learning ability of the graph neural network and solve the memory limitation problem faced by the graph neural network in processing large graphs, the graph neural network based on sampling is widely studied. The graph sampling modes of the main stream can be divided into three types: node-based sampling, layer-based sampling, and sub-picture-based sampling. However, these sampling methods are only mathematically considered, neglecting the efficiency of operation in the system. On the other hand, the existing deep learning frameworks such as Pytorch, tensorFlow lack the capability of efficiently training the graph neural network, and the graph neural network frameworks are continuously proposed in both academia and industry. Currently there is a representative graph neural network framework: DGL (DeepGraphLibrary) PyG (Pytorch-geometry), aliGraph. The graph neural network frameworks realize a plurality of sampling algorithms and graph neural network models, and bring convenience for deployment of the graph neural network models. In general, map data in real life often has a large scale, and a map neural network framework faces huge calculation and storage pressures. The graph neural network based on sampling well relieves the storage pressure caused by large-scale graph data. Prior to network training, the sample-based graph neural network limits the scope of aggregated neighbor nodes to reduce the scale of graph data, thereby relieving storage pressure. Considering the problem of neighbor explosion commonly existing in graph data, the network structure of the graph neural network is smaller than that of the deep neural network, and meanwhile, as the computing power of the GPU is continuously enhanced, data loading and data transmission become bottlenecks in the training process of the graph neural network. In a distributed environment, the data loading bottleneck becomes more serious.

The bottleneck problem of data loading increases the end-to-end training time of the graphic neural network, reduces the efficiency of the whole graphic neural network system, and cannot fully utilize GPU resources. In order to improve the data loading efficiency, the existing partial technology is optimized for data loading in the graph neural network. For example, DGL supports sampling on the GPU to accelerate sampling, however, this sampling method requires loading the entire graph to the GPU, and scalability is limited by the GPU's video memory capacity; pytorch-Direct designs a unified tensor type of a CPU and a GPU, directly acquires data from a CPU end by using a zero copy technology, and ignores redundant data transmission among different mini-latches; the Pagraph reduces redundant data transmission between the CPU and the GPU in advance at a node with high cache degree at the GPU end, so that partial precious GPU video memory resources are occupied; the RBD fully utilizes shared data among different mini-latches, redundant data transmission is reduced, node data which are not shared still need to be acquired from a CPU (Central processing Unit) end, and certain data transmission overhead exists; the Torch-quick adopts a mode of caching hot data to improve data transmission efficiency, and occupies extra GPU video memory resources.

In summary, data loading becomes the most limiting factor for improving training efficiency of the graph neural network. The prior art optimizes sampling and data transmission respectively, but has other disadvantages, and the data loading in the graph neural network training needs to be further optimized.

Disclosure of Invention

The invention combines the graph sampling characteristic, the data transmission characteristic and the graph data characteristic, realizes a high-efficiency, easy-to-use and sampling-based graph neural network data loading system, and aims to solve the bottleneck problem of data loading (graph sampling and data transmission) in the training process of the graph neural network.

In order to solve the technical problems, the specific technical scheme for realizing the purpose of the invention is as follows:

a data loading system for sample-based graph neural network training, the system comprising: a neighbor node sampler and a data transmitter, wherein:

the data transmitter comprises a classifier, a feature aggregator and a data manager;

the neighbor node sampler is connected with a classifier in the data transmitter, a sampling operator is used for sampling to obtain a current mini-batch node, and a CSR array storage strategy is adopted to reduce data transmission delay in the sampling process;

the classifier in the data transmitter is respectively connected with the neighbor node sampler and the data manager, and classifies the nodes sampled by the neighbor node sampler into shared nodes and non-shared nodes;

the data transmitter performs different processing on different types of nodes: for shared nodes, feature data maintained in the GPU are directly used, and for non-shared nodes, feature data of the non-shared nodes are obtained by using a feature aggregator;

the data manager in the data transmitter maintains a variable-size memory space in the GPU to store the characteristic data of the last mini-batch and updates the characteristic data in the memory space in situ in each training iteration process.

The sampling operator takes the sampling task of each node as a basic unit to carry out parallel sampling, and outputs all the nodes obtained after sampling to form a mini-batch;

the CSR array storage strategy is to store CSR in shared memory, global memory and CPU memory according to the size of the CSR array; when the CSR array is stored in the CPU memory, the neighbor node sampler uses zero copy to further reduce the data transmission delay.

The classifier in the data transmitter classifies the nodes output by the neighbor node sampler, and adopts an inverted index mode to divide the nodes to be acquired of the characteristic data in the current mini-batch into two types: shared nodes and non-shared nodes;

the classifier establishes inverted indexes for the nodes of the previous mini-batch and the current mini-batch, and respectively records the indexes of the nodes in the two mini-batches; and classifying the current mini-batch node into a shared node and a non-shared node according to the index number.

The characteristic aggregator of the data transmitter firstly utilizes zero copy to reduce data transmission delay, and secondly regards the characteristic aggregation task of each node as parallel basic unit expansion parallel acceleration; for feature aggregation of single nodes, parallelism continues to be developed internally in units of feature dimensions.

The data manager of the data transmitter maintains a memory space with variable size in the GPU to store the characteristic data of the last mini-batch, and provides a way for updating the characteristic data in situ; if the node number of the current mini-batch is more than that of the previous mini-batch, the data manager expands the memory space maintained in the GPU, otherwise, the redundant space is removed;

the in-place updating of the memory space is to sort indexes of the non-shared nodes and replace feature data in the memory space of the GPU respectively.

The data loading system has the beneficial effects that through the data loading system for the graph neural network based on sampling, the data loading time delay is reduced, and the overall performance of the system is improved.

Drawings

FIG. 1 is a block diagram of a data loading system of an embodiment of the present invention;

FIG. 2 is a flow chart of the use of a data loading system in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a neighbor node sampler in a data loading system according to an embodiment of the present invention;

FIG. 4 is a flow chart of processing different types of data in a data transmitter in a data loading system according to an embodiment of the present invention;

FIG. 5 is a flow chart of a classifier in a data transmitter in a data loading system according to an embodiment of the present invention;

FIG. 6 is a flow chart of feature aggregation in a data loading system according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the embodiment and the attached drawings.

The data loading system for the graph neural network training based on sampling optimizes data loading by combining graph sampling, data transmission characteristics and graph data characteristics.

A data loading system for a graph neural network based on sampling is optimized for graph sampling and data transmission respectively so as to reduce data loading time delay. The system comprises a neighbor node sampler and a data transmitter. The neighbor node sampler utilizes the strong parallel computing capacity of the GPU and provides a CSR storage strategy to increase the sampling speed; the data transmitter comprises a classifier, a feature aggregator and a data manager, wherein the classifier divides the current mini-batch node into a shared node and a non-shared node by utilizing the characteristic that repeated sampling nodes exist among different mini-batches; for the shared node, the characteristic data of the shared node does not need to be repeatedly transmitted; for the non-shared nodes, acquiring feature data from the CPU by using a feature aggregator; the data manager maintains a memory space with variable size in the GPU to store the characteristic data of the last mini-batch and updates the characteristic data in the GPU in an in-situ update mode; the system provides a simple and easy-to-use operation interface, including sampling and reloading operator acquisition node characteristics.

The neighbor node sampler in the system transfers the graph sampling process from the CPU to the GPU, an efficient sampling operator is designed at the GPU end, and the high parallelism of the GPU is utilized to mask the random access time delay. The sampling operator takes a seed node generated by Dataloader in a deep learning framework Pytorch as input, outputs all the nodes after sampling the seed node, and forms a mini-batch. The sampling operator takes the sampling task of each node as a basic unit to carry out parallel sampling. The neighbor node sampler provides a CSR array storage strategy: CSR is stored in shared memory, global memory, and CPU memory, respectively, according to CSR array size. For the CSR array stored in the CPU memory, the neighbor node sampler uses the zero copy technology to further reduce the data transmission delay.

The data transmitter in the system utilizes a classifier to divide the nodes to be obtained of the characteristic data in the mini-batch into two types: shared nodes and non-shared nodes. For the shared node, the data transmitter directly uses the characteristic data stored in the GPU to reduce redundant data transmission between the CPU and the GPU; for the unshared nodes, the data transmitter designs a high-efficiency characteristic aggregator at the GPU end, and the parallelism of the GPU is utilized to reduce the time delay of acquiring characteristic data from the CPU. The classifier adopts an inverted index mode to divide the nodes to be obtained of the characteristic data in the mini-batch into two types. The classifier establishes inverted indexes for the previous mini-batch and the current mini-batch node, and records indexes of the nodes in the two mini-batches respectively. And classifying the mini-batch nodes into shared nodes and non-shared nodes according to the index number. The data transmitter maintains a variable size of memory space in the GPU using the data manager, providing a way to update the memory space in-place. When the node number of the mini-batch is more than that of the previous mini-batch at present, the data transmitter expands the memory space in the GPU, otherwise, the redundant space is removed. The in-place updating of the memory space is to sort indexes of the non-shared nodes and replace feature data in the memory space of the GPU respectively.

The use steps of the system are as follows:

step A-1: generating a seed node of the mini-batch by using the Dataloader in the Pytorch;

step A-2: the neighbor node sampler samples the seed nodes at a high speed, and all neighbor nodes and the seed nodes generated by sampling form a mini-batch;

step A-3: the data transmitter acquires characteristic data of nodes in the mini-batch;

step A-4: training is started.

Examples

The block diagram of the data loading system of this embodiment is shown in fig. 1. The embodiment system comprises two key components, namely a neighbor node sampler and a data transmitter. And the neighbor node sampler adopts a CSR storage strategy and a mode of distributing a warp GPU thread for the sampling task of each node, so that the efficiency of graph sampling is improved. The data transmitter aims at the characteristic that shared data exists among different mini-latches, reduces redundant data transmission from a CPU to a GPU, and designs an efficient characteristic aggregator aiming at non-shared nodes, so that the throughput of data transmission is further improved. The embodiment system provides a simple and easy-to-use operation interface, comprising a sampler and a reload operator to acquire node characteristics, so that a user can run the system and obtain system performance improvement by modifying a small amount of codes. The flow of the system of the embodiment of the invention is shown in fig. 2, which comprises the following steps:

step 101: generating seed nodes by using Dataloader in a deep learning framework Pytorch;

step 102: the neighbor node sampler receives the seed nodes, generates sampled nodes and sends the sampled nodes to the classifier in the data transmitter;

step 103: the classifier compares and classifies the received mini-batch node with the last mini-batch node to form a shared node and a non-shared node;

step 104: the data manager maintains a space with variable size in the GPU to store the characteristic data of the last mini-batch;

step 105: the data transmitter directly uses the characteristic data stored in the GPU for the sharing nodes; for the non-shared nodes, acquiring data from a CPU (Central processing Unit) end by utilizing a feature aggregator;

step 106: the data manager updates the feature data maintained in the GPU in situ;

step 107: the data manager sends the acquired characteristic data of the mini-batch to the model for training.

The neighbor node sampler provided by the embodiment of the invention is used for improving the sampling efficiency. The neighbor node sampler migrates samples from the CPU to the GPU, designs an efficient sampling operator, and covers the time delay of random access by utilizing the high parallelism of the GPU. The specific design principle of the sampling operator is as follows: for a batch of seed nodes, the sampling task of each node is regarded as a basic task, and each warp is responsible for one basic task, namely the sampling task of one node. Each thread in the same warp is responsible for sampling one neighbor of the seed node, respectively. The design enables threads in the same warp to be accessed continuously, so that the number of memory access things initiated by the GPU end is reduced, and the running cost is reduced. Since the structure information of the graph is typically stored in a CSR or CSC manner to reduce storage requirements, each thread in the graph sampling process needs to access the CSR array to obtain the neighbor nodes. The neighbor node sampler provides a storage strategy based on CSR to further reduce sampling delay. Specifically, the CSR-based deposit strategy is as follows:

1. when size (CSR) <= size (sharedmemory): the CSR array is stored in a sharedmemory. The access delay of the method is the lowest.

2. When size (CSR) > size (shared memory) simultaneous size (CSR) <= size (global memory): the CSR array is stored in a global memory. At this time, the CSR is still stored in the GPU end, and the time delay is lower than that of pulling data from the CPU end.

3. When size (CSR) > size (global memory): at this time, the GPU cannot store the CSR array, and obtains the graph structure information from the CPU in a zero-copy manner, so that the delay of data transmission is reduced compared with directly obtaining data from the CPU.

The flow chart of the neighbor node sampler in this embodiment is shown in fig. 3, and includes the following steps:

step 201: before the system starts to run, calculating the size of the CSR array, and determining the storage position of the CSR according to the storage strategy of the CSR array;

step 202: for an input batch of seed nodes, each node is assigned a warp;

step 203: for a single node, each thread in warp is responsible for sampling the neighbor nodes of its corresponding seed node;

step 204: and (5) completing sampling to form a mini-batch node number.

The data transmitter in the embodiment is used for improving the throughput of data transmission. The data transmitter comprises three parts: classifier, feature aggregator and data manager. The data manager maintains a block of space in the GPU to store the characteristic data of the last mini-batch and updates the characteristic data in situ in each training iteration process; the classifier classifies the nodes from which feature data is to be acquired into two classes. The first type is nodes shared among different mini-latches, and the data transmitter directly multiplexes the characteristic data of the nodes stored in the GPU, so that the data transmission amount from the CPU to the GPU is reduced. The second type is that nodes which are not shared among different mini-batches, and the data transmitter designs an efficient characteristic aggregator to reduce the time delay of acquiring the characteristic data of the nodes from the CPU. Fig. 4 shows a flow of processing different types of data by the data transmitter according to the present embodiment, which specifically includes the following steps:

step 301: classifying the mini-batch nodes by a classifier;

step 302: for shared nodes, the data transmitter directly uses the data still stored on the GPU;

step 303: for non-shared nodes, the data transmitter acquires feature data from the CPU end by using a feature aggregator.

The classifier classifies the nodes of two different mini-batches by adopting an inverted index, and the inverted index records the index of each node in two adjacent mini-batches. If the inverted index of a node contains two index positions, classifying the inverted index as a shared node; otherwise, classifying as a non-shared node. In the reverse indexing process, the classifier records the index of the shared node and the indexes of the non-shared nodes appearing in the two mini-batches for the next feature aggregation process. Fig. 5 shows a flowchart of the implementation of the classifier in the data transmitter of the present embodiment, which specifically includes the following steps:

step 401: establishing an inverted index for the node in the previous mini-batch and the node of the current mini-batch;

step 402: classifying all nodes according to the number of the inverted indexes; nodes with two indexes are classified as shared nodes, and the rest nodes are classified as non-shared nodes;

step 403: and recording indexes of the shared nodes and indexes of the non-shared nodes in the two mini-latches.

The feature aggregator implements an efficient feature aggregation operator. The main design concept of the aggregation operator is as follows: firstly, feature data is stored in a lock page memory of a CPU end by using a zero copy technology, so that one-time memory handling of the CPU end is reduced. For the nodes to acquire the feature data, feature aggregation of each node is regarded as a basic task, and each warp is responsible for one basic task, namely, the feature aggregation task of one node. Each thread in the same warp is responsible for aggregating feature data of one dimension, respectively. The design enables threads in the same warp to be accessed continuously, thereby further reducing access delay and running overhead. Because different mini-batches may have inconsistent node numbers, the data manager maintains an array of storage node characteristic data capable of being dynamically adjusted in the GPU, and updates the node characteristic array in the GPU in an in-situ update mode. In order to realize the in-situ updating mode, the data manager firstly sorts indexes of non-shared nodes in two mini-latches, and under the condition that the mini-latches are smaller, redundant node characteristic data behind the original mini-latches are removed. The sorted indexes can ensure that redundant node characteristic data in mini-batch can be removed smoothly and effectively. And expanding the mini-batch array under the condition that the mini-batch is enlarged. The above-mentioned in-situ update method makes the sequence of the finally obtained mini-batch feature data inconsistent with the sequence of the original mini-batch node, so that after the feature data is obtained, the label data of the mini-batch needs to be rearranged, so that the label array corresponds to the mini-batch feature array. The flow chart of feature aggregation (feature aggregator and data manager are used together) in this embodiment is shown in fig. 6, and specifically includes the following steps:

step 501: distributing the CPU terminal characteristic data to a lock page memory;

step 502: ordering indexes of non-shared nodes in mini-batch;

step 503: if the node number of the current mini-batch is larger than that of the previous mini-batch, expanding the mini-batch array of the GPU end;

step 504: for each non-shared node, allocating a warp to take charge of feature aggregation;

step 505: each thread in the same warp is responsible for feature aggregation of one dimension;

step 506, replacing the non-shared node characteristic data;

and 507, removing redundant node data in a mini-batch array in the GPU if the number of current mini-batch nodes is smaller than the number of nodes of the previous mini-batch after the non-shared node feature replacement is finished.

Claims

1. A data loading system for sample-based graph neural network training, comprising:

a neighbor node sampler and a data transmitter, wherein:

the data manager in the data transmitter maintains a memory space with a variable size in the GPU to store the characteristic data of the previous mini-batch, sends the obtained characteristic data of the mini-batch to the model for training, and updates the characteristic data in the memory space in situ in each training iteration process; wherein,,

the sampling operator takes a seed node generated by Dataloader in a deep learning framework Pytorch as input, and outputs all the nodes after sampling the seed node to form a mini-batch; the sampling operator takes the sampling task of each node as a basic unit to carry out parallel sampling;

the CSR array storage strategy specifically comprises the following steps:

i) when size (CSR) <= size (shared memory): storing the CSR array in a shared memory;

ii) when size (CSR) > size (shared memory) and size (CSR) < = size (global memory): CSR is carried out

The array is stored in a global memory;

iii) when size (CSR) > size (global memory): at this time, the GPU cannot store the CSR array, and acquires the graph structure information from the CPU in a zero copy mode;

the classifier establishes inverted indexes for the nodes of the previous mini-batch and the current mini-batch, and respectively records the indexes of the nodes in the two mini-batches; if the inverted index of the node contains two index positions, classifying the inverted index as a shared node, otherwise classifying the inverted index as a non-shared node;

the method for acquiring the characteristic data by using the characteristic aggregator specifically comprises the following steps:

i) distributing the CPU end characteristic data to a lock page memory;

ii) ordering indexes of non-shared nodes in mini-batch;

iii) if the node number of the current mini-batch is larger than that of the last mini-batch, mini-port to the GPU end

The batch array expands the capacity;

iv) distributing a warp to each non-shared node to be responsible for feature aggregation;

v) each thread in the same warp is responsible for feature aggregation of one dimension;

vi) replacing non-shared node characteristic data;

vii) after the non-shared node feature replacement is finished, if the number of the current mini-batch nodes is smaller than that of the last mini-batch

Removing redundant node data in the mini-batch array in the GPU;