CN116167436A

CN116167436A - Neural network pipeline parallel training method for optimizing model division

Info

Publication number: CN116167436A
Application number: CN202310139664.5A
Authority: CN
Inventors: 方熔翔; 魏贵义
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-05-26

Abstract

The invention discloses a neural network pipeline parallel training method for optimizing model division. The invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model. By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured.

Description

Neural network pipeline parallel training method for optimizing model division

Technical Field

The invention belongs to the field of neural network parallel training, and particularly relates to a neural network pipeline parallel training method for optimizing model division.

Background

With the advent of the big data age, the data generated in people's lives is increasing. In order to process massive data, solve various complicated problems, and meet the requirement of users on high accuracy of deep learning tasks, the number of layers of the neural network is continuously increased. The large-scale neural network model greatly increases the training time. Therefore, a distributed parallel training method is commonly used for training acceleration of the model.

Methods for distributed training of neural networks are generally classified into three categories. The data parallelism is that a large amount of data is divided into small batches and sent to each computing device for computing, and parameters are updated through a certain communication mode after the computing is completed; model parallelism is to divide a large model into small models, and deploy the small models on each computing device for training; hybrid parallelism combines data parallelism and model parallelism together.

Data parallelism requires that the memory of the computing device be able to accommodate the complete model, with poor scalability. When the models are divided in parallel by the models, the traditional dividing algorithm does not comprehensively consider the load balance and the overall communication time of the computing equipment. The model alone is used for parallel training, and the throughput of the system is not high. Therefore, how to utilize a certain neural network distributed parallel training strategy to ensure the load balance of calculation tasks on hardware equipment, reduce global communication time as much as possible, realize acceleration of neural network training and solve the problem of urgent need in the field of neural network parallel training.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a neural network pipeline parallel training method for optimizing model division, and accelerates the training process of the neural network.

The technical scheme adopted by the invention is as follows:

the invention comprises the following steps:

step 1, modeling a neural network model by using a DAG graph, setting the weight of the vertex in the DAG graph as the theoretical calculation time of the neural network layers, and setting the weight of the edge as the theoretical communication time between the neural network layers.

And 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU.

And 3, equally dividing one input data batch into a plurality of micro batches, transmitting output to the next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the calculation of the next micro batch is started by the previous GPU.

And 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network.

And 5, repeating the steps 3-4 until the set number of rounds is reached, and completing training of the neural network.

Further, the step 1 specifically comprises the following steps:

step 1.1, modeling a DAG graph of the neural network, wherein the direction of the edge in the DAG graph represents the dependency relationship between the neural network model layers.

Step 1.2, weighting the vertices and edges on the original DAG graph. Wherein the weight of the vertex is the theoretical calculation time of the neural network model layer; the weights of the edges in the DAG graph are theoretical communication times between neural network model layers.

Further, the step 2 specifically comprises the following steps:

step 2.1, calculating time tc according to the theory of each layer of the neural network _k And the number m of the GPUs, and dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of the model layers distributed on each GPU is approximately equal, and an initial model dividing scheme is obtained and used as an initial solution set.

And 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability.

And 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as the solution set of the next round of iteration, and adding the solution set of the current round into a final solution set. And repeating the step 2.2 until the specified iteration times are reached.

And 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme.

And 2.5, dividing the neural network model according to the optimal solution, and sequentially deploying the model fragments on the corresponding GPUs.

Further, the model partitioning scheme comprises the following concrete steps of: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.

Further, the step 3 specifically comprises the following steps:

step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equal amount, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue.

And 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling. And constructing the dependency relationship of the calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU. And constructing the dependency relationship of the same micro batch data executed on different GPUs. On different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.

The invention has the beneficial effects that:

the invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model.

By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured. Compared with the traditional model partitioning algorithm, the time required for training one round under the model parallel strategy is shorter. After the parallel of the pipeline is introduced, the total communication time is reduced while the bubble time in the pipeline is reduced, and better overlapping of calculation and communication is realized, so that the training of a model is further accelerated.

Drawings

FIG. 1 is a flow chart of the neural network pipeline parallel training of the optimization model partitioning of the present invention;

FIG. 2 is a diagram of a DAG modeled by a neural network of the present invention, wherein (a) is a non-weighted DAG diagram and (b) is a weighted DAG diagram;

FIG. 3 is a computational task space-time diagram of a neural network of the present invention, wherein (a) is a forward propagation space-time diagram and (b) is a reverse propagation space-time diagram;

FIG. 4 is a graph of model partitioning algorithm performance versus the present invention;

FIG. 5 is a graph of acceleration effects of different micro-batches of the parallel algorithm of the present invention;

FIG. 6 is a graph of parallel algorithm and single GPU training accuracy versus the present invention, where (a) is the ResNet-50 model and CIFAR-10 dataset experiment. (b) ResNet-101 model and CIFAR-100 dataset experiments. (c) Experiments were performed for the ResNet-152 model and the Caltech-101 dataset.

Detailed Description

The embodiment discloses a neural network pipeline parallel training method for optimizing model division, and a specific flow is shown in fig. 1.

The method mainly comprises three parts, wherein the first part is used for carrying out DAG graph modeling on a neural network to be trained, the second part is used for dividing the neural network model, and the third part is used for carrying out pipeline parallel training.

The specific steps are shown in the steps 1-3.

And step 1, modeling a neural network DAG graph.

Step 1.1, define DAG graph as g= (V, E), where vertex set V represents set l= { L of all layers in the neural network ₁ ，l ₂ ，...，l _n }. The edge set E represents all topological relationships from layer to layer. An example of modeling is shown in fig. 2 (a).

Step 1.2, abstracting the theoretical calculation time of each layer in the neural network into the weight of the vertex in the DAG graph

The theoretical calculation amount of the convolution layer is shown in a formula (1), the theoretical calculation amount of the full connection layer is shown in a formula (2), the theoretical calculation amount of the BN layer is shown in a formula (3), and the theoretical calculation amounts of the pooling and Relu layers are shown in a formula (4).

f _k ＝2×b×H _{k_in} ×W _{k_in} ×C _{k_in} ×C _{k_0ut} (2)

f _k ＝2×b×C _{k_in} ×H _{k_in} ×W _{k_in} (3)

f _k ＝b×C _{k_in} ×H _{k_in} ×W _{k_in} (4)

Wherein the method comprises the steps of

The convolution kernel size of the k-th layer, C _{k_in} 、H _{k_in} 、W _{k_in} The number of channels, height and width of the feature map are input for the k-th layer. C (C) _{k_out} 、H _{k_out} 、W _{k_out} The channel number, height and width of the feature map are output for the k-th layer. b is the input batch size.

The theoretical calculation time of the neural network model layer is shown in formula (5), wherein C is the theoretical calculation force of the GPU, and n is the total layer number of the model.

Wherein tc _k For theoretical calculation of neural network layerM, f _k For its floating point calculation (GFLOPs).

Abstracting layer-to-layer communication time into weights of edges in a DAG graph

The output tensor size of the model layer is obtained from equation (6). The theoretical communication time between the model k layer and the k+1 layer is shown in equation (7), where B is the theoretical bandwidth of the GPU. The weighted DAG graph is thus shown in fig. 2 (b). />

d _k ＝b×C _{k_out} ×H _{k_out} ×W _{k_out} (6)

Wherein ts is _k For the theoretical communication time of the neural network layer and the next layer, d _k The tensor size is output for the neural network layer.

And 2, dividing the neural network model.

And 2.1, defining an evaluation index of the model division scheme. After the model is divided, a sequence P= { P of a model fragment is obtained ₁ ，P ₂ ，...，P _m Each model slice sequence is a set of layers in the corresponding neural network, a contiguous subset of the original model layers. Adding according to formula (5) to obtain theoretical calculation time Tc of each model fragment ₁ ，Tc ₂ ，...，Tc _m . The variance of the theoretical calculation time between model slices is shown in formula (7), where Tc _avg The mean value of time is calculated for the theory of model slicing.

Obtaining theoretical communication time Ts of the model fragments according to a formula (7) ₁ ，Ts ₂ ，...，Ts _m-1 . The calculation of the global communication time is shown in equation (8).

And 2.2, an evaluation algorithm of the model division scheme. Assume that there is a model partitioning scheme P _a Theoretical calculation time variance sigma of corresponding model fragments _a Global communication time T _{comm_total_a} . Another model partitioning scheme P _b Theoretical calculation time variance sigma of corresponding model fragments _b Global communication time T _{comm_total_b} 。

If sigma _a Sum sigma _b The difference value of the two types of the model partitioning schemes is smaller than a given threshold epsilon, which means that the variance difference of the two types of the model partitioning schemes is smaller, and the model partitioning scheme with shorter global communication time is selected. If sigma _a Sum sigma _b The difference value of (2) is larger than a given threshold epsilon, which indicates that the variance difference is larger, and a model division scheme with smaller variance is selected.

And 2.3, generating an initial model division scheme. Firstly, accumulating theoretical calculation time lists of a model layer element by element to obtain a list list_flow of accumulated values of theoretical calculation time of a current layer and all previous layers. And normalizing the values in the list, and eliminating the influence of the singular time value to ensure that the element values are in the range of [0,1] to obtain a normalized list_flow_norm. And then according to the GPU number m, obtaining the step length of 1/m, and dividing the layers into corresponding model fragments in sequence. For example, the element value is at [0,0.25), then it is scored into a first model tile, the element value is at [0.25,0.50), then it is scored into a second model tile, and so on. Finally, an initial model dividing scheme with approximately equal theoretical calculation time is obtained.

And 2.4, generating a solution set of the model division scheme to obtain an optimal solution. The initial model partitioning scheme is added to the current most current solution set. Traversing the solution set, and judging all movable model layers for each solution. In addition to the first and last layers of the original model being immovable, the first layer of the model fragments may be moved forward behind the last layer of the previous model fragments, the last layer of the model fragments may be moved backwardMoving to the front of the first layer of the next model slice. To reduce the search space, the movement probability of each layer is defined as r=ω ^-t ω is a value greater than 1, t is the number of times the corresponding layer has been moved, indicating that the more moves a layer is the more likely it is to discard the move. Then, a new solution is generated according to the movement of the layer, a new solution set is added, and the number of movements of the layer is increased by 1. The new solution set serves as the solution set that needs to be traversed for the next iteration. And finally, collecting the unrepeated solutions newly generated by each layer of iteration to obtain a final solution set. And obtaining an optimal solution according to the evaluation algorithm of the model partitioning scheme in the step 2.2, namely the relatively optimal model partitioning scheme.

And step 3, pipeline parallel training.

And 3.1, initializing operation. A thread is created for each GPU, which is responsible for an input queue, input_queue, and an output queue, output_queue. And placing the corresponding computing task into the corresponding GPU, and scheduling when waiting for the corresponding clock.

And 3.2, equally dividing the input data of one batch into k micro-batches.

And 3.3, determining a time-space diagram of parallel operation of the pipeline. Definition F _i,j Compute tasks for forward propagation of the ith micro-batch on the jth GPU, B _i,j The task is computed for the i-th micro-batch back-propagation on the j-th GPU. Fig. 3 (a) shows a forward propagating computational task space-time diagram, and fig. 3 (b) shows a backward propagating computational task space-time diagram. Wherein F is performed in forward propagating clock7 _4,4 After the calculation task of (2), back-propagating B is performed in clock8 _4,4 Is a computing task of (1).

And 3.4, establishing a dependency relationship between computing tasks. Taking forward propagation as an example, a current micro-batch computing task F is established on the current GPU _i,j Computing task F to the next micro batch _i+1,j Is a logical dependency. Establishing a computing task F of the current micro batch on the current GPU _i,j Computing task F to the next GPU _i,j+1 By copying the data in the output_queue of the current GPU to the nextAnd (3) realizing on a GPU. The back propagation is the same.

And 3.5, in the scheduling process, when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting the output value to output_queue of the GPU.

And 3.6, after all micro-batches complete one-time complete forward propagation and backward propagation, calculating loss and gradient, and updating parameters.

And 3.7, repeating the steps 3.2-3.6 until the set number of rounds is reached, and outputting the trained neural network model.

Verification experiment:

(1) Experiment setting: the number of GPUs employed was 4. Three sets of neural network models and data sets are combined, namely a ResNet-50 model and a CIFAR-10 data set, a ResNet-101 model and a CIFAR-100 data set, and a ResNet-152 model and a Caltech-101 data set.

(2) Data initialization: the CIFAR-10 and CIFAR-100 data sets have image sizes of 32X 32 and the Caltech-101 data set has image sizes of 224X 224.

(3) Model partitioning algorithm performance experiment: in the three experimental groups, on the basis of parallel models, the optimized model partitioning algorithm provided by the invention is compared with the partitioning algorithm based on theoretical calculation force and the partitioning algorithm based on CUDA memory occupation condition. The algorithm of the present invention performs one round of training faster than the other algorithms. In the experimental group of ResNet-152 and Caltech-101, the algorithm is 33.822s faster than the partitioning algorithm based on theoretical calculation power and 18.063s faster than the partitioning algorithm based on CUDA memory occupancy. As shown in particular in fig. 4.

(4) Optimal micro batch number experiment: in the three experimental groups, on the basis of pipeline parallelism, the optimal micro-batch number is searched by utilizing the optimal model partitioning algorithm provided by the method so as to achieve smaller single-round training time. Experiments show that under the condition of excellent micro batch number, the time of single-round training is nearly half of that of parallel models. As shown in particular in fig. 5.

(5) Model convergence rate experiment: and in the three experimental groups, comparing pipeline parallelism under the optimal micro-batch number with single GPU model training. Pipeline parallelism is compared with single GPU training, and the speed-up ratio of single rounds is 1.19,1.60,1.88 respectively. The specific table is shown below:

experimental group	Single GPU (base)/s	Pipelined parallelism (parallel 1)/s
			ResNet-50+CIFAR-10	30.123	25.316
ResNet-101+CIFAR-100	57.328	35.789
			ResNet-152+Caltech-101	87.126	46.358

With increasing rounds, the models can converge to approximately equal accuracy, as shown in fig. 6 (a), (b), and (c).

Claims

1. The neural network pipeline parallel training method for optimizing model division is characterized by comprising the following steps of:

step 1, modeling a neural network model by using a DAG graph, setting the weight of the vertex in the DAG graph as the theoretical calculation time of the neural network layer, and setting the weight of the edge as the theoretical communication time between the neural network layers;

step 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU;

step 3, dividing an input data batch into a plurality of micro batches in an equivalent way, transmitting output to a next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the previous GPU starts the calculation of the next micro batch;

step 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network;

2. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 1 is specifically as follows:

step 1.1, modeling a DAG graph of a neural network, wherein the direction of an edge in the DAG graph represents the dependency relationship between layers of the neural network model;

step 1.2, weighting the vertexes and edges on the original DAG graph; wherein the weight of the vertex is theoretical calculation time of the neural network model layer, and the calculation formula is as follows:

wherein tc _k Calculating time, f for the theory of the neural network layer _k For the floating point calculated amount, n is the total layer number in the neural network model, and C is the theoretical calculation force of the GPU;

the weight of the edge in the DAG graph is the theoretical communication time between the neural network model layers, and the calculation formula is as follows:

wherein ts is _k For the theoretical communication time of the neural network layer and the next layer, d _k And B is the theoretical bandwidth of the GPU, wherein the tensor size is output by the neural network layer.

3. The neural network pipeline parallel training method for optimizing model partitioning according to claim 2, wherein: the step 2 is specifically as follows:

step 2.1, calculating time tc according to the theory of each layer of the neural network _k And the number m of the GPUs, dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of model layers distributed on each GPU is approximately equal, and obtaining an initial model dividing scheme as an initial solution set;

step 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability;

step 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as a solution set of the next round of iteration, and adding the solution set of the current round into a final solution set; repeating the step 2.2 until the appointed iteration times are reached;

step 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme;

4. A neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 3, wherein: the calculation formula of the movement probability is as follows:

r＝ω ^-t

where r is the probability of movement, ω is a number greater than 1, and t represents the number of times the corresponding model layer has been moved.

5. The neural network pipeline parallel training method for optimizing model partitioning according to claim 3 or 4, wherein: the model division scheme comprises the following quality assessment algorithm: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.

6. The neural network pipeline parallel training method for optimizing model partitioning according to claim 5, wherein: if the difference between the theoretical calculation time and the variance value of the theoretical calculation time is smaller than a set threshold epsilon, continuously comparing the total communication time of the theoretical calculation time and the variance value of the theoretical calculation time with the variance value of the theoretical calculation time is smaller than a set threshold epsilon, and selecting a scheme with smaller value as a better scheme; if the difference between the variance values of the theoretical calculation time and the theoretical calculation time is larger than a set threshold epsilon, a scheme with smaller variance value of the theoretical calculation time is selected as a better scheme.

7. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 3 is specifically as follows:

step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equivalent manner, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue;

step 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling; constructing the dependency relationship of calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU; constructing the dependency relationship of the same micro batch data executed on different GPUs; on different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.

8. The neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 7, wherein: during the scheduling process: and when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting an output value to output_queue of the GPU.