CN116167436A - Neural network pipeline parallel training method for optimizing model division - Google Patents

Neural network pipeline parallel training method for optimizing model division Download PDF

Info

Publication number
CN116167436A
CN116167436A CN202310139664.5A CN202310139664A CN116167436A CN 116167436 A CN116167436 A CN 116167436A CN 202310139664 A CN202310139664 A CN 202310139664A CN 116167436 A CN116167436 A CN 116167436A
Authority
CN
China
Prior art keywords
model
neural network
gpu
calculation
theoretical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310139664.5A
Other languages
Chinese (zh)
Inventor
方熔翔
魏贵义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202310139664.5A priority Critical patent/CN116167436A/en
Publication of CN116167436A publication Critical patent/CN116167436A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a neural network pipeline parallel training method for optimizing model division. The invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model. By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured.

Description

Neural network pipeline parallel training method for optimizing model division
Technical Field
The invention belongs to the field of neural network parallel training, and particularly relates to a neural network pipeline parallel training method for optimizing model division.
Background
With the advent of the big data age, the data generated in people's lives is increasing. In order to process massive data, solve various complicated problems, and meet the requirement of users on high accuracy of deep learning tasks, the number of layers of the neural network is continuously increased. The large-scale neural network model greatly increases the training time. Therefore, a distributed parallel training method is commonly used for training acceleration of the model.
Methods for distributed training of neural networks are generally classified into three categories. The data parallelism is that a large amount of data is divided into small batches and sent to each computing device for computing, and parameters are updated through a certain communication mode after the computing is completed; model parallelism is to divide a large model into small models, and deploy the small models on each computing device for training; hybrid parallelism combines data parallelism and model parallelism together.
Data parallelism requires that the memory of the computing device be able to accommodate the complete model, with poor scalability. When the models are divided in parallel by the models, the traditional dividing algorithm does not comprehensively consider the load balance and the overall communication time of the computing equipment. The model alone is used for parallel training, and the throughput of the system is not high. Therefore, how to utilize a certain neural network distributed parallel training strategy to ensure the load balance of calculation tasks on hardware equipment, reduce global communication time as much as possible, realize acceleration of neural network training and solve the problem of urgent need in the field of neural network parallel training.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a neural network pipeline parallel training method for optimizing model division, and accelerates the training process of the neural network.
The technical scheme adopted by the invention is as follows:
the invention comprises the following steps:
step 1, modeling a neural network model by using a DAG graph, setting the weight of the vertex in the DAG graph as the theoretical calculation time of the neural network layers, and setting the weight of the edge as the theoretical communication time between the neural network layers.
And 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU.
And 3, equally dividing one input data batch into a plurality of micro batches, transmitting output to the next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the calculation of the next micro batch is started by the previous GPU.
And 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network.
And 5, repeating the steps 3-4 until the set number of rounds is reached, and completing training of the neural network.
Further, the step 1 specifically comprises the following steps:
step 1.1, modeling a DAG graph of the neural network, wherein the direction of the edge in the DAG graph represents the dependency relationship between the neural network model layers.
Step 1.2, weighting the vertices and edges on the original DAG graph. Wherein the weight of the vertex is the theoretical calculation time of the neural network model layer; the weights of the edges in the DAG graph are theoretical communication times between neural network model layers.
Further, the step 2 specifically comprises the following steps:
step 2.1, calculating time tc according to the theory of each layer of the neural network k And the number m of the GPUs, and dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of the model layers distributed on each GPU is approximately equal, and an initial model dividing scheme is obtained and used as an initial solution set.
And 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability.
And 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as the solution set of the next round of iteration, and adding the solution set of the current round into a final solution set. And repeating the step 2.2 until the specified iteration times are reached.
And 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme.
And 2.5, dividing the neural network model according to the optimal solution, and sequentially deploying the model fragments on the corresponding GPUs.
Further, the model partitioning scheme comprises the following concrete steps of: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.
Further, the step 3 specifically comprises the following steps:
step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equal amount, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue.
And 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling. And constructing the dependency relationship of the calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU. And constructing the dependency relationship of the same micro batch data executed on different GPUs. On different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.
The invention has the beneficial effects that:
the invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model.
By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured. Compared with the traditional model partitioning algorithm, the time required for training one round under the model parallel strategy is shorter. After the parallel of the pipeline is introduced, the total communication time is reduced while the bubble time in the pipeline is reduced, and better overlapping of calculation and communication is realized, so that the training of a model is further accelerated.
Drawings
FIG. 1 is a flow chart of the neural network pipeline parallel training of the optimization model partitioning of the present invention;
FIG. 2 is a diagram of a DAG modeled by a neural network of the present invention, wherein (a) is a non-weighted DAG diagram and (b) is a weighted DAG diagram;
FIG. 3 is a computational task space-time diagram of a neural network of the present invention, wherein (a) is a forward propagation space-time diagram and (b) is a reverse propagation space-time diagram;
FIG. 4 is a graph of model partitioning algorithm performance versus the present invention;
FIG. 5 is a graph of acceleration effects of different micro-batches of the parallel algorithm of the present invention;
FIG. 6 is a graph of parallel algorithm and single GPU training accuracy versus the present invention, where (a) is the ResNet-50 model and CIFAR-10 dataset experiment. (b) ResNet-101 model and CIFAR-100 dataset experiments. (c) Experiments were performed for the ResNet-152 model and the Caltech-101 dataset.
Detailed Description
The embodiment discloses a neural network pipeline parallel training method for optimizing model division, and a specific flow is shown in fig. 1.
The method mainly comprises three parts, wherein the first part is used for carrying out DAG graph modeling on a neural network to be trained, the second part is used for dividing the neural network model, and the third part is used for carrying out pipeline parallel training.
The specific steps are shown in the steps 1-3.
And step 1, modeling a neural network DAG graph.
Step 1.1, define DAG graph as g= (V, E), where vertex set V represents set l= { L of all layers in the neural network 1 ,l 2 ,...,l n }. The edge set E represents all topological relationships from layer to layer. An example of modeling is shown in fig. 2 (a).
Step 1.2, abstracting the theoretical calculation time of each layer in the neural network into the weight of the vertex in the DAG graph
Figure BDA0004087078100000041
The theoretical calculation amount of the convolution layer is shown in a formula (1), the theoretical calculation amount of the full connection layer is shown in a formula (2), the theoretical calculation amount of the BN layer is shown in a formula (3), and the theoretical calculation amounts of the pooling and Relu layers are shown in a formula (4).
Figure BDA0004087078100000042
f k =2×b×H k_in ×W k_in ×C k_in ×C k_0ut (2)
f k =2×b×C k_in ×H k_in ×W k_in (3)
f k =b×C k_in ×H k_in ×W k_in (4)
Wherein the method comprises the steps of
Figure BDA0004087078100000043
The convolution kernel size of the k-th layer, C k_in 、H k_in 、W k_in The number of channels, height and width of the feature map are input for the k-th layer. C (C) k_out 、H k_out 、W k_out The channel number, height and width of the feature map are output for the k-th layer. b is the input batch size.
The theoretical calculation time of the neural network model layer is shown in formula (5), wherein C is the theoretical calculation force of the GPU, and n is the total layer number of the model.
Figure BDA0004087078100000044
Wherein tc k For theoretical calculation of neural network layerM, f k For its floating point calculation (GFLOPs).
Abstracting layer-to-layer communication time into weights of edges in a DAG graph
Figure BDA0004087078100000045
The output tensor size of the model layer is obtained from equation (6). The theoretical communication time between the model k layer and the k+1 layer is shown in equation (7), where B is the theoretical bandwidth of the GPU. The weighted DAG graph is thus shown in fig. 2 (b). />
d k =b×C k_out ×H k_out ×W k_out (6)
Figure BDA0004087078100000051
Wherein ts is k For the theoretical communication time of the neural network layer and the next layer, d k The tensor size is output for the neural network layer.
And 2, dividing the neural network model.
And 2.1, defining an evaluation index of the model division scheme. After the model is divided, a sequence P= { P of a model fragment is obtained 1 ,P 2 ,...,P m Each model slice sequence is a set of layers in the corresponding neural network, a contiguous subset of the original model layers. Adding according to formula (5) to obtain theoretical calculation time Tc of each model fragment 1 ,Tc 2 ,...,Tc m . The variance of the theoretical calculation time between model slices is shown in formula (7), where Tc avg The mean value of time is calculated for the theory of model slicing.
Figure BDA0004087078100000052
Obtaining theoretical communication time Ts of the model fragments according to a formula (7) 1 ,Ts 2 ,...,Ts m-1 . The calculation of the global communication time is shown in equation (8).
Figure BDA0004087078100000053
And 2.2, an evaluation algorithm of the model division scheme. Assume that there is a model partitioning scheme P a Theoretical calculation time variance sigma of corresponding model fragments a Global communication time T comm_total_a . Another model partitioning scheme P b Theoretical calculation time variance sigma of corresponding model fragments b Global communication time T comm_total_b
If sigma a Sum sigma b The difference value of the two types of the model partitioning schemes is smaller than a given threshold epsilon, which means that the variance difference of the two types of the model partitioning schemes is smaller, and the model partitioning scheme with shorter global communication time is selected. If sigma a Sum sigma b The difference value of (2) is larger than a given threshold epsilon, which indicates that the variance difference is larger, and a model division scheme with smaller variance is selected.
And 2.3, generating an initial model division scheme. Firstly, accumulating theoretical calculation time lists of a model layer element by element to obtain a list list_flow of accumulated values of theoretical calculation time of a current layer and all previous layers. And normalizing the values in the list, and eliminating the influence of the singular time value to ensure that the element values are in the range of [0,1] to obtain a normalized list_flow_norm. And then according to the GPU number m, obtaining the step length of 1/m, and dividing the layers into corresponding model fragments in sequence. For example, the element value is at [0,0.25), then it is scored into a first model tile, the element value is at [0.25,0.50), then it is scored into a second model tile, and so on. Finally, an initial model dividing scheme with approximately equal theoretical calculation time is obtained.
And 2.4, generating a solution set of the model division scheme to obtain an optimal solution. The initial model partitioning scheme is added to the current most current solution set. Traversing the solution set, and judging all movable model layers for each solution. In addition to the first and last layers of the original model being immovable, the first layer of the model fragments may be moved forward behind the last layer of the previous model fragments, the last layer of the model fragments may be moved backwardMoving to the front of the first layer of the next model slice. To reduce the search space, the movement probability of each layer is defined as r=ω -t ω is a value greater than 1, t is the number of times the corresponding layer has been moved, indicating that the more moves a layer is the more likely it is to discard the move. Then, a new solution is generated according to the movement of the layer, a new solution set is added, and the number of movements of the layer is increased by 1. The new solution set serves as the solution set that needs to be traversed for the next iteration. And finally, collecting the unrepeated solutions newly generated by each layer of iteration to obtain a final solution set. And obtaining an optimal solution according to the evaluation algorithm of the model partitioning scheme in the step 2.2, namely the relatively optimal model partitioning scheme.
And step 3, pipeline parallel training.
And 3.1, initializing operation. A thread is created for each GPU, which is responsible for an input queue, input_queue, and an output queue, output_queue. And placing the corresponding computing task into the corresponding GPU, and scheduling when waiting for the corresponding clock.
And 3.2, equally dividing the input data of one batch into k micro-batches.
And 3.3, determining a time-space diagram of parallel operation of the pipeline. Definition F i,j Compute tasks for forward propagation of the ith micro-batch on the jth GPU, B i,j The task is computed for the i-th micro-batch back-propagation on the j-th GPU. Fig. 3 (a) shows a forward propagating computational task space-time diagram, and fig. 3 (b) shows a backward propagating computational task space-time diagram. Wherein F is performed in forward propagating clock7 4,4 After the calculation task of (2), back-propagating B is performed in clock8 4,4 Is a computing task of (1).
And 3.4, establishing a dependency relationship between computing tasks. Taking forward propagation as an example, a current micro-batch computing task F is established on the current GPU i,j Computing task F to the next micro batch i+1,j Is a logical dependency. Establishing a computing task F of the current micro batch on the current GPU i,j Computing task F to the next GPU i,j+1 By copying the data in the output_queue of the current GPU to the nextAnd (3) realizing on a GPU. The back propagation is the same.
And 3.5, in the scheduling process, when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting the output value to output_queue of the GPU.
And 3.6, after all micro-batches complete one-time complete forward propagation and backward propagation, calculating loss and gradient, and updating parameters.
And 3.7, repeating the steps 3.2-3.6 until the set number of rounds is reached, and outputting the trained neural network model.
Verification experiment:
(1) Experiment setting: the number of GPUs employed was 4. Three sets of neural network models and data sets are combined, namely a ResNet-50 model and a CIFAR-10 data set, a ResNet-101 model and a CIFAR-100 data set, and a ResNet-152 model and a Caltech-101 data set.
(2) Data initialization: the CIFAR-10 and CIFAR-100 data sets have image sizes of 32X 32 and the Caltech-101 data set has image sizes of 224X 224.
(3) Model partitioning algorithm performance experiment: in the three experimental groups, on the basis of parallel models, the optimized model partitioning algorithm provided by the invention is compared with the partitioning algorithm based on theoretical calculation force and the partitioning algorithm based on CUDA memory occupation condition. The algorithm of the present invention performs one round of training faster than the other algorithms. In the experimental group of ResNet-152 and Caltech-101, the algorithm is 33.822s faster than the partitioning algorithm based on theoretical calculation power and 18.063s faster than the partitioning algorithm based on CUDA memory occupancy. As shown in particular in fig. 4.
(4) Optimal micro batch number experiment: in the three experimental groups, on the basis of pipeline parallelism, the optimal micro-batch number is searched by utilizing the optimal model partitioning algorithm provided by the method so as to achieve smaller single-round training time. Experiments show that under the condition of excellent micro batch number, the time of single-round training is nearly half of that of parallel models. As shown in particular in fig. 5.
(5) Model convergence rate experiment: and in the three experimental groups, comparing pipeline parallelism under the optimal micro-batch number with single GPU model training. Pipeline parallelism is compared with single GPU training, and the speed-up ratio of single rounds is 1.19,1.60,1.88 respectively. The specific table is shown below:
experimental group Single GPU (base)/s Pipelined parallelism (parallel 1)/s
ResNet-50+CIFAR-10 30.123 25.316
ResNet-101+CIFAR-100 57.328 35.789
ResNet-152+Caltech-101 87.126 46.358
With increasing rounds, the models can converge to approximately equal accuracy, as shown in fig. 6 (a), (b), and (c).

Claims (8)

1. The neural network pipeline parallel training method for optimizing model division is characterized by comprising the following steps of:
step 1, modeling a neural network model by using a DAG graph, setting the weight of the vertex in the DAG graph as the theoretical calculation time of the neural network layer, and setting the weight of the edge as the theoretical communication time between the neural network layers;
step 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU;
step 3, dividing an input data batch into a plurality of micro batches in an equivalent way, transmitting output to a next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the previous GPU starts the calculation of the next micro batch;
step 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network;
and 5, repeating the steps 3-4 until the set number of rounds is reached, and completing training of the neural network.
2. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 1 is specifically as follows:
step 1.1, modeling a DAG graph of a neural network, wherein the direction of an edge in the DAG graph represents the dependency relationship between layers of the neural network model;
step 1.2, weighting the vertexes and edges on the original DAG graph; wherein the weight of the vertex is theoretical calculation time of the neural network model layer, and the calculation formula is as follows:
Figure FDA0004087078090000011
wherein tc k Calculating time, f for the theory of the neural network layer k For the floating point calculated amount, n is the total layer number in the neural network model, and C is the theoretical calculation force of the GPU;
the weight of the edge in the DAG graph is the theoretical communication time between the neural network model layers, and the calculation formula is as follows:
Figure FDA0004087078090000012
wherein ts is k For the theoretical communication time of the neural network layer and the next layer, d k And B is the theoretical bandwidth of the GPU, wherein the tensor size is output by the neural network layer.
3. The neural network pipeline parallel training method for optimizing model partitioning according to claim 2, wherein: the step 2 is specifically as follows:
step 2.1, calculating time tc according to the theory of each layer of the neural network k And the number m of the GPUs, dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of model layers distributed on each GPU is approximately equal, and obtaining an initial model dividing scheme as an initial solution set;
step 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability;
step 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as a solution set of the next round of iteration, and adding the solution set of the current round into a final solution set; repeating the step 2.2 until the appointed iteration times are reached;
step 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme;
and 2.5, dividing the neural network model according to the optimal solution, and sequentially deploying the model fragments on the corresponding GPUs.
4. A neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 3, wherein: the calculation formula of the movement probability is as follows:
r=ω -t
where r is the probability of movement, ω is a number greater than 1, and t represents the number of times the corresponding model layer has been moved.
5. The neural network pipeline parallel training method for optimizing model partitioning according to claim 3 or 4, wherein: the model division scheme comprises the following quality assessment algorithm: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.
6. The neural network pipeline parallel training method for optimizing model partitioning according to claim 5, wherein: if the difference between the theoretical calculation time and the variance value of the theoretical calculation time is smaller than a set threshold epsilon, continuously comparing the total communication time of the theoretical calculation time and the variance value of the theoretical calculation time with the variance value of the theoretical calculation time is smaller than a set threshold epsilon, and selecting a scheme with smaller value as a better scheme; if the difference between the variance values of the theoretical calculation time and the theoretical calculation time is larger than a set threshold epsilon, a scheme with smaller variance value of the theoretical calculation time is selected as a better scheme.
7. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 3 is specifically as follows:
step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equivalent manner, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue;
step 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling; constructing the dependency relationship of calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU; constructing the dependency relationship of the same micro batch data executed on different GPUs; on different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.
8. The neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 7, wherein: during the scheduling process: and when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting an output value to output_queue of the GPU.
CN202310139664.5A 2023-02-21 2023-02-21 Neural network pipeline parallel training method for optimizing model division Pending CN116167436A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310139664.5A CN116167436A (en) 2023-02-21 2023-02-21 Neural network pipeline parallel training method for optimizing model division

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310139664.5A CN116167436A (en) 2023-02-21 2023-02-21 Neural network pipeline parallel training method for optimizing model division

Publications (1)

Publication Number Publication Date
CN116167436A true CN116167436A (en) 2023-05-26

Family

ID=86414427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310139664.5A Pending CN116167436A (en) 2023-02-21 2023-02-21 Neural network pipeline parallel training method for optimizing model division

Country Status (1)

Country Link
CN (1) CN116167436A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093871A (en) * 2023-10-16 2023-11-21 之江实验室 Deep learning-oriented distributed training evaluation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093871A (en) * 2023-10-16 2023-11-21 之江实验室 Deep learning-oriented distributed training evaluation method and system
CN117093871B (en) * 2023-10-16 2024-02-13 之江实验室 Deep learning-oriented distributed training evaluation method and system

Similar Documents

Publication Publication Date Title
CN110503192B (en) Resource efficient neural architecture
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN113128702A (en) Neural network self-adaptive distributed parallel training method based on reinforcement learning
CN111353582B (en) Particle swarm algorithm-based distributed deep learning parameter updating method
US20190279088A1 (en) Training method, apparatus, chip, and system for neural network model
EP3805999A1 (en) Resource-aware automatic machine learning system
WO2018227800A1 (en) Neural network training method and device
CN110992935B (en) Computing system for training neural networks
Xiao et al. Fast deep learning training through intelligently freezing layers
CN111406264A (en) Neural architecture search
CN109740734B (en) Image classification method of convolutional neural network by optimizing spatial arrangement of neurons
Zhao et al. Probabilistic dual network architecture search on graphs
CN112784362A (en) Hybrid optimization method and system for unmanned aerial vehicle-assisted edge calculation
CN115271099A (en) Self-adaptive personalized federal learning method supporting heterogeneous model
CN114943345A (en) Federal learning global model training method based on active learning and model compression
CN116167436A (en) Neural network pipeline parallel training method for optimizing model division
CN112884236B (en) Short-term load prediction method and system based on VDM decomposition and LSTM improvement
CN114581868A (en) Image analysis method and device based on model channel pruning
CN110991621A (en) Method for searching convolutional neural network based on channel number
CN109993208A (en) A kind of clustering processing method having noise image
CN113159287A (en) Distributed deep learning method based on gradient sparsity
CN115293342A (en) Deep convolutional neural network parallel training method based on hybrid parallel
Li et al. Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
US11709783B1 (en) Tensor data distribution using grid direct-memory access (DMA) controller
CN116910210A (en) Intelligent question-answering model training method and device based on document and application of intelligent question-answering model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination