CN116167436A - Neural network pipeline parallel training method for optimizing model division - Google Patents
Neural network pipeline parallel training method for optimizing model division Download PDFInfo
- Publication number
- CN116167436A CN116167436A CN202310139664.5A CN202310139664A CN116167436A CN 116167436 A CN116167436 A CN 116167436A CN 202310139664 A CN202310139664 A CN 202310139664A CN 116167436 A CN116167436 A CN 116167436A
- Authority
- CN
- China
- Prior art keywords
- model
- neural network
- gpu
- calculation
- theoretical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a neural network pipeline parallel training method for optimizing model division. The invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model. By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured.
Description
Technical Field
The invention belongs to the field of neural network parallel training, and particularly relates to a neural network pipeline parallel training method for optimizing model division.
Background
With the advent of the big data age, the data generated in people's lives is increasing. In order to process massive data, solve various complicated problems, and meet the requirement of users on high accuracy of deep learning tasks, the number of layers of the neural network is continuously increased. The large-scale neural network model greatly increases the training time. Therefore, a distributed parallel training method is commonly used for training acceleration of the model.
Methods for distributed training of neural networks are generally classified into three categories. The data parallelism is that a large amount of data is divided into small batches and sent to each computing device for computing, and parameters are updated through a certain communication mode after the computing is completed; model parallelism is to divide a large model into small models, and deploy the small models on each computing device for training; hybrid parallelism combines data parallelism and model parallelism together.
Data parallelism requires that the memory of the computing device be able to accommodate the complete model, with poor scalability. When the models are divided in parallel by the models, the traditional dividing algorithm does not comprehensively consider the load balance and the overall communication time of the computing equipment. The model alone is used for parallel training, and the throughput of the system is not high. Therefore, how to utilize a certain neural network distributed parallel training strategy to ensure the load balance of calculation tasks on hardware equipment, reduce global communication time as much as possible, realize acceleration of neural network training and solve the problem of urgent need in the field of neural network parallel training.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a neural network pipeline parallel training method for optimizing model division, and accelerates the training process of the neural network.
The technical scheme adopted by the invention is as follows:
the invention comprises the following steps:
And 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU.
And 3, equally dividing one input data batch into a plurality of micro batches, transmitting output to the next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the calculation of the next micro batch is started by the previous GPU.
And 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network.
And 5, repeating the steps 3-4 until the set number of rounds is reached, and completing training of the neural network.
Further, the step 1 specifically comprises the following steps:
step 1.1, modeling a DAG graph of the neural network, wherein the direction of the edge in the DAG graph represents the dependency relationship between the neural network model layers.
Step 1.2, weighting the vertices and edges on the original DAG graph. Wherein the weight of the vertex is the theoretical calculation time of the neural network model layer; the weights of the edges in the DAG graph are theoretical communication times between neural network model layers.
Further, the step 2 specifically comprises the following steps:
step 2.1, calculating time tc according to the theory of each layer of the neural network k And the number m of the GPUs, and dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of the model layers distributed on each GPU is approximately equal, and an initial model dividing scheme is obtained and used as an initial solution set.
And 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability.
And 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as the solution set of the next round of iteration, and adding the solution set of the current round into a final solution set. And repeating the step 2.2 until the specified iteration times are reached.
And 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme.
And 2.5, dividing the neural network model according to the optimal solution, and sequentially deploying the model fragments on the corresponding GPUs.
Further, the model partitioning scheme comprises the following concrete steps of: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.
Further, the step 3 specifically comprises the following steps:
step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equal amount, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue.
And 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling. And constructing the dependency relationship of the calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU. And constructing the dependency relationship of the same micro batch data executed on different GPUs. On different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.
The invention has the beneficial effects that:
the invention models the neural network into a DAG graph, models the weight values of the vertexes and the edges in the graph by using the theoretical calculation time and the theoretical communication time of the model, and constructs a performance model. The solution set is obtained through iterative search of the model partitioning scheme, and the model partitioning scheme which can enable the variance of theoretical calculation time among model fragments to be minimum and simultaneously enable global communication time to be minimum is obtained. Under the model division scheme, a pipeline parallel technology is introduced to further accelerate the training process of the model.
By adopting the model partitioning algorithm, the global communication time during neural network training is reduced while load balancing among computing devices is ensured. Compared with the traditional model partitioning algorithm, the time required for training one round under the model parallel strategy is shorter. After the parallel of the pipeline is introduced, the total communication time is reduced while the bubble time in the pipeline is reduced, and better overlapping of calculation and communication is realized, so that the training of a model is further accelerated.
Drawings
FIG. 1 is a flow chart of the neural network pipeline parallel training of the optimization model partitioning of the present invention;
FIG. 2 is a diagram of a DAG modeled by a neural network of the present invention, wherein (a) is a non-weighted DAG diagram and (b) is a weighted DAG diagram;
FIG. 3 is a computational task space-time diagram of a neural network of the present invention, wherein (a) is a forward propagation space-time diagram and (b) is a reverse propagation space-time diagram;
FIG. 4 is a graph of model partitioning algorithm performance versus the present invention;
FIG. 5 is a graph of acceleration effects of different micro-batches of the parallel algorithm of the present invention;
FIG. 6 is a graph of parallel algorithm and single GPU training accuracy versus the present invention, where (a) is the ResNet-50 model and CIFAR-10 dataset experiment. (b) ResNet-101 model and CIFAR-100 dataset experiments. (c) Experiments were performed for the ResNet-152 model and the Caltech-101 dataset.
Detailed Description
The embodiment discloses a neural network pipeline parallel training method for optimizing model division, and a specific flow is shown in fig. 1.
The method mainly comprises three parts, wherein the first part is used for carrying out DAG graph modeling on a neural network to be trained, the second part is used for dividing the neural network model, and the third part is used for carrying out pipeline parallel training.
The specific steps are shown in the steps 1-3.
And step 1, modeling a neural network DAG graph.
Step 1.1, define DAG graph as g= (V, E), where vertex set V represents set l= { L of all layers in the neural network 1 ,l 2 ,...,l n }. The edge set E represents all topological relationships from layer to layer. An example of modeling is shown in fig. 2 (a).
Step 1.2, abstracting the theoretical calculation time of each layer in the neural network into the weight of the vertex in the DAG graphThe theoretical calculation amount of the convolution layer is shown in a formula (1), the theoretical calculation amount of the full connection layer is shown in a formula (2), the theoretical calculation amount of the BN layer is shown in a formula (3), and the theoretical calculation amounts of the pooling and Relu layers are shown in a formula (4).
f k =2×b×H k_in ×W k_in ×C k_in ×C k_0ut (2)
f k =2×b×C k_in ×H k_in ×W k_in (3)
f k =b×C k_in ×H k_in ×W k_in (4)
Wherein the method comprises the steps ofThe convolution kernel size of the k-th layer, C k_in 、H k_in 、W k_in The number of channels, height and width of the feature map are input for the k-th layer. C (C) k_out 、H k_out 、W k_out The channel number, height and width of the feature map are output for the k-th layer. b is the input batch size.
The theoretical calculation time of the neural network model layer is shown in formula (5), wherein C is the theoretical calculation force of the GPU, and n is the total layer number of the model.
Wherein tc k For theoretical calculation of neural network layerM, f k For its floating point calculation (GFLOPs).
Abstracting layer-to-layer communication time into weights of edges in a DAG graphThe output tensor size of the model layer is obtained from equation (6). The theoretical communication time between the model k layer and the k+1 layer is shown in equation (7), where B is the theoretical bandwidth of the GPU. The weighted DAG graph is thus shown in fig. 2 (b). />
d k =b×C k_out ×H k_out ×W k_out (6)
Wherein ts is k For the theoretical communication time of the neural network layer and the next layer, d k The tensor size is output for the neural network layer.
And 2, dividing the neural network model.
And 2.1, defining an evaluation index of the model division scheme. After the model is divided, a sequence P= { P of a model fragment is obtained 1 ,P 2 ,...,P m Each model slice sequence is a set of layers in the corresponding neural network, a contiguous subset of the original model layers. Adding according to formula (5) to obtain theoretical calculation time Tc of each model fragment 1 ,Tc 2 ,...,Tc m . The variance of the theoretical calculation time between model slices is shown in formula (7), where Tc avg The mean value of time is calculated for the theory of model slicing.
Obtaining theoretical communication time Ts of the model fragments according to a formula (7) 1 ,Ts 2 ,...,Ts m-1 . The calculation of the global communication time is shown in equation (8).
And 2.2, an evaluation algorithm of the model division scheme. Assume that there is a model partitioning scheme P a Theoretical calculation time variance sigma of corresponding model fragments a Global communication time T comm_total_a . Another model partitioning scheme P b Theoretical calculation time variance sigma of corresponding model fragments b Global communication time T comm_total_b 。
If sigma a Sum sigma b The difference value of the two types of the model partitioning schemes is smaller than a given threshold epsilon, which means that the variance difference of the two types of the model partitioning schemes is smaller, and the model partitioning scheme with shorter global communication time is selected. If sigma a Sum sigma b The difference value of (2) is larger than a given threshold epsilon, which indicates that the variance difference is larger, and a model division scheme with smaller variance is selected.
And 2.3, generating an initial model division scheme. Firstly, accumulating theoretical calculation time lists of a model layer element by element to obtain a list list_flow of accumulated values of theoretical calculation time of a current layer and all previous layers. And normalizing the values in the list, and eliminating the influence of the singular time value to ensure that the element values are in the range of [0,1] to obtain a normalized list_flow_norm. And then according to the GPU number m, obtaining the step length of 1/m, and dividing the layers into corresponding model fragments in sequence. For example, the element value is at [0,0.25), then it is scored into a first model tile, the element value is at [0.25,0.50), then it is scored into a second model tile, and so on. Finally, an initial model dividing scheme with approximately equal theoretical calculation time is obtained.
And 2.4, generating a solution set of the model division scheme to obtain an optimal solution. The initial model partitioning scheme is added to the current most current solution set. Traversing the solution set, and judging all movable model layers for each solution. In addition to the first and last layers of the original model being immovable, the first layer of the model fragments may be moved forward behind the last layer of the previous model fragments, the last layer of the model fragments may be moved backwardMoving to the front of the first layer of the next model slice. To reduce the search space, the movement probability of each layer is defined as r=ω -t ω is a value greater than 1, t is the number of times the corresponding layer has been moved, indicating that the more moves a layer is the more likely it is to discard the move. Then, a new solution is generated according to the movement of the layer, a new solution set is added, and the number of movements of the layer is increased by 1. The new solution set serves as the solution set that needs to be traversed for the next iteration. And finally, collecting the unrepeated solutions newly generated by each layer of iteration to obtain a final solution set. And obtaining an optimal solution according to the evaluation algorithm of the model partitioning scheme in the step 2.2, namely the relatively optimal model partitioning scheme.
And step 3, pipeline parallel training.
And 3.1, initializing operation. A thread is created for each GPU, which is responsible for an input queue, input_queue, and an output queue, output_queue. And placing the corresponding computing task into the corresponding GPU, and scheduling when waiting for the corresponding clock.
And 3.2, equally dividing the input data of one batch into k micro-batches.
And 3.3, determining a time-space diagram of parallel operation of the pipeline. Definition F i,j Compute tasks for forward propagation of the ith micro-batch on the jth GPU, B i,j The task is computed for the i-th micro-batch back-propagation on the j-th GPU. Fig. 3 (a) shows a forward propagating computational task space-time diagram, and fig. 3 (b) shows a backward propagating computational task space-time diagram. Wherein F is performed in forward propagating clock7 4,4 After the calculation task of (2), back-propagating B is performed in clock8 4,4 Is a computing task of (1).
And 3.4, establishing a dependency relationship between computing tasks. Taking forward propagation as an example, a current micro-batch computing task F is established on the current GPU i,j Computing task F to the next micro batch i+1,j Is a logical dependency. Establishing a computing task F of the current micro batch on the current GPU i,j Computing task F to the next GPU i,j+1 By copying the data in the output_queue of the current GPU to the nextAnd (3) realizing on a GPU. The back propagation is the same.
And 3.5, in the scheduling process, when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting the output value to output_queue of the GPU.
And 3.6, after all micro-batches complete one-time complete forward propagation and backward propagation, calculating loss and gradient, and updating parameters.
And 3.7, repeating the steps 3.2-3.6 until the set number of rounds is reached, and outputting the trained neural network model.
Verification experiment:
(1) Experiment setting: the number of GPUs employed was 4. Three sets of neural network models and data sets are combined, namely a ResNet-50 model and a CIFAR-10 data set, a ResNet-101 model and a CIFAR-100 data set, and a ResNet-152 model and a Caltech-101 data set.
(2) Data initialization: the CIFAR-10 and CIFAR-100 data sets have image sizes of 32X 32 and the Caltech-101 data set has image sizes of 224X 224.
(3) Model partitioning algorithm performance experiment: in the three experimental groups, on the basis of parallel models, the optimized model partitioning algorithm provided by the invention is compared with the partitioning algorithm based on theoretical calculation force and the partitioning algorithm based on CUDA memory occupation condition. The algorithm of the present invention performs one round of training faster than the other algorithms. In the experimental group of ResNet-152 and Caltech-101, the algorithm is 33.822s faster than the partitioning algorithm based on theoretical calculation power and 18.063s faster than the partitioning algorithm based on CUDA memory occupancy. As shown in particular in fig. 4.
(4) Optimal micro batch number experiment: in the three experimental groups, on the basis of pipeline parallelism, the optimal micro-batch number is searched by utilizing the optimal model partitioning algorithm provided by the method so as to achieve smaller single-round training time. Experiments show that under the condition of excellent micro batch number, the time of single-round training is nearly half of that of parallel models. As shown in particular in fig. 5.
(5) Model convergence rate experiment: and in the three experimental groups, comparing pipeline parallelism under the optimal micro-batch number with single GPU model training. Pipeline parallelism is compared with single GPU training, and the speed-up ratio of single rounds is 1.19,1.60,1.88 respectively. The specific table is shown below:
experimental group | Single GPU (base)/s | Pipelined parallelism (parallel 1)/s |
ResNet-50+CIFAR-10 | 30.123 | 25.316 |
ResNet-101+CIFAR-100 | 57.328 | 35.789 |
ResNet-152+Caltech-101 | 87.126 | 46.358 |
With increasing rounds, the models can converge to approximately equal accuracy, as shown in fig. 6 (a), (b), and (c).
Claims (8)
1. The neural network pipeline parallel training method for optimizing model division is characterized by comprising the following steps of:
step 1, modeling a neural network model by using a DAG graph, setting the weight of the vertex in the DAG graph as the theoretical calculation time of the neural network layer, and setting the weight of the edge as the theoretical communication time between the neural network layers;
step 2, dividing the neural network model according to the DAG graph to obtain a group of solution sets, obtaining a relatively optimal model division scheme by using a merit evaluation algorithm of the model division scheme, and deploying the divided model on a corresponding GPU;
step 3, dividing an input data batch into a plurality of micro batches in an equivalent way, transmitting output to a next GPU after the previous GPU finishes the calculation of the current micro batch, and continuing the calculation of the neural network forward propagation by the next GPU, wherein the previous GPU starts the calculation of the next micro batch;
step 4, updating parameters of the neural network after all micro batch data complete calculation tasks of forward propagation and backward propagation of the neural network;
and 5, repeating the steps 3-4 until the set number of rounds is reached, and completing training of the neural network.
2. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 1 is specifically as follows:
step 1.1, modeling a DAG graph of a neural network, wherein the direction of an edge in the DAG graph represents the dependency relationship between layers of the neural network model;
step 1.2, weighting the vertexes and edges on the original DAG graph; wherein the weight of the vertex is theoretical calculation time of the neural network model layer, and the calculation formula is as follows:
wherein tc k Calculating time, f for the theory of the neural network layer k For the floating point calculated amount, n is the total layer number in the neural network model, and C is the theoretical calculation force of the GPU;
the weight of the edge in the DAG graph is the theoretical communication time between the neural network model layers, and the calculation formula is as follows:
wherein ts is k For the theoretical communication time of the neural network layer and the next layer, d k And B is the theoretical bandwidth of the GPU, wherein the tensor size is output by the neural network layer.
3. The neural network pipeline parallel training method for optimizing model partitioning according to claim 2, wherein: the step 2 is specifically as follows:
step 2.1, calculating time tc according to the theory of each layer of the neural network k And the number m of the GPUs, dividing each layer of the neural network into the GPUs approximately uniformly, so that the sum of theoretical calculation time of model layers distributed on each GPU is approximately equal, and obtaining an initial model dividing scheme as an initial solution set;
step 2.2, traversing the current solution set to obtain all movable model layers among the model fragments under each solution, calculating the movement probability of the movable layers, and finally determining all the movable layers according to the movement probability;
step 2.3, moving all movable layers to obtain a solution set of a group of model division schemes, taking the solution set as a solution set of the next round of iteration, and adding the solution set of the current round into a final solution set; repeating the step 2.2 until the appointed iteration times are reached;
step 2.4, obtaining a final solution set, and obtaining an optimal solution by using an evaluation algorithm of the model division scheme;
and 2.5, dividing the neural network model according to the optimal solution, and sequentially deploying the model fragments on the corresponding GPUs.
4. A neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 3, wherein: the calculation formula of the movement probability is as follows:
r=ω -t
where r is the probability of movement, ω is a number greater than 1, and t represents the number of times the corresponding model layer has been moved.
5. The neural network pipeline parallel training method for optimizing model partitioning according to claim 3 or 4, wherein: the model division scheme comprises the following quality assessment algorithm: and (3) evaluating the load balance of the GPU by using the variance of the model partition theory calculation time, measuring the time required by communication during model training by using the global communication time, and comprehensively considering two indexes to determine the advantages and disadvantages of the model partition scheme.
6. The neural network pipeline parallel training method for optimizing model partitioning according to claim 5, wherein: if the difference between the theoretical calculation time and the variance value of the theoretical calculation time is smaller than a set threshold epsilon, continuously comparing the total communication time of the theoretical calculation time and the variance value of the theoretical calculation time with the variance value of the theoretical calculation time is smaller than a set threshold epsilon, and selecting a scheme with smaller value as a better scheme; if the difference between the variance values of the theoretical calculation time and the theoretical calculation time is larger than a set threshold epsilon, a scheme with smaller variance value of the theoretical calculation time is selected as a better scheme.
7. The neural network pipeline parallel training method for optimizing model partitioning according to claim 1, wherein: the step 3 is specifically as follows:
step 3.1, starting training of a model, dividing a data batch into a plurality of micro batches in an equivalent manner, and establishing a thread for each GPU, wherein the thread consists of an input queue input_queue and an output queue output_queue;
step 3.2, constructing a dependency relationship between computing tasks, so as to facilitate task scheduling; constructing the dependency relationship of calculation tasks of different micro-batch data on the same GPU, and starting calculation of the latter micro-batch after the former micro-batch calculation is completed on the same GPU; constructing the dependency relationship of the same micro batch data executed on different GPUs; on different GPUs, the previous GPU completes the calculation of the current micro batch, the output is put into own output_queue, and then the data of the output_queue is copied to the next GPU.
8. The neural network pipeline parallel training method for optimizing model partitioning as claimed in claim 7, wherein: during the scheduling process: and when each clock starts, submitting all tasks in the clock to input_queue of the corresponding GPU, obtaining output through calculation, and submitting an output value to output_queue of the GPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310139664.5A CN116167436A (en) | 2023-02-21 | 2023-02-21 | Neural network pipeline parallel training method for optimizing model division |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310139664.5A CN116167436A (en) | 2023-02-21 | 2023-02-21 | Neural network pipeline parallel training method for optimizing model division |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116167436A true CN116167436A (en) | 2023-05-26 |
Family
ID=86414427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310139664.5A Pending CN116167436A (en) | 2023-02-21 | 2023-02-21 | Neural network pipeline parallel training method for optimizing model division |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116167436A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093871A (en) * | 2023-10-16 | 2023-11-21 | 之江实验室 | Deep learning-oriented distributed training evaluation method and system |
-
2023
- 2023-02-21 CN CN202310139664.5A patent/CN116167436A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117093871A (en) * | 2023-10-16 | 2023-11-21 | 之江实验室 | Deep learning-oriented distributed training evaluation method and system |
CN117093871B (en) * | 2023-10-16 | 2024-02-13 | 之江实验室 | Deep learning-oriented distributed training evaluation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110503192B (en) | Resource efficient neural architecture | |
CN109948029B (en) | Neural network self-adaptive depth Hash image searching method | |
CN113128702A (en) | Neural network self-adaptive distributed parallel training method based on reinforcement learning | |
CN111353582B (en) | Particle swarm algorithm-based distributed deep learning parameter updating method | |
US20190279088A1 (en) | Training method, apparatus, chip, and system for neural network model | |
EP3805999A1 (en) | Resource-aware automatic machine learning system | |
WO2018227800A1 (en) | Neural network training method and device | |
CN110992935B (en) | Computing system for training neural networks | |
Xiao et al. | Fast deep learning training through intelligently freezing layers | |
CN111406264A (en) | Neural architecture search | |
CN109740734B (en) | Image classification method of convolutional neural network by optimizing spatial arrangement of neurons | |
Zhao et al. | Probabilistic dual network architecture search on graphs | |
CN112784362A (en) | Hybrid optimization method and system for unmanned aerial vehicle-assisted edge calculation | |
CN115271099A (en) | Self-adaptive personalized federal learning method supporting heterogeneous model | |
CN114943345A (en) | Federal learning global model training method based on active learning and model compression | |
CN116167436A (en) | Neural network pipeline parallel training method for optimizing model division | |
CN112884236B (en) | Short-term load prediction method and system based on VDM decomposition and LSTM improvement | |
CN114581868A (en) | Image analysis method and device based on model channel pruning | |
CN110991621A (en) | Method for searching convolutional neural network based on channel number | |
CN109993208A (en) | A kind of clustering processing method having noise image | |
CN113159287A (en) | Distributed deep learning method based on gradient sparsity | |
CN115293342A (en) | Deep convolutional neural network parallel training method based on hybrid parallel | |
Li et al. | Optimizing makespan and resource utilization for multi-DNN training in GPU cluster | |
US11709783B1 (en) | Tensor data distribution using grid direct-memory access (DMA) controller | |
CN116910210A (en) | Intelligent question-answering model training method and device based on document and application of intelligent question-answering model training method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |