CN113961337A - Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method - Google Patents

Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method Download PDF

Info

Publication number
CN113961337A
CN113961337A CN202111073054.7A CN202111073054A CN113961337A CN 113961337 A CN113961337 A CN 113961337A CN 202111073054 A CN202111073054 A CN 202111073054A CN 113961337 A CN113961337 A CN 113961337A
Authority
CN
China
Prior art keywords
data
gradient
weight
bias
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111073054.7A
Other languages
Chinese (zh)
Other versions
CN113961337B (en
Inventor
韩彦岭
沈思扬
曹守启
张云
洪中华
周汝雁
王静
杨树瑚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ocean University
Original Assignee
Shanghai Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Ocean University filed Critical Shanghai Ocean University
Priority to CN202111073054.7A priority Critical patent/CN113961337B/en
Priority claimed from CN202111073054.7A external-priority patent/CN113961337B/en
Publication of CN113961337A publication Critical patent/CN113961337A/en
Application granted granted Critical
Publication of CN113961337B publication Critical patent/CN113961337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a deep learning-oriented GPU (graphics processing Unit) parallel method based on an improved RingAllReduce algorithm, which is used for improving the transmission efficiency among data parallel multiple devices and relieving the bandwidth loss problem of a parallel structure of a traditional parameter server; in addition, by utilizing the characteristics that the traditional deep learning backbone network has smaller weight parameters than a full connection layer, has small synchronization cost and has overhigh weight and large gradient transmission cost of the full connection layer, the data parallel processing is carried out on the backbone network, and the full connection layer adopts the model parallel processing, so that the problems that the data parallel mode is difficult to support large-scale network parameters and the acceleration and the delay are solved. Compared with other methods, the method has the advantages that the final test precision is not greatly different from the training precision, the attenuation amplitude is smaller in the acceleration effect, the effect is better, experiments also find that compared with data sets with fewer classes such as Cifar10, the method has a larger acceleration advantage on miniImageNet, and therefore the method is more suitable for parallel training of mass data.

Description

Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method
Technical Field
The invention relates to a GPU parallel method facing deep learning based on an improved Ring All Reduce algorithm.
Background
With the deep learning, the method is widely applied to various fields such as image analysis, target detection, semantic segmentation, automatic driving and the like. Deep learning mainly allows computers to extract more data features by increasing the depth of the network, and the network is generally up to hundreds or even thousands of layers at present. In the face of huge data volume and complex network structure, great challenge is brought to training efficiency, and in order to shorten time, a parallel training method designed based on various computing platforms gradually becomes a research hotspot.
Early researchers studied the parallel processing of the (back propagation, BP) neural network training process, and combined the early BP neural network with the MapReduce framework initially and successfully through continuous experiments, but the early BP neural network was not applied to practical problems, so that the training process was limited to the theoretical state. Hou et al have performed different parallelization examples in Hadoop Distributed File System (Hadoop) by using a data parallel method on the basis, and have proved that the network training speed can be effectively accelerated by using MapReduce, thereby greatly shortening the training time of the neural network. In 2012, by combining the storage characteristics of a distributed file system HDFS, the results of experiments prove that the BP neural network parallel algorithm with the iterative characteristic is obviously helpful for improving the convergence rate, the learning accuracy and the parallel efficiency of the network compared with the prior algorithm. At present, there are two main methods for parallelizing the deep neural network, namely data parallelization and model parallelization.
Data parallel is the simplest parallel strategy, an independent data subset is used by a model copy on each device participating in parallel, and currently, mainstream frameworks such as TensorFlow and PyTorch support data parallel in an easy-to-use and intuitive mode. However, as the number of parallel devices increases, the Size of the Batch Size generally increases, which deteriorates the data parallel scalability, because, for any given deep learning network, the number of iterations required to achieve the same convergence accuracy after exceeding the corresponding Batch Size threshold increases significantly, mainly due to the reduced statistical efficiency of the training process. In addition, the increase in communication overhead due to the increased number of parallel devices used further limits the overall training speed.
The model parallelism is a parallel method similar to the data parallelism, a model graph is divided and deployed on a plurality of devices, the same mini Batch is processed in parallel, the model parallelism is usually used for splitting a large model (a single GPU cannot bear one model), and therefore training can be accelerated through the method. However, at present, the acceleration effect obtained by the model parallel algorithm improvement and the equipment scale is very limited. Therefore, using only this method also brings about a problem of poor scalability. On the other hand, in order to obtain maximum acceleration, the model segmentation mode needs to be adjusted repeatedly to achieve the highest communication benefit in the forward and reverse propagation process. In most cases, the communication overhead and synchronization overhead brought by the parallel model exceed the data parallel, so the speed-up ratio is not as high as the data parallel.
DistBeief developed by Google trains large-scale models by adopting a data parallel and model parallel method, Coates et al also constructs a model parallel training method by using a Graphics Processing Unit (GPU) cluster, Li [27] and the like propose an improved parameter server asynchronous interactive data parallel scheme. However, the research on the deep learning parallelization is mostly based on a large-scale commercial GPU platform, and has a great difference with a small-scale GPU experimental environment in specific image classification.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the GPU parallel method facing deep learning based on the improved Ring All Reduce algorithm is small in data transmission amount and high in training efficiency.
To solve the above-mentioned techniqueThe technical scheme adopted by the invention is as follows: the GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, and adopts n nodes to participate in parallel training G0…Gn-1N is an exponential power of 2, where an even number n-k of nodes G0,…,Gn-k-1Responsible for backbone network data parallelism, Gn-k,…,Gn-1The responsible model is parallel;
wherein, the improved Ring All Reduce algorithm adopts the exponential power n with the total node number of 2, and inputs: data set d ═ Batch0,Batch1,…,Batchx},BatchxSample Input of subset ofy,y∈[0,n-1],
Figure BDA0003261104600000021
And (3) outputting: sample InputyGradient T ofy
The method comprises the following specific steps:
step A, inputting datayAnd the initialization parameters are respectively and correspondingly loaded to all G0,G1,G2,…,Gn-1
Step B, dividing node G0,G1,G2,…,Gn-1Input to be receivedyDividing n parts;
step C, dividing nodes
Figure BDA0003261104600000022
And branch node
Figure BDA0003261104600000023
Pairing is carried out, and the paired nodes are not repeatedly paired;
step D, in the transmission process, each subnode Gi
Figure BDA0003261104600000024
Dividing self-stored data into a first part and a second part, and dividing a node GiTransmitting the first part of data to the sub-node
Figure BDA0003261104600000025
Is accumulated and simultaneously received
Figure BDA0003261104600000026
Accumulating the sent second part of data and the second part of data of the second part of data;
step E, making q equal to q × 2, and repeating the step D until q equal to n; finishing reduction operation after the step log (n); at this time G0,G1,G2,…Gn-1Has a
Figure BDA0003261104600000027
Portions of BatchxI.e. Inputy
Step F, dividing node GiAnd branch node
Figure BDA0003261104600000028
Pairing, wherein q is 2log(n)The paired nodes are not repeatedly paired;
g, dividing nodes
Figure BDA0003261104600000029
And
Figure BDA00032611046000000210
interchanging accumulated data Input corresponding to the first or second portionsyThen covering the original data of the corresponding part to obtain the final data of the part;
step H, dividing nodes
Figure BDA0003261104600000031
Merging the received final data part of the other party with the original final data part of the other party;
step I, let
Figure BDA0003261104600000032
Repeating the step G-the step H until the polymerization operation is finished after the step q is 1 and the step log (n);
the parallel method specifically comprises the following steps:
step 1, carrying out data parallel of a backbone network; wherein, the backbone network inputs: data set d ═ Batch0,Batch1…BatchjSample BatchjAre divided into n-k parts to obtain a subset Inputj,j∈[0,n-k-1];
Step 1-1G0…Gn-k-1Loading at the same time;
step 1-2 from G0To Gn-k-1Extracting features through convolution operation;
steps 1-3 from Gn-k-1To G0Completing the calculation of the weight gradient and the deviation gradient of the convolution;
steps 1 to 4G0…Gn-k-1After the weight is calculated, the improved Ring All Reduce algorithm is used for exchanging, summarizing and averaging, and T is outputiAt this time G0…Gn-k-1The same gradient information is possessed;
steps 1 to 5G0…Gn-k-1Using TiUpdating the network parameter net data;
steps 1-6 after several iterations G0…Gn-k-1Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;
step 2: carrying out the parallel of the full connection layer model, outputting net data, network weight, bias and bias weight by the main network,
step 2-1, inputting the product of net data and weight and the product of bias and bias weight into the full connection layer, dividing the full connection layer according to the output latitude of the main network, and using Gn-k...Gn-2、Gn-1And the K nodes are segmented in an equivalent mode to obtain the parallel forward propagation input of the model: net datal;l∈[n-k,n-1],weightl;l∈[n-k,n-1],biasl;l∈[n-k,n-1],bias weightl;l∈[n-k,n-1];
Node G participating in model parallelismn-k...Gn-2、Gn-1Swapping n using the modified Ring All Reduce algorithmet datal;l∈[n-k,n-1]Then merging the data to obtain net datal';l∈[n-k,n-1];
Step 2-2Gn-k...Gn-2、Gn-1Will net datal';l∈[n-k,n-1]And each weightl;l∈[n-k,n-1]And biasl;l∈[n-k,n-1]Calculating and outputting Gl fc_output;l∈[n-k,n-1];
Step 2-3Gn-k...Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml;l∈[n-k,n-1]After summarization, with bias weightl;l∈[n-k,n-1]Multiplication plus G in step 2-2ifc _ output gets a new Gl fc_output';l∈[n-k,n-1];
Steps 2 to 4Gn-k...Gn-2、Gn-1Interchanging respective G's using a modified Ring All Reduce algorithml fc_output';l∈[n-k,n-1]Then combined to obtain the final Gl fc_output″;l∈[n-k,n-1];
Step 2-5, multiplying the reverse propagation gradient of the full connection layer by the characteristic net data to obtain a weight gradient, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data gradnet;
when in transmission, the model is obtained by segmenting the parameter gradient in parallel in an equivalent mode: gradientl,l∈[n-k,n-1]Then with net datal;l∈[n-k,n-1]、weightl;l∈[n-k,n-1]And biasl;l∈[n-k,n-1]The backward propagation is carried out as an input, and the specific flow is as follows:
Gn-k...Gn-2、Gn-1exchanging grams using a modified Ring All Reduce algorithml,l∈[n-k,n-1]、net datal;l∈[n-k,n-1]Obtaining a complete gradientl',l∈[n-k,n-1]And network parameter net datal″;l∈[n-k,n-1];
Steps 2 to 6Gn-k...Gn-2、Gn-1Subjecting the complete gradient obtained in step 2-5 to gradient analysisl',l∈[n-k,n-1]And network parameter net datal″;l∈[n-k,n-1]Multiplying the weight gradientl',l∈[n-k,n-1];
Step 2-7 node Gn-k...Gn-2、Gn-1Using whole gradientl',l∈[n-k,n-1]And weightl;l∈[n-k,n-1]Multiplying to obtain net data gradinetl,l∈[n-k,n-1]Then, reducing the reduced Scatter operation;
steps 2 to 8Gn-k...Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml;l∈[n-k,n-1]Obtaining the biasl';l∈[n-k,n-1];
Step 2-9 node Gn-k...Gn-2、Gn-1Subjecting the weight gradient obtained in the step 2-6 tol',l∈[n-k,n-1]With bias in step 2-8l';l∈[n-k,n-1]Multiplying to obtain bias gradientl';l∈[n-k,n-1];
Thus, the whole training process is completed.
The invention has the beneficial effects that: the method realizes the optimization of a data transmission flow by a sub-node interval pairing principle, obtains a Ring All Reduce data communication algorithm with improved time complexity, is used for improving the transmission efficiency among data parallel multiple devices and relieves the problem of bandwidth loss of a parallel structure of a traditional parameter server; in addition, by utilizing the characteristics that the traditional deep learning backbone network has smaller weight parameters than a full connection layer, has small synchronization cost and has overhigh weight and large gradient transmission cost of the full connection layer, the data parallel processing is carried out on the backbone network, and the full connection layer adopts the model parallel processing, so that the problems that the data parallel mode is difficult to support large-scale network parameters and the acceleration and the delay are solved. Compared with other methods, the method has the advantages that the final test precision is not greatly different from the training precision, the attenuation amplitude is smaller in the acceleration effect, the effect is better, experiments also find that compared with data sets with fewer classes such as Cifar10, the method has a larger acceleration advantage on miniImageNet, and therefore the method is more suitable for parallel training of mass data.
Drawings
FIG. 1 illustrates the operation flow of the Ring All Reduce improved algorithm Reduce Scatter
FIG. 2 Ring All Reduce improved All Gather operation flow
FIG. 3a is a comparison of acceleration ratios of Vgg16 at different Batch Size Ring All Reduce and its modified algorithm
FIG. 3b is a comparison of the acceleration ratio of Resnet50 at different Batch Size Ring All Reduce and its improved algorithm
FIG. 4a is a Cifar10 multi-node data parallel training loss rate curve
FIG. 4b is a Cifar10 multi-node data parallel test loss rate curve
FIG. 5Cifar10 relation between multi-node data parallel node number and test time
FIG. 6a is a comparison of Cifar10 data parallel versus improved parallel test loss rate curves
FIG. 6b shows Cifar10 data parallelism vs. improved parallelism test time
FIG. 7a is a comparison of training times for mini ImageNet data parallel and parallel methods herein
FIG. 7b is a comparison of training speed-up ratios of the parallel method herein under different data sets
Detailed Description
Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, adopts n nodes to participate in parallel training, and has n being 2 exponential power G0…Gn-1Wherein n-2 nodes G0…Gn-3Responsible for backbone network data parallelism, Gn-2、Gn-1The responsible model is parallel;
wherein, the improved Ring All Reduce algorithm adopts the exponential power n that the total node number is 2, and n is 4 in this embodiment, inputs: data set d ═ Batch0,Batch1,…,Batchx},BatchxSample Input of subset ofy,y∈[0,n-1],
Figure BDA0003261104600000051
And (3) outputting: sample InputyGradient T ofy
The method comprises the following specific steps:
step A, inputting datayAnd the initialization parameters are respectively and correspondingly loaded to all G0,G1,G2,…,Gn-1
Step B, sub-node G0,G1,G2,…,Gn-1Input to be receivedyEvenly dividing the obtained product into the parts with the number consistent with the total node number of the algorithm to obtain a0,a1,a2,…,an-1(ii) a As shown in fig. 1, when n is 4;
step C, dividing nodes
Figure BDA0003261104600000052
And branch node
Figure BDA0003261104600000053
Pairing is carried out, and the paired nodes are not repeatedly paired;
step D, in the transmission process, each subnode Gi
Figure BDA0003261104600000061
Dividing self-stored data into a first part and a second part, and dividing a node GiTransmitting the first part of data to the sub-node
Figure BDA0003261104600000062
Is accumulated and simultaneously received
Figure BDA0003261104600000063
Accumulating the sent second part of data and the second part of data of the second part of data; the results are shown in FIG. 1 (b);
step E, let q ═ q × 2, repeat step D until q ═ n, the result is shown in fig. 1 (c); completing reduction Scatter operation after the step of log (n); at this time G0,G1,G2,…Gn-1Has a
Figure BDA0003261104600000064
Portions of BatchxI.e. Inputy
Step F, node GiAnd branch node
Figure BDA0003261104600000065
Pairing, wherein q is 2log(n)The paired nodes are not repeatedly paired;
g, dividing nodes
Figure BDA0003261104600000066
And
Figure BDA0003261104600000067
interchanging accumulated data Input corresponding to the first or second portionsyThen, the original data of the corresponding part is covered to obtain the final data of the part, and as a result, as shown in FIG. 2(b),
after one round of data exchange, G0、G1And G2、G3Respectively obtaining the accumulated data Input of the other partyy
Step H, node division
Figure BDA0003261104600000068
Merging the received final data part of the other party with the original final data part of the other party, and obtaining the result as shown in fig. 2 (c);
step I, order
Figure BDA0003261104600000069
Repeating the step G-the step H until the step q is 1, and the step log (n) is finished and then the operation of polymerizing All Gather is finished; at this time, G0、G1、G2、G3All data are obtained;
the parallel method specifically comprises the following steps:
step 1, performing data parallelization of a backbone network, specifically as follows:
wherein, the backbone network inputs: data set d ═ Batch0,Batch1…BatchjSample BatchjAre divided into n-2 parts to obtain a subset Inputi,i∈[0,n-3];
Step 1-1G0…Gn-3Loading at the same time;
step 1-2 from G0To Gn-3Extracting features through convolution operation;
steps 1-3 from Gn-3To G0Calculating the base layer of the volume to obtain a weight gradient and a deviation gradient;
steps 1-4 Using the modified Ring All Reduce algorithm for G0…Gn-3The gradient with the same weight and the gradient of deviation;
steps 1 to 5G0…Gn-3Updating the network parameter net data;
steps 1-6 after several iterations G0…Gn-3Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;
step 2: carrying out full-connection layer model parallel, specifically as follows:
the forward propagation parameters of the full connection layer of the single node comprise: the backbone network outputs net data, network weight, bias, bias weight,
the input of the full connection layer is the product of net data and weight plus the product of bias and bias weight, the full connection layer is divided according to the output latitude of the main network, G is usedn-2、Gn-1The two nodes are obtained by segmenting in an equivalent mode:
net datal、weightl、biasl、bias weightl,l∈[n-1,n-2]as a concrete model parallel forward propagation input;
step 2-1 participating in model parallel node Gn-2、Gn-1Exchanging net data using an improved Ring All Reduce algorithmn-2、net datan-1Then merging the data to obtain net datal',l∈[n-1,n-2];
Step 2-2Gn-2、Gn-1Will net datal',l∈[n-1,n-2]With respective weightl,l∈[n-1,n-2]And biasl,l∈[n-1,n-2]Calculating and outputting Gn-2fc_output、Gn-1fc_output;
Step 2-3Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml,l∈[n-1,n-2]To obtain the summarized biasl’,l∈[n-1,n-2]And bias weightl,l∈[n-1,n-2]Multiplication plus G in step 2-2lfc_output,l∈[n-1,n-2]To obtain a new Gn-2fc_output'、Gn-1fc_output';
Steps 2 to 4Gn-2、Gn-1Interchanging G Using the improved Ring All Reduce Algorithmn-2fc_output'、Gn-1fc _ output', the final G of the summary is obtainedn-2fc_output″、Gn-1fc_output″;
Step 2-5, multiplying the traditional full-connection layer back propagation gradient by the characteristic net data to obtain a weight gradient weight, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data;
when in transmission, the model is obtained by segmenting the parameter gradient in parallel in an equivalent mode: gradientl,l∈[n-2,n-1]Then with net datal;l∈[n-2,n-1]、weightl;l∈[n-2,n-1]And biasl;l∈[n-2,n-1]The backward propagation is carried out as an input, and the specific flow is as follows:
Gn-2、Gn-1exchanging grams using a modified Ring All Reduce algorithml,l∈[n-1,n-2];、net datal;l∈[n-1,n-2](ii) a Connecting with the gradient stored in itself and the network parameter to obtain the complete gradientl',l∈[n-1,n-2]And network parameter net datal″,l∈[n-1,n-2];
Steps 2 to 6Gn-2、Gn-1Subjecting the complete gradient obtained in step 2-5 to gradient analysisl',l∈[n-1,n-2]' and network parameter net datal″,l∈[n-1,n-2]Multiplying the weight gradientl,l∈[n-1,n-2];
Steps 2 to 7Gn-2、Gn-1Using gradientl',l∈[n-1,n-2]And weightl,l∈[n-1,n-2]Multiplying to obtain net data gradinetl,l∈[n-1,n-2]Then, reducing the reduced Scatter operation;
steps 2 to 8Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml,l∈[n-1,n-2]Obtaining the biasl';l∈[n-1,n-2];
Each node in the step 2-9 enables the weight gradient obtained in the step 2-6 to be obtainedl',l∈[n-1,n-2]With bias in step 2-8l',l∈[n-1,n-2]Multiplying to obtain bias gradientl',l∈[n-1,n-2];
Thus, the whole training process is completed.
Description of the Experimental data and Experimental setup for this example
Experimental data two common datasets were used, Cifar10 and mini ImageNet respectively. Cifar10 is a small data set used to identify pervasive objects. A total of 10 classes of RGB colour pictures are included, airplane (airplane), car (automobile), bird (bird), cat (cat), deer (deer), dog (dog), frog (frog), horse (horse), boat (ship) and truck (truck). The size of the pictures was 32 x 32, with a total of 50000 training images and 10000 test images in the data set. ImageNet contains a total of 1500 million images, divided into 2 million categories. Herein mini ImageNet is the top 100 categories selected from ImageNet, comprising a total of 129395 training pictures and 50000 verification images in the data set.
All experiments herein used 2080Ti and were conducted in a stand-alone, multi-card environment. And (4) averaging the experimental results of each group by repeating twice on the premise of randomly selecting the training samples.
Ring All Reduce and improved algorithm comparison thereof
The acceleration ratio is a common index for measuring the performance of parallel computation, and the main significance is that for the same service process, the time T consumed by a single node1With multiple nodes consuming time T after parallel computationnThe ratio of (2) and the change of the ratio correspond to the performance, and under an ideal state, when the node growth is in direct proportion to the acceleration ratio, the optimal condition is obtained.
The method is characterized in that a Cifar10 data set is used for conducting 100 times of single-machine multi-card data parallel training on a host with two nodes, and the speed ratio of a Ring All Reduce algorithm and an improved algorithm to a parameter server algorithm in different Batch sizes is compared under Vgg16 and Resnet50, so that the performance of the algorithm on transmission performance is measured.
FIG. 3 is a comparison of acceleration ratios when data are parallel using two algorithms, and it can be seen that, regardless of Vgg16 or Resnet50, the acceleration ratios of the two algorithms decrease to some extent as the Batch Size increases, mainly due to the increase of the data volume, which increases the transmission time. The difference between the two algorithms is the largest when the Batch Size of the Vgg16 network is 4, and the performance difference between the two algorithms is about 25% at the highest in Resnet50 when the Batch Size is 16, so that the optimal Batch Size exists for different networks, so that the performance of the improved algorithm is optimal. In general, the improved algorithm can obtain better performance than the Ring All Reduce and the acceleration ratio is always larger than 1.
Cifar10 data parallel and improved parallel method contrast
Firstly, the performance of data parallel under different node quantities is verified, and the performance is observed by training indexes such as convergence, time and the like by using a Cifar10 data set.
Experiments the picture classification was tested on the Resnet50 network and data transmission was achieved using the Ring All Reduce modified algorithm herein. Training and test Batch size were set to 64, training was 6000 times, test set was 1000 times, initial learning rate was set to 0.001, and was reduced by 10 times after 3000 and 5000 times, respectively.
The parallel training is performed on the clusters with the node numbers of 2,4 and 8 respectively, and the experimental results are shown in table 1 and fig. 4
TABLE 1 Multi-node Cifar10 data parallel training loss rate comparison for different iteration numbers
Figure BDA0003261104600000081
Table 1 shows the change of the loss rate of the data parallel training with the number of iterations at different node numbers, and it can be found by comparing the data of 8 nodes with 2 nodes that the increase of the number of parallel nodes can result in the increase of the loss rate to a certain extent, when the number of iterations is 1000, the difference between the two is 11%, but the final difference is not large, and the difference is only 2% after 6000 iterations, which indicates that the data parallel has a small influence on the final precision of the training.
Fig. 4(a) shows the loss rate curve of the data parallel training, and fig. 4(b) shows the loss variation curve of the test on the test set, from which the following points can be obtained:
the data parallel post-training and test loss rate increases with the number of parallel nodes.
Referring to the training loss rate in table 1 and the curve in fig. 4(a), the convergence trend of the parallel 8 node data is slower than that of 2 nodes when training is started, some differences exist, after 3000 iterations, the differences are reduced and kept within a range of 3%, and the final convergence point does not exceed 2%, which mainly reduces certain efficiency due to data exchange consumption caused by node increase.
Observing fig. 4(b), it can be found that, in the data parallel test set testing process, under the same iteration number, the convergence of the curve of 4 nodes is also slowed down compared with that of 8 nodes, but the final convergence point is better, which indicates the nonlinear relationship between the number of nodes and the efficiency.
The trained weights are tested on a single card, and the relationship between the number of nodes and the test time on the test set is drawn by taking the time as a reference, as shown in fig. 5:
as can be seen from fig. 5, increasing the number of parallel nodes can accelerate training, which also embodies the advantages of data parallel, but it can also be seen that as the number of nodes increases, the slope of the curve increases, the acceleration effect decreases, and as the number of nodes increases, the number of data exchanges increases, which takes longer time[29-30]Eventually, linear acceleration under ideal conditions cannot be achieved.
From experimental results, the influence of multi-node data parallel training on final classification accuracy is small, but the acceleration effect is limited to a certain extent, and the speed is reduced most obviously when 8-node data parallel training is used, so that the acceleration effect can be reflected to the maximum by using 8-node parallel training on the premise of ensuring the consistency of other experimental environments, and the time pair after 1000 times of iterative tests on a test set is shown in fig. 6(a), for example, and the loss rate curve is shown in fig. 6 (b).
As seen from fig. 6(a), the final convergence effect of the parallel method proposed herein is substantially the same compared to the data parallel policy, and it can be seen that the method has a small impact on the network accuracy. As can be seen from fig. 6(b), compared to data parallelization, the acceleration effect is improved slightly, and the test time is shortened from 9 minutes to 7.8 minutes, which is mainly because the model parallel communication content is feature map and the data parallel communication content is parameter, the strategy herein is to parallelize the fully-connected layers using the model, and parallelize the rest using the data, which is for the Cifar10 with less categories, the difference between the total parameter number of the fully-connected layers and the feature map is less, so the improvement is limited, and when the network feature map is much smaller than the parameter number, the performance can be better in communication. The next subsection will take the mini ImageNet with the larger category and data size as verification.
Comparison of mini ImageNet data parallelism with improved method
In this subsection, a data set mini ImageNet is used to further verify the performance of the proposed parallel strategy. Resnet50 is used as a training network, an experimental environment is a single-computer multi-card GPU cluster, the influence of bandwidth on data transmission is eliminated to the maximum extent, a Ring All Reduce improved algorithm provided by the method is used as a cluster communication mode, training time of a data parallel and improved parallel method is compared on nodes 1,2,4 and 8, and effectiveness of the method for further improving acceleration is verified.
The experiment Batch size is set to 64, and the results of the experiment after 90 iterations are shown in table 2, where Acc represents the accuracy of the training and T represents the training completion time.
TABLE 2 Mini ImageNet dataset Multi-node data parallel vs. parallel methods herein (Acc:%; T: HH: MM, hour: min)
Figure BDA0003261104600000091
Figure BDA0003261104600000101
As can be seen from the analysis of table 2, in the mini ImageNet dataset, no matter the data parallel method or the text parallel method is adopted, the training time is gradually shortened with the increase of the number of nodes, but the accuracy is reduced to a certain extent, compared with the single-card training, the reduction is about 2%, which is the same as the conclusion obtained on the Cifar10 dataset, and the acceleration effect of the method provided by the present invention is improved compared with the data parallel method in combination with table 2 and fig. 7 (a).
Comparing the curvature of the polyline in FIG. 7(a) shows that the method proposed herein has more significant advantages in multi-node parallelism. Fig. 7(b) shows the acceleration ratio of parallel training of the Cifar10 and the mini ImageNet data set by the method under the same node number, when 2 nodes are used in parallel, the acceleration ratio obtained by training the Cifar10 data set is slightly better than that of the mini ImageNet data set with the same node number, the acceleration ratio is continuously improved along with the increase of the node number, but the acceleration ratio obtained by the mini ImageNet data set is larger in amplitude, and the difference between the acceleration ratio and the mini ImageNet data set is more obvious.
The above-mentioned embodiments are merely illustrative of the principles and effects of the present invention, and some embodiments may be used, not restrictive; it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications belong to the protection scope of the present invention.

Claims (1)

1. The GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, and adopts n nodes to participate in parallel training L and G0...Gn-1N is an exponential power of 2, where an even number n-k of nodes G0,...,Gn-k-1Responsible for backbone network data parallelism, Gn-k,...,Gn-1The responsible model is parallel;
wherein, the improved Ring All Reduce algorithm adopts the exponential power n with the total node number of 2, and inputs:data set d ═ Batch0,Batch1,...,Batchx},BatchxSample Input of subset ofy,y∈[0,n-1],
Figure FDA0003261104590000011
And (3) outputting: sample InputyGradient T ofy
The method comprises the following specific steps:
step A, inputting datayAnd the initialization parameters are respectively and correspondingly loaded to all G0,G1,G2,...,Gn-1
Step B, dividing node G0,G1,G2,...,Gn-1Input to be receivedyDividing n parts;
step C, dividing nodes
Figure FDA0003261104590000012
And branch node
Figure FDA0003261104590000013
Pairing is carried out, and the paired nodes are not repeatedly paired;
step D, in the transmission process, each subnode Gi
Figure FDA0003261104590000014
Dividing self-stored data into a first part and a second part, and dividing a node GiTransmitting the first part of data to the sub-node
Figure FDA0003261104590000015
Is accumulated and simultaneously received
Figure FDA0003261104590000016
Accumulating the sent second part of data and the second part of data of the second part of data;
step E, making q equal to q × 2, and repeating the step D until q equal to n; finishing reduction operation after log (n) stepMaking; at this time G0,G1,G2,...Gn-1Has a
Figure FDA0003261104590000017
Portions of BatchxI.e. Inputy
Step F, dividing node GiAnd branch node
Figure FDA0003261104590000018
Are paired, wherein
Figure FDA0003261104590000019
The paired nodes are not repeatedly paired;
g, dividing nodes
Figure FDA00032611045900000110
And
Figure FDA00032611045900000111
interchanging accumulated data Input corresponding to the first or second portionsyThen covering the original data of the corresponding part to obtain the final data of the part;
step H, dividing nodes
Figure FDA00032611045900000112
Merging the received final data part of the other party with the original final data part of the other party;
step I, let
Figure FDA0003261104590000021
Repeating the step G-the step H until the polymerization operation is finished after the step q is 1 and the step log (n);
the parallel method specifically comprises the following steps:
step 1, carrying out data parallel of a backbone network; wherein, the backbone network inputs: data set d ═ Batch0,Batch1...BatchjSample BatchjAre divided into n-k parts to obtain a subset Inputj,j∈[0,n-k-1];
Step 1-1G0...Gn-k-1Loading at the same time;
step 1-2 from G0To Gn-k-1Extracting features through convolution operation;
steps 1-3 from Gn-k-1To G0Completing the calculation of the weight gradient and the deviation gradient of the convolution;
steps 1 to 4G0...Gn-k-1After the weight is calculated, the improved Ring All Reduce algorithm is used for exchanging, summarizing and averaging, and T is outputiAt this time G0...Gn-k-1The same gradient information is possessed;
steps 1 to 5G0...Gn-k-1Using TiUpdating the network parameter net data;
steps 1-6 after several iterations G0...Gn-k-1Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;
step 2: performing full link layer model parallel, using the backbone network to output net data, network weight, bias, bias weight,
step 2-1, the forward propagation of the full connection layer is the product of net data and weight plus the product of bias and bias weight, the full connection layer is divided according to the output latitude of the main network, G is usedn-k...Gn-2、Gn-1And the K nodes are segmented in an equivalent mode to obtain the parallel forward propagation input of the model: net datal;l∈[n-k,n-1],weightl;l∈[n-k,n-1],biasl;l∈[n-k,n-1],bias weightl;l∈[n-k,n-1];
Node G participating in model parallelismn-k...Gn-2、Gn-1Exchanging net data using an improved Ring All Reduce algorithml;l∈[n-k,n-1]And then merging the data to respectively obtain: net datal′;l∈[n-k,n-1];
Step 2-2Gn-k...Gn-2、Gn-1Will net datal′;l∈[n-k,n-1]And each weightl;l∈[n-k,n-1]And biasl;l∈[n-k,n-1]Calculating and outputting Gl fc_output;l∈[n-k,n-1];
Step 2-3Gn-k...Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml;l∈[n-k,n-1]After summarization, with bias weightl;l∈[n-k,n-1]Multiplication plus G in step 2-2ifc _ output gets a new Glfc_output′;l∈[n-k,n-1];
Steps 2 to 4Gn-k...Gn-2、Gn-1Interchanging respective G's using a modified Ring All Reduce algorithml fc_output′;l∈[n-k,n-1]Then combined to obtain the final Gl fc_output″;l∈[n-k,n-1];
Step 2-5, reversely propagating the full connection layer, namely multiplying the gradient by the characteristic net data to obtain a weight gradient, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data;
when in transmission, the model is obtained by segmenting the parameter gradient in parallel in an equivalent mode: gradientl,l∈[n-k,n-1]Then with net datal;l∈[n-k,n-1]、weightl;l∈[n-k,n-1]And biasl;l∈[n-k,n-1]The backward propagation is carried out as an input, and the specific flow is as follows:
Gn-k...Gn-2、Gn-1exchanging grams using a modified Ring All Reduce algorithml,l∈[n-k,n-1]、net datal;l∈[n-k,n-1]Obtaining a complete gradientl′,l∈[n-k,n-1]And network parameter net datal″;l∈[n-k,n-1];
Steps 2 to 6Gn-k...Gn-2、Gn-1Subjecting the complete gradient obtained in step 2-5 to gradient analysisl′,l∈[n-k,n-1]And network parameter net datal″;l∈[n-k,n-1]Multiplying the weight gradientl′,l∈[n-k,n-1];
Step 2-7 node Gn-k...Gn-2、Gn-1Using whole gradientl′,l∈[n-k,n-1]And weightl;l∈[n-k,n-1]Multiplying to obtain net data gradinetl,l∈[n-k,n-1]Then carrying out reduction operation on the obtained product;
steps 2 to 8Gn-k...Gn-2、Gn-1Exchanging bias using an improved Ring All Reduce algorithml;l∈[n-k,n-1]Obtaining the biasl′;l∈[n-k,n-1];
Step 2-9 node Gn-k...Gn-2、Gn-1Subjecting the weight gradient obtained in the step 2-6 tol′,l∈[n-k,n-1]With bias in step 2-8l′;l∈[n-k,n-1]Multiplying to obtain bias gradientl′;l∈[n-k,n-1];
Thus, the whole training process is completed.
CN202111073054.7A 2021-09-14 Deep learning-oriented GPU parallel method based on improved Ring All Reduce algorithm Active CN113961337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111073054.7A CN113961337B (en) 2021-09-14 Deep learning-oriented GPU parallel method based on improved Ring All Reduce algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111073054.7A CN113961337B (en) 2021-09-14 Deep learning-oriented GPU parallel method based on improved Ring All Reduce algorithm

Publications (2)

Publication Number Publication Date
CN113961337A true CN113961337A (en) 2022-01-21
CN113961337B CN113961337B (en) 2024-05-10

Family

ID=

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460457A (en) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
US20200118000A1 (en) * 2018-10-10 2020-04-16 NEC Laboratories Europe GmbH Method and system for distributed deep learning
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
US20210211787A1 (en) * 2020-01-03 2021-07-08 Microsoft Technology Licensing, Llc Distributed processing architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460457A (en) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks
US20200118000A1 (en) * 2018-10-10 2020-04-16 NEC Laboratories Europe GmbH Method and system for distributed deep learning
CN110379416A (en) * 2019-08-15 2019-10-25 腾讯科技(深圳)有限公司 A kind of neural network language model training method, device, equipment and storage medium
US20210211787A1 (en) * 2020-01-03 2021-07-08 Microsoft Technology Licensing, Llc Distributed processing architecture
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王裕民;顾乃杰;张孝慈;: "多GPU环境下的卷积神经网络并行算法", 小型微型计算机系统, no. 03, 15 March 2017 (2017-03-15) *

Similar Documents

Publication Publication Date Title
EP3540652B1 (en) Method, device, chip and system for training neural network model
US8543517B2 (en) Distributed decision tree training
CN111224905B (en) Multi-user detection method based on convolution residual error network in large-scale Internet of things
CN108108814A (en) A kind of training method of deep neural network
WO2020233709A1 (en) Model compression method, and device
CN112288087A (en) Neural network pruning method and device, electronic equipment and storage medium
CN106778015A (en) One kind is based on FPGA isomery accelerated gene computational methods in cloud platform
CN116089883B (en) Training method for improving classification degree of new and old categories in existing category increment learning
CN112215199A (en) SAR image ship detection method based on multi-receptive-field and dense feature aggregation network
CN114219824A (en) Visible light-infrared target tracking method and system based on deep network
CN115170874A (en) Self-distillation implementation method based on decoupling distillation loss
Singh et al. Hetconv: Beyond homogeneous convolution kernels for deep cnns
WO2022265573A2 (en) Automatically and efficiently generating search spaces for neural network
Li et al. Dlw-nas: Differentiable light-weight neural architecture search
CN113961337A (en) Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method
CN106294429A (en) Repeat data identification method and device
CN113961337B (en) Deep learning-oriented GPU parallel method based on improved Ring All Reduce algorithm
Khoa et al. SplitDyn: Federated split neural network for distributed edge AI applications
CN116167436A (en) Neural network pipeline parallel training method for optimizing model division
CN115630398A (en) Personalized differential privacy protection method, device and system based on small sample data
CN113033653B (en) Edge-cloud cooperative deep neural network model training method
He et al. A fast simulated annealing strategy for community detection in complex networks
CN115147870A (en) Pedestrian re-identification method and device
CN115829029A (en) Channel attention-based self-distillation implementation method
CN115035408A (en) Unmanned aerial vehicle image tree species classification method based on transfer learning and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant