CN113961337A

CN113961337A - Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method

Info

Publication number: CN113961337A
Application number: CN202111073054.7A
Authority: CN
Inventors: 韩彦岭; 沈思扬; 曹守启; 张云; 洪中华; 周汝雁; 王静; 杨树瑚
Original assignee: Shanghai Ocean University
Current assignee: Shanghai Ocean University
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2022-01-21
Anticipated expiration: 2041-09-14

Abstract

The invention discloses a deep learning-oriented GPU (graphics processing Unit) parallel method based on an improved RingAllReduce algorithm, which is used for improving the transmission efficiency among data parallel multiple devices and relieving the bandwidth loss problem of a parallel structure of a traditional parameter server; in addition, by utilizing the characteristics that the traditional deep learning backbone network has smaller weight parameters than a full connection layer, has small synchronization cost and has overhigh weight and large gradient transmission cost of the full connection layer, the data parallel processing is carried out on the backbone network, and the full connection layer adopts the model parallel processing, so that the problems that the data parallel mode is difficult to support large-scale network parameters and the acceleration and the delay are solved. Compared with other methods, the method has the advantages that the final test precision is not greatly different from the training precision, the attenuation amplitude is smaller in the acceleration effect, the effect is better, experiments also find that compared with data sets with fewer classes such as Cifar10, the method has a larger acceleration advantage on miniImageNet, and therefore the method is more suitable for parallel training of mass data.

Description

Improved Ring All Reduce algorithm-based deep learning-oriented GPU parallel method

Technical Field

The invention relates to a GPU parallel method facing deep learning based on an improved Ring All Reduce algorithm.

Background

With the deep learning, the method is widely applied to various fields such as image analysis, target detection, semantic segmentation, automatic driving and the like. Deep learning mainly allows computers to extract more data features by increasing the depth of the network, and the network is generally up to hundreds or even thousands of layers at present. In the face of huge data volume and complex network structure, great challenge is brought to training efficiency, and in order to shorten time, a parallel training method designed based on various computing platforms gradually becomes a research hotspot.

Early researchers studied the parallel processing of the (back propagation, BP) neural network training process, and combined the early BP neural network with the MapReduce framework initially and successfully through continuous experiments, but the early BP neural network was not applied to practical problems, so that the training process was limited to the theoretical state. Hou et al have performed different parallelization examples in Hadoop Distributed File System (Hadoop) by using a data parallel method on the basis, and have proved that the network training speed can be effectively accelerated by using MapReduce, thereby greatly shortening the training time of the neural network. In 2012, by combining the storage characteristics of a distributed file system HDFS, the results of experiments prove that the BP neural network parallel algorithm with the iterative characteristic is obviously helpful for improving the convergence rate, the learning accuracy and the parallel efficiency of the network compared with the prior algorithm. At present, there are two main methods for parallelizing the deep neural network, namely data parallelization and model parallelization.

Data parallel is the simplest parallel strategy, an independent data subset is used by a model copy on each device participating in parallel, and currently, mainstream frameworks such as TensorFlow and PyTorch support data parallel in an easy-to-use and intuitive mode. However, as the number of parallel devices increases, the Size of the Batch Size generally increases, which deteriorates the data parallel scalability, because, for any given deep learning network, the number of iterations required to achieve the same convergence accuracy after exceeding the corresponding Batch Size threshold increases significantly, mainly due to the reduced statistical efficiency of the training process. In addition, the increase in communication overhead due to the increased number of parallel devices used further limits the overall training speed.

The model parallelism is a parallel method similar to the data parallelism, a model graph is divided and deployed on a plurality of devices, the same mini Batch is processed in parallel, the model parallelism is usually used for splitting a large model (a single GPU cannot bear one model), and therefore training can be accelerated through the method. However, at present, the acceleration effect obtained by the model parallel algorithm improvement and the equipment scale is very limited. Therefore, using only this method also brings about a problem of poor scalability. On the other hand, in order to obtain maximum acceleration, the model segmentation mode needs to be adjusted repeatedly to achieve the highest communication benefit in the forward and reverse propagation process. In most cases, the communication overhead and synchronization overhead brought by the parallel model exceed the data parallel, so the speed-up ratio is not as high as the data parallel.

DistBeief developed by Google trains large-scale models by adopting a data parallel and model parallel method, Coates et al also constructs a model parallel training method by using a Graphics Processing Unit (GPU) cluster, Li [27] and the like propose an improved parameter server asynchronous interactive data parallel scheme. However, the research on the deep learning parallelization is mostly based on a large-scale commercial GPU platform, and has a great difference with a small-scale GPU experimental environment in specific image classification.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the GPU parallel method facing deep learning based on the improved Ring All Reduce algorithm is small in data transmission amount and high in training efficiency.

To solve the above-mentioned techniqueThe technical scheme adopted by the invention is as follows: the GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, and adopts n nodes to participate in parallel training G₀…G_n-1N is an exponential power of 2, where an even number n-k of nodes G₀，…,G_n-k-1Responsible for backbone network data parallelism, G_n-k，…,G_n-1The responsible model is parallel;

wherein, the improved Ring All Reduce algorithm adopts the exponential power n with the total node number of 2, and inputs: data set d ═ Batch₀，Batch₁，…，Batch_x}，Batch_xSample Input of subset of_y，y∈[0,n-1]，

And (3) outputting: sample Input_yGradient T of_y

The method comprises the following specific steps:

step A, inputting data_yAnd the initialization parameters are respectively and correspondingly loaded to all G₀，G₁，G₂，…,G_n-1；

Step B, dividing node G₀，G₁，G₂，…,G_n-1Input to be received_yDividing n parts;

step C, dividing nodes

And branch node

Pairing is carried out, and the paired nodes are not repeatedly paired;

step D, in the transmission process, each subnode G_i、

Dividing self-stored data into a first part and a second part, and dividing a node G_iTransmitting the first part of data to the sub-node

Is accumulated and simultaneously received

Accumulating the sent second part of data and the second part of data of the second part of data;

step E, making q equal to q × 2, and repeating the step D until q equal to n; finishing reduction operation after the step log (n); at this time G₀，G₁，G₂，…G_n-1Has a

Portions of Batch_xI.e. Input_y；

Step F, dividing node G_iAnd branch node

Pairing, wherein q is 2^log(n)The paired nodes are not repeatedly paired;

g, dividing nodes

And

interchanging accumulated data Input corresponding to the first or second portions_yThen covering the original data of the corresponding part to obtain the final data of the part;

step H, dividing nodes

Merging the received final data part of the other party with the original final data part of the other party;

step I, let

Repeating the step G-the step H until the polymerization operation is finished after the step q is 1 and the step log (n);

the parallel method specifically comprises the following steps:

step 1, carrying out data parallel of a backbone network; wherein, the backbone network inputs: data set d ═ Batch₀,Batch₁…Batch_jSample Batch_jAre divided into n-k parts to obtain a subset Input_j，j∈[0,n-k-1]；

Step 1-1G₀…G_n-k-1Loading at the same time;

step 1-2 from G₀To G_n-k-1Extracting features through convolution operation;

steps 1-3 from G_n-k-1To G₀Completing the calculation of the weight gradient and the deviation gradient of the convolution;

steps 1 to 4G₀…G_n-k-1After the weight is calculated, the improved Ring All Reduce algorithm is used for exchanging, summarizing and averaging, and T is output_iAt this time G₀…G_n-k-1The same gradient information is possessed;

steps 1 to 5G₀…G_n-k-1Using T_iUpdating the network parameter net data;

steps 1-6 after several iterations G₀…G_n-k-1Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;

step 2: carrying out the parallel of the full connection layer model, outputting net data, network weight, bias and bias weight by the main network,

step 2-1, inputting the product of net data and weight and the product of bias and bias weight into the full connection layer, dividing the full connection layer according to the output latitude of the main network, and using G_n-k...G_n-2、G_n-1And the K nodes are segmented in an equivalent mode to obtain the parallel forward propagation input of the model: net data_l；l∈[n-k，n-1]，weight_l；l∈[n-k,n-1]，bias_l；l∈[n-k，n-1]，bias weight_l；l∈[n-k，n-1]；

Node G participating in model parallelism_n-k...G_n-2、G_n-1Swapping n using the modified Ring All Reduce algorithmet data_l；l∈[n-k，n-1]Then merging the data to obtain net data_l'；l∈[n-k，n-1]；

Step 2-2G_n-k...G_n-2、G_n-1Will net data_l'；l∈[n-k，n-1]And each weight_l；l∈[n-k，n-1]And bias_l；l∈[n-k,n-1]Calculating and outputting G_l fc_output；l∈[n-k，n-1]；

Step 2-3G_n-k...G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l；l∈[n-k,n-1]After summarization, with bias weight_l；l∈[n-k,n-1]Multiplication plus G in step 2-2_ifc _ output gets a new G_l fc_output'；l∈[n-k,n-1]；

Steps 2 to 4G_n-k...G_n-2、G_n-1Interchanging respective G's using a modified Ring All Reduce algorithm_l fc_output'；l∈[n-k,n-1]Then combined to obtain the final G_l fc_output″；l∈[n-k,n-1]；

Step 2-5, multiplying the reverse propagation gradient of the full connection layer by the characteristic net data to obtain a weight gradient, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data gradnet;

when in transmission, the model is obtained by segmenting the parameter gradient in parallel in an equivalent mode: gradient_l，l∈[n-k，n-1]Then with net data_l；l∈[n-k，n-1]、weight_l；l∈[n-k，n-1]And bias_l；l∈[n-k，n-1]The backward propagation is carried out as an input, and the specific flow is as follows:

G_n-k...G_n-2、G_n-1exchanging grams using a modified Ring All Reduce algorithm_l，l∈[n-k，n-1]、net data_l；l∈[n-k,n-1]Obtaining a complete gradient_l',l∈[n-k，n-1]And network parameter net data_l″；l∈[n-k，n-1]；

Steps 2 to 6G_n-k...G_n-2、G_n-1Subjecting the complete gradient obtained in step 2-5 to gradient analysis_l',l∈[n-k,n-1]And network parameter net data_l″；l∈[n-k,n-1]Multiplying the weight gradient_l',l∈[n-k,n-1]；

Step 2-7 node G_n-k...G_n-2、G_n-1Using whole gradient_l',l∈[n-k,n-1]And weight_l；l∈[n-k,n-1]Multiplying to obtain net data gradinet_l,l∈[n-k,n-1]Then, reducing the reduced Scatter operation;

steps 2 to 8G_n-k...G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l；l∈[n-k,n-1]Obtaining the bias_l'；l∈[n-k,n-1]；

Step 2-9 node G_n-k...G_n-2、G_n-1Subjecting the weight gradient obtained in the step 2-6 to_l',l∈[n-k,n-1]With bias in step 2-8_l'；l∈[n-k,n-1]Multiplying to obtain bias gradient_l'；l∈[n-k,n-1]；

Thus, the whole training process is completed.

The invention has the beneficial effects that: the method realizes the optimization of a data transmission flow by a sub-node interval pairing principle, obtains a Ring All Reduce data communication algorithm with improved time complexity, is used for improving the transmission efficiency among data parallel multiple devices and relieves the problem of bandwidth loss of a parallel structure of a traditional parameter server; in addition, by utilizing the characteristics that the traditional deep learning backbone network has smaller weight parameters than a full connection layer, has small synchronization cost and has overhigh weight and large gradient transmission cost of the full connection layer, the data parallel processing is carried out on the backbone network, and the full connection layer adopts the model parallel processing, so that the problems that the data parallel mode is difficult to support large-scale network parameters and the acceleration and the delay are solved. Compared with other methods, the method has the advantages that the final test precision is not greatly different from the training precision, the attenuation amplitude is smaller in the acceleration effect, the effect is better, experiments also find that compared with data sets with fewer classes such as Cifar10, the method has a larger acceleration advantage on miniImageNet, and therefore the method is more suitable for parallel training of mass data.

Drawings

FIG. 1 illustrates the operation flow of the Ring All Reduce improved algorithm Reduce Scatter

FIG. 2 Ring All Reduce improved All Gather operation flow

FIG. 3a is a comparison of acceleration ratios of Vgg16 at different Batch Size Ring All Reduce and its modified algorithm

FIG. 3b is a comparison of the acceleration ratio of Resnet50 at different Batch Size Ring All Reduce and its improved algorithm

FIG. 4a is a Cifar10 multi-node data parallel training loss rate curve

FIG. 4b is a Cifar10 multi-node data parallel test loss rate curve

FIG. 5Cifar10 relation between multi-node data parallel node number and test time

FIG. 6a is a comparison of Cifar10 data parallel versus improved parallel test loss rate curves

FIG. 6b shows Cifar10 data parallelism vs. improved parallelism test time

FIG. 7a is a comparison of training times for mini ImageNet data parallel and parallel methods herein

FIG. 7b is a comparison of training speed-up ratios of the parallel method herein under different data sets

Detailed Description

Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, adopts n nodes to participate in parallel training, and has n being 2 exponential power G₀…G_n-1Wherein n-2 nodes G₀…G_n-3Responsible for backbone network data parallelism, G_n-2、G_n-1The responsible model is parallel;

wherein, the improved Ring All Reduce algorithm adopts the exponential power n that the total node number is 2, and n is 4 in this embodiment, inputs: data set d ═ Batch₀，Batch₁，…，Batch_x}，Batch_xSample Input of subset of_y，y∈[0,n-1]，

And (3) outputting: sample Input_yGradient T of_y

The method comprises the following specific steps:

Step B, sub-node G₀，G₁，G₂，…,G_n-1Input to be received_yEvenly dividing the obtained product into the parts with the number consistent with the total node number of the algorithm to obtain a₀，a₁，a₂，…，a_n-1(ii) a As shown in fig. 1, when n is 4;

step C, dividing nodes

And branch node

Pairing is carried out, and the paired nodes are not repeatedly paired;

step D, in the transmission process, each subnode G_i、

Is accumulated and simultaneously received

Accumulating the sent second part of data and the second part of data of the second part of data; the results are shown in FIG. 1 (b);

step E, let q ═ q × 2, repeat step D until q ═ n, the result is shown in fig. 1 (c); completing reduction Scatter operation after the step of log (n); at this time G₀，G₁，G₂，…G_n-1Has a

Portions of Batch_xI.e. Input_y；

Step F, node G_iAnd branch node

Pairing, wherein q is 2^log(n)The paired nodes are not repeatedly paired;

g, dividing nodes

And

interchanging accumulated data Input corresponding to the first or second portions_yThen, the original data of the corresponding part is covered to obtain the final data of the part, and as a result, as shown in FIG. 2(b),

after one round of data exchange, G₀、G₁And G₂、G₃Respectively obtaining the accumulated data Input of the other party_y；

Step H, node division

Merging the received final data part of the other party with the original final data part of the other party, and obtaining the result as shown in fig. 2 (c);

step I, order

Repeating the step G-the step H until the step q is 1, and the step log (n) is finished and then the operation of polymerizing All Gather is finished; at this time, G₀、G₁、G₂、G₃All data are obtained;

the parallel method specifically comprises the following steps:

step 1, performing data parallelization of a backbone network, specifically as follows:

wherein, the backbone network inputs: data set d ═ Batch₀,Batch₁…Batch_jSample Batch_jAre divided into n-2 parts to obtain a subset Input_i，i∈[0,n-3]；

Step 1-1G₀…G_n-3Loading at the same time;

step 1-2 from G₀To G_n-3Extracting features through convolution operation;

steps 1-3 from G_n-3To G₀Calculating the base layer of the volume to obtain a weight gradient and a deviation gradient;

steps 1-4 Using the modified Ring All Reduce algorithm for G₀…G_n-3The gradient with the same weight and the gradient of deviation;

steps 1 to 5G₀…G_n-3Updating the network parameter net data;

steps 1-6 after several iterations G₀…G_n-3Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;

step 2: carrying out full-connection layer model parallel, specifically as follows:

the forward propagation parameters of the full connection layer of the single node comprise: the backbone network outputs net data, network weight, bias, bias weight,

the input of the full connection layer is the product of net data and weight plus the product of bias and bias weight, the full connection layer is divided according to the output latitude of the main network, G is used_n-2、G_n-1The two nodes are obtained by segmenting in an equivalent mode:

net data_l、weight_l、bias_l、bias weight_l，l∈[n-1,n-2]as a concrete model parallel forward propagation input;

step 2-1 participating in model parallel node G_n-2、G_n-1Exchanging net data using an improved Ring All Reduce algorithm_n-2、net data_n-1Then merging the data to obtain net data_l'，l∈[n-1,n-2]；

Step 2-2G_n-2、G_n-1Will net data_l'，l∈[n-1,n-2]With respective weight_l，l∈[n-1,n-2]And bias_l，l∈[n-1,n-2]Calculating and outputting G_n-2fc_output、G_n-1fc_output；

Step 2-3G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l，l∈[n-1,n-2]To obtain the summarized bias_l’，l∈[n-1,n-2]And bias weight_l，l∈[n-1,n-2]Multiplication plus G in step 2-2_lfc_output，l∈[n-1,n-2]To obtain a new G_n-2fc_output'、G_n-1fc_output'；

Steps 2 to 4G_n-2、G_n-1Interchanging G Using the improved Ring All Reduce Algorithm_n-2fc_output'、G_n-1fc _ output', the final G of the summary is obtained_n-2fc_output″、G_n-1fc_output″；

Step 2-5, multiplying the traditional full-connection layer back propagation gradient by the characteristic net data to obtain a weight gradient weight, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data;

when in transmission, the model is obtained by segmenting the parameter gradient in parallel in an equivalent mode: gradient_l,l∈[n-2,n-1]Then with net data_l；l∈[n-2,n-1]、weight_l；l∈[n-2,n-1]And bias_l；l∈[n-2,n-1]The backward propagation is carried out as an input, and the specific flow is as follows:

G_n-2、G_n-1exchanging grams using a modified Ring All Reduce algorithm_l,l∈[n-1,n-2]；、net data_l；l∈[n-1,n-2](ii) a Connecting with the gradient stored in itself and the network parameter to obtain the complete gradient_l',l∈[n-1,n-2]And network parameter net data_l″，l∈[n-1,n-2]；

Steps 2 to 6G_n-2、G_n-1Subjecting the complete gradient obtained in step 2-5 to gradient analysis_l',l∈[n-1,n-2]' and network parameter net data_l″，l∈[n-1,n-2]Multiplying the weight gradient_l,l∈[n-1,n-2]；

Steps 2 to 7G_n-2、G_n-1Using gradient_l'，l∈[n-1,n-2]And weight_l,l∈[n-1,n-2]Multiplying to obtain net data gradinet_l,l∈[n-1,n-2]Then, reducing the reduced Scatter operation;

steps 2 to 8G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l,l∈[n-1,n-2]Obtaining the bias_l'；l∈[n-1,n-2]；

Each node in the step 2-9 enables the weight gradient obtained in the step 2-6 to be obtained_l',l∈[n-1,n-2]With bias in step 2-8_l',l∈[n-1,n-2]Multiplying to obtain bias gradient_l',l∈[n-1,n-2]；

Thus, the whole training process is completed.

Description of the Experimental data and Experimental setup for this example

Experimental data two common datasets were used, Cifar10 and mini ImageNet respectively. Cifar10 is a small data set used to identify pervasive objects. A total of 10 classes of RGB colour pictures are included, airplane (airplane), car (automobile), bird (bird), cat (cat), deer (deer), dog (dog), frog (frog), horse (horse), boat (ship) and truck (truck). The size of the pictures was 32 x 32, with a total of 50000 training images and 10000 test images in the data set. ImageNet contains a total of 1500 million images, divided into 2 million categories. Herein mini ImageNet is the top 100 categories selected from ImageNet, comprising a total of 129395 training pictures and 50000 verification images in the data set.

All experiments herein used 2080Ti and were conducted in a stand-alone, multi-card environment. And (4) averaging the experimental results of each group by repeating twice on the premise of randomly selecting the training samples.

Ring All Reduce and improved algorithm comparison thereof

The acceleration ratio is a common index for measuring the performance of parallel computation, and the main significance is that for the same service process, the time T consumed by a single node₁With multiple nodes consuming time T after parallel computation_nThe ratio of (2) and the change of the ratio correspond to the performance, and under an ideal state, when the node growth is in direct proportion to the acceleration ratio, the optimal condition is obtained.

The method is characterized in that a Cifar10 data set is used for conducting 100 times of single-machine multi-card data parallel training on a host with two nodes, and the speed ratio of a Ring All Reduce algorithm and an improved algorithm to a parameter server algorithm in different Batch sizes is compared under Vgg16 and Resnet50, so that the performance of the algorithm on transmission performance is measured.

FIG. 3 is a comparison of acceleration ratios when data are parallel using two algorithms, and it can be seen that, regardless of Vgg16 or Resnet50, the acceleration ratios of the two algorithms decrease to some extent as the Batch Size increases, mainly due to the increase of the data volume, which increases the transmission time. The difference between the two algorithms is the largest when the Batch Size of the Vgg16 network is 4, and the performance difference between the two algorithms is about 25% at the highest in Resnet50 when the Batch Size is 16, so that the optimal Batch Size exists for different networks, so that the performance of the improved algorithm is optimal. In general, the improved algorithm can obtain better performance than the Ring All Reduce and the acceleration ratio is always larger than 1.

Cifar10 data parallel and improved parallel method contrast

Firstly, the performance of data parallel under different node quantities is verified, and the performance is observed by training indexes such as convergence, time and the like by using a Cifar10 data set.

Experiments the picture classification was tested on the Resnet50 network and data transmission was achieved using the Ring All Reduce modified algorithm herein. Training and test Batch size were set to 64, training was 6000 times, test set was 1000 times, initial learning rate was set to 0.001, and was reduced by 10 times after 3000 and 5000 times, respectively.

The parallel training is performed on the clusters with the node numbers of 2,4 and 8 respectively, and the experimental results are shown in table 1 and fig. 4

TABLE 1 Multi-node Cifar10 data parallel training loss rate comparison for different iteration numbers

Table 1 shows the change of the loss rate of the data parallel training with the number of iterations at different node numbers, and it can be found by comparing the data of 8 nodes with 2 nodes that the increase of the number of parallel nodes can result in the increase of the loss rate to a certain extent, when the number of iterations is 1000, the difference between the two is 11%, but the final difference is not large, and the difference is only 2% after 6000 iterations, which indicates that the data parallel has a small influence on the final precision of the training.

Fig. 4(a) shows the loss rate curve of the data parallel training, and fig. 4(b) shows the loss variation curve of the test on the test set, from which the following points can be obtained:

the data parallel post-training and test loss rate increases with the number of parallel nodes.

Referring to the training loss rate in table 1 and the curve in fig. 4(a), the convergence trend of the parallel 8 node data is slower than that of 2 nodes when training is started, some differences exist, after 3000 iterations, the differences are reduced and kept within a range of 3%, and the final convergence point does not exceed 2%, which mainly reduces certain efficiency due to data exchange consumption caused by node increase.

Observing fig. 4(b), it can be found that, in the data parallel test set testing process, under the same iteration number, the convergence of the curve of 4 nodes is also slowed down compared with that of 8 nodes, but the final convergence point is better, which indicates the nonlinear relationship between the number of nodes and the efficiency.

The trained weights are tested on a single card, and the relationship between the number of nodes and the test time on the test set is drawn by taking the time as a reference, as shown in fig. 5:

as can be seen from fig. 5, increasing the number of parallel nodes can accelerate training, which also embodies the advantages of data parallel, but it can also be seen that as the number of nodes increases, the slope of the curve increases, the acceleration effect decreases, and as the number of nodes increases, the number of data exchanges increases, which takes longer time^[29-30]Eventually, linear acceleration under ideal conditions cannot be achieved.

From experimental results, the influence of multi-node data parallel training on final classification accuracy is small, but the acceleration effect is limited to a certain extent, and the speed is reduced most obviously when 8-node data parallel training is used, so that the acceleration effect can be reflected to the maximum by using 8-node parallel training on the premise of ensuring the consistency of other experimental environments, and the time pair after 1000 times of iterative tests on a test set is shown in fig. 6(a), for example, and the loss rate curve is shown in fig. 6 (b).

As seen from fig. 6(a), the final convergence effect of the parallel method proposed herein is substantially the same compared to the data parallel policy, and it can be seen that the method has a small impact on the network accuracy. As can be seen from fig. 6(b), compared to data parallelization, the acceleration effect is improved slightly, and the test time is shortened from 9 minutes to 7.8 minutes, which is mainly because the model parallel communication content is feature map and the data parallel communication content is parameter, the strategy herein is to parallelize the fully-connected layers using the model, and parallelize the rest using the data, which is for the Cifar10 with less categories, the difference between the total parameter number of the fully-connected layers and the feature map is less, so the improvement is limited, and when the network feature map is much smaller than the parameter number, the performance can be better in communication. The next subsection will take the mini ImageNet with the larger category and data size as verification.

Comparison of mini ImageNet data parallelism with improved method

In this subsection, a data set mini ImageNet is used to further verify the performance of the proposed parallel strategy. Resnet50 is used as a training network, an experimental environment is a single-computer multi-card GPU cluster, the influence of bandwidth on data transmission is eliminated to the maximum extent, a Ring All Reduce improved algorithm provided by the method is used as a cluster communication mode, training time of a data parallel and improved parallel method is compared on

nodes

1,2,4 and 8, and effectiveness of the method for further improving acceleration is verified.

The experiment Batch size is set to 64, and the results of the experiment after 90 iterations are shown in table 2, where Acc represents the accuracy of the training and T represents the training completion time.

TABLE 2 Mini ImageNet dataset Multi-node data parallel vs. parallel methods herein (Acc:%; T: HH: MM, hour: min)

As can be seen from the analysis of table 2, in the mini ImageNet dataset, no matter the data parallel method or the text parallel method is adopted, the training time is gradually shortened with the increase of the number of nodes, but the accuracy is reduced to a certain extent, compared with the single-card training, the reduction is about 2%, which is the same as the conclusion obtained on the Cifar10 dataset, and the acceleration effect of the method provided by the present invention is improved compared with the data parallel method in combination with table 2 and fig. 7 (a).

Comparing the curvature of the polyline in FIG. 7(a) shows that the method proposed herein has more significant advantages in multi-node parallelism. Fig. 7(b) shows the acceleration ratio of parallel training of the Cifar10 and the mini ImageNet data set by the method under the same node number, when 2 nodes are used in parallel, the acceleration ratio obtained by training the Cifar10 data set is slightly better than that of the mini ImageNet data set with the same node number, the acceleration ratio is continuously improved along with the increase of the node number, but the acceleration ratio obtained by the mini ImageNet data set is larger in amplitude, and the difference between the acceleration ratio and the mini ImageNet data set is more obvious.

The above-mentioned embodiments are merely illustrative of the principles and effects of the present invention, and some embodiments may be used, not restrictive; it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications belong to the protection scope of the present invention.

Claims

1. The GPU parallel method for deep learning based on the improved Ring All Reduce algorithm adopts a classification network comprising a backbone network, a full connection layer and Softmax, and adopts n nodes to participate in parallel training L and G₀...G_n-1N is an exponential power of 2, where an even number n-k of nodes G₀，...，G_n-k-1Responsible for backbone network data parallelism, G_n-k，...，G_n-1The responsible model is parallel;

wherein, the improved Ring All Reduce algorithm adopts the exponential power n with the total node number of 2, and inputs:data set d ═ Batch₀，Batch₁，...，Batch_x}，Batch_xSample Input of subset of_y，y∈[0，n-1]，

And (3) outputting: sample Input_yGradient T of_y

The method comprises the following specific steps:

step A, inputting data_yAnd the initialization parameters are respectively and correspondingly loaded to all G₀，G₁，G₂，...，G_n-1；

Step B, dividing node G₀，G₁，G₂，...，G_n-1Input to be received_yDividing n parts;

step C, dividing nodes

And branch node

Pairing is carried out, and the paired nodes are not repeatedly paired;

step D, in the transmission process, each subnode G_i、

Is accumulated and simultaneously received

step E, making q equal to q × 2, and repeating the step D until q equal to n; finishing reduction operation after log (n) stepMaking; at this time G₀，G₁，G₂，...G_n-1Has a

Portions of Batch_xI.e. Input_y；

Step F, dividing node G_iAnd branch node

Are paired, wherein

The paired nodes are not repeatedly paired;

g, dividing nodes

And

step H, dividing nodes

step I, let

the parallel method specifically comprises the following steps:

step 1, carrying out data parallel of a backbone network; wherein, the backbone network inputs: data set d ═ Batch₀，Batch₁...Batch_jSample Batch_jAre divided into n-k parts to obtain a subset Input_j，j∈[0，n-k-1]；

Step 1-1G₀...G_n-k-1Loading at the same time;

step 1-2 from G₀To G_n-k-1Extracting features through convolution operation;

steps 1 to 4G₀...G_n-k-1After the weight is calculated, the improved Ring All Reduce algorithm is used for exchanging, summarizing and averaging, and T is output_iAt this time G₀...G_n-k-1The same gradient information is possessed;

steps 1 to 5G₀...G_n-k-1Using T_iUpdating the network parameter net data;

steps 1-6 after several iterations G₀...G_n-k-1Outputting a complete net data, a network weight, a deviation bias and a deviation weight bias weight by using an improved Ring Reduce algorithm;

step 2: performing full link layer model parallel, using the backbone network to output net data, network weight, bias, bias weight,

step 2-1, the forward propagation of the full connection layer is the product of net data and weight plus the product of bias and bias weight, the full connection layer is divided according to the output latitude of the main network, G is used_n-k...G_n-2、G_n-1And the K nodes are segmented in an equivalent mode to obtain the parallel forward propagation input of the model: net data_l；l∈[n-k，n-1]，weight_l；l∈[n-k，n-1]，bias_l；l∈[n-k，n-1]，bias weight_l；l∈[n-k，n-1]；

Node G participating in model parallelism_n-k...G_n-2、G_n-1Exchanging net data using an improved Ring All Reduce algorithm_l；l∈[n-k，n-1]And then merging the data to respectively obtain: net data_l′；l∈[n-k，n-1]；

Step 2-2G_n-k...G_n-2、G_n-1Will net data_l′；l∈[n-k，n-1]And each weight_l；l∈[n-k，n-1]And bias_l；l∈[n-k，n-1]Calculating and outputting G_l fc_output；l∈[n-k，n-1]；

Step 2-3G_n-k...G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l；l∈[n-k，n-1]After summarization, with bias weight_l；l∈[n-k，n-1]Multiplication plus G in step 2-2_ifc _ output gets a new G_lfc_output′；l∈[n-k，n-1]；

Steps 2 to 4G_n-k...G_n-2、G_n-1Interchanging respective G's using a modified Ring All Reduce algorithm_l fc_output′；l∈[n-k，n-1]Then combined to obtain the final G_l fc_output″；l∈[n-k，n-1]；

Step 2-5, reversely propagating the full connection layer, namely multiplying the gradient by the characteristic net data to obtain a weight gradient, multiplying the bias by the gradient to obtain a bias gradient, and multiplying the gradient by the weight to obtain an input gradient net data;

G_n-k...G_n-2、G_n-1exchanging grams using a modified Ring All Reduce algorithm_l，l∈[n-k，n-1]、net data_l；l∈[n-k，n-1]Obtaining a complete gradient_l′，l∈[n-k，n-1]And network parameter net data_l″；l∈[n-k，n-1]；

Steps 2 to 6G_n-k...G_n-2、G_n-1Subjecting the complete gradient obtained in step 2-5 to gradient analysis_l′，l∈[n-k，n-1]And network parameter net data_l″；l∈[n-k，n-1]Multiplying the weight gradient_l′，l∈[n-k，n-1]；

Step 2-7 node G_n-k...G_n-2、G_n-1Using whole gradient_l′，l∈[n-k，n-1]And weight_l；l∈[n-k，n-1]Multiplying to obtain net data gradinet_l，l∈[n-k，n-1]Then carrying out reduction operation on the obtained product;

steps 2 to 8G_n-k...G_n-2、G_n-1Exchanging bias using an improved Ring All Reduce algorithm_l；l∈[n-k，n-1]Obtaining the bias_l′；l∈[n-k，n-1]；

Step 2-9 node G_n-k...G_n-2、G_n-1Subjecting the weight gradient obtained in the step 2-6 to_l′，l∈[n-k，n-1]With bias in step 2-8_l′；l∈[n-k，n-1]Multiplying to obtain bias gradient_l′；l∈[n-k，n-1]；

Thus, the whole training process is completed.