CN114298277B - Distributed deep learning training method and system based on layer sparsification - Google Patents

Distributed deep learning training method and system based on layer sparsification Download PDF

Info

Publication number
CN114298277B
CN114298277B CN202111627780.9A CN202111627780A CN114298277B CN 114298277 B CN114298277 B CN 114298277B CN 202111627780 A CN202111627780 A CN 202111627780A CN 114298277 B CN114298277 B CN 114298277B
Authority
CN
China
Prior art keywords
layer
list
window
neural network
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111627780.9A
Other languages
Chinese (zh)
Other versions
CN114298277A (en
Inventor
吕建成
胡宴箐
叶庆
张钟宇
郎九霖
田煜鑫
吕金地
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202111627780.9A priority Critical patent/CN114298277B/en
Publication of CN114298277A publication Critical patent/CN114298277A/en
Application granted granted Critical
Publication of CN114298277B publication Critical patent/CN114298277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a distributed deep learning training method and system based on layer sparsification, which belong to the technical field of distributed training communication sparsification and comprise the following steps: obtaining a normalized window center list according to the convergence characteristic of the neural network model; obtaining a layer transmission list by using a layer sparsification method and a normalized window center list; performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity; the application solves the problem that the existing training frame is only thinned in the network layer, effectively improves the thinning degree and reduces the traffic.

Description

Distributed deep learning training method and system based on layer sparsification
Technical Field
The application belongs to the technical field of distributed training communication sparsification, and particularly relates to a distributed deep learning training method and system based on layer sparsification.
Background
In recent years, as deep learning continues to evolve, models become larger and more complex, and training these complex large models on a stand-alone is time consuming. To reduce training time, a distributed training method is proposed to accelerate model training. In particular, distributed training mainly includes two methods, model parallelism and data parallelism. The model parallelism is difficult to accelerate due to the fact that model parameters are split into different computing nodes, and the problems that parameters of different layers are unbalanced in size, computing dependence among the nodes is high and the like are solved. While data parallelism splits the training data set while each compute node maintains a complete model. In each iteration, each computation node uses different training data to compute local gradients, and then performs transfer switching.
When each computing node carries out local gradient transfer, the distributed expansion is limited due to the large parameter quantity, limited network bandwidth and the like. To solve this bottleneck, two different methods of reducing the traffic have been proposed, sparsity and quantization. The sparsification method aims at reducing the number of elements transmitted in each iteration, setting most elements to zero, and only transmitting the most valuable gradient to update parameters so as to ensure the convergence of training.
In a basic synchronous distributed training framework SSGD (Synchronous Stochastic Gradient Descent) with no communication compression, each computing node must wait for all nodes to complete transmission of all parameters in the current iteration, and communication load caused by excessive communication becomes the biggest bottleneck, and in practical application, under the condition of limited computing resources, the method can only be applied to training of some smaller models; in a distributed deep learning training framework DGC (Deep Gradient Compression) for performing deep communication compression in a network layer, many existing auxiliary technologies are added to overcome the problem of loss caused by gradient sparsification in the network layer, so that gradient traffic between nodes is reduced to a great extent, but DGC still needs to transmit all network layers of a model every time, and a bottleneck still exists for a network with a deeper model depth.
Disclosure of Invention
Aiming at the defects in the prior art, the neural network model convergence characteristic is applied to distributed training, and the distributed deep learning training method and system based on layer sparsification are provided, so that the problem that the existing training framework is only sparsified in a network layer is solved, the sparsification degree is effectively improved, and the traffic is reduced.
In order to achieve the aim of the application, the application adopts the following technical scheme:
the application provides a distributed deep learning training method based on layer sparsification, which comprises the following steps:
s1, obtaining a normalized window center list according to convergence characteristics of a neural network model;
s2, obtaining a layer transmission list by using a layer sparsification method and a normalized window center list;
and S3, performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity.
The beneficial effects of the application are as follows: according to the distributed deep learning training method based on layer sparsification, communication synchronization is selected to be carried out on the network layer which needs to be learned currently according to different key learning opportunities of different layers in the training process of the neural network, other layers are left locally to be accumulated, and when gradient transmission between nodes is carried out, gradients of a part of layers are selected to be carried out inter-node communication, so that the communication traffic is effectively reduced.
Further, the step S1 includes the steps of:
s11, setting all layers of the neural network as a layer continuous sequence, and setting the total training times of the neural network;
s12, setting a dynamic window according to convergence characteristics of the neural network model, wherein the dynamic window traverses a layer continuous sequence from back to front along with the increase of the training times of the neural network, and sets the whole traversing times of the dynamic window;
s13, calculating the training times of the neural network and the training times of the rest neural network in the process of single traversal of the dynamic window by the neural network according to the total training times of the neural network and the integral traversal times of the dynamic window;
s14, obtaining a normalized step length of the dynamic window traversing movement according to the training times of the neural network in the process of the dynamic window single traversing of the neural network;
s15, based on the normalization step length, iterating the integral traversal times of the dynamic window and the training times of the neural network in the process of single traversal of the dynamic window on the neural network model to obtain a whole period window center list;
s16, judging whether the training times of the rest neural network are zero, if so, normalizing the whole period window center list to serve as a normalized window center list, otherwise, entering a step S17;
s17, iterating the training times of the residual neural network based on the normalized step length of the dynamic window traversal movement to obtain a residual window center list;
and S18, adding a residual window center list at the tail of the whole period window center list to serve as a normalized window center list.
The beneficial effects of adopting the further scheme are as follows: according to the convergence characteristic of the neural network model, a dynamic window is established, a network layer at the bottom of the model is mainly transmitted at the initial stage of training, the transmission key layer moves upwards along with the increase of training period, the importance of gradients of each layer is not the same, a foundation is provided for the sparsification of the distributed training layers, and the probability of selecting part of layers is effectively prevented from being too low through multiple times of traversal.
Further, the specific steps of the step S2 are as follows:
s21, acquiring a warm-up period and a normalized window center list of the neural network;
s22, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;
s23, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering step S24;
s24, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;
s25, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;
s26, combining the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list.
The beneficial effects of adopting the further scheme are as follows: in the initial stage of the neural network training, the dynamic window is at the bottommost layer of the model, the dynamic window gradually slides forward to the top layer along with the neural network training, the transmission data volume is reduced by setting the sampling preset proportion of the inner network layer of the dynamic window and the sampling preset proportion of the outer network layer of the dynamic window, and the advantage is obvious when the model is deep enough.
Further, the step S24 includes the steps of:
s241, acquiring a window center of a current training period according to the normalized window center list;
s242, taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list;
s243, selecting a sequence number of a network layer corresponding to a preset quantity of the head of the standard normal distribution list to obtain a dynamic window list;
s244, randomly and uniformly sampling a dynamic window list to preset a proportion k in And obtaining a sampling list in the window of the current training period.
The beneficial effects of adopting the further scheme are as follows: by randomly and uniformly sampling the dynamic window list, the transmission data volume is effectively reduced, and when the model is deep enough, the advantages are obvious.
Further, the step S25 includes the steps of:
s251, obtaining a dynamic window external list according to the normalized layer sequence number list and the dynamic window list;
s252, randomly and uniformly sampling a preset proportion k for the external list of the dynamic window out And obtaining a sampling list outside the window of the current training period.
The beneficial effects of adopting the further scheme are as follows: by randomly and uniformly sampling the neural network layer outside the dynamic window list, the transmission data volume is effectively reduced, and when the model is deep enough, the advantage is obvious.
Further, the step S3 includes the following steps:
s31, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;
s32, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and proceeding to a step S33, otherwise, obtaining a plurality of local accumulated gradients;
s33, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and proceeding to step S34, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;
s34, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer compression selected layer to obtain a plurality of intra-layer compression selected layer transmission gradients;
s35, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;
and S36, obtaining weight updating parameters according to the complete gradient, and completing the distributed deep learning training based on layer sparsification.
The beneficial effects of adopting the further scheme are as follows: according to the result of judging whether each layer in the neural network is in a layer transmission list or not and whether the layers are compressed in the layers or not, acquiring transmission gradients through local accumulation, direct transmission and inter-node communication respectively, acquiring complete gradients through global average gradient fusion, acquiring weight updating parameters according to the complete gradients, and completing distributed deep learning training based on layer sparsification.
The application also provides a system of the distributed deep learning training method based on layer sparsification, which comprises:
the normalized window center list acquisition module is used for acquiring a normalized window center list according to the convergence characteristic of the neural network model;
the layer transmission list acquisition module is used for acquiring a layer transmission list by using a layer thinning method and a normalized window center list;
and the distributed deep learning training module based on layer sparsification carries out distributed deep learning training based on layer sparsity according to the layer transmission list to obtain weight updating parameters and complete the distributed deep learning training based on layer sparsity.
The beneficial effects of the application are as follows: the system of the distributed deep learning training method based on layer sparsification is a system correspondingly arranged for the distributed deep learning training method based on layer sparsification, and is used for realizing the distributed deep learning training method based on layer sparsity.
Drawings
Fig. 1 is a flowchart illustrating steps of a distributed deep learning training method based on layer sparsification according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a window center moving along with a training period in an embodiment of the present application.
FIG. 3 is a diagram illustrating dynamic window movement in accordance with an embodiment of the present application.
Fig. 4 is a schematic diagram of a dynamic window list obtained according to a standard normal distribution list in an embodiment of the present application.
FIG. 5 is a diagram illustrating an all_reduce loop transmission method according to an embodiment of the present application.
FIG. 6 is a schematic diagram of the time-consuming training of Resnet110 models in the Cifar10 and Cifar100 datasets using the DGC and LS-DGC frameworks, respectively, in an embodiment of the present application.
FIG. 7 is a schematic diagram of the time consuming training of the Resnet18 model in the Cifar10 dataset with SSGD and LS-SSGD frameworks, respectively, and the time consuming training of the Resnet50 model in the Cifar100 dataset with SSGD and LS-SSGD frameworks, respectively, in an embodiment of the present application.
FIG. 8 is a schematic diagram of the convergence of Resnet18 model on Cifar10 using SSGD and LS-SSGD frameworks, respectively, and the convergence of Resnet50 model on Cifar100 using SSGD and LS-SSGD frameworks, respectively, according to an embodiment of the present application.
Fig. 9 is a block diagram of a system of a distributed deep learning training method based on layer sparsification in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application is provided to facilitate understanding of the present application by those skilled in the art, but it should be understood that the present application is not limited to the scope of the embodiments, and all the applications which make use of the inventive concept are protected by the spirit and scope of the present application as defined and defined in the appended claims to those skilled in the art.
According to the neural network model convergence characteristic conclusion obtained by deep characterization learning dynamics, the scheme provides a distributed deep learning training method and system based on layer sparsification. In the training process of the neural network, the key learning opportunities of different layers are different, which provides a feasibility support for the gradient sparsification from the angle of the network layer, the network layer which needs to be learned at present is synchronized in communication, other layers are locally accumulated, the network layer at the bottom of a main transmission model in the initial stage of training is considered, and the key transmission layer moves upwards along with the increase of the training period, so that the support is also provided for the sparsification of the distributed training layer.
Example 1
As shown in fig. 1, in one embodiment of the present application, the present application provides a distributed deep learning training method based on layer sparsification, including the steps of:
s1, obtaining a normalized window center list according to convergence characteristics of a neural network model;
according to the convergence characteristic of the neural network model, a dynamic window is established, a network layer at the bottom of the model is mainly transmitted in the initial stage of training, and as the training period increases, a transmission key layer moves upwards along with the transmission key layer, the importance of gradients of each layer is not the same, and a foundation is provided for the layer sparsification of distributed training;
s2, obtaining a layer transmission list by using a layer sparsification method and a normalized window center list;
in the initial stage of the neural network training, the dynamic window is arranged at the bottommost layer of the model, the dynamic window gradually slides forward to the top layer along with the neural network training, the transmission data amount is reduced by setting the sampling preset proportion of the inner network layer of the dynamic window and the sampling preset proportion of the outer network layer of the dynamic window, and when the model is deep enough, the advantage is obvious;
and S3, performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity.
According to the result of judging whether each layer in the neural network is in a layer transmission list or not and whether the layers are compressed in the layers or not, acquiring transmission gradients through local accumulation, direct transmission and inter-node communication respectively, acquiring complete gradients through global average gradient fusion, acquiring weight updating parameters according to the complete gradients, and completing distributed deep learning training based on layer sparsification.
The beneficial effects of the application are as follows: according to the distributed deep learning training method based on layer sparsification, communication synchronization is selected to be carried out on the network layer which needs to be learned currently according to different key learning opportunities of different layers in the training process of the neural network, other layers are left locally to be accumulated, and when gradient transmission between nodes is carried out, gradients of a part of layers are selected to be carried out inter-node communication, so that the communication traffic is effectively reduced.
Example 2
For step S1 of embodiment 1, it includes the following substeps S11 to S18:
s11, setting all layers of the neural network as a layer continuous sequence, and setting the total training times of the neural network;
s12, setting a dynamic window according to convergence characteristics of the neural network model, wherein the dynamic window traverses a layer continuous sequence from back to front along with the increase of the training times of the neural network, and sets the whole traversing times of the dynamic window;
s13, calculating the training times of the neural network and the training times of the rest neural network in the process of single traversal of the dynamic window by the neural network according to the total training times of the neural network and the integral traversal times of the dynamic window;
s14, obtaining a normalized step length of the dynamic window traversing movement according to the training times of the neural network in the process of the dynamic window single traversing of the neural network;
s15, based on the normalization step length, iterating the integral traversal times of the dynamic window and the training times of the neural network in the process of single traversal of the dynamic window on the neural network model to obtain a whole period window center list;
s16, judging whether the training times of the rest neural network are zero, if so, normalizing the whole period window center list to serve as a normalized window center list, otherwise, entering a step S17;
s17, iterating the training times of the residual neural network based on the normalized step length of the dynamic window traversal movement to obtain a residual window center list;
s18, adding a remaining window center list at the tail of the whole period window center list to serve as a normalized window center list;
as shown in fig. 2, the neural network model is set to have 300 layers, 103 periods epoch are trained, and meanwhile, the model is subjected to 4 whole traversals and 1 partial traversals (remainder of 103/4) by a dynamic window; the total training times are 103 times, the overall traversing times are 4 times, the training times of the neural network in the single traversing process of the neural network are 25 times, the training times of the rest neural network are 3 times, and the normalization step length of the dynamic window movement is 1/24 (counted from 0); considering that the traversal is not just completed for all times, similar iteration is carried out when the rest neural network training times exist to generate a rest window center, and then a normalized window center list is obtained through tail addition.
Example 3
For step S2 in embodiment 1, it includes the following substeps S21 to S26:
s21, acquiring a warm-up period and a normalized window center list of the neural network;
the warm-up period is generally set to be the first 5 epochs trained by the neural network, and no sparsification is carried out in the warm-up period to prevent the neural network from going to an error direction;
s22, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;
s23, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering step S24;
s24, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;
the step S24 includes the steps of:
s241, acquiring a window center of a current training period according to the normalized window center list;
s242, taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list;
s243, selecting a sequence number of a network layer corresponding to a preset quantity of the head of the standard normal distribution list to obtain a dynamic window list;
s244, randomly and uniformly sampling a dynamic window list to preset a proportion k in Obtaining a sampling list in a window of a current training period;
s25, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;
the step S25 includes the steps of:
s251, obtaining a dynamic window external list according to the normalized layer sequence number list and the dynamic window list;
s252, randomly and uniformly sampling a preset proportion k for the external list of the dynamic window out Obtaining a sampling list outside a window of a current training period;
s26, combining the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list;
as shown in FIG. 3, in this embodiment, a neural network is provided with 20 layers, in the initial stage of training, a window is at the lowest layer of the model, the window gradually slides forward to the top layer as training proceeds, and a dynamic window list is provided to randomly and uniformly sample a preset proportion k in =50% and the dynamic window outer list randomly samples uniformly a preset proportion k out The model compression ratio is k=20% overall, further reducing the amount of data transmitted, the advantages of this approach being evident when the model is deep enough; the model is prevented from going to the wrong direction by not carrying out sparsification in the warm-up period; acquiring a window center of a current training period according to the normalized window center list; taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list; then selecting a layer sequence number corresponding to top-20% in the standard normal distribution list as a dynamic window list;
as shown in fig. 4, the vertex of the standard normal distribution list is the window center, only the top-20% is intercepted, the corresponding abscissa is the dynamic window list corresponding to the current window center, namely the network layer sequence, and random uniform sampling is performed in the dynamic window list to obtain a sampling list in the window of the current training period; removing a top-20% dynamic window and carrying out random uniform sampling to obtain a sampling list outside the current training period window, and merging the sampling list inside the current training period window and the sampling list outside the current training period window to obtain a layer transmission list.
Example 4
For step S3 in embodiment 1, it includes the following substeps S31 to S36:
s31, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;
s32, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and proceeding to a step S33, otherwise, obtaining a plurality of local accumulated gradients;
s33, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and proceeding to step S34, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;
s34, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer compression selected layer to obtain a plurality of intra-layer compression selected layer transmission gradients;
s35, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;
s36, obtaining weight updating parameters according to the complete gradient, and completing distributed deep learning training based on layer sparsity;
in this embodiment, when intra-layer compression is selected for the layers, the layer thinning is performed by the same method as in step S2 in the above-described distributed deep learning training method based on layer thinning before intra-layer thinning and inter-node communication, and a gradient layer to be communicated is selected to reduce the thinning overhead of the neural network layer that does not perform inter-node communication. In the warm-up period, all layers are transmitted without layer sparsification to prevent the training from going to the wrong direction, after the in-layer sparsification degree is stable, the inter-node communication quantity is reduced by calling a layer sparsification strategy, and as shown in fig. 5, when the inter-node communication is carried out in the middle layer of each layer of compression selection, the all_reduce cyclic transmission method is adopted, the cyclic transmission among node is firstly carried out, and after 4 rounds of transmission are carried out, each node has gradient information of other nodes; and then carrying out averaging to obtain an average gradient, and obtaining a plurality of in-layer compression and medium-layer transmission gradients after decompression.
Example 5
In a practical example of the application, experiments are carried out on two image classification data sets, including a simple Cifair10 data set and a complex Cifair100 data set, so as to prove the effectiveness of the distributed deep learning training method based on layer sparsification; wherein Cifar10 consists of 10 classes of 50000 training images and 10000 verification images, whereas Cifar100 contains 100 classes of 500 training images and 100 test images each. All experiments were based on the Pytorch distributed training framework and used to run experiments on a machine with 4 GeForce RTX 3090 graphics cards.
First, a layer sparse distributed framework (LS-DGC) training experiment based on depth gradient compression DGC is performed, and a preheating period is still set to be 4 periods in 164 training periods. Furthermore, since DGC has already been compressed up to 99% within the layer, it is not too low when thinning the layer; the scheme adopts 20% of the model sequence as the size of the sliding window, and the inner layer of the window is totally selected to be k in =100%, set k outside the window out Layer=20% to prevent the expiration of external gradient, DGC algorithm traffic based on layer sparsification is reduced to around 36% of the original algorithm;
as shown in fig. 6, on both training using the Resnet110 model, the LS-DGC framework is lower than the DGC framework, and the time has been significantly shortened due to the large degree of intra-layer compression that has been done by DGC, and the time consumption has been further improved after combining with the layer-based sparsification algorithm; in addition, the time change of the two methods in the whole period, compression and communication, decompression and synchronization stages is compared and analyzed on two data sets in terms of the time occupation ratio in the period, and the results are shown in table 1:
TABLE 1
According to the table 1, the average time consumption of the compression communication and decompression synchronization is greatly reduced in terms of time consumption, and the duty ratio in the period is reduced by about half, so that the communication bottleneck problem between nodes is further relieved. In addition, since DGC is transmitted by pipeline communication based on layers, after LS-DGC performs layer sparsification, part of layers do not communicate any more, so that the communication frequency between nodes is also greatly reduced;
compared with the baseline DGC framework, the LS-DGC further reduces the communication quantity among the nodes to relieve the communication bottleneck, which inevitably affects the convergence speed of the model to a certain extent, and the LS-DGC accuracy is slightly reduced under the condition that the number of training periods is kept unchanged, because the model is not converged yet, the number of training periods is properly increased (the total time consumption is unchanged), and the accuracy is further improved even surpassing the baseline method result when the model is fully converged, as shown in the table 2:
TABLE 2
Method Cifar10 accuracy Cifar100 accuracy
DGC (Baseline) 93.55% 72.04%
LS-DGC 93.08% 71.74%
LS-DGC(more) 94.15%(↑) 72.55%(↑)
Secondly, a layer sparse distributed framework (LS-SSGD) training experiment based on a random rapid descent method SSGD is carried out, in order to verify the universality of layer sparse, the experiment is also carried out on SSGD without intra-layer compression, and on SSGD without intra-layer compression, the window inner layer selection probability is reduced to k in =50% while reducing the window outside probability to k out =10%, so that the overall traffic of SSGD is reduced to about 18% of the original algorithm; due to the in-layerThe method has the advantages that compression is avoided, gradient loss is small, the Resnet18 is adopted to train the Cifar10 data set, and the Resnet50 is adopted to train the Cifar100 data set, so that convergence is good;
as shown in fig. 7, again, in terms of time, the LS-SSGD framework is still less time consuming than the training of SSGD, and its intra-phase compression duty cycle communication synchronization and overall cycle time consumption are both reduced, with the communication synchronization time consumption reduced by more than 50%, as shown in table 3:
TABLE 3 Table 3
As shown in fig. 8, in terms of accuracy, compared with the baseline SSGD result, the LS-SSGD can exceed the SSGD result in the same training period, and the LS-SSGD framework is better than the SSGD framework in the whole training process.
The beneficial effect of this scheme is: the neural network model convergence characteristic is applied to distributed training, a distributed training framework for layer sparsification of the neural network is provided, the problem that the conventional training framework is only sparsified in a network layer is solved, and the sparsification degree is further improved; the distributed deep learning frameworks LS-DGC and LS-SSGD based on layer sparsity, which are provided by the scheme, are realized through experiments by combining the existing intra-layer deep sparsity framework DGC and the non-intra-layer sparsity framework SSGD; experiments are carried out on a plurality of classification models and a plurality of image data sets, and analysis and comparison are carried out on aspects of overall time consumption, traffic, communication duty ratio, accuracy and the like, so that the effectiveness and the advancement of the method are fully proved.
Example 6
As shown in fig. 9, the present solution further provides a system of a distributed deep learning training method based on layer sparsification, including:
the normalized window center list acquisition module is used for acquiring a normalized window center list according to the convergence characteristic of the neural network model;
the layer transmission list acquisition module is used for acquiring a layer transmission list by using a layer thinning method and a normalized window center list;
and the distributed deep learning training module based on layer sparsification carries out distributed deep learning training based on layer sparsity according to the layer transmission list to obtain weight updating parameters and complete the distributed deep learning training based on layer sparsity.
The system of the distributed deep learning training method based on layer sparsification provided by the embodiment can execute the technical scheme shown in the distributed deep learning training method based on layer sparsity of the method embodiment, and the implementation principle is similar to the beneficial effect, and is not repeated here.
In the embodiment of the application, the function units can be divided according to the distributed deep learning training method based on layer sparsification, for example, each function can be divided into each function unit, and two or more functions can be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that the division of the units in the present application is schematic, only one logic division, and other division manners may be implemented in practice.
In the embodiment of the application, in order to achieve the principle and beneficial effects of the distributed deep learning training method based on layer sparsification, the system of the distributed deep learning training method based on layer sparsification comprises a hardware structure and/or a software module for executing corresponding functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein are capable of being implemented as a combination of hardware and/or hardware and computer software, where a function is performed in either a hardware or a computer software driven manner, where different methods may be employed to implement the described function for each particular application depending upon the specific application and design constraints, but such implementation is not to be considered beyond the scope of the present application.

Claims (5)

1. The distributed deep learning training method based on layer sparsification is characterized by comprising the following steps of:
s1, obtaining a normalized window center list according to convergence characteristics of a neural network model;
s2, obtaining a layer transmission list by using a layer sparsification method and a normalized window center list;
the specific steps of the step S2 are as follows:
s21, acquiring a warm-up period and a normalized window center list of the neural network;
s22, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;
s23, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering step S24;
s24, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;
s25, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;
s26, combining the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list;
s3, performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity;
the step S3 includes the steps of:
s31, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;
s32, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and proceeding to a step S33, otherwise, obtaining a plurality of local accumulated gradients;
s33, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and proceeding to step S34, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;
s34, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer compression selected layer to obtain a plurality of intra-layer compression selected layer transmission gradients;
s35, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;
and S36, obtaining weight updating parameters according to the complete gradient, and completing the distributed deep learning training based on layer sparsification.
2. The distributed deep learning training method based on layer sparsification according to claim 1, wherein the step S1 includes the steps of:
s11, setting all layers of the neural network as a layer continuous sequence, and setting the total training times of the neural network;
s12, setting a dynamic window according to convergence characteristics of the neural network model, wherein the dynamic window traverses a layer continuous sequence from back to front along with the increase of the training times of the neural network, and sets the whole traversing times of the dynamic window;
s13, calculating the training times of the neural network and the training times of the rest neural network in the process of single traversal of the dynamic window by the neural network according to the total training times of the neural network and the integral traversal times of the dynamic window;
s14, obtaining a normalized step length of the dynamic window traversing movement according to the training times of the neural network in the process of the dynamic window single traversing of the neural network;
s15, based on the normalization step length, iterating the integral traversal times of the dynamic window and the training times of the neural network in the process of single traversal of the dynamic window on the neural network model to obtain a whole period window center list;
s16, judging whether the training times of the rest neural network are zero, if so, normalizing the whole period window center list to serve as a normalized window center list, otherwise, entering a step S17;
s17, iterating the training times of the residual neural network based on the normalized step length of the dynamic window traversal movement to obtain a residual window center list;
and S18, adding a residual window center list at the tail of the whole period window center list to serve as a normalized window center list.
3. The layer-sparsification-based distributed deep learning training method of claim 1, wherein the step S24 includes the steps of:
s241, acquiring a window center of a current training period according to the normalized window center list;
s242, taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list;
s243, selecting a sequence number of a network layer corresponding to a preset quantity of the head of the standard normal distribution list to obtain a dynamic window list;
s244, randomly and uniformly sampling the dynamic window list to preset proportionk in And obtaining a sampling list in the window of the current training period.
4. The layer-sparsification-based distributed deep learning training method of claim 1, wherein the step S25 includes the steps of:
s251, obtaining a dynamic window external list according to the normalized layer sequence number list and the dynamic window list;
s252, randomly and uniformly sampling the external list of the dynamic window by a preset proportionk out And obtaining a sampling list outside the window of the current training period.
5. A system of a distributed deep learning training method based on layer sparsification, comprising:
the normalized window center list acquisition module is used for acquiring a normalized window center list according to the convergence characteristic of the neural network model;
the layer transmission list acquisition module is used for obtaining a layer transmission list by using a layer thinning method and a normalized window center list, and specifically comprises the following steps:
a1, acquiring a warm-up period and a normalized window center list of a neural network;
a2, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;
a3, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering a step A4;
a4, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;
a5, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;
a6, merging the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list;
the distributed deep learning training module based on layer sparsification carries out distributed deep learning training based on layer sparsity according to a layer transmission list to obtain weight updating parameters and complete the distributed deep learning training based on layer sparsity, and the distributed deep learning training module specifically comprises the following steps:
b1, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;
step B2, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and entering a step B3, otherwise, obtaining a plurality of local accumulated gradients;
b3, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and entering a step B4, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;
b4, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer selected layer to obtain a plurality of intra-layer compressed selected layer transmission gradients;
b5, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;
and B6, obtaining weight updating parameters according to the complete gradient, and completing the distributed deep learning training based on layer sparsification.
CN202111627780.9A 2021-12-28 2021-12-28 Distributed deep learning training method and system based on layer sparsification Active CN114298277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111627780.9A CN114298277B (en) 2021-12-28 2021-12-28 Distributed deep learning training method and system based on layer sparsification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111627780.9A CN114298277B (en) 2021-12-28 2021-12-28 Distributed deep learning training method and system based on layer sparsification

Publications (2)

Publication Number Publication Date
CN114298277A CN114298277A (en) 2022-04-08
CN114298277B true CN114298277B (en) 2023-09-12

Family

ID=80972299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111627780.9A Active CN114298277B (en) 2021-12-28 2021-12-28 Distributed deep learning training method and system based on layer sparsification

Country Status (1)

Country Link
CN (1) CN114298277B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102027689A (en) * 2008-05-13 2011-04-20 高通股份有限公司 Repeaters for enhancement of wireless power transfer
WO2018107414A1 (en) * 2016-12-15 2018-06-21 上海寒武纪信息科技有限公司 Apparatus, equipment and method for compressing/decompressing neural network model
CN109409505A (en) * 2018-10-18 2019-03-01 中山大学 A method of the compression gradient for distributed deep learning
CN109951438A (en) * 2019-01-15 2019-06-28 中国科学院信息工程研究所 A kind of communication optimization method and system of distribution deep learning
CN110532898A (en) * 2019-08-09 2019-12-03 北京工业大学 A kind of physical activity recognition methods based on smart phone Multi-sensor Fusion
CN111325356A (en) * 2019-12-10 2020-06-23 四川大学 Neural network search distributed training system and training method based on evolutionary computation
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method
CN111368996A (en) * 2019-02-14 2020-07-03 谷歌有限责任公司 Retraining projection network capable of delivering natural language representation
CN111858072A (en) * 2020-08-06 2020-10-30 华中科技大学 Resource management method and system for large-scale distributed deep learning
CN112019651A (en) * 2020-08-26 2020-12-01 重庆理工大学 DGA domain name detection method using depth residual error network and character-level sliding window
CN112738014A (en) * 2020-10-28 2021-04-30 北京工业大学 Industrial control flow abnormity detection method and system based on convolution time sequence network
CN113159287A (en) * 2021-04-16 2021-07-23 中山大学 Distributed deep learning method based on gradient sparsity
CN113554169A (en) * 2021-07-28 2021-10-26 杭州海康威视数字技术股份有限公司 Model optimization method and device, electronic equipment and readable storage medium
CN113837299A (en) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 Network training method and device based on artificial intelligence and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11356334B2 (en) * 2016-04-15 2022-06-07 Nec Corporation Communication efficient sparse-reduce in distributed machine learning
US10832123B2 (en) * 2016-08-12 2020-11-10 Xilinx Technology Beijing Limited Compression of deep neural networks with proper use of mask
KR102355817B1 (en) * 2017-01-17 2022-01-26 삼성전자 주식회사 Method and apparatus for semi-persistent csi reporting in wireless communication system
CN112534452A (en) * 2018-05-06 2021-03-19 强力交易投资组合2018有限公司 Method and system for improving machines and systems for automatically performing distributed ledger and other transactions in spot and forward markets for energy, computing, storage, and other resources

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102027689A (en) * 2008-05-13 2011-04-20 高通股份有限公司 Repeaters for enhancement of wireless power transfer
WO2018107414A1 (en) * 2016-12-15 2018-06-21 上海寒武纪信息科技有限公司 Apparatus, equipment and method for compressing/decompressing neural network model
CN109409505A (en) * 2018-10-18 2019-03-01 中山大学 A method of the compression gradient for distributed deep learning
CN109951438A (en) * 2019-01-15 2019-06-28 中国科学院信息工程研究所 A kind of communication optimization method and system of distribution deep learning
CN111368996A (en) * 2019-02-14 2020-07-03 谷歌有限责任公司 Retraining projection network capable of delivering natural language representation
CN110532898A (en) * 2019-08-09 2019-12-03 北京工业大学 A kind of physical activity recognition methods based on smart phone Multi-sensor Fusion
CN111325356A (en) * 2019-12-10 2020-06-23 四川大学 Neural network search distributed training system and training method based on evolutionary computation
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method
CN111858072A (en) * 2020-08-06 2020-10-30 华中科技大学 Resource management method and system for large-scale distributed deep learning
CN112019651A (en) * 2020-08-26 2020-12-01 重庆理工大学 DGA domain name detection method using depth residual error network and character-level sliding window
CN112738014A (en) * 2020-10-28 2021-04-30 北京工业大学 Industrial control flow abnormity detection method and system based on convolution time sequence network
CN113159287A (en) * 2021-04-16 2021-07-23 中山大学 Distributed deep learning method based on gradient sparsity
CN113554169A (en) * 2021-07-28 2021-10-26 杭州海康威视数字技术股份有限公司 Model optimization method and device, electronic equipment and readable storage medium
CN113837299A (en) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 Network training method and device based on artificial intelligence and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于InfiniBand的集群分布式并行绘制系统设计";付讯等;《四川大学学报(自然科学版)》;第52卷(第1期);第39-44页 *

Also Published As

Publication number Publication date
CN114298277A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
Wu et al. Fast-convergent federated learning with adaptive weighting
Zhang et al. CSAFL: A clustered semi-asynchronous federated learning framework
CN111382844B (en) Training method and device for deep learning model
CN113222179B (en) Federal learning model compression method based on model sparsification and weight quantification
Li et al. GGS: General gradient sparsification for federated learning in edge computing
CN110856268B (en) Dynamic multichannel access method for wireless network
Jiang et al. Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing
CN112862088A (en) Distributed deep learning method based on pipeline annular parameter communication
CN114418129B (en) Deep learning model training method and related device
Liu et al. Fedpa: An adaptively partial model aggregation strategy in federated learning
Cao et al. HADFL: Heterogeneity-aware decentralized federated learning framework
Chen et al. Service delay minimization for federated learning over mobile devices
CN116050509A (en) Clustering federal learning method based on momentum gradient descent
CN115797835A (en) Non-supervision video target segmentation algorithm based on heterogeneous Transformer
Zheng et al. Distributed hierarchical deep optimization for federated learning in mobile edge computing
CN114298277B (en) Distributed deep learning training method and system based on layer sparsification
Cai et al. High-efficient hierarchical federated learning on non-IID data with progressive collaboration
CN117421115A (en) Cluster-driven federal learning client selection method with limited resources in Internet of things environment
Deng et al. HSFL: Efficient and privacy-preserving offloading for split and federated learning in IoT services
CN114465900B (en) Data sharing delay optimization method and device based on federal edge learning
CN114707636A (en) Neural network architecture searching method and device, electronic equipment and storage medium
Yu et al. Proximal Policy Optimization-based Federated Client Selection for Internet of Vehicles
Yang et al. On the convergence of hybrid federated learning with server-clients collaborative training
Shahab et al. Population-based evolutionary distributed SGD
CN113688891B (en) Distributed cascade forest method capable of adaptively dividing sub-forest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant