CN114298277B

CN114298277B - Distributed deep learning training method and system based on layer sparsification

Info

Publication number: CN114298277B
Application number: CN202111627780.9A
Authority: CN
Inventors: 吕建成; 胡宴箐; 叶庆; 张钟宇; 郎九霖; 田煜鑫; 吕金地
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-09-12
Anticipated expiration: 2041-12-28
Also published as: CN114298277A

Abstract

The application discloses a distributed deep learning training method and system based on layer sparsification, which belong to the technical field of distributed training communication sparsification and comprise the following steps: obtaining a normalized window center list according to the convergence characteristic of the neural network model; obtaining a layer transmission list by using a layer sparsification method and a normalized window center list; performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity; the application solves the problem that the existing training frame is only thinned in the network layer, effectively improves the thinning degree and reduces the traffic.

Description

Distributed deep learning training method and system based on layer sparsification

Technical Field

The application belongs to the technical field of distributed training communication sparsification, and particularly relates to a distributed deep learning training method and system based on layer sparsification.

Background

In recent years, as deep learning continues to evolve, models become larger and more complex, and training these complex large models on a stand-alone is time consuming. To reduce training time, a distributed training method is proposed to accelerate model training. In particular, distributed training mainly includes two methods, model parallelism and data parallelism. The model parallelism is difficult to accelerate due to the fact that model parameters are split into different computing nodes, and the problems that parameters of different layers are unbalanced in size, computing dependence among the nodes is high and the like are solved. While data parallelism splits the training data set while each compute node maintains a complete model. In each iteration, each computation node uses different training data to compute local gradients, and then performs transfer switching.

When each computing node carries out local gradient transfer, the distributed expansion is limited due to the large parameter quantity, limited network bandwidth and the like. To solve this bottleneck, two different methods of reducing the traffic have been proposed, sparsity and quantization. The sparsification method aims at reducing the number of elements transmitted in each iteration, setting most elements to zero, and only transmitting the most valuable gradient to update parameters so as to ensure the convergence of training.

In a basic synchronous distributed training framework SSGD (Synchronous Stochastic Gradient Descent) with no communication compression, each computing node must wait for all nodes to complete transmission of all parameters in the current iteration, and communication load caused by excessive communication becomes the biggest bottleneck, and in practical application, under the condition of limited computing resources, the method can only be applied to training of some smaller models; in a distributed deep learning training framework DGC (Deep Gradient Compression) for performing deep communication compression in a network layer, many existing auxiliary technologies are added to overcome the problem of loss caused by gradient sparsification in the network layer, so that gradient traffic between nodes is reduced to a great extent, but DGC still needs to transmit all network layers of a model every time, and a bottleneck still exists for a network with a deeper model depth.

Disclosure of Invention

Aiming at the defects in the prior art, the neural network model convergence characteristic is applied to distributed training, and the distributed deep learning training method and system based on layer sparsification are provided, so that the problem that the existing training framework is only sparsified in a network layer is solved, the sparsification degree is effectively improved, and the traffic is reduced.

In order to achieve the aim of the application, the application adopts the following technical scheme:

the application provides a distributed deep learning training method based on layer sparsification, which comprises the following steps:

s1, obtaining a normalized window center list according to convergence characteristics of a neural network model;

s2, obtaining a layer transmission list by using a layer sparsification method and a normalized window center list;

and S3, performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity.

The beneficial effects of the application are as follows: according to the distributed deep learning training method based on layer sparsification, communication synchronization is selected to be carried out on the network layer which needs to be learned currently according to different key learning opportunities of different layers in the training process of the neural network, other layers are left locally to be accumulated, and when gradient transmission between nodes is carried out, gradients of a part of layers are selected to be carried out inter-node communication, so that the communication traffic is effectively reduced.

Further, the step S1 includes the steps of:

s11, setting all layers of the neural network as a layer continuous sequence, and setting the total training times of the neural network;

s12, setting a dynamic window according to convergence characteristics of the neural network model, wherein the dynamic window traverses a layer continuous sequence from back to front along with the increase of the training times of the neural network, and sets the whole traversing times of the dynamic window;

s13, calculating the training times of the neural network and the training times of the rest neural network in the process of single traversal of the dynamic window by the neural network according to the total training times of the neural network and the integral traversal times of the dynamic window;

s14, obtaining a normalized step length of the dynamic window traversing movement according to the training times of the neural network in the process of the dynamic window single traversing of the neural network;

s15, based on the normalization step length, iterating the integral traversal times of the dynamic window and the training times of the neural network in the process of single traversal of the dynamic window on the neural network model to obtain a whole period window center list;

s16, judging whether the training times of the rest neural network are zero, if so, normalizing the whole period window center list to serve as a normalized window center list, otherwise, entering a step S17;

s17, iterating the training times of the residual neural network based on the normalized step length of the dynamic window traversal movement to obtain a residual window center list;

and S18, adding a residual window center list at the tail of the whole period window center list to serve as a normalized window center list.

The beneficial effects of adopting the further scheme are as follows: according to the convergence characteristic of the neural network model, a dynamic window is established, a network layer at the bottom of the model is mainly transmitted at the initial stage of training, the transmission key layer moves upwards along with the increase of training period, the importance of gradients of each layer is not the same, a foundation is provided for the sparsification of the distributed training layers, and the probability of selecting part of layers is effectively prevented from being too low through multiple times of traversal.

Further, the specific steps of the step S2 are as follows:

s21, acquiring a warm-up period and a normalized window center list of the neural network;

s22, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;

s23, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering step S24;

s24, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;

s25, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;

s26, combining the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list.

The beneficial effects of adopting the further scheme are as follows: in the initial stage of the neural network training, the dynamic window is at the bottommost layer of the model, the dynamic window gradually slides forward to the top layer along with the neural network training, the transmission data volume is reduced by setting the sampling preset proportion of the inner network layer of the dynamic window and the sampling preset proportion of the outer network layer of the dynamic window, and the advantage is obvious when the model is deep enough.

Further, the step S24 includes the steps of:

s241, acquiring a window center of a current training period according to the normalized window center list;

s242, taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list;

s243, selecting a sequence number of a network layer corresponding to a preset quantity of the head of the standard normal distribution list to obtain a dynamic window list;

s244, randomly and uniformly sampling a dynamic window list to preset a proportion k _in And obtaining a sampling list in the window of the current training period.

The beneficial effects of adopting the further scheme are as follows: by randomly and uniformly sampling the dynamic window list, the transmission data volume is effectively reduced, and when the model is deep enough, the advantages are obvious.

Further, the step S25 includes the steps of:

s251, obtaining a dynamic window external list according to the normalized layer sequence number list and the dynamic window list;

s252, randomly and uniformly sampling a preset proportion k for the external list of the dynamic window _out And obtaining a sampling list outside the window of the current training period.

The beneficial effects of adopting the further scheme are as follows: by randomly and uniformly sampling the neural network layer outside the dynamic window list, the transmission data volume is effectively reduced, and when the model is deep enough, the advantage is obvious.

Further, the step S3 includes the following steps:

s31, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;

s32, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and proceeding to a step S33, otherwise, obtaining a plurality of local accumulated gradients;

s33, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and proceeding to step S34, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;

s34, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer compression selected layer to obtain a plurality of intra-layer compression selected layer transmission gradients;

s35, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;

and S36, obtaining weight updating parameters according to the complete gradient, and completing the distributed deep learning training based on layer sparsification.

The beneficial effects of adopting the further scheme are as follows: according to the result of judging whether each layer in the neural network is in a layer transmission list or not and whether the layers are compressed in the layers or not, acquiring transmission gradients through local accumulation, direct transmission and inter-node communication respectively, acquiring complete gradients through global average gradient fusion, acquiring weight updating parameters according to the complete gradients, and completing distributed deep learning training based on layer sparsification.

The application also provides a system of the distributed deep learning training method based on layer sparsification, which comprises:

the normalized window center list acquisition module is used for acquiring a normalized window center list according to the convergence characteristic of the neural network model;

the layer transmission list acquisition module is used for acquiring a layer transmission list by using a layer thinning method and a normalized window center list;

and the distributed deep learning training module based on layer sparsification carries out distributed deep learning training based on layer sparsity according to the layer transmission list to obtain weight updating parameters and complete the distributed deep learning training based on layer sparsity.

The beneficial effects of the application are as follows: the system of the distributed deep learning training method based on layer sparsification is a system correspondingly arranged for the distributed deep learning training method based on layer sparsification, and is used for realizing the distributed deep learning training method based on layer sparsity.

Drawings

Fig. 1 is a flowchart illustrating steps of a distributed deep learning training method based on layer sparsification according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a window center moving along with a training period in an embodiment of the present application.

FIG. 3 is a diagram illustrating dynamic window movement in accordance with an embodiment of the present application.

Fig. 4 is a schematic diagram of a dynamic window list obtained according to a standard normal distribution list in an embodiment of the present application.

FIG. 5 is a diagram illustrating an all_reduce loop transmission method according to an embodiment of the present application.

FIG. 6 is a schematic diagram of the time-consuming training of Resnet110 models in the Cifar10 and Cifar100 datasets using the DGC and LS-DGC frameworks, respectively, in an embodiment of the present application.

FIG. 7 is a schematic diagram of the time consuming training of the Resnet18 model in the Cifar10 dataset with SSGD and LS-SSGD frameworks, respectively, and the time consuming training of the Resnet50 model in the Cifar100 dataset with SSGD and LS-SSGD frameworks, respectively, in an embodiment of the present application.

FIG. 8 is a schematic diagram of the convergence of Resnet18 model on Cifar10 using SSGD and LS-SSGD frameworks, respectively, and the convergence of Resnet50 model on Cifar100 using SSGD and LS-SSGD frameworks, respectively, according to an embodiment of the present application.

Fig. 9 is a block diagram of a system of a distributed deep learning training method based on layer sparsification in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided to facilitate understanding of the present application by those skilled in the art, but it should be understood that the present application is not limited to the scope of the embodiments, and all the applications which make use of the inventive concept are protected by the spirit and scope of the present application as defined and defined in the appended claims to those skilled in the art.

According to the neural network model convergence characteristic conclusion obtained by deep characterization learning dynamics, the scheme provides a distributed deep learning training method and system based on layer sparsification. In the training process of the neural network, the key learning opportunities of different layers are different, which provides a feasibility support for the gradient sparsification from the angle of the network layer, the network layer which needs to be learned at present is synchronized in communication, other layers are locally accumulated, the network layer at the bottom of a main transmission model in the initial stage of training is considered, and the key transmission layer moves upwards along with the increase of the training period, so that the support is also provided for the sparsification of the distributed training layer.

Example 1

As shown in fig. 1, in one embodiment of the present application, the present application provides a distributed deep learning training method based on layer sparsification, including the steps of:

according to the convergence characteristic of the neural network model, a dynamic window is established, a network layer at the bottom of the model is mainly transmitted in the initial stage of training, and as the training period increases, a transmission key layer moves upwards along with the transmission key layer, the importance of gradients of each layer is not the same, and a foundation is provided for the layer sparsification of distributed training;

in the initial stage of the neural network training, the dynamic window is arranged at the bottommost layer of the model, the dynamic window gradually slides forward to the top layer along with the neural network training, the transmission data amount is reduced by setting the sampling preset proportion of the inner network layer of the dynamic window and the sampling preset proportion of the outer network layer of the dynamic window, and when the model is deep enough, the advantage is obvious;

According to the result of judging whether each layer in the neural network is in a layer transmission list or not and whether the layers are compressed in the layers or not, acquiring transmission gradients through local accumulation, direct transmission and inter-node communication respectively, acquiring complete gradients through global average gradient fusion, acquiring weight updating parameters according to the complete gradients, and completing distributed deep learning training based on layer sparsification.

Example 2

For step S1 of embodiment 1, it includes the following substeps S11 to S18:

s18, adding a remaining window center list at the tail of the whole period window center list to serve as a normalized window center list;

as shown in fig. 2, the neural network model is set to have 300 layers, 103 periods epoch are trained, and meanwhile, the model is subjected to 4 whole traversals and 1 partial traversals (remainder of 103/4) by a dynamic window; the total training times are 103 times, the overall traversing times are 4 times, the training times of the neural network in the single traversing process of the neural network are 25 times, the training times of the rest neural network are 3 times, and the normalization step length of the dynamic window movement is 1/24 (counted from 0); considering that the traversal is not just completed for all times, similar iteration is carried out when the rest neural network training times exist to generate a rest window center, and then a normalized window center list is obtained through tail addition.

Example 3

For step S2 in embodiment 1, it includes the following substeps S21 to S26:

the warm-up period is generally set to be the first 5 epochs trained by the neural network, and no sparsification is carried out in the warm-up period to prevent the neural network from going to an error direction;

the step S24 includes the steps of:

s244, randomly and uniformly sampling a dynamic window list to preset a proportion k _in Obtaining a sampling list in a window of a current training period;

the step S25 includes the steps of:

s252, randomly and uniformly sampling a preset proportion k for the external list of the dynamic window _out Obtaining a sampling list outside a window of a current training period;

s26, combining the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list;

as shown in FIG. 3, in this embodiment, a neural network is provided with 20 layers, in the initial stage of training, a window is at the lowest layer of the model, the window gradually slides forward to the top layer as training proceeds, and a dynamic window list is provided to randomly and uniformly sample a preset proportion k _in =50% and the dynamic window outer list randomly samples uniformly a preset proportion k _out The model compression ratio is k=20% overall, further reducing the amount of data transmitted, the advantages of this approach being evident when the model is deep enough; the model is prevented from going to the wrong direction by not carrying out sparsification in the warm-up period; acquiring a window center of a current training period according to the normalized window center list; taking the window center and the normalized layer sequence number list of the current training period as an expected and independent variable list respectively, and calculating to obtain a standard normal distribution list; then selecting a layer sequence number corresponding to top-20% in the standard normal distribution list as a dynamic window list;

as shown in fig. 4, the vertex of the standard normal distribution list is the window center, only the top-20% is intercepted, the corresponding abscissa is the dynamic window list corresponding to the current window center, namely the network layer sequence, and random uniform sampling is performed in the dynamic window list to obtain a sampling list in the window of the current training period; removing a top-20% dynamic window and carrying out random uniform sampling to obtain a sampling list outside the current training period window, and merging the sampling list inside the current training period window and the sampling list outside the current training period window to obtain a layer transmission list.

Example 4

For step S3 in embodiment 1, it includes the following substeps S31 to S36:

s36, obtaining weight updating parameters according to the complete gradient, and completing distributed deep learning training based on layer sparsity;

in this embodiment, when intra-layer compression is selected for the layers, the layer thinning is performed by the same method as in step S2 in the above-described distributed deep learning training method based on layer thinning before intra-layer thinning and inter-node communication, and a gradient layer to be communicated is selected to reduce the thinning overhead of the neural network layer that does not perform inter-node communication. In the warm-up period, all layers are transmitted without layer sparsification to prevent the training from going to the wrong direction, after the in-layer sparsification degree is stable, the inter-node communication quantity is reduced by calling a layer sparsification strategy, and as shown in fig. 5, when the inter-node communication is carried out in the middle layer of each layer of compression selection, the all_reduce cyclic transmission method is adopted, the cyclic transmission among node is firstly carried out, and after 4 rounds of transmission are carried out, each node has gradient information of other nodes; and then carrying out averaging to obtain an average gradient, and obtaining a plurality of in-layer compression and medium-layer transmission gradients after decompression.

Example 5

In a practical example of the application, experiments are carried out on two image classification data sets, including a simple Cifair10 data set and a complex Cifair100 data set, so as to prove the effectiveness of the distributed deep learning training method based on layer sparsification; wherein Cifar10 consists of 10 classes of 50000 training images and 10000 verification images, whereas Cifar100 contains 100 classes of 500 training images and 100 test images each. All experiments were based on the Pytorch distributed training framework and used to run experiments on a machine with 4 GeForce RTX 3090 graphics cards.

First, a layer sparse distributed framework (LS-DGC) training experiment based on depth gradient compression DGC is performed, and a preheating period is still set to be 4 periods in 164 training periods. Furthermore, since DGC has already been compressed up to 99% within the layer, it is not too low when thinning the layer; the scheme adopts 20% of the model sequence as the size of the sliding window, and the inner layer of the window is totally selected to be k _in =100%, set k outside the window _out Layer=20% to prevent the expiration of external gradient, DGC algorithm traffic based on layer sparsification is reduced to around 36% of the original algorithm;

as shown in fig. 6, on both training using the Resnet110 model, the LS-DGC framework is lower than the DGC framework, and the time has been significantly shortened due to the large degree of intra-layer compression that has been done by DGC, and the time consumption has been further improved after combining with the layer-based sparsification algorithm; in addition, the time change of the two methods in the whole period, compression and communication, decompression and synchronization stages is compared and analyzed on two data sets in terms of the time occupation ratio in the period, and the results are shown in table 1:

TABLE 1

According to the table 1, the average time consumption of the compression communication and decompression synchronization is greatly reduced in terms of time consumption, and the duty ratio in the period is reduced by about half, so that the communication bottleneck problem between nodes is further relieved. In addition, since DGC is transmitted by pipeline communication based on layers, after LS-DGC performs layer sparsification, part of layers do not communicate any more, so that the communication frequency between nodes is also greatly reduced;

compared with the baseline DGC framework, the LS-DGC further reduces the communication quantity among the nodes to relieve the communication bottleneck, which inevitably affects the convergence speed of the model to a certain extent, and the LS-DGC accuracy is slightly reduced under the condition that the number of training periods is kept unchanged, because the model is not converged yet, the number of training periods is properly increased (the total time consumption is unchanged), and the accuracy is further improved even surpassing the baseline method result when the model is fully converged, as shown in the table 2:

TABLE 2

Method	Cifar10 accuracy	Cifar100 accuracy
			DGC (Baseline)	93.55％	72.04％
LS-DGC	93.08％	71.74％
			LS-DGC(more)	94.15％(↑)	72.55％(↑)

Secondly, a layer sparse distributed framework (LS-SSGD) training experiment based on a random rapid descent method SSGD is carried out, in order to verify the universality of layer sparse, the experiment is also carried out on SSGD without intra-layer compression, and on SSGD without intra-layer compression, the window inner layer selection probability is reduced to k _in =50% while reducing the window outside probability to k _out =10%, so that the overall traffic of SSGD is reduced to about 18% of the original algorithm; due to the in-layerThe method has the advantages that compression is avoided, gradient loss is small, the Resnet18 is adopted to train the Cifar10 data set, and the Resnet50 is adopted to train the Cifar100 data set, so that convergence is good;

as shown in fig. 7, again, in terms of time, the LS-SSGD framework is still less time consuming than the training of SSGD, and its intra-phase compression duty cycle communication synchronization and overall cycle time consumption are both reduced, with the communication synchronization time consumption reduced by more than 50%, as shown in table 3:

TABLE 3 Table 3

As shown in fig. 8, in terms of accuracy, compared with the baseline SSGD result, the LS-SSGD can exceed the SSGD result in the same training period, and the LS-SSGD framework is better than the SSGD framework in the whole training process.

The beneficial effect of this scheme is: the neural network model convergence characteristic is applied to distributed training, a distributed training framework for layer sparsification of the neural network is provided, the problem that the conventional training framework is only sparsified in a network layer is solved, and the sparsification degree is further improved; the distributed deep learning frameworks LS-DGC and LS-SSGD based on layer sparsity, which are provided by the scheme, are realized through experiments by combining the existing intra-layer deep sparsity framework DGC and the non-intra-layer sparsity framework SSGD; experiments are carried out on a plurality of classification models and a plurality of image data sets, and analysis and comparison are carried out on aspects of overall time consumption, traffic, communication duty ratio, accuracy and the like, so that the effectiveness and the advancement of the method are fully proved.

Example 6

As shown in fig. 9, the present solution further provides a system of a distributed deep learning training method based on layer sparsification, including:

The system of the distributed deep learning training method based on layer sparsification provided by the embodiment can execute the technical scheme shown in the distributed deep learning training method based on layer sparsity of the method embodiment, and the implementation principle is similar to the beneficial effect, and is not repeated here.

In the embodiment of the application, the function units can be divided according to the distributed deep learning training method based on layer sparsification, for example, each function can be divided into each function unit, and two or more functions can be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that the division of the units in the present application is schematic, only one logic division, and other division manners may be implemented in practice.

In the embodiment of the application, in order to achieve the principle and beneficial effects of the distributed deep learning training method based on layer sparsification, the system of the distributed deep learning training method based on layer sparsification comprises a hardware structure and/or a software module for executing corresponding functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein are capable of being implemented as a combination of hardware and/or hardware and computer software, where a function is performed in either a hardware or a computer software driven manner, where different methods may be employed to implement the described function for each particular application depending upon the specific application and design constraints, but such implementation is not to be considered beyond the scope of the present application.

Claims

1. The distributed deep learning training method based on layer sparsification is characterized by comprising the following steps of:

the specific steps of the step S2 are as follows:

s3, performing distributed deep learning training based on layer sparsification according to the layer transmission list to obtain weight updating parameters, and completing the distributed deep learning training based on layer sparsity;

the step S3 includes the steps of:

2. The distributed deep learning training method based on layer sparsification according to claim 1, wherein the step S1 includes the steps of:

3. The layer-sparsification-based distributed deep learning training method of claim 1, wherein the step S24 includes the steps of:

s244, randomly and uniformly sampling the dynamic window list to preset proportionk _in And obtaining a sampling list in the window of the current training period.

4. The layer-sparsification-based distributed deep learning training method of claim 1, wherein the step S25 includes the steps of:

s252, randomly and uniformly sampling the external list of the dynamic window by a preset proportionk _out And obtaining a sampling list outside the window of the current training period.

5. A system of a distributed deep learning training method based on layer sparsification, comprising:

the layer transmission list acquisition module is used for obtaining a layer transmission list by using a layer thinning method and a normalized window center list, and specifically comprises the following steps:

a1, acquiring a warm-up period and a normalized window center list of a neural network;

a2, normalizing the sequence numbers of network layers in the neural network according to a layer sparsification method to obtain a normalized layer sequence number list;

a3, transmitting all parameters of all layers of the neural network when the current training period is in a warm-up period, otherwise, entering a step A4;

a4, obtaining a dynamic window list and a sampling list in a current training period window according to the normalized window center list and the normalized layer sequence number list;

a5, obtaining a sampling list outside the window of the current training period according to the normalized layer sequence number list and the dynamic window list;

a6, merging the sampling list in the current training period window and the sampling list outside the current training period window to obtain a layer transmission list;

the distributed deep learning training module based on layer sparsification carries out distributed deep learning training based on layer sparsity according to a layer transmission list to obtain weight updating parameters and complete the distributed deep learning training based on layer sparsity, and the distributed deep learning training module specifically comprises the following steps:

b1, obtaining a neural network layer gradient list through feedforward and feedback calculation of a neural network sample;

step B2, traversing each layer in the neural network layer gradient list layer by layer, judging whether each layer is in the layer transmission list, if so, obtaining a plurality of selected layers, and entering a step B3, otherwise, obtaining a plurality of local accumulated gradients;

b3, judging whether each selected layer has intra-layer compression, if so, obtaining a plurality of selected layers with intra-layer compression, and entering a step B4, otherwise, obtaining a plurality of selected layer transmission gradients without intra-layer compression;

b4, sequentially carrying out intra-layer sparsification, inter-node communication and decompression synchronization on the inner part of each intra-layer selected layer to obtain a plurality of intra-layer compressed selected layer transmission gradients;

b5, carrying out global average on each local accumulated gradient, each selected layer transmission gradient without intra-layer compression or each selected layer transmission gradient with intra-layer compression to obtain a complete gradient;

and B6, obtaining weight updating parameters according to the complete gradient, and completing the distributed deep learning training based on layer sparsification.