WO2022179007A1

WO2022179007A1 - Distributed communication-based neural network training method and apparatus, and storage medium

Info

Publication number: WO2022179007A1
Application number: PCT/CN2021/100623
Authority: WO
Inventors: 颜子杰; 段江飞; 孙鹏; 张行程
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-02-27
Filing date: 2021-06-17
Publication date: 2022-09-01
Also published as: CN112766502A

Abstract

The present disclosure relates to a distributed communication-based neural network training method and apparatus, and a storage medium. The method comprises: training a neural network, and saving a generated gradient in a first gradient sequence; obtaining a cumulative gradient sequence according to the first gradient sequence and a second gradient sequence, the second gradient sequence being used for recording a gradient that has not participated in synchronization; calculating an importance index sequence according to the cumulative gradient sequence; obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence; obtaining information to be synchronized of training nodes according to the important gradient and the importance index sequence; performing synchronization among the training nodes on the basis of said information, so as to obtain a synchronized gradient sequence and a synchronized important gradient indication sequence; and adjusting parameters of the neural network according to the synchronized gradient sequence. The present disclosure can reduce the frequency and data volume of node communication, reduce communication overhead, and improve the training speed of the neural network.

Description

Distributed communication-based neural network training method, device and storage medium

This application claims the priority of the Chinese patent application with the application number 202110221266.9 and the application name "Distributed Communication-Based Neural Network Training Method, Device and Storage Medium" filed with the China Patent Office on February 27, 2021, the entire content of which is Incorporated herein by reference.

technical field

The present disclosure relates to the technical field of machine learning, and in particular, to a neural network training method, device and storage medium based on distributed communication.

Background technique

With the development of information technology and the rise of artificial intelligence, the application of neural networks in daily life is becoming more and more extensive. There are more and more types of neural networks and higher and higher complexity. Traditional single-machine training may require tens of thousands of iterations. It takes several months to converge, and the computing power of a single computer has been difficult to match the computing power requirements of neural network training; and distributed training methods can improve computing power by allocating training tasks to multiple nodes in parallel, but require The training can be completed only by mutual communication. The large amount of communication data and high frequency of each node result in high bandwidth consumption and long communication time lag, which makes the problem of communication between nodes become a bottleneck for speeding up neural network training.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a distributed communication-based neural network training method, device, and storage medium, which are used to solve at least one of the above-mentioned technical problems.

In the first aspect, an embodiment of the present application provides a distributed communication-based neural network training method, which is applied to training nodes, and the method includes:

Train the neural network corresponding to the training node, and save the generated gradient in the first gradient sequence;

According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients that have not participated in synchronization;

Calculate and obtain an importance index sequence according to the cumulative gradient sequence;

obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;

Perform synchronization between training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;

The parameters of the neural network are adjusted according to the post-synchronization gradient sequence, and the post-synchronization important gradient indication sequence is used as the new important gradient indication sequence.

In the second aspect, the embodiments of the present application provide a neural network training device based on distributed communication, the device is set on a training node, and includes:

A training module is used to train the corresponding neural network of the training node, and the generated gradient is stored in the first gradient sequence;

a cumulative gradient acquisition module, configured to obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization;

an importance index sequence calculation module, configured to calculate and obtain an importance index sequence according to the cumulative gradient sequence;

a gradient classification module, configured to obtain an important gradient indication sequence, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

a synchronization module, configured to obtain the to-be-synchronized information of the training nodes according to the important gradients and the importance index sequence, and to perform synchronization between the training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and post-synchronized important gradients instruction sequence;

an update module, configured to use the synchronized important gradient indication sequence as the new important gradient indication sequence;

A parameter adjustment module, configured to adjust the parameters of the neural network according to the synchronized gradient sequence.

In a third aspect, an embodiment of the present application provides a neural network training method based on distributed communication, which is applied to a training system including multiple training nodes, and the method includes:

Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence;

According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients in the training node that have not participated in synchronization;

obtaining the important gradient indication sequence in the training node, and determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

Each training node performs synchronization between the training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;

Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one instruction or at least one program is stored in the computer-readable storage medium, and the at least one instruction or at least one program is loaded and executed by a processor In order to realize the neural network training method based on distributed communication according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a training device, including at least one processor, and a memory communicatively connected to the at least one processor; wherein the memory stores a program that can be executed by the at least one processor. Instructions, the at least one processor implements the distributed communication-based neural network training method according to any one of the first aspect by executing the instructions stored in the memory.

In a sixth aspect, an embodiment of the present application provides a training system, where the training system includes a plurality of training devices described in the fifth aspect.

In a seventh aspect, embodiments of the present application provide a computer program product, where the computer program product may include a computer-readable storage medium on which computer-readable program instructions for enabling a processor to implement various aspects are uploaded.

It can be seen that, in the embodiment of the present application, the important gradient and the related information used to determine the important gradient can be synchronized in one communication process, and the synchronization result can ensure that the important gradient obtained by each training node corresponds to the same neural network parameters, and there is no need for additional communication between nodes on the position of important gradients, which reduces the frequency of node communication and improves the training speed.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

1 shows a schematic diagram of neural network training based on distributed communication in the related art according to an embodiment of the present disclosure;

2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure;

Fig. 3 shows a feasible structural schematic diagram of a deep neural network according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of the relationship between the first storage space, the second storage space and a neural network structure according to an embodiment of the present disclosure;

FIG. 5 shows a schematic flowchart of a multi-node implementation of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;

6 shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;

7 shows a schematic flowchart of step S40 of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;

8 shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;

9 shows a block diagram of a neural network training apparatus based on distributed communication according to an embodiment of the present disclosure;

10 shows a block diagram of a training device according to an embodiment of the present disclosure;

11 shows a block diagram of another training device according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present specification. Obviously, the described embodiments are only a part of the embodiments of the present specification, rather than all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

It should be noted that the terms "first", "second" and the like in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or server comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the term "at least one" herein refers to any combination of any one of the plurality or at least two of the plurality, for example, including at least one of A, B, and C, and may mean including from A, B, and C. Any one or more elements selected from the set of B and C.

In addition, in order to better illustrate the present disclosure, numerous specific details are set forth in the following detailed description. It will be understood by those skilled in the art that the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

In the practical application of related technologies, in order to improve the training speed of the neural network, the training tasks of the neural network can be distributed to multiple training nodes in parallel, and each training node can cooperate to train the neural network based on distributed communication, thereby improving the neural network The training speed can be shortened, and the time-consuming of neural network convergence can be shortened. Please refer to FIG. 1 , which shows a schematic diagram of neural network training based on distributed communication in the related art. Each GPU (Graphics Processing Unit, graphics processor) in Figure 1 is a training node. There are m training nodes in Figure 1. The above m training nodes can independently train a neural network with the same structure.

Taking the m training nodes in Figure 1 to jointly train a neural network with n layers as an example, in the process of training the neural network for each training node, the sample data is input into the neural network and the forward propagation process and reverse propagation are triggered in turn. The propagation process, in the above forward propagation process, the sample data is input to the first layer, the first layer and the second layer are triggered, until the nth layer outputs the calculation results in sequence. In the above backpropagation process, the nth layer and the n-1th layer are triggered until the first layer generates gradients in turn.

The above m training nodes communicate according to the generated gradient, and adjust the parameters of the corresponding neural network according to the communication result. Through distributed training, each training node can adjust the parameters of its corresponding neural network with reference to the gradients generated by other training nodes, thereby improving the training speed. It is also high, which makes the communication process itself consume more resources, and each training node may be forced to wait frequently due to frequent communication delays, thereby reducing the training speed. In this case, the communication process of distributed nodes becomes The bottleneck restricting training efficiency.

In order to reduce the communication data volume and communication frequency in the distributed training process, ease the communication pressure, and improve the speed of the distributed training of the neural network, the embodiments of the present disclosure provide a neural network training method based on distributed communication.

The distributed communication-based neural network training method provided by the embodiments of the present disclosure can be applied to any data processing device with a graphics processor (Graphics Processing Unit, GPU). The data processing device can be a terminal, including a personal computer (Personal Computer, PC), minicomputer, medium computer, mainframe, workstation, etc. Of course, the data processing device can also be a server. It should be noted that the data processing device may be independent when used for training the neural network, or may exist in the form of a cluster.

The distributed communication-based neural network training method provided by the embodiment of the present disclosure may be stored in a data processing device in the form of a computer program, and the data processing device implements the distributed communication-based neural network training method of the present disclosure by running the computer program. The above-mentioned computer program may be an independent computer program, or may be a functional module, a plug-in, or a small program integrated on other computer programs.

The method for training a neural network based on distributed communication according to an embodiment of the present disclosure will be described below by taking a data processing device serving as a training node as an execution subject as an example. FIG. 2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure. As shown in FIG. 2 , the above method includes:

S10: Train the neural network corresponding to the above training node, and save the generated gradient in the first gradient sequence.

The embodiments of the present disclosure do not limit the structure of the neural network, and the above-mentioned neural network may be at least one of a deep neural network, a convolutional neural network, and a recurrent neural network. Taking a deep neural network as an example, please refer to FIG. 3 , which shows a schematic structural diagram of a feasible deep neural network. The above-mentioned deep neural network may include a convolutional layer, a pooling layer and a fully connected layer, and one of the above-mentioned convolutional layers is used as the input layer of the neural network, and the above-mentioned fully connected layer is used as the output layer of the above-mentioned deep neural network. A convolutional layer and a pooling layer can be set at intervals between the layer and the above-mentioned output layer.

In a feasible implementation manner, a first storage space (grad_buffer) and a second storage space (left_buffer) may be opened on the training node. In one embodiment, the first storage space (grad_buffer) and the second storage space ( left_buffer) are contiguous storage spaces.

Please refer to FIG. 4 , which shows a schematic diagram of the relationship between the first storage space, the second storage space and the neural network structure. In the embodiment of the present disclosure, the neural network may have multiple layers, each layer may include multiple parameters, and each parameter in each layer corresponds to a segment of storage interval in the above-mentioned first storage space. The first storage space and the second storage space may have the same structure, and both the first storage space and the second storage space may be the same as the storage space occupied by the parameters of the neural network. The first storage space (grad_buffer) may be used to store gradients (first gradient sequences) generated by the training nodes during the training process, and the second storage space (left_buffer) may be used to store gradients that have not yet participated in synchronization.

Take the neural network as an example with three layers:

Layer 1 includes parameter E10, parameter E11, parameter E12 and parameter E13;

Layer 2 includes parameter E20, parameter E21, and parameter E22;

Layer 3 includes parameter E30, parameter E31 and parameter E32;

Layers 1-3 include a total of 10 parameters, then the above-mentioned first storage space and the above-mentioned second storage space both include 10 storage intervals, and each layer generates a gradient corresponding to each parameter in the reverse direction during the training process, specifically:

First, layer 3 generates gradient T30, gradient T31 and parameter gradient T32 corresponding to parameter E30, parameter E31 and parameter E32, respectively;

Then, layer 2 generates gradient T20, gradient T21, and gradient T22 corresponding to parameter E20, parameter E21, parameter E22, respectively;

Next, layer 1 generates gradient T10, gradient T11, gradient T12, and gradient T13 corresponding to parameter E10, parameter E11, parameter E12, and parameter E13, respectively.

The 10 gradients can be sequentially stored in the first storage space in the order of gradient generation, that is, the data in the first storage space (grad_buffer) can be {T30, T31, T32, T20, T21, T22, T10, T11, T12, T13}. The data {T30, T31, T32, T20, T21, T22, T10, T11, T12, T13} of the first storage space (grad_buffer) is the first gradient sequence.

In one embodiment, the above-mentioned training node can take out the smallest batch of samples each time, perform sample training of the smallest batch, adjust the parameters of the corresponding neural network during the training process, and obtain the corresponding training results of this batch of training according to the training results. For the gradient of each parameter, the above gradient is stored in the first gradient sequence. Through batch training, the first gradient sequence can be recorded after each batch of training, and the subsequent synchronization between training nodes can be performed. Batch training reduces communication frequency and improves training speed.

S20: Obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization.

In the embodiment of the present disclosure, the second gradient sequence may be stored in the second storage space (left_buffer). The second storage space has the same structure as the above-mentioned first storage space, and the storage intervals in the same position correspond to the same neural network parameters. Taking the above-mentioned neural network with three layers as an example, that is, the 10 consecutive storage intervals in the first storage space and the 10 consecutive storage intervals in the second storage space correspond to parameter E30, parameter E31, parameter E32, parameter E32, and parameter E31 respectively. E20, Parameter E21, Parameter E22, Parameter E10, Parameter E11, and Parameter E12. The accumulated gradient sequence may be obtained by adding the data in the same storage interval in the first storage space and the second storage space, and the accumulated gradient sequence may be stored in the first storage space.

Exemplarily, taking the above-mentioned neural network with three layers as an example, any storage interval grad_buffer[i] of the first storage space and the corresponding storage interval left_buffer[i] of the second storage space can be obtained, where i is taken as The value is 0-9, assign grad_buffer[i]+left_buffer[i] to grad_buffer[i], the data in the first storage space after the assignment is the above-mentioned cumulative gradient sequence, that is, the above-mentioned cumulative gradient sequence and the above-mentioned The first gradient sequence reuses the above-mentioned first storage space. In the actual training scenario of neural network based on distributed communication, the number of training nodes participating in the collaborative training of neural network is large, and because the complexity of neural network can be very high, correspondingly, the storage space occupied by the gradient generated is often The first storage space can be reused in each training node, so that storage resources can be greatly saved.

S30: Calculate and obtain an importance index sequence according to the above-mentioned cumulative gradient sequence.

Specifically, the importance index may be calculated corresponding to each accumulated gradient in the accumulated gradient sequence, and the embodiment of the present disclosure does not limit the method for determining the above importance index. Taking the above-mentioned neural network with three layers as an example, the importance index sequence Im can be obtained, and each importance index Im[i] can represent the gradient generated by the neural network parameters corresponding to grad_buffer[i] obtained above. Importance.

S40: Acquire an important gradient indication sequence, and determine an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence.

In the embodiment of the present disclosure, each important gradient indication value in the important gradient indication sequence may indicate whether the accumulated gradient of the corresponding position is important. Taking the above-mentioned neural network with three layers as an example, the important gradient indication sequence Ip can be obtained, and each important gradient indication value Ip[i] can indicate whether the data recorded by grad_buffer[i] obtained above is an important gradient. Exemplarily, if Ip[i] is the first indication value, the data recorded in grad_buffer[i] is determined to be an important gradient, and if Ip[i] is the second indication value, the data recorded in grad_buffer[i] is determined to be is an unimportant gradient. In an exemplary embodiment, the important gradient indication sequence Ip may be obtained during the last synchronization process between nodes.

Exemplarily, if the important gradient indicates that the values of Ip[1], Ip[3], and Ip[5] in the sequence Ip represent the important gradient, then grad_buffer[1], grad_buffer[3], and grad_buffer[5] are the important gradients , other gradients in grad_buffer are non-important gradients.

In an exemplary embodiment, after the insignificant gradient in the cumulative gradient sequence is determined, the second gradient sequence may also be updated according to the insignificant gradient.

In the embodiment of the present disclosure, the non-important gradient does not participate in the synchronization between nodes this time. Therefore, the non-important gradient can be updated in the second gradient sequence, that is, correspondingly stored in the second storage space (left_buffer). Updating the second gradient sequence according to the unimportant gradient includes determining the position of the unimportant gradient in the second gradient sequence, updating the data corresponding to the second gradient sequence at the position to the unimportant gradient, and updating the data corresponding to the other positions with the unimportant gradient. The data is assigned a value of 0.

Taking the above neural network with three layers as an example, grad_buffer[0], grad_buffer[2], grad_buffer[4], grad_buffer[6], grad_buffer[7], grad_buffer[8], and grad_buffer[9] are not important Gradient, assign grad_buffer[0], grad_buffer[2], grad_buffer[4], grad_buffer[6], grad_buffer[7], grad_buffer[8], grad_buffer[9] to left_buffer[0], left_buffer[2] , left_buffer[4], left_buffer[6], left_buffer[7], left_buffer[8], left_buffer[9], and left_buffer[1], left_buffer[3], left_buffer[5] can be assigned the value 0.

S50: Obtain the to-be-synchronized information of the training node according to the above-mentioned important gradient and the above-mentioned importance index sequence.

Specifically, the above-mentioned important gradient and the above-mentioned importance index sequence may be spliced, and the splicing result may be used as the above-mentioned information to be synchronized. The embodiments of the present disclosure do not limit the splicing method.

S60: Perform synchronization between training nodes based on the above information to be synchronized, to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.

Exemplarily, AllReduce can be used for communication, and AllReduce is a general term for a class of communication methods that can efficiently communicate between distributed training nodes.

In the embodiment of the present disclosure, each training node can obtain the important gradient indication sequence after synchronization during each synchronization, and the above-mentioned important gradient indication sequence after synchronization is used as the important gradient indication sequence applied in the next training. Therefore, in each training node The sequence of important gradient indications used during training is the same, that is, the important gradient positions are the same.

Taking training node A as an example, training node A determines in step S40 that grad_buffer[1], grad_buffer[3], and grad_buffer[5] are important gradients, and other training nodes will also determine their corresponding grad_buffer[1], grad_buffer [3], grad_buffer[5] is an important gradient, the gradient sequence after synchronization can be calculated directly according to the important gradient of each node, without additional consideration of the position of the important gradient for synchronization, and the position information of the important gradient is no longer needed between each training node Additional communication reduces communication frequency.

Exemplarily, for the gradient sequence Td after synchronization, the average value can be obtained according to the important gradients grad_buffer[1], grad_buffer[3], and grad_buffer[5] in each node, and Td[1], Td[3] and Td[ 5], and the values at other positions in Td can be assigned as preset gradient values.

In one embodiment, the importance index sequence after synchronization can be obtained by taking the mean value element by element according to the importance index sequence of each node, and the above-mentioned important gradient indication sequence after synchronization is correspondingly obtained according to the importance index sequence after synchronization.

S70: Adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence, and use the above-mentioned important gradient indication sequence after synchronization as a new above-mentioned important gradient indication sequence.

The above-mentioned adjustment of the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence includes:

S71: Sequentially extract the gradients in the above-mentioned synchronized gradient sequence.

S72: If the gradient is not equal to the preset gradient value, adjust the corresponding neural network parameter according to the gradient.

S73: If the above gradient is equal to the above preset gradient value, extract the next gradient.

In a specific embodiment, the above-mentioned preset gradient value is used to indicate a situation in which parameter adjustment according to the gradient is not required. In a feasible embodiment, the above-mentioned preset gradient value may be 0.

In the embodiment of the present disclosure, the elements in the synchronized gradient sequence may be correspondingly stored in the first storage space, and the first storage space (grad_buffer) may be reused again, thereby reducing storage consumption. Take the neural network with three layers described above as an example, read grad_buffer[i] sequentially, if grad_buffer[i] is not the above preset gradient value, adjust the parameters of the neural network corresponding to the above grad_buffer[i] ; If grad_buffer[i] is the above-mentioned preset gradient value, i will increment by 1, and continue to read grad_buffer[i]. Obviously, if the currently extracted gradient is the last gradient in the above-mentioned synchronized gradient sequence, the "extract the next gradient" will fail. This situation indicates that the adjustment of the parameters of the neural network has been completed. After completion, the first gradient can be cleared. A storage space, so that when steps S10-S70 are executed iteratively, the first gradient sequence can be multiplexed and recorded in step S10.

In the embodiment of the present disclosure, the unimportant gradient is indicated by setting the preset gradient value, and the parameter value corresponding to the unimportant gradient does not need to be adjusted, thereby avoiding the transitional adjustment of the parameters of the neural network and further improving the convergence speed of the neural network.

Steps S10-S70 in this embodiment of the present disclosure show a neural network training method in a single synchronization situation. In the above neural network training method in a single synchronization situation, both the second gradient sequence and the important gradient indication sequence may be updated, According to the update result, the above-mentioned neural network training method for a single synchronization situation can be performed again, that is, steps S10-S70 can be performed iteratively until a preset training stop condition is reached, and a trained neural network can be obtained. The embodiment of the present disclosure does not limit the above training stop condition, and the above training stop condition may be that the number of iterations reaches a preset iteration threshold, or that the loss generated by the neural network is less than a preset loss threshold.

Please refer to FIG. 5 , which shows a schematic flowchart of multi-node executing the above-mentioned neural network training method based on distributed communication. Each training node in FIG. 5 can independently train its own neural network, update the local second gradient sequence during the training process, and obtain information to be synchronized. The information to be synchronized includes important gradients and importance index sequences. Each training node communicates based on the information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence. Each node performs parameter adjustment based on the above-mentioned post-synchronization gradient sequence, and determines the important gradient in the next iteration based on the above-mentioned post-synchronization important gradient indication sequence. Each training node continuously adjusts its own parameters through iteration, and communicates and cooperates between nodes to complete the training of the neural network during each iteration.

In the actual scenario of distributed training of neural networks, the number of training nodes involved in training is relatively large, and because the complexity of neural networks can be very high, the amount of gradient data often generated is also relatively large. Correspondingly, it is used to record gradients. The data volume of the gradient position information of the location is also large. The training nodes with large data volume not only need to synchronize the gradient position information, but also need to synchronize the gradient, which generates a large communication pressure and consumes a lot of communication resources. , limited by communication resources, many nodes may be forced to wait for synchronization often, which reduces the training speed. Based on the above configuration, the distributed communication-based neural network training method provided by the embodiment of the present disclosure can synchronize important gradients and important gradient position information in one communication process, and the synchronization result can ensure that the important gradients obtained by each training node correspond to With the same neural network parameters, there is no need to additionally communicate the gradient position information between the training nodes, which can significantly reduce the communication frequency, reduce the consumption of communication resources, reduce the time-consuming training nodes waiting for synchronization, and improve the training speed. A particularly obvious speed-up effect can be achieved in scenarios where the amount of data training nodes and the complexity of the neural network are high.

In some feasible implementation manners, the speed at which the above-mentioned training node performs the above-mentioned steps may be further improved based on the multi-thread concurrency characteristic, and the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Please refer to FIG. 6 , which shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method. The above-mentioned schematic flowchart of obtaining a cumulative gradient sequence according to the above-mentioned first gradient sequence and second gradient sequence includes:

S21: Segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule to obtain a first gradient segment sequence and a second gradient segment sequence; wherein, if the first gradient segment is in the first gradient segment The position of the gradient segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment corresponds to the second gradient segment, and both correspond to the same neural network parameters.

The embodiments of the present disclosure do not limit specific segmentation rules, as long as the above-mentioned first gradient sequence can be divided into a plurality of first gradient segments to form a first gradient segment sequence. The first gradient sequence and the second gradient sequence have the same structure, and the second gradient sequence is segmented based on the segmentation rule to obtain the second gradient segment sequence. The position in the segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment and the second gradient segment correspond, and both correspond to the same neural network parameters.

Exemplarily, taking the above-mentioned neural network with three layers as an example, the first gradient sequence obtained in step S10 is stored in the first storage space (grad_buffer), which can be composed of grad_buffer[0], grad_buffer[1] , grad_buffer[2], grad_buffer[3], grad_buffer[4] form a first gradient segment Tdf[0], composed of grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8], The data in grad_buffer[9] forms another first gradient segment Tdf[1], obtaining the first gradient segment sequence {Tdf[0], Tdf[1]}. Correspondingly, the second gradient sequence is stored in the second storage space (left_buffer), which is formed by the data in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], and left_buffer[4]. The second gradient segment Tds[0] forms another first gradient segment Tds[1] from the data in left_buffer[5], left_buffer[6], left_buffer[7], left_buffer[8], and left_buffer[9], and obtains the first gradient segment Tds[1]. Two gradient segment sequence {Tds[0], Tds[1]}.

Taking Tdf[0] as an example, each data in Tdf[0] corresponds to parameter E30, parameter E31, parameter E32, parameter E20, and parameter E21 in sequence. Tdf[0] corresponds to Tds[0], and each data in Tds[0] also corresponds to parameter E30, parameter E31, parameter E32, parameter E20 and parameter E21 in sequence.

S22: Set up multiple parallel computing threads, each of which obtains at least one first gradient segment and a second gradient segment corresponding to the foregoing first gradient segment.

Taking two parallel computing threads as an example, Tdf[0] and Tds[0] can be sent to computing thread A, and Tdf[1] and Tds[1] can be sent to computing thread B.

S23: For each acquired first gradient segment, each of the above-mentioned computing threads accumulates the above-mentioned first gradient segment and the corresponding second gradient segment to obtain a corresponding accumulated gradient segment.

Taking computing thread A as an example, the data in Tdf[0] and Tds[0] can be added element by element to obtain the corresponding cumulative gradient segment. Exemplarily, the data of each element in the first gradient segment Tdf[0] is sequentially stored in grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], and grad_buffer[4], and the second gradient The data of each element in segment Tds[0] is sequentially stored in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], left_buffer[4], then the corresponding cumulative gradient segment STd[0] The data in the sequence is grad_buffer[0]+left_buffer[0], grad_buffer[1]+left_buffer[1], grad_buffer[2]+left_buffer[2], grad_buffer[3]+left_buffer[3], grad_buffer[4]+ left_buffer[4], obviously the five accumulated gradients in STd[0] correspond to parameter E30, parameter E31, parameter E32, parameter E20 and parameter E21 respectively. In the same way, computing thread B can obtain STd[1] correspondingly according to Tdf[1] and Tds[1], and the five accumulated gradients in STd[1] correspond to parameter E22, parameter E10, parameter E11, parameter E12, Parameter E13.

S24: Obtain a cumulative gradient sequence according to the cumulative gradient segments obtained by each of the above computing threads.

Exemplarily, calculation thread A can obtain STd[0], and calculation thread B can obtain STd[1], and the cumulative gradient in STd[0] and the cumulative gradient in STd[1] are arranged in sequence, and the above cumulative gradient can be obtained. gradient sequence. Taking the neural network with three layers described above as an example, grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], grad_buffer[ can be updated according to the five elements in STd[0] 4], update grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8], grad_buffer[9] respectively according to the five elements in STd[1], then the data in the first storage space is formed The above cumulative gradient sequence. The above cumulative gradient sequence reduces storage consumption by reusing the first storage space. Based on the above configuration, the cumulative gradient can be calculated in parallel segments to improve the calculation speed of the cumulative gradient sequence.

Correspondingly, in step S30, the sequence of importance indexes may also be calculated with the gradient segment as the granularity, that is, for each of the above-mentioned cumulative gradient segments, the corresponding importance index is calculated; Importance index sequence. Exemplarily, for the cumulative gradient segment STd[0], the corresponding importance index Im[0] can be obtained; for the cumulative gradient segment STd[1], the corresponding importance index Im[1] can be obtained. Based on the above configuration, the obtained importance index can represent the importance of a gradient segment including multiple gradients, and in the subsequent synchronization process, it can be determined whether the gradient segment is an important gradient segment, so that the important gradient can be determined with the gradient segment as the granularity. , so as to complete the gradient update of the gradient segment granularity, realize the sparse gradient update, further reduce the amount of communication data, and improve the training speed.

In an exemplary embodiment, taking the above-mentioned neural network with three layers as an example, two cumulative gradient segments STd[0] and STd[1] can be obtained. Correspondingly, the above-mentioned importance index sequence includes: Two importance indicators Im[0] and Im[1]. For each cumulative gradient segment, a statistical value of each cumulative gradient in the cumulative gradient segment may be calculated, and the statistical value may be used as the importance index. The type of the above-mentioned statistical value is not limited in the embodiment of the present disclosure. Exemplarily, the above-mentioned statistical value may be a variance, a standard deviation, or a two-norm.

Please refer to FIG. 7 , which shows a schematic flowchart of step S40 of the above-mentioned distributed communication-based neural network training method. The above-mentioned schematic flowchart of determining an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence includes:

S41: For each cumulative gradient segment calculated by each of the above-mentioned calculation threads, extract a corresponding important gradient indication value from the above-mentioned important gradient indication sequence.

In an exemplary embodiment, each significant gradient indication value in the above-mentioned significant gradient indication sequence also corresponds to an accumulated gradient segment in the accumulated gradient sequence. Taking the above-mentioned neural network with three layers as an example, two cumulative gradient segments STd[0] and STd[1] can be obtained. Accordingly, the above important gradient indication sequence also includes two important gradient indication values Ip [0] and Ip[1].

S42: If the above-mentioned important gradient indication value is the first indication value, determine all the cumulative gradients in the above-mentioned cumulative gradient segment as important gradients, and submit the above-mentioned cumulative gradient segment to the communication buffer of the above-mentioned training node; the above-mentioned first indication The value indicates that the cumulative gradients in the above-mentioned cumulative gradient segment are all important gradients.

S43: If the above-mentioned important gradient indication value is the second indication value, determine that the cumulative gradients in the above-mentioned cumulative gradient segment are all non-important gradients; the above-mentioned second indication value indicates that the above-mentioned cumulative gradients in the above-mentioned cumulative gradient segment are all non-important gradients .

Exemplarily, 1 may be used as the first indication value, and 0 may be used as the second indication value. Exemplarily, if the important gradient indication value Ip[0] is 1, it is determined that each cumulative gradient in the corresponding cumulative gradient segment STd[0] is an important gradient, and the above-mentioned cumulative gradient segment is submitted to the communication buffer of the above-mentioned training node. If the important gradient indication value Ip[1] is 0, the five accumulated gradients in the corresponding accumulated gradient segment STd[1] are all regarded as unimportant gradients.

In a feasible embodiment, the second gradient sequence may be updated according to the determined unimportant gradient. This step has been described above and will not be repeated here.

Based on the above configuration, important gradients and non-important gradients can be quickly determined with the gradient segment as the granularity, which improves the determination speed of important gradients.

In an exemplary embodiment, obtaining the information to be synchronized of the training node according to the important gradient and the sequence of importance indicators, including:

S51. Obtain a gradient sequence to be synchronized according to the accumulated gradient segments in the communication buffer; wherein, the position of each accumulated gradient in each accumulated gradient segment in the communication buffer in the gradient sequence to be synchronized is the same The position of the accumulated gradient in the sequence of accumulated gradients is the same, and other positions in the sequence to be synchronized are set as preset gradient values.

Continuing to use the above example, only the cumulative gradient segment STd[0] is submitted to the above communication buffer, and the corresponding positions of each cumulative gradient in the cumulative gradient segment STd[0] in the above cumulative gradient sequence are the first to the first. 5 bits, then the first to fifth bits in the above-mentioned gradient sequence to be synchronized are the five accumulated gradients in the accumulated gradient sequence STd[0], and the other positions in the above-mentioned gradient sequence to be synchronized are set to default values . Exemplarily, other positions in the above-mentioned gradient sequence to be synchronized may be set to zero. In the above example, the above-mentioned gradient sequence to be synchronized includes a total of 10 values, corresponding to parameter E30, parameter E31, parameter E32, parameter E20, parameter E21 respectively. , parameter E22, parameter E10, parameter E11 and parameter E12.

S52. Splicing the above-mentioned gradient sequence to be synchronized and the above-mentioned importance index sequence to obtain the above-mentioned information to be synchronized.

Exemplarily, the above-mentioned importance index sequence {Im[0], Im[1]} may be appended to the above-mentioned gradient sequence to be synchronized to obtain the above-mentioned information to be synchronized.

Based on the above configuration, the important gradients can be quickly determined with the gradient segment as the granularity, and the gradient sequence to be synchronized can be obtained. The gradient information to be synchronized corresponding to each parameter can be recorded in the gradient sequence to be synchronized. The splicing result is used as the information to be synchronized, so that the information to be synchronized can include the gradient, the gradient position, and the sequence of importance indexes used to determine the important gradient position in the next iteration with a small amount of data, thereby reducing the amount of communication data, The communication frequency is also reduced, which can significantly alleviate the communication pressure caused by neural network training in a distributed communication environment.

Please refer to FIG. 8 , which shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method. The above-mentioned synchronization between training nodes is performed based on the above-mentioned information to be synchronized, and a post-synchronization gradient sequence and an important post-synchronization gradient indication are obtained. A flow chart of the sequence, including:

S61. Add the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the above training nodes element by element to obtain a synchronized accumulated gradient sequence.

S62. Divide each synchronous cumulative gradient in the above-mentioned synchronous cumulative gradient sequence by the total number of the above-mentioned training nodes to obtain the above-mentioned post-synchronized gradient sequence.

Assuming that there are three training nodes to train the above-mentioned three-layer neural network, the gradient sequence to be synchronized of training node 1 is Wt1, the gradient sequence to be synchronized of training node 2 is Wt2, and the gradient sequence to be synchronized of training node 3 is Wt3. For any sequence position i, the value of Wt1[i]+Wt2[i]+Wt3[i] is the synchronous accumulation gradient corresponding to sequence position i in the above synchronous accumulation gradient sequence. By dividing the synchronous cumulative gradient corresponding to each sequence position in the above synchronous cumulative gradient sequence by 3, the above-mentioned post-synchronized gradient sequence can be obtained.

S63. Add the importance index sequences in the information to be synchronized of each of the above training nodes element by element to obtain an accumulated importance index sequence.

S64. Divide each cumulative importance index in the above-mentioned cumulative importance index sequence by the total number of the above-mentioned training nodes to obtain an average importance index sequence.

Referring to the previous example, for a neural network with three layers, the importance index sequence corresponding to each training node is {Im[0], Im[1]}. In order to distinguish training node 1, training node 2 and training node 3, The importance index sequences of training node 1, training node 2 and training node 3 are expressed as {Im1[0], Im1[1]}, {Im2[0], Im2[1]} and {Im3[0], respectively, Im3[1]}. If {AIm[0], AIm[1]} is used to represent the above average importance index sequence, then AIm[0]=(Im1[0]+Im2[0]+Im3[0])/3, AIm[1] =(Im1[1]+Im2[1]+Im3[1])/3.

S65. Calculate the important gradient indication value corresponding to each average importance index in the above average importance index sequence, and obtain the important gradient indication sequence after synchronization.

Based on the above configuration, a post-synchronized gradient sequence that can be accurately used to adjust neural network parameters can be calculated, and a post-synchronized important gradient indication sequence that can be used to quickly determine important gradients can be obtained, thereby improving the training speed of the neural network.

In a feasible embodiment, the above-mentioned calculation of the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence includes:

S651. Obtain the sorting result of each average importance index in the above average importance index sequence according to the descending order of the average importance index.

Referring to the previous example, for the average importance index sequence {AIm[0], AIm[1]}, if AIm[0] is less than AIm[1], the above sorting result is {AIm[1], AIm[0]}, On the contrary, the above sorting result is {AIm[0], AIm[1]}.

S652. Obtain a threshold index, and determine an importance index threshold in the sorting result according to the threshold index.

In a feasible implementation manner, the value at the position pointed to by the threshold index in the sorting result may be used as the threshold of the importance index. Exemplarily, if the above sorting result includes 30 average importance indexes, and the threshold index is 10, the value corresponding to the 10th average importance index in the above sorting result is used as the above importance index threshold.

In a feasible implementation manner, the number of segments may be determined according to the preset segmentation rule; a preset compression ratio is obtained; and the product of the compression ratio and the number of segments is used as the threshold indicator. The embodiments of the present disclosure do not limit the specific data of the above-mentioned preset compression ratio and the above-mentioned number of segments. Referring to the foregoing example, the number of segments is only two segments, and in practical applications, there may be far more than two segments, and no specific limitation is imposed on the examples in the embodiments of the present disclosure.

Based on the above configuration, a more reasonable threshold index can be obtained, so as to determine a reasonable threshold of the importance index, and finally improve the accuracy of the important gradient judgment.

S653. For each average importance index in the above-mentioned average importance index sequence, if the above-mentioned average importance index is greater than the above-mentioned importance index threshold, set the corresponding important gradient indication value as the first indication value; otherwise, set the corresponding important gradient indication value to the first indication value. The important gradient indication value of is set as the second indication value.

In this embodiment of the present disclosure, the above-mentioned first indication value may be used to indicate an important gradient, for example, the first indication value may be 1; the second indication value may be used to indicate an unimportant gradient, for example, the second indication value Can be 0. Taking the average importance index sequence {AIm[0], AIm[1]} as an example, the post-synchronization important gradient indication sequence {TIp[0], TIp[1]} can be correspondingly obtained. When the above-mentioned distributed communication-based neural network training method is iteratively performed, the post-synchronization important gradient indication sequence {TIp[0], TIp[1]} can be used as a new important gradient indication sequence {Ip[0], Ip[1] }, is applied to determine important gradients in the next iteration. How to determine the important gradient based on the important gradient indication sequence {Ip[0], Ip[1]} has been described above, and will not be repeated here.

Based on the above configuration, the important gradient indication value corresponding to each average importance index can be accurately calculated, and the accuracy of the important gradient judgment can be improved.

Those skilled in the art can understand that, in the above method of the specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.

In a feasible embodiment, the present disclosure also provides another neural network training method based on distributed communication, which is applied to a training system including a plurality of training nodes, and the method includes:

S100. Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence.

S200. Obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients in the training node that have not participated in synchronization.

S300. Calculate and obtain an importance index sequence according to the cumulative gradient sequence.

S400. Acquire an important gradient indication sequence in the training node, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence.

S500. Obtain the to-be-synchronized information of the training node according to the important gradient and the importance index sequence.

S600. Each training node performs synchronization between training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.

S700. Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.

For the steps performed by each training node in the training system described in the embodiments of the present disclosure, reference may be made to the foregoing description, which will not be repeated here. It can be understood that the above-mentioned method embodiments mentioned in the present disclosure can be combined with each other to form a combined embodiment without violating the principle and logic.

9 shows a block diagram of a distributed communication-based neural network training apparatus provided according to an embodiment of the present disclosure. The above apparatus is set on a training node and includes:

The training module 10 is used to train the neural network corresponding to the above-mentioned training node, and save the generated gradient in the first gradient sequence;

The cumulative gradient acquisition module 20 is configured to obtain the cumulative gradient sequence according to the above-mentioned first gradient sequence and the second gradient sequence; the above-mentioned second gradient sequence is used to record the gradients that have not participated in synchronization;

The importance index sequence calculation module 30 is configured to calculate and obtain the importance index sequence according to the above-mentioned cumulative gradient sequence;

The gradient classification module 40 is configured to obtain an important gradient indication sequence, and determine the important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence;

The synchronization module 50 is configured to obtain the to-be-synchronized information of the above-mentioned training nodes according to the above-mentioned important gradients and the above-mentioned importance index sequence, and to perform synchronization between the training nodes based on the above-mentioned to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;

an update module 60, configured to use the above-mentioned important gradient indication sequence after synchronization as the new above-mentioned important gradient indication sequence;

The parameter adjustment module 70 is configured to adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence.

In some possible implementations, the above-mentioned cumulative gradient acquisition module includes:

A segmentation unit, configured to segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule, to obtain the first gradient segment sequence and the second gradient segment sequence; wherein, if the first gradient segment The position in the above-mentioned first gradient segment sequence is the same as the position of the second gradient segment in the above-mentioned second gradient segment sequence, then the above-mentioned first gradient segment corresponds to the above-mentioned second gradient segment, and both correspond to the same neural network parameter;

a multi-thread processing unit, configured to set up multiple parallel computing threads, each of the computing threads obtaining at least one first gradient segment and a second gradient segment corresponding to the first gradient segment;

an accumulating unit, used for each of the above-mentioned computing threads to accumulate the above-mentioned first gradient section and the corresponding second gradient section for each obtained first gradient section to obtain a corresponding cumulative gradient section;

The cumulative gradient sequence obtaining unit is configured to obtain the above cumulative gradient sequence according to the cumulative gradient segments obtained by each of the above calculation threads.

In some possible implementations, the above-mentioned importance index sequence calculation module includes: an importance index calculation unit, used for each of the above calculation threads to calculate the corresponding importance index according to the obtained cumulative gradient segment; the importance index sequence The obtaining unit is configured to obtain the sequence of importance indexes according to the calculation results of the importance indexes of each of the above calculation threads.

In some possible implementation manners, the above-mentioned gradient classification module is configured to, for each cumulative gradient segment calculated by each of the above-mentioned computing threads, extract a corresponding important gradient indication value from the above-mentioned important gradient indication sequence; if the above-mentioned important gradient indication value is the first indication value, then the cumulative gradients in the above-mentioned cumulative gradient segments are all determined as important gradients, and the above-mentioned cumulative gradient segments are submitted to the communication buffer of the above-mentioned training node; the above-mentioned first indication value represents the above-mentioned cumulative gradient segments. The cumulative gradients are all important gradients.

In some possible implementations, the above synchronization module includes:

The to-be-synchronized gradient sequence acquisition unit is configured to obtain the to-be-synchronized gradient sequence according to the accumulated gradient segments in the communication buffer; wherein, each accumulated gradient in each accumulated gradient segment in the communication buffer is in the to-be-synchronized gradient The positions in the synchronization gradient sequence are all the same as the positions of the accumulated gradients in the accumulated gradient sequence, and other positions in the to-be-synchronized sequence are set as preset gradient values;

The splicing unit is used for splicing the above-mentioned gradient sequence to be synchronized and the above-mentioned importance index sequence to obtain the above-mentioned information to be synchronized.

In some possible implementations, the above synchronization module further includes: a synchronous cumulative gradient sequence acquisition unit, configured to add element by element the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the above training nodes to obtain a synchronous cumulative gradient sequence;

a post-synchronized gradient sequence acquisition unit, configured to divide each synchronously accumulated gradient in the above-mentioned synchronously-accumulated gradient sequence by the total number of the above-mentioned training nodes to obtain the above-mentioned post-synchronized gradient sequence;

The accumulative importance index sequence acquisition unit is used to add the importance index sequences in the information to be synchronized of each of the above training nodes element by element to obtain the accumulated importance index sequence;

an average importance index sequence obtaining unit, configured to divide each cumulative importance index in the above-mentioned cumulative importance index sequence by the total number of the above-mentioned training nodes to obtain an average importance index sequence;

The post-synchronized important gradient indication sequence calculation unit is used to calculate the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence, and obtain the post-synchronized important gradient indication sequence.

In some possible implementations, the above-mentioned calculation unit of the important gradient indication sequence after synchronization includes: a sorting unit, configured to obtain the sorting result of each average importance index in the above-mentioned average importance index sequence according to the descending order of the average importance index; the index threshold value The obtaining unit is used to obtain the threshold index, and determine the threshold of the importance index in the above-mentioned sorting result according to the above threshold index; If the above-mentioned average importance index is greater than the above-mentioned threshold of the above-mentioned importance index, the corresponding important gradient indication value is set as the first indication value; otherwise, the corresponding important gradient indication value is set as the second indication value; wherein, the above-mentioned first indication value It is used to indicate the important gradient, and the above-mentioned second indication value is used to indicate the unimportant gradient.

In some possible implementations, the above-mentioned index threshold obtaining unit is configured to determine the number of segments according to the above-mentioned preset segmentation rule; obtain a preset compression ratio; and use the product of the above-mentioned compression ratio and the above-mentioned number of segments as the above-mentioned Threshold metrics.

In some possible implementations, the above-mentioned parameter adjustment module is used to sequentially extract the gradients in the above-mentioned synchronized gradient sequence; if the above-mentioned gradient is not equal to the above-mentioned preset gradient value, adjust the corresponding neural network parameters according to the above-mentioned gradient; if the above-mentioned gradient is equal to The above preset gradient value is used to extract the next gradient.

In some possible implementation manners, the above-mentioned updating module is further configured to determine the insignificant gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence; update the above-mentioned second gradient sequence according to the above-mentioned insignificant gradient; the above-mentioned apparatus further includes an iterative control module , which is used to iteratively perform distributed communication-based neural network training until a preset training stop condition is reached.

In some embodiments, the functions or modules included in the apparatuses provided in the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments. For specific implementation, reference may be made to the descriptions of the above method embodiments. For brevity, here No longer.

Embodiments of the present disclosure further provide a computer-readable storage medium, where at least one instruction or at least one segment of program is stored, and the above-mentioned method is implemented when the at least one instruction or at least one segment of program is loaded and executed by a processor. The computer-readable storage medium may be a non-volatile computer-readable storage medium.

An embodiment of the present disclosure further provides a training device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the above method. The training device may be provided as a terminal, server or other form of device.

Figure 10 shows a block diagram of a training device according to an embodiment of the present disclosure. For example, training device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.

10, the training device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814 , and the communication component 816 .

Processing component 802 generally controls the overall operation of training device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

Memory 804 is configured to store various types of data to support operation at training device 800 . Examples of such data include instructions for any application or method operating on the training device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

Power supply component 806 provides power to various components of training device 800 . Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to training device 800 .

The multimedia component 808 includes a screen that provides an output interface between the training device 800 described above and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The above-mentioned touch sensor may not only sense the boundary of the touch or swipe action, but also detect the duration and pressure associated with the above-mentioned touch or swipe action. In some embodiments, the multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the training device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when training device 800 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signal may be further stored in memory 804 or transmitted via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing state assessments of various aspects of training device 800 . For example, the sensor component 814 can detect the on/off state of the training device 800, the relative positioning of components, such as the display and keypad of the training device 800, the sensor component 814 can also detect the training device 800 or a component of the training device 800 changes in position, presence or absence of user contact with training device 800, training device 800 orientation or acceleration/deceleration and temperature changes of training device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 816 is configured to facilitate wired or wireless communication between training device 800 and other devices. The training device 800 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 described above also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, training device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmed gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as memory 804 comprising computer program instructions executable by processor 820 of training device 800 to perform the above-described method.

11 shows a block diagram of another training device according to an embodiment of the present disclosure. For example, training device 1900 may be provided as a server. 11, training device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource, represented by memory 1932, for storing instructions executable by processing component 1922, such as an application program. An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 1922 is configured to execute instructions to perform the above-described methods.

The training device 1900 may also include a power supply assembly 1926 configured to perform power management of the training device 1900, a wired or wireless network interface 1950 configured to connect the training device 1900 to a network, and an input output (I/O) interface 1958 . Training device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as memory 1932 comprising computer program instructions executable by processing component 1922 of training device 1900 to perform the method described above.

In an exemplary embodiment, there is also provided a training system comprising a plurality of the above-mentioned training devices.

The present disclosure may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present disclosure.

A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code written in any combination including object-oriented programming languages - such as Smalltalk, C+, etc., and conventional procedural programming languages - such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present disclosure.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more alternatives for implementing the specified logical function(s) Execute the instruction. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

A neural network training method based on distributed communication is characterized in that, applied to training nodes, the method includes:

Train the neural network corresponding to the training node, and save the generated gradient in the first gradient sequence;

According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients that have not participated in synchronization;

Calculate and obtain an importance index sequence according to the cumulative gradient sequence;

obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;

Perform synchronization between training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;

The parameters of the neural network are adjusted according to the post-synchronization gradient sequence, and the post-synchronization important gradient indication sequence is used as the new important gradient indication sequence.
The method according to claim 1, wherein the obtaining a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence comprises:

Segment the first gradient sequence and the second gradient sequence respectively based on the preset segmentation rule to obtain the first gradient segment sequence and the second gradient segment sequence; The position of a gradient segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment and the second gradient segment correspond, and both correspond to the same neural network parameter;

Setting up multiple parallel computing threads, each of which obtains at least one first gradient segment and a second gradient segment corresponding to the first gradient segment;

For each acquired first gradient segment, each computing thread accumulates the first gradient segment and the corresponding second gradient segment to obtain a corresponding accumulated gradient segment;

The cumulative gradient sequence is obtained according to the cumulative gradient segments obtained by each of the computing threads.
The method according to claim 2, wherein the calculating and obtaining the sequence of importance indexes according to the accumulated gradient sequence comprises:

Each of the computing threads calculates the corresponding importance index according to the obtained cumulative gradient segment;

According to the calculation result of the importance index of each of the calculation threads, the sequence of importance index is obtained.
The method according to claim 2 or 3, wherein the determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence comprises:

For each cumulative gradient segment calculated by each of the computing threads, extract a corresponding important gradient indication value from the important gradient indication sequence;

If the indication value of the important gradient is the first indication value, all the accumulated gradients in the accumulated gradient segment are determined as important gradients, and the accumulated gradient segment is submitted to the communication buffer of the training node; the The first indication value indicates that the accumulated gradients in the accumulated gradient segment are all important gradients.
The method according to claim 4, wherein the obtaining the information to be synchronized of the training node according to the important gradient and the sequence of importance indexes comprises:

Obtain the gradient sequence to be synchronized according to the accumulated gradient segments in the communication buffer; wherein, the position of each accumulated gradient in each accumulated gradient segment in the communication buffer in the to-be-synchronized gradient sequence is the same as that of the gradient sequence to be synchronized. The positions of the cumulative gradients in the cumulative gradient sequence are the same, and other positions in the to-be-synchronized sequence are set as preset gradient values;

The to-be-synchronized gradient sequence and the importance index sequence are spliced to obtain the to-be-synchronized information.
The method according to any one of claims 2-5, wherein the synchronization between training nodes based on the information to be synchronized to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence, comprising:

adding the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the training nodes element by element to obtain a synchronous cumulative gradient sequence;

Dividing each synchronous cumulative gradient in the synchronous cumulative gradient sequence by the total number of the training nodes to obtain the synchronized gradient sequence;

adding the importance index sequences in the information to be synchronized of each of the training nodes element by element to obtain an accumulated importance index sequence;

Dividing each accumulated importance index in the sequence of accumulated importance indexes by the total number of the training nodes to obtain an average sequence of importance indexes;

An important gradient indication value corresponding to each average importance index in the average importance index sequence is calculated to obtain a post-synchronized important gradient indication sequence.
The method according to claim 6, wherein the calculating the important gradient indication value corresponding to each average importance index in the average importance index sequence comprises:

Obtain the sorting result of each average importance index in the average importance index sequence according to the descending order of the average importance index;

obtaining a threshold index, and determining an importance index threshold in the sorting result according to the threshold index;

For each average importance index in the average importance index sequence, if the average importance index is greater than the importance index threshold, set the corresponding important gradient indicator value as the first indicator value; otherwise, set the corresponding important gradient indicator value to the first indicator value; The corresponding important gradient indication value is set as a second indication value; wherein, the first indication value is used to indicate an important gradient, and the second indication value is used to indicate an unimportant gradient.
The method according to claim 7, wherein the obtaining the threshold indicator comprises:

Determine the number of segments according to the preset segmentation rule;

Get the preset compression ratio;

The product of the compression ratio and the number of segments is used as the threshold indicator.
The method according to any one of claims 1-8, wherein the adjusting the parameters of the neural network according to the synchronized gradient sequence comprises:

sequentially extracting the gradients in the synchronized gradient sequence;

If the gradient is not equal to the preset gradient value, adjust the corresponding neural network parameters according to the gradient;

If the gradient is equal to the preset gradient value, extract the next gradient.
The method according to any one of claims 1-9, wherein the method further comprises:

determining insignificant gradients in the cumulative gradient sequence according to the significant gradient indication sequence;

updating the second gradient sequence according to the insignificant gradient;

The neural network training based on distributed communication is iteratively performed until a preset training stop condition is reached.
A neural network training method based on distributed communication is characterized in that, applied to a training system including a plurality of training nodes, the method includes:

Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence;

According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients in the training node that have not participated in synchronization;

Calculate and obtain an importance index sequence according to the cumulative gradient sequence;

obtaining the important gradient indication sequence in the training node, and determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;

Each training node performs synchronization between the training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;

Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
The neural network training device based on distributed communication is characterized in that, the device is arranged on the training node, including:

A training module for training the neural network corresponding to the training node, and saving the generated gradient in the first gradient sequence;

a cumulative gradient acquisition module, configured to obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization;

an importance index sequence calculation module, configured to calculate and obtain an importance index sequence according to the cumulative gradient sequence;

a gradient classification module, configured to obtain an important gradient indication sequence, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;

a synchronization module, configured to obtain the to-be-synchronized information of the training nodes according to the important gradients and the importance index sequence, and to perform synchronization between the training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and post-synchronized important gradients instruction sequence;

an update module, configured to use the synchronized important gradient indication sequence as the new important gradient indication sequence;

A parameter adjustment module, configured to adjust the parameters of the neural network according to the synchronized gradient sequence.
A computer-readable storage medium, characterized in that, the computer-readable storage medium stores at least one instruction or at least one program, and the at least one instruction or at least one program is loaded and executed by a processor to implement the claims The distributed communication-based neural network training method described in any one of 1-10.
A training device, comprising at least one processor and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the at least one processor A processor implements the distributed communication-based neural network training method according to any one of claims 1-10 by executing the instructions stored in the memory.
A training system, characterized in that, the training system includes a plurality of training devices according to claim 14 .