WO2022179007A1 - Distributed communication-based neural network training method and apparatus, and storage medium - Google Patents

Distributed communication-based neural network training method and apparatus, and storage medium Download PDF

Info

Publication number
WO2022179007A1
WO2022179007A1 PCT/CN2021/100623 CN2021100623W WO2022179007A1 WO 2022179007 A1 WO2022179007 A1 WO 2022179007A1 CN 2021100623 W CN2021100623 W CN 2021100623W WO 2022179007 A1 WO2022179007 A1 WO 2022179007A1
Authority
WO
WIPO (PCT)
Prior art keywords
gradient
sequence
training
important
synchronized
Prior art date
Application number
PCT/CN2021/100623
Other languages
French (fr)
Chinese (zh)
Inventor
颜子杰
段江飞
孙鹏
张行程
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2022179007A1 publication Critical patent/WO2022179007A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present disclosure relates to the technical field of machine learning, and in particular, to a neural network training method, device and storage medium based on distributed communication.
  • Embodiments of the present application provide a distributed communication-based neural network training method, device, and storage medium, which are used to solve at least one of the above-mentioned technical problems.
  • an embodiment of the present application provides a distributed communication-based neural network training method, which is applied to training nodes, and the method includes:
  • a cumulative gradient sequence is obtained;
  • the second gradient sequence is used to record the gradients that have not participated in synchronization;
  • the parameters of the neural network are adjusted according to the post-synchronization gradient sequence, and the post-synchronization important gradient indication sequence is used as the new important gradient indication sequence.
  • the embodiments of the present application provide a neural network training device based on distributed communication, the device is set on a training node, and includes:
  • a training module is used to train the corresponding neural network of the training node, and the generated gradient is stored in the first gradient sequence;
  • a cumulative gradient acquisition module configured to obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization;
  • an importance index sequence calculation module configured to calculate and obtain an importance index sequence according to the cumulative gradient sequence
  • a gradient classification module configured to obtain an important gradient indication sequence, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
  • a synchronization module configured to obtain the to-be-synchronized information of the training nodes according to the important gradients and the importance index sequence, and to perform synchronization between the training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and post-synchronized important gradients instruction sequence;
  • an update module configured to use the synchronized important gradient indication sequence as the new important gradient indication sequence
  • a parameter adjustment module configured to adjust the parameters of the neural network according to the synchronized gradient sequence.
  • an embodiment of the present application provides a neural network training method based on distributed communication, which is applied to a training system including multiple training nodes, and the method includes:
  • Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence
  • a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients in the training node that have not participated in synchronization;
  • Each training node performs synchronization between the training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
  • Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
  • an embodiment of the present application provides a computer-readable storage medium, where at least one instruction or at least one program is stored in the computer-readable storage medium, and the at least one instruction or at least one program is loaded and executed by a processor In order to realize the neural network training method based on distributed communication according to any one of the first aspect.
  • an embodiment of the present application provides a training device, including at least one processor, and a memory communicatively connected to the at least one processor; wherein the memory stores a program that can be executed by the at least one processor. Instructions, the at least one processor implements the distributed communication-based neural network training method according to any one of the first aspect by executing the instructions stored in the memory.
  • an embodiment of the present application provides a training system, where the training system includes a plurality of training devices described in the fifth aspect.
  • embodiments of the present application provide a computer program product, where the computer program product may include a computer-readable storage medium on which computer-readable program instructions for enabling a processor to implement various aspects are uploaded.
  • the important gradient and the related information used to determine the important gradient can be synchronized in one communication process, and the synchronization result can ensure that the important gradient obtained by each training node corresponds to the same neural network parameters, and there is no need for additional communication between nodes on the position of important gradients, which reduces the frequency of node communication and improves the training speed.
  • FIG. 1 shows a schematic diagram of neural network training based on distributed communication in the related art according to an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure
  • Fig. 3 shows a feasible structural schematic diagram of a deep neural network according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of the relationship between the first storage space, the second storage space and a neural network structure according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a multi-node implementation of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure
  • step S20 shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure
  • step S40 of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure
  • step S60 shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure
  • FIG. 9 shows a block diagram of a neural network training apparatus based on distributed communication according to an embodiment of the present disclosure
  • FIG. 10 shows a block diagram of a training device according to an embodiment of the present disclosure
  • FIG. 11 shows a block diagram of another training device according to an embodiment of the present disclosure.
  • FIG. 1 shows a schematic diagram of neural network training based on distributed communication in the related art.
  • Each GPU Graphics Processing Unit, graphics processor
  • FIG. 1 shows a schematic diagram of neural network training based on distributed communication in the related art.
  • Each GPU Graphics Processing Unit, graphics processor
  • FIG. 1 shows a schematic diagram of neural network training based on distributed communication in the related art.
  • Each GPU Graphics Processing Unit, graphics processor
  • FIG. 1 shows a schematic diagram of neural network training based on distributed communication in the related art.
  • Each GPU Graphics Processing Unit, graphics processor
  • the above m training nodes can independently train a neural network with the same structure.
  • the sample data is input into the neural network and the forward propagation process and reverse propagation are triggered in turn.
  • the propagation process in the above forward propagation process, the sample data is input to the first layer, the first layer and the second layer are triggered, until the nth layer outputs the calculation results in sequence.
  • the nth layer and the n-1th layer are triggered until the first layer generates gradients in turn.
  • the above m training nodes communicate according to the generated gradient, and adjust the parameters of the corresponding neural network according to the communication result.
  • each training node can adjust the parameters of its corresponding neural network with reference to the gradients generated by other training nodes, thereby improving the training speed. It is also high, which makes the communication process itself consume more resources, and each training node may be forced to wait frequently due to frequent communication delays, thereby reducing the training speed. In this case, the communication process of distributed nodes becomes The bottleneck restricting training efficiency.
  • the embodiments of the present disclosure provide a neural network training method based on distributed communication.
  • the distributed communication-based neural network training method provided by the embodiments of the present disclosure can be applied to any data processing device with a graphics processor (Graphics Processing Unit, GPU).
  • the data processing device can be a terminal, including a personal computer (Personal Computer, PC), minicomputer, medium computer, mainframe, workstation, etc.
  • the data processing device can also be a server. It should be noted that the data processing device may be independent when used for training the neural network, or may exist in the form of a cluster.
  • the distributed communication-based neural network training method provided by the embodiment of the present disclosure may be stored in a data processing device in the form of a computer program, and the data processing device implements the distributed communication-based neural network training method of the present disclosure by running the computer program.
  • the above-mentioned computer program may be an independent computer program, or may be a functional module, a plug-in, or a small program integrated on other computer programs.
  • FIG. 2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure. As shown in FIG. 2 , the above method includes:
  • the embodiments of the present disclosure do not limit the structure of the neural network, and the above-mentioned neural network may be at least one of a deep neural network, a convolutional neural network, and a recurrent neural network.
  • a deep neural network as an example, please refer to FIG. 3 , which shows a schematic structural diagram of a feasible deep neural network.
  • the above-mentioned deep neural network may include a convolutional layer, a pooling layer and a fully connected layer, and one of the above-mentioned convolutional layers is used as the input layer of the neural network, and the above-mentioned fully connected layer is used as the output layer of the above-mentioned deep neural network.
  • a convolutional layer and a pooling layer can be set at intervals between the layer and the above-mentioned output layer.
  • a first storage space (grad_buffer) and a second storage space (left_buffer) may be opened on the training node.
  • the first storage space (grad_buffer) and the second storage space ( left_buffer) are contiguous storage spaces.
  • the neural network may have multiple layers, each layer may include multiple parameters, and each parameter in each layer corresponds to a segment of storage interval in the above-mentioned first storage space.
  • the first storage space and the second storage space may have the same structure, and both the first storage space and the second storage space may be the same as the storage space occupied by the parameters of the neural network.
  • the first storage space (grad_buffer) may be used to store gradients (first gradient sequences) generated by the training nodes during the training process
  • the second storage space (left_buffer) may be used to store gradients that have not yet participated in synchronization.
  • Layer 1 includes parameter E10, parameter E11, parameter E12 and parameter E13;
  • Layer 2 includes parameter E20, parameter E21, and parameter E22;
  • Layer 3 includes parameter E30, parameter E31 and parameter E32;
  • Layers 1-3 include a total of 10 parameters, then the above-mentioned first storage space and the above-mentioned second storage space both include 10 storage intervals, and each layer generates a gradient corresponding to each parameter in the reverse direction during the training process, specifically:
  • layer 3 generates gradient T30, gradient T31 and parameter gradient T32 corresponding to parameter E30, parameter E31 and parameter E32, respectively;
  • layer 2 generates gradient T20, gradient T21, and gradient T22 corresponding to parameter E20, parameter E21, parameter E22, respectively;
  • layer 1 generates gradient T10, gradient T11, gradient T12, and gradient T13 corresponding to parameter E10, parameter E11, parameter E12, and parameter E13, respectively.
  • the 10 gradients can be sequentially stored in the first storage space in the order of gradient generation, that is, the data in the first storage space (grad_buffer) can be ⁇ T30, T31, T32, T20, T21, T22, T10, T11, T12, T13 ⁇ .
  • the data ⁇ T30, T31, T32, T20, T21, T22, T10, T11, T12, T13 ⁇ of the first storage space (grad_buffer) is the first gradient sequence.
  • the above-mentioned training node can take out the smallest batch of samples each time, perform sample training of the smallest batch, adjust the parameters of the corresponding neural network during the training process, and obtain the corresponding training results of this batch of training according to the training results.
  • the above gradient is stored in the first gradient sequence.
  • the first gradient sequence can be recorded after each batch of training, and the subsequent synchronization between training nodes can be performed. Batch training reduces communication frequency and improves training speed.
  • S20 Obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization.
  • the second gradient sequence may be stored in the second storage space (left_buffer).
  • the second storage space has the same structure as the above-mentioned first storage space, and the storage intervals in the same position correspond to the same neural network parameters.
  • the 10 consecutive storage intervals in the first storage space and the 10 consecutive storage intervals in the second storage space correspond to parameter E30, parameter E31, parameter E32, parameter E32, and parameter E31 respectively.
  • the accumulated gradient sequence may be obtained by adding the data in the same storage interval in the first storage space and the second storage space, and the accumulated gradient sequence may be stored in the first storage space.
  • any storage interval grad_buffer[i] of the first storage space and the corresponding storage interval left_buffer[i] of the second storage space can be obtained, where i is taken as The value is 0-9, assign grad_buffer[i]+left_buffer[i] to grad_buffer[i], the data in the first storage space after the assignment is the above-mentioned cumulative gradient sequence, that is, the above-mentioned cumulative gradient sequence and the above-mentioned
  • the first gradient sequence reuses the above-mentioned first storage space.
  • the number of training nodes participating in the collaborative training of neural network is large, and because the complexity of neural network can be very high, correspondingly, the storage space occupied by the gradient generated is often The first storage space can be reused in each training node, so that storage resources can be greatly saved.
  • the importance index may be calculated corresponding to each accumulated gradient in the accumulated gradient sequence, and the embodiment of the present disclosure does not limit the method for determining the above importance index.
  • the importance index sequence Im can be obtained, and each importance index Im[i] can represent the gradient generated by the neural network parameters corresponding to grad_buffer[i] obtained above. Importance.
  • S40 Acquire an important gradient indication sequence, and determine an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence.
  • each important gradient indication value in the important gradient indication sequence may indicate whether the accumulated gradient of the corresponding position is important.
  • the important gradient indication sequence Ip can be obtained, and each important gradient indication value Ip[i] can indicate whether the data recorded by grad_buffer[i] obtained above is an important gradient.
  • Ip[i] is the first indication value
  • the data recorded in grad_buffer[i] is determined to be an important gradient
  • Ip[i] is the second indication value
  • the data recorded in grad_buffer[i] is determined to be is an unimportant gradient.
  • the important gradient indication sequence Ip may be obtained during the last synchronization process between nodes.
  • grad_buffer[1], grad_buffer[3], and grad_buffer[5] are the important gradients
  • other gradients in grad_buffer are non-important gradients.
  • the second gradient sequence may also be updated according to the insignificant gradient.
  • the non-important gradient does not participate in the synchronization between nodes this time. Therefore, the non-important gradient can be updated in the second gradient sequence, that is, correspondingly stored in the second storage space (left_buffer). Updating the second gradient sequence according to the unimportant gradient includes determining the position of the unimportant gradient in the second gradient sequence, updating the data corresponding to the second gradient sequence at the position to the unimportant gradient, and updating the data corresponding to the other positions with the unimportant gradient. The data is assigned a value of 0.
  • the above-mentioned important gradient and the above-mentioned importance index sequence may be spliced, and the splicing result may be used as the above-mentioned information to be synchronized.
  • the embodiments of the present disclosure do not limit the splicing method.
  • S60 Perform synchronization between training nodes based on the above information to be synchronized, to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.
  • AllReduce can be used for communication, and AllReduce is a general term for a class of communication methods that can efficiently communicate between distributed training nodes.
  • each training node can obtain the important gradient indication sequence after synchronization during each synchronization, and the above-mentioned important gradient indication sequence after synchronization is used as the important gradient indication sequence applied in the next training. Therefore, in each training node The sequence of important gradient indications used during training is the same, that is, the important gradient positions are the same.
  • training node A determines in step S40 that grad_buffer[1], grad_buffer[3], and grad_buffer[5] are important gradients, and other training nodes will also determine their corresponding grad_buffer[1], grad_buffer [3], grad_buffer[5] is an important gradient, the gradient sequence after synchronization can be calculated directly according to the important gradient of each node, without additional consideration of the position of the important gradient for synchronization, and the position information of the important gradient is no longer needed between each training node Additional communication reduces communication frequency.
  • the average value can be obtained according to the important gradients grad_buffer[1], grad_buffer[3], and grad_buffer[5] in each node, and Td[1], Td[3] and Td[ 5], and the values at other positions in Td can be assigned as preset gradient values.
  • the importance index sequence after synchronization can be obtained by taking the mean value element by element according to the importance index sequence of each node, and the above-mentioned important gradient indication sequence after synchronization is correspondingly obtained according to the importance index sequence after synchronization.
  • S70 Adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence, and use the above-mentioned important gradient indication sequence after synchronization as a new above-mentioned important gradient indication sequence.
  • the above-mentioned adjustment of the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence includes:
  • the above-mentioned preset gradient value is used to indicate a situation in which parameter adjustment according to the gradient is not required.
  • the above-mentioned preset gradient value may be 0.
  • the elements in the synchronized gradient sequence may be correspondingly stored in the first storage space, and the first storage space (grad_buffer) may be reused again, thereby reducing storage consumption.
  • the first storage space (grad_buffer) may be reused again, thereby reducing storage consumption.
  • the “extract the next gradient” will fail. This situation indicates that the adjustment of the parameters of the neural network has been completed. After completion, the first gradient can be cleared. A storage space, so that when steps S10-S70 are executed iteratively, the first gradient sequence can be multiplexed and recorded in step S10.
  • the unimportant gradient is indicated by setting the preset gradient value, and the parameter value corresponding to the unimportant gradient does not need to be adjusted, thereby avoiding the transitional adjustment of the parameters of the neural network and further improving the convergence speed of the neural network.
  • Steps S10-S70 in this embodiment of the present disclosure show a neural network training method in a single synchronization situation.
  • both the second gradient sequence and the important gradient indication sequence may be updated, According to the update result, the above-mentioned neural network training method for a single synchronization situation can be performed again, that is, steps S10-S70 can be performed iteratively until a preset training stop condition is reached, and a trained neural network can be obtained.
  • the embodiment of the present disclosure does not limit the above training stop condition, and the above training stop condition may be that the number of iterations reaches a preset iteration threshold, or that the loss generated by the neural network is less than a preset loss threshold.
  • FIG. 5 shows a schematic flowchart of multi-node executing the above-mentioned neural network training method based on distributed communication.
  • Each training node in FIG. 5 can independently train its own neural network, update the local second gradient sequence during the training process, and obtain information to be synchronized.
  • the information to be synchronized includes important gradients and importance index sequences.
  • Each training node communicates based on the information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.
  • Each node performs parameter adjustment based on the above-mentioned post-synchronization gradient sequence, and determines the important gradient in the next iteration based on the above-mentioned post-synchronization important gradient indication sequence.
  • Each training node continuously adjusts its own parameters through iteration, and communicates and cooperates between nodes to complete the training of the neural network during each iteration.
  • the number of training nodes involved in training is relatively large, and because the complexity of neural networks can be very high, the amount of gradient data often generated is also relatively large. Correspondingly, it is used to record gradients.
  • the data volume of the gradient position information of the location is also large.
  • the training nodes with large data volume not only need to synchronize the gradient position information, but also need to synchronize the gradient, which generates a large communication pressure and consumes a lot of communication resources. , limited by communication resources, many nodes may be forced to wait for synchronization often, which reduces the training speed.
  • the distributed communication-based neural network training method can synchronize important gradients and important gradient position information in one communication process, and the synchronization result can ensure that the important gradients obtained by each training node correspond to With the same neural network parameters, there is no need to additionally communicate the gradient position information between the training nodes, which can significantly reduce the communication frequency, reduce the consumption of communication resources, reduce the time-consuming training nodes waiting for synchronization, and improve the training speed.
  • a particularly obvious speed-up effect can be achieved in scenarios where the amount of data training nodes and the complexity of the neural network are high.
  • the speed at which the above-mentioned training node performs the above-mentioned steps may be further improved based on the multi-thread concurrency characteristic, and the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
  • FIG. 6 shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method.
  • the above-mentioned schematic flowchart of obtaining a cumulative gradient sequence according to the above-mentioned first gradient sequence and second gradient sequence includes:
  • S21 Segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule to obtain a first gradient segment sequence and a second gradient segment sequence; wherein, if the first gradient segment is in the first gradient segment The position of the gradient segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment corresponds to the second gradient segment, and both correspond to the same neural network parameters.
  • the embodiments of the present disclosure do not limit specific segmentation rules, as long as the above-mentioned first gradient sequence can be divided into a plurality of first gradient segments to form a first gradient segment sequence.
  • the first gradient sequence and the second gradient sequence have the same structure, and the second gradient sequence is segmented based on the segmentation rule to obtain the second gradient segment sequence.
  • the position in the segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment and the second gradient segment correspond, and both correspond to the same neural network parameters.
  • the first gradient sequence obtained in step S10 is stored in the first storage space (grad_buffer), which can be composed of grad_buffer[0], grad_buffer[1] , grad_buffer[2], grad_buffer[3], grad_buffer[4] form a first gradient segment Tdf[0], composed of grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8],
  • the data in grad_buffer[9] forms another first gradient segment Tdf[1], obtaining the first gradient segment sequence ⁇ Tdf[0], Tdf[1] ⁇ .
  • the second gradient sequence is stored in the second storage space (left_buffer), which is formed by the data in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], and left_buffer[4].
  • the second gradient segment Tds[0] forms another first gradient segment Tds[1] from the data in left_buffer[5], left_buffer[6], left_buffer[7], left_buffer[8], and left_buffer[9], and obtains the first gradient segment Tds[1].
  • each data in Tdf[0] corresponds to parameter E30, parameter E31, parameter E32, parameter E20, and parameter E21 in sequence.
  • Tdf[0] corresponds to Tds[0]
  • each data in Tds[0] also corresponds to parameter E30, parameter E31, parameter E32, parameter E20 and parameter E21 in sequence.
  • S22 Set up multiple parallel computing threads, each of which obtains at least one first gradient segment and a second gradient segment corresponding to the foregoing first gradient segment.
  • Tdf[0] and Tds[0] can be sent to computing thread A
  • Tdf[1] and Tds[1] can be sent to computing thread B.
  • each of the above-mentioned computing threads For each acquired first gradient segment, each of the above-mentioned computing threads accumulates the above-mentioned first gradient segment and the corresponding second gradient segment to obtain a corresponding accumulated gradient segment.
  • the data in Tdf[0] and Tds[0] can be added element by element to obtain the corresponding cumulative gradient segment.
  • the data of each element in the first gradient segment Tdf[0] is sequentially stored in grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], and grad_buffer[4], and the second gradient
  • the data of each element in segment Tds[0] is sequentially stored in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], left_buffer[4], then the corresponding cumulative gradient segment STd[0]
  • the data in the sequence is grad_buffer[0]+left_buffer[0], grad_buffer[1]+left_buffer[1], grad_buffer[2]+left_buffer[2], grad_buffer[3]+left_buffer[3], grad_buffer[4]+ left_
  • computing thread B can obtain STd[1] correspondingly according to Tdf[1] and Tds[1], and the five accumulated gradients in STd[1] correspond to parameter E22, parameter E10, parameter E11, parameter E12, Parameter E13.
  • calculation thread A can obtain STd[0]
  • calculation thread B can obtain STd[1]
  • the cumulative gradient in STd[0] and the cumulative gradient in STd[1] are arranged in sequence, and the above cumulative gradient can be obtained.
  • gradient sequence Taking the neural network with three layers described above as an example, grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], grad_buffer[ can be updated according to the five elements in STd[0] 4], update grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8], grad_buffer[9] respectively according to the five elements in STd[1], then the data in the first storage space is formed
  • the above cumulative gradient sequence reduces storage consumption by reusing the first storage space. Based on the above configuration, the cumulative gradient can be calculated in parallel segments to improve the calculation speed of the cumulative gradient sequence.
  • the sequence of importance indexes may also be calculated with the gradient segment as the granularity, that is, for each of the above-mentioned cumulative gradient segments, the corresponding importance index is calculated; Importance index sequence.
  • the corresponding importance index Im[0] can be obtained; for the cumulative gradient segment STd[1], the corresponding importance index Im[1] can be obtained.
  • the obtained importance index can represent the importance of a gradient segment including multiple gradients, and in the subsequent synchronization process, it can be determined whether the gradient segment is an important gradient segment, so that the important gradient can be determined with the gradient segment as the granularity. , so as to complete the gradient update of the gradient segment granularity, realize the sparse gradient update, further reduce the amount of communication data, and improve the training speed.
  • the above-mentioned importance index sequence includes: Two importance indicators Im[0] and Im[1].
  • a statistical value of each cumulative gradient in the cumulative gradient segment may be calculated, and the statistical value may be used as the importance index.
  • the type of the above-mentioned statistical value is not limited in the embodiment of the present disclosure.
  • the above-mentioned statistical value may be a variance, a standard deviation, or a two-norm.
  • FIG. 7 shows a schematic flowchart of step S40 of the above-mentioned distributed communication-based neural network training method.
  • the above-mentioned schematic flowchart of determining an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence includes:
  • each significant gradient indication value in the above-mentioned significant gradient indication sequence also corresponds to an accumulated gradient segment in the accumulated gradient sequence.
  • the above important gradient indication sequence also includes two important gradient indication values Ip [0] and Ip[1].
  • the above-mentioned important gradient indication value is the first indication value, determine all the cumulative gradients in the above-mentioned cumulative gradient segment as important gradients, and submit the above-mentioned cumulative gradient segment to the communication buffer of the above-mentioned training node; the above-mentioned first indication The value indicates that the cumulative gradients in the above-mentioned cumulative gradient segment are all important gradients.
  • 1 may be used as the first indication value, and 0 may be used as the second indication value.
  • the important gradient indication value Ip[0] is 1, it is determined that each cumulative gradient in the corresponding cumulative gradient segment STd[0] is an important gradient, and the above-mentioned cumulative gradient segment is submitted to the communication buffer of the above-mentioned training node. If the important gradient indication value Ip[1] is 0, the five accumulated gradients in the corresponding accumulated gradient segment STd[1] are all regarded as unimportant gradients.
  • the second gradient sequence may be updated according to the determined unimportant gradient. This step has been described above and will not be repeated here.
  • obtaining the information to be synchronized of the training node according to the important gradient and the sequence of importance indicators including:
  • the cumulative gradient segment STd[0] is submitted to the above communication buffer, and the corresponding positions of each cumulative gradient in the cumulative gradient segment STd[0] in the above cumulative gradient sequence are the first to the first. 5 bits, then the first to fifth bits in the above-mentioned gradient sequence to be synchronized are the five accumulated gradients in the accumulated gradient sequence STd[0], and the other positions in the above-mentioned gradient sequence to be synchronized are set to default values .
  • other positions in the above-mentioned gradient sequence to be synchronized may be set to zero.
  • the above-mentioned gradient sequence to be synchronized includes a total of 10 values, corresponding to parameter E30, parameter E31, parameter E32, parameter E20, parameter E21 respectively. , parameter E22, parameter E10, parameter E11 and parameter E12.
  • the above-mentioned importance index sequence ⁇ Im[0], Im[1] ⁇ may be appended to the above-mentioned gradient sequence to be synchronized to obtain the above-mentioned information to be synchronized.
  • the important gradients can be quickly determined with the gradient segment as the granularity, and the gradient sequence to be synchronized can be obtained.
  • the gradient information to be synchronized corresponding to each parameter can be recorded in the gradient sequence to be synchronized.
  • the splicing result is used as the information to be synchronized, so that the information to be synchronized can include the gradient, the gradient position, and the sequence of importance indexes used to determine the important gradient position in the next iteration with a small amount of data, thereby reducing the amount of communication data,
  • the communication frequency is also reduced, which can significantly alleviate the communication pressure caused by neural network training in a distributed communication environment.
  • FIG. 8 shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method.
  • the above-mentioned synchronization between training nodes is performed based on the above-mentioned information to be synchronized, and a post-synchronization gradient sequence and an important post-synchronization gradient indication are obtained.
  • a flow chart of the sequence including:
  • the gradient sequence to be synchronized of training node 1 is Wt1
  • the gradient sequence to be synchronized of training node 2 is Wt2
  • the gradient sequence to be synchronized of training node 3 is Wt3.
  • the value of Wt1[i]+Wt2[i]+Wt3[i] is the synchronous accumulation gradient corresponding to sequence position i in the above synchronous accumulation gradient sequence.
  • the importance index sequence corresponding to each training node is ⁇ Im[0], Im[1] ⁇ .
  • the importance index sequences of training node 1, training node 2 and training node 3 are expressed as ⁇ Im1[0], Im1[1] ⁇ , ⁇ Im2[0], Im2[1] ⁇ and ⁇ Im3[0], respectively, Im3[1] ⁇ .
  • AIm[0] (Im1[0]+Im2[0]+Im3[0])/3
  • AIm[1] (Im1[1]+Im2[1]+Im3[1])/3.
  • a post-synchronized gradient sequence that can be accurately used to adjust neural network parameters can be calculated, and a post-synchronized important gradient indication sequence that can be used to quickly determine important gradients can be obtained, thereby improving the training speed of the neural network.
  • the above-mentioned calculation of the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence includes:
  • the value at the position pointed to by the threshold index in the sorting result may be used as the threshold of the importance index.
  • the threshold index is 10
  • the value corresponding to the 10th average importance index in the above sorting result is used as the above importance index threshold.
  • the number of segments may be determined according to the preset segmentation rule; a preset compression ratio is obtained; and the product of the compression ratio and the number of segments is used as the threshold indicator.
  • the embodiments of the present disclosure do not limit the specific data of the above-mentioned preset compression ratio and the above-mentioned number of segments. Referring to the foregoing example, the number of segments is only two segments, and in practical applications, there may be far more than two segments, and no specific limitation is imposed on the examples in the embodiments of the present disclosure.
  • the above-mentioned first indication value may be used to indicate an important gradient, for example, the first indication value may be 1; the second indication value may be used to indicate an unimportant gradient, for example, the second indication value Can be 0.
  • the average importance index sequence ⁇ AIm[0], AIm[1] ⁇ as an example, the post-synchronization important gradient indication sequence ⁇ TIp[0], TIp[1] ⁇ can be correspondingly obtained.
  • the post-synchronization important gradient indication sequence ⁇ TIp[0], TIp[1] ⁇ can be used as a new important gradient indication sequence ⁇ Ip[0], Ip[1] ⁇ , is applied to determine important gradients in the next iteration. How to determine the important gradient based on the important gradient indication sequence ⁇ Ip[0], Ip[1] ⁇ has been described above, and will not be repeated here.
  • the important gradient indication value corresponding to each average importance index can be accurately calculated, and the accuracy of the important gradient judgment can be improved.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
  • the present disclosure also provides another neural network training method based on distributed communication, which is applied to a training system including a plurality of training nodes, and the method includes:
  • Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence.
  • Each training node performs synchronization between training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.
  • Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
  • FIG. 9 shows a block diagram of a distributed communication-based neural network training apparatus provided according to an embodiment of the present disclosure.
  • the above apparatus is set on a training node and includes:
  • the training module 10 is used to train the neural network corresponding to the above-mentioned training node, and save the generated gradient in the first gradient sequence;
  • the cumulative gradient acquisition module 20 is configured to obtain the cumulative gradient sequence according to the above-mentioned first gradient sequence and the second gradient sequence; the above-mentioned second gradient sequence is used to record the gradients that have not participated in synchronization;
  • the importance index sequence calculation module 30 is configured to calculate and obtain the importance index sequence according to the above-mentioned cumulative gradient sequence
  • the gradient classification module 40 is configured to obtain an important gradient indication sequence, and determine the important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence;
  • the synchronization module 50 is configured to obtain the to-be-synchronized information of the above-mentioned training nodes according to the above-mentioned important gradients and the above-mentioned importance index sequence, and to perform synchronization between the training nodes based on the above-mentioned to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
  • an update module 60 configured to use the above-mentioned important gradient indication sequence after synchronization as the new above-mentioned important gradient indication sequence
  • the parameter adjustment module 70 is configured to adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence.
  • the above-mentioned cumulative gradient acquisition module includes:
  • a segmentation unit configured to segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule, to obtain the first gradient segment sequence and the second gradient segment sequence; wherein, if the first gradient segment The position in the above-mentioned first gradient segment sequence is the same as the position of the second gradient segment in the above-mentioned second gradient segment sequence, then the above-mentioned first gradient segment corresponds to the above-mentioned second gradient segment, and both correspond to the same neural network parameter;
  • a multi-thread processing unit configured to set up multiple parallel computing threads, each of the computing threads obtaining at least one first gradient segment and a second gradient segment corresponding to the first gradient segment;
  • an accumulating unit used for each of the above-mentioned computing threads to accumulate the above-mentioned first gradient section and the corresponding second gradient section for each obtained first gradient section to obtain a corresponding cumulative gradient section
  • the cumulative gradient sequence obtaining unit is configured to obtain the above cumulative gradient sequence according to the cumulative gradient segments obtained by each of the above calculation threads.
  • the above-mentioned importance index sequence calculation module includes: an importance index calculation unit, used for each of the above calculation threads to calculate the corresponding importance index according to the obtained cumulative gradient segment; the importance index sequence The obtaining unit is configured to obtain the sequence of importance indexes according to the calculation results of the importance indexes of each of the above calculation threads.
  • the above-mentioned gradient classification module is configured to, for each cumulative gradient segment calculated by each of the above-mentioned computing threads, extract a corresponding important gradient indication value from the above-mentioned important gradient indication sequence; if the above-mentioned important gradient indication value is the first indication value, then the cumulative gradients in the above-mentioned cumulative gradient segments are all determined as important gradients, and the above-mentioned cumulative gradient segments are submitted to the communication buffer of the above-mentioned training node; the above-mentioned first indication value represents the above-mentioned cumulative gradient segments.
  • the cumulative gradients are all important gradients.
  • the above synchronization module includes:
  • the to-be-synchronized gradient sequence acquisition unit is configured to obtain the to-be-synchronized gradient sequence according to the accumulated gradient segments in the communication buffer; wherein, each accumulated gradient in each accumulated gradient segment in the communication buffer is in the to-be-synchronized gradient
  • the positions in the synchronization gradient sequence are all the same as the positions of the accumulated gradients in the accumulated gradient sequence, and other positions in the to-be-synchronized sequence are set as preset gradient values;
  • the splicing unit is used for splicing the above-mentioned gradient sequence to be synchronized and the above-mentioned importance index sequence to obtain the above-mentioned information to be synchronized.
  • the above synchronization module further includes: a synchronous cumulative gradient sequence acquisition unit, configured to add element by element the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the above training nodes to obtain a synchronous cumulative gradient sequence;
  • a post-synchronized gradient sequence acquisition unit configured to divide each synchronously accumulated gradient in the above-mentioned synchronously-accumulated gradient sequence by the total number of the above-mentioned training nodes to obtain the above-mentioned post-synchronized gradient sequence;
  • the accumulative importance index sequence acquisition unit is used to add the importance index sequences in the information to be synchronized of each of the above training nodes element by element to obtain the accumulated importance index sequence;
  • an average importance index sequence obtaining unit configured to divide each cumulative importance index in the above-mentioned cumulative importance index sequence by the total number of the above-mentioned training nodes to obtain an average importance index sequence
  • the post-synchronized important gradient indication sequence calculation unit is used to calculate the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence, and obtain the post-synchronized important gradient indication sequence.
  • the above-mentioned calculation unit of the important gradient indication sequence after synchronization includes: a sorting unit, configured to obtain the sorting result of each average importance index in the above-mentioned average importance index sequence according to the descending order of the average importance index; the index threshold value The obtaining unit is used to obtain the threshold index, and determine the threshold of the importance index in the above-mentioned sorting result according to the above threshold index; If the above-mentioned average importance index is greater than the above-mentioned threshold of the above-mentioned importance index, the corresponding important gradient indication value is set as the first indication value; otherwise, the corresponding important gradient indication value is set as the second indication value; wherein, the above-mentioned first indication value It is used to indicate the important gradient, and the above-mentioned second indication value is used to indicate the unimportant gradient.
  • the above-mentioned index threshold obtaining unit is configured to determine the number of segments according to the above-mentioned preset segmentation rule; obtain a preset compression ratio; and use the product of the above-mentioned compression ratio and the above-mentioned number of segments as the above-mentioned Threshold metrics.
  • the above-mentioned parameter adjustment module is used to sequentially extract the gradients in the above-mentioned synchronized gradient sequence; if the above-mentioned gradient is not equal to the above-mentioned preset gradient value, adjust the corresponding neural network parameters according to the above-mentioned gradient; if the above-mentioned gradient is equal to The above preset gradient value is used to extract the next gradient.
  • the above-mentioned updating module is further configured to determine the insignificant gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence; update the above-mentioned second gradient sequence according to the above-mentioned insignificant gradient; the above-mentioned apparatus further includes an iterative control module , which is used to iteratively perform distributed communication-based neural network training until a preset training stop condition is reached.
  • the functions or modules included in the apparatuses provided in the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, where at least one instruction or at least one segment of program is stored, and the above-mentioned method is implemented when the at least one instruction or at least one segment of program is loaded and executed by a processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure further provides a training device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the above method.
  • the training device may be provided as a terminal, server or other form of device.
  • FIG. 10 shows a block diagram of a training device according to an embodiment of the present disclosure.
  • training device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.
  • the training device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814 , and the communication component 816 .
  • Processing component 802 generally controls the overall operation of training device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
  • Memory 804 is configured to store various types of data to support operation at training device 800 . Examples of such data include instructions for any application or method operating on the training device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • Power supply component 806 provides power to various components of training device 800 .
  • Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to training device 800 .
  • the multimedia component 808 includes a screen that provides an output interface between the training device 800 described above and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The above-mentioned touch sensor may not only sense the boundary of the touch or swipe action, but also detect the duration and pressure associated with the above-mentioned touch or swipe action.
  • the multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the training device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
  • Audio component 810 is configured to output and/or input audio signals.
  • audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when training device 800 is in operating modes, such as call mode, recording mode, and speech recognition mode.
  • the received audio signal may be further stored in memory 804 or transmitted via communication component 816 .
  • audio component 810 also includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
  • Sensor assembly 814 includes one or more sensors for providing state assessments of various aspects of training device 800 .
  • the sensor component 814 can detect the on/off state of the training device 800, the relative positioning of components, such as the display and keypad of the training device 800, the sensor component 814 can also detect the training device 800 or a component of the training device 800 changes in position, presence or absence of user contact with training device 800, training device 800 orientation or acceleration/deceleration and temperature changes of training device 800.
  • Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 816 is configured to facilitate wired or wireless communication between training device 800 and other devices.
  • the training device 800 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, or a combination thereof.
  • the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 described above also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • training device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmed gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmed gate array
  • controller microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
  • a non-volatile computer-readable storage medium is also provided, such as memory 804 comprising computer program instructions executable by processor 820 of training device 800 to perform the above-described method.
  • training device 1900 may be provided as a server.
  • training device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource, represented by memory 1932, for storing instructions executable by processing component 1922, such as an application program.
  • An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the above-described methods.
  • the training device 1900 may also include a power supply assembly 1926 configured to perform power management of the training device 1900, a wired or wireless network interface 1950 configured to connect the training device 1900 to a network, and an input output (I/O) interface 1958 .
  • Training device 1900 may operate based on an operating system stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • a non-volatile computer-readable storage medium such as memory 1932 comprising computer program instructions executable by processing component 1922 of training device 1900 to perform the method described above.
  • a training system comprising a plurality of the above-mentioned training devices.
  • the present disclosure may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) can be personalized by utilizing state information of computer readable program instructions.
  • Computer readable program instructions are executed to implement various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more alternatives for implementing the specified logical function(s) Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

Abstract

The present disclosure relates to a distributed communication-based neural network training method and apparatus, and a storage medium. The method comprises: training a neural network, and saving a generated gradient in a first gradient sequence; obtaining a cumulative gradient sequence according to the first gradient sequence and a second gradient sequence, the second gradient sequence being used for recording a gradient that has not participated in synchronization; calculating an importance index sequence according to the cumulative gradient sequence; obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence; obtaining information to be synchronized of training nodes according to the important gradient and the importance index sequence; performing synchronization among the training nodes on the basis of said information, so as to obtain a synchronized gradient sequence and a synchronized important gradient indication sequence; and adjusting parameters of the neural network according to the synchronized gradient sequence. The present disclosure can reduce the frequency and data volume of node communication, reduce communication overhead, and improve the training speed of the neural network.

Description

基于分布式通信的神经网络训练方法、装置及存储介质Distributed communication-based neural network training method, device and storage medium
本申请要求于2021年02月27日提交中国专利局、申请号为202110221266.9、申请名称为“基于分布式通信的神经网络训练方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110221266.9 and the application name "Distributed Communication-Based Neural Network Training Method, Device and Storage Medium" filed with the China Patent Office on February 27, 2021, the entire content of which is Incorporated herein by reference.
技术领域technical field
本公开涉及机器学习技术领域,尤其涉及基于分布式通信的神经网络训练方法、装置及存储介质。The present disclosure relates to the technical field of machine learning, and in particular, to a neural network training method, device and storage medium based on distributed communication.
背景技术Background technique
随着信息技术的发展和人工智能的兴起,神经网络在日常生活中的应用日益广泛,神经网络的种类越来越多,复杂度也越来越高,传统的单机训练可能需要数万次迭代数月的时间才能收敛,单机的计算能力已经难以匹配神经网络训练的算力需求;而分布式的训练方法虽然可以通过将训练任务并行分配到多个节点提升算力,但是需要各个节点间的相互通信才能够完成训练,各个节点的通信数据量较大、频率较高又造成了较高的带宽消耗和较长的通信时滞,使得节点间通信的问题成为了神经网络训练提速的瓶颈。With the development of information technology and the rise of artificial intelligence, the application of neural networks in daily life is becoming more and more extensive. There are more and more types of neural networks and higher and higher complexity. Traditional single-machine training may require tens of thousands of iterations. It takes several months to converge, and the computing power of a single computer has been difficult to match the computing power requirements of neural network training; and distributed training methods can improve computing power by allocating training tasks to multiple nodes in parallel, but require The training can be completed only by mutual communication. The large amount of communication data and high frequency of each node result in high bandwidth consumption and long communication time lag, which makes the problem of communication between nodes become a bottleneck for speeding up neural network training.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供基于分布式通信的神经网络训练方法、装置及存储介质,用于解决上述的至少一个技术问题。Embodiments of the present application provide a distributed communication-based neural network training method, device, and storage medium, which are used to solve at least one of the above-mentioned technical problems.
第一方面,本申请实施例提供基于分布式通信的神经网络训练方法,应用于训练节点,所述方法包括:In the first aspect, an embodiment of the present application provides a distributed communication-based neural network training method, which is applied to training nodes, and the method includes:
训练所述训练节点对应的神经网络,将产生的梯度保存在第一梯度序列;Train the neural network corresponding to the training node, and save the generated gradient in the first gradient sequence;
根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录尚未参与同步的梯度;According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients that have not participated in synchronization;
根据所述累计梯度序列计算得到重要度指标序列;Calculate and obtain an importance index sequence according to the cumulative gradient sequence;
获取重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息;Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;
基于所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;Perform synchronization between training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
根据所述同步后梯度序列调整所述神经网络的参数,以及将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列。The parameters of the neural network are adjusted according to the post-synchronization gradient sequence, and the post-synchronization important gradient indication sequence is used as the new important gradient indication sequence.
第二方面,本申请实施例提供基于分布式通信的神经网络训练装置,所述装置设置于训练节点,包括:In the second aspect, the embodiments of the present application provide a neural network training device based on distributed communication, the device is set on a training node, and includes:
训练模块,用于训练所述训练节点对应的神经网络,将产生的梯度保存在第一梯度序 列;A training module is used to train the corresponding neural network of the training node, and the generated gradient is stored in the first gradient sequence;
累计梯度获取模块,用于根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录尚未参与同步的梯度;a cumulative gradient acquisition module, configured to obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization;
重要度指标序列计算模块,用于根据所述累计梯度序列计算得到重要度指标序列;an importance index sequence calculation module, configured to calculate and obtain an importance index sequence according to the cumulative gradient sequence;
梯度分类模块,用于获取重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;a gradient classification module, configured to obtain an important gradient indication sequence, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
同步模块,用于根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息,基于所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;a synchronization module, configured to obtain the to-be-synchronized information of the training nodes according to the important gradients and the importance index sequence, and to perform synchronization between the training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and post-synchronized important gradients instruction sequence;
更新模块,用于将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列;an update module, configured to use the synchronized important gradient indication sequence as the new important gradient indication sequence;
参数调整模块,用于根据所述同步后梯度序列调整所述神经网络的参数。A parameter adjustment module, configured to adjust the parameters of the neural network according to the synchronized gradient sequence.
第三方面,本申请实施例提供基于分布式通信的神经网络训练方法,应用于包括多个训练节点的训练系统,所述方法包括:In a third aspect, an embodiment of the present application provides a neural network training method based on distributed communication, which is applied to a training system including multiple training nodes, and the method includes:
每个训练节点训练对应的神经网络,将产生的梯度保存在第一梯度序列;Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence;
根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录所述训练节点中尚未参与同步的梯度;According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients in the training node that have not participated in synchronization;
根据所述累计梯度序列计算得到重要度指标序列;Calculate and obtain an importance index sequence according to the cumulative gradient sequence;
获取所述训练节点中的重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;obtaining the important gradient indication sequence in the training node, and determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息;Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;
各个训练节点基于对应的所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;Each training node performs synchronization between the training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
每个训练节点根据所述同步后梯度序列调整所述神经网络的参数,以及将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列。Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令或至少一段程序,所述至少一条指令或至少一段程序由处理器加载并执行以实现第一方面中任意一项所述的基于分布式通信的神经网络训练方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one instruction or at least one program is stored in the computer-readable storage medium, and the at least one instruction or at least one program is loaded and executed by a processor In order to realize the neural network training method based on distributed communication according to any one of the first aspect.
第五方面,本申请实施例提供一种训练设备,包括至少一个处理器,以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述至少一个处理器通过执行所述存储器存储的指令实现如第一方面中任意一项所述的基于分布式通信的神经网络训练方法。In a fifth aspect, an embodiment of the present application provides a training device, including at least one processor, and a memory communicatively connected to the at least one processor; wherein the memory stores a program that can be executed by the at least one processor. Instructions, the at least one processor implements the distributed communication-based neural network training method according to any one of the first aspect by executing the instructions stored in the memory.
第六方面,本申请实施例提供一种训练系统,所述训练系统包括多个第五方面所述的训练设备。In a sixth aspect, an embodiment of the present application provides a training system, where the training system includes a plurality of training devices described in the fifth aspect.
第七方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现各个方面的计算机可读程序指令。In a seventh aspect, embodiments of the present application provide a computer program product, where the computer program product may include a computer-readable storage medium on which computer-readable program instructions for enabling a processor to implement various aspects are uploaded.
可以看出,在本申请实施例中,可以对于重要梯度和用于确定重要梯度的相关信息在一次通信过程中进行同步,同步结果可以确保各个训练节点得到的重要梯度均对应于相同的神经网络参数,而不必再额外就重要梯度的位置进行节点间通信,降低了节点通信的频 率,提升了训练速度。It can be seen that, in the embodiment of the present application, the important gradient and the related information used to determine the important gradient can be synchronized in one communication process, and the synchronization result can ensure that the important gradient obtained by each training node corresponds to the same neural network parameters, and there is no need for additional communication between nodes on the position of important gradients, which reduces the frequency of node communication and improves the training speed.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1示出根据本公开实施例的相关技术中基于分布式通信进行神经网络训练的示意图;1 shows a schematic diagram of neural network training based on distributed communication in the related art according to an embodiment of the present disclosure;
图2示出根据本公开实施例的基于分布式通信的神经网络训练方法的流程图;2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure;
图3示出根据本公开实施例的深度神经网络的一种可行的结构示意图;Fig. 3 shows a feasible structural schematic diagram of a deep neural network according to an embodiment of the present disclosure;
图4示出根据本公开实施例的第一存储空间、第二存储空间与神经网络结构的关系示意图;FIG. 4 shows a schematic diagram of the relationship between the first storage space, the second storage space and a neural network structure according to an embodiment of the present disclosure;
图5示出根据本公开实施例的多节点执行上述基于分布式通信的神经网络训练方法的流程示意图;FIG. 5 shows a schematic flowchart of a multi-node implementation of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;
图6示出根据本公开实施例的上述基于分布式通信的神经网络训练方法中步骤S20的流程示意图;6 shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;
图7示出根据本公开实施例的上述基于分布式通信的神经网络训练方法步骤S40的流程示意图;7 shows a schematic flowchart of step S40 of the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;
图8示出根据本公开实施例的上述基于分布式通信的神经网络训练方法中步骤S60的流程示意图;8 shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method according to an embodiment of the present disclosure;
图9示出根据本公开实施例的基于分布式通信的神经网络训练装置的框图;9 shows a block diagram of a neural network training apparatus based on distributed communication according to an embodiment of the present disclosure;
图10示出根据本公开实施例的一种训练设备的框图;10 shows a block diagram of a training device according to an embodiment of the present disclosure;
图11示出根据本公开实施例的另一种训练设备的框图。11 shows a block diagram of another training device according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present specification. Obviously, the described embodiments are only a part of the embodiments of the present specification, rather than all the embodiments. Based on the embodiments in this specification, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或服务器不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、 方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second" and the like in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or server comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the term "at least one" herein refers to any combination of any one of the plurality or at least two of the plurality, for example, including at least one of A, B, and C, and may mean including from A, B, and C. Any one or more elements selected from the set of B and C.
另外,为了更好地说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are set forth in the following detailed description. It will be understood by those skilled in the art that the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.
在相关技术的实际应用中,为了提升神经网络的训练速度,可以将神经网络的训练任务并行分配给多个训练节点,各个训练节点基于分布式通信可以协同进行神经网络的训练,进而提升神经网络的训练速度,缩短神经网络收敛的耗时。请参考图1,其示出了相关技术中基于分布式通信进行神经网络训练的示意图。图1中每个GPU(Graphics Processing Unit,图形处理器)为一个训练节点,图1中共有m个训练节点,上述m个训练节点可以各自独立训练结构相同的神经网络。In the practical application of related technologies, in order to improve the training speed of the neural network, the training tasks of the neural network can be distributed to multiple training nodes in parallel, and each training node can cooperate to train the neural network based on distributed communication, thereby improving the neural network The training speed can be shortened, and the time-consuming of neural network convergence can be shortened. Please refer to FIG. 1 , which shows a schematic diagram of neural network training based on distributed communication in the related art. Each GPU (Graphics Processing Unit, graphics processor) in Figure 1 is a training node. There are m training nodes in Figure 1. The above m training nodes can independently train a neural network with the same structure.
以图1中的m个训练节点协同训练一个存在n个层的神经网络为例,在每个训练节点进行神经网络训练的过程中,样本数据输入神经网络后依次触发前向传播过程和反向传播过程,在上述前向传播过程中,样本数据输入第1层,触发第1层、第2层,直至第n层依次输出计算结果。在上述反向传播过程中,触发第n层、第n-1层,直至第1层依次产生梯度。Taking the m training nodes in Figure 1 to jointly train a neural network with n layers as an example, in the process of training the neural network for each training node, the sample data is input into the neural network and the forward propagation process and reverse propagation are triggered in turn. The propagation process, in the above forward propagation process, the sample data is input to the first layer, the first layer and the second layer are triggered, until the nth layer outputs the calculation results in sequence. In the above backpropagation process, the nth layer and the n-1th layer are triggered until the first layer generates gradients in turn.
上述m个训练节点根据产生的梯度进行通信,根据通信结果调整对应的神经网络的参数。通过分布式训练,每个训练节点可以参考其它训练节点产生的梯度调整自身对应的神经网络的参数,从而提升训练速度,但是相关技术中需要进行节点间通信的数据的数据量较大、通信频率也较高,这使得通信过程本身消耗了较多资源,各个训练节点可能因为频繁的通信时滞被迫经常处于等待状态,从而降低了训练速度,这种情况下分布式节点的通信过程成为了制约训练效率的瓶颈。The above m training nodes communicate according to the generated gradient, and adjust the parameters of the corresponding neural network according to the communication result. Through distributed training, each training node can adjust the parameters of its corresponding neural network with reference to the gradients generated by other training nodes, thereby improving the training speed. It is also high, which makes the communication process itself consume more resources, and each training node may be forced to wait frequently due to frequent communication delays, thereby reducing the training speed. In this case, the communication process of distributed nodes becomes The bottleneck restricting training efficiency.
为了降低分布式训练过程的通信数据量和通信频率,缓解通信压力,提升神经网络的分布式训练的速度,本公开实施例提供基于分布式通信的神经网络训练方法。In order to reduce the communication data volume and communication frequency in the distributed training process, ease the communication pressure, and improve the speed of the distributed training of the neural network, the embodiments of the present disclosure provide a neural network training method based on distributed communication.
本公开实施例提供的基于分布式通信的神经网络训练方法可以应用于任意具有图形处理器(Graphics Processing Unit,GPU)的数据处理设备,该数据处理设备可以是终端,包括个人计算机(Personal Computer,PC)、小型机、中型机、大型机、工作站等,当然该数据处理设备也可以是服务器。需要说明的是,该数据处理设备在用于训练神经网络时可以是独立的,也可以集群形式存在。The distributed communication-based neural network training method provided by the embodiments of the present disclosure can be applied to any data processing device with a graphics processor (Graphics Processing Unit, GPU). The data processing device can be a terminal, including a personal computer (Personal Computer, PC), minicomputer, medium computer, mainframe, workstation, etc. Of course, the data processing device can also be a server. It should be noted that the data processing device may be independent when used for training the neural network, or may exist in the form of a cluster.
本公开实施例提供的基于分布式通信的神经网络训练方法可以以计算机程序的形式存储于数据处理设备,数据处理设备通过运行计算机程序实现本公开实施例的基于分布式通信的神经网络训练方法。上述计算机程序可以是独立的计算机程序,也可以是集成于其他计算机程序之上的功能模块、插件或者小程序等。The distributed communication-based neural network training method provided by the embodiment of the present disclosure may be stored in a data processing device in the form of a computer program, and the data processing device implements the distributed communication-based neural network training method of the present disclosure by running the computer program. The above-mentioned computer program may be an independent computer program, or may be a functional module, a plug-in, or a small program integrated on other computer programs.
下面以作为训练节点的数据处理设备为执行主体为例,对本公开实施例的基于分布式通信的神经网络训练方法进行说明。图2示出根据本公开实施例的基于分布式通信的神经网络训练方法的流程图,如图2所示,上述方法包括:The method for training a neural network based on distributed communication according to an embodiment of the present disclosure will be described below by taking a data processing device serving as a training node as an execution subject as an example. FIG. 2 shows a flowchart of a method for training a neural network based on distributed communication according to an embodiment of the present disclosure. As shown in FIG. 2 , the above method includes:
S10:训练上述训练节点对应的神经网络,将产生的梯度保存在第一梯度序列。S10: Train the neural network corresponding to the above training node, and save the generated gradient in the first gradient sequence.
本公开实施例中并不对神经网络的结构进行限制,上述神经网络可以是深度神经网络、卷积神经网络、循环神经网络中的至少一种。以深度神经网络为例,请参考图3,其示出深度神经网络的一种可行的结构示意图。上述深度神经网络可以包括卷积层、池化层和全连接层,以其中一个上述卷积层作为神经网络的输入层,并以上述全连接层作为上述深度神经网络的输出层,在上述输入层和上述输出层之间可以间隔设置卷积层和池化层。The embodiments of the present disclosure do not limit the structure of the neural network, and the above-mentioned neural network may be at least one of a deep neural network, a convolutional neural network, and a recurrent neural network. Taking a deep neural network as an example, please refer to FIG. 3 , which shows a schematic structural diagram of a feasible deep neural network. The above-mentioned deep neural network may include a convolutional layer, a pooling layer and a fully connected layer, and one of the above-mentioned convolutional layers is used as the input layer of the neural network, and the above-mentioned fully connected layer is used as the output layer of the above-mentioned deep neural network. A convolutional layer and a pooling layer can be set at intervals between the layer and the above-mentioned output layer.
在一个可行的实施方式中,可以在上述训练节点开辟第一存储空间(grad_buffer)和第二存储空间(left_buffer),在一个实施例中,上述第一存储空间(grad_buffer)和第二存储空间(left_buffer)均是连续的存储空间。In a feasible implementation manner, a first storage space (grad_buffer) and a second storage space (left_buffer) may be opened on the training node. In one embodiment, the first storage space (grad_buffer) and the second storage space ( left_buffer) are contiguous storage spaces.
请参考图4,其示出了第一存储空间、第二存储空间与神经网络结构的关系示意图。本公开实施例中神经网络可以具备多个层,每个层可以包括多个参数,每个层中的每个参数对应于上述第一存储空间中的一段存储区间。上述第一存储空间和上述第二存储空间可以具备相同结构,并且上述第一存储空间和上述第二存储空间可以均与上述神经网络的参数所占的存储空间相同。上述第一存储空间(grad_buffer)可以用于存储上述训练节点在训练过程中产生的梯度(第一梯度序列),第二存储空间(left_buffer)可以用于存储尚未参与同步的梯度。Please refer to FIG. 4 , which shows a schematic diagram of the relationship between the first storage space, the second storage space and the neural network structure. In the embodiment of the present disclosure, the neural network may have multiple layers, each layer may include multiple parameters, and each parameter in each layer corresponds to a segment of storage interval in the above-mentioned first storage space. The first storage space and the second storage space may have the same structure, and both the first storage space and the second storage space may be the same as the storage space occupied by the parameters of the neural network. The first storage space (grad_buffer) may be used to store gradients (first gradient sequences) generated by the training nodes during the training process, and the second storage space (left_buffer) may be used to store gradients that have not yet participated in synchronization.
以神经网络具备三个层为例:Take the neural network as an example with three layers:
层1包括参数E10、参数E11、参数E12和参数E13;Layer 1 includes parameter E10, parameter E11, parameter E12 and parameter E13;
层2包括参数E20、参数E21、参数E22;Layer 2 includes parameter E20, parameter E21, and parameter E22;
层3包括参数E30、参数E31和参数E32;Layer 3 includes parameter E30, parameter E31 and parameter E32;
层1-3共包括10个参数,则上述第一存储空间和上述第二存储空间均包括10个存储区间,各个层在训练过程中反向产生对应于各个参数的梯度,具体地:Layers 1-3 include a total of 10 parameters, then the above-mentioned first storage space and the above-mentioned second storage space both include 10 storage intervals, and each layer generates a gradient corresponding to each parameter in the reverse direction during the training process, specifically:
首先,层3产生分别对应于参数E30、参数E31和参数E32的梯度T30、梯度T31和参数梯度T32;First, layer 3 generates gradient T30, gradient T31 and parameter gradient T32 corresponding to parameter E30, parameter E31 and parameter E32, respectively;
然后,层2产生分别对应于参数E20、参数E21、参数E22的梯度T20、梯度T21和梯度T22;Then, layer 2 generates gradient T20, gradient T21, and gradient T22 corresponding to parameter E20, parameter E21, parameter E22, respectively;
接下来,层1产生分别对应于参数E10、参数E11、参数E12和参数E13的梯度T10、梯度T11、梯度T12和梯度T13。Next, layer 1 generates gradient T10, gradient T11, gradient T12, and gradient T13 corresponding to parameter E10, parameter E11, parameter E12, and parameter E13, respectively.
可以按照梯度产生顺序依次将这10个梯度存储在上述第一存储空间之中,即上述第一存储空间中(grad_buffer)的数据可以为{T30、T31、T32、T20、T21、T22、T10、T11、T12、T13}。上述第一存储空间(grad_buffer)的数据{T30、T31、T32、T20、T21、T22、 T10、T11、T12、T13}即上述第一梯度序列。The 10 gradients can be sequentially stored in the first storage space in the order of gradient generation, that is, the data in the first storage space (grad_buffer) can be {T30, T31, T32, T20, T21, T22, T10, T11, T12, T13}. The data {T30, T31, T32, T20, T21, T22, T10, T11, T12, T13} of the first storage space (grad_buffer) is the first gradient sequence.
在一个实施例中,上述训练节点可以每次取出最小批次样本,进行最小批次的样本训练,在训练过程中调整对应的神经网络的参数,并且根据训练结果得到本批次训练产生的对应于每个参数的梯度,将上述梯度存储在第一梯度序列之中。通过分批次训练,可以在每批次训练结束后记录第一梯度序列,并进行后续的训练节点间的同步,而在每个批次训练的过程可以不进行训练节点间的同步,通过分批次训练降低通信频率,提升训练速度。In one embodiment, the above-mentioned training node can take out the smallest batch of samples each time, perform sample training of the smallest batch, adjust the parameters of the corresponding neural network during the training process, and obtain the corresponding training results of this batch of training according to the training results. For the gradient of each parameter, the above gradient is stored in the first gradient sequence. Through batch training, the first gradient sequence can be recorded after each batch of training, and the subsequent synchronization between training nodes can be performed. Batch training reduces communication frequency and improves training speed.
S20:根据上述第一梯度序列和第二梯度序列,得到累计梯度序列;上述第二梯度序列用于记录尚未参与同步的梯度。S20: Obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization.
本公开实施例中第二梯度序列可以被存储于第二存储空间(left_buffer)。第二存储空间与上述第一存储空间具备相同结构,并且相同位置的存储区间对应相同的神经网络参数。以上述具备三个层的神经网络为例,即第一存储空间中的10个连续存储区间和第二存储空间中的10个连续存储区间均分别依次对应参数E30、参数E31、参数E32、参数E20、参数E21、参数E22、参数E10、参数E11和参数E12。可以将第一存储空间和第二存储空间中处于相同位置的存储区间中的数据相加,即可得到累计梯度序列,可以将上述累计梯度序列存储在上述第一存储空间之中。In the embodiment of the present disclosure, the second gradient sequence may be stored in the second storage space (left_buffer). The second storage space has the same structure as the above-mentioned first storage space, and the storage intervals in the same position correspond to the same neural network parameters. Taking the above-mentioned neural network with three layers as an example, that is, the 10 consecutive storage intervals in the first storage space and the 10 consecutive storage intervals in the second storage space correspond to parameter E30, parameter E31, parameter E32, parameter E32, and parameter E31 respectively. E20, Parameter E21, Parameter E22, Parameter E10, Parameter E11, and Parameter E12. The accumulated gradient sequence may be obtained by adding the data in the same storage interval in the first storage space and the second storage space, and the accumulated gradient sequence may be stored in the first storage space.
示例性的,以上述具备三个层的神经网络为例,可以获取第一存储空间的任一存储区间grad_buffer[i],和第二存储空间的对应的存储区间left_buffer[i],其中i取值为0-9,将grad_buffer[i]+left_buffer[i]赋值给grad_buffer[i],赋值后的第一存储空间中的数据即为上述累计梯度序列,即上述累计梯度序列和上文所述第一梯度序列复用上述第一存储空间。在实际对神经网络进行基于分布式通信的训练场景中,参与神经网络协同训练的训练节点数量较大,并且由于神经网络复杂度可以很高,相应的,往往产生的梯度所占的存储空间也较大,在每个训练节点中均可以对于其中的第一存储空间进行复用,从而可以大大地节省存储资源。Exemplarily, taking the above-mentioned neural network with three layers as an example, any storage interval grad_buffer[i] of the first storage space and the corresponding storage interval left_buffer[i] of the second storage space can be obtained, where i is taken as The value is 0-9, assign grad_buffer[i]+left_buffer[i] to grad_buffer[i], the data in the first storage space after the assignment is the above-mentioned cumulative gradient sequence, that is, the above-mentioned cumulative gradient sequence and the above-mentioned The first gradient sequence reuses the above-mentioned first storage space. In the actual training scenario of neural network based on distributed communication, the number of training nodes participating in the collaborative training of neural network is large, and because the complexity of neural network can be very high, correspondingly, the storage space occupied by the gradient generated is often The first storage space can be reused in each training node, so that storage resources can be greatly saved.
S30:根据上述累计梯度序列计算得到重要度指标序列。S30: Calculate and obtain an importance index sequence according to the above-mentioned cumulative gradient sequence.
具体地,可以对应于累计梯度序列中的每个累计梯度,计算得到重要度指标,本公开实施例并不限定上述重要度指标的确定方法。以上述具备三个层的神经网络为例,可以得到重要度指标序列Im,其中的每个重要度指标Im[i]可以表示上文得到的grad_buffer[i]对应的神经网络参数产生的梯度的重要程度。Specifically, the importance index may be calculated corresponding to each accumulated gradient in the accumulated gradient sequence, and the embodiment of the present disclosure does not limit the method for determining the above importance index. Taking the above-mentioned neural network with three layers as an example, the importance index sequence Im can be obtained, and each importance index Im[i] can represent the gradient generated by the neural network parameters corresponding to grad_buffer[i] obtained above. Importance.
S40:获取重要梯度指示序列,根据上述重要梯度指示序列确定上述累计梯度序列中的重要梯度。S40: Acquire an important gradient indication sequence, and determine an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence.
本公开实施例中重要梯度指示序列中的每个重要梯度指示值可以指示对应的位置的累计梯度是否重要。以上述具备三个层的神经网络为例,可以获取重要梯度指示序列Ip,每个重要梯度指示值Ip[i]可以表示上文得到的grad_buffer[i]记录的数据是否为重要梯度。示例性的,若Ip[i]为第一指示值,则grad_buffer[i]记录的数据被判定为重要梯度,若Ip[i]为第二指示值,则grad_buffer[i]记录的数据被判定为非重要梯度。在一个示例性的实施方式中,重要梯度指示序列Ip可以在上一次节点间同步过程中得到。In the embodiment of the present disclosure, each important gradient indication value in the important gradient indication sequence may indicate whether the accumulated gradient of the corresponding position is important. Taking the above-mentioned neural network with three layers as an example, the important gradient indication sequence Ip can be obtained, and each important gradient indication value Ip[i] can indicate whether the data recorded by grad_buffer[i] obtained above is an important gradient. Exemplarily, if Ip[i] is the first indication value, the data recorded in grad_buffer[i] is determined to be an important gradient, and if Ip[i] is the second indication value, the data recorded in grad_buffer[i] is determined to be is an unimportant gradient. In an exemplary embodiment, the important gradient indication sequence Ip may be obtained during the last synchronization process between nodes.
示例性的,若重要梯度指示序列Ip中Ip[1]、Ip[3]、Ip[5]的值表征重要梯度,则grad_buffer[1]、grad_buffer[3]、grad_buffer[5]即为重要梯度,grad_buffer中的其它梯度即为 非重要梯度。Exemplarily, if the important gradient indicates that the values of Ip[1], Ip[3], and Ip[5] in the sequence Ip represent the important gradient, then grad_buffer[1], grad_buffer[3], and grad_buffer[5] are the important gradients , other gradients in grad_buffer are non-important gradients.
在一个示例性的实施方式中,在确定上述累计梯度序列中的非重要梯度后,还可以根据上述非重要梯度更新上述第二梯度序列。In an exemplary embodiment, after the insignificant gradient in the cumulative gradient sequence is determined, the second gradient sequence may also be updated according to the insignificant gradient.
本公开实施例中非重要梯度不参与本次的节点间的同步,因此,可以将非重要梯度更新在上述第二梯度序列中,即对应存储在第二存储空间(left_buffer)。根据上述非重要梯度更新上述第二梯度序列包括确定上述非重要梯度在上述第二梯度序列中的位置,将第二梯度序列在上述位置对应的数据更新为上述非重要梯度,将其它位置对应的数据赋值为0。In the embodiment of the present disclosure, the non-important gradient does not participate in the synchronization between nodes this time. Therefore, the non-important gradient can be updated in the second gradient sequence, that is, correspondingly stored in the second storage space (left_buffer). Updating the second gradient sequence according to the unimportant gradient includes determining the position of the unimportant gradient in the second gradient sequence, updating the data corresponding to the second gradient sequence at the position to the unimportant gradient, and updating the data corresponding to the other positions with the unimportant gradient. The data is assigned a value of 0.
以上述具备三个层的神经网络为例,grad_buffer[0]、grad_buffer[2]、grad_buffer[4]、grad_buffer[6]、grad_buffer[7]、grad_buffer[8]、grad_buffer[9]均为非重要梯度,则将grad_buffer[0]、grad_buffer[2]、grad_buffer[4]、grad_buffer[6]、grad_buffer[7]、grad_buffer[8]、grad_buffer[9]对应赋值给left_buffer[0]、left_buffer[2]、left_buffer[4]、left_buffer[6]、left_buffer[7]、left_buffer[8]、left_buffer[9],而对于left_buffer[1]、left_buffer[3]、left_buffer[5]均可以被赋值为0。Taking the above neural network with three layers as an example, grad_buffer[0], grad_buffer[2], grad_buffer[4], grad_buffer[6], grad_buffer[7], grad_buffer[8], and grad_buffer[9] are not important Gradient, assign grad_buffer[0], grad_buffer[2], grad_buffer[4], grad_buffer[6], grad_buffer[7], grad_buffer[8], grad_buffer[9] to left_buffer[0], left_buffer[2] , left_buffer[4], left_buffer[6], left_buffer[7], left_buffer[8], left_buffer[9], and left_buffer[1], left_buffer[3], left_buffer[5] can be assigned the value 0.
S50:根据上述重要梯度和上述重要度指标序列得到上述训练节点的待同步信息。S50: Obtain the to-be-synchronized information of the training node according to the above-mentioned important gradient and the above-mentioned importance index sequence.
具体地,可以将上述重要梯度和上述重要度指标序列进行拼接,将拼接结果作为上述待同步信息。本公开实施例并不限定拼接方法。Specifically, the above-mentioned important gradient and the above-mentioned importance index sequence may be spliced, and the splicing result may be used as the above-mentioned information to be synchronized. The embodiments of the present disclosure do not limit the splicing method.
S60:基于上述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列。S60: Perform synchronization between training nodes based on the above information to be synchronized, to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.
示例性的,可以使用AllReduce进行通信,AllReduce是可以高效进行分布式训练节点间通信的一类通信方法的统称。Exemplarily, AllReduce can be used for communication, and AllReduce is a general term for a class of communication methods that can efficiently communicate between distributed training nodes.
本公开实施例中由于每次同步时各个训练节点可以得到同步后重要梯度指示序列,而上述同步后重要梯度指示序列被作为下一次训练时所应用的重要梯度指示序列,因此,在各个训练节点的训练过程中使用的重要梯度指示序列是相同的,也就是说重要梯度位置是相同的。In the embodiment of the present disclosure, each training node can obtain the important gradient indication sequence after synchronization during each synchronization, and the above-mentioned important gradient indication sequence after synchronization is used as the important gradient indication sequence applied in the next training. Therefore, in each training node The sequence of important gradient indications used during training is the same, that is, the important gradient positions are the same.
以训练节点A为例,训练节点A在步骤S40中判定grad_buffer[1]、grad_buffer[3]、grad_buffer[5]即为重要梯度,则其它训练节点也会判定其对应的grad_buffer[1]、grad_buffer[3]、grad_buffer[5]为重要梯度,可以直接根据各个节点的重要梯度计算同步后梯度序列,而不必额外考虑同步重要梯度的位置,各个训练节点间也不再需要就重要梯度的位置信息额外通信,降低了通信频率。Taking training node A as an example, training node A determines in step S40 that grad_buffer[1], grad_buffer[3], and grad_buffer[5] are important gradients, and other training nodes will also determine their corresponding grad_buffer[1], grad_buffer [3], grad_buffer[5] is an important gradient, the gradient sequence after synchronization can be calculated directly according to the important gradient of each node, without additional consideration of the position of the important gradient for synchronization, and the position information of the important gradient is no longer needed between each training node Additional communication reduces communication frequency.
示例性的,对于同步后梯度序列Td,可以根据各个节点中的重要梯度grad_buffer[1]、grad_buffer[3]、grad_buffer[5]取均值,对应得到Td[1]、Td[3]和Td[5],而Td中其他位置的值可以被赋值为预设梯度值。Exemplarily, for the gradient sequence Td after synchronization, the average value can be obtained according to the important gradients grad_buffer[1], grad_buffer[3], and grad_buffer[5] in each node, and Td[1], Td[3] and Td[ 5], and the values at other positions in Td can be assigned as preset gradient values.
在一个实施例中,可以根据各个节点的重要度指标序列逐元素取均值,得到同步后重要度指标序列,根据同步后重要度指标序列对应得到上述同步后重要梯度指示序列。In one embodiment, the importance index sequence after synchronization can be obtained by taking the mean value element by element according to the importance index sequence of each node, and the above-mentioned important gradient indication sequence after synchronization is correspondingly obtained according to the importance index sequence after synchronization.
S70:根据上述同步后梯度序列调整上述神经网络的参数,以及将上述同步后重要梯度指示序列作为新的上述重要梯度指示序列。S70: Adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence, and use the above-mentioned important gradient indication sequence after synchronization as a new above-mentioned important gradient indication sequence.
上述根据上述同步后梯度序列调整上述神经网络的参数,包括:The above-mentioned adjustment of the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence includes:
S71:依次提取上述同步后梯度序列中的梯度。S71: Sequentially extract the gradients in the above-mentioned synchronized gradient sequence.
S72:若上述梯度不等于上述预设梯度值,根据上述梯度调整对应的神经网络参数。S72: If the gradient is not equal to the preset gradient value, adjust the corresponding neural network parameter according to the gradient.
S73:若上述梯度等于上述预设梯度值,提取下一个梯度。S73: If the above gradient is equal to the above preset gradient value, extract the next gradient.
在一个具体的实施方式中,上述预设梯度值用于指示不需要根据梯度进行参数调整的情况,在一个可行的实施方式中,上述预设梯度值可以为0。In a specific embodiment, the above-mentioned preset gradient value is used to indicate a situation in which parameter adjustment according to the gradient is not required. In a feasible embodiment, the above-mentioned preset gradient value may be 0.
本公开实施例中,可以将同步后梯度序列中的元素对应存储在上述第一存储空间中,再次复用第一存储空间(grad_buffer),降低了存储消耗。以上文所述的具备三个层的神经网络为例,顺序读取grad_buffer[i],若grad_buffer[i]不为上述预设梯度值,则调整与上述grad_buffer[i]对应的神经网络的参数;若grad_buffer[i]为上述预设梯度值,则i自增1,继续读取grad_buffer[i]。显然,若当前提取的梯度为上述同步后梯度序列中的最后一个梯度,则“提取下一个梯度”将失败,这种情况表征本次神经网络的参数的调整已经完成,在完成后可以清空第一存储空间,以便于迭代执行步骤S10-S70时,在步骤S10中可以被复用记录第一梯度序列。In the embodiment of the present disclosure, the elements in the synchronized gradient sequence may be correspondingly stored in the first storage space, and the first storage space (grad_buffer) may be reused again, thereby reducing storage consumption. Take the neural network with three layers described above as an example, read grad_buffer[i] sequentially, if grad_buffer[i] is not the above preset gradient value, adjust the parameters of the neural network corresponding to the above grad_buffer[i] ; If grad_buffer[i] is the above-mentioned preset gradient value, i will increment by 1, and continue to read grad_buffer[i]. Obviously, if the currently extracted gradient is the last gradient in the above-mentioned synchronized gradient sequence, the "extract the next gradient" will fail. This situation indicates that the adjustment of the parameters of the neural network has been completed. After completion, the first gradient can be cleared. A storage space, so that when steps S10-S70 are executed iteratively, the first gradient sequence can be multiplexed and recorded in step S10.
本公开实施例中通过设置预设梯度值的方式指示非重要梯度,非重要梯度对应的参数值不需要调整,从而避免对于神经网络的参数的过渡调整,进一步提升神经网络的收敛速度。In the embodiment of the present disclosure, the unimportant gradient is indicated by setting the preset gradient value, and the parameter value corresponding to the unimportant gradient does not need to be adjusted, thereby avoiding the transitional adjustment of the parameters of the neural network and further improving the convergence speed of the neural network.
本公开实施例步骤S10-S70示出了单次同步情况的神经网络训练方法,在上述单次同步情况的神经网络训练方法中,还可以对于第二梯度序列和重要梯度指示序列均进行更新,根据更新结果可以再次执行上述单次同步情况的神经网络训练方法,即可以迭代执行步骤S10-S70,直至达到预设的训练停止条件,即可得到训练好的神经网络。本公开实施例并不限定上述训练停止条件,上述训练停止条件可以为迭代次数达到预设的迭代阈值,也可以为上述神经网络产生的损失小于预设的损失阈值。Steps S10-S70 in this embodiment of the present disclosure show a neural network training method in a single synchronization situation. In the above neural network training method in a single synchronization situation, both the second gradient sequence and the important gradient indication sequence may be updated, According to the update result, the above-mentioned neural network training method for a single synchronization situation can be performed again, that is, steps S10-S70 can be performed iteratively until a preset training stop condition is reached, and a trained neural network can be obtained. The embodiment of the present disclosure does not limit the above training stop condition, and the above training stop condition may be that the number of iterations reaches a preset iteration threshold, or that the loss generated by the neural network is less than a preset loss threshold.
请参考图5,其示出了多节点执行上述基于分布式通信的神经网络训练方法的流程示意图。图5中各个训练节点可以独立对于各自的神经网络进行训练,在训练过程中更新本地的第二梯度序列,并得到待同步信息,上述待同步信息中包括重要梯度和重要度指标序列。各个训练节点基于待同步信息进行通信,得到同步后梯度序列和同步后重要梯度指示序列。各个节点均基于上述同步后梯度序列进行参数调整,以及基于上述同步后重要梯度指示序列确定下一次迭代中的重要梯度。各个训练节点通过迭代不断调整自身参数,以及在每次迭代过程中进行节点间通信协作完成神经网络的训练。Please refer to FIG. 5 , which shows a schematic flowchart of multi-node executing the above-mentioned neural network training method based on distributed communication. Each training node in FIG. 5 can independently train its own neural network, update the local second gradient sequence during the training process, and obtain information to be synchronized. The information to be synchronized includes important gradients and importance index sequences. Each training node communicates based on the information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence. Each node performs parameter adjustment based on the above-mentioned post-synchronization gradient sequence, and determines the important gradient in the next iteration based on the above-mentioned post-synchronization important gradient indication sequence. Each training node continuously adjusts its own parameters through iteration, and communicates and cooperates between nodes to complete the training of the neural network during each iteration.
在实际对于神经网络进行分布式训练的场景中,参与训练的训练节点数量较大,并且由于神经网络复杂度可以很高,往往产生的梯度的数据量也较大,相应的,用于记录梯度位置的梯度位置信息数据量也较大,大数据量的训练节点之间不仅需要就梯度位置信息进行同步,也需要就梯度进行同步,从而产生了较大的通信压力,消耗了较多通信资源,受限于通信资源,较多节点可能被迫经常处于等待同步的状态,降低了训练速度。基于上述配置,本公开实施例提供的基于分布式通信的神经网络训练方法可以将重要梯度和重要梯度位置信息在一次通信过程中进行同步,同步结果可以确保各个训练节点得到的重要梯度均对应于相同的神经网络参数,而不必再额外就梯度位置信息进行训练节点间的通信,这就可以显著降低通信频率,减少通信资源消耗,降低训练节点等待同步的耗时,提升训练速度,在存在大数据量训练节点并且神经网络复杂度高的场景中可以取得尤为明显的提速 效果。In the actual scenario of distributed training of neural networks, the number of training nodes involved in training is relatively large, and because the complexity of neural networks can be very high, the amount of gradient data often generated is also relatively large. Correspondingly, it is used to record gradients. The data volume of the gradient position information of the location is also large. The training nodes with large data volume not only need to synchronize the gradient position information, but also need to synchronize the gradient, which generates a large communication pressure and consumes a lot of communication resources. , limited by communication resources, many nodes may be forced to wait for synchronization often, which reduces the training speed. Based on the above configuration, the distributed communication-based neural network training method provided by the embodiment of the present disclosure can synchronize important gradients and important gradient position information in one communication process, and the synchronization result can ensure that the important gradients obtained by each training node correspond to With the same neural network parameters, there is no need to additionally communicate the gradient position information between the training nodes, which can significantly reduce the communication frequency, reduce the consumption of communication resources, reduce the time-consuming training nodes waiting for synchronization, and improve the training speed. A particularly obvious speed-up effect can be achieved in scenarios where the amount of data training nodes and the complexity of the neural network are high.
在一些可行的实施方式中,可以进一步基于多线程并发特性提升上述训练节点执行上述步骤的速度,下面结合附图对本公开实施例进行详细说明。In some feasible implementation manners, the speed at which the above-mentioned training node performs the above-mentioned steps may be further improved based on the multi-thread concurrency characteristic, and the embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
请参考图6,其示出了上述基于分布式通信的神经网络训练方法中步骤S20的流程示意图,上述根据上述第一梯度序列和第二梯度序列,得到累计梯度序列的流程示意图,包括:Please refer to FIG. 6 , which shows a schematic flowchart of step S20 in the above-mentioned distributed communication-based neural network training method. The above-mentioned schematic flowchart of obtaining a cumulative gradient sequence according to the above-mentioned first gradient sequence and second gradient sequence includes:
S21:基于预设的分段规则对上述第一梯度序列和上述第二梯度序列分别进行分段,得到第一梯度段序列和第二梯度段序列;其中,若第一梯度段在上述第一梯度段序列中的位置与第二梯度段在上述第二梯度段序列中的位置相同,则上述第一梯度段和上述第二梯度段相对应,并且均对应于相同的神经网络参数。S21: Segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule to obtain a first gradient segment sequence and a second gradient segment sequence; wherein, if the first gradient segment is in the first gradient segment The position of the gradient segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment corresponds to the second gradient segment, and both correspond to the same neural network parameters.
本公开实施例并不限定具体的分段规则,只要可以将上述第一梯度序列切分为多个第一梯度段,形成第一梯度段序列即可。上述第一梯度序列和上述第二梯度序列具备相同结构,基于上述分段规则再对于第二梯度序列进行切分,即可得到第二梯度段序列,并且若第一梯度段在上述第一梯度段序列中的位置与第二梯度段在上述第二梯度段序列中的位置相同,则上述第一梯度段和上述第二梯度段相对应,并且均对应于相同的神经网络参数。The embodiments of the present disclosure do not limit specific segmentation rules, as long as the above-mentioned first gradient sequence can be divided into a plurality of first gradient segments to form a first gradient segment sequence. The first gradient sequence and the second gradient sequence have the same structure, and the second gradient sequence is segmented based on the segmentation rule to obtain the second gradient segment sequence. The position in the segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment and the second gradient segment correspond, and both correspond to the same neural network parameters.
示例性的,以上述的具备三个层的神经网络为例,在步骤S10中获取的第一梯度序列被存储在第一存储空间(grad_buffer)中,可以由grad_buffer[0]、grad_buffer[1]、grad_buffer[2]、grad_buffer[3]、grad_buffer[4]中的数据形成一个第一梯度段Tdf[0]、由grad_buffer[5]、grad_buffer[6]、grad_buffer[7]、grad_buffer[8]、grad_buffer[9]中的数据形成另一个第一梯度段Tdf[1],得到第一梯度段序列{Tdf[0],Tdf[1]}。相应的,第二梯度序列被存储在第二存储空间(left_buffer)中,由left_buffer[0]、left_buffer[1]、left_buffer[2]、left_buffer[3]、left_buffer[4]中的数据形成一个第二梯度段Tds[0]、由left_buffer[5]、left_buffer[6]、left_buffer[7]、left_buffer[8]、left_buffer[9]中的数据形成另一个第一梯度段Tds[1],得到第二梯度段序列{Tds[0],Tds[1]}。Exemplarily, taking the above-mentioned neural network with three layers as an example, the first gradient sequence obtained in step S10 is stored in the first storage space (grad_buffer), which can be composed of grad_buffer[0], grad_buffer[1] , grad_buffer[2], grad_buffer[3], grad_buffer[4] form a first gradient segment Tdf[0], composed of grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8], The data in grad_buffer[9] forms another first gradient segment Tdf[1], obtaining the first gradient segment sequence {Tdf[0], Tdf[1]}. Correspondingly, the second gradient sequence is stored in the second storage space (left_buffer), which is formed by the data in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], and left_buffer[4]. The second gradient segment Tds[0] forms another first gradient segment Tds[1] from the data in left_buffer[5], left_buffer[6], left_buffer[7], left_buffer[8], and left_buffer[9], and obtains the first gradient segment Tds[1]. Two gradient segment sequence {Tds[0], Tds[1]}.
以Tdf[0]为例,Tdf[0]中的各个数据依次对应于参数E30、参数E31、参数E32、参数E20、参数E21。Tdf[0]与Tds[0]相对应,则Tds[0]中的各个数据也是依次对应于参数E30、参数E31、参数E32、参数E20和参数E21。Taking Tdf[0] as an example, each data in Tdf[0] corresponds to parameter E30, parameter E31, parameter E32, parameter E20, and parameter E21 in sequence. Tdf[0] corresponds to Tds[0], and each data in Tds[0] also corresponds to parameter E30, parameter E31, parameter E32, parameter E20 and parameter E21 in sequence.
S22:设置多个并行的计算线程,每个上述计算线程获取至少一个第一梯度段,以及与上述第一梯度段对应的第二梯度段。S22: Set up multiple parallel computing threads, each of which obtains at least one first gradient segment and a second gradient segment corresponding to the foregoing first gradient segment.
以设置两个并行的计算线程为例,可以将Tdf[0]和Tds[0]发送至计算线程A,将Tdf[1]和Tds[1]发送至计算线程B。Taking two parallel computing threads as an example, Tdf[0] and Tds[0] can be sent to computing thread A, and Tdf[1] and Tds[1] can be sent to computing thread B.
S23:每个上述计算线程对于获取到的每个第一梯度段,将上述第一梯度段与对应的第二梯度段进行累加,得到对应的累计梯度段。S23: For each acquired first gradient segment, each of the above-mentioned computing threads accumulates the above-mentioned first gradient segment and the corresponding second gradient segment to obtain a corresponding accumulated gradient segment.
以计算线程A为例,可以将Tdf[0]和Tds[0]中的数据进行逐元素相加,得到对应的累计梯度段。示例性的,第一梯度段Tdf[0]中每个元素的数据依次被存储在grad_buffer[0]、grad_buffer[1]、grad_buffer[2]、grad_buffer[3]、grad_buffer[4],第二梯度段Tds[0]中每个元素的数据依次被存储在left_buffer[0]、left_buffer[1]、left_buffer[2]、left_buffer[3]、left_buffer[4],则对应的累计梯度段STd[0]中的数据依次为grad_buffer[0]+left_buffer[0]、 grad_buffer[1]+left_buffer[1]、grad_buffer[2]+left_buffer[2]、grad_buffer[3]+left_buffer[3]、grad_buffer[4]+left_buffer[4],显然STd[0]中的五个累计梯度分别对应于参数E30、参数E31、参数E32、参数E20和参数E21。同样道理,计算线程B可以根据Tdf[1]和Tds[1],对应得到STd[1],STd[1]中的五个累计梯度分别对应于参数E22、参数E10、参数E11、参数E12、参数E13。Taking computing thread A as an example, the data in Tdf[0] and Tds[0] can be added element by element to obtain the corresponding cumulative gradient segment. Exemplarily, the data of each element in the first gradient segment Tdf[0] is sequentially stored in grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], and grad_buffer[4], and the second gradient The data of each element in segment Tds[0] is sequentially stored in left_buffer[0], left_buffer[1], left_buffer[2], left_buffer[3], left_buffer[4], then the corresponding cumulative gradient segment STd[0] The data in the sequence is grad_buffer[0]+left_buffer[0], grad_buffer[1]+left_buffer[1], grad_buffer[2]+left_buffer[2], grad_buffer[3]+left_buffer[3], grad_buffer[4]+ left_buffer[4], obviously the five accumulated gradients in STd[0] correspond to parameter E30, parameter E31, parameter E32, parameter E20 and parameter E21 respectively. In the same way, computing thread B can obtain STd[1] correspondingly according to Tdf[1] and Tds[1], and the five accumulated gradients in STd[1] correspond to parameter E22, parameter E10, parameter E11, parameter E12, Parameter E13.
S24:根据上述各个计算线程得到的累计梯度段,得到累计梯度序列。S24: Obtain a cumulative gradient sequence according to the cumulative gradient segments obtained by each of the above computing threads.
示例性的,计算线程A可以得到STd[0],计算线程B可以得到STd[1],将STd[0]中的累计梯度与STd[1]中的累计梯度依次排列,即可得到上述累计梯度序列。以上文所述的具备三个层的神经网络为例,可以根据STd[0]中的五个元素分别更新grad_buffer[0]、grad_buffer[1]、grad_buffer[2]、grad_buffer[3]、grad_buffer[4],根据STd[1]中的五个元素分别更新grad_buffer[5]、grad_buffer[6]、grad_buffer[7]、grad_buffer[8]、grad_buffer[9],则第一存储空间中的数据即形成上述累计梯度序列。上述累计梯度序列通过复用第一存储空间降低存储消耗。基于上述配置,可以通过并行分段计算累计梯度,提升累计梯度序列的计算速度。Exemplarily, calculation thread A can obtain STd[0], and calculation thread B can obtain STd[1], and the cumulative gradient in STd[0] and the cumulative gradient in STd[1] are arranged in sequence, and the above cumulative gradient can be obtained. gradient sequence. Taking the neural network with three layers described above as an example, grad_buffer[0], grad_buffer[1], grad_buffer[2], grad_buffer[3], grad_buffer[ can be updated according to the five elements in STd[0] 4], update grad_buffer[5], grad_buffer[6], grad_buffer[7], grad_buffer[8], grad_buffer[9] respectively according to the five elements in STd[1], then the data in the first storage space is formed The above cumulative gradient sequence. The above cumulative gradient sequence reduces storage consumption by reusing the first storage space. Based on the above configuration, the cumulative gradient can be calculated in parallel segments to improve the calculation speed of the cumulative gradient sequence.
相应的,在步骤S30中,也可以以梯度段为粒度计算重要度指标序列,即对于每个上述累计梯度段,计算对应的重要度指标;根据各个上述计算线程的重要度指标计算结果,得到重要度指标序列。示例性的,对于累计梯度段STd[0],可以得到对应的重要度指标Im[0];对于累计梯度段STd[1],可以得到对应的重要度指标Im[1]。基于上述配置,使得得到的重要度指标可以表征包括多个梯度的梯度段的重要程度,进而在后续的同步过程中可以确定该梯度段是否为重要梯度段,便于以梯度段为粒度确定重要梯度,从而完成梯度段粒度的梯度更新,实现了稀疏梯度更新,进一步降低通信的数据量,提升训练速度。Correspondingly, in step S30, the sequence of importance indexes may also be calculated with the gradient segment as the granularity, that is, for each of the above-mentioned cumulative gradient segments, the corresponding importance index is calculated; Importance index sequence. Exemplarily, for the cumulative gradient segment STd[0], the corresponding importance index Im[0] can be obtained; for the cumulative gradient segment STd[1], the corresponding importance index Im[1] can be obtained. Based on the above configuration, the obtained importance index can represent the importance of a gradient segment including multiple gradients, and in the subsequent synchronization process, it can be determined whether the gradient segment is an important gradient segment, so that the important gradient can be determined with the gradient segment as the granularity. , so as to complete the gradient update of the gradient segment granularity, realize the sparse gradient update, further reduce the amount of communication data, and improve the training speed.
在一个示例性的实施方式中,以上文所述的具备三个层的神经网络为例,可以得到两个累计梯度段STd[0]和STd[1],相应的,上述重要度指标序列包括两个重要度指标Im[0]和Im[1]。对于每个上述累计梯度段,可以计算上述累计梯度段中各个累计梯度的统计值,将上述统计值作为上述重要度指标。本公开实施例中不限定上述统计值的类型,示例性的,上述统计值可以是方差、标准差或二范数。In an exemplary embodiment, taking the above-mentioned neural network with three layers as an example, two cumulative gradient segments STd[0] and STd[1] can be obtained. Correspondingly, the above-mentioned importance index sequence includes: Two importance indicators Im[0] and Im[1]. For each cumulative gradient segment, a statistical value of each cumulative gradient in the cumulative gradient segment may be calculated, and the statistical value may be used as the importance index. The type of the above-mentioned statistical value is not limited in the embodiment of the present disclosure. Exemplarily, the above-mentioned statistical value may be a variance, a standard deviation, or a two-norm.
请参考图7,其示出了上述基于分布式通信的神经网络训练方法步骤S40的流程示意图,上述根据上述重要梯度指示序列确定上述累计梯度序列中的重要梯度的流程示意图,包括:Please refer to FIG. 7 , which shows a schematic flowchart of step S40 of the above-mentioned distributed communication-based neural network training method. The above-mentioned schematic flowchart of determining an important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence includes:
S41:对于每个上述计算线程计算得到的每个累计梯度段,在上述重要梯度指示序列提取对应的重要梯度指示值。S41: For each cumulative gradient segment calculated by each of the above-mentioned calculation threads, extract a corresponding important gradient indication value from the above-mentioned important gradient indication sequence.
在一个示例性的实施方式中,上述重要梯度指示序列中的每个重要梯度指示值也与累计梯度序列中的累计梯度段相对应。以上文所述的具备三个层的神经网络为例,可以得到两个累计梯度段STd[0]和STd[1],则相应的,上述重要梯度指示序列也包括两个重要梯度指示值Ip[0]和Ip[1]。In an exemplary embodiment, each significant gradient indication value in the above-mentioned significant gradient indication sequence also corresponds to an accumulated gradient segment in the accumulated gradient sequence. Taking the above-mentioned neural network with three layers as an example, two cumulative gradient segments STd[0] and STd[1] can be obtained. Accordingly, the above important gradient indication sequence also includes two important gradient indication values Ip [0] and Ip[1].
S42:若上述重要梯度指示值为第一指示值,则将上述累计梯度段中的累计梯度均确定为重要梯度,并将上述累计梯度段提交至上述训练节点的通信缓冲区;上述第一指示值表征上述累计梯度段中的累计梯度均为重要梯度。S42: If the above-mentioned important gradient indication value is the first indication value, determine all the cumulative gradients in the above-mentioned cumulative gradient segment as important gradients, and submit the above-mentioned cumulative gradient segment to the communication buffer of the above-mentioned training node; the above-mentioned first indication The value indicates that the cumulative gradients in the above-mentioned cumulative gradient segment are all important gradients.
S43:若上述重要梯度指示值为第二指示值,则将上述累计梯度段中的累计梯度均确定为非重要梯度;上述第二指示值表征上述累计梯度段中的累计梯度均为非重要梯度。S43: If the above-mentioned important gradient indication value is the second indication value, determine that the cumulative gradients in the above-mentioned cumulative gradient segment are all non-important gradients; the above-mentioned second indication value indicates that the above-mentioned cumulative gradients in the above-mentioned cumulative gradient segment are all non-important gradients .
示例性的,可以将1作为第一指示值,0作为第二指示值。示例性的,若重要梯度指示值Ip[0]为1,则确定对应的累计梯度段STd[0]中的各个累计梯度均为重要梯度,将上述累计梯度段提交至上述训练节点的通信缓冲区,若重要梯度指示值Ip[1]为0,则将对应的累计梯度段STd[1]中的五个累计梯度均作为非重要梯度。Exemplarily, 1 may be used as the first indication value, and 0 may be used as the second indication value. Exemplarily, if the important gradient indication value Ip[0] is 1, it is determined that each cumulative gradient in the corresponding cumulative gradient segment STd[0] is an important gradient, and the above-mentioned cumulative gradient segment is submitted to the communication buffer of the above-mentioned training node. If the important gradient indication value Ip[1] is 0, the five accumulated gradients in the corresponding accumulated gradient segment STd[1] are all regarded as unimportant gradients.
在一个可行的实施例中,可以根据确定的非重要梯度更新第二梯度序列,这个步骤在前文有述,在此不再赘述。In a feasible embodiment, the second gradient sequence may be updated according to the determined unimportant gradient. This step has been described above and will not be repeated here.
基于上述配置,可以以梯度段为粒度快速确定重要梯度和非重要梯度,提升重要梯度的确定速度。Based on the above configuration, important gradients and non-important gradients can be quickly determined with the gradient segment as the granularity, which improves the determination speed of important gradients.
在一个示例性的实施方式中,上述根据上述重要梯度和上述重要度指标序列得到上述训练节点的待同步信息,包括:In an exemplary embodiment, obtaining the information to be synchronized of the training node according to the important gradient and the sequence of importance indicators, including:
S51.根据上述通信缓冲区中的累计梯度段,得到待同步梯度序列;其中,所述通信缓冲区中的每个累计梯度段中的每个累计梯度在所述待同步梯度序列中的位置均与所述累计梯度在所述累计梯度序列中的位置相同,并且所述待同步序列中的其它位置被设置为预设梯度值。S51. Obtain a gradient sequence to be synchronized according to the accumulated gradient segments in the communication buffer; wherein, the position of each accumulated gradient in each accumulated gradient segment in the communication buffer in the gradient sequence to be synchronized is the same The position of the accumulated gradient in the sequence of accumulated gradients is the same, and other positions in the sequence to be synchronized are set as preset gradient values.
继续使用上文示例,只有累计梯度段STd[0]被提交至上述通信缓冲区,累计梯度段STd[0]中的各个累计梯度在上述累计梯度序列中的对应的位置为第1位至第5位,则上述待同步梯度序列中的第1位至第5为分别为累计梯度序列STd[0]中的五个累计梯度,而上述待同步梯度序列中的其它位置被设置为预设值。示例性的,上述待同步梯度序列中的其它位置可以置零,所述示例中,上述待同步后梯度序列共包括10个数值,分别对应参数E30、参数E31、参数E32、参数E20、参数E21、参数E22、参数E10、参数E11和参数E12。Continuing to use the above example, only the cumulative gradient segment STd[0] is submitted to the above communication buffer, and the corresponding positions of each cumulative gradient in the cumulative gradient segment STd[0] in the above cumulative gradient sequence are the first to the first. 5 bits, then the first to fifth bits in the above-mentioned gradient sequence to be synchronized are the five accumulated gradients in the accumulated gradient sequence STd[0], and the other positions in the above-mentioned gradient sequence to be synchronized are set to default values . Exemplarily, other positions in the above-mentioned gradient sequence to be synchronized may be set to zero. In the above example, the above-mentioned gradient sequence to be synchronized includes a total of 10 values, corresponding to parameter E30, parameter E31, parameter E32, parameter E20, parameter E21 respectively. , parameter E22, parameter E10, parameter E11 and parameter E12.
S52.拼接上述待同步梯度序列和上述重要度指标序列,得到上述待同步信息。S52. Splicing the above-mentioned gradient sequence to be synchronized and the above-mentioned importance index sequence to obtain the above-mentioned information to be synchronized.
示例性的,可以将上述重要度指标序列{Im[0],Im[1]}附在上述待同步梯度序列之后,得到上述待同步信息。Exemplarily, the above-mentioned importance index sequence {Im[0], Im[1]} may be appended to the above-mentioned gradient sequence to be synchronized to obtain the above-mentioned information to be synchronized.
基于上述配置,可以以梯度段为粒度快速确定重要梯度,得到待同步梯度序列,上述待同步梯度序列中可以记录每个参数对应的待同步的梯度信息,将待同步梯度序列和重要度指标的拼接结果作为待同步信息,可以使得待同步信息能够以较小的数据量包括了梯度、梯度位置,以及用于在下一次迭代中确定重要梯度位置的重要度指标序列,从而降低了通信数据量,也降低了通信频率,可以显著缓解分布式通信环境中神经网络训练产生的通信压力。Based on the above configuration, the important gradients can be quickly determined with the gradient segment as the granularity, and the gradient sequence to be synchronized can be obtained. The gradient information to be synchronized corresponding to each parameter can be recorded in the gradient sequence to be synchronized. The splicing result is used as the information to be synchronized, so that the information to be synchronized can include the gradient, the gradient position, and the sequence of importance indexes used to determine the important gradient position in the next iteration with a small amount of data, thereby reducing the amount of communication data, The communication frequency is also reduced, which can significantly alleviate the communication pressure caused by neural network training in a distributed communication environment.
请参考图8,其示出了上述基于分布式通信的神经网络训练方法中步骤S60的流程示意图,上述基于上述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列的流程图,包括:Please refer to FIG. 8 , which shows a schematic flowchart of step S60 in the above-mentioned distributed communication-based neural network training method. The above-mentioned synchronization between training nodes is performed based on the above-mentioned information to be synchronized, and a post-synchronization gradient sequence and an important post-synchronization gradient indication are obtained. A flow chart of the sequence, including:
S61.将各个上述训练节点的待同步信息中的待同步梯度序列逐元素相加,得到同步累加梯度序列。S61. Add the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the above training nodes element by element to obtain a synchronized accumulated gradient sequence.
S62.将上述同步累加梯度序列中的每个同步累加梯度除以上述训练节点的总数量,得 到上述同步后梯度序列。S62. Divide each synchronous cumulative gradient in the above-mentioned synchronous cumulative gradient sequence by the total number of the above-mentioned training nodes to obtain the above-mentioned post-synchronized gradient sequence.
假设存在三个训练节点训练上述存在三个层的神经网络,其中训练节点1的待同步梯度序列为Wt1,训练节点2的待同步梯度序列为Wt2,训练节点3的待同步梯度序列为Wt3,对于任意序列位置i,Wt1[i]+Wt2[i]+Wt3[i]的值即为上述同步累加梯度序列中序列位置i对应的同步累加梯度。上述同步累加梯度序列中每个序列位置对应的同步累加梯度除以3,即可得到上述同步后梯度序列。Assuming that there are three training nodes to train the above-mentioned three-layer neural network, the gradient sequence to be synchronized of training node 1 is Wt1, the gradient sequence to be synchronized of training node 2 is Wt2, and the gradient sequence to be synchronized of training node 3 is Wt3. For any sequence position i, the value of Wt1[i]+Wt2[i]+Wt3[i] is the synchronous accumulation gradient corresponding to sequence position i in the above synchronous accumulation gradient sequence. By dividing the synchronous cumulative gradient corresponding to each sequence position in the above synchronous cumulative gradient sequence by 3, the above-mentioned post-synchronized gradient sequence can be obtained.
S63.将各个上述训练节点的待同步信息中的重要度指标序列逐元素相加,得到累加重要度指标序列。S63. Add the importance index sequences in the information to be synchronized of each of the above training nodes element by element to obtain an accumulated importance index sequence.
S64.将上述累加重要度指标序列中的每个累加重要度指标除以上述训练节点的总数量,得到平均重要度指标序列。S64. Divide each cumulative importance index in the above-mentioned cumulative importance index sequence by the total number of the above-mentioned training nodes to obtain an average importance index sequence.
参考前文示例,对于存在三个层的神经网络,每个训练节点对应的重要度指标序列为{Im[0],Im[1]},为了区分训练节点1、训练节点2和训练节点3,将训练节点1、训练节点2和训练节点3的重要度指标序列分别表示为{Im1[0],Im1[1]}、{Im2[0],Im2[1]}和{Im3[0],Im3[1]}。若使用{AIm[0],AIm[1]}表征上述平均重要度指标序列,则AIm[0]=(Im1[0]+Im2[0]+Im3[0])/3,AIm[1]=(Im1[1]+Im2[1]+Im3[1])/3。Referring to the previous example, for a neural network with three layers, the importance index sequence corresponding to each training node is {Im[0], Im[1]}. In order to distinguish training node 1, training node 2 and training node 3, The importance index sequences of training node 1, training node 2 and training node 3 are expressed as {Im1[0], Im1[1]}, {Im2[0], Im2[1]} and {Im3[0], respectively, Im3[1]}. If {AIm[0], AIm[1]} is used to represent the above average importance index sequence, then AIm[0]=(Im1[0]+Im2[0]+Im3[0])/3, AIm[1] =(Im1[1]+Im2[1]+Im3[1])/3.
S65.计算上述平均重要度指标序列中每个平均重要度指标对应的重要梯度指示值,得到同步后重要梯度指示序列。S65. Calculate the important gradient indication value corresponding to each average importance index in the above average importance index sequence, and obtain the important gradient indication sequence after synchronization.
基于上述配置,可以计算得到能够准确用于调整神经网络参数的同步后梯度序列,以及得到能够用于快速判断重要梯度的同步后重要梯度指示序列,提升神经网络训练速度。Based on the above configuration, a post-synchronized gradient sequence that can be accurately used to adjust neural network parameters can be calculated, and a post-synchronized important gradient indication sequence that can be used to quickly determine important gradients can be obtained, thereby improving the training speed of the neural network.
在一个可行的实施方式中,上述计算上述平均重要度指标序列中每个平均重要度指标对应的重要梯度指示值,包括:In a feasible embodiment, the above-mentioned calculation of the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence includes:
S651.按照平均重要度指标降序顺序得到上述平均重要度指标序列中各个平均重要度指标的排序结果。S651. Obtain the sorting result of each average importance index in the above average importance index sequence according to the descending order of the average importance index.
参考前文示例,对于平均重要度指标序列{AIm[0],AIm[1]},若AIm[0]小于AIm[1],则上述排序结果为{AIm[1],AIm[0]},反之上述排序结果为{AIm[0],AIm[1]}。Referring to the previous example, for the average importance index sequence {AIm[0], AIm[1]}, if AIm[0] is less than AIm[1], the above sorting result is {AIm[1], AIm[0]}, On the contrary, the above sorting result is {AIm[0], AIm[1]}.
S652.获取阈值指标,根据上述阈值指标在上述排序结果中确定重要度指标阈值。S652. Obtain a threshold index, and determine an importance index threshold in the sorting result according to the threshold index.
在一个可行的实施方式中,可以将上述排序结果中上述阈值指标指向的位置处的数值作为上述重要度指标阈值。示例性的,上述排序结果中包括30个平均重要度指标,阈值指标为10,则将上述排序结果中的第10个平均重要度指标对应的数值作为上述重要度指标阈值。In a feasible implementation manner, the value at the position pointed to by the threshold index in the sorting result may be used as the threshold of the importance index. Exemplarily, if the above sorting result includes 30 average importance indexes, and the threshold index is 10, the value corresponding to the 10th average importance index in the above sorting result is used as the above importance index threshold.
在一个可行的实施方式中,可以根据上述预设的分段规则确定分段个数;获取预设的压缩率;将上述压缩率与上述分段个数的乘积作为上述阈值指标。本公开实施例不限定上述预设的压缩率和上述分段个数的具体数据。参考前文示例的分段个数仅仅为两段,实际应用中可以远不止两段,本公开实施例中的示例均不产生具体限制。In a feasible implementation manner, the number of segments may be determined according to the preset segmentation rule; a preset compression ratio is obtained; and the product of the compression ratio and the number of segments is used as the threshold indicator. The embodiments of the present disclosure do not limit the specific data of the above-mentioned preset compression ratio and the above-mentioned number of segments. Referring to the foregoing example, the number of segments is only two segments, and in practical applications, there may be far more than two segments, and no specific limitation is imposed on the examples in the embodiments of the present disclosure.
基于上述配置,可以得到较为合理的阈值指标,从而确定合理的重要度指标阈值,最终提升重要梯度判断的准确度。Based on the above configuration, a more reasonable threshold index can be obtained, so as to determine a reasonable threshold of the importance index, and finally improve the accuracy of the important gradient judgment.
S653.对于上述平均重要度指标序列中的每个平均重要度指标,若上述平均重要度指标大于上述重要度指标阈值,则将对应的重要梯度指示值设置为第一指示值;否则,将对 应的重要梯度指示值设置为第二指示值。S653. For each average importance index in the above-mentioned average importance index sequence, if the above-mentioned average importance index is greater than the above-mentioned importance index threshold, set the corresponding important gradient indication value as the first indication value; otherwise, set the corresponding important gradient indication value to the first indication value. The important gradient indication value of is set as the second indication value.
本公开实施例中,上述第一指示值可以用于指示重要梯度,示例性的,第一指示值可以是1;第二指示值可以用于指示非重要梯度,示例性的,第二指示值可以是0。以平均重要度指标序列{AIm[0],AIm[1]}为例,可以对应得到同步后重要梯度指示序列{TIp[0],TIp[1]}。在迭代执行上述基于分布式通信的神经网络训练方法时,同步后重要梯度指示序列{TIp[0],TIp[1]}可以作为新的重要梯度指示序列{Ip[0],Ip[1]},被应用于在下一次迭代中确定重要梯度。如何基于重要梯度指示序列{Ip[0],Ip[1]}确定重要梯度在前文有述,在此不再赘述。In this embodiment of the present disclosure, the above-mentioned first indication value may be used to indicate an important gradient, for example, the first indication value may be 1; the second indication value may be used to indicate an unimportant gradient, for example, the second indication value Can be 0. Taking the average importance index sequence {AIm[0], AIm[1]} as an example, the post-synchronization important gradient indication sequence {TIp[0], TIp[1]} can be correspondingly obtained. When the above-mentioned distributed communication-based neural network training method is iteratively performed, the post-synchronization important gradient indication sequence {TIp[0], TIp[1]} can be used as a new important gradient indication sequence {Ip[0], Ip[1] }, is applied to determine important gradients in the next iteration. How to determine the important gradient based on the important gradient indication sequence {Ip[0], Ip[1]} has been described above, and will not be repeated here.
基于上述配置,可以准确计算得到每个平均重要度指标对应的重要梯度指示值,提升重要梯度判断的准确度。Based on the above configuration, the important gradient indication value corresponding to each average importance index can be accurately calculated, and the accuracy of the important gradient judgment can be improved.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格地执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that, in the above method of the specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
在一个可行的实施例中,本公开还提供另一基于分布式通信的神经网络训练方法,应用于包括多个训练节点的训练系统,该方法包括:In a feasible embodiment, the present disclosure also provides another neural network training method based on distributed communication, which is applied to a training system including a plurality of training nodes, and the method includes:
S100.每个训练节点训练对应的神经网络,将产生的梯度保存在第一梯度序列。S100. Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence.
S200.根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录所述训练节点中尚未参与同步的梯度。S200. Obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients in the training node that have not participated in synchronization.
S300.根据所述累计梯度序列计算得到重要度指标序列。S300. Calculate and obtain an importance index sequence according to the cumulative gradient sequence.
S400.获取所述训练节点中的重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度。S400. Acquire an important gradient indication sequence in the training node, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence.
S500.根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息。S500. Obtain the to-be-synchronized information of the training node according to the important gradient and the importance index sequence.
S600.各个训练节点基于对应的所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列。S600. Each training node performs synchronization between training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence.
S700.每个训练节点根据所述同步后梯度序列调整所述神经网络的参数,以及将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列。S700. Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
本公开实施例中所述训练系统中每个训练节点执行的步骤均可以参考前文,在此不再赘述。可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。For the steps performed by each training node in the training system described in the embodiments of the present disclosure, reference may be made to the foregoing description, which will not be repeated here. It can be understood that the above-mentioned method embodiments mentioned in the present disclosure can be combined with each other to form a combined embodiment without violating the principle and logic.
图9示出根据本公开实施例提供的基于分布式通信的神经网络训练装置的框图,上述装置设置于训练节点,包括:9 shows a block diagram of a distributed communication-based neural network training apparatus provided according to an embodiment of the present disclosure. The above apparatus is set on a training node and includes:
训练模块10,用于训练上述训练节点对应的神经网络,将产生的梯度保存在第一梯度序列;The training module 10 is used to train the neural network corresponding to the above-mentioned training node, and save the generated gradient in the first gradient sequence;
累计梯度获取模块20,用于根据上述第一梯度序列和第二梯度序列,得到累计梯度序列;上述第二梯度序列用于记录尚未参与同步的梯度;The cumulative gradient acquisition module 20 is configured to obtain the cumulative gradient sequence according to the above-mentioned first gradient sequence and the second gradient sequence; the above-mentioned second gradient sequence is used to record the gradients that have not participated in synchronization;
重要度指标序列计算模块30,用于根据上述累计梯度序列计算得到重要度指标序列;The importance index sequence calculation module 30 is configured to calculate and obtain the importance index sequence according to the above-mentioned cumulative gradient sequence;
梯度分类模块40,用于获取重要梯度指示序列,根据上述重要梯度指示序列确定上述累计梯度序列中的重要梯度;The gradient classification module 40 is configured to obtain an important gradient indication sequence, and determine the important gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence;
同步模块50,用于根据上述重要梯度和上述重要度指标序列得到上述训练节点的待同步信息,基于上述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;The synchronization module 50 is configured to obtain the to-be-synchronized information of the above-mentioned training nodes according to the above-mentioned important gradients and the above-mentioned importance index sequence, and to perform synchronization between the training nodes based on the above-mentioned to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
更新模块60,用于将上述同步后重要梯度指示序列作为新的上述重要梯度指示序列;an update module 60, configured to use the above-mentioned important gradient indication sequence after synchronization as the new above-mentioned important gradient indication sequence;
参数调整模块70,用于根据上述同步后梯度序列调整上述神经网络的参数。The parameter adjustment module 70 is configured to adjust the parameters of the above-mentioned neural network according to the above-mentioned post-synchronized gradient sequence.
在一些可能的实施方式中,上述累计梯度获取模块,包括:In some possible implementations, the above-mentioned cumulative gradient acquisition module includes:
分段单元,用于基于预设的分段规则对上述第一梯度序列和上述第二梯度序列分别进行分段,得到第一梯度段序列和第二梯度段序列;其中,若第一梯度段在上述第一梯度段序列中的位置与第二梯度段在上述第二梯度段序列中的位置相同,则上述第一梯度段和上述第二梯度段相对应,并且均对应于相同的神经网络参数;A segmentation unit, configured to segment the first gradient sequence and the second gradient sequence respectively based on a preset segmentation rule, to obtain the first gradient segment sequence and the second gradient segment sequence; wherein, if the first gradient segment The position in the above-mentioned first gradient segment sequence is the same as the position of the second gradient segment in the above-mentioned second gradient segment sequence, then the above-mentioned first gradient segment corresponds to the above-mentioned second gradient segment, and both correspond to the same neural network parameter;
多线程处理单元,用于设置多个并行的计算线程,每个上述计算线程获取至少一个第一梯度段,以及与上述第一梯度段对应的第二梯度段;a multi-thread processing unit, configured to set up multiple parallel computing threads, each of the computing threads obtaining at least one first gradient segment and a second gradient segment corresponding to the first gradient segment;
累计单元,用于每个上述计算线程对于获取到的每个第一梯度段,将上述第一梯度段与对应的第二梯度段进行累加,得到对应的累计梯度段;an accumulating unit, used for each of the above-mentioned computing threads to accumulate the above-mentioned first gradient section and the corresponding second gradient section for each obtained first gradient section to obtain a corresponding cumulative gradient section;
累计梯度序列获取单元,用于根据上述各个计算线程得到的累计梯度段,得到上述累计梯度序列。The cumulative gradient sequence obtaining unit is configured to obtain the above cumulative gradient sequence according to the cumulative gradient segments obtained by each of the above calculation threads.
在一些可能的实施方式中,上述重要度指标序列计算模块,包括:重要度指标计算单元,用于每个上述计算线程根据得到的上述累计梯度段,计算对应的重要度指标;重要度指标序列获取单元,用于根据各个上述计算线程的重要度指标计算结果,得到重要度指标序列。In some possible implementations, the above-mentioned importance index sequence calculation module includes: an importance index calculation unit, used for each of the above calculation threads to calculate the corresponding importance index according to the obtained cumulative gradient segment; the importance index sequence The obtaining unit is configured to obtain the sequence of importance indexes according to the calculation results of the importance indexes of each of the above calculation threads.
在一些可能的实施方式中,上述梯度分类模块,用于对于每个上述计算线程计算得到的每个累计梯度段,在上述重要梯度指示序列提取对应的重要梯度指示值;若上述重要梯度指示值为第一指示值,则将上述累计梯度段中的累计梯度均确定为重要梯度,并将上述累计梯度段提交至上述训练节点的通信缓冲区;上述第一指示值表征上述累计梯度段中的累计梯度均为重要梯度。In some possible implementation manners, the above-mentioned gradient classification module is configured to, for each cumulative gradient segment calculated by each of the above-mentioned computing threads, extract a corresponding important gradient indication value from the above-mentioned important gradient indication sequence; if the above-mentioned important gradient indication value is the first indication value, then the cumulative gradients in the above-mentioned cumulative gradient segments are all determined as important gradients, and the above-mentioned cumulative gradient segments are submitted to the communication buffer of the above-mentioned training node; the above-mentioned first indication value represents the above-mentioned cumulative gradient segments. The cumulative gradients are all important gradients.
在一些可能的实施方式中,上述同步模块,包括:In some possible implementations, the above synchronization module includes:
待同步梯度序列获取单元,用于根据上述通信缓冲区中的累计梯度段,得到待同步梯度序列;其中,所述通信缓冲区中的每个累计梯度段中的每个累计梯度在所述待同步梯度序列中的位置均与所述累计梯度在所述累计梯度序列中的位置相同,并且所述待同步序列中的其它位置被设置为预设梯度值;The to-be-synchronized gradient sequence acquisition unit is configured to obtain the to-be-synchronized gradient sequence according to the accumulated gradient segments in the communication buffer; wherein, each accumulated gradient in each accumulated gradient segment in the communication buffer is in the to-be-synchronized gradient The positions in the synchronization gradient sequence are all the same as the positions of the accumulated gradients in the accumulated gradient sequence, and other positions in the to-be-synchronized sequence are set as preset gradient values;
拼接单元,用于拼接上述待同步梯度序列和上述重要度指标序列,得到上述待同步信息。The splicing unit is used for splicing the above-mentioned gradient sequence to be synchronized and the above-mentioned importance index sequence to obtain the above-mentioned information to be synchronized.
在一些可能的实施方式中,上述同步模块还包括:同步累加梯度序列获取单元,用于将各个上述训练节点的待同步信息中的待同步梯度序列逐元素相加,得到同步累加梯度序列;In some possible implementations, the above synchronization module further includes: a synchronous cumulative gradient sequence acquisition unit, configured to add element by element the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the above training nodes to obtain a synchronous cumulative gradient sequence;
同步后梯度序列获取单元,用于将上述同步累加梯度序列中的每个同步累加梯度除以上述训练节点的总数量,得到上述同步后梯度序列;a post-synchronized gradient sequence acquisition unit, configured to divide each synchronously accumulated gradient in the above-mentioned synchronously-accumulated gradient sequence by the total number of the above-mentioned training nodes to obtain the above-mentioned post-synchronized gradient sequence;
累加重要度指标序列获取单元,用于将各个上述训练节点的待同步信息中的重要度指 标序列逐元素相加,得到累加重要度指标序列;The accumulative importance index sequence acquisition unit is used to add the importance index sequences in the information to be synchronized of each of the above training nodes element by element to obtain the accumulated importance index sequence;
平均重要度指标序列获取单元,用于将上述累加重要度指标序列中的每个累加重要度指标除以上述训练节点的总数量,得到平均重要度指标序列;an average importance index sequence obtaining unit, configured to divide each cumulative importance index in the above-mentioned cumulative importance index sequence by the total number of the above-mentioned training nodes to obtain an average importance index sequence;
同步后重要梯度指示序列计算单元,用于计算上述平均重要度指标序列中每个平均重要度指标对应的重要梯度指示值,得到同步后重要梯度指示序列。The post-synchronized important gradient indication sequence calculation unit is used to calculate the important gradient indication value corresponding to each average importance index in the above-mentioned average importance index sequence, and obtain the post-synchronized important gradient indication sequence.
在一些可能的实施方式中,上述同步后重要梯度指示序列计算单元包括:排序单元,用于按照平均重要度指标降序顺序得到上述平均重要度指标序列中各个平均重要度指标的排序结果;指标阈值获取单元,用于获取阈值指标,根据上述阈值指标在上述排序结果中确定重要度指标阈值;重要梯度指示值计算单元,用于对于上述平均重要度指标序列中的每个平均重要度指标,若上述平均重要度指标大于上述重要度指标阈值,则将对应的重要梯度指示值设置为第一指示值;否则,将对应的重要梯度指示值设置为第二指示值;其中,上述第一指示值用于指示重要梯度,上述第二指示值用于指示非重要梯度。In some possible implementations, the above-mentioned calculation unit of the important gradient indication sequence after synchronization includes: a sorting unit, configured to obtain the sorting result of each average importance index in the above-mentioned average importance index sequence according to the descending order of the average importance index; the index threshold value The obtaining unit is used to obtain the threshold index, and determine the threshold of the importance index in the above-mentioned sorting result according to the above threshold index; If the above-mentioned average importance index is greater than the above-mentioned threshold of the above-mentioned importance index, the corresponding important gradient indication value is set as the first indication value; otherwise, the corresponding important gradient indication value is set as the second indication value; wherein, the above-mentioned first indication value It is used to indicate the important gradient, and the above-mentioned second indication value is used to indicate the unimportant gradient.
在一些可能的实施方式中,上述指标阈值获取单元用于根据上述预设的分段规则确定分段个数;获取预设的压缩率;将上述压缩率与上述分段个数的乘积作为上述阈值指标。In some possible implementations, the above-mentioned index threshold obtaining unit is configured to determine the number of segments according to the above-mentioned preset segmentation rule; obtain a preset compression ratio; and use the product of the above-mentioned compression ratio and the above-mentioned number of segments as the above-mentioned Threshold metrics.
在一些可能的实施方式中,上述参数调整模块用于依次提取上述同步后梯度序列中的梯度;若上述梯度不等于上述预设梯度值,根据上述梯度调整对应的神经网络参数;若上述梯度等于上述预设梯度值,提取下一个梯度。In some possible implementations, the above-mentioned parameter adjustment module is used to sequentially extract the gradients in the above-mentioned synchronized gradient sequence; if the above-mentioned gradient is not equal to the above-mentioned preset gradient value, adjust the corresponding neural network parameters according to the above-mentioned gradient; if the above-mentioned gradient is equal to The above preset gradient value is used to extract the next gradient.
在一些可能的实施方式中,上述更新模块还用于根据上述重要梯度指示序列确定上述累计梯度序列中的非重要梯度;根据上述非重要梯度更新上述第二梯度序列;上述装置还包括迭代控制模块,用于迭代进行基于分布式通信的神经网络训练,直至到达预设的训练停止条件。In some possible implementation manners, the above-mentioned updating module is further configured to determine the insignificant gradient in the above-mentioned cumulative gradient sequence according to the above-mentioned important gradient indication sequence; update the above-mentioned second gradient sequence according to the above-mentioned insignificant gradient; the above-mentioned apparatus further includes an iterative control module , which is used to iteratively perform distributed communication-based neural network training until a preset training stop condition is reached.
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。In some embodiments, the functions or modules included in the apparatuses provided in the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments. For specific implementation, reference may be made to the descriptions of the above method embodiments. For brevity, here No longer.
本公开实施例还提出一种计算机可读存储介质,上述计算机可读存储介质中存储有至少一条指令或至少一段程序,上述至少一条指令或至少一段程序由处理器加载并执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。Embodiments of the present disclosure further provide a computer-readable storage medium, where at least one instruction or at least one segment of program is stored, and the above-mentioned method is implemented when the at least one instruction or at least one segment of program is loaded and executed by a processor. The computer-readable storage medium may be a non-volatile computer-readable storage medium.
本公开实施例还提出一种训练设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,上述处理器被配置为上述方法。训练设备可以被提供为终端、服务器或其它形态的设备。An embodiment of the present disclosure further provides a training device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the above method. The training device may be provided as a terminal, server or other form of device.
图10示出根据本公开实施例的一种训练设备的框图。例如,训练设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。Figure 10 shows a block diagram of a training device according to an embodiment of the present disclosure. For example, training device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, etc. terminal.
参照图10,训练设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。10, the training device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814 , and the communication component 816 .
处理组件802通常控制训练设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行 指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。 Processing component 802 generally controls the overall operation of training device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or some of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
存储器804被配置为存储各种类型的数据以支持在训练设备800的操作。这些数据的示例包括用于在训练设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。 Memory 804 is configured to store various types of data to support operation at training device 800 . Examples of such data include instructions for any application or method operating on the training device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
电源组件806为训练设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为训练设备800生成、管理和分配电力相关联的组件。 Power supply component 806 provides power to various components of training device 800 . Power supply components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to training device 800 .
多媒体组件808包括在上述训练设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。上述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与上述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当训练设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 808 includes a screen that provides an output interface between the training device 800 described above and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The above-mentioned touch sensor may not only sense the boundary of the touch or swipe action, but also detect the duration and pressure associated with the above-mentioned touch or swipe action. In some embodiments, the multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the training device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当训练设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。 Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when training device 800 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signal may be further stored in memory 804 or transmitted via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.
传感器组件814包括一个或多个传感器,用于为训练设备800提供各个方面的状态评估。例如,传感器组件814可以检测到训练设备800的打开/关闭状态,组件的相对定位,例如上述组件为训练设备800的显示器和小键盘,传感器组件814还可以检测训练设备800或训练设备800一个组件的位置改变,用户与训练设备800接触的存在或不存在,训练设备800方位或加速/减速和训练设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。 Sensor assembly 814 includes one or more sensors for providing state assessments of various aspects of training device 800 . For example, the sensor component 814 can detect the on/off state of the training device 800, the relative positioning of components, such as the display and keypad of the training device 800, the sensor component 814 can also detect the training device 800 or a component of the training device 800 changes in position, presence or absence of user contact with training device 800, training device 800 orientation or acceleration/deceleration and temperature changes of training device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件816被配置为便于训练设备800和其他设备之间有线或无线方式的通信。训练设备800可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G、5G或它们的组合。 在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,上述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 816 is configured to facilitate wired or wireless communication between training device 800 and other devices. The training device 800 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 described above also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
在示例性实施例中,训练设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, training device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmed gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由训练设备800的处理器820执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as memory 804 comprising computer program instructions executable by processor 820 of training device 800 to perform the above-described method.
图11示出根据本公开实施例的另一种训练设备的框图。例如,训练设备1900可以被提供为一服务器。参照图11,训练设备1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。11 shows a block diagram of another training device according to an embodiment of the present disclosure. For example, training device 1900 may be provided as a server. 11, training device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource, represented by memory 1932, for storing instructions executable by processing component 1922, such as an application program. An application program stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Additionally, the processing component 1922 is configured to execute instructions to perform the above-described methods.
训练设备1900还可以包括一个电源组件1926被配置为执行训练设备1900的电源管理,一个有线或无线网络接口1950被配置为将训练设备1900连接到网络,和一个输入输出(I/O)接口1958。训练设备1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。The training device 1900 may also include a power supply assembly 1926 configured to perform power management of the training device 1900, a wired or wireless network interface 1950 configured to connect the training device 1900 to a network, and an input output (I/O) interface 1958 . Training device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由训练设备1900的处理组件1922执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as memory 1932 comprising computer program instructions executable by processing component 1922 of training device 1900 to perform the method described above.
在示例性实施例中,还提供了一种训练系统,所述训练系统包括多个上述的训练设备。In an exemplary embodiment, there is also provided a training system comprising a plurality of the above-mentioned training devices.
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。The present disclosure may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present disclosure.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、 网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,上述编程语言包括面向对象的编程语言—诸如Smalltalk、C+等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code written in any combination including object-oriented programming languages - such as Smalltalk, C+, etc., and conventional procedural programming languages - such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present disclosure.
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,上述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more alternatives for implementing the specified logical function(s) Execute the instruction. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不 限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (15)

  1. 基于分布式通信的神经网络训练方法,其特征在于,应用于训练节点,所述方法包括:A neural network training method based on distributed communication is characterized in that, applied to training nodes, the method includes:
    训练所述训练节点对应的神经网络,将产生的梯度保存在第一梯度序列;Train the neural network corresponding to the training node, and save the generated gradient in the first gradient sequence;
    根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录尚未参与同步的梯度;According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients that have not participated in synchronization;
    根据所述累计梯度序列计算得到重要度指标序列;Calculate and obtain an importance index sequence according to the cumulative gradient sequence;
    获取重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;obtaining an important gradient indication sequence, and determining an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
    根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息;Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;
    基于所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;Perform synchronization between training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
    根据所述同步后梯度序列调整所述神经网络的参数,以及将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列。The parameters of the neural network are adjusted according to the post-synchronization gradient sequence, and the post-synchronization important gradient indication sequence is used as the new important gradient indication sequence.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述第一梯度序列和第二梯度序列,得到累计梯度序列,包括:The method according to claim 1, wherein the obtaining a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence comprises:
    基于预设的分段规则对所述第一梯度序列和所述第二梯度序列分别进行分段,得到第一梯度段序列和第二梯度段序列;其中,若第一梯度段在所述第一梯度段序列中的位置与第二梯度段在所述第二梯度段序列中的位置相同,则所述第一梯度段和所述第二梯度段相对应,并且均对应于相同的神经网络参数;Segment the first gradient sequence and the second gradient sequence respectively based on the preset segmentation rule to obtain the first gradient segment sequence and the second gradient segment sequence; The position of a gradient segment sequence is the same as the position of the second gradient segment in the second gradient segment sequence, then the first gradient segment and the second gradient segment correspond, and both correspond to the same neural network parameter;
    设置多个并行的计算线程,每个所述计算线程获取至少一个第一梯度段,以及与所述第一梯度段对应的第二梯度段;Setting up multiple parallel computing threads, each of which obtains at least one first gradient segment and a second gradient segment corresponding to the first gradient segment;
    每个所述计算线程对于获取到的每个第一梯度段,将所述第一梯度段与对应的第二梯度段进行累加,得到对应的累计梯度段;For each acquired first gradient segment, each computing thread accumulates the first gradient segment and the corresponding second gradient segment to obtain a corresponding accumulated gradient segment;
    根据各个所述计算线程得到的累计梯度段,得到所述累计梯度序列。The cumulative gradient sequence is obtained according to the cumulative gradient segments obtained by each of the computing threads.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述累计梯度序列计算得到重要度指标序列,包括:The method according to claim 2, wherein the calculating and obtaining the sequence of importance indexes according to the accumulated gradient sequence comprises:
    每个所述计算线程根据得到的所述累计梯度段,计算对应的重要度指标;Each of the computing threads calculates the corresponding importance index according to the obtained cumulative gradient segment;
    根据各个所述计算线程的重要度指标计算结果,得到重要度指标序列。According to the calculation result of the importance index of each of the calculation threads, the sequence of importance index is obtained.
  4. 根据权利要求2或3所述的方法,其特征在于,所述根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度,包括:The method according to claim 2 or 3, wherein the determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence comprises:
    对于每个所述计算线程计算得到的每个累计梯度段,在所述重要梯度指示序列提取对应的重要梯度指示值;For each cumulative gradient segment calculated by each of the computing threads, extract a corresponding important gradient indication value from the important gradient indication sequence;
    若所述重要梯度指示值为第一指示值,则将所述累计梯度段中的累计梯度均确定为重 要梯度,并将所述累计梯度段提交至所述训练节点的通信缓冲区;所述第一指示值表征所述累计梯度段中的累计梯度均为重要梯度。If the indication value of the important gradient is the first indication value, all the accumulated gradients in the accumulated gradient segment are determined as important gradients, and the accumulated gradient segment is submitted to the communication buffer of the training node; the The first indication value indicates that the accumulated gradients in the accumulated gradient segment are all important gradients.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息,包括:The method according to claim 4, wherein the obtaining the information to be synchronized of the training node according to the important gradient and the sequence of importance indexes comprises:
    根据所述通信缓冲区中的累计梯度段,得到待同步梯度序列;其中,所述通信缓冲区中的每个累计梯度段中的每个累计梯度在所述待同步梯度序列中的位置均与所述累计梯度在所述累计梯度序列中的位置相同,并且所述待同步序列中的其它位置被设置为预设梯度值;Obtain the gradient sequence to be synchronized according to the accumulated gradient segments in the communication buffer; wherein, the position of each accumulated gradient in each accumulated gradient segment in the communication buffer in the to-be-synchronized gradient sequence is the same as that of the gradient sequence to be synchronized. The positions of the cumulative gradients in the cumulative gradient sequence are the same, and other positions in the to-be-synchronized sequence are set as preset gradient values;
    拼接所述待同步梯度序列和所述重要度指标序列,得到所述待同步信息。The to-be-synchronized gradient sequence and the importance index sequence are spliced to obtain the to-be-synchronized information.
  6. 根据权利要求2-5中任意一项所述的方法,其特征在于,所述基于所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列,包括:The method according to any one of claims 2-5, wherein the synchronization between training nodes based on the information to be synchronized to obtain a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence, comprising:
    将各个所述训练节点的待同步信息中的待同步梯度序列逐元素相加,得到同步累加梯度序列;adding the to-be-synchronized gradient sequences in the to-be-synchronized information of each of the training nodes element by element to obtain a synchronous cumulative gradient sequence;
    将所述同步累加梯度序列中的每个同步累加梯度除以所述训练节点的总数量,得到所述同步后梯度序列;Dividing each synchronous cumulative gradient in the synchronous cumulative gradient sequence by the total number of the training nodes to obtain the synchronized gradient sequence;
    将各个所述训练节点的待同步信息中的重要度指标序列逐元素相加,得到累加重要度指标序列;adding the importance index sequences in the information to be synchronized of each of the training nodes element by element to obtain an accumulated importance index sequence;
    将所述累加重要度指标序列中的每个累加重要度指标除以所述训练节点的总数量,得到平均重要度指标序列;Dividing each accumulated importance index in the sequence of accumulated importance indexes by the total number of the training nodes to obtain an average sequence of importance indexes;
    计算所述平均重要度指标序列中每个平均重要度指标对应的重要梯度指示值,得到同步后重要梯度指示序列。An important gradient indication value corresponding to each average importance index in the average importance index sequence is calculated to obtain a post-synchronized important gradient indication sequence.
  7. 根据权利要求6中所述的方法,其特征在于,所述计算所述平均重要度指标序列中每个平均重要度指标对应的重要梯度指示值,包括:The method according to claim 6, wherein the calculating the important gradient indication value corresponding to each average importance index in the average importance index sequence comprises:
    按照平均重要度指标降序顺序得到所述平均重要度指标序列中各个平均重要度指标的排序结果;Obtain the sorting result of each average importance index in the average importance index sequence according to the descending order of the average importance index;
    获取阈值指标,根据所述阈值指标在所述排序结果中确定重要度指标阈值;obtaining a threshold index, and determining an importance index threshold in the sorting result according to the threshold index;
    对于所述平均重要度指标序列中的每个平均重要度指标,若所述平均重要度指标大于所述重要度指标阈值,则将对应的重要梯度指示值设置为第一指示值;否则,将对应的重要梯度指示值设置为第二指示值;其中,所述第一指示值用于指示重要梯度,所述第二指示值用于指示非重要梯度。For each average importance index in the average importance index sequence, if the average importance index is greater than the importance index threshold, set the corresponding important gradient indicator value as the first indicator value; otherwise, set the corresponding important gradient indicator value to the first indicator value; The corresponding important gradient indication value is set as a second indication value; wherein, the first indication value is used to indicate an important gradient, and the second indication value is used to indicate an unimportant gradient.
  8. 根据权利要求7所述的方法,其特征在于,所述获取阈值指标,包括:The method according to claim 7, wherein the obtaining the threshold indicator comprises:
    根据所述预设的分段规则确定分段个数;Determine the number of segments according to the preset segmentation rule;
    获取预设的压缩率;Get the preset compression ratio;
    将所述压缩率与所述分段个数的乘积作为所述阈值指标。The product of the compression ratio and the number of segments is used as the threshold indicator.
  9. 根据权利要求1-8中任意一项所述的方法,其特征在于,所述根据所述同步后梯度序列调整所述神经网络的参数,包括:The method according to any one of claims 1-8, wherein the adjusting the parameters of the neural network according to the synchronized gradient sequence comprises:
    依次提取所述同步后梯度序列中的梯度;sequentially extracting the gradients in the synchronized gradient sequence;
    若所述梯度不等于预设梯度值,根据所述梯度调整对应的神经网络参数;If the gradient is not equal to the preset gradient value, adjust the corresponding neural network parameters according to the gradient;
    若所述梯度等于所述预设梯度值,提取下一个梯度。If the gradient is equal to the preset gradient value, extract the next gradient.
  10. 根据权利要求1-9中任意一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-9, wherein the method further comprises:
    根据所述重要梯度指示序列确定所述累计梯度序列中的非重要梯度;determining insignificant gradients in the cumulative gradient sequence according to the significant gradient indication sequence;
    根据所述非重要梯度更新所述第二梯度序列;updating the second gradient sequence according to the insignificant gradient;
    迭代进行基于分布式通信的神经网络训练,直至到达预设的训练停止条件。The neural network training based on distributed communication is iteratively performed until a preset training stop condition is reached.
  11. 基于分布式通信的神经网络训练方法,其特征在于,应用于包括多个训练节点的训练系统,所述方法包括:A neural network training method based on distributed communication is characterized in that, applied to a training system including a plurality of training nodes, the method includes:
    每个训练节点训练对应的神经网络,将产生的梯度保存在第一梯度序列;Each training node trains the corresponding neural network, and saves the generated gradient in the first gradient sequence;
    根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录所述训练节点中尚未参与同步的梯度;According to the first gradient sequence and the second gradient sequence, a cumulative gradient sequence is obtained; the second gradient sequence is used to record the gradients in the training node that have not participated in synchronization;
    根据所述累计梯度序列计算得到重要度指标序列;Calculate and obtain an importance index sequence according to the cumulative gradient sequence;
    获取所述训练节点中的重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;obtaining the important gradient indication sequence in the training node, and determining the important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
    根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息;Obtain the to-be-synchronized information of the training node according to the important gradient and the sequence of importance indicators;
    各个训练节点基于对应的所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;Each training node performs synchronization between the training nodes based on the corresponding information to be synchronized, and obtains a post-synchronized gradient sequence and a post-synchronized important gradient indication sequence;
    每个训练节点根据所述同步后梯度序列调整所述神经网络的参数,以及将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列。Each training node adjusts the parameters of the neural network according to the post-synchronized gradient sequence, and uses the post-synchronized important gradient indication sequence as the new important gradient indication sequence.
  12. 基于分布式通信的神经网络训练装置,其特征在于,所述装置设置于训练节点,包括:The neural network training device based on distributed communication is characterized in that, the device is arranged on the training node, including:
    训练模块,用于训练所述训练节点对应的神经网络,将产生的梯度保存在第一梯度序列;A training module for training the neural network corresponding to the training node, and saving the generated gradient in the first gradient sequence;
    累计梯度获取模块,用于根据所述第一梯度序列和第二梯度序列,得到累计梯度序列;所述第二梯度序列用于记录尚未参与同步的梯度;a cumulative gradient acquisition module, configured to obtain a cumulative gradient sequence according to the first gradient sequence and the second gradient sequence; the second gradient sequence is used to record gradients that have not yet participated in synchronization;
    重要度指标序列计算模块,用于根据所述累计梯度序列计算得到重要度指标序列;an importance index sequence calculation module, configured to calculate and obtain an importance index sequence according to the cumulative gradient sequence;
    梯度分类模块,用于获取重要梯度指示序列,根据所述重要梯度指示序列确定所述累计梯度序列中的重要梯度;a gradient classification module, configured to obtain an important gradient indication sequence, and determine an important gradient in the cumulative gradient sequence according to the important gradient indication sequence;
    同步模块,用于根据所述重要梯度和所述重要度指标序列得到所述训练节点的待同步信息,基于所述待同步信息进行训练节点间的同步,得到同步后梯度序列和同步后重要梯度指示序列;a synchronization module, configured to obtain the to-be-synchronized information of the training nodes according to the important gradients and the importance index sequence, and to perform synchronization between the training nodes based on the to-be-synchronized information to obtain a post-synchronized gradient sequence and post-synchronized important gradients instruction sequence;
    更新模块,用于将所述同步后重要梯度指示序列作为新的所述重要梯度指示序列;an update module, configured to use the synchronized important gradient indication sequence as the new important gradient indication sequence;
    参数调整模块,用于根据所述同步后梯度序列调整所述神经网络的参数。A parameter adjustment module, configured to adjust the parameters of the neural network according to the synchronized gradient sequence.
  13. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有至少一条指令或至少一段程序,所述至少一条指令或至少一段程序由处理器加载并执行以实现如权利要求1-10中任意一项所述的基于分布式通信的神经网络训练方法。A computer-readable storage medium, characterized in that, the computer-readable storage medium stores at least one instruction or at least one program, and the at least one instruction or at least one program is loaded and executed by a processor to implement the claims The distributed communication-based neural network training method described in any one of 1-10.
  14. 一种训练设备,其特征在于,包括至少一个处理器,以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述至少一个处理器通过执行所述存储器存储的指令实现如权利要求1-10中任意一项所述的基于分布式通信的神经网络训练方法。A training device, comprising at least one processor and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the at least one processor A processor implements the distributed communication-based neural network training method according to any one of claims 1-10 by executing the instructions stored in the memory.
  15. 一种训练系统,其特征在于,所述训练系统包括多个权利要求14所述的训练设备。A training system, characterized in that, the training system includes a plurality of training devices according to claim 14 .
PCT/CN2021/100623 2021-02-27 2021-06-17 Distributed communication-based neural network training method and apparatus, and storage medium WO2022179007A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110221266.9 2021-02-27
CN202110221266.9A CN112766502A (en) 2021-02-27 2021-02-27 Neural network training method and device based on distributed communication and storage medium

Publications (1)

Publication Number Publication Date
WO2022179007A1 true WO2022179007A1 (en) 2022-09-01

Family

ID=75704305

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100623 WO2022179007A1 (en) 2021-02-27 2021-06-17 Distributed communication-based neural network training method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN112766502A (en)
WO (1) WO2022179007A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766502A (en) * 2021-02-27 2021-05-07 上海商汤智能科技有限公司 Neural network training method and device based on distributed communication and storage medium
CN113556247B (en) * 2021-06-25 2023-08-01 深圳技术大学 Multi-layer parameter distributed data transmission method, device and readable medium
CN114118381B (en) * 2021-12-03 2024-02-02 中国人民解放军国防科技大学 Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication
CN114862656B (en) * 2022-05-18 2023-05-05 北京百度网讯科技有限公司 Multi-GPU-based acquisition method for training cost of distributed deep learning model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075347A1 (en) * 2016-09-15 2018-03-15 Microsoft Technology Licensing, Llc Efficient training of neural networks
CN110619388A (en) * 2019-09-20 2019-12-27 北京金山数字娱乐科技有限公司 Gradient synchronization method and device in distributed training
WO2020226634A1 (en) * 2019-05-07 2020-11-12 Huawei Technologies Co., Ltd. Distributed synchronous training architecture using stale weights
CN112288083A (en) * 2020-10-21 2021-01-29 周宇浩 Neural network distributed training method, device, equipment and storage medium
CN112766502A (en) * 2021-02-27 2021-05-07 上海商汤智能科技有限公司 Neural network training method and device based on distributed communication and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180075347A1 (en) * 2016-09-15 2018-03-15 Microsoft Technology Licensing, Llc Efficient training of neural networks
WO2020226634A1 (en) * 2019-05-07 2020-11-12 Huawei Technologies Co., Ltd. Distributed synchronous training architecture using stale weights
CN110619388A (en) * 2019-09-20 2019-12-27 北京金山数字娱乐科技有限公司 Gradient synchronization method and device in distributed training
CN112288083A (en) * 2020-10-21 2021-01-29 周宇浩 Neural network distributed training method, device, equipment and storage medium
CN112766502A (en) * 2021-02-27 2021-05-07 上海商汤智能科技有限公司 Neural network training method and device based on distributed communication and storage medium

Also Published As

Publication number Publication date
CN112766502A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
WO2022179007A1 (en) Distributed communication-based neural network training method and apparatus, and storage medium
WO2020134556A1 (en) Image style transfer method, device, electronic apparatus, and storage medium
KR102484617B1 (en) Method and apparatus for generating model for representing heterogeneous graph node, electronic device, storage medium and program
KR102406354B1 (en) Video restoration method and apparatus, electronic device and storage medium
JP2021197137A (en) Method, device, electronic apparatus, storage medium, and computer program for training model
CN109300179B (en) Animation production method, device, terminal and medium
US11556761B2 (en) Method and device for compressing a neural network model for machine translation and storage medium
JP7200277B2 (en) Method and apparatus, electronic device, storage medium and computer program for identifying word slots
CN110188871B (en) Operation method, device and related product
WO2021128737A1 (en) Resource scheduling method and apparatus, electronic device, and storage medium
KR20210156243A (en) Training methods of deep-running frameworks, devices and storage media
TW202044068A (en) Information processing method and device, electronic device and storage medium
CN112466296A (en) Voice interaction processing method and device, electronic equipment and storage medium
US20170199788A1 (en) Latency-reduced document change discovery
CN111582432B (en) Network parameter processing method and device
CN114363686B (en) Method, device, equipment and medium for publishing multimedia content
US20190087720A1 (en) Anonymized time-series generation from recurrent neural networks
US20220245401A1 (en) Method and apparatus for training model
CN111312243B (en) Equipment interaction method and device
CN117236805A (en) Power equipment control method, device, electronic equipment and computer readable medium
US11636304B2 (en) Creating response schedule for tasks from cognitive state of a user
CN112652304A (en) Voice interaction method and device of intelligent equipment and electronic equipment
WO2023230936A1 (en) Image segmentation model training method and apparatus, and image segmentation method and apparatus
CN114550728B (en) Method, device and electronic equipment for marking speaker
CN111582452B (en) Method and device for generating neural network model

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022531593

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927439

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE