WO2021140643A1 - ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム - Google Patents

ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム Download PDF

Info

Publication number
WO2021140643A1
WO2021140643A1 PCT/JP2020/000644 JP2020000644W WO2021140643A1 WO 2021140643 A1 WO2021140643 A1 WO 2021140643A1 JP 2020000644 W JP2020000644 W JP 2020000644W WO 2021140643 A1 WO2021140643 A1 WO 2021140643A1
Authority
WO
WIPO (PCT)
Prior art keywords
update
neural network
processors
accumulation
gradient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2020/000644
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
檀上匠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to EP20912017.9A priority Critical patent/EP4089586A4/en
Priority to JP2021569687A priority patent/JP7453563B2/ja
Priority to PCT/JP2020/000644 priority patent/WO2021140643A1/ja
Priority to CN202080092251.0A priority patent/CN114930350A/zh
Publication of WO2021140643A1 publication Critical patent/WO2021140643A1/ja
Priority to US17/832,733 priority patent/US20220300790A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to a neural network system, a learning method of a neural network system, and a learning program of a neural network system.
  • the neural network has, for example, a configuration in which a plurality of inputs are multiplied by their respective weights, the value obtained by adding the plurality of multiplication values is input to the activation function of neurons in the output layer, and the output of the activation function is output. ..
  • a neural network with such a simple configuration is called a simple perceptron.
  • a neural network having a plurality of layers having the above-mentioned simple configuration and inputting the output of one layer to another layer is called a multi-layer perceptron.
  • a deep neural network has a plurality of hidden layers between an input layer and an output layer, like a multi-layer perceptron.
  • the neural network is abbreviated as NN (Neural Network).
  • NN optimizes parameters such as the above-mentioned weights by learning using a large amount of training data.
  • training data is divided by the number of arithmetic nodes, and multiple arithmetic nodes perform learning operations based on their respective training data, and the gradient of the error function of the NN output for the NN parameters. Is calculated, and the update amount of the parameter obtained by multiplying the gradient by the learning rate is calculated. After that, the arithmetic nodes obtain the average of the update amounts of each node, and all the arithmetic nodes update the parameters with the average of the update values.
  • the arithmetic time required for learning can be shortened as compared with the case where a single arithmetic node performs learning operations with a single training data.
  • the update amounts calculated by each of the plurality of arithmetic nodes are aggregated by adding them, and the aggregated added value is shared by the plurality of arithmetic nodes. There is a need to. Then, when aggregating and sharing, data communication processing is performed between a plurality of arithmetic nodes.
  • the data parallel type distributed learning shortens the calculation time required for learning by calculating the training data in parallel, but the calculation time is shortened due to the communication processing time between the calculation nodes performed in each learning process. Is suppressed.
  • an object of the first aspect of the present embodiment is to provide a neural network system, a neural network learning method, and a neural network learning program that improve the throughput of data parallel distributed learning.
  • the first aspect of the present embodiment includes a memory and a plurality of processors that access the memory, and each of the plurality of processors is subjected to a plurality of learning sessions.
  • the operation of the neural network is executed based on the input of the training data and the parameters in the neural network to calculate the output of the neural network, and the difference between the calculated output and the teacher data of the training data is relative to the parameter. Calculate the slope or the amount of renewal based on the slope, In the first case where the accumulation of the gradient or the update amount is not less than the threshold, the plurality of processors transmit the calculated accumulation of the gradient or the update amount to other processors in the plurality of processors.
  • a first update process that aggregates the accumulation of the plurality of gradients or update amounts, receives the aggregated gradient or update amount accumulation, and updates the parameter with the aggregated gradient or update amount accumulation. Run and In the second case where the accumulation of the gradient or the update amount is less than the threshold value, the plurality of processors do not aggregate the accumulation of the plurality of gradients or the update amount by the transmission, and each of the plurality of processors calculates. It is a neural network system that executes a second update process for updating each parameter with the gradient or update amount.
  • the throughput of data parallel distributed learning is improved.
  • FIG. 1 is a diagram showing a configuration example of an NN system according to the present embodiment.
  • the NN system 1 is, for example, a high-performance computer, and has a main processor (CPU: Central Processing Unit) 10, a main memory 12, a subprocessor module 13, and an interface 18 with a network NET.
  • the subprocessor module 13 has, for example, four arithmetic node modules, and each arithmetic node module has a subprocessor 14 and a memory 16 accessed by the subprocessor.
  • the NN system 1 has an auxiliary storage device 20-26 which is a large-capacity storage, and the auxiliary storage device stores the NN learning program 20, the NN program 22, the training data 24, and the parameter 26. ..
  • the NN learning program 20 is executed by the processors 10 and 14, and performs learning processing using the training data.
  • the NN program 22 is executed by the processors 10 and 14, and executes the calculation of the NN model.
  • the training data 24 is a plurality of data having labels that are input and teacher data, respectively.
  • Parameter 26 is a plurality of weights in the NN that are optimized by learning, for example.
  • the main processor 10 executes the NN learning program 20 and distributes the NN learning calculation based on the plurality of training data to the plurality of subprocessors 14 and executes them in parallel.
  • the four subprocessors 14 are arithmetic nodes composed of processor chips, and are configured to be able to communicate with each other via the bus 28.
  • the NN system 1 can provide the NN platform to the client terminals 30 and 32 via the network NET.
  • the NN system 1 may have a configuration in which a plurality of computers are arithmetic nodes and the plurality of computers can communicate with each other.
  • FIG. 2 is a diagram showing an example of learning processing in a general NN.
  • the NN in FIG. 2 has an input layer IN_L and three hidden layers, neuron layers NR_L1 to NR_L3.
  • the third neuron layer NR_L3 also serves as an output layer.
  • the outline of the NN learning process is as follows.
  • the processor 10 or 14 reads one training data having an input data and a label from the training data 24 in the storage, and inputs the input data to the input layer IN_L.
  • the processor executes the NN learning program, executes the calculation of the first neuron layer NR_L1 using the input training data, and inputs the calculation result to the second neuron layer NR_L2. Further, the processor executes the operation of the second neuron layer NR_L2 using the operation result, inputs the operation result to the third neuron layer NR_L3, and finally, regarding the operation result, the third neuron layer NR_L3. Executes the operation of and outputs the operation result. It is called forward propagation processing FW that the three neuron layers NR_L1 to NR_L3 execute their respective operations in order with respect to the input of the input layer IN_L.
  • the processor calculates the difference E3 between the label of the training data (correct answer data) and the output of the third neuron layer NR_3, and differentiates the difference E3 with the parameter w in the neuron layer NR_L3 to obtain the gradient ⁇ E3.
  • the processor calculates the difference E2 in the second neuron layer NR_L2 from the difference E3, differentiates the difference E2 with the parameter w in the neuron layer NR_L2, and obtains the gradient ⁇ E2.
  • the processor calculates the difference E1 in the first neuron layer NR_L1 from the difference E2, differentiates the difference E1 with the parameter w in the neuron layer NR_L1, and obtains the gradient ⁇ E1.
  • each arithmetic is performed by the input to each layer and a plurality of parameters w.
  • a plurality of parameters w are updated by the gradient method so that the difference between the output estimated by NN based on the input data of the training data and the label (supervised data) is minimized.
  • the parameter update amount is calculated by multiplying the gradient ⁇ E obtained by differentiating the difference E with the parameter w by the learning rate ⁇ .
  • NNs especially deep NNs (hereinafter referred to as DNNs)
  • DNNs deep NNs
  • NNs can improve the accuracy of NNs and DNNs as the number of training data increases and the number of times of learning using the training data increases.
  • the learning time by the NN system increases accordingly. Therefore, in order to shorten the learning time, data parallel type distributed learning in which the learning process is distributed among a plurality of arithmetic nodes is effective.
  • a plurality of arithmetic nodes execute forward propagation processing FW and back propagation processing BW using their respective training data, and calculate a gradient ⁇ E corresponding to the parameter w of the difference E of the NN of each of the plurality of arithmetic nodes. To do. Then, each of the gradients ⁇ E or the parameter update amount ⁇ w obtained by multiplying the gradient by the learning rate is obtained by the plurality of nodes and shared by the plurality of nodes. Further, the plurality of nodes acquire the average of the gradient ⁇ E or the average of the update amount ⁇ w, and the plurality of nodes update the parameter w of each NN with the average of the update amount ⁇ w.
  • each of the plurality of nodes uses one training data to perform forward propagation processing and back propagation processing to obtain the gradient or update amount, and the parameter w of each node is based on the average of them.
  • the process of updating is a process corresponding to the mini-batch method in which the training data of the number of nodes is set as a mini-batch unit.
  • the training data of the number obtained by multiplying the number of nodes by the number of processes of each node is called a mini-batch.
  • the mini-batch method corresponds to the mini-batch method.
  • the above-mentioned processing for obtaining the average of the gradient or the amount of update includes Reduce processing and All reduce processing included in MPI (Message Passing Interface), which is a standard for parallel computing.
  • MPI Message Passing Interface
  • the parameters of each of the plurality of arithmetic nodes are aggregated (for example, addition), and the aggregated values are acquired by all the plurality of arithmetic nodes. This process requires data communication between a plurality of arithmetic nodes.
  • FIG. 3 is a diagram showing a processing flow of a plurality of arithmetic nodes in data division type distributed learning.
  • FIG. 4 is a flowchart showing specific processing of a plurality of arithmetic nodes in the data division type distributed learning.
  • four arithmetic nodes ND_1 to ND_4 each perform learning using one training data, and a certain arithmetic node aggregates the update amount ⁇ w of the parameter w calculated by each of the four arithmetic nodes.
  • Each parameter is updated with the average value of the update amount aggregated by each arithmetic node.
  • the outline of the data division type distributed learning will be described below with reference to FIGS. 3 and 4.
  • the four arithmetic nodes correspond to, for example, the four subprocessors 14 shown in FIG. If the NN system consists of four computers, the four compute nodes correspond to the processors of the four computers.
  • the four arithmetic nodes ND_1 to ND_4 execute the NN learning program and perform the following processing.
  • each of the four arithmetic nodes ND_1 to ND_4 inputs the data corresponding to each of the training data data D1 to D4 (S10).
  • the data D1 to D4 are input to the first neuron layer NR_L1 of each arithmetic node.
  • each arithmetic node executes the feedforward processing FW and executes the arithmetic of each neuron layer (S11).
  • the data D1 and D2 are handwritten characters "6" and "2".
  • each arithmetic node calculates the difference E1 to E4 between the output OUT of each NN and the label LB which is the teacher data.
  • the output OUTs of the arithmetic nodes ND_1 and ND_2 are "5" and "3", and the labels LB are "6" and "2".
  • the differences E1 to E3 are the sum of squares of the differences between the output OUT and the label LB.
  • the NN is a model for estimating handwritten numbers
  • the third neuron layer which is an output layer, outputs the probabilities that the numbers in the input data correspond to the numbers 0 to 9, respectively.
  • the calculation node calculates the sum of squares of the differences of the respective probabilities as the difference E.
  • each calculation node propagates each difference E to each neuron layer (S13), differentiates the difference E propagated by the parameter w of each neuron layer, and calculates the gradient ⁇ E. Further, each arithmetic node multiplies the gradient ⁇ E by the learning rate ⁇ to calculate the update amount ⁇ w of the parameter w (S14).
  • the four arithmetic nodes ND_1 to ND_4 communicate their respective update amounts ⁇ w between the arithmetic nodes via the bus 28 between the arithmetic nodes, and a certain arithmetic node updates the update amounts ⁇ w1 of the parameters w1 to w4 of all the nodes. Performs reduce processing that aggregates ⁇ ⁇ w4. Aggregation is, for example, addition (or extraction of maximum value). Then, the four arithmetic nodes receive the aggregated addition value ⁇ w_ad via the bus 28 and perform an All reduce process shared by all the arithmetic nodes (S15).
  • each arithmetic node divides the aggregated addition value ⁇ w_ad by the number of arithmetic nodes 4 to calculate the average value ⁇ w_ad / 4 of the update amount ⁇ w, adds it to the existing parameters w1 to w4, and updates each parameter. (S16). As a result, one learning by the mini-batch is completed. When the training is completed, the NN parameter of each arithmetic node is updated to the same value. Then, each arithmetic node returns to the process S10 and executes the next learning.
  • FIG. 5 is a diagram showing a general example of Reduce processing and All reduce processing.
  • the four arithmetic nodes ND_1 to ND_4 own the values y1 to y4, respectively.
  • each arithmetic node communicates its respective values y1 to y4 via the bus 28.
  • one arithmetic node ND_1 receives all the values y1 to y4, and four values are set by a predetermined function f. Calculate the aggregated value f (y1, y2, y3, y4) by calculation.
  • the arithmetic node ND_1 transmits the aggregated value f (y1, y2, y3, y4) to the other arithmetic nodes ND_2 to ND_4 via the bus 28, and all the arithmetic nodes share the aggregated value. To do.
  • the arithmetic nodes ND_2 to ND_4 transmit their respective values y2 to y4 to the arithmetic node ND_1.
  • the calculation node ND_4 sends the value y4 to the calculation node ND_3
  • the calculation node ND_3 calculates the addition value y3 + y4
  • the calculation node ND_2 sends the value y2 to the calculation node ND_1
  • the calculation node ND_1 calculates the addition value y1 + y2.
  • the four arithmetic nodes ND_1 to ND_4 have array data (w1, x1, y1, z1), (w2, x2, y2, z2), (w3, x3, y3, z3), respectively. If you own (w4, x4, y4, z4), each arithmetic node has data w to arithmetic node ND_1, data x to arithmetic node ND_2, data y to arithmetic node ND_3, and data z to arithmetic node ND_1. Send to arithmetic node ND_4 respectively.
  • each arithmetic node transmits the calculated aggregated value to the other arithmetic nodes via the bus 28, and all the arithmetic nodes acquire and share the aggregated value. This process is an All reduce process.
  • FIG. 6 is a diagram showing an example of Reduce processing and All reduce processing when NN learning is performed by data division type distributed learning.
  • each arithmetic node performs the Reduce process at the stage where each arithmetic node calculates the update amounts ⁇ w1 to ⁇ w4 of the respective parameters w1 to w4
  • the update amounts ⁇ w1 to ⁇ w4 of all the arithmetic nodes are increased.
  • it is transmitted to one calculation node ND_1, and the calculation node ND_1 calculates the addition value ⁇ w_ad of the update amounts ⁇ w1 to ⁇ w4 as an aggregate value by the addition function f.
  • the calculation node ND_1 transmits the addition value ⁇ w_ad to the other calculation nodes ND_2 to ND_4, and all the calculation nodes acquire and share the addition value.
  • each calculation node ND_1 to ND_4 calculates the average value ⁇ w_av of the update amount ⁇ w by dividing the addition value ⁇ w_ad by the number of calculation nodes 4 in the average processing Average. Then, each of the arithmetic nodes ND_1 to ND_4 adds the average value ⁇ w_av of the update amount to the existing parameters w1 to w4 in the update process Update.
  • a plurality of training data of a mini-batch are distributed to a plurality of arithmetic nodes by the above data parallel distributed learning, and the plurality of arithmetic nodes perform arithmetic operations in parallel up to a parameter gradient ⁇ E or an update amount ⁇ w, and a plurality of arithmetic nodes are operated in parallel.
  • the parameter w of each operation node is updated with the average value of the plurality of update amounts ⁇ w1 to ⁇ w4 calculated from the training data.
  • the average value of a plurality of update amounts is the NN of all arithmetic nodes. Since the parameter w of is updated, the adverse effect on learning due to the exceptional amount of update can be suppressed.
  • FIG. 7 is a flowchart of data parallel type distributed learning according to the first embodiment.
  • FIG. 7 shows a flowchart of the learning process of the arithmetic node ND_1 in the flowchart of FIG. 4, and the flowchart of the learning process of the remaining arithmetic nodes ND_2 to ND_4 is omitted.
  • the flowcharts S10 to S16, S16A, and S20 to S23 of the learning process of the calculation nodes ND_1 shown in FIG. 7 are similarly executed in the remaining calculation nodes ND_2 to ND_4.
  • the flowchart of the learning process of the four arithmetic nodes can be obtained by replacing the flowchart of the learning process of each arithmetic node of FIG. 4 with the flowchart of the learning process of the arithmetic node ND_1 of FIG.
  • NN has a large number of parameters w, but first, an example of one parameter w of NN will be described, and then how to be processed for a plurality of parameters w of NN will be described.
  • Reduce processing and All reduce processing are not executed in all learning for the gradient ⁇ E or the update amount ⁇ w calculated by each of the plurality of arithmetic nodes. That is, when the gradient or update amount calculated by the plurality of arithmetic nodes is less than the threshold value, the processor does not execute the reduce process and the all reduce process, and uses the gradient or update amount calculated by each arithmetic node for each parameter w. To update. If the gradient or update amount calculated by the plurality of arithmetic nodes is not less than the threshold value, the processor executes Reduce processing and Allreduce processing, and updates each parameter w using the average value of the aggregated gradient or update amount. To do.
  • the NN system of the present embodiment sometimes skips the communication processing by the Reduce processing and the Allreduce processing while suppressing the harmful effects due to the exceptional value gradient and the update amount by the mini-batch method, and the processing time of the total learning. To shorten.
  • the gradient or update amount is simplified to the update amount ⁇ w.
  • the reduce process and the all reduce process may be performed on the gradient ⁇ E.
  • each arithmetic node will be described with reference to the flowchart of FIG. Although not shown in FIG. 7, immediately after the start of learning, the arithmetic nodes ND_1 to ND_4 reset their cumulative update amounts ⁇ wr1 to ⁇ wr4 to 0. Next, the calculation nodes ND_1 to ND_4 input the input data of the training data and execute the forward propagation process, the calculation of the difference E1, the back propagation process, and the calculation of the update amounts ⁇ w1 to ⁇ w4 of the parameter w (S10 to S14). ).
  • each arithmetic node adds the calculated update amounts ⁇ w1 to ⁇ w4 to the cumulative update amounts ⁇ wr1 to wr4 (S20).
  • the cumulative update amounts of all the arithmetic nodes are all 0, so the cumulative update amounts ⁇ wr1 to wr4 are equal to the update amounts ⁇ w1 to ⁇ w4 in the initial learning.
  • each arithmetic node determines whether or not the cumulative update amounts ⁇ wr1 to wr4 are less than the threshold value TH (S21).
  • each arithmetic node updates the parameters by adding the respective update amounts ⁇ w1 to ⁇ w4 to the respective parameters w1 to w4. (S16A).
  • each arithmetic node updates the respective parameters w1 to w4 with the calculated update amounts ⁇ w1 to ⁇ w4.
  • the arithmetic nodes ND_1 to ND_4 send and receive the calculated cumulative update amounts ⁇ wr1 to ⁇ wr4, respectively, and perform all operations. Aggregate by adding the cumulative update amounts ⁇ wr1 to ⁇ wr4 of the nodes (Reduce processing), and share the aggregated cumulative update amount ⁇ wr_ad with all nodes (All reduce processing) (S15).
  • NO in the determination S21 means that the total cumulative update amounts ⁇ wr1 to wr4 do not fall below the threshold value TH, and that at least one cumulative update amount does not fall below the threshold value TH.
  • the case of YES in the above determination S21 means that the total cumulative update amounts ⁇ wr1 to wr4 are less than the threshold value TH.
  • each arithmetic node ND_1 to ND_4 divides the aggregate cumulative update amount ⁇ wr_ad by the number of arithmetic nodes “4” to obtain the average value ⁇ wr_ad / 4 of each cumulative update amount, and obtains the update amount of each parameter w1 to w4.
  • the average value of the cumulative update amount ⁇ wr_ad / 4 is added to the values of the parameters w1 to w4 before the accumulation starts, and each parameter is updated to a common value (S16).
  • each arithmetic node resets the cumulative update amounts ⁇ wr1 to ⁇ wr4 to 0 (S22).
  • Each arithmetic node repeats the above learning process while the total number of learnings is less than N (S23).
  • the arithmetic nodes ND_1 to ND_4 perform Reduce processing and All reduce processing. Since there is no such thing, the time required for communication between the arithmetic nodes of both processes is eliminated, and the time required for all learning can be shortened.
  • each arithmetic node calculates the cumulative update amount ⁇ wr1 to ⁇ wr4. And record it.
  • each arithmetic node aggregates the cumulative update amounts ⁇ wr1 to ⁇ wr4.
  • the aggregated cumulative update amount ⁇ wr_add is shared, and the parameters w1 to w4 before accumulation are updated with the average value ⁇ wr_add / 4.
  • the cumulative update amount of all arithmetic nodes is less than the threshold value, the variation of the cumulative update amount in each arithmetic node is relatively small, and even if Reduce processing and Allreduce processing are omitted, the parameter values between each arithmetic node greatly deviate. There is nothing to do. However, if the cumulative update amount of all arithmetic nodes is not less than the threshold value and at least one cumulative update amount is equal to or greater than the threshold value, the discrepancy between the parameter values becomes large.
  • the update amount is aggregated, and the parameters before accumulation of all arithmetic nodes are updated and reset with the average value.
  • each arithmetic node optimized one parameter w in each NN.
  • the cumulative update amounts ⁇ wr1 to ⁇ wr4 of the parameters w1 to w4 of each arithmetic node were compared with the threshold values.
  • the parameter update amounts ⁇ w1 to ⁇ w4 may be positive or negative, it is desirable to compare the absolute values of the cumulative parameter update amounts ⁇ wr1 to ⁇ wr4 with a certain threshold value TH (TH is positive).
  • each arithmetic node individually compares the cumulative update amount of a plurality of parameters of each NN with the threshold value TH, and the cumulative update amount of all the parameters of each NN is the threshold value TH. It is determined whether or not it is less than, and whether or not it is less than the threshold value TH for all arithmetic nodes.
  • each arithmetic node groups a plurality of parameters of each NN into a plurality of parameters w 1 , w 2 , ... w n of each layer in the NN, and in the determination step S21, a plurality of parameters of each layer.
  • each arithmetic node determines whether or not the determination S21 of the plurality of layers of each NN is all less than the threshold value TH, and determines whether or not all the arithmetic nodes are all less than the threshold value TH. Since only the maximum value is compared with the threshold value TH, the throughput of the determination step S21 is improved.
  • each arithmetic node groups a plurality of parameters of each NN into a plurality of parameters w 1 , w 2 , ... w n of each layer in the NN, and in the determination step S21, a plurality of parameters of each layer.
  • each arithmetic node determines whether or not all the determinations of the plurality of layers of each NN are less than the threshold value TH, and also determines whether or not all the arithmetic nodes are all less than the threshold value TH.
  • the L1 norm is the sum of the absolute values of the cumulative updates of the multiple parameters w 1 , w 2 , ... w n
  • the L2 norm is the sum of the multiple parameters w 1 , w 2 , ,. ... w n is the square root of the sum of squares of the absolute values of the cumulative updates
  • the Lp norm is the sum of the absolute values of the cumulative updates of multiple parameters w 1 , w 2 , ... w n to the power of -p.
  • the throughput of the determination step S21 is improved because the value obtained by converting the cumulative update amount of each of the plurality of parameters of each layer into the L1 norm and the L2 norm is compared with the threshold value.
  • FIG. 8 is a flowchart of data parallel type distributed learning according to the second embodiment.
  • FIG. 8 shows a flowchart of the learning process of the arithmetic node ND_1 in the flowchart of FIG. 4, and the flowchart of the learning process of the remaining arithmetic nodes ND_2 to ND_4 is omitted.
  • the flowcharts S30, S10 to S16, S20 to S23, and S31 to S33 of the learning process of the calculation nodes ND_1 shown in FIG. 8 are similarly executed in the remaining calculation nodes ND_2 to ND_4.
  • the parameter gradient ⁇ E is large and the update amounts ⁇ w1 to ⁇ w4 are also large.
  • the parameter gradient ⁇ E is small and the update amounts ⁇ w1 to ⁇ w4 are also small. Therefore, in the data parallel type distributed learning of the first embodiment, immediately after the start of the learning process, in the determination process S21, it is determined that the cumulative update amount is not less than the threshold value TH, and the Reduce process and All reduce process are performed each time. May be heard. On the other hand, as the learning process is approaching the end, in the determination process S21, it is determined that the cumulative update amount is less than the threshold value TH each time, and the Reduce process and All reduce process may not be performed at all.
  • each arithmetic node does not perform Reduce processing and All reduce processing regardless of the comparison judgment with the threshold value TH.
  • D times to U-1 times U is a positive integer larger than D
  • the cumulative update amounts ⁇ wr1 to ⁇ wr4 are less than the threshold value TH as in the first embodiment, Reduce. If the processing and All reduce processing are not performed and the threshold value is not less than TH, the Reduce processing and All reduce processing are performed to update each NN parameter with the average value of the cumulative update amount.
  • each arithmetic node has the threshold value TH. Regardless of the comparison judgment, Reduce processing and All reduce processing are performed, and each NN parameter is updated with the average value of the cumulative update amount. (4) Each arithmetic node repeats the update cycle of the parameters (1) to (3) above until all learning N is reached.
  • each arithmetic node performs Reduce processing and All reduce processing. Since there is no such thing, the number of communications can be reduced. Further, at D times or more in the update cycle, the smaller the cumulative update amount is, the more times the Reduce process and All reduce process are not performed continuously, based on the comparison judgment between the cumulative update amount of the parameter and the threshold value TH. On the contrary, the larger the cumulative update amount, the smaller the number of times the Reduce process and All reduce process are not performed continuously.
  • each of the four arithmetic nodes performs learning using one training data, and each arithmetic node optimizes one parameter w in each NN.
  • each arithmetic node stores the cumulative update amounts ⁇ wr1 to ⁇ wr4 by cumulatively adding the update amounts of the parameters calculated each time learning. Then, each arithmetic node executes the processes S30, S31-S32, and S33 in addition to the processes of the flowchart of FIG. 7. These processes will be mainly described.
  • Each arithmetic node resets the learning count counter value i and the continuous non-communication counter value j to "0" and the cumulative update amounts ⁇ wr1 to ⁇ wr4 of the parameters of each arithmetic node to "0" as the initialization process.
  • each arithmetic node performs data input, forward propagation processing, and back propagation processing of training data to calculate the update amounts ⁇ w1 to ⁇ wr4 of each parameter (S10-S14).
  • each arithmetic node adds 1 to each of the counter values i and j, and adds the calculated update amounts ⁇ w1 to ⁇ wr4 to the cumulative update amounts ⁇ wr1 to ⁇ wr4 of the parameters to update the cumulative update amount (S31).
  • each arithmetic node updates each parameter w1 to w4 with their respective update amounts ⁇ w1 to ⁇ wr4 (S16A). ..
  • Each arithmetic node repeats the above processes S10-S14, S31-S32, and S16A until the continuous non-communication counter value j is less than the first reference number D.
  • the arithmetic node does not always perform Reduce processing and All reduce processing in the first learning in the update cycle of (1) to (3).
  • each arithmetic node asks whether the cumulative update amounts ⁇ wr1 to ⁇ wr4 of the parameters are all less than the threshold TH in all the arithmetic nodes. Is determined.
  • each arithmetic node sets its parameters w1 to w4. Update with each update amount ⁇ w1 to ⁇ wr4 (S16A).
  • the arithmetic nodes ND_1 to ND_4 execute Reduce processing and All reduce processing (S15), and update each parameter w1-w4 with the average value ⁇ wr_add / 4 of the cumulative update amount (S15). S16). Then, the arithmetic node resets the continuous non-communication count value j to 0 and the cumulative update amounts ⁇ wr1 to ⁇ wr4 to 0 (S22A). In this case, the update cycle of (1) to (3) is reset.
  • each arithmetic node performs Reduce processing and All reduce.
  • the process is executed (S15), the parameter is updated with the average value of the cumulative update amount (S16), and the continuous non-communication count value j and the cumulative update amount are reset to 0 (S22A). This will reset the update cycle.
  • the absolute value of the cumulative update amount of each parameter w is less than the threshold value TH in the determination step S21, as in the first embodiment. Whether or not the maximum absolute value of the cumulative update amount of multiple parameters of each layer is less than the threshold TH, and whether the L1 norm or L2 norm of the absolute value of the cumulative update amount of multiple parameters of each layer is less than the threshold TH. , Etc. may be determined.
  • FIG. 9 is a diagram showing a change example of the learning update cycle in the second embodiment.
  • FIG. 9 (1) is an example of a change in the case where the update cycle of the present embodiment is not performed.
  • One learning includes forward propagation processing FW, back propagation processing BW, Reduce processing and All reduce processing CM, and parameter update processing UP for the training data.
  • the update process UP1 in FIG. 9 (1) is a process of updating the parameters with the average value ⁇ w_ad / 4 of the update amounts of the NN parameters of all the arithmetic nodes.
  • each arithmetic node executes the Reduce process, the All reduce process CM, and the parameter update process UP1 in all learning.
  • FIG. 9 (2) is an example of a change in the learning update cycle in the second embodiment.
  • the first to fourth learnings in FIG. 9 (2) correspond to the learnings within the above-mentioned update cycle.
  • the update process UP2 in FIG. 9 (2) is a process (S16A) in which each arithmetic node updates each parameter with the update amount ⁇ w of each NN parameter.
  • the update process UP3 in FIG. 9 (2) is a process (S16) in which each arithmetic node updates each parameter with the average value ⁇ wr_ad / 4 of the cumulative update amount ⁇ wr of each NN parameter.
  • j D
  • each arithmetic node executes the Reduce process and the All reduce process CM, and executes the update process UP3.
  • the third learning and the fourth learning are the same as the first learning and the second learning, respectively.
  • the arithmetic node since the arithmetic node does not execute the Reduce process and the All reduce process CM in all learning, the arithmetic time of the entire learning can be suppressed by not executing the process. ..
  • the cumulative update amount ⁇ wr of the parameters was aggregated in the Reduce process and the All reduce process, and the parameters of each NN were updated with the average value ⁇ wr_ad / 4.
  • Reduce processing and All reduce processing may be performed on the difference gradient ⁇ E. This is because the parameter update amount ⁇ w is calculated by multiplying the difference gradient ⁇ E by the learning rate ⁇ , and therefore the cumulative update amount ⁇ wr can be calculated by multiplying the cumulative gradient by the learning rate.
  • each arithmetic node does not perform Reduce processing and Allreduce processing, the accumulation of the difference gradient ⁇ E is updated, and if Reduce processing and Allreduce processing are performed, the cumulative gradient ⁇ Er at each arithmetic node is aggregated. Then, the aggregated value (added value) of the cumulative gradient ⁇ Er is shared by all arithmetic nodes, and the average value ⁇ wr_ad of the cumulative update amount obtained by multiplying the average value ⁇ Er_ad / 4 of the aggregated value (added value) of the cumulative gradient ⁇ Er by the learning rate ⁇ . Update the parameter w before accumulation with / 4.
  • each operation node executes an NN operation on one training data in each learning.
  • each arithmetic node may execute NN arithmetic on a plurality of training data in a plurality of processes.
  • the number of training data in one batch is the number obtained by multiplying the plurality of training data of each arithmetic node by the number of arithmetic nodes (4 in the above example).
  • each arithmetic node updates the NN parameter of each arithmetic node by using the average value of the gradient ⁇ E of the plurality of differences E calculated by the plurality of processes or the update amount ⁇ w of the parameter.
  • multiple arithmetic nodes aggregate the accumulation of their respective gradients or update amounts, and the average of the aggregated values is shared by all arithmetic nodes, and the average of the aggregated values is the parameter of each NN. To update.
  • Deep NN includes, for example, a convolution NN having a plurality of convolution layers, a pooling layer, and a fully connected layer, an autoencoder NN having nodes having the same size as an input layer and an output layer, a recurrent NN, and the like.
  • NN system 10 Main processor 13: Subprocessor module 14: Subprocessor, Arithmetic node 20: Neural network learning program 22: Neural network program 24; Training data 26: Parameter w ND_1 to ND_4: Arithmetic node w1 to w4: Parameters ⁇ w1 to ⁇ w4: Parameter update amount ⁇ wr1 to ⁇ wr4: Cumulative update amount ⁇ wr_add: Cumulative update amount aggregate value ⁇ wr_add / 4: Average cumulative update amount

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Feedback Control In General (AREA)
  • Multi Processors (AREA)
  • Debugging And Monitoring (AREA)
PCT/JP2020/000644 2020-01-10 2020-01-10 ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム Ceased WO2021140643A1 (ja)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP20912017.9A EP4089586A4 (en) 2020-01-10 2020-01-10 NEURAL NETWORK SYSTEM, NEURAL NETWORK TRAINING METHOD AND NEURAL NETWORK TRAINING PROGRAM
JP2021569687A JP7453563B2 (ja) 2020-01-10 2020-01-10 ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム
PCT/JP2020/000644 WO2021140643A1 (ja) 2020-01-10 2020-01-10 ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム
CN202080092251.0A CN114930350A (zh) 2020-01-10 2020-01-10 神经网络系统、神经网络的学习方法以及神经网络的学习程序
US17/832,733 US20220300790A1 (en) 2020-01-10 2022-06-06 Neural network system, neural network learning method, and neural network learning program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/000644 WO2021140643A1 (ja) 2020-01-10 2020-01-10 ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/832,733 Continuation US20220300790A1 (en) 2020-01-10 2022-06-06 Neural network system, neural network learning method, and neural network learning program

Publications (1)

Publication Number Publication Date
WO2021140643A1 true WO2021140643A1 (ja) 2021-07-15

Family

ID=76787793

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/000644 Ceased WO2021140643A1 (ja) 2020-01-10 2020-01-10 ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム

Country Status (5)

Country Link
US (1) US20220300790A1 (https=)
EP (1) EP4089586A4 (https=)
JP (1) JP7453563B2 (https=)
CN (1) CN114930350A (https=)
WO (1) WO2021140643A1 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169532A (zh) * 2022-07-06 2022-10-11 北京灵汐科技有限公司 基于众核系统的神经网络训练方法及装置、电子设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7413528B2 (ja) * 2021-12-03 2024-01-15 三菱電機株式会社 学習済モデル生成システム、学習済モデル生成方法、情報処理装置、プログラム、および推定装置
US20250106120A1 (en) * 2022-01-13 2025-03-27 Lg Electronics Inc. Method by which reception device performs end-to-end training in wireless communication system, reception device, processing device, storage medium, method by which transmission device performs end-to-end training, and transmission device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012079080A (ja) 2010-10-01 2012-04-19 Nippon Hoso Kyokai <Nhk> パラメタ学習装置およびそのプログラム
JP2018120470A (ja) 2017-01-26 2018-08-02 日本電気株式会社 通信システム、分散計算システム、ノード、情報共有方法及びプログラム
JP2019109875A (ja) * 2017-12-18 2019-07-04 株式会社東芝 システム、プログラム及び方法
JP2019212111A (ja) * 2018-06-06 2019-12-12 株式会社Preferred Networks 分散学習方法及び分散学習装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102732517B1 (ko) * 2018-07-04 2024-11-20 삼성전자주식회사 뉴럴 네트워크에서 파라미터를 처리하는 방법 및 장치
CN110795228B (zh) * 2018-08-03 2023-08-25 伊姆西Ip控股有限责任公司 用于训练深度学习模型的方法和制品、以及计算系统
US10776164B2 (en) * 2018-11-30 2020-09-15 EMC IP Holding Company LLC Dynamic composition of data pipeline in accelerator-as-a-service computing environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012079080A (ja) 2010-10-01 2012-04-19 Nippon Hoso Kyokai <Nhk> パラメタ学習装置およびそのプログラム
JP2018120470A (ja) 2017-01-26 2018-08-02 日本電気株式会社 通信システム、分散計算システム、ノード、情報共有方法及びプログラム
JP2019109875A (ja) * 2017-12-18 2019-07-04 株式会社東芝 システム、プログラム及び方法
JP2019212111A (ja) * 2018-06-06 2019-12-12 株式会社Preferred Networks 分散学習方法及び分散学習装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4089586A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169532A (zh) * 2022-07-06 2022-10-11 北京灵汐科技有限公司 基于众核系统的神经网络训练方法及装置、电子设备

Also Published As

Publication number Publication date
JP7453563B2 (ja) 2024-03-21
EP4089586A1 (en) 2022-11-16
US20220300790A1 (en) 2022-09-22
EP4089586A4 (en) 2023-02-01
CN114930350A (zh) 2022-08-19
JPWO2021140643A1 (https=) 2021-07-15

Similar Documents

Publication Publication Date Title
CN109800883B (zh) 量子机器学习框架构建方法、装置及量子计算机
US20210200610A1 (en) System for efficient large-scale data distribution in distributed and parallel processing environment
US11481618B2 (en) Optimization apparatus and method for controlling neural network
WO2021140643A1 (ja) ニューラルネットワークシステム、ニューラルネットワークの学習方法及びニューラルネットワークの学習プログラム
CN108446761A (zh) 一种神经网络加速器及数据处理方法
Hou et al. A multi-objective discrete particle swarm optimization method for particle routing in distributed particle filters
CN111831358A (zh) 权重精度配置方法、装置、设备及存储介质
CN111831354B (zh) 数据精度配置方法、装置、芯片、芯片阵列、设备及介质
CN119919144A (zh) 一种互联网营销平台风险预警管理方法及系统
US5129038A (en) Neural network with selective error reduction to increase learning speed
Lin et al. Multi-node bert-pretraining: Cost-efficient approach
Xie et al. A distributed cooperative learning algorithm based on zero-gradient-sum strategy using radial basis function network
CN116016212A (zh) 一种带宽感知的去中心化联邦学习方法及装置
CN121525890A (zh) 基于光量子的量子系统数据处理方法及光量子计算机
Zheng et al. Improved adaptive war strategy optimization algorithm assisted-adaptive multi-head graph attention mechanism network for remaining useful life of complex equipment
CN116579437B (zh) 一种量子线路训练方法、装置、存储介质及电子装置
WO2024193207A1 (zh) 一种数据增广方法及相关装置
JPH05128284A (ja) ニユーロプロセツサ
CN108197083A (zh) 一种数据中心线性回归与小波神经网络融合的短期工作负载预测方法
CN114331556A (zh) 一种能源服务商效益评估方法、系统、装置及存储介质
Chen et al. Phy: A performance-driven hybrid communication compression method for distributed training
Sarinana et al. Exploring Reduced Precision for Deep Learning Activation Functions
CN110515722A (zh) 一种在云端实现神经网络模型并行的方法
CN116644813B (zh) 一种利用量子电路确定最优组合方案的方法及装置
JPH0470954A (ja) 株価予測装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912017

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021569687

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020912017

Country of ref document: EP

Effective date: 20220810

WWW Wipo information: withdrawn in national office

Ref document number: 2020912017

Country of ref document: EP