WO2018099084A1 - 一种神经网络模型训练方法、装置、芯片和系统 - Google Patents

一种神经网络模型训练方法、装置、芯片和系统 Download PDF

Info

Publication number
WO2018099084A1
WO2018099084A1 PCT/CN2017/092091 CN2017092091W WO2018099084A1 WO 2018099084 A1 WO2018099084 A1 WO 2018099084A1 CN 2017092091 W CN2017092091 W CN 2017092091W WO 2018099084 A1 WO2018099084 A1 WO 2018099084A1
Authority
WO
WIPO (PCT)
Prior art keywords
iteration
module
gradient
server module
working
Prior art date
Application number
PCT/CN2017/092091
Other languages
English (en)
French (fr)
Inventor
张长征
白小龙
涂丹丹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17875081.6A priority Critical patent/EP3540652B1/en
Publication of WO2018099084A1 publication Critical patent/WO2018099084A1/zh
Priority to US16/424,760 priority patent/US20190279088A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiments of the present application relate to the field of machine learning, and in particular, to a neural network model training method, apparatus, chip, and system.
  • FIG. 1 exemplarily shows a schematic diagram of a distributed training system.
  • a server module set (English can be called servers) 101 and a work module set (English can be called workers) 102 are provided.
  • the server module set can be Including a plurality of server modules (English can be called server), the work module set can include multiple work modules (English can be called workers), the server module is similar to the main server (English can be called master) node, the work module can refer to Calculate the actuator.
  • the distributed training system includes a plurality of distributed nodes, each of which may include one or more working modules, and may also include one or more server modules.
  • Figure 1 the signaling interaction process between the server module and the working module in the distributed training system is described in detail.
  • Figure 1 includes N working modules and P server modules, N working modules and P server modules for training model parameters in the neural network model.
  • a model parameter is trained as an example:
  • the distributed computing platform is started, the application is deployed, the server module is initialized, and the initialized model parameter ⁇ 1 is obtained , and the global model parameter ⁇ 1 is pulled from the server module from the server module to the respective working modules;
  • each working module performs the first iteration: reading the sample data, relying on the global model parameter ⁇ 1 for local gradient calculation; working module 1 calculates the local gradient ⁇ 1-1 ; and working module 2 calculates the local gradient ⁇ 2-1 ;...the working module N calculates the local gradient ⁇ N-1 ;
  • each working module performs a second iteration: each working module applies the local gradient ⁇ 1-1 , the local gradient ⁇ 2-1 ... the local gradient ⁇ N-1 generated in the first iteration to the server module.
  • Push in English can be called push
  • the server module calculates the global gradient ⁇ 1_1 according to the local gradient ⁇ 1-1 , the local gradient ⁇ 2-1 ... the local gradient ⁇ N-1 ; and the global gradient ⁇ 1_1 from the server module Pull down (English can be called pull) to each work module; each work module, according to the global gradient ⁇ 1_1 to update the local model parameter ⁇ 1 to the model parameter ⁇ 2 ;
  • Each working module reads the sample data, and performs local gradient calculation depending on the updated model parameter ⁇ 2 ; the working module 1 calculates the local gradient ⁇ 1-2 ; the working module 2 calculates the local gradient ⁇ 2-2 ; Calculate the local gradient ⁇ N-2 ;
  • each working module pushes each local gradient to the server module again, so that the server module pulls down the global gradient from the server module again, so that each working module depends on the slave module to pull down.
  • each working module reports the last updated local model parameters to the server module, and the server module determines an average value according to the updated local model parameters reported by each working module, and obtains training.
  • This process can be called a training cycle (called epoch in English), and the model parameters can be trained through multiple training cycles.
  • each model parameter in the iterative process each working module first pushes the local gradient to the server module, and waits for the global gradient from the server module to the model parameter, and then updates the local according to the global gradient.
  • the model parameters are then calculated based on the updated local model parameters.
  • the time taken for each iteration process includes pushing the local gradient to the server module and the communication time of the global gradient from the server module, and updating the local model parameters and calculations.
  • the calculation time of the local gradient is longer for one iteration, which leads to a large delay in the training process of the model parameters.
  • the embodiment of the present application provides a neural network model training method, device, chip and system, which are used to shorten the training time delay of model parameters and improve the efficiency of model parameter training.
  • an embodiment of the present application provides a neural network model training method.
  • the embodiment of the present application is applicable to a training system including a server module and N working modules.
  • the server module and the N working modules are configured to train model parameters in at least one training period, and each training period in at least one training period includes K times.
  • each working module is executed in parallel: calculating the i+ according to the local gradient of the i-th iteration and the model parameters of the i-th iteration The model parameters of one iteration, and in the case where i is smaller than K, the local gradient of the i+1th iteration is calculated according to the model parameters of the i+1th iteration and the sample data of the i+1th iteration; Pull down the global gradient of the rth iteration from the server module and/or push the local gradient of the fth iteration to the server module.
  • the first process and the second process are executed in parallel in each iteration process in the embodiment of the present application.
  • the first process is a computing process, and specifically includes calculating a model parameter of the (i+1)th iteration, and calculating a local gradient of the (i+1)th iteration.
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • the working module calculates the model parameter of the (i+1)th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, including: the working module determines that the first condition has been pulled down from the server module.
  • the model parameters of the i+1th iteration are calculated according to the global gradient of the jth iteration, the local gradient of the ith iteration, and the model parameters of the ith iteration; wherein j is A positive integer less than or equal to i; the first condition includes that the global gradient of the jth iteration is not used for the calculation of the model parameters in any iteration between the first iteration and the ith iteration. In this way, the global gradient calculation of the jth iteration that satisfies the first condition from the server module can be calculated.
  • the model parameters of the i+1 iterations improve the accuracy of the model parameters for calculating the i+1th iteration.
  • the global order of the jth iteration that satisfies the first condition is selected from the global gradients that have been pulled down from the server module. Gradient, without waiting for the communication process, further shortens the iteration time and improves the efficiency of model parameter training.
  • the working module calculates the model parameter of the (i+1)th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, including: the working module determines that the first condition is not pulled down from the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the convergence of the model parameters can be accelerated by updating the model parameters according to the global gradient in the nearest iteration in the current iteration.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • the working module pulls down the global gradient of the rth iteration from the server module and/or pushes the local gradient of the fth iteration to the server module, including pulling down the global gradient of the rth iteration from the server module.
  • a global gradient from the server module pulling down the rth iteration pushing the local gradient of the i-th iteration to the server module.
  • the local gradient of the ith iteration can be improved.
  • the local gradient in the nearest iteration in the current iteration can be pushed to the server module as much as possible, thereby accelerating the convergence of the model parameters.
  • the method further includes: after the working module calculates the local gradient of the Kth iteration, calculating the K+1th time according to the local gradient of the Kth iteration and the model parameter of the Kth iteration After iterating the model parameters, the model parameters of the K+1th iteration are pushed to the server module; wherein the model parameters of the K+1th iteration are used to: make the server module according to each working module of the N working modules to the server The model parameters of the K+1th iteration pushed up by the module, and the number of iterations K, determine the model parameters of the 1st iteration in the next training period. In this way, the accuracy of the model parameters of the training period is improved.
  • an embodiment of the present application provides a neural network model training device, where the training device includes N working modules, and the training device is applicable to a training system including a server module and N working modules, a server module, and The N working modules are configured to train the model parameters in at least one training period, and each training period in the at least one training period includes K iterations; each of the N working modules includes a communication module and a computing module; The i-th iteration of each working module in the working module in each training period, where N and K are integers greater than or equal to 1, respectively, and i is an integer greater than or equal to 1 and less than or equal to K.
  • the communication module and the calculation module of each working module run in parallel; wherein the calculation module is configured to calculate the model parameter of the i+1th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, and is smaller than i
  • the local gradient of the i+1th iteration is calculated according to the model parameters of the i+1th iteration and the sample data of the i+1th iteration
  • the communication module is used to pull down the rth iteration from the server module The global gradient and/or push the local gradient of the fth iteration to the server module; where r and f are positive integers less than or equal to i, respectively.
  • the communication module and the computing module run in parallel during each iteration, and the communication module performs the first
  • the calculation module executes the second process
  • the first process is a calculation process, which includes calculating a model parameter of the i+1th iteration, and calculating a local gradient of the i+1th iteration
  • the second process is a communication process, specifically including The server module pulls down the global gradient of the rth iteration and/or pushes the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • a calculation module is configured to: in the case of determining that the global gradient of the jth iteration that satisfies the first condition has been pulled down from the server module, the global gradient according to the jth iteration, the local gradient of the ith iteration
  • the model parameter of the i-th iteration calculates the model parameter of the i+1th iteration; wherein j is a positive integer less than or equal to i; the first condition includes: the global gradient of the jth iteration is not used for the first iteration Calculation of model parameters in any iteration between the ith iterations. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the calculating module is configured to: according to the global gradient of the jth iteration that does not satisfy the first condition from the server module, according to the local gradient of the i-th iteration and the model parameter of the ith iteration, Calculate the model parameters for the i+1th iteration.
  • the convergence of the model parameters can be accelerated by updating the model parameters according to the global gradient in the nearest iteration in the current iteration.
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the model parameters of the (i+1)th iteration can be calculated according to the global gradient of the jth iteration that has been pulled down from the server module to satisfy the first condition, and the accuracy of calculating the model parameters of the (i+1)th iteration is improved.
  • the global gradient of the jth iteration that satisfies the first condition is selected from the global gradient that has been pulled down from the server module, without waiting for the communication process, further shortening the duration of the iteration, and improving the efficiency of training the model parameters.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • a communication module is configured to pull down the global gradient of the rth iteration from the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-1th iteration to the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-th iteration to the server module. Or used to push the local gradient of the i-1th iteration to the server module. Or used to push the local gradient of the i-th iteration to the server module. In this way, the flexibility of the working module can be improved. On the other hand, the local gradient in the nearest iteration in the current iteration can be pushed to the server module as much as possible, thereby accelerating the convergence of the model parameters.
  • the communication module is further configured to: after calculating the local gradient of the Kth iteration by the calculation module, calculate the model parameter according to the local gradient of the Kth iteration and the model parameter of the Kth iteration After K+1 iterations of the model parameters, the model parameters of the K+1th iteration are pushed to the server module; wherein the model parameters of the K+1th iteration are used to: make the server module according to each of the N working modules The model parameters of the K+1th iteration pushed by the working module to the server module, and the number of iterations K, determine the model parameters of the first iteration in the next training period. In this way, the accuracy of the model parameters of the training period is improved.
  • an embodiment of the present application provides a neural network model training device, where the training device includes a processor, a memory, and a transceiver, the processor includes N processor cores, and the training device is adapted to include a server module and a training system of N processor cores, the server module and the N processor cores being used in at least one training Training model parameters during the training period, each training period in the at least one training period including K iterations;
  • the memory is for storing instructions; the processor is configured to execute the instructions stored by the memory, and control data transfer between the transceiver and the server module; when the processor executes the instructions stored by the memory
  • Each of the N processor cores is used to:
  • a transceiver for pulling a global gradient of the rth iteration from the server module and/or pushing a local gradient of the fth iteration to the server module; wherein r and f are positive integers less than or equal to i, respectively.
  • the memory is used to store the global gradients pulled down from the server module, as well as the calculated local gradients.
  • the transceiver and the processor run in parallel, the processor executes the first process, and the transceiver executes the second process.
  • the first process is a computing process, and specifically includes a model for calculating the i+1th iteration.
  • the parameter is used to calculate the local gradient of the i+1th iteration;
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • the processor is configured to: in the case of determining that the global gradient of the jth iteration that satisfies the first condition has been pulled down from the server module, the global gradient according to the jth iteration, the local gradient of the ith iteration
  • the model parameter of the i-th iteration calculates the model parameter of the i+1th iteration; wherein j is a positive integer less than or equal to i; the first condition includes: the global gradient of the jth iteration is not used for the first iteration Calculation of model parameters in any iteration between the ith iterations. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the processor is configured to: according to the global gradient of the jth iteration that does not satisfy the first condition from the server module, according to the local gradient of the i-th iteration and the model parameter of the ith iteration, Calculate the model parameters for the i+1th iteration.
  • the convergence of the model parameters can be accelerated by updating the model parameters according to the global gradient in the nearest iteration in the current iteration.
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the model parameters of the (i+1)th iteration can be calculated according to the global gradient of the jth iteration that has been pulled down from the server module to satisfy the first condition, and the accuracy of calculating the model parameters of the (i+1)th iteration is improved.
  • the global gradient of the jth iteration that satisfies the first condition is selected from the global gradient that has been pulled down from the server module, without waiting for the communication process, further shortening the duration of the iteration, and improving the efficiency of training the model parameters.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • a transceiver is configured to pull down the global gradient of the rth iteration from the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-1th iteration to the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-th iteration to the server module. Or used to push the local gradient of the i-1th iteration to the server module. Or used to push the i-th iteration to the server module Local gradient. In this way, the flexibility of the working module can be improved. On the other hand, the local gradient in the nearest iteration in the current iteration can be pushed to the server module as much as possible, thereby accelerating the convergence of the model parameters.
  • the transceiver is further configured to: after calculating the local gradient of the Kth iteration by the processor, calculate the model parameter according to the local gradient of the Kth iteration and the model parameter of the Kth iteration After K+1 iterations of the model parameters, the model parameters of the K+1th iteration are pushed to the server module; wherein the model parameters of the K+1th iteration are used to: make the server module according to each of the N working modules The model parameters of the K+1th iteration pushed by the working module to the server module, and the number of iterations K, determine the model parameters of the first iteration in the next training period. In this way, the accuracy of the model parameters of the training period is improved.
  • an embodiment of the present application provides a chip for training a neural network model, where the chip is applicable to a training system including N chips and a server, where the server module and the N chips are used in at least one The model parameters are trained during the training period, and each training period in the at least one training period includes K iterations; each of the N chips is used to perform the method performed by the working module in the first aspect above.
  • an embodiment of the present application provides a system for training a neural network model, where the system includes a server module and N working modules; the server module and the N working modules are configured to train the model parameters in at least one training period, at least Each training period in a training period includes K iterations; for the ith iteration of one working module in each of the N working modules, each working module is used for: parallel execution: according to the ith iteration
  • the local gradient and the model parameters of the ith iteration calculate the model parameters of the i+1th iteration, and in the case where i is less than K, the model parameters according to the i+1th iteration and the samples of the i+1th iteration Data, calculating the local gradient of the i+1th iteration; pulling down the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module; wherein r and f are respectively less than or equal to i a positive integer; where N and
  • a computer program product comprising: a computer program (also referred to as a code, or an instruction) that, when executed, causes the computer to perform any of the first aspects described above The method in the way.
  • a computer program also referred to as a code, or an instruction
  • a computer readable medium storing a computer program (which may also be referred to as a code, or an instruction), when executed on a computer, causes the computer to perform any of the first aspects described above Possible methods in the implementation.
  • a computer program which may also be referred to as a code, or an instruction
  • the first process and the second process are executed in parallel in each iteration process in the embodiment of the present application.
  • the first process is a computing process, and specifically includes calculating a model parameter of the (i+1)th iteration, and calculating a local gradient of the (i+1)th iteration.
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, and the model of the local gradient and the i-th iteration according to the i-th iteration in the first process
  • the parameter calculates the model parameters of the i+1th iteration, which avoids the prior art scheme of waiting for the global gradient from the server module to the i-th iteration to calculate the model parameters of the i+1th iteration, thereby shortening the solution.
  • the length of one iteration improves the efficiency of model parameter training.
  • FIG. 1 is a schematic diagram of a distributed training system shown in the background art
  • FIG. 2 is a schematic diagram of an application scenario architecture applicable to an embodiment of the present application
  • FIG. 3 is a schematic diagram of a suitable training system according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a neural network model training method according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a neural network model training method according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a neural network model training apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a neural network model training apparatus according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a system for training a neural network model according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram showing an application scenario architecture applicable to the embodiment of the present application.
  • a plurality of original data such as the telecommunication data 201 and the financial data 202 in FIG. 2, may be present.
  • the consumer data 203 and the like, the big data platform 204 performs data collection on the raw data, as well as data storage and data calculations, etc., and obtains data processed by the big data platform 204.
  • the data mining platform 205 obtains the data processed by the big data platform 204 from the big data platform, and performs data mining, for example, using logistic regression analysis (English may be called Logistic Regression, referred to as LR), large-scale traditional neural network model implicit Di Likley distribution (English can be called Latent Dirichlet Allocation, referred to as LDA); convolutional neural network (English can be called Convolution neural network, referred to as CNN), cyclic neural network (English can be called Recurrent neural network, referred to as RNN), sparse A deep learning model such as an automatic encoder (Sparse AutoEncoder, SAE for short) is used for data mining to obtain data mining results.
  • the application platform 206 includes various fields, and can perform big data analysis in the telecommunications field, big data analysis in the financial field, big data analysis in the consumer field, and big data analysis in other fields according to the data mining results determined by the data mining platform 205. .
  • Embodiments of the present application can be used to train distributed parallel computing clusters of massive data, and suitable algorithms include convolutional neural networks (for image, voice, or video processing), recurrent neural networks (for natural language processing), deep neural networks.
  • convolutional neural networks for image, voice, or video processing
  • recurrent neural networks for natural language processing
  • deep neural networks for deep neural networks.
  • Various deep learning algorithms such as (for processing speech) and large-scale machine learning algorithms.
  • the solution provided by the embodiment of the present application is applied to the data mining platform 205.
  • the data mining platform 205 can perform mining analysis on the underlying raw data through deep learning intelligent analysis, and enhances the deep learning based on the accelerated learning process of the distributed architecture.
  • the performance and scalability of the data mining platform to support the decision-making and operation of the upper application platform, such as video analytics, image recognition, object detection, natural language processing and other upper-layer application platform services.
  • a node in the embodiment of the present application may be a computer device including at least one graphics processing unit (GPU) chip and/or at least one central processing unit (CPU) chip.
  • GPU graphics processing unit
  • CPU central processing unit
  • Each GPU chip includes one or more GPU cores
  • each CPU chip includes one or more CPU cores.
  • the working module in the embodiment of the present application may include one or more GPU cores
  • the server module may include one or more CPU cores.
  • FIG. 3 is a schematic diagram showing a suitable system architecture provided by an embodiment of the present application.
  • the embodiment of the present application includes a server module set 307 and a work module set 308, and the server module set 307 includes multiple servers.
  • the modules are respectively a server module 301, a server module 302, a server module 303, and the working module set 308 can include a plurality of working modules, which are a working module 304, a working module 305, a working module 306, respectively.
  • a distributed system architecture includes multiple distributed nodes.
  • the specific deployment form of each node includes three types: first The working module and the server module are deployed on the same node, and the number of working modules is equal to or different from the number of server modules. Second, the working module and the server module are respectively deployed on different nodes, and the number of working modules is equal to the server module or The third type is that the working module and the server module are mixed and deployed on different nodes, that is, at least one of the plurality of nodes has both a working module and a server module, and the number of working modules is equal or unequal to the number of server modules. .
  • the solution provided by the embodiment of the present application is applicable to any specific deployment mode.
  • the server module and the work module are used to train model parameters for at least one training period.
  • Each training cycle (called epoch in English) can include K iterations.
  • Model parameters can be trained through one training cycle or multiple training cycles.
  • the following content in the embodiment of the present application focuses on a training cycle in detail.
  • the schemes of other training cycles are similar to the following, and are not described again.
  • FIG. 4 exemplarily shows a schematic flowchart of a neural network model training method provided by an embodiment of the present application, where the neural network model training method is applicable to a training system including a server module and N working modules, a server module, and N working modules are used to train model parameters in at least one training period, and each training period in at least one training period includes K iterations, for the i-th of each of the N working modules in each training period The second iteration, wherein N and K are integers greater than or equal to 1, respectively, and i is an integer greater than or equal to 1 and less than or equal to K.
  • the method includes:
  • Step 401 each working module performs step 402 and step 403 in parallel;
  • the working module is one of N working modules, and optionally, one working module of N working modules;
  • Step 402 Each working module calculates a model parameter of the i+1th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, and if i is smaller than K, according to the i+1th iteration Calculating the local gradient of the i+1th iteration by using the model parameters and the sample data of the i+1th iteration;
  • Step 403 Each working module pulls down the global gradient of the rth iteration from the server module and/or pushes the local gradient of the fth iteration to the server module; wherein r and f are positive integers less than or equal to i, respectively.
  • it includes several schemes: scheme one, the working module pulls down the global gradient of the rth iteration from the server module; scheme two, the working module pushes the local gradient of the fth iteration to the server module, scheme three, the working module from the server The module pulls down the global gradient of the rth iteration and pushes the local gradient of the fth iteration to the server module.
  • the working module pulls down the global gradient of the rth iteration from the server module, including the global gradient of the rth iteration sent by the working module receiving the server module, or the working module automatically goes to the server module to obtain the global gradient of the rth iteration. .
  • Pushing the local gradient of the fth iteration to the server module Specifically, the working module sends the local gradient of the fth iteration to the server module.
  • Step 402 and step 403 are performed in parallel in each iterative process in the embodiment of the present application.
  • Step 402 is a first process
  • step 403 is a second process.
  • the first process is a computing process, which includes calculating an i+1th iteration.
  • the model parameter calculates the local gradient of the i+1th iteration;
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, thereby avoiding the prior art that must wait for the slave module to pull down to the i-th
  • the global gradient of the second iteration can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • the second process is executed in parallel while the first process is executed, and the communication process is not performed until the local gradient of the (i+1)th iteration is calculated in the prior art.
  • the length of one iteration is shortened, and the efficiency of model parameter training is improved.
  • the N working modules and the server module may be located on one node, and the node is a computer device including multiple GPU cores and multiple CPU cores.
  • a work module consists of one or more GPU cores, one server The module includes one or more CPU cores.
  • communication between the N working modules and the server modules is possible through inter-core communication between the GPU core and the CPU core.
  • communication between the N working modules and the server modules may be implemented through some links between the nodes. The achievable communication between each of the N working modules and the server module in the embodiment of the present application.
  • the working module calculates the model parameter of the (i+1)th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, and the working module determines that the slave module has been pulled down from the server module.
  • the model parameters of the i+1th iteration are calculated according to the global gradient of the jth iteration, the local gradient of the i th iteration, and the model parameters of the i th iteration;
  • j is a positive integer less than or equal to i;
  • the first condition includes that the global gradient of the jth iteration is not used for the calculation of the model parameters in any iteration between the first iteration and the ith iteration.
  • the model parameters of the (i+1)th iteration can be calculated according to the global gradient of the jth iteration that has been pulled down from the server module to satisfy the first condition, and the accuracy of calculating the model parameters of the (i+1)th iteration is improved.
  • the global gradient of the jth iteration that satisfies the first condition is selected from the global gradient that has been pulled down from the server module, without waiting for the communication process, further shortening the duration of the iteration, and improving the efficiency of training the model parameters.
  • the working module calculates the model parameters of the i+1th iteration according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, including the global order of the j-th iteration that the working module determines that the first condition is not pulled down from the server module.
  • the model parameters of the i+1th iteration are calculated from the local gradient of the i-th iteration and the model parameters of the i-th iteration. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the communication process and the calculation process in the system are two processes that are independent of each other and can be executed in parallel.
  • the working module performs the communication process
  • the local gradient is pushed to the server module, and the global gradient is pulled from the server module; or the local gradient is continuously pushed to the server module, and the server module is pulled down once or consecutive times.
  • the global gradient in the foregoing step 403, in a case that the server module has calculated the global gradient of the rth iteration, in the above step 403, the working module may pull down the global gradient of the rth iteration from the server module.
  • the working module just completes the process of pushing the local gradient to the server module, or the working module turns to the process of pushing the local gradient to the server module, then the working module can select Push the local gradient of the fth iteration to the server module.
  • the communication process between the working module and the server module is faster, and the process module calculates the model parameters of the i+1th iteration and calculates the local gradient of the i+1th iteration.
  • the working module may pull the global gradient of the rth iteration from the server module and push the local gradient of the fth iteration to the server module; or push the local gradient of the fth iteration to the server module and pull down from the server module The global gradient of r iterations. There is no order between the local gradient of the f-th iteration and the global gradient of the r-th iteration from the server module in the embodiment of the present application. In the above solution, the working module may choose to push the local gradient of the fth iteration to the server module in various implementations.
  • the working module has successfully pulled down the global gradient of the first iteration, the global gradient of the third iteration, the global gradient of the fourth iteration, and the global gradient of the sixth iteration from the server module.
  • the global gradient of the first iteration has been used.
  • the global gradient of the third iteration, the global gradient of the fourth iteration, and the global gradient of the sixth iteration are not used.
  • the ninth iteration process is performed, and the model parameters of the ninth iteration are updated, that is, the i+1th iteration is the ninth iteration.
  • the global gradient of the jth iteration that currently satisfies the first condition is any one of the global gradient of the third iteration, the global gradient of the fourth iteration, and the global gradient of the sixth iteration, optionally, according to the third
  • the global gradient of the second iteration, the global gradient of the 4th iteration, and the global gradient of the 6th iteration, and the model parameters of the 9th iteration and the model parameters of the 8th iteration calculate the model parameters of the 9th iteration .
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the iteration batch is the order number of the iteration. For example, the iteration of the iteration of the 3rd iteration is the 3rd. The larger the order number of the iteration, the higher the iteration batch.
  • the global gradient of the 3rd iteration, the global gradient of the 4th iteration, and the highest iteration of the iterative batch in the global gradient of the 6th iteration are the 6th iteration, therefore, the jth time is preferably determined. Iterate over the global gradient for the sixth iteration.
  • the model parameters of the ninth iteration are calculated according to the global gradient of the sixth iteration, and the local gradient of the eighth iteration and the model parameters of the eighth iteration.
  • the communication process can be executed in parallel.
  • the working module updates the process of the model parameters of the ninth iteration
  • the working module has calculated the local gradient in the first 8 iterations, and has pushed the local gradient of the first iteration, the local gradient of the third iteration to the server module, The local gradient of the 4th iteration and the local gradient of the 6th iteration.
  • the local gradients that are still not pushed onto the server module include: the local gradient of the 2nd iteration, the local gradient of the 5th iteration, the local gradient of the 7th iteration, and the local gradient of the 8th iteration.
  • the working module can be selected as follows:
  • the working module updates the model parameters of the ninth iteration, and calculates the local gradient of the ninth iteration, and executes in parallel: the global gradient of the rth iteration is pulled from the server module.
  • the working module can be pulled down from the server module.
  • Global gradient of 5 iterations That is to say, the working module in the embodiment of the present application can perform the global gradient of the rth iteration from the server module in parallel, and the global gradient of the rth iteration has been calculated by the server module.
  • the working module working module updates the model parameters of the ninth iteration, and calculates the local gradient of the ninth iteration, executing in parallel: pulling down the global gradient of the rth iteration from the server module and pushing the f to the server module The local gradient of the next iteration; or push the local gradient of the fth iteration to the server module.
  • the local gradient of the fth iteration to the server module exists in the following scenarios: b1, b2, b3, and b4:
  • the working module determines a local gradient from the local gradient pushed from the server module, and pushes the server module, for example, the local gradient from the second iteration, the local gradient of the fifth iteration, and the seventh time. Any one of the iterative local gradient and the local gradient of the 8th iteration is selected to push to the server module.
  • the local gradient of the i-1th iteration is pushed to the server module, and the working module selects a local gradient that is not pushed to the server module by the iterative batch, and pushes the server module to the server module, for example, from the 2nd iteration
  • the local gradient, the local gradient of the 5th iteration, the local gradient of the 7th iteration, and the local gradient of the 8th iteration select the local gradient of the 7th iteration to push to the server module.
  • the local gradient of the i-th iteration is pushed to the server module, and the working module selects the highest gradient of the local gradient that is not pushed to the server module, and pushes the server module, for example, the local gradient from the second iteration.
  • the local gradient of the 5th iteration, the local gradient of the 7th iteration, and the local gradient of the 8th iteration select the local gradient of the 8th iteration to push to the server module.
  • the working module can keep waiting to push the local gradient of the i+1th iteration to the server module; that is, after the working module waits to determine the local gradient of the ninth iteration, the local gradient of the ninth iteration is performed. Push to the server module.
  • the local gradient of the f-th iteration that has been calculated may be selected to be pushed to the server module, or the server module may be selected from the server module.
  • Has The calculated global gradient of the rth iteration does not need to report the local gradient in each iteration calculated by the working module and the global gradient in each iteration from the server module, which reduces the communication between the working module and the server module. the amount.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • the working module can calculate the global gradient of the jth iteration according to the local gradient of the jth iteration reported by the 20 working modules.
  • the global gradient of the jth iteration may be calculated according to the local gradient of the jth iteration reported by all working modules in the N working modules.
  • the server module may calculate the global gradient of the jth iteration according to the local gradient of all the jth iterations reported by the multiple working modules.
  • a schematic illustration will be given by way of a few examples.
  • the server module averages the local gradients of all the jth iterations reported by the multiple working modules to obtain the global gradient of the jth iteration; for example, the local gradient of all the jth iterations reported by the server module to the multiple working modules Multiply the corresponding weights separately, and then calculate the average of the local gradients multiplied by the weights to obtain the global gradient of the jth iteration.
  • the method further includes: after the working module calculates the local gradient of the Kth iteration, calculating the K+1th time according to the local gradient of the Kth iteration and the model parameter of the Kth iteration After iterating the model parameters, the model parameters of the K+1th iteration are pushed to the server module.
  • the model parameter of the K+1th iteration is used to: cause the server module to determine the model parameter of the K+1th iteration and the number of iterations K according to the K+1 iteration of each working module of the N working modules to the server module. Model parameters for the first iteration of the training period. In this way, the accuracy of the model parameters of the training period is improved.
  • model parameters of the K+1th iteration pushed to the server module by each of the N working modules are averaged, or each working module of the N working modules is pushed to the server module.
  • the sum of the model parameters of the K+1th iteration is divided by the number of iterations K to obtain the model parameters trained in the training period.
  • a training period may be restarted to train the model parameters, and the model parameters trained in the training period are determined as the model parameters of the first iteration in the next training period.
  • the training period may not be started any more, and the model parameters trained in the training period are directly determined as the model parameters after the training.
  • the method includes:
  • the work module performs the first iteration.
  • the GPU of the working module reads the sample data of the first iteration respectively, performs local gradient calculation depending on the global model parameter ⁇ 1_0 , and simultaneously preprocesses the sample data of the second iteration. In this way, the training cycle time can be further shortened. The model parameters for the second iteration are then calculated.
  • the working module 1 calculates the local gradient ⁇ 1-1 ; the working module 2 calculates the local gradient ⁇ 2-1 ;...the working module n calculates the local gradient ⁇ n-1 ...the working module N is calculated Local gradient ⁇ N-1 .
  • the work module performs the second iteration.
  • the working modules are executed in parallel: calculating the model parameters of the second iteration, calculating the local gradient of the second iteration; and pushing the local gradient of the first iteration to the server module.
  • the working module may pull down the global gradient of the first iteration from the server module from the server module.
  • the model parameters of the second iteration are determined based on the model parameters of the first iteration and the local gradient of the first iteration. Specifically, there are various determination schemes, such as making the model parameters of the second iteration closer to the final value by error calculation.
  • a formula (1) of the model parameter for calculating the i+1th iteration of the working module m is provided:
  • w n_i is the model parameter of the i+ 1th iteration of the working module n;
  • i is the number of iterations, the range of i is [1, K]; the range of n is [1, N];
  • w n_i-1 is the model parameter of the ith iteration
  • ⁇ w n_i is a local gradient calculated by the working module n in i iterations
  • is the learning rate control factor. Where ⁇ can be determined according to a specific applicable scenario.
  • model parameters of the second iteration are calculated by the above formula (1).
  • the GPU of the working module reads the sample data of the second iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the second iteration; pre-processing the sample data of the third iteration .
  • the work module pushes the local gradient of the first iteration to the server module.
  • the server module may receive the local gradients ⁇ 1-2 , ⁇ 2-2 , . . . ⁇ n-2 , . . . ⁇ N-2 of the N first iterations reported by the N working modules, respectively, optionally calculating N
  • the average of the local gradients of the first iteration yields the global gradient ⁇ 1 of the first iteration.
  • the local gradient of the first iteration is also pushed to the server module, so that the calculation process and the time window of the communication process overlap, thereby shortening the training period time.
  • the average of the local gradients of the first iteration may be calculated according to the local gradients of the M first iterations reported by the M working modules of the N working modules, to obtain the global gradient of the first iteration.
  • the work module performs the third iteration.
  • the working module executes in parallel: calculating the model parameters of the third iteration, and calculating the local gradient of the third iteration; the slave server module Pull down the global gradient ⁇ 1 of the first iteration from the server module.
  • the global gradient of the first iteration is also pulled from the server module, so that the time window of the calculation process and the communication process overlap, thereby shortening The time of the training cycle.
  • the model parameters of the third iteration are based on the model parameters of the second iteration, and the second iteration Local gradient determination.
  • the model parameters of the third iteration can be determined by the above formula (1).
  • the GPU of the working module reads the sample data of the 3rd iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the 3rd iteration; pre-processing the sample data of the 4th iteration .
  • the work module performs the 4th iteration.
  • the working modules are executed in parallel: calculating the model parameters of the 4th iteration, calculating the local gradient of the 4th iteration; pushing the local gradient of the 3rd iteration to the server module. Or calculate the model parameters of the 4th iteration, calculate the local gradient of the 4th iteration, and do not push the local gradient to the server module or pull down the global gradient from the server module. In this way, the amount of communication between the work module and the server module is reduced.
  • the local gradient is not pushed to the server module or the global gradient is pulled from the server module as an example.
  • the model parameters of the fourth iteration are based on the third iteration.
  • the model parameters, the local gradient of the 3rd iteration, and the global gradient determination of the 1st iteration from the server module pull-down from the server module.
  • a formula (2) for the working module n to calculate the model parameters of the fourth iteration is provided:
  • w n_i-1 is the model parameter of the ith iteration of the working module n;
  • ⁇ w n_i is a local gradient calculated by the working module n in i iterations
  • ⁇ w j is the global gradient of the jth iteration; j is a positive integer less than or equal to i;
  • ⁇ and ⁇ are learning rate control factors, respectively. Where ⁇ and ⁇ can be determined separately according to specific application scenarios.
  • the GPU of the working module reads the sample data of the 4th iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the 4th iteration; pre-processes the sample data of the 5th iteration . In this way, the local gradient calculation is used in the first 3 iterations, and the global gradient is used in the 4th iteration to ensure that the model parameters are more quickly and accurately approximated to the correct values.
  • the work module performs the 5th iteration.
  • the working modules are executed in parallel: calculating the model parameters of the 5th iteration, calculating the local gradient of the 5th iteration; pushing the local gradient of the 4th iteration to the server module. Or perform the calculation of the model parameters of the 5th iteration in parallel, calculate the local gradient of the 5th iteration; push the local gradient of the 3rd iteration to the server module.
  • the following content in the embodiment of the present application is introduced by taking a local gradient of the fourth iteration to the server module as an example.
  • the GPU of the working module reads the sample data of the 5th iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the 5th iteration; pre-processes the sample data of the 6th iteration .
  • the server module can receive the local gradients ⁇ 1-4 , ⁇ 2-4 , . . . ⁇ n-4 , . . . ⁇ N-4 of the M first iterations reported by the M working modules, respectively, optionally calculating N
  • the average of the local gradients of the 4th iteration gives the global gradient ⁇ 4 of the 4th iteration.
  • the work module performs the sixth iteration.
  • the working modules are executed in parallel: calculating the model parameters of the sixth iteration, calculating the local gradient of the sixth iteration; and pulling down the global gradient of the fourth iteration from the server module from the server module.
  • the global gradient of the 4th iteration is also pulled down from the server module, so that the time window of the calculation process and the communication process overlap, thereby shortening The time of the training cycle.
  • the model parameter of the 6th iteration can be determined by the above formula (1).
  • the GPU of the working module reads the sample data of the 6th iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the 6th iteration; pre-processing the sample data of the 7th iteration .
  • the server module can receive the local gradients ⁇ 1-4 , ⁇ 2-4 , . . . ⁇ n-4 , . . . ⁇ N-4 of the N first iterations reported by the N working modules, respectively, optionally calculating N
  • the average of the local gradients of the 4th iteration gives the global gradient ⁇ 4 of the 4th iteration.
  • the work module performs the 7th iteration.
  • the working modules are executed in parallel: calculating the model parameters of the 7th iteration, calculating the local gradient of the 7th iteration; pushing the local gradient of the 6th iteration to the server module.
  • the sixth time is combined by the above formula (2).
  • the iterative model parameters, the local gradient of the 6th iteration, and the global gradient determination of the 4th iteration from the server module pull down from the server module. Determine the model parameters for the 7th iteration.
  • the GPU of the working module reads the sample data of the 7th iteration of the pre-processing, and then performs the following in parallel: local gradient calculation depends on the model parameters of the 7th iteration; pre-processing the sample data of the 8th iteration .
  • the local gradient calculation is used in the 5th iteration and the 6th iteration
  • the global gradient is used in the 7th iteration process to ensure that the model parameters are more quickly and accurately approximated to the correct values.
  • the working module calculates the local gradient of the Kth iteration Calculate the model parameters of the K+1th iteration according to the above formula (1).
  • the server module receives the local model parameters reported by the N working modules, the global model parameters of the training period are calculated, and the specific method includes a plurality of methods, such as averaging, etc.
  • the embodiment of the present application provides a server computing global model.
  • w 2_0 (w 1_K +w 2_K ...+w n_K ...+w N_K )/K...Formula (3)
  • w 2_0 is the global model parameter, and can also be called w 2_0 as the model parameter of the first iteration in the next training period;
  • w m_K is a local model parameter of the working module n
  • n ranges from [1, N]; K is the total number of iterations in the training period.
  • the source of the sample data may be a local disk corresponding to the working module (English may be referred to as Disk), or a corresponding distributed storage node, such as Hadoop Distributed File System (HDFS).
  • HDFS Hadoop Distributed File System
  • S3, distributed file system Google File System, referred to as GFS.
  • FIG. 5 exemplarily shows a schematic flowchart of a method for training a neural network model.
  • a server module and two working modules are respectively included, which are a working module 1 and a working module 2.
  • K iterations are included in one training cycle.
  • each working module pushes the local gradient of the first iteration to the server module;
  • each working module pulls down the global gradient of the first iteration from the server module from the server module.
  • each working module pushes the local gradient of the 4th iteration to the server module; in the 6th iteration, each working module pulls down the global gradient of the 4th iteration from the server module from the server module.
  • the calculation process overlaps with the time window of the communication process, shortens the duration of the training period, improves the training efficiency of the model parameters, and on the other hand, only pushes and pulls down from the server module.
  • Local gradients and global gradients in partial iterations avoid local gradients and global gradients in each iteration from the server module and from the server module, reducing the amount of communication between the working module and the server module.
  • the application scenario of this example is to classify image data sets by deep neural network.
  • the data set of this example is an image recognition database (such as imagenet), a total of 1000 categories, 1.28 million images.
  • the neural network of this example is googlenet, which belongs to a large-scale neural network model.
  • the distributed system of the example includes 4 nodes (called nodes in English), and each node includes a server module and a working module, respectively, a server module 1, a server module 2, a server module 3, a server module 4, and a work. Module 1, work module 2, work module 3 and work module 4.
  • Each working module corresponds to one K80 GPU card (12G memory), and each server module corresponds to one Intel Xeon E5-2620CPU. Optionally, each working module further corresponds to a part of the CPU for preprocessing the sample data.
  • Googlenet is currently a relatively common image classification network with high classification accuracy. Take the example of the first iteration:
  • Server module 1 initializes the global model parameters to obtain the model parameters of the first iteration.
  • the model parameters of the first iteration obey W ⁇ (0,0.01), and the server module pulls the model parameters of the first iteration from the server module to the server module. 4 node working module.
  • each working module calculates the gradient based on W ⁇ (0,0.01), and the accumulated gradient is (the server module's GPU calculates the gradient while working the module).
  • the corresponding CPU preprocesses the next image, that is, preprocesses the sample data of the second iteration.
  • This example provides a calculation formula (4) for optionally calculating the local gradient of the first iteration for each working module:
  • the second iteration is performed.
  • the working modules are executed in parallel: calculating the model parameters of the second iteration, calculating the local gradient of the second iteration; and pushing the local gradient of the first iteration to the server module.
  • the first calculation is completed in the server module.
  • the working module can pull down the global gradient of the first iteration from the server module from the server module.
  • w 1_1 w 1_0 +0.01 ⁇ w 1_1
  • w 2_1 w 1_0 +0.01 ⁇ w 2_1
  • w 3_1 w 1_0 +0.01 ⁇ w 3_1
  • each working module calculates the local gradient of the second iteration based on the model parameters of the second iteration, and simultaneously pushes the local gradient of each first iteration to the server module 1 to the server module, and At the same time, the CPU corresponding to the working module preprocesses the next image, that is, preprocesses the sample data of the third iteration.
  • a calculation formula (6) for optionally calculating the local gradient of the second iteration by each working module is provided:
  • ⁇ w 1_2 is the local gradient of the second iteration of the working module 1;
  • ⁇ w 2_2 is the local gradient of the second iteration of the working module 2;
  • ⁇ w 3_2 is the local gradient of the second iteration of the working module 3;
  • ⁇ w 4_2 is the local gradient of the second iteration of the working module 4.
  • the working module executes in parallel: calculating the model parameters of the third iteration, and calculating the local gradient of the third iteration; the slave server module Pull down the global gradient ⁇ 1 of the first iteration from the server module.
  • the working module calculates the model parameters of each working module in the third iteration according to the above formula (1), so that ⁇ in the formula (1) is 0.01. Get the result as shown in equation (7):
  • w 1_2 w 1_1 + 0.01 ⁇ w 1_2
  • w 2_2 w 2_1 +0.01 ⁇ w 2_2
  • w 3_2 w 3_1 + 0.01 ⁇ w 3_2
  • w 4_2 is the model parameter of the 3rd iteration of the working module 4; w 4_1 is the model parameter of the 2nd iteration of the working module 4; ⁇ w 4_2 is the local gradient of the 2nd iteration of the working module 4.
  • the working module calculates the model parameters of each working module in the third iteration according to the above formula (2).
  • w 1_2 w 1_1 +0.01 ⁇ w 1_2 +0.4 ⁇ w 1
  • w 2_2 w 2_1 +0.01 ⁇ w 2_2 +0.4 ⁇ w 1
  • w 3_2 w 3_1 +0.01 ⁇ w 3_2 +0.4 ⁇ w 1
  • ⁇ w 1 is the global gradient of the first iteration.
  • each working module calculates the local gradient of the 3rd iteration based on the model parameters of the 3rd iteration, and synchronizes the global gradient from the server module 1 from the server module to the 1st iteration, and simultaneously works.
  • the CPU corresponding to the module preprocesses the next image, that is, preprocesses the sample data of the 4th iteration.
  • each working module calculates a formula for calculating the local gradient of the third iteration (9):
  • ⁇ w 1_3 is the local gradient of the third iteration of the working module 1;
  • ⁇ w 2_3 is the local gradient of the third iteration of the working module 2;
  • ⁇ w 3_3 is the local gradient of the third iteration of the working module 3;
  • ⁇ w 4_3 is the local gradient of the third iteration of the working module 4.
  • the third iteration process ends and the fourth iteration process begins.
  • the working module executes in parallel: calculating the model parameters of the 4th iteration, calculating the local gradient of the 4th iteration; the slave server module Pull down the global gradient ⁇ 1 of the first iteration from the server module.
  • the working module calculates the model parameters of the respective working modules in the fourth iteration according to the above formula (1).
  • the working module calculates the model parameters of each working module in the fourth iteration according to the above formula (2).
  • the result is as shown in formula (10):
  • w 1_3 w 1_2 + 0.01 ⁇ w 1_3 + 0.4 ⁇ w 1
  • w 2_3 w 2_2 + 0.01 ⁇ w 2_3 + 0.4 ⁇ w 1
  • w 3_3 w 3_2 + 0.01 ⁇ w 3_3 + 0.4 ⁇ w 1
  • w 2_3 is the model parameter of the 4th iteration of the working module 2;
  • w 2_2 is the model parameter of the 3rd iteration of the working module 2;
  • ⁇ w 2_3 is the local gradient of the 3rd iteration of the working module 2;
  • w 3_3 is the model parameter of the 4th iteration of the working module 3;
  • w 3_2 is the model parameter of the 3rd iteration of the working module 3;
  • ⁇ w 3_3 is the local gradient of the 3rd iteration of the working module 3;
  • w 4_3 is the model parameter of the 4th iteration of the working module 4;
  • w 4_2 is the model parameter of the 3rd iteration of the working module 4;
  • ⁇ w 4_3 is the local gradient of the 3rd iteration of the working module 4;
  • ⁇ w 1 is the global gradient of the first iteration.
  • the local gradient of the 4th iteration is calculated according to the model parameters of the 4th iteration, and the rest of the iterative process is similar to the above content, and details are not described herein again.
  • the working module pushes each local gradient to the server module 1 to the server module, and the server module 1 calculates a global gradient according to each local gradient.
  • the average value of each local gradient may be calculated as a global gradient. Equation (11) is provided in the embodiment of the present application for calculating the global gradient:
  • ⁇ w 1 is the global gradient of the first iteration
  • w 1_1 is the local gradient of the first iteration of the working module 1;
  • w 2_1 is the local gradient of the first iteration of the working module 2;
  • w n_1 is the local gradient of the first iteration of the working module n, and the value of n ranges from [1, N];
  • w N_1 is the local gradient of the first iteration of the working module N; N is the total number of working modules.
  • the global gradient information is used to adjust the model update of each working module without increasing the extra communication time overhead, and the model consistency caused by the weak synchronization in the traditional communication mode is solved.
  • Convergence problem This application effectively overcomes the communication bottleneck caused by large models under the premise of ensuring stable convergence of large-scale distributed neural network models (including deep learning models); it is also the first time in the industry to propose communication time overhead for large-scale distributed machine learning.
  • FIG. 6 is a schematic structural diagram of a neural network model training apparatus provided by an embodiment of the present application.
  • the embodiment of the present application provides a training device for a neural network model.
  • the training device includes N working modules, and the training device is applicable to a training system including a server module and N working modules.
  • the server module and the N working modules are configured to train the model parameters in at least one training period, each training period in the at least one training period includes K iterations; the working module is one of the N working modules, and the working module includes the communication Module and calculation module; for the ith iteration of each of the N working modules in each training period, where N and K are integers greater than or equal to 1, respectively, i is greater than or equal to 1 and less than or equal to K Integer.
  • Each of the N work modules includes a communication module 603 and a calculation module 602.
  • a storage module 601 can also be included.
  • the storage module is configured to store information such as a global gradient of the following.
  • the communication module 603 and the calculation module 602 of each working module run in parallel;
  • the calculation module 602 is configured to calculate a model parameter of the (i+1)th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, and if i is less than K, according to the (i+1)th
  • the iterative model parameters and the sample data of the i+1th iteration are used to calculate the local gradient of the i+1th iteration;
  • the communication module 603 is configured to pull down the global gradient of the rth iteration from the server module and/or push the local gradient of the fth iteration to the server module; wherein r and f are positive integers less than or equal to i, respectively.
  • the communication module and the calculation module run in parallel, the communication module executes the first process, and the calculation module executes the second process, and the first process is a calculation process, specifically including calculating the model of the i+1th iteration.
  • the parameter is used to calculate the local gradient of the i+1th iteration;
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, thereby avoiding the prior art.
  • it is necessary to wait for the global gradient from the server module to the i-th iteration to calculate the model parameters of the i+1th iteration, thereby shortening the duration of one iteration and improving the efficiency of model parameter training.
  • the calculating module 602 is configured to: in the case of determining that the global gradient of the jth iteration that satisfies the first condition has been pulled down from the server module, the global gradient according to the jth iteration, the local gradient of the ith iteration And the model parameter of the i-th iteration, calculating the model parameter of the i+1th iteration; wherein j is a positive integer less than or equal to i; the first condition includes: the global gradient of the jth iteration is not used for the first time The calculation of the model parameters in any iteration between iterations and the ith iteration. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the calculating module 602 is configured to: according to the global gradient of the jth iteration that does not satisfy the first condition from the server module, according to the local gradient of the i-th iteration and the model parameter of the ith iteration , calculate the model parameters of the i+1th iteration. In this way, the convergence of the model parameters can be accelerated by updating the model parameters according to the global gradient in the nearest iteration in the current iteration.
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • the global gradient of the jth iteration that satisfies the first condition is selected from the global gradient that has been pulled down from the server module, without waiting for the communication process, further shortening the duration of the iteration, and improving the efficiency of training the model parameters.
  • the communication module 603 is configured to pull down the global gradient of the rth iteration from the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-1th iteration to the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-th iteration to the server module. Or used to push the local gradient of the i-1th iteration to the server module. Or used to push the local gradient of the i-th iteration to the server module. In this way, the flexibility of the working module can be improved. On the other hand, the local gradient in the nearest iteration in the current iteration can be pushed to the server module as much as possible, thereby accelerating the convergence of the model parameters.
  • the communication module 603 is further configured to: after calculating the local gradient of the Kth iteration by the calculation module, calculate the model parameters according to the local gradient of the Kth iteration and the Kth iteration After the model parameter of the K+1th iteration, the model parameter of the K+1th iteration is pushed to the server module; wherein the model parameter of the K+1th iteration is used to: make the server module according to each of the N working modules The model parameters of the K+1th iteration pushed by the working module to the server module, and the number of iterations K, determine the model parameters of the first iteration in the next training period. In this way, the accuracy of the model parameters of the training period is improved.
  • the first process and the second process are executed in parallel in each iterative process
  • the first process is a computing process, which specifically includes calculating a model parameter of the (i+1)th iteration, and calculating the i-th +1 iteration of the local gradient
  • the second process is the communication process, specifically including the global gradient from the r-th iteration of the server module and/or the local gradient of the f-th iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • the division of the unit in the embodiment of the present application is schematic, and is only a logical function division. There are other ways of dividing the actual implementation.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • FIG. 7 is a schematic structural diagram of a neural network model training apparatus provided by an embodiment of the present application.
  • the embodiment of the present application provides a neural network model training apparatus for performing the above method flow.
  • the training device includes a transceiver 701 and a processor 702.
  • Processor 702 includes N processor cores.
  • a memory 704, and a communication interface 703 are also included.
  • a bus 705 can also be included.
  • the processor, the memory, and the transceiver are connected to each other through a bus.
  • the bus may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 7, but it does not mean that there is only one bus or one type of bus.
  • the memory 704 may include a volatile memory such as a random-access memory (RAM); the memory may also include a non-volatile memory such as a flash memory (flash) Memory), hard disk drive (HDD) or solid-state drive (SSD); the memory 704 may also include a combination of the above types of memory.
  • RAM random-access memory
  • flash flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the N processor cores included in the processor 702 may include a GPU or may include a GPU and a CPU.
  • the processor core may further include a hardware chip.
  • the hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination.
  • the transceiver is used to implement data transmission between each working module and the server module.
  • the memory is used to store instructions.
  • the memory is further configured to store information such as a global gradient of the pulldown.
  • the processor includes N processor cores, and the training device is applicable to a training system including a server module and N processor cores, the server module and the N processor cores being used for at least one training period
  • the model parameters are trained, and each training period in the at least one training period includes K iterations. For the ith iteration of each of the N working modules in each training period, where N and K are respectively integers greater than or equal to 1, and i is an integer greater than or equal to 1 and less than or equal to K.
  • the transceiver 701 and the processor 702 of each working module operate in parallel;
  • the processor 702 is configured to calculate a model parameter of the (i+1)th iteration according to the local gradient of the i-th iteration and the model parameter of the i-th iteration, and if i is less than K, according to the (i+1)th
  • the iterative model parameters and the sample data of the i+1th iteration are used to calculate the local gradient of the i+1th iteration;
  • the transceiver 701 is configured to pull down the global gradient of the rth iteration from the server module and/or push the local gradient of the fth iteration to the server module; where r and f are positive integers less than or equal to i, respectively.
  • the memory is used to store the global gradients pulled down from the server module, as well as the calculated local gradients.
  • the transceiver and the processor run in parallel, the processor executes the first process, and the transceiver executes the second process.
  • the first process is a computing process, and specifically includes a model for calculating the i+1th iteration.
  • the parameter is used to calculate the local gradient of the i+1th iteration;
  • the second process is a communication process, specifically including pulling the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can be used to calculate the model parameters of the i+1th iteration. The case shortens the length of one iteration and improves the efficiency of model parameter training.
  • the processor 702 is configured to: according to the global gradient of the jth iteration that satisfies the first condition from the server module, the global gradient according to the jth iteration, the local gradient of the ith iteration And the model parameter of the i-th iteration, calculating the model parameter of the i+1th iteration; wherein j is a positive integer less than or equal to i; the first condition includes: the global gradient of the jth iteration is not used for the first time The calculation of the model parameters in any iteration between iterations and the ith iteration. In this way, there is no need to wait for the communication process, which further shortens the duration of the iteration and improves the efficiency of training the model parameters.
  • the processor 702 is configured to: according to the global gradient of the jth iteration that does not satisfy the first condition from the server module, according to the local gradient of the i-th iteration and the model parameter of the ith iteration , calculate the model parameters of the i+1th iteration. In this way, the convergence of the model parameters can be accelerated by updating the model parameters according to the global gradient in the nearest iteration in the current iteration.
  • the first condition further comprises: the global gradient of the jth iteration is a global gradient in the iteration of the highest batch of iterations in all global gradients that have been pulled down from the server module.
  • the model parameters of the (i+1)th iteration can be calculated according to the global gradient of the jth iteration that has been pulled down from the server module to satisfy the first condition, and the accuracy of calculating the model parameters of the (i+1)th iteration is improved.
  • the global gradient of the jth iteration that satisfies the first condition is selected from the global gradient that has been pulled down from the server module, without waiting for the communication process, further shortening the duration of the iteration, and improving the efficiency of training the model parameters.
  • the global gradient of the jth iteration is determined according to the following: a local gradient of the jth iteration reported by the M working modules of the N working modules; wherein, M is an integer greater than or equal to 1 and less than or equal to N .
  • M is an integer greater than or equal to 1 and less than or equal to N .
  • the transceiver 701 is configured to pull down a global gradient of the rth iteration from the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-1th iteration to the server module. Or for pulling the global gradient of the rth iteration from the server module, pushing the local gradient of the i-th iteration to the server module. Or used to push the local gradient of the i-1th iteration to the server module. Or used to push the local gradient of the i-th iteration to the server module. In this way, the flexibility of the working module can be improved. On the other hand, the local gradient in the nearest iteration in the current iteration can be pushed to the server module as much as possible, thereby accelerating the convergence of the model parameters.
  • the transceiver 701 is further configured to: after calculating the local gradient of the Kth iteration by the processor, calculate the model parameters according to the local gradient of the Kth iteration and the Kth iteration After the model parameter of the K+1th iteration, the model parameter of the K+1th iteration is pushed to the server module; wherein the model parameter of the K+1th iteration is used to: make the server module according to each of the N working modules The model parameters of the K+1th iteration pushed by the working module to the server module, and the number of iterations K, determine the model parameters of the first iteration in the next training period. In this way, the accuracy of the model parameters of the training period is improved.
  • the first process and the second process are executed in parallel in each iterative process
  • the first process is a computing process, which specifically includes calculating a model parameter of the (i+1)th iteration, and calculating the i-th +1 iteration of the local gradient
  • the second process is the communication process, specifically including the global gradient from the r-th iteration of the server module and/or the local gradient of the f-th iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • the embodiment of the present application provides a chip for training a neural network model, where the chip is applicable.
  • a training system comprising N chips and a server
  • the server module and the N chips are configured to train model parameters in at least one training period, each training period in the at least one training period comprising K iterations;
  • Each of the N chips is for performing the method performed by the working module in the first aspect described above.
  • FIG. 8 is a schematic structural diagram of a system for training a neural network model provided by an embodiment of the present application.
  • the embodiment of the present application provides a schematic structural diagram of a system for training a neural network model.
  • the system includes a server module 800 and N working modules 801, a working module 802 to a working module 80n, and a server.
  • the module 800 and the N working modules 801, the working modules 802 to the working modules 80n are configured to train the model parameters in at least one training period, and each training period in the at least one training period includes K iterations;
  • each of the N working modules 801 and the working modules 802 to 80n is used for: parallel execution: according to the ith time
  • the model parameters of the i+1th iteration are calculated by the iterative local gradient and the model parameters of the ith iteration, and in the case where i is smaller than K, the model parameters according to the i+1th iteration and the i+1th iteration Sample data, calculating a local gradient of the i+1th iteration; pulling down the global gradient of the rth iteration from the server module and/or pushing the local gradient of the fth iteration to the server module; wherein r and f are respectively less than or equal to a positive integer of i; wherein N and K are integers greater than or equal to 1, respectively, and i is an integer greater than or equal to 1 and less than or equal to K;
  • the server module 800 is configured to calculate a global gradient of the rth iteration according to the received local gradient of the rth iteration pushed by each working module, and pull the global gradient of the rth iteration to each working module; receive each work
  • the local gradient of the fth iteration of the module is pushed up, and the global gradient of the fth iteration is calculated according to the local gradient of the fth iteration pushed up by each working module.
  • the first process and the second process are executed in parallel in each iterative process
  • the first process is a computing process, which specifically includes calculating a model parameter of the (i+1)th iteration, and calculating the i-th +1 iteration of the local gradient
  • the second process is the communication process, specifically including the global gradient from the r-th iteration of the server module and/or the local gradient of the f-th iteration to the server module.
  • the model parameters of the i+1th iteration are calculated according to the local gradient of the i-th iteration and the model parameters of the i-th iteration, so that the prior art must wait for the pull-down from the server module to the ith iteration.
  • the global gradient can calculate the model parameters of the i+1th iteration, which shortens the duration of one iteration and improves the efficiency of model parameter training.
  • a computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, computer instructions can be wired from a website site, computer, server or data center (eg Coax, fiber, digital subscriber line (DSL) or wireless (eg, infrared, wireless, microwave, etc.) is transmitted to another website, computer, server, or data center.
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • Useful media can be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)).
  • embodiments of the present application can be provided as a method, or a computer program product.
  • the application can take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)
  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种神经网络模型训练方法、装置、芯片和系统,用以缩短模型参数的训练时延。每个训练周期包括K次迭代,针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,各个工作模块并行执行:根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;工作模块从所述服务器模块下拉第r次迭代的全局梯度和/或向所述服务器模块上推第f次迭代的局部梯度。从而使计算进程与通讯进程的时间窗口重叠,缩短模型参数的训练时延。

Description

一种神经网络模型训练方法、装置、芯片和系统
本申请要求在2016年11月29日提交中国专利局、申请号为201611073994.5、发明名称为“一种神经网络模型训练方法、装置、芯片和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及机器学习领域,尤其涉及一种神经网络模型训练方法、装置、芯片和系统。
背景技术
随着计算机和信息技术的迅猛发展和普及应用,行业应用数据呈爆炸性增长。动辄达到数百万亿字节(Trillionbyte,简称TB),甚至数千万亿字节(Petabyte,简称PB)规模的行业、企业大数据往往隐含着很多在小数据量时不具备的深度知识和价值,大规模机器学习(包含深度学习)引领的数据分析是将大数据转换成有用知识的关键技术。Google、Facebook、微软、百度等国内外大型互联网公司纷纷成立专门的基于大数据的机器学习与人工智能研发机构,深入系统地研究基于大数据的机器学习和智能化计算技术。
将大规模机器学习算法部署在大规模分布式并行计算系统,目前较为常用的是参数服务器(英文可称为parameter sever)计算架构,配合有效的随机梯度下降算法(英文可称为Stochastic gradient descent)进行训练。图1示例性示出了一种分布式训练系统示意图,如图1所示,包括服务器模块集合(英文可称为servers)101和工作模块集合(英文可称为workers)102,服务器模块集合可包括多个服务器模块(英文可称为server),工作模块集合可包括多个工作模块(英文可称为worker),服务器模块与主服务器(英文可称为master)节点类似,工作模块可指代计算执行器。分布式训练系统中包括多个分布式的节点,每个节点可包括一个或多个工作模块,也还可包括一个或多个服务器模块。
以图1为例,对分布式训练系统下服务器模块和工作模块之间的信令交互过程进行详细介绍。图1中包括N个工作模块以及P个服务器模块,N个工作模块以及P个服务器模块用于训练神经网络模型中的模型参数,该示例中以训练一个模型参数为例进行介绍:
第一,启动分布式计算平台,部署应用,服务器模块进行初始化,得到初始化的模型参数ω1,从服务器模块将全局模型参数ω1从服务器模块下拉(英文可称为pull)到各个工作模块;
第二,每个工作模块执行第一次迭代:读入样本数据,依赖于全局模型参数ω1进行局部梯度计算;工作模块1计算得到局部梯度Δω1-1;工作模块2计算得到局部梯度Δω2-1;…工作模块N计算得到局部梯度ΔωN-1
第三,每个工作模块执行第二次迭代:各个工作模块均将第一次迭代中产生的局部梯度Δω1-1、局部梯度Δω2-1…局部梯度ΔωN-1均向服务器模块上推(英文可称为push)至服务器模块,服务器模块根据局部梯度Δω1-1、局部梯度Δω2-1…局部梯度ΔωN-1计算出全局梯度ω1_1;将全局梯度Δω1_1从服务器模块下拉(英文可称为pull)到各个工作模块;每个工作模块,根据全局梯度Δω1_1将本地模型参数ω1更新为模型参数ω2
各个工作模块读入样本数据,依赖于更新后的模型参数ω2进行局部梯度计算;工作 模块1计算得到局部梯度Δω1-2;工作模块2计算得到局部梯度Δω2-2;…工作模块1计算得到局部梯度ΔωN-2
第四,在之后的几次迭代中各个工作模块均再次将各个局部梯度向服务器模块上推至服务器模块,以便服务器模块再次从服务器模块下拉全局梯度,从而使各个工作模块依赖从服务器模块下拉的全局梯度进行本地模型参数的更新,以及梯度计算;
第五,重复进行几次迭代之后,各个工作模块向服务器模块上报最后一次更新后的本地的模型参数,服务器模块根据各个工作模块上报的更新后的本地的模型参数确定出平均值,得到经训练后的模型参数。该一个过程可称为一个训练周期(英文可称为epoch),可通过多个训练周期对该模型参数进行训练。
通过上述描述可看出,每个模型参数在迭代过程中,每个工作模块先向服务器模块上推该局部梯度,并等待从服务器模块下拉到该模型参数的全局梯度之后,依据全局梯度更新本地模型参数,之后根据更新后的本地模型参数计算局部梯度,可见每次迭代过程所占用的时间包括向服务器模块上推局部梯度和从服务器模块下拉全局梯度的通讯时间,以及更新本地模型参数和计算局部梯度的计算时间,一次迭代的时间较长,从而导致模型参数的训练过程存在较大的时延。
发明内容
本申请实施例提供一种神经网络模型训练方法、装置、芯片和系统,用以缩短模型参数的训练时延,提高模型参数训练的效率。
第一方面,本申请实施例提供一种神经网络模型训练方法。本申请实施例适用于包括服务器模块和N个工作模块的训练系统,服务器模块和N个工作模块用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代,针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,各个工作模块并行执行:根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;工作模块从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。
本申请实施例中每次迭代过程中并行执行第一进程和第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
可选地,工作模块根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,包括:工作模块在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第 i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,工作模块根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,包括:所工作模块在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可提高工作模块和服务器模块的工作的灵活性。且进一步减少工作模块和服务器模块之间的通讯量。
可选地,工作模块从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度,包括从服务器模块下拉第r次迭代的全局梯度。或者包括从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i-1次迭代的局部梯度。或者包括从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i次迭代的局部梯度。或者包括向服务器模块上推第i-1次迭代的局部梯度。或者包括向服务器模块上推第i次迭代的局部梯度。如此,可提高工作模块的灵活性,另一方面,也可尽量向服务器模块上推与当前迭代过程中最邻近的迭代中的局部梯度,从而加速了模型参数的收敛。
可选地,在i为K的情况下,方法还包括:工作模块计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数;其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。
第二方面,本申请实施例提供一种用于神经网络模型训练装置,所述训练装置包N个工作模块,所述训练装置适用于包括服务器模块和N个工作模块的训练系统,服务器模块和N个工作模块用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代;N个工作模块中每个工作模块包括通讯模块和计算模块;针对N个工作模块中的各个工作模块在每个训练周期内的第i次迭代,其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数。各个工作模块的通讯模块和计算模块并行运行;其中,计算模块,用于根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;通讯模块,用于从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数。
本申请实施例中每次迭代过程中通讯模块和计算模块并行运行,通信模块执行第一进 程,计算模块执行第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
可选地,计算模块,用于:在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,计算模块,用于:在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可提高工作模块和服务器模块的工作的灵活性。且进一步减少工作模块和服务器模块之间的通讯量。
可选地,通讯模块,用于从服务器模块下拉第r次迭代的全局梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i-1次迭代的局部梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i次迭代的局部梯度。或者用于向服务器模块上推第i-1次迭代的局部梯度。或者用于向服务器模块上推第i次迭代的局部梯度。如此,可提高工作模块的灵活性,另一方面,也可尽量向服务器模块上推与当前迭代过程中最邻近的迭代中的局部梯度,从而加速了模型参数的收敛。
可选地,在i为K的情况下,通讯模块还用于:在通过计算模块计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数;其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。
第三方面,本申请实施例提供一种用于神经网络模型训练装置,所述训练装置包括处理器、存储器和收发器,所述处理器包括N个处理器核,所述训练装置适用于包括服务器模块和N个处理器核的训练系统,所述服务器模块和所述N个处理器核用于在至少一个训 练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;
所述存储器用于存储指令;所述处理器用于执行所述存储器存储的指令,并控制所述收发器与所述服务器模块之间传输数据;当所述处理器执行所述存储器存储的指令时,所述N个处理器核中的每个处理器核用于:
根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;
收发器,用于从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数。
存储器用于存储从服务器模块下拉的全局梯度,以及计算出的局部梯度。
本申请实施例中每次迭代过程中收发器和处理器并行运行,处理器执行第一进程,收发器执行第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
可选地,处理器,用于:在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,处理器,用于:在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可提高工作模块和服务器模块的工作的灵活性。且进一步减少工作模块和服务器模块之间的通讯量。
可选地,收发器,用于从服务器模块下拉第r次迭代的全局梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i-1次迭代的局部梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i次迭代的局部梯度。或者用于向服务器模块上推第i-1次迭代的局部梯度。或者用于向服务器模块上推第i次迭代的 局部梯度。如此,可提高工作模块的灵活性,另一方面,也可尽量向服务器模块上推与当前迭代过程中最邻近的迭代中的局部梯度,从而加速了模型参数的收敛。
可选地,在i为K的情况下,收发器还用于:在通过处理器计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数;其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。
第四方面,本申请实施例提供一种用于神经网络模型训练的芯片,所述芯片适用于包括N个芯片和服务器的训练系统,所述服务器模块和所述N个芯片用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;所述N个芯片中的每个芯片用于执行如上述第一方面中工作模块执行的方法。
第五方面,本申请实施例提供一种用于神经网络模型训练的系统,系统包括服务器模块和N个工作模块;服务器模块和N个工作模块用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代;针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,各个工作模块用于:并行执行:根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数;其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数;服务器模块,用于根据接收到的各个工作模块上推的第r次迭代的局部梯度计算第r次迭代的全局梯度,并将第r次迭代的全局梯度下拉至各个工作模块;接收各个工作模块上推的第f次迭代的局部梯度,并根据各个工作模块上推的第f次迭代的局部梯度计算第f次迭代的全局梯度。
第六方面,提供了一种计算机程序产品,计算机程序产品包括:计算机程序(也可以称为代码,或指令),当计算机程序被运行时,使得计算机执行上述第一方面中任一种可能实现方式中的方法。
第七方面,提供了一种计算机可读介质,计算机可读介质存储有计算机程序(也可以称为代码,或指令)当其在计算机上运行时,使得计算机执行上述第一方面中任一种可能实现方式中的方法。
本申请实施例中每次迭代过程中并行执行第一进程和第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
附图说明
图1为背景技术中示出的一种分布式训练系统示意图;
图2为本申请实施例适用的一种应用场景架构示意图;
图3为本申请实施例提供的一种适用的训练系统示意图;
图4为本申请实施例提供的一种神经网络模型训练方法流程示意图;
图5为本申请实施例提供的一种神经网络模型训练方法流程示意图;
图6为本申请实施例提供的一种神经网络模型训练装置的结构示意图;
图7为本申请实施例提供的一种神经网络模型训练装置的结构示意图;
图8为本申请实施例提供的一种用于神经网络模型训练的系统的结构示意图。
具体实施方式
图2示例性示出了本申请实施例适用的一种应用场景架构示意图,如图2所示,在具体实施中会存在多种原始数据,比如图2中的电信数据201、金融数据202以及消费者数据203等等,大数据平台204对这些原始数据进行数据采集,以及数据存储和数据计算等等,得到经过大数据平台204处理后的数据。数据挖掘平台205从大数据平台获取经过大数据平台204处理后的数据,并进行数据挖掘,比如使用逻辑回归分析(英文可称为Logistic Regression,简称LR)、大规模传统神经网络模型隐含狄利克雷分布(英文可称为Latent Dirichlet Allocation,简称LDA);卷积神经网络(英文可称为Convolution neural network,简称CNN)、循环神经网络(英文可称为Recurrent neural network,简称RNN)、稀疏自动编码器(英文可称为Sparse AutoEncoder,简称SAE)等深度学习模型进行数据挖掘,得到数据挖掘后的结果。应用平台206中包括各个领域,可依据数据挖掘平台205确定出的数据挖掘后的结果进行电信领域大数据分析、金融领域大数据分析、消费者领域大数据分析,以及其它领域大数据分析等等。
本申请实施例可用于训练海量数据的分布式并行计算集群,适合的算法包括卷积神经网络(用于图像、语音或视频的处理)、递归神经网络(用于自然语言处理)、深度神经网络(用于处理语音)等多种深度学习算法以及大规模机器学习算法。
本申请实施例所提供的方案应用于数据挖掘平台205,数据挖掘平台205可通过深度学习智能分析对底层的原始数据进行挖掘分析,通过分布式架构的加速训练过程,提升了基于深度学习训练的数据挖掘平台的性能和可扩展性,从而支撑上层的应用平台的决策和运营,比如视频分析、图像识别、物体检测、自然语言处理等上层的应用平台的业务。
本申请实施例中一个节点可为包括至少一个图形处理器(Graphics Processing Unit,简称GPU)芯片和/或至少一个中央处理器(Central Processing Unit,简称CPU)芯片的计算机设备。其中,每个GPU芯片中包括一个或多个GPU核,每个CPU芯片中包括一个或多个CPU核。本申请实施例中的工作模块可包括一个或多个GPU核,服务器模块可包括一个或多个CPU核。
为了方便描述,多个服务器模块的可称为服务器模块集合,多个工作模块的可称为工作模块集合。图3示例性示出了本申请实施例提供的一种适用的系统架构示意图,如图3所示,本申请实施例包括服务器模块集合307和工作模块集合308,服务器模块集合307包括多个服务器模块,分别为服务器模块301、服务器模块302、…服务器模块303;工作模块集合308可包括多个工作模块,分别为工作模块304、工作模块305、…工作模块306。
分布式系统架构中包括多个分布式的节点。每个节点的具体部署形态包括3种:第一 种,工作模块与服务器模块部署在同一个节点上,工作模块数目与服务器模块数目相等或不等;第二种,工作模块与服务器模块分别部署在不同节点上,工作模块数目与服务器模块相等或不等;第三种,工作模块与服务器模块混合部署在不同节点上,也就是多个节点中至少有一个节点上既有工作模块又有服务器模块,工作模块数目与服务器模块数目相等或不相等。本申请实施例所提供的方案适用于任一种具体部署形态。
服务器模块和工作模块用于在至少一个训练周期内训练模型参数。每个训练周期(英文可称为epoch)可包括K次迭代。可通过一个训练周期或多个训练周期对模型参数进行训练。本申请实施例中以下内容重点对一个训练周期进行详细介绍,其它训练周期的方案与下述内容类似,不再赘述。
基于上述内容,图4示例性示出了本申请实施例提供的一种神经网络模型训练方法流程示意图,该神经网络模型训练方法适用于包括服务器模块和N个工作模块的训练系统,服务器模块和N个工作模块用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代,针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数,如图4所示,方法包括:
步骤401,各个工作模块并行执行步骤402和步骤403;该工作模块为N个工作模块中的一个,可选地,可为N个工作模块中的一个工作模块;
步骤402,各个工作模块根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;
步骤403,各个工作模块从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数。具体来说包括几种方案:方案一,工作模块从服务器模块下拉第r次迭代的全局梯度;方案二,工作模块向服务器模块上推第f次迭代的局部梯度,方案三,工作模块从服务器模块下拉第r次迭代的全局梯度且向服务器模块上推第f次迭代的局部梯度。具体来说,工作模块从服务器模块下拉第r次迭代的全局梯度,包括工作模块接收服务器模块发送的第r次迭代的全局梯度,或者工作模块自动去服务器模块上获取第r次迭代的全局梯度。向服务器模块上推第f次迭代的局部梯度具体来说,工作模块将第f次迭代的局部梯度发送给服务器模块。
本申请实施例中每次迭代过程中并行执行步骤402和步骤403,步骤402为第一进程,步骤403为第二进程,第一个进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。一方面,第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
另一方面,本申请实施例中在执行第一进程的同时,并行执行第二进程,避免了现有技术中一定要等待计算出第i+1次迭代的局部梯度之后才进行通讯进程,进一步缩短了一次迭代的时长,提高了模型参数训练的效率。
本申请实施例中,N个工作模块和服务器模块可位于一个节点上,该节点为包括多个GPU核和多个CPU核的计算机设备。一个工作模块包括一个或多个GPU核,一个服务器 模块包括一个或多个CPU核,在该种情况下,N个工作模块和服务器模块之间可通过GPU核与CPU核之间的核间通信实现通讯。在N个工作模块和服务器模块分别位于多个节点的情况下,N个工作模块和服务器模块之间可通过节点间的一些链路实现通讯。本申请实施例中N个工作模块中的每个工作模块与服务器模块之间的可实现通讯。
可选地,上述步骤402中,工作模块根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,包括工作模块在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
工作模块根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,包括工作模块在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
具体来说,系统中通讯进程与计算进程为相互独立的两个进程,可并行执行。可选地,工作模块在执行通讯进程时,向服务器模块上推一次局部梯度,从服务器模块下拉一次全局梯度;或者连续向服务器模块上推多次局部梯度,从服务器模块下拉一次或连续多次的全局梯度。可选地,在上述步骤403中,在服务器模块已经计算出第r次迭代的全局梯度的情况下,在上述步骤403中,工作模块可以从服务器模块下拉第r次迭代的全局梯度。另一种可选地方案中,在上述步骤403中,工作模块刚完成一次向服务器模块上推局部梯度的进程,或者工作模块轮到向服务器模块上推局部梯度的进程,则工作模块可选择向服务器模块上推第f次迭代的局部梯度。另一种可选地方案中,工作模块与服务器模块之间通讯进程进行的较快速,在工作模块计算计算第i+1次迭代的模型参数和计算第i+1次迭代的局部梯度的过程中,工作模块可以从服务器模块下拉第r次迭代的全局梯度且向服务器模块上推第f次迭代的局部梯度;也可以向服务器模块上推第f次迭代的局部梯度且从服务器模块下拉第r次迭代的全局梯度。本申请实施例中向服务器模块上推第f次迭代的局部梯度和从服务器模块下拉第r次迭代的全局梯度之间没有先后顺序。在上述方案中,工作模块可选择向服务器模块上推第f次迭代的局部梯度有多种实现方案。
下面通过一个例子,对上述内容进行更加详细的介绍。工作模块当前已经成功从服务器模块下拉第1次迭代的全局梯度、第3次迭代的全局梯度、第4次迭代的全局梯度和第6次迭代的全局梯度,第1次迭代的全局梯度已经用于第2次迭代的模型参数的计算中,第3次迭代的全局梯度、第4次迭代的全局梯度和第6次迭代的全局梯度均未被使用。当前进行第9次迭代过程,更新第9次迭代的模型参数,也就是说第i+1次迭代为第9次迭代。当前满足第一条件的第j次迭代的全局梯度为第3次迭代的全局梯度、第4次迭代的全局梯度和第6次迭代的全局梯度中的任一个,可选地,可根据第3次迭代的全局梯度、第4次迭代的全局梯度和第6次迭代的全局梯度中的任一个,以及第8次迭代的局部梯度和第8次迭代的模型参数计算第9次迭代的模型参数。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。迭代批次即为迭代的次序号,比如第3次迭代的迭代的批次为第3次,迭代的次序号越大,迭代的批次越高。结合示例中来说,第3次迭代的全局梯度、第4次迭代的全局梯度和第6次迭代的全局梯度中迭代批次最高的迭代为第6次迭代,因此,优选地确定第j次迭代为第6次迭代的全局梯度。可选地,根据第6次迭代的全局梯度,以及第8次迭代的局部梯度和第8次迭代的模型参数计算第9次迭代的模型参数。
可选地,更新第9次迭代的模型参数的进程中,可并行执行通讯进程。工作模块更新第9次迭代的模型参数的进程时,工作模块已经计算出前8次迭代过程中的局部梯度,已经向服务器模块上推第1次迭代的局部梯度、第3次迭代的局部梯度、第4次迭代的局部梯度和第6次迭代的局部梯度。仍旧未向服务器模块上推的局部梯度包括:第2次迭代的局部梯度、第5次迭代的局部梯度、第7次迭代的局部梯度和第8次迭代的局部梯度。可选地,工作模块可执行以下几种方案可以选用:
方案a1,工作模块更新第9次迭代的模型参数、计算第9次迭代的局部梯度的进程中,并行执行:从服务器模块下拉第r次迭代的全局梯度。假设工作模块已经向服务器模块上推第5次迭代的局部梯度,且服务器模块已经计算出第5次迭代的全局梯度,但是工作模块还未从服务器模块下拉,则工作模块可从服务器模块下拉第5次迭代的全局梯度。也就是说,本申请实施例中工作模块可并行执行从服务器模块下拉第r次迭代的全局梯度,且第r次迭代的全局梯度已经由服务器模块计算出。
方案a2,工作模块工作模块更新第9次迭代的模型参数、计算第9次迭代的局部梯度的进程中,并行执行:从服务器模块下拉第r次迭代的全局梯度且向服务器模块上推第f次迭代的局部梯度;或者向服务器模块上推第f次迭代的局部梯度。向服务器模块上推第f次迭代的局部梯度存在以下方案b1、方案b2、方案b3和方案b4等多种情况:
方案b1,工作模块从未向服务器模块上推的局部梯度中确定出一个局部梯度,进行向服务器模块上推,比如从第2次迭代的局部梯度、第5次迭代的局部梯度、第7次迭代的局部梯度和第8次迭代的局部梯度中选出任一个进行向服务器模块上推。
方案b2,向服务器模块上推第i-1次迭代的局部梯度,工作模块选择迭代批次次高的一个未向服务器模块上推的局部梯度进行向服务器模块上推,比如从第2次迭代的局部梯度、第5次迭代的局部梯度、第7次迭代的局部梯度和第8次迭代的局部梯度中选出第7次迭代的局部梯度进行向服务器模块上推。
方案b3,向服务器模块上推第i次迭代的局部梯度,工作模块选择迭代批次最高的一个未向服务器模块上推的局部梯度进行向服务器模块上推,比如从第2次迭代的局部梯度、第5次迭代的局部梯度、第7次迭代的局部梯度和第8次迭代的局部梯度中选出第8次迭代的局部梯度进行向服务器模块上推。
方案b4,工作模块可保持等待,向服务器模块上推第i+1次迭代的局部梯度;也就是说工作模块等待确定出第9次迭代的局部梯度之后,将第9次迭代的局部梯度进行向服务器模块上推。
通过上述方案可以看出,本申请实施例中在进行第i+1次迭代中,可以选择向服务器模块上推已经计算出的第f次迭代的局部梯度,也可选择从服务器模块下拉服务器模块已 经计算出的第r次迭代的全局梯度,无需上报工作模块计算出的每次迭代中的局部梯度以及从服务器模块下拉每次迭代中的全局梯度,降低了工作模块和服务器模块之间的通讯量。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可提高工作模块和服务器模块的工作的灵活性。且进一步减少工作模块和服务器模块之间的通讯量。举例来说,共有50个工作模块,N为50,M为20,工作模块可根据20个工作模块上报的第j次迭代的局部梯度计算第j次迭代的全局梯度。可选地,可根据N个工作模块中所有工作模块上报的第j次迭代的局部梯度计算第j次迭代的全局梯度。
可选地,上述方案中服务器模块可根据多个工作模块上报的所有第j次迭代的局部梯度计算出第j次迭代的全局梯度。如何根据局部梯度计算全局梯度的具体算法有多种,比如求平均值、加权计算、对权重较大的几个局部梯度求均值等等。举几个示例进行示意性说明。比如服务器模块将多个工作模块上报的所有第j次迭代的局部梯度计算平均值,得到第j次迭代的全局梯度;再比如服务器模块将多个工作模块上报的所有第j次迭代的局部梯度分别乘以对应的权重,之后在计算乘以权重的局部梯度的平均值,得到第j次迭代的全局梯度。
可选地,在i为K的情况下,方法还包括:工作模块计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数。其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。举个例子,比如对N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数求平均,或者用N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数之和除以迭代次数K的方案,得到该次训练周期所训练出的模型参数。可选地,可以再启动一个训练周期对该模型参数进行训练,此时将该次训练周期所训练出的模型参数确定为下个训练周期内的第1次迭代的模型参数。也可不再启动训练周期,直接将该次训练周期所训练出的模型参数确定为训练后的模型参数。
为了进一步对本申请实施例所提供的方案进行介绍,该方法包括:
启动分布式计算平台,部署应用,服务器模块进行初始化,得到初始化的模型参数ω1_0,从服务器模块将模型参数ω1_0从服务器模块下拉到各个工作模块;
工作模块执行第1次迭代。
工作模块的GPU分别读入第1次迭代的样本数据,依赖于全局模型参数ω1_0进行局部梯度计算,并同时分别对第2次迭代的样本数据进行预处理。如此,可进一步缩短训练周期的时间。之后计算第2次迭代的模型参数。
比如N个工作模块中,其中工作模块1计算得到局部梯度Δω1-1;工作模块2计算得到局部梯度Δω2-1;…工作模块n计算得到局部梯度Δωn-1…工作模块N计算得到局部梯度ΔωN-1
工作模块执行第2次迭代。
可选地,工作模块并行执行:计算第2次迭代的模型参数,计算第2次迭代的局部梯度;向服务器模块上推第1次迭代的局部梯度。可选地,在服务器模块计算完成第1次迭代的全局梯度之后,工作模块可从服务器模块从服务器模块下拉第1次迭代的全局梯度。
由于此时还未从服务器模块下拉全局梯度,因此第2次迭代的模型参数根据第1次迭代的模型参数、第1次迭代的局部梯度确定。具体来说,确定方案有多种,比如通过误差计算使第2次迭代的模型参数更加逼近最终值。可选地,提供一种工作模块m计算第i+1次迭代的模型参数的公式(1):
wn_i=wn_i-1+η·Δwn_i……公式(1)
在公式(1)中:
wn_i为工作模块n的第i+1次迭代的模型参数;
i为迭代次数,i的取值范围为[1,K];n的取值范围为[1,N];
wn_i-1为第i次迭代的模型参数;
Δwn_i为工作模块n在i次迭代中计算得到的局部梯度;
η为学习速率控制因子。其中,η可根据具体适用场景确定。
在该示例中通过上述公式(1)计算第2次迭代的模型参数。
工作模块的GPU分别读入经过预处理的第2次迭代的样本数据,之后并行执行以下内容:依赖于第2次迭代的模型参数进行局部梯度计算;对第3次迭代的样本数据进行预处理。
工作模块向服务器模块上推第1次迭代的局部梯度。比如,服务器模块可以接收N个工作模块分别上报的N个第1次迭代的局部梯度Δω1-2、Δω2-2、…Δωn-2、…ΔωN-2,可选地,计算N个第1次迭代的局部梯度的平均值,得到第1次迭代的全局梯度Δω1。如此,在进行第2次迭代的局部梯度计算的同时,也向服务器模块上推第1次迭代的局部梯度,使计算进程和通讯进程的时间窗口重叠,从而缩短了训练周期的时间。可选地,也可根据N个工作模块中的M个工作模块上报的M个第1次迭代的局部梯度计算第1次迭代的局部梯度的平均值,得到第1次迭代的全局梯度。
工作模块执行第3次迭代。
可选地,在工作模块还未从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块并行执行:计算第3次迭代的模型参数,计算第3次迭代的局部梯度;从服务器模块从服务器模块下拉第1次迭代的全局梯度Δω1。如此,在计算第3次迭代的模型参数,计算第3次迭代的局部梯度的同时,也从服务器模块下拉第1次迭代的全局梯度,使计算进程和通讯进程的时间窗口重叠,从而缩短了训练周期的时间。
由于此时还未从服务器模块下拉全局梯度,即不存在满足第一条件的第j次迭代的全局梯度,因此第3次迭代的模型参数根据第2次迭代的模型参数、第2次迭代的局部梯度确定。可选地,可通过上述公式(1)确定第3次迭代的模型参数。
工作模块的GPU分别读入经过预处理的第3次迭代的样本数据,之后并行执行以下内容:依赖于第3次迭代的模型参数进行局部梯度计算;对第4次迭代的样本数据进行预处理。
工作模块执行第4次迭代。
可选地,工作模块并行执行:计算第4次迭代的模型参数,计算第4次迭代的局部梯度;向服务器模块上推第3次迭代的局部梯度。或者计算第4次迭代的模型参数,计算第4次迭代的局部梯度的同时不向服务器模块上推局部梯度也不从服务器模块下拉全局梯度。如此,降低工作模块和服务器模块之间的通讯量。本申请实施例中以不向服务器模块上推局部梯度也不从服务器模块下拉全局梯度为例进行介绍。
由于此时已经从服务器模块下拉第1次迭代的全局梯度,且之前并未将该第1次迭代的全局梯度用于模型参数的更新中,因此第4次迭代的模型参数根据第3次迭代的模型参数、第3次迭代的局部梯度,以及从服务器模块从服务器模块下拉的第1次迭代的全局梯度确定。具体来说,确定方案有多种,比如通过误差计算使第4次迭代的模型参数更加逼近最终值。可选地,提供一种工作模块n计算第4次迭代的模型参数的公式(2):
wn_i=wn_i-1+λ·Δwn_i+χ·Δwj……公式(2)
在公式(2)中:
wn_i为工作模块n的第i+1次迭代的模型参数;
n的取值范围为[1,N];i为迭代次数,i的取值范围为[1,K];
wn_i-1为工作模块n的第i次迭代的模型参数;
Δwn_i为工作模块n在i次迭代中计算得到的局部梯度;
Δwj为第j次迭代的全局梯度;j为小于等于i的正整数;
λ和χ分别为学习速率控制因子。其中,λ和χ可根据具体适用场景分别确定。
工作模块的GPU分别读入经过预处理的第4次迭代的样本数据,之后并行执行以下内容:依赖于第4次迭代的模型参数进行局部梯度计算;对第5次迭代的样本数据进行预处理。如此,在前3次迭代中使用局部梯度计算,在第4迭代过程中使用全局梯度进行计算,保证了模型参数向正确的值更加快速准确的逼近。
工作模块执行第5次迭代。
可选地,工作模块并行执行:计算第5次迭代的模型参数,计算第5次迭代的局部梯度;向服务器模块上推第4次迭代的局部梯度。或者并行执行计算第5次迭代的模型参数,计算第5次迭代的局部梯度;向服务器模块上推第3次迭代的局部梯度。本申请实施例中以下内容以向服务器模块上推第4次迭代的局部梯度为例进行介绍。
由于此时仅从服务器模块下拉第1次迭代的全局梯度,但是第1次全局梯度已经被用于计算第4次迭代的模型参数,因此根据上述公式(1)中所示的,根据第4次迭代的模型参数、第4次迭代的局部梯度确定第5次迭代的模型参数。
工作模块的GPU分别读入经过预处理的第5次迭代的样本数据,之后并行执行以下内容:依赖于第5次迭代的模型参数进行局部梯度计算;对第6次迭代的样本数据进行预处理。如此,服务器模块可以接收M个工作模块分别上报的M个第1次迭代的局部梯度Δω1-4、Δω2-4、…Δωn-4、…ΔωN-4,可选地,计算N个第4次迭代的局部梯度的平均值,得到第4次迭代的全局梯度Δω4
工作模块执行第6次迭代。
可选地,工作模块并行执行:计算第6次迭代的模型参数,计算第6次迭代的局部梯度;从服务器模块从服务器模块下拉第4次迭代的全局梯度。如此,在计算第6次迭代的模型参数,计算第6次迭代的局部梯度的同时,也从服务器模块下拉第4次迭代的全局梯度,使计算进程和通讯进程的时间窗口重叠,从而缩短了训练周期的时间。
可选地,由于工作模块在计算第6次迭代的模型参数的时候,并未成功从服务器模块下拉第4次迭代的全局梯度,因此可通过上述公式(1)确定第6次迭代的模型参数。
工作模块的GPU分别读入经过预处理的第6次迭代的样本数据,之后并行执行以下内容:依赖于第6次迭代的模型参数进行局部梯度计算;对第7次迭代的样本数据进行预处理。如此,服务器模块可以接收N个工作模块分别上报的N个第1次迭代的局部梯度Δω1-4、Δω2-4、…Δωn-4、…ΔωN-4,可选地,计算N个第4次迭代的局部梯度的平均值,得到第4次迭代的全局梯度Δω4
工作模块执行第7次迭代。
可选地,工作模块并行执行:计算第7次迭代的模型参数,计算第7次迭代的局部梯度;向服务器模块上推第6次迭代的局部梯度。
由于此时已经从服务器模块下拉第4次迭代的全局梯度,且之前并未将该第4次迭代的全局梯度用于模型参数的更新中,因此,通过上述公式(2),结合第6次迭代的模型参数、第6次迭代的局部梯度,以及从服务器模块从服务器模块下拉的第4次迭代的全局梯度确定。确定第7次迭代的模型参数。
工作模块的GPU分别读入经过预处理的第7次迭代的样本数据,之后并行执行以下内容:依赖于第7次迭代的模型参数进行局部梯度计算;对第8次迭代的样本数据进行预处理。如此,在第5次迭代和第6次迭代中使用局部梯度计算,在第7迭代过程中使用全局梯度进行计算,保证了模型参数向正确的值更加快速准确的逼近。
重复进行几次迭代之后,至收敛或迭代次数满足要求,在当前训练周期(英文可称为epoch)的最后一次迭代,即第K次迭代中,工作模块计算出第K次迭代的局部梯度之后,根据上述公式(1)计算第K+1次迭代的模型参数。服务器模块接收到N个工作模块分别上报的局部模型参数之后,计算该次训练周期的全局模型参数,具体方法有多种,比如求平均值等等,本申请实施例提供一种服务器计算全局模型参数的公式(3):
w2_0=(w1_K+w2_K...+wn_K...+wN_K)/K……公式(3)
在公式(3)中:
w2_0为全局模型参数,也可称w2_0为下一次训练周期内第1次迭代的模型参数;
wm_K为工作模块n的局部模型参数;
n的取值范围为[1,N];K为训练周期内的迭代总次数。
上述示例中,样本数据的来源可为工作模块对应的本地磁盘(英文可称为Disk),或者为相对应的分布式存储节点,如hadoop的分布式文件系统(Hadoop Distributed Filesystem,简称HDFS)、S3、分布式文件系统(Google File System,简称GFS)等。
图5示例性示出了一种神经网络模型训练的方法流程示意图,如图5所示,包括一个服务器模块和两个工作模块,分别为工作模块1和工作模块2。一个训练周期内包括K次迭代。在第2次迭代过程中,各个工作模块向服务器模块上推第1次迭代的局部梯度;在 第3次迭代过程中,各个工作模块从服务器模块从服务器模块下拉第1次迭代的全局梯度。在第5次迭代过程中,各个工作模块向服务器模块上推第4次迭代的局部梯度;在第6次迭代过程中,各个工作模块从服务器模块从服务器模块下拉第4次迭代的全局梯度。可见,本申请实施例中,一方面计算进程与通讯进程的时间窗口重叠,缩短了训练周期的时长,提高了模型参数的训练效率,另一方面,仅向服务器模块上推和从服务器模块下拉部分迭代中的局部梯度和全局梯度,避免了向服务器模块上推和从服务器模块下拉每次迭代中的局部梯度和全局梯度,减小了工作模块和服务器模块之间的通信量。
为了进一步介绍本申请实施例所提供的方案,本申请实施例提供以下一个具体示例进行详细说明。该示例的应用场景为:用深度神经网络对图像数据集进行分类。该示例的数据集为图像识别数据库(比如imagenet),共1000类,128万张图像。该示例的神经网络为googlenet,属于大规模神经网络模型的一种。该示例的分布式系统包括4个节点(英文可称为node),每个节点分别包括一个服务器模块和一个工作模块,分别为服务器模块1、服务器模块2、服务器模块3、服务器模块4、工作模块1、工作模块2、工作模块3和工作模块4。每个工作模块对应1块K80GPU卡(12G显存),每个服务器模块对应一块Intel Xeon E5-2620CPU。可选地,每个工作模块还对应一部分CPU,用于对样本数据进行预处理。Googlenet是目前一种比较常见的、分类精准度较高的图像分类网络。以第1次迭代举例说明:
开始第一次迭代。
服务器模块1初始化全局模型参数,得到第1次迭代的模型参数,第1次迭代的模型参数服从W□□(0,0.01),从服务器模块将第1次迭代的模型参数从服务器模块下拉到4个节点的工作模块。
设定每次迭代过程中,各个工作模块处理的数据规模为256;四个工作模块基于W□(0,0.01)计算梯度,累计得到的梯度为(服务器模块的GPU计算梯度的同时,工作模块对应的CPU预处理下一张图像,即预处理第2次迭代的样本数据。该示例中提供一种可选地各个工作模块计算第1次迭代的局部梯度的计算公式(4):
Figure PCTCN2017092091-appb-000001
在公式(4)中,Δw1_1为工作模块1的第1次迭代的局部梯度;Δw2_1为工作模块2的第1次迭代的局部梯度;Δw3_1为工作模块3的第1次迭代的局部梯度;Δw4_1为工作模块4的第1次迭代的局部梯度。
进行第2次迭代。
可选地,工作模块并行执行:计算第2次迭代的模型参数,计算第2次迭代的局部梯度;向服务器模块上推第1次迭代的局部梯度。可选地,在服务器模块计算完成第1次迭 代的全局梯度之后,工作模块可从服务器模块从服务器模块下拉第1次迭代的全局梯度。
根据上述公式(1)计算第2次迭代中各个工作模块的模型参数,令公式(1)中的η为0.01。得到如公式(5)所示的结果:
w1_1=w1_0+0.01Δw1_1
w2_1=w1_0+0.01Δw2_1
w3_1=w1_0+0.01Δw3_1
w4_1=w1_0+0.01Δw4_1……公式(5)
在第2次迭代过程中,各个工作模块基于各自第2次迭代的模型参数计算第2次迭代的局部梯度,并同步向服务器模块1向服务器模块上推各个第1次迭代的局部梯度,且同时工作模块对应的CPU预处理下一张图像,即预处理第3次迭代的样本数据。该示例中提供一种可选地各个工作模块计算第2次迭代的局部梯度的计算公式(6):
Figure PCTCN2017092091-appb-000002
在公式(6)中:
Δw1_2为工作模块1的第2次迭代的局部梯度;
Δw2_2为工作模块2的第2次迭代的局部梯度;
Δw3_2为工作模块3的第2次迭代的局部梯度;
Δw4_2为工作模块4的第2次迭代的局部梯度。
进行第3次迭代。
可选地,在工作模块还未从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块并行执行:计算第3次迭代的模型参数,计算第3次迭代的局部梯度;从服务器模块从服务器模块下拉第1次迭代的全局梯度Δω1。在工作模块还未从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块根据上述公式(1)计算第3次迭代中各个工作模块的模型参数,令公式(1)中的η为0.01。得到如公式(7)所示的结果:
w1_2=w1_1+0.01Δw1_2
w2_2=w2_1+0.01Δw2_2
w3_2=w3_1+0.01Δw3_2
w4_2=w4_1+0.01Δw4_2……公式(7)
在公式(7)中:
w1_2为工作模块1的第3次迭代的模型参数;w1_1为工作模块1的第2次迭代的模型参数;Δw1_2为工作模块1的第2次迭代的局部梯度;
w2_2为工作模块2的第3次迭代的模型参数;w2_1为工作模块2的第2次迭代的模型参数;Δw2_2为工作模块2的第2次迭代的局部梯度;
w3_2为工作模块3的第3次迭代的模型参数;w3_1为工作模块3的第2次迭代的模型参数;Δw3_2为工作模块3的第2次迭代的局部梯度;
w4_2为工作模块4的第3次迭代的模型参数;w4_1为工作模块4的第2次迭代的模型参数;Δw4_2为工作模块4的第2次迭代的局部梯度。
可选地,在工作模块已经从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块根据上述公式(2)计算第3次迭代中各个工作模块的模型参数。根据上述公式(2)计算第3次迭代中各个工作摩块的模型参数,令公式(2)中的λ为0.01,λ为0.4。得到如公式(8)所示的结果:
w1_2=w1_1+0.01·Δw1_2+0.4·Δw1
w2_2=w2_1+0.01·Δw2_2+0.4·Δw1
w3_2=w3_1+0.01·Δw3_2+0.4·Δw1
w4_2=w4_1+0.01·Δw4_2+0.4·Δw1……公式(8)
在公式(8)中:
w1_2为工作模块1的第3次迭代的模型参数;w1_1为工作模块1的第2次迭代的模型参数;Δw1_2为工作模块1的第2次迭代的局部梯度;
w2_2为工作模块2的第3次迭代的模型参数;w2_1为工作模块2的第2次迭代的模型参数;Δw2_2为工作模块2的第2次迭代的局部梯度;
w3_2为工作模块3的第3次迭代的模型参数;w3_1为工作模块3的第2次迭代的模型参数;Δw3_2为工作模块3的第2次迭代的局部梯度;
w4_2为工作模块4的第3次迭代的模型参数;w4_1为工作模块4的第2次迭代的模型参数;Δw4_2为工作模块4的第2次迭代的局部梯度;
Δw1为第1次迭代的全局梯度。
在第3次迭代过程中,各个工作模块基于各自第3次迭代的模型参数计算第3次迭代的局部梯度,并同步从服务器模块1从服务器模块下拉第1次迭代的全局梯度,且同时工作模块对应的CPU预处理下一张图像,即预处理第4次迭代的样本数据。该示例中提供一 种可选地各个工作模块计算第3次迭代的局部梯度的计算公式(9):
Figure PCTCN2017092091-appb-000003
在公式(9)中:
Δw1_3为工作模块1的第3次迭代的局部梯度;
Δw2_3为工作模块2的第3次迭代的局部梯度;
Δw3_3为工作模块3的第3次迭代的局部梯度;
Δw4_3为工作模块4的第3次迭代的局部梯度。
第3次迭代过程结束,开始进行第4次迭代过程。
可选地,在工作模块还未从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块并行执行:计算第4次迭代的模型参数,计算第4次迭代的局部梯度;从服务器模块从服务器模块下拉第1次迭代的全局梯度Δω1
在工作模块还未从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块根据上述公式(1)计算第4次迭代中各个工作模块的模型参数。
可选地,在工作模块已经从服务器模块下拉第1次迭代的全局梯度的情况下,工作模块根据上述公式(2)计算第4次迭代中各个工作模块的模型参数。根据上述公式(2)计算第4次迭代中各个工作模块的模型参数,令公式(2)中的λ为0.01,χ为0.4。得到如公式(10)所示的结果:
w1_3=w1_2+0.01Δw1_3+0.4Δw1
w2_3=w2_2+0.01Δw2_3+0.4Δw1
w3_3=w3_2+0.01Δw3_3+0.4Δw1
w4_3=w4_2+0.01Δw4_3+0.4Δw1……公式(10)
在公式(10)中:
w1_3为工作模块1的第4次迭代的模型参数;w1_2为工作模块1的第3次迭代的模型参数;Δw1_3为工作模块1的第3次迭代的局部梯度;
w2_3为工作模块2的第4次迭代的模型参数;w2_2为工作模块2的第3次迭代的模型参数;Δw2_3为工作模块2的第3次迭代的局部梯度;
w3_3为工作模块3的第4次迭代的模型参数;w3_2为工作模块3的第3次迭代的模型参数;Δw3_3为工作模块3的第3次迭代的局部梯度;
w4_3为工作模块4的第4次迭代的模型参数;w4_2为工作模块4的第3次迭代的模型参数;Δw4_3为工作模块4的第3次迭代的局部梯度;
Δw1为第1次迭代的全局梯度。
进而根据第4次迭代的模型参数计算第4次迭代的局部梯度,其余迭代过程与上述内容类似,在此不再赘述。
可选地,工作模块向服务器模块1向服务器模块上推各个局部梯度,服务器模块1根据各个局部梯度计算全局梯度,可选地,可以计算各个局部梯度的平均值,作为全局梯度。本申请实施例中提供公式(11)用于进行全局梯度的计算:
Δw1=(w1_1+w2_1+...wn_1...+wN_1)/N……公式(11)
在公式(11)中:
Δw1为第1次迭代的全局梯度;
w1_1为工作模块1的第1次迭代的局部梯度;
w2_1为工作模块2的第1次迭代的局部梯度;
wn_1为工作模块n的第1次迭代的局部梯度,n的取值范围为[1,N];
wN_1为工作模块N的第1次迭代的局部梯度;N为工作模块的总数量。
通过上述内容可以看出,本申请实施例中在不增加额外通信时间开销的同时,利用全局梯度信息调控每个工作模块的模型更新,解决传统通信模式下由于较弱的同步导致的模型一致性收敛问题。本申请在确保大规模分布式神经网络模型(包括深度学习模型)稳定收敛的前提下,有效克服大模型导致的通信瓶颈问题;也是业界第一次提出将大规模分布式机器学习的通信时间开销与计算时间开销完全重叠的解决方案,规避通信瓶颈,最理想可实现几近线性的加速效果。
图6示例性示出了本申请实施例提供的一种神经网络模型训练装置的结构示意图。
基于相同构思,本申请实施例提供一种神经网络模型的训练装置,如图6所示,该训练装置包括N个工作模块,该训练装置适用于包括服务器模块和N个工作模块的训练系统,服务器模块和N个工作模块用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代;工作模块为所N个工作模块中的一个,工作模块包括通讯模块和计算模块;针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数。N个工作模块中各个工作模块包括通讯模块603和计算模块602。可选地,还可包括存储模块601。可选地,存储模块用于存储以下拉的全局梯度等信息。
各个工作模块的通讯模块603和计算模块602并行运行;
其中,计算模块602,用于根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;
通讯模块603,用于从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数。
本申请实施例中每次迭代过程中通讯模块和计算模块并行运行,通信模块执行第一进程,计算模块执行第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技 术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
可选地,计算模块602,用于:在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,计算模块602,用于:在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,通讯模块603,用于从服务器模块下拉第r次迭代的全局梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i-1次迭代的局部梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i次迭代的局部梯度。或者用于向服务器模块上推第i-1次迭代的局部梯度。或者用于向服务器模块上推第i次迭代的局部梯度。如此,可提高工作模块的灵活性,另一方面,也可尽量向服务器模块上推与当前迭代过程中最邻近的迭代中的局部梯度,从而加速了模型参数的收敛。
可选地,在i为K的情况下,通讯模块603还用于:在通过计算模块计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数;其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。
从上述内容可以看出:本申请实施例中每次迭代过程中并行执行第一进程和第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分, 实际实现时可以有另外的划分方式。在本申请的实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
图7示例性示出了本申请实施例提供的一种神经网络模型训练装置的结构示意图。
基于相同构思,本申请实施例提供一种神经网络模型训练装置,用于执行上述方法流程。如图7所示,训练装置包括收发器701和处理器702。处理器702包括N个处理器核。可选地,还可包括存储器704,以及通信接口703。可选地,还可包括总线705。
其中,处理器、存储器、收发器通过总线相互连接。总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图7中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器704可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器704还可以包括上述种类的存储器的组合。
处理器702中包括的N个处理器核可包括GPU,或者可包括GPU和CPU。处理器核还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。
收发器用于实现各个工作模块与服务器模块之间的数据的传输。
存储器用于存储指令。可选地,存储器还用于存储下拉的全局梯度等信息。
所述处理器包括N个处理器核,所述训练装置适用于包括服务器模块和N个处理器核的训练系统,所述服务器模块和所述N个处理器核用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代。针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数。各个工作模块的收发器701和处理器702并行运行;
其中,处理器702,用于根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;
收发器701,用于从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数。
存储器用于存储从服务器模块下拉的全局梯度,以及计算出的局部梯度。
本申请实施例中每次迭代过程中收发器和处理器并行运行,处理器执行第一进程,收发器执行第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方 案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
可选地,处理器702,用于:在确定已从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第j次迭代的全局梯度、第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,j为小于等于i的正整数;第一条件包括:第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算。如此,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,处理器702,用于:在确定未从服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据第i次迭代的局部梯度和第i次迭代的模型参数,计算第i+1次迭代的模型参数。如此,可以根据与当前迭代过程中最邻近的迭代中的全局梯度更新模型参数,加速了模型参数的收敛。
可选地,第一条件还包括:第j次迭代的全局梯度为已经从服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。如此,可以根据已从服务器模块下拉满足第一条件的第j次迭代的全局梯度计算第i+1次迭代的模型参数,提高了计算第i+1次迭代的模型参数的精度,另一方面,从已从服务器模块下拉的全局梯度中选择满足第一条件的第j次迭代的全局梯度,无需等待通讯进程,进一步缩短了迭代的时长,提高了模型参数训练的效率。
可选地,第j次迭代的全局梯度根据以下内容确定的:N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,M为大于等于1且小于等于N的整数。如此,可提高工作模块和服务器模块的工作的灵活性。且进一步减少工作模块和服务器模块之间的通讯量。
可选地,收发器701,用于从服务器模块下拉第r次迭代的全局梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i-1次迭代的局部梯度。或者用于从服务器模块下拉第r次迭代的全局梯度,向服务器模块上推第i次迭代的局部梯度。或者用于向服务器模块上推第i-1次迭代的局部梯度。或者用于向服务器模块上推第i次迭代的局部梯度。如此,可提高工作模块的灵活性,另一方面,也可尽量向服务器模块上推与当前迭代过程中最邻近的迭代中的局部梯度,从而加速了模型参数的收敛。
可选地,在i为K的情况下,收发器701还用于:在通过处理器计算第K次迭代的局部梯度之后,根据第K次迭代的局部梯度和第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向服务器模块上推第K+1次迭代的模型参数;其中,第K+1次迭代的模型参数用于:使服务器模块根据N个工作模块中每个工作模块向服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。如此,提高训练周期的模型参数的准确度。
从上述内容可以看出:本申请实施例中每次迭代过程中并行执行第一进程和第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
基于相同构思,本申请实施例提供一种用于神经网络模型训练的芯片,所述芯片适用 于包括N个芯片和服务器的训练系统,所述服务器模块和所述N个芯片用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;所述N个芯片中的每个芯片用于执行如上述第一方面中工作模块执行的方法。
图8示例性示出了本申请实施例提供的一种用于神经网络模型训练的系统的结构示意图。
基于相同构思,本申请实施例提供一种用于神经网络模型训练的系统的结构示意图,如图8所示,系统包括服务器模块800和N个工作模块801、工作模块802至工作模块80n;服务器模块800和N个工作模块801、工作模块802至工作模块80n用于在至少一个训练周期内训练模型参数,至少一个训练周期内的每个训练周期包括K次迭代;
针对N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,N个工作模块801、工作模块802至工作模块80n中的各个工作模块用于:并行执行:根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据第i+1次迭代的模型参数以及第i+1次迭代的样本数据,计算第i+1次迭代的局部梯度;从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度;其中,r和f分别为小于等于i的正整数;其中,N、K分别为大于等于1的整数,i为大于等于1且小于等于K的整数;
服务器模块800,用于根据接收到的各个工作模块上推的第r次迭代的局部梯度计算第r次迭代的全局梯度,并将第r次迭代的全局梯度下拉至各个工作模块;接收各个工作模块上推的第f次迭代的局部梯度,并根据各个工作模块上推的第f次迭代的局部梯度计算第f次迭代的全局梯度。
从上述内容可以看出:本申请实施例中每次迭代过程中并行执行第一进程和第二进程,第一进程是计算进程,具体包括计算第i+1次迭代的模型参数,计算第i+1次迭代的局部梯度;第二进程是通讯进程,具体包括从服务器模块下拉第r次迭代的全局梯度和/或向服务器模块上推第f次迭代的局部梯度。第一进程中根据第i次迭代的局部梯度和第i次迭代的模型参数计算第i+1次迭代的模型参数,避免了现有技术中一定要等待从服务器模块下拉到第i次迭代的全局梯度才能计算第i+1次迭代的模型参数的方案,从而缩短了一次迭代的时长,提高了模型参数训练的效率。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本领域内的技术人员应明白,本申请的实施例可提供为方法、或计算机程序产品。因 此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (16)

  1. 一种神经网络模型训练方法,其特征在于,所述方法适用于包括服务器模块和N个工作模块的训练系统,所述服务器模块和所述N个工作模块用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代,针对所述N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,各个工作模块并行执行:
    根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据所述第i+1次迭代的模型参数以及所述第i+1次迭代的样本数据,计算所述第i+1次迭代的局部梯度;
    从所述服务器模块下拉第r次迭代的全局梯度和/或向所述服务器模块上推第f次迭代的局部梯度;其中,所述r和f分别为小于等于i的正整数;
    其中,所述N、K分别为大于等于1的整数,所述i为大于等于1且小于等于K的整数。
  2. 如权利要求1所述的方法,其特征在于,所述工作模块根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数计算第i+1次迭代的模型参数,包括:
    所述工作模块在确定已从所述服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据所述第j次迭代的全局梯度、所述第i次迭代的局部梯度和所述第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,所述j为小于等于所述i的正整数;所述第一条件包括:所述第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算;
    所述工作模块在确定未从所述服务器模块下拉满足所述第一条件的第j次迭代的全局梯度的情况下,根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数,计算第i+1次迭代的模型参数。
  3. 如权利要求2所述的方法,其特征在于,所述第一条件还包括:所述第j次迭代的全局梯度为已经从所述服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。
  4. 如权利要求2或3所述的方法,其特征在于,所述第j次迭代的全局梯度根据以下内容确定的:
    所述N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,所述M为大于等于1且小于等于N的整数。
  5. 如权利要求1至4任一权利要求所述的方法,其特征在于,所述工作模块从所述服务器模块下拉第r次迭代的全局梯度和/或向所述服务器模块上推第f次迭代的局部梯度,包括以下内容中的两项,或者包括以下内容中的任一项:
    从所述服务器模块下拉所述第r次迭代的全局梯度;
    向所述服务器模块上推所述第i-1次迭代的局部梯度;或者向所述服务器模块上推所述第i次迭代的局部梯度。
  6. 如权利要求1至5任一权利要求所述的方法,其特征在于,在所述i为K的情况下,所述方法还包括:
    所述工作模块计算所述第K次迭代的局部梯度之后,根据所述第K次迭代的局部梯度和所述第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向所述服务器模块上 推所述第K+1次迭代的模型参数;
    其中,所述第K+1次迭代的模型参数用于:使所述服务器模块根据所述N个工作模块中每个工作模块向所述服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。
  7. 一种神经网络模型训练装置,其特征在于,所述训练装置包N个工作模块,所述训练装置适用于包括服务器模块和N个工作模块的训练系统,所述服务器模块和所述N个工作模块用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;所述N个工作模块中的每个工作模块包括通讯模块和计算模块;针对所述N个工作模块中的一个工作模块在每个训练周期内的第i次迭代:
    各个工作模块的所述通讯模块和所述计算模块并行运行;
    其中,所述计算模块,用于根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据所述第i+1次迭代的模型参数以及所述第i+1次迭代的样本数据,计算所述第i+1次迭代的局部梯度;
    所述通讯模块,用于从所述服务器模块下拉第r次迭代的全局梯度和/或向所述服务器模块上推第f次迭代的局部梯度;其中,所述r和f分别为小于等于i的正整数;
    其中,所述N、K分别为大于等于1的整数,所述i为大于等于1且小于等于K的整数。
  8. 如权利要求7所述的训练装置,其特征在于,所述计算模块,用于:
    在确定已从所述服务器模块下拉满足第一条件的第j次迭代的全局梯度的情况下,根据所述第j次迭代的全局梯度、所述第i次迭代的局部梯度和所述第i次迭代的模型参数,计算第i+1次迭代的模型参数;其中,所述j为小于等于所述i的正整数;所述第一条件包括:所述第j次迭代的全局梯度未被用于第1次迭代至第i次迭代之间任一次迭代中的模型参数的计算;
    在确定未从所述服务器模块下拉满足所述第一条件的第j次迭代的全局梯度的情况下,根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数,计算第i+1次迭代的模型参数。
  9. 如权利要求8所述的训练装置,其特征在于,所述第一条件还包括:所述第j次迭代的全局梯度为已经从所述服务器模块下拉的所有全局梯度中迭代批次最高的迭代中的全局梯度。
  10. 如权利要求8或9所述的训练装置,其特征在于,所述第j次迭代的全局梯度根据以下内容确定的:
    所述N个工作模块中的M个工作模块上报的第j次迭代的局部梯度;其中,所述M为大于等于1且小于等于N的整数。
  11. 如权利要求7至10任一权利要求所述的训练装置,其特征在于,所述通讯模块,用于执行以下内容中的两项,或者包括以下内容中的任一项:
    从所述服务器模块下拉所述第r次迭代的全局梯度;
    向所述服务器模块上推所述第i-1次迭代的局部梯度;或者向所述服务器模块上推所述第i次迭代的局部梯度。
  12. 如权利要求7至11任一权利要求所述的训练装置,其特征在于,在所述i为K的情况下,所述通讯模块还用于:
    在通过所述计算模块计算所述第K次迭代的局部梯度之后,根据所述第K次迭代的局部梯度和所述第K次迭代的模型参数计算第K+1次迭代的模型参数之后,向所述服务器模块上推所述第K+1次迭代的模型参数;
    其中,所述第K+1次迭代的模型参数用于:使所述服务器模块根据所述N个工作模块中每个工作模块向所述服务器模块上推的第K+1次迭代的模型参数,以及迭代次数K,确定下个训练周期内的第1次迭代的模型参数。
  13. 一种神经网络模型的训练装置,其特征在于,所述训练装置包括处理器、存储器和收发器,所述处理器包括N个处理器核,所述训练装置适用于包括服务器模块和N个处理器核的训练系统,所述服务器模块和所述N个处理器核用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;
    所述存储器用于存储指令;所述处理器用于执行所述存储器存储的指令,并控制所述收发器与所述服务器模块之间传输数据;当所述处理器执行所述存储器存储的指令时,所述N个处理器核中的每个处理器核用于执行如权利要求1至6任一权利要求中工作模块执行的方法。
  14. 一种用于神经网络模型训练的芯片,其特征在于,所述芯片适用于包括N个芯片和服务器的训练系统,所述服务器模块和所述N个芯片用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;
    所述N个芯片中的每个芯片用于执行如权利要求1至6任一权利要求中工作模块执行的方法。
  15. 一种神经网络模型训练系统,其特征在于,所述系统包括服务器模块和N个工作模块;所述服务器模块和所述N个工作模块用于在至少一个训练周期内训练模型参数,所述至少一个训练周期内的每个训练周期包括K次迭代;
    针对所述N个工作模块中的一个工作模块在每个训练周期内的第i次迭代,各个工作模块用于:并行执行:根据所述第i次迭代的局部梯度和所述第i次迭代的模型参数计算第i+1次迭代的模型参数,且在i小于K的情况下,根据所述第i+1次迭代的模型参数以及所述第i+1次迭代的样本数据,计算所述第i+1次迭代的局部梯度;从所述服务器模块下拉第r次迭代的全局梯度和/或向所述服务器模块上推第f次迭代的局部梯度;其中,所述r和f分别为小于等于i的正整数;其中,所述N、K分别为大于等于1的整数,所述i为大于等于1且小于等于K的整数;
    所述服务器模块,用于根据接收到的各个工作模块上推的第r次迭代的局部梯度计算第r次迭代的全局梯度,并将第r次迭代的全局梯度下拉至各个工作模块;接收各个工作模块上推的第f次迭代的局部梯度,并根据各个工作模块上推的第f次迭代的局部梯度计算第f次迭代的全局梯度。
  16. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机可执行指令,所述计算机可执行指令在被计算机调用时,使所述计算机执行如权利要求1至6任一权利要求所述的方法。
PCT/CN2017/092091 2016-11-29 2017-07-06 一种神经网络模型训练方法、装置、芯片和系统 WO2018099084A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP17875081.6A EP3540652B1 (en) 2016-11-29 2017-07-06 Method, device, chip and system for training neural network model
US16/424,760 US20190279088A1 (en) 2016-11-29 2019-05-29 Training method, apparatus, chip, and system for neural network model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611073994.5A CN108122032B (zh) 2016-11-29 2016-11-29 一种神经网络模型训练方法、装置、芯片和系统
CN201611073994.5 2016-11-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/424,760 Continuation US20190279088A1 (en) 2016-11-29 2019-05-29 Training method, apparatus, chip, and system for neural network model

Publications (1)

Publication Number Publication Date
WO2018099084A1 true WO2018099084A1 (zh) 2018-06-07

Family

ID=62225831

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/092091 WO2018099084A1 (zh) 2016-11-29 2017-07-06 一种神经网络模型训练方法、装置、芯片和系统

Country Status (4)

Country Link
US (1) US20190279088A1 (zh)
EP (1) EP3540652B1 (zh)
CN (2) CN110348571B (zh)
WO (1) WO2018099084A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085074A (zh) * 2020-08-25 2020-12-15 腾讯科技(深圳)有限公司 一种模型参数更新系统、方法及装置
CN114912587A (zh) * 2022-06-09 2022-08-16 上海燧原科技有限公司 神经网络分布式训练系统、方法、装置、计算单元及介质
US11531912B2 (en) 2019-04-12 2022-12-20 Samsung Electronics Co., Ltd. Electronic apparatus and server for refining artificial intelligence model, and method of refining artificial intelligence model

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919313B (zh) * 2019-01-31 2021-06-08 华为技术有限公司 一种梯度传输的方法及分布式训练系统
WO2020155083A1 (zh) * 2019-02-01 2020-08-06 华为技术有限公司 神经网络的分布式训练方法及装置
KR20200120469A (ko) * 2019-04-12 2020-10-21 삼성전자주식회사 인공지능 모델을 갱신하는 전자 장치, 서버 및 그 동작 방법
CN110084380A (zh) * 2019-05-10 2019-08-02 深圳市网心科技有限公司 一种迭代训练方法、设备、系统及介质
CN112152741B (zh) * 2019-06-28 2021-11-19 华为技术有限公司 信道模型的训练方法及装置
CN110379416B (zh) * 2019-08-15 2021-10-22 腾讯科技(深圳)有限公司 一种神经网络语言模型训练方法、装置、设备及存储介质
US11379727B2 (en) * 2019-11-25 2022-07-05 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for enhancing a distributed medical network
CN110956265A (zh) * 2019-12-03 2020-04-03 腾讯科技(深圳)有限公司 一种模型训练方法和相关装置
US11599671B1 (en) 2019-12-13 2023-03-07 TripleBlind, Inc. Systems and methods for finding a value in a combined list of private values
CN111160535B (zh) * 2019-12-31 2024-01-30 北京计算机技术及应用研究所 基于Hadoop的DGCNN模型加速方法
CN111475313B (zh) * 2020-03-04 2023-06-27 江苏理工学院 适用于卷积神经网络前向传播的消息队列构建方法及装置
CN112016699B (zh) * 2020-08-31 2024-02-02 北京灵汐科技有限公司 一种深度学习模型训练方法、工作节点和参数服务器
CN112015749B (zh) 2020-10-27 2021-02-19 支付宝(杭州)信息技术有限公司 基于隐私保护更新业务模型的方法、装置及系统
CN113052239B (zh) * 2021-03-25 2022-08-02 山东大学 基于梯度方向参数优化的神经网络的图像分类方法及系统
CN113343938B (zh) * 2021-07-16 2023-01-31 浙江大学 一种图像识别方法、装置、设备及计算机可读存储介质
CN114422605A (zh) * 2022-01-12 2022-04-29 重庆邮电大学 一种基于联邦学习的通信梯度自适应压缩方法
CN115660078A (zh) * 2022-12-29 2023-01-31 浪潮电子信息产业股份有限公司 一种分布式计算方法、系统、存储介质和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346629A (zh) * 2014-10-24 2015-02-11 华为技术有限公司 一种模型参数训练方法、装置及系统
CN104978601A (zh) * 2015-06-26 2015-10-14 深圳市腾讯计算机系统有限公司 神经网络模型训练系统和方法
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System
CN105184367A (zh) * 2014-06-09 2015-12-23 讯飞智元信息科技有限公司 深度神经网络的模型参数训练方法及系统
CN106156807A (zh) * 2015-04-02 2016-11-23 华中科技大学 卷积神经网络模型的训练方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036451B (zh) * 2014-06-20 2018-12-11 深圳市腾讯计算机系统有限公司 基于多图形处理器的模型并行处理方法及装置
US10242313B2 (en) * 2014-07-18 2019-03-26 James LaRue Joint proximity association template for neural networks
CN104463324A (zh) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 一种基于大规模高性能集群的卷积神经网络并行处理方法
CN104598972A (zh) * 2015-01-22 2015-05-06 清华大学 一种大规模数据回归神经网络快速训练方法
US10445641B2 (en) * 2015-02-06 2019-10-15 Deepmind Technologies Limited Distributed training of reinforcement learning systems
US20160267380A1 (en) * 2015-03-13 2016-09-15 Nuance Communications, Inc. Method and System for Training a Neural Network
CN104714852B (zh) * 2015-03-17 2018-05-22 华中科技大学 一种适用于分布式机器学习的参数同步优化方法及其系统
US20160321522A1 (en) * 2015-04-30 2016-11-03 Canon Kabushiki Kaisha Devices, systems, and methods for pairwise multi-task feature learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150324690A1 (en) * 2014-05-08 2015-11-12 Microsoft Corporation Deep Learning Training System
CN105184367A (zh) * 2014-06-09 2015-12-23 讯飞智元信息科技有限公司 深度神经网络的模型参数训练方法及系统
CN104346629A (zh) * 2014-10-24 2015-02-11 华为技术有限公司 一种模型参数训练方法、装置及系统
CN106156807A (zh) * 2015-04-02 2016-11-23 华中科技大学 卷积神经网络模型的训练方法及装置
CN104978601A (zh) * 2015-06-26 2015-10-14 深圳市腾讯计算机系统有限公司 神经网络模型训练系统和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3540652A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11531912B2 (en) 2019-04-12 2022-12-20 Samsung Electronics Co., Ltd. Electronic apparatus and server for refining artificial intelligence model, and method of refining artificial intelligence model
CN112085074A (zh) * 2020-08-25 2020-12-15 腾讯科技(深圳)有限公司 一种模型参数更新系统、方法及装置
CN112085074B (zh) * 2020-08-25 2024-05-07 腾讯科技(深圳)有限公司 一种模型参数更新系统、方法及装置
CN114912587A (zh) * 2022-06-09 2022-08-16 上海燧原科技有限公司 神经网络分布式训练系统、方法、装置、计算单元及介质
CN114912587B (zh) * 2022-06-09 2023-05-26 上海燧原科技有限公司 神经网络分布式训练系统、方法、装置、计算单元及介质

Also Published As

Publication number Publication date
CN110348571B (zh) 2024-03-29
CN108122032B (zh) 2020-02-14
EP3540652A4 (en) 2019-11-06
US20190279088A1 (en) 2019-09-12
CN108122032A (zh) 2018-06-05
CN110348571A (zh) 2019-10-18
EP3540652A1 (en) 2019-09-18
EP3540652B1 (en) 2023-06-14

Similar Documents

Publication Publication Date Title
WO2018099084A1 (zh) 一种神经网络模型训练方法、装置、芯片和系统
WO2018099085A1 (zh) 一种神经网络模型的训练方法、装置及芯片
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
US11176487B2 (en) Gradient-based auto-tuning for machine learning and deep learning models
EP4123515A1 (en) Data processing method and data processing device
WO2021089013A1 (zh) 空间图卷积网络的训练方法、电子设备及存储介质
WO2016062044A1 (zh) 一种模型参数训练方法、装置及系统
JP7287397B2 (ja) 情報処理方法、情報処理装置及び情報処理プログラム
US11625614B2 (en) Small-world nets for fast neural network training and execution
CN111047563B (zh) 一种应用于医学超声图像的神经网络构建方法
WO2020238039A1 (zh) 神经网络搜索方法及装置
WO2022083093A1 (zh) 图谱中的概率计算方法、装置、计算机设备及存储介质
WO2023051369A1 (zh) 一种神经网络的获取方法、数据处理方法以及相关设备
US11295236B2 (en) Machine learning in heterogeneous processing systems
US20200226458A1 (en) Optimizing artificial neural network computations based on automatic determination of a batch size
CN114528990A (zh) 一种神经网络搜索方法及系统
CN115412401B (zh) 训练虚拟网络嵌入模型及虚拟网络嵌入的方法和装置
WO2023078009A1 (zh) 一种模型权重获取方法以及相关系统
JP2020003860A (ja) 学習システム、処理装置、処理方法、およびプログラム
US11714992B1 (en) Neural network processing based on subgraph recognition
US10769527B2 (en) Accelerating artificial neural network computations by skipping input values
EP3895024A1 (en) Caching data in artificial neural network computations
WO2024016894A1 (zh) 一种神经网络的训练方法以及相关设备
US20230043584A1 (en) Optimization of memory use for efficient neural network execution
US20230051344A1 (en) Optimization of memory use for efficient neural network execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17875081

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017875081

Country of ref document: EP

Effective date: 20190613