CN107341132B - Device and method for executing AdaGrad gradient descent training algorithm - Google Patents
Device and method for executing AdaGrad gradient descent training algorithm Download PDFInfo
- Publication number
- CN107341132B CN107341132B CN201610280620.4A CN201610280620A CN107341132B CN 107341132 B CN107341132 B CN 107341132B CN 201610280620 A CN201610280620 A CN 201610280620A CN 107341132 B CN107341132 B CN 107341132B
- Authority
- CN
- China
- Prior art keywords
- vector
- unit
- data
- gradient
- updated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Algebra (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
A computing device and method, the computing device including a controller unit and a data processing unit; the device firstly reads a gradient vector and a value vector to be updated, and meanwhile, updates a historical gradient value in a cache region by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges. The invention can solve the problems of insufficient operation performance of the processor and high front-section decoding cost by adopting special equipment, accelerates the execution speed of related applications and reduces the bandwidth of memory access.
Description
Technical Field
The invention relates to the technical field of AdaGrad algorithm application, in particular to a device and a method for executing an AdaGrad gradient descent training algorithm.
Background
The AdaGrad algorithm is widely used due to the characteristics of easy realization, small calculation amount, small required storage space, capability of adaptively distributing learning rate to each parameter and the like. The implementation of the AdaGrad algorithm by using a dedicated device can significantly increase the speed of its execution.
Currently, one known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the AdaGrad gradient descent algorithm into a long-row operation and memory access instruction sequence, and the front-end decoding of the processor brings about large power consumption overhead.
Another known method of performing the AdaGrad gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device dedicated to performing graphics image operations and scientific calculations, there is no special support for the relevant operations of the AdaGrad gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the relevant operations in the AdaGrad gradient descent algorithm, thereby bringing a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as historical gradient values) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.
Disclosure of Invention
In view of the above, the present invention provides an apparatus and a method for performing an AdaGrad gradient descent algorithm to solve at least one of the above technical problems.
To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing an AdaGrad gradient descent algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;
and the data processing module is used for executing operation operations under the control of the controller unit, wherein the operation operations comprise vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation, and storing intermediate variables into the data cache unit.
The data processing module comprises an operation control submodule, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector square root operation unit and a basic operation submodule.
When the data processing module executes operation aiming at the same vector, elements at different positions can execute operation in parallel.
Wherein the data cache unit is initialized at the deviceThe sum of squares of the historical gradient values is initializedSetting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent algorithm is executed.
Wherein the data caching unit sums the squares of the historical gradient values in each data updating processReading the data into a data processing module, updating the value of the data in the data processing module, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing module reads the square sum of historical gradient values from the data cache unitAnd constants alpha, epsilon, updateAnd sending back the value of (a) to the data cache unit, usingAnd constant alpha, epsilon calculating adaptive learning rateAnd finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
As another aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:
step (1), initializing a data cache unit, including setting initial values for constants alpha and epsilon and squaring and summing historical gradientsZero setting operationWherein, the constant alpha is a gain coefficient of the adaptive learning rate for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant for ensuring that the denominator in the calculation of the adaptive learning rate is non-zero, t is the current iteration number, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2Summing;
step (2), reading a parameter vector to be updated and a corresponding gradient vector from an external space;
and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing moduleAnd through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unitCalculating adaptive learning rate
And (4) the data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:
wherein, WtDenotes the current, i.e. t-th, parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1Representing the updated parameters, which are also the parameters to be updated of the next time, namely t +1 times of iterative operation;
and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
And a device for executing the AdaGrad gradient descent algorithm, wherein a program for executing the method is solidified in a controller of the device.
As a further aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:
step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;
step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all instructions related to AdaGrad gradient descent calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads in the assignment instruction from the instruction cache unit, and the historical gradient sum of squares in the data cache unit according to the translated microinstructionSetting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2Summing;
step S4, the controller unit reads in an IO instruction from the instruction cache unit, and the data access unit reads the parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Reading the data into a data cache unit;
in the step of S5,the controller unit reads a data transfer instruction from the instruction cache unit, and sums the squares of the historical gradients in the data cache unit according to the translated microinstructionsAnd the constant α, ε is transmitted to the data processing unit;
in step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs historical gradient square-sum according to the translated microinstructionIn the update operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a vector multiply parallel operator module to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squaresPerforming the following steps;
in step S7, the controller unit reads an instruction from the instruction cache unit and updates the historical gradient sum of squares based on the translated microinstructionIs transmitted from the data processing unit back to the data caching unit;
in step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and according to the translated microinstruction, the operation control submodule controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodulesComputing adaptive learning rate using vector division parallel operator modules
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operations according to the translated microinstruction: using vector multiplication parallel arithmetic unit sub-modules to calculateParallel computation of sub-modules using vector additionObtaining an updated parameter vector Wt+1;
Step S10, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector W according to the translated microinstructiont+1Transmitting the data from the data processing unit to the external address space designated address through the data access unit;
in step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing unit goes to step S5 to continue the execution.
And a device for executing the AdaGrad gradient descent algorithm, wherein a program for executing the method is solidified in a controller of the device.
Based on the technical scheme, the device and the method have the following beneficial effects: by using the device, an AdaGrad gradient descent algorithm can be realized, and the data processing efficiency is greatly improved; by adopting the special equipment for executing the AdaGrad gradient descent algorithm, the problems of insufficient operation performance of a general processor of the data and high front-section decoding overhead can be solved, the execution speed of related applications is accelerated, and the data processing efficiency is greatly improved; meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.
Drawings
Fig. 1 is an exemplary block diagram of the overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention;
fig. 2 is a block diagram of an example of a data processing module in an apparatus for implementing an application related to an AdaGrad gradient descent algorithm according to an embodiment of the present invention;
fig. 3 is a flowchart of operations for implementing an AdaGrad gradient descent algorithm-related application, according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments. Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description.
In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
The invention discloses a device for executing an AdaGrad gradient descent algorithm, which comprises a data access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The data access unit can access an external address space, can read and write data to each cache unit in the device and complete loading and storage of the data, and specifically comprises the steps of reading an instruction to the instruction cache unit, reading parameters to be updated and corresponding gradient values from the specified storage units to the data processing unit, and directly writing updated parameter vectors into the external specified space from the data processing module; the instruction cache unit reads the instruction through the data access unit and caches the read instruction; the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a microinstruction for controlling the behavior of other modules and sends the microinstruction to other modules such as a data access unit, a data cache unit, a data processing module and the like; the data cache unit stores some intermediate variables needed in the operation of the device, and initializes and updates the variables; the data processing module performs corresponding operation operations according to the instruction, including vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation.
The apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the present invention may be used to support applications using the AdaGrad gradient descent algorithm. And establishing a space in the data cache unit to store the square sum of historical gradient values, calculating a learning rate by using the square sum when gradient descent is performed each time, and then performing updating operation on the vector to be updated. And repeating the gradient descending operation until the vector to be updated converges.
The invention also discloses a method for executing the AdaGrad gradient descent algorithm, which comprises the following specific implementation steps:
the initialization operation of the data cache unit is completed through an instruction, and comprises setting initial values for constants alpha and epsilon and squaring and summing historical gradientsAnd setting zero operation. Wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2And (6) summing.
And finishing the operation of reading the parameter vector to be updated and the corresponding gradient vector from the external space by the data access unit through the IO instruction.
The data processing module reads and calculates the historical gradient square sum in the updated data cache unit according to the corresponding instructionAnd through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unitCalculated adaptive learning rate
The data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value according to the corresponding instruction, and the calculation formula of the updating operation is as follows:
wherein, WtDenotes the current (t-th) parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1And the updated parameters are also the parameters to be updated of the next (t + 1) iteration operation.
And (3) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, and otherwise, turning to the step (2) to continue executing.
The embodiments of the present invention will be further explained with reference to the accompanying drawings.
Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the invention. As shown in fig. 1, the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may be implemented by hardware, including but not limited to FPGA, CGRA, ASIC, analog circuit, memristor, etc.
The data access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from an instruction cache unit 2, reading parameters to be updated back and forth from a designated storage unit to a data processing unit 5, reading gradient values from an external designated space to a data cache unit 4, and directly writing the updated parameter vectors into the external designated space from a data processing module 5.
The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
The controller unit 3 reads instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and sends them to other modules such as the data access unit 1, the data cache unit 3, the data processing module 5, etc.
The data buffer unit 4 initializes the square sum of the historical gradient values at the time of device initializationSetting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent iteration process is finished. During each data updating process, the squares of the historical gradient values are summedRead into the data processing module 5, update its value in the data processing module 5, i.e. add the sum of the squares of the current gradient values, and then write into the data buffer unit 4.
The data processing module 5 reads the sum of squares of the historical gradient values from the data cache unit 4And constants alpha, epsilon, updateAnd send the value of (2) back to the data buffer unit 4By usingAnd constant alpha, epsilon calculating adaptive learning rateAnd finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
Fig. 2 shows an exemplary block diagram of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention. As shown in fig. 2, the data processing module includes an operation control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Because vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs a certain operation, elements at different positions can perform the operation in parallel.
Fig. 3 shows a general flow diagram of a device for performing a correlation operation according to the AdaGrad gradient descent algorithm.
In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.
In step S2, the operation starts, and the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all instructions related to the AdaGrad gradient descent calculation from the external address space and caches them in the instruction cache unit 2.
In step S3, controller unit 4 reads in the assignment instruction from instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unitZeroed and the constants alpha, epsilon initialized. Wherein, the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, and t is whenNumber of preceding iterations, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,...,(ΔL(Wt))2And (6) summing.
In step S4, controller unit 3 reads in an IO instruction from instruction cache unit 2, and data access unit 1 reads parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Read into the data cache unit 4.
In step S5, the controller unit 3 reads in a data transfer instruction from the instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unit 4And the constant alpha, epsilon is transmitted to the data processing unit.
In step S6, the controller unit reads a vector instruction from the instruction cache unit 2, and performs historical gradient square-sum based on the translated microinstructionIn which the command is sent to the arithmetic control sub-module 51, the arithmetic control sub-module 51 sends the corresponding command to perform the following operations: using the vector multiplication parallel operation submodule 53 to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squaresIn (1).
In step S7, the controller unit reads an instruction from the instruction cache unit 2, and updates the historical gradient square sum according to the translated microinstructionFrom the data processing unit 5 back to the data buffering unit 4.
In step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and based on the translated microinstruction, the operation control submodule 51 controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodule 55Computing adaptive learning rate using vector division parallel operator module 54
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: calculated by the vector multiply parallel arithmetic unit submodule 52Parallel computation with vector addition sub-module 52Obtaining an updated parameter vector Wt+1。
In step S10, the controller unit reads an IO instruction from the instruction cache unit 2, and updates the parameter vector W according to the translated microinstructiont+1From the data processing unit 5 to the external address space designation address via the data access unit 1.
In step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing proceeds to step S5 to continue the operation.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Claims (11)
1. A computing device that executes an AdaGrad gradient descent algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
a data processing unit for performing arithmetic operations including a basic arithmetic operation, and vector addition operation, vector multiplication operation, vector division operation, and vector square root operation under the control of the controller unit;
the data cache unit is used for storing the square sum of the historical gradient values in the operation process;
the data cache unit is also used for executing initialization and updating operation on the intermediate variable;
when the computing device executes the AdaGrad gradient descent algorithm, firstly reading a gradient vector and a value vector to be updated, and updating a historical gradient value cached in a data cache unit by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.
2. The computing apparatus of claim 1, wherein the data processing unit includes an operation control sub-module and a basic operation sub-module, and a parallel vector addition operation unit, a parallel vector division operation unit, a parallel vector multiplication operation unit, and a parallel vector square root operation unit.
3. The computing device of claim 2, wherein different positional elements are capable of performing operations in parallel when the data processing unit performs operations on the same vector.
4. The computing device of claim 1, wherein the data cache unit is to initialize a sum of squares of historical gradient values upon device initializationSetting the value to 0, simultaneously opening up two spaces to store constants alpha and epsilon, and keeping the two constant spaces until the whole gradient descent algorithm is executed, wherein the self-adaptive learning rate
5. The computing device of claim 1, wherein the data caching unit is to sum squares of historical gradient values during each data updateReading the data into a data processing unit, updating the value of the data in the data processing unit, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing unit reads the square sum of the historical gradient values from the data cache unitAnd constants alpha, epsilon, updateAnd sending back the value of (a) to the data cache unit, usingAnd constant alpha, epsilon calculating adaptive learning rateAnd finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
6. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
the controller unit decodes the read instruction into a microinstruction for controlling the corresponding module and sends the microinstruction to the corresponding module;
the data processing unit executes arithmetic operations under the control of the controller unit, including basic arithmetic operations, and vector addition operations, vector multiplication operations, vector division operations, and vector square root operations;
the controller unit stores the square sum of the historical gradient values generated in the operation process in a data cache unit, and performs initialization and updating operations on the intermediate variables;
when executing the AdaGrad gradient descent algorithm, the controller unit firstly reads the gradient vector and the value vector to be updated, and meanwhile, updates the historical gradient value cached in the data cache unit by using the current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.
7. The computing method of claim 6, wherein different positional elements are capable of performing operations in parallel when the data processing units perform operations on the same vector.
8. The computing method of claim 6, wherein the data cache unit, upon device initialization,initializing the sum of squares of historical gradient valuesSetting the value to 0, simultaneously opening up two spaces to store constants alpha and epsilon, and keeping the two constant spaces until the whole gradient descent algorithm is executed, wherein the self-adaptive learning rate
9. The computing method of claim 6, wherein the data caching unit sums squares of historical gradient values during each data updateReading the data into a data processing unit, updating the value of the data in the data processing unit, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing unit reads the square sum of the historical gradient values from the data cache unitAnd constants alpha, epsilon, updateAnd sending back the value of (a) to the data cache unit, usingAnd constant alpha, epsilon calculating adaptive learning rateAnd finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
10. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
step (1), initializing a data cache unit, including setting initial values for constants alpha and epsilon and squaring and summing historical gradientsSetting zero operation, wherein the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration number, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,...,(ΔL(Wt))2Summing;
step (2), reading a parameter vector to be updated and a corresponding gradient vector from an external space;
and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing unitAnd through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unitCalculating adaptive learning rate
And (4) the data processing unit completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:
wherein, WtDenotes the current, i.e. t-th, parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1Representing the updated parameters, which are also the parameters to be updated of the next time, namely t +1 times of iterative operation;
step (5), the data processing unit judges whether the updated parameter vector is converged, if so, the operation is finished, otherwise, the step (2) is carried out continuously;
the data cache unit stores the square sum of historical gradient values in the operation process.
11. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;
step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all instructions related to AdaGrad gradient descent calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads in the assignment instruction from the instruction cache unit, and the historical gradient sum of squares in the data cache unit according to the translated microinstructionSetting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)t))2,(ΔL(W2))2,...,(ΔL(Wt))2Summing;
step S4, the controller unit reads in an IO instruction from the instruction cache unit, and the data access unit reads the parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Reading the data into a data cache unit;
in step S5, the controller unit reads a data transfer instruction from the instruction cache unit and sums the squares of the historical gradients in the data cache unit according to the translated microinstructionsAnd the constant α, ε is transmitted to the data processing unit;
in step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs historical gradient square-sum according to the translated microinstructionIn the update operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a vector multiply parallel operator module to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squaresPerforming the following steps;
in step S7, the controller unit reads an instruction from the instruction cache unit and updates the historical gradient sum of squares based on the translated microinstructionIs transmitted from the data processing unit back to the data caching unit;
step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and the operation control sub-module controls the correlation according to the translated microinstructionThe operation module performs the following operations: computation using vector square root parallel operator submodulesComputing adaptive learning rate using vector division parallel operator modules
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operations according to the translated microinstruction: using vector multiplication parallel arithmetic unit sub-modules to calculateParallel computation of sub-modules using vector additionObtaining an updated parameter vector Wt+1;
Step S10, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector W according to the translated microinstructiont+1Transmitting the data from the data processing unit to the external address space designated address through the data access unit;
step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing unit judges whether the updated parameter vector is converged according to the translated microinstruction, if so, the operation is finished, otherwise, the controller unit transfers to the step S5 to continue executing;
the data cache unit stores the square sum of historical gradient values in the operation process.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610280620.4A CN107341132B (en) | 2016-04-29 | 2016-04-29 | Device and method for executing AdaGrad gradient descent training algorithm |
PCT/CN2016/081836 WO2017185411A1 (en) | 2016-04-29 | 2016-05-12 | Apparatus and method for executing adagrad gradient descent training algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610280620.4A CN107341132B (en) | 2016-04-29 | 2016-04-29 | Device and method for executing AdaGrad gradient descent training algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107341132A CN107341132A (en) | 2017-11-10 |
CN107341132B true CN107341132B (en) | 2021-06-11 |
Family
ID=60161682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610280620.4A Active CN107341132B (en) | 2016-04-29 | 2016-04-29 | Device and method for executing AdaGrad gradient descent training algorithm |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107341132B (en) |
WO (1) | WO2017185411A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110378480B (en) * | 2019-06-14 | 2022-09-27 | 平安科技(深圳)有限公司 | Model training method and device and computer readable storage medium |
CN111626434B (en) * | 2020-05-15 | 2022-06-07 | 浪潮电子信息产业股份有限公司 | Distributed training parameter updating method, device, equipment and storage medium |
CN112329941B (en) * | 2020-11-04 | 2022-04-12 | 支付宝(杭州)信息技术有限公司 | Deep learning model updating method and device |
CN113238975A (en) * | 2021-06-08 | 2021-08-10 | 中科寒武纪科技股份有限公司 | Memory, integrated circuit and board card for optimizing parameters of deep neural network |
CN116128072B (en) * | 2023-01-20 | 2023-08-25 | 支付宝(杭州)信息技术有限公司 | Training method, device, equipment and storage medium of risk control model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
CN104035751A (en) * | 2014-06-20 | 2014-09-10 | 深圳市腾讯计算机系统有限公司 | Graphics processing unit based parallel data processing method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101826142B (en) * | 2010-04-19 | 2011-11-09 | 中国人民解放军信息工程大学 | Reconfigurable elliptic curve cipher processor |
US9477925B2 (en) * | 2012-11-20 | 2016-10-25 | Microsoft Technology Licensing, Llc | Deep neural networks training for speech and pattern recognition |
-
2016
- 2016-04-29 CN CN201610280620.4A patent/CN107341132B/en active Active
- 2016-05-12 WO PCT/CN2016/081836 patent/WO2017185411A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
CN104035751A (en) * | 2014-06-20 | 2014-09-10 | 深圳市腾讯计算机系统有限公司 | Graphics processing unit based parallel data processing method and device |
Non-Patent Citations (2)
Title |
---|
An overview of gradient descent optimization algorithms.pdf;SEBASTIAN RUDER;《https://ruder.io/optimizing-gradient-descent/》;20160119;第1-26页 * |
Large Scale Distributed Deep Networks;Jeffrey Dean 等;《NIPS 2012》;20141021;第1-9页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107341132A (en) | 2017-11-10 |
WO2017185411A1 (en) | 2017-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107341132B (en) | Device and method for executing AdaGrad gradient descent training algorithm | |
US10643129B2 (en) | Apparatus and methods for training in convolutional neural networks | |
US20190370663A1 (en) | Operation method | |
US20190065958A1 (en) | Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks | |
CN111260025B (en) | Apparatus and method for performing LSTM neural network operation | |
CN110929863B (en) | Apparatus and method for performing LSTM operations | |
WO2017124642A1 (en) | Device and method for executing forward calculation of artificial neural network | |
CN107316078A (en) | Apparatus and method for performing artificial neural network self study computing | |
CN108320018B (en) | Artificial neural network operation device and method | |
WO2017185257A1 (en) | Device and method for performing adam gradient descent training algorithm | |
CN107170019B (en) | Rapid low-storage image compression sensing method | |
US20190065938A1 (en) | Apparatus and Methods for Pooling Operations | |
CN107341540B (en) | Device and method for executing Hessian-Free training algorithm | |
US20200097520A1 (en) | Apparatus and methods for vector operations | |
CN109754062B (en) | Execution method of convolution expansion instruction and related product | |
WO2017185248A1 (en) | Apparatus and method for performing auto-learning operation of artificial neural network | |
CN107644253A (en) | A kind of Neural network optimization based on inverse function, system and electronic equipment | |
WO2017185335A1 (en) | Apparatus and method for executing batch normalization operation | |
US20190130274A1 (en) | Apparatus and methods for backward propagation in neural networks supporting discrete data | |
CN111860814B (en) | Apparatus and method for performing batch normalization operations | |
CN107315570B (en) | Device and method for executing Adam gradient descent training algorithm | |
CN107315569B (en) | Device and method for executing RMSprop gradient descent algorithm | |
WO2017185256A1 (en) | Rmsprop gradient descent algorithm execution apparatus and method | |
CN113595681B (en) | QR decomposition method, system, circuit, equipment and medium based on Givens rotation | |
US20190080241A1 (en) | Apparatus and methods for backward propagation in neural networks supporting discrete data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant after: Zhongke Cambrian Technology Co., Ltd Address before: 100190 room 644, scientific research complex, No. 6, South Road, Academy of Sciences, Haidian District, Beijing Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |