CN107341132B - Device and method for executing AdaGrad gradient descent training algorithm - Google Patents

Device and method for executing AdaGrad gradient descent training algorithm Download PDF

Info

Publication number
CN107341132B
CN107341132B CN201610280620.4A CN201610280620A CN107341132B CN 107341132 B CN107341132 B CN 107341132B CN 201610280620 A CN201610280620 A CN 201610280620A CN 107341132 B CN107341132 B CN 107341132B
Authority
CN
China
Prior art keywords
vector
unit
data
gradient
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610280620.4A
Other languages
Chinese (zh)
Other versions
CN107341132A (en
Inventor
郭崎
刘少礼
陈天石
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201610280620.4A priority Critical patent/CN107341132B/en
Priority to PCT/CN2016/081836 priority patent/WO2017185411A1/en
Publication of CN107341132A publication Critical patent/CN107341132A/en
Application granted granted Critical
Publication of CN107341132B publication Critical patent/CN107341132B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A computing device and method, the computing device including a controller unit and a data processing unit; the device firstly reads a gradient vector and a value vector to be updated, and meanwhile, updates a historical gradient value in a cache region by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges. The invention can solve the problems of insufficient operation performance of the processor and high front-section decoding cost by adopting special equipment, accelerates the execution speed of related applications and reduces the bandwidth of memory access.

Description

Device and method for executing AdaGrad gradient descent training algorithm
Technical Field
The invention relates to the technical field of AdaGrad algorithm application, in particular to a device and a method for executing an AdaGrad gradient descent training algorithm.
Background
The AdaGrad algorithm is widely used due to the characteristics of easy realization, small calculation amount, small required storage space, capability of adaptively distributing learning rate to each parameter and the like. The implementation of the AdaGrad algorithm by using a dedicated device can significantly increase the speed of its execution.
Currently, one known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the AdaGrad gradient descent algorithm into a long-row operation and memory access instruction sequence, and the front-end decoding of the processor brings about large power consumption overhead.
Another known method of performing the AdaGrad gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device dedicated to performing graphics image operations and scientific calculations, there is no special support for the relevant operations of the AdaGrad gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the relevant operations in the AdaGrad gradient descent algorithm, thereby bringing a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as historical gradient values) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.
Disclosure of Invention
In view of the above, the present invention provides an apparatus and a method for performing an AdaGrad gradient descent algorithm to solve at least one of the above technical problems.
To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing an AdaGrad gradient descent algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;
and the data processing module is used for executing operation operations under the control of the controller unit, wherein the operation operations comprise vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation, and storing intermediate variables into the data cache unit.
The data processing module comprises an operation control submodule, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector square root operation unit and a basic operation submodule.
When the data processing module executes operation aiming at the same vector, elements at different positions can execute operation in parallel.
Wherein the data cache unit is initialized at the deviceThe sum of squares of the historical gradient values is initialized
Figure BDA0000978417050000021
Setting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent algorithm is executed.
Wherein the data caching unit sums the squares of the historical gradient values in each data updating process
Figure BDA0000978417050000022
Reading the data into a data processing module, updating the value of the data in the data processing module, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing module reads the square sum of historical gradient values from the data cache unit
Figure BDA0000978417050000023
And constants alpha, epsilon, update
Figure BDA0000978417050000024
And sending back the value of (a) to the data cache unit, using
Figure BDA0000978417050000025
And constant alpha, epsilon calculating adaptive learning rate
Figure BDA0000978417050000026
And finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
As another aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:
step (1), initializing a data cache unit, including setting initial values for constants alpha and epsilon and squaring and summing historical gradients
Figure BDA0000978417050000031
Zero setting operationWherein, the constant alpha is a gain coefficient of the adaptive learning rate for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant for ensuring that the denominator in the calculation of the adaptive learning rate is non-zero, t is the current iteration number, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2Summing;
step (2), reading a parameter vector to be updated and a corresponding gradient vector from an external space;
and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing module
Figure BDA0000978417050000032
And through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unit
Figure BDA0000978417050000033
Calculating adaptive learning rate
Figure BDA0000978417050000034
And (4) the data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:
Figure BDA0000978417050000035
wherein, WtDenotes the current, i.e. t-th, parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1Representing the updated parameters, which are also the parameters to be updated of the next time, namely t +1 times of iterative operation;
and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
And a device for executing the AdaGrad gradient descent algorithm, wherein a program for executing the method is solidified in a controller of the device.
As a further aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:
step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;
step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all instructions related to AdaGrad gradient descent calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads in the assignment instruction from the instruction cache unit, and the historical gradient sum of squares in the data cache unit according to the translated microinstruction
Figure BDA0000978417050000041
Setting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2Summing;
step S4, the controller unit reads in an IO instruction from the instruction cache unit, and the data access unit reads the parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Reading the data into a data cache unit;
in the step of S5,the controller unit reads a data transfer instruction from the instruction cache unit, and sums the squares of the historical gradients in the data cache unit according to the translated microinstructions
Figure BDA0000978417050000042
And the constant α, ε is transmitted to the data processing unit;
in step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs historical gradient square-sum according to the translated microinstruction
Figure BDA0000978417050000043
In the update operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a vector multiply parallel operator module to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squares
Figure BDA0000978417050000044
Performing the following steps;
in step S7, the controller unit reads an instruction from the instruction cache unit and updates the historical gradient sum of squares based on the translated microinstruction
Figure BDA0000978417050000045
Is transmitted from the data processing unit back to the data caching unit;
in step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and according to the translated microinstruction, the operation control submodule controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodules
Figure BDA0000978417050000051
Computing adaptive learning rate using vector division parallel operator modules
Figure BDA0000978417050000052
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operations according to the translated microinstruction: using vector multiplication parallel arithmetic unit sub-modules to calculate
Figure BDA0000978417050000053
Parallel computation of sub-modules using vector addition
Figure BDA0000978417050000054
Obtaining an updated parameter vector Wt+1
Step S10, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector W according to the translated microinstructiont+1Transmitting the data from the data processing unit to the external address space designated address through the data access unit;
in step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing unit goes to step S5 to continue the execution.
And a device for executing the AdaGrad gradient descent algorithm, wherein a program for executing the method is solidified in a controller of the device.
Based on the technical scheme, the device and the method have the following beneficial effects: by using the device, an AdaGrad gradient descent algorithm can be realized, and the data processing efficiency is greatly improved; by adopting the special equipment for executing the AdaGrad gradient descent algorithm, the problems of insufficient operation performance of a general processor of the data and high front-section decoding overhead can be solved, the execution speed of related applications is accelerated, and the data processing efficiency is greatly improved; meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.
Drawings
Fig. 1 is an exemplary block diagram of the overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention;
fig. 2 is a block diagram of an example of a data processing module in an apparatus for implementing an application related to an AdaGrad gradient descent algorithm according to an embodiment of the present invention;
fig. 3 is a flowchart of operations for implementing an AdaGrad gradient descent algorithm-related application, according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments. Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description.
In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
The invention discloses a device for executing an AdaGrad gradient descent algorithm, which comprises a data access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The data access unit can access an external address space, can read and write data to each cache unit in the device and complete loading and storage of the data, and specifically comprises the steps of reading an instruction to the instruction cache unit, reading parameters to be updated and corresponding gradient values from the specified storage units to the data processing unit, and directly writing updated parameter vectors into the external specified space from the data processing module; the instruction cache unit reads the instruction through the data access unit and caches the read instruction; the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a microinstruction for controlling the behavior of other modules and sends the microinstruction to other modules such as a data access unit, a data cache unit, a data processing module and the like; the data cache unit stores some intermediate variables needed in the operation of the device, and initializes and updates the variables; the data processing module performs corresponding operation operations according to the instruction, including vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation.
The apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the present invention may be used to support applications using the AdaGrad gradient descent algorithm. And establishing a space in the data cache unit to store the square sum of historical gradient values, calculating a learning rate by using the square sum when gradient descent is performed each time, and then performing updating operation on the vector to be updated. And repeating the gradient descending operation until the vector to be updated converges.
The invention also discloses a method for executing the AdaGrad gradient descent algorithm, which comprises the following specific implementation steps:
the initialization operation of the data cache unit is completed through an instruction, and comprises setting initial values for constants alpha and epsilon and squaring and summing historical gradients
Figure BDA0000978417050000071
And setting zero operation. Wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,…,(ΔL(Wt))2And (6) summing.
And finishing the operation of reading the parameter vector to be updated and the corresponding gradient vector from the external space by the data access unit through the IO instruction.
The data processing module reads and calculates the historical gradient square sum in the updated data cache unit according to the corresponding instruction
Figure BDA0000978417050000072
And through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unit
Figure BDA0000978417050000073
Calculated adaptive learning rate
Figure BDA0000978417050000074
The data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value according to the corresponding instruction, and the calculation formula of the updating operation is as follows:
Figure BDA0000978417050000075
wherein, WtDenotes the current (t-th) parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1And the updated parameters are also the parameters to be updated of the next (t + 1) iteration operation.
And (3) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, and otherwise, turning to the step (2) to continue executing.
The embodiments of the present invention will be further explained with reference to the accompanying drawings.
Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the invention. As shown in fig. 1, the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may be implemented by hardware, including but not limited to FPGA, CGRA, ASIC, analog circuit, memristor, etc.
The data access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from an instruction cache unit 2, reading parameters to be updated back and forth from a designated storage unit to a data processing unit 5, reading gradient values from an external designated space to a data cache unit 4, and directly writing the updated parameter vectors into the external designated space from a data processing module 5.
The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
The controller unit 3 reads instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and sends them to other modules such as the data access unit 1, the data cache unit 3, the data processing module 5, etc.
The data buffer unit 4 initializes the square sum of the historical gradient values at the time of device initialization
Figure BDA0000978417050000081
Setting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent iteration process is finished. During each data updating process, the squares of the historical gradient values are summed
Figure BDA0000978417050000082
Read into the data processing module 5, update its value in the data processing module 5, i.e. add the sum of the squares of the current gradient values, and then write into the data buffer unit 4.
The data processing module 5 reads the sum of squares of the historical gradient values from the data cache unit 4
Figure BDA0000978417050000083
And constants alpha, epsilon, update
Figure BDA0000978417050000084
And send the value of (2) back to the data buffer unit 4By using
Figure BDA0000978417050000085
And constant alpha, epsilon calculating adaptive learning rate
Figure BDA0000978417050000086
And finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
Fig. 2 shows an exemplary block diagram of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention. As shown in fig. 2, the data processing module includes an operation control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Because vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs a certain operation, elements at different positions can perform the operation in parallel.
Fig. 3 shows a general flow diagram of a device for performing a correlation operation according to the AdaGrad gradient descent algorithm.
In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.
In step S2, the operation starts, and the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all instructions related to the AdaGrad gradient descent calculation from the external address space and caches them in the instruction cache unit 2.
In step S3, controller unit 4 reads in the assignment instruction from instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unit
Figure BDA0000978417050000091
Zeroed and the constants alpha, epsilon initialized. Wherein, the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, and t is whenNumber of preceding iterations, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,...,(ΔL(Wt))2And (6) summing.
In step S4, controller unit 3 reads in an IO instruction from instruction cache unit 2, and data access unit 1 reads parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Read into the data cache unit 4.
In step S5, the controller unit 3 reads in a data transfer instruction from the instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unit 4
Figure BDA0000978417050000092
And the constant alpha, epsilon is transmitted to the data processing unit.
In step S6, the controller unit reads a vector instruction from the instruction cache unit 2, and performs historical gradient square-sum based on the translated microinstruction
Figure BDA0000978417050000093
In which the command is sent to the arithmetic control sub-module 51, the arithmetic control sub-module 51 sends the corresponding command to perform the following operations: using the vector multiplication parallel operation submodule 53 to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squares
Figure BDA0000978417050000094
In (1).
In step S7, the controller unit reads an instruction from the instruction cache unit 2, and updates the historical gradient square sum according to the translated microinstruction
Figure BDA0000978417050000101
From the data processing unit 5 back to the data buffering unit 4.
In step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and based on the translated microinstruction, the operation control submodule 51 controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodule 55
Figure BDA0000978417050000102
Computing adaptive learning rate using vector division parallel operator module 54
Figure BDA0000978417050000103
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: calculated by the vector multiply parallel arithmetic unit submodule 52
Figure BDA0000978417050000104
Parallel computation with vector addition sub-module 52
Figure BDA0000978417050000105
Obtaining an updated parameter vector Wt+1
In step S10, the controller unit reads an IO instruction from the instruction cache unit 2, and updates the parameter vector W according to the translated microinstructiont+1From the data processing unit 5 to the external address space designation address via the data access unit 1.
In step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing proceeds to step S5 to continue the operation.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (11)

1. A computing device that executes an AdaGrad gradient descent algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
a data processing unit for performing arithmetic operations including a basic arithmetic operation, and vector addition operation, vector multiplication operation, vector division operation, and vector square root operation under the control of the controller unit;
the data cache unit is used for storing the square sum of the historical gradient values in the operation process;
the data cache unit is also used for executing initialization and updating operation on the intermediate variable;
when the computing device executes the AdaGrad gradient descent algorithm, firstly reading a gradient vector and a value vector to be updated, and updating a historical gradient value cached in a data cache unit by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.
2. The computing apparatus of claim 1, wherein the data processing unit includes an operation control sub-module and a basic operation sub-module, and a parallel vector addition operation unit, a parallel vector division operation unit, a parallel vector multiplication operation unit, and a parallel vector square root operation unit.
3. The computing device of claim 2, wherein different positional elements are capable of performing operations in parallel when the data processing unit performs operations on the same vector.
4. The computing device of claim 1, wherein the data cache unit is to initialize a sum of squares of historical gradient values upon device initialization
Figure FDA0002998864040000011
Setting the value to 0, simultaneously opening up two spaces to store constants alpha and epsilon, and keeping the two constant spaces until the whole gradient descent algorithm is executed, wherein the self-adaptive learning rate
Figure FDA0002998864040000012
5. The computing device of claim 1, wherein the data caching unit is to sum squares of historical gradient values during each data update
Figure FDA0002998864040000013
Reading the data into a data processing unit, updating the value of the data in the data processing unit, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing unit reads the square sum of the historical gradient values from the data cache unit
Figure FDA0002998864040000021
And constants alpha, epsilon, update
Figure FDA0002998864040000022
And sending back the value of (a) to the data cache unit, using
Figure FDA0002998864040000023
And constant alpha, epsilon calculating adaptive learning rate
Figure FDA0002998864040000024
And finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
6. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
the controller unit decodes the read instruction into a microinstruction for controlling the corresponding module and sends the microinstruction to the corresponding module;
the data processing unit executes arithmetic operations under the control of the controller unit, including basic arithmetic operations, and vector addition operations, vector multiplication operations, vector division operations, and vector square root operations;
the controller unit stores the square sum of the historical gradient values generated in the operation process in a data cache unit, and performs initialization and updating operations on the intermediate variables;
when executing the AdaGrad gradient descent algorithm, the controller unit firstly reads the gradient vector and the value vector to be updated, and meanwhile, updates the historical gradient value cached in the data cache unit by using the current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.
7. The computing method of claim 6, wherein different positional elements are capable of performing operations in parallel when the data processing units perform operations on the same vector.
8. The computing method of claim 6, wherein the data cache unit, upon device initialization,initializing the sum of squares of historical gradient values
Figure FDA0002998864040000025
Setting the value to 0, simultaneously opening up two spaces to store constants alpha and epsilon, and keeping the two constant spaces until the whole gradient descent algorithm is executed, wherein the self-adaptive learning rate
Figure FDA0002998864040000026
9. The computing method of claim 6, wherein the data caching unit sums squares of historical gradient values during each data update
Figure FDA0002998864040000027
Reading the data into a data processing unit, updating the value of the data in the data processing unit, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;
the data processing unit reads the square sum of the historical gradient values from the data cache unit
Figure FDA0002998864040000031
And constants alpha, epsilon, update
Figure FDA0002998864040000032
And sending back the value of (a) to the data cache unit, using
Figure FDA0002998864040000033
And constant alpha, epsilon calculating adaptive learning rate
Figure FDA0002998864040000034
And finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.
10. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
step (1), initializing a data cache unit, including setting initial values for constants alpha and epsilon and squaring and summing historical gradients
Figure FDA0002998864040000035
Setting zero operation, wherein the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration number, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)1))2,(ΔL(W2))2,...,(ΔL(Wt))2Summing;
step (2), reading a parameter vector to be updated and a corresponding gradient vector from an external space;
and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing unit
Figure FDA0002998864040000036
And through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unit
Figure FDA0002998864040000037
Calculating adaptive learning rate
Figure FDA0002998864040000038
And (4) the data processing unit completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:
Figure FDA0002998864040000039
wherein, WtDenotes the current, i.e. t-th, parameter to be updated, Δ L (W)t) Representing the gradient value, W, of the current parameter to be updatedt+1Representing the updated parameters, which are also the parameters to be updated of the next time, namely t +1 times of iterative operation;
step (5), the data processing unit judges whether the updated parameter vector is converged, if so, the operation is finished, otherwise, the step (2) is carried out continuously;
the data cache unit stores the square sum of historical gradient values in the operation process.
11. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;
step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all instructions related to AdaGrad gradient descent calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads in the assignment instruction from the instruction cache unit, and the historical gradient sum of squares in the data cache unit according to the translated microinstruction
Figure FDA0002998864040000041
Setting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, Wt′Is a parameter to be updated in the ith iteration, Δ L (W)t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)t))2,(ΔL(W2))2,...,(ΔL(Wt))2Summing;
step S4, the controller unit reads in an IO instruction from the instruction cache unit, and the data access unit reads the parameter vector W to be updated from the external space according to the translated microinstructiontAnd a corresponding gradient vector Δ L (W)t) Reading the data into a data cache unit;
in step S5, the controller unit reads a data transfer instruction from the instruction cache unit and sums the squares of the historical gradients in the data cache unit according to the translated microinstructions
Figure FDA0002998864040000042
And the constant α, ε is transmitted to the data processing unit;
in step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs historical gradient square-sum according to the translated microinstruction
Figure FDA0002998864040000043
In the update operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a vector multiply parallel operator module to obtain (Δ L (W)t))2The parallel operation submodule using vector addition will (Delta L (W)t))2Added to historical gradient sum of squares
Figure FDA0002998864040000044
Performing the following steps;
in step S7, the controller unit reads an instruction from the instruction cache unit and updates the historical gradient sum of squares based on the translated microinstruction
Figure FDA0002998864040000051
Is transmitted from the data processing unit back to the data caching unit;
step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and the operation control sub-module controls the correlation according to the translated microinstructionThe operation module performs the following operations: computation using vector square root parallel operator submodules
Figure FDA0002998864040000052
Computing adaptive learning rate using vector division parallel operator modules
Figure FDA0002998864040000053
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operations according to the translated microinstruction: using vector multiplication parallel arithmetic unit sub-modules to calculate
Figure FDA0002998864040000054
Parallel computation of sub-modules using vector addition
Figure FDA0002998864040000055
Obtaining an updated parameter vector Wt+1
Step S10, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector W according to the translated microinstructiont+1Transmitting the data from the data processing unit to the external address space designated address through the data access unit;
step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing unit judges whether the updated parameter vector is converged according to the translated microinstruction, if so, the operation is finished, otherwise, the controller unit transfers to the step S5 to continue executing;
the data cache unit stores the square sum of historical gradient values in the operation process.
CN201610280620.4A 2016-04-29 2016-04-29 Device and method for executing AdaGrad gradient descent training algorithm Active CN107341132B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610280620.4A CN107341132B (en) 2016-04-29 2016-04-29 Device and method for executing AdaGrad gradient descent training algorithm
PCT/CN2016/081836 WO2017185411A1 (en) 2016-04-29 2016-05-12 Apparatus and method for executing adagrad gradient descent training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610280620.4A CN107341132B (en) 2016-04-29 2016-04-29 Device and method for executing AdaGrad gradient descent training algorithm

Publications (2)

Publication Number Publication Date
CN107341132A CN107341132A (en) 2017-11-10
CN107341132B true CN107341132B (en) 2021-06-11

Family

ID=60161682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610280620.4A Active CN107341132B (en) 2016-04-29 2016-04-29 Device and method for executing AdaGrad gradient descent training algorithm

Country Status (2)

Country Link
CN (1) CN107341132B (en)
WO (1) WO2017185411A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378480B (en) * 2019-06-14 2022-09-27 平安科技(深圳)有限公司 Model training method and device and computer readable storage medium
CN111626434B (en) * 2020-05-15 2022-06-07 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN112329941B (en) * 2020-11-04 2022-04-12 支付宝(杭州)信息技术有限公司 Deep learning model updating method and device
CN113238975A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Memory, integrated circuit and board card for optimizing parameters of deep neural network
CN116128072B (en) * 2023-01-20 2023-08-25 支付宝(杭州)信息技术有限公司 Training method, device, equipment and storage medium of risk control model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826142B (en) * 2010-04-19 2011-11-09 中国人民解放军信息工程大学 Reconfigurable elliptic curve cipher processor
US9477925B2 (en) * 2012-11-20 2016-10-25 Microsoft Technology Licensing, Llc Deep neural networks training for speech and pattern recognition

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An overview of gradient descent optimization algorithms.pdf;SEBASTIAN RUDER;《https://ruder.io/optimizing-gradient-descent/》;20160119;第1-26页 *
Large Scale Distributed Deep Networks;Jeffrey Dean 等;《NIPS 2012》;20141021;第1-9页 *

Also Published As

Publication number Publication date
CN107341132A (en) 2017-11-10
WO2017185411A1 (en) 2017-11-02

Similar Documents

Publication Publication Date Title
CN107341132B (en) Device and method for executing AdaGrad gradient descent training algorithm
US10643129B2 (en) Apparatus and methods for training in convolutional neural networks
US20190370663A1 (en) Operation method
US20190065958A1 (en) Apparatus and Methods for Training in Fully Connected Layers of Convolutional Networks
CN111260025B (en) Apparatus and method for performing LSTM neural network operation
CN110929863B (en) Apparatus and method for performing LSTM operations
WO2017124642A1 (en) Device and method for executing forward calculation of artificial neural network
CN107316078A (en) Apparatus and method for performing artificial neural network self study computing
CN108320018B (en) Artificial neural network operation device and method
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
CN107170019B (en) Rapid low-storage image compression sensing method
US20190065938A1 (en) Apparatus and Methods for Pooling Operations
CN107341540B (en) Device and method for executing Hessian-Free training algorithm
US20200097520A1 (en) Apparatus and methods for vector operations
CN109754062B (en) Execution method of convolution expansion instruction and related product
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
CN107644253A (en) A kind of Neural network optimization based on inverse function, system and electronic equipment
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation
US20190130274A1 (en) Apparatus and methods for backward propagation in neural networks supporting discrete data
CN111860814B (en) Apparatus and method for performing batch normalization operations
CN107315570B (en) Device and method for executing Adam gradient descent training algorithm
CN107315569B (en) Device and method for executing RMSprop gradient descent algorithm
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN113595681B (en) QR decomposition method, system, circuit, equipment and medium based on Givens rotation
US20190080241A1 (en) Apparatus and methods for backward propagation in neural networks supporting discrete data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100190 room 644, scientific research complex, No. 6, South Road, Academy of Sciences, Haidian District, Beijing

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant