WO2017185411A1

WO2017185411A1 - Apparatus and method for executing adagrad gradient descent training algorithm

Info

Publication number: WO2017185411A1
Application number: PCT/CN2016/081836
Authority: WO
Inventors: 郭崎; 刘少礼; 陈天石; 陈云霁
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-04-29
Filing date: 2016-05-12
Publication date: 2017-11-02
Also published as: CN107341132A; CN107341132B

Abstract

An apparatus and method for executing an AdaGrad gradient descent training algorithm, the apparatus comprising a control unit (3), a data cache unit (4), and a data processing module (5); the apparatus first reads a gradient vector and a vector to be updated, and simultaneously uses a current gradient value to update a historical gradient value in a cache region; at each iteration, using the current gradient value and the historical gradient value to calculate an update amount, and performing an update operation on the vector to be updated; and continually training until the parameter vectors to be updated converge. Employing a device specifically used for executing the AdaGrad gradient descent algorithm can solve the problems of the insufficient operating performance of general processors and high front-end decoding overheads, and accelerates the execution speed of related applications; in addition, the use of the data cache unit (4) prevents duplicate reading of data to a memory, reducing the memory access bandwidth.

Description

Apparatus and method for performing AdaGrad gradient descent training algorithm

Technical field

The present invention relates to the field of AdaGrad algorithm application technology, and more particularly to an apparatus and method for performing an AdaGrad gradient descent training algorithm.

Background technique

The gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. The AdaGrad algorithm is easy to implement, has small computational complexity, requires small storage space, and can adaptively allocate learning rates for various parameters. Features are widely used. Implementing the AdaGrad algorithm with a dedicated device can significantly increase the speed of its execution.

Currently, a known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the AdaGrad gradient descent algorithm into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.

Another known method of performing the AdaGrad gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the AdaGrad gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operation in the AdaGrad gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer, and the data required for the operation (such as historical gradient values) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.

Summary of the invention

In view of the above, it is an object of the present invention to provide an apparatus and method for performing an AdaGrad gradient descent algorithm to solve at least one of the above technical problems.

In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for performing an AdaGrad gradient descent algorithm, comprising:

a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;

a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;

a data processing module, configured to perform an arithmetic operation under the control of the controller unit, including a vector addition operation, a vector multiplication operation, a vector division operation, a vector square root operation, and a basic operation, and storing the intermediate variable in the data buffer unit in.

The data processing module includes an operation control sub-module, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector division operation unit, a parallel vector square root operation unit, and a basic operation sub-module.

Wherein, when the data processing module performs an operation on the same vector, different position elements can perform operations in parallel.

Wherein the data buffer unit initializes the sum of the squares of the historical gradient values when the device is initialized

Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent algorithm is executed.

Wherein the data buffer unit compares the sum of the squares of the historical gradient values during each data update process.

Read into the data processing module, update its value in the data processing module, that is, add the square sum of the current gradient values, and then write to the data buffer unit;

The data processing module reads a sum of squares of historical gradient values from the data buffer unit

And constant α, ε, update

Value is sent back to the data cache unit, utilizing

And constant α, ε calculate adaptive learning rate

Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.

As another aspect of the present invention, the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Step (1), initializing the data buffer unit, including setting the initial value for the constant α, ε, and summing the square of the historical gradient

Zeroing operation, wherein the constant α is an adaptive learning rate gain coefficient for adjusting the range of the adaptive learning rate, and the constant ε is a constant for ensuring that the denominator in the adaptive learning rate calculation is non-zero, t is the current The number of iterations, W _i is the parameter to be updated when the i-th iteration is operated, ΔL(W _i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 To i=t, that is, the initial to current gradient squared value (ΔL(W ₁ )) ² , (ΔL(W ₂ )) ² , . . . , (ΔL(W _t )) ^{2 is} summed;

Step (2), reading the parameter vector to be updated and the corresponding gradient vector from the external space;

Step (3), the data processing module reads and calculates the sum of squared historical gradients in the updated data buffer unit

And pass the sum of the constants α, ε and the historical gradient in the data buffer unit

Calculated adaptive learning rate

In step (4), the data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value, and the update operation calculation formula is as follows:

Where W _t represents the current, that is, the t-th parameter to be updated, ΔL(W _t ) represents the gradient value of the current parameter to be updated, and W _t+1 represents the updated parameter, which is also the next time, that is, t+1 iterations. The parameter to be updated of the operation;

In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.

And a device for performing an AdaGrad gradient descent algorithm in which a program for performing the method as described above is cured.

As still another aspect of the present invention, the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Step S1, pre-storing an IO instruction at the first address of the instruction cache unit;

Step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the AdaGrad gradient descent calculation from the external address space, and Cached into the instruction cache unit;

Step S3, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit

Zero, and initialize the constant α, ε; where the constant α is the adaptive learning rate gain coefficient, used to adjust the range of the adaptive learning rate, the constant ε is a constant, used to guarantee the denominator in the adaptive learning rate calculation Non-zero, t is the current number of iterations, W _i is the parameter to be updated when the i-th iteration is operated, ΔL(W _i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the And the range from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W ₁ )) ² , (ΔL(W ₂ )) ² ,...,(ΔL(W _t )) ² Summing

Step S4, the controller unit IO buffer unit reads an instruction from the instruction, according to the microinstruction translated, the access unit data read from the external space to be updated parameter vector W _t and the corresponding gradient vector ΔL (W _t) are read into the In the data cache unit;

Step S5, the controller unit reads a data transfer instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit

And the constant α, ε is transmitted to the data processing unit;

Step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs a historical gradient square sum according to the translated microinstruction.

The update operation, in which the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: using the vector multiplication parallel operation sub-module to obtain (ΔL(W _t )) ² , using vector addition in parallel The arithmetic submodule adds (ΔL(W _t )) ² to the sum of the squares of the historical gradients

in;

Step S7, the controller unit reads an instruction from the instruction cache unit, and according to the translated microinstruction, the updated historical gradient sum of squares

Transferred from the data processing unit back to the data cache unit;

Step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction buffer unit, and according to the translated micro instruction, the operation control sub-module controls the correlation operation module to perform the following operations: using the vector square root parallel operation sub-module to calculate

Calculating adaptive learning rate by using vector division parallel operation sub-module

Step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module

Calculation using a vector addition parallel operation submodule

Obtain the updated parameter vector W _t+1 ;

Step S10, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector W _{t+1 is} transmitted from the data processing unit to the external address space specified address through the data access unit;

Step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. carried out.

Based on the foregoing technical solutions, the device and method of the present invention have the following beneficial effects: the AdaGrad gradient descent algorithm can be implemented by using the device, and the efficiency of data processing is greatly improved; by using a device specially used for executing the AdaGrad gradient descent algorithm, The general-purpose processor that solves the data has insufficient computational performance, and the problem of large decoding cost in the previous stage accelerates the execution speed of related applications and greatly improves the efficiency of data processing. At the same time, the application of the data buffer unit avoids repeatedly reading data into the memory. , reducing the bandwidth of memory access.

DRAWINGS

1 is a block diagram showing an example of an overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention;

2 is a block diagram showing an example of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention;

3 is a flow diagram of operations for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the specific embodiments of the invention, Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the art.

In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be appreciated by those skilled in the art that various changes and modifications can be made in the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numbers are used throughout the drawings for similar functions and operations.

The invention discloses an apparatus for executing an AdaGrad gradient descent algorithm, comprising a data access unit, an instruction cache unit, a controller unit, a data buffer unit and a data processing module. The data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit. And the corresponding gradient value to the data processing unit, the updated parameter vector is directly written from the data processing module to the external designated space; the instruction cache unit reads the instruction through the data access unit, and caches the read instruction; the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variable, and These variables are initialized and updated; the data processing module performs corresponding operations based on the instructions, including vector addition, vector multiplication, vector division, vector square root, and basic operations.

An apparatus for implementing an AdaGrad gradient descent algorithm according to an embodiment of the present invention can be used to support applications using the AdaGrad gradient descent algorithm. In the data buffer unit, a space is stored to store the sum of the squares of the historical gradient values. Each time the gradient is dropped, a learning rate is calculated by using the square sum as the learning rate of the gradient descent, and then the update operation of the vector to be updated is performed. The gradient descent operation is repeated until the vector to be updated converges.

The invention also discloses a method for executing an AdaGrad gradient descent algorithm, and the specific implementation steps are as follows:

Step (1) completes the initialization operation of the data buffer unit by an instruction, including setting an initial value for the constant α, ε, and summing the square of the historical gradient

Zero operation. The constant α is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate. The constant ε is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration. The number of times, W _i is the parameter to be updated when the i-th iteration is operated, ΔL(W _i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W ₁ )) ² , (ΔL(W ₂ )) ² , . . . , (ΔL(W _t )) ^{2 is} summed.

Step (2), by the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated and the corresponding gradient vector from the external space.

Step (3) The data processing module reads and calculates the sum of the squares of the historical gradients in the update data buffer unit according to the corresponding instructions.

Calculated adaptive learning rate

Step (4) The data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value according to the corresponding instruction, and the update operation calculation formula is as follows:

Where W _t represents the current (tth) parameter to be updated, ΔL(W _t ) represents the gradient value of the current parameter to be updated, W _t+1 represents the updated parameter, and is the next (t+1) iteration. The parameter to be updated for the operation.

Step (5) The data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.

The specific embodiments of the present invention will be further illustrated below in conjunction with the accompanying drawings.

1 shows an example block diagram of the overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm, in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware, including but not limited to FPGA, CGRA, and dedicated integration. Circuit ASICs, analog circuits, and memristors.

The data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the method includes reading an instruction to the instruction cache unit 2, reading the parameter to be updated from the specified storage unit to the data processing unit 5, reading the gradient value from the external designated space to the data buffer unit 4, and updating the parameter vector from the parameter. The data processing module 5 directly writes to the external designated space.

The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.

The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 3, the data processing module 5, and the like.

The data buffer unit 4 initializes the sum of the squares of the historical gradient values at the time of device initialization.

Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent iteration process ends. The sum of the squares of the historical gradient values during each data update

Read into the data processing module 5, the value is updated in the data processing module 5, i.e., the sum of the squares of the current gradient values is added, and then written to the data buffer unit 4.

The data processing module 5 reads the sum of the squares of the historical gradient values from the data buffer unit 4.

And constant α, ε, update

The value is sent back to the data cache unit 4, utilizing

And constant α, ε calculate adaptive learning rate

2 illustrates an example block diagram of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module includes an arithmetic control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Since the vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs an operation, the elements at different positions can perform operations in parallel.

Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the AdaGrad gradient descent algorithm.

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the AdaGrad gradient descent calculations from the external address space. All instructions are buffered into the instruction cache unit 2.

In step S3, the controller unit 4 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, the sum of the squares of the historical gradients in the data buffer unit.

Set zero and initialize the constants α, ε. The constant α is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate. The constant ε is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration. The number of times, W _i is the parameter to be updated when the i-th iteration is operated, ΔL(W _i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W ₁ )) ² , (ΔL(W ₂ )) ² , . . . , (ΔL(W _t )) ^{2 is} summed.

In step S4, the controller unit 3 into an IO instruction read from the instruction cache unit 2, in accordance with the translated microinstructions, read from a data access unit outside the space to be the updated parameter vector W _t and the corresponding gradient vector ΔL (W _t ) is read into the data buffer unit 4.

In step S5, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and sums the sum of the historical gradients in the data buffer unit 4 according to the translated microinstruction.

And the constant α, ε is transmitted to the data processing unit.

In step S6, the controller unit reads a vector instruction from the instruction cache unit 2, and performs a historical gradient square sum according to the translated microinstruction.

The update operation, in which the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 transmits the corresponding instruction to perform the following operation: using the vector multiplication parallel operation sub-module 53 to obtain (ΔL(W _t )) ² , using The vector addition parallel operation sub-module adds (ΔL(W _t )) ² to the sum of squared historical gradients

in.

In step S7, the controller unit reads an instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated historical gradient sum of squares

The data processing unit 5 is transferred back to the data buffer unit 4.

In step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and according to the translated micro-instruction, the operation control sub-module 51 controls the correlation operation module to perform the following operation: using the vector square root parallel operation sub-module 55 Calculation

Using the vector division parallel operation sub-module 54 to calculate the adaptive learning rate

In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform an operation based on the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module 52

Calculated by the vector addition parallel operation sub-module 52

The updated parameter vector W _{t+1 is obtained} .

In step S10, the controller unit reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector Wt _{+1 is} transmitted from the data processing unit 5 through the data access unit 1 to the external address space designation. address.

In step S11, the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. Continue to execute.

The processes or methods depicted in the preceding figures may be by hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software carried on a non-transitory computer readable medium), or both. The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims

An apparatus for performing an AdaGrad gradient descent algorithm, comprising:

a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;

a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;

a data processing module, configured to perform an arithmetic operation under the control of the controller unit, including a vector addition operation, a vector multiplication operation, a vector division operation, a vector square root operation, and a basic operation, and storing the intermediate variable in the data buffer unit in.
The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein the data processing module comprises an operation control sub-module, a parallel vector addition unit, a parallel vector multiplication unit, a parallel vector division unit, Parallel vector square root operation unit and basic operation submodule.
The apparatus for performing an AdaGrad gradient descent algorithm according to claim 2, wherein when the data processing module performs an operation on the same vector, different position elements can perform operations in parallel.
The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein said data buffer unit initializes a sum of squares of historical gradient values at device initialization.
Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent algorithm is executed.
The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein said data buffer unit compares the sum of squares of historical gradient values during each data update process.
Read into the data processing module, update its value in the data processing module, that is, add the square sum of the current gradient values, and then write to the data buffer unit;

The data processing module reads a sum of squares of historical gradient values from the data buffer unit
And constant α, ε, update
Value is sent back to the data cache unit, utilizing
And constant α, ε calculate adaptive learning rate
Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
A method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Step (1), initializing the data buffer unit, including setting the initial value for the constant α, ε, and summing the square of the historical gradient
Zeroing operation, wherein the constant α is an adaptive learning rate gain coefficient for adjusting the range of the adaptive learning rate, and the constant ε is a constant for ensuring that the denominator in the adaptive learning rate calculation is non-zero, t is the current The number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 To i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 , . . . , (ΔL(W t )) 2 is summed;

Step (2), reading the parameter vector to be updated and the corresponding gradient vector from the external space;

Step (3), the data processing module reads and calculates the sum of squared historical gradients in the updated data buffer unit
And pass the sum of the constants α, ε and the historical gradient in the data buffer unit
Calculated adaptive learning rate

In step (4), the data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value, and the update operation calculation formula is as follows:

Where W t represents the current, that is, the t-th parameter to be updated, ΔL(W t ) represents the gradient value of the current parameter to be updated, and W t+1 represents the updated parameter, which is also the next time, that is, t+1 iterations. The parameter to be updated of the operation;

In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
An apparatus for performing an AdaGrad gradient descent algorithm in which a program for performing the method of claim 6 is cured.
A method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Step S1, pre-storing an IO instruction at the first address of the instruction cache unit;

Step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the AdaGrad gradient descent calculation from the external address space, and Cached into the instruction cache unit;

Step S3, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
Zero, and initialize the constant α, ε; where the constant α is the adaptive learning rate gain coefficient, used to adjust the range of the adaptive learning rate, the constant ε is a constant, used to guarantee the denominator in the adaptive learning rate calculation Non-zero, t is the current number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the And the range from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 ,...,(ΔL(W t )) 2 Summing

Step S4, the controller unit IO buffer unit reads an instruction from the instruction, according to the microinstruction translated, the access unit data read from the external space to be updated parameter vector W t and the corresponding gradient vector ΔL (W t) are read into the In the data cache unit;

Step S5, the controller unit reads a data transfer instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
And the constant α, ε is transmitted to the data processing unit;

Step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs a historical gradient square sum according to the translated microinstruction.
The update operation, in which the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: using the vector multiplication parallel operation sub-module to obtain (ΔL(W t )) 2 , using vector addition in parallel The arithmetic submodule adds (ΔL(W t )) 2 to the sum of the squares of the historical gradients
in;

Step S7, the controller unit reads an instruction from the instruction cache unit, and according to the translated microinstruction, the updated historical gradient sum of squares
Transferred from the data processing unit back to the data cache unit;

Step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction buffer unit, and according to the translated micro instruction, the operation control sub-module controls the correlation operation module to perform the following operations: calculating by using the vector square root parallel operation sub-module
Calculating adaptive learning rate by using vector division parallel operation sub-module

Step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module
Calculation using a vector addition parallel operation submodule
Obtain the updated parameter vector W t+1 ;

Step S10, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector W t+1 is transmitted from the data processing unit to the external address space specified address through the data access unit;

Step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. carried out.
An apparatus for performing an AdaGrad gradient descent algorithm in which a program for performing the method of claim 8 is cured.