CN107341132B

CN107341132B - Device and method for executing AdaGrad gradient descent training algorithm

Info

Publication number: CN107341132B
Application number: CN201610280620.4A
Authority: CN
Inventors: 郭崎; 刘少礼; 陈天石; 陈云霁
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2021-06-11
Anticipated expiration: 2036-04-29
Also published as: CN107341132A; WO2017185411A1

Abstract

A computing device and method, the computing device including a controller unit and a data processing unit; the device firstly reads a gradient vector and a value vector to be updated, and meanwhile, updates a historical gradient value in a cache region by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges. The invention can solve the problems of insufficient operation performance of the processor and high front-section decoding cost by adopting special equipment, accelerates the execution speed of related applications and reduces the bandwidth of memory access.

Description

Device and method for executing AdaGrad gradient descent training algorithm

Technical Field

The invention relates to the technical field of AdaGrad algorithm application, in particular to a device and a method for executing an AdaGrad gradient descent training algorithm.

Background

The AdaGrad algorithm is widely used due to the characteristics of easy realization, small calculation amount, small required storage space, capability of adaptively distributing learning rate to each parameter and the like. The implementation of the AdaGrad algorithm by using a dedicated device can significantly increase the speed of its execution.

Currently, one known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the AdaGrad gradient descent algorithm into a long-row operation and memory access instruction sequence, and the front-end decoding of the processor brings about large power consumption overhead.

Another known method of performing the AdaGrad gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device dedicated to performing graphics image operations and scientific calculations, there is no special support for the relevant operations of the AdaGrad gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the relevant operations in the AdaGrad gradient descent algorithm, thereby bringing a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as historical gradient values) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.

Disclosure of Invention

In view of the above, the present invention provides an apparatus and a method for performing an AdaGrad gradient descent algorithm to solve at least one of the above technical problems.

To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing an AdaGrad gradient descent algorithm, comprising:

the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;

the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;

and the data processing module is used for executing operation operations under the control of the controller unit, wherein the operation operations comprise vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation, and storing intermediate variables into the data cache unit.

The data processing module comprises an operation control submodule, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector square root operation unit and a basic operation submodule.

When the data processing module executes operation aiming at the same vector, elements at different positions can execute operation in parallel.

Wherein the data cache unit is initialized at the deviceThe sum of squares of the historical gradient values is initialized

Setting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent algorithm is executed.

Wherein the data caching unit sums the squares of the historical gradient values in each data updating process

Reading the data into a data processing module, updating the value of the data in the data processing module, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;

the data processing module reads the square sum of historical gradient values from the data cache unit

And constants alpha, epsilon, update

And sending back the value of (a) to the data cache unit, using

And constant alpha, epsilon calculating adaptive learning rate

And finally, updating the vector to be updated by using the current gradient value and the self-adaptive learning rate.

As another aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:

step (1), initializing a data cache unit, including setting initial values for constants alpha and epsilon and squaring and summing historical gradients

Zero setting operationWherein, the constant alpha is a gain coefficient of the adaptive learning rate for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant for ensuring that the denominator in the calculation of the adaptive learning rate is non-zero, t is the current iteration number, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)₁))²，(ΔL(W₂))²，…，(ΔL(W_t))²Summing;

step (2), reading a parameter vector to be updated and a corresponding gradient vector from an external space;

and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing module

And through the square sum of the constants alpha, epsilon and historical gradient in the data buffer unit

Calculating adaptive learning rate

And (4) the data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:

wherein, W_tDenotes the current, i.e. t-th, parameter to be updated, Δ L (W)_t) Representing the gradient value, W, of the current parameter to be updated_t+1Representing the updated parameters, which are also the parameters to be updated of the next time, namely t +1 times of iterative operation;

and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.

And a device for executing the AdaGrad gradient descent algorithm, wherein a program for executing the method is solidified in a controller of the device.

As a further aspect of the present invention, the present invention also provides a method of executing an AdaGrad gradient descent algorithm, characterized by comprising the steps of:

step S1, an IO instruction is pre-stored in the first address of the instruction cache unit;

step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all instructions related to AdaGrad gradient descent calculation from the external address space and caches the instructions into the instruction cache unit;

step S3, the controller unit reads in the assignment instruction from the instruction cache unit, and the historical gradient sum of squares in the data cache unit according to the translated microinstruction

Setting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)₁))²，(ΔL(W₂))²，…，(ΔL(W_t))²Summing;

step S4, the controller unit reads in an IO instruction from the instruction cache unit, and the data access unit reads the parameter vector W to be updated from the external space according to the translated microinstruction_tAnd a corresponding gradient vector Δ L (W)_t) Reading the data into a data cache unit;

in the step of S5,the controller unit reads a data transfer instruction from the instruction cache unit, and sums the squares of the historical gradients in the data cache unit according to the translated microinstructions

And the constant α, ε is transmitted to the data processing unit;

in step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs historical gradient square-sum according to the translated microinstruction

In the update operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a vector multiply parallel operator module to obtain (Δ L (W)_t))²The parallel operation submodule using vector addition will (Delta L (W)_t))²Added to historical gradient sum of squares

Performing the following steps;

in step S7, the controller unit reads an instruction from the instruction cache unit and updates the historical gradient sum of squares based on the translated microinstruction

Is transmitted from the data processing unit back to the data caching unit;

in step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and according to the translated microinstruction, the operation control submodule controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodules

Computing adaptive learning rate using vector division parallel operator modules

In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operations according to the translated microinstruction: using vector multiplication parallel arithmetic unit sub-modules to calculate

Parallel computation of sub-modules using vector addition

Obtaining an updated parameter vector W_t+1；

Step S10, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector W according to the translated microinstruction_t+1Transmitting the data from the data processing unit to the external address space designated address through the data access unit;

in step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing unit goes to step S5 to continue the execution.

Based on the technical scheme, the device and the method have the following beneficial effects: by using the device, an AdaGrad gradient descent algorithm can be realized, and the data processing efficiency is greatly improved; by adopting the special equipment for executing the AdaGrad gradient descent algorithm, the problems of insufficient operation performance of a general processor of the data and high front-section decoding overhead can be solved, the execution speed of related applications is accelerated, and the data processing efficiency is greatly improved; meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.

Drawings

Fig. 1 is an exemplary block diagram of the overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention;

fig. 2 is a block diagram of an example of a data processing module in an apparatus for implementing an application related to an AdaGrad gradient descent algorithm according to an embodiment of the present invention;

fig. 3 is a flowchart of operations for implementing an AdaGrad gradient descent algorithm-related application, according to an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments. Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description.

In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.

The invention discloses a device for executing an AdaGrad gradient descent algorithm, which comprises a data access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The data access unit can access an external address space, can read and write data to each cache unit in the device and complete loading and storage of the data, and specifically comprises the steps of reading an instruction to the instruction cache unit, reading parameters to be updated and corresponding gradient values from the specified storage units to the data processing unit, and directly writing updated parameter vectors into the external specified space from the data processing module; the instruction cache unit reads the instruction through the data access unit and caches the read instruction; the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a microinstruction for controlling the behavior of other modules and sends the microinstruction to other modules such as a data access unit, a data cache unit, a data processing module and the like; the data cache unit stores some intermediate variables needed in the operation of the device, and initializes and updates the variables; the data processing module performs corresponding operation operations according to the instruction, including vector addition operation, vector multiplication operation, vector division operation, vector square root operation and basic operation.

The apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the present invention may be used to support applications using the AdaGrad gradient descent algorithm. And establishing a space in the data cache unit to store the square sum of historical gradient values, calculating a learning rate by using the square sum when gradient descent is performed each time, and then performing updating operation on the vector to be updated. And repeating the gradient descending operation until the vector to be updated converges.

The invention also discloses a method for executing the AdaGrad gradient descent algorithm, which comprises the following specific implementation steps:

the initialization operation of the data cache unit is completed through an instruction, and comprises setting initial values for constants alpha and epsilon and squaring and summing historical gradients

And setting zero operation. Wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)₁))²，(ΔL(W₂))²，…，(ΔL(W_t))²And (6) summing.

And finishing the operation of reading the parameter vector to be updated and the corresponding gradient vector from the external space by the data access unit through the IO instruction.

The data processing module reads and calculates the historical gradient square sum in the updated data cache unit according to the corresponding instruction

Calculated adaptive learning rate

The data processing module completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value according to the corresponding instruction, and the calculation formula of the updating operation is as follows:

wherein, W_tDenotes the current (t-th) parameter to be updated, Δ L (W)_t) Representing the gradient value, W, of the current parameter to be updated_t+1And the updated parameters are also the parameters to be updated of the next (t + 1) iteration operation.

And (3) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, and otherwise, turning to the step (2) to continue executing.

The embodiments of the present invention will be further explained with reference to the accompanying drawings.

Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the AdaGrad gradient descent algorithm according to an embodiment of the invention. As shown in fig. 1, the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may be implemented by hardware, including but not limited to FPGA, CGRA, ASIC, analog circuit, memristor, etc.

The data access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from an instruction cache unit 2, reading parameters to be updated back and forth from a designated storage unit to a data processing unit 5, reading gradient values from an external designated space to a data cache unit 4, and directly writing the updated parameter vectors into the external designated space from a data processing module 5.

The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.

The controller unit 3 reads instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and sends them to other modules such as the data access unit 1, the data cache unit 3, the data processing module 5, etc.

The data buffer unit 4 initializes the square sum of the historical gradient values at the time of device initialization

Setting the value to 0, and simultaneously opening up two space storage constants alpha and epsilon, wherein the two constant spaces are kept until the whole gradient descent iteration process is finished. During each data updating process, the squares of the historical gradient values are summed

Read into the data processing module 5, update its value in the data processing module 5, i.e. add the sum of the squares of the current gradient values, and then write into the data buffer unit 4.

The data processing module 5 reads the sum of squares of the historical gradient values from the data cache unit 4

And constants alpha, epsilon, update

And send the value of (2) back to the data buffer unit 4By using

And constant alpha, epsilon calculating adaptive learning rate

Fig. 2 shows an exemplary block diagram of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm-related application according to an embodiment of the present invention. As shown in fig. 2, the data processing module includes an operation control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Because vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs a certain operation, elements at different positions can perform the operation in parallel.

Fig. 3 shows a general flow diagram of a device for performing a correlation operation according to the AdaGrad gradient descent algorithm.

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

In step S2, the operation starts, and the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all instructions related to the AdaGrad gradient descent calculation from the external address space and caches them in the instruction cache unit 2.

In step S3, controller unit 4 reads in the assignment instruction from instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unit

Zeroed and the constants alpha, epsilon initialized. Wherein, the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a smaller constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, and t is whenNumber of preceding iterations, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)₁))²，(ΔL(W₂))²，...，(ΔL(W_t))²And (6) summing.

In step S4, controller unit 3 reads in an IO instruction from instruction cache unit 2, and data access unit 1 reads parameter vector W to be updated from the external space according to the translated microinstruction_tAnd a corresponding gradient vector Δ L (W)_t) Read into the data cache unit 4.

In step S5, the controller unit 3 reads in a data transfer instruction from the instruction cache unit 2, and based on the translated microinstruction, sums of squares of historical gradients in the data cache unit 4

And the constant alpha, epsilon is transmitted to the data processing unit.

In step S6, the controller unit reads a vector instruction from the instruction cache unit 2, and performs historical gradient square-sum based on the translated microinstruction

In which the command is sent to the arithmetic control sub-module 51, the arithmetic control sub-module 51 sends the corresponding command to perform the following operations: using the vector multiplication parallel operation submodule 53 to obtain (Δ L (W)_t))²The parallel operation submodule using vector addition will (Delta L (W)_t))²Added to historical gradient sum of squares

In (1).

In step S7, the controller unit reads an instruction from the instruction cache unit 2, and updates the historical gradient square sum according to the translated microinstruction

From the data processing unit 5 back to the data buffering unit 4.

In step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and based on the translated microinstruction, the operation control submodule 51 controls the relevant operation module to perform the following operations: computation using vector square root parallel operator submodule 55

Computing adaptive learning rate using vector division parallel operator module 54

In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: calculated by the vector multiply parallel arithmetic unit submodule 52

Parallel computation with vector addition sub-module 52

Obtaining an updated parameter vector W_t+1。

In step S10, the controller unit reads an IO instruction from the instruction cache unit 2, and updates the parameter vector W according to the translated microinstruction_t+1From the data processing unit 5 to the external address space designation address via the data access unit 1.

In step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and the data processing unit judges whether the updated parameter vector converges according to the translated microinstruction, and if so, the operation ends, otherwise, the processing proceeds to step S5 to continue the operation.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A computing device that executes an AdaGrad gradient descent algorithm, comprising:

a data processing unit for performing arithmetic operations including a basic arithmetic operation, and vector addition operation, vector multiplication operation, vector division operation, and vector square root operation under the control of the controller unit;

the data cache unit is used for storing the square sum of the historical gradient values in the operation process;

the data cache unit is also used for executing initialization and updating operation on the intermediate variable;

when the computing device executes the AdaGrad gradient descent algorithm, firstly reading a gradient vector and a value vector to be updated, and updating a historical gradient value cached in a data cache unit by using a current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.

2. The computing apparatus of claim 1, wherein the data processing unit includes an operation control sub-module and a basic operation sub-module, and a parallel vector addition operation unit, a parallel vector division operation unit, a parallel vector multiplication operation unit, and a parallel vector square root operation unit.

3. The computing device of claim 2, wherein different positional elements are capable of performing operations in parallel when the data processing unit performs operations on the same vector.

4. The computing device of claim 1, wherein the data cache unit is to initialize a sum of squares of historical gradient values upon device initialization

Setting the value to 0, simultaneously opening up two spaces to store constants alpha and epsilon, and keeping the two constant spaces until the whole gradient descent algorithm is executed, wherein the self-adaptive learning rate

5. The computing device of claim 1, wherein the data caching unit is to sum squares of historical gradient values during each data update

Reading the data into a data processing unit, updating the value of the data in the data processing unit, namely adding the sum of squares of the current gradient value, and then writing the sum into the data cache unit;

the data processing unit reads the square sum of the historical gradient values from the data cache unit

And constants alpha, epsilon, update

And sending back the value of (a) to the data cache unit, using

And constant alpha, epsilon calculating adaptive learning rate

6. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

the controller unit decodes the read instruction into a microinstruction for controlling the corresponding module and sends the microinstruction to the corresponding module;

the data processing unit executes arithmetic operations under the control of the controller unit, including basic arithmetic operations, and vector addition operations, vector multiplication operations, vector division operations, and vector square root operations;

the controller unit stores the square sum of the historical gradient values generated in the operation process in a data cache unit, and performs initialization and updating operations on the intermediate variables;

when executing the AdaGrad gradient descent algorithm, the controller unit firstly reads the gradient vector and the value vector to be updated, and meanwhile, updates the historical gradient value cached in the data cache unit by using the current gradient value; during each iteration, calculating an updating amount by using the current gradient value and the historical gradient value, and performing updating operation on a vector to be updated; and continuing training until the parameter vector to be updated converges.

7. The computing method of claim 6, wherein different positional elements are capable of performing operations in parallel when the data processing units perform operations on the same vector.

8. The computing method of claim 6, wherein the data cache unit, upon device initialization,initializing the sum of squares of historical gradient values

9. The computing method of claim 6, wherein the data caching unit sums squares of historical gradient values during each data update

And constants alpha, epsilon, update

And sending back the value of (a) to the data cache unit, using

And constant alpha, epsilon calculating adaptive learning rate

10. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Setting zero operation, wherein the constant alpha is the gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration number, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)₁))²，(ΔL(W₂))²，...，(ΔL(W_t))²Summing;

and (3) reading and calculating the square sum of historical gradients in the updated data cache unit by the data processing unit

Calculating adaptive learning rate

And (4) the data processing unit completes the updating operation of the vector to be updated by utilizing the self-adaptive learning rate and the current gradient value, and the calculation formula of the updating operation is as follows:

step (5), the data processing unit judges whether the updated parameter vector is converged, if so, the operation is finished, otherwise, the step (2) is carried out continuously;

the data cache unit stores the square sum of historical gradient values in the operation process.

11. A computational method for performing an AdaGrad gradient descent algorithm, comprising the steps of:

Setting zero and initializing a constant alpha, epsilon; wherein, the constant alpha is a gain coefficient of the adaptive learning rate and is used for adjusting and controlling the range of the adaptive learning rate, the constant epsilon is a constant and is used for ensuring that the denominator in the calculation of the adaptive learning rate is nonzero, t is the current iteration frequency, W_t′Is a parameter to be updated in the ith iteration, Δ L (W)_t′) For the gradient value of the parameter to be updated at the i-th iteration, Σ represents a summation operation that ranges from i-1 to i-t, i.e., for the initial to current gradient squared value (Δ L (W)_t))²，(ΔL(W₂))²，...，(ΔL(W_t))²Summing;

in step S5, the controller unit reads a data transfer instruction from the instruction cache unit and sums the squares of the historical gradients in the data cache unit according to the translated microinstructions

And the constant α, ε is transmitted to the data processing unit;

Performing the following steps;

Is transmitted from the data processing unit back to the data caching unit;

step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit, and the operation control sub-module controls the correlation according to the translated microinstructionThe operation module performs the following operations: computation using vector square root parallel operator submodules

Parallel computation of sub-modules using vector addition

Obtaining an updated parameter vector W_t+1；

step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing unit judges whether the updated parameter vector is converged according to the translated microinstruction, if so, the operation is finished, otherwise, the controller unit transfers to the step S5 to continue executing;