WO2017185256A1

WO2017185256A1 - Rmsprop gradient descent algorithm execution apparatus and method

Info

Publication number: WO2017185256A1
Application number: PCT/CN2016/080354
Authority: WO
Inventors: 刘少礼; 郭崎; 陈天石; 陈云霁
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-11-02

Abstract

Disclosed in the present invention are an RMSprop gradient descent algorithm execution apparatus and method. The apparatus comprises: a direct memory access unit, an instruction cache unit, a controller unit, a data cache unit, and a data processing module. The method comprises: first reading a gradient vector and a to-be-updated value vector, and initializing a mean square vector; and during each iteration, first updating the mean square vector by using the gradient vector, then calculating, by using the mean square vector, a corresponding gradient descent amount during the update, updating the to-be-updated parameter vector, and repeating the process until the to-be-updated vector converges. In the whole process, the mean square vector is always stored in the data cache unit. By means of the present invention, application of an RMSprop gradient descent algorithm can be implemented, and the efficiency of data processing can be greatly improved.

Description

Apparatus and method for performing RMSprop gradient descent algorithm

Technical field

The present invention relates to the field of RMSprop algorithm application technology, and in particular to an apparatus and method for performing an RMSprop gradient descent algorithm, and relates to a hardware implementation of an RMSprop gradient descent optimization algorithm.

Background technique

Gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. RMSprop algorithm is one of the gradient descent optimization algorithms. Because of its easy implementation, the calculation amount is small, the required storage space is small and Features such as mini-batch data sets are well used for processing, and the use of dedicated devices to implement the RMSprop algorithm can significantly increase the speed of execution.

Currently, a known method of performing the RMSprop gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this method is that the performance of a single general-purpose processor is low, and when multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the RMSprop algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.

Another known method of performing the RMSprop gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit. Since the GPU is a device dedicated to performing graphics and computational operations and scientific calculations, without the special support for the RMSprop gradient descent algorithm, it still requires a large amount of front-end decoding to perform the relevant operations in the RMSprop gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer. The intermediate variable data required for the RMSprop gradient descent algorithm, such as the average direction amount, needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck and brings great work. Cost.

Summary of the invention

In view of this, the main object of the present invention is to provide an apparatus and method for performing an RMSprop gradient descent algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeatedly reading into the memory. Take data and reduce the bandwidth of memory access.

To achieve the above object, the present invention provides an apparatus for performing an RMSprop gradient descent algorithm, the apparatus comprising a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, among them:

The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;

The instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;

The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.

a data buffer unit 4, configured to cache a mean square matrix during initialization and data update;

The data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1. In the specified space.

In the above solution, the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated The parameter vector is directly written from the data processing module 5 to the external designated space.

In the above solution, the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified. The address reads the data and writes the data to the external designated address, and the control data buffer unit 4 acquires an instruction required for the operation from the external designated address through the direct memory access unit 1, controls the data processing module 5 to perform the update operation of the parameter to be updated, and controls The data buffer unit 4 performs data transmission with the data processing module 5.

In the above solution, the data buffer unit 4 initializes the mean square matrix RMS _t at initialization, and reads the mean square matrix RMS _t-1 into the data processing module 5 in each data update process, in the data processing module 5 It is updated to the mean square matrix RMS _t and then written to the data buffer unit 4. During the operation of the device, a copy of the mean square matrix RMS _t is always stored inside the data buffer unit 4.

In the above solution, the data processing module 5 reads the average direction quantity RMS _t-1 from the data buffer unit 4, and reads the parameter vector θ _t-1 and the gradient vector to be updated from the external designated space through the direct memory access unit ₁ .

Α global update step size and direction are updated amount of δ, the direction of the average amount of RMS _t-1 update RMS _t, to be updated by updating the parameter RMS _t θ _t-1 is θ _t, and the data is written back to cache RMS _t In unit 4, θ _t is written back to the external designated space by the direct memory control unit 1.

In the above solution, the data processing module 5 updates the mean direction amount RMS _t-1 to RMS _t according to the formula.

Implemented, the data processing module 5 to be updated vector θ _t-1 is updated according to the formula θ _t

Realized.

In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56.

In the above solution, when the device operates on the vector, the vector operations are all element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.

To achieve the above object, the present invention also provides a method for performing an RMSprop gradient descent algorithm, the method comprising:

Initialize an average direction RMS ₀ and obtain the parameter θ _t to be updated and the corresponding gradient vector from the specified storage unit

When performing the gradient descent operation, first use the mean direction RMS _t-1 and the gradient vector

And the average direction quantity update rate δ updates the average direction quantity RMS _t , then divides the gradient vector by the square root of the mean direction quantity and multiplies the global update step length α to obtain the corresponding gradient descent quantity, and updates the to-be-updated vector θ _t-1 to θ. _t and output; repeat this process until the vector to be updated converges.

In the above solution, the initial direction quantity RMS _{0 is} initialized, and the parameter vector θ _t to be updated and the corresponding gradient vector are obtained from the specified storage unit.

include:

In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.

Step S2, the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, drives the direct memory access unit 1 to read from the external address space and is related to the RMSprop gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;

In step S3, the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size α from the external space according to the translated microinstruction. The quantity update rate δ, the convergence threshold ct, and then sent to the data processing module 5;

In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, initializes the average direction amount RMS _t-1 in the drive data buffer unit 4 according to the translated microinstruction, and drives the number of iterations in the data processing unit 5. t is set to 1;

In step S5, the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated θ _t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector

Then sent to the data processing module 5;

In step S6, the controller unit 3 reads a data transfer instruction from the instruction buffer unit 2, and transmits the average direction amount RMS _t-1 in the data buffer unit 4 to the data processing unit 5 according to the translated microinstruction.

In the above solution, the average direction RMS _t-1 and the gradient vector are utilized.

And the average direction quantity update rate δ update the average direction quantity RMS _t , which is according to the formula

The implementation specifically includes: the controller unit 3 reads an average direction quantity update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS _t-1 according to the translated micro instruction; In the update operation, the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub- The module 56 operates (1-δ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1-δ) RMS _t-1 , respectively.

with

among them

with

The calculation of the elements corresponding to the position of the vector exists in a sequential order, and the positions are calculated in parallel between different positions; then, the operation instruction 3 (INS_3) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates

The updated mean direction RMS _{t is obtained} .

After updating the average direction quantity RMS _t and the average direction quantity update rate δ, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and according to the translated micro-instruction, the updated average direction quantity RMS _t It is transferred from the data processing unit 5 to the data buffer unit 4.

In the above solution, the gradient vector is divided by the square root of the mean direction quantity and multiplied by the global update step length α to obtain a corresponding gradient decrease amount, and the updated update vector θ _t-1 is θ _t , which is according to the formula.

The implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and performs an update operation of the parameter vector according to the translated micro instruction; in the update operation, the parameter vector update instruction is sent to The operation control sub-module 51 controls the correlation operation module to perform an operation of: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation unit sub-module 56 to calculate -α, the number of iterations t plus 1; send operation instruction 5 (INS_5) to vector square root parallel operation sub-module 55, drive vector square root parallel operation sub-module 55 is calculated

The operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates

After the two operations are completed, the operation instruction 7 (INS_7) is sent to the vector division parallel operation sub-module 54, and the drive vector division parallel operation sub-module 54 is calculated.

Then, the operation instruction 8 (INS_8) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates

Obtaining θ _t ; where θ _t-1 is the value before θ ₀ is not updated at the t-th cycle, and the t-th cycle updates θ _t-1 to θ _t ; the operation control sub-module 51 sends the operation instruction 9 (INS_9) ) to the vector division parallel operation sub-module 54, the drive vector division parallel operation sub-module 54 operates to obtain a vector

The arithmetic control submodule 51 transmits an arithmetic instruction 10 (INS_10), an arithmetic instruction 11 (INS_11) vector addition parallel operation submodule 52, and a basic operation submodule 56, respectively, and calculates sum = ∑ _i temp _i and temp2 = sum / n.

In the above solution, after the update vector θ _t-1 is θ _t , the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction. θ _{t is} transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.

In the above solution, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges. The specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 < ct, it converges, and the operation ends.

It can be seen from the above technical solutions that the present invention has the following beneficial effects:

1. The apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large by adopting a device specially used for performing the RMSprop gradient descent algorithm. Accelerate the execution speed of related applications.

2. Apparatus and method for performing an RMSprop gradient descent algorithm provided by the present invention, The use of the moment vector required for the intermediate process of the data buffer unit temporarily avoids repeatedly reading data into the memory, reduces the IO operation between the device and the external address space, reduces the bandwidth of the memory access, and solves the off-chip bandwidth. This bottleneck.

3. The apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention, the degree of parallelism is greatly improved because the data processing module performs vector operations using the related parallel operation sub-modules.

4. The apparatus and method for executing the RMSprop gradient descent algorithm provided by the present invention, since the data processing module uses the related parallel operation sub-module to perform vector operation, the parallelism of the operation is high, so the frequency during operation is low, so that the power consumption is low. The overhead is small.

DRAWINGS

For a more complete understanding of the present invention and its advantages, reference will now be made to the following description

1 shows an example block diagram of the overall structure of an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.

2 illustrates an example block diagram of a data processing module in an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.

3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.

Throughout the drawings, the same devices, components, units, and the like are denoted by the same reference numerals.

detailed description

Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the <RTI

In the present invention, the terms "including" and "containing" and their derivatives are intended to include, but are not limited to, The term "or" is inclusive, meaning and/or.

In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.

An apparatus and method for performing an RMSprop gradient descent algorithm for accelerating the application of an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention. First, an average direction quantity RMS _{0 is} initialized, and the parameter vector θ _t to be updated and the corresponding gradient vector are obtained from the specified storage unit.

Then, for each iteration, first use the previous mean direction RMS _t-1 , gradient vector

And the average direction quantity update rate δ updates the average direction quantity RMS _t , that is,

After that, the gradient vector is divided by the square root of the mean direction amount and multiplied by the global update step size α to obtain the corresponding gradient descent amount, and the vector to be updated is updated, that is,

The entire process is repeated until the vector to be updated converges.

1 shows an example block diagram of the overall structure of an apparatus for implementing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.

The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.

Instruction cache unit 2 for reading instructions through direct memory access unit 1 and caching The instruction read.

The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro The instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory. The unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.

Data cache unit 4 for the initialization and update the data cache during mean square matrix; specifically, data cache unit 4 are initialized in initialization RMS _t square matrix, a square matrix RMS _t are each in the data update process _{The -1 is} read out into the data processing module 5, updated in the data processing module 5 to the mean square matrix RMS _t , and then written to the data buffer unit 4. A copy of the mean square matrix RMS _t is always stored inside the data buffer unit 4 throughout the operation of the device. In the present invention, since the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.

The data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1. In the specified space; specifically, the data processing module 5 reads the average direction quantity RMS _t-1 from the data buffer unit 4, and reads the parameter vector θ _t-1 to be updated from the external designated space through the direct memory access unit ₁ , Gradient vector

The global update step size α and the average direction quantity update rate δ. First update the mean direction RMS _t-1 to RMS _t , ie RMS _t =

Then, the parameter θ _t-1 to be updated is updated by RMS _t to be θ _t , that is,

The RMS _{t is} written back to the data buffer unit 4, and θ _t is written back to the external designated space by the direct memory control unit 1. In the present invention, since the data processing module performs vector operations using the associated parallel operation sub-modules, the degree of parallelism is greatly improved, so the frequency of operation is low, and the power consumption overhead is small.

2 illustrates an example block diagram of a data processing module in an apparatus for implementing an RMSprop gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56. When the device operates on a vector, the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.

3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm, including the following steps, in accordance with an embodiment of the present invention:

Step S4, the controller unit 3 from unit 2 is read into the instruction cache assignment instructions, translated in accordance with the microinstruction, the drive data buffer unit 4 in the direction average RMS _t-1 the amount of initialization, data processing and number of iterations of the drive unit 5 t is set to 1;

Then sent to the data processing module 5;

In step S7, the controller unit 3 reads an average direction quantity update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS _t-1 according to the translated microinstruction. In the update operation, the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56 to drive the basic operation. The sub-module 56 operates (1-δ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1-δ) RMS _t-1 , respectively.

with

among them

with

The calculation of the elements corresponding to the position of the vector exists in a sequential order, and the positions are calculated in parallel between different positions. Then, the operation instruction 3 (INS_3) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates

The updated mean direction RMS _{t is obtained} .

In step S8, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and transfers the updated mean direction amount RMS _t from the data processing unit 5 to the data buffer unit 4 according to the translated microinstruction.

In step S9, the controller unit 3 reads a parameter vector operation instruction from the instruction buffer unit 2, and performs an update operation of the parameter vector according to the translated micro instruction. In the update operation, the parameter vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 controls the correlation operation module to perform the following operations: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation. The unit sub-module 56 calculates -α, the iteration number t is incremented by one; the arithmetic operation instruction 5 (INS_5) is sent to the vector square root parallel operation sub-module 55, and the drive vector square root parallel operation sub-module 55 is calculated. The operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates

Step S10, the control unit 3 reads an instruction cache to be updated, the amount of write-back command from the unit 2 (DATABACK_IO), translated in accordance with the microinstruction, the updated parameter vector θ _t from the data processing unit 5 via direct memory access unit 1 Transfer to the external designated space.

In step S11, the controller unit 3 reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2<ct, it converges, and the operation ends. Otherwise, go to step S5 to continue execution.

By adopting a device dedicated to performing the RMSprop gradient descent algorithm, the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated. At the same time, the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.

The processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims

An apparatus for performing an RMSprop gradient descent algorithm, the device comprising a direct memory access unit (1), an instruction cache unit (2), a controller unit (3), a data buffer unit (4), and data processing Module (5), where:

The direct memory access unit (1) is configured to access an external designated space, read and write data to the instruction cache unit (2) and the data processing module (5), and complete loading and storing of the data;

An instruction cache unit (2) for reading an instruction through the direct memory access unit (1) and buffering the read instruction;

The controller unit (3) is configured to read an instruction from the instruction cache unit (2), and decode the read instruction into a control direct memory access unit (1), a data buffer unit (4) or a data processing module (5) Micro-instructions of behavior;

a data buffer unit (4) for buffering the mean square matrix during initialization and data update;

a data processing module (5), configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit (4), and pass the updated parameter to be updated through the direct memory access unit ( 1) Write to the external specified space.
The apparatus for performing an RMSprop gradient descent algorithm according to claim 1, wherein said direct memory access unit (1) writes an instruction from an external designated space to an instruction cache unit (2), and specifies a space from the outside. The parameter to be updated and the corresponding gradient value are read to the data processing module (5), and the updated parameter vector is directly written from the data processing module (5) to the external designated space.
The apparatus for performing an RMSprop gradient descent algorithm according to claim 1, wherein said controller unit (3) decodes the read instruction into a control direct memory access unit (1), a data buffer unit ( 4) or micro-instruction of data processing module (5) behavior, used to control the direct memory access unit (1) read data from an external specified address and write data to an external specified address, control data cache unit (4) through direct memory The access unit (1) acquires an instruction required for an operation from an external designated address, controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data buffer unit (4) to perform data transmission with the data processing module (5).
The apparatus for performing an RMSprop gradient descent algorithm according to claim 1, wherein said data buffer unit (4) initializes a mean square matrix RMS t at initialization, and a mean square matrix in each data update process The RMS t-1 is read out into the data processing module (5), updated in the data processing module (5) to the mean square matrix RMS t , and then written to the data buffer unit (4).
The apparatus for performing RMSprop gradient descent algorithm according to claim 4, wherein, during operation of the device, the data buffer unit (4) inside each square matrix always holds a copy of the RMS t.
The apparatus for performing an RMSprop gradient descent algorithm according to claim 1, wherein said data processing module (5) reads an average direction amount RMS t-1 from a data buffer unit (4) through direct memory The access unit (1) reads the parameter vector θ t-1 to be updated from the external designated space, and the gradient vector
Α global update step size and direction are updated amount of δ, the direction of the average amount of RMS t-1 update RMS t, to be updated by updating the parameter RMS t θ t-1 is θ t, and the data is written back to cache RMS t In unit (4), θ t is written back to the external designated space by the direct memory control unit (1).
The apparatus for performing an RMSprop gradient descent algorithm according to claim 6, wherein said data processing module (5) updates the mean direction amount RMS t-1 to RMS t according to a formula
Realizing, the data processing module (5) updates the vector θ t-1 to be updated to θ t according to the formula
Realized.
The apparatus for performing an RMSprop gradient descent algorithm according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51), a vector addition parallel operation sub-module (52), Vector multiplication parallel operation sub-module (53), vector division parallel operation sub-module (54), vector square root parallel operation sub-module (55) and basic operation sub-module (56), wherein vector addition parallel operation sub-module (52), vector The multiplication parallel operation submodule (53), the vector division parallel operation submodule (54), the vector square root parallel operation submodule (55), and the basic operation submodule (56) are connected in parallel, and the operation control submodule (51) and the vector addition respectively Parallel operation submodule (52), vector multiplication parallel operation submodule The block (53), the vector division parallel operation sub-module (54), the vector square root parallel operation sub-module (55), and the basic operation sub-module (56) are connected in series.
The apparatus for performing an RMSprop gradient descent algorithm according to claim 8, wherein the vector operation is an element-wise operation when the vector is operated, and the different position elements are when the same vector performs an operation. Perform operations in parallel.
A method for performing an RMSprop gradient descent algorithm, applied to the device of any one of claims 1 to 9, characterized in that the method comprises:

Initialize an average direction quantity RMS 0 and obtain the parameter vector θ t to be updated and the corresponding gradient vector from the specified storage unit

When performing the gradient descent operation, first use the mean direction RMS t-1 and the gradient vector
And the average direction quantity update rate δ updates the average direction quantity RMS t , then divides the gradient vector by the square root of the mean direction quantity and multiplies the global update step length α to obtain the corresponding gradient descent quantity, and updates the to-be-updated vector θ t-1 to θ. t and output; repeat this process until the vector to be updated converges.
The method for performing an RMSprop gradient descent algorithm according to claim 10, wherein said initializing an average direction quantity RMS 0 and obtaining a parameter vector θ t to be updated and a corresponding gradient vector from a specified storage unit
include:

In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.

Step S2, the operation starts, the controller unit reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit, and drives the direct memory access unit to read all the instructions related to the RMSprop gradient descent calculation from the external address space according to the translated microinstruction. And buffering these instructions into the instruction cache unit;

Step S3, the controller unit reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit, and drives the direct memory access unit to read the global update step size α from the external space according to the translated microinstruction, and the average direction update rate δ, convergence threshold ct, then After being sent to the data processing module;

Step S4, the controller unit reads the assignment instruction from the instruction cache unit, initializes the average direction quantity RMS t-1 in the drive data buffer unit according to the translated micro instruction, and drives the number of iterations t in the data processing unit to be set to 1;

Step S5, the controller unit reads a parameter read instruction (DATA_IO) from the instruction cache unit, and drives the direct memory access unit to read the parameter vector θ t-1 to be updated from the external designated space according to the translated micro instruction. Gradient vector
Then sent to the data processing module;

In step S6, the controller unit reads a data transfer instruction from the instruction cache unit, and transmits the average direction quantity RMS t-1 in the data buffer unit to the data processing unit according to the translated micro instruction.
A method for performing an RMSprop gradient descent algorithm according to claim 10, wherein said utilizing a mean direction RMS t-1 , a gradient vector
Both direction and amount of the average update rate of the update directions amount δ RMS t, is according to the formula

Realized, including:

The controller unit reads an average direction quantity update instruction from the instruction cache unit, and drives the data buffer unit to perform an update operation of the average direction quantity RMS t-1 according to the translated micro instruction; in the update operation, the direction quantity update The instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: sending the operation instruction 1 to the basic operation sub-module, driving the basic operation sub-module operation (1-δ), and transmitting the operation instruction 2 to the vector multiplication Running the sub-module in parallel, the drive vector multiplication parallel running sub-module calculates (1-δ) RMS t-1 ,
with
among them
with
The calculation of the elements corresponding to the position of the vector exists in order, and the positions are calculated in parallel; then, the operation instruction 3 to the vector addition parallel operation sub-module is sent, and the drive vector addition parallel operation sub-module is calculated.
The updated mean direction RMS t is obtained .
A method for performing an RMSprop gradient descent algorithm according to claim 12, wherein said utilizing a mean direction RMS t-1 , a gradient vector
And the average direction quantity update rate δ is updated after the average direction quantity RMS t , and further includes:

The controller unit reads a data transfer instruction from the instruction cache unit, and transfers the updated mean direction RMS t from the data processing unit to the data cache unit according to the translated microinstruction.
The method for performing an RMSprop gradient descent algorithm according to claim 10, wherein the dividing the gradient vector by the square root of the mean direction amount and multiplying the global update step size α to obtain a corresponding gradient descent amount is updated. The update vector θ t-1 is θ t , which is according to the formula
Realized, including:

The controller unit reads a parameter vector update instruction from the instruction cache unit, and performs an update operation of the parameter vector according to the translated micro instruction; in the update operation, the parameter vector update instruction is sent to the operation control sub-module, and the operation controller The module control related operation module performs the following operations: sending the operation instruction 4 to the basic operation unit sub-module, driving the basic operation unit sub-module to calculate -α, the iteration number t plus 1; and transmitting the operation instruction 5 to the vector square root parallel operation sub-module, driving Vector square root parallel operation sub-module is calculated
Send operation instruction 6 to vector multiplication parallel operation sub-module, drive vector multiplication parallel operation sub-module
After the two operations are completed, the operation instruction 7 is sent to the vector division parallel operation sub-module, and the drive vector division parallel operation sub-module is calculated.
Then, send the operation instruction 8 to the vector addition parallel operation sub-module, and drive the vector addition to run the sub-module calculation
Obtaining θ t ; where θ t-1 is the value before θ 0 is not updated at the tth cycle, and t t is updating θ t-1 to θ t ; the operation control submodule sends the operation instruction 9 to vector division Parallel operation sub-module, driving vector division parallel operation sub-module operation to obtain vector
The operation control sub-module respectively sends an operation instruction 10, an operation instruction 11 vector addition parallel operation sub-module and a basic operation sub-module, and calculates sum=∑ i temp i and temp2=sum/n.
The method for performing an RMSprop gradient descent algorithm according to claim 14, wherein after the updating the vector to be updated θ t-1 is θ t , the method further comprises:

The controller unit reads a DATABACK_IO instruction from the instruction cache unit, and transmits the updated parameter vector θ t from the data processing unit to the external designated space through the direct memory access unit according to the translated micro instruction.
The method for performing an RMSprop gradient descent algorithm according to claim 10, wherein the step of repeating the process until the vector to be updated converges comprises determining whether the vector to be updated converges, and the specific determining process is as follows:

The controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated micro instruction, the data processing module determines whether the updated parameter vector converges, and if temp2<ct, converges, and the operation ends.