WO2017185257A1

WO2017185257A1 - Device and method for performing adam gradient descent training algorithm

Info

Publication number: WO2017185257A1
Application number: PCT/CN2016/080357
Authority: WO
Inventors: 郭崎; 刘少礼; 陈天石; 陈云霁
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-11-02

Abstract

A device and method for performing Adam gradient descent training algorithm, the device comprising a direct memory access unit (1), an instruction cache unit (2), a controller unit (3), a data cache unit (4) and a data processing module (5). The method comprises: firstly, reading a gradient vector and a to-be-updated value vector, and initializing a first-order vector, a second-order vector and a corresponding exponential decay rate; at each iteration, updating the first-order vector and second-order vector by using the gradient vector, and calculating a first-order biased estimation vector and a second-order biased estimation vector; updating to-be-updated parameters by using the first-order biased estimation vector and second-order biased estimation vector; and continuing with training till the vector of the to-be-updated parameters is converged. The present invention may achieve the application of the Adam gradient descent algorithm, and greatly improve the data processing efficiency.

Description

Apparatus and method for performing Adam gradient descent training algorithm

Technical field

The present invention relates to the field of Adam algorithm application technology, and in particular to an apparatus and method for performing an Adam gradient descent training algorithm, and relates to a hardware implementation of an Adam gradient descent optimization algorithm.

Background technique

The gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. Adam algorithm is one of the gradient descent optimization algorithms. It is easy to implement, small in computation, small in required storage space and gradient. Features such as symmetry transformation invariance are widely used, and the implementation of the Adam algorithm using a dedicated device can significantly increase the speed of its execution.

Currently, one known method of performing the Adam gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the Adam gradient descent algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.

Another known method of performing the Adam gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the Adam gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operations in the Adam gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer, and the data required in the operation (such as first-order moment vector and second-order moment vector) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings great work. Cost.

Summary of the invention

In view of this, the main object of the present invention is to provide an apparatus and method for performing an Adam gradient descent training algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeated memory. Read data and reduce the bandwidth of memory access.

To achieve the above object, the present invention provides an apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising a direct memory access unit 1, an instruction buffer unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5. ,among them:

The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;

The instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;

The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.

a data buffer unit 4, configured to cache each first moment vector and second moment vector during initialization and data update;

The data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external specified space.

In the above solution, the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated The parameter vector is directly written from the data processing module 5 to the external designated space.

In the above solution, the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified. The address reads the data and writes the data to the external designated address, and the control data buffer unit 4 obtains an instruction required for the operation from the external designated address through the direct memory access unit 1, and controls the data processing module 5 to perform the update operation of the parameter to be updated. And the control data buffer unit 4 and the data processing module 5 perform data transmission.

In the above solution, the data buffer unit 4 initializes a first moment vector m _t and a second moment vector v _t at initialization, and first moment vector m _t-1 and second moment vector v in each data update process. _{The t-1 is} read out and sent to the data processing module 5, updated in the data processing module 5 as a first moment vector m _t and a second moment vector v _t , and then written into the data buffer unit 4.

In the above solution, during the operation of the device, the data buffer unit 4 always stores a copy of the first moment vector m _t and the second moment vector v _t .

In the above solution, the data processing module 5 reads the moment vectors m _t-1 , v _t-1 from the data buffer unit 4, and reads the vector to be updated θ _t-1 from the external designated space through the direct memory access unit _1. Gradient vector

Updating the step size α and the exponential decay rates β ₁ and β ₂ ; then updating the moment vectors m _t-1 , v _t-1 to m _t , v _t , and calculating the moment estimation vector by m _t , v _t

Finally, the vector to be updated θ _{t-1 is} updated to θ _t , and m _t , v _{t are} written into the data buffer unit 4, and θ _t is written into the external designated space through the direct memory access unit 1.

In the above solution, the data processing module 5 updates the moment vectors m _t-1 , v _t-1 to m _t according to the formula.

The data processing module 5 calculates the moment estimation vector by using m _t and v _t

Is based on the formula

Implemented, the data processing module 5 to be updated vector θ _t-1 is updated according to the formula θ _t

Realized.

In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 and vector addition parallel operation sub-module 52, vector multiplication parallel operation sub-module 53, vector division parallel operation sub-module 54, vector The square root parallel operation sub-module 55 and the basic operation sub-module 56 are connected in series. When the device operates on a vector, the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.

To achieve the above object, the present invention also provides a method for performing an Adam gradient descent training algorithm, the method comprising:

Initializing a first moment vector m _O , a second moment vector v ₀ , an exponential decay rate β ₁ , β _{2 ,} and a learning step length α, and obtaining a vector θ _Q to be updated from an external designated space;

When performing a gradient descent operation, first use the gradient value passed from the outside

And the exponential decay rate updates the first moment vector m _t-1 , the second moment vector v _t-1 , and then obtains the biased moment estimation vector by the moment vector operation

with

Finally, the vector to be updated θ _t-1 is updated as θ _t and output; this process is repeated until the vector to be updated converges.

In the above solution, the first moment vector m _Q , the second moment vector v _Q , the exponential decay rate β ₁ , β _{2 ,} and the learning step length α are initialized, and the vector to be updated θ _O is obtained from an external designated space, including:

In step S1, an INSTRUCTION_IO instruction is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.

Step S2, the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and drives the direct memory access unit 1 to read from the external address space according to the translated microinstruction, which is related to the Adam gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;

In step S3, the controller unit 3 reads a HYPERPARAMETER_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size α, the exponential decay rate β ₁ from the external designated space according to the translated microinstruction. β ₂ , convergence threshold ct, and then sent to the data processing module 5;

In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, drives the first moment vectors m _t-1 and v _t-1 in the data buffer unit 4 to initialize and drive. The number of iterations t in the data processing unit 5 is set to 1;

In step S5, the controller unit 3 reads a DATA_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector θ _t-1 to be updated and the corresponding gradient from the external designated space according to the translated microinstruction. vector

Then sent to the data processing module 5;

In step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m _t-1 and the second moment vector v _t- in the data buffer unit 4. _{1 is} transferred to the data processing unit 5.

In the above scheme, the utilization of the gradient value introduced from the outside

And the exponential decay rate updates the first moment vector m _t-1 and the second moment vector v _t-1 according to the formula

The implementation specifically includes: the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform the first moment vector m _t-1 and the second moment vector according to the translated micro instruction. The update operation of v _t-1 , in which the moment vector update instruction is sent to the operation control sub-module 51 , and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the INS_1 instruction to the basic operation sub-module 56, driving The basic operation sub-module 56 calculates (1-β ₁ ) and (1-β ₂ ); sends an INS_2 instruction to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates

Then, the INS_3 instruction is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 is simultaneously calculated.

with

The results are denoted as a ₁ , a ₂ , b ₁ and b _{2 respectively} ; then, a ₁ and a ₂ , b ₁ and b ₂ are respectively taken as two inputs and sent to the vector addition parallel operation sub-module 52 to obtain an updated The first moment vector m _t and the second moment vector v _t .

And updating the first moment vector m _t-1 and the second moment vector v _t-1 , and the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, according to the translated micro instruction. The updated first moment vector m _t and second moment vector v _{t are} transferred from the data processing unit 5 to the data buffer unit 4.

In the above solution, the moment vector estimation is obtained by the moment vector operation

with

Is based on the formula

The implementation includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends The corresponding instruction performs the following operations: the arithmetic control sub-module 51 sends the command INS_4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate

with

The iteration number t is incremented by 1, the operation control sub-module 51 sends the instruction INS_5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m _t and

Second moment vector v _t and

Biased estimate vector

with

In the above solution, the update to be updated vector θ _t-1 is θ _t according to the formula

The implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the instruction INS_6 to The basic operation sub-module 56 drives the basic operation sub-module 56 to calculate -α; the operation control sub-module 51 sends the instruction INS_7 to the vector square root parallel operation sub-module 55, and drives the operation thereof.

The operation control sub-module 51 sends the instruction INS_7 to the vector division parallel operation sub-module 54 to drive the operation thereof.

The operation control sub-module 51 sends the instruction INS_8 to the vector multiplication parallel operation sub-module 53 to drive the operation thereof.

The operation control sub-module 51 sends the instruction INS_9 to the vector addition parallel operation sub-module 52 to drive its calculation.

The updated parameter vector θ _{t is obtained} ; wherein θ _t-1 is the value before θ ₀ is not updated at the tth cycle, and the tth cycle updates θ _t-1 to θ _t ; the arithmetic control submodule 51 sends The instruction INS_10 to the vector division parallel operation sub-module 54 drives the operation to obtain a vector

The arithmetic control sub-module 51 sends the commands INS_11, INS_12 to the vector addition parallel operation sub-module 52 and the basic operation sub-module 56 to calculate sum = ∑ _i temp _i , temp2 = sum / n, respectively.

In the above solution, after the update vector θ _t-1 is θ _t , the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction. θ _{t is} transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.

In the above solution, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges. The specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 < ct, it converges, and the operation ends.

It can be seen from the above technical solutions that the present invention has the following beneficial effects:

1. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention can solve the problem of insufficient performance of the general-purpose processor of the data by using a device specially used for executing the Adam gradient descent training algorithm, and the decoding cost of the previous segment is large. Problem, speed up the execution of related applications.

2. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention avoids repeatedly reading data into the memory by using the moment vector required for the intermediate process of the data buffer unit, thereby reducing the device and the external address. The IO operation between the spaces reduces the bandwidth of memory access.

3. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention, because the data processing module performs the vector operation by using the related parallel operation sub-module, the degree of parallelism is greatly improved.

4. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention, because the data processing module uses the related parallel operation sub-module to perform vector operations, the degree of parallelism of the operation is high, so the working frequency is low, so that the work is low. The cost is small.

DRAWINGS

For a more complete understanding of the present invention and its advantages, reference will now be made to the following description

1 shows an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.

2 shows an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.

3 shows a flow chart of a method for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.

Throughout the drawings, the same devices, components, units, and the like are denoted by the same reference numerals.

detailed description

Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the <RTI

In the present invention, the terms "include" and "including" and their derivatives are intended to be inclusive and not limiting; the term "or" is inclusive, meaning and/or.

In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.

An apparatus and method for performing an Adam gradient descent training algorithm for accelerating the application of an Adam gradient descent algorithm according to an embodiment of the present invention. First, the first moment vector m _O , the second moment vector v _Q , the exponential decay rate β ₁ , β _{2 ,} and the learning step length α are initialized, and the vector to be updated θ _Q is obtained from the external designated space; In operation, first use the gradient value passed from the outside

And the exponential decay rate updates the first moment vector m _t-1 and the second moment vector v _t-1 , ie

Then obtain the biased moment estimation vector by the moment vector operation

with

which is

Finally update the vector θ _t-1 to be updated as θ _t and output, ie

Where θ _t-1 is the value before θ ₀ is not updated at the t-th cycle, and the t-th cycle updates θ _t-1 to θ _t . This process is repeated until the vector to be updated converges.

1 shows an example block diagram of the overall structure of an apparatus for implementing an Adam gradient descent algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.

The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.

The instruction cache unit 2 is configured to read an instruction through the direct memory access unit 1 and cache the read instruction.

The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro The instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory. The unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.

The data buffer unit 4 is configured to buffer each first moment vector and second moment vector in the initialization and data update process; specifically, the data buffer unit 4 initializes the first moment vector m _t and the second moment vector v during initialization. _t , in each data update process, the data buffer unit 4 reads out and sends the first moment vector m _t-1 and the second moment vector v _t-1 to the data processing module 5, and updates it in the data processing module 5 to The first moment vector m _t and the second moment vector v _{t are} then written to the data buffer unit 4. During the operation of the device, the data buffer unit 4 always stores a copy of the first moment vector m _t and the second moment vector v _t . In the present invention, since the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.

The data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external designated space; specifically, the data processing module 5 reads the moment vectors m _t-1 , v _t-1 from the data buffer unit 4, and reads from the external designated space through the direct memory access unit 1 Update vector θ _t-1 , gradient vector

Updating the step size α and the exponential decay rates β ₁ and β ₂ ; then updating the moment vectors m _t-1 , v _t-1 to m _t , v _t , ie

Calculate the moment estimation vector by m _t , v _t

which is

Finally, the vector θ _{t-1 to be} updated is updated to θ _t , ie

The m _t and v _{t are} written into the data buffer unit 4, and θ _t is written into the external designated space through the direct memory access unit 1. In the present invention, since the data processing module performs vector operations using the associated parallel operation sub-modules, the degree of parallelism is greatly improved, so the frequency of operation is low, and the power consumption overhead is small.

2 illustrates an example block diagram of a data processing module in an apparatus for implementing an Adam gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56. When the device operates on a vector, the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.

FIG. 3 illustrates an algorithm for performing Adam gradient descent training according to an embodiment of the present invention. The flow chart of the method specifically includes the following steps:

In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.

In step S3, the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size α, index from the external designated space according to the translated microinstruction. The attenuation rate β ₁ , β ₂ , the convergence threshold ct, and then sent to the data processing module 5;

In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and initializes the first-order moment vectors m _t-1 and v _t-1 in the data buffer unit 4 according to the translated micro-instruction, and drives the data. The number of iterations t in the processing unit 5 is set to 1;

In step S5, the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated θ _t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector

Then sent to the data processing module 5;

In step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m _t-1 and the second moment vector v _t-1 in the data buffer unit 4. Transfer to the data processing unit 5.

In step S7, the controller unit 3 reads a moment vector update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform a first moment vector m _t-1 and a second moment vector v _t- according to the translated microinstruction. ₁ update operation. In the update operation, the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub-module 56 calculates (1-β ₁ ) and (1 - β ₂ ); sends an operation instruction 2 (INS_2) to a vector multiplication parallel operation submodule 53, and the drive vector multiplication parallel operation sub-module 53 calculates

Then, the operation instruction 3 (INS_3) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 simultaneously calculates β ₁ m _t-1 ,

_{2 2} v _t-1 and

In step S8, the controller unit 3 reads a data transmission instruction from the instruction buffer unit 2, and transmits the updated first-order moment vector m _t and second-order moment vector v _t from the data processing unit 5 according to the translated micro-instruction. In the data buffer unit 4.

In step S9, the controller unit 3 reads a moment estimation vector operation instruction from the instruction buffer unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends the corresponding instruction. The operation operation sub-module 51 sends the operation instruction 4 (INS_4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate

with

The iteration number t is incremented by 1, and the operation control sub-module 51 sends an operation instruction 5 (INS_5) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m _t and

Second moment vector v _t and

Biased estimate vector

with

In step S10, the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the operation instruction 6 (INS_6) To the basic operation sub-module 56, the drive basic operation sub-module 56 calculates -α; the operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector square root parallel operation sub-module 55, and drives the operation thereof.

The operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector division parallel operation sub-module 54 to drive the operation thereof.

The operation control sub-module 51 sends the operation instruction 8 (INS_8) to the vector multiplication parallel operation sub-module 53, and drives the operation thereof to obtain

The operation control sub-module 51 sends the operation instruction 9 (INS_9) to the vector addition parallel operation sub-module 52 to drive the calculation thereof.

The updated parameter vector θ _{t is obtained} ; wherein θ _t-1 is the value before θ ₀ is not updated at the tth cycle, and the tth cycle updates θ _t-1 to θ _t ; the arithmetic control sub-module 51 transmits The operation instruction 10 (INS_10) to the vector division parallel operation sub-module 54 drives the operation to obtain a vector

The arithmetic control sub-module 51 transmits the arithmetic command 11 (INS_11), the arithmetic command 12 (INS_12) to the vector addition parallel operation sub-module 52, and the basic operation sub-module 56 to calculate sum=∑ _i temp ₁ and temp2=sum/n, respectively.

In step S11, the controller unit 3 reads a to-be-updated write-back instruction (DATABACK_IO) from the instruction cache unit 2, and passes the updated parameter vector θ _t from the data processing unit 5 through the direct memory access unit according to the translated micro-instruction. 1 Transfer to the external designated space.

In step S12, the controller unit 3 reads a convergence judgment instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2<ct, the convergence is performed, and the operation ends. Otherwise, go to step S5 to continue execution.

By adopting a device specially used for executing the Adam gradient descent training algorithm, the invention can solve the problem that the general processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated. At the same time, the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.

The processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims

An apparatus for performing an Adam gradient descent training algorithm, characterized in that the apparatus comprises a direct memory access unit (1), an instruction buffer unit (2), a controller unit (3), a data buffer unit (4), data Processing module (5), wherein:

The direct memory access unit (1) is configured to access an external designated space, read and write data to the instruction cache unit (2) and the data processing module (5), and complete loading and storing of the data;

An instruction cache unit (2) for reading an instruction through the direct memory access unit (1) and buffering the read instruction;

The controller unit (3) is configured to read an instruction from the instruction cache unit (2), and decode the read instruction into a control direct memory access unit (1), a data buffer unit (4) or a data processing module (5) Micro-instructions of behavior;

a data buffer unit (4) for buffering each first moment vector and second moment vector during initialization and data update;

a data processing module (5) for updating a moment vector, calculating a moment estimation vector, updating the vector to be updated, and writing the updated moment vector to the data buffer unit (4), and directly updating the updated vector to be updated The memory access unit (1) is written to the external specified space.
The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said direct memory access unit (1) writes an instruction from an external designated space to an instruction buffer unit (2), and specifies from the outside. The space reads the parameter to be updated and the corresponding gradient value to the data processing module (5), and writes the updated parameter vector directly from the data processing module (5) to the external designated space.
The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said controller unit (3) decodes the read instruction into a control direct memory access unit (1), a data buffer unit (4) or micro-instruction of the data processing module (5) behavior, used to control the direct memory access unit (1) read data from an external specified address and write data to an external designated address, control the data cache unit (4) through direct The memory access unit (1) acquires an instruction required for an operation from an external designated address, controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data buffer unit (4) and the data processing module (5) Data transfer.
The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said data buffer unit (4) initializes a first moment vector m t and a second moment vector v t at initialization time. The first moment vector m t-1 and the second moment vector v t-1 are read out and sent to the data processing module (5) during the secondary data update process, and updated to the first moment vector in the data processing module (5). m t and the second moment vector v t are then written to the data buffer unit (4).
The apparatus for performing an Adam gradient descent training algorithm according to claim 4, wherein during the operation of the device, the first moment vector m t and the second moment are always stored inside the data buffer unit (4). A copy of the vector v t .
Apparatus for performing an Adam gradient descent training algorithm according to claim 1, characterized in that said data processing module (5) reads moment vectors m t-1 , v t- from data buffer unit (4) 1. Reading the vector to be updated θ t-1 and the gradient vector from the external specified space through the direct memory access unit (1)
Updating the step size α and the exponential decay rates β 1 and β 2 ; then updating the moment vectors m t-1 , v t-1 to m t , v t , and calculating the moment estimation vector by m t , v t
Finally, the vector θ t-1 to be updated is updated to θ t , and m t , v t are written into the data buffer unit (4), and θ t is written into the external designated space through the direct memory access unit (1). .
The apparatus for performing an Adam gradient descent training algorithm according to claim 6, wherein said data processing module (5) updates moment vectors m t-1 , v t-1 to m t according to a formula
Realizing, the data processing module (5) calculates a moment estimation vector by using m t and v t
Is based on the formula
Realizing, the data processing module (5) updates the vector θ t-1 to be updated to θ t according to the formula
Realized.
The apparatus for performing an Adam gradient descent training algorithm according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51) and a vector addition parallel operation sub-module (52). Vector multiplication parallel operation sub-module (53), vector Dividing parallel operation sub-module (54), vector square root parallel operation sub-module (55) and basic operation sub-module (56), wherein vector addition parallel operation sub-module (52), vector multiplication parallel operation sub-module (53), vector division The parallel operation sub-module (54), the vector square root parallel operation sub-module (55), and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively paralleled with the vector addition parallel operation sub-module (52) and vector multiplication. The operation sub-module (53), the vector division parallel operation sub-module (54), the vector square root parallel operation sub-module (55), and the basic operation sub-module (56) are connected in series.
The apparatus for performing an Adam gradient descent training algorithm according to claim 8, wherein when the device operates on the vector, the vector operations are element-wise operations, and the same vector performs different operations when performing certain operations on the elements. It is the parallel execution of the operation.
A method for performing an Adam gradient descent training algorithm, applied to the device of any one of claims 1 to 9, characterized in that the method comprises:

Initializing a first moment vector m o , a second moment vector v 0 , an exponential decay rate β 1 , β 2 , and a learning step length α, and obtaining a vector θ 0 to be updated from an external designated space;

When performing a gradient descent operation, first use the gradient value passed from the outside
And the exponential decay rate updates the first moment vector m t-1 , the second moment vector v t-1 , and then obtains the biased moment estimation vector by the moment vector operation
with
Finally, the vector to be updated θ t-1 is updated as θ t and output; this process is repeated until the vector to be updated converges.
The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said initializing first moment vector m o , second moment vector v o , exponential decay rate β 1 , β 2 , and learning step Length α, and obtain the vector to be updated θ 0 from the external specified space, including:

In step S1, an INSTRUCTION_IO instruction is pre-stored at the first address of the instruction cache unit, and the INSTRUCTION_IO instruction is used to drive the direct memory unit to read all instructions related to the Adam gradient descent calculation from the external address space.

Step S2, the operation starts, the controller unit reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit, and drives the direct memory access unit to read all the instructions related to the Adam gradient descent calculation from the external address space according to the translated microinstruction. And these The instruction is buffered into the instruction cache unit;

Step S3, the controller unit reads a HYPERPARAMETER_IO instruction from the instruction cache unit, and drives the direct memory access unit to read the global update step size α, the exponential decay rate β 1 , β 2 , and the convergence from the external designated space according to the translated microinstruction. The threshold ct is then sent to the data processing module;

Step S4, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, drives the first moment vectors m t-1 and v t-1 in the data buffer unit to be initialized, and drives the data processing unit. The number of iterations t is set to 1;

Step S5, the controller unit reads a DATA_IO instruction from the instruction cache unit, and drives the direct memory access unit to read the parameter vector θ t-1 to be updated and the corresponding gradient vector from the external designated space according to the translated micro instruction.
Then sent to the data processing module;

Step S6, the controller unit reads a data transmission instruction from the instruction buffer unit, and transmits the first moment vector m t-1 and the second moment vector v t-1 in the data buffer unit to the data according to the translated micro instruction. Processing unit.
A method for performing an Adam gradient descent training algorithm according to claim 10, wherein said utilizing a gradient value introduced from the outside
And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 according to the formula

Realized, including:

The controller unit reads a moment vector update instruction from the instruction cache unit, and drives the data buffer unit to perform an update operation of the first moment vector m t-1 and the second moment vector v t-1 according to the translated micro instruction. In the update operation, the moment vector update instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: sending the INS_1 instruction to the basic operation sub-module, driving the basic operation sub-module to calculate (1-β 1 ) and (1-β 2 ); send INS_2 instruction to vector multiplication parallel operation sub-module, drive vector multiplication parallel operation sub-module
Then, send the INS_3 instruction to the vector multiplication parallel operation sub-module, and drive the vector multiplication parallel operation sub-module to simultaneously calculate β 1 m t-1 ,
2 2 v t-1 and
The results are denoted as a 1 , a 2 , b 1 and b 2 respectively ; then, a 1 and a 2 , b 1 and b 2 are respectively taken as two inputs, and sent to the vector addition parallel operation sub-module to obtain the updated one. The moment vector m t and the second moment vector v t .
A method for performing an Adam gradient descent training algorithm according to claim 12, wherein said utilizing a gradient value introduced from the outside
And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 , and further includes:

The controller unit reads a data transfer instruction from the instruction cache unit, and transfers the updated first-order moment vector m t and second-order moment vector v t from the data processing unit to the data buffer unit according to the translated micro-instruction.
The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said moment vector estimation operation obtains a bias vector estimation vector
with
Is based on the formula
Realized, including:

The controller unit reads a moment estimation vector operation instruction from the instruction buffer unit, and drives the operation control sub-module to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module sends the corresponding instruction to perform the following operations: the operation control sub-module Send the command INS_4 to the basic operation sub-module, and drive the basic operation sub-module to calculate
with
The iteration number t is incremented by 1, the operation control sub-module sends the instruction INS_5 to the vector multiplication parallel operation sub-module, and the drive vector multiplication parallel operation sub-module calculates the first-order moment vector m t and
Second moment vector v t and
Biased estimate vector
with
The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said updating the vector to be updated θ t-1 is θ t according to a formula
Realized, including:

The controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: the operation control sub-module sends the instruction INS_6 to the basic operation sub-module, and drives the basic operation sub-module Calculate -α; the operation control sub-module sends the instruction INS_7 to the vector square root parallel operation sub-module, and drives the operation to obtain The operation control sub-module sends the instruction INS_7 to the vector division parallel operation sub-module to drive its operation.
The operation control sub-module sends the instruction INS_8 to the vector multiplication parallel operation sub-module, and drives the operation thereof to obtain
The operation control sub-module sends the instruction INS_9 to the vector addition parallel operation sub-module to drive its calculation.
The updated parameter vector θ t is obtained ; wherein θ t-1 is the value before θ 0 is not updated at the t-th cycle, and the t-th cycle updates θ t-1 to θ t ; the operation control sub-module sends the instruction INS_10 to vector division parallel operation sub-module, driving its operation to get the vector
The operation control sub-module respectively sends the instructions INS_11, INS_12 to the vector addition parallel operation sub-module and the basic operation sub-module to calculate sum=∑ i temp i , temp2=sum/n.
The method for performing an Adam gradient descent training algorithm according to claim 15, wherein after updating the update vector θ t-1 to θ t , the method further comprises:

The controller unit reads a DATABACK_IO instruction from the instruction cache unit, and transmits the updated parameter vector θ t from the data processing unit to the external designated space through the direct memory access unit according to the translated micro instruction.
The method for performing an Adam gradient descent training algorithm according to claim 10, wherein the step of repeating the process until the vector to be updated converges comprises determining whether the vector to be updated converges, and the specific determining process is as follows:

The controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated micro instruction, the data processing module determines whether the updated parameter vector converges, and if temp2<ct, converges, and the operation ends.