CN107315570B

CN107315570B - Device and method for executing Adam gradient descent training algorithm

Info

Publication number: CN107315570B
Application number: CN201610269689.7A
Authority: CN
Inventors: 郭崎; 刘少礼; 陈天石; 陈云霁
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2021-06-18
Anticipated expiration: 2036-04-27
Also published as: CN107315570A

Abstract

The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The method comprises the following steps: firstly, reading a gradient vector and a value vector to be updated, and initializing a first moment vector, a second moment vector and corresponding exponential decay rates; and during each iteration, updating the first-order moment vector and the second-order moment vector by using the gradient vector, respectively calculating a first-order partial moment estimation vector and a second-order partial moment estimation vector, updating the parameter to be updated by using the first-order partial moment estimation vector and the second-order partial moment estimation vector, and continuously training until the parameter vector to be updated is converged. By utilizing the method and the device, the application of the Adam gradient descent algorithm can be realized, and the data processing efficiency is greatly improved.

Description

Device and method for executing Adam gradient descent training algorithm

Technical Field

The disclosure relates to the technical field of Adam algorithm application, in particular to a device and a method for executing an Adam gradient descent training algorithm, and relates to related application of hardware implementation of an Adam gradient descent optimization algorithm.

Background

The Adam algorithm is used as one of the gradient descent optimization algorithms, because the Adam algorithm is easy to implement, small in calculation amount, small in required storage space, and wide in characteristics such as symmetric transformation invariance of gradients, and the Adam algorithm can be implemented by using a special device, so that the execution speed of the Adam algorithm can be remarkably improved.

Currently, one known method of performing the Adam gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the Adam gradient descent algorithm into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead.

Another known method of performing the Adam gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing a general purpose single instruction multiple data Stream (SIMD) instruction using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for the associated operations of the Adam gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the associated operations in the Adam gradient descent algorithm, which brings a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as first-order moment vectors and second-order moment vectors) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and huge power consumption overhead is brought at the same time.

Disclosure of Invention

Technical problem to be solved

In view of the above, the present disclosure provides an apparatus and a method for executing Adam gradient descent training algorithm to solve the problems of insufficient operation performance of a general-purpose processor of data and high front-end decoding overhead, and to avoid repeatedly reading data from a memory and reduce the bandwidth of memory access.

(II) technical scheme

To achieve the above object, the present disclosure provides an apparatus for executing Adam gradient descent training algorithm, the apparatus including a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, wherein:

the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data;

the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction;

the controller unit 3 is used for reading the instruction from the instruction cache unit 2 and decoding the read instruction into a microinstruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4 or the data processing module 5;

the data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes;

and the data processing module 5 is used for updating the moment vector, calculating the moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1.

In the above scheme, the direct memory access unit 1 writes an instruction into the instruction cache unit 2 from an external designated space, reads a parameter to be updated and a corresponding gradient value from the external designated space to the data processing module 5, and directly writes an updated parameter vector into the external designated space from the data processing module 5.

In the above scheme, the controller unit 3 decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, so as to control the direct memory access unit 1 to read data from an external designated address and write the data into the external designated address, control the data cache unit 4 to obtain an instruction required by an operation from the external designated address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.

In the above scheme, the data cache unit 4 initializes the first moment vector m during initialization_tSecond order moment vector v_tThe first order moment vector m is used in each data updating process_t-1And a second moment vector v_t-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5_tAnd a second moment vector v_tAnd then written into the data buffer unit 4.

In the above scheme, during the operation of the device, the first-order moment vector m is always stored in the data cache unit 4_tTwo, twoOrder moment vector v_tA copy of (1).

In the above scheme, the data processing module 5 reads the moment vector m from the data cache unit 4_t-1、v_t-2Reading the vector theta to be updated from the external designated space through the direct memory access unit 1_t-1Gradient vector of

Update step α and exponential decay rate β₁And beta₂(ii) a Then the moment vector m_t-1、v_t-1Is updated to m_t、v_tThrough m_t、v_tComputing moment estimate vectors

Finally, the vector theta to be updated_t-1Is updated to theta_tAnd m is_t、v_tWriting into the data buffer unit 4, and storing theta_tWrites to the external designated space via direct memory access unit 1.

In the above scheme, the data processing module 5 converts the moment vector m_t-1、v_t-1Is updated to m_tIs according to the formula

Implemented, the data processing module 5 passes m_t、m_tComputing moment estimate vectors

Is according to the formula

Implemented, the data processing module 5 will update the vector θ to be updated_t-1Is updated to theta_tIs according to the formula

And (4) realizing.

In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.

To achieve the above object, the present disclosure also provides a method for executing an Adam gradient descent training algorithm, the method comprising:

initializing first order moment vector m₀Second order moment vector v₀Exponential decay rate beta₁、β₂Learning step length alpha and obtaining the vector theta to be updated from an external designated space₀；

When gradient descending operation is carried out, gradient values transmitted from the outside are utilized

Updating first order moment vector m by sum exponential decay rate_t-1Second order moment vector v_t-1Then obtaining the estimation vector with partial moment through the moment vector operation

And

finally, updating the vector theta to be updated_t-1Is theta_tAnd outputting; this process is repeated until the vector to be updated converges.

In the above scheme, the initializing first moment vector m₀Second order moment vector v₀Exponential decay rate ofβ₁、β₂Learning step length alpha and obtaining the vector theta to be updated from an external designated space₀The method comprises the following steps:

in step S1, an INSTRUCTION entry _ IO INSTRUCTION is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION entry _ IO INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.

Step S2, when the operation starts, the controller unit 3 reads the INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit 2, drives the direct memory access unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated microinstruction, and caches the INSTRUCTIONs in the INSTRUCTION cache unit 2;

in step S3, the controller unit 3 reads a HYPERPARAMETER _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction₁、β₂Convergence threshold c_tThen sent into the data processing module 5;

in step S4, the controller unit 3 reads in the assignment instruction from the instruction cache unit 2, and drives the first-order moment m in the data cache unit 4 according to the translated microinstruction_t-1And v_t-1Initializing and driving the number of iterations t in the data processing unit 5 to be set to 1;

in step S5, the controller unit 3 reads a DATA _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstruction_t-1And corresponding gradient vector

Then sent to the data processing module 5;

in step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and writes the first-order moment m in the data cache unit 4 according to the translated microinstruction_t-1And a second moment vector v_t-1To data transmissionIn the processing unit 5.

In the above scheme, the gradient value transmitted from the outside is utilized

Updating first order moment vector m by sum exponential decay rate_t-1Second order moment vector v_t-1Is according to the formula

The implementation specifically includes: the controller unit 3 reads a moment vector updating instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instruction_t-1And a second moment vector v_t-1In which a moment vector update instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends a corresponding instruction to perform the following operations: sends INS _1 instruction to the basic operation sub-module 56, drives the basic operation sub-module 56 to calculate (1-beta)₁) And (1-. beta.)₂) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule 53, driving vector multiplication parallel operation submodule 53 to obtain

Then, INS _3 instruction is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate beta₁m_t-1、

β₂v_t-1And

the results are respectively denoted as a₁、a₂、b₁And b₂(ii) a Then, a is mixed₁And a₂、b₁And b₂Respectively as two inputs to the vector addition parallel operator module 52 to obtain the updatesThe latter first moment vector m_tAnd a second moment vector v_t。

Updating first order moment vector m by sum exponential decay rate_t-1Second order moment vector v_t-1And then, the method further comprises the following steps: the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated micro instruction_tAnd a second moment vector v_tFrom the data processing unit 5 into the data buffering unit 4.

In the above scheme, the partial moment estimation vector is obtained by the moment vector operation

And

is according to the formula

The implementation specifically includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an instruction INS _4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate

And

the iteration number t is added to 1, the operation control sub-module 51 sends an instruction INS _5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first moment vector m in parallel_tAnd

second moment vector v_tAnd

the product of which yields an estimated vector with a bias

And

in the above scheme, the vector θ to be updated is updated_t-1Is theta_tIs according to the formula

The implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an instruction INS _6 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an instruction INS _7 to the vector square root parallel operation sub-module 55 to drive the operation to obtain

The operation control sub-module 51 sends an instruction INS _7 to the vector division parallel operation sub-module 54 to drive the operation to obtain

The operation control sub-module 51 sends an instruction INS _8 to the vector multiplication parallel operation sub-module 53 to drive the operation to obtain

The operation control sub-module 51 sends an instruction INS _9 to the vector addition parallel operation sub-module 52 to drive the calculation thereof

Obtaining an updated parameter vector theta_t(ii) a Wherein, theta_t-1Is theta₀Not updated before the t-th cycle, the t-th cycle will be θ_t-1Is updated to theta_t(ii) a The operation control sub-module 51 sends an instruction INS _10 to the vector division parallel operation sub-module 54 to drive the operation thereof to obtain a vector

The arithmetic control sub-module 51 sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module 52 and the basic operation sub-module 56 respectively to calculate sum sigma_t temp_t、temp2＝sum/n。

In the above scheme, the vector θ to be updated is updated_t-1Is theta_tThen, the method further comprises the following steps: the controller unit 3 reads a DATABACK _ IO instruction from the instruction cache unit 2, and updates the parameter vector theta according to the translated microinstruction_tFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.

In the above scheme, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process includes: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 < ct, the convergence is performed, and the operation is ended.

(III) advantageous effects

According to the technical scheme, the method has the following beneficial effects:

1. according to the device and the method for executing the Adam gradient descent training algorithm, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that a general processor of data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated.

2. According to the device and the method for executing the Adam gradient descent training algorithm, the moment vector required in the intermediate process is temporarily stored by the data cache unit, so that repeated data reading to the memory is avoided, IO (input/output) operation between the device and an external address space is reduced, and the bandwidth of memory access is reduced.

3. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, so that the parallel degree is greatly improved.

4. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, the operation parallelism degree is high, the working frequency is low, and the power consumption overhead is low.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

fig. 1 illustrates an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.

Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of this disclosure.

Fig. 3 shows a flow diagram of a method for executing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.

Like devices, components, units, etc. are designated with like reference numerals throughout the drawings.

Detailed Description

Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description of the disclosed embodiments, which taken in conjunction with the annexed drawings.

In the present disclosure, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or.

In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.

The device and the method for executing the Adam gradient descent training algorithm according to the embodiment of the disclosure are used for accelerating the application of the Adam gradient descent algorithm. First, initializing the first moment vector m₀Second order moment vector v₀Exponential decay rate beta₁、β₂Learning step length alpha and obtaining the vector theta to be updated from an external designated space₀(ii) a Each time gradient descending operation is carried out, gradient values transmitted from the outside are utilized

Updating first order moment vector m by sum exponential decay rate_t-1Second order moment vector v_t-1I.e. by

Then obtaining an estimated vector with a bias moment through moment vector operation

And

namely, it is

Finally, updating the vector theta to be updated_t-1Is theta_tAnd output, i.e.

Wherein, theta_t-1Is theta₀Not updated at the t-th cyclePrevious value, t cycle will be θ_t-1Is updated to theta_t. This process is repeated until the vector to be updated converges.

Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the Adam gradient descent algorithm according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.

And the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data. Specifically, an instruction is written into the instruction cache unit 2 from an external designated space, a parameter to be updated and a corresponding gradient value are read from the external designated space to the data processing module 5, and an updated parameter vector is directly written into the external designated space from the data processing module 5.

And the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction.

The controller unit 3 is configured to read an instruction from the instruction cache unit 2, decode the read instruction into micro instructions for controlling behaviors of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, send each micro instruction to the direct memory access unit 1, the data cache unit 4, or the data processing module 5, control the direct memory access unit 1 to read data from an external specified address and write the data into the external specified address, control the data cache unit 3 to obtain an instruction required by an operation from the external specified address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.

The data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes; specifically, the data cache unit 4 initializes the first-order moment vector m at the time of initialization_tSecond order moment vector v_tThe data cache unit 4 stores the first moment vector m in each data updating process_t-1And second moment directionQuantity v_t-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5_tAnd a second moment vector v_tAnd then written into the data buffer unit 4. During the operation of the device, the first moment vector m is always stored in the data cache unit 4_tSecond order moment vector v_tA copy of (1). In the disclosure, because the moment vector required in the intermediate process is temporarily stored by the data cache unit, repeated data reading to the memory is avoided, IO operations between the device and an external address space are reduced, and the bandwidth of memory access is reduced.

The data processing module 5 is used for updating the moment vector, calculating a moment estimation vector, updating a vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1; specifically, the data processing module 5 reads the moment vector m from the data cache unit 4_t-1、v_t-1Reading the vector theta to be updated from the external designated space through the direct memory access unit 1_t-1Gradient vector of

Update step α and exponential decay rate β₁And beta₂(ii) a Then the moment vector m_t-1、v_t-1Is updated to m_t、v_tI.e. by

Through m_t、v_tComputing moment estimate vectors

Namely, it is

Finally, the vector theta to be updated_t-1Is updated to theta_tI.e. by

And m is_t、v_tWriting into the data buffer unit 4, and storing theta_tWrites to the external designated space via direct memory access unit 1. In the disclosure, the data processing module performs vector operation by using the related parallel operation submodule, so that the parallel degree is greatly improved, the working frequency is low, and the power consumption overhead is low.

Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for implementing Adam gradient descent algorithm-related applications in accordance with an embodiment of this disclosure. As shown in fig. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operator sub-module 52, a vector multiplication parallel operator sub-module 53, a vector division parallel operator sub-module 54, a vector square root parallel operator sub-module 55 and a basic operator sub-module 56, wherein the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.

Fig. 3 shows a flow chart of a method for executing an Adam gradient descent training algorithm according to an embodiment of the present disclosure, specifically comprising the steps of:

in step S1, an INSTRUCTION prefetch INSTRUCTION (INSTRUCTION _ IO) is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION prefetch INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.

in step S3, the controller unit 3 reads in a hyper-parameter read instruction (HYPERPARAMETER _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction₁、β₂The convergence threshold value ct is sent to the data processing module 5;

in step S5, the controller unit 3 reads a parameter read instruction (DATA _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstruction_t-1And corresponding gradient vector

Then sent to the data processing module 5;

in step S6, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and stores the first-order moment m in the data cache unit 4 according to the translated microinstruction_t-1And a second moment vector v_t-1To the data processing unit 5.

In step S7, the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instruction_t-1And a second moment vector v_t-2The update operation of (2). In the update operation, the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1(INS _1) to the basic operation submodule 56, and driving the basic operation submodule 56 to calculate (1-beta)₁) And (1-. beta.)₂) (ii) a Sending an operation instruction 2(INS _2) to the vector multiplication parallel operation submodule 53, and driving the vector multiplication parallel operation submodule 53 to obtain

Then, the operation instruction 3(INS _3) is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate the beta₁m_t-1、

β₂v_t-1And

the results are respectively denoted as a₁、a₂、b₁And b₂(ii) a Then, a is mixed₁And a₂、b₁And b₂Respectively as two inputs to the vector addition parallel operation submodule 52 to obtain an updated first-order moment vector m_tAnd a second moment vector v_t。

In step S8, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated microinstruction_tAnd a second moment vector v_tFrom the data processing unit 5 into the data buffering unit 4.

In step S9, the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated microinstruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an operation instruction 4(INS _4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate

And

the iteration number t is added to 1, and the operation control sub-module 51 sends an operation instruction 5(INS _5) to the vector multiplication in parallelThe operator module 53, the driver vector multiplication parallel operator module 53 calculates the first order moment vector m in parallel_tAnd

second moment vector v_tAnd

the product of which yields an estimated vector with a bias

And

in step S10, the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an operation instruction 6(INS _6) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector square root parallel operation sub-module 55 to drive the operation to obtain

The operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector division parallel operation sub-module 54 to drive the operation to obtain

The operation control sub-module 51 sends an operation instruction 8(INS _8) to the vector multiplication parallel operation sub-module 53 to drive the operation to obtain

The operation control sub-module 51 sends the operation instruction 9(INS _9) to the vector addition parallel operation sub-module 52 to drive the calculation thereof

Obtaining an updated parameter vector theta_t(ii) a Wherein the content of the first and second substances,θ_t-1is theta₀Not updated before the t-th cycle, the t-th cycle will be θ_t-1Is updated to theta_t(ii) a The operation control sub-module 51 sends an operation instruction 10(INS _10) to the vector division parallel operation sub-module 54 to drive the operation to obtain a vector

The arithmetic control sub-module 51 sends an arithmetic instruction 11(INS _11) and an arithmetic instruction 12(INS _12) to the vector addition parallel operator module 52 and the basic operator module 56 respectively to calculate sum-sigma_ttemp_t、temp2＝sum/n。

In step S11, the controller unit 3 reads a to-be-updated vector write-back instruction (DATABACK _ IO) from the instruction cache unit 2, and updates the updated parameter vector θ according to the translated microinstruction_tFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.

In step S12, the controller unit 3 reads a convergence judgment command from the command cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 is less than ct, the convergence ends and the operation ends, otherwise, the process goes to step S5 to continue execution.

According to the method and the device, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that the general processor of the data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated. Meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. An apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising:

the controller unit (3) is used for reading an instruction and decoding the read instruction into a micro instruction for controlling the behavior of the data cache unit (4) or the data processing module (5);

the data caching unit (4) is used for caching the moment vectors in the processes of initialization and data updating;

the data processing module (5) is connected to the controller unit (3) and the data cache unit (4) and is used for reading the vector to be updated and the corresponding gradient value from the external designated space, reading the moment vector from the data cache unit (4), updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit (4) and writing the updated vector to be updated into the external designated space;

wherein the data buffer unit (4) initializes a first order moment vector m at initialization_tSecond order moment vector v_tThe first order moment vector m is used in each data updating process_t-1And a second moment vector v_t-1Read and sent to the data processing module (5) and updated to a first moment vector m in the data processing module (5)_tAnd a second moment vector v_tAnd then written into the data buffer unit (4).

2. The apparatus of claim 1, further comprising:

the direct memory access unit (1) is connected to the data processing module (5) and is used for accessing an external designated space, reading and writing data to the instruction cache unit (2) and the data processing module (5) and completing the loading and storage of the data;

and the instruction cache unit (2) is connected to the direct memory access unit (1) and the controller unit (3) and is used for reading the instruction through the direct memory access unit (1) and caching the read instruction for the controller unit (3) to read.

3. The apparatus according to claim 2, wherein the direct memory access unit (1) writes the instruction from the external designated space to the instruction cache unit (2), reads the parameter vector to be updated and the corresponding gradient value from the external designated space to the data processing module (5), and writes the updated parameter vector from the data processing module (5) directly into the external designated space.

4. The apparatus according to claim 2, wherein the controller unit (3) decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit (1), the data cache unit (4) or the data processing module (5) to control the direct memory access unit (1) to read data from and write data to the external designated address, controls the data cache unit (4) to obtain an instruction required for an operation from the external designated address through the direct memory access unit (1), controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data cache unit (4) to perform data transmission with the data processing module (5).

5. The device according to claim 1, wherein the first-order moment vector m is always stored in the data buffer unit (4) during the device operation_tSecond order moment vector v_tA copy of (1).

6. The apparatus according to claim 2, wherein the data processing module (5) reads the moment vector m from the data cache unit (4)_t-1、v_t-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)_t-1Gradient vector of

Finally, the vector theta to be updated_t-1Is updated to theta_tAnd m is_t、v_tWriting into a data buffer unit (4) by using a write operation_tWriting into an external designated space through a direct memory access unit (1).

7. The device according to claim 6, characterized in that the data processing module (5) maps the moment vectors m_t-1、v_t-1Is updated to m_tIs according to the formula

Implemented, the data processing module (5) passes through m_t、v_tComputing moment estimate vectors

Is according to the formula

In which beta is₁、β₂Is an exponential decay rate; the data processing module (5) updates the vector theta to be updated_t-1Is updated to theta_tIs according to the formula

And (4) realizing.

8. The apparatus according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.

9. The apparatus of claim 8, wherein the apparatus is configured to perform the operation on the vectors, wherein the vector operations are element-wise operations, and wherein the operation on the same vector is performed on different position elements in parallel.

10. A method of using the apparatus of claim 1, the method comprising:

reading an instruction by adopting a controller unit, and decoding the read instruction into a micro instruction for controlling the behavior of a data cache unit or a data processing module;

caching the moment vector in the initialization and data updating process by adopting a data caching unit;

reading a vector to be updated and a corresponding gradient value from an external designated space by adopting a data processing module, reading a moment vector from a data cache unit, updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit, and writing the updated vector to be updated into the external designated space;

the data buffer unit is initializedInitializing first order moment vector m_tSecond order moment vector v_tThe first order moment vector m is used in each data updating process_t-1And a second moment vector v_t-1Read out and send to the data processing module, and update to the first moment vector m in the data processing module_tAnd a second moment vector v_tAnd then written into the data cache unit.

11. The method of claim 10, wherein the method comprises:

writing an instruction into the instruction cache unit from an external designated space by adopting a direct memory access unit, and caching the read instruction by adopting the instruction cache unit so as to be read by the controller unit;

accessing an external designated space using a direct memory access unit, reading the parameter to be updated and the corresponding gradient value to a data processing module, and

and directly writing the updated parameter vector into an external designated space from the data processing module by adopting a direct memory access unit.

12. The method of claim 11, characterized in that the method comprises:

the controller unit decodes the read instruction into a microinstruction which controls the behavior of the direct memory access unit, the data cache unit or the data processing module,

the direct memory access unit is controlled to read data from and write data to the external designated address,

the control data cache unit obtains the instruction required by the operation from the external designated address through the direct memory access unit,

control the data processing module to perform an update operation on the parameter to be updated, an

And controlling the data buffer unit to transmit data with the data processing module.

13. Method according to claim 12, characterized in that during operation, a data cache unit (4) is always maintained insideOrder moment vector m_tSecond order moment vector v_tA copy of (1).

14. Method according to claim 11, characterized in that the data processing module (5) reads the moment vector m from the data cache unit (4)_t-1、v_t-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)_t-1Gradient vector of

15. Method according to claim 14, characterized in that the data processing module (5) adapts a moment vector m_t-1、v_t-1Is updated to m_tIs according to the formula

Is according to the formula

Implemented, the data processing module (5) is used for updating the vector theta to be updated_t-1Is updated to theta_tIs according to the formula

And (4) realizing.

16. The method according to claim 10 or 15, characterized in that the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.

17. The method of claim 10, wherein when operating on vectors, the vector operations are element-wise operations, and wherein different positional elements perform operations in parallel when the same vector performs an operation.

18. A method of using the apparatus of claim 1, comprising: initializing first order moment vector m₀Second order moment vector v₀Exponential decay rate beta₁、β₂Learning step length alpha and obtaining the vector theta to be updated from an external designated space₀；

And

19. The method of claim 18, wherein the initializing a first moment vector m₀Second order moment vector v₀Exponential decay rate beta₁、β₂Learning step length alpha and obtaining the vector theta to be updated from an external designated space₀The method comprises the following steps:

an INSTRUCTION of INSTRUCTION _ IO is stored in advance at the first address of the INSTRUCTION cache unit, and the INSTRUCTION of INSTRUCTION _ IO is used for driving the direct memory unit to read all INSTRUCTIONs related to Adam gradient descent calculation from an external address space;

when the operation starts, the controller unit reads the INSTRUCTION of INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit, drives the direct memory access unit to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated micro INSTRUCTION, and caches the INSTRUCTIONs into the INSTRUCTION cache unit;

the controller unit reads an HYPERPARAMETER _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read a global update step alpha and an exponential decay rate beta from an external designated space according to the translated microinstruction₁、β₂The convergence threshold value ct is sent to the data processing module;

the controller unit reads in the assignment instruction from the instruction cache unit and drives the first-order moment vector m in the data cache unit according to the translated microinstruction_t-1And v_t-1Initializing and driving the number of iterations t in the data processing unit to be set to 1;

the controller unit reads a DATA _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read the parameter vector theta to be updated from the external designated space according to the translated microinstruction_t-1And corresponding gradient vector

Then sending the data into a data processing module;

the controller unit reads a data transmission instruction from the instruction cache unit, and caches the first-order moment vector m in the data cache unit according to the translated microinstruction_t-1And a second moment vector v_t-1To the data processing unit.

20. The method of claim 18, wherein said utilizing gradient values externally incoming

The implementation specifically includes:

the controller unit reads a moment vector updating instruction from the instruction cache unit and drives the data cache unit to carry out the first-order moment vector m according to the translated micro instruction_t-1And a second moment vector v_t-1In the update operation, the moment vector update instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: sending INS _1 instruction to basic operation submodule to drive basic operation submodule to calculate (1-beta)₁) And (1-. beta.)₂) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule, and driving the vector multiplication parallel operation submodule to obtain

Then, an INS _3 instruction is sent to the vector multiplication parallel operation submodule, and the vector multiplication parallel operation submodule is driven to simultaneously calculate beta₁m_t-1、

β₂v_t-1And

the results are respectively denoted as a₁、a₂、b₁And b₂(ii) a Then, a is mixed₁And a₂、b₁And b₂Respectively used as two inputs, and sent to the vector addition parallel operation submodule to obtain the updated first-order moment vector m_tAnd a second moment vector v_t。

21. The method of claim 20, wherein the utilizing gradient values externally incoming

Updating first order moment vector m by sum exponential decay rate_t-1Second order moment vector v_t-1And then, the method further comprises the following steps:

the controller unit reads a data transmission instruction from the instruction cache unit and updates the first-order moment vector m according to the translated micro-instruction_tAnd a second moment vector v_tFrom the data processing unit to the data buffering unit.

22. The method of claim 18, wherein the biased-moment estimate vectors are obtained by a moment vector operation

And

is according to the formula

The implementation specifically includes:

the controller unit reads a moment estimation vector operation instruction from the instruction cache unit, and drives the operation control submodule to calculate the moment estimation vector according to the translated micro instruction, and the operation control submodule sends a corresponding instruction to perform the following operations: the operation control sub-module sends an instruction INS _4 to the basic operation sub-module to drive the basic operation sub-module to calculate

And

adding 1 to the iteration times t, sending an instruction INS _5 to the vector multiplication parallel operation submodule by the operation control submodule, and driving the vector multiplication parallel operation submodule to calculate the first moment vector m in parallel_tAnd

second moment vector v_tAnd

the product of which yields an estimated vector with a bias

And

23. the method of claim 18, wherein the updating the vector θ to be updated is based on the number of bits in the vector θ_t-1Is theta_tIs according to the formula

In the realization process, the first-stage reaction,the method specifically comprises the following steps:

the controller unit reads a parameter vector updating instruction from the instruction cache unit, and drives the operation control submodule to perform the following operations according to the translated microinstruction: the operation control sub-module sends an instruction INS _6 to the basic operation sub-module, and drives the basic operation sub-module to calculate-alpha; the operation control sub-module sends an instruction INS _7 to the vector square root parallel operation sub-module to drive the operation to obtain

The operation control sub-module sends an instruction INS _7 to the vector division parallel operation sub-module to drive the operation to obtain

The operation control sub-module sends an instruction INS _8 to the vector multiplication parallel operation sub-module to drive the operation to obtain

The operation control submodule sends an instruction INS _9 to the vector addition parallel operation submodule to drive the calculation thereof

Obtaining an updated parameter vector theta_t(ii) a Wherein, theta_t-1Is theta₀Not updated before the t-th cycle, the t-th cycle will be θ_t-1Is updated to theta_t(ii) a The operation control sub-module sends an instruction INS _10 to the vector division parallel operation sub-module to drive the operation to obtain a vector

The operation control sub-module respectively sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module and the basic operation sub-module to calculate sum sigma_itemp_iTemp2 is sum/n, where i is 1, 2, 3, … …, n is the total number of cycles, temp2 is the moving weighted average of the gradients.

24. The method of claim 23, wherein updating the vector θ to be updated comprises updating the vector θ to be updated_t-1Is theta_tThen, the method further comprises the following steps:

the controller unit reads a DATABACK _ IO instruction from the instruction cache unit, and updates the updated parameter vector theta according to the translated microinstruction_tAnd transmitting the data from the data processing unit to the external designated space through the direct memory access unit.

25. The method according to claim 18, wherein the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process is as follows:

the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing module judges whether the updated parameter vector is converged according to the translated micro instruction, and if temp2 is less than ct, the convergence is realized, and the operation is finished; temp2 is the moving weighted average of the gradients, and ct is the convergence threshold.