CN107315570B - Device and method for executing Adam gradient descent training algorithm - Google Patents

Device and method for executing Adam gradient descent training algorithm Download PDF

Info

Publication number
CN107315570B
CN107315570B CN201610269689.7A CN201610269689A CN107315570B CN 107315570 B CN107315570 B CN 107315570B CN 201610269689 A CN201610269689 A CN 201610269689A CN 107315570 B CN107315570 B CN 107315570B
Authority
CN
China
Prior art keywords
vector
instruction
module
updated
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610269689.7A
Other languages
Chinese (zh)
Other versions
CN107315570A (en
Inventor
郭崎
刘少礼
陈天石
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN201610269689.7A priority Critical patent/CN107315570B/en
Publication of CN107315570A publication Critical patent/CN107315570A/en
Application granted granted Critical
Publication of CN107315570B publication Critical patent/CN107315570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • G06F9/3832Value prediction for operands; operand history buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Liquid Crystal Display Device Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The method comprises the following steps: firstly, reading a gradient vector and a value vector to be updated, and initializing a first moment vector, a second moment vector and corresponding exponential decay rates; and during each iteration, updating the first-order moment vector and the second-order moment vector by using the gradient vector, respectively calculating a first-order partial moment estimation vector and a second-order partial moment estimation vector, updating the parameter to be updated by using the first-order partial moment estimation vector and the second-order partial moment estimation vector, and continuously training until the parameter vector to be updated is converged. By utilizing the method and the device, the application of the Adam gradient descent algorithm can be realized, and the data processing efficiency is greatly improved.

Description

Device and method for executing Adam gradient descent training algorithm
Technical Field
The disclosure relates to the technical field of Adam algorithm application, in particular to a device and a method for executing an Adam gradient descent training algorithm, and relates to related application of hardware implementation of an Adam gradient descent optimization algorithm.
Background
The Adam algorithm is used as one of the gradient descent optimization algorithms, because the Adam algorithm is easy to implement, small in calculation amount, small in required storage space, and wide in characteristics such as symmetric transformation invariance of gradients, and the Adam algorithm can be implemented by using a special device, so that the execution speed of the Adam algorithm can be remarkably improved.
Currently, one known method of performing the Adam gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the Adam gradient descent algorithm into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead.
Another known method of performing the Adam gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing a general purpose single instruction multiple data Stream (SIMD) instruction using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for the associated operations of the Adam gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the associated operations in the Adam gradient descent algorithm, which brings a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as first-order moment vectors and second-order moment vectors) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and huge power consumption overhead is brought at the same time.
Disclosure of Invention
Technical problem to be solved
In view of the above, the present disclosure provides an apparatus and a method for executing Adam gradient descent training algorithm to solve the problems of insufficient operation performance of a general-purpose processor of data and high front-end decoding overhead, and to avoid repeatedly reading data from a memory and reduce the bandwidth of memory access.
(II) technical scheme
To achieve the above object, the present disclosure provides an apparatus for executing Adam gradient descent training algorithm, the apparatus including a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, wherein:
the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data;
the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction;
the controller unit 3 is used for reading the instruction from the instruction cache unit 2 and decoding the read instruction into a microinstruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4 or the data processing module 5;
the data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes;
and the data processing module 5 is used for updating the moment vector, calculating the moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1.
In the above scheme, the direct memory access unit 1 writes an instruction into the instruction cache unit 2 from an external designated space, reads a parameter to be updated and a corresponding gradient value from the external designated space to the data processing module 5, and directly writes an updated parameter vector into the external designated space from the data processing module 5.
In the above scheme, the controller unit 3 decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, so as to control the direct memory access unit 1 to read data from an external designated address and write the data into the external designated address, control the data cache unit 4 to obtain an instruction required by an operation from the external designated address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.
In the above scheme, the data cache unit 4 initializes the first moment vector m during initializationtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5tAnd a second moment vector vtAnd then written into the data buffer unit 4.
In the above scheme, during the operation of the device, the first-order moment vector m is always stored in the data cache unit 4tTwo, twoOrder moment vector vtA copy of (1).
In the above scheme, the data processing module 5 reads the moment vector m from the data cache unit 4t-1、vt-2Reading the vector theta to be updated from the external designated space through the direct memory access unit 1t-1Gradient vector of
Figure GDA0001655754140000031
Update step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectors
Figure GDA0001655754140000032
Finally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into the data buffer unit 4, and storing thetatWrites to the external designated space via direct memory access unit 1.
In the above scheme, the data processing module 5 converts the moment vector mt-1、vt-1Is updated to mtIs according to the formula
Figure GDA0001655754140000033
Implemented, the data processing module 5 passes mt、mtComputing moment estimate vectors
Figure GDA0001655754140000034
Is according to the formula
Figure GDA0001655754140000035
Implemented, the data processing module 5 will update the vector θ to be updatedt-1Is updated to thetatIs according to the formula
Figure GDA0001655754140000036
And (4) realizing.
In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.
To achieve the above object, the present disclosure also provides a method for executing an Adam gradient descent training algorithm, the method comprising:
initializing first order moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0
When gradient descending operation is carried out, gradient values transmitted from the outside are utilized
Figure GDA0001655754140000041
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Then obtaining the estimation vector with partial moment through the moment vector operation
Figure GDA0001655754140000042
And
Figure GDA0001655754140000043
finally, updating the vector theta to be updatedt-1Is thetatAnd outputting; this process is repeated until the vector to be updated converges.
In the above scheme, the initializing first moment vector m0Second order moment vector v0Exponential decay rate ofβ1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0The method comprises the following steps:
in step S1, an INSTRUCTION entry _ IO INSTRUCTION is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION entry _ IO INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.
Step S2, when the operation starts, the controller unit 3 reads the INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit 2, drives the direct memory access unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated microinstruction, and caches the INSTRUCTIONs in the INSTRUCTION cache unit 2;
in step S3, the controller unit 3 reads a HYPERPARAMETER _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction1、β2Convergence threshold ctThen sent into the data processing module 5;
in step S4, the controller unit 3 reads in the assignment instruction from the instruction cache unit 2, and drives the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit 5 to be set to 1;
in step S5, the controller unit 3 reads a DATA _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vector
Figure GDA0001655754140000044
Then sent to the data processing module 5;
in step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and writes the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And a second moment vector vt-1To data transmissionIn the processing unit 5.
In the above scheme, the gradient value transmitted from the outside is utilized
Figure GDA0001655754140000051
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Is according to the formula
Figure GDA0001655754140000052
Figure GDA0001655754140000053
The implementation specifically includes: the controller unit 3 reads a moment vector updating instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-1In which a moment vector update instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends a corresponding instruction to perform the following operations: sends INS _1 instruction to the basic operation sub-module 56, drives the basic operation sub-module 56 to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule 53, driving vector multiplication parallel operation submodule 53 to obtain
Figure GDA0001655754140000054
Then, INS _3 instruction is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate beta1mt-1
Figure GDA0001655754140000055
β2vt-1And
Figure GDA0001655754140000056
the results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively as two inputs to the vector addition parallel operator module 52 to obtain the updatesThe latter first moment vector mtAnd a second moment vector vt
In the above scheme, the gradient value transmitted from the outside is utilized
Figure GDA0001655754140000057
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1And then, the method further comprises the following steps: the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated micro instructiontAnd a second moment vector vtFrom the data processing unit 5 into the data buffering unit 4.
In the above scheme, the partial moment estimation vector is obtained by the moment vector operation
Figure GDA0001655754140000058
And
Figure GDA0001655754140000059
is according to the formula
Figure GDA00016557541400000510
The implementation specifically includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an instruction INS _4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate
Figure GDA00016557541400000511
And
Figure GDA00016557541400000512
the iteration number t is added to 1, the operation control sub-module 51 sends an instruction INS _5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first moment vector m in paralleltAnd
Figure GDA00016557541400000513
second moment vector vtAnd
Figure GDA00016557541400000514
the product of which yields an estimated vector with a bias
Figure GDA00016557541400000515
And
Figure GDA00016557541400000516
in the above scheme, the vector θ to be updated is updatedt-1Is thetatIs according to the formula
Figure GDA00016557541400000517
The implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an instruction INS _6 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an instruction INS _7 to the vector square root parallel operation sub-module 55 to drive the operation to obtain
Figure GDA0001655754140000061
The operation control sub-module 51 sends an instruction INS _7 to the vector division parallel operation sub-module 54 to drive the operation to obtain
Figure GDA0001655754140000062
The operation control sub-module 51 sends an instruction INS _8 to the vector multiplication parallel operation sub-module 53 to drive the operation to obtain
Figure GDA0001655754140000066
The operation control sub-module 51 sends an instruction INS _9 to the vector addition parallel operation sub-module 52 to drive the calculation thereof
Figure GDA0001655754140000064
Obtaining an updated parameter vector thetat(ii) a Wherein, thetat-1Is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module 51 sends an instruction INS _10 to the vector division parallel operation sub-module 54 to drive the operation thereof to obtain a vector
Figure GDA0001655754140000065
The arithmetic control sub-module 51 sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module 52 and the basic operation sub-module 56 respectively to calculate sum sigmat tempt、temp2=sum/n。
In the above scheme, the vector θ to be updated is updatedt-1Is thetatThen, the method further comprises the following steps: the controller unit 3 reads a DATABACK _ IO instruction from the instruction cache unit 2, and updates the parameter vector theta according to the translated microinstructiontFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.
In the above scheme, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process includes: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 < ct, the convergence is performed, and the operation is ended.
(III) advantageous effects
According to the technical scheme, the method has the following beneficial effects:
1. according to the device and the method for executing the Adam gradient descent training algorithm, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that a general processor of data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated.
2. According to the device and the method for executing the Adam gradient descent training algorithm, the moment vector required in the intermediate process is temporarily stored by the data cache unit, so that repeated data reading to the memory is avoided, IO (input/output) operation between the device and an external address space is reduced, and the bandwidth of memory access is reduced.
3. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, so that the parallel degree is greatly improved.
4. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, the operation parallelism degree is high, the working frequency is low, and the power consumption overhead is low.
Drawings
For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
fig. 1 illustrates an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.
Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of this disclosure.
Fig. 3 shows a flow diagram of a method for executing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.
Like devices, components, units, etc. are designated with like reference numerals throughout the drawings.
Detailed Description
Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description of the disclosed embodiments, which taken in conjunction with the annexed drawings.
In the present disclosure, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or.
In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
The device and the method for executing the Adam gradient descent training algorithm according to the embodiment of the disclosure are used for accelerating the application of the Adam gradient descent algorithm. First, initializing the first moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0(ii) a Each time gradient descending operation is carried out, gradient values transmitted from the outside are utilized
Figure GDA0001655754140000081
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1I.e. by
Figure GDA0001655754140000082
Then obtaining an estimated vector with a bias moment through moment vector operation
Figure GDA0001655754140000083
And
Figure GDA0001655754140000084
namely, it is
Figure GDA0001655754140000085
Finally, updating the vector theta to be updatedt-1Is thetatAnd output, i.e.
Figure GDA0001655754140000086
Wherein, thetat-1Is theta0Not updated at the t-th cyclePrevious value, t cycle will be θt-1Is updated to thetat. This process is repeated until the vector to be updated converges.
Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the Adam gradient descent algorithm according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.
And the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data. Specifically, an instruction is written into the instruction cache unit 2 from an external designated space, a parameter to be updated and a corresponding gradient value are read from the external designated space to the data processing module 5, and an updated parameter vector is directly written into the external designated space from the data processing module 5.
And the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction.
The controller unit 3 is configured to read an instruction from the instruction cache unit 2, decode the read instruction into micro instructions for controlling behaviors of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, send each micro instruction to the direct memory access unit 1, the data cache unit 4, or the data processing module 5, control the direct memory access unit 1 to read data from an external specified address and write the data into the external specified address, control the data cache unit 3 to obtain an instruction required by an operation from the external specified address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.
The data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes; specifically, the data cache unit 4 initializes the first-order moment vector m at the time of initializationtSecond order moment vector vtThe data cache unit 4 stores the first moment vector m in each data updating processt-1And second moment directionQuantity vt-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5tAnd a second moment vector vtAnd then written into the data buffer unit 4. During the operation of the device, the first moment vector m is always stored in the data cache unit 4tSecond order moment vector vtA copy of (1). In the disclosure, because the moment vector required in the intermediate process is temporarily stored by the data cache unit, repeated data reading to the memory is avoided, IO operations between the device and an external address space are reduced, and the bandwidth of memory access is reduced.
The data processing module 5 is used for updating the moment vector, calculating a moment estimation vector, updating a vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1; specifically, the data processing module 5 reads the moment vector m from the data cache unit 4t-1、vt-1Reading the vector theta to be updated from the external designated space through the direct memory access unit 1t-1Gradient vector of
Figure GDA0001655754140000091
Update step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtI.e. by
Figure GDA0001655754140000092
Through mt、vtComputing moment estimate vectors
Figure GDA0001655754140000093
Figure GDA0001655754140000094
Namely, it is
Figure GDA0001655754140000095
Finally, the vector theta to be updatedt-1Is updated to thetatI.e. by
Figure GDA0001655754140000096
And m ist、vtWriting into the data buffer unit 4, and storing thetatWrites to the external designated space via direct memory access unit 1. In the disclosure, the data processing module performs vector operation by using the related parallel operation submodule, so that the parallel degree is greatly improved, the working frequency is low, and the power consumption overhead is low.
Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for implementing Adam gradient descent algorithm-related applications in accordance with an embodiment of this disclosure. As shown in fig. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operator sub-module 52, a vector multiplication parallel operator sub-module 53, a vector division parallel operator sub-module 54, a vector square root parallel operator sub-module 55 and a basic operator sub-module 56, wherein the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.
Fig. 3 shows a flow chart of a method for executing an Adam gradient descent training algorithm according to an embodiment of the present disclosure, specifically comprising the steps of:
in step S1, an INSTRUCTION prefetch INSTRUCTION (INSTRUCTION _ IO) is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION prefetch INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.
Step S2, when the operation starts, the controller unit 3 reads the INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit 2, drives the direct memory access unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated microinstruction, and caches the INSTRUCTIONs in the INSTRUCTION cache unit 2;
in step S3, the controller unit 3 reads in a hyper-parameter read instruction (HYPERPARAMETER _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction1、β2The convergence threshold value ct is sent to the data processing module 5;
in step S4, the controller unit 3 reads in the assignment instruction from the instruction cache unit 2, and drives the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit 5 to be set to 1;
in step S5, the controller unit 3 reads a parameter read instruction (DATA _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vector
Figure GDA0001655754140000101
Then sent to the data processing module 5;
in step S6, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and stores the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And a second moment vector vt-1To the data processing unit 5.
In step S7, the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-2The update operation of (2). In the update operation, the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1(INS _1) to the basic operation submodule 56, and driving the basic operation submodule 56 to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending an operation instruction 2(INS _2) to the vector multiplication parallel operation submodule 53, and driving the vector multiplication parallel operation submodule 53 to obtain
Figure GDA0001655754140000111
Then, the operation instruction 3(INS _3) is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate the beta1mt-1
Figure GDA0001655754140000112
β2vt-1And
Figure GDA0001655754140000113
the results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively as two inputs to the vector addition parallel operation submodule 52 to obtain an updated first-order moment vector mtAnd a second moment vector vt
In step S8, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated microinstructiontAnd a second moment vector vtFrom the data processing unit 5 into the data buffering unit 4.
In step S9, the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated microinstruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an operation instruction 4(INS _4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate
Figure GDA0001655754140000114
And
Figure GDA0001655754140000115
the iteration number t is added to 1, and the operation control sub-module 51 sends an operation instruction 5(INS _5) to the vector multiplication in parallelThe operator module 53, the driver vector multiplication parallel operator module 53 calculates the first order moment vector m in paralleltAnd
Figure GDA0001655754140000121
second moment vector vtAnd
Figure GDA0001655754140000122
the product of which yields an estimated vector with a bias
Figure GDA0001655754140000123
And
Figure GDA0001655754140000124
in step S10, the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an operation instruction 6(INS _6) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector square root parallel operation sub-module 55 to drive the operation to obtain
Figure GDA0001655754140000125
The operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector division parallel operation sub-module 54 to drive the operation to obtain
Figure GDA0001655754140000126
The operation control sub-module 51 sends an operation instruction 8(INS _8) to the vector multiplication parallel operation sub-module 53 to drive the operation to obtain
Figure GDA0001655754140000127
The operation control sub-module 51 sends the operation instruction 9(INS _9) to the vector addition parallel operation sub-module 52 to drive the calculation thereof
Figure GDA00016557541400001210
Obtaining an updated parameter vector thetat(ii) a Wherein the content of the first and second substances,θt-1is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module 51 sends an operation instruction 10(INS _10) to the vector division parallel operation sub-module 54 to drive the operation to obtain a vector
Figure GDA0001655754140000129
The arithmetic control sub-module 51 sends an arithmetic instruction 11(INS _11) and an arithmetic instruction 12(INS _12) to the vector addition parallel operator module 52 and the basic operator module 56 respectively to calculate sum-sigmattempt、temp2=sum/n。
In step S11, the controller unit 3 reads a to-be-updated vector write-back instruction (DATABACK _ IO) from the instruction cache unit 2, and updates the updated parameter vector θ according to the translated microinstructiontFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.
In step S12, the controller unit 3 reads a convergence judgment command from the command cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 is less than ct, the convergence ends and the operation ends, otherwise, the process goes to step S5 to continue execution.
According to the method and the device, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that the general processor of the data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated. Meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
In the foregoing specification, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (25)

1. An apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising:
the controller unit (3) is used for reading an instruction and decoding the read instruction into a micro instruction for controlling the behavior of the data cache unit (4) or the data processing module (5);
the data caching unit (4) is used for caching the moment vectors in the processes of initialization and data updating;
the data processing module (5) is connected to the controller unit (3) and the data cache unit (4) and is used for reading the vector to be updated and the corresponding gradient value from the external designated space, reading the moment vector from the data cache unit (4), updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit (4) and writing the updated vector to be updated into the external designated space;
wherein the data buffer unit (4) initializes a first order moment vector m at initializationtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read and sent to the data processing module (5) and updated to a first moment vector m in the data processing module (5)tAnd a second moment vector vtAnd then written into the data buffer unit (4).
2. The apparatus of claim 1, further comprising:
the direct memory access unit (1) is connected to the data processing module (5) and is used for accessing an external designated space, reading and writing data to the instruction cache unit (2) and the data processing module (5) and completing the loading and storage of the data;
and the instruction cache unit (2) is connected to the direct memory access unit (1) and the controller unit (3) and is used for reading the instruction through the direct memory access unit (1) and caching the read instruction for the controller unit (3) to read.
3. The apparatus according to claim 2, wherein the direct memory access unit (1) writes the instruction from the external designated space to the instruction cache unit (2), reads the parameter vector to be updated and the corresponding gradient value from the external designated space to the data processing module (5), and writes the updated parameter vector from the data processing module (5) directly into the external designated space.
4. The apparatus according to claim 2, wherein the controller unit (3) decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit (1), the data cache unit (4) or the data processing module (5) to control the direct memory access unit (1) to read data from and write data to the external designated address, controls the data cache unit (4) to obtain an instruction required for an operation from the external designated address through the direct memory access unit (1), controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data cache unit (4) to perform data transmission with the data processing module (5).
5. The device according to claim 1, wherein the first-order moment vector m is always stored in the data buffer unit (4) during the device operationtSecond order moment vector vtA copy of (1).
6. The apparatus according to claim 2, wherein the data processing module (5) reads the moment vector m from the data cache unit (4)t-1、vt-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)t-1Gradient vector of
Figure FDA0003057252270000021
Update step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectors
Figure FDA0003057252270000022
Finally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into a data buffer unit (4) by using a write operationtWriting into an external designated space through a direct memory access unit (1).
7. The device according to claim 6, characterized in that the data processing module (5) maps the moment vectors mt-1、vt-1Is updated to mtIs according to the formula
Figure FDA0003057252270000023
Figure FDA0003057252270000024
Implemented, the data processing module (5) passes through mt、vtComputing moment estimate vectors
Figure FDA0003057252270000025
Is according to the formula
Figure FDA0003057252270000026
In which beta is1、β2Is an exponential decay rate; the data processing module (5) updates the vector theta to be updatedt-1Is updated to thetatIs according to the formula
Figure FDA0003057252270000027
And (4) realizing.
8. The apparatus according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.
9. The apparatus of claim 8, wherein the apparatus is configured to perform the operation on the vectors, wherein the vector operations are element-wise operations, and wherein the operation on the same vector is performed on different position elements in parallel.
10. A method of using the apparatus of claim 1, the method comprising:
reading an instruction by adopting a controller unit, and decoding the read instruction into a micro instruction for controlling the behavior of a data cache unit or a data processing module;
caching the moment vector in the initialization and data updating process by adopting a data caching unit;
reading a vector to be updated and a corresponding gradient value from an external designated space by adopting a data processing module, reading a moment vector from a data cache unit, updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit, and writing the updated vector to be updated into the external designated space;
the data buffer unit is initializedInitializing first order moment vector mtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read out and send to the data processing module, and update to the first moment vector m in the data processing moduletAnd a second moment vector vtAnd then written into the data cache unit.
11. The method of claim 10, wherein the method comprises:
writing an instruction into the instruction cache unit from an external designated space by adopting a direct memory access unit, and caching the read instruction by adopting the instruction cache unit so as to be read by the controller unit;
accessing an external designated space using a direct memory access unit, reading the parameter to be updated and the corresponding gradient value to a data processing module, and
and directly writing the updated parameter vector into an external designated space from the data processing module by adopting a direct memory access unit.
12. The method of claim 11, characterized in that the method comprises:
the controller unit decodes the read instruction into a microinstruction which controls the behavior of the direct memory access unit, the data cache unit or the data processing module,
the direct memory access unit is controlled to read data from and write data to the external designated address,
the control data cache unit obtains the instruction required by the operation from the external designated address through the direct memory access unit,
control the data processing module to perform an update operation on the parameter to be updated, an
And controlling the data buffer unit to transmit data with the data processing module.
13. Method according to claim 12, characterized in that during operation, a data cache unit (4) is always maintained insideOrder moment vector mtSecond order moment vector vtA copy of (1).
14. Method according to claim 11, characterized in that the data processing module (5) reads the moment vector m from the data cache unit (4)t-1、vt-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)t-1Gradient vector of
Figure FDA0003057252270000041
Update step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectors
Figure FDA0003057252270000042
Finally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into a data buffer unit (4) by using a write operationtWriting into an external designated space through a direct memory access unit (1).
15. Method according to claim 14, characterized in that the data processing module (5) adapts a moment vector mt-1、vt-1Is updated to mtIs according to the formula
Figure FDA0003057252270000043
Figure FDA0003057252270000044
Implemented, the data processing module (5) passes through mt、vtComputing moment estimate vectors
Figure FDA0003057252270000045
Is according to the formula
Figure FDA0003057252270000046
Implemented, the data processing module (5) is used for updating the vector theta to be updatedt-1Is updated to thetatIs according to the formula
Figure FDA0003057252270000047
And (4) realizing.
16. The method according to claim 10 or 15, characterized in that the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.
17. The method of claim 10, wherein when operating on vectors, the vector operations are element-wise operations, and wherein different positional elements perform operations in parallel when the same vector performs an operation.
18. A method of using the apparatus of claim 1, comprising: initializing first order moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0
When gradient descending operation is carried out, gradient values transmitted from the outside are utilized
Figure FDA0003057252270000051
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Then obtaining the estimation vector with partial moment through the moment vector operation
Figure FDA0003057252270000052
And
Figure FDA0003057252270000053
finally, updating the vector theta to be updatedt-1Is thetatAnd outputting; this process is repeated until the vector to be updated converges.
19. The method of claim 18, wherein the initializing a first moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0The method comprises the following steps:
an INSTRUCTION of INSTRUCTION _ IO is stored in advance at the first address of the INSTRUCTION cache unit, and the INSTRUCTION of INSTRUCTION _ IO is used for driving the direct memory unit to read all INSTRUCTIONs related to Adam gradient descent calculation from an external address space;
when the operation starts, the controller unit reads the INSTRUCTION of INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit, drives the direct memory access unit to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated micro INSTRUCTION, and caches the INSTRUCTIONs into the INSTRUCTION cache unit;
the controller unit reads an HYPERPARAMETER _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read a global update step alpha and an exponential decay rate beta from an external designated space according to the translated microinstruction1、β2The convergence threshold value ct is sent to the data processing module;
the controller unit reads in the assignment instruction from the instruction cache unit and drives the first-order moment vector m in the data cache unit according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit to be set to 1;
the controller unit reads a DATA _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read the parameter vector theta to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vector
Figure FDA0003057252270000069
Then sending the data into a data processing module;
the controller unit reads a data transmission instruction from the instruction cache unit, and caches the first-order moment vector m in the data cache unit according to the translated microinstructiont-1And a second moment vector vt-1To the data processing unit.
20. The method of claim 18, wherein said utilizing gradient values externally incoming
Figure FDA0003057252270000061
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Is according to the formula
Figure FDA0003057252270000062
The implementation specifically includes:
the controller unit reads a moment vector updating instruction from the instruction cache unit and drives the data cache unit to carry out the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-1In the update operation, the moment vector update instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: sending INS _1 instruction to basic operation submodule to drive basic operation submodule to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule, and driving the vector multiplication parallel operation submodule to obtain
Figure FDA0003057252270000063
Then, an INS _3 instruction is sent to the vector multiplication parallel operation submodule, and the vector multiplication parallel operation submodule is driven to simultaneously calculate beta1mt-1
Figure FDA0003057252270000064
β2vt-1And
Figure FDA0003057252270000065
the results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively used as two inputs, and sent to the vector addition parallel operation submodule to obtain the updated first-order moment vector mtAnd a second moment vector vt
21. The method of claim 20, wherein the utilizing gradient values externally incoming
Figure FDA00030572522700000610
Updating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1And then, the method further comprises the following steps:
the controller unit reads a data transmission instruction from the instruction cache unit and updates the first-order moment vector m according to the translated micro-instructiontAnd a second moment vector vtFrom the data processing unit to the data buffering unit.
22. The method of claim 18, wherein the biased-moment estimate vectors are obtained by a moment vector operation
Figure FDA0003057252270000066
And
Figure FDA0003057252270000067
is according to the formula
Figure FDA0003057252270000068
The implementation specifically includes:
the controller unit reads a moment estimation vector operation instruction from the instruction cache unit, and drives the operation control submodule to calculate the moment estimation vector according to the translated micro instruction, and the operation control submodule sends a corresponding instruction to perform the following operations: the operation control sub-module sends an instruction INS _4 to the basic operation sub-module to drive the basic operation sub-module to calculate
Figure FDA0003057252270000071
And
Figure FDA0003057252270000072
adding 1 to the iteration times t, sending an instruction INS _5 to the vector multiplication parallel operation submodule by the operation control submodule, and driving the vector multiplication parallel operation submodule to calculate the first moment vector m in paralleltAnd
Figure FDA0003057252270000073
second moment vector vtAnd
Figure FDA0003057252270000074
the product of which yields an estimated vector with a bias
Figure FDA0003057252270000075
And
Figure FDA0003057252270000076
23. the method of claim 18, wherein the updating the vector θ to be updated is based on the number of bits in the vector θt-1Is thetatIs according to the formula
Figure FDA0003057252270000077
In the realization process, the first-stage reaction,the method specifically comprises the following steps:
the controller unit reads a parameter vector updating instruction from the instruction cache unit, and drives the operation control submodule to perform the following operations according to the translated microinstruction: the operation control sub-module sends an instruction INS _6 to the basic operation sub-module, and drives the basic operation sub-module to calculate-alpha; the operation control sub-module sends an instruction INS _7 to the vector square root parallel operation sub-module to drive the operation to obtain
Figure FDA0003057252270000078
The operation control sub-module sends an instruction INS _7 to the vector division parallel operation sub-module to drive the operation to obtain
Figure FDA0003057252270000079
The operation control sub-module sends an instruction INS _8 to the vector multiplication parallel operation sub-module to drive the operation to obtain
Figure FDA00030572522700000710
The operation control submodule sends an instruction INS _9 to the vector addition parallel operation submodule to drive the calculation thereof
Figure FDA00030572522700000711
Obtaining an updated parameter vector thetat(ii) a Wherein, thetat-1Is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module sends an instruction INS _10 to the vector division parallel operation sub-module to drive the operation to obtain a vector
Figure FDA00030572522700000712
The operation control sub-module respectively sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module and the basic operation sub-module to calculate sum sigmaitempiTemp2 is sum/n, where i is 1, 2, 3, … …, n is the total number of cycles, temp2 is the moving weighted average of the gradients.
24. The method of claim 23, wherein updating the vector θ to be updated comprises updating the vector θ to be updatedt-1Is thetatThen, the method further comprises the following steps:
the controller unit reads a DATABACK _ IO instruction from the instruction cache unit, and updates the updated parameter vector theta according to the translated microinstructiontAnd transmitting the data from the data processing unit to the external designated space through the direct memory access unit.
25. The method according to claim 18, wherein the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process is as follows:
the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing module judges whether the updated parameter vector is converged according to the translated micro instruction, and if temp2 is less than ct, the convergence is realized, and the operation is finished; temp2 is the moving weighted average of the gradients, and ct is the convergence threshold.
CN201610269689.7A 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm Active CN107315570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610269689.7A CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610269689.7A CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Publications (2)

Publication Number Publication Date
CN107315570A CN107315570A (en) 2017-11-03
CN107315570B true CN107315570B (en) 2021-06-18

Family

ID=60185643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610269689.7A Active CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Country Status (1)

Country Link
CN (1) CN107315570B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109782392A (en) * 2019-02-27 2019-05-21 中国科学院光电技术研究所 A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm
CN111460528B (en) * 2020-04-01 2022-06-14 支付宝(杭州)信息技术有限公司 Multi-party combined training method and system based on Adam optimization algorithm

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101931416A (en) * 2009-06-24 2010-12-29 中国科学院微电子研究所 Parallel hierarchical decoder for low density parity code (LDPC) in mobile digital multimedia broadcasting system
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN104360597A (en) * 2014-11-02 2015-02-18 北京工业大学 Sewage treatment process optimization control method based on multiple gradient descent
CN104376124A (en) * 2014-12-09 2015-02-25 西华大学 Clustering algorithm based on disturbance absorbing principle
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8862206B2 (en) * 2009-11-12 2014-10-14 Virginia Tech Intellectual Properties, Inc. Extended interior methods and systems for spectral, optical, and photoacoustic imaging

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101931416A (en) * 2009-06-24 2010-12-29 中国科学院微电子研究所 Parallel hierarchical decoder for low density parity code (LDPC) in mobile digital multimedia broadcasting system
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method
CN104360597A (en) * 2014-11-02 2015-02-18 北京工业大学 Sewage treatment process optimization control method based on multiple gradient descent
CN104376124A (en) * 2014-12-09 2015-02-25 西华大学 Clustering algorithm based on disturbance absorbing principle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Adam:A Method for Stochastic Optimization;D.Kingma,J.Ba;《ICLR 2015》;20150123;全文 *

Also Published As

Publication number Publication date
CN107315570A (en) 2017-11-03

Similar Documents

Publication Publication Date Title
CN111310904B (en) Apparatus and method for performing convolutional neural network training
CN111353589B (en) Apparatus and method for performing artificial neural network forward operations
US11574195B2 (en) Operation method
CN110929863B (en) Apparatus and method for performing LSTM operations
CN111353588B (en) Apparatus and method for performing artificial neural network reverse training
CN108268939B (en) Apparatus and method for performing LSTM neural network operations
CN111860813B (en) Device and method for performing forward operation of convolutional neural network
WO2017124648A1 (en) Vector computing device
JP5987233B2 (en) Apparatus, method, and system
EP3832499A1 (en) Matrix computing device
WO2018120016A1 (en) Apparatus for executing lstm neural network operation, and operational method
CN107766079B (en) Processor and method for executing instructions on processor
EP3451238A1 (en) Apparatus and method for executing pooling operation
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
CN107341132B (en) Device and method for executing AdaGrad gradient descent training algorithm
CN113222101A (en) Deep learning processing device, method, equipment and storage medium
WO2017185393A1 (en) Apparatus and method for executing inner product operation of vectors
WO2017185392A1 (en) Device and method for performing four fundamental operations of arithmetic of vectors
CN107315570B (en) Device and method for executing Adam gradient descent training algorithm
CN107315569B (en) Device and method for executing RMSprop gradient descent algorithm
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN107341540B (en) Device and method for executing Hessian-Free training algorithm
CN111860814B (en) Apparatus and method for performing batch normalization operations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant