CN107315570B - Device and method for executing Adam gradient descent training algorithm - Google Patents
Device and method for executing Adam gradient descent training algorithm Download PDFInfo
- Publication number
- CN107315570B CN107315570B CN201610269689.7A CN201610269689A CN107315570B CN 107315570 B CN107315570 B CN 107315570B CN 201610269689 A CN201610269689 A CN 201610269689A CN 107315570 B CN107315570 B CN 107315570B
- Authority
- CN
- China
- Prior art keywords
- vector
- instruction
- module
- updated
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 340
- 238000012545 processing Methods 0.000 claims abstract description 89
- 238000004364 calculation method Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 17
- 230000005540 biological transmission Effects 0.000 claims description 8
- 230000006399 behavior Effects 0.000 claims description 7
- 101100179597 Caenorhabditis elegans ins-7 gene Proteins 0.000 claims description 6
- 101100340781 Caenorhabditis elegans ins-11 gene Proteins 0.000 claims description 3
- 101100179824 Caenorhabditis elegans ins-17 gene Proteins 0.000 claims description 3
- 101100179596 Caenorhabditis elegans ins-3 gene Proteins 0.000 claims description 3
- 101100179594 Caenorhabditis elegans ins-4 gene Proteins 0.000 claims description 3
- 101100072420 Caenorhabditis elegans ins-5 gene Proteins 0.000 claims description 3
- 101100072419 Caenorhabditis elegans ins-6 gene Proteins 0.000 claims description 3
- 101150089655 Ins2 gene Proteins 0.000 claims description 3
- 230000003139 buffering effect Effects 0.000 claims description 3
- 101150032953 ins1 gene Proteins 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3814—Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
- G06F9/3832—Value prediction for operands; operand history buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Liquid Crystal Display Device Control (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The device comprises a direct memory access unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The method comprises the following steps: firstly, reading a gradient vector and a value vector to be updated, and initializing a first moment vector, a second moment vector and corresponding exponential decay rates; and during each iteration, updating the first-order moment vector and the second-order moment vector by using the gradient vector, respectively calculating a first-order partial moment estimation vector and a second-order partial moment estimation vector, updating the parameter to be updated by using the first-order partial moment estimation vector and the second-order partial moment estimation vector, and continuously training until the parameter vector to be updated is converged. By utilizing the method and the device, the application of the Adam gradient descent algorithm can be realized, and the data processing efficiency is greatly improved.
Description
Technical Field
The disclosure relates to the technical field of Adam algorithm application, in particular to a device and a method for executing an Adam gradient descent training algorithm, and relates to related application of hardware implementation of an Adam gradient descent optimization algorithm.
Background
The Adam algorithm is used as one of the gradient descent optimization algorithms, because the Adam algorithm is easy to implement, small in calculation amount, small in required storage space, and wide in characteristics such as symmetric transformation invariance of gradients, and the Adam algorithm can be implemented by using a special device, so that the execution speed of the Adam algorithm can be remarkably improved.
Currently, one known method of performing the Adam gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the relevant operation corresponding to the Adam gradient descent algorithm into a long-row operation and access instruction sequence, and the front-end decoding of the processor brings large power consumption overhead.
Another known method of performing the Adam gradient descent algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing a general purpose single instruction multiple data Stream (SIMD) instruction using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for the associated operations of the Adam gradient descent algorithm, and a large amount of front-end decoding work is still required to perform the associated operations in the Adam gradient descent algorithm, which brings a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as first-order moment vectors and second-order moment vectors) required in the operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and huge power consumption overhead is brought at the same time.
Disclosure of Invention
Technical problem to be solved
In view of the above, the present disclosure provides an apparatus and a method for executing Adam gradient descent training algorithm to solve the problems of insufficient operation performance of a general-purpose processor of data and high front-end decoding overhead, and to avoid repeatedly reading data from a memory and reduce the bandwidth of memory access.
(II) technical scheme
To achieve the above object, the present disclosure provides an apparatus for executing Adam gradient descent training algorithm, the apparatus including a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, wherein:
the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data;
the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction;
the controller unit 3 is used for reading the instruction from the instruction cache unit 2 and decoding the read instruction into a microinstruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4 or the data processing module 5;
the data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes;
and the data processing module 5 is used for updating the moment vector, calculating the moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1.
In the above scheme, the direct memory access unit 1 writes an instruction into the instruction cache unit 2 from an external designated space, reads a parameter to be updated and a corresponding gradient value from the external designated space to the data processing module 5, and directly writes an updated parameter vector into the external designated space from the data processing module 5.
In the above scheme, the controller unit 3 decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, so as to control the direct memory access unit 1 to read data from an external designated address and write the data into the external designated address, control the data cache unit 4 to obtain an instruction required by an operation from the external designated address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.
In the above scheme, the data cache unit 4 initializes the first moment vector m during initializationtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5tAnd a second moment vector vtAnd then written into the data buffer unit 4.
In the above scheme, during the operation of the device, the first-order moment vector m is always stored in the data cache unit 4tTwo, twoOrder moment vector vtA copy of (1).
In the above scheme, the data processing module 5 reads the moment vector m from the data cache unit 4t-1、vt-2Reading the vector theta to be updated from the external designated space through the direct memory access unit 1t-1Gradient vector ofUpdate step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectorsFinally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into the data buffer unit 4, and storing thetatWrites to the external designated space via direct memory access unit 1.
In the above scheme, the data processing module 5 converts the moment vector mt-1、vt-1Is updated to mtIs according to the formulaImplemented, the data processing module 5 passes mt、mtComputing moment estimate vectorsIs according to the formulaImplemented, the data processing module 5 will update the vector θ to be updatedt-1Is updated to thetatIs according to the formulaAnd (4) realizing.
In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.
To achieve the above object, the present disclosure also provides a method for executing an Adam gradient descent training algorithm, the method comprising:
initializing first order moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0;
When gradient descending operation is carried out, gradient values transmitted from the outside are utilizedUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Then obtaining the estimation vector with partial moment through the moment vector operationAndfinally, updating the vector theta to be updatedt-1Is thetatAnd outputting; this process is repeated until the vector to be updated converges.
In the above scheme, the initializing first moment vector m0Second order moment vector v0Exponential decay rate ofβ1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0The method comprises the following steps:
in step S1, an INSTRUCTION entry _ IO INSTRUCTION is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION entry _ IO INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.
Step S2, when the operation starts, the controller unit 3 reads the INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit 2, drives the direct memory access unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated microinstruction, and caches the INSTRUCTIONs in the INSTRUCTION cache unit 2;
in step S3, the controller unit 3 reads a HYPERPARAMETER _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction1、β2Convergence threshold ctThen sent into the data processing module 5;
in step S4, the controller unit 3 reads in the assignment instruction from the instruction cache unit 2, and drives the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit 5 to be set to 1;
in step S5, the controller unit 3 reads a DATA _ IO instruction from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vectorThen sent to the data processing module 5;
in step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and writes the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And a second moment vector vt-1To data transmissionIn the processing unit 5.
In the above scheme, the gradient value transmitted from the outside is utilizedUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Is according to the formula The implementation specifically includes: the controller unit 3 reads a moment vector updating instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-1In which a moment vector update instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends a corresponding instruction to perform the following operations: sends INS _1 instruction to the basic operation sub-module 56, drives the basic operation sub-module 56 to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule 53, driving vector multiplication parallel operation submodule 53 to obtainThen, INS _3 instruction is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate beta1mt-1、β2vt-1Andthe results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively as two inputs to the vector addition parallel operator module 52 to obtain the updatesThe latter first moment vector mtAnd a second moment vector vt。
In the above scheme, the gradient value transmitted from the outside is utilizedUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1And then, the method further comprises the following steps: the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated micro instructiontAnd a second moment vector vtFrom the data processing unit 5 into the data buffering unit 4.
In the above scheme, the partial moment estimation vector is obtained by the moment vector operationAndis according to the formulaThe implementation specifically includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an instruction INS _4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculateAndthe iteration number t is added to 1, the operation control sub-module 51 sends an instruction INS _5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first moment vector m in paralleltAndsecond moment vector vtAndthe product of which yields an estimated vector with a biasAnd
in the above scheme, the vector θ to be updated is updatedt-1Is thetatIs according to the formulaThe implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an instruction INS _6 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an instruction INS _7 to the vector square root parallel operation sub-module 55 to drive the operation to obtainThe operation control sub-module 51 sends an instruction INS _7 to the vector division parallel operation sub-module 54 to drive the operation to obtainThe operation control sub-module 51 sends an instruction INS _8 to the vector multiplication parallel operation sub-module 53 to drive the operation to obtainThe operation control sub-module 51 sends an instruction INS _9 to the vector addition parallel operation sub-module 52 to drive the calculation thereofObtaining an updated parameter vector thetat(ii) a Wherein, thetat-1Is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module 51 sends an instruction INS _10 to the vector division parallel operation sub-module 54 to drive the operation thereof to obtain a vectorThe arithmetic control sub-module 51 sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module 52 and the basic operation sub-module 56 respectively to calculate sum sigmat tempt、temp2=sum/n。
In the above scheme, the vector θ to be updated is updatedt-1Is thetatThen, the method further comprises the following steps: the controller unit 3 reads a DATABACK _ IO instruction from the instruction cache unit 2, and updates the parameter vector theta according to the translated microinstructiontFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.
In the above scheme, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process includes: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 < ct, the convergence is performed, and the operation is ended.
(III) advantageous effects
According to the technical scheme, the method has the following beneficial effects:
1. according to the device and the method for executing the Adam gradient descent training algorithm, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that a general processor of data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated.
2. According to the device and the method for executing the Adam gradient descent training algorithm, the moment vector required in the intermediate process is temporarily stored by the data cache unit, so that repeated data reading to the memory is avoided, IO (input/output) operation between the device and an external address space is reduced, and the bandwidth of memory access is reduced.
3. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, so that the parallel degree is greatly improved.
4. According to the device and the method for executing the Adam gradient descent training algorithm, the data processing module adopts the related parallel operation sub-module to perform vector operation, the operation parallelism degree is high, the working frequency is low, and the power consumption overhead is low.
Drawings
For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
fig. 1 illustrates an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.
Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of this disclosure.
Fig. 3 shows a flow diagram of a method for executing an Adam gradient descent training algorithm in accordance with an embodiment of the present disclosure.
Like devices, components, units, etc. are designated with like reference numerals throughout the drawings.
Detailed Description
Other aspects, advantages, and salient features of the disclosure will become apparent to those skilled in the art from the following detailed description of the disclosed embodiments, which taken in conjunction with the annexed drawings.
In the present disclosure, the terms "include" and "comprise," as well as derivatives thereof, mean inclusion without limitation; the term "or" is inclusive, meaning and/or.
In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
The device and the method for executing the Adam gradient descent training algorithm according to the embodiment of the disclosure are used for accelerating the application of the Adam gradient descent algorithm. First, initializing the first moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0(ii) a Each time gradient descending operation is carried out, gradient values transmitted from the outside are utilizedUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1I.e. byThen obtaining an estimated vector with a bias moment through moment vector operationAndnamely, it isFinally, updating the vector theta to be updatedt-1Is thetatAnd output, i.e.Wherein, thetat-1Is theta0Not updated at the t-th cyclePrevious value, t cycle will be θt-1Is updated to thetat. This process is repeated until the vector to be updated converges.
Fig. 1 shows an example block diagram of the overall structure of an apparatus for implementing the Adam gradient descent algorithm according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.
And the direct memory access unit 1 is used for accessing an external designated space, reading and writing data to the instruction cache unit 2 and the data processing module 5, and completing the loading and storage of the data. Specifically, an instruction is written into the instruction cache unit 2 from an external designated space, a parameter to be updated and a corresponding gradient value are read from the external designated space to the data processing module 5, and an updated parameter vector is directly written into the external designated space from the data processing module 5.
And the instruction cache unit 2 is used for reading the instruction through the direct memory access unit 1 and caching the read instruction.
The controller unit 3 is configured to read an instruction from the instruction cache unit 2, decode the read instruction into micro instructions for controlling behaviors of the direct memory access unit 1, the data cache unit 4, or the data processing module 5, send each micro instruction to the direct memory access unit 1, the data cache unit 4, or the data processing module 5, control the direct memory access unit 1 to read data from an external specified address and write the data into the external specified address, control the data cache unit 3 to obtain an instruction required by an operation from the external specified address through the direct memory access unit 1, control the data processing module 5 to perform an update operation on a parameter to be updated, and control the data cache unit 4 and the data processing module 5 to perform data transmission.
The data cache unit 4 is used for caching the first-order moment vectors and the second-order moment vectors in the initialization and data updating processes; specifically, the data cache unit 4 initializes the first-order moment vector m at the time of initializationtSecond order moment vector vtThe data cache unit 4 stores the first moment vector m in each data updating processt-1And second moment directionQuantity vt-1Read out and sent to the data processing module 5, and updated to the first moment vector m in the data processing module 5tAnd a second moment vector vtAnd then written into the data buffer unit 4. During the operation of the device, the first moment vector m is always stored in the data cache unit 4tSecond order moment vector vtA copy of (1). In the disclosure, because the moment vector required in the intermediate process is temporarily stored by the data cache unit, repeated data reading to the memory is avoided, IO operations between the device and an external address space are reduced, and the bandwidth of memory access is reduced.
The data processing module 5 is used for updating the moment vector, calculating a moment estimation vector, updating a vector to be updated, writing the updated moment vector into the data cache unit 4, and writing the updated vector to be updated into an external designated space through the direct memory access unit 1; specifically, the data processing module 5 reads the moment vector m from the data cache unit 4t-1、vt-1Reading the vector theta to be updated from the external designated space through the direct memory access unit 1t-1Gradient vector ofUpdate step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtI.e. byThrough mt、vtComputing moment estimate vectors Namely, it isFinally, the vector theta to be updatedt-1Is updated to thetatI.e. byAnd m ist、vtWriting into the data buffer unit 4, and storing thetatWrites to the external designated space via direct memory access unit 1. In the disclosure, the data processing module performs vector operation by using the related parallel operation submodule, so that the parallel degree is greatly improved, the working frequency is low, and the power consumption overhead is low.
Fig. 2 illustrates an example block diagram of a data processing module in an apparatus for implementing Adam gradient descent algorithm-related applications in accordance with an embodiment of this disclosure. As shown in fig. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operator sub-module 52, a vector multiplication parallel operator sub-module 53, a vector division parallel operator sub-module 54, a vector square root parallel operator sub-module 55 and a basic operator sub-module 56, wherein the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56 are connected in parallel, and the operation control sub-module 51 is connected in series with the vector addition parallel operator sub-module 52, the vector multiplication parallel operator sub-module 53, the vector division parallel operator sub-module 54, the vector square root parallel operator sub-module 55 and the basic operator sub-module 56, respectively. When the device is used for calculating vectors, the vector calculation is element-wise calculation, and when the same vector executes a certain calculation, different position elements execute the calculation in parallel.
Fig. 3 shows a flow chart of a method for executing an Adam gradient descent training algorithm according to an embodiment of the present disclosure, specifically comprising the steps of:
in step S1, an INSTRUCTION prefetch INSTRUCTION (INSTRUCTION _ IO) is pre-stored at the first address of the INSTRUCTION cache unit 2, and the INSTRUCTION prefetch INSTRUCTION is used to drive the direct memory unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space.
Step S2, when the operation starts, the controller unit 3 reads the INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit 2, drives the direct memory access unit 1 to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated microinstruction, and caches the INSTRUCTIONs in the INSTRUCTION cache unit 2;
in step S3, the controller unit 3 reads in a hyper-parameter read instruction (HYPERPARAMETER _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the global update step α and the exponential decay rate β from the external designated space according to the translated microinstruction1、β2The convergence threshold value ct is sent to the data processing module 5;
in step S4, the controller unit 3 reads in the assignment instruction from the instruction cache unit 2, and drives the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit 5 to be set to 1;
in step S5, the controller unit 3 reads a parameter read instruction (DATA _ IO) from the instruction cache unit 2, and drives the dma unit 1 to read the parameter vector θ to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vectorThen sent to the data processing module 5;
in step S6, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and stores the first-order moment m in the data cache unit 4 according to the translated microinstructiont-1And a second moment vector vt-1To the data processing unit 5.
In step S7, the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data cache unit 4 to perform the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-2The update operation of (2). In the update operation, the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1(INS _1) to the basic operation submodule 56, and driving the basic operation submodule 56 to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending an operation instruction 2(INS _2) to the vector multiplication parallel operation submodule 53, and driving the vector multiplication parallel operation submodule 53 to obtainThen, the operation instruction 3(INS _3) is sent to the vector multiplication parallel operation submodule 53, and the vector multiplication parallel operation submodule 53 is driven to simultaneously calculate the beta1mt-1、β2vt-1Andthe results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively as two inputs to the vector addition parallel operation submodule 52 to obtain an updated first-order moment vector mtAnd a second moment vector vt。
In step S8, the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and updates the first-order moment vector m according to the translated microinstructiontAnd a second moment vector vtFrom the data processing unit 5 into the data buffering unit 4.
In step S9, the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated microinstruction, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: the operation control sub-module 51 sends an operation instruction 4(INS _4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculateAndthe iteration number t is added to 1, and the operation control sub-module 51 sends an operation instruction 5(INS _5) to the vector multiplication in parallelThe operator module 53, the driver vector multiplication parallel operator module 53 calculates the first order moment vector m in paralleltAndsecond moment vector vtAndthe product of which yields an estimated vector with a biasAnd
in step S10, the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operations according to the translated microinstruction: the operation control sub-module 51 sends an operation instruction 6(INS _6) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate-alpha; the operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector square root parallel operation sub-module 55 to drive the operation to obtainThe operation control sub-module 51 sends an operation instruction 7(INS _7) to the vector division parallel operation sub-module 54 to drive the operation to obtainThe operation control sub-module 51 sends an operation instruction 8(INS _8) to the vector multiplication parallel operation sub-module 53 to drive the operation to obtainThe operation control sub-module 51 sends the operation instruction 9(INS _9) to the vector addition parallel operation sub-module 52 to drive the calculation thereofObtaining an updated parameter vector thetat(ii) a Wherein the content of the first and second substances,θt-1is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module 51 sends an operation instruction 10(INS _10) to the vector division parallel operation sub-module 54 to drive the operation to obtain a vectorThe arithmetic control sub-module 51 sends an arithmetic instruction 11(INS _11) and an arithmetic instruction 12(INS _12) to the vector addition parallel operator module 52 and the basic operator module 56 respectively to calculate sum-sigmattempt、temp2=sum/n。
In step S11, the controller unit 3 reads a to-be-updated vector write-back instruction (DATABACK _ IO) from the instruction cache unit 2, and updates the updated parameter vector θ according to the translated microinstructiontFrom the data processing unit 5 to the external designated space via the direct memory access unit 1.
In step S12, the controller unit 3 reads a convergence judgment command from the command cache unit 2, and according to the translated microinstruction, the data processing module 5 judges whether the updated parameter vector converges, and if temp2 is less than ct, the convergence ends and the operation ends, otherwise, the process goes to step S5 to continue execution.
According to the method and the device, the device special for executing the Adam gradient descent training algorithm is adopted, the problems that the general processor of the data is insufficient in operation performance and the front-section decoding cost is high can be solved, and the execution speed of related applications is accelerated. Meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software embodied on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
In the foregoing specification, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Claims (25)
1. An apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising:
the controller unit (3) is used for reading an instruction and decoding the read instruction into a micro instruction for controlling the behavior of the data cache unit (4) or the data processing module (5);
the data caching unit (4) is used for caching the moment vectors in the processes of initialization and data updating;
the data processing module (5) is connected to the controller unit (3) and the data cache unit (4) and is used for reading the vector to be updated and the corresponding gradient value from the external designated space, reading the moment vector from the data cache unit (4), updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit (4) and writing the updated vector to be updated into the external designated space;
wherein the data buffer unit (4) initializes a first order moment vector m at initializationtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read and sent to the data processing module (5) and updated to a first moment vector m in the data processing module (5)tAnd a second moment vector vtAnd then written into the data buffer unit (4).
2. The apparatus of claim 1, further comprising:
the direct memory access unit (1) is connected to the data processing module (5) and is used for accessing an external designated space, reading and writing data to the instruction cache unit (2) and the data processing module (5) and completing the loading and storage of the data;
and the instruction cache unit (2) is connected to the direct memory access unit (1) and the controller unit (3) and is used for reading the instruction through the direct memory access unit (1) and caching the read instruction for the controller unit (3) to read.
3. The apparatus according to claim 2, wherein the direct memory access unit (1) writes the instruction from the external designated space to the instruction cache unit (2), reads the parameter vector to be updated and the corresponding gradient value from the external designated space to the data processing module (5), and writes the updated parameter vector from the data processing module (5) directly into the external designated space.
4. The apparatus according to claim 2, wherein the controller unit (3) decodes the read instruction into a micro instruction for controlling the behavior of the direct memory access unit (1), the data cache unit (4) or the data processing module (5) to control the direct memory access unit (1) to read data from and write data to the external designated address, controls the data cache unit (4) to obtain an instruction required for an operation from the external designated address through the direct memory access unit (1), controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data cache unit (4) to perform data transmission with the data processing module (5).
5. The device according to claim 1, wherein the first-order moment vector m is always stored in the data buffer unit (4) during the device operationtSecond order moment vector vtA copy of (1).
6. The apparatus according to claim 2, wherein the data processing module (5) reads the moment vector m from the data cache unit (4)t-1、vt-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)t-1Gradient vector ofUpdate step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectorsFinally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into a data buffer unit (4) by using a write operationtWriting into an external designated space through a direct memory access unit (1).
7. The device according to claim 6, characterized in that the data processing module (5) maps the moment vectors mt-1、vt-1Is updated to mtIs according to the formula Implemented, the data processing module (5) passes through mt、vtComputing moment estimate vectorsIs according to the formulaIn which beta is1、β2Is an exponential decay rate; the data processing module (5) updates the vector theta to be updatedt-1Is updated to thetatIs according to the formulaAnd (4) realizing.
8. The apparatus according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.
9. The apparatus of claim 8, wherein the apparatus is configured to perform the operation on the vectors, wherein the vector operations are element-wise operations, and wherein the operation on the same vector is performed on different position elements in parallel.
10. A method of using the apparatus of claim 1, the method comprising:
reading an instruction by adopting a controller unit, and decoding the read instruction into a micro instruction for controlling the behavior of a data cache unit or a data processing module;
caching the moment vector in the initialization and data updating process by adopting a data caching unit;
reading a vector to be updated and a corresponding gradient value from an external designated space by adopting a data processing module, reading a moment vector from a data cache unit, updating the moment vector according to the vector to be updated and the corresponding gradient value, calculating a moment estimation vector, updating the vector to be updated, writing the updated moment vector into the data cache unit, and writing the updated vector to be updated into the external designated space;
the data buffer unit is initializedInitializing first order moment vector mtSecond order moment vector vtThe first order moment vector m is used in each data updating processt-1And a second moment vector vt-1Read out and send to the data processing module, and update to the first moment vector m in the data processing moduletAnd a second moment vector vtAnd then written into the data cache unit.
11. The method of claim 10, wherein the method comprises:
writing an instruction into the instruction cache unit from an external designated space by adopting a direct memory access unit, and caching the read instruction by adopting the instruction cache unit so as to be read by the controller unit;
accessing an external designated space using a direct memory access unit, reading the parameter to be updated and the corresponding gradient value to a data processing module, and
and directly writing the updated parameter vector into an external designated space from the data processing module by adopting a direct memory access unit.
12. The method of claim 11, characterized in that the method comprises:
the controller unit decodes the read instruction into a microinstruction which controls the behavior of the direct memory access unit, the data cache unit or the data processing module,
the direct memory access unit is controlled to read data from and write data to the external designated address,
the control data cache unit obtains the instruction required by the operation from the external designated address through the direct memory access unit,
control the data processing module to perform an update operation on the parameter to be updated, an
And controlling the data buffer unit to transmit data with the data processing module.
13. Method according to claim 12, characterized in that during operation, a data cache unit (4) is always maintained insideOrder moment vector mtSecond order moment vector vtA copy of (1).
14. Method according to claim 11, characterized in that the data processing module (5) reads the moment vector m from the data cache unit (4)t-1、vt-1Reading the vector theta to be updated from an external designated space through a direct memory access unit (1)t-1Gradient vector ofUpdate step α and exponential decay rate β1And beta2(ii) a Then the moment vector mt-1、vt-1Is updated to mt、vtThrough mt、vtComputing moment estimate vectorsFinally, the vector theta to be updatedt-1Is updated to thetatAnd m ist、vtWriting into a data buffer unit (4) by using a write operationtWriting into an external designated space through a direct memory access unit (1).
15. Method according to claim 14, characterized in that the data processing module (5) adapts a moment vector mt-1、vt-1Is updated to mtIs according to the formula Implemented, the data processing module (5) passes through mt、vtComputing moment estimate vectorsIs according to the formulaImplemented, the data processing module (5) is used for updating the vector theta to be updatedt-1Is updated to thetatIs according to the formulaAnd (4) realizing.
16. The method according to claim 10 or 15, characterized in that the data processing module (5) comprises an operation control sub-module (51), and a vector addition parallel operator sub-module (52), a vector multiplication parallel operator sub-module (53), a vector division parallel operator sub-module (54), a vector square root parallel operator sub-module (55) and a basic operation sub-module (56) connected to the operation control sub-module (51), wherein the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), the vector parallel division operator sub-module (54), the vector square root parallel operator sub-module (55) and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively connected with the vector addition parallel operator sub-module (52), the vector multiplication parallel operator sub-module (53), The vector division parallel operation submodule (54), the vector square root parallel operation submodule (55) and the basic operation submodule (56) are connected in series.
17. The method of claim 10, wherein when operating on vectors, the vector operations are element-wise operations, and wherein different positional elements perform operations in parallel when the same vector performs an operation.
18. A method of using the apparatus of claim 1, comprising: initializing first order moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0;
When gradient descending operation is carried out, gradient values transmitted from the outside are utilizedUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Then obtaining the estimation vector with partial moment through the moment vector operationAndfinally, updating the vector theta to be updatedt-1Is thetatAnd outputting; this process is repeated until the vector to be updated converges.
19. The method of claim 18, wherein the initializing a first moment vector m0Second order moment vector v0Exponential decay rate beta1、β2Learning step length alpha and obtaining the vector theta to be updated from an external designated space0The method comprises the following steps:
an INSTRUCTION of INSTRUCTION _ IO is stored in advance at the first address of the INSTRUCTION cache unit, and the INSTRUCTION of INSTRUCTION _ IO is used for driving the direct memory unit to read all INSTRUCTIONs related to Adam gradient descent calculation from an external address space;
when the operation starts, the controller unit reads the INSTRUCTION of INSTRUCTION _ IO from the first address of the INSTRUCTION cache unit, drives the direct memory access unit to read all INSTRUCTIONs related to Adam gradient descent calculation from the external address space according to the translated micro INSTRUCTION, and caches the INSTRUCTIONs into the INSTRUCTION cache unit;
the controller unit reads an HYPERPARAMETER _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read a global update step alpha and an exponential decay rate beta from an external designated space according to the translated microinstruction1、β2The convergence threshold value ct is sent to the data processing module;
the controller unit reads in the assignment instruction from the instruction cache unit and drives the first-order moment vector m in the data cache unit according to the translated microinstructiont-1And vt-1Initializing and driving the number of iterations t in the data processing unit to be set to 1;
the controller unit reads a DATA _ IO instruction from the instruction cache unit, and drives the direct memory access unit to read the parameter vector theta to be updated from the external designated space according to the translated microinstructiont-1And corresponding gradient vectorThen sending the data into a data processing module;
the controller unit reads a data transmission instruction from the instruction cache unit, and caches the first-order moment vector m in the data cache unit according to the translated microinstructiont-1And a second moment vector vt-1To the data processing unit.
20. The method of claim 18, wherein said utilizing gradient values externally incomingUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1Is according to the formulaThe implementation specifically includes:
the controller unit reads a moment vector updating instruction from the instruction cache unit and drives the data cache unit to carry out the first-order moment vector m according to the translated micro instructiont-1And a second moment vector vt-1In the update operation, the moment vector update instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: sending INS _1 instruction to basic operation submodule to drive basic operation submodule to calculate (1-beta)1) And (1-. beta.)2) (ii) a Sending INS _2 instruction to vector multiplication parallel operation submodule, and driving the vector multiplication parallel operation submodule to obtainThen, an INS _3 instruction is sent to the vector multiplication parallel operation submodule, and the vector multiplication parallel operation submodule is driven to simultaneously calculate beta1mt-1、β2vt-1Andthe results are respectively denoted as a1、a2、b1And b2(ii) a Then, a is mixed1And a2、b1And b2Respectively used as two inputs, and sent to the vector addition parallel operation submodule to obtain the updated first-order moment vector mtAnd a second moment vector vt。
21. The method of claim 20, wherein the utilizing gradient values externally incomingUpdating first order moment vector m by sum exponential decay ratet-1Second order moment vector vt-1And then, the method further comprises the following steps:
the controller unit reads a data transmission instruction from the instruction cache unit and updates the first-order moment vector m according to the translated micro-instructiontAnd a second moment vector vtFrom the data processing unit to the data buffering unit.
22. The method of claim 18, wherein the biased-moment estimate vectors are obtained by a moment vector operationAndis according to the formulaThe implementation specifically includes:
the controller unit reads a moment estimation vector operation instruction from the instruction cache unit, and drives the operation control submodule to calculate the moment estimation vector according to the translated micro instruction, and the operation control submodule sends a corresponding instruction to perform the following operations: the operation control sub-module sends an instruction INS _4 to the basic operation sub-module to drive the basic operation sub-module to calculateAndadding 1 to the iteration times t, sending an instruction INS _5 to the vector multiplication parallel operation submodule by the operation control submodule, and driving the vector multiplication parallel operation submodule to calculate the first moment vector m in paralleltAndsecond moment vector vtAndthe product of which yields an estimated vector with a biasAnd
23. the method of claim 18, wherein the updating the vector θ to be updated is based on the number of bits in the vector θt-1Is thetatIs according to the formulaIn the realization process, the first-stage reaction,the method specifically comprises the following steps:
the controller unit reads a parameter vector updating instruction from the instruction cache unit, and drives the operation control submodule to perform the following operations according to the translated microinstruction: the operation control sub-module sends an instruction INS _6 to the basic operation sub-module, and drives the basic operation sub-module to calculate-alpha; the operation control sub-module sends an instruction INS _7 to the vector square root parallel operation sub-module to drive the operation to obtainThe operation control sub-module sends an instruction INS _7 to the vector division parallel operation sub-module to drive the operation to obtainThe operation control sub-module sends an instruction INS _8 to the vector multiplication parallel operation sub-module to drive the operation to obtainThe operation control submodule sends an instruction INS _9 to the vector addition parallel operation submodule to drive the calculation thereofObtaining an updated parameter vector thetat(ii) a Wherein, thetat-1Is theta0Not updated before the t-th cycle, the t-th cycle will be θt-1Is updated to thetat(ii) a The operation control sub-module sends an instruction INS _10 to the vector division parallel operation sub-module to drive the operation to obtain a vectorThe operation control sub-module respectively sends instructions INS _11 and INS _12 to the vector addition parallel operation sub-module and the basic operation sub-module to calculate sum sigmaitempiTemp2 is sum/n, where i is 1, 2, 3, … …, n is the total number of cycles, temp2 is the moving weighted average of the gradients.
24. The method of claim 23, wherein updating the vector θ to be updated comprises updating the vector θ to be updatedt-1Is thetatThen, the method further comprises the following steps:
the controller unit reads a DATABACK _ IO instruction from the instruction cache unit, and updates the updated parameter vector theta according to the translated microinstructiontAnd transmitting the data from the data processing unit to the external designated space through the direct memory access unit.
25. The method according to claim 18, wherein the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges, and the specific determination process is as follows:
the controller unit reads a convergence judgment instruction from the instruction cache unit, the data processing module judges whether the updated parameter vector is converged according to the translated micro instruction, and if temp2 is less than ct, the convergence is realized, and the operation is finished; temp2 is the moving weighted average of the gradients, and ct is the convergence threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610269689.7A CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610269689.7A CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107315570A CN107315570A (en) | 2017-11-03 |
CN107315570B true CN107315570B (en) | 2021-06-18 |
Family
ID=60185643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610269689.7A Active CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107315570B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109782392A (en) * | 2019-02-27 | 2019-05-21 | 中国科学院光电技术研究所 | A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm |
CN111460528B (en) * | 2020-04-01 | 2022-06-14 | 支付宝(杭州)信息技术有限公司 | Multi-party combined training method and system based on Adam optimization algorithm |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101931416A (en) * | 2009-06-24 | 2010-12-29 | 中国科学院微电子研究所 | Parallel hierarchical decoder for low density parity code (LDPC) in mobile digital multimedia broadcasting system |
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
CN103956992A (en) * | 2014-03-26 | 2014-07-30 | 复旦大学 | Self-adaptive signal processing method based on multi-step gradient decrease |
CN104360597A (en) * | 2014-11-02 | 2015-02-18 | 北京工业大学 | Sewage treatment process optimization control method based on multiple gradient descent |
CN104376124A (en) * | 2014-12-09 | 2015-02-25 | 西华大学 | Clustering algorithm based on disturbance absorbing principle |
CN104978282A (en) * | 2014-04-04 | 2015-10-14 | 上海芯豪微电子有限公司 | Cache system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8862206B2 (en) * | 2009-11-12 | 2014-10-14 | Virginia Tech Intellectual Properties, Inc. | Extended interior methods and systems for spectral, optical, and photoacoustic imaging |
-
2016
- 2016-04-27 CN CN201610269689.7A patent/CN107315570B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101931416A (en) * | 2009-06-24 | 2010-12-29 | 中国科学院微电子研究所 | Parallel hierarchical decoder for low density parity code (LDPC) in mobile digital multimedia broadcasting system |
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
CN103956992A (en) * | 2014-03-26 | 2014-07-30 | 复旦大学 | Self-adaptive signal processing method based on multi-step gradient decrease |
CN104978282A (en) * | 2014-04-04 | 2015-10-14 | 上海芯豪微电子有限公司 | Cache system and method |
CN104360597A (en) * | 2014-11-02 | 2015-02-18 | 北京工业大学 | Sewage treatment process optimization control method based on multiple gradient descent |
CN104376124A (en) * | 2014-12-09 | 2015-02-25 | 西华大学 | Clustering algorithm based on disturbance absorbing principle |
Non-Patent Citations (1)
Title |
---|
Adam:A Method for Stochastic Optimization;D.Kingma,J.Ba;《ICLR 2015》;20150123;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN107315570A (en) | 2017-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111310904B (en) | Apparatus and method for performing convolutional neural network training | |
CN111353589B (en) | Apparatus and method for performing artificial neural network forward operations | |
US11574195B2 (en) | Operation method | |
CN110929863B (en) | Apparatus and method for performing LSTM operations | |
CN111353588B (en) | Apparatus and method for performing artificial neural network reverse training | |
CN108268939B (en) | Apparatus and method for performing LSTM neural network operations | |
CN111860813B (en) | Device and method for performing forward operation of convolutional neural network | |
WO2017124648A1 (en) | Vector computing device | |
JP5987233B2 (en) | Apparatus, method, and system | |
EP3832499A1 (en) | Matrix computing device | |
WO2018120016A1 (en) | Apparatus for executing lstm neural network operation, and operational method | |
CN107766079B (en) | Processor and method for executing instructions on processor | |
EP3451238A1 (en) | Apparatus and method for executing pooling operation | |
WO2017185257A1 (en) | Device and method for performing adam gradient descent training algorithm | |
CN107341132B (en) | Device and method for executing AdaGrad gradient descent training algorithm | |
CN113222101A (en) | Deep learning processing device, method, equipment and storage medium | |
WO2017185393A1 (en) | Apparatus and method for executing inner product operation of vectors | |
WO2017185392A1 (en) | Device and method for performing four fundamental operations of arithmetic of vectors | |
CN107315570B (en) | Device and method for executing Adam gradient descent training algorithm | |
CN107315569B (en) | Device and method for executing RMSprop gradient descent algorithm | |
WO2017185248A1 (en) | Apparatus and method for performing auto-learning operation of artificial neural network | |
WO2017185335A1 (en) | Apparatus and method for executing batch normalization operation | |
WO2017185256A1 (en) | Rmsprop gradient descent algorithm execution apparatus and method | |
CN107341540B (en) | Device and method for executing Hessian-Free training algorithm | |
CN111860814B (en) | Apparatus and method for performing batch normalization operations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant after: Zhongke Cambrian Technology Co., Ltd Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |