CN107315570A - It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm - Google Patents
It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm Download PDFInfo
- Publication number
- CN107315570A CN107315570A CN201610269689.7A CN201610269689A CN107315570A CN 107315570 A CN107315570 A CN 107315570A CN 201610269689 A CN201610269689 A CN 201610269689A CN 107315570 A CN107315570 A CN 107315570A
- Authority
- CN
- China
- Prior art keywords
- vector
- submodule
- unit
- instruction
- updated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 57
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 49
- 230000007423 decrease Effects 0.000 title claims abstract description 30
- 238000012545 processing Methods 0.000 claims abstract description 74
- 239000000872 buffer Substances 0.000 claims abstract description 54
- 230000008569 process Effects 0.000 claims description 15
- 230000005540 biological transmission Effects 0.000 claims description 9
- 101100179597 Caenorhabditis elegans ins-7 gene Proteins 0.000 claims description 6
- 230000006399 behavior Effects 0.000 claims description 6
- 101100340781 Caenorhabditis elegans ins-11 gene Proteins 0.000 claims description 3
- 101100179824 Caenorhabditis elegans ins-17 gene Proteins 0.000 claims description 3
- 101100179596 Caenorhabditis elegans ins-3 gene Proteins 0.000 claims description 3
- 101100179594 Caenorhabditis elegans ins-4 gene Proteins 0.000 claims description 3
- 101100072420 Caenorhabditis elegans ins-5 gene Proteins 0.000 claims description 3
- 101150089655 Ins2 gene Proteins 0.000 claims description 3
- 101150032953 ins1 gene Proteins 0.000 claims description 3
- 101100072419 Caenorhabditis elegans ins-6 gene Proteins 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3814—Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
- G06F9/3832—Value prediction for operands; operand history buffers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Liquid Crystal Display Device Control (AREA)
Abstract
It is used to perform the device and method that Adam gradients decline training algorithm the invention discloses a kind of, the device includes direct memory access unit, instruction cache unit, controller unit, data buffer storage unit, data processing module.This method includes:Gradient vector and value to be updated vector are read first, while initializing, first order and second order moments are vectorial and corresponding exponential decay rate;During each iteration, first order and second order moments vector is updated using gradient vector, and calculating single order respectively has inclined moments estimation vector sum second order to have inclined moments estimation vector, there is inclined moments estimation vector sum second order to there is inclined moments estimation vector to update parameter to be updated, continuous training until parameter vector to be updated is restrained using single order.Utilize the present invention, it is possible to achieve the application of Adam gradient descent algorithms, and increase substantially the efficiency of data processing.
Description
Technical field
The present invention relates to Adam algorithm applied technical fields, it is used to perform the decline training of Adam gradients more particularly to one kind
The device and method of algorithm, relates to the hard-wired related application of Adam gradient optimization algorithms.
Background technology
Gradient optimization algorithm extensively should in fields such as function approximation, optimization calculating, Pattern recognition and image processings
With Adam algorithms are as one kind in gradient optimization algorithm, and because it is easily achieved, amount of calculation is small, required memory space
The features such as small and gradient symmetry transformation consistency are widely used, and realize that Adam algorithms can be with using special purpose device
Significantly improve the speed of its execution.
At present, a kind of known method of execution Adam gradient descent algorithms is to use general processor.This method is by making
Perform universal command to support above-mentioned algorithm with general-purpose register and general utility functions part.One of shortcoming of this method is single
The operational performance of general processor is relatively low.And multiple general processors, when performing parallel, the intercommunication of general processor is again
Become performance bottleneck.In addition, general processor needs the corresponding related operation of Adam gradient descent algorithms to be decoded into a length
Column operations and access instruction sequence, processor front end decoding bring larger power dissipation overhead.
The known method of another execution Adam gradient descent algorithms is to use graphics processor (GPU).This method passes through
General single-instruction multiple-data stream (SIMD) (SIMD) instruction is performed using general-purpose register and general stream processing unit to support above-mentioned calculation
Method.Because GPU is the equipment that is specifically used to perform graph image computing and scientific algorithm, not to Adam gradient descent algorithms
The special support of related operation, it is still desirable to which substantial amounts of front end work decoding could perform related in Adam gradient descent algorithms
Computing, brings substantial amounts of overhead.In addition, GPU only has less upper caching, data (such as first moment needed for computing
Vector sum second moment vector etc.) need to carry outside piece repeatedly, the outer bandwidth of piece becomes main performance bottleneck, while bringing huge
Big power dissipation overhead.
The content of the invention
(1) technical problem to be solved
In view of this, it is a primary object of the present invention to provide a kind of dress for being used to perform Adam gradients decline training algorithm
Put and method, leading portion decoding overheads big the problem of not enough with the general processor operational performance for solving data, and avoid repeatedly to
Memory read data, reduces the bandwidth of internal storage access.
(2) technical scheme
To reach above-mentioned purpose, it is used to perform the device that Adam gradients decline training algorithm the invention provides a kind of, should
Device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data buffer storage unit 4, data processing mould
Block 5, wherein:
Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5
Data are read and write, the loading and storage of data is completed;
Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1;
Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control
The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5;
Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to
Amount;
Data processing module 5, it is vectorial for updating square, moments estimation vector is calculated, vector to be updated is updated, and by after renewal
Square vector be written in data buffer storage unit 4, the vector to be updated after renewal is written to by direct memory access unit 1
In outside designated space.
In such scheme, the direct memory access unit 1 is to write to refer to from outside designated space to instruction cache unit 2
Order, parameter to be updated and corresponding Grad are read to data processing module 5 from outside designated space, and by the parameter after renewal
Vector writes direct outside designated space from data processing module 5.
In such scheme, the Instruction decoding of reading is control direct memory access unit 1, number by the controller unit 3
According to buffer unit 4 or the microcommand of the behavior of data processing module 5, to control direct memory access unit 1 from the specified ground in outside
Location reads data and writes data into outside specified address, and control data buffer unit 4 is by direct memory access unit 1 from outer
Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control
Data buffer storage unit 4 processed carries out data transmission with data processing module 5.
In such scheme, the data buffer storage unit 4 initialization first moment vector m in initializationt, second moment vector vt,
By first moment vector m in each data updating processt-1With second moment vector vt-1Read and deliver in data processing module 5,
First moment vector m is updated in data processing module 5tWith second moment vector vt, then it is then written in data buffer storage unit 4.
In such scheme, during plant running, in store first moment is vectorial all the time inside the data buffer storage unit 4
mt, second moment vector vtCopy.
In such scheme, the data processing module 5 reads square vector m from data buffer storage unit 4t-1、vt-1, by straight
Connect internal storage access unit 1 and vectorial θ to be updated is read from outside designated spacet-1, gradient vectorUpdate step-length α and refer to
Number attenuation rate β1And β2;Then by square vector mt-1、vt-1It is updated to mt、vt, pass through mt、vtCalculate moments estimation vectorMost
Afterwards by vectorial θ to be updatedt-1It is updated to θt, and by mt、vtIt is written in data buffer storage unit 4, by θtPass through direct memory access
Unit 1 is written in outside designated space.
In such scheme, the data processing module 5 is by square vector mt-1、vt-1It is updated to mtIt is according to formulaRealize, the data processing module 5 passes through mt、vt
Calculate moments estimation vectorIt is according to formulaRealize, the data processing module 5 will be treated more
New vector θt-1It is updated to θtIt is according to formulaRealize.
In such scheme, the data processing module 5 includes operation control submodule 51, vectorial addition concurrent operation submodule
Block 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule
Block 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53,
Vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule 56 are in parallel even
Connect, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53, to
Division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and the series connection of basic operation submodule 56 is measured to connect
Connect.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector performs certain computing
When diverse location element be parallel to perform computing.
To reach above-mentioned purpose, it is used to perform the method that Adam gradients decline training algorithm present invention also offers a kind of,
This method includes:
Initialize first moment vector m0, second moment vector v0, exponential decay rate β1、β2And Learning Step α, and from outside
Vectorial θ to be updated is obtained in designated space0;
When carrying out gradient step-down operation, first with by outside incoming GradSingle order is updated with exponential decay rate
Square vector mt-1, second moment vector vt-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is treated
Renewal vector θt-1For θtAnd export;This process is repeated, until vector convergence to be updated.
In such scheme, the initialization first moment vector mo, second moment vector vo, exponential decay rate β1、β2And study
Step-length α, and obtain from outside designated space vectorial θ to be updated0, including:
Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit 2, should
INSTRUCTION_IO is instructed to be had for driving direct internal storage location 1 to be read from external address space with Adam gradient descent algorithms
All instructions closed.
Step S2, computing starts, and the first address of controller unit 3 from instruction cache unit 2 reads this
INSTRUCTION_IO is instructed, according to the microcommand translated, driving direct memory access unit 1 from external address space read with
The relevant all instructions of Adam gradient descent algorithms, and these instruction buffers are entered in instruction cache unit 2;
In step S3, controller unit 3 reads in a HYPERPARAMETER_IO instruction from instruction cache unit 2, according to
The microcommand translated, driving direct memory access unit 1 reads global renewal step-length α, exponential decay rate from outside designated space
β2、β2, convergence threshold ct, be then fed into data processing module 5;
In step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving
First moment vector m in data buffer storage unit 4t-1And vt-1Initialized, and the iterations in driving data processing unit 5
T is arranged to 1;
In step S5, controller unit 3 reads in a DATA_IO instruction from instruction cache unit 2, according to the micro- finger translated
Order, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated spacet-1With corresponding gradient vectorIt is then fed into data processing module 5;
In step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, micro- according to what is translated
Instruction, by the first moment vector m in data buffer storage unit 4t-1With second moment vector vt-1It is transferred in data processing unit 5.
It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate
mt-1, second moment vector vt-1, it is according to formula Realize
, specifically include:Controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to the micro- finger translated
Order, driving data buffer unit 4 carries out first moment vector mt-1With second moment vector vt-1Renewal operation, the renewal operation
In, square vector more new command is sent to operation control submodule 51, and it is following that operation control submodule 51 sends command adapted thereto progress
Operation:Send INS_1 to instruct to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β1) and (1- β2);Hair
INS_2 is sent to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53, which is calculated, to be obtained
Then, send INS_3 to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53 is simultaneously
Calculate β1mt-1、β2vt-1WithAs a result a is designated as respectively1、a2、b1And b2;Then, by a1And a2、b1With
b2Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to, the first moment vector m after being updatedtAnd second order
Square vector vt。
It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate
mt-1, second moment vector vt-1Afterwards, in addition to:Controller unit 3 reads data transmission instruction, root from instruction cache unit 2
According to the microcommand translated, by the first moment vector m after renewaltWith second moment vector vtData are sent to from data processing unit 5 to delay
In memory cell 4.
It is described that inclined moments estimation vector has been obtained by square vector operation in such schemeWithIt is according to formulaRealize, specifically include:Controller unit 3 reads one from instruction cache unit 2
Bar moments estimation vector operation instruction, according to the microcommand translated, driving operation control submodule 51 carries out the meter of moments estimation vector
Calculate, operation control submodule 51 sends command adapted thereto and proceeded as follows:Operation control submodule 51 sends instruction INS_4 to base
This computing submodule 56, driving basic operation submodule 56 is calculatedWithIterations t adds 1, computing
Control submodule 51 sends instruction INS_5 to vector multiplication concurrent operation submodule 53, drives vector multiplication concurrent operation submodule
The parallel computation first moment vector m of block 53tWithSecond moment vector vtWithProduct obtained inclined moments estimation
VectorWith
It is described to update vectorial θ to be updated in such schemet-1For θtIt is according to formulaRealize,
Specifically include:Controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to the microcommand translated,
Driving operation control submodule 51 carries out following computing:Operation control submodule 51 sends instruction INS_6 to basic operation
Module 56, driving basic operation submodule 56 calculates-α;Operation control submodule 51 sends instruction INS_7 to vector square-root
Concurrent operation submodule 55, drives its computing to obtainOperation control submodule 51 send instruction INS_7 to vectorial division simultaneously
Row computing submodule 54 drives its computing to obtainOperation control submodule 51 send instruction INS_8 to vector multiplication simultaneously
Row computing submodule 53, drives its computing to obtainOperation control submodule 51 sends instruction INS_9 to vectorial addition
Concurrent operation submodule 52, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0
Value before not updated during the t times circulation, is circulated θ for the t timet-1It is updated to θt;Operation control submodule 51 sends instruction INS_10
To vectorial division concurrent operation submodule 54, its computing is driven to obtain vector51 points of operation control submodule
It Fa Song not instruct INS_11, INS_12 to calculate to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 to obtain
Sum=∑sitempi, temp2=sum/n.
It is described to update vectorial θ to be updated in such schemet-1For θtAfterwards, in addition to:Controller unit 3 is from instruction buffer
Unit 2 reads a DATABACK_IO instruction, according to the microcommand translated, by the parameter vector θ after renewaltFrom data processing
Unit 5 is sent to outside designated space by direct memory access unit 1.
In such scheme, this process of the repetition until it is to be updated vector convergence the step of in, including judge it is to be updated to
Whether amount restrains, and specific deterministic process is as follows:Controller unit 3 reads a convergence decision instruction, root from instruction cache unit 2
According to the microcommand translated, data processing module 5 judges whether the parameter vector after updating restrains, if temp2 < ct, restrain,
Computing terminates.
(3) beneficial effect
It can be seen from the above technical proposal that the invention has the advantages that:
1st, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, by using special use
In the device for performing Adam gradients decline training algorithm, the general processor operational performance that can solve data is not enough, and leading portion is translated
The problem of code expense is big, accelerates the execution speed of related application.
2nd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, as a result of data
Buffer unit keeps in the square vector needed for pilot process, it is to avoid repeatedly to memory read data, reduce device with externally
I/O operation between the space of location, reduces the bandwidth of internal storage access.
3rd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould
Block carries out vector operation using related concurrent operation submodule so that degree of concurrence is greatly improved.
4th, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould
Block carries out vector operation using related concurrent operation submodule, and the degree of concurrence of computing is high, so frequency during work is relatively low,
So that power dissipation overhead is small.
Brief description of the drawings
For a more complete understanding of the present invention and its advantage, referring now to the following description with reference to accompanying drawing, wherein:
Fig. 1 shows the overall knot for being used to perform the device that Adam gradients decline training algorithm according to embodiments of the present invention
The example block diagram of structure.
Fig. 2 is shown in the device for declining training algorithm for performing Adam gradients according to embodiments of the present invention at data
Manage the example block diagram of module.
Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention
Figure.
In all of the figs, identical device, part, unit etc. make to be denoted by the same reference numerals.
Embodiment
According to embodiments of the present invention with reference to accompanying drawing to the described in detail below of exemplary embodiment of the present, of the invention its
Its aspect, advantage and prominent features will become obvious for those skilled in the art.
In the present invention, term " comprising " and " containing " and its derivative mean including and it is unrestricted;Term "or" is bag
Containing property, mean and/or.
In this manual, following various embodiments for being used to describe the principle of the invention are explanation, should not be with any
Mode is construed to the scope of limitation invention.Referring to the drawings described below is used to help comprehensive understanding by claim and its equivalent
The exemplary embodiment of the invention that thing is limited.It is described below to help to understand including a variety of details, but these details should
Think what is be merely exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from scope and spirit of the present invention
In the case of, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity,
Eliminate the description of known function and structure.In addition, through accompanying drawing, same reference numbers are used for identity function and operation.
The device and method for being used to perform Adam gradients decline training algorithm according to embodiments of the present invention, to accelerate
The application of Adam gradient descent algorithms.First, initialization first moment vector m0, second moment vector v0, exponential decay rate β1、β2With
And Learning Step α, and obtain from outside designated space vectorial θ to be updatedo;It is first sharp every time when carrying out gradient step-down operation
With by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1, i.e.,Then inclined moments estimation has been obtained by square vector operation
VectorWithI.e.Final updating vectorial θ to be updatedt-1For θtAnd export, i.e.,Wherein, θt-1It is θ0Value before not updated when circulating for the t times, is circulated θ for the t timet-1It is updated to θt。
This process is repeated, until vector convergence to be updated.
Fig. 1 shows the integrally-built of the device for being used to realize Adam gradient descent algorithms according to embodiments of the present invention
Example block diagram.As shown in figure 1, the device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data
Buffer unit 4 and data processing module 5, can be realized by hardware circuit.
Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5
Data are read and write, the loading and storage of data is completed.Specifically from outside designated space to the write instruction of instruction cache unit 2, from
Outside designated space reads parameter to be updated and corresponding Grad to data processing module 5, and by the parameter vector after renewal
Outside designated space is write direct from data processing module 5.
Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1.
Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control
The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5, and each microcommand is sent to direct
Internal storage access unit 1, data buffer storage unit 4 or data processing module 5, control direct memory access unit 1 specify ground from outside
Location reads data and writes data into outside specified address, and control data buffer unit 3 is by direct memory access unit 1 from outer
Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control
Data buffer storage unit 4 processed carries out data transmission with data processing module 5.
Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to
Amount;Specifically, the initialization first moment vector of data buffer storage unit 4 m in initializationt, second moment vector vt, in each data
Data buffer storage unit 4 is by first moment vector m in renewal processt-1With second moment vector vt-1Read and deliver to data processing module 5
In, first moment vector m is updated in data processing module 5tWith second moment vector vt, then it is then written to data buffer storage unit 4
In.During plant running, in store first moment vector m all the time inside the data buffer storage unit 4t, second moment vector vt's
Copy.In the present invention, the square vector needed for pilot process is kept in as a result of data buffer storage unit, it is to avoid repeatedly inside
Reading data are deposited, the I/O operation between device and external address space is reduced, the bandwidth of internal storage access is reduced.
Data processing module 5, for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal
Square vector is written in data buffer storage unit 4, and the vector to be updated after renewal is written into outside by direct memory access unit 1
In designated space;Specifically, data processing module 5 reads square vector m from data buffer storage unit 4t-1、vt-1, pass through direct internal memory
Access unit 1 reads vectorial θ to be updated from outside designated spacet-1, gradient vectorUpdate step-length α and exponential decay rate
β1And β2;Then by square vector mt-1、vt-1It is updated to mt、vt, i.e.,
Pass through mt、vtCalculate moments estimation vector I.e.Finally by vectorial θ to be updatedt-1It is updated to θt,
I.e.And by mt、vtIt is written in data buffer storage unit 4, by θtWrite by direct memory access unit 1
Enter into outside designated space.In the present invention, due to data processing module using related concurrent operation submodule carry out to
Measure computing so that degree of concurrence is greatly improved, so frequency during work is relatively low, and then make it that power dissipation overhead is small.
Fig. 2 shows number in the device for realizing Adam gradient descent algorithm related applications according to embodiments of the present invention
According to the example block diagram of processing module.As shown in Fig. 2 data processing module 5 is parallel including operation control submodule 51, vectorial addition
Computing submodule 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root are parallel
Computing submodule 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation
Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule
Block 56 is connected in parallel, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation
Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule
Block 56 is connected in series.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector is held
Diverse location element is parallel execution computing during certain computing of row.
Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention
Figure, specifically includes following steps:
Step S1, pre-deposits an instruction prefetch instruction (INSTRUCTION_ at the first address of instruction cache unit 2
IO), the INSTRUCTION_IO is instructed declines for driving direct internal storage location 1 to be read from external address space with Adam gradients
Calculate relevant all instructions.
Step S2, computing starts, and the first address of controller unit 3 from instruction cache unit 2 reads this
INSTRUCTION_IO is instructed, according to the microcommand translated, driving direct memory access unit 1 from external address space read with
The relevant all instructions of Adam gradient descent algorithms, and these instruction buffers are entered in instruction cache unit 2;
Step S3, controller unit 3 reads in a super parameter from instruction cache unit 2 and reads instruction
(HYPERPARAMETER_IO), according to the microcommand translated, driving direct memory access unit 1 is read from outside designated space
The overall situation updates step-length α, exponential decay rate β1、β2, convergence threshold ct, be then fed into data processing module 5;
Step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving number
According to the first moment vector m in buffer unit 4t-1And vt-1Initialized, and the iterations t in driving data processing unit 5
It is arranged to 1;
Step S5, controller unit 3 reads in parameter from instruction cache unit 2 and reads instruction (DATA_IO), according to translating
The microcommand gone out, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated spacet-1With corresponding ladder
Degree vectorIt is then fed into data processing module 5;
Step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, according to the micro- finger translated
Order, by the first moment vector m in data buffer storage unit 4t-1With second moment vector vt-1It is transferred in data processing unit 5.
Step S7, controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to what is translated
Microcommand, driving data buffer unit 4 carries out first moment vector mt-1With second moment vector vt-1Renewal operation.In renewal behaviour
In work, square vector more new command is sent to operation control submodule 51, operation control submodule 51 send command adapted thereto carry out with
Lower operation:Operational order 1 (INS_1) is sent to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β1)
(1- β2);Operational order 2 (INS_2) is sent to vector multiplication concurrent operation submodule 53, vector multiplication concurrent operation is driven
Submodule 53 is calculated and obtainedThen, operational order 3 (INS_3) is sent to vector multiplication concurrent operation submodule 53, driving
Vector multiplication concurrent operation submodule 53 calculates β simultaneously1mt-1、β2vt-1WithAs a result it is designated as respectively
a1、a2、b1And b2;Then, by a1And a2、b1And b2Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to,
First moment vector m after being updatedtWith second moment vector vt。
Step S8, controller unit 3 reads data transmission instruction from instruction cache unit 2, according to the micro- finger translated
Order, by the first moment vector m after renewaltWith second moment vector vtIt is sent to from data processing unit 5 in data buffer storage unit 4.
Step S9, controller unit 3 reads a moments estimation vector operation instruction from instruction cache unit 2, according to translating
Microcommand, driving operation control submodule 51 carries out the calculating of moments estimation vector, and operation control submodule 51 sends corresponding finger
Order is proceeded as follows:Operation control submodule 51 sends operational order 4 (INS_4) to basic operation submodule 56, drives base
This computing submodule 56 is calculatedWithIterations t adds 1, and operation control submodule 51 sends computing
5 (INS_5) are instructed to vector multiplication concurrent operation submodule 53, the parallel computation one of driving vector multiplication concurrent operation submodule 53
Rank square vector mtWithSecond moment vector vtWithProduct obtained inclined moments estimation vectorWith
Step S10, controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to what is translated
Microcommand, driving operation control submodule 51 carries out following computing:Operation control submodule 51 sends the (INS_ of operational order 6
6) to basic operation submodule 56, driving basic operation submodule 56 calculates-α;Operation control submodule 51 sends computing and referred to
Make 7 (INS_7) to vector square-root concurrent operation submodule 55, drive its computing to obtainOperation control submodule 51 is sent
Operational order 7 (INS_7) to vectorial division concurrent operation submodule 54 drives its computing to obtainOperation control submodule
51 send operational order 8 (INS_8) to vector multiplication concurrent operation submodule 53, drive its computing to obtainComputing
Control submodule 51 sends operational order 9 (INS_9) to vectorial addition concurrent operation submodule 52, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0Value before not updated when circulating for the t times, t
It is secondary to circulate θt-1It is updated to θt;Operation control submodule 51 sends operational order 10 (INS_110) to vectorial division concurrent operation
Submodule 54, drives its computing to obtain vectorOperation control submodule 51 sends operational order 11 respectively
(INS_11), operational order 12 (INS_12) to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 is calculated
To sum=Σitempi, temp2=sum/n.
Step S11, controller unit 3 reads an amount to be updated from instruction cache unit 2 and writes back instruction (DATABACK_
IO), according to the microcommand translated, by the parameter vector θ after renewaltPass through direct memory access unit 1 from data processing unit 5
It is sent to outside designated space.
Step S12, controller unit 3 reads a convergence decision instruction from instruction cache unit 2, according to the micro- finger translated
Order, data processing module 5 judges whether the parameter vector after updating restrains, no if temp2 < ct, restrain, computing terminates
Then, go at step S5 and continue executing with.
The present invention can solve the logical of data by using dedicated for performing the device that Adam gradients decline training algorithm
Not enough, the problem of leading portion decoding overheads are big with processor operational performance, accelerates the execution speed of related application.Meanwhile, to data
The application of buffer unit, it is to avoid repeatedly to memory read data, reduces the bandwidth of internal storage access.
The process or method described in accompanying drawing above can be by including hardware (for example, circuit, special logic etc.), solid
Part, software (for example, being embodied in the software in non-transient computer-readable media), or both combination processing logic come
Perform.Although process or method are described according to the operation of some orders above, however, it is to be understood that described some operations
It can be performed with different order.In addition, concurrently rather than certain operations can be sequentially performed.
In foregoing specification, various embodiments of the present invention are described with reference to its certain exemplary embodiments.Obviously, may be used
Various modifications are made to each embodiment, without departing from the wider spirit and scope of the invention described in appended claims.
Correspondingly, specification and drawings should be considered as illustrative and not restrictive.
Claims (17)
1. a kind of be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device is visited including direct internal memory
Unit (1), instruction cache unit (2), controller unit (3), data buffer storage unit (4), data processing module (5) are asked, wherein:
Direct memory access unit (1), for accessing outside designated space, to instruction cache unit (2) and data processing module
(5) data are read and write, the loading and storage of data is completed;
Instruction cache unit (2), is instructed, and cache the instruction of reading for being read by direct memory access unit (1);
Controller unit (3), it is for reading instruction from instruction cache unit (2), the Instruction decoding of reading is direct for control
The microcommand of internal storage access unit (1), data buffer storage unit (4) or data processing module (5) behavior;
Data buffer storage unit (4), for caching each first moment vector sum second moment vector in initialization and data updating process;
Data processing module (5), for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal
Square vector is written in data buffer storage unit (4), and the vector to be updated after renewal is write by direct memory access unit (1)
Into outside designated space.
2. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that described straight
It is, from outside designated space to instruction cache unit (2) write instruction, to be read from outside designated space to connect internal storage access unit (1)
Parameter to be updated and corresponding Grad are to data processing module (5), and by the parameter vector after renewal from data processing module
(5) outside designated space is write direct.
3. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the control
The Instruction decoding of reading is at control direct memory access unit (1), data buffer storage unit (4) or data by device unit (3) processed
The microcommand of module (5) behavior of managing, to control direct memory access unit (1) from the specified address reading data in outside and by number
According to the outside specified address of write-in, control data buffer unit (4) is obtained by direct memory access unit (1) from the specified address in outside
Instruction needed for extract operation, control data processing module (5) carries out the renewal computing of parameter to be updated, and control data caching
Unit (4) carries out data transmission with data processing module (5).
4. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number
According to buffer unit (4) first moment vector m is initialized in initializationt, second moment vector vt, will in each data updating process
First moment vector mt-1With second moment vector vt-1Read and deliver in data processing module (5), in data processing module (5) more
It is newly first moment vector mtWith second moment vector vt, then it is then written in data buffer storage unit (4).
5. according to claim 4 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that in device
In running, the internal first moment vector m in store all the time of the data buffer storage unit (4)t, second moment vector vtCopy.
6. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number
Square vector m is read from data buffer storage unit (4) according to processing module (5)t-1、vt-1, by direct memory access unit (1) from outer
Vectorial θ to be updated is read in portion's designated spacet-1, gradient vectorUpdate step-length α and exponential decay rate β1And β2;Then will
Square vector mt-1、vt-1It is updated to mt、vt, pass through mt、vtCalculate moments estimation vectorFinally by vectorial θ to be updatedt-1Update
For θt, and by mt、vtIt is written in data buffer storage unit (4), by θtOutside is written to by direct memory access unit (1) to refer to
Determine in space.
7. according to claim 6 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that
The data processing module (5) is by square vector mt-1、vt-1It is updated to mtIt is according to formulaRealize, the data processing module (5) passes through mt、
vtCalculate moments estimation vectorIt is according to formulaRealize, the data processing module (5)
By vectorial θ to be updatedt-1It is updated to θtIt is according to formulaRealize.
8. being used for according to claim 1 or 7 performs the device that Adam gradients decline training algorithm, it is characterised in that institute
Stating data processing module (5) includes operation control submodule (51), vectorial addition concurrent operation submodule (52), vector multiplication simultaneously
Row computing submodule (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and base
This computing submodule (56), wherein vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation submodule (53), to
Measure division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation submodule (56) simultaneously
Connection connection, operation control submodule (51) is sub with vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation respectively
Module (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation
Submodule (56) is connected in series.
9. according to claim 8 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device
When carrying out computing to vector, vector operation is element-wise computings, and same vector performs different positions during certain computing
It is parallel execution computing to put element.
10. a kind of be used to perform the method that Adam gradients decline training algorithm, applied to any one of claim 1 to 9
Device, it is characterised in that this method includes:
Initialize first moment vector mo, second moment vector v0, exponential decay rate β1、β2And Learning Step α, and specify sky from outside
Between in obtain vectorial θ to be updated0;
When carrying out gradient step-down operation, first with by outside incoming GradWith exponential decay rate update first moment to
Measure mt-1, second moment vector vt-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is to be updated
Vectorial θt-1For θtAnd export;This process is repeated, until vector convergence to be updated.
11. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Initialize first moment vector m0, second moment vector vo, exponential decay rate β1、β2And Learning Step α, and from outside designated space
It is middle to obtain vectorial θ to be updated0, including:
Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit, should
INSTRUCTION_IO is instructed to be had for driving direct internal storage location to be read from external address space with Adam gradient descent algorithms
All instructions closed.
Step S2, computing starts, and controller unit reads this INSTRUCTION_IO from the first address of instruction cache unit and referred to
Order, according to the microcommand translated, driving direct memory access unit is read and Adam gradient descent algorithms from external address space
Relevant all instructions, and these instruction buffers are entered in instruction cache unit;
Step S3, controller unit reads in a HYPERPARAMETER_IO instruction from instruction cache unit, micro- according to what is translated
Instruction, driving direct memory access unit reads global renewal step-length α, exponential decay rate β from outside designated space2、β2, convergence
Threshold value ct, is then fed into data processing module;
Step S4, controller unit reads in assignment directive from instruction cache unit, according to the microcommand translated, driving data caching
First moment vector m in unitt-1And vt-1Initialized, and the iterations t in driving data processing unit is arranged to
1;
Step S5, controller unit reads in a DATA_IO instruction from instruction cache unit, according to the microcommand translated, driving
Direct memory access unit reads parameter vector θ to be updated from outside designated spacet-1With corresponding gradient vectorThen send
Enter into data processing module;
Step S6, controller unit reads in data transmission instruction from instruction cache unit, according to the microcommand translated, by number
According to the first moment vector m in buffer unitt-1With second moment vector vt-1It is transferred in data processing unit.
12. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Using by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1, it is according to public affairs
Formula Realize, specifically include:
Controller unit reads the vectorial more new command of a square from instruction cache unit, according to the microcommand translated, driving number
First moment vector m is carried out according to buffer unitt-1With second moment vector vt-1Renewal operation, the renewal operation in, square vector more
New command is sent to operation control submodule, and operation control submodule sends command adapted thereto and carries out following operate:INS_1 is sent to refer to
Order to basic operation submodule, driving basic operation submodule calculates (1- β1) and (1- β2);Send INS_2 and instruct to vector and multiply
Method concurrent operation submodule, driving vector multiplication concurrent operation submodule, which is calculated, to be obtainedThen, send INS_3 instruct to
Multiplication concurrent operation submodule is measured, driving vector multiplication concurrent operation submodule calculates β simultaneously1mt-1、β2vt-1WithAs a result a is designated as respectively1、a2、b2And b2;Then, by a1And a2、b1And b2Respectively as two inputs, vector is delivered to
Addition concurrent operation submodule, the first moment vector m after being updatedtWith second moment vector vt。
13. according to claim 12 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Using by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1Afterwards, also wrap
Include:
Controller unit reads data transmission instruction from instruction cache unit, according to the microcommand translated, after renewal
First moment vector mtWith second moment vector vtFrom data processing unit is sent to data buffer storage unit.
14. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Inclined moments estimation vector has been obtained by square vector operationWithIt is according to formula
Realize, specifically include:
Controller unit reads a moments estimation vector operation instruction from instruction cache unit, according to the microcommand translated, driving
Operation control submodule carries out the calculating of moments estimation vector, and operation control submodule sends command adapted thereto and proceeded as follows:Fortune
Calculate control submodule and send instruction INS_4 to basic operation submodule, driving basic operation submodule is calculatedWithIterations t adds 1, and operation control submodule sends instruction INS_5 to vector multiplication concurrent operation submodule, drives
Moving vector multiplication concurrent operation submodule parallel computation first moment vector mtWithSecond moment vector vtWith
Product obtained inclined moments estimation vectorWith
15. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Update vectorial θ to be updatedt-1For θtIt is according to formulaRealize, specifically include:
Controller unit reads a parameter vector more new command from instruction cache unit, according to the microcommand translated, driving fortune
Calculate control submodule and carry out following computing:Operation control submodule sends instruction INS_6 to basic operation submodule, drives base
This computing submodule calculates-α;Operation control submodule sends instruction INS_7 to vector square-root concurrent operation submodule, drives
Its computing is moved to obtainOperation control submodule sends instruction INS_7 and drives its computing to vectorial division concurrent operation submodule
ObtainOperation control submodule sends instruction INS_8 to vector multiplication concurrent operation submodule, drives its computing to obtainOperation control submodule sends instruction INS_9 to vectorial addition concurrent operation submodule, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0Value before not updated when circulating for the t times, t
It is secondary to circulate θt-1It is updated to θt;Operation control submodule sends instruction INS_10 to vectorial division concurrent operation submodule, driving
Its computing obtains vectorOperation control submodule sends instruction INS_11, INS_12 to vectorial addition respectively
Concurrent operation submodule and basic operation submodule, which are calculated, obtains sum=∑sitempi, temp2=sum/n.
16. according to claim 15 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
Update vectorial θ to be updatedt-1For θtAfterwards, in addition to:
Controller unit reads a DATABACK_IO instruction from instruction cache unit, according to the microcommand translated, after renewal
Parameter vector θtOutside designated space is sent to from data processing unit by direct memory access unit.
17. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described
During this process is repeated the step of vector convergence to be updated, including judge whether vector to be updated restrains, specific deterministic process
It is as follows:
Controller unit reads a convergence decision instruction from instruction cache unit, according to the microcommand translated, data processing mould
Block judges whether the parameter vector after updating restrains, if temp2 < ct, restrain, computing terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610269689.7A CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610269689.7A CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107315570A true CN107315570A (en) | 2017-11-03 |
CN107315570B CN107315570B (en) | 2021-06-18 |
Family
ID=60185643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610269689.7A Active CN107315570B (en) | 2016-04-27 | 2016-04-27 | Device and method for executing Adam gradient descent training algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107315570B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109782392A (en) * | 2019-02-27 | 2019-05-21 | 中国科学院光电技术研究所 | A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm |
CN111460528A (en) * | 2020-04-01 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Multi-party combined training method and system based on Adam optimization algorithm |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101931416A (en) * | 2009-06-24 | 2010-12-29 | 中国科学院微电子研究所 | Parallel layered decoder of LDPC code in mobile digital multimedia broadcasting system |
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
US20110282181A1 (en) * | 2009-11-12 | 2011-11-17 | Ge Wang | Extended interior methods and systems for spectral, optical, and photoacoustic imaging |
CN103956992A (en) * | 2014-03-26 | 2014-07-30 | 复旦大学 | Self-adaptive signal processing method based on multi-step gradient decrease |
CN104360597A (en) * | 2014-11-02 | 2015-02-18 | 北京工业大学 | Sewage treatment process optimization control method based on multiple gradient descent |
CN104376124A (en) * | 2014-12-09 | 2015-02-25 | 西华大学 | Clustering algorithm based on disturbance absorbing principle |
CN104978282A (en) * | 2014-04-04 | 2015-10-14 | 上海芯豪微电子有限公司 | Cache system and method |
-
2016
- 2016-04-27 CN CN201610269689.7A patent/CN107315570B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101931416A (en) * | 2009-06-24 | 2010-12-29 | 中国科学院微电子研究所 | Parallel layered decoder of LDPC code in mobile digital multimedia broadcasting system |
US20110282181A1 (en) * | 2009-11-12 | 2011-11-17 | Ge Wang | Extended interior methods and systems for spectral, optical, and photoacoustic imaging |
CN102156637A (en) * | 2011-05-04 | 2011-08-17 | 中国人民解放军国防科学技术大学 | Vector crossing multithread processing method and vector crossing multithread microprocessor |
CN103956992A (en) * | 2014-03-26 | 2014-07-30 | 复旦大学 | Self-adaptive signal processing method based on multi-step gradient decrease |
CN104978282A (en) * | 2014-04-04 | 2015-10-14 | 上海芯豪微电子有限公司 | Cache system and method |
CN104360597A (en) * | 2014-11-02 | 2015-02-18 | 北京工业大学 | Sewage treatment process optimization control method based on multiple gradient descent |
CN104376124A (en) * | 2014-12-09 | 2015-02-25 | 西华大学 | Clustering algorithm based on disturbance absorbing principle |
Non-Patent Citations (1)
Title |
---|
D.KINGMA,J.BA: "Adam:A Method for Stochastic Optimization", 《ICLR 2015》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109782392A (en) * | 2019-02-27 | 2019-05-21 | 中国科学院光电技术研究所 | A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm |
CN111460528A (en) * | 2020-04-01 | 2020-07-28 | 支付宝(杭州)信息技术有限公司 | Multi-party combined training method and system based on Adam optimization algorithm |
CN111460528B (en) * | 2020-04-01 | 2022-06-14 | 支付宝(杭州)信息技术有限公司 | Multi-party combined training method and system based on Adam optimization algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN107315570B (en) | 2021-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11580386B2 (en) | Convolutional layer acceleration unit, embedded system having the same, and method for operating the embedded system | |
US20190370664A1 (en) | Operation method | |
CN106991477B (en) | Artificial neural network compression coding device and method | |
KR102304216B1 (en) | Vector computing device | |
US10346507B2 (en) | Symmetric block sparse matrix-vector multiplication | |
Dong et al. | LU factorization of small matrices: Accelerating batched DGETRF on the GPU | |
CN103049241B (en) | A kind of method improving CPU+GPU isomery device calculated performance | |
WO2020046859A1 (en) | Systems and methods for neural network convolutional layer matrix multiplication using cache memory | |
WO2017124642A1 (en) | Device and method for executing forward calculation of artificial neural network | |
CN108090565A (en) | Accelerated method is trained in a kind of convolutional neural networks parallelization | |
Jung et al. | Implementing an interior point method for linear programs on a CPU-GPU system | |
WO2017185257A1 (en) | Device and method for performing adam gradient descent training algorithm | |
CN107341132B (en) | Device and method for executing AdaGrad gradient descent training algorithm | |
Ezzatti et al. | Using graphics processors to accelerate the computation of the matrix inverse | |
CN107315570A (en) | It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm | |
Falch et al. | Register caching for stencil computations on GPUs | |
CN107341540B (en) | Device and method for executing Hessian-Free training algorithm | |
CN107315569A (en) | A kind of device and method for being used to perform RMSprop gradient descent algorithms | |
Wiggers et al. | Implementing the conjugate gradient algorithm on multi-core systems | |
Kuzelewski et al. | GPU-based acceleration of computations in elasticity problems solving by parametric integral equations system | |
WO2017185256A1 (en) | Rmsprop gradient descent algorithm execution apparatus and method | |
Zhang et al. | A high performance real-time edge detection system with NEON | |
CN202093573U (en) | Parallel acceleration device used in industrial CT image reconstruction | |
Li et al. | Quantum computer simulation on gpu cluster incorporating data locality | |
CN113780539A (en) | Neural network data processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant after: Zhongke Cambrian Technology Co.,Ltd. Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TG01 | Patent term adjustment |