CN107315570A - It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm - Google Patents

It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm Download PDF

Info

Publication number
CN107315570A
CN107315570A CN201610269689.7A CN201610269689A CN107315570A CN 107315570 A CN107315570 A CN 107315570A CN 201610269689 A CN201610269689 A CN 201610269689A CN 107315570 A CN107315570 A CN 107315570A
Authority
CN
China
Prior art keywords
vector
submodule
unit
instruction
updated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610269689.7A
Other languages
Chinese (zh)
Other versions
CN107315570B (en
Inventor
郭崎
刘少礼
陈天石
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Beijing Zhongke Cambrian Technology Co Ltd
Original Assignee
Beijing Zhongke Cambrian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Cambrian Technology Co Ltd filed Critical Beijing Zhongke Cambrian Technology Co Ltd
Priority to CN201610269689.7A priority Critical patent/CN107315570B/en
Publication of CN107315570A publication Critical patent/CN107315570A/en
Application granted granted Critical
Publication of CN107315570B publication Critical patent/CN107315570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • G06F9/3832Value prediction for operands; operand history buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Liquid Crystal Display Device Control (AREA)

Abstract

It is used to perform the device and method that Adam gradients decline training algorithm the invention discloses a kind of, the device includes direct memory access unit, instruction cache unit, controller unit, data buffer storage unit, data processing module.This method includes:Gradient vector and value to be updated vector are read first, while initializing, first order and second order moments are vectorial and corresponding exponential decay rate;During each iteration, first order and second order moments vector is updated using gradient vector, and calculating single order respectively has inclined moments estimation vector sum second order to have inclined moments estimation vector, there is inclined moments estimation vector sum second order to there is inclined moments estimation vector to update parameter to be updated, continuous training until parameter vector to be updated is restrained using single order.Utilize the present invention, it is possible to achieve the application of Adam gradient descent algorithms, and increase substantially the efficiency of data processing.

Description

It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm
Technical field
The present invention relates to Adam algorithm applied technical fields, it is used to perform the decline training of Adam gradients more particularly to one kind The device and method of algorithm, relates to the hard-wired related application of Adam gradient optimization algorithms.
Background technology
Gradient optimization algorithm extensively should in fields such as function approximation, optimization calculating, Pattern recognition and image processings With Adam algorithms are as one kind in gradient optimization algorithm, and because it is easily achieved, amount of calculation is small, required memory space The features such as small and gradient symmetry transformation consistency are widely used, and realize that Adam algorithms can be with using special purpose device Significantly improve the speed of its execution.
At present, a kind of known method of execution Adam gradient descent algorithms is to use general processor.This method is by making Perform universal command to support above-mentioned algorithm with general-purpose register and general utility functions part.One of shortcoming of this method is single The operational performance of general processor is relatively low.And multiple general processors, when performing parallel, the intercommunication of general processor is again Become performance bottleneck.In addition, general processor needs the corresponding related operation of Adam gradient descent algorithms to be decoded into a length Column operations and access instruction sequence, processor front end decoding bring larger power dissipation overhead.
The known method of another execution Adam gradient descent algorithms is to use graphics processor (GPU).This method passes through General single-instruction multiple-data stream (SIMD) (SIMD) instruction is performed using general-purpose register and general stream processing unit to support above-mentioned calculation Method.Because GPU is the equipment that is specifically used to perform graph image computing and scientific algorithm, not to Adam gradient descent algorithms The special support of related operation, it is still desirable to which substantial amounts of front end work decoding could perform related in Adam gradient descent algorithms Computing, brings substantial amounts of overhead.In addition, GPU only has less upper caching, data (such as first moment needed for computing Vector sum second moment vector etc.) need to carry outside piece repeatedly, the outer bandwidth of piece becomes main performance bottleneck, while bringing huge Big power dissipation overhead.
The content of the invention
(1) technical problem to be solved
In view of this, it is a primary object of the present invention to provide a kind of dress for being used to perform Adam gradients decline training algorithm Put and method, leading portion decoding overheads big the problem of not enough with the general processor operational performance for solving data, and avoid repeatedly to Memory read data, reduces the bandwidth of internal storage access.
(2) technical scheme
To reach above-mentioned purpose, it is used to perform the device that Adam gradients decline training algorithm the invention provides a kind of, should Device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data buffer storage unit 4, data processing mould Block 5, wherein:
Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5 Data are read and write, the loading and storage of data is completed;
Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1;
Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5;
Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to Amount;
Data processing module 5, it is vectorial for updating square, moments estimation vector is calculated, vector to be updated is updated, and by after renewal Square vector be written in data buffer storage unit 4, the vector to be updated after renewal is written to by direct memory access unit 1 In outside designated space.
In such scheme, the direct memory access unit 1 is to write to refer to from outside designated space to instruction cache unit 2 Order, parameter to be updated and corresponding Grad are read to data processing module 5 from outside designated space, and by the parameter after renewal Vector writes direct outside designated space from data processing module 5.
In such scheme, the Instruction decoding of reading is control direct memory access unit 1, number by the controller unit 3 According to buffer unit 4 or the microcommand of the behavior of data processing module 5, to control direct memory access unit 1 from the specified ground in outside Location reads data and writes data into outside specified address, and control data buffer unit 4 is by direct memory access unit 1 from outer Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control Data buffer storage unit 4 processed carries out data transmission with data processing module 5.
In such scheme, the data buffer storage unit 4 initialization first moment vector m in initializationt, second moment vector vt, By first moment vector m in each data updating processt-1With second moment vector vt-1Read and deliver in data processing module 5, First moment vector m is updated in data processing module 5tWith second moment vector vt, then it is then written in data buffer storage unit 4.
In such scheme, during plant running, in store first moment is vectorial all the time inside the data buffer storage unit 4 mt, second moment vector vtCopy.
In such scheme, the data processing module 5 reads square vector m from data buffer storage unit 4t-1、vt-1, by straight Connect internal storage access unit 1 and vectorial θ to be updated is read from outside designated spacet-1, gradient vectorUpdate step-length α and refer to Number attenuation rate β1And β2;Then by square vector mt-1、vt-1It is updated to mt、vt, pass through mt、vtCalculate moments estimation vectorMost Afterwards by vectorial θ to be updatedt-1It is updated to θt, and by mt、vtIt is written in data buffer storage unit 4, by θtPass through direct memory access Unit 1 is written in outside designated space.
In such scheme, the data processing module 5 is by square vector mt-1、vt-1It is updated to mtIt is according to formulaRealize, the data processing module 5 passes through mt、vt Calculate moments estimation vectorIt is according to formulaRealize, the data processing module 5 will be treated more New vector θt-1It is updated to θtIt is according to formulaRealize.
In such scheme, the data processing module 5 includes operation control submodule 51, vectorial addition concurrent operation submodule Block 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule Block 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53, Vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule 56 are in parallel even Connect, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53, to Division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and the series connection of basic operation submodule 56 is measured to connect Connect.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector performs certain computing When diverse location element be parallel to perform computing.
To reach above-mentioned purpose, it is used to perform the method that Adam gradients decline training algorithm present invention also offers a kind of, This method includes:
Initialize first moment vector m0, second moment vector v0, exponential decay rate β1、β2And Learning Step α, and from outside Vectorial θ to be updated is obtained in designated space0
When carrying out gradient step-down operation, first with by outside incoming GradSingle order is updated with exponential decay rate Square vector mt-1, second moment vector vt-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is treated Renewal vector θt-1For θtAnd export;This process is repeated, until vector convergence to be updated.
In such scheme, the initialization first moment vector mo, second moment vector vo, exponential decay rate β1、β2And study Step-length α, and obtain from outside designated space vectorial θ to be updated0, including:
Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit 2, should INSTRUCTION_IO is instructed to be had for driving direct internal storage location 1 to be read from external address space with Adam gradient descent algorithms All instructions closed.
Step S2, computing starts, and the first address of controller unit 3 from instruction cache unit 2 reads this INSTRUCTION_IO is instructed, according to the microcommand translated, driving direct memory access unit 1 from external address space read with The relevant all instructions of Adam gradient descent algorithms, and these instruction buffers are entered in instruction cache unit 2;
In step S3, controller unit 3 reads in a HYPERPARAMETER_IO instruction from instruction cache unit 2, according to The microcommand translated, driving direct memory access unit 1 reads global renewal step-length α, exponential decay rate from outside designated space β2、β2, convergence threshold ct, be then fed into data processing module 5;
In step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving First moment vector m in data buffer storage unit 4t-1And vt-1Initialized, and the iterations in driving data processing unit 5 T is arranged to 1;
In step S5, controller unit 3 reads in a DATA_IO instruction from instruction cache unit 2, according to the micro- finger translated Order, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated spacet-1With corresponding gradient vectorIt is then fed into data processing module 5;
In step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, micro- according to what is translated Instruction, by the first moment vector m in data buffer storage unit 4t-1With second moment vector vt-1It is transferred in data processing unit 5.
It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate mt-1, second moment vector vt-1, it is according to formula Realize , specifically include:Controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to the micro- finger translated Order, driving data buffer unit 4 carries out first moment vector mt-1With second moment vector vt-1Renewal operation, the renewal operation In, square vector more new command is sent to operation control submodule 51, and it is following that operation control submodule 51 sends command adapted thereto progress Operation:Send INS_1 to instruct to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β1) and (1- β2);Hair INS_2 is sent to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53, which is calculated, to be obtained Then, send INS_3 to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53 is simultaneously Calculate β1mt-1β2vt-1WithAs a result a is designated as respectively1、a2、b1And b2;Then, by a1And a2、b1With b2Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to, the first moment vector m after being updatedtAnd second order Square vector vt
It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate mt-1, second moment vector vt-1Afterwards, in addition to:Controller unit 3 reads data transmission instruction, root from instruction cache unit 2 According to the microcommand translated, by the first moment vector m after renewaltWith second moment vector vtData are sent to from data processing unit 5 to delay In memory cell 4.
It is described that inclined moments estimation vector has been obtained by square vector operation in such schemeWithIt is according to formulaRealize, specifically include:Controller unit 3 reads one from instruction cache unit 2 Bar moments estimation vector operation instruction, according to the microcommand translated, driving operation control submodule 51 carries out the meter of moments estimation vector Calculate, operation control submodule 51 sends command adapted thereto and proceeded as follows:Operation control submodule 51 sends instruction INS_4 to base This computing submodule 56, driving basic operation submodule 56 is calculatedWithIterations t adds 1, computing Control submodule 51 sends instruction INS_5 to vector multiplication concurrent operation submodule 53, drives vector multiplication concurrent operation submodule The parallel computation first moment vector m of block 53tWithSecond moment vector vtWithProduct obtained inclined moments estimation VectorWith
It is described to update vectorial θ to be updated in such schemet-1For θtIt is according to formulaRealize, Specifically include:Controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to the microcommand translated, Driving operation control submodule 51 carries out following computing:Operation control submodule 51 sends instruction INS_6 to basic operation Module 56, driving basic operation submodule 56 calculates-α;Operation control submodule 51 sends instruction INS_7 to vector square-root Concurrent operation submodule 55, drives its computing to obtainOperation control submodule 51 send instruction INS_7 to vectorial division simultaneously Row computing submodule 54 drives its computing to obtainOperation control submodule 51 send instruction INS_8 to vector multiplication simultaneously Row computing submodule 53, drives its computing to obtainOperation control submodule 51 sends instruction INS_9 to vectorial addition Concurrent operation submodule 52, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0 Value before not updated during the t times circulation, is circulated θ for the t timet-1It is updated to θt;Operation control submodule 51 sends instruction INS_10 To vectorial division concurrent operation submodule 54, its computing is driven to obtain vector51 points of operation control submodule It Fa Song not instruct INS_11, INS_12 to calculate to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 to obtain Sum=∑sitempi, temp2=sum/n.
It is described to update vectorial θ to be updated in such schemet-1For θtAfterwards, in addition to:Controller unit 3 is from instruction buffer Unit 2 reads a DATABACK_IO instruction, according to the microcommand translated, by the parameter vector θ after renewaltFrom data processing Unit 5 is sent to outside designated space by direct memory access unit 1.
In such scheme, this process of the repetition until it is to be updated vector convergence the step of in, including judge it is to be updated to Whether amount restrains, and specific deterministic process is as follows:Controller unit 3 reads a convergence decision instruction, root from instruction cache unit 2 According to the microcommand translated, data processing module 5 judges whether the parameter vector after updating restrains, if temp2 < ct, restrain, Computing terminates.
(3) beneficial effect
It can be seen from the above technical proposal that the invention has the advantages that:
1st, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, by using special use In the device for performing Adam gradients decline training algorithm, the general processor operational performance that can solve data is not enough, and leading portion is translated The problem of code expense is big, accelerates the execution speed of related application.
2nd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, as a result of data Buffer unit keeps in the square vector needed for pilot process, it is to avoid repeatedly to memory read data, reduce device with externally I/O operation between the space of location, reduces the bandwidth of internal storage access.
3rd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould Block carries out vector operation using related concurrent operation submodule so that degree of concurrence is greatly improved.
4th, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould Block carries out vector operation using related concurrent operation submodule, and the degree of concurrence of computing is high, so frequency during work is relatively low, So that power dissipation overhead is small.
Brief description of the drawings
For a more complete understanding of the present invention and its advantage, referring now to the following description with reference to accompanying drawing, wherein:
Fig. 1 shows the overall knot for being used to perform the device that Adam gradients decline training algorithm according to embodiments of the present invention The example block diagram of structure.
Fig. 2 is shown in the device for declining training algorithm for performing Adam gradients according to embodiments of the present invention at data Manage the example block diagram of module.
Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention Figure.
In all of the figs, identical device, part, unit etc. make to be denoted by the same reference numerals.
Embodiment
According to embodiments of the present invention with reference to accompanying drawing to the described in detail below of exemplary embodiment of the present, of the invention its Its aspect, advantage and prominent features will become obvious for those skilled in the art.
In the present invention, term " comprising " and " containing " and its derivative mean including and it is unrestricted;Term "or" is bag Containing property, mean and/or.
In this manual, following various embodiments for being used to describe the principle of the invention are explanation, should not be with any Mode is construed to the scope of limitation invention.Referring to the drawings described below is used to help comprehensive understanding by claim and its equivalent The exemplary embodiment of the invention that thing is limited.It is described below to help to understand including a variety of details, but these details should Think what is be merely exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from scope and spirit of the present invention In the case of, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity, Eliminate the description of known function and structure.In addition, through accompanying drawing, same reference numbers are used for identity function and operation.
The device and method for being used to perform Adam gradients decline training algorithm according to embodiments of the present invention, to accelerate The application of Adam gradient descent algorithms.First, initialization first moment vector m0, second moment vector v0, exponential decay rate β1、β2With And Learning Step α, and obtain from outside designated space vectorial θ to be updatedo;It is first sharp every time when carrying out gradient step-down operation With by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1, i.e.,Then inclined moments estimation has been obtained by square vector operation VectorWithI.e.Final updating vectorial θ to be updatedt-1For θtAnd export, i.e.,Wherein, θt-1It is θ0Value before not updated when circulating for the t times, is circulated θ for the t timet-1It is updated to θt。 This process is repeated, until vector convergence to be updated.
Fig. 1 shows the integrally-built of the device for being used to realize Adam gradient descent algorithms according to embodiments of the present invention Example block diagram.As shown in figure 1, the device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data Buffer unit 4 and data processing module 5, can be realized by hardware circuit.
Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5 Data are read and write, the loading and storage of data is completed.Specifically from outside designated space to the write instruction of instruction cache unit 2, from Outside designated space reads parameter to be updated and corresponding Grad to data processing module 5, and by the parameter vector after renewal Outside designated space is write direct from data processing module 5.
Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1.
Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5, and each microcommand is sent to direct Internal storage access unit 1, data buffer storage unit 4 or data processing module 5, control direct memory access unit 1 specify ground from outside Location reads data and writes data into outside specified address, and control data buffer unit 3 is by direct memory access unit 1 from outer Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control Data buffer storage unit 4 processed carries out data transmission with data processing module 5.
Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to Amount;Specifically, the initialization first moment vector of data buffer storage unit 4 m in initializationt, second moment vector vt, in each data Data buffer storage unit 4 is by first moment vector m in renewal processt-1With second moment vector vt-1Read and deliver to data processing module 5 In, first moment vector m is updated in data processing module 5tWith second moment vector vt, then it is then written to data buffer storage unit 4 In.During plant running, in store first moment vector m all the time inside the data buffer storage unit 4t, second moment vector vt's Copy.In the present invention, the square vector needed for pilot process is kept in as a result of data buffer storage unit, it is to avoid repeatedly inside Reading data are deposited, the I/O operation between device and external address space is reduced, the bandwidth of internal storage access is reduced.
Data processing module 5, for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal Square vector is written in data buffer storage unit 4, and the vector to be updated after renewal is written into outside by direct memory access unit 1 In designated space;Specifically, data processing module 5 reads square vector m from data buffer storage unit 4t-1、vt-1, pass through direct internal memory Access unit 1 reads vectorial θ to be updated from outside designated spacet-1, gradient vectorUpdate step-length α and exponential decay rate β1And β2;Then by square vector mt-1、vt-1It is updated to mt、vt, i.e., Pass through mt、vtCalculate moments estimation vector I.e.Finally by vectorial θ to be updatedt-1It is updated to θt, I.e.And by mt、vtIt is written in data buffer storage unit 4, by θtWrite by direct memory access unit 1 Enter into outside designated space.In the present invention, due to data processing module using related concurrent operation submodule carry out to Measure computing so that degree of concurrence is greatly improved, so frequency during work is relatively low, and then make it that power dissipation overhead is small.
Fig. 2 shows number in the device for realizing Adam gradient descent algorithm related applications according to embodiments of the present invention According to the example block diagram of processing module.As shown in Fig. 2 data processing module 5 is parallel including operation control submodule 51, vectorial addition Computing submodule 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root are parallel Computing submodule 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule Block 56 is connected in parallel, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule Block 56 is connected in series.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector is held Diverse location element is parallel execution computing during certain computing of row.
Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention Figure, specifically includes following steps:
Step S1, pre-deposits an instruction prefetch instruction (INSTRUCTION_ at the first address of instruction cache unit 2 IO), the INSTRUCTION_IO is instructed declines for driving direct internal storage location 1 to be read from external address space with Adam gradients Calculate relevant all instructions.
Step S2, computing starts, and the first address of controller unit 3 from instruction cache unit 2 reads this INSTRUCTION_IO is instructed, according to the microcommand translated, driving direct memory access unit 1 from external address space read with The relevant all instructions of Adam gradient descent algorithms, and these instruction buffers are entered in instruction cache unit 2;
Step S3, controller unit 3 reads in a super parameter from instruction cache unit 2 and reads instruction (HYPERPARAMETER_IO), according to the microcommand translated, driving direct memory access unit 1 is read from outside designated space The overall situation updates step-length α, exponential decay rate β1、β2, convergence threshold ct, be then fed into data processing module 5;
Step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving number According to the first moment vector m in buffer unit 4t-1And vt-1Initialized, and the iterations t in driving data processing unit 5 It is arranged to 1;
Step S5, controller unit 3 reads in parameter from instruction cache unit 2 and reads instruction (DATA_IO), according to translating The microcommand gone out, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated spacet-1With corresponding ladder Degree vectorIt is then fed into data processing module 5;
Step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, according to the micro- finger translated Order, by the first moment vector m in data buffer storage unit 4t-1With second moment vector vt-1It is transferred in data processing unit 5.
Step S7, controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to what is translated Microcommand, driving data buffer unit 4 carries out first moment vector mt-1With second moment vector vt-1Renewal operation.In renewal behaviour In work, square vector more new command is sent to operation control submodule 51, operation control submodule 51 send command adapted thereto carry out with Lower operation:Operational order 1 (INS_1) is sent to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β1) (1- β2);Operational order 2 (INS_2) is sent to vector multiplication concurrent operation submodule 53, vector multiplication concurrent operation is driven Submodule 53 is calculated and obtainedThen, operational order 3 (INS_3) is sent to vector multiplication concurrent operation submodule 53, driving Vector multiplication concurrent operation submodule 53 calculates β simultaneously1mt-1β2vt-1WithAs a result it is designated as respectively a1、a2、b1And b2;Then, by a1And a2、b1And b2Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to, First moment vector m after being updatedtWith second moment vector vt
Step S8, controller unit 3 reads data transmission instruction from instruction cache unit 2, according to the micro- finger translated Order, by the first moment vector m after renewaltWith second moment vector vtIt is sent to from data processing unit 5 in data buffer storage unit 4.
Step S9, controller unit 3 reads a moments estimation vector operation instruction from instruction cache unit 2, according to translating Microcommand, driving operation control submodule 51 carries out the calculating of moments estimation vector, and operation control submodule 51 sends corresponding finger Order is proceeded as follows:Operation control submodule 51 sends operational order 4 (INS_4) to basic operation submodule 56, drives base This computing submodule 56 is calculatedWithIterations t adds 1, and operation control submodule 51 sends computing 5 (INS_5) are instructed to vector multiplication concurrent operation submodule 53, the parallel computation one of driving vector multiplication concurrent operation submodule 53 Rank square vector mtWithSecond moment vector vtWithProduct obtained inclined moments estimation vectorWith
Step S10, controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to what is translated Microcommand, driving operation control submodule 51 carries out following computing:Operation control submodule 51 sends the (INS_ of operational order 6 6) to basic operation submodule 56, driving basic operation submodule 56 calculates-α;Operation control submodule 51 sends computing and referred to Make 7 (INS_7) to vector square-root concurrent operation submodule 55, drive its computing to obtainOperation control submodule 51 is sent Operational order 7 (INS_7) to vectorial division concurrent operation submodule 54 drives its computing to obtainOperation control submodule 51 send operational order 8 (INS_8) to vector multiplication concurrent operation submodule 53, drive its computing to obtainComputing Control submodule 51 sends operational order 9 (INS_9) to vectorial addition concurrent operation submodule 52, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0Value before not updated when circulating for the t times, t It is secondary to circulate θt-1It is updated to θt;Operation control submodule 51 sends operational order 10 (INS_110) to vectorial division concurrent operation Submodule 54, drives its computing to obtain vectorOperation control submodule 51 sends operational order 11 respectively (INS_11), operational order 12 (INS_12) to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 is calculated To sum=Σitempi, temp2=sum/n.
Step S11, controller unit 3 reads an amount to be updated from instruction cache unit 2 and writes back instruction (DATABACK_ IO), according to the microcommand translated, by the parameter vector θ after renewaltPass through direct memory access unit 1 from data processing unit 5 It is sent to outside designated space.
Step S12, controller unit 3 reads a convergence decision instruction from instruction cache unit 2, according to the micro- finger translated Order, data processing module 5 judges whether the parameter vector after updating restrains, no if temp2 < ct, restrain, computing terminates Then, go at step S5 and continue executing with.
The present invention can solve the logical of data by using dedicated for performing the device that Adam gradients decline training algorithm Not enough, the problem of leading portion decoding overheads are big with processor operational performance, accelerates the execution speed of related application.Meanwhile, to data The application of buffer unit, it is to avoid repeatedly to memory read data, reduces the bandwidth of internal storage access.
The process or method described in accompanying drawing above can be by including hardware (for example, circuit, special logic etc.), solid Part, software (for example, being embodied in the software in non-transient computer-readable media), or both combination processing logic come Perform.Although process or method are described according to the operation of some orders above, however, it is to be understood that described some operations It can be performed with different order.In addition, concurrently rather than certain operations can be sequentially performed.
In foregoing specification, various embodiments of the present invention are described with reference to its certain exemplary embodiments.Obviously, may be used Various modifications are made to each embodiment, without departing from the wider spirit and scope of the invention described in appended claims. Correspondingly, specification and drawings should be considered as illustrative and not restrictive.

Claims (17)

1. a kind of be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device is visited including direct internal memory Unit (1), instruction cache unit (2), controller unit (3), data buffer storage unit (4), data processing module (5) are asked, wherein:
Direct memory access unit (1), for accessing outside designated space, to instruction cache unit (2) and data processing module (5) data are read and write, the loading and storage of data is completed;
Instruction cache unit (2), is instructed, and cache the instruction of reading for being read by direct memory access unit (1);
Controller unit (3), it is for reading instruction from instruction cache unit (2), the Instruction decoding of reading is direct for control The microcommand of internal storage access unit (1), data buffer storage unit (4) or data processing module (5) behavior;
Data buffer storage unit (4), for caching each first moment vector sum second moment vector in initialization and data updating process;
Data processing module (5), for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal Square vector is written in data buffer storage unit (4), and the vector to be updated after renewal is write by direct memory access unit (1) Into outside designated space.
2. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that described straight It is, from outside designated space to instruction cache unit (2) write instruction, to be read from outside designated space to connect internal storage access unit (1) Parameter to be updated and corresponding Grad are to data processing module (5), and by the parameter vector after renewal from data processing module (5) outside designated space is write direct.
3. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the control The Instruction decoding of reading is at control direct memory access unit (1), data buffer storage unit (4) or data by device unit (3) processed The microcommand of module (5) behavior of managing, to control direct memory access unit (1) from the specified address reading data in outside and by number According to the outside specified address of write-in, control data buffer unit (4) is obtained by direct memory access unit (1) from the specified address in outside Instruction needed for extract operation, control data processing module (5) carries out the renewal computing of parameter to be updated, and control data caching Unit (4) carries out data transmission with data processing module (5).
4. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number According to buffer unit (4) first moment vector m is initialized in initializationt, second moment vector vt, will in each data updating process First moment vector mt-1With second moment vector vt-1Read and deliver in data processing module (5), in data processing module (5) more It is newly first moment vector mtWith second moment vector vt, then it is then written in data buffer storage unit (4).
5. according to claim 4 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that in device In running, the internal first moment vector m in store all the time of the data buffer storage unit (4)t, second moment vector vtCopy.
6. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number Square vector m is read from data buffer storage unit (4) according to processing module (5)t-1、vt-1, by direct memory access unit (1) from outer Vectorial θ to be updated is read in portion's designated spacet-1, gradient vectorUpdate step-length α and exponential decay rate β1And β2;Then will Square vector mt-1、vt-1It is updated to mt、vt, pass through mt、vtCalculate moments estimation vectorFinally by vectorial θ to be updatedt-1Update For θt, and by mt、vtIt is written in data buffer storage unit (4), by θtOutside is written to by direct memory access unit (1) to refer to Determine in space.
7. according to claim 6 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that The data processing module (5) is by square vector mt-1、vt-1It is updated to mtIt is according to formulaRealize, the data processing module (5) passes through mt、 vtCalculate moments estimation vectorIt is according to formulaRealize, the data processing module (5) By vectorial θ to be updatedt-1It is updated to θtIt is according to formulaRealize.
8. being used for according to claim 1 or 7 performs the device that Adam gradients decline training algorithm, it is characterised in that institute Stating data processing module (5) includes operation control submodule (51), vectorial addition concurrent operation submodule (52), vector multiplication simultaneously Row computing submodule (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and base This computing submodule (56), wherein vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation submodule (53), to Measure division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation submodule (56) simultaneously Connection connection, operation control submodule (51) is sub with vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation respectively Module (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation Submodule (56) is connected in series.
9. according to claim 8 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device When carrying out computing to vector, vector operation is element-wise computings, and same vector performs different positions during certain computing It is parallel execution computing to put element.
10. a kind of be used to perform the method that Adam gradients decline training algorithm, applied to any one of claim 1 to 9 Device, it is characterised in that this method includes:
Initialize first moment vector mo, second moment vector v0, exponential decay rate β1、β2And Learning Step α, and specify sky from outside Between in obtain vectorial θ to be updated0
When carrying out gradient step-down operation, first with by outside incoming GradWith exponential decay rate update first moment to Measure mt-1, second moment vector vt-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is to be updated Vectorial θt-1For θtAnd export;This process is repeated, until vector convergence to be updated.
11. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Initialize first moment vector m0, second moment vector vo, exponential decay rate β1、β2And Learning Step α, and from outside designated space It is middle to obtain vectorial θ to be updated0, including:
Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit, should INSTRUCTION_IO is instructed to be had for driving direct internal storage location to be read from external address space with Adam gradient descent algorithms All instructions closed.
Step S2, computing starts, and controller unit reads this INSTRUCTION_IO from the first address of instruction cache unit and referred to Order, according to the microcommand translated, driving direct memory access unit is read and Adam gradient descent algorithms from external address space Relevant all instructions, and these instruction buffers are entered in instruction cache unit;
Step S3, controller unit reads in a HYPERPARAMETER_IO instruction from instruction cache unit, micro- according to what is translated Instruction, driving direct memory access unit reads global renewal step-length α, exponential decay rate β from outside designated space2、β2, convergence Threshold value ct, is then fed into data processing module;
Step S4, controller unit reads in assignment directive from instruction cache unit, according to the microcommand translated, driving data caching First moment vector m in unitt-1And vt-1Initialized, and the iterations t in driving data processing unit is arranged to 1;
Step S5, controller unit reads in a DATA_IO instruction from instruction cache unit, according to the microcommand translated, driving Direct memory access unit reads parameter vector θ to be updated from outside designated spacet-1With corresponding gradient vectorThen send Enter into data processing module;
Step S6, controller unit reads in data transmission instruction from instruction cache unit, according to the microcommand translated, by number According to the first moment vector m in buffer unitt-1With second moment vector vt-1It is transferred in data processing unit.
12. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Using by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1, it is according to public affairs Formula Realize, specifically include:
Controller unit reads the vectorial more new command of a square from instruction cache unit, according to the microcommand translated, driving number First moment vector m is carried out according to buffer unitt-1With second moment vector vt-1Renewal operation, the renewal operation in, square vector more New command is sent to operation control submodule, and operation control submodule sends command adapted thereto and carries out following operate:INS_1 is sent to refer to Order to basic operation submodule, driving basic operation submodule calculates (1- β1) and (1- β2);Send INS_2 and instruct to vector and multiply Method concurrent operation submodule, driving vector multiplication concurrent operation submodule, which is calculated, to be obtainedThen, send INS_3 instruct to Multiplication concurrent operation submodule is measured, driving vector multiplication concurrent operation submodule calculates β simultaneously1mt-1β2vt-1WithAs a result a is designated as respectively1、a2、b2And b2;Then, by a1And a2、b1And b2Respectively as two inputs, vector is delivered to Addition concurrent operation submodule, the first moment vector m after being updatedtWith second moment vector vt
13. according to claim 12 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Using by outside incoming GradFirst moment vector m is updated with exponential decay ratet-1, second moment vector vt-1Afterwards, also wrap Include:
Controller unit reads data transmission instruction from instruction cache unit, according to the microcommand translated, after renewal First moment vector mtWith second moment vector vtFrom data processing unit is sent to data buffer storage unit.
14. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Inclined moments estimation vector has been obtained by square vector operationWithIt is according to formula Realize, specifically include:
Controller unit reads a moments estimation vector operation instruction from instruction cache unit, according to the microcommand translated, driving Operation control submodule carries out the calculating of moments estimation vector, and operation control submodule sends command adapted thereto and proceeded as follows:Fortune Calculate control submodule and send instruction INS_4 to basic operation submodule, driving basic operation submodule is calculatedWithIterations t adds 1, and operation control submodule sends instruction INS_5 to vector multiplication concurrent operation submodule, drives Moving vector multiplication concurrent operation submodule parallel computation first moment vector mtWithSecond moment vector vtWith Product obtained inclined moments estimation vectorWith
15. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Update vectorial θ to be updatedt-1For θtIt is according to formulaRealize, specifically include:
Controller unit reads a parameter vector more new command from instruction cache unit, according to the microcommand translated, driving fortune Calculate control submodule and carry out following computing:Operation control submodule sends instruction INS_6 to basic operation submodule, drives base This computing submodule calculates-α;Operation control submodule sends instruction INS_7 to vector square-root concurrent operation submodule, drives Its computing is moved to obtainOperation control submodule sends instruction INS_7 and drives its computing to vectorial division concurrent operation submodule ObtainOperation control submodule sends instruction INS_8 to vector multiplication concurrent operation submodule, drives its computing to obtainOperation control submodule sends instruction INS_9 to vectorial addition concurrent operation submodule, drives it to calculateParameter vector θ after being updatedt;Wherein, θt-1It is θ0Value before not updated when circulating for the t times, t It is secondary to circulate θt-1It is updated to θt;Operation control submodule sends instruction INS_10 to vectorial division concurrent operation submodule, driving Its computing obtains vectorOperation control submodule sends instruction INS_11, INS_12 to vectorial addition respectively Concurrent operation submodule and basic operation submodule, which are calculated, obtains sum=∑sitempi, temp2=sum/n.
16. according to claim 15 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Update vectorial θ to be updatedt-1For θtAfterwards, in addition to:
Controller unit reads a DATABACK_IO instruction from instruction cache unit, according to the microcommand translated, after renewal Parameter vector θtOutside designated space is sent to from data processing unit by direct memory access unit.
17. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described During this process is repeated the step of vector convergence to be updated, including judge whether vector to be updated restrains, specific deterministic process It is as follows:
Controller unit reads a convergence decision instruction from instruction cache unit, according to the microcommand translated, data processing mould Block judges whether the parameter vector after updating restrains, if temp2 < ct, restrain, computing terminates.
CN201610269689.7A 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm Active CN107315570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610269689.7A CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610269689.7A CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Publications (2)

Publication Number Publication Date
CN107315570A true CN107315570A (en) 2017-11-03
CN107315570B CN107315570B (en) 2021-06-18

Family

ID=60185643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610269689.7A Active CN107315570B (en) 2016-04-27 2016-04-27 Device and method for executing Adam gradient descent training algorithm

Country Status (1)

Country Link
CN (1) CN107315570B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109782392A (en) * 2019-02-27 2019-05-21 中国科学院光电技术研究所 A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm
CN111460528A (en) * 2020-04-01 2020-07-28 支付宝(杭州)信息技术有限公司 Multi-party combined training method and system based on Adam optimization algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101931416A (en) * 2009-06-24 2010-12-29 中国科学院微电子研究所 Parallel layered decoder of LDPC code in mobile digital multimedia broadcasting system
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
US20110282181A1 (en) * 2009-11-12 2011-11-17 Ge Wang Extended interior methods and systems for spectral, optical, and photoacoustic imaging
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN104360597A (en) * 2014-11-02 2015-02-18 北京工业大学 Sewage treatment process optimization control method based on multiple gradient descent
CN104376124A (en) * 2014-12-09 2015-02-25 西华大学 Clustering algorithm based on disturbance absorbing principle
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101931416A (en) * 2009-06-24 2010-12-29 中国科学院微电子研究所 Parallel layered decoder of LDPC code in mobile digital multimedia broadcasting system
US20110282181A1 (en) * 2009-11-12 2011-11-17 Ge Wang Extended interior methods and systems for spectral, optical, and photoacoustic imaging
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN104978282A (en) * 2014-04-04 2015-10-14 上海芯豪微电子有限公司 Cache system and method
CN104360597A (en) * 2014-11-02 2015-02-18 北京工业大学 Sewage treatment process optimization control method based on multiple gradient descent
CN104376124A (en) * 2014-12-09 2015-02-25 西华大学 Clustering algorithm based on disturbance absorbing principle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
D.KINGMA,J.BA: "Adam:A Method for Stochastic Optimization", 《ICLR 2015》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109782392A (en) * 2019-02-27 2019-05-21 中国科学院光电技术研究所 A kind of fiber-optic coupling method based on modified random paralleling gradient descent algorithm
CN111460528A (en) * 2020-04-01 2020-07-28 支付宝(杭州)信息技术有限公司 Multi-party combined training method and system based on Adam optimization algorithm
CN111460528B (en) * 2020-04-01 2022-06-14 支付宝(杭州)信息技术有限公司 Multi-party combined training method and system based on Adam optimization algorithm

Also Published As

Publication number Publication date
CN107315570B (en) 2021-06-18

Similar Documents

Publication Publication Date Title
US11580386B2 (en) Convolutional layer acceleration unit, embedded system having the same, and method for operating the embedded system
US20190370664A1 (en) Operation method
CN106991477B (en) Artificial neural network compression coding device and method
KR102304216B1 (en) Vector computing device
US10346507B2 (en) Symmetric block sparse matrix-vector multiplication
Dong et al. LU factorization of small matrices: Accelerating batched DGETRF on the GPU
CN103049241B (en) A kind of method improving CPU+GPU isomery device calculated performance
WO2020046859A1 (en) Systems and methods for neural network convolutional layer matrix multiplication using cache memory
WO2017124642A1 (en) Device and method for executing forward calculation of artificial neural network
CN108090565A (en) Accelerated method is trained in a kind of convolutional neural networks parallelization
Jung et al. Implementing an interior point method for linear programs on a CPU-GPU system
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
CN107341132B (en) Device and method for executing AdaGrad gradient descent training algorithm
Ezzatti et al. Using graphics processors to accelerate the computation of the matrix inverse
CN107315570A (en) It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm
Falch et al. Register caching for stencil computations on GPUs
CN107341540B (en) Device and method for executing Hessian-Free training algorithm
CN107315569A (en) A kind of device and method for being used to perform RMSprop gradient descent algorithms
Wiggers et al. Implementing the conjugate gradient algorithm on multi-core systems
Kuzelewski et al. GPU-based acceleration of computations in elasticity problems solving by parametric integral equations system
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
Zhang et al. A high performance real-time edge detection system with NEON
CN202093573U (en) Parallel acceleration device used in industrial CT image reconstruction
Li et al. Quantum computer simulation on gpu cluster incorporating data locality
CN113780539A (en) Neural network data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant after: Zhongke Cambrian Technology Co.,Ltd.

Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TG01 Patent term adjustment