CN107315570A

CN107315570A - It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm

Info

Publication number: CN107315570A
Application number: CN201610269689.7A
Authority: CN
Inventors: 郭崎; 刘少礼; 陈天石; 陈云霁
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Cambricon Technologies Corp Ltd; Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-11-03
Anticipated expiration: 2036-04-27
Also published as: CN107315570B

Abstract

It is used to perform the device and method that Adam gradients decline training algorithm the invention discloses a kind of, the device includes direct memory access unit, instruction cache unit, controller unit, data buffer storage unit, data processing module.This method includes：Gradient vector and value to be updated vector are read first, while initializing, first order and second order moments are vectorial and corresponding exponential decay rate；During each iteration, first order and second order moments vector is updated using gradient vector, and calculating single order respectively has inclined moments estimation vector sum second order to have inclined moments estimation vector, there is inclined moments estimation vector sum second order to there is inclined moments estimation vector to update parameter to be updated, continuous training until parameter vector to be updated is restrained using single order.Utilize the present invention, it is possible to achieve the application of Adam gradient descent algorithms, and increase substantially the efficiency of data processing.

Description

It is a kind of to be used to perform the device and method that Adam gradients decline training algorithm

Technical field

The present invention relates to Adam algorithm applied technical fields, it is used to perform the decline training of Adam gradients more particularly to one kind The device and method of algorithm, relates to the hard-wired related application of Adam gradient optimization algorithms.

Background technology

Gradient optimization algorithm extensively should in fields such as function approximation, optimization calculating, Pattern recognition and image processings With Adam algorithms are as one kind in gradient optimization algorithm, and because it is easily achieved, amount of calculation is small, required memory space The features such as small and gradient symmetry transformation consistency are widely used, and realize that Adam algorithms can be with using special purpose device Significantly improve the speed of its execution.

At present, a kind of known method of execution Adam gradient descent algorithms is to use general processor.This method is by making Perform universal command to support above-mentioned algorithm with general-purpose register and general utility functions part.One of shortcoming of this method is single The operational performance of general processor is relatively low.And multiple general processors, when performing parallel, the intercommunication of general processor is again Become performance bottleneck.In addition, general processor needs the corresponding related operation of Adam gradient descent algorithms to be decoded into a length Column operations and access instruction sequence, processor front end decoding bring larger power dissipation overhead.

The known method of another execution Adam gradient descent algorithms is to use graphics processor (GPU).This method passes through General single-instruction multiple-data stream (SIMD) (SIMD) instruction is performed using general-purpose register and general stream processing unit to support above-mentioned calculation Method.Because GPU is the equipment that is specifically used to perform graph image computing and scientific algorithm, not to Adam gradient descent algorithms The special support of related operation, it is still desirable to which substantial amounts of front end work decoding could perform related in Adam gradient descent algorithms Computing, brings substantial amounts of overhead.In addition, GPU only has less upper caching, data (such as first moment needed for computing Vector sum second moment vector etc.) need to carry outside piece repeatedly, the outer bandwidth of piece becomes main performance bottleneck, while bringing huge Big power dissipation overhead.

The content of the invention

(1) technical problem to be solved

In view of this, it is a primary object of the present invention to provide a kind of dress for being used to perform Adam gradients decline training algorithm Put and method, leading portion decoding overheads big the problem of not enough with the general processor operational performance for solving data, and avoid repeatedly to Memory read data, reduces the bandwidth of internal storage access.

(2) technical scheme

To reach above-mentioned purpose, it is used to perform the device that Adam gradients decline training algorithm the invention provides a kind of, should Device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data buffer storage unit 4, data processing mould Block 5, wherein：

Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5 Data are read and write, the loading and storage of data is completed；

Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1；

Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5；

Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to Amount；

Data processing module 5, it is vectorial for updating square, moments estimation vector is calculated, vector to be updated is updated, and by after renewal Square vector be written in data buffer storage unit 4, the vector to be updated after renewal is written to by direct memory access unit 1 In outside designated space.

In such scheme, the direct memory access unit 1 is to write to refer to from outside designated space to instruction cache unit 2 Order, parameter to be updated and corresponding Grad are read to data processing module 5 from outside designated space, and by the parameter after renewal Vector writes direct outside designated space from data processing module 5.

In such scheme, the Instruction decoding of reading is control direct memory access unit 1, number by the controller unit 3 According to buffer unit 4 or the microcommand of the behavior of data processing module 5, to control direct memory access unit 1 from the specified ground in outside Location reads data and writes data into outside specified address, and control data buffer unit 4 is by direct memory access unit 1 from outer Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control Data buffer storage unit 4 processed carries out data transmission with data processing module 5.

In such scheme, the data buffer storage unit 4 initialization first moment vector m in initialization_t, second moment vector v_t, By first moment vector m in each data updating process_t-1With second moment vector v_t-1Read and deliver in data processing module 5, First moment vector m is updated in data processing module 5_tWith second moment vector v_t, then it is then written in data buffer storage unit 4.

In such scheme, during plant running, in store first moment is vectorial all the time inside the data buffer storage unit 4 m_t, second moment vector v_tCopy.

In such scheme, the data processing module 5 reads square vector m from data buffer storage unit 4_t-1、v_t-1, by straight Connect internal storage access unit 1 and vectorial θ to be updated is read from outside designated space_t-1, gradient vectorUpdate step-length α and refer to Number attenuation rate β₁And β₂；Then by square vector m_t-1、v_t-1It is updated to m_t、v_t, pass through m_t、v_tCalculate moments estimation vectorMost Afterwards by vectorial θ to be updated_t-1It is updated to θ_t, and by m_t、v_tIt is written in data buffer storage unit 4, by θ_tPass through direct memory access Unit 1 is written in outside designated space.

In such scheme, the data processing module 5 is by square vector m_t-1、v_t-1It is updated to m_tIt is according to formulaRealize, the data processing module 5 passes through m_t、v_t Calculate moments estimation vectorIt is according to formulaRealize, the data processing module 5 will be treated more New vector θ_t-1It is updated to θ_tIt is according to formulaRealize.

In such scheme, the data processing module 5 includes operation control submodule 51, vectorial addition concurrent operation submodule Block 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule Block 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53, Vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule 56 are in parallel even Connect, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation submodule 53, to Division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and the series connection of basic operation submodule 56 is measured to connect Connect.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector performs certain computing When diverse location element be parallel to perform computing.

To reach above-mentioned purpose, it is used to perform the method that Adam gradients decline training algorithm present invention also offers a kind of, This method includes：

Initialize first moment vector m₀, second moment vector v₀, exponential decay rate β₁、β₂And Learning Step α, and from outside Vectorial θ to be updated is obtained in designated space₀；

When carrying out gradient step-down operation, first with by outside incoming GradSingle order is updated with exponential decay rate Square vector m_t-1, second moment vector v_t-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is treated Renewal vector θ_t-1For θ_tAnd export；This process is repeated, until vector convergence to be updated.

In such scheme, the initialization first moment vector m_o, second moment vector v_o, exponential decay rate β₁、β₂And study Step-length α, and obtain from outside designated space vectorial θ to be updated₀, including：

Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit 2, should INSTRUCTION_IO is instructed to be had for driving direct internal storage location 1 to be read from external address space with Adam gradient descent algorithms All instructions closed.

Step S2, computing starts, and the first address of controller unit 3 from instruction cache unit 2 reads this INSTRUCTION_IO is instructed, according to the microcommand translated, driving direct memory access unit 1 from external address space read with The relevant all instructions of Adam gradient descent algorithms, and these instruction buffers are entered in instruction cache unit 2；

In step S3, controller unit 3 reads in a HYPERPARAMETER_IO instruction from instruction cache unit 2, according to The microcommand translated, driving direct memory access unit 1 reads global renewal step-length α, exponential decay rate from outside designated space β₂、β₂, convergence threshold ct, be then fed into data processing module 5；

In step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving First moment vector m in data buffer storage unit 4_t-1And v_t-1Initialized, and the iterations in driving data processing unit 5 T is arranged to 1；

In step S5, controller unit 3 reads in a DATA_IO instruction from instruction cache unit 2, according to the micro- finger translated Order, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated space_t-1With corresponding gradient vectorIt is then fed into data processing module 5；

In step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, micro- according to what is translated Instruction, by the first moment vector m in data buffer storage unit 4_t-1With second moment vector v_t-1It is transferred in data processing unit 5.

It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate m_t-1, second moment vector v_t-1, it is according to formula Realize , specifically include：Controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to the micro- finger translated Order, driving data buffer unit 4 carries out first moment vector m_t-1With second moment vector v_t-1Renewal operation, the renewal operation In, square vector more new command is sent to operation control submodule 51, and it is following that operation control submodule 51 sends command adapted thereto progress Operation：Send INS_1 to instruct to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β₁) and (1- β₂)；Hair INS_2 is sent to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53, which is calculated, to be obtained Then, send INS_3 to instruct to vector multiplication concurrent operation submodule 53, driving vector multiplication concurrent operation submodule 53 is simultaneously Calculate β₁m_t-1、β₂v_t-1WithAs a result a is designated as respectively₁、a₂、b₁And b₂；Then, by a₁And a₂、b₁With b₂Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to, the first moment vector m after being updated_tAnd second order Square vector v_t。

It is described using by outside incoming Grad in such schemeFirst moment vector is updated with exponential decay rate m_t-1, second moment vector v_t-1Afterwards, in addition to：Controller unit 3 reads data transmission instruction, root from instruction cache unit 2 According to the microcommand translated, by the first moment vector m after renewal_tWith second moment vector v_tData are sent to from data processing unit 5 to delay In memory cell 4.

It is described that inclined moments estimation vector has been obtained by square vector operation in such schemeWithIt is according to formulaRealize, specifically include：Controller unit 3 reads one from instruction cache unit 2 Bar moments estimation vector operation instruction, according to the microcommand translated, driving operation control submodule 51 carries out the meter of moments estimation vector Calculate, operation control submodule 51 sends command adapted thereto and proceeded as follows：Operation control submodule 51 sends instruction INS_4 to base This computing submodule 56, driving basic operation submodule 56 is calculatedWithIterations t adds 1, computing Control submodule 51 sends instruction INS_5 to vector multiplication concurrent operation submodule 53, drives vector multiplication concurrent operation submodule The parallel computation first moment vector m of block 53_tWithSecond moment vector v_tWithProduct obtained inclined moments estimation VectorWith

It is described to update vectorial θ to be updated in such scheme_t-1For θ_tIt is according to formulaRealize, Specifically include：Controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to the microcommand translated, Driving operation control submodule 51 carries out following computing：Operation control submodule 51 sends instruction INS_6 to basic operation Module 56, driving basic operation submodule 56 calculates-α；Operation control submodule 51 sends instruction INS_7 to vector square-root Concurrent operation submodule 55, drives its computing to obtainOperation control submodule 51 send instruction INS_7 to vectorial division simultaneously Row computing submodule 54 drives its computing to obtainOperation control submodule 51 send instruction INS_8 to vector multiplication simultaneously Row computing submodule 53, drives its computing to obtainOperation control submodule 51 sends instruction INS_9 to vectorial addition Concurrent operation submodule 52, drives it to calculateParameter vector θ after being updated_t；Wherein, θ_t-1It is θ₀ Value before not updated during the t times circulation, is circulated θ for the t time_t-1It is updated to θ_t；Operation control submodule 51 sends instruction INS_10 To vectorial division concurrent operation submodule 54, its computing is driven to obtain vector51 points of operation control submodule It Fa Song not instruct INS_11, INS_12 to calculate to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 to obtain Sum=∑s_itemp_i, temp2=sum/n.

It is described to update vectorial θ to be updated in such scheme_t-1For θ_tAfterwards, in addition to：Controller unit 3 is from instruction buffer Unit 2 reads a DATABACK_IO instruction, according to the microcommand translated, by the parameter vector θ after renewal_tFrom data processing Unit 5 is sent to outside designated space by direct memory access unit 1.

In such scheme, this process of the repetition until it is to be updated vector convergence the step of in, including judge it is to be updated to Whether amount restrains, and specific deterministic process is as follows：Controller unit 3 reads a convergence decision instruction, root from instruction cache unit 2 According to the microcommand translated, data processing module 5 judges whether the parameter vector after updating restrains, if temp2 ＜ ct, restrain, Computing terminates.

(3) beneficial effect

It can be seen from the above technical proposal that the invention has the advantages that：

1st, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, by using special use In the device for performing Adam gradients decline training algorithm, the general processor operational performance that can solve data is not enough, and leading portion is translated The problem of code expense is big, accelerates the execution speed of related application.

2nd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, as a result of data Buffer unit keeps in the square vector needed for pilot process, it is to avoid repeatedly to memory read data, reduce device with externally I/O operation between the space of location, reduces the bandwidth of internal storage access.

3rd, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould Block carries out vector operation using related concurrent operation submodule so that degree of concurrence is greatly improved.

4th, the device and method of training algorithm is declined provided by the present invention for performing Adam gradients, due to data processing mould Block carries out vector operation using related concurrent operation submodule, and the degree of concurrence of computing is high, so frequency during work is relatively low, So that power dissipation overhead is small.

Brief description of the drawings

For a more complete understanding of the present invention and its advantage, referring now to the following description with reference to accompanying drawing, wherein：

Fig. 1 shows the overall knot for being used to perform the device that Adam gradients decline training algorithm according to embodiments of the present invention The example block diagram of structure.

Fig. 2 is shown in the device for declining training algorithm for performing Adam gradients according to embodiments of the present invention at data Manage the example block diagram of module.

Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention Figure.

In all of the figs, identical device, part, unit etc. make to be denoted by the same reference numerals.

Embodiment

According to embodiments of the present invention with reference to accompanying drawing to the described in detail below of exemplary embodiment of the present, of the invention its Its aspect, advantage and prominent features will become obvious for those skilled in the art.

In the present invention, term " comprising " and " containing " and its derivative mean including and it is unrestricted；Term "or" is bag Containing property, mean and/or.

In this manual, following various embodiments for being used to describe the principle of the invention are explanation, should not be with any Mode is construed to the scope of limitation invention.Referring to the drawings described below is used to help comprehensive understanding by claim and its equivalent The exemplary embodiment of the invention that thing is limited.It is described below to help to understand including a variety of details, but these details should Think what is be merely exemplary.Therefore, it will be appreciated by those of ordinary skill in the art that without departing substantially from scope and spirit of the present invention In the case of, embodiment described herein can be made various changes and modifications.In addition, for clarity and brevity, Eliminate the description of known function and structure.In addition, through accompanying drawing, same reference numbers are used for identity function and operation.

The device and method for being used to perform Adam gradients decline training algorithm according to embodiments of the present invention, to accelerate The application of Adam gradient descent algorithms.First, initialization first moment vector m₀, second moment vector v₀, exponential decay rate β₁、β₂With And Learning Step α, and obtain from outside designated space vectorial θ to be updated_o；It is first sharp every time when carrying out gradient step-down operation With by outside incoming GradFirst moment vector m is updated with exponential decay rate_t-1, second moment vector v_t-1, i.e.,Then inclined moments estimation has been obtained by square vector operation VectorWithI.e.Final updating vectorial θ to be updated_t-1For θ_tAnd export, i.e.,Wherein, θ_t-1It is θ₀Value before not updated when circulating for the t times, is circulated θ for the t time_t-1It is updated to θ_t。 This process is repeated, until vector convergence to be updated.

Fig. 1 shows the integrally-built of the device for being used to realize Adam gradient descent algorithms according to embodiments of the present invention Example block diagram.As shown in figure 1, the device includes direct memory access unit 1, instruction cache unit 2, controller unit 3, data Buffer unit 4 and data processing module 5, can be realized by hardware circuit.

Direct memory access unit 1, for accessing outside designated space, to instruction cache unit 2 and data processing module 5 Data are read and write, the loading and storage of data is completed.Specifically from outside designated space to the write instruction of instruction cache unit 2, from Outside designated space reads parameter to be updated and corresponding Grad to data processing module 5, and by the parameter vector after renewal Outside designated space is write direct from data processing module 5.

Instruction cache unit 2, is instructed, and cache the instruction of reading for being read by direct memory access unit 1.

Controller unit 3, it is for reading instruction from instruction cache unit 2, the Instruction decoding of reading is direct for control The microcommand of internal storage access unit 1, data buffer storage unit 4 or the behavior of data processing module 5, and each microcommand is sent to direct Internal storage access unit 1, data buffer storage unit 4 or data processing module 5, control direct memory access unit 1 specify ground from outside Location reads data and writes data into outside specified address, and control data buffer unit 3 is by direct memory access unit 1 from outer Portion specifies the instruction needed for address acquiring operation, and control data processing module 5 carries out the renewal computing of parameter to be updated, and control Data buffer storage unit 4 processed carries out data transmission with data processing module 5.

Data buffer storage unit 4, for cached in initialization and data updating process each first moment vector sum second moment to Amount；Specifically, the initialization first moment vector of data buffer storage unit 4 m in initialization_t, second moment vector v_t, in each data Data buffer storage unit 4 is by first moment vector m in renewal process_t-1With second moment vector v_t-1Read and deliver to data processing module 5 In, first moment vector m is updated in data processing module 5_tWith second moment vector v_t, then it is then written to data buffer storage unit 4 In.During plant running, in store first moment vector m all the time inside the data buffer storage unit 4_t, second moment vector v_t's Copy.In the present invention, the square vector needed for pilot process is kept in as a result of data buffer storage unit, it is to avoid repeatedly inside Reading data are deposited, the I/O operation between device and external address space is reduced, the bandwidth of internal storage access is reduced.

Data processing module 5, for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal Square vector is written in data buffer storage unit 4, and the vector to be updated after renewal is written into outside by direct memory access unit 1 In designated space；Specifically, data processing module 5 reads square vector m from data buffer storage unit 4_t-1、v_t-1, pass through direct internal memory Access unit 1 reads vectorial θ to be updated from outside designated space_t-1, gradient vectorUpdate step-length α and exponential decay rate β₁And β₂；Then by square vector m_t-1、v_t-1It is updated to m_t、v_t, i.e., Pass through m_t、v_tCalculate moments estimation vector I.e.Finally by vectorial θ to be updated_t-1It is updated to θ_t, I.e.And by m_t、v_tIt is written in data buffer storage unit 4, by θ_tWrite by direct memory access unit 1 Enter into outside designated space.In the present invention, due to data processing module using related concurrent operation submodule carry out to Measure computing so that degree of concurrence is greatly improved, so frequency during work is relatively low, and then make it that power dissipation overhead is small.

Fig. 2 shows number in the device for realizing Adam gradient descent algorithm related applications according to embodiments of the present invention According to the example block diagram of processing module.As shown in Fig. 2 data processing module 5 is parallel including operation control submodule 51, vectorial addition Computing submodule 52, vector multiplication concurrent operation submodule 53, vectorial division concurrent operation submodule 54, vector square-root are parallel Computing submodule 55 and basic operation submodule 56, wherein vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule Block 56 is connected in parallel, operation control submodule 51 respectively with vectorial addition concurrent operation submodule 52, vector multiplication concurrent operation Submodule 53, vectorial division concurrent operation submodule 54, vector square-root concurrent operation submodule 55 and basic operation submodule Block 56 is connected in series.The device to vector when carrying out computing, and vector operation is element-wise computings, and same vector is held Diverse location element is parallel execution computing during certain computing of row.

Fig. 3 shows the flow for being used to perform the method that Adam gradients decline training algorithm according to embodiments of the present invention Figure, specifically includes following steps：

Step S1, pre-deposits an instruction prefetch instruction (INSTRUCTION_ at the first address of instruction cache unit 2 IO), the INSTRUCTION_IO is instructed declines for driving direct internal storage location 1 to be read from external address space with Adam gradients Calculate relevant all instructions.

Step S3, controller unit 3 reads in a super parameter from instruction cache unit 2 and reads instruction (HYPERPARAMETER_IO), according to the microcommand translated, driving direct memory access unit 1 is read from outside designated space The overall situation updates step-length α, exponential decay rate β₁、β₂, convergence threshold ct, be then fed into data processing module 5；

Step S4, controller unit 3 reads in assignment directive from instruction cache unit 2, according to the microcommand translated, driving number According to the first moment vector m in buffer unit 4_t-1And v_t-1Initialized, and the iterations t in driving data processing unit 5 It is arranged to 1；

Step S5, controller unit 3 reads in parameter from instruction cache unit 2 and reads instruction (DATA_IO), according to translating The microcommand gone out, driving direct memory access unit 1 reads parameter vector θ to be updated from outside designated space_t-1With corresponding ladder Degree vectorIt is then fed into data processing module 5；

Step S6, controller unit 3 reads in data transmission instruction from instruction cache unit 2, according to the micro- finger translated Order, by the first moment vector m in data buffer storage unit 4_t-1With second moment vector v_t-1It is transferred in data processing unit 5.

Step S7, controller unit 3 reads the vectorial more new command of a square from instruction cache unit 2, according to what is translated Microcommand, driving data buffer unit 4 carries out first moment vector m_t-1With second moment vector v_t-1Renewal operation.In renewal behaviour In work, square vector more new command is sent to operation control submodule 51, operation control submodule 51 send command adapted thereto carry out with Lower operation：Operational order 1 (INS_1) is sent to basic operation submodule 56, driving basic operation submodule 56 calculates (1- β₁) (1- β₂)；Operational order 2 (INS_2) is sent to vector multiplication concurrent operation submodule 53, vector multiplication concurrent operation is driven Submodule 53 is calculated and obtainedThen, operational order 3 (INS_3) is sent to vector multiplication concurrent operation submodule 53, driving Vector multiplication concurrent operation submodule 53 calculates β simultaneously₁m_t-1、β₂v_t-1WithAs a result it is designated as respectively a₁、a₂、b₁And b₂；Then, by a₁And a₂、b₁And b₂Respectively as two inputs, vectorial addition concurrent operation submodule 52 is delivered to, First moment vector m after being updated_tWith second moment vector v_t。

Step S8, controller unit 3 reads data transmission instruction from instruction cache unit 2, according to the micro- finger translated Order, by the first moment vector m after renewal_tWith second moment vector v_tIt is sent to from data processing unit 5 in data buffer storage unit 4.

Step S9, controller unit 3 reads a moments estimation vector operation instruction from instruction cache unit 2, according to translating Microcommand, driving operation control submodule 51 carries out the calculating of moments estimation vector, and operation control submodule 51 sends corresponding finger Order is proceeded as follows：Operation control submodule 51 sends operational order 4 (INS_4) to basic operation submodule 56, drives base This computing submodule 56 is calculatedWithIterations t adds 1, and operation control submodule 51 sends computing 5 (INS_5) are instructed to vector multiplication concurrent operation submodule 53, the parallel computation one of driving vector multiplication concurrent operation submodule 53 Rank square vector m_tWithSecond moment vector v_tWithProduct obtained inclined moments estimation vectorWith

Step S10, controller unit 3 reads a parameter vector more new command from instruction cache unit 2, according to what is translated Microcommand, driving operation control submodule 51 carries out following computing：Operation control submodule 51 sends the (INS_ of operational order 6 6) to basic operation submodule 56, driving basic operation submodule 56 calculates-α；Operation control submodule 51 sends computing and referred to Make 7 (INS_7) to vector square-root concurrent operation submodule 55, drive its computing to obtainOperation control submodule 51 is sent Operational order 7 (INS_7) to vectorial division concurrent operation submodule 54 drives its computing to obtainOperation control submodule 51 send operational order 8 (INS_8) to vector multiplication concurrent operation submodule 53, drive its computing to obtainComputing Control submodule 51 sends operational order 9 (INS_9) to vectorial addition concurrent operation submodule 52, drives it to calculateParameter vector θ after being updated_t；Wherein, θ_t-1It is θ₀Value before not updated when circulating for the t times, t It is secondary to circulate θ_t-1It is updated to θ_t；Operation control submodule 51 sends operational order 10 (INS_110) to vectorial division concurrent operation Submodule 54, drives its computing to obtain vectorOperation control submodule 51 sends operational order 11 respectively (INS_11), operational order 12 (INS_12) to vectorial addition concurrent operation submodule 52 and basic operation submodule 56 is calculated To sum=Σ_itemp_i, temp2=sum/n.

Step S11, controller unit 3 reads an amount to be updated from instruction cache unit 2 and writes back instruction (DATABACK_ IO), according to the microcommand translated, by the parameter vector θ after renewal_tPass through direct memory access unit 1 from data processing unit 5 It is sent to outside designated space.

Step S12, controller unit 3 reads a convergence decision instruction from instruction cache unit 2, according to the micro- finger translated Order, data processing module 5 judges whether the parameter vector after updating restrains, no if temp2 ＜ ct, restrain, computing terminates Then, go at step S5 and continue executing with.

The present invention can solve the logical of data by using dedicated for performing the device that Adam gradients decline training algorithm Not enough, the problem of leading portion decoding overheads are big with processor operational performance, accelerates the execution speed of related application.Meanwhile, to data The application of buffer unit, it is to avoid repeatedly to memory read data, reduces the bandwidth of internal storage access.

The process or method described in accompanying drawing above can be by including hardware (for example, circuit, special logic etc.), solid Part, software (for example, being embodied in the software in non-transient computer-readable media), or both combination processing logic come Perform.Although process or method are described according to the operation of some orders above, however, it is to be understood that described some operations It can be performed with different order.In addition, concurrently rather than certain operations can be sequentially performed.

In foregoing specification, various embodiments of the present invention are described with reference to its certain exemplary embodiments.Obviously, may be used Various modifications are made to each embodiment, without departing from the wider spirit and scope of the invention described in appended claims. Correspondingly, specification and drawings should be considered as illustrative and not restrictive.

Claims

1. a kind of be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device is visited including direct internal memory Unit (1), instruction cache unit (2), controller unit (3), data buffer storage unit (4), data processing module (5) are asked, wherein：

Direct memory access unit (1), for accessing outside designated space, to instruction cache unit (2) and data processing module (5) data are read and write, the loading and storage of data is completed；

Instruction cache unit (2), is instructed, and cache the instruction of reading for being read by direct memory access unit (1)；

Controller unit (3), it is for reading instruction from instruction cache unit (2), the Instruction decoding of reading is direct for control The microcommand of internal storage access unit (1), data buffer storage unit (4) or data processing module (5) behavior；

Data buffer storage unit (4), for caching each first moment vector sum second moment vector in initialization and data updating process；

Data processing module (5), for updating square vector, calculates moments estimation vector, updates vector to be updated, and by after renewal Square vector is written in data buffer storage unit (4), and the vector to be updated after renewal is write by direct memory access unit (1) Into outside designated space.

2. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that described straight It is, from outside designated space to instruction cache unit (2) write instruction, to be read from outside designated space to connect internal storage access unit (1) Parameter to be updated and corresponding Grad are to data processing module (5), and by the parameter vector after renewal from data processing module (5) outside designated space is write direct.

3. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the control The Instruction decoding of reading is at control direct memory access unit (1), data buffer storage unit (4) or data by device unit (3) processed The microcommand of module (5) behavior of managing, to control direct memory access unit (1) from the specified address reading data in outside and by number According to the outside specified address of write-in, control data buffer unit (4) is obtained by direct memory access unit (1) from the specified address in outside Instruction needed for extract operation, control data processing module (5) carries out the renewal computing of parameter to be updated, and control data caching Unit (4) carries out data transmission with data processing module (5).

4. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number According to buffer unit (4) first moment vector m is initialized in initialization_t, second moment vector v_t, will in each data updating process First moment vector m_t-1With second moment vector v_t-1Read and deliver in data processing module (5), in data processing module (5) more It is newly first moment vector m_tWith second moment vector v_t, then it is then written in data buffer storage unit (4).

5. according to claim 4 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that in device In running, the internal first moment vector m in store all the time of the data buffer storage unit (4)_t, second moment vector v_tCopy.

6. according to claim 1 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the number Square vector m is read from data buffer storage unit (4) according to processing module (5)_t-1、v_t-1, by direct memory access unit (1) from outer Vectorial θ to be updated is read in portion's designated space_t-1, gradient vectorUpdate step-length α and exponential decay rate β₁And β₂；Then will Square vector m_t-1、v_t-1It is updated to m_t、v_t, pass through m_t、v_tCalculate moments estimation vectorFinally by vectorial θ to be updated_t-1Update For θ_t, and by m_t、v_tIt is written in data buffer storage unit (4), by θ_tOutside is written to by direct memory access unit (1) to refer to Determine in space.

7. according to claim 6 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that The data processing module (5) is by square vector m_t-1、v_t-1It is updated to m_tIt is according to formulaRealize, the data processing module (5) passes through m_t、 v_tCalculate moments estimation vectorIt is according to formulaRealize, the data processing module (5) By vectorial θ to be updated_t-1It is updated to θ_tIt is according to formulaRealize.

8. being used for according to claim 1 or 7 performs the device that Adam gradients decline training algorithm, it is characterised in that institute Stating data processing module (5) includes operation control submodule (51), vectorial addition concurrent operation submodule (52), vector multiplication simultaneously Row computing submodule (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and base This computing submodule (56), wherein vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation submodule (53), to Measure division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation submodule (56) simultaneously Connection connection, operation control submodule (51) is sub with vectorial addition concurrent operation submodule (52), vector multiplication concurrent operation respectively Module (53), vectorial division concurrent operation submodule (54), vector square-root concurrent operation submodule (55) and basic operation Submodule (56) is connected in series.

9. according to claim 8 be used to perform the device that Adam gradients decline training algorithm, it is characterised in that the device When carrying out computing to vector, vector operation is element-wise computings, and same vector performs different positions during certain computing It is parallel execution computing to put element.

10. a kind of be used to perform the method that Adam gradients decline training algorithm, applied to any one of claim 1 to 9 Device, it is characterised in that this method includes：

Initialize first moment vector m_o, second moment vector v₀, exponential decay rate β₁、β₂And Learning Step α, and specify sky from outside Between in obtain vectorial θ to be updated₀；

When carrying out gradient step-down operation, first with by outside incoming GradWith exponential decay rate update first moment to Measure m_t-1, second moment vector v_t-1, inclined moments estimation vector has then been obtained by square vector operationWithFinal updating is to be updated Vectorial θ_t-1For θ_tAnd export；This process is repeated, until vector convergence to be updated.

11. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Initialize first moment vector m₀, second moment vector v_o, exponential decay rate β₁、β₂And Learning Step α, and from outside designated space It is middle to obtain vectorial θ to be updated₀, including：

Step S1, pre-deposits an INSTRUCTION_IO instruction at the first address of instruction cache unit, should INSTRUCTION_IO is instructed to be had for driving direct internal storage location to be read from external address space with Adam gradient descent algorithms All instructions closed.

Step S2, computing starts, and controller unit reads this INSTRUCTION_IO from the first address of instruction cache unit and referred to Order, according to the microcommand translated, driving direct memory access unit is read and Adam gradient descent algorithms from external address space Relevant all instructions, and these instruction buffers are entered in instruction cache unit；

Step S3, controller unit reads in a HYPERPARAMETER_IO instruction from instruction cache unit, micro- according to what is translated Instruction, driving direct memory access unit reads global renewal step-length α, exponential decay rate β from outside designated space₂、β₂, convergence Threshold value ct, is then fed into data processing module；

Step S4, controller unit reads in assignment directive from instruction cache unit, according to the microcommand translated, driving data caching First moment vector m in unit_t-1And v_t-1Initialized, and the iterations t in driving data processing unit is arranged to 1；

Step S5, controller unit reads in a DATA_IO instruction from instruction cache unit, according to the microcommand translated, driving Direct memory access unit reads parameter vector θ to be updated from outside designated space_t-1With corresponding gradient vectorThen send Enter into data processing module；

Step S6, controller unit reads in data transmission instruction from instruction cache unit, according to the microcommand translated, by number According to the first moment vector m in buffer unit_t-1With second moment vector v_t-1It is transferred in data processing unit.

12. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Using by outside incoming GradFirst moment vector m is updated with exponential decay rate_t-1, second moment vector v_t-1, it is according to public affairs Formula Realize, specifically include：

Controller unit reads the vectorial more new command of a square from instruction cache unit, according to the microcommand translated, driving number First moment vector m is carried out according to buffer unit_t-1With second moment vector v_t-1Renewal operation, the renewal operation in, square vector more New command is sent to operation control submodule, and operation control submodule sends command adapted thereto and carries out following operate：INS_1 is sent to refer to Order to basic operation submodule, driving basic operation submodule calculates (1- β₁) and (1- β₂)；Send INS_2 and instruct to vector and multiply Method concurrent operation submodule, driving vector multiplication concurrent operation submodule, which is calculated, to be obtainedThen, send INS_3 instruct to Multiplication concurrent operation submodule is measured, driving vector multiplication concurrent operation submodule calculates β simultaneously₁m_t-1、β₂v_t-1WithAs a result a is designated as respectively₁、a₂、b₂And b₂；Then, by a₁And a₂、b₁And b₂Respectively as two inputs, vector is delivered to Addition concurrent operation submodule, the first moment vector m after being updated_tWith second moment vector v_t。

13. according to claim 12 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Using by outside incoming GradFirst moment vector m is updated with exponential decay rate_t-1, second moment vector v_t-1Afterwards, also wrap Include：

Controller unit reads data transmission instruction from instruction cache unit, according to the microcommand translated, after renewal First moment vector m_tWith second moment vector v_tFrom data processing unit is sent to data buffer storage unit.

14. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Inclined moments estimation vector has been obtained by square vector operationWithIt is according to formula Realize, specifically include：

Controller unit reads a moments estimation vector operation instruction from instruction cache unit, according to the microcommand translated, driving Operation control submodule carries out the calculating of moments estimation vector, and operation control submodule sends command adapted thereto and proceeded as follows：Fortune Calculate control submodule and send instruction INS_4 to basic operation submodule, driving basic operation submodule is calculatedWithIterations t adds 1, and operation control submodule sends instruction INS_5 to vector multiplication concurrent operation submodule, drives Moving vector multiplication concurrent operation submodule parallel computation first moment vector m_tWithSecond moment vector v_tWith Product obtained inclined moments estimation vectorWith

15. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Update vectorial θ to be updated_t-1For θ_tIt is according to formulaRealize, specifically include：

Controller unit reads a parameter vector more new command from instruction cache unit, according to the microcommand translated, driving fortune Calculate control submodule and carry out following computing：Operation control submodule sends instruction INS_6 to basic operation submodule, drives base This computing submodule calculates-α；Operation control submodule sends instruction INS_7 to vector square-root concurrent operation submodule, drives Its computing is moved to obtainOperation control submodule sends instruction INS_7 and drives its computing to vectorial division concurrent operation submodule ObtainOperation control submodule sends instruction INS_8 to vector multiplication concurrent operation submodule, drives its computing to obtainOperation control submodule sends instruction INS_9 to vectorial addition concurrent operation submodule, drives it to calculateParameter vector θ after being updated_t；Wherein, θ_t-1It is θ₀Value before not updated when circulating for the t times, t It is secondary to circulate θ_t-1It is updated to θ_t；Operation control submodule sends instruction INS_10 to vectorial division concurrent operation submodule, driving Its computing obtains vectorOperation control submodule sends instruction INS_11, INS_12 to vectorial addition respectively Concurrent operation submodule and basic operation submodule, which are calculated, obtains sum=∑s_itemp_i, temp2=sum/n.

16. according to claim 15 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described Update vectorial θ to be updated_t-1For θ_tAfterwards, in addition to：

Controller unit reads a DATABACK_IO instruction from instruction cache unit, according to the microcommand translated, after renewal Parameter vector θ_tOutside designated space is sent to from data processing unit by direct memory access unit.

17. according to claim 10 be used to perform the method that Adam gradients decline training algorithm, it is characterised in that described During this process is repeated the step of vector convergence to be updated, including judge whether vector to be updated restrains, specific deterministic process It is as follows：

Controller unit reads a convergence decision instruction from instruction cache unit, according to the microcommand translated, data processing mould Block judges whether the parameter vector after updating restrains, if temp2 ＜ ct, restrain, computing terminates.