WO2017185257A1 - Device and method for performing adam gradient descent training algorithm - Google Patents

Device and method for performing adam gradient descent training algorithm Download PDF

Info

Publication number
WO2017185257A1
WO2017185257A1 PCT/CN2016/080357 CN2016080357W WO2017185257A1 WO 2017185257 A1 WO2017185257 A1 WO 2017185257A1 CN 2016080357 W CN2016080357 W CN 2016080357W WO 2017185257 A1 WO2017185257 A1 WO 2017185257A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
module
moment
sub
Prior art date
Application number
PCT/CN2016/080357
Other languages
French (fr)
Chinese (zh)
Inventor
郭崎
刘少礼
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to PCT/CN2016/080357 priority Critical patent/WO2017185257A1/en
Publication of WO2017185257A1 publication Critical patent/WO2017185257A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to the field of Adam algorithm application technology, and in particular to an apparatus and method for performing an Adam gradient descent training algorithm, and relates to a hardware implementation of an Adam gradient descent optimization algorithm.
  • the gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing.
  • Adam algorithm is one of the gradient descent optimization algorithms. It is easy to implement, small in computation, small in required storage space and gradient. Features such as symmetry transformation invariance are widely used, and the implementation of the Adam algorithm using a dedicated device can significantly increase the speed of its execution.
  • one known method of performing the Adam gradient descent algorithm is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the correlation operation corresponding to the Adam gradient descent algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of performing the Adam gradient descent algorithm is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit.
  • SIMD general single instruction multiple data stream
  • the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the Adam gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operations in the Adam gradient descent algorithm. A lot of extra overhead.
  • the GPU has only a small on-chip buffer, and the data required in the operation (such as first-order moment vector and second-order moment vector) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings great work. Cost.
  • the main object of the present invention is to provide an apparatus and method for performing an Adam gradient descent training algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeated memory. Read data and reduce the bandwidth of memory access.
  • the present invention provides an apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising a direct memory access unit 1, an instruction buffer unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5. ,among them:
  • the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;
  • the instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;
  • the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.
  • a data buffer unit 4 configured to cache each first moment vector and second moment vector during initialization and data update
  • the data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external specified space.
  • the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated
  • the parameter vector is directly written from the data processing module 5 to the external designated space.
  • the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified.
  • the address reads the data and writes the data to the external designated address
  • the control data buffer unit 4 obtains an instruction required for the operation from the external designated address through the direct memory access unit 1, and controls the data processing module 5 to perform the update operation of the parameter to be updated.
  • the control data buffer unit 4 and the data processing module 5 perform data transmission.
  • the data buffer unit 4 initializes a first moment vector m t and a second moment vector v t at initialization, and first moment vector m t-1 and second moment vector v in each data update process.
  • the t-1 is read out and sent to the data processing module 5, updated in the data processing module 5 as a first moment vector m t and a second moment vector v t , and then written into the data buffer unit 4.
  • the data buffer unit 4 always stores a copy of the first moment vector m t and the second moment vector v t .
  • the data processing module 5 reads the moment vectors m t-1 , v t-1 from the data buffer unit 4, and reads the vector to be updated ⁇ t-1 from the external designated space through the direct memory access unit 1.
  • the vector to be updated ⁇ t-1 is updated to ⁇ t
  • m t , v t are written into the data buffer unit 4
  • ⁇ t is written into the external designated space through the direct memory access unit 1.
  • the data processing module 5 updates the moment vectors m t-1 , v t-1 to m t according to the formula.
  • the data processing module 5 calculates the moment estimation vector by using m t and v t Is based on the formula Implemented, the data processing module 5 to be updated vector ⁇ t-1 is updated according to the formula ⁇ t Realized.
  • the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
  • vector operations 51 and vector addition parallel operation sub-module 52, vector multiplication parallel operation sub-module 53, vector division parallel operation sub-module 54, vector The square root parallel operation sub-module 55 and the basic operation sub-module 56 are connected in series.
  • the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
  • the present invention also provides a method for performing an Adam gradient descent training algorithm, the method comprising:
  • the first moment vector m Q , the second moment vector v Q , the exponential decay rate ⁇ 1 , ⁇ 2 , and the learning step length ⁇ are initialized, and the vector to be updated ⁇ O is obtained from an external designated space, including:
  • step S1 an INSTRUCTION_IO instruction is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.
  • Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and drives the direct memory access unit 1 to read from the external address space according to the translated microinstruction, which is related to the Adam gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
  • step S3 the controller unit 3 reads a HYPERPARAMETER_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ , the exponential decay rate ⁇ 1 from the external designated space according to the translated microinstruction. ⁇ 2 , convergence threshold ct, and then sent to the data processing module 5;
  • step S4 the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, drives the first moment vectors m t-1 and v t-1 in the data buffer unit 4 to initialize and drive.
  • the number of iterations t in the data processing unit 5 is set to 1;
  • step S5 the controller unit 3 reads a DATA_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector ⁇ t-1 to be updated and the corresponding gradient from the external designated space according to the translated microinstruction. vector Then sent to the data processing module 5;
  • step S6 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m t-1 and the second moment vector v t- in the data buffer unit 4. 1 is transferred to the data processing unit 5.
  • the implementation specifically includes: the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform the first moment vector m t-1 and the second moment vector according to the translated micro instruction.
  • the update operation of v t-1 in which the moment vector update instruction is sent to the operation control sub-module 51 , and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the INS_1 instruction to the basic operation sub-module 56, driving The basic operation sub-module 56 calculates (1- ⁇ 1 ) and (1- ⁇ 2 ); sends an INS_2 instruction to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates Then, the INS_3 instruction is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 is simultaneously calculated.
  • the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, according to the translated micro instruction.
  • the updated first moment vector m t and second moment vector v t are transferred from the data processing unit 5 to the data buffer unit 4.
  • the moment vector estimation is obtained by the moment vector operation with Is based on the formula
  • the implementation includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends The corresponding instruction performs the following operations: the arithmetic control sub-module 51 sends the command INS_4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate with The iteration number t is incremented by 1, the operation control sub-module 51 sends the instruction INS_5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m t and Second moment vector v t and Biased estimate vector with
  • the update to be updated vector ⁇ t-1 is ⁇ t according to the formula
  • the implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the instruction INS_6 to The basic operation sub-module 56 drives the basic operation sub-module 56 to calculate - ⁇ ; the operation control sub-module 51 sends the instruction INS_7 to the vector square root parallel operation sub-module 55, and drives the operation thereof. The operation control sub-module 51 sends the instruction INS_7 to the vector division parallel operation sub-module 54 to drive the operation thereof.
  • the operation control sub-module 51 sends the instruction INS_8 to the vector multiplication parallel operation sub-module 53 to drive the operation thereof.
  • the operation control sub-module 51 sends the instruction INS_9 to the vector addition parallel operation sub-module 52 to drive its calculation.
  • the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction.
  • ⁇ t is transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.
  • the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges.
  • the specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, it converges, and the operation ends.
  • the apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention can solve the problem of insufficient performance of the general-purpose processor of the data by using a device specially used for executing the Adam gradient descent training algorithm, and the decoding cost of the previous segment is large. Problem, speed up the execution of related applications.
  • the apparatus and method for performing the Adam gradient descent training algorithm avoids repeatedly reading data into the memory by using the moment vector required for the intermediate process of the data buffer unit, thereby reducing the device and the external address.
  • the IO operation between the spaces reduces the bandwidth of memory access.
  • the apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention, because the data processing module uses the related parallel operation sub-module to perform vector operations, the degree of parallelism of the operation is high, so the working frequency is low, so that the work is low. The cost is small.
  • FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
  • FIG. 2 shows an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
  • FIG. 3 shows a flow chart of a method for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
  • the first moment vector m O , the second moment vector v Q , the exponential decay rate ⁇ 1 , ⁇ 2 , and the learning step length ⁇ are initialized, and the vector to be updated ⁇ Q is obtained from the external designated space;
  • the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 , ie
  • ⁇ t-1 is the value before ⁇ 0 is not updated at the t-th cycle
  • the t-th cycle updates ⁇ t-1 to ⁇ t . This process is repeated until the vector to be updated converges.
  • the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
  • the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.
  • the instruction cache unit 2 is configured to read an instruction through the direct memory access unit 1 and cache the read instruction.
  • the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro
  • the instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory.
  • the unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.
  • the data buffer unit 4 is configured to buffer each first moment vector and second moment vector in the initialization and data update process; specifically, the data buffer unit 4 initializes the first moment vector m t and the second moment vector v during initialization. t , in each data update process, the data buffer unit 4 reads out and sends the first moment vector m t-1 and the second moment vector v t-1 to the data processing module 5, and updates it in the data processing module 5 to The first moment vector m t and the second moment vector v t are then written to the data buffer unit 4. During the operation of the device, the data buffer unit 4 always stores a copy of the first moment vector m t and the second moment vector v t . In the present invention, since the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.
  • the data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external designated space; specifically, the data processing module 5 reads the moment vectors m t-1 , v t-1 from the data buffer unit 4, and reads from the external designated space through the direct memory access unit 1 Update vector ⁇ t-1 , gradient vector Updating the step size ⁇ and the exponential decay rates ⁇ 1 and ⁇ 2 ; then updating the moment vectors m t-1 , v t-1 to m t , v t , ie Calculate the moment estimation vector by m t , v t which is Finally, the vector ⁇ t-1 to be updated is updated to ⁇ t , ie The m t and v t are written into the data buffer unit 4, and ⁇ t is written into the external designated space through the direct memory access unit 1.
  • the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
  • the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
  • FIG. 3 illustrates an algorithm for performing Adam gradient descent training according to an embodiment of the present invention.
  • the flow chart of the method specifically includes the following steps:
  • step S1 an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.
  • INSTRUCTION_IO instruction prefetch instruction
  • Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and drives the direct memory access unit 1 to read from the external address space according to the translated microinstruction, which is related to the Adam gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
  • step S3 the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ , index from the external designated space according to the translated microinstruction.
  • step S4 the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and initializes the first-order moment vectors m t-1 and v t-1 in the data buffer unit 4 according to the translated micro-instruction, and drives the data.
  • the number of iterations t in the processing unit 5 is set to 1;
  • step S5 the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated ⁇ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector Then sent to the data processing module 5;
  • DATA_IO parameter read instruction
  • step S6 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m t-1 and the second moment vector v t-1 in the data buffer unit 4. Transfer to the data processing unit 5.
  • step S7 the controller unit 3 reads a moment vector update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform a first moment vector m t-1 and a second moment vector v t- according to the translated microinstruction. 1 update operation.
  • the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub-module 56 calculates (1- ⁇ 1 ) and (1 - ⁇ 2 ); sends an operation instruction 2 (INS_2) to a vector multiplication parallel operation submodule 53, and the drive vector multiplication parallel operation sub-module 53 calculates Then, the operation instruction 3 (INS_3) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 simultaneously calculates ⁇ 1 m t-1 , 2 2 v t-1 and The results are denoted as a 1 , a 2 , b 1 and b 2 respectively ; then, a 1 and a 2 , b 1 and b 2 are respectively taken as two inputs and sent to the vector addition parallel operation sub-module 52 to obtain an
  • step S8 the controller unit 3 reads a data transmission instruction from the instruction buffer unit 2, and transmits the updated first-order moment vector m t and second-order moment vector v t from the data processing unit 5 according to the translated micro-instruction.
  • the data buffer unit 4 In the data buffer unit 4.
  • step S9 the controller unit 3 reads a moment estimation vector operation instruction from the instruction buffer unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends the corresponding instruction.
  • the operation operation sub-module 51 sends the operation instruction 4 (INS_4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate with The iteration number t is incremented by 1, and the operation control sub-module 51 sends an operation instruction 5 (INS_5) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m t and Second moment vector v t and Biased estimate vector with
  • INS_4 operation instruction 4
  • INS_5 operation instruction 5
  • step S10 the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the operation instruction 6 (INS_6) To the basic operation sub-module 56, the drive basic operation sub-module 56 calculates - ⁇ ; the operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector square root parallel operation sub-module 55, and drives the operation thereof. The operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector division parallel operation sub-module 54 to drive the operation thereof.
  • the operation control sub-module 51 sends the operation instruction 8 (INS_8) to the vector multiplication parallel operation sub-module 53, and drives the operation thereof to obtain
  • the operation control sub-module 51 sends the operation instruction 9 (INS_9) to the vector addition parallel operation sub-module 52 to drive the calculation thereof.
  • the updated parameter vector ⁇ t is obtained ; wherein ⁇ t-1 is the value before ⁇ 0 is not updated at the tth cycle, and the tth cycle updates ⁇ t-1 to ⁇ t ; the arithmetic control sub-module 51 transmits The operation instruction 10 (INS_10) to the vector division parallel operation sub-module 54 drives the operation to obtain a vector
  • step S11 the controller unit 3 reads a to-be-updated write-back instruction (DATABACK_IO) from the instruction cache unit 2, and passes the updated parameter vector ⁇ t from the data processing unit 5 through the direct memory access unit according to the translated micro-instruction. 1 Transfer to the external designated space.
  • DATABACK_IO to-be-updated write-back instruction
  • step S12 the controller unit 3 reads a convergence judgment instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, the convergence is performed, and the operation ends. Otherwise, go to step S5 to continue execution.
  • the invention can solve the problem that the general processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated.
  • the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

A device and method for performing Adam gradient descent training algorithm, the device comprising a direct memory access unit (1), an instruction cache unit (2), a controller unit (3), a data cache unit (4) and a data processing module (5). The method comprises: firstly, reading a gradient vector and a to-be-updated value vector, and initializing a first-order vector, a second-order vector and a corresponding exponential decay rate; at each iteration, updating the first-order vector and second-order vector by using the gradient vector, and calculating a first-order biased estimation vector and a second-order biased estimation vector; updating to-be-updated parameters by using the first-order biased estimation vector and second-order biased estimation vector; and continuing with training till the vector of the to-be-updated parameters is converged. The present invention may achieve the application of the Adam gradient descent algorithm, and greatly improve the data processing efficiency.

Description

一种用于执行Adam梯度下降训练算法的装置及方法Apparatus and method for performing Adam gradient descent training algorithm 技术领域Technical field
本发明涉及Adam算法应用技术领域,具体地涉及一种用于执行Adam梯度下降训练算法的装置及方法,是有关于Adam梯度下降优化算法的硬件实现的相关应用。The present invention relates to the field of Adam algorithm application technology, and in particular to an apparatus and method for performing an Adam gradient descent training algorithm, and relates to a hardware implementation of an Adam gradient descent optimization algorithm.
背景技术Background technique
梯度下降优化算法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用,Adam算法作为梯度下降优化算法中的一种,由于其易于实现,计算量小,所需存储空间小以及梯度的对称变换不变性等特征被广泛的使用,并且使用专用装置实现Adam算法可以显著提高其执行的速度。The gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. Adam algorithm is one of the gradient descent optimization algorithms. It is easy to implement, small in computation, small in required storage space and gradient. Features such as symmetry transformation invariance are widely used, and the implementation of the Adam algorithm using a dedicated device can significantly increase the speed of its execution.
目前,一种执行Adam梯度下降算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把Adam梯度下降算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, one known method of performing the Adam gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the Adam gradient descent algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
另一种执行Adam梯度下降算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用单指令多数据流(SIMD)指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对Adam梯度下降算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行Adam梯度下降算法中相关的运算,带来了大量的额外开销。另外,GPU只有较小的片上缓存,运算中所需数据(如一阶矩向量和二阶矩向量等)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。 Another known method of performing the Adam gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the Adam gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operations in the Adam gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer, and the data required in the operation (such as first-order moment vector and second-order moment vector) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings great work. Cost.
发明内容Summary of the invention
有鉴于此,本发明的主要目的在于提供一种用于执行Adam梯度下降训练算法的装置及方法,以解决数据的通用处理器运算性能不足,前段译码开销大的问题,并避免反复向内存读取数据,降低内存访问的带宽。In view of this, the main object of the present invention is to provide an apparatus and method for performing an Adam gradient descent training algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeated memory. Read data and reduce the bandwidth of memory access.
为达到上述目的,本发明提供了一种用于执行Adam梯度下降训练算法的装置,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4、数据处理模块5,其中:To achieve the above object, the present invention provides an apparatus for performing an Adam gradient descent training algorithm, the apparatus comprising a direct memory access unit 1, an instruction buffer unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5. ,among them:
直接内存访问单元1,用于访问外部指定空间,向指令缓存单元2和数据处理模块5读写数据,完成数据的加载和存储;The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;
指令缓存单元2,用于通过直接内存访问单元1读取指令,并缓存读取的指令;The instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;
控制器单元3,用于从指令缓存单元2中读取指令,将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令;The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.
数据缓存单元4,用于在初始化及数据更新过程中缓存各一阶矩向量和二阶矩向量;a data buffer unit 4, configured to cache each first moment vector and second moment vector during initialization and data update;
数据处理模块5,用于更新矩向量,计算矩估计向量,更新待更新向量,并将更新后的矩向量写入到数据缓存单元4中,将更新后的待更新向量通过直接内存访问单元1写入到外部指定空间中。The data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external specified space.
上述方案中,所述直接内存访问单元1是从外部指定空间向指令缓存单元2写入指令,从外部指定空间读取待更新参数和对应的梯度值到数据处理模块5,并将更新后的参数向量从数据处理模块5直接写入外部指定空间。In the above solution, the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated The parameter vector is directly written from the data processing module 5 to the external designated space.
上述方案中,所述控制器单元3将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令,用以控制直接内存访问单元1从外部指定地址读取数据和将数据写入外部指定地址,控制数据缓存单元4通过直接内存访问单元1从外部指定地址获取操作所需的指令,控制数据处理模块5进行待更新参数的更新运算, 以及控制数据缓存单元4与数据处理模块5进行数据传输。In the above solution, the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified. The address reads the data and writes the data to the external designated address, and the control data buffer unit 4 obtains an instruction required for the operation from the external designated address through the direct memory access unit 1, and controls the data processing module 5 to perform the update operation of the parameter to be updated. And the control data buffer unit 4 and the data processing module 5 perform data transmission.
上述方案中,所述数据缓存单元4在初始化时初始化一阶矩向量mt、二阶矩向量vt,在每次数据更新过程中将一阶矩向量mt-1和二阶矩向量vt-1读出并送至数据处理模块5中,在数据处理模块5中更新为一阶矩向量mt和二阶矩向量vt,然后再写入到数据缓存单元4中。In the above solution, the data buffer unit 4 initializes a first moment vector m t and a second moment vector v t at initialization, and first moment vector m t-1 and second moment vector v in each data update process. The t-1 is read out and sent to the data processing module 5, updated in the data processing module 5 as a first moment vector m t and a second moment vector v t , and then written into the data buffer unit 4.
上述方案中,在装置运行过程中,所述数据缓存单元4内部始终保存着一阶矩向量mt、二阶矩向量vt的副本。In the above solution, during the operation of the device, the data buffer unit 4 always stores a copy of the first moment vector m t and the second moment vector v t .
上述方案中,所述数据处理模块5从数据缓存单元4中读取矩向量mt-1、vt-1,通过直接内存访问单元1从外部指定空间中读取待更新向量θt-1、梯度向量
Figure PCTCN2016080357-appb-000001
更新步长α以及指数衰减率β1和β2;然后将矩向量mt-1、vt-1更新为mt、vt,通过mt、vt计算矩估计向量
Figure PCTCN2016080357-appb-000002
最后将待更新向量θt-1更新为θt,并将mt、vt写入到数据缓存单元4中,将θt通过直接内存访问单元1写入到外部指定空间中。
In the above solution, the data processing module 5 reads the moment vectors m t-1 , v t-1 from the data buffer unit 4, and reads the vector to be updated θ t-1 from the external designated space through the direct memory access unit 1. Gradient vector
Figure PCTCN2016080357-appb-000001
Updating the step size α and the exponential decay rates β 1 and β 2 ; then updating the moment vectors m t-1 , v t-1 to m t , v t , and calculating the moment estimation vector by m t , v t
Figure PCTCN2016080357-appb-000002
Finally, the vector to be updated θ t-1 is updated to θ t , and m t , v t are written into the data buffer unit 4, and θ t is written into the external designated space through the direct memory access unit 1.
上述方案中,所述数据处理模块5将矩向量mt-1、vt-1更新为mt是根据公式
Figure PCTCN2016080357-appb-000003
实现的,所述数据处理模块5通过mt、vt计算矩估计向量
Figure PCTCN2016080357-appb-000004
是根据公式
Figure PCTCN2016080357-appb-000005
实现的,所述数据处理模块5将待更新向量θt-1更新为θt是根据公式
Figure PCTCN2016080357-appb-000006
实现的。
In the above solution, the data processing module 5 updates the moment vectors m t-1 , v t-1 to m t according to the formula.
Figure PCTCN2016080357-appb-000003
The data processing module 5 calculates the moment estimation vector by using m t and v t
Figure PCTCN2016080357-appb-000004
Is based on the formula
Figure PCTCN2016080357-appb-000005
Implemented, the data processing module 5 to be updated vector θ t-1 is updated according to the formula θ t
Figure PCTCN2016080357-appb-000006
Realized.
上述方案中,所述数据处理模块5包括运算控制子模块51、向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55和基本运算子模块56,其中向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56并联连接,运算控制子模块51分别与向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量 平方根并行运算子模块55以及基本运算子模块56串联连接。该装置在对向量进行运算时,向量运算均为element-wise运算,同一向量执行某种运算时不同位置元素是并行执行运算。In the above solution, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 and vector addition parallel operation sub-module 52, vector multiplication parallel operation sub-module 53, vector division parallel operation sub-module 54, vector The square root parallel operation sub-module 55 and the basic operation sub-module 56 are connected in series. When the device operates on a vector, the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
为达到上述目的,本发明还提供了一种用于执行Adam梯度下降训练算法的方法,该方法包括:To achieve the above object, the present invention also provides a method for performing an Adam gradient descent training algorithm, the method comprising:
初始化一阶矩向量mO、二阶矩向量v0、指数衰减率β1、β2以及学习步长α,并从外部指定空间中获取待更新向量θQInitializing a first moment vector m O , a second moment vector v 0 , an exponential decay rate β 1 , β 2 , and a learning step length α, and obtaining a vector θ Q to be updated from an external designated space;
在进行梯度下降操作时,先利用由外部传入的梯度值
Figure PCTCN2016080357-appb-000007
和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1,然后通过矩向量运算得到有偏矩估计向量
Figure PCTCN2016080357-appb-000008
Figure PCTCN2016080357-appb-000009
最后更新待更新向量θt-1为θt并输出;重复此过程,直至待更新向量收敛。
When performing a gradient descent operation, first use the gradient value passed from the outside
Figure PCTCN2016080357-appb-000007
And the exponential decay rate updates the first moment vector m t-1 , the second moment vector v t-1 , and then obtains the biased moment estimation vector by the moment vector operation
Figure PCTCN2016080357-appb-000008
with
Figure PCTCN2016080357-appb-000009
Finally, the vector to be updated θ t-1 is updated as θ t and output; this process is repeated until the vector to be updated converges.
上述方案中,所述初始化一阶矩向量mQ、二阶矩向量vQ、指数衰减率β1、β2以及学习步长α,并从外部指定空间中获取待更新向量θO,包括:In the above solution, the first moment vector m Q , the second moment vector v Q , the exponential decay rate β 1 , β 2 , and the learning step length α are initialized, and the vector to be updated θ O is obtained from an external designated space, including:
步骤S1,在指令缓存单元2的首地址处预先存入一条INSTRUCTION_IO指令,该INSTRUCTION_IO指令用于驱动直接内存单元1从外部地址空间读取与Adam梯度下降计算有关的所有指令。In step S1, an INSTRUCTION_IO instruction is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.
步骤S2,运算开始,控制器单元3从指令缓存单元2的首地址读取该条INSTRUCTION_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部地址空间读取与Adam梯度下降计算有关的所有指令,并将这些指令缓存入指令缓存单元2中;Step S2, the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and drives the direct memory access unit 1 to read from the external address space according to the translated microinstruction, which is related to the Adam gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
在步骤S3,控制器单元3从指令缓存单元2读入一条HYPERPARAMETER_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取全局更新步长α、指数衰减率β1、β2、收敛阈值ct,然后送入数据处理模块5中;In step S3, the controller unit 3 reads a HYPERPARAMETER_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size α, the exponential decay rate β 1 from the external designated space according to the translated microinstruction. β 2 , convergence threshold ct, and then sent to the data processing module 5;
在步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,驱动数据缓存单元4中的一阶矩向量mt-1和vt-1进行初始化,并驱动数据处理单元5中的迭代次数t被设置为1;In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, drives the first moment vectors m t-1 and v t-1 in the data buffer unit 4 to initialize and drive. The number of iterations t in the data processing unit 5 is set to 1;
在步骤S5,控制器单元3从指令缓存单元2读入一条DATA_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取 待更新参数向量θt-1和对应的梯度向量
Figure PCTCN2016080357-appb-000010
然后送入到数据处理模块5中;
In step S5, the controller unit 3 reads a DATA_IO instruction from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector θ t-1 to be updated and the corresponding gradient from the external designated space according to the translated microinstruction. vector
Figure PCTCN2016080357-appb-000010
Then sent to the data processing module 5;
在步骤S6,控制器单元3从指令缓存单元2读入一条数据传输指令,根据译出的微指令,将数据缓存单元4中的一阶矩向量mt-1和二阶矩向量vt-1传输到数据处理单元5中。In step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m t-1 and the second moment vector v t- in the data buffer unit 4. 1 is transferred to the data processing unit 5.
上述方案中,所述利用由外部传入的梯度值
Figure PCTCN2016080357-appb-000011
和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1,是根据公式
Figure PCTCN2016080357-appb-000012
Figure PCTCN2016080357-appb-000013
实现的,具体包括:控制器单元3从指令缓存单元2中读取一条矩向量更新指令,根据译出的微指令,驱动数据缓存单元4进行一阶矩向量mt-1和二阶矩向量vt-1的更新操作,在该更新操作中,矩向量更新指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:发送INS_1指令至基本运算子模块56,驱动基本运算子模块56计算(1-β1)和(1-β2);发送INS_2指令至向量乘法并行运算子模块53,驱动向量乘法并行运算子模块53计算得到
Figure PCTCN2016080357-appb-000014
然后,发送INS_3指令至向量乘法并行运算子模块53,驱动向量乘法并行运算子模块53同时计算
Figure PCTCN2016080357-appb-000015
Figure PCTCN2016080357-appb-000016
结果分别记为a1、a2、b1和b2;然后,将a1和a2、b1和b2分别作为两个输入,送至向量加法并行运算子模块52,得到更新后的一阶矩向量mt和二阶矩向量vt
In the above scheme, the utilization of the gradient value introduced from the outside
Figure PCTCN2016080357-appb-000011
And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 according to the formula
Figure PCTCN2016080357-appb-000012
Figure PCTCN2016080357-appb-000013
The implementation specifically includes: the controller unit 3 reads a moment vector update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform the first moment vector m t-1 and the second moment vector according to the translated micro instruction. The update operation of v t-1 , in which the moment vector update instruction is sent to the operation control sub-module 51 , and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the INS_1 instruction to the basic operation sub-module 56, driving The basic operation sub-module 56 calculates (1-β 1 ) and (1-β 2 ); sends an INS_2 instruction to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates
Figure PCTCN2016080357-appb-000014
Then, the INS_3 instruction is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 is simultaneously calculated.
Figure PCTCN2016080357-appb-000015
with
Figure PCTCN2016080357-appb-000016
The results are denoted as a 1 , a 2 , b 1 and b 2 respectively ; then, a 1 and a 2 , b 1 and b 2 are respectively taken as two inputs and sent to the vector addition parallel operation sub-module 52 to obtain an updated The first moment vector m t and the second moment vector v t .
上述方案中,所述利用由外部传入的梯度值
Figure PCTCN2016080357-appb-000017
和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1后,还包括:控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将更新后的一阶矩向量mt和二阶矩向量vt从数据处理单元5传送到数据缓存单元4中。
In the above scheme, the utilization of the gradient value introduced from the outside
Figure PCTCN2016080357-appb-000017
And updating the first moment vector m t-1 and the second moment vector v t-1 , and the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, according to the translated micro instruction. The updated first moment vector m t and second moment vector v t are transferred from the data processing unit 5 to the data buffer unit 4.
上述方案中,所述通过矩向量运算得到有偏矩估计向量
Figure PCTCN2016080357-appb-000018
Figure PCTCN2016080357-appb-000019
是根据公式
Figure PCTCN2016080357-appb-000020
实现的,具体包括:控制器单元3从指令缓存单元2读取一条矩估计向量运算指令,根据译出的微指令, 驱动运算控制子模块51进行矩估计向量的计算,运算控制子模块51发送相应指令进行如下操作:运算控制子模块51发送指令INS_4至基本运算子模块56,驱动基本运算子模块56计算出
Figure PCTCN2016080357-appb-000021
Figure PCTCN2016080357-appb-000022
迭代次数t加1,运算控制子模块51发送指令INS_5至向量乘法并行运算子模块53,驱动向量乘法并行运算子模块53并行计算一阶矩向量mt
Figure PCTCN2016080357-appb-000023
二阶矩向量vt
Figure PCTCN2016080357-appb-000024
的乘积得到有偏矩估计向量
Figure PCTCN2016080357-appb-000025
Figure PCTCN2016080357-appb-000026
In the above solution, the moment vector estimation is obtained by the moment vector operation
Figure PCTCN2016080357-appb-000018
with
Figure PCTCN2016080357-appb-000019
Is based on the formula
Figure PCTCN2016080357-appb-000020
The implementation includes: the controller unit 3 reads a moment estimation vector operation instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends The corresponding instruction performs the following operations: the arithmetic control sub-module 51 sends the command INS_4 to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate
Figure PCTCN2016080357-appb-000021
with
Figure PCTCN2016080357-appb-000022
The iteration number t is incremented by 1, the operation control sub-module 51 sends the instruction INS_5 to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m t and
Figure PCTCN2016080357-appb-000023
Second moment vector v t and
Figure PCTCN2016080357-appb-000024
Biased estimate vector
Figure PCTCN2016080357-appb-000025
with
Figure PCTCN2016080357-appb-000026
上述方案中,所述更新待更新向量θt-1为θt是根据公式
Figure PCTCN2016080357-appb-000027
实现的,具体包括:控制器单元3从指令缓存单元2读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块51进行如下的运算:运算控制子模块51发送指令INS_6至基本运算子模块56,驱动基本运算子模块56计算出-α;运算控制子模块51发送指令INS_7至向量平方根并行运算子模块55,驱动其运算得到
Figure PCTCN2016080357-appb-000028
运算控制子模块51发送指令INS_7至向量除法并行运算子模块54驱动其运算得到
Figure PCTCN2016080357-appb-000029
运算控制子模块51发送指令INS_8至向量乘法并行运算子模块53,驱动其运算得到
Figure PCTCN2016080357-appb-000030
运算控制子模块51发送指令INS_9至向量加法并行运算子模块52,驱动其计算
Figure PCTCN2016080357-appb-000031
得到更新后的参数向量θt;其中,θt-1是θ0在第t次循环时未更新前的值,第t次循环将θt-1更新为θt;运算控制子模块51发送指令INS_10至向量除法并行运算子模块54,驱动其运算得到向量
Figure PCTCN2016080357-appb-000032
运算控制子模块51分别发送指令INS_11、INS_12至向量加法并行运算子模块52和基本运算子模块56计算得到sum=∑itempi、temp2=sum/n。
In the above solution, the update to be updated vector θ t-1 is θ t according to the formula
Figure PCTCN2016080357-appb-000027
The implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the instruction INS_6 to The basic operation sub-module 56 drives the basic operation sub-module 56 to calculate -α; the operation control sub-module 51 sends the instruction INS_7 to the vector square root parallel operation sub-module 55, and drives the operation thereof.
Figure PCTCN2016080357-appb-000028
The operation control sub-module 51 sends the instruction INS_7 to the vector division parallel operation sub-module 54 to drive the operation thereof.
Figure PCTCN2016080357-appb-000029
The operation control sub-module 51 sends the instruction INS_8 to the vector multiplication parallel operation sub-module 53 to drive the operation thereof.
Figure PCTCN2016080357-appb-000030
The operation control sub-module 51 sends the instruction INS_9 to the vector addition parallel operation sub-module 52 to drive its calculation.
Figure PCTCN2016080357-appb-000031
The updated parameter vector θ t is obtained ; wherein θ t-1 is the value before θ 0 is not updated at the tth cycle, and the tth cycle updates θ t-1 to θ t ; the arithmetic control submodule 51 sends The instruction INS_10 to the vector division parallel operation sub-module 54 drives the operation to obtain a vector
Figure PCTCN2016080357-appb-000032
The arithmetic control sub-module 51 sends the commands INS_11, INS_12 to the vector addition parallel operation sub-module 52 and the basic operation sub-module 56 to calculate sum = ∑ i temp i , temp2 = sum / n, respectively.
上述方案中,所述更新待更新向量θt-1为θt之后,还包括:控制器单元3从指令缓存单元2读取一条DATABACK_IO指令,根据译出的微指令,将更新后的参数向量θt从数据处理单元5通过直接内存访问单元1传送至外部指定空间。 In the above solution, after the update vector θ t-1 is θ t , the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction. θ t is transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.
上述方案中,所述重复此过程直至待更新向量收敛的步骤中,包括判断待更新向量是否收敛,具体判断过程如下:控制器单元3从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理模块5判断更新后的参数向量是否收敛,若temp2<ct,则收敛,运算结束。In the above solution, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges. The specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 < ct, it converges, and the operation ends.
从上述技术方案可以看出,本发明具有以下有益效果:It can be seen from the above technical solutions that the present invention has the following beneficial effects:
1、本发明提供的用于执行Adam梯度下降训练算法的装置及方法,通过采用专门用于执行Adam梯度下降训练算法的装置,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度。1. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention can solve the problem of insufficient performance of the general-purpose processor of the data by using a device specially used for executing the Adam gradient descent training algorithm, and the decoding cost of the previous segment is large. Problem, speed up the execution of related applications.
2、本发明提供的用于执行Adam梯度下降训练算法的装置及方法,由于采用了数据缓存单元暂存中间过程所需的矩向量,避免了反复向内存读取数据,减少了装置与外部地址空间之间的IO操作,降低了内存访问的带宽。2. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention avoids repeatedly reading data into the memory by using the moment vector required for the intermediate process of the data buffer unit, thereby reducing the device and the external address. The IO operation between the spaces reduces the bandwidth of memory access.
3、本发明提供的用于执行Adam梯度下降训练算法的装置及方法,由于数据处理模块采用相关的并行运算子模块进行向量运算,使得并行程度大幅提高。3. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention, because the data processing module performs the vector operation by using the related parallel operation sub-module, the degree of parallelism is greatly improved.
4、本发明提供的用于执行Adam梯度下降训练算法的装置及方法,由于数据处理模块采用相关的并行运算子模块进行向量运算,运算的并行程度高,所以工作时的频率较低,使得功耗开销小。4. The apparatus and method for performing the Adam gradient descent training algorithm provided by the present invention, because the data processing module uses the related parallel operation sub-module to perform vector operations, the degree of parallelism of the operation is high, so the working frequency is low, so that the work is low. The cost is small.
附图说明DRAWINGS
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:For a more complete understanding of the present invention and its advantages, reference will now be made to the following description
图1示出了根据本发明实施例的用于执行Adam梯度下降训练算法的装置的整体结构的示例框图。1 shows an example block diagram of the overall structure of an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
图2示出了根据本发明实施例的用于执行Adam梯度下降训练算法的装置中数据处理模块的示例框图。 2 shows an example block diagram of a data processing module in an apparatus for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
图3示出了根据本发明实施例的用于执行Adam梯度下降训练算法的方法的流程图。3 shows a flow chart of a method for performing an Adam gradient descent training algorithm in accordance with an embodiment of the present invention.
在所有附图中,相同的装置、部件、单元等使用相同的附图标记来表示。Throughout the drawings, the same devices, components, units, and the like are denoted by the same reference numerals.
具体实施方式detailed description
根据本发明实施例结合附图对本发明示例性实施例的以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the <RTI
在本发明中,术语“包括”和“含有”及其派生词意为包括而非限制;术语“或”是包含性的,意为和/或。In the present invention, the terms "include" and "including" and their derivatives are intended to be inclusive and not limiting; the term "or" is inclusive, meaning and/or.
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.
根据本发明实施例的用于执行Adam梯度下降训练算法的装置及方法,用以加速Adam梯度下降算法的应用。首先,初始化一阶矩向量mO、二阶矩向量vQ、指数衰减率β1、β2以及学习步长α,并从外部指定空间中获取待更新向量θQ;每次在进行梯度下降操作时,先利用由外部传入的梯度值
Figure PCTCN2016080357-appb-000033
和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1,即
Figure PCTCN2016080357-appb-000034
然后通过矩向量运算得到有偏矩估计向量
Figure PCTCN2016080357-appb-000035
Figure PCTCN2016080357-appb-000036
Figure PCTCN2016080357-appb-000037
最后更新待更新向量θt-1为θt并输出,即
Figure PCTCN2016080357-appb-000038
其中,θt-1是θ0在第t次循环 时未更新前的值,第t次循环将θt-1更新为θt。重复此过程,直至待更新向量收敛。
An apparatus and method for performing an Adam gradient descent training algorithm for accelerating the application of an Adam gradient descent algorithm according to an embodiment of the present invention. First, the first moment vector m O , the second moment vector v Q , the exponential decay rate β 1 , β 2 , and the learning step length α are initialized, and the vector to be updated θ Q is obtained from the external designated space; In operation, first use the gradient value passed from the outside
Figure PCTCN2016080357-appb-000033
And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 , ie
Figure PCTCN2016080357-appb-000034
Then obtain the biased moment estimation vector by the moment vector operation
Figure PCTCN2016080357-appb-000035
with
Figure PCTCN2016080357-appb-000036
which is
Figure PCTCN2016080357-appb-000037
Finally update the vector θ t-1 to be updated as θ t and output, ie
Figure PCTCN2016080357-appb-000038
Where θ t-1 is the value before θ 0 is not updated at the t-th cycle, and the t-th cycle updates θ t-1 to θ t . This process is repeated until the vector to be updated converges.
图1示出了根据本发明实施例的用于实现Adam梯度下降算法的装置的整体结构的示例框图。如图1所示,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理模块5,均可以通过硬件电路实现。1 shows an example block diagram of the overall structure of an apparatus for implementing an Adam gradient descent algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
直接内存访问单元1,用于访问外部指定空间,向指令缓存单元2和数据处理模块5读写数据,完成数据的加载和存储。具体是从外部指定空间向指令缓存单元2写入指令,从外部指定空间读取待更新参数和对应的梯度值到数据处理模块5,并将更新后的参数向量从数据处理模块5直接写入外部指定空间。The direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.
指令缓存单元2,用于通过直接内存访问单元1读取指令,并缓存读取的指令。The instruction cache unit 2 is configured to read an instruction through the direct memory access unit 1 and cache the read instruction.
控制器单元3,用于从指令缓存单元2中读取指令,将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令,并将各微指令发送至直接内存访问单元1、数据缓存单元4或数据处理模块5,控制直接内存访问单元1从外部指定地址读取数据和将数据写入外部指定地址,控制数据缓存单元3通过直接内存访问单元1从外部指定地址获取操作所需的指令,控制数据处理模块5进行待更新参数的更新运算,以及控制数据缓存单元4与数据处理模块5进行数据传输。The controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro The instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory. The unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.
数据缓存单元4,用于在初始化及数据更新过程中缓存各一阶矩向量和二阶矩向量;具体而言,在初始化时数据缓存单元4初始化一阶矩向量mt、二阶矩向量vt,在每次数据更新过程中数据缓存单元4将一阶矩向量mt-1和二阶矩向量vt-1读出并送至数据处理模块5中,在数据处理模块5中更新为一阶矩向量mt和二阶矩向量vt,然后再写入到数据缓存单元4中。在装置运行过程中,所述数据缓存单元4内部始终保存着一阶矩向量mt、二阶矩向量vt的副本。在本发明中,由于采用了数据缓存单元暂存中间过程所需的矩向量,避免了反复向内存读取数据,减少了 装置与外部地址空间之间的IO操作,降低了内存访问的带宽。The data buffer unit 4 is configured to buffer each first moment vector and second moment vector in the initialization and data update process; specifically, the data buffer unit 4 initializes the first moment vector m t and the second moment vector v during initialization. t , in each data update process, the data buffer unit 4 reads out and sends the first moment vector m t-1 and the second moment vector v t-1 to the data processing module 5, and updates it in the data processing module 5 to The first moment vector m t and the second moment vector v t are then written to the data buffer unit 4. During the operation of the device, the data buffer unit 4 always stores a copy of the first moment vector m t and the second moment vector v t . In the present invention, since the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.
数据处理模块5,用于更新矩向量,计算矩估计向量,更新待更新向量,并将更新后的矩向量写入到数据缓存单元4中,将更新后的待更新向量通过直接内存访问单元1写入到外部指定空间中;具体而言,数据处理模块5从数据缓存单元4中读取矩向量mt-1、vt-1,通过直接内存访问单元1从外部指定空间中读取待更新向量θt-1、梯度向量
Figure PCTCN2016080357-appb-000039
更新步长α以及指数衰减率β1和β2;然后将矩向量mt-1、vt-1更新为mt、vt,即
Figure PCTCN2016080357-appb-000040
通过mt、vt计算矩估计向量
Figure PCTCN2016080357-appb-000041
Figure PCTCN2016080357-appb-000042
最后将待更新向量θt-1更新为θt,即
Figure PCTCN2016080357-appb-000043
并将mt、vt写入到数据缓存单元4中,将θt通过直接内存访问单元1写入到外部指定空间中。在本发明中,由于数据处理模块采用相关的并行运算子模块进行向量运算,使得并行程度大幅提高,所以工作时的频率较低,进而使得功耗开销小。
The data processing module 5 is configured to update the moment vector, calculate the moment estimation vector, update the vector to be updated, and write the updated moment vector to the data buffer unit 4, and pass the updated vector to be updated through the direct memory access unit 1 Write to the external designated space; specifically, the data processing module 5 reads the moment vectors m t-1 , v t-1 from the data buffer unit 4, and reads from the external designated space through the direct memory access unit 1 Update vector θ t-1 , gradient vector
Figure PCTCN2016080357-appb-000039
Updating the step size α and the exponential decay rates β 1 and β 2 ; then updating the moment vectors m t-1 , v t-1 to m t , v t , ie
Figure PCTCN2016080357-appb-000040
Calculate the moment estimation vector by m t , v t
Figure PCTCN2016080357-appb-000041
which is
Figure PCTCN2016080357-appb-000042
Finally, the vector θ t-1 to be updated is updated to θ t , ie
Figure PCTCN2016080357-appb-000043
The m t and v t are written into the data buffer unit 4, and θ t is written into the external designated space through the direct memory access unit 1. In the present invention, since the data processing module performs vector operations using the associated parallel operation sub-modules, the degree of parallelism is greatly improved, so the frequency of operation is low, and the power consumption overhead is small.
图2示出了根据本发明实施例的用于实现Adam梯度下降算法相关应用的装置中数据处理模块的示例框图。如图2所示,数据处理模块5包括运算控制子模块51、向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55和基本运算子模块56,其中向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56并联连接,运算控制子模块51分别与向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56串联连接。该装置在对向量进行运算时,向量运算均为element-wise运算,同一向量执行某种运算时不同位置元素是并行执行运算。2 illustrates an example block diagram of a data processing module in an apparatus for implementing an Adam gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected. 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56. When the device operates on a vector, the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
图3示出了根据本发明实施例的用于执行Adam梯度下降训练算法 的方法的流程图,具体包括以下步骤:FIG. 3 illustrates an algorithm for performing Adam gradient descent training according to an embodiment of the present invention. The flow chart of the method specifically includes the following steps:
步骤S1,在指令缓存单元2的首地址处预先存入一条指令预取指令(INSTRUCTION_IO),该INSTRUCTION_IO指令用于驱动直接内存单元1从外部地址空间读取与Adam梯度下降计算有关的所有指令。In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the Adam gradient descent calculation from the external address space.
步骤S2,运算开始,控制器单元3从指令缓存单元2的首地址读取该条INSTRUCTION_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部地址空间读取与Adam梯度下降计算有关的所有指令,并将这些指令缓存入指令缓存单元2中;Step S2, the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and drives the direct memory access unit 1 to read from the external address space according to the translated microinstruction, which is related to the Adam gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
步骤S3,控制器单元3从指令缓存单元2读入一条超参量读取指令(HYPERPARAMETER_IO),根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取全局更新步长α、指数衰减率β1、β2、收敛阈值ct,然后送入数据处理模块5中;In step S3, the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size α, index from the external designated space according to the translated microinstruction. The attenuation rate β 1 , β 2 , the convergence threshold ct, and then sent to the data processing module 5;
步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,驱动数据缓存单元4中的一阶矩向量mt-1和vt-1进行初始化,并驱动数据处理单元5中的迭代次数t被设置为1;In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, and initializes the first-order moment vectors m t-1 and v t-1 in the data buffer unit 4 according to the translated micro-instruction, and drives the data. The number of iterations t in the processing unit 5 is set to 1;
步骤S5,控制器单元3从指令缓存单元2读入一条参数读取指令(DATA_IO),根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取待更新参数向量θt-1和对应的梯度向量
Figure PCTCN2016080357-appb-000044
然后送入到数据处理模块5中;
In step S5, the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated θ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector
Figure PCTCN2016080357-appb-000044
Then sent to the data processing module 5;
步骤S6,控制器单元3从指令缓存单元2读入一条数据传输指令,根据译出的微指令,将数据缓存单元4中的一阶矩向量mt-1和二阶矩向量vt-1传输到数据处理单元5中。In step S6, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and according to the translated microinstruction, the first moment vector m t-1 and the second moment vector v t-1 in the data buffer unit 4. Transfer to the data processing unit 5.
步骤S7,控制器单元3从指令缓存单元2中读取一条矩向量更新指令,根据译出的微指令,驱动数据缓存单元4进行一阶矩向量mt-1和二阶矩向量vt-1的更新操作。在该更新操作中,矩向量更新指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:发送运算指令1(INS_1)至基本运算子模块56,驱动基本运算子模块56计算(1-β1)和(1-β2);发送运算指令2(INS_2)至向量乘法并行运算子模 块53,驱动向量乘法并行运算子模块53计算得到
Figure PCTCN2016080357-appb-000045
然后,发送运算指令3(INS_3)至向量乘法并行运算子模块53,驱动向量乘法并行运算子模块53同时计算β1mt-1
Figure PCTCN2016080357-appb-000046
β2vt-1
Figure PCTCN2016080357-appb-000047
结果分别记为a1、a2、b1和b2;然后,将a1和a2、b1和b2分别作为两个输入,送至向量加法并行运算子模块52,得到更新后的一阶矩向量mt和二阶矩向量vt
In step S7, the controller unit 3 reads a moment vector update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform a first moment vector m t-1 and a second moment vector v t- according to the translated microinstruction. 1 update operation. In the update operation, the moment vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends the corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub-module 56 calculates (1-β 1 ) and (1 - β 2 ); sends an operation instruction 2 (INS_2) to a vector multiplication parallel operation submodule 53, and the drive vector multiplication parallel operation sub-module 53 calculates
Figure PCTCN2016080357-appb-000045
Then, the operation instruction 3 (INS_3) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 simultaneously calculates β 1 m t-1 ,
Figure PCTCN2016080357-appb-000046
2 2 v t-1 and
Figure PCTCN2016080357-appb-000047
The results are denoted as a 1 , a 2 , b 1 and b 2 respectively ; then, a 1 and a 2 , b 1 and b 2 are respectively taken as two inputs and sent to the vector addition parallel operation sub-module 52 to obtain an updated The first moment vector m t and the second moment vector v t .
步骤S8,控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将更新后的一阶矩向量mt和二阶矩向量vt从数据处理单元5传送到数据缓存单元4中。In step S8, the controller unit 3 reads a data transmission instruction from the instruction buffer unit 2, and transmits the updated first-order moment vector m t and second-order moment vector v t from the data processing unit 5 according to the translated micro-instruction. In the data buffer unit 4.
步骤S9,控制器单元3从指令缓存单元2读取一条矩估计向量运算指令,根据译出的微指令,驱动运算控制子模块51进行矩估计向量的计算,运算控制子模块51发送相应指令进行如下操作:运算控制子模块51发送运算指令4(INS_4)至基本运算子模块56,驱动基本运算子模块56计算出
Figure PCTCN2016080357-appb-000048
Figure PCTCN2016080357-appb-000049
迭代次数t加1,运算控制子模块51发送运算指令5(INS_5)至向量乘法并行运算子模块53,驱动向量乘法并行运算子模块53并行计算一阶矩向量mt
Figure PCTCN2016080357-appb-000050
二阶矩向量vt
Figure PCTCN2016080357-appb-000051
的乘积得到有偏矩估计向量
Figure PCTCN2016080357-appb-000052
Figure PCTCN2016080357-appb-000053
In step S9, the controller unit 3 reads a moment estimation vector operation instruction from the instruction buffer unit 2, and drives the operation control sub-module 51 to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module 51 sends the corresponding instruction. The operation operation sub-module 51 sends the operation instruction 4 (INS_4) to the basic operation sub-module 56, and drives the basic operation sub-module 56 to calculate
Figure PCTCN2016080357-appb-000048
with
Figure PCTCN2016080357-appb-000049
The iteration number t is incremented by 1, and the operation control sub-module 51 sends an operation instruction 5 (INS_5) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates the first-order moment vector m t and
Figure PCTCN2016080357-appb-000050
Second moment vector v t and
Figure PCTCN2016080357-appb-000051
Biased estimate vector
Figure PCTCN2016080357-appb-000052
with
Figure PCTCN2016080357-appb-000053
步骤S10,控制器单元3从指令缓存单元2读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块51进行如下的运算:运算控制子模块51发送运算指令6(INS_6)至基本运算子模块56,驱动基本运算子模块56计算出-α;运算控制子模块51发送运算指令7(INS_7)至向量平方根并行运算子模块55,驱动其运算得到
Figure PCTCN2016080357-appb-000054
运算控制子模块51发送运算指令7(INS_7)至向量除法并行运算子模块54驱动其运算得到
Figure PCTCN2016080357-appb-000055
运算控制子模块51发送运算指令8(INS_8)至向量乘法并行运算子模块53,驱动其运算得到
Figure PCTCN2016080357-appb-000056
运算控制子模块51发送运算指令9(INS_9)至向量加法并行运算子模块52,驱动其计算
Figure PCTCN2016080357-appb-000057
得到更新后的参数向量θt;其中,θt-1是θ0在第t次循 环时未更新前的值,第t次循环将θt-1更新为θt;运算控制子模块51发送运算指令10(INS_10)至向量除法并行运算子模块54,驱动其运算得到向量
Figure PCTCN2016080357-appb-000058
运算控制子模块51分别发送运算指令11(INS_11)、运算指令12(INS_12)至向量加法并行运算子模块52和基本运算子模块56计算得到sum=∑itemp1、temp2=sum/n。
In step S10, the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform the following operation according to the translated micro-instruction: the operation control sub-module 51 sends the operation instruction 6 (INS_6) To the basic operation sub-module 56, the drive basic operation sub-module 56 calculates -α; the operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector square root parallel operation sub-module 55, and drives the operation thereof.
Figure PCTCN2016080357-appb-000054
The operation control sub-module 51 sends the operation instruction 7 (INS_7) to the vector division parallel operation sub-module 54 to drive the operation thereof.
Figure PCTCN2016080357-appb-000055
The operation control sub-module 51 sends the operation instruction 8 (INS_8) to the vector multiplication parallel operation sub-module 53, and drives the operation thereof to obtain
Figure PCTCN2016080357-appb-000056
The operation control sub-module 51 sends the operation instruction 9 (INS_9) to the vector addition parallel operation sub-module 52 to drive the calculation thereof.
Figure PCTCN2016080357-appb-000057
The updated parameter vector θ t is obtained ; wherein θ t-1 is the value before θ 0 is not updated at the tth cycle, and the tth cycle updates θ t-1 to θ t ; the arithmetic control sub-module 51 transmits The operation instruction 10 (INS_10) to the vector division parallel operation sub-module 54 drives the operation to obtain a vector
Figure PCTCN2016080357-appb-000058
The arithmetic control sub-module 51 transmits the arithmetic command 11 (INS_11), the arithmetic command 12 (INS_12) to the vector addition parallel operation sub-module 52, and the basic operation sub-module 56 to calculate sum=∑ i temp 1 and temp2=sum/n, respectively.
步骤S11,控制器单元3从指令缓存单元2读取一条待更新量写回指令(DATABACK_IO),根据译出的微指令,将更新后的参数向量θt从数据处理单元5通过直接内存访问单元1传送至外部指定空间。In step S11, the controller unit 3 reads a to-be-updated write-back instruction (DATABACK_IO) from the instruction cache unit 2, and passes the updated parameter vector θ t from the data processing unit 5 through the direct memory access unit according to the translated micro-instruction. 1 Transfer to the external designated space.
步骤S12,控制器单元3从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理模块5判断更新后的参数向量是否收敛,若temp2<ct,则收敛,运算结束,否则,转到步骤S5处继续执行。In step S12, the controller unit 3 reads a convergence judgment instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2<ct, the convergence is performed, and the operation ends. Otherwise, go to step S5 to continue execution.
本发明通过采用专门用于执行Adam梯度下降训练算法的装置,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度。同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。By adopting a device specially used for executing the Adam gradient descent training algorithm, the invention can solve the problem that the general processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated. At the same time, the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不背离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。 In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims (17)

  1. 一种用于执行Adam梯度下降训练算法的装置,其特征在于,该装置包括直接内存访问单元(1)、指令缓存单元(2)、控制器单元(3)、数据缓存单元(4)、数据处理模块(5),其中:An apparatus for performing an Adam gradient descent training algorithm, characterized in that the apparatus comprises a direct memory access unit (1), an instruction buffer unit (2), a controller unit (3), a data buffer unit (4), data Processing module (5), wherein:
    直接内存访问单元(1),用于访问外部指定空间,向指令缓存单元(2)和数据处理模块(5)读写数据,完成数据的加载和存储;The direct memory access unit (1) is configured to access an external designated space, read and write data to the instruction cache unit (2) and the data processing module (5), and complete loading and storing of the data;
    指令缓存单元(2),用于通过直接内存访问单元(1)读取指令,并缓存读取的指令;An instruction cache unit (2) for reading an instruction through the direct memory access unit (1) and buffering the read instruction;
    控制器单元(3),用于从指令缓存单元(2)中读取指令,将读取的指令译码为控制直接内存访问单元(1)、数据缓存单元(4)或数据处理模块(5)行为的微指令;The controller unit (3) is configured to read an instruction from the instruction cache unit (2), and decode the read instruction into a control direct memory access unit (1), a data buffer unit (4) or a data processing module (5) Micro-instructions of behavior;
    数据缓存单元(4),用于在初始化及数据更新过程中缓存各一阶矩向量和二阶矩向量;a data buffer unit (4) for buffering each first moment vector and second moment vector during initialization and data update;
    数据处理模块(5),用于更新矩向量,计算矩估计向量,更新待更新向量,并将更新后的矩向量写入到数据缓存单元(4)中,将更新后的待更新向量通过直接内存访问单元(1)写入到外部指定空间中。a data processing module (5) for updating a moment vector, calculating a moment estimation vector, updating the vector to be updated, and writing the updated moment vector to the data buffer unit (4), and directly updating the updated vector to be updated The memory access unit (1) is written to the external specified space.
  2. 根据权利要求1所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述直接内存访问单元(1)是从外部指定空间向指令缓存单元(2)写入指令,从外部指定空间读取待更新参数和对应的梯度值到数据处理模块(5),并将更新后的参数向量从数据处理模块(5)直接写入外部指定空间。The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said direct memory access unit (1) writes an instruction from an external designated space to an instruction buffer unit (2), and specifies from the outside. The space reads the parameter to be updated and the corresponding gradient value to the data processing module (5), and writes the updated parameter vector directly from the data processing module (5) to the external designated space.
  3. 根据权利要求1所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述控制器单元(3)将读取的指令译码为控制直接内存访问单元(1)、数据缓存单元(4)或数据处理模块(5)行为的微指令,用以控制直接内存访问单元(1)从外部指定地址读取数据和将数据写入外部指定地址,控制数据缓存单元(4)通过直接内存访问单元(1)从外部指定地址获取操作所需的指令,控制数据处理模块(5)进行待更新参数的更新运算,以及控制数据缓存单元(4)与数据处理模块(5) 进行数据传输。The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said controller unit (3) decodes the read instruction into a control direct memory access unit (1), a data buffer unit (4) or micro-instruction of the data processing module (5) behavior, used to control the direct memory access unit (1) read data from an external specified address and write data to an external designated address, control the data cache unit (4) through direct The memory access unit (1) acquires an instruction required for an operation from an external designated address, controls the data processing module (5) to perform an update operation of the parameter to be updated, and controls the data buffer unit (4) and the data processing module (5) Data transfer.
  4. 根据权利要求1所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述数据缓存单元(4)在初始化时初始化一阶矩向量mt、二阶矩向量vt,在每次数据更新过程中将一阶矩向量mt-1和二阶矩向量vt-1读出并送至数据处理模块(5)中,在数据处理模块(5)中更新为一阶矩向量mt和二阶矩向量vt,然后再写入到数据缓存单元(4)中。The apparatus for performing an Adam gradient descent training algorithm according to claim 1, wherein said data buffer unit (4) initializes a first moment vector m t and a second moment vector v t at initialization time. The first moment vector m t-1 and the second moment vector v t-1 are read out and sent to the data processing module (5) during the secondary data update process, and updated to the first moment vector in the data processing module (5). m t and the second moment vector v t are then written to the data buffer unit (4).
  5. 根据权利要求4所述的用于执行Adam梯度下降训练算法的装置,其特征在于,在装置运行过程中,所述数据缓存单元(4)内部始终保存着一阶矩向量mt、二阶矩向量vt的副本。The apparatus for performing an Adam gradient descent training algorithm according to claim 4, wherein during the operation of the device, the first moment vector m t and the second moment are always stored inside the data buffer unit (4). A copy of the vector v t .
  6. 根据权利要求1所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述数据处理模块(5)从数据缓存单元(4)中读取矩向量mt-1、vt-1,通过直接内存访问单元(1)从外部指定空间中读取待更新向量θt-1、梯度向量
    Figure PCTCN2016080357-appb-100001
    更新步长α以及指数衰减率β1和β2;然后将矩向量mt-1、vt-1更新为mt、vt,通过mt、vt计算矩估计向量
    Figure PCTCN2016080357-appb-100002
    最后将待更新向量θt-1更新为θt,并将mt、vt写入到数据缓存单元(4)中,将θt通过直接内存访问单元(1)写入到外部指定空间中。
    Apparatus for performing an Adam gradient descent training algorithm according to claim 1, characterized in that said data processing module (5) reads moment vectors m t-1 , v t- from data buffer unit (4) 1. Reading the vector to be updated θ t-1 and the gradient vector from the external specified space through the direct memory access unit (1)
    Figure PCTCN2016080357-appb-100001
    Updating the step size α and the exponential decay rates β 1 and β 2 ; then updating the moment vectors m t-1 , v t-1 to m t , v t , and calculating the moment estimation vector by m t , v t
    Figure PCTCN2016080357-appb-100002
    Finally, the vector θ t-1 to be updated is updated to θ t , and m t , v t are written into the data buffer unit (4), and θ t is written into the external designated space through the direct memory access unit (1). .
  7. 根据权利要求6所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述数据处理模块(5)将矩向量mt-1、vt-1更新为mt是根据公式
    Figure PCTCN2016080357-appb-100003
    实现的,所述数据处理模块(5)通过mt、vt计算矩估计向量
    Figure PCTCN2016080357-appb-100004
    是根据公式
    Figure PCTCN2016080357-appb-100005
    实现的,所述数据处理模块(5)将待更新向量θt-1更新为θt是根据公式
    Figure PCTCN2016080357-appb-100006
    实现的。
    The apparatus for performing an Adam gradient descent training algorithm according to claim 6, wherein said data processing module (5) updates moment vectors m t-1 , v t-1 to m t according to a formula
    Figure PCTCN2016080357-appb-100003
    Realizing, the data processing module (5) calculates a moment estimation vector by using m t and v t
    Figure PCTCN2016080357-appb-100004
    Is based on the formula
    Figure PCTCN2016080357-appb-100005
    Realizing, the data processing module (5) updates the vector θ t-1 to be updated to θ t according to the formula
    Figure PCTCN2016080357-appb-100006
    Realized.
  8. 根据权利要求1或7所述的用于执行Adam梯度下降训练算法的装置,其特征在于,所述数据处理模块(5)包括运算控制子模块(51)、向量加法并行运算子模块(52)、向量乘法并行运算子模块(53)、向量 除法并行运算子模块(54)、向量平方根并行运算子模块(55)和基本运算子模块(56),其中向量加法并行运算子模块(52)、向量乘法并行运算子模块(53)、向量除法并行运算子模块(54)、向量平方根并行运算子模块(55)以及基本运算子模块(56)并联连接,运算控制子模块(51)分别与向量加法并行运算子模块(52)、向量乘法并行运算子模块(53)、向量除法并行运算子模块(54)、向量平方根并行运算子模块(55)以及基本运算子模块(56)串联连接。The apparatus for performing an Adam gradient descent training algorithm according to claim 1 or 7, wherein the data processing module (5) comprises an operation control sub-module (51) and a vector addition parallel operation sub-module (52). Vector multiplication parallel operation sub-module (53), vector Dividing parallel operation sub-module (54), vector square root parallel operation sub-module (55) and basic operation sub-module (56), wherein vector addition parallel operation sub-module (52), vector multiplication parallel operation sub-module (53), vector division The parallel operation sub-module (54), the vector square root parallel operation sub-module (55), and the basic operation sub-module (56) are connected in parallel, and the operation control sub-module (51) is respectively paralleled with the vector addition parallel operation sub-module (52) and vector multiplication. The operation sub-module (53), the vector division parallel operation sub-module (54), the vector square root parallel operation sub-module (55), and the basic operation sub-module (56) are connected in series.
  9. 根据权利要求8所述的用于执行Adam梯度下降训练算法的装置,其特征在于,该装置在对向量进行运算时,向量运算均为element-wise运算,同一向量执行某种运算时不同位置元素是并行执行运算。The apparatus for performing an Adam gradient descent training algorithm according to claim 8, wherein when the device operates on the vector, the vector operations are element-wise operations, and the same vector performs different operations when performing certain operations on the elements. It is the parallel execution of the operation.
  10. 一种用于执行Adam梯度下降训练算法的方法,应用于权利要求1至9中任一项所述的装置,其特征在于,该方法包括:A method for performing an Adam gradient descent training algorithm, applied to the device of any one of claims 1 to 9, characterized in that the method comprises:
    初始化一阶矩向量mo、二阶矩向量v0、指数衰减率β1、β2以及学习步长α,并从外部指定空间中获取待更新向量θ0Initializing a first moment vector m o , a second moment vector v 0 , an exponential decay rate β 1 , β 2 , and a learning step length α, and obtaining a vector θ 0 to be updated from an external designated space;
    在进行梯度下降操作时,先利用由外部传入的梯度值
    Figure PCTCN2016080357-appb-100007
    和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1,然后通过矩向量运算得到有偏矩估计向量
    Figure PCTCN2016080357-appb-100008
    Figure PCTCN2016080357-appb-100009
    最后更新待更新向量θt-1为θt并输出;重复此过程,直至待更新向量收敛。
    When performing a gradient descent operation, first use the gradient value passed from the outside
    Figure PCTCN2016080357-appb-100007
    And the exponential decay rate updates the first moment vector m t-1 , the second moment vector v t-1 , and then obtains the biased moment estimation vector by the moment vector operation
    Figure PCTCN2016080357-appb-100008
    with
    Figure PCTCN2016080357-appb-100009
    Finally, the vector to be updated θ t-1 is updated as θ t and output; this process is repeated until the vector to be updated converges.
  11. 根据权利要求10所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述初始化一阶矩向量mo、二阶矩向量vo、指数衰减率β1、β2以及学习步长α,并从外部指定空间中获取待更新向量θ0,包括:The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said initializing first moment vector m o , second moment vector v o , exponential decay rate β 1 , β 2 , and learning step Length α, and obtain the vector to be updated θ 0 from the external specified space, including:
    步骤S1,在指令缓存单元的首地址处预先存入一条INSTRUCTION_IO指令,该INSTRUCTION_IO指令用于驱动直接内存单元从外部地址空间读取与Adam梯度下降计算有关的所有指令。In step S1, an INSTRUCTION_IO instruction is pre-stored at the first address of the instruction cache unit, and the INSTRUCTION_IO instruction is used to drive the direct memory unit to read all instructions related to the Adam gradient descent calculation from the external address space.
    步骤S2,运算开始,控制器单元从指令缓存单元的首地址读取该条INSTRUCTION_IO指令,根据译出的微指令,驱动直接内存访问单元从外部地址空间读取与Adam梯度下降计算有关的所有指令,并将这些 指令缓存入指令缓存单元中;Step S2, the operation starts, the controller unit reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit, and drives the direct memory access unit to read all the instructions related to the Adam gradient descent calculation from the external address space according to the translated microinstruction. And these The instruction is buffered into the instruction cache unit;
    步骤S3,控制器单元从指令缓存单元读入一条HYPERPARAMETER_IO指令,根据译出的微指令,驱动直接内存访问单元从外部指定空间读取全局更新步长α、指数衰减率β1、β2、收敛阈值ct,然后送入数据处理模块中;Step S3, the controller unit reads a HYPERPARAMETER_IO instruction from the instruction cache unit, and drives the direct memory access unit to read the global update step size α, the exponential decay rate β 1 , β 2 , and the convergence from the external designated space according to the translated microinstruction. The threshold ct is then sent to the data processing module;
    步骤S4,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,驱动数据缓存单元中的一阶矩向量mt-1和vt-1进行初始化,并驱动数据处理单元中的迭代次数t被设置为1;Step S4, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, drives the first moment vectors m t-1 and v t-1 in the data buffer unit to be initialized, and drives the data processing unit. The number of iterations t is set to 1;
    步骤S5,控制器单元从指令缓存单元读入一条DATA_IO指令,根据译出的微指令,驱动直接内存访问单元从外部指定空间读取待更新参数向量θt-1和对应的梯度向量
    Figure PCTCN2016080357-appb-100010
    然后送入到数据处理模块中;
    Step S5, the controller unit reads a DATA_IO instruction from the instruction cache unit, and drives the direct memory access unit to read the parameter vector θ t-1 to be updated and the corresponding gradient vector from the external designated space according to the translated micro instruction.
    Figure PCTCN2016080357-appb-100010
    Then sent to the data processing module;
    步骤S6,控制器单元从指令缓存单元读入一条数据传输指令,根据译出的微指令,将数据缓存单元中的一阶矩向量mt-1和二阶矩向量vt-1传输到数据处理单元中。Step S6, the controller unit reads a data transmission instruction from the instruction buffer unit, and transmits the first moment vector m t-1 and the second moment vector v t-1 in the data buffer unit to the data according to the translated micro instruction. Processing unit.
  12. 根据权利要求10所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述利用由外部传入的梯度值
    Figure PCTCN2016080357-appb-100011
    和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1,是根据公式
    Figure PCTCN2016080357-appb-100012
    Figure PCTCN2016080357-appb-100013
    实现的,具体包括:
    A method for performing an Adam gradient descent training algorithm according to claim 10, wherein said utilizing a gradient value introduced from the outside
    Figure PCTCN2016080357-appb-100011
    And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 according to the formula
    Figure PCTCN2016080357-appb-100012
    Figure PCTCN2016080357-appb-100013
    Realized, including:
    控制器单元从指令缓存单元中读取一条矩向量更新指令,根据译出的微指令,驱动数据缓存单元进行一阶矩向量mt-1和二阶矩向量vt-1的更新操作,在该更新操作中,矩向量更新指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:发送INS_1指令至基本运算子模块,驱动基本运算子模块计算(1-β1)和(1-β2);发送INS_2指令至向量乘法并行运算子模块,驱动向量乘法并行运算子模块计算得到
    Figure PCTCN2016080357-appb-100014
    然后,发送INS_3指令至向量乘法并行运算子模块,驱动向量乘法并行 运算子模块同时计算β1mt-1
    Figure PCTCN2016080357-appb-100015
    β2vt-1
    Figure PCTCN2016080357-appb-100016
    结果分别记为a1、a2、b1和b2;然后,将a1和a2、b1和b2分别作为两个输入,送至向量加法并行运算子模块,得到更新后的一阶矩向量mt和二阶矩向量vt
    The controller unit reads a moment vector update instruction from the instruction cache unit, and drives the data buffer unit to perform an update operation of the first moment vector m t-1 and the second moment vector v t-1 according to the translated micro instruction. In the update operation, the moment vector update instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: sending the INS_1 instruction to the basic operation sub-module, driving the basic operation sub-module to calculate (1-β 1 ) and (1-β 2 ); send INS_2 instruction to vector multiplication parallel operation sub-module, drive vector multiplication parallel operation sub-module
    Figure PCTCN2016080357-appb-100014
    Then, send the INS_3 instruction to the vector multiplication parallel operation sub-module, and drive the vector multiplication parallel operation sub-module to simultaneously calculate β 1 m t-1 ,
    Figure PCTCN2016080357-appb-100015
    2 2 v t-1 and
    Figure PCTCN2016080357-appb-100016
    The results are denoted as a 1 , a 2 , b 1 and b 2 respectively ; then, a 1 and a 2 , b 1 and b 2 are respectively taken as two inputs, and sent to the vector addition parallel operation sub-module to obtain the updated one. The moment vector m t and the second moment vector v t .
  13. 根据权利要求12所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述利用由外部传入的梯度值
    Figure PCTCN2016080357-appb-100017
    和指数衰减率更新一阶矩向量mt-1、二阶矩向量vt-1后,还包括:
    A method for performing an Adam gradient descent training algorithm according to claim 12, wherein said utilizing a gradient value introduced from the outside
    Figure PCTCN2016080357-appb-100017
    And the exponential decay rate updates the first moment vector m t-1 and the second moment vector v t-1 , and further includes:
    控制器单元从指令缓存单元读取一条数据传输指令,根据译出的微指令,将更新后的一阶矩向量mt和二阶矩向量vt从数据处理单元传送到数据缓存单元中。The controller unit reads a data transfer instruction from the instruction cache unit, and transfers the updated first-order moment vector m t and second-order moment vector v t from the data processing unit to the data buffer unit according to the translated micro-instruction.
  14. 根据权利要求10所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述通过矩向量运算得到有偏矩估计向量
    Figure PCTCN2016080357-appb-100018
    Figure PCTCN2016080357-appb-100019
    是根据公式
    Figure PCTCN2016080357-appb-100020
    实现的,具体包括:
    The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said moment vector estimation operation obtains a bias vector estimation vector
    Figure PCTCN2016080357-appb-100018
    with
    Figure PCTCN2016080357-appb-100019
    Is based on the formula
    Figure PCTCN2016080357-appb-100020
    Realized, including:
    控制器单元从指令缓存单元读取一条矩估计向量运算指令,根据译出的微指令,驱动运算控制子模块进行矩估计向量的计算,运算控制子模块发送相应指令进行如下操作:运算控制子模块发送指令INS_4至基本运算子模块,驱动基本运算子模块计算出
    Figure PCTCN2016080357-appb-100021
    Figure PCTCN2016080357-appb-100022
    迭代次数t加1,运算控制子模块发送指令INS_5至向量乘法并行运算子模块,驱动向量乘法并行运算子模块并行计算一阶矩向量mt
    Figure PCTCN2016080357-appb-100023
    二阶矩向量vt
    Figure PCTCN2016080357-appb-100024
    的乘积得到有偏矩估计向量
    Figure PCTCN2016080357-appb-100025
    Figure PCTCN2016080357-appb-100026
    The controller unit reads a moment estimation vector operation instruction from the instruction buffer unit, and drives the operation control sub-module to calculate the moment estimation vector according to the translated micro-instruction, and the operation control sub-module sends the corresponding instruction to perform the following operations: the operation control sub-module Send the command INS_4 to the basic operation sub-module, and drive the basic operation sub-module to calculate
    Figure PCTCN2016080357-appb-100021
    with
    Figure PCTCN2016080357-appb-100022
    The iteration number t is incremented by 1, the operation control sub-module sends the instruction INS_5 to the vector multiplication parallel operation sub-module, and the drive vector multiplication parallel operation sub-module calculates the first-order moment vector m t and
    Figure PCTCN2016080357-appb-100023
    Second moment vector v t and
    Figure PCTCN2016080357-appb-100024
    Biased estimate vector
    Figure PCTCN2016080357-appb-100025
    with
    Figure PCTCN2016080357-appb-100026
  15. 根据权利要求10所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述更新待更新向量θt-1为θt是根据公式
    Figure PCTCN2016080357-appb-100027
    实现的,具体包括:
    The method for performing an Adam gradient descent training algorithm according to claim 10, wherein said updating the vector to be updated θ t-1 is θ t according to a formula
    Figure PCTCN2016080357-appb-100027
    Realized, including:
    控制器单元从指令缓存单元读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块进行如下的运算:运算控制子模块发送指令INS_6至基本运算子模块,驱动基本运算子模块计算出-α;运算控 制子模块发送指令INS_7至向量平方根并行运算子模块,驱动其运算得到运算控制子模块发送指令INS_7至向量除法并行运算子模块驱动其运算得到
    Figure PCTCN2016080357-appb-100029
    运算控制子模块发送指令INS_8至向量乘法并行运算子模块,驱动其运算得到
    Figure PCTCN2016080357-appb-100030
    运算控制子模块发送指令INS_9至向量加法并行运算子模块,驱动其计算
    Figure PCTCN2016080357-appb-100031
    得到更新后的参数向量θt;其中,θt-1是θ0在第t次循环时未更新前的值,第t次循环将θt-1更新为θt;运算控制子模块发送指令INS_10至向量除法并行运算子模块,驱动其运算得到向量
    Figure PCTCN2016080357-appb-100032
    运算控制子模块分别发送指令INS_11、INS_12至向量加法并行运算子模块和基本运算子模块计算得到sum=∑itempi、temp2=sum/n。
    The controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: the operation control sub-module sends the instruction INS_6 to the basic operation sub-module, and drives the basic operation sub-module Calculate -α; the operation control sub-module sends the instruction INS_7 to the vector square root parallel operation sub-module, and drives the operation to obtain The operation control sub-module sends the instruction INS_7 to the vector division parallel operation sub-module to drive its operation.
    Figure PCTCN2016080357-appb-100029
    The operation control sub-module sends the instruction INS_8 to the vector multiplication parallel operation sub-module, and drives the operation thereof to obtain
    Figure PCTCN2016080357-appb-100030
    The operation control sub-module sends the instruction INS_9 to the vector addition parallel operation sub-module to drive its calculation.
    Figure PCTCN2016080357-appb-100031
    The updated parameter vector θ t is obtained ; wherein θ t-1 is the value before θ 0 is not updated at the t-th cycle, and the t-th cycle updates θ t-1 to θ t ; the operation control sub-module sends the instruction INS_10 to vector division parallel operation sub-module, driving its operation to get the vector
    Figure PCTCN2016080357-appb-100032
    The operation control sub-module respectively sends the instructions INS_11, INS_12 to the vector addition parallel operation sub-module and the basic operation sub-module to calculate sum=∑ i temp i , temp2=sum/n.
  16. 根据权利要求15所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述更新待更新向量θt-1为θt之后,还包括:The method for performing an Adam gradient descent training algorithm according to claim 15, wherein after updating the update vector θ t-1 to θ t , the method further comprises:
    控制器单元从指令缓存单元读取一条DATABACK_IO指令,根据译出的微指令,将更新后的参数向量θt从数据处理单元通过直接内存访问单元传送至外部指定空间。The controller unit reads a DATABACK_IO instruction from the instruction cache unit, and transmits the updated parameter vector θ t from the data processing unit to the external designated space through the direct memory access unit according to the translated micro instruction.
  17. 根据权利要求10所述的用于执行Adam梯度下降训练算法的方法,其特征在于,所述重复此过程直至待更新向量收敛的步骤中,包括判断待更新向量是否收敛,具体判断过程如下:The method for performing an Adam gradient descent training algorithm according to claim 10, wherein the step of repeating the process until the vector to be updated converges comprises determining whether the vector to be updated converges, and the specific determining process is as follows:
    控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理模块判断更新后的参数向量是否收敛,若temp2<ct,则收敛,运算结束。 The controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated micro instruction, the data processing module determines whether the updated parameter vector converges, and if temp2<ct, converges, and the operation ends.
PCT/CN2016/080357 2016-04-27 2016-04-27 Device and method for performing adam gradient descent training algorithm WO2017185257A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080357 WO2017185257A1 (en) 2016-04-27 2016-04-27 Device and method for performing adam gradient descent training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080357 WO2017185257A1 (en) 2016-04-27 2016-04-27 Device and method for performing adam gradient descent training algorithm

Publications (1)

Publication Number Publication Date
WO2017185257A1 true WO2017185257A1 (en) 2017-11-02

Family

ID=60161795

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080357 WO2017185257A1 (en) 2016-04-27 2016-04-27 Device and method for performing adam gradient descent training algorithm

Country Status (1)

Country Link
WO (1) WO2017185257A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931937A (en) * 2020-09-30 2020-11-13 深圳云天励飞技术股份有限公司 Gradient updating method, device and system of image processing model
CN112329941A (en) * 2020-11-04 2021-02-05 支付宝(杭州)信息技术有限公司 Deep learning model updating method and device
CN112580507A (en) * 2020-12-18 2021-03-30 合肥高维数据技术有限公司 Deep learning text character detection method based on image moment correction
CN113238975A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Memory, integrated circuit and board card for optimizing parameters of deep neural network
CN116863492A (en) * 2023-09-04 2023-10-10 山东正禾大教育科技有限公司 Mobile digital publishing system
CN112580507B (en) * 2020-12-18 2024-05-31 合肥高维数据技术有限公司 Deep learning text character detection method based on image moment correction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325401A1 (en) * 2012-05-29 2013-12-05 Xerox Corporation Adaptive weighted stochastic gradient descent
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor
CN105184369A (en) * 2015-09-08 2015-12-23 杭州朗和科技有限公司 Depth learning model matrix compression method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325401A1 (en) * 2012-05-29 2013-12-05 Xerox Corporation Adaptive weighted stochastic gradient descent
CN103956992A (en) * 2014-03-26 2014-07-30 复旦大学 Self-adaptive signal processing method based on multi-step gradient decrease
CN105184369A (en) * 2015-09-08 2015-12-23 杭州朗和科技有限公司 Depth learning model matrix compression method and device
CN105184366A (en) * 2015-09-15 2015-12-23 中国科学院计算技术研究所 Time-division-multiplexing general neural network processor

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931937A (en) * 2020-09-30 2020-11-13 深圳云天励飞技术股份有限公司 Gradient updating method, device and system of image processing model
CN111931937B (en) * 2020-09-30 2021-01-01 深圳云天励飞技术股份有限公司 Gradient updating method, device and system of image processing model
CN112329941A (en) * 2020-11-04 2021-02-05 支付宝(杭州)信息技术有限公司 Deep learning model updating method and device
CN112580507A (en) * 2020-12-18 2021-03-30 合肥高维数据技术有限公司 Deep learning text character detection method based on image moment correction
CN112580507B (en) * 2020-12-18 2024-05-31 合肥高维数据技术有限公司 Deep learning text character detection method based on image moment correction
CN113238975A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Memory, integrated circuit and board card for optimizing parameters of deep neural network
CN116863492A (en) * 2023-09-04 2023-10-10 山东正禾大教育科技有限公司 Mobile digital publishing system
CN116863492B (en) * 2023-09-04 2023-11-21 山东正禾大教育科技有限公司 Mobile digital publishing system

Similar Documents

Publication Publication Date Title
CN111260025B (en) Apparatus and method for performing LSTM neural network operation
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
WO2017124644A1 (en) Artificial neural network compression encoding device and method
WO2017124641A1 (en) Device and method for executing reversal training of artificial neural network
CN111353589B (en) Apparatus and method for performing artificial neural network forward operations
WO2017124648A1 (en) Vector computing device
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
WO2018120016A1 (en) Apparatus for executing lstm neural network operation, and operational method
CN110929863B (en) Apparatus and method for performing LSTM operations
WO2017124647A1 (en) Matrix calculation apparatus
EP3944157A1 (en) Device and method for performing training of convolutional neural network
WO2017185396A1 (en) Device and method for use in executing matrix addition/subtraction operations
WO2017185411A1 (en) Apparatus and method for executing adagrad gradient descent training algorithm
WO2019127838A1 (en) Method and apparatus for realizing convolutional neural network, terminal, and storage medium
WO2018107476A1 (en) Memory access device, computing device and device applied to convolutional neural network computation
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
US20170185888A1 (en) Interconnection Scheme for Reconfigurable Neuromorphic Hardware
WO2017185393A1 (en) Apparatus and method for executing inner product operation of vectors
US10831861B2 (en) Apparatus and methods for vector operations
CN111651203B (en) Device and method for executing vector four-rule operation
CN109754062B (en) Execution method of convolution expansion instruction and related product
WO2017185336A1 (en) Apparatus and method for executing pooling operation
WO2018113790A1 (en) Operation apparatus and method for artificial neural network
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN107315570B (en) Device and method for executing Adam gradient descent training algorithm

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899770

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899770

Country of ref document: EP

Kind code of ref document: A1