WO2017185413A1 - Device and method for executing hessian-free training algorithm - Google Patents

Device and method for executing hessian-free training algorithm Download PDF

Info

Publication number
WO2017185413A1
WO2017185413A1 PCT/CN2016/081842 CN2016081842W WO2017185413A1 WO 2017185413 A1 WO2017185413 A1 WO 2017185413A1 CN 2016081842 W CN2016081842 W CN 2016081842W WO 2017185413 A1 WO2017185413 A1 WO 2017185413A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
module
updated
unit
hessian
Prior art date
Application number
PCT/CN2016/081842
Other languages
French (fr)
Chinese (zh)
Inventor
张士锦
郭崎
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Publication of WO2017185413A1 publication Critical patent/WO2017185413A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing a Hessian-Free training algorithm.
  • the gradient descent method is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing.
  • the training method of the mainstream neural network is the gradient descent method (in combination with the BP algorithm), but this method ignores the curvature information of the error function, which is not only prone to excessively flat parameter changes, but also cannot converge to the local best advantage, and cannot It is very good to deal with the error function of "morbid curvature" (such as Rosenbrock function).
  • Rosenbrock function such as Rosenbrock function.
  • the Hessian-Free training algorithm solves this problem very well, and through some detail improvements, there is no case where the amount of computation increases with respect to the square of the number of parameters (linear growth like the gradient descent method).
  • a known method of performing the Hessian-Free training algorithm is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the correlation operation corresponding to the Hessian-Free training algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit.
  • the GPU is a device specially used for performing graphics and computational operations and scientific calculations, without the special support for the Hessian-Free training algorithm related operations, a large amount of front-end decoding work is still required to perform related operations in the Hessian-Free training algorithm. A lot of extra overhead.
  • the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
  • the present invention provides an apparatus for executing a Hessian-Free training algorithm, comprising:
  • controller unit configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
  • a data buffer unit configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables
  • a data processing module configured to perform an operation operation under the control of the controller unit, and store the intermediate variable in the data cache unit.
  • the data processing module includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module, and a basic operation sub-module; wherein the basic The operation sub-module performs basic operations such as adding, subtracting, multiplying, and dividing between matrices and vectors;
  • the gradient operation submodule, the damping term operation submodule, the Gauss-Newton matrix operation submodule, and the conjugate gradient method operation submodule are all capable of calling the basic operation submodule, and the gradient operation submodule according to the situation
  • the damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other.
  • the data buffer unit initializes a second-order estimate of f( ⁇ ) at device initialization Before the update of the nth update parameter vector ⁇ n is started, Read out into the data processing module and get the update vector in the data processing module Write again;
  • is the parameter vector to be updated
  • ⁇ n is the nth parameter vector to be updated
  • f( ⁇ ) is the error function, that is, the function that deviates from the actual value of the measured result
  • ⁇ n is the update vector
  • ⁇ n+1 ⁇ n + ⁇ n .
  • the data buffer unit is initialized In the step of initializing the gradient Gauss-Newtonian matrix G f , damping coefficient ⁇ and damping function among them,
  • the gradient F refers to the value of the gradient at ⁇ n
  • G f is the f ⁇ n at the Gauss - Newton matrix
  • damping function The value of the function at ⁇ n is determined according to the training model
  • the damping coefficient ⁇ is obtained by the LM-type heuristic method
  • the data processing module reads from the data cache unit Reading the parameter vector ⁇ n to be updated from the external specified space; obtaining the update vector ⁇ n in the module, updating ⁇ n to ⁇ n+1 , corresponding Updated to followed by Writing to the data buffer unit, writing ⁇ n+1 into the external designated space; wherein ⁇ n+1 is the n+1th parameter vector to be updated, Is a second-order estimate of f( ⁇ +1).
  • the present invention also provides a method for executing a Hessian-Free training algorithm, comprising the following steps:
  • Step (1) through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f( ⁇ )
  • is the parameter vector to be updated
  • ⁇ n is the nth parameter vector to be updated
  • f( ⁇ ) is the error function, that is, the function that deviates from the actual value of the measured result
  • ⁇ n is the update vector
  • ⁇ n +1 ⁇ n + ⁇ n ;
  • Step (2) through the IO instruction, completing an operation of the data access unit reading the parameter vector to be updated from the external space;
  • Step (3) the data processing module performs a second-order Taylor expansion on the error function f( ⁇ ) at ⁇ n according to the corresponding instruction, and adds a damping term Obtain an estimate of f( ⁇ ) around ⁇ n which is
  • G f is a Gauss-Newton matrix of f at ⁇ n ;
  • the damping coefficient ⁇ is obtained by the LM-like heuristic method;
  • Step (4) the data processing module performs a preconditioned conjugate gradient method to obtain ⁇ n according to the corresponding instruction.
  • the minimum value is reached and ⁇ n is updated to ⁇ n+1 .
  • the specific update operation is:
  • ⁇ n+1 ⁇ n + ⁇ n ;
  • step (5) the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  • the step of completing the initialization operation of the data buffer unit in the step (1) includes: Gauss-Newtonian matrix G f , damping coefficient ⁇ and damping function Perform a zeroing operation.
  • step (3) when performing RNN training, the damping function
  • G S is the Gauss-Newton matrix of S at ⁇ n
  • is a predetermined positive number
  • the preconditioned conjugate gradient method is performed to obtain ⁇ n as described in step (4).
  • the pre-conditioned conjugate gradient method only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed. Do implicit approximation calculations.
  • the present invention also provides a method for executing a Hessian-Free training algorithm, which comprises the following steps:
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit.
  • Step S2 the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the Hessian-Free calculation from the external address space. And cache it into the instruction cache unit;
  • Step S3 the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the initial parameter vector ⁇ 0 to be updated from the external space into the data processing module;
  • Step S5 the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the parameter vector ⁇ n to be updated from the external space and transmits it to the data processing module;
  • Step S6 the controller unit reads, from the instruction buffer unit, an instruction for performing second-order estimation of the error function near the current parameter vector value, and performs a second-order estimation of f( ⁇ ) near ⁇ n according to the translated micro-instruction.
  • the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: calculation using the gradient operation sub-module Using the Gauss-Newton submodule and the matrix multiplication in the basic operation submodule, the Gauss-Newton matrix G f of f at ⁇ n is obtained; the LM heuristic method is used to obtain the damping by the damping term operation submodule and the basic operation submodule. Coefficient ⁇ , which in turn gives the damping term Finally, pass get The expression is stored in the data cache unit; where the damping function Is a value of ⁇ n predetermined by a predetermined function according to the training model;
  • Step S7 the controller unit reads a data transfer instruction from the instruction cache unit, according to the translated microinstruction, Transfer from the data buffer unit to the data processing unit;
  • Step S8 the controller unit reads a parameter update operation instruction from the instruction cache unit, and performs ⁇ n using a preconditioned conjugate gradient method according to the translated micro instruction. The minimum value is reached, and ⁇ n is updated to ⁇ n+1 ; the data access unit reads the parameter vector ⁇ n to be updated from the external space and passes it to the data processing module; the operation control sub-module controls the related operation module to perform the following operations; The update vector ⁇ n is obtained by using the conjugate gradient operation sub-module and the basic operation sub-module; finally, ⁇ n is updated to ⁇ n+1 by vector addition in the basic operation sub-module;
  • Step S9 the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector ⁇ n+1 is transmitted from the data processing unit to the external designated space through the data access unit;
  • Step S10 the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector ⁇ n+1 converges: if convergence, the operation ends; otherwise, the number of iterations The value of n is incremented by 1, and the process returns to step S5.
  • the present invention also provides an apparatus for executing a Hessian-Free training algorithm in which a program for executing the method of executing the Hessian-Free training algorithm as described above is solidified.
  • the device and method of the present invention have the following beneficial effects: the device can implement the Hessian-Free training algorithm, and complete training on various neural networks, such as auto-encoders and convolutional nerves. Network (RNN), etc.; by using a device dedicated to the implementation of the Hessian-Free training algorithm, it is possible to solve the problem that the general-purpose processor of the data is insufficient in performance, and the decoding cost of the previous stage is large, and the execution speed of the related application is accelerated; The application of the cache unit avoids repeatedly reading data into the memory and reduces the bandwidth of the memory access.
  • RNN convolutional nerves. Network
  • FIG. 1 is a block diagram showing an overall structure of an apparatus for implementing a Hessian-Free training algorithm related application according to an embodiment of the present invention
  • FIG. 2 is a block diagram showing an example of a data processing module in an apparatus for implementing a Hessian-Free training algorithm related application, in accordance with an embodiment of the present invention
  • FIG. 3 is a flowchart of operations for implementing a Hessian-Free training algorithm related application according to an embodiment of the invention.
  • the invention discloses an apparatus for executing a Hessian-Free training algorithm, comprising an instruction buffer unit, an instruction decoding unit, a data access unit, a data processing module and a data buffer module.
  • the device can implement the Hessian-Free training algorithm and complete training on various neural networks, such as auto-encoders and convolutional neural networks (RNN).
  • RNN convolutional neural networks
  • a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, followed by the current gradient, Gauss-Newtonian matrix, damping function, and damping coefficient ( Dampling constant)
  • the preconditioned conjugate gradient method (PreconditioningCG-Minimize) is used to obtain the update vector, and the parameters to be updated are updated. The iteration continues until the parameter vector to be updated converges.
  • the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data buffer unit, and a data processing module.
  • the data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit.
  • the updated parameter vector is directly written from the data processing module to the external designated space;
  • the instruction cache unit reads the instruction through the data access unit, and caches the read instruction;
  • the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variables, and initialize and update these variables;
  • the data processing module performs corresponding arithmetic operations according to the instructions.
  • the present invention also discloses a method for executing a Hessian-Free training algorithm, which includes the following steps:
  • Step (1) through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f( ⁇ ) Specifically, it is the gradient to it.
  • Gauss-Newtonian matrix G f , damping coefficient ⁇ and damping function Perform a zeroing operation.
  • Step (2) through the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated from the external space.
  • Step (3) the data processing module performs a second-order Taylor expansion on the error function f( ⁇ ) at ⁇ n according to the corresponding instruction, and adds a damping term Obtain an estimate of f( ⁇ ) around ⁇ n which is
  • G f is the Gauss-Newton matrix of f at ⁇ n ; ⁇ n is the update vector; the damping coefficient ⁇ is obtained by the method of Levenburg-Marquardt style heuristics; the damping function Is the value of the function at ⁇ n predetermined according to the training model, such as when performing RNN training, S is similar to f, which is a distance function, G S is the Gauss-Newton matrix of S at ⁇ n , and ⁇ (weighting constant) is a predetermined positive number.
  • Step (4) the data processing module performs a preconditioned conjugate gradient method to obtain ⁇ n according to the corresponding instruction.
  • the minimum value is reached and the operation of ⁇ n is updated to ⁇ n+1 .
  • the update operation is as follows:
  • ⁇ n+1 ⁇ n + ⁇ n ;
  • step (5) the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  • the apparatus of the Hessian-Free training algorithm implemented according to an embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm.
  • a second-order estimation of a spatial storage error function near each of the parameters to be updated is opened.
  • an update vector is calculated by using the second-order estimation, and then updated.
  • Vector update operation Repeat the above steps until the vector to be updated converges.
  • the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
  • the data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the instruction is read from the instruction cache unit 2, the parameter to be updated is read from the specified storage unit to the data processing unit 5, and the updated parameter vector is directly written from the data processing module 5 to the external designated space.
  • the instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
  • the controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 4, the data processing module 5, and the like.
  • Data buffer unit 4 is initialized at device initialization Specifically, it is to initialize the gradient Gauss-Newtonian matrix G f , damping coefficient ⁇ and damping function Before the update of the nth update parameter vector ⁇ n starts, Read out into the data processing module 5.
  • the update vector ⁇ n is obtained in the data processing module 5, and ⁇ n is updated to ⁇ n+1 , corresponding Updated to followed by Write to data cache unit 4 (new data will be overwritten by the previous corresponding data) for the next use.
  • the data processing module 5 reads from the data buffer unit 4
  • the parameter vector ⁇ n to be updated is read from the external designated space by the data access unit 1.
  • the update vector ⁇ n is obtained in the module, and ⁇ n is updated to ⁇ n+1 , corresponding Updated to followed by It is written to the data buffer unit 4, and ⁇ n+1 is written into the external designated space through the data access unit 1.
  • the data processing module includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a Gauss-Newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55, and a basic operation unit.
  • Module 56 the basic operation sub-module 56 performs basic operations such as matrix and vector multiplication, and the 52, 53, 54, 55 sub-modules will call 56 sub-modules, and depending on the situation, these modules are also allowed to call each other. .
  • Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the Hessian-Free training algorithm.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 2.
  • step S2 the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the Hessian-Free calculations from the external address space. All instructions are buffered into the instruction cache unit 2.
  • step S3 the controller unit 3 reads an IO instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data access unit 1 reads the initial parameter vector ⁇ 0 to be updated from the external space into the data processing module 5.
  • step S4 the controller unit 3 reads the assignment instruction from the instruction cache unit 2, according to the translated microinstruction, in the data buffer unit 4. Initialization, the number n of iterations in the data processing unit 5 is set to zero.
  • step S5 the controller unit 3 reads an IO command from the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads the parameter vector ⁇ n to be updated from the external space and passes it to the data processing module 5.
  • step S6 the controller unit 3 reads, from the instruction buffer unit 2, an instruction for performing second-order estimation of the error function near the current parameter vector value, and based on the translated micro-instruction, performs f( ⁇ ) near ⁇ n . Second order estimate Operation.
  • the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends the corresponding instruction to perform the following operations: using the gradient operation sub-module 52 to calculate Using the matrix multiplication operation in the Gauss-Newton operation sub-module 54 and the basic operation sub-module 56, the Gauss-Newton matrix Gf of f at ⁇ n is obtained; the LM inspiration is performed by the damping term operation sub-module 53 and the basic operation sub-module 56. Method to obtain the damping coefficient ⁇ , and then obtain the damping term Finally, get The expression is stored in the data cache unit 4.
  • step S7 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, according to the translated microinstruction, It is transferred from the data buffer unit 4 to the data processing unit 5.
  • step S8 the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and based on the translated micro-instruction, performs ⁇ n using a preconditioned conjugate gradient method. The minimum value is reached and the operation of ⁇ n is updated to ⁇ n+1 .
  • the data access unit 1 reads the parameter vector ⁇ n to be updated from the external space and passes it to the data processing module 5.
  • the operation control sub-module 51 controls the correlation operation module to perform an operation of obtaining the update vector ⁇ n using the conjugate gradient operation sub-module 55 and the basic operation sub-module 56. Among them, according to the damping function The expression may also need to call the Gauss Newton-arithmetic module (such as the RNN example mentioned earlier).
  • ⁇ n is updated to ⁇ n+1 using vector addition in the basic operation sub-module 56.
  • step S9 the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector ⁇ n+1 is transmitted from the data processing unit 5 through the data access unit 1 to the external designated space. .
  • step S10 the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector ⁇ n+1 converges: if convergence, the operation ends; otherwise, The value of the number of iterations n is incremented by 1, and the process returns to step S5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

Provided are a device and method for executing an Hessian-Free training algorithm. The device comprises a data access unit, a controller unit, a data processing unit, and a data buffer module. The device can be used to realize an Hessian-Free training algorithm, thereby performing training on various neural networks such as an automatic encoder and a convolutional neural network (RNN). In each iteration, a second order Taylor series expansion is performed on an error function (objective function), and a damping term is added, so as to approximate the objective function. Afterwards, a preconditioned conjugate gradient method is used to obtain an updated vector according to a current gradient, a Gauss-Newton matrix, a damping function, and a damping coefficient so as to update a parameter to be updated. The iteration is repeated until a vector of the parameter to be updated converges.

Description

一种用于执行Hessian-Free训练算法的装置和方法Apparatus and method for executing Hessian-Free training algorithm 技术领域Technical field
本发明涉及神经网络运算技术领域,更具体地涉及一种用于执行Hessian-Free训练算法的装置和方法。The present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing a Hessian-Free training algorithm.
背景技术Background technique
梯度下降法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用。目前主流的神经网络的训练方法是梯度下降法(结合BP算法),但是这种方法忽略了误差函数的曲率信息,不仅容易出现参数变化过度平缓,从而无法收敛到局部最优点的情况,而且无法很好的处理“病态曲率”的误差函数(比如Rosenbrock函数)。Hessian-Free训练算法很好的解决了这个问题,并且通过一些细节改进,使之没有出现运算量关于参数数量平方增长(和梯度下降法一样是线性增长)的情况。The gradient descent method is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. At present, the training method of the mainstream neural network is the gradient descent method (in combination with the BP algorithm), but this method ignores the curvature information of the error function, which is not only prone to excessively flat parameter changes, but also cannot converge to the local best advantage, and cannot It is very good to deal with the error function of "morbid curvature" (such as Rosenbrock function). The Hessian-Free training algorithm solves this problem very well, and through some detail improvements, there is no case where the amount of computation increases with respect to the square of the number of parameters (linear growth like the gradient descent method).
目前,一种执行Hessian-Free训练算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把Hessian-Free训练算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, a known method of performing the Hessian-Free training algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the Hessian-Free training algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
另一种执行Hessian-Free训练算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对Hessian-Free训练算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行Hessian-Free训练算法中相关的运算,带来了大量的额外开销。另外,GPU只有较小的片上缓存,运算中所需数据(如高斯-牛顿矩阵等)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。 Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit. Since the GPU is a device specially used for performing graphics and computational operations and scientific calculations, without the special support for the Hessian-Free training algorithm related operations, a large amount of front-end decoding work is still required to perform related operations in the Hessian-Free training algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供一种用于执行Hessian-Free训练算法的装置和方法,以期解决上述技术问题中的至少之一。In view of the above, it is an object of the present invention to provide an apparatus and method for executing a Hessian-Free training algorithm with a view to solving at least one of the above technical problems.
为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行Hessian-Free训练算法的装置,包括:In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for executing a Hessian-Free training algorithm, comprising:
控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;
数据处理模块,用于在所述控制器单元的控制下执行运算操作,并将中间变量存储于所述数据缓存单元中。And a data processing module, configured to perform an operation operation under the control of the controller unit, and store the intermediate variable in the data cache unit.
其中,所述数据处理模块包括运算控制子模块、梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块及基本运算子模块;其中,所述基本运算子模块进行的是矩阵、向量之间的加、减、乘和除等基础运算;The data processing module includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module, and a basic operation sub-module; wherein the basic The operation sub-module performs basic operations such as adding, subtracting, multiplying, and dividing between matrices and vectors;
作为优选,所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块均能够调用所述基本运算子模块,且根据情况所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块之间允许互相调用。Preferably, the gradient operation submodule, the damping term operation submodule, the Gauss-Newton matrix operation submodule, and the conjugate gradient method operation submodule are all capable of calling the basic operation submodule, and the gradient operation submodule according to the situation The damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other.
其中,所述数据缓存单元在装置初始化时初始化f(θ)的二阶估计
Figure PCTCN2016081842-appb-000001
在第n次待更新参数向量θn的更新开始前,将
Figure PCTCN2016081842-appb-000002
读出到数据处理模块中,并在所述数据处理模块中得到更新向量后将
Figure PCTCN2016081842-appb-000003
再次写入;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
Wherein the data buffer unit initializes a second-order estimate of f(θ) at device initialization
Figure PCTCN2016081842-appb-000001
Before the update of the nth update parameter vector θ n is started,
Figure PCTCN2016081842-appb-000002
Read out into the data processing module and get the update vector in the data processing module
Figure PCTCN2016081842-appb-000003
Write again; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector And θ n+1 = θ n + δ n .
其中,所述数据缓存单元在初始化
Figure PCTCN2016081842-appb-000004
的步骤中,初始化其中的梯度
Figure PCTCN2016081842-appb-000005
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure PCTCN2016081842-appb-000006
其中,
Figure PCTCN2016081842-appb-000007
所述梯度
Figure PCTCN2016081842-appb-000008
指f在θn处的梯度值,Gf是f在θn处的高斯-牛顿矩阵;阻尼函数
Figure PCTCN2016081842-appb-000009
是根据训练模型预先确定好的函数在θn处的值;阻尼系数λ通过LM式启发式方法求得;
Wherein the data buffer unit is initialized
Figure PCTCN2016081842-appb-000004
In the step of initializing the gradient
Figure PCTCN2016081842-appb-000005
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
Figure PCTCN2016081842-appb-000006
among them,
Figure PCTCN2016081842-appb-000007
The gradient
Figure PCTCN2016081842-appb-000008
F refers to the value of the gradient at θ n, G f is the f θ n at the Gauss - Newton matrix; damping function
Figure PCTCN2016081842-appb-000009
The value of the function at θ n is determined according to the training model; the damping coefficient λ is obtained by the LM-type heuristic method;
所述数据处理模块从所述数据缓存单元中读取
Figure PCTCN2016081842-appb-000010
从外部指定空间中读取待更新参数向量θn;在模块内得到更新向量δn,将θn更新为θn+1,对应的
Figure PCTCN2016081842-appb-000011
更新为
Figure PCTCN2016081842-appb-000012
然后将
Figure PCTCN2016081842-appb-000013
写入至所述数据缓存单元,将θn+1写入到外部指定空间中;其中,θn+1为第n+1次待更新参数向量,
Figure PCTCN2016081842-appb-000014
为f(θ+1)的二阶估计。
The data processing module reads from the data cache unit
Figure PCTCN2016081842-appb-000010
Reading the parameter vector θ n to be updated from the external specified space; obtaining the update vector δ n in the module, updating θ n to θ n+1 , corresponding
Figure PCTCN2016081842-appb-000011
Updated to
Figure PCTCN2016081842-appb-000012
followed by
Figure PCTCN2016081842-appb-000013
Writing to the data buffer unit, writing θ n+1 into the external designated space; wherein θ n+1 is the n+1th parameter vector to be updated,
Figure PCTCN2016081842-appb-000014
Is a second-order estimate of f(θ+1).
作为本发明的另一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,包括以下步骤:As another aspect of the present invention, the present invention also provides a method for executing a Hessian-Free training algorithm, comprising the following steps:
步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计
Figure PCTCN2016081842-appb-000015
其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
Step (1), through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f(θ)
Figure PCTCN2016081842-appb-000015
Where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector, and θ n +1 = θ n + δ n ;
步骤(2),通过IO指令,完成数据访问单元从外部空间读取待更新参数向量的操作;Step (2), through the IO instruction, completing an operation of the data access unit reading the parameter vector to be updated from the external space;
步骤(3),数据处理模块根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项
Figure PCTCN2016081842-appb-000016
得到f(θ)在θn附近的估计
Figure PCTCN2016081842-appb-000017
Step (3), the data processing module performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
Figure PCTCN2016081842-appb-000016
Obtain an estimate of f(θ) around θ n
Figure PCTCN2016081842-appb-000017
which is
Figure PCTCN2016081842-appb-000018
Figure PCTCN2016081842-appb-000018
其中,Gf是f在θn处的高斯-牛顿矩阵;阻尼系数λ通过LM式启发式方法求得;阻尼函数
Figure PCTCN2016081842-appb-000019
是根据训练模型预先确定好的函数在θn处的值;
Where G f is a Gauss-Newton matrix of f at θ n ; the damping coefficient λ is obtained by the LM-like heuristic method; the damping function
Figure PCTCN2016081842-appb-000019
Is a value of θ n predetermined by a predetermined function according to the training model;
步骤(4),数据处理模块根据相应的指令,进行有预条件的共轭梯度法求δn使得
Figure PCTCN2016081842-appb-000020
达到最小值,并把θn更新为θn+1,具体更新操作为:
Step (4), the data processing module performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instruction.
Figure PCTCN2016081842-appb-000020
The minimum value is reached and θ n is updated to θ n+1 . The specific update operation is:
θn+1=θnnθ n+1 = θ n + δ n ;
步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
其中,步骤(1)中所述完成数据缓存单元的初始化操作的步骤包括:对梯度
Figure PCTCN2016081842-appb-000021
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure PCTCN2016081842-appb-000022
进行置零操作。
The step of completing the initialization operation of the data buffer unit in the step (1) includes:
Figure PCTCN2016081842-appb-000021
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
Figure PCTCN2016081842-appb-000022
Perform a zeroing operation.
其中,步骤(3)中当进行RNN训练时,阻尼函数
Figure PCTCN2016081842-appb-000023
Wherein, in step (3), when performing RNN training, the damping function
Figure PCTCN2016081842-appb-000023
其中S和f均是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ是一个预先确定好的正数。Where S and f are both distance functions, G S is the Gauss-Newton matrix of S at θ n , and μ is a predetermined positive number.
其中,步骤(4)所述进行有预条件的共轭梯度法求δn使得
Figure PCTCN2016081842-appb-000024
达到最小值的步骤中,在进行有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过
Figure PCTCN2016081842-appb-000025
做隐式近似计算。
Wherein, the preconditioned conjugate gradient method is performed to obtain δ n as described in step (4).
Figure PCTCN2016081842-appb-000024
In the step of reaching the minimum value, in the pre-conditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.
Figure PCTCN2016081842-appb-000025
Do implicit approximation calculations.
作为本发明的再一个方面,本发明还提供了一种执行Hessian-Free训练算法的方法,其特征在于,包括以下步骤:As still another aspect of the present invention, the present invention also provides a method for executing a Hessian-Free training algorithm, which comprises the following steps:
步骤S1,在指令缓存单元的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.
步骤S2,运算开始,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the Hessian-Free calculation from the external address space. And cache it into the instruction cache unit;
步骤S3,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取初始待更新参数向量θ0到数据处理模块中;Step S3, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the initial parameter vector θ 0 to be updated from the external space into the data processing module;
步骤S4,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的
Figure PCTCN2016081842-appb-000026
初始化,数据处理单元中的迭代次数n被设置为0;其中,θ为待更新参数向量,θn为第n次待更新参数向量, f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
Figure PCTCN2016081842-appb-000027
为f(θ)的二阶估计;
Step S4, the controller unit reads the assignment instruction from the instruction cache unit, according to the translated microinstruction, in the data cache unit
Figure PCTCN2016081842-appb-000026
Initialization, the number of iterations n in the data processing unit is set to 0; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result and a function of the predicted value deviation; δ n is the update vector, and θ n+1 = θ n + δ n ;
Figure PCTCN2016081842-appb-000027
a second order estimate of f(θ);
步骤S5,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取待更新参数向量θn传入到数据处理模块中;Step S5, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the parameter vector θ n to be updated from the external space and transmits it to the data processing module;
步骤S6,控制器单元从指令缓存单元读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计
Figure PCTCN2016081842-appb-000028
的操作;该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用梯度运算子模块计算
Figure PCTCN2016081842-appb-000029
利用高斯-牛顿运算子模块和基本运算子模块中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块和基本运算子模块执行LM启发式方法得到阻尼系数λ,进而得到阻尼项
Figure PCTCN2016081842-appb-000030
最后,通过
Figure PCTCN2016081842-appb-000031
得到
Figure PCTCN2016081842-appb-000032
的表达式存入数据缓存单元;其中,阻尼函数
Figure PCTCN2016081842-appb-000033
是根据训练模型预先确定好的函数在θn处的值;
Step S6, the controller unit reads, from the instruction buffer unit, an instruction for performing second-order estimation of the error function near the current parameter vector value, and performs a second-order estimation of f(θ) near θ n according to the translated micro-instruction.
Figure PCTCN2016081842-appb-000028
Operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: calculation using the gradient operation sub-module
Figure PCTCN2016081842-appb-000029
Using the Gauss-Newton submodule and the matrix multiplication in the basic operation submodule, the Gauss-Newton matrix G f of f at θ n is obtained; the LM heuristic method is used to obtain the damping by the damping term operation submodule and the basic operation submodule. Coefficient λ, which in turn gives the damping term
Figure PCTCN2016081842-appb-000030
Finally, pass
Figure PCTCN2016081842-appb-000031
get
Figure PCTCN2016081842-appb-000032
The expression is stored in the data cache unit; where the damping function
Figure PCTCN2016081842-appb-000033
Is a value of θ n predetermined by a predetermined function according to the training model;
步骤S7,控制器单元从指令缓存单元读取一条数据传输指令,根据译出的微指令,将
Figure PCTCN2016081842-appb-000034
从数据缓存单元传送到数据处理单元中;
Step S7, the controller unit reads a data transfer instruction from the instruction cache unit, according to the translated microinstruction,
Figure PCTCN2016081842-appb-000034
Transfer from the data buffer unit to the data processing unit;
步骤S8,控制器单元从指令缓存单元读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得
Figure PCTCN2016081842-appb-000035
达到最小值,并把θn更新为θn+1的操作;数据访问单元从外部空间读取待更新参数向量θn传入到数据处理模块中;运算控制子模块控制相关运算模块进行如下操作:利用共轭梯度运算子模块和基本运算子模块得到更新向量δn;最后,利用基本运算子模块中的向量加法将θn更新为θn+1
Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and performs δ n using a preconditioned conjugate gradient method according to the translated micro instruction.
Figure PCTCN2016081842-appb-000035
The minimum value is reached, and θ n is updated to θ n+1 ; the data access unit reads the parameter vector θ n to be updated from the external space and passes it to the data processing module; the operation control sub-module controls the related operation module to perform the following operations; The update vector δ n is obtained by using the conjugate gradient operation sub-module and the basic operation sub-module; finally, θ n is updated to θ n+1 by vector addition in the basic operation sub-module;
步骤S9,控制器单元从指令缓存单元读取一条IO指令,根据译出 的微指令,更新后的参数向量θn+1从数据处理单元通过数据访问单元传送至外部指定空间;Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector θ n+1 is transmitted from the data processing unit to the external designated space through the data access unit;
步骤S10,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ n+1 converges: if convergence, the operation ends; otherwise, the number of iterations The value of n is incremented by 1, and the process returns to step S5.
作为本发明的还一个方面,本发明还提供了一种执行Hessian-Free训练算法的装置,所述装置的控制器中固化有执行如上所述的执行Hessian-Free训练算法的方法的程序。As still another aspect of the present invention, the present invention also provides an apparatus for executing a Hessian-Free training algorithm in which a program for executing the method of executing the Hessian-Free training algorithm as described above is solidified.
基于上述技术方案可知,本发明的装置和方法具有如下有益效果:使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等;通过采用专门用于执行Hessian-Free训练算法的设备,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。Based on the above technical solutions, the device and method of the present invention have the following beneficial effects: the device can implement the Hessian-Free training algorithm, and complete training on various neural networks, such as auto-encoders and convolutional nerves. Network (RNN), etc.; by using a device dedicated to the implementation of the Hessian-Free training algorithm, it is possible to solve the problem that the general-purpose processor of the data is insufficient in performance, and the decoding cost of the previous stage is large, and the execution speed of the related application is accelerated; The application of the cache unit avoids repeatedly reading data into the memory and reduces the bandwidth of the memory access.
附图说明DRAWINGS
图1为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置的整体结构示例框图;1 is a block diagram showing an overall structure of an apparatus for implementing a Hessian-Free training algorithm related application according to an embodiment of the present invention;
图2为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理模块的示例框图;2 is a block diagram showing an example of a data processing module in an apparatus for implementing a Hessian-Free training algorithm related application, in accordance with an embodiment of the present invention;
图3为根据本发明一实施例的用于实现Hessian-Free训练算法相关应用的运算流程图。FIG. 3 is a flowchart of operations for implementing a Hessian-Free training algorithm related application according to an embodiment of the invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。通过以下详细描 述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。The present invention will be further described in detail below with reference to the specific embodiments of the invention, Through the following detailed description Other aspects, advantages, and salient features of the invention will be apparent to those skilled in the art.
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.
本发明公开了一种用于执行Hessian-Free训练算法的装置,包括指令缓存单元、指令译码单元、数据访问单元、数据处理模块和数据缓存模块。使用该装置可以实现Hessian-Free训练算法,完成对各种神经网络的训练,如自动编码器(auto-encoders)、卷积神经网络(RNN)等。每次迭代时,对误差函数(目标函数)做二阶泰勒展开,并且添加阻尼项,作为目标函数的估计,之后根据当前的梯度、高斯-牛顿矩阵、阻尼函数(dampling function)和阻尼系数(dampling constant),用有预条件的共轭梯度法(PreconditioningCG-Minimize)求得更新向量,更新待更新参数。持续迭代直至待更新参数向量收敛。The invention discloses an apparatus for executing a Hessian-Free training algorithm, comprising an instruction buffer unit, an instruction decoding unit, a data access unit, a data processing module and a data buffer module. The device can implement the Hessian-Free training algorithm and complete training on various neural networks, such as auto-encoders and convolutional neural networks (RNN). At each iteration, a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, followed by the current gradient, Gauss-Newtonian matrix, damping function, and damping coefficient ( Dampling constant), the preconditioned conjugate gradient method (PreconditioningCG-Minimize) is used to obtain the update vector, and the parameters to be updated are updated. The iteration continues until the parameter vector to be updated converges.
更具体地,本发明的装置包括直接内存控制单元、指令缓存单元、控制器单元、数据缓存单元和数据处理模块。其中,数据访问单元能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储,具体包括向指令缓存单元读取指令,从指定存储单元之间读取待更新参数和对应的梯度值到数据处理单元,将更新后的参数向量从数据处理模块直接写入外部指定空间;指令缓存单元通过数据访问单元读取指令,并缓存读入的指令;控制器单元从指令缓存单元中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如数据访问单元、数据缓存单元、数据处理模块等;数据缓存单元存储装置运行中需要的一些中间变量,并对这些变量做初始化及更新操作; 数据处理模块根据指令做相应的运算操作。More specifically, the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data buffer unit, and a data processing module. The data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit. And the corresponding gradient value to the data processing unit, the updated parameter vector is directly written from the data processing module to the external designated space; the instruction cache unit reads the instruction through the data access unit, and caches the read instruction; the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variables, and initialize and update these variables; The data processing module performs corresponding arithmetic operations according to the instructions.
此外,本发明还公开了一种执行Hessian-Free训练算法的方法,包括以下步骤:In addition, the present invention also discloses a method for executing a Hessian-Free training algorithm, which includes the following steps:
步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计
Figure PCTCN2016081842-appb-000036
具体来说,是对其中的梯度
Figure PCTCN2016081842-appb-000037
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure PCTCN2016081842-appb-000038
进行置零操作。
Step (1), through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f(θ)
Figure PCTCN2016081842-appb-000036
Specifically, it is the gradient to it.
Figure PCTCN2016081842-appb-000037
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
Figure PCTCN2016081842-appb-000038
Perform a zeroing operation.
步骤(2),通过IO指令,完成数据访问单元从外部空间读取待更新参数向量的操作。Step (2), through the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated from the external space.
步骤(3),数据处理模块根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项
Figure PCTCN2016081842-appb-000039
得到f(θ)在θn附近的估计
Figure PCTCN2016081842-appb-000040
Step (3), the data processing module performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
Figure PCTCN2016081842-appb-000039
Obtain an estimate of f(θ) around θ n
Figure PCTCN2016081842-appb-000040
which is
Figure PCTCN2016081842-appb-000041
Figure PCTCN2016081842-appb-000041
其中,Gf是f在θn处的高斯-牛顿矩阵;δn是更新向量;阻尼系数λ是用LM式启发式(Levenburg-Marquardt style heuristics)的方法求得;阻尼函数
Figure PCTCN2016081842-appb-000042
是根据训练模型预先确定好的函数在θn处的值,比如进行RNN训练时,
Figure PCTCN2016081842-appb-000043
S和f类似,是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ(weighting constant)是一个预先确定好的正数。
Where G f is the Gauss-Newton matrix of f at θ n ; δ n is the update vector; the damping coefficient λ is obtained by the method of Levenburg-Marquardt style heuristics; the damping function
Figure PCTCN2016081842-appb-000042
Is the value of the function at θ n predetermined according to the training model, such as when performing RNN training,
Figure PCTCN2016081842-appb-000043
S is similar to f, which is a distance function, G S is the Gauss-Newton matrix of S at θ n , and μ (weighting constant) is a predetermined positive number.
步骤(4),数据处理模块根据相应的指令,进行有预条件的共轭梯度法求δn使得
Figure PCTCN2016081842-appb-000044
达到最小值,并把θn更新为θn+1的操作。更新操作如下:
Step (4), the data processing module performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instruction.
Figure PCTCN2016081842-appb-000044
The minimum value is reached and the operation of θ n is updated to θ n+1 . The update operation is as follows:
θn+1=θnnθ n+1 = θ n + δ n ;
值得一提的是,在进行用有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,并且其中涉及的高斯-牛顿矩阵乘向量 的运算都是通过
Figure PCTCN2016081842-appb-000045
做隐式近似计算(Pearlmutter的R{}-method)。这样既提高了大数据学习的效率或者说提高数据运算模块的运算效率,也避免了运算量随参数数量平方增长的情况。
It is worth mentioning that in the process of using the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.
Figure PCTCN2016081842-appb-000045
Do an implicit approximation (Pearlmutter's R{}-method). This not only improves the efficiency of big data learning or improves the computational efficiency of the data operation module, but also avoids the case where the amount of computation increases with the square of the parameter number.
步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
根据本发明实施例实现的Hessian-Free训练算法的装置,可以用以支持使用Hessian-Free训练算法方面的应用。在数据缓存单元开辟一个空间存储误差函数在每一代待更新参数附近的二阶估计,在每次进行有预条件的共轭梯度法时,利用该二阶估计计算一个更新向量,然后进行对待更新向量的更新操作。重复进行上述步骤,直至待更新向量收敛。The apparatus of the Hessian-Free training algorithm implemented according to an embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm. In the data buffer unit, a second-order estimation of a spatial storage error function near each of the parameters to be updated is opened. Each time a preconditioned conjugate gradient method is performed, an update vector is calculated by using the second-order estimation, and then updated. Vector update operation. Repeat the above steps until the vector to be updated converges.
下面结合附图的本发明的技术方案进行进一步阐述说明。The technical solutions of the present invention with reference to the accompanying drawings are further explained below.
图1示出了根据本发明一实施例的用于实现Hessian-Free训练算法的装置的整体结构示例框图。如图1所示,该装置包括数据访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理模块5,均可以通过硬件电路实现。1 shows an example block diagram of an overall structure of an apparatus for implementing a Hessian-Free training algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
数据访问单元1能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储。具体包括向指令缓存单元2读取指令,从指定存储单元之间读取待更新参数到数据处理单元5,将更新后的参数向量从数据处理模块5直接写入外部指定空间。The data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the instruction is read from the instruction cache unit 2, the parameter to be updated is read from the specified storage unit to the data processing unit 5, and the updated parameter vector is directly written from the data processing module 5 to the external designated space.
指令缓存单元2通过数据访问单元1读取指令,并缓存读入的指令。The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
控制器单元3从指令缓存单元2中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如数据访问单元1、数据缓存单元4、数据处理模块5等。The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 4, the data processing module 5, and the like.
数据缓存单元4在装置初始化时初始化
Figure PCTCN2016081842-appb-000046
具体来说,是初始化其中的梯度
Figure PCTCN2016081842-appb-000047
高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
Figure PCTCN2016081842-appb-000048
在第n次待更新参数向量θn的更新开始前,会将
Figure PCTCN2016081842-appb-000049
读出到数据处理模块5中。在数据处理模块5中得到更新向量δn,将θn更新为θn+1,对应的
Figure PCTCN2016081842-appb-000050
更新为
Figure PCTCN2016081842-appb-000051
然后将
Figure PCTCN2016081842-appb-000052
写入至数据缓存单元4(新的数据将以前的相对应数据覆盖),用于下次使用。
Data buffer unit 4 is initialized at device initialization
Figure PCTCN2016081842-appb-000046
Specifically, it is to initialize the gradient
Figure PCTCN2016081842-appb-000047
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
Figure PCTCN2016081842-appb-000048
Before the update of the nth update parameter vector θ n starts,
Figure PCTCN2016081842-appb-000049
Read out into the data processing module 5. The update vector δ n is obtained in the data processing module 5, and θ n is updated to θ n+1 , corresponding
Figure PCTCN2016081842-appb-000050
Updated to
Figure PCTCN2016081842-appb-000051
followed by
Figure PCTCN2016081842-appb-000052
Write to data cache unit 4 (new data will be overwritten by the previous corresponding data) for the next use.
数据处理模块5从数据缓存单元4中读取
Figure PCTCN2016081842-appb-000053
通过数据访问单元1从外部指定空间中读取待更新参数向量θn。在模块内得到更新向量δn,将θn更新为θn+1,对应的
Figure PCTCN2016081842-appb-000054
更新为
Figure PCTCN2016081842-appb-000055
然后将
Figure PCTCN2016081842-appb-000056
写入至数据缓存单元4,将θn+1通过数据访问单元1写入到外部指定空间中。
The data processing module 5 reads from the data buffer unit 4
Figure PCTCN2016081842-appb-000053
The parameter vector θ n to be updated is read from the external designated space by the data access unit 1. The update vector δ n is obtained in the module, and θ n is updated to θ n+1 , corresponding
Figure PCTCN2016081842-appb-000054
Updated to
Figure PCTCN2016081842-appb-000055
followed by
Figure PCTCN2016081842-appb-000056
It is written to the data buffer unit 4, and θ n+1 is written into the external designated space through the data access unit 1.
图2示出了根据本发明实施例的用于实现Hessian-Free训练算法相关应用的装置中数据处理模块的示例框图。如图2所示,数据处理模块包括运算控制子模块51、梯度运算子模块52、阻尼项运算子模块53、高斯-牛顿矩阵运算子模块54、共轭梯度法运算子模块55以及基本运算子模块56。其中,基本运算子模块56进行的是矩阵、向量之间的加乘等基础运算,52、53、54、55子模块都会调用56子模块,并且根据情况,这些模块互相之间也允许互相调用。2 illustrates an example block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a Gauss-Newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55, and a basic operation unit. Module 56. Among them, the basic operation sub-module 56 performs basic operations such as matrix and vector multiplication, and the 52, 53, 54, 55 sub-modules will call 56 sub-modules, and depending on the situation, these modules are also allowed to call each other. .
图3示出了根据Hessian-Free训练算法进行相关运算的装置的总体流程图。Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the Hessian-Free training algorithm.
在步骤S1,在指令缓存单元2的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.
在步骤S2,运算开始,控制单元3从指令缓存单元2的首地址读取该条IO指令,根据译出的微指令,数据访问单元1从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元2中。In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the Hessian-Free calculations from the external address space. All instructions are buffered into the instruction cache unit 2.
在步骤S3,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,数据访问单元1从外部空间读取初始待更新参数向量θ0到数据处理模块5中。In step S3, the controller unit 3 reads an IO instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data access unit 1 reads the initial parameter vector θ 0 to be updated from the external space into the data processing module 5.
在步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,数据缓存单元4中的
Figure PCTCN2016081842-appb-000057
初始化,数据处理单元5中的 迭代次数n被设置为0。
In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, according to the translated microinstruction, in the data buffer unit 4.
Figure PCTCN2016081842-appb-000057
Initialization, the number n of iterations in the data processing unit 5 is set to zero.
在步骤S5,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,数据访问单元1从外部空间读取待更新参数向量θn传入到数据处理模块5中。In step S5, the controller unit 3 reads an IO command from the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads the parameter vector θ n to be updated from the external space and passes it to the data processing module 5.
在步骤S6,控制器单元3从指令缓存单元2读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计
Figure PCTCN2016081842-appb-000058
的操作。该操作中,指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:利用梯度运算子模块52计算
Figure PCTCN2016081842-appb-000059
利用高斯-牛顿运算子模块54和基本运算子模块56中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块53和基本运算子模块56执行LM启发式方法得到阻尼系数λ,进而得到阻尼项
Figure PCTCN2016081842-appb-000060
最后,得到
Figure PCTCN2016081842-appb-000061
的表达式存入数据缓存单元4。
In step S6, the controller unit 3 reads, from the instruction buffer unit 2, an instruction for performing second-order estimation of the error function near the current parameter vector value, and based on the translated micro-instruction, performs f(θ) near θ n . Second order estimate
Figure PCTCN2016081842-appb-000058
Operation. In this operation, the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends the corresponding instruction to perform the following operations: using the gradient operation sub-module 52 to calculate
Figure PCTCN2016081842-appb-000059
Using the matrix multiplication operation in the Gauss-Newton operation sub-module 54 and the basic operation sub-module 56, the Gauss-Newton matrix Gf of f at θ n is obtained; the LM inspiration is performed by the damping term operation sub-module 53 and the basic operation sub-module 56. Method to obtain the damping coefficient λ, and then obtain the damping term
Figure PCTCN2016081842-appb-000060
Finally, get
Figure PCTCN2016081842-appb-000061
The expression is stored in the data cache unit 4.
在步骤S7,控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将
Figure PCTCN2016081842-appb-000062
从数据缓存单元4传送到数据处理单元5中。
In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, according to the translated microinstruction,
Figure PCTCN2016081842-appb-000062
It is transferred from the data buffer unit 4 to the data processing unit 5.
在步骤S8,控制器单元3从指令缓存单元2读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得
Figure PCTCN2016081842-appb-000063
达到最小值,并把θn更新为θn+1的操作。数据访问单元1从外部空间读取待更新参数向量θn传入到数据处理模块5中。运算控制子模块51控制相关运算模块进行如下操作:利用共轭梯度运算子模块55和基本运算子模块56得到更新向量δn。这其中,根据阻尼函数
Figure PCTCN2016081842-appb-000064
的表达式,也可能需要调用高斯牛顿-运算模块(比如之前提到的RNN的例子)。最后,利用基本运算子模块56中的向量加法将θn更新为θn+1
In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and based on the translated micro-instruction, performs δ n using a preconditioned conjugate gradient method.
Figure PCTCN2016081842-appb-000063
The minimum value is reached and the operation of θ n is updated to θ n+1 . The data access unit 1 reads the parameter vector θ n to be updated from the external space and passes it to the data processing module 5. The operation control sub-module 51 controls the correlation operation module to perform an operation of obtaining the update vector δ n using the conjugate gradient operation sub-module 55 and the basic operation sub-module 56. Among them, according to the damping function
Figure PCTCN2016081842-appb-000064
The expression may also need to call the Gauss Newton-arithmetic module (such as the RNN example mentioned earlier). Finally, θ n is updated to θ n+1 using vector addition in the basic operation sub-module 56.
在步骤S9,控制器单元3从指令缓存单元2读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元5通过数据访问单元1传送至外部指定空间。In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector θ n+1 is transmitted from the data processing unit 5 through the data access unit 1 to the external designated space. .
在步骤S10,控制器单元从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛: 若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。In step S10, the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ n+1 converges: if convergence, the operation ends; otherwise, The value of the number of iterations n is incremented by 1, and the process returns to step S5.
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may be by hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software carried on a non-transitory computer readable medium), or both. The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不悖离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。 In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims (10)

  1. 一种用于执行Hessian-Free训练算法的装置,其特征在于,包括:An apparatus for executing a Hessian-Free training algorithm, comprising:
    控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
    数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;
    数据处理模块,用于在所述控制器单元的控制下执行运算操作,并将中间变量存储于所述数据缓存单元中。And a data processing module, configured to perform an operation operation under the control of the controller unit, and store the intermediate variable in the data cache unit.
  2. 如权利要求1所述的用于执行Hessian-Free训练算法的装置,其特征在于,所述数据处理模块包括运算控制子模块、梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块及基本运算子模块;其中,所述基本运算子模块进行的是矩阵、向量之间的加、减、乘和除等基础运算;The apparatus for executing a Hessian-Free training algorithm according to claim 1, wherein the data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, and a Gauss-Newton matrix operator. a module, a conjugate gradient method operation sub-module and a basic operation sub-module; wherein the basic operation sub-module performs basic operations such as addition, subtraction, multiplication and division between a matrix and a vector;
    作为优选,所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块均能够调用所述基本运算子模块,且根据情况所述梯度运算子模块、阻尼项运算子模块、高斯-牛顿矩阵运算子模块、共轭梯度法运算子模块之间允许互相调用。Preferably, the gradient operation submodule, the damping term operation submodule, the Gauss-Newton matrix operation submodule, and the conjugate gradient method operation submodule are all capable of calling the basic operation submodule, and the gradient operation submodule according to the situation The damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other.
  3. 如权利要求1所述的用于执行Hessian-Free训练算法的装置,其特征在于,所述数据缓存单元在装置初始化时初始化f(θ)的二阶估计
    Figure PCTCN2016081842-appb-100001
    在第n次待更新参数向量θn的更新开始前,将
    Figure PCTCN2016081842-appb-100002
    读出到数据处理模块中,并在所述数据处理模块中得到更新向量后将
    Figure PCTCN2016081842-appb-100003
    再次写入;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
    The apparatus for executing a Hessian-Free training algorithm according to claim 1, wherein said data buffer unit initializes a second-order estimate of f(θ) at device initialization.
    Figure PCTCN2016081842-appb-100001
    Before the update of the nth update parameter vector θ n is started,
    Figure PCTCN2016081842-appb-100002
    Read out into the data processing module and get the update vector in the data processing module
    Figure PCTCN2016081842-appb-100003
    Write again; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector And θ n+1 = θ n + δ n .
  4. 如权利要求3所述的用于执行Hessian-Free训练算法的装置,其特征在于,所述数据缓存单元在初始化
    Figure PCTCN2016081842-appb-100004
    的步骤中,初始化其中的梯度
    Figure PCTCN2016081842-appb-100005
    高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
    Figure PCTCN2016081842-appb-100006
    其中,
    Figure PCTCN2016081842-appb-100007
    所述梯度
    Figure PCTCN2016081842-appb-100008
    指f在θn处的梯度值,Gf是f在θn处的高斯-牛顿矩阵;阻尼函数
    Figure PCTCN2016081842-appb-100009
    是根据训练模型预先确定好的函数在θn处的值;阻尼系数λ通过LM式启发式方法求得;
    The apparatus for executing a Hessian-Free training algorithm according to claim 3, wherein said data buffer unit is initialized
    Figure PCTCN2016081842-appb-100004
    In the step of initializing the gradient
    Figure PCTCN2016081842-appb-100005
    Gauss-Newtonian matrix G f , damping coefficient λ and damping function
    Figure PCTCN2016081842-appb-100006
    among them,
    Figure PCTCN2016081842-appb-100007
    The gradient
    Figure PCTCN2016081842-appb-100008
    F refers to the value of the gradient at θ n, G f is the f θ n at the Gauss - Newton matrix; damping function
    Figure PCTCN2016081842-appb-100009
    The value of the function at θ n is determined according to the training model; the damping coefficient λ is obtained by the LM-type heuristic method;
    所述数据处理模块从所述数据缓存单元中读取
    Figure PCTCN2016081842-appb-100010
    从外部指定空间中读取待更新参数向量θn;在模块内得到更新向量δn,将θn更新为θn+1,对应的
    Figure PCTCN2016081842-appb-100011
    更新为
    Figure PCTCN2016081842-appb-100012
    然后将
    Figure PCTCN2016081842-appb-100013
    写入至所述数据缓存单元,将θn+1写入到外部指定空间中;其中,θn+1为第n+1次待更新参数向量,
    Figure PCTCN2016081842-appb-100014
    为f(θ+1)的二阶估计。
    The data processing module reads from the data cache unit
    Figure PCTCN2016081842-appb-100010
    Reading the parameter vector θ n to be updated from the external specified space; obtaining the update vector δ n in the module, updating θ n to θ n+1 , corresponding
    Figure PCTCN2016081842-appb-100011
    Updated to
    Figure PCTCN2016081842-appb-100012
    followed by
    Figure PCTCN2016081842-appb-100013
    Writing to the data buffer unit, writing θ n+1 into the external designated space; wherein θ n+1 is the n+1th parameter vector to be updated,
    Figure PCTCN2016081842-appb-100014
    Is a second-order estimate of f(θ+1).
  5. 一种执行Hessian-Free训练算法的方法,其特征在于,包括以下步骤:A method for performing a Hessian-Free training algorithm, comprising the steps of:
    步骤(1),通过指令,完成数据缓存单元的初始化操作,即初始化f(θ)的二阶估计
    Figure PCTCN2016081842-appb-100015
    其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
    Step (1), through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f(θ)
    Figure PCTCN2016081842-appb-100015
    Where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector, and θ n +1 = θ n + δ n ;
    步骤(2),通过IO指令,完成数据访问单元从外部空间读取待更新参数向量的操作;Step (2), through the IO instruction, completing an operation of the data access unit reading the parameter vector to be updated from the external space;
    步骤(3),数据处理模块根据相应指令,在θn处对误差函数f(θ)进行二阶泰勒展开,并添加阻尼项
    Figure PCTCN2016081842-appb-100016
    得到f(θ)在θn附近的估计
    Figure PCTCN2016081842-appb-100017
    Step (3), the data processing module performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
    Figure PCTCN2016081842-appb-100016
    Obtain an estimate of f(θ) around θ n
    Figure PCTCN2016081842-appb-100017
    which is
    Figure PCTCN2016081842-appb-100018
    Figure PCTCN2016081842-appb-100018
    其中,Gf是f在θn处的高斯-牛顿矩阵;阻尼系数λ通过LM式启发式方法求得;阻尼函数
    Figure PCTCN2016081842-appb-100019
    是根据训练模型预先确定好的函数在θn处的值;
    Where G f is a Gauss-Newton matrix of f at θ n ; the damping coefficient λ is obtained by the LM-like heuristic method; the damping function
    Figure PCTCN2016081842-appb-100019
    Is a value of θ n predetermined by a predetermined function according to the training model;
    步骤(4),数据处理模块根据相应的指令,进行有预条件的共轭梯度法求δn使得
    Figure PCTCN2016081842-appb-100020
    达到最小值,并把θn更新为θn+1,具体更新操作为:
    Step (4), the data processing module performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instruction.
    Figure PCTCN2016081842-appb-100020
    The minimum value is reached and θ n is updated to θ n+1 . The specific update operation is:
    θn+1=θnnθ n+1 = θ n + δ n ;
    步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  6. 如权利要求5所述的执行Hessian-Free训练算法的方法,其特征在于,步骤(1)中所述完成数据缓存单元的初始化操作的步骤包括:对梯度
    Figure PCTCN2016081842-appb-100021
    高斯-牛顿矩阵Gf、阻尼系数λ和阻尼函数
    Figure PCTCN2016081842-appb-100022
    进行置零操作。
    The method for performing a Hessian-Free training algorithm according to claim 5, wherein the step of completing the initialization operation of the data buffer unit in the step (1) comprises:
    Figure PCTCN2016081842-appb-100021
    Gauss-Newtonian matrix G f , damping coefficient λ and damping function
    Figure PCTCN2016081842-appb-100022
    Perform a zeroing operation.
  7. 如权利要求5所述的执行Hessian-Free训练算法的方法,其特征在于,步骤(3)中当进行RNN训练时,The method for performing a Hessian-Free training algorithm according to claim 5, wherein when performing RNN training in step (3),
    阻尼函数
    Figure PCTCN2016081842-appb-100023
    Damping function
    Figure PCTCN2016081842-appb-100023
    其中S和f均是距离函数,GS是S在θn处的高斯-牛顿矩阵,μ是一个预先确定好的正数。Where S and f are both distance functions, G S is the Gauss-Newton matrix of S at θ n , and μ is a predetermined positive number.
  8. 如权利要求5所述的执行Hessian-Free训练算法的方法,其特征在于,步骤(4)所述进行有预条件的共轭梯度法求δn使得
    Figure PCTCN2016081842-appb-100024
    达到最小值的步骤中,在进行有预条件的共轭梯度法过程中,只用“mini-batch”而不是所有样本,且其中涉及的高斯-牛顿矩阵乘向量的运算都是通过
    Figure PCTCN2016081842-appb-100025
    做隐式近似计算。
    A method of executing a Hessian-Free training algorithm according to claim 5, wherein said pre-conditional conjugate gradient method is performed to obtain δ n as described in step (4).
    Figure PCTCN2016081842-appb-100024
    In the step of reaching the minimum value, in the pre-conditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.
    Figure PCTCN2016081842-appb-100025
    Do implicit approximation calculations.
  9. 一种执行Hessian-Free训练算法的方法,其特征在于,包括以下步骤:A method for performing a Hessian-Free training algorithm, comprising the steps of:
    步骤S1,在指令缓存单元的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.
    步骤S2,运算开始,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取所有与Hessian-Free计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the Hessian-Free calculation from the external address space. And cache it into the instruction cache unit;
    步骤S3,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取初始待更新参数向量θ0到数据处理模块中; Step S3, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the initial parameter vector θ 0 to be updated from the external space into the data processing module;
    步骤S4,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的
    Figure PCTCN2016081842-appb-100026
    初始化,数据处理单元中的迭代次数n被设置为0;其中,θ为待更新参数向量,θn为第n次待更新参数向量,f(θ)为误差函数,即衡量结果的实际值与预测值偏离的函数;δn是更新向量,且θn+1=θnn
    Figure PCTCN2016081842-appb-100027
    为f(θ)的二阶估计;
    Step S4, the controller unit reads the assignment instruction from the instruction cache unit, according to the translated microinstruction, in the data cache unit
    Figure PCTCN2016081842-appb-100026
    Initialization, the number n of iterations in the data processing unit is set to 0; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result and a function of the predicted value deviation; δ n is the update vector, and θ n+1 = θ n + δ n ;
    Figure PCTCN2016081842-appb-100027
    a second order estimate of f(θ);
    步骤S5,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取待更新参数向量θn传入到数据处理模块中;Step S5, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the parameter vector θ n to be updated from the external space and transmits it to the data processing module;
    步骤S6,控制器单元从指令缓存单元读入一条对误差函数在当前参数向量值附近进行二阶估计的指令,根据译出的微指令,进行求f(θ)在θn附近的二阶估计
    Figure PCTCN2016081842-appb-100028
    的操作;该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用梯度运算子模块计算
    Figure PCTCN2016081842-appb-100029
    利用高斯-牛顿运算子模块和基本运算子模块中的矩阵乘法运算,得到f在θn处的高斯-牛顿矩阵Gf;利用阻尼项运算子模块和基本运算子模块执行LM启发式方法得到阻尼系数λ,进而得到阻尼项
    Figure PCTCN2016081842-appb-100030
    最后,通过
    Step S6, the controller unit reads, from the instruction buffer unit, an instruction for performing second-order estimation of the error function near the current parameter vector value, and performs a second-order estimation of f(θ) near θ n according to the translated micro-instruction.
    Figure PCTCN2016081842-appb-100028
    Operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: calculation using the gradient operation sub-module
    Figure PCTCN2016081842-appb-100029
    Using the Gauss-Newton submodule and the matrix multiplication in the basic operation submodule, the Gauss-Newton matrix G f of f at θ n is obtained; the LM heuristic method is used to obtain the damping by the damping term operation submodule and the basic operation submodule. Coefficient λ, which in turn gives the damping term
    Figure PCTCN2016081842-appb-100030
    Finally, pass
    Figure PCTCN2016081842-appb-100031
    得到
    Figure PCTCN2016081842-appb-100032
    的表达式存入数据缓存单元;其中,阻尼函数
    Figure PCTCN2016081842-appb-100033
    是根据训练模型预先确定好的函数在θn处的值;
    Figure PCTCN2016081842-appb-100031
    get
    Figure PCTCN2016081842-appb-100032
    The expression is stored in the data cache unit; where the damping function
    Figure PCTCN2016081842-appb-100033
    Is a value of θ n predetermined by a predetermined function according to the training model;
    步骤S7,控制器单元从指令缓存单元读取一条数据传输指令,根据译出的微指令,将
    Figure PCTCN2016081842-appb-100034
    从数据缓存单元传送到数据处理单元中;
    Step S7, the controller unit reads a data transfer instruction from the instruction cache unit, according to the translated microinstruction,
    Figure PCTCN2016081842-appb-100034
    Transfer from the data buffer unit to the data processing unit;
    步骤S8,控制器单元从指令缓存单元读取一条参数更新运算指令,根据译出的微指令,进行用有预条件的共轭梯度法求δn使得
    Figure PCTCN2016081842-appb-100035
    达到最小值,并把θn更新为θn+1的操作;数据访问单元从外部空间读取待更新参数向量θn传入到数据处理模块中;运算控制子模块控制相关运算模 块进行如下操作:利用共轭梯度运算子模块和基本运算子模块得到更新向量δn;最后,利用基本运算子模块中的向量加法将θn更新为θn+1
    Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and performs δ n using a preconditioned conjugate gradient method according to the translated micro instruction.
    Figure PCTCN2016081842-appb-100035
    The minimum value is reached, and θ n is updated to θ n+1 ; the data access unit reads the parameter vector θ n to be updated from the external space and passes it to the data processing module; the operation control sub-module controls the related operation module to perform the following operations; The update vector δ n is obtained by using the conjugate gradient operation sub-module and the basic operation sub-module; finally, θ n is updated to θ n+1 by vector addition in the basic operation sub-module;
    步骤S9,控制器单元从指令缓存单元读取一条IO指令,根据译出的微指令,更新后的参数向量θn+1从数据处理单元通过数据访问单元传送至外部指定空间;Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector θ n+1 is transmitted from the data processing unit to the external designated space through the data access unit;
    步骤S10,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量θn+1是否收敛:若收敛,运算结束;否则,迭代次数n的值增长1,转回执行步骤S5。Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ n+1 converges: if convergence, the operation ends; otherwise, the number of iterations The value of n is incremented by 1, and the process returns to step S5.
  10. 一种执行Hessian-Free训练算法的装置,所述装置的控制器中固化有执行如权利要求5至9任意一项所述的执行Hessian-Free训练算法的方法的程序。 An apparatus for executing a Hessian-Free training algorithm in which a program for performing the Hessian-Free training algorithm according to any one of claims 5 to 9 is solidified.
PCT/CN2016/081842 2016-04-29 2016-05-12 Device and method for executing hessian-free training algorithm WO2017185413A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610283885.XA CN107341540B (en) 2016-04-29 2016-04-29 Device and method for executing Hessian-Free training algorithm
CN201610283885.X 2016-04-29

Publications (1)

Publication Number Publication Date
WO2017185413A1 true WO2017185413A1 (en) 2017-11-02

Family

ID=60160584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081842 WO2017185413A1 (en) 2016-04-29 2016-05-12 Device and method for executing hessian-free training algorithm

Country Status (2)

Country Link
CN (1) CN107341540B (en)
WO (1) WO2017185413A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934208A (en) * 2019-04-22 2019-06-25 江苏邦融微电子有限公司 A kind of hardware-accelerated system and method for fingerprint recognition

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626434B (en) * 2020-05-15 2022-06-07 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cipher operation
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
US9601109B2 (en) * 2013-12-06 2017-03-21 International Business Machines Corporation Systems and methods for accelerating hessian-free optimization for deep neural networks by implicit preconditioning and sampling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cipher operation
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934208A (en) * 2019-04-22 2019-06-25 江苏邦融微电子有限公司 A kind of hardware-accelerated system and method for fingerprint recognition

Also Published As

Publication number Publication date
CN107341540A (en) 2017-11-10
CN107341540B (en) 2021-07-20

Similar Documents

Publication Publication Date Title
WO2017124641A1 (en) Device and method for executing reversal training of artificial neural network
KR102331978B1 (en) Device and method for executing forward calculation of artificial neural network
CN110929863B (en) Apparatus and method for performing LSTM operations
CN111260025B (en) Apparatus and method for performing LSTM neural network operation
CN111860812B (en) Apparatus and method for performing convolutional neural network training
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
WO2017124647A1 (en) Matrix calculation apparatus
WO2017124648A1 (en) Vector computing device
WO2017185411A1 (en) Apparatus and method for executing adagrad gradient descent training algorithm
WO2018024093A1 (en) Operation unit, method and device capable of supporting operation data of different bit widths
WO2017185336A1 (en) Apparatus and method for executing pooling operation
US10831861B2 (en) Apparatus and methods for vector operations
CN111027690B (en) Combined processing device, chip and method for performing deterministic reasoning
WO2018113790A1 (en) Operation apparatus and method for artificial neural network
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
WO2017185413A1 (en) Device and method for executing hessian-free training algorithm
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN107315570B (en) Device and method for executing Adam gradient descent training algorithm
CN107315569B (en) Device and method for executing RMSprop gradient descent algorithm

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899924

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899924

Country of ref document: EP

Kind code of ref document: A1