CN107341540A - A kind of apparatus and method for performing Hessian-Free training algorithms - Google Patents

A kind of apparatus and method for performing Hessian-Free training algorithms Download PDF

Info

Publication number
CN107341540A
CN107341540A CN201610283885.XA CN201610283885A CN107341540A CN 107341540 A CN107341540 A CN 107341540A CN 201610283885 A CN201610283885 A CN 201610283885A CN 107341540 A CN107341540 A CN 107341540A
Authority
CN
China
Prior art keywords
theta
instruction
msub
updated
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610283885.XA
Other languages
Chinese (zh)
Other versions
CN107341540B (en
Inventor
张士锦
郭崎
陈天石
陈云霁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Cambrian Technology Co Ltd
Original Assignee
Beijing Zhongke Cambrian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Cambrian Technology Co Ltd filed Critical Beijing Zhongke Cambrian Technology Co Ltd
Priority to CN201610283885.XA priority Critical patent/CN107341540B/en
Priority to PCT/CN2016/081842 priority patent/WO2017185413A1/en
Publication of CN107341540A publication Critical patent/CN107341540A/en
Application granted granted Critical
Publication of CN107341540B publication Critical patent/CN107341540B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)

Abstract

A kind of apparatus and method for performing Hessian Free training algorithms, the device include direct memory access unit, controller unit, data processing module and data cache module.Hessian Free training algorithms can be realized using the device, complete the training to various neutral nets, such as autocoder, convolutional neural networks RNN.During each iteration, the second Taylor series are done to error function (object function), and add damping term, estimation as object function, afterwards according to current gradient, Gauss-Newton matrix, damping function and damped coefficient, renewal vector is tried to achieve with the conjugate gradient method for having fore condition, updates parameter to be updated.Continue iteration until parameter vector to be updated is restrained.

Description

Device and method for executing Hessian-Free training algorithm
Technical Field
The invention relates to the technical field of neural network operation, in particular to a device and a method for executing a Hessian-Free training algorithm.
Background
The gradient descent method is widely applied to the fields of function approximation, optimization calculation, mode recognition, image processing and the like. At present, the mainstream training method of the neural network is a gradient descent method (combined with a BP algorithm), but the method ignores curvature information of an error function, so that the situation that parameter change is excessively gentle and cannot be converged to a local optimal point easily occurs, and the error function (such as a rosenblock function) of 'pathological curvature' cannot be well processed. The Hessian-Free training algorithm solves this problem well and, with some detail improvement, makes it impossible for the operand to grow square with respect to the number of parameters (linearly as with the gradient descent method).
Currently, one known method of performing the Hessian-Free training algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the related operation corresponding to the Hessian-Free training algorithm into a long-row operation and memory access instruction sequence, and the front-end decoding of the processor brings about large power consumption overhead.
Another known method of performing the Hessian-Free training algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for relevant operations of the Hessian-Free training algorithm, and a large amount of front-end decoding work is still required to perform the relevant operations in the Hessian-Free training algorithm, which brings a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as a gaussian-newton matrix) required in operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.
Disclosure of Invention
In view of the above, it is an object of the present invention to provide an apparatus and method for performing a Hessian-Free training algorithm, which aim to address at least one of the above technical problems.
To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing a Hessian-Free training algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;
and the data processing module is used for executing operation under the control of the controller unit and storing the intermediate variable in the data cache unit.
The data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gaussian-Newton matrix operation sub-module, a conjugate gradient method operation sub-module and a basic operation sub-module; the basic operation submodule performs addition and multiplication basic operation between a matrix and a vector;
preferably, the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule can call the basic operation submodule, and the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule are allowed to call each other according to the situation.
Wherein the data buffer unit initializes a second order estimate of f (θ) at device initializationParameter vector theta to be updated at nth timenBefore the update of (2) is started, willRead out to the data processing module, and after obtaining the update vector in the data processing module, the update vector will beWriting again; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis an update vector, and θn+1=θn+n
Wherein the data buffer unit is initializedIn the step (2), the gradient therein is initializedGauss-Newton matrix GfDamping coefficient lambda and damping functionWherein,the gradientFinger f at θnGradient value of (G)fIs f at θnA gauss-newton matrix of (d); damping functionIs a function predetermined according to a training model at thetanThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;
the data processing module reads from the data cache unitReading parameter vector theta to be updated from external designated spacen(ii) a Obtaining update vectors within a modulenWill thetanIs updated to thetan+1Corresponding toIs updated toThen will beWriting into the data buffer unit, and storing thetan+1Writing into an external designated space; wherein, thetan+1For the (n +1) th parameter vector to be updated,is a second order estimate of f (θ + 1).
As another aspect of the present invention, the present invention also provides a method of performing a Hessian-Free training algorithm, comprising the steps of:
step (1), finishing the initialization operation of the data cache unit, namely initializing f (theta) through an instruction) Second order estimation ofWherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis an update vector, and θn+1=θn+n
Step (2), completing the operation of reading the parameter vector to be updated from the external space by the direct memory access unit through an IO instruction;
step (3), the data processing module performs theta according to corresponding instructionsnPerforming second-order Taylor expansion on the error function f (theta), and adding a damping termF (theta) at thetanEstimation of proximityNamely, it is
Wherein G isfIs f at θnA gauss-newton matrix of (d); the damping coefficient lambda is obtained by an LM type heuristic method; damping functionIs a function predetermined according to a training model at thetanThe value of (d);
step (4), the data processing module carries out the conjugate gradient method with preconditions to solve according to the corresponding instructionnSo thatReaches a minimum value, andnis updated to thetan+1The specific update operation is:
θn+1=θn+n
and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
Wherein, the step of completing the initialization operation of the data cache unit in the step (1) comprises: for gradientGauss-Newton matrix GfDamping coefficient lambda and damping functionA zero operation is performed.
Wherein, when RNN training is performed in the step (3),
damping function
Where S and f are both distance functions, GSIs S at θnAnd a gauss-newton matrix where μ is a predetermined positive number.
Wherein, the step (4) is carried out with a preconditioned conjugate gradient methodnSo thatIn the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated byAnd performing implicit approximate calculation.
As a further aspect of the present invention, there is also provided a method of performing a Hessian-Free training algorithm, characterized by comprising the steps of:
in step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.
Step S2, when the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated micro instruction, the direct memory access unit reads all instructions related to Hessian-Free calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction0Entering a data processing module;
step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstructionInitializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis an update vector, and θn+1=θn+nA second order estimate of f (θ);
step S5, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the parameter vector θ to be updated from the external space according to the translated microinstructionnTransmitting the data into a data processing module;
step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and calculates a second-order estimate of the error function based on the translated micro fingerMaking f (theta) at thetanNearby second order estimationThe operation of (1); in the operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculateMatrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at thetanGauss-Newton matrix G of (G)f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping itemFinally, byTo obtainThe expression of (2) is stored in a data cache unit; wherein the damping functionIs a function predetermined according to a training model at thetanThe value of (d);
in step S7, the controller unit reads a data transfer instruction from the instruction cache unit and translates the read data transfer instruction into a microinstructionThe data is transmitted from the data buffer unit to the data processing unit;
step S8, the controller unit reads a parameter updating operation instruction from the instruction cache unit, and performs the calculation by the conjugate gradient method with preconditions according to the translated microinstructionnSo thatReaches a minimum value, andnis updated to thetan+1The operation of (1); direct memory access unit reads parameter vector theta to be updated from external spacenTransmitting the data into a data processing module; the operation control sub-module controls the relevant operation module to perform the following operations: obtaining an update vector using a conjugate gradient operation submodule and a basic operation submodulen(ii) a Finally, theta is added by vector addition in the basic operation submodulenIs updated to thetan+1
Step S9, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector theta according to the translated microinstructionn+1Transmitting the data from the data processing unit to an external designated space through the direct memory access unit;
in step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges the updated parameter vector θ according to the translated microinstructionn+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.
As yet another aspect of the present invention, there is also provided a device for performing the Hessian-Free training algorithm, having a program solidified in a controller thereof for performing the method of performing the Hessian-Free training algorithm as described above.
Based on the technical scheme, the device and the method have the following beneficial effects: the device can realize Hessian-Free training algorithm, and complete the training of various neural networks, such as auto-encoders (auto-encoders), convolutional neural networks (RNN), and the like; by adopting the special equipment for executing the Hessian-Free training algorithm, the problems of insufficient operation performance of a general processor of data and high front-section decoding cost can be solved, and the execution speed of related applications is accelerated; meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.
Drawings
FIG. 1 is a block diagram illustrating an exemplary overall architecture of an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention;
FIG. 2 is an exemplary block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention;
fig. 3 is a flowchart illustrating operations for implementing a relevant application of the Hessian-Free training algorithm according to an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments. Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description.
In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
The invention discloses a device for executing Hessian-Free training algorithm, which comprises an instruction cache unit, an instruction decoding unit, a direct memory access unit, a data processing module and a data cache module. The device can realize Hessian-Free training algorithm and finish the training of various neural networks, such as auto-encoders (auto-encoders), convolutional neural networks (RNN) and the like. And during each iteration, performing second-order Taylor expansion on an error function (target function), adding a damping term to be used as estimation of the target function, then obtaining an update vector by a preconditioned conjugate gradient method (preconditioned CG-minimum) according to the current gradient, the Gaussian-Newton matrix, the damping function (dampling function) and the damping coefficient (dampling constant), and updating the parameter to be updated. And continuing the iteration until the parameter vector to be updated converges.
More specifically, the device of the invention comprises a direct memory control unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The direct memory access unit can access an external address space, can read and write data to each cache unit in the device and complete loading and storage of the data, and specifically comprises the steps of reading an instruction to the instruction cache unit, reading parameters to be updated and corresponding gradient values from specified storage units to the data processing unit, and directly writing updated parameter vectors into the external specified space from the data processing module; the instruction cache unit reads the instruction through the direct memory access unit and caches the read instruction; the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a microinstruction for controlling the behavior of other modules and sends the microinstruction to other modules such as a direct memory access unit, a data cache unit, a data processing module and the like; the data cache unit stores some intermediate variables needed in the operation of the device, and initializes and updates the variables; and the data processing module performs corresponding operation according to the instruction.
In addition, the invention also discloses a method for executing the Hessian-Free training algorithm, which comprises the following steps:
step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)In particular to the gradient thereinGauss-Newton matrix GfDamping coefficient lambda and damping functionA zero operation is performed.
And (2) completing the operation of reading the parameter vector to be updated from the external space by the direct memory access unit through the IO instruction.
Step (3), the data processing module performs theta according to corresponding instructionsnPerforming second-order Taylor expansion on the error function f (theta), and adding a damping termF (theta) at thetanEstimation of proximityNamely, it is
Wherein G isfIs f at θnA gauss-newton matrix of (d);nis an update vector; the damping coefficient lambda is obtained by an LM type heuristic method (Levenburg-Marquardt style hesitics); damping functionIs a function predetermined according to a training model at thetanThe value of (c), such as when performing RNN training,s and f are similar and are distance functions, GSIs S at θnA Gaussian-Newton matrix of (1), mu (weighing constant) is oneA predetermined positive number.
Step (4), the data processing module carries out the conjugate gradient method with preconditions to solve according to the corresponding instructionnSo thatReaches a minimum value, andnis updated to thetan+1The operation of (2). The update operation is as follows:
θn+1=θn+n
it is worth mentioning that in the implementation of the conjugate gradient method with preconditions, only the "mini-batch" is used instead of all samples, and that the multiplication of the vector by the gauss-newton matrix involved is performed byAn implicit approximation calculation (Pearlcuter R { } -method) is made. Therefore, the efficiency of big data learning or the operation efficiency of the data operation module is improved, and the situation that the operation amount is increased along with the square of the number of parameters is avoided.
And (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
The device for Hessian-Free training algorithm realized according to the embodiment of the invention can be used for supporting the application of the Hessian-Free training algorithm. And opening a space in the data cache unit to store a second-order estimation of the error function near each generation of the parameters to be updated, calculating an updating vector by using the second-order estimation each time a preconditioned conjugate gradient method is carried out, and then carrying out updating operation on the vector to be updated. And repeating the steps until the vector to be updated is converged.
The technical solution of the present invention will be further explained with reference to the accompanying drawings.
Fig. 1 shows an overall structural example block diagram of an apparatus for implementing the Hessian-Free training algorithm according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.
The direct memory access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from an instruction cache unit 2, reading a parameter to be updated from a designated storage unit to a data processing unit 5, and directly writing the updated parameter vector into an external designated space from a data processing module 5.
The instruction cache unit 2 reads the instructions through the direct memory access unit 1 and caches the read instructions.
The controller unit 3 reads the instruction from the instruction cache unit 2, decodes the instruction into a microinstruction for controlling the behavior of other modules, and sends the microinstruction to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing module 5, and the like.
The data buffer unit 4 is initialized at the time of device initializationIn particular, initializing gradients thereinGauss-Newton matrix GfDamping coefficient lambda and damping functionParameter vector theta to be updated at nth timenWill be updated before the update beginsRead out into the data processing module 5. Obtaining an update vector in the data processing module 5nWill thetanIs updated to thetan+1Corresponding toIs updated toThen will beWritten to the data cache unit 4 (new data overwrites previous corresponding data) for the next use.
The data processing module 5 reads from the data buffer unit 4Reading parameter vector theta to be updated from external designated space through direct memory access unit 1n. Obtaining update vectors within a modulenWill thetanIs updated to thetan+1Corresponding toIs updated toThen will beWriting to the data buffer unit 4, and storing thetan+1Writes to the external designated space via direct memory access unit 1.
Fig. 2 shows an example block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention. As shown in fig. 2, the data processing module includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a gauss-newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55, and a basic operation sub-module 56. The basic operation submodule 56 performs basic operations such as addition and multiplication between matrixes and vectors, the submodules 52, 53, 54 and 55 call the submodule 56, and the submodules also allow to call each other according to the situation.
Fig. 3 shows a general flow diagram of the apparatus for correlation according to the Hessian-Free training algorithm.
In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.
In step S2, the operation starts, and the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads all instructions related to the Hessian-Free calculation from the external address space and caches them in the instruction cache unit 2.
In step S3, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and the dma unit 1 reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction0To the data processing module 5.
In step S4, controller unit 3 reads in the assignment instruction from instruction cache unit 2, and reads in the assignment instruction from data cache unit 4 according to the translated microinstructionInitialization, the number of iterations n in the data processing unit 5 is set to 0.
In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and the dma unit 1 reads the parameter vector θ to be updated from the external space according to the translated microinstructionnInto the data processing module 5.
In step S6, the controller unit 3 reads in an instruction for second-order estimation of the error function in the vicinity of the current parameter vector value from the instruction cache unit 2, and performs f (θ) at θ according to the translated microinstructionnNearby second order estimationThe operation of (2). In this operation, instructions are provided to an operationThe control submodule 51 and the operation control submodule 51 send corresponding instructions to perform the following operations: calculation with gradient calculation submodule 52The matrix multiplication in the Gauss-Newton operator module 54 and the basic operator module 56 is used to obtain the f at thetanGauss-Newton matrix G of (G)f(ii) a The damping term arithmetic submodule 53 and the basic arithmetic submodule 56 are utilized to execute the LM heuristic method to obtain the damping coefficient lambda so as to obtain the damping termFinally, obtainThe expression(s) of (b) is stored in the data cache unit 4.
In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and will read the data transfer instruction according to the translated microinstructionFrom the data buffering unit 4 to the data processing unit 5.
In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and performs the computation using the preconditioned conjugate gradient method according to the translated microinstructionnSo thatReaches a minimum value, andnis updated to thetan+1The operation of (2). Direct memory access unit 1 reads parameter vector theta to be updated from external spacenInto the data processing module 5. The operation control sub-module 51 controls the correlation operation module to perform the following operations: update vector using conjugate gradient operator module 55 and base operator module 56n. In which the damping function is basedMay also require the invocation of a gauss-newton-computing module (such as the previously mentioned example of RNN). Finally, θ is added using vector addition in the base operator sub-block 56nIs updated to thetan+1
In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and updates the parameter vector θ according to the translated microinstructionn+1From the data processing unit 5 to the external designated space via the direct memory access unit 1.
In step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and the data processing unit judges the updated parameter vector θ based on the translated microinstructionn+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (10)

1. An apparatus for performing a Hessian-Free training algorithm, comprising:
the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;
the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;
and the data processing module is used for executing operation under the control of the controller unit and storing the intermediate variable in the data cache unit.
2. The apparatus of claim 1, wherein the data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a gauss-newton matrix operation sub-module, a conjugate gradient method operation sub-module, and a basic operation sub-module; the basic operation submodule performs addition and multiplication basic operation between a matrix and a vector;
preferably, the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule can call the basic operation submodule, and the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule are allowed to call each other according to the situation.
3. The apparatus for performing a Hessian-Free training algorithm as in claim 1, wherein the data cache unit initializes a second order estimate of f (θ) at device initializationParameter vector theta to be updated at nth timenBefore the update of (2) is started, willRead out to the data processing module, and after obtaining the update vector in the data processing module, the update vector will beWriting again; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis an update vector, and θn+1=θn+n
4. The apparatus for performing a Hessian-Free training algorithm as in claim 3, wherein the data cache unit is being initializedIn the step (2), the gradient therein is initializedGauss-Newton matrix GfDamping coefficient lambda and damping functionWherein,the gradientFinger f at θnGradient value of (G)fIs f at θnA gauss-newton matrix of (d); damping functionIs a function predetermined according to a training model at thetanThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;
the data processing module reads from the data cache unitReading parameter vector theta to be updated from external designated spacen(ii) a Obtaining update vectors within a modulenWill thetanIs updated to thetan+1Corresponding toIs updated toThen will beWriting into the data buffer unit, and storing thetan+1Writing into an external designated space; wherein, thetan+1For the (n +1) th parameter vector to be updated,is a second order estimate of f (θ + 1).
5. A method of performing a Hessian-Free training algorithm, comprising the steps of:
step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)Wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis an update vector, and θn+1=θn+n
Step (2), completing the operation of reading the parameter vector to be updated from the external space by the direct memory access unit through an IO instruction;
step (3), the data processing module performs theta according to corresponding instructionsnPerforming second-order Taylor expansion on the error function f (theta), and adding a damping termF (theta) at thetanEstimation of proximityNamely, it is
<mrow> <mover> <mi>f</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <msub> <mi>&amp;delta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>M</mi> <msub> <mi>&amp;theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&amp;lambda;</mi> <mi>n</mi> </msub> <msub> <mi>R</mi> <msub> <mi>&amp;theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>&amp;theta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mo>&amp;dtri;</mo> <mi>f</mi> <msup> <mrow> <mo>(</mo> <msub> <mi>&amp;theta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>&amp;delta;</mi> <mi>n</mi> </msub> <mo>+</mo> <mfrac> <mrow> <msup> <msub> <mi>&amp;delta;</mi> <mi>n</mi> </msub> <mi>T</mi> </msup> <msub> <mi>G</mi> <mi>f</mi> </msub> <msub> <mi>&amp;delta;</mi> <mi>n</mi> </msub> </mrow> <mn>2</mn> </mfrac> <mo>+</mo> <msub> <mi>&amp;lambda;R</mi> <msub> <mi>&amp;theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein G isfIs f at θnA gauss-newton matrix of (d); the damping coefficient lambda is obtained by an LM type heuristic method; damping functionIs a function predetermined according to a training model at thetanThe value of (d);
step (4), the data processing module carries out the conjugate gradient method with preconditions to solve according to the corresponding instructionnSo thatReaches a minimum value, andnis updated to thetan+1The specific update operation is:
θn+1=θn+n
and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.
6. The method of performing a Hessian-Free training algorithm as claimed in claim 5, wherein said step of completing initialization of data cache locations in step (1) comprises: for gradientGauss-Newton matrix GfDamping coefficient lambda and damping functionA zero operation is performed.
7. The method of performing a Hessian-Free training algorithm as claimed in claim 5, wherein in step (3) when RNN training is performed,
damping function
Where S and f are both distance functions, GSIs S at θnAnd a gauss-newton matrix where μ is a predetermined positive number.
8. The method of performing a Hessian-Free training algorithm as claimed in claim 5 wherein step (4) performs a preconditioned conjugate gradient methodnSo thatIn the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated byAnd performing implicit approximate calculation.
9. A method of performing a Hessian-Free training algorithm, comprising the steps of:
in step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.
Step S2, when the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated micro instruction, the direct memory access unit reads all instructions related to Hessian-Free calculation from the external address space and caches the instructions into the instruction cache unit;
step S3, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction0Entering a data processing module;
step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstructionInitializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and thetanF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;nis to updateVector, and θn+1=θn+nA second order estimate of f (θ);
step S5, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the parameter vector θ to be updated from the external space according to the translated microinstructionnTransmitting the data into a data processing module;
step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs the calculation of f (θ) at θ according to the translated microinstructionnNearby second order estimationThe operation of (1); in the operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculateMatrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at thetanGauss-Newton matrix G of (G)f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping itemFinally, byTo obtainThe expression of (2) is stored in a data cache unit; wherein the damping functionIs a function predetermined according to a training model at thetanThe value of (d);
in step S7, the controller unit reads a data transfer instruction from the instruction cache unit and translates the read data transfer instruction into a microinstructionThe data is transmitted from the data buffer unit to the data processing unit;
step S8, the controller unit reads a parameter updating operation instruction from the instruction cache unit, and performs the calculation by the conjugate gradient method with preconditions according to the translated microinstructionnSo thatReaches a minimum value, andnis updated to thetan+1The operation of (1); direct memory access unit reads parameter vector theta to be updated from external spacenTransmitting the data into a data processing module; the operation control sub-module controls the relevant operation module to perform the following operations: obtaining an update vector using a conjugate gradient operation submodule and a basic operation submodulen(ii) a Finally, theta is added by vector addition in the basic operation submodulenIs updated to thetan+1
Step S9, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector theta according to the translated microinstructionn+1Transmitting the data from the data processing unit to an external designated space through the direct memory access unit;
in step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges the updated parameter vector θ according to the translated microinstructionn+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.
10. A device for performing a Hessian-Free training algorithm, the device having a controller having a program embedded therein for performing the method of performing the Hessian-Free training algorithm of any of claims 5 to 9.
CN201610283885.XA 2016-04-29 2016-04-29 Device and method for executing Hessian-Free training algorithm Active CN107341540B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610283885.XA CN107341540B (en) 2016-04-29 2016-04-29 Device and method for executing Hessian-Free training algorithm
PCT/CN2016/081842 WO2017185413A1 (en) 2016-04-29 2016-05-12 Device and method for executing hessian-free training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610283885.XA CN107341540B (en) 2016-04-29 2016-04-29 Device and method for executing Hessian-Free training algorithm

Publications (2)

Publication Number Publication Date
CN107341540A true CN107341540A (en) 2017-11-10
CN107341540B CN107341540B (en) 2021-07-20

Family

ID=60160584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610283885.XA Active CN107341540B (en) 2016-04-29 2016-04-29 Device and method for executing Hessian-Free training algorithm

Country Status (2)

Country Link
CN (1) CN107341540B (en)
WO (1) WO2017185413A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626434A (en) * 2020-05-15 2020-09-04 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN112990958A (en) * 2021-01-19 2021-06-18 腾讯科技(深圳)有限公司 Data processing method, data processing device, storage medium and computer equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934208B (en) * 2019-04-22 2024-07-23 江苏邦融微电子有限公司 Hardware acceleration system and method for fingerprint identification
US11444483B2 (en) * 2020-01-14 2022-09-13 Hitachi Energy Switzerland Ag Adaptive state estimation for power systems

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cipher operation
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
US20150161987A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Systems and methods for accelerating hessian-free optimization for deep neural networks by implicit preconditioning and sampling
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1658550A (en) * 2004-04-16 2005-08-24 威盛电子股份有限公司 Apparatus and method for performing cipher operation
CN1834898A (en) * 2005-05-16 2006-09-20 威盛电子股份有限公司 Microprocessor apparatus and method for modular exponentiation
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
US20140067738A1 (en) * 2012-08-28 2014-03-06 International Business Machines Corporation Training Deep Neural Network Acoustic Models Using Distributed Hessian-Free Optimization
US20150161987A1 (en) * 2013-12-06 2015-06-11 International Business Machines Corporation Systems and methods for accelerating hessian-free optimization for deep neural networks by implicit preconditioning and sampling
WO2016037351A1 (en) * 2014-09-12 2016-03-17 Microsoft Corporation Computing system for training neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JAMES MARTENS: "Deep learning via Hessian-free optimization", 《PROCEEDINGS OF THE 27TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 *
JAMES MARTENS等: "Training Deep and Recurrent Networks with Hessian-Free Optimization", 《NEURAL NETWORKS: TRICKS OFTHE TRADE》 *
RYAN KIROS: "Training Neural Networks with Stochastic Hessian-Free Optimization", 《ARXIV:1301.3641V3 [CS.LG]》 *
YUNJI CHEN等: "DaDianNao: A Machine-Learning Supercomputer", 《2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626434A (en) * 2020-05-15 2020-09-04 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN111626434B (en) * 2020-05-15 2022-06-07 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN112990958A (en) * 2021-01-19 2021-06-18 腾讯科技(深圳)有限公司 Data processing method, data processing device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN107341540B (en) 2021-07-20
WO2017185413A1 (en) 2017-11-02

Similar Documents

Publication Publication Date Title
KR102175044B1 (en) Apparatus and method for running artificial neural network reverse training
CN111310904B (en) Apparatus and method for performing convolutional neural network training
CN107341542B (en) Apparatus and method for performing recurrent neural networks and LSTM operations
CN111353589B (en) Apparatus and method for performing artificial neural network forward operations
CN111260025B (en) Apparatus and method for performing LSTM neural network operation
CN109376861B (en) Apparatus and method for performing full connectivity layer neural network training
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
CN107341540B (en) Device and method for executing Hessian-Free training algorithm
CN107341132B (en) Device and method for executing AdaGrad gradient descent training algorithm
CN108334944B (en) Artificial neural network operation device and method
WO2017185336A1 (en) Apparatus and method for executing pooling operation
US20200097520A1 (en) Apparatus and methods for vector operations
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation
CN111860814B (en) Apparatus and method for performing batch normalization operations
CN107315570B (en) Device and method for executing Adam gradient descent training algorithm
CN107315569B (en) Device and method for executing RMSprop gradient descent algorithm
CN111860772B (en) Device and method for executing artificial neural network mapping operation
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant after: Zhongke Cambrian Technology Co.,Ltd.

Address before: 100190 room 644, comprehensive research building, No. 6 South Road, Haidian District Academy of Sciences, Beijing

Applicant before: Beijing Zhongke Cambrian Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
TG01 Patent term adjustment