CN107341540A

CN107341540A - A kind of apparatus and method for performing Hessian-Free training algorithms

Info

Publication number: CN107341540A
Application number: CN201610283885.XA
Authority: CN
Inventors: 张士锦; 郭崎; 陈天石; 陈云霁
Original assignee: Beijing Zhongke Cambrian Technology Co Ltd
Current assignee: Beijing Zhongke Cambrian Technology Co Ltd
Priority date: 2016-04-29
Filing date: 2016-04-29
Publication date: 2017-11-10
Anticipated expiration: 2036-04-29
Also published as: CN107341540B; WO2017185413A1

Abstract

A kind of apparatus and method for performing Hessian Free training algorithms, the device include direct memory access unit, controller unit, data processing module and data cache module.Hessian Free training algorithms can be realized using the device, complete the training to various neutral nets, such as autocoder, convolutional neural networks RNN.During each iteration, the second Taylor series are done to error function (object function), and add damping term, estimation as object function, afterwards according to current gradient, Gauss-Newton matrix, damping function and damped coefficient, renewal vector is tried to achieve with the conjugate gradient method for having fore condition, updates parameter to be updated.Continue iteration until parameter vector to be updated is restrained.

Description

Device and method for executing Hessian-Free training algorithm

Technical Field

The invention relates to the technical field of neural network operation, in particular to a device and a method for executing a Hessian-Free training algorithm.

Background

The gradient descent method is widely applied to the fields of function approximation, optimization calculation, mode recognition, image processing and the like. At present, the mainstream training method of the neural network is a gradient descent method (combined with a BP algorithm), but the method ignores curvature information of an error function, so that the situation that parameter change is excessively gentle and cannot be converged to a local optimal point easily occurs, and the error function (such as a rosenblock function) of 'pathological curvature' cannot be well processed. The Hessian-Free training algorithm solves this problem well and, with some detail improvement, makes it impossible for the operand to grow square with respect to the number of parameters (linearly as with the gradient descent method).

Currently, one known method of performing the Hessian-Free training algorithm is to use a general purpose processor. The method supports the above algorithm by executing general instructions using a general register file and general functional units. One of the disadvantages of this approach is the low computational performance of a single general purpose processor. When multiple general-purpose processors are executed in parallel, the mutual communication between the general-purpose processors becomes a performance bottleneck. In addition, the general processor needs to decode the related operation corresponding to the Hessian-Free training algorithm into a long-row operation and memory access instruction sequence, and the front-end decoding of the processor brings about large power consumption overhead.

Another known method of performing the Hessian-Free training algorithm is to use a Graphics Processor (GPU). The method supports the above algorithm by executing general purpose SIMD instructions using a general purpose register file and a general purpose stream processing unit. Because the GPU is a device specially used for performing graphic image operations and scientific calculations, there is no special support for relevant operations of the Hessian-Free training algorithm, and a large amount of front-end decoding work is still required to perform the relevant operations in the Hessian-Free training algorithm, which brings a large amount of additional overhead. In addition, the GPU has only a small on-chip cache, and data (such as a gaussian-newton matrix) required in operation needs to be repeatedly carried off-chip, and off-chip bandwidth becomes a main performance bottleneck, and brings huge power consumption overhead.

Disclosure of Invention

In view of the above, it is an object of the present invention to provide an apparatus and method for performing a Hessian-Free training algorithm, which aim to address at least one of the above technical problems.

To achieve the above object, as one aspect of the present invention, there is provided an apparatus for performing a Hessian-Free training algorithm, comprising:

the controller unit is used for decoding the read instruction into a microinstruction for controlling the corresponding module and sending the microinstruction to the corresponding module;

the data cache unit is used for storing intermediate variables in the operation process and executing initialization and updating operations on the intermediate variables;

and the data processing module is used for executing operation under the control of the controller unit and storing the intermediate variable in the data cache unit.

The data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gaussian-Newton matrix operation sub-module, a conjugate gradient method operation sub-module and a basic operation sub-module; the basic operation submodule performs addition and multiplication basic operation between a matrix and a vector;

preferably, the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule can call the basic operation submodule, and the gradient operation submodule, the damping term operation submodule, the gauss-newton matrix operation submodule and the conjugate gradient method operation submodule are allowed to call each other according to the situation.

Wherein the data buffer unit initializes a second order estimate of f (θ) at device initializationParameter vector theta to be updated at nth time_nBefore the update of (2) is started, willRead out to the data processing module, and after obtaining the update vector in the data processing module, the update vector will beWriting again; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis an update vector, and θ_n+1＝θ_n+_n。

Wherein the data buffer unit is initializedIn the step (2), the gradient therein is initializedGauss-Newton matrix G_fDamping coefficient lambda and damping functionWherein,the gradientFinger f at θ_nGradient value of (G)_fIs f at θ_nA gauss-newton matrix of (d); damping functionIs a function predetermined according to a training model at theta_nThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;

the data processing module reads from the data cache unitReading parameter vector theta to be updated from external designated space_n(ii) a Obtaining update vectors within a module_nWill theta_nIs updated to theta_n+1Corresponding toIs updated toThen will beWriting into the data buffer unit, and storing theta_n+1Writing into an external designated space; wherein, theta_n+1For the (n +1) th parameter vector to be updated,is a second order estimate of f (θ + 1).

As another aspect of the present invention, the present invention also provides a method of performing a Hessian-Free training algorithm, comprising the steps of:

step (1), finishing the initialization operation of the data cache unit, namely initializing f (theta) through an instruction) Second order estimation ofWherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis an update vector, and θ_n+1＝θ_n+_n；

Step (2), completing the operation of reading the parameter vector to be updated from the external space by the direct memory access unit through an IO instruction;

step (3), the data processing module performs theta according to corresponding instructions_nPerforming second-order Taylor expansion on the error function f (theta), and adding a damping termF (theta) at theta_nEstimation of proximityNamely, it is

Wherein G is_fIs f at θ_nA gauss-newton matrix of (d); the damping coefficient lambda is obtained by an LM type heuristic method; damping functionIs a function predetermined according to a training model at theta_nThe value of (d);

step (4), the data processing module carries out the conjugate gradient method with preconditions to solve according to the corresponding instruction_nSo thatReaches a minimum value, and_nis updated to theta_n+1The specific update operation is:

θ_n+1＝θ_n+_n；

and (5) judging whether the updated parameter vector is converged by the data processing unit, if so, finishing the operation, otherwise, turning to the step (2) to continue executing.

Wherein, the step of completing the initialization operation of the data cache unit in the step (1) comprises: for gradientGauss-Newton matrix G_fDamping coefficient lambda and damping functionA zero operation is performed.

Wherein, when RNN training is performed in the step (3),

damping function

Where S and f are both distance functions, G_SIs S at θ_nAnd a gauss-newton matrix where μ is a predetermined positive number.

Wherein, the step (4) is carried out with a preconditioned conjugate gradient method_nSo thatIn the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated byAnd performing implicit approximate calculation.

As a further aspect of the present invention, there is also provided a method of performing a Hessian-Free training algorithm, characterized by comprising the steps of:

in step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.

Step S2, when the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated micro instruction, the direct memory access unit reads all instructions related to Hessian-Free calculation from the external address space and caches the instructions into the instruction cache unit;

step S3, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction₀Entering a data processing module;

step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstructionInitializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis an update vector, and θ_n+1＝θ_n+_n；A second order estimate of f (θ);

step S5, the controller unit reads an IO instruction from the instruction cache unit, and the dma unit reads the parameter vector θ to be updated from the external space according to the translated microinstruction_nTransmitting the data into a data processing module;

step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and calculates a second-order estimate of the error function based on the translated micro fingerMaking f (theta) at theta_nNearby second order estimationThe operation of (1); in the operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculateMatrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at theta_nGauss-Newton matrix G of (G)_f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping itemFinally, byTo obtainThe expression of (2) is stored in a data cache unit; wherein the damping functionIs a function predetermined according to a training model at theta_nThe value of (d);

in step S7, the controller unit reads a data transfer instruction from the instruction cache unit and translates the read data transfer instruction into a microinstructionThe data is transmitted from the data buffer unit to the data processing unit;

step S8, the controller unit reads a parameter updating operation instruction from the instruction cache unit, and performs the calculation by the conjugate gradient method with preconditions according to the translated microinstruction_nSo thatReaches a minimum value, and_nis updated to theta_n+1The operation of (1); direct memory access unit reads parameter vector theta to be updated from external space_nTransmitting the data into a data processing module; the operation control sub-module controls the relevant operation module to perform the following operations: obtaining an update vector using a conjugate gradient operation submodule and a basic operation submodule_n(ii) a Finally, theta is added by vector addition in the basic operation submodule_nIs updated to theta_n+1；

Step S9, the controller unit reads an IO instruction from the instruction cache unit, and updates the parameter vector theta according to the translated microinstruction_n+1Transmitting the data from the data processing unit to an external designated space through the direct memory access unit;

in step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and the data processing unit judges the updated parameter vector θ according to the translated microinstruction_n+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.

As yet another aspect of the present invention, there is also provided a device for performing the Hessian-Free training algorithm, having a program solidified in a controller thereof for performing the method of performing the Hessian-Free training algorithm as described above.

Based on the technical scheme, the device and the method have the following beneficial effects: the device can realize Hessian-Free training algorithm, and complete the training of various neural networks, such as auto-encoders (auto-encoders), convolutional neural networks (RNN), and the like; by adopting the special equipment for executing the Hessian-Free training algorithm, the problems of insufficient operation performance of a general processor of data and high front-section decoding cost can be solved, and the execution speed of related applications is accelerated; meanwhile, the application of the data cache unit avoids reading data from the memory repeatedly, and reduces the bandwidth of memory access.

Drawings

FIG. 1 is a block diagram illustrating an exemplary overall architecture of an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention;

FIG. 2 is an exemplary block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention;

fig. 3 is a flowchart illustrating operations for implementing a relevant application of the Hessian-Free training algorithm according to an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments. Other aspects, advantages, and salient features of the invention will become apparent to those skilled in the art from the following detailed description.

In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.

The invention discloses a device for executing Hessian-Free training algorithm, which comprises an instruction cache unit, an instruction decoding unit, a direct memory access unit, a data processing module and a data cache module. The device can realize Hessian-Free training algorithm and finish the training of various neural networks, such as auto-encoders (auto-encoders), convolutional neural networks (RNN) and the like. And during each iteration, performing second-order Taylor expansion on an error function (target function), adding a damping term to be used as estimation of the target function, then obtaining an update vector by a preconditioned conjugate gradient method (preconditioned CG-minimum) according to the current gradient, the Gaussian-Newton matrix, the damping function (dampling function) and the damping coefficient (dampling constant), and updating the parameter to be updated. And continuing the iteration until the parameter vector to be updated converges.

More specifically, the device of the invention comprises a direct memory control unit, an instruction cache unit, a controller unit, a data cache unit and a data processing module. The direct memory access unit can access an external address space, can read and write data to each cache unit in the device and complete loading and storage of the data, and specifically comprises the steps of reading an instruction to the instruction cache unit, reading parameters to be updated and corresponding gradient values from specified storage units to the data processing unit, and directly writing updated parameter vectors into the external specified space from the data processing module; the instruction cache unit reads the instruction through the direct memory access unit and caches the read instruction; the controller unit reads the instruction from the instruction cache unit, decodes the instruction into a microinstruction for controlling the behavior of other modules and sends the microinstruction to other modules such as a direct memory access unit, a data cache unit, a data processing module and the like; the data cache unit stores some intermediate variables needed in the operation of the device, and initializes and updates the variables; and the data processing module performs corresponding operation according to the instruction.

In addition, the invention also discloses a method for executing the Hessian-Free training algorithm, which comprises the following steps:

step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)In particular to the gradient thereinGauss-Newton matrix G_fDamping coefficient lambda and damping functionA zero operation is performed.

And (2) completing the operation of reading the parameter vector to be updated from the external space by the direct memory access unit through the IO instruction.

Wherein G is_fIs f at θ_nA gauss-newton matrix of (d);_nis an update vector; the damping coefficient lambda is obtained by an LM type heuristic method (Levenburg-Marquardt style hesitics); damping functionIs a function predetermined according to a training model at theta_nThe value of (c), such as when performing RNN training,s and f are similar and are distance functions, G_SIs S at θ_nA Gaussian-Newton matrix of (1), mu (weighing constant) is oneA predetermined positive number.

Step (4), the data processing module carries out the conjugate gradient method with preconditions to solve according to the corresponding instruction_nSo thatReaches a minimum value, and_nis updated to theta_n+1The operation of (2). The update operation is as follows:

θ_n+1＝θ_n+_n；

it is worth mentioning that in the implementation of the conjugate gradient method with preconditions, only the "mini-batch" is used instead of all samples, and that the multiplication of the vector by the gauss-newton matrix involved is performed byAn implicit approximation calculation (Pearlcuter R { } -method) is made. Therefore, the efficiency of big data learning or the operation efficiency of the data operation module is improved, and the situation that the operation amount is increased along with the square of the number of parameters is avoided.

The device for Hessian-Free training algorithm realized according to the embodiment of the invention can be used for supporting the application of the Hessian-Free training algorithm. And opening a space in the data cache unit to store a second-order estimation of the error function near each generation of the parameters to be updated, calculating an updating vector by using the second-order estimation each time a preconditioned conjugate gradient method is carried out, and then carrying out updating operation on the vector to be updated. And repeating the steps until the vector to be updated is converged.

The technical solution of the present invention will be further explained with reference to the accompanying drawings.

Fig. 1 shows an overall structural example block diagram of an apparatus for implementing the Hessian-Free training algorithm according to an embodiment of the present invention. As shown in fig. 1, the apparatus includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data cache unit 4, and a data processing module 5, which may all be implemented by hardware circuits.

The direct memory access unit 1 can access an external address space, and can read and write data to each cache unit in the device to complete loading and storing of the data. The method specifically comprises the steps of reading an instruction from an instruction cache unit 2, reading a parameter to be updated from a designated storage unit to a data processing unit 5, and directly writing the updated parameter vector into an external designated space from a data processing module 5.

The instruction cache unit 2 reads the instructions through the direct memory access unit 1 and caches the read instructions.

The controller unit 3 reads the instruction from the instruction cache unit 2, decodes the instruction into a microinstruction for controlling the behavior of other modules, and sends the microinstruction to other modules such as the direct memory access unit 1, the data cache unit 4, the data processing module 5, and the like.

The data buffer unit 4 is initialized at the time of device initializationIn particular, initializing gradients thereinGauss-Newton matrix G_fDamping coefficient lambda and damping functionParameter vector theta to be updated at nth time_nWill be updated before the update beginsRead out into the data processing module 5. Obtaining an update vector in the data processing module 5_nWill theta_nIs updated to theta_n+1Corresponding toIs updated toThen will beWritten to the data cache unit 4 (new data overwrites previous corresponding data) for the next use.

The data processing module 5 reads from the data buffer unit 4Reading parameter vector theta to be updated from external designated space through direct memory access unit 1_n. Obtaining update vectors within a module_nWill theta_nIs updated to theta_n+1Corresponding toIs updated toThen will beWriting to the data buffer unit 4, and storing theta_n+1Writes to the external designated space via direct memory access unit 1.

Fig. 2 shows an example block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm-related application in accordance with an embodiment of the present invention. As shown in fig. 2, the data processing module includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a gauss-newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55, and a basic operation sub-module 56. The basic operation submodule 56 performs basic operations such as addition and multiplication between matrixes and vectors, the submodules 52, 53, 54 and 55 call the submodule 56, and the submodules also allow to call each other according to the situation.

Fig. 3 shows a general flow diagram of the apparatus for correlation according to the Hessian-Free training algorithm.

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

In step S2, the operation starts, and the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the direct memory access unit 1 reads all instructions related to the Hessian-Free calculation from the external address space and caches them in the instruction cache unit 2.

In step S3, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and the dma unit 1 reads the initial parameter vector θ to be updated from the external space according to the translated microinstruction₀To the data processing module 5.

In step S4, controller unit 3 reads in the assignment instruction from instruction cache unit 2, and reads in the assignment instruction from data cache unit 4 according to the translated microinstructionInitialization, the number of iterations n in the data processing unit 5 is set to 0.

In step S5, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and the dma unit 1 reads the parameter vector θ to be updated from the external space according to the translated microinstruction_nInto the data processing module 5.

In step S6, the controller unit 3 reads in an instruction for second-order estimation of the error function in the vicinity of the current parameter vector value from the instruction cache unit 2, and performs f (θ) at θ according to the translated microinstruction_nNearby second order estimationThe operation of (2). In this operation, instructions are provided to an operationThe control submodule 51 and the operation control submodule 51 send corresponding instructions to perform the following operations: calculation with gradient calculation submodule 52The matrix multiplication in the Gauss-Newton operator module 54 and the basic operator module 56 is used to obtain the f at theta_nGauss-Newton matrix G of (G)_f(ii) a The damping term arithmetic submodule 53 and the basic arithmetic submodule 56 are utilized to execute the LM heuristic method to obtain the damping coefficient lambda so as to obtain the damping termFinally, obtainThe expression(s) of (b) is stored in the data cache unit 4.

In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and will read the data transfer instruction according to the translated microinstructionFrom the data buffering unit 4 to the data processing unit 5.

In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and performs the computation using the preconditioned conjugate gradient method according to the translated microinstruction_nSo thatReaches a minimum value, and_nis updated to theta_n+1The operation of (2). Direct memory access unit 1 reads parameter vector theta to be updated from external space_nInto the data processing module 5. The operation control sub-module 51 controls the correlation operation module to perform the following operations: update vector using conjugate gradient operator module 55 and base operator module 56_n. In which the damping function is basedMay also require the invocation of a gauss-newton-computing module (such as the previously mentioned example of RNN). Finally, θ is added using vector addition in the base operator sub-block 56_nIs updated to theta_n+1。

In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and updates the parameter vector θ according to the translated microinstruction_n+1From the data processing unit 5 to the external designated space via the direct memory access unit 1.

In step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit 2, and the data processing unit judges the updated parameter vector θ based on the translated microinstruction_n+1Whether to converge or not: if the convergence is reached, the operation is ended; otherwise, the value of the iteration number n is increased by 1, and the process returns to step S5.

The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, embodiments of the invention have been described with reference to specific exemplary embodiments thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of the invention as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. An apparatus for performing a Hessian-Free training algorithm, comprising:

2. The apparatus of claim 1, wherein the data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a gauss-newton matrix operation sub-module, a conjugate gradient method operation sub-module, and a basic operation sub-module; the basic operation submodule performs addition and multiplication basic operation between a matrix and a vector;

3. The apparatus for performing a Hessian-Free training algorithm as in claim 1, wherein the data cache unit initializes a second order estimate of f (θ) at device initializationParameter vector theta to be updated at nth time_nBefore the update of (2) is started, willRead out to the data processing module, and after obtaining the update vector in the data processing module, the update vector will beWriting again; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis an update vector, and θ_n+1＝θ_n+_n。

4. The apparatus for performing a Hessian-Free training algorithm as in claim 3, wherein the data cache unit is being initializedIn the step (2), the gradient therein is initializedGauss-Newton matrix G_fDamping coefficient lambda and damping functionWherein,the gradientFinger f at θ_nGradient value of (G)_fIs f at θ_nA gauss-newton matrix of (d); damping functionIs a function predetermined according to a training model at theta_nThe value of (d); the damping coefficient lambda is obtained by an LM type heuristic method;

5. A method of performing a Hessian-Free training algorithm, comprising the steps of:

step (1), finishing the initialization operation of the data cache unit through an instruction, namely initializing the second-order estimation of f (theta)Wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis an update vector, and θ_n+1＝θ_n+_n；

<mrow> <mover> <mi>f</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <msub> <mi>&delta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>M</mi> <msub> <mi>&theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&lambda;</mi> <mi>n</mi> </msub> <msub> <mi>R</mi> <msub> <mi>&theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>&theta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mo>&dtri;</mo> <mi>f</mi> <msup> <mrow> <mo>(</mo> <msub> <mi>&theta;</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>&delta;</mi> <mi>n</mi> </msub> <mo>+</mo> <mfrac> <mrow> <msup> <msub> <mi>&delta;</mi> <mi>n</mi> </msub> <mi>T</mi> </msup> <msub> <mi>G</mi> <mi>f</mi> </msub> <msub> <mi>&delta;</mi> <mi>n</mi> </msub> </mrow> <mn>2</mn> </mfrac> <mo>+</mo> <msub> <mi>&lambda;R</mi> <msub> <mi>&theta;</mi> <mi>n</mi> </msub> </msub> <mrow> <mo>(</mo> <mi>&theta;</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

θ_n+1＝θ_n+_n；

6. The method of performing a Hessian-Free training algorithm as claimed in claim 5, wherein said step of completing initialization of data cache locations in step (1) comprises: for gradientGauss-Newton matrix G_fDamping coefficient lambda and damping functionA zero operation is performed.

7. The method of performing a Hessian-Free training algorithm as claimed in claim 5, wherein in step (3) when RNN training is performed,

damping function

8. The method of performing a Hessian-Free training algorithm as claimed in claim 5 wherein step (4) performs a preconditioned conjugate gradient method_nSo thatIn the step of reaching the minimum, only the "mini-batch" is used instead of all samples during the implementation of the preconditioned conjugate gradient method, and the gaussian-newton matrix multiplication vector involved therein is calculated byAnd performing implicit approximate calculation.

9. A method of performing a Hessian-Free training algorithm, comprising the steps of:

step S4, the controller unit reads the assignment instruction from the instruction cache unit, and the data cache unit stores the assignment instruction in accordance with the translated microinstructionInitializing, wherein the iteration number n in the data processing unit is set to be 0; wherein, theta is a parameter vector to be updated, and theta_nF (theta) is an error function, namely a function of deviation of an actual value of a measurement result and a predicted value, for the parameter vector to be updated at the nth time;_nis to updateVector, and θ_n+1＝θ_n+_n；A second order estimate of f (θ);

step S6, the controller unit reads in an instruction for second-order estimation of the error function near the current parameter vector value from the instruction cache unit, and performs the calculation of f (θ) at θ according to the translated microinstruction_nNearby second order estimationThe operation of (1); in the operation, the instruction is sent to the operation control submodule, and the operation control submodule sends a corresponding instruction to perform the following operations: using a gradient arithmetic submodule to calculateMatrix multiplication in the Gauss-Newton operation submodule and the basic operation submodule is utilized to obtain f at theta_nGauss-Newton matrix G of (G)_f(ii) a Utilizing the damping item operation submodule and the basic operation submodule to execute an LM heuristic method to obtain a damping coefficient lambda so as to obtain a damping itemFinally, byTo obtainThe expression of (2) is stored in a data cache unit; wherein the damping functionIs a function predetermined according to a training model at theta_nThe value of (d);

10. A device for performing a Hessian-Free training algorithm, the device having a controller having a program embedded therein for performing the method of performing the Hessian-Free training algorithm of any of claims 5 to 9.