WO2017185413A1

WO2017185413A1 - Device and method for executing hessian-free training algorithm

Info

Publication number: WO2017185413A1
Application number: PCT/CN2016/081842
Authority: WO
Inventors: 张士锦; 郭崎; 陈天石; 陈云霁
Original assignee: 北京中科寒武纪科技有限公司
Priority date: 2016-04-29
Filing date: 2016-05-12
Publication date: 2017-11-02
Also published as: CN107341540A; CN107341540B

Abstract

Provided are a device and method for executing an Hessian-Free training algorithm. The device comprises a data access unit, a controller unit, a data processing unit, and a data buffer module. The device can be used to realize an Hessian-Free training algorithm, thereby performing training on various neural networks such as an automatic encoder and a convolutional neural network (RNN). In each iteration, a second order Taylor series expansion is performed on an error function (objective function), and a damping term is added, so as to approximate the objective function. Afterwards, a preconditioned conjugate gradient method is used to obtain an updated vector according to a current gradient, a Gauss-Newton matrix, a damping function, and a damping coefficient so as to update a parameter to be updated. The iteration is repeated until a vector of the parameter to be updated converges.

Description

Apparatus and method for executing Hessian-Free training algorithm

Technical field

The present invention relates to the field of neural network computing technologies, and more particularly to an apparatus and method for performing a Hessian-Free training algorithm.

Background technique

The gradient descent method is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. At present, the training method of the mainstream neural network is the gradient descent method (in combination with the BP algorithm), but this method ignores the curvature information of the error function, which is not only prone to excessively flat parameter changes, but also cannot converge to the local best advantage, and cannot It is very good to deal with the error function of "morbid curvature" (such as Rosenbrock function). The Hessian-Free training algorithm solves this problem very well, and through some detail improvements, there is no case where the amount of computation increases with respect to the square of the number of parameters (linear growth like the gradient descent method).

Currently, a known method of performing the Hessian-Free training algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the Hessian-Free training algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.

Another known method of performing the Hessian-Free training algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit. Since the GPU is a device specially used for performing graphics and computational operations and scientific calculations, without the special support for the Hessian-Free training algorithm related operations, a large amount of front-end decoding work is still required to perform related operations in the Hessian-Free training algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip cache, and the data required in the operation (such as Gauss-Newton matrix) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.

Summary of the invention

In view of the above, it is an object of the present invention to provide an apparatus and method for executing a Hessian-Free training algorithm with a view to solving at least one of the above technical problems.

In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for executing a Hessian-Free training algorithm, comprising:

a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;

a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;

And a data processing module, configured to perform an operation operation under the control of the controller unit, and store the intermediate variable in the data cache unit.

The data processing module includes an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, a Gauss-Newton matrix operation sub-module, a conjugate gradient method operation sub-module, and a basic operation sub-module; wherein the basic The operation sub-module performs basic operations such as adding, subtracting, multiplying, and dividing between matrices and vectors;

Preferably, the gradient operation submodule, the damping term operation submodule, the Gauss-Newton matrix operation submodule, and the conjugate gradient method operation submodule are all capable of calling the basic operation submodule, and the gradient operation submodule according to the situation The damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other.

Wherein the data buffer unit initializes a second-order estimate of f(θ) at device initialization

Before the update of the nth update parameter vector θ _n is started,

Read out into the data processing module and get the update vector in the data processing module

Write again; where θ is the parameter vector to be updated, θ _n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ _n is the update vector And θ _n+1 = θ _n + δ _n .

Wherein the data buffer unit is initialized

In the step of initializing the gradient

Gauss-Newtonian matrix G _f , damping coefficient λ and damping function

among them,

The gradient

F refers to the value of the gradient at θ _n, G _f is the f θ _n at the Gauss - Newton matrix; damping function

The value of the function at θ _n is determined according to the training model; the damping coefficient λ is obtained by the LM-type heuristic method;

The data processing module reads from the data cache unit

Reading the parameter vector θ _n to be updated from the external specified space; obtaining the update vector δ _n in the module, updating θ _n to θ _n+1 , corresponding

Updated to

followed by

Writing to the data buffer unit, writing θ _n+1 into the external designated space; wherein θ _n+1 is the n+1th parameter vector to be updated,

Is a second-order estimate of f(θ+1).

As another aspect of the present invention, the present invention also provides a method for executing a Hessian-Free training algorithm, comprising the following steps:

Step (1), through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f(θ)

Where θ is the parameter vector to be updated, θ _n is the nth parameter vector to be updated, f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ _n is the update vector, and θ _{n +1} = θ _n + δ _n ;

Step (2), through the IO instruction, completing an operation of the data access unit reading the parameter vector to be updated from the external space;

Step (3), the data processing module performs a second-order Taylor expansion on the error function f(θ) at θ _n according to the corresponding instruction, and adds a damping term

Obtain an estimate of f(θ) around θ _n

which is

Where G _f is a Gauss-Newton matrix of f at θ _n ; the damping coefficient λ is obtained by the LM-like heuristic method; the damping function

Is a value of θ _n predetermined by a predetermined function according to the training model;

Step (4), the data processing module performs a preconditioned conjugate gradient method to obtain δ _n according to the corresponding instruction.

The minimum value is reached and θ _{n is} updated to θ _n+1 . The specific update operation is:

θ _n+1 = θ _n + δ _n ;

In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.

The step of completing the initialization operation of the data buffer unit in the step (1) includes:

Gauss-Newtonian matrix G _f , damping coefficient λ and damping function

Perform a zeroing operation.

Wherein, in step (3), when performing RNN training, the damping function

Where S and f are both distance functions, G _S is the Gauss-Newton matrix of S at θ _n , and μ is a predetermined positive number.

Wherein, the preconditioned conjugate gradient method is performed to obtain δ _{n as} described in step (4).

In the step of reaching the minimum value, in the pre-conditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.

Do implicit approximation calculations.

As still another aspect of the present invention, the present invention also provides a method for executing a Hessian-Free training algorithm, which comprises the following steps:

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.

Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the Hessian-Free calculation from the external address space. And cache it into the instruction cache unit;

Step S3, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the initial parameter vector θ ₀ to be updated from the external space into the data processing module;

Step S4, the controller unit reads the assignment instruction from the instruction cache unit, according to the translated microinstruction, in the data cache unit

Initialization, the number of iterations n in the data processing unit is set to 0; where θ is the parameter vector to be updated, θ _n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result and a function of the predicted value deviation; δ _n is the update vector, and θ _n+1 = θ _n + δ _n ;

a second order estimate of f(θ);

Step S5, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the parameter vector θ _n to be updated from the external space and transmits it to the data processing module;

Step S6, the controller unit reads, from the instruction buffer unit, an instruction for performing second-order estimation of the error function near the current parameter vector value, and performs a second-order estimation of f(θ) near θ _n according to the translated micro-instruction.

Operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: calculation using the gradient operation sub-module

Using the Gauss-Newton submodule and the matrix multiplication in the basic operation submodule, the Gauss-Newton matrix G _{f of} f at θ _n is obtained; the LM heuristic method is used to obtain the damping by the damping term operation submodule and the basic operation submodule. Coefficient λ, which in turn gives the damping term

Finally, pass

get

The expression is stored in the data cache unit; where the damping function

Step S7, the controller unit reads a data transfer instruction from the instruction cache unit, according to the translated microinstruction,

Transfer from the data buffer unit to the data processing unit;

Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and performs δ _n using a preconditioned conjugate gradient method according to the translated micro instruction.

The minimum value is reached, and θ _{n is} updated to θ _n+1 ; the data access unit reads the parameter vector θ _n to be updated from the external space and passes it to the data processing module; the operation control sub-module controls the related operation module to perform the following operations; The update vector δ _{n is} obtained by using the conjugate gradient operation sub-module and the basic operation sub-module; finally, θ _{n is} updated to θ _n+1 by vector addition in the basic operation sub-module;

Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector θ _{n+1 is} transmitted from the data processing unit to the external designated space through the data access unit;

Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ _n+1 converges: if convergence, the operation ends; otherwise, the number of iterations The value of n is incremented by 1, and the process returns to step S5.

As still another aspect of the present invention, the present invention also provides an apparatus for executing a Hessian-Free training algorithm in which a program for executing the method of executing the Hessian-Free training algorithm as described above is solidified.

Based on the above technical solutions, the device and method of the present invention have the following beneficial effects: the device can implement the Hessian-Free training algorithm, and complete training on various neural networks, such as auto-encoders and convolutional nerves. Network (RNN), etc.; by using a device dedicated to the implementation of the Hessian-Free training algorithm, it is possible to solve the problem that the general-purpose processor of the data is insufficient in performance, and the decoding cost of the previous stage is large, and the execution speed of the related application is accelerated; The application of the cache unit avoids repeatedly reading data into the memory and reduces the bandwidth of the memory access.

DRAWINGS

1 is a block diagram showing an overall structure of an apparatus for implementing a Hessian-Free training algorithm related application according to an embodiment of the present invention;

2 is a block diagram showing an example of a data processing module in an apparatus for implementing a Hessian-Free training algorithm related application, in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of operations for implementing a Hessian-Free training algorithm related application according to an embodiment of the invention.

detailed description

The present invention will be further described in detail below with reference to the specific embodiments of the invention, Through the following detailed description Other aspects, advantages, and salient features of the invention will be apparent to those skilled in the art.

In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.

The invention discloses an apparatus for executing a Hessian-Free training algorithm, comprising an instruction buffer unit, an instruction decoding unit, a data access unit, a data processing module and a data buffer module. The device can implement the Hessian-Free training algorithm and complete training on various neural networks, such as auto-encoders and convolutional neural networks (RNN). At each iteration, a second-order Taylor expansion is performed on the error function (objective function), and a damping term is added as an estimate of the objective function, followed by the current gradient, Gauss-Newtonian matrix, damping function, and damping coefficient ( Dampling constant), the preconditioned conjugate gradient method (PreconditioningCG-Minimize) is used to obtain the update vector, and the parameters to be updated are updated. The iteration continues until the parameter vector to be updated converges.

More specifically, the apparatus of the present invention includes a direct memory control unit, an instruction cache unit, a controller unit, a data buffer unit, and a data processing module. The data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit. And the corresponding gradient value to the data processing unit, the updated parameter vector is directly written from the data processing module to the external designated space; the instruction cache unit reads the instruction through the data access unit, and caches the read instruction; the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variables, and initialize and update these variables; The data processing module performs corresponding arithmetic operations according to the instructions.

In addition, the present invention also discloses a method for executing a Hessian-Free training algorithm, which includes the following steps:

Specifically, it is the gradient to it.

Gauss-Newtonian matrix G _f , damping coefficient λ and damping function

Perform a zeroing operation.

Step (2), through the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated from the external space.

Obtain an estimate of f(θ) around θ _n

which is

Where G _f is the Gauss-Newton matrix of f at θ _n ; δ _n is the update vector; the damping coefficient λ is obtained by the method of Levenburg-Marquardt style heuristics; the damping function

Is the value of the function at θ _n predetermined according to the training model, such as when performing RNN training,

S is similar to f, which is a distance function, G _S is the Gauss-Newton matrix of S at θ _n , and μ (weighting constant) is a predetermined positive number.

The minimum value is reached and the operation of θ _{n is} updated to θ _n+1 . The update operation is as follows:

θ _n+1 = θ _n + δ _n ;

It is worth mentioning that in the process of using the preconditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.

Do an implicit approximation (Pearlmutter's R{}-method). This not only improves the efficiency of big data learning or improves the computational efficiency of the data operation module, but also avoids the case where the amount of computation increases with the square of the parameter number.

The apparatus of the Hessian-Free training algorithm implemented according to an embodiment of the present invention can be used to support applications using the Hessian-Free training algorithm. In the data buffer unit, a second-order estimation of a spatial storage error function near each of the parameters to be updated is opened. Each time a preconditioned conjugate gradient method is performed, an update vector is calculated by using the second-order estimation, and then updated. Vector update operation. Repeat the above steps until the vector to be updated converges.

The technical solutions of the present invention with reference to the accompanying drawings are further explained below.

1 shows an example block diagram of an overall structure of an apparatus for implementing a Hessian-Free training algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the apparatus includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.

The data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the instruction is read from the instruction cache unit 2, the parameter to be updated is read from the specified storage unit to the data processing unit 5, and the updated parameter vector is directly written from the data processing module 5 to the external designated space.

The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.

The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 4, the data processing module 5, and the like.

Data buffer unit 4 is initialized at device initialization

Specifically, it is to initialize the gradient

Gauss-Newtonian matrix G _f , damping coefficient λ and damping function

Before the update of the nth update parameter vector θ _n starts,

Read out into the data processing module 5. The update vector δ _n is obtained in the data processing module 5, and θ _{n is} updated to θ _n+1 , corresponding

Updated to

followed by

Write to data cache unit 4 (new data will be overwritten by the previous corresponding data) for the next use.

The data processing module 5 reads from the data buffer unit 4

The parameter vector θ _n to be updated is read from the external designated space by the data access unit 1. The update vector δ _n is obtained in the module, and θ _{n is} updated to θ _n+1 , corresponding

Updated to

followed by

It is written to the data buffer unit 4, and θ _n+1 is written into the external designated space through the data access unit 1.

2 illustrates an example block diagram of a data processing module in an apparatus for implementing a Hessian-Free training algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module includes an operation control sub-module 51, a gradient operation sub-module 52, a damping term operation sub-module 53, a Gauss-Newton matrix operation sub-module 54, a conjugate gradient method operation sub-module 55, and a basic operation unit. Module 56. Among them, the basic operation sub-module 56 performs basic operations such as matrix and vector multiplication, and the 52, 53, 54, 55 sub-modules will call 56 sub-modules, and depending on the situation, these modules are also allowed to call each other. .

Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the Hessian-Free training algorithm.

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.

In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the Hessian-Free calculations from the external address space. All instructions are buffered into the instruction cache unit 2.

In step S3, the controller unit 3 reads an IO instruction from the instruction buffer unit 2, and according to the translated microinstruction, the data access unit 1 reads the initial parameter vector θ ₀ to be updated from the external space into the data processing module 5.

In step S4, the controller unit 3 reads the assignment instruction from the instruction cache unit 2, according to the translated microinstruction, in the data buffer unit 4.

Initialization, the number n of iterations in the data processing unit 5 is set to zero.

In step S5, the controller unit 3 reads an IO command from the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads the parameter vector θ _n to be updated from the external space and passes it to the data processing module 5.

In step S6, the controller unit 3 reads, from the instruction buffer unit 2, an instruction for performing second-order estimation of the error function near the current parameter vector value, and based on the translated micro-instruction, performs f(θ) near θ _n . Second order estimate

Operation. In this operation, the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 sends the corresponding instruction to perform the following operations: using the gradient operation sub-module 52 to calculate

Using the matrix multiplication operation in the Gauss-Newton operation sub-module 54 and the basic operation sub-module 56, the Gauss-Newton matrix Gf of _f at θ _n is obtained; the LM inspiration is performed by the damping term operation sub-module 53 and the basic operation sub-module 56. Method to obtain the damping coefficient λ, and then obtain the damping term

Finally, get

The expression is stored in the data cache unit 4.

In step S7, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, according to the translated microinstruction,

It is transferred from the data buffer unit 4 to the data processing unit 5.

In step S8, the controller unit 3 reads a parameter update operation instruction from the instruction cache unit 2, and based on the translated micro-instruction, performs δ _n using a preconditioned conjugate gradient method.

The minimum value is reached and the operation of θ _{n is} updated to θ _n+1 . The data access unit 1 reads the parameter vector θ _n to be updated from the external space and passes it to the data processing module 5. The operation control sub-module 51 controls the correlation operation module to perform an operation of obtaining the update vector δ _n using the conjugate gradient operation sub-module 55 and the basic operation sub-module 56. Among them, according to the damping function

The expression may also need to call the Gauss Newton-arithmetic module (such as the RNN example mentioned earlier). Finally, θ _{n is} updated to θ _n+1 using vector addition in the basic operation sub-module 56.

In step S9, the controller unit 3 reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector θ _{n+1 is} transmitted from the data processing unit 5 through the data access unit 1 to the external designated space. .

In step S10, the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ _n+1 converges: if convergence, the operation ends; otherwise, The value of the number of iterations n is incremented by 1, and the process returns to step S5.

The processes or methods depicted in the preceding figures may be by hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software carried on a non-transitory computer readable medium), or both. The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.

In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims

An apparatus for executing a Hessian-Free training algorithm, comprising:

a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;

a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;

And a data processing module, configured to perform an operation operation under the control of the controller unit, and store the intermediate variable in the data cache unit.
The apparatus for executing a Hessian-Free training algorithm according to claim 1, wherein the data processing module comprises an operation control sub-module, a gradient operation sub-module, a damping term operation sub-module, and a Gauss-Newton matrix operator. a module, a conjugate gradient method operation sub-module and a basic operation sub-module; wherein the basic operation sub-module performs basic operations such as addition, subtraction, multiplication and division between a matrix and a vector;

Preferably, the gradient operation submodule, the damping term operation submodule, the Gauss-Newton matrix operation submodule, and the conjugate gradient method operation submodule are all capable of calling the basic operation submodule, and the gradient operation submodule according to the situation The damping term operation sub-module, the Gauss-Newton matrix operation sub-module, and the conjugate gradient method operation sub-module are allowed to be called each other.
The apparatus for executing a Hessian-Free training algorithm according to claim 1, wherein said data buffer unit initializes a second-order estimate of f(θ) at device initialization.
Before the update of the nth update parameter vector θ n is started,
Read out into the data processing module and get the update vector in the data processing module
Write again; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector And θ n+1 = θ n + δ n .
The apparatus for executing a Hessian-Free training algorithm according to claim 3, wherein said data buffer unit is initialized
In the step of initializing the gradient
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
among them,
The gradient
F refers to the value of the gradient at θ n, G f is the f θ n at the Gauss - Newton matrix; damping function
The value of the function at θ n is determined according to the training model; the damping coefficient λ is obtained by the LM-type heuristic method;

The data processing module reads from the data cache unit
Reading the parameter vector θ n to be updated from the external specified space; obtaining the update vector δ n in the module, updating θ n to θ n+1 , corresponding
Updated to
followed by
Writing to the data buffer unit, writing θ n+1 into the external designated space; wherein θ n+1 is the n+1th parameter vector to be updated,
Is a second-order estimate of f(θ+1).
A method for performing a Hessian-Free training algorithm, comprising the steps of:

Step (1), through the instruction, complete the initialization operation of the data buffer unit, that is, initialize the second-order estimation of f(θ)
Where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, f(θ) is the error function, that is, the function that deviates from the actual value of the measured result; δ n is the update vector, and θ n +1 = θ n + δ n ;

Step (2), through the IO instruction, completing an operation of the data access unit reading the parameter vector to be updated from the external space;

Step (3), the data processing module performs a second-order Taylor expansion on the error function f(θ) at θ n according to the corresponding instruction, and adds a damping term
Obtain an estimate of f(θ) around θ n
which is

Where G f is a Gauss-Newton matrix of f at θ n ; the damping coefficient λ is obtained by the LM-like heuristic method; the damping function
Is a value of θ n predetermined by a predetermined function according to the training model;

Step (4), the data processing module performs a preconditioned conjugate gradient method to obtain δ n according to the corresponding instruction.
The minimum value is reached and θ n is updated to θ n+1 . The specific update operation is:

θ n+1 = θ n + δ n ;

In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
The method for performing a Hessian-Free training algorithm according to claim 5, wherein the step of completing the initialization operation of the data buffer unit in the step (1) comprises:
Gauss-Newtonian matrix G f , damping coefficient λ and damping function
Perform a zeroing operation.
The method for performing a Hessian-Free training algorithm according to claim 5, wherein when performing RNN training in step (3),

Damping function

Where S and f are both distance functions, G S is the Gauss-Newton matrix of S at θ n , and μ is a predetermined positive number.
A method of executing a Hessian-Free training algorithm according to claim 5, wherein said pre-conditional conjugate gradient method is performed to obtain δ n as described in step (4).
In the step of reaching the minimum value, in the pre-conditioned conjugate gradient method, only "mini-batch" is used instead of all samples, and the Gauss-Newton matrix multiplication vector involved in the operation is passed.
Do implicit approximation calculations.
A method for performing a Hessian-Free training algorithm, comprising the steps of:

In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit.

Step S2, the operation starts, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the Hessian-Free calculation from the external address space. And cache it into the instruction cache unit;

Step S3, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the initial parameter vector θ 0 to be updated from the external space into the data processing module;

Step S4, the controller unit reads the assignment instruction from the instruction cache unit, according to the translated microinstruction, in the data cache unit
Initialization, the number n of iterations in the data processing unit is set to 0; where θ is the parameter vector to be updated, θ n is the nth parameter vector to be updated, and f(θ) is the error function, that is, the actual value of the measurement result and a function of the predicted value deviation; δ n is the update vector, and θ n+1 = θ n + δ n ;
a second order estimate of f(θ);

Step S5, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the data access unit reads the parameter vector θ n to be updated from the external space and transmits it to the data processing module;

Step S6, the controller unit reads, from the instruction buffer unit, an instruction for performing second-order estimation of the error function near the current parameter vector value, and performs a second-order estimation of f(θ) near θ n according to the translated micro-instruction.
Operation; in this operation, the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: calculation using the gradient operation sub-module
Using the Gauss-Newton submodule and the matrix multiplication in the basic operation submodule, the Gauss-Newton matrix G f of f at θ n is obtained; the LM heuristic method is used to obtain the damping by the damping term operation submodule and the basic operation submodule. Coefficient λ, which in turn gives the damping term
Finally, pass

get
The expression is stored in the data cache unit; where the damping function
Is a value of θ n predetermined by a predetermined function according to the training model;

Step S7, the controller unit reads a data transfer instruction from the instruction cache unit, according to the translated microinstruction,
Transfer from the data buffer unit to the data processing unit;

Step S8, the controller unit reads a parameter update operation instruction from the instruction cache unit, and performs δ n using a preconditioned conjugate gradient method according to the translated micro instruction.
The minimum value is reached, and θ n is updated to θ n+1 ; the data access unit reads the parameter vector θ n to be updated from the external space and passes it to the data processing module; the operation control sub-module controls the related operation module to perform the following operations; The update vector δ n is obtained by using the conjugate gradient operation sub-module and the basic operation sub-module; finally, θ n is updated to θ n+1 by vector addition in the basic operation sub-module;

Step S9, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector θ n+1 is transmitted from the data processing unit to the external designated space through the data access unit;

Step S10, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector θ n+1 converges: if convergence, the operation ends; otherwise, the number of iterations The value of n is incremented by 1, and the process returns to step S5.
An apparatus for executing a Hessian-Free training algorithm in which a program for performing the Hessian-Free training algorithm according to any one of claims 5 to 9 is solidified.