CN114548400A

CN114548400A - Rapid flexible full-pure embedded neural network wide area optimization training method

Info

Publication number: CN114548400A
Application number: CN202210125273.3A
Authority: CN
Inventors: 汪涛; 谭洪宇; 高子雄; 何晓斌
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-27
Also published as: WO2023151201A1

Abstract

The invention provides a quick, flexible and fully-pure embedded neural network wide-area optimization training method, which comprises the following specific steps: step 1, determining a differential equation to be solved, and sampling in a defined domain to obtain training data and test data; step 2, constructing a neural network model, and embedding an activation function layer based on piecewise rational approximation; step 3, adjusting the hyper-parameters and training a neural network model; step 4, model prediction is carried out, if the prediction result meets the requirement, the model training is successful, and the training is finished; otherwise, returning to the step 3. The activation function constructed by the piecewise rational approximation method is superior to a common activation function in the aspects of training time and training precision, and a powerful solution is provided for rapidly and accurately solving the problem of high-dimensional partial differential equations related to actual engineering calculation tasks.

Description

Rapid flexible full-pure embedded neural network wide area optimization training method

Technical Field

The invention relates to the technical field of information science and engineering calculation, in particular to a quick, flexible and fully-embedded neural network wide-area optimization training method.

Background

Partial differential equations are widely used in various fields of natural science and engineering applications such as oil and gas exploration, bridge design, and machine manufacturing. However, in some complex scenarios, it is difficult to have an analytical solution. Numerical methods such as the traditional methods of finite difference, finite element, finite volume, etc. are more commonly used. However, the conventional method needs to divide the region into a plurality of grid cells to approximate the solution space of the partial differential equation, and the number of grids is huge when the dimension is very high, so that the calculation cost is very large. And the partial differential equation is solved by using a Neural Network (NN), grid division is not needed, and random sampling is carried out in the space to be used as the input of the model, so that dimension disasters are avoided.

In the past decade, Deep Neural Networks (DNNs) have evolved as the fundamental technology and key tool for machine learning. Research finds that the performance of the method is superior to that of the traditional statistical learning technology (such as a nuclear method, a support vector machine and a random forest) in many practical applications such as image classification, voice recognition, image segmentation and medical imaging.

A neural network is a complex network system formed by a large number of simple processing units (called neurons) widely connected to each other, reflects many basic features of human brain functions, and is a highly complex nonlinear dynamical learning system. The neural network has four basic features:

(i) non-linear: non-linear relationships are a common property of nature. The intelligence of the brain is a nonlinear phenomenon. The artificial neuron is in two different states of activation or inhibition, and the behavior is mathematically expressed as a nonlinear relationship. The network formed by the neurons with the threshold value has better performance, and the fault tolerance and the storage capacity can be improved.

(ii) Without limitation: a neural network is typically formed by a plurality of widely connected neurons. The overall behavior of a system depends not only on the characteristics of the individual neurons, but may be primarily determined by the interactions, interconnections, between the units. The non-limitations of the brain are simulated by a large number of connections between the cells. Associative memory is a non-limiting representative example.

(iii) Very qualitative: the artificial neural network has the self-adaption, self-organization and self-learning capabilities. The neural network not only can process various information, but also can process information, and simultaneously, the nonlinear dynamical system is changed continuously. Iterative processes are often used to describe the evolution of dynamic or time-varying systems.

(iv) Non-convex: the direction of evolution of a system will, under certain conditions, depend on a particular state function. For example an energy function, the extreme values of which correspond to a more stable state of the system. Non-convexity means that the function has a plurality of extreme values, so that the system has a plurality of stable equilibrium states, which leads to the diversity of the system evolution.

The activation function plays an important role for the artificial neural network model in learning and understanding the complex change law (which is generally highly nonlinear). They introduce non-linear characteristics into the network. In neurons, the inputs are weighted, summed, and applied to a function, which is the activation function. The activation function introduces nonlinear factors to the neurons, so that the neural network can arbitrarily approximate any nonlinear function, and the neural network can be applied to a plurality of nonlinear models.

There are no clear guiding theoretical principles for the selection of activation functions, and the general choices are ReLu function, Sigmoid function and hyperbolic tangent function. Existing activation functions are often one of the above three functions or variations of the three functions (e.g., with one to two trainable parameters). The advantages and disadvantages of these three activation functions are:

(i) ReLu function is the most commonly used activation function in modern neural networks, most feed-forward neural networks use the activation function by default. The method has the advantages that the algorithm is fast in convergence, and meanwhile, the problems of gradient saturation, gradient disappearance and the like cannot occur in the region where x is greater than 0; in addition, its disadvantages are also evident, including: the phenomenon of neuron necrosis caused by the fact that the ReLu function is constantly zero in a negative number region, at the moment, the gradient of a neuron and the gradient behind the neuron are always zero, and updating can not be carried out in the training round; meanwhile, because the second derivative and higher derivatives of the ReLu function in the positive and negative number regions are zero, the neural network model cannot be trained effectively in some special applications (such as solving a differential equation by using a neural network).

(ii) Sigmoid function has the advantages that the output of the function is between (0,1), the optimization is stable, the function is continuous and the derivation is convenient; the disadvantage is that the function is not sensitive to input and output because it saturates when the absolute value of the variable is very large.

(iii) The hyperbolic tangent function can be regarded as a variation of the Sigmoid function, and the problem of gradient saturation still exists.

Disclosure of Invention

In order to solve the problems, the invention provides a rapid, flexible and fully-embedded neural network wide-area optimization training method which is strong in expression capability, good in smoothness and convenient to calculate.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention relates to a quick, flexible and fully-pure embedded neural network wide-area optimization training method, which comprises the following steps:

step 1, determining a differential equation to be solved, and sampling in a defined domain to obtain training data and test data;

step 2, constructing a neural network model, and embedding an activation function layer based on piecewise rational approximation;

step 3, adjusting the hyper-parameters and training a neural network model;

step 4, model prediction is carried out, if the prediction result meets the requirement, the model training is successful, and the training is finished;

otherwise, returning to the step 3.

The invention is further improved in that: the differential equation in step 1 is a Burgers equation.

The invention is further improved in that: the neural network model constructed in the step 2 comprises an input layer, four full-connection layers, four activation function layers and an output layer.

The invention is further improved in that: the construction of the activation function of the piecewise rational approximation in step 2 is as follows:

suppose at some point x₀The function f (x) is approximated using a single point pade approximation method of the form:

wherein p is_kAnd q is_kIs a coefficient to be obtained, L represents the highest order of x in the numerator, and M represents the highest order of x in the denominator. When L + M is constant, when L ═ M is taken, the numerator and denominator are solved in the following manner. Let L be M be n, first solve the linear equation A_qB to obtain (q)₁,q₂,q₃,…,q_n) A value of (a), wherein:

(p) is obtained by the following equation₀,p₁,p₂,…,p_n) The value of (c):

the multi-point pade approximation is a generalized version of the single-point pade approximation. Setting the approximated function f (x) if x is at n +1 interpolation points₀,x₁,x₂,…,x_nIf the function value is known, there is a rational formula:

wherein L + M ═ n, u^[L/M](x) Is a polynomial of highest order L, v^[L/M](x) Is a polynomial of highest order M:

here, u^[L/M](x) And v^[L/M](x) Is a polynomial function that needs to be constructed by mean-difference;

first, the mean deviation of f (x) is defined as follows:

let f_i,jIs f [ x ]_i,x_i+1,…,x_j]J is not less than i; then u^[L/M](x) Can be calculated by:

at the same time, v^[L/M](x) Can be calculated by the following way:

the segmented Pade approximation used by the invention is a special form of multi-point Pade approximation, which is constructed by giving each interpolation point, function value at the interpolation point and derivative value from first order to m order and constructing each segment based on the multi-point Pade approximation.

Let the approximated function be f (x) and at n +1 interpolation points x₀,x₁,x₂,…,x_nThe method comprises the following steps:

wherein

Is represented by x_iA derivative value of order τ of (f), (x);

arbitrarily take a section of interval [ x_k,x_k+1]Structure ofPade approximation expression:

wherein L + M +1 ═ n,

and

the expression of (c) has been given in equations (8) and (9). The specific calculation process needs to consider an equivalent set formed by 2m +2 points:

according to the formula (8) and the formula (9), the mean difference f_i,j＝f[z_i,z_i+1,…,z_j],0≤i≤j≤2m+1；

From the nature of the mean difference and equation (10):

when i is more than or equal to 0 and less than or equal to m and j is more than or equal to m +1 and less than or equal to 2m +1, a recurrence formula is as follows:

when i +1 is more than or equal to m +1, directly calculating according to the formula (14);

when j-1 is not more than m, directly calculating according to a formula (13);

calculating f_i,jIs substituted into the formula (8) and the formula (9) to obtain

And

further, find out

Function r constructed from a piecewise Pade approximation^L/M(x) Expressed as:

the invention is further improved in that: setting the training round as N in the step 3, wherein the training steps are as follows:

step 3.1, inputting the training data into a neural network, and executing step 3.2;

step 3.2, data in the module is transmitted in the forward direction, and data H_n×mInputting the data into an activation function layer, and executing the next step;

step 3.3, activating the hyper-parameter x of the function layer₀,x₁,x₂,…,x_nAnd trainable parameters

The piecewise function is obtained according to equations (10) - (16) as interpolation points and derivative values from zero order to m order

Forming a piecewise activation function r^[L/M](x)；

Step 3.4, data H_n×mAfter passing through the activation function r^[L/M](x) To obtain an output Z_n×mExpressed as:

to obtain an output Z_n×m；

3.5, continuing forward propagation of the data until a next activation function layer is encountered, jumping to the step 3.3, and otherwise, executing the step 3.6;

step 3.6, obtaining a training result, calculating the value of a loss function, and automatically performing back propagation and updating neural network weight and trainable parameters by a framework; if the current round is less than or equal to N, a new batch of training data is taken, and the step 3.2 is skipped; otherwise, the model training process is ended.

The invention is further improved in that: step 4, model prediction is carried out, if the prediction result meets the requirement, the model training is successful, and the training is finished; otherwise, returning to the step 3.

The beneficial effects of the invention are: the invention provides an activation function based on segmentation rational approximation according to the idea of fast flexible all-pure embedding (FFHE). Firstly, initializing function points, function values and various derivative values,

and constructing a segmented activation function by using a segmented rational approximation method. Its advantages are as follows:

(i) more potent expression: the expression capability of the piecewise function is stronger than that of the ordinary function, and a solid theoretical foundation is provided. It has been demonstrated in the literature that under the Lipschitz condition, the property of point-by-point nonlinearity is linked to the global Lipschitz constant of the network by introducing a boundary, and then regularization is performed using the boundary to derive a representation theorem indicating that the optimal configuration is implemented by a depth spline network, wherein each activation function is a piecewise linear spline function with adaptive nodes.

(ii) Better smoothness: other commonly used activation functions such as ReLu function, PReLu function and piecewise linear spline are only piecewise first-order derivative, which is limited in some scenes, for example, solving a differential equation by using a neural network usually requires a second-order derivative or even a higher-order derivative of a network output to an input, and the activation function which is only first-order derivative causes a gradient to be zero and cannot effectively update parameters.

(iii) More flexible and easy to calculate: the activation function based on the piecewise rational approximation sets an initialization function point, a function value and each order derivative, takes the function value and each order derivative as parameters which can be adjusted along with the training of the neural network, and the adaptive adjustment of the parameters enables the back propagation of the neural network to be updated towards the steepest direction, so that the activation function needs fewer turns than other activation functions to achieve the expected precision.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

FIG. 2 is a flow chart of neural network model training based on a piecewise rational approximation activation function.

Fig. 3 is a schematic diagram of a neural network model structure of the present invention.

Fig. 4 is a schematic diagram of the structure of the PINNs model.

Fig. 5 is a graph of the LeakyReLu, ReLu, Tanh, and FFHE activation function training.

Detailed Description

Embodiments of the invention will be described in detail below with reference to the drawings, and for the sake of clarity, many implementation details will be set forth in the following description. It should be understood, however, that these implementation details should not be used to limit the invention. That is, details of these implementations are not necessary in some embodiments of the invention.

As shown in fig. 1-3, the invention is a fast, flexible and fully-embedded neural network wide-area optimization training method, comprising the following steps:

step 1, determining a differential equation to be solved, and sampling in a defined domain to obtain training data and test data; step 2, constructing a neural network model, and embedding an activation function layer based on piecewise rational approximation;

step 3, adjusting the hyper-parameters and training a neural network model;

step 4, model prediction is carried out, if the prediction result meets the requirement, the model training is successful, and the training is finished; otherwise, returning to the step 3.

The differential equation to be solved in the step 1 is a Burgers equation, and the Burgers equation is a very useful mathematical model for many physical problems, such as shock waves, shallow water wave problems, traffic flow mechanics and the like, and is an important mathematical model for describing the physical world diffusion phenomenon. It is a nonlinear partial differential equation that simulates the propagation and reflection of a shock wave, defined as follows:

u_t+uu_x-(0.01/π)u_xx＝0,x∈[-1,1],t∈[0,1],

u(0,x)＝-sin(πx),

u(t,-1)＝u(t,1)＝0.

the equation is a time-varying, one-dimensional, partial differential equation with an initial condition and boundary conditions in z-state space.

The PINNs model is used in step 2, and the approximate structure of the model is shown in FIG. 4, with the independent variables x and t of the differential equation as inputs and the dependent variable u as an output. NN (x, t; theta) in the graph is represented as a fully-connected neural network, and theta is the weight of a hidden layer of the neural network. The PDE (λ) part of the graph represents the composition of the loss function in the neural network model. The penalty function for PINNs is divided into two parts: one block is the initial conditions and boundary portion and one block is the equation itself.

Taking Burgers equation as an example, the boundary and initial up-sampling number are set as N_uThe number of samples within the boundary is N_f. The first part of the loss function is to compute the MSE of the output of the model over initial and boundary conditions:

the second part of the loss function is to compute the MSE of the output of the model over the equation:

is provided with

γ＝u_t+uu_x-(0.01/π)u_xx.

Then there are:

the final loss function is the sum of the two:

MSE＝MSE_u+MSE_f.

as shown in FIG. 3, the present invention provides a fully-connected neural network of PINNs with four hidden layers of 20 neurons per layer. And (2) obtaining 25600 (x, t) data pairs by sampling in the boundary and the initial value, obtaining 10000 (x, t) data pairs in the boundary and 100 (x, t) data pairs on the boundary and the initial value by adopting a Latin hypercube sampling method in all the data, and taking 10100 data pairs as training data of the model in total. The remaining (x, t) data pairs were used as test data for the model.

Each full-connection hidden layer is followed by an activation function layer based on piecewise rational approximation, each activation function layer has six trainable parameters, and each activation function layer has n +1 hyper-parameters x₀,x₁,x₂,…,x_nRepresenting interpolation points, (m +1) (n +1) trainable parameters

Representing the derivative values from zero to m.

The invention designs the activation function according to the thought of fast flexible full-pure embedding (FFHE) and by combining the related mathematical knowledge of the piecewise rational approximation. Wherein, the Pade approximation is a method for constructing rational function approximation, and the Pade approximation is more accurate than truncated Taylor series; moreover, even when the taylor series does not converge, the pade approximation tends to converge. In addition, when constructing the interpolation function, in order to avoid the longge phenomenon caused by the high-order polynomial, a method of piecewise interpolation is generally adopted, and the interpolation result depends only on a few surrounding points, so that a composite piecewise function is finally formed.

The construction process of the activation function based on the piecewise rational approximation in the step 2 is as follows:

the details of which are described in the formulae (10) - (16)

Setting the maximum training round as N in the step 3, and training the neural network model specifically comprises the following steps:

Forming a piecewise activation function r^[L/M](x)；

Step 4.4, data H_n×mAfter passing through the activation function r^[L/M](x) To obtain an output Z_n×mExpressed as:

to obtain an output Z_n×m；

step 3.6, obtaining a training result, calculating a value of a loss function, and automatically performing reverse propagation and updating neural network weight and trainable parameters by a framework; if the current round is less than or equal to N, a new batch of training data is taken, and the step 3.2 is skipped; otherwise, the model training process is ended.

7000 rounds of training were performed with the learning rate set to 0.002. The LeakyReLu activation function and the ReLu activation function have the worst training effect, and the corresponding training curves in FIG. 5 are the top two almost coincident curves. The average training time of each hundred rounds of activation functions constructed by the segmented Pade approximation is 4.307s, and the average training time of each hundred rounds of Tanh functions is 3.532 s; the activation function of the segmented Pade approximation construction reaches 9.4067E-04 at the 1500 th round of training error, and the Tanh function is trained to 7000 rounds and just reduces the training error to 9.1780E-04. That is, the error of the FFHE method can be reduced to the same level only by about one fifth of the training round required by Tanh; if 7000 rounds are trained, the FFHE method gives results that are more than two orders of magnitude (100 times) more accurate than those obtained using Tanh. Therefore, the activation function constructed by using the FFHE (segmented Pade approximation) method is superior to a general activation function in the aspects of training time and training precision. Therefore, the method provides a powerful solution for rapidly and accurately solving the problem of the high-dimensional partial differential equation involved in the actual engineering calculation task.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A quick, flexible and fully-pure embedded neural network wide area optimization training method is characterized by comprising the following steps: the method comprises the following steps:

step 3, adjusting the hyper-parameters and training a neural network model;

2. The method of claim 1, wherein the method comprises: the differential equation in step 1 is a Burgers equation.

3. The method of claim 1, wherein the method comprises: the neural network model constructed in the step 2 comprises an input layer, four full-connection layers, four activation function layers and an output layer.

4. The method of claim 1, wherein the method comprises: the construction of the activation function of the piecewise rational approximation in step 2 is as follows:

wherein p is_kAnd q is_kIs a coefficient to be obtained, L represents the highest order of x in the numerator, and M represents the highest order of x in the denominator. When L + M is constant, when L ═ M is taken, the numerator and denominator are solved in the following manner. Let L ═ M ═ n, first solve linear equation Aq ═ b, get (q ═ b)₁，q₂，q₃，...，q_n) A value of (a), wherein:

(p) is obtained by the following equation₀，p₁，p₂，...，p_n) The value of (c):

p₀＝a₀，q₀＝1，

the multi-point pade approximation is a generalized version of the single-point pade approximation. Setting the approximated function f (x), if at n +1 interpolation points x₀，x₁，x₂，...，x_nIf the function value is known, there is a rational formula:

first, the mean deviation of f (x) is defined as follows:

let f_i，jIs f [ x ]_i，x_i+1，...，x_j]J is not less than i; then u^[L/M](x) Can be calculated by the following way:

at the same time, v^[L/M](x) Can be calculated by the following way:

Let the approximated function be f (x), and at n +1 interpolation points x₀，x₁，x₂，...，x_nThe method comprises the following steps:

wherein

Is represented by x_iA derivative value of order τ of (f), (x);

arbitrarily take a section of interval [ x_k，x_k+1]Constructing a Pade approximation expression:

wherein L + M +1 ═ n,

and

according to the formula (8) and the formula (9), the mean difference f_i，j＝f[z_i，z_i+1，...，z_j]，0≤i≤j≤2m+1；

From the nature of the mean difference and equation (10):

when j-1 is not more than m, directly calculating according to a formula (13);

calculating f_i，jIs substituted into the formula (8) and the formula (9) to obtain

And

further, find out

Function r constructed from a piecewise Pade approximation^L/M(x) Expressed as:

5. the method of claim 1, wherein the method comprises: setting the training round as N in the step 3, wherein the training steps are as follows:

step 3.3, activating the hyper-parameter x of the function layer₀，x₁，x₂，...，x_nAnd trainable parameters

Forming a piecewise activation function r^[L/M](x)；

to obtain an output Z_n×m；

6. The method of claim 1, wherein the method comprises: step 4, model prediction is carried out, if the prediction result meets the requirement, the model training is successful, and the training is finished; otherwise, returning to the step 3.