CN113485099A

CN113485099A - Online learning control method of nonlinear discrete time system

Info

Publication number: CN113485099A
Application number: CN202011635930.6A
Authority: CN
Inventors: 李新兴; 查文中; 王雪源; 王蓉
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-10-08
Anticipated expiration: 2040-12-31
Also published as: CN113485099B

Abstract

The invention discloses an online learning control method of a nonlinear discrete time system, which comprises a behavior strategy selection step, an optimal Q-function definition step, an evaluation network and execution network introduction step, an estimation error calculation step and a final optimal weight calculation step. The invention can realize the real-time online learning of the optimal controller without repeated iteration between strategy evaluation and strategy improvement; the off-orbit strategy learning mechanism is adopted, the problem that a direct heuristic dynamic programming method is insufficient in exploring a state-strategy space is effectively solved, the execution network and the evaluation network can use any form of activation function, online learning of the optimal controller can be achieved, a system model is not needed, and only state data generated by action strategies are needed.

Description

Online learning control method of nonlinear discrete time system

Technical Field

The invention relates to the field of industrial production control, in particular to an online learning control method for a nonlinear discrete time system.

Background

In the process of industrial production, engineering technicians often need to optimize controllers of control objects such as robots, unmanned aerial vehicles, unmanned vehicles and the like so as to meet certain control indexes. Optimization of the controller is made difficult by the fact that the above-mentioned control objects tend to exhibit strong non-linearity. From the perspective of optimal control, obtaining an optimal control controller requires solving a complex hamilton-jacobi-bellman equation (HJB equation), but the HJB equation is a nonlinear partial differential equation and is very difficult to solve. Traditional dynamic programming, variational methods, spectral methods and the like often face great limitations in the practical application process due to extremely high computational complexity.

The adaptive dynamic programming is taken as a novel intelligent control algorithm which is started in recent years, the technologies of reinforcement learning, neural network approximation, dynamic programming, adaptive control and the like are fused, the online learning of the optimal controller can be realized, and the problem of high complexity of calculation of the traditional method is effectively solved. Aiming at the optimal control problem of a nonlinear discrete time system, Jennie Si and Yu-Tsung Wang put forward a direct heuristic dynamic programming algorithm for the first time in a paper "one-line learning control by association and recovery", the algorithm adopts the basic idea of generalized strategy iteration, and can realize real-time Online learning of an optimal controller and an optimal value function by introducing two neural networks (namely an execution network and an evaluation network). Through continuous development in recent years, the convergence and stability analysis of the algorithm also has a certain theoretical basis at present. Although the direct heuristic dynamic programming algorithm can realize the online adaptive optimal control, the algorithm still has the following defects: 1) the algorithm adopts an on-policy learning mechanism, has the problem of insufficient exploration on a state-policy space, and is easy to fall into a local optimal solution; 2) the hyperbolic tangent functions are adopted by the activation functions of the execution network and the evaluation network, and all the convergence and stability analysis results are based on the hyperbolic tangent functions at present, so that the method is not applicable to other types of activation functions.

Therefore, how to overcome the above disadvantages of the above direct heuristic dynamic programming method makes convergence and stability analysis results not limited to hyperbolic tangent function any more, and becomes a technical problem to be solved urgently in the prior art.

Disclosure of Invention

The invention aims to provide an online learning control method of a nonlinear discrete time system, which has better exploration capability on a state-strategy space, so that the types of activation functions of an execution network and an evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with iterative methods such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by behavior strategies.

In order to achieve the purpose, the invention adopts the following technical scheme:

an online learning control method of a nonlinear discrete time system comprises the following steps:

behavior policy selection step S110:

selecting a behavior strategy u by using the existing experience according to the characteristics of a controlled object, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data required in the learning process;

optimal Q-function definition step S120:

the following optimal Q-function is defined:

the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken^*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:

optimal control

Can be expressed as:

for linear systems, Q^*(x_k,u_k) and

are each about (x)_k,u_k) and x_kA non-linear function of (d);

evaluating the network and performing a network introduction step S130:

introducing evaluation network and execution network respectively to Q^*(x_k,u_k) and

carrying out online approximation, wherein the evaluation network and the execution network are neural networks;

evaluation network for learning optimal Q-function Q^*(x_k,u_k) The executive network is used to learn the optimal controller u^*Assuming that the number of neural network activation functions in the evaluation network is N_cTo and from

For evaluating network pair Q in least square sense^*(x_k,u_k) The best approximation of

Can be expressed as:

wherein ,W_cFor the weight of the hidden layer to the output layer, phi_cFormed for evaluating all activation functions in a hidden layer in a networkIn the collection of the images, the image data is collected,

to evaluate the weight of the network input layer to the hidden layer, wherein,

for the weight corresponding to the ith activation function,

represents (x)_k,u_k) The input values of the corresponding respective activation functions,

an input value representing an ith activation function;

let the number of network activation functions to be performed be N_aTo and from

For performing network pairs in least-squares sense

The best approximation of

Can be expressed as:

the input to the execution network is the system state, where W_aFor the weight of the hidden layer to the input layer, phi_a() is a set of executing network hidden layer activation functions,

is the weight of the input layer to the hidden layer, wherein,

is as followsThe weights corresponding to the i activation functions,

represents x_kThe input value of the corresponding respective activation function,

representing the input value of the ith activation function, for x_k+1Then there is

Estimation error calculation step S140:

optimal approximation

And

instead of the exact value Q^*(x_k,u_k) and

the following estimation errors can be obtained:

wherein ,

is expressed as input

Evaluating the input values of each activation function in the network, i.e.

Optimal weight calculation step S150:

optimal weight W for evaluation network_cAnd performing optimization of the networkWeight W_aPerforming online learning, assuming that at time k, the evaluation network and the execution network pair W_c and W_aAre respectively estimated as

And

where l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:

in action policy u_kGenerating the next state x_k+1Previously, the executing network could not give the k +1 time pair W_aSo that the network pair W is performed at time k +1_aThe estimated value of (2) is still adopted

The output of the execution network at time k +1 is:

similarly, when the input is (x)_k,u_k) The output of the evaluation network is:

when the input is

The output of the evaluation network is:

wherein ,

also, in generating state x_k+1Before, the evaluation network can not give the k +1 time pair W_cSo that the network pair W is evaluated at time k +1_cIs also taken as an estimated value

Therefore, there are:

replacing the true values with the estimated values yields the following estimation errors:

weights for evaluation network

The adjustment is carried out by adopting a gradient descent method,

weights for executing networks

Training is performed using an importance weighting method and using a modified gradient descent method

The on-line adjustment is carried out,

when evaluating the weight of the network

And executing the weight of the network

After convergence, the output of the execution network is the proximity of the optimal controllerLike values.

Alternatively, in the evaluating network and performing network introduction step S130,

for the evaluation network, W_c ⁰Setting the weight of the hidden layer to the output layer as a constant value;

for the execution network, W_a ⁰Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.

Optionally, in the optimal weight calculating step S150:

for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the method specifically comprises the following steps:

wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phi_c(k)＝φ_c(θ₂(k+1))-φ_c(θ₁(k) Is a regression vector, phi_c(k)＝(1+Δφ_c(k)^TΔφ_c(k))²Is a normalization term;

for the weight of the execution network, the importance weighting method is adopted for training, and an improved gradient descent method is adopted for training

The online adjustment is specifically as follows:

optionally, in the step S110 of selecting an action policy, the action policy is: u. of_k＝u′_k+n_kWhere u' is any one of the possible control strategies, selected based on the characteristics and experience of the system being controlled, and n_kTo explore noise, n_kAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.

Optionally, the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is a state and a control input, the input of the execution network is a system state, and the output is a multi-m-dimensional vector.

Optionally, the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process starts and are kept unchanged in the learning process.

Optionally, the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.

The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:

the computer executable instructions, when executed by a processor, perform the method of online learning control of a non-linear dispersion time system.

The invention has the following advantages:

1. the invention provides an online learning control method suitable for a general nonlinear discrete time system, which can realize real-time online learning of an optimal controller without repeated iteration between strategy evaluation and strategy improvement;

2. the invention adopts an off-orbit strategy learning mechanism, and effectively overcomes the problem that the direct heuristic dynamic programming method is insufficient in exploring the state-strategy space; in addition, the execution network and the evaluation network may use any form of activation function.

3. Compared with the classical direct heuristic dynamic programming method, the online learning method provided by the patent has better exploration capability on a state-strategy space, and the types of activation functions of the execution network and the evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with an iterative method such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by a behavior strategy.

Drawings

FIG. 1 is a flow chart of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an evaluation network of an online learning control method of a nonlinear discrete time system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an implementation network of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;

fig. 4 is an algorithm diagram of an online learning control method of a nonlinear discrete-time system according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The present invention first needs to consider the following nonlinear discrete time system optimal control problem. Consider the following discrete-time system:

x_k+1＝F(x_k,u_k),x₀＝x(0)

wherein ,x_kIs the system state, u_kIs the system input. System function F (x)_k,u_k) In the tight set

Above is Lipschitz continuous and satisfies F (0,0) ═ 0. It is assumed that the system is calmable at Ω, i.e. there is a control sequence u₁,…,u_k…, such that x_k→ 0. In addition, assume the system function F (x)_k,u_k) Is unknown. The goal of optimal control of a nonlinear system is to find a feasible control strategy that makes the system calm while minimizing the following value function:

according to the Bellman optimality principle, an optimal control strategy u^*Satisfy the following shellfishThe Erman equation:

s.t.x_k+1＝F(x_k,u_k)

thus, the optimal controller u^*Has the following expression:

substituting the above equation into the Bellman equation yields the following HJB equation:

thus, referring to fig. 1, there is shown an online learning control method of a nonlinear discrete-time system according to the present invention, comprising the steps of:

behavior policy selection step S110:

according to the characteristics of the controlled object, the behavior strategy u is selected by using the existing experience, the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process, and the behavior strategy is mainly used for generating system state data required in the learning process.

After the behavior strategy is selected, the optimal controller is to be learned online.

Optimal Q-function definition step S120:

the following optimal Q-function is defined:

optimal control

Can be expressed as:

for linear systems, Q^*(x_k,u_k) and

are each about (x)_k,u_k) and x_kIs a non-linear function of (a).

Evaluating the network and performing a network introduction step S130:

considering that the feedforward neural network can realize the approximation with any precision on smooth or continuous nonlinear function when the number of the activation functions in the feedforward neural network is enough, an evaluation network and an execution network are introduced to respectively carry out Q-factor estimation^*(x_k,u_k) and

performing online approximation, wherein the evaluation network and the execution network are neural networks;

evaluation network for learning optimal Q-function Q^*(x_k,u_k) The execution network is used to learn the optimal controller u^*Assuming that the number of neural network activation functions in the evaluation network is N_cTo and from

Can be expressed as:

wherein ,W_cFor the weight of the hidden layer to the output layer, phi_c() is a set of all activation functions in the hidden layer in the evaluation network,

to evaluate the weight of the network input layer to the hidden layer, wherein,

for the weight corresponding to the ith activation function,

an input value representing an ith activation function;

the invention relates to W_c ⁰Set to a constant value, therefore, only the weights of the hidden layer to the output layer need to be adjusted.

Similarly, let the number of network activation functions performed be N_aTo and from

Performing network pairs in least squares sense

The best approximation of

Can be expressed as:

is the weight of the input layer to the hidden layer, wherein,

for the weight corresponding to the ith activation function,

The invention also relates to W_a ⁰Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.

Estimation error calculation step S140:

optimal approximation

And

instead of the exact value Q^*(x_k,u_k) and

the following estimation errors can be obtained:

wherein ,

is expressed as input

Evaluating the input values of each activation function in the network, i.e.

Optimal weight calculation step S150:

optimal weight W for evaluation network_cAnd optimal weight W of the execution network_aPerforming online learning, assuming that at time k, the evaluation network and the execution network pair W_c and W_aAre respectively estimated as

And

The output of the execution network at time k +1 is:

when the input is

The output of the evaluation network is:

wherein ,

Therefore, there are:

replacing the true value with the estimated value yields the following estimation error:

for the evaluation network, the goal is to make the estimation error e through online learning_kThe weight of the evaluation network is therefore adjusted by the gradient descent method as follows:

wherein alpha is more than 0, evaluating the learning rate of the network,

is a regression vector of phi_c(k)＝(1+Δφ_c(k)^TΔφ_c(k))²Is a normalization term.

Weights for executing networks

Then the importance weighting method is used for training. The objective function of the execution network is defined as:

wherein the prediction error e of the execution network_a(k) Is defined as:

U_cin the present invention U is the desired final objective function _c0, i.e. the execution network is to be minimized as much as possible in the learning process

Also, the following modified gradient descent method pair was adopted

Performing online adjustment:

beta > 0 is the learning rate of the execution network, phi_a(k)＝(1+φ_a(θ₄(k))^Tφ_a(θ₄(k)))²Is a normalized term.

As can be seen from the training process of the evaluation network and the execution network, all state data used in the learning process are generated by the behavior strategy u, and when the weight of the evaluation network

And executing the weights of the network

After convergence, the output of the execution network is an approximate value of the optimal controller.

For the behavior policy:

in a specific embodiment, during the online learning process of the optimal controller, all the used state data are generated by the behavior strategy u, and in order to ensure that the algorithm has a certain detection capability to the policy space, the state data generated by the behavior strategy needs to be rich enough and meet a certain continuous excitation condition to ensure the convergence of the algorithm. The behavior strategy in the invention is as follows: u. of_k＝u′_k+n_kWhere u' is any one of the possible control strategies, typically selected based on the characteristics and experience of the system being controlled, and n_kTo explore noise, n_kIt may be a sine or cosine signal containing more, e.g. enough, frequencies or a random signal with limited amplitude.

For the evaluation network and the execution network:

the evaluation network and the execution network both adopt a feedforward neural network with a single hidden layer, wherein the input of the evaluation network for approximating the Q-function is a state and control input, and the output of the evaluation network is a scalar. The input to the execution network is also the system state, and its output is a multi-m-dimensional vector. In the learning process, the three neural networks only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process. The activation functions of the three hidden layers of the neural network can be selected from common hyperbolic tangent functions, Sigmoid functions, linear rectifiers, polynomial functions and the like.

Referring to fig. 2, fig. 3, schematic diagrams of an evaluation network and a neural network are shown, respectively.

Of course, the evaluation network and the execution network of the present invention can also be selected as a feedforward neural network with a plurality of hidden layers, and the weights of all the connection layers can also be adjusted in the learning process.

Referring to fig. 4, a schematic diagram of the online learning control method of the present invention is shown.

the computer executable instructions, when executed by a processor, perform the above-described method of online learning control of a non-linear dispersion time system.

The invention has the following advantages:

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, which may be centralized on a single computing device, or alternatively, they may be implemented using program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or separately fabricated into various integrated circuit modules, or multiple ones of them fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An online learning control method of a nonlinear discrete time system comprises the following steps:

behavior policy selection step S110:

optimal Q-function definition step S120:

the following optimal Q-function is defined:

optimal control

Can be expressed as:

for linear systems, Q^*(x_k,u_k) and

are each about (x)_k,u_k) and x_kA non-linear function of (d);

evaluating the network and performing a network introduction step S130:

Can be expressed as:

to evaluate the weights of the network input layer to the hidden layer,wherein,

for the weight corresponding to the ith activation function,

an input value representing an ith activation function;

For performing network pairs in least-squares sense

The best approximation of

Can be expressed as:

is the weight of the input layer to the hidden layer, wherein,

for the weight corresponding to the ith activation function,

represents x_kThe input values of the corresponding respective activation functions,

Estimation error calculation step S140:

optimal approximation

And

instead of the exact value Q^*(x_k,u_k) and

the following estimation error can be obtained:

wherein ,

is expressed as input

Evaluating the input values of each activation function in the network, i.e.

Optimal weight calculation step S150:

And

The output of the execution network at time k +1 is:

when the input is

The output of the evaluation network is:

wherein ,

also, in generating state x_k+1Previously, the evaluation network could not give the k +1 time pair W_cSo that the network pair W is evaluated at time k +1_cIs also taken as an estimate of

Therefore, there are:

weights for evaluation network

The adjustment is carried out by adopting a gradient descent method,

weights for executing networks

The on-line adjustment is carried out,

when evaluating the weight of the network

And executing the weight of the network

2. The online learning control method according to claim 1, characterized in that:

in the evaluation network and execution network introduction step S130,

3. The online learning control method according to claim 2, characterized in that:

in the optimal weight calculation step S150:

wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phi_c(k)＝φ_c(θ₂(k+1))-φ_c（θ₁(k) Is a regression vector of phi_c(k)＝(1+Δφ_c(k)^TΔφ_c(k))²Is a normalization term.

4. The online learning control method according to claim 3, characterized in that:

in the step S110 of selecting an action policy, the action policy is: u. of_k＝u_k′+n_kWherein u' isAny one of the possible control strategies is selected according to the characteristics and experience of the controlled system, n_kTo explore noise, n_kAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.

5. The online learning control method according to claim 3, characterized in that:

the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is state and control input, the input of the execution network is system state, and the output is multi-m-dimensional vector.

6. The online learning control method according to claim 5, characterized in that:

the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process.

7. The online learning control method according to claim 5, characterized in that:

the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.

8. A storage medium for storing computer-executable instructions, characterized in that:

the computer executable instructions, when executed by a processor, perform a method of online learning control of a non-linear discrete time system as claimed in any one of claims 1 to 7.