CN113485099A - Online learning control method of nonlinear discrete time system - Google Patents

Online learning control method of nonlinear discrete time system Download PDF

Info

Publication number
CN113485099A
CN113485099A CN202011635930.6A CN202011635930A CN113485099A CN 113485099 A CN113485099 A CN 113485099A CN 202011635930 A CN202011635930 A CN 202011635930A CN 113485099 A CN113485099 A CN 113485099A
Authority
CN
China
Prior art keywords
network
optimal
input
evaluation
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011635930.6A
Other languages
Chinese (zh)
Other versions
CN113485099B (en
Inventor
李新兴
查文中
王雪源
王蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN202011635930.6A priority Critical patent/CN113485099B/en
Publication of CN113485099A publication Critical patent/CN113485099A/en
Application granted granted Critical
Publication of CN113485099B publication Critical patent/CN113485099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0205Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system
    • G05B13/021Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system in which a variable is automatically adjusted to optimise the performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an online learning control method of a nonlinear discrete time system, which comprises a behavior strategy selection step, an optimal Q-function definition step, an evaluation network and execution network introduction step, an estimation error calculation step and a final optimal weight calculation step. The invention can realize the real-time online learning of the optimal controller without repeated iteration between strategy evaluation and strategy improvement; the off-orbit strategy learning mechanism is adopted, the problem that a direct heuristic dynamic programming method is insufficient in exploring a state-strategy space is effectively solved, the execution network and the evaluation network can use any form of activation function, online learning of the optimal controller can be achieved, a system model is not needed, and only state data generated by action strategies are needed.

Description

Online learning control method of nonlinear discrete time system
Technical Field
The invention relates to the field of industrial production control, in particular to an online learning control method for a nonlinear discrete time system.
Background
In the process of industrial production, engineering technicians often need to optimize controllers of control objects such as robots, unmanned aerial vehicles, unmanned vehicles and the like so as to meet certain control indexes. Optimization of the controller is made difficult by the fact that the above-mentioned control objects tend to exhibit strong non-linearity. From the perspective of optimal control, obtaining an optimal control controller requires solving a complex hamilton-jacobi-bellman equation (HJB equation), but the HJB equation is a nonlinear partial differential equation and is very difficult to solve. Traditional dynamic programming, variational methods, spectral methods and the like often face great limitations in the practical application process due to extremely high computational complexity.
The adaptive dynamic programming is taken as a novel intelligent control algorithm which is started in recent years, the technologies of reinforcement learning, neural network approximation, dynamic programming, adaptive control and the like are fused, the online learning of the optimal controller can be realized, and the problem of high complexity of calculation of the traditional method is effectively solved. Aiming at the optimal control problem of a nonlinear discrete time system, Jennie Si and Yu-Tsung Wang put forward a direct heuristic dynamic programming algorithm for the first time in a paper "one-line learning control by association and recovery", the algorithm adopts the basic idea of generalized strategy iteration, and can realize real-time Online learning of an optimal controller and an optimal value function by introducing two neural networks (namely an execution network and an evaluation network). Through continuous development in recent years, the convergence and stability analysis of the algorithm also has a certain theoretical basis at present. Although the direct heuristic dynamic programming algorithm can realize the online adaptive optimal control, the algorithm still has the following defects: 1) the algorithm adopts an on-policy learning mechanism, has the problem of insufficient exploration on a state-policy space, and is easy to fall into a local optimal solution; 2) the hyperbolic tangent functions are adopted by the activation functions of the execution network and the evaluation network, and all the convergence and stability analysis results are based on the hyperbolic tangent functions at present, so that the method is not applicable to other types of activation functions.
Therefore, how to overcome the above disadvantages of the above direct heuristic dynamic programming method makes convergence and stability analysis results not limited to hyperbolic tangent function any more, and becomes a technical problem to be solved urgently in the prior art.
Disclosure of Invention
The invention aims to provide an online learning control method of a nonlinear discrete time system, which has better exploration capability on a state-strategy space, so that the types of activation functions of an execution network and an evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with iterative methods such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by behavior strategies.
In order to achieve the purpose, the invention adopts the following technical scheme:
an online learning control method of a nonlinear discrete time system comprises the following steps:
behavior policy selection step S110:
selecting a behavior strategy u by using the existing experience according to the characteristics of a controlled object, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data required in the learning process;
optimal Q-function definition step S120:
the following optimal Q-function is defined:
Figure BDA0002876289310000021
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
Figure BDA0002876289310000031
optimal control
Figure BDA0002876289310000032
Can be expressed as:
Figure BDA0002876289310000033
for linear systems, Q*(xk,uk) and
Figure BDA0002876289310000034
are each about (x)k,uk) and xkA non-linear function of (d);
evaluating the network and performing a network introduction step S130:
introducing evaluation network and execution network respectively to Q*(xk,uk) and
Figure BDA0002876289310000035
carrying out online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The executive network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and from
Figure BDA0002876289310000036
For evaluating network pair Q in least square sense*(xk,uk) The best approximation of
Figure BDA0002876289310000037
Can be expressed as:
Figure BDA0002876289310000038
wherein ,WcFor the weight of the hidden layer to the output layer, phicFormed for evaluating all activation functions in a hidden layer in a networkIn the collection of the images, the image data is collected,
Figure RE-GDA0003232563200000039
to evaluate the weight of the network input layer to the hidden layer, wherein,
Figure RE-GDA00032325632000000310
for the weight corresponding to the ith activation function,
Figure RE-GDA00032325632000000311
represents (x)k,uk) The input values of the corresponding respective activation functions,
Figure RE-GDA00032325632000000312
an input value representing an ith activation function;
let the number of network activation functions to be performed be NaTo and from
Figure BDA00028762893100000315
For performing network pairs in least-squares sense
Figure BDA00028762893100000316
The best approximation of
Figure BDA00028762893100000317
Can be expressed as:
Figure BDA00028762893100000318
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,
Figure RE-GDA0003232563200000041
is the weight of the input layer to the hidden layer, wherein,
Figure RE-GDA0003232563200000042
is as followsThe weights corresponding to the i activation functions,
Figure RE-GDA0003232563200000043
represents xkThe input value of the corresponding respective activation function,
Figure RE-GDA0003232563200000044
representing the input value of the ith activation function, for xk+1Then there is
Figure RE-GDA0003232563200000045
Estimation error calculation step S140:
optimal approximation
Figure BDA0002876289310000048
And
Figure BDA0002876289310000049
instead of the exact value Q*(xk,uk) and
Figure BDA00028762893100000410
the following estimation errors can be obtained:
Figure BDA00028762893100000411
wherein ,
Figure BDA00028762893100000412
is expressed as input
Figure BDA00028762893100000413
Evaluating the input values of each activation function in the network, i.e.
Figure BDA00028762893100000414
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd performing optimization of the networkWeight WaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated as
Figure BDA00028762893100000415
And
Figure BDA00028762893100000416
where l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
Figure BDA00028762893100000417
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adopted
Figure BDA00028762893100000418
The output of the execution network at time k +1 is:
Figure BDA0002876289310000051
similarly, when the input is (x)k,uk) The output of the evaluation network is:
Figure BDA0002876289310000052
when the input is
Figure BDA0002876289310000053
The output of the evaluation network is:
Figure BDA0002876289310000054
wherein ,
Figure BDA0002876289310000055
also, in generating state xk+1Before, the evaluation network can not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimated value
Figure BDA0002876289310000056
Therefore, there are:
Figure BDA0002876289310000057
replacing the true values with the estimated values yields the following estimation errors:
Figure BDA0002876289310000058
weights for evaluation network
Figure BDA0002876289310000059
The adjustment is carried out by adopting a gradient descent method,
weights for executing networks
Figure BDA00028762893100000510
Training is performed using an importance weighting method and using a modified gradient descent method
Figure BDA00028762893100000511
The on-line adjustment is carried out,
when evaluating the weight of the network
Figure BDA00028762893100000512
And executing the weight of the network
Figure BDA00028762893100000513
After convergence, the output of the execution network is the proximity of the optimal controllerLike values.
Alternatively, in the evaluating network and performing network introduction step S130,
for the evaluation network, Wc 0Setting the weight of the hidden layer to the output layer as a constant value;
for the execution network, Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
Optionally, in the optimal weight calculating step S150:
for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the method specifically comprises the following steps:
Figure BDA0002876289310000061
wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phic(k)=φc2(k+1))-φc1(k) Is a regression vector, phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term;
for the weight of the execution network, the importance weighting method is adopted for training, and an improved gradient descent method is adopted for training
Figure BDA0002876289310000063
The online adjustment is specifically as follows:
optionally, in the step S110 of selecting an action policy, the action policy is: u. ofk=u′k+nkWhere u' is any one of the possible control strategies, selected based on the characteristics and experience of the system being controlled, and nkTo explore noise, nkAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.
Optionally, the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is a state and a control input, the input of the execution network is a system state, and the output is a multi-m-dimensional vector.
Optionally, the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process starts and are kept unchanged in the learning process.
Optionally, the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the method of online learning control of a non-linear dispersion time system.
The invention has the following advantages:
1. the invention provides an online learning control method suitable for a general nonlinear discrete time system, which can realize real-time online learning of an optimal controller without repeated iteration between strategy evaluation and strategy improvement;
2. the invention adopts an off-orbit strategy learning mechanism, and effectively overcomes the problem that the direct heuristic dynamic programming method is insufficient in exploring the state-strategy space; in addition, the execution network and the evaluation network may use any form of activation function.
3. Compared with the classical direct heuristic dynamic programming method, the online learning method provided by the patent has better exploration capability on a state-strategy space, and the types of activation functions of the execution network and the evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with an iterative method such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by a behavior strategy.
Drawings
FIG. 1 is a flow chart of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an evaluation network of an online learning control method of a nonlinear discrete time system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation network of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;
fig. 4 is an algorithm diagram of an online learning control method of a nonlinear discrete-time system according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The present invention first needs to consider the following nonlinear discrete time system optimal control problem. Consider the following discrete-time system:
xk+1=F(xk,uk),x0=x(0)
wherein ,xkIs the system state, ukIs the system input. System function F (x)k,uk) In the tight set
Figure RE-GDA0003232563200000081
Above is Lipschitz continuous and satisfies F (0,0) ═ 0. It is assumed that the system is calmable at Ω, i.e. there is a control sequence u1,…,uk…, such that xk→ 0. In addition, assume the system function F (x)k,uk) Is unknown. The goal of optimal control of a nonlinear system is to find a feasible control strategy that makes the system calm while minimizing the following value function:
Figure BDA0002876289310000081
according to the Bellman optimality principle, an optimal control strategy u*Satisfy the following shellfishThe Erman equation:
Figure BDA0002876289310000082
s.t.xk+1=F(xk,uk)
thus, the optimal controller u*Has the following expression:
Figure BDA0002876289310000083
substituting the above equation into the Bellman equation yields the following HJB equation:
Figure BDA0002876289310000084
thus, referring to fig. 1, there is shown an online learning control method of a nonlinear discrete-time system according to the present invention, comprising the steps of:
behavior policy selection step S110:
according to the characteristics of the controlled object, the behavior strategy u is selected by using the existing experience, the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process, and the behavior strategy is mainly used for generating system state data required in the learning process.
After the behavior strategy is selected, the optimal controller is to be learned online.
Optimal Q-function definition step S120:
the following optimal Q-function is defined:
Figure BDA0002876289310000091
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
Figure BDA0002876289310000092
optimal control
Figure BDA0002876289310000093
Can be expressed as:
Figure BDA0002876289310000094
for linear systems, Q*(xk,uk) and
Figure BDA0002876289310000095
are each about (x)k,uk) and xkIs a non-linear function of (a).
Evaluating the network and performing a network introduction step S130:
considering that the feedforward neural network can realize the approximation with any precision on smooth or continuous nonlinear function when the number of the activation functions in the feedforward neural network is enough, an evaluation network and an execution network are introduced to respectively carry out Q-factor estimation*(xk,uk) and
Figure BDA0002876289310000096
performing online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The execution network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and from
Figure BDA0002876289310000097
For evaluating network pair Q in least square sense*(xk,uk) The best approximation of
Figure BDA0002876289310000101
Can be expressed as:
Figure BDA0002876289310000102
wherein ,WcFor the weight of the hidden layer to the output layer, phic() is a set of all activation functions in the hidden layer in the evaluation network,
Figure RE-GDA0003232563200000101
to evaluate the weight of the network input layer to the hidden layer, wherein,
Figure RE-GDA0003232563200000102
for the weight corresponding to the ith activation function,
Figure RE-GDA0003232563200000103
represents (x)k,uk) The input values of the corresponding respective activation functions,
Figure RE-GDA0003232563200000104
an input value representing an ith activation function;
the invention relates to Wc 0Set to a constant value, therefore, only the weights of the hidden layer to the output layer need to be adjusted.
Similarly, let the number of network activation functions performed be NaTo and from
Figure BDA0002876289310000109
Performing network pairs in least squares sense
Figure BDA00028762893100001010
The best approximation of
Figure BDA00028762893100001011
Can be expressed as:
Figure BDA00028762893100001012
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,
Figure RE-GDA00032325632000001010
is the weight of the input layer to the hidden layer, wherein,
Figure RE-GDA00032325632000001011
for the weight corresponding to the ith activation function,
Figure RE-GDA00032325632000001012
represents xkThe input value of the corresponding respective activation function,
Figure RE-GDA00032325632000001013
representing the input value of the ith activation function, for xk+1Then there is
Figure RE-GDA00032325632000001014
The invention also relates to Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
Estimation error calculation step S140:
optimal approximation
Figure BDA0002876289310000111
And
Figure BDA0002876289310000112
instead of the exact value Q*(xk,uk) and
Figure BDA0002876289310000113
the following estimation errors can be obtained:
Figure BDA0002876289310000114
wherein ,
Figure BDA0002876289310000115
is expressed as input
Figure BDA0002876289310000116
Evaluating the input values of each activation function in the network, i.e.
Figure BDA0002876289310000117
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd optimal weight W of the execution networkaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated as
Figure BDA0002876289310000118
And
Figure BDA0002876289310000119
where l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
Figure BDA00028762893100001110
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adopted
Figure BDA00028762893100001111
The output of the execution network at time k +1 is:
Figure BDA00028762893100001112
similarly, when the input is (x)k,uk) The output of the evaluation network is:
Figure BDA00028762893100001113
when the input is
Figure BDA00028762893100001114
The output of the evaluation network is:
Figure BDA00028762893100001115
wherein ,
Figure BDA00028762893100001116
also, in generating state xk+1Before, the evaluation network can not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimated value
Figure BDA00028762893100001117
Therefore, there are:
Figure BDA0002876289310000121
replacing the true value with the estimated value yields the following estimation error:
Figure BDA0002876289310000122
for the evaluation network, the goal is to make the estimation error e through online learningkThe weight of the evaluation network is therefore adjusted by the gradient descent method as follows:
Figure BDA0002876289310000123
wherein alpha is more than 0, evaluating the learning rate of the network,
Figure BDA0002876289310000124
is a regression vector of phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term.
Weights for executing networks
Figure BDA0002876289310000125
Then the importance weighting method is used for training. The objective function of the execution network is defined as:
Figure BDA0002876289310000126
wherein the prediction error e of the execution networka(k) Is defined as:
Figure BDA0002876289310000127
Ucin the present invention U is the desired final objective function c0, i.e. the execution network is to be minimized as much as possible in the learning process
Figure BDA0002876289310000128
Also, the following modified gradient descent method pair was adopted
Figure BDA0002876289310000129
Performing online adjustment:
Figure BDA00028762893100001210
beta > 0 is the learning rate of the execution network, phia(k)=(1+φa4(k))Tφa4(k)))2Is a normalized term.
As can be seen from the training process of the evaluation network and the execution network, all state data used in the learning process are generated by the behavior strategy u, and when the weight of the evaluation network
Figure BDA00028762893100001212
And executing the weights of the network
Figure BDA0002876289310000131
After convergence, the output of the execution network is an approximate value of the optimal controller.
For the behavior policy:
in a specific embodiment, during the online learning process of the optimal controller, all the used state data are generated by the behavior strategy u, and in order to ensure that the algorithm has a certain detection capability to the policy space, the state data generated by the behavior strategy needs to be rich enough and meet a certain continuous excitation condition to ensure the convergence of the algorithm. The behavior strategy in the invention is as follows: u. ofk=u′k+nkWhere u' is any one of the possible control strategies, typically selected based on the characteristics and experience of the system being controlled, and nkTo explore noise, nkIt may be a sine or cosine signal containing more, e.g. enough, frequencies or a random signal with limited amplitude.
For the evaluation network and the execution network:
the evaluation network and the execution network both adopt a feedforward neural network with a single hidden layer, wherein the input of the evaluation network for approximating the Q-function is a state and control input, and the output of the evaluation network is a scalar. The input to the execution network is also the system state, and its output is a multi-m-dimensional vector. In the learning process, the three neural networks only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process. The activation functions of the three hidden layers of the neural network can be selected from common hyperbolic tangent functions, Sigmoid functions, linear rectifiers, polynomial functions and the like.
Referring to fig. 2, fig. 3, schematic diagrams of an evaluation network and a neural network are shown, respectively.
Of course, the evaluation network and the execution network of the present invention can also be selected as a feedforward neural network with a plurality of hidden layers, and the weights of all the connection layers can also be adjusted in the learning process.
Referring to fig. 4, a schematic diagram of the online learning control method of the present invention is shown.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described method of online learning control of a non-linear dispersion time system.
The invention has the following advantages:
1. the invention provides an online learning control method suitable for a general nonlinear discrete time system, which can realize real-time online learning of an optimal controller without repeated iteration between strategy evaluation and strategy improvement;
2. the invention adopts an off-orbit strategy learning mechanism, and effectively overcomes the problem that the direct heuristic dynamic programming method is insufficient in exploring the state-strategy space; in addition, the execution network and the evaluation network may use any form of activation function.
3. Compared with the classical direct heuristic dynamic programming method, the online learning method provided by the patent has better exploration capability on a state-strategy space, and the types of activation functions of the execution network and the evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with an iterative method such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by a behavior strategy.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, which may be centralized on a single computing device, or alternatively, they may be implemented using program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or separately fabricated into various integrated circuit modules, or multiple ones of them fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. An online learning control method of a nonlinear discrete time system comprises the following steps:
behavior policy selection step S110:
selecting a behavior strategy u by using the existing experience according to the characteristics of a controlled object, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data required in the learning process;
optimal Q-function definition step S120:
the following optimal Q-function is defined:
Figure RE-FDA0003232563190000011
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
Figure RE-FDA0003232563190000012
optimal control
Figure RE-FDA0003232563190000013
Can be expressed as:
Figure RE-FDA0003232563190000014
for linear systems, Q*(xk,uk) and
Figure RE-FDA0003232563190000015
are each about (x)k,uk) and xkA non-linear function of (d);
evaluating the network and performing a network introduction step S130:
introducing evaluation network and execution network respectively to Q*(xk,uk) and
Figure RE-FDA0003232563190000016
performing online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The executive network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and from
Figure RE-FDA0003232563190000017
For evaluating network pair Q in least square sense*(xk,uk) The best approximation of
Figure RE-FDA0003232563190000021
Can be expressed as:
Figure RE-FDA0003232563190000022
wherein ,WcFor the weight of the hidden layer to the output layer, phic() is a set of all activation functions in the hidden layer in the evaluation network,
Figure RE-FDA0003232563190000023
to evaluate the weights of the network input layer to the hidden layer,wherein,
Figure RE-FDA0003232563190000024
for the weight corresponding to the ith activation function,
Figure RE-FDA0003232563190000025
represents (x)k,uk) The input values of the corresponding respective activation functions,
Figure RE-FDA0003232563190000026
an input value representing an ith activation function;
let the number of network activation functions to be performed be NaTo and from
Figure RE-FDA0003232563190000027
For performing network pairs in least-squares sense
Figure RE-FDA0003232563190000028
The best approximation of
Figure RE-FDA0003232563190000029
Can be expressed as:
Figure RE-FDA00032325631900000210
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,
Figure RE-FDA00032325631900000211
is the weight of the input layer to the hidden layer, wherein,
Figure RE-FDA00032325631900000212
for the weight corresponding to the ith activation function,
Figure RE-FDA00032325631900000213
represents xkThe input values of the corresponding respective activation functions,
Figure RE-FDA00032325631900000214
representing the input value of the ith activation function, for xk+1Then there is
Figure RE-FDA00032325631900000215
Estimation error calculation step S140:
optimal approximation
Figure RE-FDA00032325631900000216
And
Figure RE-FDA00032325631900000217
instead of the exact value Q*(xk,uk) and
Figure RE-FDA00032325631900000218
the following estimation error can be obtained:
Figure RE-FDA00032325631900000219
wherein ,
Figure RE-FDA00032325631900000220
is expressed as input
Figure RE-FDA00032325631900000221
Evaluating the input values of each activation function in the network, i.e.
Figure RE-FDA0003232563190000031
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd optimal weight W of the execution networkaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated as
Figure RE-FDA0003232563190000032
And
Figure RE-FDA0003232563190000033
where l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
Figure RE-FDA0003232563190000034
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adopted
Figure RE-FDA0003232563190000035
The output of the execution network at time k +1 is:
Figure RE-FDA0003232563190000036
similarly, when the input is (x)k,uk) The output of the evaluation network is:
Figure RE-FDA0003232563190000037
when the input is
Figure RE-FDA0003232563190000038
The output of the evaluation network is:
Figure RE-FDA0003232563190000039
wherein ,
Figure RE-FDA00032325631900000310
also, in generating state xk+1Previously, the evaluation network could not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimate of
Figure RE-FDA00032325631900000311
Therefore, there are:
Figure RE-FDA00032325631900000312
replacing the true values with the estimated values yields the following estimation errors:
Figure RE-FDA00032325631900000313
weights for evaluation network
Figure RE-FDA00032325631900000314
The adjustment is carried out by adopting a gradient descent method,
weights for executing networks
Figure RE-FDA0003232563190000041
Training is performed using an importance weighting method and using a modified gradient descent method
Figure RE-FDA0003232563190000042
The on-line adjustment is carried out,
when evaluating the weight of the network
Figure RE-FDA0003232563190000043
And executing the weight of the network
Figure RE-FDA0003232563190000044
After convergence, the output of the execution network is an approximate value of the optimal controller.
2. The online learning control method according to claim 1, characterized in that:
in the evaluation network and execution network introduction step S130,
for the evaluation network, Wc 0Setting the weight of the hidden layer to the output layer as a constant value;
for the execution network, Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
3. The online learning control method according to claim 2, characterized in that:
in the optimal weight calculation step S150:
for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the method specifically comprises the following steps:
Figure RE-FDA0003114611190000047
wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phic(k)=φc2(k+1))-φc(θ1(k) Is a regression vector of phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term.
4. The online learning control method according to claim 3, characterized in that:
in the step S110 of selecting an action policy, the action policy is: u. ofk=uk′+nkWherein u' isAny one of the possible control strategies is selected according to the characteristics and experience of the controlled system, nkTo explore noise, nkAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.
5. The online learning control method according to claim 3, characterized in that:
the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is state and control input, the input of the execution network is system state, and the output is multi-m-dimensional vector.
6. The online learning control method according to claim 5, characterized in that:
the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process.
7. The online learning control method according to claim 5, characterized in that:
the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.
8. A storage medium for storing computer-executable instructions, characterized in that:
the computer executable instructions, when executed by a processor, perform a method of online learning control of a non-linear discrete time system as claimed in any one of claims 1 to 7.
CN202011635930.6A 2020-12-31 2020-12-31 Online learning control method of nonlinear discrete time system Active CN113485099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011635930.6A CN113485099B (en) 2020-12-31 2020-12-31 Online learning control method of nonlinear discrete time system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011635930.6A CN113485099B (en) 2020-12-31 2020-12-31 Online learning control method of nonlinear discrete time system

Publications (2)

Publication Number Publication Date
CN113485099A true CN113485099A (en) 2021-10-08
CN113485099B CN113485099B (en) 2023-09-22

Family

ID=77933336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011635930.6A Active CN113485099B (en) 2020-12-31 2020-12-31 Online learning control method of nonlinear discrete time system

Country Status (1)

Country Link
CN (1) CN113485099B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117111620A (en) * 2023-10-23 2023-11-24 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009186317A (en) * 2008-02-06 2009-08-20 Mitsubishi Electric Corp Radar control system
CN107436424A (en) * 2017-09-08 2017-12-05 中国电子科技集团公司信息科学研究院 A kind of more radar dynamic regulating methods and device based on information gain
CN110214264A (en) * 2016-12-23 2019-09-06 御眼视觉技术有限公司 The navigation system of restricted responsibility with application
CN110462544A (en) * 2017-03-20 2019-11-15 御眼视觉技术有限公司 The track of autonomous vehicle selects
CN110826026A (en) * 2020-01-13 2020-02-21 江苏万链区块链技术研究院有限公司 Method and system for publication based on block chain technology and associated copyright protection
CN111142383A (en) * 2019-12-30 2020-05-12 中国电子科技集团公司信息科学研究院 Online learning method for optimal controller of nonlinear system
CN111812973A (en) * 2020-05-21 2020-10-23 天津大学 Event trigger optimization control method of discrete time nonlinear system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009186317A (en) * 2008-02-06 2009-08-20 Mitsubishi Electric Corp Radar control system
CN110214264A (en) * 2016-12-23 2019-09-06 御眼视觉技术有限公司 The navigation system of restricted responsibility with application
CN110462544A (en) * 2017-03-20 2019-11-15 御眼视觉技术有限公司 The track of autonomous vehicle selects
CN107436424A (en) * 2017-09-08 2017-12-05 中国电子科技集团公司信息科学研究院 A kind of more radar dynamic regulating methods and device based on information gain
CN111142383A (en) * 2019-12-30 2020-05-12 中国电子科技集团公司信息科学研究院 Online learning method for optimal controller of nonlinear system
CN110826026A (en) * 2020-01-13 2020-02-21 江苏万链区块链技术研究院有限公司 Method and system for publication based on block chain technology and associated copyright protection
CN111812973A (en) * 2020-05-21 2020-10-23 天津大学 Event trigger optimization control method of discrete time nonlinear system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
HONG-GUI HAN等: "Real-Time Model Predictive Control Using a Self-Organizing Neural Network", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 *
J. SI等: "Online learning control by association and reinforcement", 《PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS. IJCNN 2000. NEURAL COMPUTING: NEW CHALLENGES AND PERSPECTIVES FOR THE NEW MILLENNIUM》 *
LI, XX等: "Off-Policy Q-Learning for Infinite Horizon LQR Problem with Unknown Dynamics", 《27TH IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE)》 *
张振宁等: "再入飞行器自适应最优姿态控制", 《宇航学报》 *
许腾驹等: "异构网络下基于D2D通信机制的负载均衡策略", 《计算机工程与设计》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117111620A (en) * 2023-10-23 2023-11-24 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system
CN117111620B (en) * 2023-10-23 2024-03-29 山东省科学院海洋仪器仪表研究所 Autonomous decision-making method for task allocation of heterogeneous unmanned system

Also Published As

Publication number Publication date
CN113485099B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
Li et al. Adaptive optimized backstepping control-based RL algorithm for stochastic nonlinear systems with state constraints and its application
Lucia et al. A deep learning-based approach to robust nonlinear model predictive control
CA2414707C (en) Computer method and apparatus for constraining a non-linear approximator of an empirical process
Zhang et al. Adaptive neural tracking control of pure-feedback nonlinear systems with unknown gain signs and unmodeled dynamics
Jiang et al. Robust adaptive dynamic programming
Suykens et al. Robust local stability of multilayer recurrent neural networks
Wang et al. Adaptive neural finite-time containment control for nonlower triangular nonlinear multi-agent systems with dynamics uncertainties
CN112405542B (en) Musculoskeletal robot control method and system based on brain inspiring multitask learning
Li et al. Policy iteration based Q-learning for linear nonzero-sum quadratic differential games
Jia et al. Optimization of control parameters based on genetic algorithms for spacecraft attitude tracking with input constraints
Grancharova et al. Computation, approximation and stability of explicit feedback min–max nonlinear model predictive control
Ibrahim et al. Regulated Kalman filter based training of an interval type-2 fuzzy system and its evaluation
Kosmatopoulos Control of unknown nonlinear systems with efficient transient performance using concurrent exploitation and exploration
CN113485099A (en) Online learning control method of nonlinear discrete time system
Chen et al. Novel adaptive neural networks control with event-triggered for uncertain nonlinear system
CN114740710A (en) Random nonlinear multi-agent reinforcement learning optimization formation control method
CN115167102A (en) Reinforced learning self-adaptive PID control method based on parallel dominant motion evaluation
CN111142383B (en) Online learning method for optimal controller of nonlinear system
Sabes et al. Reinforcement learning by probability matching
Song et al. Adaptive dynamic programming: single and multiple controllers
Wang et al. Model-free nonlinear robust control design via online critic learning
Fu et al. Adaptive optimal control of unknown nonlinear systems with different time scales
Chen et al. Adaptive fuzzy PD+ control for attitude maneuver of rigid spacecraft
Lewis et al. Neural network control of robot arms and nonlinear systems
CN114063458A (en) Preset performance control method of non-triangular structure system independent of initial conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant