CN113485099A - Online learning control method of nonlinear discrete time system - Google Patents
Online learning control method of nonlinear discrete time system Download PDFInfo
- Publication number
- CN113485099A CN113485099A CN202011635930.6A CN202011635930A CN113485099A CN 113485099 A CN113485099 A CN 113485099A CN 202011635930 A CN202011635930 A CN 202011635930A CN 113485099 A CN113485099 A CN 113485099A
- Authority
- CN
- China
- Prior art keywords
- network
- optimal
- input
- evaluation
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 230000006870 function Effects 0.000 claims abstract description 66
- 238000011156 evaluation Methods 0.000 claims abstract description 65
- 230000004913 activation Effects 0.000 claims abstract description 45
- 230000006399 behavior Effects 0.000 claims abstract description 23
- 230000009471 action Effects 0.000 claims abstract description 11
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 17
- 238000011217 control strategy Methods 0.000 claims description 11
- 238000011478 gradient descent method Methods 0.000 claims description 9
- 238000012886 linear function Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 4
- 230000006872 improvement Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 239000006185 dispersion Substances 0.000 description 2
- 238000009776 industrial production Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000004800 variational method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/0265—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
- G05B13/027—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/0205—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system
- G05B13/021—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric not using a model or a simulator of the controlled system in which a variable is automatically adjusted to optimise the performance
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/02—Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses an online learning control method of a nonlinear discrete time system, which comprises a behavior strategy selection step, an optimal Q-function definition step, an evaluation network and execution network introduction step, an estimation error calculation step and a final optimal weight calculation step. The invention can realize the real-time online learning of the optimal controller without repeated iteration between strategy evaluation and strategy improvement; the off-orbit strategy learning mechanism is adopted, the problem that a direct heuristic dynamic programming method is insufficient in exploring a state-strategy space is effectively solved, the execution network and the evaluation network can use any form of activation function, online learning of the optimal controller can be achieved, a system model is not needed, and only state data generated by action strategies are needed.
Description
Technical Field
The invention relates to the field of industrial production control, in particular to an online learning control method for a nonlinear discrete time system.
Background
In the process of industrial production, engineering technicians often need to optimize controllers of control objects such as robots, unmanned aerial vehicles, unmanned vehicles and the like so as to meet certain control indexes. Optimization of the controller is made difficult by the fact that the above-mentioned control objects tend to exhibit strong non-linearity. From the perspective of optimal control, obtaining an optimal control controller requires solving a complex hamilton-jacobi-bellman equation (HJB equation), but the HJB equation is a nonlinear partial differential equation and is very difficult to solve. Traditional dynamic programming, variational methods, spectral methods and the like often face great limitations in the practical application process due to extremely high computational complexity.
The adaptive dynamic programming is taken as a novel intelligent control algorithm which is started in recent years, the technologies of reinforcement learning, neural network approximation, dynamic programming, adaptive control and the like are fused, the online learning of the optimal controller can be realized, and the problem of high complexity of calculation of the traditional method is effectively solved. Aiming at the optimal control problem of a nonlinear discrete time system, Jennie Si and Yu-Tsung Wang put forward a direct heuristic dynamic programming algorithm for the first time in a paper "one-line learning control by association and recovery", the algorithm adopts the basic idea of generalized strategy iteration, and can realize real-time Online learning of an optimal controller and an optimal value function by introducing two neural networks (namely an execution network and an evaluation network). Through continuous development in recent years, the convergence and stability analysis of the algorithm also has a certain theoretical basis at present. Although the direct heuristic dynamic programming algorithm can realize the online adaptive optimal control, the algorithm still has the following defects: 1) the algorithm adopts an on-policy learning mechanism, has the problem of insufficient exploration on a state-policy space, and is easy to fall into a local optimal solution; 2) the hyperbolic tangent functions are adopted by the activation functions of the execution network and the evaluation network, and all the convergence and stability analysis results are based on the hyperbolic tangent functions at present, so that the method is not applicable to other types of activation functions.
Therefore, how to overcome the above disadvantages of the above direct heuristic dynamic programming method makes convergence and stability analysis results not limited to hyperbolic tangent function any more, and becomes a technical problem to be solved urgently in the prior art.
Disclosure of Invention
The invention aims to provide an online learning control method of a nonlinear discrete time system, which has better exploration capability on a state-strategy space, so that the types of activation functions of an execution network and an evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with iterative methods such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by behavior strategies.
In order to achieve the purpose, the invention adopts the following technical scheme:
an online learning control method of a nonlinear discrete time system comprises the following steps:
behavior policy selection step S110:
selecting a behavior strategy u by using the existing experience according to the characteristics of a controlled object, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data required in the learning process;
optimal Q-function definition step S120:
the following optimal Q-function is defined:
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
evaluating the network and performing a network introduction step S130:
introducing evaluation network and execution network respectively to Q*(xk,uk) and carrying out online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The executive network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and fromFor evaluating network pair Q in least square sense*(xk,uk) The best approximation ofCan be expressed as:
wherein ,WcFor the weight of the hidden layer to the output layer, phicFormed for evaluating all activation functions in a hidden layer in a networkIn the collection of the images, the image data is collected,to evaluate the weight of the network input layer to the hidden layer, wherein,for the weight corresponding to the ith activation function,represents (x)k,uk) The input values of the corresponding respective activation functions,an input value representing an ith activation function;
let the number of network activation functions to be performed be NaTo and fromFor performing network pairs in least-squares senseThe best approximation ofCan be expressed as:
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,is the weight of the input layer to the hidden layer, wherein,is as followsThe weights corresponding to the i activation functions,represents xkThe input value of the corresponding respective activation function,representing the input value of the ith activation function, for xk+1Then there is
Estimation error calculation step S140:
optimal approximationAndinstead of the exact value Q*(xk,uk) and the following estimation errors can be obtained:
wherein ,is expressed as inputEvaluating the input values of each activation function in the network, i.e.
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd performing optimization of the networkWeight WaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated asAndwhere l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adoptedThe output of the execution network at time k +1 is:
similarly, when the input is (x)k,uk) The output of the evaluation network is:
wherein ,also, in generating state xk+1Before, the evaluation network can not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimated valueTherefore, there are:
replacing the true values with the estimated values yields the following estimation errors:
weights for executing networksTraining is performed using an importance weighting method and using a modified gradient descent methodThe on-line adjustment is carried out,
when evaluating the weight of the networkAnd executing the weight of the networkAfter convergence, the output of the execution network is the proximity of the optimal controllerLike values.
Alternatively, in the evaluating network and performing network introduction step S130,
for the evaluation network, Wc 0Setting the weight of the hidden layer to the output layer as a constant value;
for the execution network, Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
Optionally, in the optimal weight calculating step S150:
for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the method specifically comprises the following steps:
wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phic(k)=φc(θ2(k+1))-φc(θ1(k) Is a regression vector, phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term;
for the weight of the execution network, the importance weighting method is adopted for training, and an improved gradient descent method is adopted for trainingThe online adjustment is specifically as follows:
optionally, in the step S110 of selecting an action policy, the action policy is: u. ofk=u′k+nkWhere u' is any one of the possible control strategies, selected based on the characteristics and experience of the system being controlled, and nkTo explore noise, nkAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.
Optionally, the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is a state and a control input, the input of the execution network is a system state, and the output is a multi-m-dimensional vector.
Optionally, the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process starts and are kept unchanged in the learning process.
Optionally, the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the method of online learning control of a non-linear dispersion time system.
The invention has the following advantages:
1. the invention provides an online learning control method suitable for a general nonlinear discrete time system, which can realize real-time online learning of an optimal controller without repeated iteration between strategy evaluation and strategy improvement;
2. the invention adopts an off-orbit strategy learning mechanism, and effectively overcomes the problem that the direct heuristic dynamic programming method is insufficient in exploring the state-strategy space; in addition, the execution network and the evaluation network may use any form of activation function.
3. Compared with the classical direct heuristic dynamic programming method, the online learning method provided by the patent has better exploration capability on a state-strategy space, and the types of activation functions of the execution network and the evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with an iterative method such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by a behavior strategy.
Drawings
FIG. 1 is a flow chart of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an evaluation network of an online learning control method of a nonlinear discrete time system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an implementation network of an online learning control method for a non-linear discrete time system according to an embodiment of the present invention;
fig. 4 is an algorithm diagram of an online learning control method of a nonlinear discrete-time system according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The present invention first needs to consider the following nonlinear discrete time system optimal control problem. Consider the following discrete-time system:
xk+1=F(xk,uk),x0=x(0)
wherein ,xkIs the system state, ukIs the system input. System function F (x)k,uk) In the tight setAbove is Lipschitz continuous and satisfies F (0,0) ═ 0. It is assumed that the system is calmable at Ω, i.e. there is a control sequence u1,…,uk…, such that xk→ 0. In addition, assume the system function F (x)k,uk) Is unknown. The goal of optimal control of a nonlinear system is to find a feasible control strategy that makes the system calm while minimizing the following value function:
according to the Bellman optimality principle, an optimal control strategy u*Satisfy the following shellfishThe Erman equation:
s.t.xk+1=F(xk,uk)
thus, the optimal controller u*Has the following expression:
substituting the above equation into the Bellman equation yields the following HJB equation:
thus, referring to fig. 1, there is shown an online learning control method of a nonlinear discrete-time system according to the present invention, comprising the steps of:
behavior policy selection step S110:
according to the characteristics of the controlled object, the behavior strategy u is selected by using the existing experience, the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process, and the behavior strategy is mainly used for generating system state data required in the learning process.
After the behavior strategy is selected, the optimal controller is to be learned online.
Optimal Q-function definition step S120:
the following optimal Q-function is defined:
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
Evaluating the network and performing a network introduction step S130:
considering that the feedforward neural network can realize the approximation with any precision on smooth or continuous nonlinear function when the number of the activation functions in the feedforward neural network is enough, an evaluation network and an execution network are introduced to respectively carry out Q-factor estimation*(xk,uk) and performing online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The execution network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and fromFor evaluating network pair Q in least square sense*(xk,uk) The best approximation ofCan be expressed as:
wherein ,WcFor the weight of the hidden layer to the output layer, phic() is a set of all activation functions in the hidden layer in the evaluation network,to evaluate the weight of the network input layer to the hidden layer, wherein,for the weight corresponding to the ith activation function,represents (x)k,uk) The input values of the corresponding respective activation functions,an input value representing an ith activation function;
the invention relates to Wc 0Set to a constant value, therefore, only the weights of the hidden layer to the output layer need to be adjusted.
Similarly, let the number of network activation functions performed be NaTo and fromPerforming network pairs in least squares senseThe best approximation ofCan be expressed as:
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,is the weight of the input layer to the hidden layer, wherein,for the weight corresponding to the ith activation function,represents xkThe input value of the corresponding respective activation function,representing the input value of the ith activation function, for xk+1Then there is
The invention also relates to Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
Estimation error calculation step S140:
optimal approximationAndinstead of the exact value Q*(xk,uk) and the following estimation errors can be obtained:
wherein ,is expressed as inputEvaluating the input values of each activation function in the network, i.e.
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd optimal weight W of the execution networkaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated asAndwhere l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adoptedThe output of the execution network at time k +1 is:
similarly, when the input is (x)k,uk) The output of the evaluation network is:
wherein ,also, in generating state xk+1Before, the evaluation network can not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimated valueTherefore, there are:
replacing the true value with the estimated value yields the following estimation error:
for the evaluation network, the goal is to make the estimation error e through online learningkThe weight of the evaluation network is therefore adjusted by the gradient descent method as follows:
wherein alpha is more than 0, evaluating the learning rate of the network,is a regression vector of phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term.
Weights for executing networksThen the importance weighting method is used for training. The objective function of the execution network is defined as:wherein the prediction error e of the execution networka(k) Is defined as:Ucin the present invention U is the desired final objective function c0, i.e. the execution network is to be minimized as much as possible in the learning processAlso, the following modified gradient descent method pair was adoptedPerforming online adjustment:
beta > 0 is the learning rate of the execution network, phia(k)=(1+φa(θ4(k))Tφa(θ4(k)))2Is a normalized term.
As can be seen from the training process of the evaluation network and the execution network, all state data used in the learning process are generated by the behavior strategy u, and when the weight of the evaluation networkAnd executing the weights of the networkAfter convergence, the output of the execution network is an approximate value of the optimal controller.
For the behavior policy:
in a specific embodiment, during the online learning process of the optimal controller, all the used state data are generated by the behavior strategy u, and in order to ensure that the algorithm has a certain detection capability to the policy space, the state data generated by the behavior strategy needs to be rich enough and meet a certain continuous excitation condition to ensure the convergence of the algorithm. The behavior strategy in the invention is as follows: u. ofk=u′k+nkWhere u' is any one of the possible control strategies, typically selected based on the characteristics and experience of the system being controlled, and nkTo explore noise, nkIt may be a sine or cosine signal containing more, e.g. enough, frequencies or a random signal with limited amplitude.
For the evaluation network and the execution network:
the evaluation network and the execution network both adopt a feedforward neural network with a single hidden layer, wherein the input of the evaluation network for approximating the Q-function is a state and control input, and the output of the evaluation network is a scalar. The input to the execution network is also the system state, and its output is a multi-m-dimensional vector. In the learning process, the three neural networks only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process. The activation functions of the three hidden layers of the neural network can be selected from common hyperbolic tangent functions, Sigmoid functions, linear rectifiers, polynomial functions and the like.
Referring to fig. 2, fig. 3, schematic diagrams of an evaluation network and a neural network are shown, respectively.
Of course, the evaluation network and the execution network of the present invention can also be selected as a feedforward neural network with a plurality of hidden layers, and the weights of all the connection layers can also be adjusted in the learning process.
Referring to fig. 4, a schematic diagram of the online learning control method of the present invention is shown.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described method of online learning control of a non-linear dispersion time system.
The invention has the following advantages:
1. the invention provides an online learning control method suitable for a general nonlinear discrete time system, which can realize real-time online learning of an optimal controller without repeated iteration between strategy evaluation and strategy improvement;
2. the invention adopts an off-orbit strategy learning mechanism, and effectively overcomes the problem that the direct heuristic dynamic programming method is insufficient in exploring the state-strategy space; in addition, the execution network and the evaluation network may use any form of activation function.
3. Compared with the classical direct heuristic dynamic programming method, the online learning method provided by the patent has better exploration capability on a state-strategy space, and the types of activation functions of the execution network and the evaluation network can be selected at will and are not limited to hyperbolic tangent functions; compared with an iterative method such as strategy iteration or value iteration, the method can realize online learning of the optimal controller, does not need a system model, and only needs state data generated by a behavior strategy.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, which may be centralized on a single computing device, or alternatively, they may be implemented using program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or separately fabricated into various integrated circuit modules, or multiple ones of them fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. An online learning control method of a nonlinear discrete time system comprises the following steps:
behavior policy selection step S110:
selecting a behavior strategy u by using the existing experience according to the characteristics of a controlled object, wherein the behavior strategy is a control strategy which is actually applied to the controlled object in the learning process and is mainly used for generating system state data required in the learning process;
optimal Q-function definition step S120:
the following optimal Q-function is defined:
the physical significance is as follows: at time k, an action strategy u is taken, and at all subsequent times an optimal control strategy u is taken*I.e. the target strategy, is defined by the optimal Q-function, the above equation can be equivalently expressed as:
evaluating the network and performing a network introduction step S130:
introducing evaluation network and execution network respectively to Q*(xk,uk) and performing online approximation, wherein the evaluation network and the execution network are neural networks;
evaluation network for learning optimal Q-function Q*(xk,uk) The executive network is used to learn the optimal controller u*Assuming that the number of neural network activation functions in the evaluation network is NcTo and fromFor evaluating network pair Q in least square sense*(xk,uk) The best approximation ofCan be expressed as:
wherein ,WcFor the weight of the hidden layer to the output layer, phic() is a set of all activation functions in the hidden layer in the evaluation network,to evaluate the weights of the network input layer to the hidden layer,wherein,for the weight corresponding to the ith activation function,represents (x)k,uk) The input values of the corresponding respective activation functions,an input value representing an ith activation function;
let the number of network activation functions to be performed be NaTo and fromFor performing network pairs in least-squares senseThe best approximation ofCan be expressed as:
the input to the execution network is the system state, where WaFor the weight of the hidden layer to the input layer, phia() is a set of executing network hidden layer activation functions,is the weight of the input layer to the hidden layer, wherein,for the weight corresponding to the ith activation function,represents xkThe input values of the corresponding respective activation functions,representing the input value of the ith activation function, for xk+1Then there is
Estimation error calculation step S140:
optimal approximationAndinstead of the exact value Q*(xk,uk) and the following estimation error can be obtained:
wherein ,is expressed as inputEvaluating the input values of each activation function in the network, i.e.
Optimal weight calculation step S150:
optimal weight W for evaluation networkcAnd optimal weight W of the execution networkaPerforming online learning, assuming that at time k, the evaluation network and the execution network pair Wc and WaAre respectively estimated asAndwhere l ≦ k, i.e. the learning process is to be performed after the behavior strategy starts to generate state data, the output of the execution network at time k may be expressed as:
in action policy ukGenerating the next state xk+1Previously, the executing network could not give the k +1 time pair WaSo that the network pair W is performed at time k +1aThe estimated value of (2) is still adoptedThe output of the execution network at time k +1 is:
similarly, when the input is (x)k,uk) The output of the evaluation network is:
wherein ,also, in generating state xk+1Previously, the evaluation network could not give the k +1 time pair WcSo that the network pair W is evaluated at time k +1cIs also taken as an estimate ofTherefore, there are:
replacing the true values with the estimated values yields the following estimation errors:
weights for executing networksTraining is performed using an importance weighting method and using a modified gradient descent methodThe on-line adjustment is carried out,
2. The online learning control method according to claim 1, characterized in that:
in the evaluation network and execution network introduction step S130,
for the evaluation network, Wc 0Setting the weight of the hidden layer to the output layer as a constant value;
for the execution network, Wa 0Set to a constant value, only the weights of the hidden layer to the output layer are adjusted.
3. The online learning control method according to claim 2, characterized in that:
in the optimal weight calculation step S150:
for evaluating the weight of the network, the following gradient descent method is adopted for adjustment, and the method specifically comprises the following steps:
wherein alpha is more than 0, and the learning rate of the network is evaluated, and delta phic(k)=φc(θ2(k+1))-φc(θ1(k) Is a regression vector of phic(k)=(1+Δφc(k)TΔφc(k))2Is a normalization term.
4. The online learning control method according to claim 3, characterized in that:
in the step S110 of selecting an action policy, the action policy is: u. ofk=uk′+nkWherein u' isAny one of the possible control strategies is selected according to the characteristics and experience of the controlled system, nkTo explore noise, nkAre sinusoidal, cosine signals containing more, e.g. enough frequencies or random signals with limited amplitude.
5. The online learning control method according to claim 3, characterized in that:
the evaluation network and the execution network are single hidden layer feedforward neural networks, the input of the evaluation network for approximating the Q-function is state and control input, the input of the execution network is system state, and the output is multi-m-dimensional vector.
6. The online learning control method according to claim 5, characterized in that:
the evaluation network and the execution network only adjust the weights from the hidden layer to the output layer, and the weights from the input layer to the hidden layer are randomly generated before the learning process is started and are kept unchanged in the learning process.
7. The online learning control method according to claim 5, characterized in that:
the activation function of the evaluation network and the execution network is one of a hyperbolic tangent function, a Sigmoid function, a linear rectifier, and a polynomial function.
8. A storage medium for storing computer-executable instructions, characterized in that:
the computer executable instructions, when executed by a processor, perform a method of online learning control of a non-linear discrete time system as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011635930.6A CN113485099B (en) | 2020-12-31 | 2020-12-31 | Online learning control method of nonlinear discrete time system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011635930.6A CN113485099B (en) | 2020-12-31 | 2020-12-31 | Online learning control method of nonlinear discrete time system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113485099A true CN113485099A (en) | 2021-10-08 |
CN113485099B CN113485099B (en) | 2023-09-22 |
Family
ID=77933336
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011635930.6A Active CN113485099B (en) | 2020-12-31 | 2020-12-31 | Online learning control method of nonlinear discrete time system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113485099B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117111620A (en) * | 2023-10-23 | 2023-11-24 | 山东省科学院海洋仪器仪表研究所 | Autonomous decision-making method for task allocation of heterogeneous unmanned system |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009186317A (en) * | 2008-02-06 | 2009-08-20 | Mitsubishi Electric Corp | Radar control system |
CN107436424A (en) * | 2017-09-08 | 2017-12-05 | 中国电子科技集团公司信息科学研究院 | A kind of more radar dynamic regulating methods and device based on information gain |
CN110214264A (en) * | 2016-12-23 | 2019-09-06 | 御眼视觉技术有限公司 | The navigation system of restricted responsibility with application |
CN110462544A (en) * | 2017-03-20 | 2019-11-15 | 御眼视觉技术有限公司 | The track of autonomous vehicle selects |
CN110826026A (en) * | 2020-01-13 | 2020-02-21 | 江苏万链区块链技术研究院有限公司 | Method and system for publication based on block chain technology and associated copyright protection |
CN111142383A (en) * | 2019-12-30 | 2020-05-12 | 中国电子科技集团公司信息科学研究院 | Online learning method for optimal controller of nonlinear system |
CN111812973A (en) * | 2020-05-21 | 2020-10-23 | 天津大学 | Event trigger optimization control method of discrete time nonlinear system |
-
2020
- 2020-12-31 CN CN202011635930.6A patent/CN113485099B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2009186317A (en) * | 2008-02-06 | 2009-08-20 | Mitsubishi Electric Corp | Radar control system |
CN110214264A (en) * | 2016-12-23 | 2019-09-06 | 御眼视觉技术有限公司 | The navigation system of restricted responsibility with application |
CN110462544A (en) * | 2017-03-20 | 2019-11-15 | 御眼视觉技术有限公司 | The track of autonomous vehicle selects |
CN107436424A (en) * | 2017-09-08 | 2017-12-05 | 中国电子科技集团公司信息科学研究院 | A kind of more radar dynamic regulating methods and device based on information gain |
CN111142383A (en) * | 2019-12-30 | 2020-05-12 | 中国电子科技集团公司信息科学研究院 | Online learning method for optimal controller of nonlinear system |
CN110826026A (en) * | 2020-01-13 | 2020-02-21 | 江苏万链区块链技术研究院有限公司 | Method and system for publication based on block chain technology and associated copyright protection |
CN111812973A (en) * | 2020-05-21 | 2020-10-23 | 天津大学 | Event trigger optimization control method of discrete time nonlinear system |
Non-Patent Citations (5)
Title |
---|
HONG-GUI HAN等: "Real-Time Model Predictive Control Using a Self-Organizing Neural Network", 《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》 * |
J. SI等: "Online learning control by association and reinforcement", 《PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS. IJCNN 2000. NEURAL COMPUTING: NEW CHALLENGES AND PERSPECTIVES FOR THE NEW MILLENNIUM》 * |
LI, XX等: "Off-Policy Q-Learning for Infinite Horizon LQR Problem with Unknown Dynamics", 《27TH IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE)》 * |
张振宁等: "再入飞行器自适应最优姿态控制", 《宇航学报》 * |
许腾驹等: "异构网络下基于D2D通信机制的负载均衡策略", 《计算机工程与设计》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117111620A (en) * | 2023-10-23 | 2023-11-24 | 山东省科学院海洋仪器仪表研究所 | Autonomous decision-making method for task allocation of heterogeneous unmanned system |
CN117111620B (en) * | 2023-10-23 | 2024-03-29 | 山东省科学院海洋仪器仪表研究所 | Autonomous decision-making method for task allocation of heterogeneous unmanned system |
Also Published As
Publication number | Publication date |
---|---|
CN113485099B (en) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Adaptive optimized backstepping control-based RL algorithm for stochastic nonlinear systems with state constraints and its application | |
Lucia et al. | A deep learning-based approach to robust nonlinear model predictive control | |
CA2414707C (en) | Computer method and apparatus for constraining a non-linear approximator of an empirical process | |
Zhang et al. | Adaptive neural tracking control of pure-feedback nonlinear systems with unknown gain signs and unmodeled dynamics | |
Jiang et al. | Robust adaptive dynamic programming | |
Suykens et al. | Robust local stability of multilayer recurrent neural networks | |
Wang et al. | Adaptive neural finite-time containment control for nonlower triangular nonlinear multi-agent systems with dynamics uncertainties | |
CN112405542B (en) | Musculoskeletal robot control method and system based on brain inspiring multitask learning | |
Li et al. | Policy iteration based Q-learning for linear nonzero-sum quadratic differential games | |
Jia et al. | Optimization of control parameters based on genetic algorithms for spacecraft attitude tracking with input constraints | |
Grancharova et al. | Computation, approximation and stability of explicit feedback min–max nonlinear model predictive control | |
Ibrahim et al. | Regulated Kalman filter based training of an interval type-2 fuzzy system and its evaluation | |
Kosmatopoulos | Control of unknown nonlinear systems with efficient transient performance using concurrent exploitation and exploration | |
CN113485099A (en) | Online learning control method of nonlinear discrete time system | |
Chen et al. | Novel adaptive neural networks control with event-triggered for uncertain nonlinear system | |
CN114740710A (en) | Random nonlinear multi-agent reinforcement learning optimization formation control method | |
CN115167102A (en) | Reinforced learning self-adaptive PID control method based on parallel dominant motion evaluation | |
CN111142383B (en) | Online learning method for optimal controller of nonlinear system | |
Sabes et al. | Reinforcement learning by probability matching | |
Song et al. | Adaptive dynamic programming: single and multiple controllers | |
Wang et al. | Model-free nonlinear robust control design via online critic learning | |
Fu et al. | Adaptive optimal control of unknown nonlinear systems with different time scales | |
Chen et al. | Adaptive fuzzy PD+ control for attitude maneuver of rigid spacecraft | |
Lewis et al. | Neural network control of robot arms and nonlinear systems | |
CN114063458A (en) | Preset performance control method of non-triangular structure system independent of initial conditions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |