CN108181816A

CN108181816A - A kind of synchronization policy update method for optimally controlling based on online data

Info

Publication number: CN108181816A
Application number: CN201810010374.XA
Authority: CN
Inventors: 魏阿龙; 刘春生; 孙景亮
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2018-01-05
Filing date: 2018-01-05
Publication date: 2018-06-19

Abstract

The present invention relates to a kind of synchronization policies based on online data to update method for optimally controlling, belongs to intelligent control and optimum control field.This method comprises the following steps：1st, it initializes system mode, determine three NN activation primitives, and arbitrary initial value is assigned to its weights；Set data acquisition length and stopping criterion for iteration；2nd, the arbitrary control input of selection acts on system with interference noise；3rd, sampling system current state and control, noise inputs are distinguished with fixed rate, and calculates algorithm correlation intermediate variable；4th, judge whether data are effective, are, carry out in next step；Otherwise step 1 is jumped to；5th, three NN weights are updated；6th, judge whether to meet stopping criterion for iteration, be to export result；Otherwise step 5 is jumped to.Method proposed by the present invention solves Dependence Problem of the traditional control method to plant model, alleviates the immense pressure of controller solution, while increases the robustness of system.

Description

A kind of synchronization policy update method for optimally controlling based on online data

Technical field

The present invention relates to a kind of synchronization policies based on online data to update method for optimally controlling, belong to intelligent control with most Excellent control field.

Background technology

Dynamic Programming (DP) is to solve a kind of systems approach of dynamic optimization and Optimal Control Problem.It depends on optimality Principle, and optimal control policy is found by a cost function.And this cost function to meet Hamilton-Jacobi- Graceful (HJB) equation of Bell (corresponding optimum control) or Hamilton-Jacobi-Walter Isaacs (HJI) equation (corresponding differential game). In various control applications, there are many interference phenomenons in system, negative consequence is played in control performance.H_∞Control provides One powerful algorithmic tool reduces the influence that system is interfered.According to Differential Game Theory, H is found_∞Controller is suitable In solving zero-sum two-person game (ZSG), wherein controller attempts to minimize performance indicator under the disturbance of maximum possible.However, Due to HJB HJI equation unintentional nonlinearity properties, obtain its analytic solutions and be nearly impossible.

Recently, an algorithm newly proposed is referred to as adaptive/approximate Dynamic Programming (ADP), solves and asked for ZSG The various optimal control problems of topic.Its basic thought is come estimate cost function, on time using an approximation to function structure Between just always solve DP problems, so as to avoid " dimension calamity " problem, one is provided for NONLINEAR OPTIMAL CONTROL and differential game A convenient, effective solution.In practical engineering application, the accurate kinetic model of controlled device is typically unknown. Some researchers identify unknown dynamic using neural network (NN), then find optimal solution using ADP on identification network. However, the Identification Errors in network are unfavorable to the optimality of final controller.Training identification network, which also increases, to be calculated as This, increases learning time.Therefore, people more thirst for a method for optimally controlling for not depending on system model completely.

It is further noted that strategy of on-line iteration is also the popular approach of control design case, off-line strategy more cenotype is discussed below Compared with where the advantage of line interation.

Online method needs to act on system number using tactful μ in the value function for calculating target control strategy μ According to.However on-line learning algorithm uses approximate target control strategy in practical execution(rather than practical target Policy μ) generation data learn its value function, optimum control is iterated to calculate, this can seriously affect the learning direction of strategy.Cause For the estimation value function of these states not represented fully (that is, the state for being not under optimum control) may be Height inaccuracy, in other words, strategy of on-line learning method learns its value function using the data of " inaccuracy ", will increase Add accumulated error.This is referred to as " insufficient exploration ", be in online alternative manner one it is especially acute the problem of.

In industrial circle, scientific and technical innovation and progress show two it is prominent the characteristics of.One is got over production system scale Come bigger, operation becomes increasingly complex, and more and more real systems, which are faced with, is difficult to set up accurate industrial processes mould Type meets the difficulty of control design case demand.Another feature is that a large amount of data storage in industrial processes, but does not obtain It efficiently uses.It is therefore proposed that the nonlinear system offline iteration optimal control problem based on data have great importance and Challenge.

Invention content

Meet the difficulty of control design case demand and Industrial Engineering for accurate industrial processes model is difficult to set up The problems such as mass data of generation is not utilized effectively, the present invention propose a kind of synchronization policy update based on online data most Excellent control method.

The present invention is adopted the following technical scheme that solve its technical problem：

A kind of synchronization policy update method for optimally controlling based on online data, includes the following steps：

Step 1：Initialization system mode determines three NN activation primitives, and assigns arbitrary initial value to its weights；Set data Acquisition length and stopping criterion for iteration；

Step 2：The arbitrary control input of selection acts on system with interference noise；

Step 3：With fixed rate difference sampling system current state and control, noise inputs, and calculate algorithm phase Close intermediate variable；

Step 4：Judge whether data are effective, are, carry out in next step；Otherwise step 1 is jumped to；

Step 5：Update three NN weights；

Step 6：Judge whether to meet stopping criterion for iteration, be, export result；Otherwise step 5 is jumped to.

Beneficial effects of the present invention are as follows：

1st, method proposed by the present invention solves Dependence Problem of the traditional control method to plant model, alleviates control Device processed solves the immense pressure of (the mainly solutions of partial differential equation), while increases the robustness of system.2nd, the present invention carries The method gone out is easy to implement.It performs in two stages, the acquisition of online data and offline synchronized update.Use what is arbitrarily allowed Strategy acts on system so that system is safer.

3rd, off-line strategy proposed by the present invention more new algorithm can be based on generating by other non-targeted optimal policy behaviors Data realize, and be not necessarily target strategy, the ability of " exploration " increased in learning process.Simultaneously as generating Error will not be generated during data, so as to reduce the error of accumulation.

4th, it is globally optimal solution that the present invention, which finally acquires,.

Description of the drawings

Fig. 1, which is that the present invention is based on the synchronization policies of online data, to update method for optimally controlling flow chart.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, with reference to attached drawing, the present invention is done further It is described in detail.

The invention discloses a kind of synchronization policies based on online data to update method for optimally controlling, independent of controlled pair The model information of elephant, while increase the robustness of system.The present invention considers the inhibition to system interference noise simultaneously.

The system model that the present invention is studied can be expressed asWherein x ∈ RⁿIt is that n dimension states become Amount, u ∈ R^mIt is that m dimension controls input, v ∈ R^qExternal noise interference, f (x) ∈ R are tieed up for qⁿFor system dynamic matrix, g (x) ∈ R^n×m Input matrix in order to control, k (x) ∈ R^n×qFor noise inputs matrix, system initial state x=x₀。

H_∞The purpose of control is to find a controller so that performance index function Minimum defines r (x, u, v)=Q (x)+u^TRu-γ²v^TThe positive semidefinite matrix of v, and Q (x) >=0, R>0 is positive definite matrix, γ >=γ^* >=0, scalar γ be noise suppression gain, γ^*Represent minimum γ values existing for optimal solution.

The present invention is directed to solve optimal state feed-back control device.Given control u=u (x (t)) and noise disturbance v=v (x (t)), corresponding value function is

Based on Differential Game Theory, H is found_∞Controller is equivalent to solve two people ZSG, wherein control signal attempts to minimize Performance indicator, and noise jamming attempts to maximize performance indicator.Optimal state feed-back control u^*With disturbance v^*Corresponding optimal value letter Number isAnd receive assorted (Nash) conditionSaddle point solution.

Method for optimally controlling flow chart is updated Fig. 1 shows the present invention is based on the synchronization policies of online data.The offline plan Slightly synchronized update algorithm on-line acquisition system operation data first, then carries out offline iteration study.Details are as follows for specific steps：

Step 1：Initialization.The selected control strategy u ' arbitrarily allowed, disturbance v ' and its corresponding exploration noise e_u,e_vIt stays System to be acted on.Data acquisition length L and sampling period Δ t is set, then it is T=L* Δs t to understand the online data sampling time. Given stopping criterion for iteration (error threshold of front and rear iteration twice) ε.Value function is enabled to be expressed asControl law isInterferenceWherein NN activation primitives It can voluntarily select, usually hyperbolic tangent function tanh () and polynomial function etc., N₁、N₂、N₃Represent corresponding NN activation letter Several numbers, Respectively correspond to the weight matrix of value function, control strategy and perturbation strategy NN.Arbitrarily Given NN initial weightsM corresponds to control input dimension, and q corresponds to the input of noise jamming Dimension.

Step 2：The control u=u '+e that previous step is selected_u, interference v=v '+e_vAct on system.

Step 3：It acquires and calculates related intermediate variable.

Online real-time collecting { δ₁,…δ₆, it is as follows to embody form

Wherein：t₀For the time that sampling timing starts, t₁=t₀+ Δ t, t_L=t₀+ L Δs t, t_L-1=t₀+ (L-1) Δ t,

When data collection time reaches T, stop sampling, and calculate two variables of following formula

WhereinRepresent Kronecker (Kronecker) Product Operator, vec () stretches operator, I for matrix column_K2For K2 ties up unit matrix, I_K3Unit matrix, W are tieed up for K3_2,iThe weight matrix of control strategy NN, W during for ith iteration_3,iFor ith The weight matrix of perturbation strategy NN during iteration, Wherein：For 3 weight matrixs are combined the weight vector newly formed afterwards.

Step 4：JudgeWith the presence or absence of pseudoinverse, i.e.,It is whether reversible.If in the presence of carrying out in next step.It is no Then, it resets and explores noise e_u,e_v, jump to step 2.

Step 5：Use iterative formulaSynchronized update value function, control strategy NN and perturbation strategy NN weights；

Step 6：By formulaJudge whether iteration restrains.If convergence is exported as a result, obtaining optimal controller (including maximum interference) isWherein W_2,*、W_3,*Represent the final value of iteration convergence.If no Convergence then jumps to step 5 and continues to update.

It is worth noting that value function NN weights W_1,iInstrumentality is only played in an iterative process.Because it is updated in iteration It is not used in the process, is intended only as least square solution and is presented.

Method steps mentioned above has carried out the purpose of the present invention, technical solution and advantageous effect further in detail Illustrate, every modification within the spirit and principles in the present invention, made, equivalent replacement, improvement etc. should be included in the present invention Protection domain within.

Claims

1. a kind of synchronization policy update method for optimally controlling based on online data, which is characterized in that include the following steps：

Step 3：With fixed rate difference sampling system current state and control, noise inputs, and calculate in algorithm correlation Between variable；

Step 5：Update three NN weights；

2. a kind of synchronization policy update method for optimally controlling based on online data according to claim 1, feature exist In three NN activation primitives described in step 1 areN₁、N₂、N₃Represent corresponding The number of NN activation primitives, R>0 is positive definite matrix.

3. a kind of synchronization policy update method for optimally controlling based on online data according to claim 2, feature exist In the activation primitive is hyperbolic tangent function tanh ().

4. a kind of synchronization policy update method for optimally controlling based on online data according to claim 2, feature exist In the activation primitive is polynomial function.

5. a kind of synchronization policy update method for optimally controlling based on online data according to claim 1, feature exist In data acquisition length described in step 1 are L.