CN110879531B

CN110879531B - Data-driven self-adaptive optimization control method and medium for random disturbance system

Info

Publication number: CN110879531B
Application number: CN201911154069.9A
Authority: CN
Inventors: 甘明刚; 马千兆; 张蒙; 陈杰; 窦丽华; 邓方; 白永强
Original assignee: Beijing Institute of Technology BIT; Chongqing Innovation Center of Beijing University of Technology
Current assignee: Beijing Institute of Technology BIT; Chongqing Innovation Center of Beijing University of Technology
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2022-06-24
Anticipated expiration: 2039-11-22
Also published as: CN110879531A

Abstract

The invention discloses a data-driven self-adaptive optimization control method and a medium of a random disturbance system, wherein the method comprises a problem description part, a design part of a data-driven optimal state observer and an inter-policy data-driven ADP control part of the random disturbance system; the present invention has been described in detail with respect to the above three sections. According to the method, the optimal state observer is driven by design data, and different strategy data driving ADP control of a random disturbance system is performed. The data driving ADP method is firstly used for a system with completely unmeasurable state; model-less LQG control is generalized to continuous time systems; non-matching noise outside a control signal channel and independent noise independent of a state and a control signal are considered in ADP design; a novel different strategy data driving ADP control method and medium for a random disturbance system are provided, the burden of repeatedly reading and updating control signals is avoided, and the calculation amount is obviously reduced.

Description

Data-driven self-adaptive optimization control method and medium for random disturbance system

Technical Field

The invention relates to a random noise disturbance system, in particular to model-free random optimal control. The random noise disturbance system is applied to various fields such as industrial and agricultural production, electric power systems, chemical processes, mechanical manufacturing, transportation, aerospace, artificial intelligence and the like.

Background

Uncertainty in the actual system may come from signals such as inputs and conditionsNoise. Therefore, the optimal control problem of the random noise disturbance system is always concerned. In the conventional literature, such problems are usually represented by H₂Or H_∞And (3) processing by using a robust control method, wherein the main realization mode is to adjust disturbance input by using a certain determination model so as to design state feedback control. In engineering practice, however, it is often not practical to update the external disturbances in the way they are expected. On the other hand, existing H₂And H_∞The outcomes are mostly model-based methods. For practical control systems, in addition to noise interference, uncertainty due to model unknowns may also suffer. Therefore, the research of the model-free random optimal control method has important theoretical and practical significance.

An Adaptive Dynamic Programming (ADP) method provides a new idea of model-free random optimal control. In recent years, random optimal control results based on reinforcement learning or an ADP method have appeared, which only considers the "matching type" noise of a control signal channel, needs to read and update the control signal for many times, and has a large computation amount. In a practical system, however, the source of the noise may fall into different categories. Another concern is that system state is sometimes not directly available, but existing data-driven reinforcement learning or ADP methods require that system state be at least partially known. Under the MBC framework, some researchers have solved the problem of completely unmeasurable states by taking the output as a state. However, in the case of measurement noise, this approach will certainly deteriorate the performance of the control system.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a data-driven self-adaptive optimization control method of a random disturbance system, which is used for solving the control problem of the system with completely undetectable state.

It is another object of the present invention to provide a storage medium for a data-driven adaptive optimization control method for a stochastic disturbance system.

The purpose of the invention is realized by the following technical scheme:

a data-driven self-adaptive optimization control method of a random disturbance system comprises a problem description part, a design part of a data-driven optimal state observer and an ADP control part of a random disturbance system driven by different strategy data;

for the problem description section:

giving a random perturbation system and obtaining an output equation associated with the random perturbation system; solving the optimal linear control, and minimizing a cost function through a designed random optimal control strategy;

for the design part of the data-driven optimal state observer:

aiming at a completely unmeasured system state, designing a data-driven optimal state observer; obtaining a state design system through a random disturbance system, an output equation and an observer;

observing the obtained optimal control strategy of the state design system;

designing a data driving algorithm on a state design system by using the idea of data driving ADP, and solving the optimal observation gain;

for the different strategy data drive ADP control part of the random disturbance system:

and obtaining the online state information of the state design system by using the optimal observer, and further designing a data-driven ADP algorithm to finally obtain random optimal control.

As a preferred mode, for the problem description section:

given a randomly perturbed system described by a random differential equation

And output equation associated therewith

y＝Cx+υ (2)

Wherein the content of the first and second substances,

and

respectively represent the system state and control the outputInputting and outputting;

and

is an unknown constant matrix;

and

are uncorrelated zero-mean wiener processes whose covariance matrices are respectively represented as

And

N₁，N₂is a non-negative integer; xi_i，

Is a zero-mean wiener process, satisfies

Where ρ is_ij,σ_ij>0 is a known constant value;

given the above system, the goal is to solve for the optimal linear control u^*＝-K^*x, wherein

Is a random optimal control strategy to be designed so that the cost function

Namely, it is

Minimization of which

As a preferred mode, for the design part of the data-driven optimal state observer:

aiming at the completely unmeasured system state, an observer is designed

Wherein

An observed value that is representative of the state,

the observation gain to be solved; the observation error is expressed as

From (1), (2) and (8) can be obtained:

the change of the error e is independent of the state x, firstly, an optimal observer of an unknown state is designed, and then an optimal control strategy of the system is designed by using the observed state.

Preferably, the method comprises the following steps: definition of

According to the LQG control theory, the optimal state observation gain L^*Can be expressed as

L^*＝S^*C^TV^-1 (12)

Wherein S^*Is composed of

A unique symmetric positive solution of;

for an optimal observer to exist, the following assumptions are made:

is a restful and strictly considerable mean square, wherein

Is in a state;

given initial observed gain

So that

Is a Hurwitz matrix, if

Is the Lyapunov equation

Of (a) wherein L is_k(k-1, 2,3, …) is prepared from

L_k＝S_k-1B^TV^-1 (16)

Given, then

Is a Hurwitz matrix, and the sequence

And

respectively converge to S^*And L^*；

By using the idea of data driving ADP, a data driving algorithm is designed on a system (9) to solve the optimal observation gain L^*。

Preferably, the method comprises the following steps: the observation gain in the data acquisition and learning stage is fixed to L₀Wherein

Is a Hurwitz matrix. When the independent noise is zero, in order to ensure the continuous excitation condition, the exploration noise zeta dt is added to the right side of the system (9); when independent noise exists, the continuous excitation condition is automatically met, and exploration noise does not need to be added, namely zeta is 0;

definition of L_k:＝L_k-L₀(k is 0,1,2 …) and

let d sigma₁Can be measured if

Integrating the two sides of the system (9) along the trajectory to obtain

Definition of

Wherein

Is a predefined sampling instant, satisfies

Using the above expression, (19) can be transformed into a more compact form

Wherein psi_ekAnd Ω_ekAre respectively defined as

For a given L₀So that

Is a Hurwitz matrix, if the rank condition

If true, then there is S_kAnd S_k+1And calculating the resulting sequence

Respectively converge to S^*,L^*；

As long as the number of samples r₁(the lower subscript of the preset sampling time in the formula (24) represents the number of the collected data) is selected to be large enough to ensure that the rank condition is met, and the rank condition can be met through

Iterative solution of S_k,L_k+1. By setting a threshold value k₁As the cycle interruption condition, when | | | S is satisfied_k-S_k-1||≤κ₁Interrupt the cycle at that time, L_kI.e. the resulting optimal observation gain.

Preferably, the ADP control part is driven by the abnormal strategy data of the random disturbance system:

when the independent noise is not zero, the continuous excitation condition is automatically met, and the control law in the learning stage is set as

u₀＝-K₀x (30)

Wherein K₀Admission control strategy of the system. When the independent noise is zero, the exploration noise is added on the right side of (30)

Derived from the third introduction of Ito

Definition of

And

and d sigma₂Can be measured as follows

Integrating the two sides of equation (33) along the trajectory of system (1) to obtain further

Definition of

Wherein

Is a predefined sampling instant, which is satisfied

Using the above expressions, it is possible to transform (34) into a compact form

Wherein psi_xkAnd Ω_xkAre respectively defined as

Further obtain

Given an initial admission control strategy K₀If rank condition

By choosing r large enough₂(equation (38) lower subscript of the sampling instant) to obtain (P)_k,K_k+1) Is unique solution of, and

by setting a threshold value k₂As the cycle interruption condition, when | | | P is satisfied_k-P_k-1||≤κ₂The cycle is interrupted at times when u is-K_kx is the resulting random optimal control input.

A computer-readable storage medium, on which a computer program is stored, which computer program is executed by a processor for performing the above-mentioned method.

The beneficial effects of the invention are:

according to the method, the optimal state observer is driven by design data, and different strategy data driving ADP control of a random disturbance system is performed. The data driving ADP method is firstly used for a system with completely unmeasurable state; model-less LQG control is generalized to continuous time systems; non-matching noise outside a control signal channel and independent noise independent of a state and a control signal are considered in ADP design; a novel different strategy data driving ADP control method and medium for a random disturbance system are provided, the burden of repeatedly reading and updating control signals is avoided, and the calculation amount is obviously reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a top view of a laboratory scene;

FIG. 2 is a flow chart of a data-driven adaptive optimization control algorithm for a random perturbation system;

FIG. 3 is a graph showing the meaning of (d), (t), and θ (t));

FIG. 4 is a diagram of the trajectory of the end movements;

FIG. 5 is the tip speed and the motive force in a zero force field;

FIG. 6 is the tip speed and force before learning in VF;

FIG. 7 is the tip velocity and force after learning in VF;

fig. 8 shows the VF back end velocity and force.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example one

The invention provides a data-driven self-adaptive optimization control method of a random disturbance system, which comprises a problem description part, a design part of a data-driven optimal state observer and an ADP control part driven by different strategy data of the random disturbance system;

for the problem description section:

for the design part of the data-driven optimal state observer:

observing the obtained optimal control strategy of the state design system;

For the problem description section:

given a random perturbation system described by random differential equations

And the output equation associated therewith

y＝Cx+υ (2)

Wherein, the first and the second end of the pipe are connected with each other,

and

respectively representing system status (completely undetectable), control inputs and outputs;

and

is an unknown constant matrix;

and

is a zero-mean wiener process (Brownian motion) that is uncorrelated with each other and whose covariance matrix is divided intoIs shown as

And

N₁，N₂is a non-negative integer; xi_i，

Is a zero-mean wiener process, satisfies

Where ρ is_ij,σ_ij>0 is a known constant value;

given the above system, the goal is to solve for the optimal linear control u^*＝-K^*x is wherein

Is a random optimal control strategy to be designed so that the cost function

Namely, it is

Minimization of which

For the design part of the data-driven optimal state observer:

aiming at the completely unmeasured system state, an observer is designed

Wherein

An observed value that is representative of the state,

the observation gain to be solved; the observation error is expressed as

From (1), (2) and (8) can be obtained:

the change in error e is independent of the state x, which is consistent with the description of the separation principle. Therefore, it is possible to first design an optimal observer of unknown state and then design an optimal control strategy of the system using the observed state.

Definition of

According to LQG control theory, the optimal state observation gain L^*Can be expressed as

L^*＝S^*C^TV^-1 (12)

Wherein S^*Is composed of

The unique symmetric positive solution of (a);

for an optimal observer to exist, the following assumptions are made:

is a restful and strictly considerable mean square, wherein

Is in a state;

given initial observed gain

So that

Is a Hurwitz matrix, if

Is the Lyapunov equation

Of (a) wherein L is_k(k-1, 2,3, …) is prepared from

L_k＝S_k-1B^TV^-1 (16)

Given, then

Is a Hurwitz matrix, and the sequence

And

respectively converge to S^*And L^*；

The observation gain in the data acquisition and learning stage is fixed to L₀In which

definition of L_k:＝L_k-L₀(k is 0,1,2 …) and

let d sigma₁Can be measured if

Integrating the two sides of the system (9) along the trajectory to obtain

Definition of

Wherein

Is a predefined sampling instant, satisfies

Using the above expression, (19) can be transformed into a more compact form

Wherein psi_ekAnd Ω_ekAre respectively defined as

For a given L₀So that

Is a Hurwitz matrix, if rank condition

If true, then there is S_kAnd S_k+1And calculating the resulting sequence

Respectively converge to S^*,L^*；

As long as the number of samples r₁(the lower corner mark of the preset sampling time in the formula (24) represents the number of the acquired data) is selected to be large enough, so that the rank condition can be ensured to be met, and the rank condition can be met at the moment

Iterative solution of S_k,L_k+1. By setting a threshold value k₁As a cycle interrupt condition. When | | | S is satisfied_k-S_k-1||≤κ₁Interrupt the cycle at that time, L_kI.e. the resulting optimal observation gain.

u₀＝-K₀x (30)

Wherein K₀The allowable control strategy of the system. When the independent noise is zero, add the exploration noise on the right side of (30)

Derived from the third introduction of Ito

Definition of

And

and d sigma₂Can measure, have

Definition of

Wherein

Is a predefined sampling instant, satisfies

Using the above expression, it is possible to transform (34) into a compact form

Wherein psi_xkAnd Ω_xkAre respectively defined as

Further obtain

Given an initial admission control strategy K₀If rank condition

By choosing r large enough₂(equation (38) lower corner of sampling time) here again, we get (P)_k,K_k+1) Is unique solution of, and

by setting a threshold value k₂As a cycle interrupt condition. When | | | P is satisfied_k-P_k-1||≤κ₂The cycle is interrupted at times when u is-K_kx is the resulting random optimal control input.

Example two

In accordance with an embodiment, the invention provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor for performing the method described above.

The present invention may employ a computer program product embodied on one or more storage media (including disk storage, CD-ROM, optical storage) having computer program code embodied therein.

The present invention has been described with reference to a method according to an embodiment of the invention. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions. These computer programs may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means.

EXAMPLE III

In accordance with an embodiment, the present invention provides an example of an application of a learning mechanism simulation of the central nervous system.

This example demonstrates the effectiveness of the above method by simulating the arm motion control experiment of the Central Nervous System (CNS) under external force field disturbance. The human subject moves the manipulator tip forward in the horizontal plane to the target position by arm movements, as shown in fig. 1. Two torque motors are arranged in a base of the manipulator, can generate a required force field, and apply corresponding interference force to the arm through the mechanical arm and a handle at the tail end. The data-driven Adaptive Optimal Control (AOC) approach shown in fig. 2 is used here to simulate the learning mechanism of the CNS in this example.

1) Simulation setup

The dynamic behavior of the system can be described as:

wherein, two-dimensional vector p ═ p^x,p^y]^T，υ＝[υ^x,υ^y]^TRespectively representing the position and the action speed of the tail end; a ═ a^x,a^y]^TThe state of the actuator is represented, namely the action force applied to the tail end by the experimental object; u ═ u^x,u^y]^TControl signals for CNS; m represents the mass of the hand; b is the viscosity constant; τ is a time constant; d η represents the control correlation noise, consisting of

Given, wherein eta₁And η₂Is two wiener processes, c₁And c₂Is a positive number that measures the noise amplitude; f is an external force generated by a Velocity-dependent force field (VF), and the value of the external force is set as

Wherein

Is a constant scaling factor whose value is positively correlated to the subject's strength.

The values of the physical parameters of the system are shown in table 1. Taking the state vector as [ p ]^T,υ^T,a^T]^T. Since the output is noisy, the state cannot be measured directly, but is obtained by an observer, where C ═ I is set₆. Note that N₁＝0，N₂2. Taking W as 0.001I as covariance matrix of independent noise₆V ═ 0.015diag (1,1,1,1,10, 10). The initial observed gain and control strategy are set to L respectively₀＝10I₆And K₀＝[100I₂ 10I₂ 10I₂]。

TABLE 1 partial physical parameters of arm motion model

2) Weight matrix selection

Weight matrix R is 0.01I₂. Q is set to task dependent, i.e. the CNS can select different Q matrices for different tasks. For example, when a force field perturbation is found in the x-axis direction, the CNS can increase the weight of that direction to enhance the stiffness of the system (i.e., the magnitude of the restoring force per unit of trajectory deflection). Thus, the existing tests can be utilized (one trial and error is calledFor one trial) data and idea of data fitting, select the appropriate Q. Note that Q contains 21 independent elements. To reduce redundancy, the form is set to Q ═ diag (Q)₀,0.01Q₀,0.0005Q₀) Wherein

Is a task-dependent symmetric matrix. Let g^*(t) is an ideal motion track, namely a straight track; g (t) is the actual trajectory of the last trial. In g^*With (t) as the origin, the polar function pairs (d (t), θ (t)) for the g (t) phases are obtained, as shown in FIG. 3.

At time t when d (t) is at its maximum_mHas a d_m＝d(t_m),θ_m＝θ(t_m) Then CNS determines Q by lower model₀The value of (c):

wherein, ω is₀,ω₁,ω₂As an empirical constant, the example takes ω₀＝5×10⁵,ω₁＝5×10⁴,ω₂＝10⁵. In the case of a fixed external force field, d_mAnd theta_mIs a constant value; d is only modulated when changes in the external force field are observed in the CNS_mAnd theta_mAdapting Q to new task requirements.

3) Simulation result

The trajectory of the tip movement in the simulation is shown in fig. 4.

5 trials were performed in a zero force field (Null field) using an initial admission control strategy (top left panel); then, VF was suddenly applied to the subject, and the experiment was performed (upper right panel); after the first test in VF, the weight matrix parameters are obtained

After learning of the CNS by the data-driven AOC algorithm, experiments were performed (lower left panel); when VF is suddenly removed, namely the state of zero force field is recovered, the trace of the end of the aftereffect is obviously deviated to the right (lower right graph)) It is shown that previous learning does produce a stable compensation effect that is still valid until a new learning task is performed.

The end action speed and action force states of the above four stages are shown in fig. 5-8, respectively. It can be seen that the initial trajectory in the y-direction (i.e. the target direction) is approximately a bell curve. After VF is applied, the x positive direction and y negative direction of the motion force increase significantly, which indicates that the subject is disturbed by a large external force (the direction of which is related to the initial motion direction), so that the motion force has to be increased to compensate. After learning, the experimental object generates stable compensation force, and the influence of an external force field is effectively counteracted.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, it should be noted that any modifications, equivalents and improvements made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data-driven self-adaptive optimization control method of a random disturbance system is characterized by comprising a problem description part, a design part of a data-driven optimal state observer and an ADP control part driven by different strategy data of the random disturbance system;

for the problem description section:

a random disturbance system is given, and an output equation associated with the random disturbance system is obtained; solving the optimal linear control, and minimizing a cost function through a designed random optimal control strategy;

for the design part of the data-driven optimal state observer:

observing the obtained optimal control strategy of the state design system;

obtaining the online state information of the state design system by using the optimal observer, namely further designing a data-driven ADP algorithm and finally obtaining random optimal control;

the observation gain in the data acquisition and learning stage is fixed to L₀Wherein

Is a Hurwitz matrix; when the independent noise is zero, in order to ensure the continuous excitation condition, the exploration noise zeta dt is added to the right side of the system (9); when independent noise exists, the continuous excitation condition is automatically met, and then exploration noise does not need to be added, namely zeta is equal to 0;

definition of L_k:＝L_k-L₀(k is 0,1,2 …) and

let d sigma₁Can be measured if

Integrating the two sides of the system (9) along the trajectory to obtain

Definition of

Wherein

Is a predefined sampling instant, satisfies

Using the above expression, (19) can be transformed into a more compact form

Wherein psi_ekAnd Ω_ekAre respectively defined as

For a given L₀So that

Is a Hurwitz matrix, if the rank condition

If true, then there is S_kAnd S_k+1And calculating the resulting sequence

Respectively converge to S^*,L^*；

As long as the number of samples r₁(the subscript of the preset sampling time in the formula (24), representing the number of the collected data) is selected to be large enough, so that the rank condition can be met, and the rank condition can be met at the moment

Iterative solution of S_k,L_k+1By setting a threshold value k₁As the cycle interruption condition, when | | | S is satisfied_k-S_k-1||≤κ₁Interrupt the cycle at that time, L_kObtaining the optimal observation gain;

when the independent noise is not zero, the continuous excitation condition is automatically met, and the control law of the learning stage is set as

u₀＝-K₀x (30)

Wherein K₀The allowable control strategy of the system, when the independent noise is zero, the exploration noise theta is added on the right side of the (30);

derived from the third introduction of Ito

Definition of

And

and d sigma₂Can measure, have

Integrating the track of the system (1) to two sides of the equation (33) to further obtain

Definition of

Wherein

Is a predefined sampling instant, satisfies

Wherein psi_xkAnd Ω_xkAre respectively defined as

Further obtain

Given an initial admission control strategy K₀If rank condition

By choosing r large enough₂Obtaining (P)_k,K_k+1) Is unique solution of, and

2. The data-driven adaptive optimization control method for the random disturbance system according to claim 1, wherein for the problem description part:

given a randomly perturbed system described by a random differential equation

And the output equation associated therewith

y＝Cx+υ (2)

Wherein the content of the first and second substances,

and

respectively representing system states, control inputs and outputs;

and

is an unknown constant matrix;

and

And

N₁，N₂is a non-negative integer; xi_i，

Is a zero-mean wiener process, satisfies

Where ρ is_ij,σ_ij>0 is a known constant value;

Is a random optimal control strategy to be designed so that the cost function

Namely, it is

Minimization of which

3. The data-driven adaptive optimization control method of the random disturbance system according to claim 1, wherein for the design part of the data-driven optimal state observer:

aiming at the completely unmeasured system state, an observer is designed

Wherein

An observed value representative of the state of the device,

the observation gain to be solved; the observation error is expressed as

From (1), (2) and (8) can be obtained:

4. The data-driven adaptive optimization control method of the random disturbance system according to claim 3, wherein:

definition of

L^*＝S^*C^TV^-1 (12)

Wherein S^*Is composed of

A unique symmetric positive solution of;

for an optimal observer to exist, the following assumptions are made:

is a restful and strictly considerable mean square, wherein

Is in a state;

given initial observed gain

So that

Is a Hurwitz matrix, if

Is the Lyapunov equation

Of (a) wherein L is_k(k-1, 2,3, …) is prepared from

L_k＝S_k-1B^TV^-1 (16)

Given, then

Is a Hurwitz matrix, and the sequence

And

respectively converge to S^*And L^*；

5. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program is executed by a processor for performing the method according to any one of claims 1-4.