CN113325804B

CN113325804B - Q learning extended state observer design method of motion control system

Info

Publication number: CN113325804B
Application number: CN202110637860.6A
Authority: CN
Inventors: 薛文超; 汤国杰; 方海涛
Original assignee: Academy of Mathematics and Systems Science of CAS
Current assignee: Academy of Mathematics and Systems Science of CAS
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2022-03-29
Anticipated expiration: 2041-06-08
Also published as: CN113325804A

Abstract

The invention provides a Q learning extended state observer design method of a motion control system, which comprises the following steps: 1. designing an extended state observer according to a discrete form of a mathematical model of a motion control system: 2. and designing a Q learning algorithm, namely 3, adjusting the parameters of the extended state observer through Q learning. The system can be a hybrid system which is common in practice and has continuous object and sampling output, and a corresponding discrete ESO structure and parameter design method is directly provided; noise and disturbance models are not needed to optimize the parameters of the extended state observer, the parameters are adjusted in real time in a data-driven mode, and compared with the traditional constant ESO, the method has the capability of tracking internal uncertain dynamics and external disturbance more accurately; the adjustment ranges of the four parameters of the Q learning part are quantitatively and explicitly given. The parameter adjustment theory can ensure the stability of the extended state observer and can reduce the cost spent on adjusting parameters in actual engineering.

Description

Q learning extended state observer design method of motion control system

Technical Field

The invention belongs to the technical field of a design method of an extended observer of a motion control system, and particularly relates to an Extended State Observer (ESO) and Q learning parameter adjusting technology of the motion control system.

Background

In the past decades, estimating the state of the motion control system by designing an observer and implementing state feedback in the control law has proved to be an effective control method. However, most observers still have their limitations, such as relying on system models, which can only estimate states, single external disturbances, and cannot handle complex uncertainties including internal unknown dynamics and external disturbances, as described in reference [1 ]. Based on the above problems, korean scholars in china propose an extended state observer which estimates an internal uncertain dynamic state and an external disturbance as an extended state "total disturbance" without depending on a model, as described in reference [2 ]. For the parameter adjustment problem of the ESO, a reference [3] provides a 'bandwidth method' for a linear extended state observer and is widely applied, reference [4] to [5] respectively analyzes the linear ESO and the nonlinear ESO of a constant parameter, and provides a parameter adjustment method capable of ensuring the stability of the ESO, and reference [6] provides a Kalman gain optimization adjustment method for the ESO by using the statistical characteristic and the uncertainty change range information of noise.

Currently, the main ESO parameter adjustments are mainly constant gain and optimization methods that rely on model information. The characteristics of the actual motion control system and the external environment change with time, so that the noise characteristics and uncertain dynamic characteristics of the actual motion control system change, and the assumption that model information is known is difficult. Therefore, the design of the extended state observer of the motion control system needs to be independent of model information, a gain adjustment method for learning and optimizing by using data on line is realized, and the rapid and accurate estimation of the state and the uncertainty is realized.

Disclosure of Invention

The technical problem solved by the invention is as follows: aiming at a motion control system, the extended state observer based on Q learning algorithm parameter adjustment is designed to realize effective real-time estimation on the system state and total disturbance, and real-time measurement data of the system is utilized to drive real-time optimization of the gain of the extended state observer, so that the estimation capability and the steady-state performance of the extended state observer are enhanced.

Consider the following mathematical model of a motion control system:

wherein t represents time, x₁(t) E R represents the position of the moving object at time t, x₂(t) E R represents the speed of the moving object at the time t, b represents input gain, u (t) E R represents the input of the system at the time t, d (x) (t), t) E R represents total disturbance composed of uncertain dynamics and external disturbance in the system at the time t, and c (t) E R represents the derivative of the total disturbance at the time t; x is the number of₁(kh)、x₂(kh) denotes the position and velocity of the moving object at time t-kh, where h is the sampling period, k denotes the kth sample, y denotes the k-th sample_i(kh) denotes x_i(kh) measured value, v_i(kh) isThe measurement noise (i ═ 1,2) for the corresponding channel.

Considering that the state of the motion control system can be measured, the design goal is to enable the extended state observer to adjust the parameters of the extended state observer according to real-time data, so that the total disturbance d (x (t), t) can be tracked quickly and accurately, and the sensitivity to noise is reduced.

The technical solution of the invention comprises the following three steps:

step (I): the extended state observer is designed according to the discrete form of (1.1):

since the sampling and control inputs are discrete in a real system, a discrete approximation form of (1.1) needs to be considered:

x₁((k+1)h)、x₂((k +1) h) denotes the position and speed of the moving object at time t ═ k +1 h, where h is the sampling period, k +1 denotes the (k +1) th sample, y_i((k +1) h) represents x_iMeasured value of ((k +1) h), v_i((k +1) h) is the measurement noise of the corresponding channel (i ═ 1, 2). b represents the input gain, u (kh) e R represents the input of the system at the time t-kh, d (x (kh), kh) e R represents the total disturbance consisting of the uncertain dynamics inside the system and the external disturbance at the time t-kh, and c (kh) e R represents the derivative of the total disturbance at the time t-kh.

According to (1.2), the linear extended state observer is designed as follows:

wherein, beta₁(kh),β₂(kh),β₃(kh) to observe the gain, the design method is

β₁(kh)＝3ω(kh),β₂(kh)＝3ω²(kh),β₃(kh)＝ω³(kh),ω(kh)＞0. (1.4)

And

respectively represent x₂The estimated values of (kh), d, (kh) and c (kh), ω (kh) is referred to as the "observer bandwidth" at time t — kh, which is the parameter to be adjusted.

Step (II): designing a Q learning algorithm:

the Q learning algorithm comprises four main components of state, action, reward and state-action value function: based on the current state of the system, selecting corresponding action according to the value of the state-action value function, obtaining corresponding reward, and updating the value of the state-action value function according to the reward.

Designing a state space S and a dynamic space Lambda according to actual conditions, wherein Lambda is a bounded real number set, and the maximum value and the minimum value are respectively recorded as

a; for all states S e S and actions a e Λ, the corresponding state-action value function Q (S, a) is initialized to 0, and the discount factor γ e (0,1) and the sequence of learning rates satisfying the following condition are selected

For the state-action value function Q, the following update criterion is employed:

wherein the subscript n denotes the nth time at state s_jSelects action a_j(ii) a s' represents the next state to transition to after selecting action a; the lower subscript j indicates the jth Q learning. During the operation of the system, sampling is carried out one every q timesThe bandwidth from t jqh to t (j +1) qh is a constant, and is recorded as Q learning

ω_j＝ω(t),t∈[jqh,(j+1)qh),j＝1,2,... (1.7)

And calculating the state of the system by the following formula when the j-th Q learning is carried out, selecting an action and calculating the reward:

the state is as follows: the j-th Q learning state is defined as s_j＝[s_j,1,s_j,2]Wherein s is_j,1And s_j,2Is defined as:

and (4) action: action a at jth Q learning_jThe selection rules are as follows:

rewarding: the reward calculation formula obtained by the j-th Q learning is as follows:

wherein, the lambda epsilon (0,1) is the reward parameter,

and

are respectively s_j+1,2And r_j,2Normalized value, s_j+1,2R is as defined in (1.8)_2,jTo represent

Of (2) i.e. the variance of

Step (three): adjusting parameters of an extended state observer through Q-learning

When the system runs to time t jqh,

j

1,2_jAnd selecting action a according to (1.9)_jThen, the observer bandwidth is adjusted by the following rule:

andωthe bandwidth upper and lower limits are set in advance.

After adjusting the bandwidth, the reward function r is calculated according to (1.10)_jAnd the next state s_j+1And updating the Q value function according to the formula (1.6).

In order to ensure the stability of the extended state observer,

ω、

andathe following conditions are satisfied:

wherein epsilon, M and M are intermediate variables, and the calculation method comprises the following steps:

compared with the prior art, the invention has the advantages that:

1. the system is a hybrid system which is characterized in that a common actual object is continuous and the output is sampling, and a corresponding discrete ESO structure and parameter design method is directly provided;

2. noise and disturbance models are not needed to optimize the parameters of the extended state observer, the parameters are adjusted in real time in a data-driven mode, and compared with the traditional constant ESO, the method has the capability of tracking internal uncertain dynamics and external disturbance more accurately;

3. the adjustment ranges of the four parameters of the Q learning part are quantitatively and explicitly given. The parameter adjustment theory can ensure the stability of the extended state observer and can reduce the cost spent on adjusting parameters in actual engineering.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is an estimation error curve (first total disturbance) of a different bandwidth extended state observer for a disturbance.

FIG. 3 is a plot of estimated error for a different bandwidth extended state observer versus disturbance (second total disturbance).

Fig. 4 is an estimation error curve (third total disturbance) of a different bandwidth extended state observer to disturbance.

Description of the symbols

t: the running time of the motion control system, t ∈ [0, ∞);

h: the sampling period of the motion control system is h epsilon to R;

x₁(t): position of moving object at time t, x₁(t)∈R；

x₂(t): speed, x, of moving object at time t₂(t)∈R；

u (t): the control input of the motion control system at the time t, u (t) epsilon R;

y_i(kh): output of motion control system at time t-kh, y_i(kh)∈R,

i＝1,2,k＝1,2,...；

v_i(kh): the measurement noise of the motion control system at the time t & kh, v (kh) epsilon Rⁿ,

i＝1,2,k＝1,2,...；

d (x (t), t): the sum of uncertain dynamics and external disturbance in the motion control system at the moment t;

c (t): derivatives of d (x (t), t);

x₂an estimated value of (d);

an estimate of d;

c, an estimated value;

ω: expanding the bandwidth and the parameters to be adjusted of the state observer;

ω: upper and lower limits of bandwidth;

s_j: the j-th Q learning state;

a_j: the action selected when Q learning is performed for the jth time;

r_j: reward obtained after the j time of Q learning selection action;

a: action a_jUpper and lower limits of (3).

Detailed Description

To test the applicability of the Q-learning extended state observer for motion control systems, we performed simulation experiments. Consider the following motion control system:

and three categories "total perturbations":

wherein the first type is constant disturbance, independent of both time and system state; the second type is piecewise linear disturbance, which depends on time only; the third category is the composite of nonlinear dynamics and periodic external disturbances.

Designing an extended state observer according to the step (I):

wherein

β₁＝3ω,β₂＝3ω²,β₃＝ω³,ω＞0，

According to the formula (1.13) in the step (III), the parameters of the Q learning extended state observer are designed to be

ω＝0.5,

a＝-0.5，h＝0.001,q＝100. (1.19)

The action set in the simulation is selected as { -0.5,0,0.5}, and other parameters are set as

h＝0.001,q＝100,λ＝0.4，

γ＝0.9. (1.20)

Calculating the current state s according to the formula in the step (II) every 100 times of sampling_jAnd selecting action a according to the formula (1.10)_jThe observer bandwidth is then adjusted according to (1.12). The observer bandwidth remains unchanged for other sampling instants. While adjusting the bandwidthAnd (3) calculating the reward of the last action from (1.10) and updating the Q value according to (1.6) in the step (three), wherein the normalization method adopts a method of dividing the current value by the magnitude order when calculating the reward function.

Fig. 2 to 4 are simulation results of the Q-learning extended state observer under three types of "total disturbances", where the initial bandwidth of the Q-learning extended state observer is set to 5, and compared with the extended state observer with a constant bandwidth.

In the first case, d (x (t), t) is a constant perturbation of 0.15. When the bandwidth is 10, the observer is seriously influenced by noise, so that the estimation error always oscillates between [ -0.5 and 0.5 ]; when the bandwidth is taken as 2, the observer reaches a steady state after about 10s, and the estimation error at the steady state is between [ -0.01,0.01 ]; when the bandwidth is adjusted through Q learning, the observer reaches a steady state in 1s, and an estimation error is between [ -0.04,0.04] in the steady state, the QESO is poor in adjustment effect due to insufficient data in an initial stage, but good performance is finally obtained through adjustment of Q learning in a subsequent stage.

In the second case, d (x (t), t) is a piecewise linear perturbation. When the bandwidth is 10, when the disturbance change is fast, the observer can track the disturbance fast, but when the disturbance change is slow, the influence of noise is obvious, and the estimation error is always between [ -0.5,0.5 ]; when the bandwidth is taken as 2, the observer can obtain a good estimation effect on a part with slow disturbance change, but has a large tracking error due to insufficient tracking capability of a part with fast change, and the overall tracking error is [ -0.5,0.5 ]. When the bandwidth is adjusted through Q learning, the observer can obtain better balance between fast tracking disturbance and noise influence suppression, poor adjustment effect caused by insufficient data in the initial stage is eliminated, the estimation error is always maintained between-0.2 and 0.2 after the steady state is started at 10s, and the estimation effect is best.

In the third case, the uncertain dynamics in the system are a composite of the nonlinear system dynamics and the periodic external disturbances. When the bandwidth is 10, the influence of noise is obvious, and the estimation error is always between [ -0.5,0.5 ]; when the bandwidth is taken as 2, the tracking error is large due to rapid disturbance change, and the overall tracking error is between [ -0.5,0.5 ]; when the bandwidth is adjusted through Q learning, poor adjustment effect caused by insufficient data in the initial stage is eliminated, and the estimation error is kept between [ -0.2 and 0.2] all the time after the bandwidth enters the steady state at 1.5s, so that the estimation effect is best.

Reference to the literature

[1]Chi-TsongChen.Linear system theory and design[M].Holt,Rinchart and Winston,1984.

[2] Han Jingqing, a type of extended state observer for uncertain objects [ J ] control and decision, 1995,000(001):85-88.

[3]Gao Z.Scaling and bandwidth-parameterization based controller tuning[C]//IEEE.IEEE,2003.

[4]Xue W,Yi H.Performance analysis of active disturbance rejection tracking control for a class of uncertain LTI systems[J].Isa Transactions,2015,58:133-154.

[5]Guo B Z,Zhao Z L.On the convergence of an extended state observer for nonlinear systems with uncertainty[J].Systems&Control Letters,2011,60(6):420-430.

[6]Bai W,Xue W,Huang Y,et al.On extended state based Kalman filter design for a class of nonlinear time-varying uncertain systems[J].Science China Information Sciences,2018,61(04):1-16.

Claims

1. A Q learning extended state observer design method of a motion control system is based on the following motion control system mathematical model:

wherein t represents time, x₁(t) E R represents the position of the moving object at time t, x₂(t) E R represents the speed of the moving object at the time t, b represents input gain, u (t) E R represents the input of the system at the time t, d (x) (t), t) E R represents total disturbance composed of uncertain dynamics and external disturbance in the system at the time t, and c (t) E R represents the derivative of the total disturbance at the time t; x is the number of₁(kh)、x₂(kh) denotes the position and velocity of the moving object at time t-kh, where h is the sampling period, k denotes the kth sample, y denotes the k-th sample_i(kh) denotes x_i(kh) measured value, v_i(kh) is the measurement noise of the corresponding channel, i is 1, 2;

the method is characterized by comprising the following three steps:

the method comprises the following steps: the extended state observer is designed according to the discrete form of equation (1.1):

since the sampling and control inputs are discrete in a real system, a discrete approximation of equation (1.1) needs to be considered:

x₁((k+1)h)、x₂((k +1) h) denotes the position and speed of the moving object at time t ═ k +1 h, where h is the sampling period, k +1 denotes the (k +1) th sample, y_i((k +1) h) represents x_iMeasured value of ((k +1) h), v_i(k +1) h) is the measurement noise of the corresponding channel; b represents an input gain, u (kh) e R represents the input of the system at the moment of t-kh, d (x (kh), kh) e R represents the total disturbance consisting of uncertain dynamic states and external disturbance inside the system at the moment of t-kh, and c (kh) e R represents the derivative of the total disturbance at the moment of t-kh;

according to equation (1.2), the linear extended state observer is designed as follows:

And

respectively represent x₂(kh), estimated values of d (kh) and c (kh), where ω (kh) is referred to as the observer bandwidth at the time t — kh, which is a parameter to be adjusted;

step two: designing a Q learning algorithm:

the Q learning algorithm comprises four components of state, action, reward and state-action value function:

based on the current state of the system, selecting corresponding action according to the value of the state-action value function, obtaining corresponding reward, and updating the value of the state-action value function according to the reward;

a(ii) a For all states S e S and actions a e Λ, the corresponding state-action value function Q (S, a) is initialized to 0, and the discount factor γ e (0,1) and the sequence of learning rates satisfying the following condition are selected

wherein the subscript n denotes the nth time at state s_jSelects action a_j(ii) a s' represents the next state to transition to after selecting action a; the lower subscript j indicates that Q learning is performed for the jth time; in the system runningIn the process, Q learning is performed every Q samples, and the bandwidth from time t jqh to time t (j +1) qh is a constant and is recorded as

ω_j＝ω(t),t∈[jqh,(j+1)qh),j＝1,2,... (1.7)

and (4) action: action a at jth Q learning_jThe selection rules are as follows:

wherein, the lambda epsilon (0,1) is the reward parameter,

and

are respectively s_j+1,2And r_j,2Normalized value, s_j+1,2R is as defined for formula (1.8)_j,2To represent

Of (2) i.e. the variance of

When the system is running to time t jqh, j 1,2_jAnd selecting action a according to the formula (1.9)_jThen, the observer bandwidth is adjusted by the following rule:

andωthe bandwidth is an upper and lower limit set in advance;

after adjusting the bandwidth, the reward function r is calculated according to the formula (1.10)_jAnd the next state s_j+1And updating the Q value function according to the formula (1.6).

2. The method of claim 1, wherein the method comprises the following steps: in order to ensure the stability of the extended state observer,

ω、

andathe following conditions are satisfied: