CN110908280B

CN110908280B - Optimization control method for trolley-two-stage inverted pendulum system

Info

Publication number: CN110908280B
Application number: CN201911043225.4A
Authority: CN
Inventors: 卢荣华; 陈特欢
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2023-01-03
Anticipated expiration: 2039-10-30
Also published as: CN110908280A

Abstract

The embodiment of the invention discloses a vehicle-two-stage inverted pendulum system optimization control method, which comprises the following steps: s10, setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and a state variable as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model; s20, determining an optimal control interval and a better discrete control sequence; s30, after a Gaussian process model and an optimal control interval are obtained, a total objective function value and a gradient value are obtained by combining iterative calculation of the Gaussian process model on the formula Gaussian process model and initial condition variation; and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.

Description

An Optimal Control Method for Cart-Secondary Inverted Pendulum System

技术领域technical field

本发明涉及小车-二级倒立摆系统的控制领域，具体涉及一种小车-二级倒立摆系统优化控制方法。The invention relates to the control field of a trolley-two-stage inverted pendulum system, in particular to an optimal control method for a trolley-two-stage inverted pendulum system.

背景技术Background technique

小车-二级倒立摆系统是一个经典的快速、多变量、非线性、不稳定系统，是控制领域经典的控制对象。许多控制算法，包括PID，模糊PID，鲁棒控制等均已在此系统中实施。然而，设计控制的前提是建模。目前对于小车-二级倒立摆系统的控制均基于机理模型，是一种通过物理原理来推导的确定性模型。而模型参数涉及小车的尺寸、倒立摆的尺寸等等。The trolley-two-stage inverted pendulum system is a classic fast, multivariable, nonlinear and unstable system, and it is a classic control object in the field of control. Many control algorithms, including PID, fuzzy PID, robust control, etc. have been implemented in this system. However, a prerequisite for designing controls is modeling. At present, the control of the trolley-two-stage inverted pendulum system is based on the mechanism model, which is a deterministic model derived from physical principles. The model parameters involve the size of the trolley, the size of the inverted pendulum, and so on.

随着智能算法，尤其是强化学习等算法的发展，控制模型已慢慢从确定的机理模型转为无模型控制。而无模型控制有一定的弊端，如学习次数太多，控制效果难以定量分析，控制器设计效率慢等。With the development of intelligent algorithms, especially algorithms such as reinforcement learning, the control model has gradually changed from a definite mechanism model to model-free control. However, model-free control has certain disadvantages, such as too many learning times, difficult quantitative analysis of control effects, and slow controller design efficiency.

发明内容Contents of the invention

鉴于以上存在的技术问题，本发明用于提供一种小车-二级倒立摆系统优化控制方法。In view of the technical problems above, the present invention provides an optimal control method for the trolley-two-stage inverted pendulum system.

为解决上述技术问题，本发明采用如下的技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种小车-二级倒立摆系统优化控制方法，应用于包括小车，第一根倒立摆和第二根倒立摆的小车-二级倒立摆系统，包括以下步骤：A dolly-two-stage inverted pendulum system optimization control method, applied to a dolly-two-stage inverted pendulum system comprising a dolly, a first inverted pendulum and a second inverted pendulum, comprising the following steps:

S10，确定小车-二级倒立摆系统的6个状态，在施加一定的作用力后，获得6个状态随时间的变化情况；设置高斯核函数，通过以当前状态控制对为输入，状态变化量为输出，训练出高斯核函数超参数的取值；通过联合概率分布函数，得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系，即高斯过程模型；S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;

S20，设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值；定义成本函数的期望形式，设定奖励惩罚机制，给定每个时间步的Q值的更新规则；学习过程前期不断地缩小控制量的取值范围，学习过程后期不断地平移控制量的取值范围；当收敛到产生成本函数的较小值，确定最优的控制区间与较优的离散控制序列；S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;

S30，获得高斯过程模型和最优控制区间后，对公式高斯过程模型和初始条件变分，得到Δμ_i和Δη_i的迭代形式，结合高斯过程模型的迭代计算，得到总的目标函数值与梯度值；S30, after the Gaussian process model and the optimal control interval are obtained, the formula Gaussian process model and the initial condition variation are obtained to obtain the iterative forms of _Δμi and _Δηi , combined with the iterative calculation of the Gaussian process model, the total objective function value and gradient are obtained value;

S40，调用基于梯度的优化求解器，并将学习得到的较优的离散控制序列作为优化控制的初始猜测，通过梯度值和总的目标函数值的计算，迭代求解得到最优的控制力序列。S40, call the gradient-based optimization solver, and use the learned better discrete control sequence as the initial guess of the optimization control, calculate the gradient value and the total objective function value, and iteratively solve to obtain the optimal control force sequence.

优选地，S10中，确定小车-二级倒立摆系统的6个状态分别为：小车的位移x₁、小车的速度x₂、第一根倒立摆的角速度x₃和角度x₅、第二根倒立摆的角速度x₄和角度x₆，令x＝[x₁x₂ ... x₆]^T，T为矩阵转置，并设作用力为u(t)，初始时刻(0时刻)小车与两根倒立摆都处于静止状态，即x₁(0)＝0,x₂(0)＝0,x₃(0)＝0,x₄(0)＝0,x₅(0)＝π,x₆(0)＝π.整个控制时间尺度为T＝1.2秒，每次控制作用为0.02秒，整个控制时间需要60个控制序列，即离散时刻i＝0,1,...,60，对应时刻的状态和控制就是x(i)和u(i)。Preferably, in S10, the six states of the trolley-two-stage inverted pendulum system are determined to be: the displacement x ₁ of the trolley, the velocity x ₂ of the trolley, the angular velocity x ₃ and the angle x ₅ of the first inverted pendulum, the second Angular velocity x ₄ and angle x ₆ of the inverted pendulum, set x=[x ₁ x ₂ ... x ₆ ] ^T , T is the matrix transposition, and set the force as u(t), the initial moment (moment 0) the car All are at rest with two inverted pendulums, namely x ₁ (0)=0, x ₂ (0)=0, x ₃ (0)=0, x ₄ (0)=0, x ₅ (0)=π ,x ₆ (0)=π. The whole control time scale is T=1.2 seconds, each control action is 0.02 seconds, and the whole control time needs 60 control sequences, that is, the discrete time i=0,1,...,60 , the state and control at the corresponding moment are x(i) and u(i).

优选地，高斯建模过程如下：高斯过程所要学习的映射关系，是当前的状态控制对xp＝{x(0)，u(0)，...，x(59)，u(59)}到状态变化量Δx＝{Δx(0)，...，Δx(59)}之间的关系，其中符号Δ是状态的变化量，Δx(i)＝x(i+1)-x(i)，Preferably, the Gaussian modeling process is as follows: the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0), u(0),..., x(59), u(59)} To the relationship between the state change Δx={Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i ),

根据高斯过程建模，对于给定的1个状态控制对xp^*＝{x^*，u^*}，它与60个状态控制对xp＝{x(0)，u(0)，...，x(59)，u(59)}之间的关系如下所示：According to Gaussian process modeling, for a given 1 state control pair xp ^* ={x ^* ,u ^* }, it is related to 60 state control pairs xp={x(0),u(0),..., The relationship between x(59), u(59)} is as follows:

式中，

表示高斯概率分布，I为单位矩阵，σ_w为一与噪声有关的超参数，K，k为高斯核函数。当向量与向量操作时，记为K；当数与数或数与向量操作时，记为k；f是未知的小车-二级倒立摆系统的动态模型，核函数一般定义平方指数协方差函数In the formula,

Represents a Gaussian probability distribution, I is an identity matrix, σ _w is a hyperparameter related to noise, K, k is a Gaussian kernel function. When a vector operates with a vector, it is denoted as K; when a number and a number or a number and a vector are operated, it is denoted as k; f is the dynamic model of the unknown trolley-two-stage inverted pendulum system, and the kernel function generally defines a square exponential covariance function

式中，y_m，y_n可以是固定的数值或是矩阵；σ_f与方差有关的超参数；W为权值矩阵，只在对角线上有数值，且这些数值都为超参数，通过具体的控制状态对的输入和状态变化量的输出，优化得到超参数的具体数值；In the formula, y _m , y _n can be fixed values or matrices; σ _f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Specifically control the input of the state pair and the output of the state change, and optimize the specific value of the hyperparameter;

通过联合概率分布(1)获得f(x(i)，u(i))的均值

和方差

Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1)

and variance

由此，根据第i时刻状态控制对的分布，通过准确计算，得到第i+1时刻状态分布的预测值：Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment is obtained through accurate calculation:

式中，

是状态和相应的状态转换之间的协方差，可由匹配方法获得，从而得到高斯过程模型，即第i时刻状态控制对的分布，计算第i+1时刻状态的分布。In the formula,

is the covariance between the state and the corresponding state transition, which can be obtained by the matching method, so as to obtain the Gaussian process model, that is, the distribution of the state-control pair at the i-th moment, and calculate the distribution of the state at the i+1-th moment.

优选地，S20具体包括：Preferably, S20 specifically includes:

在小车-二级倒立摆系统的控制过程中，最佳地设计控制力，使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π，首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列；In the control process of the trolley-two-stage inverted pendulum system, the control force is optimally designed so that the two-stage inverted pendulum system satisfies the angle of the first inverted pendulum and the second inverted pendulum at time T. Firstly, reinforcement learning is used To get as close as possible to the global optimal control interval and obtain a better discrete control sequence;

在强化学习中，小车-二级倒立摆系统优化控制系统满足M＝<X，U，P，R，γ>马尔可夫决策过程(离散过程)。X，U，P，R，γ各自的定义如下：In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process (discrete process). X, U, P, R, γ are defined as follows:

X是小车-二级倒立摆系统的6个状态向量x，即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely

X＝<x(j)>＝<μ_j，η_j>，j＝0，1，…，N， (5)X=<x(j)>=<μ _j , η _j >, j=0, 1, ..., N, (5)

式中，<>表示集合，j是时间步长，取N＝10，In the formula, <> represents a set, j is the time step, and N=10,

U是包含所有可能操作的动作空间，即施加的力的离散取值(有限个数)，将合理的控制离散取值设为a_m，那么U is the action space containing all possible operations, that is, the discrete value of the applied force (limited number), and the reasonable control discrete value is set to a _m , then

U＝{a_m}，m＝1，2，...，M， (6)U={a _m }, m=1, 2, . . . , M, (6)

式中，M为控制量所取的个数，每次取M＝100，P是状态转移概率函数，P[x(j+1)＝x’|x(j)＝x，u(j)＝a]＝P(x，a，x’)，In the formula, M is the number of control variables, each time M=100, P is the state transition probability function, P[x(j+1)=x'|x(j)=x,u(j) = a] = P(x, a, x'),

R是在每个时间步j＝0，1，...，N的状态和动作的奖励，学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π，奖励函数可用控制成本函数来定义，第j个时刻的控制成本函数定义为：R is the state and action reward at each time step j = 0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π, the reward function can be defined by the control cost function, and the control cost function at the jth moment is defined as:

式中，

和Z分别为状态和控制的权值系数，根据实际情况提前给定；x^tg(j)是目标位置；In the formula,

and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x ^tg (j) is the target position;

由于状态量x_j含有均值和方差，两边求期望，可以得到如下第j个时刻的控制成本函数Since the state quantity x _j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained

其中，L代表目标函数，当远超过或远不到2π时，应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值)；当接近2π时，应该给小车-二级倒立摆系统优化控制系统一个奖励，接近2π与远离2π的指标C_j＝x(j)-Z_j，其中Z_j为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When it is far beyond or far less than 2π, a penalty (the negative value of the reward function) should be given to the optimal control system of the trolley-secondary inverted pendulum system; when it is close to 2π, the trolley-secondary Inverted pendulum system optimization control system is a reward, close to 2π and away from 2π index C _j = x(j)-Z _j , where Z _j is the set value of the trend close to 2π at each moment

式中，取λ为π/10；In the formula, take λ as π/10;

γ是折扣系数，假设随着时间的推移，相应的奖励会打折，每个时间步的奖励是基于前一步的奖励R_j和折扣系数γ(0≤γ≤1)，取γ为0.85；累积奖励表示为γ is the discount coefficient, assuming that the corresponding reward will be discounted as time goes by, the reward of each time step is based on the previous step’s reward R _j and the discount coefficient γ (0≤γ≤1), take γ as 0.85; accumulate Reward expressed as

采用Q学习算法，在每个离散时间步骤，小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动；使用规则更新每个时间步的Q值Using the Q-learning algorithm, at each discrete time step, the trolley-two-stage inverted pendulum system optimization control system observes the current state and acts on the state that maximizes the reward; uses a rule to update the Q value at each time step

Q^j+1(x(j)，a_j)＝Q^j(x(j)，a_j)+∈×[R_j+1+γ(maxQ^J(x(j+1)，a_j+1)-Q^J(x(j).a_j))]，(11)Q ^j+1 (x(j), a _j )=Q ^j (x(j), a _j )+∈×[R _j+1 +γ(maxQ ^J (x(j+1), a _j+1 )-Q ^J (x(j).a _j ))], (11)

其中，Q^j(x(j)，a_j)是时刻j的状态x(j)和动作a_j的Q值，

是描述

算法中利用可能性的学习率，在大量训练下探索所有组合训练，Q学习算法应该为所有状态-动作组合生成Q值，在每个状态下，选择具有最大Q值的动作作为最佳动作，进一步提出控制区间自适应的强化学习方法，用来在学习过程中不断地缩小控制量的取值范围，Among them, Q ^j (x(j), a _j ) is the Q value of state x(j) and action a _j at time j,

is the description

The learning rate of the possibility is used in the algorithm to explore all combination training under a large number of trainings. The Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q-value is selected as the best action. Furthermore, a self-adaptive reinforcement learning method for the control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.

argmin_INE[L(x，a)]，argmin _IN E[L(x,a)],

IN＝[min{a_m}，max{a_m}]， (12)IN=[min _{am }, max _{am }], (12)

其中，E[L(x，a)]是每个时刻代价之和(总代价)，IN表示自适应地优化选择控制量的区间，设置更大范围的控制区间来开始训练过程，根据成本函数的结果，选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间，在此过程中，控制区间在缩小的过程中，同时保持控制的离散数量M不变，控制越来越精细，即每个控制区间的动作数量是恒定的，每个区间逐渐减小并且间隔逐渐减小，也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), IN means to adaptively optimize the selection of the interval of the control amount, set a larger range of control interval to start the training process, according to the cost function As a result, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In the process, the control interval is shrinking, while keeping the discrete number M of control unchanged, and the control is getting more and more The more refined, that is, the number of actions in each control interval is constant, each interval gradually decreases and the interval gradually decreases, that is, the control interval of the lth learning and the control interval of the l+1th learning satisfy

[min{a_m}，max{a_m}](l)就等于在控制区间训练l次后的(12)式中的IN，[min{a _m }, max{a _m }](l) is equal to IN in formula (12) after training l times in the control interval,

当一组动作收敛到某个控制区间时，小车-二级倒立摆系统优化控制系统开始通过移动控制区间继续试验相邻值，也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the trolley-two-stage inverted pendulum system optimization control system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies

是平移参数，

is the translation parameter,

优化以成本函数来自适应选择控制空间的过程是迭代的，并且一旦控制区间收敛到产生成本函数的较小值的时间间隔，则控制区间成为最优的控制区间，并确定较优的离散控制序列。Optimization The process of adaptively selecting the control space with a cost function is iterative, and once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval and determines the optimal discrete control sequence .

优选地，为了得到最优的控制序列，提出小车-二级倒立摆系统高斯过程模型的优化控制，首先，对公式(4)变分得到Preferably, in order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained

其中，Δ表示微变量，由于初始条件是固定值，因此初始条件的变分都为0，即Δμ₀＝0，Δη₀＝0，由于Δu_i可以是任意数值，为了形式简单，将其设置成Δu_i＝1，得到Δμ_i和Δη_i的迭代形式Among them, Δ represents a micro variable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ ₀ = 0, Δη ₀ = 0. Since Δu _i can be any value, for the sake of simplicity, it is set to into Δu _i = 1, the iterative form of Δμ _i and Δη _i is obtained

总的目标函数的期望值The expected value of the overall objective function

对第i步的期望E[L(x(i)，u_i)]做变分得到Change the expectation E[L(x(i), u _i )] of the i-th step to get

整个时间区间内的总的目标函数的变分，即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes

其中，Δμ_i和Δη_i由公式(16)给出，μ_i和u_i由公式(4)给出。Among them, Δμ _i and Δη _i are given by formula (16), and μ _i and u _i are given by formula (4).

本发明的有益效果是：The beneficial effects of the present invention are:

(1)针对小车-二级倒立摆系统，提出数据驱动的高斯过程模型，区别于传统的机理建模等确定性模型，采用均值和方差来表示小车-二级倒立摆系统的运行状态，更贴近系统的实际运动过程。(1) For the trolley-two-stage inverted pendulum system, a data-driven Gaussian process model is proposed, which is different from traditional deterministic models such as mechanism modeling. Close to the actual motion process of the system.

(2)针对小车-二级倒立摆系统(不确定性系统)，设计控制区间自适应的强化学习与优化控制，将学习与控制方法推广到不确定系统领域。(2) Aiming at the trolley-two-stage inverted pendulum system (uncertain system), design the reinforcement learning and optimal control of control interval adaptation, and extend the learning and control method to the field of uncertain systems.

(3)考虑到传统强化学习，如Q Learning在学习效率上的问题，提出控制区间自适应的强化学习，即不断地缩小控制决策的范围，自适应地选择在一个最佳的控制区间，并确定较优的离散控制序列。(3) Considering the learning efficiency of traditional reinforcement learning, such as Q Learning, a control interval adaptive reinforcement learning is proposed, that is, continuously narrowing the scope of control decisions, adaptively selecting an optimal control interval, and Determine the optimal discrete control sequence.

(4)针对优化问题容易陷入局部最优的问题，通过所提的强化学习来确定最优的控制区间，并以得到较优的离散控制序列为初始猜测，采用优化控制来确定最优的控制曲线(值)，确保最大限度地搜索到全局最优解。(4) Aiming at the problem that the optimization problem is easy to fall into local optimum, the optimal control interval is determined through the proposed reinforcement learning, and the optimal control is determined by using the optimal control sequence as the initial guess Curve (value), to ensure that the global optimal solution is searched to the maximum extent.

(5)针对传统强化学习控制量必须是有限个的问题，在强化学习决策后，又采用控制区间连续的优化控制算法，最终得到最优的控制输入。(5) For the problem that the control quantity of traditional reinforcement learning must be finite, after the reinforcement learning decision-making, the optimal control algorithm with continuous control interval is adopted to finally obtain the optimal control input.

(6)结合实验给出的本发明实验效果，Q学习前的控制区间为[-250,250]，Q学习后的控制区间为[-116,93]。优化控制后，最优的目标函数值为9338，给出最优控制下各个时刻的目标函数均值和方差图及最优控制曲线。(6) According to the experimental results of the present invention given in the experiment, the control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimal control, the optimal value of the objective function is 9338, and the mean value and variance diagram of the objective function and the optimal control curve at each moment under the optimal control are given.

附图说明Description of drawings

图1为本发明小车-二级倒立摆系统的实验设备简化图。Fig. 1 is a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention.

图2为本发明小车-二级倒立摆系统的高斯过程建模流程图。Fig. 2 is a Gaussian process modeling flow chart of the trolley-two-stage inverted pendulum system of the present invention.

图3为本发明小车-二级倒立摆系统的控制区间自适应的强化学习流程图。Fig. 3 is a flowchart of reinforcement learning for control interval adaptation of the trolley-two-stage inverted pendulum system of the present invention.

图4为本发明小车-二级倒立摆系统的优化控制流程图。Fig. 4 is an optimized control flow chart of the trolley-two-stage inverted pendulum system of the present invention.

图5为本发明小车-二级倒立摆系统的高斯过程建模、自适应区间的强化学习与优化控制的流程图。Fig. 5 is a flow chart of Gaussian process modeling, reinforcement learning and optimal control of the adaptive interval of the trolley-two-stage inverted pendulum system of the present invention.

图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention.

图7为本发明小车-二级倒立摆系统在最优控制下各个时刻的目标函数均值和方差图。Fig. 7 is a diagram of the mean value and variance of the objective function at each moment of the trolley-two-stage inverted pendulum system of the present invention under optimal control.

图8为本发明小车-二级倒立摆系统的最优控制曲线。Fig. 8 is the optimal control curve of the trolley-two-stage inverted pendulum system of the present invention.

图9为本发明小车-二级倒立摆系统第一根倒立摆的角度均值的变化图。Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.

图10为本发明小车-二级倒立摆系统第二根倒立摆的角度均值的变化图。Fig. 10 is a change diagram of the angle mean value of the second inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.

具体实施方式detailed description

如图1所示：本发明小车-二级倒立摆系统的实验设备简化图，依次为小车1，第1根第2根立摆，第二根倒立摆3等组成。右边的箭头是对小车施加的力，也就是系统的控制输入。弯转的箭头代表倒立摆的旋转角度。As shown in Figure 1: a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention, which is composed of trolley 1, the first and second vertical pendulums, and the second inverted pendulum 3 and so on. The arrow on the right is the force applied to the cart, which is the control input to the system. The curved arrow represents the angle of rotation of the inverted pendulum.

本发明实施例提供的小车-二级倒立摆系统优化控制方法，应用于如图1所示的小车-二级倒立摆系统，具体包括以下步骤：The trolley-two-stage inverted pendulum system optimization control method provided by the embodiment of the present invention is applied to the trolley-two-stage inverted pendulum system as shown in Figure 1, and specifically includes the following steps:

S30，获得高斯过程模型(4)和最优控制区间后，对公式高斯过程模型和初始条件变分，得到Δμ_i和Δη_i的迭代形式。结合高斯过程模型的迭代计算，得到总的目标函数值(17)与梯度值(19)。S30, after the Gaussian process model (4) and the optimal control interval are obtained, the formula Gaussian process model and initial conditions are varied to obtain iterative forms of Δμ _i and Δη _i . Combined with the iterative calculation of the Gaussian process model, the total objective function value (17) and gradient value (19) are obtained.

S40，调用基于梯度的优化求解器，如matlab中的SQP，并将学习得到的较优的离散控制序列作为优化控制的初始猜测，通过梯度值(19)和总的目标函数值(17)的计算，迭代求解得到最优的控制力序列。S40, call a gradient-based optimization solver, such as SQP in matlab, and use the learned better discrete control sequence as the initial guess of the optimization control, through the gradient value (19) and the total objective function value (17) Calculate and iteratively solve to obtain the optimal control force sequence.

如图2所示，本发明小车-二级倒立摆系统的高斯过程建模步骤：本发明小车-二级倒立摆系统包括6个状态，分别为小车的位移x₁、小车的速度x₂、第一根倒立摆的角速度x₃和角度x₅、第二根倒立摆的角速度x₄和角度x₆。令x＝[x₁ x₂ ... x₆]^·，·为矩阵转置，并设作用力为u(t)。初始时刻(0时刻)小车与两根倒立摆都处于静止状态，即x₁(0)＝0,x₂(0)＝0,x₃(0)＝0,x₄(0)＝0,x₅(0)＝π,x₆(0)＝π.整个控制时间尺度为T＝1.2秒，每次控制作用为0.02秒。因此整个控制时间需要60个控制序列，即离散时刻i＝0,1,...,60，对应时刻的状态和控制就是x(i)和u(i)。As shown in Figure 2, the Gaussian process modeling steps of the trolley-two-stage inverted pendulum system of the present invention: the trolley-two-stage inverted pendulum system of the present invention includes 6 states, which are respectively the displacement x ₁ of the trolley, the speed x ₂ of the trolley, Angular velocity x ₃ and angle x ₅ of the first inverted pendulum, angular velocity x ₄ and angle x ₆ of the second inverted pendulum. Let x＝[x ₁ x ₂ ... x ₆ ] ^· , · be the matrix transpose, and let the force be u(t). The car and the two inverted pendulums are at rest at the initial moment (moment 0), that is, x ₁ (0)=0, x ₂ (0)=0, x ₃ (0)=0, x ₄ (0)=0, x ₅ (0)=π, x ₆ (0)=π. The whole control time scale is T=1.2 seconds, and each control action is 0.02 seconds. Therefore, the entire control time requires 60 control sequences, that is, discrete time i=0, 1,..., 60, and the state and control at the corresponding time are x(i) and u(i).

高斯过程建模的过程：系统的动态过程，其实就是当前时刻状态和当前控制，到一下时刻状态/状态变化量的过程。因此，高斯过程所要学习的映射关系，就是当前的状态控制对xp＝{x(0),u(0),...,x(59),u(59)}到状态变化量Δx＝{Δx(0),...,Δx(59)}之间的关系，其中符号Δ是状态的变化量，Δx(i)＝x(i+1)-x(i)。The process of Gaussian process modeling: the dynamic process of the system is actually the process of the current state and current control, and the state/state change at the next moment. Therefore, the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0),u(0),...,x(59),u(59)} to the state change Δx={ The relationship between Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i).

根据高斯过程建模，对于给定的1个状态控制对xp^*＝{x^*,u^*}，它与60个状态控制对xp＝{x(0),u(0),...,x(59),u(59)}之间的关系如下所示According to Gaussian process modeling, for a given 1 state control pair xp ^* = {x ^* , u ^* }, it is related to 60 state control pairs xp = {x(0),u(0),..., The relationship between x(59), u(59)} is as follows

式中，

表示高斯概率分布，I为单位矩阵，σ_w为一与噪声有关的超参数，K,k为高斯核函数。当向量与向量操作是，记为K。当数与数或数与向量操作时，记为k。f是未知的小车-二级倒立摆系统的动态模型。核函数一般定义平方指数协方差函数In the formula,

Represents a Gaussian probability distribution, I is the identity matrix, σ _w is a hyperparameter related to noise, and K, k are Gaussian kernel functions. When vectors and vectors are operated, it is recorded as K. When numbers and numbers or numbers and vectors are operated, it is recorded as k. f is the dynamic model of the unknown cart-two-stage inverted pendulum system. The kernel function generally defines the square exponential covariance function

式中，y_m，y_n可以是固定的数值或是矩阵；σ_f与方差有关的超参数；W为权值矩阵，只在对角线上有数值，且这些数值都为超参数。通过具体的控制状态对的输入和状态变化量的输出，优化得到超参数的具体数值。In the formula, y _m and y _n can be fixed values or matrices; σ _f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Through the specific input of the control state pair and the output of the state change, the specific value of the hyperparameter is optimized.

通过联合概率分布(1)获得f(x(i)，u(i))的均值

和方差

Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1)

and variance

由此，根据第i时刻状态控制对的分布，通过准确计算，得到第i+1时刻状态分布的预测值Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment can be obtained through accurate calculation

式中，

是状态和相应的状态转换之间的协方差，可由匹配方法获得。从而，得到了高斯过程模型，即第i时刻状态控制对的分布，计算第i+1时刻状态的分布。In the formula,

is the covariance between states and corresponding state transitions, which can be obtained by the matching method. Thus, the Gaussian process model is obtained, that is, the distribution of the state-control pair at the i-th moment, and the distribution of the state at the i+1-th moment is calculated.

本发明小车-二级倒立摆系统的控制区间自适应的强化学习步骤：在小车-二级倒立摆系统的控制过程中，最佳地设计控制力，使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π，即第1根立摆和第2根立摆都垂直。由于传统的优化容易陷入局部最优，首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列。The self-adaptive reinforcement learning step of the control interval of the trolley-two-stage inverted pendulum system in the present invention: in the control process of the trolley-two-stage inverted pendulum system, optimally design the control force so that the two-stage inverted pendulum system meets the first The angle between one inverted pendulum and the second inverted pendulum is 2π, that is, the first vertical pendulum and the second vertical pendulum are both vertical. Since traditional optimization is easy to fall into local optimum, reinforcement learning is firstly used to get as close as possible to the global optimal control interval and obtain a better discrete control sequence.

式中，<>表示集合，j是时间步长，考虑到步长的设置会严重影响学习计算效率，不同于高斯过程60份的时间离散数目，这里取N＝10，即时间离散10份。In the formula, <> represents the set, and j is the time step size. Considering that the setting of the step size will seriously affect the learning calculation efficiency, which is different from the 60 time discrete numbers of the Gaussian process, N=10 is taken here, that is, the time discrete is 10 times.

U＝{a_m}，m＝1，2，...，M， (6)U={a _m }, m=1, 2, . . . , M, (6)

式中，M为控制量所取的个数，每次取M＝100。P是状态转移概率函数，P[x(j+1)＝x’|x(j)＝x，u(j)＝a]＝P(x，a，x’)。In the formula, M is the number of controlled variables, M = 100 each time. P is a state transition probability function, P[x(j+1)=x'|x(j)=x, u(j)=a]=P(x, a, x').

R是在每个时间步j＝0，1，...，N的状态和动作的奖励，学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π。本专利的奖励函数可用控制成本函数来定义。第j个时刻的控制成本函数定义为：R is the state and action reward at each time step j=0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π. The reward function of this patent can be defined with a control cost function. The control cost function at the jth moment is defined as:

式中，

和Z分别为状态和控制的权值系数，根据实际情况提前给定；x^tg(j)是目标位置。In the formula,

and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x ^tg (j) is the target position.

其中，L代表目标函数。当远超过或远不到2π时，应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值)。当接近2π时，应该给小车-二级倒立摆系统优化控制系统一个奖励。考虑到这一点，接近2π与远离2π的指标C_j＝x(j)-Z_j，其中Z_j为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When far exceeding or far less than 2π, a penalty (negative value of the reward function) should be given to the optimal control system of the trolley-two-stage inverted pendulum system. When approaching 2π, a reward should be given to the optimal control system of the trolley-two-stage inverted pendulum system. Taking this into consideration, the index C _j = x(j)-Z _j that is close to 2π and far from 2π, where Z _j is the set value of the trend close to 2π at each moment

式中，取λ为π/10。In the formula, take λ as π/10.

γ是折扣系数。假设随着时间的推移，相应的奖励会打折。因此，每个时间步的奖励是基于前一步的奖励R_j和折扣系数γ(0≤γ≤1)，取γ为0.85。累积奖励表示为γ is the discount factor. Assume that the corresponding rewards are discounted over time. Therefore, the reward at each time step is based on the previous step’s reward R _j and the discount coefficient γ (0≤γ≤1), taking γ as 0.85. The cumulative reward is expressed as

采用Q学习算法。在每个离散时间步骤，小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动。使用规则更新每个时间步的Q值Using the Q-learning algorithm. At each discrete time step, the cart-two-stage inverted pendulum system optimization control system observes the current state and takes action on the state that maximizes the reward. Use a rule to update the Q value at each time step

其中，Q^j(x(j)，a_j)是时刻j的状态x(j)和动作a_j的Q值。

是描述

算法中利用可能性的学习率。在大量训练下探索所有组合训练，Q学习算法应该为所有状态-动作组合生成Q值。在每个状态下，选择具有最大Q值的动作作为最佳动作。Among them, Q ^j (x(j), a _j ) is the Q value of state x(j) and action a _j at time j.

is the description

The learning rate at which likelihood is exploited in the algorithm. Exploring all combinations under a large training set, the Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q value is selected as the best action.

然而，通过上述X，U，P，R，γ设定，Q的学习非常的慢，主要由于施加的力的离散取值范围非常广，而且离散的力的间隔点又非常多，导致Q列表的维数非常高，容易造成维数灾难。但是如果力的离散点设置很少，又会导致学习效果不佳，难以得到较好的控制策略。控制量的取值范围(控制区间)和间隔严重地限制了Q学习的效率。因此，提出控制区间自适应的强化学习方法，用来在学习过程中不断地缩小控制量的取值范围。However, through the above settings of X, U, P, R, and γ, the learning of Q is very slow, mainly because the range of discrete values of the applied force is very wide, and there are many interval points between the discrete forces, resulting in the Q list The dimensionality of is very high, which is easy to cause the curse of dimensionality. However, if there are few discrete points of force, the learning effect will be poor and it will be difficult to obtain a better control strategy. The value range (control interval) and interval of the control quantity severely limit the efficiency of Q-learning. Therefore, a reinforcement learning method with adaptive control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.

argmin_INE[L(x，a)]，argmin _IN E[L(x,a)],

IN＝[min{a_m}，max{a_m}]， (12)IN=[min _{am }, max _{am }], (12)

其中，E[L(x，a)]是每个时刻代价之和(总代价)，IN表示自适应地优化选择控制量的区间。为了尽可能地近似全局最优解，设置更大范围的控制区间来开始训练过程。根据成本函数的结果，选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间。在此过程中，控制区间在缩小的过程中，同时保持控制的离散数量M不变，控制越来越精细，即每个控制区间的动作数量是恒定的，每个区间逐渐减小并且间隔逐渐减小。也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), and IN represents the interval for adaptively optimizing the selection of the control amount. In order to approximate the global optimal solution as much as possible, a larger control interval is set to start the training process. According to the result of the cost function, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In this process, the control interval is in the process of shrinking, while keeping the discrete number M of control constant, the control becomes more and more refined, that is, the number of actions in each control interval is constant, and each interval gradually decreases and the interval gradually decreases. decrease. That is, the control interval of the lth learning and the control interval of the l+1th learning satisfy

[min{a_m}，max{a_m}](l)就等于在控制区间训练l次后的(12)式中的IN，这可以看成是控制区间自适应地“缩小”步骤。[min{a _m }, max{a _m }](l) is equal to IN in formula (12) after training l times in the control interval, which can be regarded as an adaptive "shrinking" step of the control interval.

是平移参数。这可以看成是控制区间自适应地“平移”步骤。

is the translation parameter. This can be seen as an adaptive "translation" step of the control interval.

优化以成本函数来自适应选择控制空间的过程是迭代的。并且一旦控制区间收敛到产生成本函数的较小值的时间间隔，则控制区间成为最优的控制区间，并确定较优的离散控制序列。The process of optimization to adapt the selection control space with a cost function is iterative. And once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval, and a better discrete control sequence is determined.

本发明小车-二级倒立摆系统的优化控制步骤：针对高斯过程模型，利用自适应控制区间的强化学习只能得到有限的控制决策集合，跟具体控制量的离散程度有关，没法得到最优的连续控制集合中的控制序列。为了得到最优的控制序列，提出小车-二级倒立摆系统高斯过程模型的优化控制。首先，对公式(4)变分得到The optimal control steps of the trolley-two-stage inverted pendulum system of the present invention: for the Gaussian process model, only a limited control decision set can be obtained by using the reinforcement learning of the adaptive control interval, which is related to the discrete degree of the specific control quantity, and the optimal control cannot be obtained. A control sequence in the continuous control set of . In order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained

其中，Δ表示微变量。由于初始条件是固定值，因此初始条件的变分都为0，即Δμ₀＝0，Δη₀＝0。由于Δu_i可以是任意数值，为了形式简单，可将Δu_i设置成Δu_i＝1，得到Δμ_i和Δη_i的迭代形式Among them, Δ represents the microvariable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ ₀ =0, Δη ₀ =0. Since Δu _i can be any value, in order to simplify the form, Δu _i can be set to Δu _i = 1, and the iterative form of Δu _i and Δη _i can be obtained

其中，Δμ_i和Δη_i由公式(16)给出，μ_i和u_i由公式(4)给出。对以上设置的小车-二级倒立摆系统优化控制方法进行实施验证，小车的参数数据为：小车质量为0.5kg，第1根和第二根倒立摆的质量为0.5kg，长度为0.1m，小车与地面的摩擦系数为0.1。状态权值系数

为[0 1 1 1 15 15]^·。控制权值系数Z为0.01。Q学习前的控制区间为[-250，250]，Q学习后的控制区间为[-116，93]。优化控制后，最优的目标函数值为9338。图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。图7为系统在最优控制下各个时刻的目标函数均值和方差图。图8为系统的最优控制曲线。图9为系统第一根倒立摆的角度均值的变化图。图10为系统第二根倒立摆的角度均值的变化图。Among them, Δμ _i and Δη _i are given by formula (16), and μ _i and u _i are given by formula (4). Carry out verification on the optimal control method of the trolley-two-stage inverted pendulum system set above. The parameter data of the trolley is: the mass of the trolley is 0.5kg, the mass of the first and second inverted pendulum is 0.5kg, and the length is 0.1m. The coefficient of friction between the car and the ground is 0.1. State weight coefficient

is [0 1 1 1 15 15] ^· . The control weight coefficient Z is 0.01. The control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimizing the control, the optimal objective function value is 9338. Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention. Figure 7 is the mean value and variance diagram of the objective function at each moment of the system under optimal control. Figure 8 is the optimal control curve of the system. Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the system. Fig. 10 is a change diagram of the mean value of the angle of the second inverted pendulum of the system.

Claims

1. a dolly-two-stage inverted pendulum system optimization control method, is applied to comprise dolly, the dolly-two-stage inverted pendulum system of the first inverted pendulum and the second inverted pendulum, is characterized in that, comprises the following steps:

S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;

S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;

S30, after the Gaussian process model and the optimal control interval are obtained, the formula Gaussian process model and the initial condition variation are obtained to obtain the iterative forms of _Δμi and _Δηi , combined with the iterative calculation of the Gaussian process model, the total objective function value and gradient are obtained value;

S40, call the gradient-based optimization solver, and use the learned better discrete control sequence as the initial guess of the optimization control, and iteratively solve to obtain the optimal control force sequence through the calculation of the gradient value and the total objective function value;

In the control process of the trolley-two-stage inverted pendulum system, the control force is optimally designed so that the two-stage inverted pendulum system satisfies the angle of the first inverted pendulum and the second inverted pendulum at time T. Firstly, reinforcement learning is used To get as close as possible to the global optimal control interval and obtain a better discrete control sequence;

In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process, and the respective definitions of X, U, P, R, and γ are as follows:

X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely

X=<x(j)>=<μ _j ,η _j >,j=0,1,...,N, (5)

In the formula, <> means set, j is the time step, take N=10,

U is the action space containing all possible operations, that is, the discrete value of the applied force, and the reasonable control discrete value is set to a _m , then

U={a _m },m=1,2,...,M, (6)

In the formula, M is the number of control variables, each time M=100, P is the state transition probability function, P[x(j+1)=x'|x(j)=x,u(j) =a]=P(x,a,x'),

R is the state and action reward at each time step j=0,1,...,N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π, the reward function can be defined by the control cost function, and the control cost function at the jth moment is defined as:

In the formula,

and R' are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x ^tg (j) is the target position;

Since the state quantity x _j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained

Among them, L represents the objective function. When it is far beyond or far less than 2π, a penalty should be given to the optimal control system of the trolley-two-stage inverted pendulum system; when it is close to 2π, a penalty should be given to the optimal control system of the trolley-two-stage inverted pendulum system Reward, the index C _j =x(j)-Z _j close to 2π and away from 2π, where Z _j is the set value of the trend close to 2π at each moment

In the formula, take λ as π/10;

γ is the discount coefficient, assuming that the corresponding reward will be discounted as time goes by, the reward of each time step is based on the reward R _j of the previous step and the discount coefficient γ0≤γ≤1, take γ as 0.85; the cumulative reward is expressed as

Using a Q-learning algorithm, at each discrete time step, the trolley-two-stage inverted pendulum system optimization control system observes the current state and acts on the state that maximizes reward; uses a rule to update the Q value at each time step

Q ^j+1 (x(j),a _j )＝Q ^j (x(j),a _j )+∈×[R _j+1 +γ(maxQ ^j (x(j+1),a _j+1 )-Q ^j (x(j).a _j ))],

(11)

where Q ^j (x(j),a _j ) is the Q value of state x(j) and action a _j at time j, ∈ is the learning rate describing the possibility of exploitation in ∈-greed algorithm, exploring under a large number of training For all combinatorial training, ∈ is the learning rate describing the exploited likelihood in the ∈-greed algorithm. To explore all combinatorial training under a large number of trainings, the Q-learning algorithm should generate Q-values for all state-action combinations. The action with the largest Q value is taken as the best action;

S20 specifically includes:

The Q learning algorithm should generate Q values for all state-action combinations. In each state, select the action with the largest Q value as the best action, and further propose an adaptive reinforcement learning method for the control interval, which is used to continuously improve the learning process. narrow the value range of the control variable,

argmin _IN E[L(x,a)],

IN＝[min{a _m },max{a _m }], (12)

Among them, E[L(x,a)] is the sum of the cost at each moment, IN means to adaptively optimize the selection of the interval of the control amount, set a larger range of control interval to start the training process, and according to the result of the cost function, select The control interval in the action space that produces the minimum cost is used as the new control interval for the next learning. In the process, the control interval is shrinking while keeping the discrete quantity M of control unchanged, and the control becomes more and more refined, namely The number of actions in each control interval is constant, each interval gradually decreases and the interval gradually decreases, that is, the control interval of the lth learning and the control interval of the l+1th learning satisfy

[min{a _m },max{a _m }](l) is equal to IN in formula (12) after training l times in the control interval,

When a set of actions converges to a certain control interval, the optimal control method of the trolley-two-stage inverted pendulum system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies

is the translation parameter,

Optimization The process of adaptively selecting the control space with a cost function is iterative, and once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval and determines the optimal discrete control sequence .

2. the dolly-two-stage inverted pendulum system optimal control method as claimed in claim 1, is characterized in that, in S10, determine the 6 states of dolly-two-stage inverted pendulum system to be respectively: the displacement x ₁ of dolly, the displacement x of dolly Velocity x ₂ , angular velocity x ₃ and angle x ₅ of the first inverted pendulum, angular velocity x ₄ and angle x ₆ of the second inverted pendulum, let

is the matrix transposition, and set the force as u(t), the car and the two inverted pendulums are at rest at the initial moment, that is, x ₁ (0)=0, x ₂ (0)=0, x ₃ (0) = 0, x ₄ (0) = 0, x ₅ (0) = π, x ₆ (0) = π, the entire control time scale is T = 1.2 seconds, each control action is 0.02 seconds, the entire control time needs 60 A control sequence, that is, discrete time i=0,1,...,60, the state and control at the corresponding time are x(i) and u(i).

3. dolly-two-stage inverted pendulum system optimization control method as claimed in claim 2, is characterized in that, Gaussian modeling process is as follows:

The mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0),u(0),...,x(59),u(59)} to the state change Δx={Δx( 0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i),

According to Gaussian process modeling, for a given 1 state control pair xp ^* = {x ^* , u ^* }, it is related to 60 state control pairs xp = {x(0),u(0),..., The relationship between x(59), u(59)} is as follows:

In the formula, GP represents the Gaussian probability distribution, I is the identity matrix, σ _w is a hyperparameter related to noise, K, k is the Gaussian kernel function, when the vector operates with the vector, it is recorded as K; when the number and the number or the number When operating with a vector, it is denoted as k; f is the dynamic model of the unknown car-two-stage inverted pendulum system, and the kernel function generally defines the square exponential covariance function

In the formula, y _m and y _n are fixed values or matrices; σ _f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. The input of the control state pair and the output of the state change amount are optimized to obtain the specific value of the hyperparameter;

Obtain the mean of f(x(i),u(i)) through the joint probability distribution (1)

and variance

Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment is obtained through accurate calculation:

In the formula,

is the covariance between the state and the corresponding state transition, obtained by the matching method, so as to obtain the Gaussian process model, that is, the distribution of the state-control pair at the i-th moment, and calculate the distribution of the state at the i+1-th moment.

4. dolly-two-stage inverted pendulum system optimal control method as claimed in claim 3 is characterized in that, in order to obtain optimum control sequence, proposes the optimal control of dolly-two-stage inverted pendulum system Gaussian process model, at first, to The variation of formula (4) gives

Among them, Δ represents a micro variable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ ₀ = 0, Δη ₀ = 0. Since Δu _i is an arbitrary value, for the sake of simplicity, it is set as Δu _i =1, get the iterative form of Δμ _i and Δη _i

The expected value of the overall objective function

Change the expectation E[L(x(i),u _i )] of the i-th step to get

The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes

Among them, Δμ _i and Δη _i are given by formula (16), and μ _i and η _i are given by formula (4).