CN110908280B - Optimization control method for trolley-two-stage inverted pendulum system - Google Patents

Optimization control method for trolley-two-stage inverted pendulum system Download PDF

Info

Publication number
CN110908280B
CN110908280B CN201911043225.4A CN201911043225A CN110908280B CN 110908280 B CN110908280 B CN 110908280B CN 201911043225 A CN201911043225 A CN 201911043225A CN 110908280 B CN110908280 B CN 110908280B
Authority
CN
China
Prior art keywords
control
inverted pendulum
state
value
interval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911043225.4A
Other languages
Chinese (zh)
Other versions
CN110908280A (en
Inventor
卢荣华
陈特欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN201911043225.4A priority Critical patent/CN110908280B/en
Publication of CN110908280A publication Critical patent/CN110908280A/en
Application granted granted Critical
Publication of CN110908280B publication Critical patent/CN110908280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The embodiment of the invention discloses a vehicle-two-stage inverted pendulum system optimization control method, which comprises the following steps: s10, setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and a state variable as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model; s20, determining an optimal control interval and a better discrete control sequence; s30, after a Gaussian process model and an optimal control interval are obtained, a total objective function value and a gradient value are obtained by combining iterative calculation of the Gaussian process model on the formula Gaussian process model and initial condition variation; and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.

Description

一种小车-二级倒立摆系统优化控制方法An Optimal Control Method for Cart-Secondary Inverted Pendulum System

技术领域technical field

本发明涉及小车-二级倒立摆系统的控制领域,具体涉及一种小车-二级倒立摆系统优化控制方法。The invention relates to the control field of a trolley-two-stage inverted pendulum system, in particular to an optimal control method for a trolley-two-stage inverted pendulum system.

背景技术Background technique

小车-二级倒立摆系统是一个经典的快速、多变量、非线性、不稳定系统,是控制领域经典的控制对象。许多控制算法,包括PID,模糊PID,鲁棒控制等均已在此系统中实施。然而,设计控制的前提是建模。目前对于小车-二级倒立摆系统的控制均基于机理模型,是一种通过物理原理来推导的确定性模型。而模型参数涉及小车的尺寸、倒立摆的尺寸等等。The trolley-two-stage inverted pendulum system is a classic fast, multivariable, nonlinear and unstable system, and it is a classic control object in the field of control. Many control algorithms, including PID, fuzzy PID, robust control, etc. have been implemented in this system. However, a prerequisite for designing controls is modeling. At present, the control of the trolley-two-stage inverted pendulum system is based on the mechanism model, which is a deterministic model derived from physical principles. The model parameters involve the size of the trolley, the size of the inverted pendulum, and so on.

随着智能算法,尤其是强化学习等算法的发展,控制模型已慢慢从确定的机理模型转为无模型控制。而无模型控制有一定的弊端,如学习次数太多,控制效果难以定量分析,控制器设计效率慢等。With the development of intelligent algorithms, especially algorithms such as reinforcement learning, the control model has gradually changed from a definite mechanism model to model-free control. However, model-free control has certain disadvantages, such as too many learning times, difficult quantitative analysis of control effects, and slow controller design efficiency.

发明内容Contents of the invention

鉴于以上存在的技术问题,本发明用于提供一种小车-二级倒立摆系统优化控制方法。In view of the technical problems above, the present invention provides an optimal control method for the trolley-two-stage inverted pendulum system.

为解决上述技术问题,本发明采用如下的技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种小车-二级倒立摆系统优化控制方法,应用于包括小车,第一根倒立摆和第二根倒立摆的小车-二级倒立摆系统,包括以下步骤:A dolly-two-stage inverted pendulum system optimization control method, applied to a dolly-two-stage inverted pendulum system comprising a dolly, a first inverted pendulum and a second inverted pendulum, comprising the following steps:

S10,确定小车-二级倒立摆系统的6个状态,在施加一定的作用力后,获得6个状态随时间的变化情况;设置高斯核函数,通过以当前状态控制对为输入,状态变化量为输出,训练出高斯核函数超参数的取值;通过联合概率分布函数,得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系,即高斯过程模型;S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;

S20,设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值;定义成本函数的期望形式,设定奖励惩罚机制,给定每个时间步的Q值的更新规则;学习过程前期不断地缩小控制量的取值范围,学习过程后期不断地平移控制量的取值范围;当收敛到产生成本函数的较小值,确定最优的控制区间与较优的离散控制序列;S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;

S30,获得高斯过程模型和最优控制区间后,对公式高斯过程模型和初始条件变分,得到Δμi和Δηi的迭代形式,结合高斯过程模型的迭代计算,得到总的目标函数值与梯度值;S30, after the Gaussian process model and the optimal control interval are obtained, the formula Gaussian process model and the initial condition variation are obtained to obtain the iterative forms of Δμi and Δηi , combined with the iterative calculation of the Gaussian process model, the total objective function value and gradient are obtained value;

S40,调用基于梯度的优化求解器,并将学习得到的较优的离散控制序列作为优化控制的初始猜测,通过梯度值和总的目标函数值的计算,迭代求解得到最优的控制力序列。S40, call the gradient-based optimization solver, and use the learned better discrete control sequence as the initial guess of the optimization control, calculate the gradient value and the total objective function value, and iteratively solve to obtain the optimal control force sequence.

优选地,S10中,确定小车-二级倒立摆系统的6个状态分别为:小车的位移x1、小车的速度x2、第一根倒立摆的角速度x3和角度x5、第二根倒立摆的角速度x4和角度x6,令x=[x1x2 ... x6]T,T为矩阵转置,并设作用力为u(t),初始时刻(0时刻)小车与两根倒立摆都处于静止状态,即x1(0)=0,x2(0)=0,x3(0)=0,x4(0)=0,x5(0)=π,x6(0)=π.整个控制时间尺度为T=1.2秒,每次控制作用为0.02秒,整个控制时间需要60个控制序列,即离散时刻i=0,1,...,60,对应时刻的状态和控制就是x(i)和u(i)。Preferably, in S10, the six states of the trolley-two-stage inverted pendulum system are determined to be: the displacement x 1 of the trolley, the velocity x 2 of the trolley, the angular velocity x 3 and the angle x 5 of the first inverted pendulum, the second Angular velocity x 4 and angle x 6 of the inverted pendulum, set x=[x 1 x 2 ... x 6 ] T , T is the matrix transposition, and set the force as u(t), the initial moment (moment 0) the car All are at rest with two inverted pendulums, namely x 1 (0)=0, x 2 (0)=0, x 3 (0)=0, x 4 (0)=0, x 5 (0)=π ,x 6 (0)=π. The whole control time scale is T=1.2 seconds, each control action is 0.02 seconds, and the whole control time needs 60 control sequences, that is, the discrete time i=0,1,...,60 , the state and control at the corresponding moment are x(i) and u(i).

优选地,高斯建模过程如下:高斯过程所要学习的映射关系,是当前的状态控制对xp={x(0),u(0),...,x(59),u(59)}到状态变化量Δx={Δx(0),...,Δx(59)}之间的关系,其中符号Δ是状态的变化量,Δx(i)=x(i+1)-x(i),Preferably, the Gaussian modeling process is as follows: the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0), u(0),..., x(59), u(59)} To the relationship between the state change Δx={Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i ),

根据高斯过程建模,对于给定的1个状态控制对xp*={x*,u*},它与60个状态控制对xp={x(0),u(0),...,x(59),u(59)}之间的关系如下所示:According to Gaussian process modeling, for a given 1 state control pair xp * ={x * ,u * }, it is related to 60 state control pairs xp={x(0),u(0),..., The relationship between x(59), u(59)} is as follows:

Figure GDA0003895584350000031
Figure GDA0003895584350000031

式中,

Figure GDA0003895584350000032
表示高斯概率分布,I为单位矩阵,σw为一与噪声有关的超参数,K,k为高斯核函数。当向量与向量操作时,记为K;当数与数或数与向量操作时,记为k;f是未知的小车-二级倒立摆系统的动态模型,核函数一般定义平方指数协方差函数In the formula,
Figure GDA0003895584350000032
Represents a Gaussian probability distribution, I is an identity matrix, σ w is a hyperparameter related to noise, K, k is a Gaussian kernel function. When a vector operates with a vector, it is denoted as K; when a number and a number or a number and a vector are operated, it is denoted as k; f is the dynamic model of the unknown trolley-two-stage inverted pendulum system, and the kernel function generally defines a square exponential covariance function

Figure GDA0003895584350000033
Figure GDA0003895584350000033

式中,ym,yn可以是固定的数值或是矩阵;σf与方差有关的超参数;W为权值矩阵,只在对角线上有数值,且这些数值都为超参数,通过具体的控制状态对的输入和状态变化量的输出,优化得到超参数的具体数值;In the formula, y m , y n can be fixed values or matrices; σ f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Specifically control the input of the state pair and the output of the state change, and optimize the specific value of the hyperparameter;

通过联合概率分布(1)获得f(x(i),u(i))的均值

Figure GDA0003895584350000034
和方差
Figure GDA0003895584350000035
Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1)
Figure GDA0003895584350000034
and variance
Figure GDA0003895584350000035

Figure GDA0003895584350000037
Figure GDA0003895584350000037

Figure GDA0003895584350000036
Figure GDA0003895584350000036

由此,根据第i时刻状态控制对的分布,通过准确计算,得到第i+1时刻状态分布的预测值:Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment is obtained through accurate calculation:

Figure GDA0003895584350000041
Figure GDA0003895584350000041

Figure GDA0003895584350000042
Figure GDA0003895584350000042

式中,

Figure GDA0003895584350000043
是状态和相应的状态转换之间的协方差,可由匹配方法获得,从而得到高斯过程模型,即第i时刻状态控制对的分布,计算第i+1时刻状态的分布。In the formula,
Figure GDA0003895584350000043
is the covariance between the state and the corresponding state transition, which can be obtained by the matching method, so as to obtain the Gaussian process model, that is, the distribution of the state-control pair at the i-th moment, and calculate the distribution of the state at the i+1-th moment.

优选地,S20具体包括:Preferably, S20 specifically includes:

在小车-二级倒立摆系统的控制过程中,最佳地设计控制力,使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列;In the control process of the trolley-two-stage inverted pendulum system, the control force is optimally designed so that the two-stage inverted pendulum system satisfies the angle of the first inverted pendulum and the second inverted pendulum at time T. Firstly, reinforcement learning is used To get as close as possible to the global optimal control interval and obtain a better discrete control sequence;

在强化学习中,小车-二级倒立摆系统优化控制系统满足M=<X,U,P,R,γ>马尔可夫决策过程(离散过程)。X,U,P,R,γ各自的定义如下:In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process (discrete process). X, U, P, R, γ are defined as follows:

X是小车-二级倒立摆系统的6个状态向量x,即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely

X=<x(j)>=<μj,ηj>,j=0,1,…,N, (5)X=<x(j)>=<μ j , η j >, j=0, 1, ..., N, (5)

式中,<>表示集合,j是时间步长,取N=10,In the formula, <> represents a set, j is the time step, and N=10,

U是包含所有可能操作的动作空间,即施加的力的离散取值(有限个数),将合理的控制离散取值设为am,那么U is the action space containing all possible operations, that is, the discrete value of the applied force (limited number), and the reasonable control discrete value is set to a m , then

U={am},m=1,2,...,M, (6)U={a m }, m=1, 2, . . . , M, (6)

式中,M为控制量所取的个数,每次取M=100,P是状态转移概率函数,P[x(j+1)=x’|x(j)=x,u(j)=a]=P(x,a,x’),In the formula, M is the number of control variables, each time M=100, P is the state transition probability function, P[x(j+1)=x'|x(j)=x,u(j) = a] = P(x, a, x'),

R是在每个时间步j=0,1,...,N的状态和动作的奖励,学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,奖励函数可用控制成本函数来定义,第j个时刻的控制成本函数定义为:R is the state and action reward at each time step j = 0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π, the reward function can be defined by the control cost function, and the control cost function at the jth moment is defined as:

Figure GDA0003895584350000051
Figure GDA0003895584350000051

式中,

Figure GDA0003895584350000052
和Z分别为状态和控制的权值系数,根据实际情况提前给定;xtg(j)是目标位置;In the formula,
Figure GDA0003895584350000052
and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x tg (j) is the target position;

由于状态量xj含有均值和方差,两边求期望,可以得到如下第j个时刻的控制成本函数Since the state quantity x j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained

Figure GDA0003895584350000053
Figure GDA0003895584350000053

其中,L代表目标函数,当远超过或远不到2π时,应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值);当接近2π时,应该给小车-二级倒立摆系统优化控制系统一个奖励,接近2π与远离2π的指标Cj=x(j)-Zj,其中Zj为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When it is far beyond or far less than 2π, a penalty (the negative value of the reward function) should be given to the optimal control system of the trolley-secondary inverted pendulum system; when it is close to 2π, the trolley-secondary Inverted pendulum system optimization control system is a reward, close to 2π and away from 2π index C j = x(j)-Z j , where Z j is the set value of the trend close to 2π at each moment

Figure GDA0003895584350000054
Figure GDA0003895584350000054

式中,取λ为π/10;In the formula, take λ as π/10;

γ是折扣系数,假设随着时间的推移,相应的奖励会打折,每个时间步的奖励是基于前一步的奖励Rj和折扣系数γ(0≤γ≤1),取γ为0.85;累积奖励表示为γ is the discount coefficient, assuming that the corresponding reward will be discounted as time goes by, the reward of each time step is based on the previous step’s reward R j and the discount coefficient γ (0≤γ≤1), take γ as 0.85; accumulate Reward expressed as

Figure GDA0003895584350000055
Figure GDA0003895584350000055

采用Q学习算法,在每个离散时间步骤,小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动;使用规则更新每个时间步的Q值Using the Q-learning algorithm, at each discrete time step, the trolley-two-stage inverted pendulum system optimization control system observes the current state and acts on the state that maximizes the reward; uses a rule to update the Q value at each time step

Qj+1(x(j),aj)=Qj(x(j),aj)+∈×[Rj+1+γ(maxQJ(x(j+1),aj+1)-QJ(x(j).aj))],(11)Q j+1 (x(j), a j )=Q j (x(j), a j )+∈×[R j+1 +γ(maxQ J (x(j+1), a j+1 )-Q J (x(j).a j ))], (11)

其中,Qj(x(j),aj)是时刻j的状态x(j)和动作aj的Q值,

Figure GDA0003895584350000061
是描述
Figure GDA0003895584350000062
算法中利用可能性的学习率,在大量训练下探索所有组合训练,Q学习算法应该为所有状态-动作组合生成Q值,在每个状态下,选择具有最大Q值的动作作为最佳动作,进一步提出控制区间自适应的强化学习方法,用来在学习过程中不断地缩小控制量的取值范围,Among them, Q j (x(j), a j ) is the Q value of state x(j) and action a j at time j,
Figure GDA0003895584350000061
is the description
Figure GDA0003895584350000062
The learning rate of the possibility is used in the algorithm to explore all combination training under a large number of trainings. The Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q-value is selected as the best action. Furthermore, a self-adaptive reinforcement learning method for the control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.

argminINE[L(x,a)],argmin IN E[L(x,a)],

IN=[min{am},max{am}], (12)IN=[min {am }, max {am }], (12)

其中,E[L(x,a)]是每个时刻代价之和(总代价),IN表示自适应地优化选择控制量的区间,设置更大范围的控制区间来开始训练过程,根据成本函数的结果,选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间,在此过程中,控制区间在缩小的过程中,同时保持控制的离散数量M不变,控制越来越精细,即每个控制区间的动作数量是恒定的,每个区间逐渐减小并且间隔逐渐减小,也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), IN means to adaptively optimize the selection of the interval of the control amount, set a larger range of control interval to start the training process, according to the cost function As a result, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In the process, the control interval is shrinking, while keeping the discrete number M of control unchanged, and the control is getting more and more The more refined, that is, the number of actions in each control interval is constant, each interval gradually decreases and the interval gradually decreases, that is, the control interval of the lth learning and the control interval of the l+1th learning satisfy

Figure GDA0003895584350000063
Figure GDA0003895584350000063

Figure GDA0003895584350000075
Figure GDA0003895584350000075

[min{am},max{am}](l)就等于在控制区间训练l次后的(12)式中的IN,[min{a m }, max{a m }](l) is equal to IN in formula (12) after training l times in the control interval,

当一组动作收敛到某个控制区间时,小车-二级倒立摆系统优化控制系统开始通过移动控制区间继续试验相邻值,也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the trolley-two-stage inverted pendulum system optimization control system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies

Figure GDA0003895584350000073
Figure GDA0003895584350000073

Figure GDA0003895584350000074
是平移参数,
Figure GDA0003895584350000074
is the translation parameter,

优化以成本函数来自适应选择控制空间的过程是迭代的,并且一旦控制区间收敛到产生成本函数的较小值的时间间隔,则控制区间成为最优的控制区间,并确定较优的离散控制序列。Optimization The process of adaptively selecting the control space with a cost function is iterative, and once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval and determines the optimal discrete control sequence .

优选地,为了得到最优的控制序列,提出小车-二级倒立摆系统高斯过程模型的优化控制,首先,对公式(4)变分得到Preferably, in order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained

Figure GDA0003895584350000071
Figure GDA0003895584350000071

Figure GDA0003895584350000072
Figure GDA0003895584350000072

其中,Δ表示微变量,由于初始条件是固定值,因此初始条件的变分都为0,即Δμ0=0,Δη0=0,由于Δui可以是任意数值,为了形式简单,将其设置成Δui=1,得到Δμi和Δηi的迭代形式Among them, Δ represents a micro variable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ 0 = 0, Δη 0 = 0. Since Δu i can be any value, for the sake of simplicity, it is set to into Δu i = 1, the iterative form of Δμ i and Δη i is obtained

Figure GDA0003895584350000081
Figure GDA0003895584350000081

Figure GDA0003895584350000082
Figure GDA0003895584350000082

总的目标函数的期望值The expected value of the overall objective function

Figure GDA0003895584350000083
Figure GDA0003895584350000083

对第i步的期望E[L(x(i),ui)]做变分得到Change the expectation E[L(x(i), u i )] of the i-th step to get

Figure GDA0003895584350000084
Figure GDA0003895584350000084

整个时间区间内的总的目标函数的变分,即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes

Figure GDA0003895584350000085
Figure GDA0003895584350000085

其中,Δμi和Δηi由公式(16)给出,μi和ui由公式(4)给出。Among them, Δμ i and Δη i are given by formula (16), and μ i and u i are given by formula (4).

本发明的有益效果是:The beneficial effects of the present invention are:

(1)针对小车-二级倒立摆系统,提出数据驱动的高斯过程模型,区别于传统的机理建模等确定性模型,采用均值和方差来表示小车-二级倒立摆系统的运行状态,更贴近系统的实际运动过程。(1) For the trolley-two-stage inverted pendulum system, a data-driven Gaussian process model is proposed, which is different from traditional deterministic models such as mechanism modeling. Close to the actual motion process of the system.

(2)针对小车-二级倒立摆系统(不确定性系统),设计控制区间自适应的强化学习与优化控制,将学习与控制方法推广到不确定系统领域。(2) Aiming at the trolley-two-stage inverted pendulum system (uncertain system), design the reinforcement learning and optimal control of control interval adaptation, and extend the learning and control method to the field of uncertain systems.

(3)考虑到传统强化学习,如Q Learning在学习效率上的问题,提出控制区间自适应的强化学习,即不断地缩小控制决策的范围,自适应地选择在一个最佳的控制区间,并确定较优的离散控制序列。(3) Considering the learning efficiency of traditional reinforcement learning, such as Q Learning, a control interval adaptive reinforcement learning is proposed, that is, continuously narrowing the scope of control decisions, adaptively selecting an optimal control interval, and Determine the optimal discrete control sequence.

(4)针对优化问题容易陷入局部最优的问题,通过所提的强化学习来确定最优的控制区间,并以得到较优的离散控制序列为初始猜测,采用优化控制来确定最优的控制曲线(值),确保最大限度地搜索到全局最优解。(4) Aiming at the problem that the optimization problem is easy to fall into local optimum, the optimal control interval is determined through the proposed reinforcement learning, and the optimal control is determined by using the optimal control sequence as the initial guess Curve (value), to ensure that the global optimal solution is searched to the maximum extent.

(5)针对传统强化学习控制量必须是有限个的问题,在强化学习决策后,又采用控制区间连续的优化控制算法,最终得到最优的控制输入。(5) For the problem that the control quantity of traditional reinforcement learning must be finite, after the reinforcement learning decision-making, the optimal control algorithm with continuous control interval is adopted to finally obtain the optimal control input.

(6)结合实验给出的本发明实验效果,Q学习前的控制区间为[-250,250],Q学习后的控制区间为[-116,93]。优化控制后,最优的目标函数值为9338,给出最优控制下各个时刻的目标函数均值和方差图及最优控制曲线。(6) According to the experimental results of the present invention given in the experiment, the control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimal control, the optimal value of the objective function is 9338, and the mean value and variance diagram of the objective function and the optimal control curve at each moment under the optimal control are given.

附图说明Description of drawings

图1为本发明小车-二级倒立摆系统的实验设备简化图。Fig. 1 is a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention.

图2为本发明小车-二级倒立摆系统的高斯过程建模流程图。Fig. 2 is a Gaussian process modeling flow chart of the trolley-two-stage inverted pendulum system of the present invention.

图3为本发明小车-二级倒立摆系统的控制区间自适应的强化学习流程图。Fig. 3 is a flowchart of reinforcement learning for control interval adaptation of the trolley-two-stage inverted pendulum system of the present invention.

图4为本发明小车-二级倒立摆系统的优化控制流程图。Fig. 4 is an optimized control flow chart of the trolley-two-stage inverted pendulum system of the present invention.

图5为本发明小车-二级倒立摆系统的高斯过程建模、自适应区间的强化学习与优化控制的流程图。Fig. 5 is a flow chart of Gaussian process modeling, reinforcement learning and optimal control of the adaptive interval of the trolley-two-stage inverted pendulum system of the present invention.

图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention.

图7为本发明小车-二级倒立摆系统在最优控制下各个时刻的目标函数均值和方差图。Fig. 7 is a diagram of the mean value and variance of the objective function at each moment of the trolley-two-stage inverted pendulum system of the present invention under optimal control.

图8为本发明小车-二级倒立摆系统的最优控制曲线。Fig. 8 is the optimal control curve of the trolley-two-stage inverted pendulum system of the present invention.

图9为本发明小车-二级倒立摆系统第一根倒立摆的角度均值的变化图。Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.

图10为本发明小车-二级倒立摆系统第二根倒立摆的角度均值的变化图。Fig. 10 is a change diagram of the angle mean value of the second inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.

具体实施方式detailed description

如图1所示:本发明小车-二级倒立摆系统的实验设备简化图,依次为小车1,第1根第2根立摆,第二根倒立摆3等组成。右边的箭头是对小车施加的力,也就是系统的控制输入。弯转的箭头代表倒立摆的旋转角度。As shown in Figure 1: a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention, which is composed of trolley 1, the first and second vertical pendulums, and the second inverted pendulum 3 and so on. The arrow on the right is the force applied to the cart, which is the control input to the system. The curved arrow represents the angle of rotation of the inverted pendulum.

本发明实施例提供的小车-二级倒立摆系统优化控制方法,应用于如图1所示的小车-二级倒立摆系统,具体包括以下步骤:The trolley-two-stage inverted pendulum system optimization control method provided by the embodiment of the present invention is applied to the trolley-two-stage inverted pendulum system as shown in Figure 1, and specifically includes the following steps:

S10,确定小车-二级倒立摆系统的6个状态,在施加一定的作用力后,获得6个状态随时间的变化情况;设置高斯核函数,通过以当前状态控制对为输入,状态变化量为输出,训练出高斯核函数超参数的取值;通过联合概率分布函数,得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系,即高斯过程模型;S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;

S20,设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值;定义成本函数的期望形式,设定奖励惩罚机制,给定每个时间步的Q值的更新规则;学习过程前期不断地缩小控制量的取值范围,学习过程后期不断地平移控制量的取值范围;当收敛到产生成本函数的较小值,确定最优的控制区间与较优的离散控制序列;S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;

S30,获得高斯过程模型(4)和最优控制区间后,对公式高斯过程模型和初始条件变分,得到Δμi和Δηi的迭代形式。结合高斯过程模型的迭代计算,得到总的目标函数值(17)与梯度值(19)。S30, after the Gaussian process model (4) and the optimal control interval are obtained, the formula Gaussian process model and initial conditions are varied to obtain iterative forms of Δμ i and Δη i . Combined with the iterative calculation of the Gaussian process model, the total objective function value (17) and gradient value (19) are obtained.

S40,调用基于梯度的优化求解器,如matlab中的SQP,并将学习得到的较优的离散控制序列作为优化控制的初始猜测,通过梯度值(19)和总的目标函数值(17)的计算,迭代求解得到最优的控制力序列。S40, call a gradient-based optimization solver, such as SQP in matlab, and use the learned better discrete control sequence as the initial guess of the optimization control, through the gradient value (19) and the total objective function value (17) Calculate and iteratively solve to obtain the optimal control force sequence.

如图2所示,本发明小车-二级倒立摆系统的高斯过程建模步骤:本发明小车-二级倒立摆系统包括6个状态,分别为小车的位移x1、小车的速度x2、第一根倒立摆的角速度x3和角度x5、第二根倒立摆的角速度x4和角度x6。令x=[x1 x2 ... x6]·,·为矩阵转置,并设作用力为u(t)。初始时刻(0时刻)小车与两根倒立摆都处于静止状态,即x1(0)=0,x2(0)=0,x3(0)=0,x4(0)=0,x5(0)=π,x6(0)=π.整个控制时间尺度为T=1.2秒,每次控制作用为0.02秒。因此整个控制时间需要60个控制序列,即离散时刻i=0,1,...,60,对应时刻的状态和控制就是x(i)和u(i)。As shown in Figure 2, the Gaussian process modeling steps of the trolley-two-stage inverted pendulum system of the present invention: the trolley-two-stage inverted pendulum system of the present invention includes 6 states, which are respectively the displacement x 1 of the trolley, the speed x 2 of the trolley, Angular velocity x 3 and angle x 5 of the first inverted pendulum, angular velocity x 4 and angle x 6 of the second inverted pendulum. Let x=[x 1 x 2 ... x 6 ] · , · be the matrix transpose, and let the force be u(t). The car and the two inverted pendulums are at rest at the initial moment (moment 0), that is, x 1 (0)=0, x 2 (0)=0, x 3 (0)=0, x 4 (0)=0, x 5 (0)=π, x 6 (0)=π. The whole control time scale is T=1.2 seconds, and each control action is 0.02 seconds. Therefore, the entire control time requires 60 control sequences, that is, discrete time i=0, 1,..., 60, and the state and control at the corresponding time are x(i) and u(i).

高斯过程建模的过程:系统的动态过程,其实就是当前时刻状态和当前控制,到一下时刻状态/状态变化量的过程。因此,高斯过程所要学习的映射关系,就是当前的状态控制对xp={x(0),u(0),...,x(59),u(59)}到状态变化量Δx={Δx(0),...,Δx(59)}之间的关系,其中符号Δ是状态的变化量,Δx(i)=x(i+1)-x(i)。The process of Gaussian process modeling: the dynamic process of the system is actually the process of the current state and current control, and the state/state change at the next moment. Therefore, the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0),u(0),...,x(59),u(59)} to the state change Δx={ The relationship between Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i).

根据高斯过程建模,对于给定的1个状态控制对xp*={x*,u*},它与60个状态控制对xp={x(0),u(0),...,x(59),u(59)}之间的关系如下所示According to Gaussian process modeling, for a given 1 state control pair xp * = {x * , u * }, it is related to 60 state control pairs xp = {x(0),u(0),..., The relationship between x(59), u(59)} is as follows

Figure GDA0003895584350000121
Figure GDA0003895584350000121

式中,

Figure GDA0003895584350000122
表示高斯概率分布,I为单位矩阵,σw为一与噪声有关的超参数,K,k为高斯核函数。当向量与向量操作是,记为K。当数与数或数与向量操作时,记为k。f是未知的小车-二级倒立摆系统的动态模型。核函数一般定义平方指数协方差函数In the formula,
Figure GDA0003895584350000122
Represents a Gaussian probability distribution, I is the identity matrix, σ w is a hyperparameter related to noise, and K, k are Gaussian kernel functions. When vectors and vectors are operated, it is recorded as K. When numbers and numbers or numbers and vectors are operated, it is recorded as k. f is the dynamic model of the unknown cart-two-stage inverted pendulum system. The kernel function generally defines the square exponential covariance function

Figure GDA0003895584350000123
Figure GDA0003895584350000123

式中,ym,yn可以是固定的数值或是矩阵;σf与方差有关的超参数;W为权值矩阵,只在对角线上有数值,且这些数值都为超参数。通过具体的控制状态对的输入和状态变化量的输出,优化得到超参数的具体数值。In the formula, y m and y n can be fixed values or matrices; σ f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Through the specific input of the control state pair and the output of the state change, the specific value of the hyperparameter is optimized.

通过联合概率分布(1)获得f(x(i),u(i))的均值

Figure GDA0003895584350000131
和方差
Figure GDA0003895584350000132
Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1)
Figure GDA0003895584350000131
and variance
Figure GDA0003895584350000132

Figure GDA0003895584350000133
Figure GDA0003895584350000133

Figure GDA0003895584350000134
Figure GDA0003895584350000134

由此,根据第i时刻状态控制对的分布,通过准确计算,得到第i+1时刻状态分布的预测值Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment can be obtained through accurate calculation

Figure GDA0003895584350000135
Figure GDA0003895584350000135

Figure GDA0003895584350000136
Figure GDA0003895584350000136

式中,

Figure GDA0003895584350000137
是状态和相应的状态转换之间的协方差,可由匹配方法获得。从而,得到了高斯过程模型,即第i时刻状态控制对的分布,计算第i+1时刻状态的分布。In the formula,
Figure GDA0003895584350000137
is the covariance between states and corresponding state transitions, which can be obtained by the matching method. Thus, the Gaussian process model is obtained, that is, the distribution of the state-control pair at the i-th moment, and the distribution of the state at the i+1-th moment is calculated.

本发明小车-二级倒立摆系统的控制区间自适应的强化学习步骤:在小车-二级倒立摆系统的控制过程中,最佳地设计控制力,使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,即第1根立摆和第2根立摆都垂直。由于传统的优化容易陷入局部最优,首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列。The self-adaptive reinforcement learning step of the control interval of the trolley-two-stage inverted pendulum system in the present invention: in the control process of the trolley-two-stage inverted pendulum system, optimally design the control force so that the two-stage inverted pendulum system meets the first The angle between one inverted pendulum and the second inverted pendulum is 2π, that is, the first vertical pendulum and the second vertical pendulum are both vertical. Since traditional optimization is easy to fall into local optimum, reinforcement learning is firstly used to get as close as possible to the global optimal control interval and obtain a better discrete control sequence.

在强化学习中,小车-二级倒立摆系统优化控制系统满足M=<X,U,P,R,γ>马尔可夫决策过程(离散过程)。X,U,P,R,γ各自的定义如下:In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process (discrete process). X, U, P, R, γ are defined as follows:

X是小车-二级倒立摆系统的6个状态向量x,即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely

X=<x(j)>=<μj,ηj>,j=0,1,…,N, (5)X=<x(j)>=<μ j , η j >, j=0, 1, ..., N, (5)

式中,<>表示集合,j是时间步长,考虑到步长的设置会严重影响学习计算效率,不同于高斯过程60份的时间离散数目,这里取N=10,即时间离散10份。In the formula, <> represents the set, and j is the time step size. Considering that the setting of the step size will seriously affect the learning calculation efficiency, which is different from the 60 time discrete numbers of the Gaussian process, N=10 is taken here, that is, the time discrete is 10 times.

U是包含所有可能操作的动作空间,即施加的力的离散取值(有限个数),将合理的控制离散取值设为am,那么U is the action space containing all possible operations, that is, the discrete value of the applied force (limited number), and the reasonable control discrete value is set to a m , then

U={am},m=1,2,...,M, (6)U={a m }, m=1, 2, . . . , M, (6)

式中,M为控制量所取的个数,每次取M=100。P是状态转移概率函数,P[x(j+1)=x’|x(j)=x,u(j)=a]=P(x,a,x’)。In the formula, M is the number of controlled variables, M = 100 each time. P is a state transition probability function, P[x(j+1)=x'|x(j)=x, u(j)=a]=P(x, a, x').

R是在每个时间步j=0,1,...,N的状态和动作的奖励,学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π。本专利的奖励函数可用控制成本函数来定义。第j个时刻的控制成本函数定义为:R is the state and action reward at each time step j=0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π. The reward function of this patent can be defined with a control cost function. The control cost function at the jth moment is defined as:

Figure GDA0003895584350000141
Figure GDA0003895584350000141

式中,

Figure GDA0003895584350000142
和Z分别为状态和控制的权值系数,根据实际情况提前给定;xtg(j)是目标位置。In the formula,
Figure GDA0003895584350000142
and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x tg (j) is the target position.

由于状态量xj含有均值和方差,两边求期望,可以得到如下第j个时刻的控制成本函数Since the state quantity x j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained

Figure GDA0003895584350000143
Figure GDA0003895584350000143

其中,L代表目标函数。当远超过或远不到2π时,应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值)。当接近2π时,应该给小车-二级倒立摆系统优化控制系统一个奖励。考虑到这一点,接近2π与远离2π的指标Cj=x(j)-Zj,其中Zj为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When far exceeding or far less than 2π, a penalty (negative value of the reward function) should be given to the optimal control system of the trolley-two-stage inverted pendulum system. When approaching 2π, a reward should be given to the optimal control system of the trolley-two-stage inverted pendulum system. Taking this into consideration, the index C j = x(j)-Z j that is close to 2π and far from 2π, where Z j is the set value of the trend close to 2π at each moment

Figure GDA0003895584350000151
Figure GDA0003895584350000151

式中,取λ为π/10。In the formula, take λ as π/10.

γ是折扣系数。假设随着时间的推移,相应的奖励会打折。因此,每个时间步的奖励是基于前一步的奖励Rj和折扣系数γ(0≤γ≤1),取γ为0.85。累积奖励表示为γ is the discount factor. Assume that the corresponding rewards are discounted over time. Therefore, the reward at each time step is based on the previous step’s reward R j and the discount coefficient γ (0≤γ≤1), taking γ as 0.85. The cumulative reward is expressed as

Figure GDA0003895584350000152
Figure GDA0003895584350000152

采用Q学习算法。在每个离散时间步骤,小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动。使用规则更新每个时间步的Q值Using the Q-learning algorithm. At each discrete time step, the cart-two-stage inverted pendulum system optimization control system observes the current state and takes action on the state that maximizes the reward. Use a rule to update the Q value at each time step

Qj+1(x(j),aj)=Qj(x(j),aj)+∈×[Rj+1+γ(maxQJ(x(j+1),aj+1)-QJ(x(j).aj))],(11)Q j+1 (x(j), a j )=Q j (x(j), a j )+∈×[R j+1 +γ(maxQ J (x(j+1), a j+1 )-Q J (x(j).a j ))], (11)

其中,Qj(x(j),aj)是时刻j的状态x(j)和动作aj的Q值。

Figure GDA0003895584350000153
是描述
Figure GDA0003895584350000154
算法中利用可能性的学习率。在大量训练下探索所有组合训练,Q学习算法应该为所有状态-动作组合生成Q值。在每个状态下,选择具有最大Q值的动作作为最佳动作。Among them, Q j (x(j), a j ) is the Q value of state x(j) and action a j at time j.
Figure GDA0003895584350000153
is the description
Figure GDA0003895584350000154
The learning rate at which likelihood is exploited in the algorithm. Exploring all combinations under a large training set, the Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q value is selected as the best action.

然而,通过上述X,U,P,R,γ设定,Q的学习非常的慢,主要由于施加的力的离散取值范围非常广,而且离散的力的间隔点又非常多,导致Q列表的维数非常高,容易造成维数灾难。但是如果力的离散点设置很少,又会导致学习效果不佳,难以得到较好的控制策略。控制量的取值范围(控制区间)和间隔严重地限制了Q学习的效率。因此,提出控制区间自适应的强化学习方法,用来在学习过程中不断地缩小控制量的取值范围。However, through the above settings of X, U, P, R, and γ, the learning of Q is very slow, mainly because the range of discrete values of the applied force is very wide, and there are many interval points between the discrete forces, resulting in the Q list The dimensionality of is very high, which is easy to cause the curse of dimensionality. However, if there are few discrete points of force, the learning effect will be poor and it will be difficult to obtain a better control strategy. The value range (control interval) and interval of the control quantity severely limit the efficiency of Q-learning. Therefore, a reinforcement learning method with adaptive control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.

argminINE[L(x,a)],argmin IN E[L(x,a)],

IN=[min{am},max{am}], (12)IN=[min {am }, max {am }], (12)

其中,E[L(x,a)]是每个时刻代价之和(总代价),IN表示自适应地优化选择控制量的区间。为了尽可能地近似全局最优解,设置更大范围的控制区间来开始训练过程。根据成本函数的结果,选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间。在此过程中,控制区间在缩小的过程中,同时保持控制的离散数量M不变,控制越来越精细,即每个控制区间的动作数量是恒定的,每个区间逐渐减小并且间隔逐渐减小。也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), and IN represents the interval for adaptively optimizing the selection of the control amount. In order to approximate the global optimal solution as much as possible, a larger control interval is set to start the training process. According to the result of the cost function, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In this process, the control interval is in the process of shrinking, while keeping the discrete number M of control constant, the control becomes more and more refined, that is, the number of actions in each control interval is constant, and each interval gradually decreases and the interval gradually decreases. decrease. That is, the control interval of the lth learning and the control interval of the l+1th learning satisfy

Figure GDA0003895584350000161
Figure GDA0003895584350000161

[min{am},max{am}](l)就等于在控制区间训练l次后的(12)式中的IN,这可以看成是控制区间自适应地“缩小”步骤。[min{a m }, max{a m }](l) is equal to IN in formula (12) after training l times in the control interval, which can be regarded as an adaptive "shrinking" step of the control interval.

当一组动作收敛到某个控制区间时,小车-二级倒立摆系统优化控制系统开始通过移动控制区间继续试验相邻值,也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the trolley-two-stage inverted pendulum system optimization control system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies

Figure GDA0003895584350000162
Figure GDA0003895584350000162

Figure GDA0003895584350000175
Figure GDA0003895584350000175

Figure GDA0003895584350000171
是平移参数。这可以看成是控制区间自适应地“平移”步骤。
Figure GDA0003895584350000171
is the translation parameter. This can be seen as an adaptive "translation" step of the control interval.

优化以成本函数来自适应选择控制空间的过程是迭代的。并且一旦控制区间收敛到产生成本函数的较小值的时间间隔,则控制区间成为最优的控制区间,并确定较优的离散控制序列。The process of optimization to adapt the selection control space with a cost function is iterative. And once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval, and a better discrete control sequence is determined.

本发明小车-二级倒立摆系统的优化控制步骤:针对高斯过程模型,利用自适应控制区间的强化学习只能得到有限的控制决策集合,跟具体控制量的离散程度有关,没法得到最优的连续控制集合中的控制序列。为了得到最优的控制序列,提出小车-二级倒立摆系统高斯过程模型的优化控制。首先,对公式(4)变分得到The optimal control steps of the trolley-two-stage inverted pendulum system of the present invention: for the Gaussian process model, only a limited control decision set can be obtained by using the reinforcement learning of the adaptive control interval, which is related to the discrete degree of the specific control quantity, and the optimal control cannot be obtained. A control sequence in the continuous control set of . In order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained

Figure GDA0003895584350000172
Figure GDA0003895584350000172

Figure GDA0003895584350000173
Figure GDA0003895584350000173

其中,Δ表示微变量。由于初始条件是固定值,因此初始条件的变分都为0,即Δμ0=0,Δη0=0。由于Δui可以是任意数值,为了形式简单,可将Δui设置成Δui=1,得到Δμi和Δηi的迭代形式Among them, Δ represents the microvariable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ 0 =0, Δη 0 =0. Since Δu i can be any value, in order to simplify the form, Δu i can be set to Δu i = 1, and the iterative form of Δu i and Δη i can be obtained

Figure GDA0003895584350000174
Figure GDA0003895584350000174

Figure GDA0003895584350000181
Figure GDA0003895584350000181

总的目标函数的期望值The expected value of the overall objective function

Figure GDA0003895584350000182
Figure GDA0003895584350000182

对第i步的期望E[L(x(i),ui)]做变分得到Change the expectation E[L(x(i), u i )] of the i-th step to get

Figure GDA0003895584350000183
Figure GDA0003895584350000183

整个时间区间内的总的目标函数的变分,即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes

Figure GDA0003895584350000184
Figure GDA0003895584350000184

其中,Δμi和Δηi由公式(16)给出,μi和ui由公式(4)给出。对以上设置的小车-二级倒立摆系统优化控制方法进行实施验证,小车的参数数据为:小车质量为0.5kg,第1根和第二根倒立摆的质量为0.5kg,长度为0.1m,小车与地面的摩擦系数为0.1。状态权值系数

Figure GDA0003895584350000185
为[0 1 1 1 15 15]·。控制权值系数Z为0.01。Q学习前的控制区间为[-250,250],Q学习后的控制区间为[-116,93]。优化控制后,最优的目标函数值为9338。图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。图7为系统在最优控制下各个时刻的目标函数均值和方差图。图8为系统的最优控制曲线。图9为系统第一根倒立摆的角度均值的变化图。图10为系统第二根倒立摆的角度均值的变化图。Among them, Δμ i and Δη i are given by formula (16), and μ i and u i are given by formula (4). Carry out verification on the optimal control method of the trolley-two-stage inverted pendulum system set above. The parameter data of the trolley is: the mass of the trolley is 0.5kg, the mass of the first and second inverted pendulum is 0.5kg, and the length is 0.1m. The coefficient of friction between the car and the ground is 0.1. State weight coefficient
Figure GDA0003895584350000185
is [0 1 1 1 15 15] · . The control weight coefficient Z is 0.01. The control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimizing the control, the optimal objective function value is 9338. Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention. Figure 7 is the mean value and variance diagram of the objective function at each moment of the system under optimal control. Figure 8 is the optimal control curve of the system. Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the system. Fig. 10 is a change diagram of the mean value of the angle of the second inverted pendulum of the system.

Claims (4)

1.一种小车-二级倒立摆系统优化控制方法,应用于包括小车,第一根倒立摆和第二根倒立摆的小车-二级倒立摆系统,其特征在于,包括以下步骤:1. a dolly-two-stage inverted pendulum system optimization control method, is applied to comprise dolly, the dolly-two-stage inverted pendulum system of the first inverted pendulum and the second inverted pendulum, is characterized in that, comprises the following steps: S10,确定小车-二级倒立摆系统的6个状态,在施加一定的作用力后,获得6个状态随时间的变化情况;设置高斯核函数,通过以当前状态控制对为输入,状态变化量为输出,训练出高斯核函数超参数的取值;通过联合概率分布函数,得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系,即高斯过程模型;S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model; S20,设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值;定义成本函数的期望形式,设定奖励惩罚机制,给定每个时间步的Q值的更新规则;学习过程前期不断地缩小控制量的取值范围,学习过程后期不断地平移控制量的取值范围;当收敛到产生成本函数的较小值,确定最优的控制区间与较优的离散控制序列;S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ; S30,获得高斯过程模型和最优控制区间后,对公式高斯过程模型和初始条件变分,得到Δμi和Δηi的迭代形式,结合高斯过程模型的迭代计算,得到总的目标函数值与梯度值;S30, after the Gaussian process model and the optimal control interval are obtained, the formula Gaussian process model and the initial condition variation are obtained to obtain the iterative forms of Δμi and Δηi , combined with the iterative calculation of the Gaussian process model, the total objective function value and gradient are obtained value; S40,调用基于梯度的优化求解器,并将学习得到的较优的离散控制序列作为优化控制的初始猜测,通过梯度值和总的目标函数值的计算,迭代求解得到最优的控制力序列;S40, call the gradient-based optimization solver, and use the learned better discrete control sequence as the initial guess of the optimization control, and iteratively solve to obtain the optimal control force sequence through the calculation of the gradient value and the total objective function value; 在小车-二级倒立摆系统的控制过程中,最佳地设计控制力,使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列;In the control process of the trolley-two-stage inverted pendulum system, the control force is optimally designed so that the two-stage inverted pendulum system satisfies the angle of the first inverted pendulum and the second inverted pendulum at time T. Firstly, reinforcement learning is used To get as close as possible to the global optimal control interval and obtain a better discrete control sequence; 在强化学习中,小车-二级倒立摆系统优化控制系统满足M=<X,U,P,R,γ>马尔可夫决策过程,X,U,P,R,γ各自的定义如下:In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process, and the respective definitions of X, U, P, R, and γ are as follows: X是小车-二级倒立摆系统的6个状态向量x,即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely X=<x(j)>=<μjj>,j=0,1,...,N, (5)X=<x(j)>=<μ jj >,j=0,1,...,N, (5) 式中,<>表示集合,j是时间步长,取N=10,In the formula, <> means set, j is the time step, take N=10, U是包含所有可能操作的动作空间,即施加的力的离散取值,将合理的控制离散取值设为am,那么U is the action space containing all possible operations, that is, the discrete value of the applied force, and the reasonable control discrete value is set to a m , then U={am},m=1,2,...,M, (6)U={a m },m=1,2,...,M, (6) 式中,M为控制量所取的个数,每次取M=100,P是状态转移概率函数,P[x(j+1)=x’|x(j)=x,u(j)=a]=P(x,a,x’),In the formula, M is the number of control variables, each time M=100, P is the state transition probability function, P[x(j+1)=x'|x(j)=x,u(j) =a]=P(x,a,x'), R是在每个时间步j=0,1,...,N的状态和动作的奖励,学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,奖励函数可用控制成本函数来定义,第j个时刻的控制成本函数定义为:R is the state and action reward at each time step j=0,1,...,N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π, the reward function can be defined by the control cost function, and the control cost function at the jth moment is defined as:
Figure FDA0003887320220000021
Figure FDA0003887320220000021
式中,
Figure FDA0003887320220000022
和R'分别为状态和控制的权值系数,根据实际情况提前给定;xtg(j)是目标位置;
In the formula,
Figure FDA0003887320220000022
and R' are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x tg (j) is the target position;
由于状态量xj含有均值和方差,两边求期望,可以得到如下第j个时刻的控制成本函数Since the state quantity x j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained
Figure FDA0003887320220000031
Figure FDA0003887320220000031
其中,L代表目标函数,当远超过或远不到2π时,应该给小车-二级倒立摆系统优化控制系统一个惩罚;当接近2π时,应该给小车-二级倒立摆系统优化控制系统一个奖励,接近2π与远离2π的指标Cj=x(j)-Zj,其中Zj为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When it is far beyond or far less than 2π, a penalty should be given to the optimal control system of the trolley-two-stage inverted pendulum system; when it is close to 2π, a penalty should be given to the optimal control system of the trolley-two-stage inverted pendulum system Reward, the index C j =x(j)-Z j close to 2π and away from 2π, where Z j is the set value of the trend close to 2π at each moment
Figure FDA0003887320220000032
Figure FDA0003887320220000032
式中,取λ为π/10;In the formula, take λ as π/10; γ是折扣系数,假设随着时间的推移,相应的奖励会打折,每个时间步的奖励是基于前一步的奖励Rj和折扣系数γ0≤γ≤1,取γ为0.85;累积奖励表示为γ is the discount coefficient, assuming that the corresponding reward will be discounted as time goes by, the reward of each time step is based on the reward R j of the previous step and the discount coefficient γ0≤γ≤1, take γ as 0.85; the cumulative reward is expressed as
Figure FDA0003887320220000033
Figure FDA0003887320220000033
采用Q学习算法,在每个离散时间步骤,小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动;使用规则更新每个时间步的Q值Using a Q-learning algorithm, at each discrete time step, the trolley-two-stage inverted pendulum system optimization control system observes the current state and acts on the state that maximizes reward; uses a rule to update the Q value at each time step Qj+1(x(j),aj)=Qj(x(j),aj)+∈×[Rj+1+γ(maxQj(x(j+1),aj+1)-Qj(x(j).aj))],Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ j (x(j+1),a j+1 )-Q j (x(j).a j ))], (11)(11) 其中,Qj(x(j),aj)是时刻j的状态x(j)和动作aj的Q值,∈是描述∈-greed算法中利用可能性的学习率,在大量训练下探索所有组合训练,∈是描述∈-greed算法中利用可能性的学习率,在大量训练下探索所有组合训练,Q学习算法应该为所有状态-动作组合生成Q值,在每个状态下,选择具有最大Q值的动作作为最佳动作;where Q j (x(j),a j ) is the Q value of state x(j) and action a j at time j, ∈ is the learning rate describing the possibility of exploitation in ∈-greed algorithm, exploring under a large number of training For all combinatorial training, ∈ is the learning rate describing the exploited likelihood in the ∈-greed algorithm. To explore all combinatorial training under a large number of trainings, the Q-learning algorithm should generate Q-values for all state-action combinations. The action with the largest Q value is taken as the best action; S20具体包括:S20 specifically includes: Q学习算法应该为所有状态-动作组合生成Q值,在每个状态下,选择具有最大Q值的动作作为最佳动作,进一步提出控制区间自适应的强化学习方法,用来在学习过程中不断地缩小控制量的取值范围,The Q learning algorithm should generate Q values for all state-action combinations. In each state, select the action with the largest Q value as the best action, and further propose an adaptive reinforcement learning method for the control interval, which is used to continuously improve the learning process. narrow the value range of the control variable, argminINE[L(x,a)],argmin IN E[L(x,a)], IN=[min{am},max{am}], (12)IN=[min{a m },max{a m }], (12) 其中,E[L(x,a)]是每个时刻代价之和,IN表示自适应地优化选择控制量的区间,设置更大范围的控制区间来开始训练过程,根据成本函数的结果,选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间,在此过程中,控制区间在缩小的过程中,同时保持控制的离散数量M不变,控制越来越精细,即每个控制区间的动作数量是恒定的,每个区间逐渐减小并且间隔逐渐减小,也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x,a)] is the sum of the cost at each moment, IN means to adaptively optimize the selection of the interval of the control amount, set a larger range of control interval to start the training process, and according to the result of the cost function, select The control interval in the action space that produces the minimum cost is used as the new control interval for the next learning. In the process, the control interval is shrinking while keeping the discrete quantity M of control unchanged, and the control becomes more and more refined, namely The number of actions in each control interval is constant, each interval gradually decreases and the interval gradually decreases, that is, the control interval of the lth learning and the control interval of the l+1th learning satisfy
Figure FDA0003887320220000041
Figure FDA0003887320220000041
[min{am},max{am}](l)就等于在控制区间训练l次后的(12)式中的IN,[min{a m },max{a m }](l) is equal to IN in formula (12) after training l times in the control interval, 当一组动作收敛到某个控制区间时,小车-二级倒立摆系统优化控制方法开始通过移动控制区间继续试验相邻值,也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the optimal control method of the trolley-two-stage inverted pendulum system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies
Figure FDA0003887320220000051
Figure FDA0003887320220000051
Figure FDA0003887320220000052
是平移参数,
Figure FDA0003887320220000052
is the translation parameter,
优化以成本函数来自适应选择控制空间的过程是迭代的,并且一旦控制区间收敛到产生成本函数的较小值的时间间隔,则控制区间成为最优的控制区间,并确定较优的离散控制序列。Optimization The process of adaptively selecting the control space with a cost function is iterative, and once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval and determines the optimal discrete control sequence .
2.如权利要求1所述的小车-二级倒立摆系统优化控制方法,其特征在于,S10中,确定小车-二级倒立摆系统的6个状态分别为:小车的位移x1、小车的速度x2、第一根倒立摆的角速度x3和角度x5、第二根倒立摆的角速度x4和角度x6,令
Figure FDA0003887320220000053
Figure FDA0003887320220000054
为矩阵转置,并设作用力为u(t),初始时刻小车与两根倒立摆都处于静止状态,即x1(0)=0,x2(0)=0,x3(0)=0,x4(0)=0,x5(0)=π,x6(0)=π,整个控制时间尺度为T=1.2秒,每次控制作用为0.02秒,整个控制时间需要60个控制序列,即离散时刻i=0,1,...,60,对应时刻的状态和控制就是x(i)和u(i)。
2. the dolly-two-stage inverted pendulum system optimal control method as claimed in claim 1, is characterized in that, in S10, determine the 6 states of dolly-two-stage inverted pendulum system to be respectively: the displacement x 1 of dolly, the displacement x of dolly Velocity x 2 , angular velocity x 3 and angle x 5 of the first inverted pendulum, angular velocity x 4 and angle x 6 of the second inverted pendulum, let
Figure FDA0003887320220000053
Figure FDA0003887320220000054
is the matrix transposition, and set the force as u(t), the car and the two inverted pendulums are at rest at the initial moment, that is, x 1 (0)=0, x 2 (0)=0, x 3 (0) = 0, x 4 (0) = 0, x 5 (0) = π, x 6 (0) = π, the entire control time scale is T = 1.2 seconds, each control action is 0.02 seconds, the entire control time needs 60 A control sequence, that is, discrete time i=0,1,...,60, the state and control at the corresponding time are x(i) and u(i).
3.如权利要求2所述的小车-二级倒立摆系统优化控制方法,其特征在于,高斯建模过程如下:3. dolly-two-stage inverted pendulum system optimization control method as claimed in claim 2, is characterized in that, Gaussian modeling process is as follows: 高斯过程所要学习的映射关系,是当前的状态控制对xp={x(0),u(0),...,x(59),u(59)}到状态变化量Δx={Δx(0),...,Δx(59)}之间的关系,其中符号Δ是状态的变化量,Δx(i)=x(i+1)-x(i),The mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0),u(0),...,x(59),u(59)} to the state change Δx={Δx( 0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i), 根据高斯过程建模,对于给定的1个状态控制对xp*={x*,u*},它与60个状态控制对xp={x(0),u(0),...,x(59),u(59)}之间的关系如下所示:According to Gaussian process modeling, for a given 1 state control pair xp * = {x * , u * }, it is related to 60 state control pairs xp = {x(0),u(0),..., The relationship between x(59), u(59)} is as follows:
Figure FDA0003887320220000061
Figure FDA0003887320220000061
式中,GP表示高斯概率分布,I为单位矩阵,σw为一与噪声有关的超参数,K,k为高斯核函数,当向量与向量操作时,记为K;当数与数或数与向量操作时,记为k;f是未知的小车-二级倒立摆系统的动态模型,核函数一般定义平方指数协方差函数In the formula, GP represents the Gaussian probability distribution, I is the identity matrix, σ w is a hyperparameter related to noise, K, k is the Gaussian kernel function, when the vector operates with the vector, it is recorded as K; when the number and the number or the number When operating with a vector, it is denoted as k; f is the dynamic model of the unknown car-two-stage inverted pendulum system, and the kernel function generally defines the square exponential covariance function
Figure FDA0003887320220000062
Figure FDA0003887320220000062
式中,ym,yn是固定的数值或是矩阵;σf与方差有关的超参数;W为权值矩阵,只在对角线上有数值,且这些数值都为超参数,通过具体的控制状态对的输入和状态变化量的输出,优化得到超参数的具体数值;In the formula, y m and y n are fixed values or matrices; σ f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. The input of the control state pair and the output of the state change amount are optimized to obtain the specific value of the hyperparameter; 通过联合概率分布(1)获得f(x(i),u(i))的均值
Figure FDA0003887320220000063
和方差
Figure FDA0003887320220000064
Obtain the mean of f(x(i),u(i)) through the joint probability distribution (1)
Figure FDA0003887320220000063
and variance
Figure FDA0003887320220000064
Figure FDA0003887320220000065
Figure FDA0003887320220000065
Figure FDA0003887320220000066
Figure FDA0003887320220000066
由此,根据第i时刻状态控制对的分布,通过准确计算,得到第i+1时刻状态分布的预测值:Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment is obtained through accurate calculation:
Figure FDA0003887320220000067
Figure FDA0003887320220000067
Figure FDA0003887320220000068
Figure FDA0003887320220000068
式中,
Figure FDA0003887320220000071
是状态和相应的状态转换之间的协方差,由匹配方法获得,从而得到高斯过程模型,即第i时刻状态控制对的分布,计算第i+1时刻状态的分布。
In the formula,
Figure FDA0003887320220000071
is the covariance between the state and the corresponding state transition, obtained by the matching method, so as to obtain the Gaussian process model, that is, the distribution of the state-control pair at the i-th moment, and calculate the distribution of the state at the i+1-th moment.
4.如权利要求3所述的小车-二级倒立摆系统优化控制方法,其特征在于,为了得到最优的控制序列,提出小车-二级倒立摆系统高斯过程模型的优化控制,首先,对公式(4)变分得到4. dolly-two-stage inverted pendulum system optimal control method as claimed in claim 3 is characterized in that, in order to obtain optimum control sequence, proposes the optimal control of dolly-two-stage inverted pendulum system Gaussian process model, at first, to The variation of formula (4) gives
Figure FDA0003887320220000072
Figure FDA0003887320220000072
Figure FDA0003887320220000073
Figure FDA0003887320220000073
其中,Δ表示微变量,由于初始条件是固定值,因此初始条件的变分都为0,即Δμ0=0,Δη0=0,由于Δui是任意数值,为了形式简单,将其设置成Δui=1,得到Δμi和Δηi的迭代形式Among them, Δ represents a micro variable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ 0 = 0, Δη 0 = 0. Since Δu i is an arbitrary value, for the sake of simplicity, it is set as Δu i =1, get the iterative form of Δμ i and Δη i
Figure FDA0003887320220000074
Figure FDA0003887320220000074
Figure FDA0003887320220000075
Figure FDA0003887320220000075
总的目标函数的期望值The expected value of the overall objective function
Figure FDA0003887320220000081
Figure FDA0003887320220000081
对第i步的期望E[L(x(i),ui)]做变分得到Change the expectation E[L(x(i),u i )] of the i-th step to get
Figure FDA0003887320220000082
Figure FDA0003887320220000082
整个时间区间内的总的目标函数的变分,即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes
Figure FDA0003887320220000083
Figure FDA0003887320220000083
其中,Δμi和Δηi由公式(16)给出,μi和ηi由公式(4)给出。Among them, Δμ i and Δη i are given by formula (16), and μ i and η i are given by formula (4).
CN201911043225.4A 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system Active CN110908280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911043225.4A CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911043225.4A CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Publications (2)

Publication Number Publication Date
CN110908280A CN110908280A (en) 2020-03-24
CN110908280B true CN110908280B (en) 2023-01-03

Family

ID=69814671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911043225.4A Active CN110908280B (en) 2019-10-30 2019-10-30 Optimization control method for trolley-two-stage inverted pendulum system

Country Status (1)

Country Link
CN (1) CN110908280B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111580392B (en) * 2020-07-14 2021-06-15 江南大学 Finite frequency range robust iterative learning control method of series inverted pendulum

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN110134011A (en) * 2019-04-23 2019-08-16 浙江工业大学 An adaptive iterative learning inversion control method for inverted pendulum

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011068222A (en) * 2009-09-24 2011-04-07 Honda Motor Co Ltd Control device of inverted pendulum type vehicle

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105549384A (en) * 2015-09-01 2016-05-04 中国矿业大学 Inverted pendulum control method based on neural network and reinforced learning
CN110134011A (en) * 2019-04-23 2019-08-16 浙江工业大学 An adaptive iterative learning inversion control method for inverted pendulum

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
直线二级倒立摆系统的H_∞鲁棒最优控制;王春平等;《机电工程》;20170520(第05期);全文 *

Also Published As

Publication number Publication date
CN110908280A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
Lin et al. An ensemble learning velocity prediction-based energy management strategy for a plug-in hybrid electric vehicle considering driving pattern adaptive reference SOC
CN108520155B (en) Vehicle behavior simulation method based on neural network
Chiou et al. A PSO-based adaptive fuzzy PID-controllers
CN110119844A (en) Introduce robot motion&#39;s decision-making technique, the system, device of Feeling control mechanism
CN104133372B (en) Room temperature control algolithm based on fuzzy neural network
CN113408648A (en) Unit combination calculation method combined with deep learning
CN110286586A (en) A Hybrid Modeling Method for Magnetorheological Damper
Su et al. Prediction model of hot metal temperature for blast furnace based on improved multi-layer extreme learning machine
CN113420508B (en) Unit combination calculation method based on LSTM
Zhai et al. Deep q-learning with prioritized sampling
CN116892866B (en) Rocket sublevel recovery track planning method, rocket sublevel recovery track planning equipment and storage medium
CN113641907A (en) A method and device for hyperparameter adaptive depth recommendation based on evolutionary algorithm
CN116050242A (en) Transition prediction method, device, equipment and medium
CN110908280B (en) Optimization control method for trolley-two-stage inverted pendulum system
Chen et al. Improve the accuracy of recurrent fuzzy system design using an efficient continuous ant colony optimization
Zhou et al. Auxiliary task-based deep reinforcement learning for quantum control
Bisong et al. Training a Neural Network
CN115453880A (en) A method for training generative models for state prediction based on adversarial neural networks
Chan et al. Evolutionary computation for on-line and off-line parameter tuning of evolving fuzzy neural networks
Morales Deep Reinforcement Learning
Yin et al. Hashing over predicted future frames for informed exploration of deep reinforcement learning
CN111445005A (en) Neural network control method and reinforcement learning system based on reinforcement learning
Peng et al. Effective policy gradient search for reinforcement learning through NEAT based feature extraction
Xiong et al. Robust PETS reinforcement learning algorithm for uncertain environment
Chaturvedi Factors affecting the performance of artificial neural network models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant