CN110908280B - Optimization control method for trolley-two-stage inverted pendulum system - Google Patents
Optimization control method for trolley-two-stage inverted pendulum system Download PDFInfo
- Publication number
- CN110908280B CN110908280B CN201911043225.4A CN201911043225A CN110908280B CN 110908280 B CN110908280 B CN 110908280B CN 201911043225 A CN201911043225 A CN 201911043225A CN 110908280 B CN110908280 B CN 110908280B
- Authority
- CN
- China
- Prior art keywords
- control
- inverted pendulum
- state
- value
- interval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 88
- 238000005457 optimization Methods 0.000 title claims abstract description 25
- 230000006870 function Effects 0.000 claims abstract description 76
- 230000008569 process Effects 0.000 claims abstract description 68
- 238000012549 training Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 238000005315 distribution function Methods 0.000 claims abstract description 4
- 230000009471 action Effects 0.000 claims description 30
- 230000008859 change Effects 0.000 claims description 20
- 230000002787 reinforcement Effects 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 12
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000003044 adaptive effect Effects 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 238000006073 displacement reaction Methods 0.000 claims description 4
- 238000013519 translation Methods 0.000 claims description 4
- YTAHJIFKAKIKAV-XNMGPUDCSA-N [(1R)-3-morpholin-4-yl-1-phenylpropyl] N-[(3S)-2-oxo-5-phenyl-1,3-dihydro-1,4-benzodiazepin-3-yl]carbamate Chemical compound O=C1[C@H](N=C(C2=C(N1)C=CC=C2)C1=CC=CC=C1)NC(O[C@H](CCN1CCOCC1)C1=CC=CC=C1)=O YTAHJIFKAKIKAV-XNMGPUDCSA-N 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 2
- 230000017105 transposition Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 11
- 238000013461 design Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
Description
技术领域technical field
本发明涉及小车-二级倒立摆系统的控制领域,具体涉及一种小车-二级倒立摆系统优化控制方法。The invention relates to the control field of a trolley-two-stage inverted pendulum system, in particular to an optimal control method for a trolley-two-stage inverted pendulum system.
背景技术Background technique
小车-二级倒立摆系统是一个经典的快速、多变量、非线性、不稳定系统,是控制领域经典的控制对象。许多控制算法,包括PID,模糊PID,鲁棒控制等均已在此系统中实施。然而,设计控制的前提是建模。目前对于小车-二级倒立摆系统的控制均基于机理模型,是一种通过物理原理来推导的确定性模型。而模型参数涉及小车的尺寸、倒立摆的尺寸等等。The trolley-two-stage inverted pendulum system is a classic fast, multivariable, nonlinear and unstable system, and it is a classic control object in the field of control. Many control algorithms, including PID, fuzzy PID, robust control, etc. have been implemented in this system. However, a prerequisite for designing controls is modeling. At present, the control of the trolley-two-stage inverted pendulum system is based on the mechanism model, which is a deterministic model derived from physical principles. The model parameters involve the size of the trolley, the size of the inverted pendulum, and so on.
随着智能算法,尤其是强化学习等算法的发展,控制模型已慢慢从确定的机理模型转为无模型控制。而无模型控制有一定的弊端,如学习次数太多,控制效果难以定量分析,控制器设计效率慢等。With the development of intelligent algorithms, especially algorithms such as reinforcement learning, the control model has gradually changed from a definite mechanism model to model-free control. However, model-free control has certain disadvantages, such as too many learning times, difficult quantitative analysis of control effects, and slow controller design efficiency.
发明内容Contents of the invention
鉴于以上存在的技术问题,本发明用于提供一种小车-二级倒立摆系统优化控制方法。In view of the technical problems above, the present invention provides an optimal control method for the trolley-two-stage inverted pendulum system.
为解决上述技术问题,本发明采用如下的技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:
一种小车-二级倒立摆系统优化控制方法,应用于包括小车,第一根倒立摆和第二根倒立摆的小车-二级倒立摆系统,包括以下步骤:A dolly-two-stage inverted pendulum system optimization control method, applied to a dolly-two-stage inverted pendulum system comprising a dolly, a first inverted pendulum and a second inverted pendulum, comprising the following steps:
S10,确定小车-二级倒立摆系统的6个状态,在施加一定的作用力后,获得6个状态随时间的变化情况;设置高斯核函数,通过以当前状态控制对为输入,状态变化量为输出,训练出高斯核函数超参数的取值;通过联合概率分布函数,得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系,即高斯过程模型;S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;
S20,设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值;定义成本函数的期望形式,设定奖励惩罚机制,给定每个时间步的Q值的更新规则;学习过程前期不断地缩小控制量的取值范围,学习过程后期不断地平移控制量的取值范围;当收敛到产生成本函数的较小值,确定最优的控制区间与较优的离散控制序列;S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;
S30,获得高斯过程模型和最优控制区间后,对公式高斯过程模型和初始条件变分,得到Δμi和Δηi的迭代形式,结合高斯过程模型的迭代计算,得到总的目标函数值与梯度值;S30, after the Gaussian process model and the optimal control interval are obtained, the formula Gaussian process model and the initial condition variation are obtained to obtain the iterative forms of Δμi and Δηi , combined with the iterative calculation of the Gaussian process model, the total objective function value and gradient are obtained value;
S40,调用基于梯度的优化求解器,并将学习得到的较优的离散控制序列作为优化控制的初始猜测,通过梯度值和总的目标函数值的计算,迭代求解得到最优的控制力序列。S40, call the gradient-based optimization solver, and use the learned better discrete control sequence as the initial guess of the optimization control, calculate the gradient value and the total objective function value, and iteratively solve to obtain the optimal control force sequence.
优选地,S10中,确定小车-二级倒立摆系统的6个状态分别为:小车的位移x1、小车的速度x2、第一根倒立摆的角速度x3和角度x5、第二根倒立摆的角速度x4和角度x6,令x=[x1x2 ... x6]T,T为矩阵转置,并设作用力为u(t),初始时刻(0时刻)小车与两根倒立摆都处于静止状态,即x1(0)=0,x2(0)=0,x3(0)=0,x4(0)=0,x5(0)=π,x6(0)=π.整个控制时间尺度为T=1.2秒,每次控制作用为0.02秒,整个控制时间需要60个控制序列,即离散时刻i=0,1,...,60,对应时刻的状态和控制就是x(i)和u(i)。Preferably, in S10, the six states of the trolley-two-stage inverted pendulum system are determined to be: the displacement x 1 of the trolley, the velocity x 2 of the trolley, the angular velocity x 3 and the angle x 5 of the first inverted pendulum, the second Angular velocity x 4 and angle x 6 of the inverted pendulum, set x=[x 1 x 2 ... x 6 ] T , T is the matrix transposition, and set the force as u(t), the initial moment (moment 0) the car All are at rest with two inverted pendulums, namely x 1 (0)=0, x 2 (0)=0, x 3 (0)=0, x 4 (0)=0, x 5 (0)=π ,x 6 (0)=π. The whole control time scale is T=1.2 seconds, each control action is 0.02 seconds, and the whole control time needs 60 control sequences, that is, the discrete time i=0,1,...,60 , the state and control at the corresponding moment are x(i) and u(i).
优选地,高斯建模过程如下:高斯过程所要学习的映射关系,是当前的状态控制对xp={x(0),u(0),...,x(59),u(59)}到状态变化量Δx={Δx(0),...,Δx(59)}之间的关系,其中符号Δ是状态的变化量,Δx(i)=x(i+1)-x(i),Preferably, the Gaussian modeling process is as follows: the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0), u(0),..., x(59), u(59)} To the relationship between the state change Δx={Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i ),
根据高斯过程建模,对于给定的1个状态控制对xp*={x*,u*},它与60个状态控制对xp={x(0),u(0),...,x(59),u(59)}之间的关系如下所示:According to Gaussian process modeling, for a given 1 state control pair xp * ={x * ,u * }, it is related to 60 state control pairs xp={x(0),u(0),..., The relationship between x(59), u(59)} is as follows:
式中,表示高斯概率分布,I为单位矩阵,σw为一与噪声有关的超参数,K,k为高斯核函数。当向量与向量操作时,记为K;当数与数或数与向量操作时,记为k;f是未知的小车-二级倒立摆系统的动态模型,核函数一般定义平方指数协方差函数In the formula, Represents a Gaussian probability distribution, I is an identity matrix, σ w is a hyperparameter related to noise, K, k is a Gaussian kernel function. When a vector operates with a vector, it is denoted as K; when a number and a number or a number and a vector are operated, it is denoted as k; f is the dynamic model of the unknown trolley-two-stage inverted pendulum system, and the kernel function generally defines a square exponential covariance function
式中,ym,yn可以是固定的数值或是矩阵;σf与方差有关的超参数;W为权值矩阵,只在对角线上有数值,且这些数值都为超参数,通过具体的控制状态对的输入和状态变化量的输出,优化得到超参数的具体数值;In the formula, y m , y n can be fixed values or matrices; σ f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Specifically control the input of the state pair and the output of the state change, and optimize the specific value of the hyperparameter;
通过联合概率分布(1)获得f(x(i),u(i))的均值和方差 Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1) and variance
由此,根据第i时刻状态控制对的分布,通过准确计算,得到第i+1时刻状态分布的预测值:Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment is obtained through accurate calculation:
式中,是状态和相应的状态转换之间的协方差,可由匹配方法获得,从而得到高斯过程模型,即第i时刻状态控制对的分布,计算第i+1时刻状态的分布。In the formula, is the covariance between the state and the corresponding state transition, which can be obtained by the matching method, so as to obtain the Gaussian process model, that is, the distribution of the state-control pair at the i-th moment, and calculate the distribution of the state at the i+1-th moment.
优选地,S20具体包括:Preferably, S20 specifically includes:
在小车-二级倒立摆系统的控制过程中,最佳地设计控制力,使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列;In the control process of the trolley-two-stage inverted pendulum system, the control force is optimally designed so that the two-stage inverted pendulum system satisfies the angle of the first inverted pendulum and the second inverted pendulum at time T. Firstly, reinforcement learning is used To get as close as possible to the global optimal control interval and obtain a better discrete control sequence;
在强化学习中,小车-二级倒立摆系统优化控制系统满足M=<X,U,P,R,γ>马尔可夫决策过程(离散过程)。X,U,P,R,γ各自的定义如下:In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process (discrete process). X, U, P, R, γ are defined as follows:
X是小车-二级倒立摆系统的6个状态向量x,即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely
X=<x(j)>=<μj,ηj>,j=0,1,…,N, (5)X=<x(j)>=<μ j , η j >, j=0, 1, ..., N, (5)
式中,<>表示集合,j是时间步长,取N=10,In the formula, <> represents a set, j is the time step, and N=10,
U是包含所有可能操作的动作空间,即施加的力的离散取值(有限个数),将合理的控制离散取值设为am,那么U is the action space containing all possible operations, that is, the discrete value of the applied force (limited number), and the reasonable control discrete value is set to a m , then
U={am},m=1,2,...,M, (6)U={a m }, m=1, 2, . . . , M, (6)
式中,M为控制量所取的个数,每次取M=100,P是状态转移概率函数,P[x(j+1)=x’|x(j)=x,u(j)=a]=P(x,a,x’),In the formula, M is the number of control variables, each time M=100, P is the state transition probability function, P[x(j+1)=x'|x(j)=x,u(j) = a] = P(x, a, x'),
R是在每个时间步j=0,1,...,N的状态和动作的奖励,学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,奖励函数可用控制成本函数来定义,第j个时刻的控制成本函数定义为:R is the state and action reward at each time step j = 0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π, the reward function can be defined by the control cost function, and the control cost function at the jth moment is defined as:
式中,和Z分别为状态和控制的权值系数,根据实际情况提前给定;xtg(j)是目标位置;In the formula, and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x tg (j) is the target position;
由于状态量xj含有均值和方差,两边求期望,可以得到如下第j个时刻的控制成本函数Since the state quantity x j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained
其中,L代表目标函数,当远超过或远不到2π时,应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值);当接近2π时,应该给小车-二级倒立摆系统优化控制系统一个奖励,接近2π与远离2π的指标Cj=x(j)-Zj,其中Zj为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When it is far beyond or far less than 2π, a penalty (the negative value of the reward function) should be given to the optimal control system of the trolley-secondary inverted pendulum system; when it is close to 2π, the trolley-secondary Inverted pendulum system optimization control system is a reward, close to 2π and away from 2π index C j = x(j)-Z j , where Z j is the set value of the trend close to 2π at each moment
式中,取λ为π/10;In the formula, take λ as π/10;
γ是折扣系数,假设随着时间的推移,相应的奖励会打折,每个时间步的奖励是基于前一步的奖励Rj和折扣系数γ(0≤γ≤1),取γ为0.85;累积奖励表示为γ is the discount coefficient, assuming that the corresponding reward will be discounted as time goes by, the reward of each time step is based on the previous step’s reward R j and the discount coefficient γ (0≤γ≤1), take γ as 0.85; accumulate Reward expressed as
采用Q学习算法,在每个离散时间步骤,小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动;使用规则更新每个时间步的Q值Using the Q-learning algorithm, at each discrete time step, the trolley-two-stage inverted pendulum system optimization control system observes the current state and acts on the state that maximizes the reward; uses a rule to update the Q value at each time step
Qj+1(x(j),aj)=Qj(x(j),aj)+∈×[Rj+1+γ(maxQJ(x(j+1),aj+1)-QJ(x(j).aj))],(11)Q j+1 (x(j), a j )=Q j (x(j), a j )+∈×[R j+1 +γ(maxQ J (x(j+1), a j+1 )-Q J (x(j).a j ))], (11)
其中,Qj(x(j),aj)是时刻j的状态x(j)和动作aj的Q值,是描述算法中利用可能性的学习率,在大量训练下探索所有组合训练,Q学习算法应该为所有状态-动作组合生成Q值,在每个状态下,选择具有最大Q值的动作作为最佳动作,进一步提出控制区间自适应的强化学习方法,用来在学习过程中不断地缩小控制量的取值范围,Among them, Q j (x(j), a j ) is the Q value of state x(j) and action a j at time j, is the description The learning rate of the possibility is used in the algorithm to explore all combination training under a large number of trainings. The Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q-value is selected as the best action. Furthermore, a self-adaptive reinforcement learning method for the control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.
argminINE[L(x,a)],argmin IN E[L(x,a)],
IN=[min{am},max{am}], (12)IN=[min {am }, max {am }], (12)
其中,E[L(x,a)]是每个时刻代价之和(总代价),IN表示自适应地优化选择控制量的区间,设置更大范围的控制区间来开始训练过程,根据成本函数的结果,选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间,在此过程中,控制区间在缩小的过程中,同时保持控制的离散数量M不变,控制越来越精细,即每个控制区间的动作数量是恒定的,每个区间逐渐减小并且间隔逐渐减小,也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), IN means to adaptively optimize the selection of the interval of the control amount, set a larger range of control interval to start the training process, according to the cost function As a result, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In the process, the control interval is shrinking, while keeping the discrete number M of control unchanged, and the control is getting more and more The more refined, that is, the number of actions in each control interval is constant, each interval gradually decreases and the interval gradually decreases, that is, the control interval of the lth learning and the control interval of the l+1th learning satisfy
[min{am},max{am}](l)就等于在控制区间训练l次后的(12)式中的IN,[min{a m }, max{a m }](l) is equal to IN in formula (12) after training l times in the control interval,
当一组动作收敛到某个控制区间时,小车-二级倒立摆系统优化控制系统开始通过移动控制区间继续试验相邻值,也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the trolley-two-stage inverted pendulum system optimization control system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies
是平移参数, is the translation parameter,
优化以成本函数来自适应选择控制空间的过程是迭代的,并且一旦控制区间收敛到产生成本函数的较小值的时间间隔,则控制区间成为最优的控制区间,并确定较优的离散控制序列。Optimization The process of adaptively selecting the control space with a cost function is iterative, and once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval and determines the optimal discrete control sequence .
优选地,为了得到最优的控制序列,提出小车-二级倒立摆系统高斯过程模型的优化控制,首先,对公式(4)变分得到Preferably, in order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained
其中,Δ表示微变量,由于初始条件是固定值,因此初始条件的变分都为0,即Δμ0=0,Δη0=0,由于Δui可以是任意数值,为了形式简单,将其设置成Δui=1,得到Δμi和Δηi的迭代形式Among them, Δ represents a micro variable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ 0 = 0, Δη 0 = 0. Since Δu i can be any value, for the sake of simplicity, it is set to into Δu i = 1, the iterative form of Δμ i and Δη i is obtained
总的目标函数的期望值The expected value of the overall objective function
对第i步的期望E[L(x(i),ui)]做变分得到Change the expectation E[L(x(i), u i )] of the i-th step to get
整个时间区间内的总的目标函数的变分,即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes
其中,Δμi和Δηi由公式(16)给出,μi和ui由公式(4)给出。Among them, Δμ i and Δη i are given by formula (16), and μ i and u i are given by formula (4).
本发明的有益效果是:The beneficial effects of the present invention are:
(1)针对小车-二级倒立摆系统,提出数据驱动的高斯过程模型,区别于传统的机理建模等确定性模型,采用均值和方差来表示小车-二级倒立摆系统的运行状态,更贴近系统的实际运动过程。(1) For the trolley-two-stage inverted pendulum system, a data-driven Gaussian process model is proposed, which is different from traditional deterministic models such as mechanism modeling. Close to the actual motion process of the system.
(2)针对小车-二级倒立摆系统(不确定性系统),设计控制区间自适应的强化学习与优化控制,将学习与控制方法推广到不确定系统领域。(2) Aiming at the trolley-two-stage inverted pendulum system (uncertain system), design the reinforcement learning and optimal control of control interval adaptation, and extend the learning and control method to the field of uncertain systems.
(3)考虑到传统强化学习,如Q Learning在学习效率上的问题,提出控制区间自适应的强化学习,即不断地缩小控制决策的范围,自适应地选择在一个最佳的控制区间,并确定较优的离散控制序列。(3) Considering the learning efficiency of traditional reinforcement learning, such as Q Learning, a control interval adaptive reinforcement learning is proposed, that is, continuously narrowing the scope of control decisions, adaptively selecting an optimal control interval, and Determine the optimal discrete control sequence.
(4)针对优化问题容易陷入局部最优的问题,通过所提的强化学习来确定最优的控制区间,并以得到较优的离散控制序列为初始猜测,采用优化控制来确定最优的控制曲线(值),确保最大限度地搜索到全局最优解。(4) Aiming at the problem that the optimization problem is easy to fall into local optimum, the optimal control interval is determined through the proposed reinforcement learning, and the optimal control is determined by using the optimal control sequence as the initial guess Curve (value), to ensure that the global optimal solution is searched to the maximum extent.
(5)针对传统强化学习控制量必须是有限个的问题,在强化学习决策后,又采用控制区间连续的优化控制算法,最终得到最优的控制输入。(5) For the problem that the control quantity of traditional reinforcement learning must be finite, after the reinforcement learning decision-making, the optimal control algorithm with continuous control interval is adopted to finally obtain the optimal control input.
(6)结合实验给出的本发明实验效果,Q学习前的控制区间为[-250,250],Q学习后的控制区间为[-116,93]。优化控制后,最优的目标函数值为9338,给出最优控制下各个时刻的目标函数均值和方差图及最优控制曲线。(6) According to the experimental results of the present invention given in the experiment, the control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimal control, the optimal value of the objective function is 9338, and the mean value and variance diagram of the objective function and the optimal control curve at each moment under the optimal control are given.
附图说明Description of drawings
图1为本发明小车-二级倒立摆系统的实验设备简化图。Fig. 1 is a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention.
图2为本发明小车-二级倒立摆系统的高斯过程建模流程图。Fig. 2 is a Gaussian process modeling flow chart of the trolley-two-stage inverted pendulum system of the present invention.
图3为本发明小车-二级倒立摆系统的控制区间自适应的强化学习流程图。Fig. 3 is a flowchart of reinforcement learning for control interval adaptation of the trolley-two-stage inverted pendulum system of the present invention.
图4为本发明小车-二级倒立摆系统的优化控制流程图。Fig. 4 is an optimized control flow chart of the trolley-two-stage inverted pendulum system of the present invention.
图5为本发明小车-二级倒立摆系统的高斯过程建模、自适应区间的强化学习与优化控制的流程图。Fig. 5 is a flow chart of Gaussian process modeling, reinforcement learning and optimal control of the adaptive interval of the trolley-two-stage inverted pendulum system of the present invention.
图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention.
图7为本发明小车-二级倒立摆系统在最优控制下各个时刻的目标函数均值和方差图。Fig. 7 is a diagram of the mean value and variance of the objective function at each moment of the trolley-two-stage inverted pendulum system of the present invention under optimal control.
图8为本发明小车-二级倒立摆系统的最优控制曲线。Fig. 8 is the optimal control curve of the trolley-two-stage inverted pendulum system of the present invention.
图9为本发明小车-二级倒立摆系统第一根倒立摆的角度均值的变化图。Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.
图10为本发明小车-二级倒立摆系统第二根倒立摆的角度均值的变化图。Fig. 10 is a change diagram of the angle mean value of the second inverted pendulum of the trolley-two-stage inverted pendulum system of the present invention.
具体实施方式detailed description
如图1所示:本发明小车-二级倒立摆系统的实验设备简化图,依次为小车1,第1根第2根立摆,第二根倒立摆3等组成。右边的箭头是对小车施加的力,也就是系统的控制输入。弯转的箭头代表倒立摆的旋转角度。As shown in Figure 1: a simplified diagram of the experimental equipment of the trolley-two-stage inverted pendulum system of the present invention, which is composed of
本发明实施例提供的小车-二级倒立摆系统优化控制方法,应用于如图1所示的小车-二级倒立摆系统,具体包括以下步骤:The trolley-two-stage inverted pendulum system optimization control method provided by the embodiment of the present invention is applied to the trolley-two-stage inverted pendulum system as shown in Figure 1, and specifically includes the following steps:
S10,确定小车-二级倒立摆系统的6个状态,在施加一定的作用力后,获得6个状态随时间的变化情况;设置高斯核函数,通过以当前状态控制对为输入,状态变化量为输出,训练出高斯核函数超参数的取值;通过联合概率分布函数,得到第i时刻状态控制对的分布与第i+1时刻状态分布的关系,即高斯过程模型;S10, determine the 6 states of the trolley-two-stage inverted pendulum system, and obtain the changes of the 6 states over time after applying a certain force; set the Gaussian kernel function, and by taking the current state control pair as input, the state change amount As the output, the value of the Gaussian kernel function hyperparameter is trained; through the joint probability distribution function, the relationship between the distribution of the state control pair at the i-th moment and the state distribution at the i+1th moment is obtained, that is, the Gaussian process model;
S20,设置小车-二级倒立摆系统的状态、时间步长和控制变量的离散取值;定义成本函数的期望形式,设定奖励惩罚机制,给定每个时间步的Q值的更新规则;学习过程前期不断地缩小控制量的取值范围,学习过程后期不断地平移控制量的取值范围;当收敛到产生成本函数的较小值,确定最优的控制区间与较优的离散控制序列;S20, setting the state of the car-two-stage inverted pendulum system, the time step and the discrete value of the control variable; defining the expected form of the cost function, setting the reward and punishment mechanism, and specifying the update rule of the Q value of each time step; In the early stage of the learning process, the value range of the control variable is continuously narrowed, and in the later stage of the learning process, the value range of the control variable is continuously translated; when it converges to a smaller value of the cost function, the optimal control interval and the better discrete control sequence are determined. ;
S30,获得高斯过程模型(4)和最优控制区间后,对公式高斯过程模型和初始条件变分,得到Δμi和Δηi的迭代形式。结合高斯过程模型的迭代计算,得到总的目标函数值(17)与梯度值(19)。S30, after the Gaussian process model (4) and the optimal control interval are obtained, the formula Gaussian process model and initial conditions are varied to obtain iterative forms of Δμ i and Δη i . Combined with the iterative calculation of the Gaussian process model, the total objective function value (17) and gradient value (19) are obtained.
S40,调用基于梯度的优化求解器,如matlab中的SQP,并将学习得到的较优的离散控制序列作为优化控制的初始猜测,通过梯度值(19)和总的目标函数值(17)的计算,迭代求解得到最优的控制力序列。S40, call a gradient-based optimization solver, such as SQP in matlab, and use the learned better discrete control sequence as the initial guess of the optimization control, through the gradient value (19) and the total objective function value (17) Calculate and iteratively solve to obtain the optimal control force sequence.
如图2所示,本发明小车-二级倒立摆系统的高斯过程建模步骤:本发明小车-二级倒立摆系统包括6个状态,分别为小车的位移x1、小车的速度x2、第一根倒立摆的角速度x3和角度x5、第二根倒立摆的角速度x4和角度x6。令x=[x1 x2 ... x6]·,·为矩阵转置,并设作用力为u(t)。初始时刻(0时刻)小车与两根倒立摆都处于静止状态,即x1(0)=0,x2(0)=0,x3(0)=0,x4(0)=0,x5(0)=π,x6(0)=π.整个控制时间尺度为T=1.2秒,每次控制作用为0.02秒。因此整个控制时间需要60个控制序列,即离散时刻i=0,1,...,60,对应时刻的状态和控制就是x(i)和u(i)。As shown in Figure 2, the Gaussian process modeling steps of the trolley-two-stage inverted pendulum system of the present invention: the trolley-two-stage inverted pendulum system of the present invention includes 6 states, which are respectively the displacement x 1 of the trolley, the speed x 2 of the trolley, Angular velocity x 3 and angle x 5 of the first inverted pendulum, angular velocity x 4 and angle x 6 of the second inverted pendulum. Let x=[x 1 x 2 ... x 6 ] · , · be the matrix transpose, and let the force be u(t). The car and the two inverted pendulums are at rest at the initial moment (moment 0), that is, x 1 (0)=0, x 2 (0)=0, x 3 (0)=0, x 4 (0)=0, x 5 (0)=π, x 6 (0)=π. The whole control time scale is T=1.2 seconds, and each control action is 0.02 seconds. Therefore, the entire control time requires 60 control sequences, that is, discrete time i=0, 1,..., 60, and the state and control at the corresponding time are x(i) and u(i).
高斯过程建模的过程:系统的动态过程,其实就是当前时刻状态和当前控制,到一下时刻状态/状态变化量的过程。因此,高斯过程所要学习的映射关系,就是当前的状态控制对xp={x(0),u(0),...,x(59),u(59)}到状态变化量Δx={Δx(0),...,Δx(59)}之间的关系,其中符号Δ是状态的变化量,Δx(i)=x(i+1)-x(i)。The process of Gaussian process modeling: the dynamic process of the system is actually the process of the current state and current control, and the state/state change at the next moment. Therefore, the mapping relationship to be learned by the Gaussian process is the current state control pair xp={x(0),u(0),...,x(59),u(59)} to the state change Δx={ The relationship between Δx(0),...,Δx(59)}, where the symbol Δ is the state change, Δx(i)=x(i+1)-x(i).
根据高斯过程建模,对于给定的1个状态控制对xp*={x*,u*},它与60个状态控制对xp={x(0),u(0),...,x(59),u(59)}之间的关系如下所示According to Gaussian process modeling, for a given 1 state control pair xp * = {x * , u * }, it is related to 60 state control pairs xp = {x(0),u(0),..., The relationship between x(59), u(59)} is as follows
式中,表示高斯概率分布,I为单位矩阵,σw为一与噪声有关的超参数,K,k为高斯核函数。当向量与向量操作是,记为K。当数与数或数与向量操作时,记为k。f是未知的小车-二级倒立摆系统的动态模型。核函数一般定义平方指数协方差函数In the formula, Represents a Gaussian probability distribution, I is the identity matrix, σ w is a hyperparameter related to noise, and K, k are Gaussian kernel functions. When vectors and vectors are operated, it is recorded as K. When numbers and numbers or numbers and vectors are operated, it is recorded as k. f is the dynamic model of the unknown cart-two-stage inverted pendulum system. The kernel function generally defines the square exponential covariance function
式中,ym,yn可以是固定的数值或是矩阵;σf与方差有关的超参数;W为权值矩阵,只在对角线上有数值,且这些数值都为超参数。通过具体的控制状态对的输入和状态变化量的输出,优化得到超参数的具体数值。In the formula, y m and y n can be fixed values or matrices; σ f is a hyperparameter related to variance; W is a weight matrix, which only has values on the diagonal, and these values are all hyperparameters. Through the specific input of the control state pair and the output of the state change, the specific value of the hyperparameter is optimized.
通过联合概率分布(1)获得f(x(i),u(i))的均值和方差 Obtain the mean of f(x(i), u(i)) through the joint probability distribution (1) and variance
由此,根据第i时刻状态控制对的分布,通过准确计算,得到第i+1时刻状态分布的预测值Thus, according to the distribution of the state control pair at the i-th moment, the predicted value of the state distribution at the i+1th moment can be obtained through accurate calculation
式中,是状态和相应的状态转换之间的协方差,可由匹配方法获得。从而,得到了高斯过程模型,即第i时刻状态控制对的分布,计算第i+1时刻状态的分布。In the formula, is the covariance between states and corresponding state transitions, which can be obtained by the matching method. Thus, the Gaussian process model is obtained, that is, the distribution of the state-control pair at the i-th moment, and the distribution of the state at the i+1-th moment is calculated.
本发明小车-二级倒立摆系统的控制区间自适应的强化学习步骤:在小车-二级倒立摆系统的控制过程中,最佳地设计控制力,使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π,即第1根立摆和第2根立摆都垂直。由于传统的优化容易陷入局部最优,首先采用强化学习来尽可能地靠近全局最优的控制区间和获得较优的离散控制序列。The self-adaptive reinforcement learning step of the control interval of the trolley-two-stage inverted pendulum system in the present invention: in the control process of the trolley-two-stage inverted pendulum system, optimally design the control force so that the two-stage inverted pendulum system meets the first The angle between one inverted pendulum and the second inverted pendulum is 2π, that is, the first vertical pendulum and the second vertical pendulum are both vertical. Since traditional optimization is easy to fall into local optimum, reinforcement learning is firstly used to get as close as possible to the global optimal control interval and obtain a better discrete control sequence.
在强化学习中,小车-二级倒立摆系统优化控制系统满足M=<X,U,P,R,γ>马尔可夫决策过程(离散过程)。X,U,P,R,γ各自的定义如下:In reinforcement learning, the optimal control system of the trolley-two-stage inverted pendulum system satisfies M=<X, U, P, R, γ> Markov decision process (discrete process). X, U, P, R, γ are defined as follows:
X是小车-二级倒立摆系统的6个状态向量x,即X is the six state vectors x of the trolley-two-stage inverted pendulum system, namely
X=<x(j)>=<μj,ηj>,j=0,1,…,N, (5)X=<x(j)>=<μ j , η j >, j=0, 1, ..., N, (5)
式中,<>表示集合,j是时间步长,考虑到步长的设置会严重影响学习计算效率,不同于高斯过程60份的时间离散数目,这里取N=10,即时间离散10份。In the formula, <> represents the set, and j is the time step size. Considering that the setting of the step size will seriously affect the learning calculation efficiency, which is different from the 60 time discrete numbers of the Gaussian process, N=10 is taken here, that is, the time discrete is 10 times.
U是包含所有可能操作的动作空间,即施加的力的离散取值(有限个数),将合理的控制离散取值设为am,那么U is the action space containing all possible operations, that is, the discrete value of the applied force (limited number), and the reasonable control discrete value is set to a m , then
U={am},m=1,2,...,M, (6)U={a m }, m=1, 2, . . . , M, (6)
式中,M为控制量所取的个数,每次取M=100。P是状态转移概率函数,P[x(j+1)=x’|x(j)=x,u(j)=a]=P(x,a,x’)。In the formula, M is the number of controlled variables, M = 100 each time. P is a state transition probability function, P[x(j+1)=x'|x(j)=x, u(j)=a]=P(x, a, x').
R是在每个时间步j=0,1,...,N的状态和动作的奖励,学习控制的目的是使得二级倒立摆系统在T时刻满足第一根倒立摆和第二根倒立摆的角度为2π。本专利的奖励函数可用控制成本函数来定义。第j个时刻的控制成本函数定义为:R is the state and action reward at each time step j=0, 1, ..., N. The purpose of learning control is to make the two-stage inverted pendulum system satisfy the first inverted pendulum and the second inverted pendulum at time T. The angle of the pendulum is 2π. The reward function of this patent can be defined with a control cost function. The control cost function at the jth moment is defined as:
式中,和Z分别为状态和控制的权值系数,根据实际情况提前给定;xtg(j)是目标位置。In the formula, and Z are the weight coefficients of state and control respectively, which are given in advance according to the actual situation; x tg (j) is the target position.
由于状态量xj含有均值和方差,两边求期望,可以得到如下第j个时刻的控制成本函数Since the state quantity x j contains the mean and variance, and the expectation is calculated on both sides, the following control cost function at the jth moment can be obtained
其中,L代表目标函数。当远超过或远不到2π时,应该给小车-二级倒立摆系统优化控制系统一个惩罚(奖励函数的负值)。当接近2π时,应该给小车-二级倒立摆系统优化控制系统一个奖励。考虑到这一点,接近2π与远离2π的指标Cj=x(j)-Zj,其中Zj为各个时刻接近2π趋势的设定值Among them, L represents the objective function. When far exceeding or far less than 2π, a penalty (negative value of the reward function) should be given to the optimal control system of the trolley-two-stage inverted pendulum system. When approaching 2π, a reward should be given to the optimal control system of the trolley-two-stage inverted pendulum system. Taking this into consideration, the index C j = x(j)-Z j that is close to 2π and far from 2π, where Z j is the set value of the trend close to 2π at each moment
式中,取λ为π/10。In the formula, take λ as π/10.
γ是折扣系数。假设随着时间的推移,相应的奖励会打折。因此,每个时间步的奖励是基于前一步的奖励Rj和折扣系数γ(0≤γ≤1),取γ为0.85。累积奖励表示为γ is the discount factor. Assume that the corresponding rewards are discounted over time. Therefore, the reward at each time step is based on the previous step’s reward R j and the discount coefficient γ (0≤γ≤1), taking γ as 0.85. The cumulative reward is expressed as
采用Q学习算法。在每个离散时间步骤,小车-二级倒立摆系统优化控制系统观察当前状态并对最大化奖励的状态采取行动。使用规则更新每个时间步的Q值Using the Q-learning algorithm. At each discrete time step, the cart-two-stage inverted pendulum system optimization control system observes the current state and takes action on the state that maximizes the reward. Use a rule to update the Q value at each time step
Qj+1(x(j),aj)=Qj(x(j),aj)+∈×[Rj+1+γ(maxQJ(x(j+1),aj+1)-QJ(x(j).aj))],(11)Q j+1 (x(j), a j )=Q j (x(j), a j )+∈×[R j+1 +γ(maxQ J (x(j+1), a j+1 )-Q J (x(j).a j ))], (11)
其中,Qj(x(j),aj)是时刻j的状态x(j)和动作aj的Q值。是描述算法中利用可能性的学习率。在大量训练下探索所有组合训练,Q学习算法应该为所有状态-动作组合生成Q值。在每个状态下,选择具有最大Q值的动作作为最佳动作。Among them, Q j (x(j), a j ) is the Q value of state x(j) and action a j at time j. is the description The learning rate at which likelihood is exploited in the algorithm. Exploring all combinations under a large training set, the Q-learning algorithm should generate Q-values for all state-action combinations. In each state, the action with the largest Q value is selected as the best action.
然而,通过上述X,U,P,R,γ设定,Q的学习非常的慢,主要由于施加的力的离散取值范围非常广,而且离散的力的间隔点又非常多,导致Q列表的维数非常高,容易造成维数灾难。但是如果力的离散点设置很少,又会导致学习效果不佳,难以得到较好的控制策略。控制量的取值范围(控制区间)和间隔严重地限制了Q学习的效率。因此,提出控制区间自适应的强化学习方法,用来在学习过程中不断地缩小控制量的取值范围。However, through the above settings of X, U, P, R, and γ, the learning of Q is very slow, mainly because the range of discrete values of the applied force is very wide, and there are many interval points between the discrete forces, resulting in the Q list The dimensionality of is very high, which is easy to cause the curse of dimensionality. However, if there are few discrete points of force, the learning effect will be poor and it will be difficult to obtain a better control strategy. The value range (control interval) and interval of the control quantity severely limit the efficiency of Q-learning. Therefore, a reinforcement learning method with adaptive control interval is proposed, which is used to continuously narrow the value range of the control variable during the learning process.
argminINE[L(x,a)],argmin IN E[L(x,a)],
IN=[min{am},max{am}], (12)IN=[min {am }, max {am }], (12)
其中,E[L(x,a)]是每个时刻代价之和(总代价),IN表示自适应地优化选择控制量的区间。为了尽可能地近似全局最优解,设置更大范围的控制区间来开始训练过程。根据成本函数的结果,选择产生最小成本的动作空间中的控制区间作为下一次学习的新的控制区间。在此过程中,控制区间在缩小的过程中,同时保持控制的离散数量M不变,控制越来越精细,即每个控制区间的动作数量是恒定的,每个区间逐渐减小并且间隔逐渐减小。也就是第l次学习的控制区间和第l+1次学习的控制区间满足Among them, E[L(x, a)] is the sum of the cost at each moment (total cost), and IN represents the interval for adaptively optimizing the selection of the control amount. In order to approximate the global optimal solution as much as possible, a larger control interval is set to start the training process. According to the result of the cost function, the control interval in the action space that produces the minimum cost is selected as the new control interval for the next learning. In this process, the control interval is in the process of shrinking, while keeping the discrete number M of control constant, the control becomes more and more refined, that is, the number of actions in each control interval is constant, and each interval gradually decreases and the interval gradually decreases. decrease. That is, the control interval of the lth learning and the control interval of the l+1th learning satisfy
[min{am},max{am}](l)就等于在控制区间训练l次后的(12)式中的IN,这可以看成是控制区间自适应地“缩小”步骤。[min{a m }, max{a m }](l) is equal to IN in formula (12) after training l times in the control interval, which can be regarded as an adaptive "shrinking" step of the control interval.
当一组动作收敛到某个控制区间时,小车-二级倒立摆系统优化控制系统开始通过移动控制区间继续试验相邻值,也就是第n次学习的控制区间和第n+1次学习的控制区间满足When a set of actions converges to a certain control interval, the trolley-two-stage inverted pendulum system optimization control system starts to continue to test adjacent values by moving the control interval, that is, the control interval of the nth learning and the n+1th learning The control interval satisfies
是平移参数。这可以看成是控制区间自适应地“平移”步骤。 is the translation parameter. This can be seen as an adaptive "translation" step of the control interval.
优化以成本函数来自适应选择控制空间的过程是迭代的。并且一旦控制区间收敛到产生成本函数的较小值的时间间隔,则控制区间成为最优的控制区间,并确定较优的离散控制序列。The process of optimization to adapt the selection control space with a cost function is iterative. And once the control interval converges to the time interval that produces the smaller value of the cost function, the control interval becomes the optimal control interval, and a better discrete control sequence is determined.
本发明小车-二级倒立摆系统的优化控制步骤:针对高斯过程模型,利用自适应控制区间的强化学习只能得到有限的控制决策集合,跟具体控制量的离散程度有关,没法得到最优的连续控制集合中的控制序列。为了得到最优的控制序列,提出小车-二级倒立摆系统高斯过程模型的优化控制。首先,对公式(4)变分得到The optimal control steps of the trolley-two-stage inverted pendulum system of the present invention: for the Gaussian process model, only a limited control decision set can be obtained by using the reinforcement learning of the adaptive control interval, which is related to the discrete degree of the specific control quantity, and the optimal control cannot be obtained. A control sequence in the continuous control set of . In order to obtain the optimal control sequence, the optimal control of the Gaussian process model of the trolley-two-stage inverted pendulum system is proposed. First, the variation of formula (4) is obtained
其中,Δ表示微变量。由于初始条件是固定值,因此初始条件的变分都为0,即Δμ0=0,Δη0=0。由于Δui可以是任意数值,为了形式简单,可将Δui设置成Δui=1,得到Δμi和Δηi的迭代形式Among them, Δ represents the microvariable. Since the initial condition is a fixed value, the variation of the initial condition is 0, that is, Δμ 0 =0, Δη 0 =0. Since Δu i can be any value, in order to simplify the form, Δu i can be set to Δu i = 1, and the iterative form of Δu i and Δη i can be obtained
总的目标函数的期望值The expected value of the overall objective function
对第i步的期望E[L(x(i),ui)]做变分得到Change the expectation E[L(x(i), u i )] of the i-th step to get
整个时间区间内的总的目标函数的变分,即目标函数的梯度变为The variation of the total objective function over the entire time interval, that is, the gradient of the objective function becomes
其中,Δμi和Δηi由公式(16)给出,μi和ui由公式(4)给出。对以上设置的小车-二级倒立摆系统优化控制方法进行实施验证,小车的参数数据为:小车质量为0.5kg,第1根和第二根倒立摆的质量为0.5kg,长度为0.1m,小车与地面的摩擦系数为0.1。状态权值系数为[0 1 1 1 15 15]·。控制权值系数Z为0.01。Q学习前的控制区间为[-250,250],Q学习后的控制区间为[-116,93]。优化控制后,最优的目标函数值为9338。图6为本发明小车-二级倒立摆系统在初始猜测情况下各个时刻的目标函数均值和方差图。图7为系统在最优控制下各个时刻的目标函数均值和方差图。图8为系统的最优控制曲线。图9为系统第一根倒立摆的角度均值的变化图。图10为系统第二根倒立摆的角度均值的变化图。Among them, Δμ i and Δη i are given by formula (16), and μ i and u i are given by formula (4). Carry out verification on the optimal control method of the trolley-two-stage inverted pendulum system set above. The parameter data of the trolley is: the mass of the trolley is 0.5kg, the mass of the first and second inverted pendulum is 0.5kg, and the length is 0.1m. The coefficient of friction between the car and the ground is 0.1. State weight coefficient is [0 1 1 1 15 15] · . The control weight coefficient Z is 0.01. The control interval before Q learning is [-250, 250], and the control interval after Q learning is [-116, 93]. After optimizing the control, the optimal objective function value is 9338. Fig. 6 is a diagram of the mean value and variance of the objective function at each moment in the case of the initial guess of the trolley-two-stage inverted pendulum system of the present invention. Figure 7 is the mean value and variance diagram of the objective function at each moment of the system under optimal control. Figure 8 is the optimal control curve of the system. Fig. 9 is a change diagram of the angle mean value of the first inverted pendulum of the system. Fig. 10 is a change diagram of the mean value of the angle of the second inverted pendulum of the system.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911043225.4A CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911043225.4A CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110908280A CN110908280A (en) | 2020-03-24 |
CN110908280B true CN110908280B (en) | 2023-01-03 |
Family
ID=69814671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911043225.4A Active CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110908280B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111580392B (en) * | 2020-07-14 | 2021-06-15 | 江南大学 | Finite frequency range robust iterative learning control method of series inverted pendulum |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN110134011A (en) * | 2019-04-23 | 2019-08-16 | 浙江工业大学 | An adaptive iterative learning inversion control method for inverted pendulum |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011068222A (en) * | 2009-09-24 | 2011-04-07 | Honda Motor Co Ltd | Control device of inverted pendulum type vehicle |
-
2019
- 2019-10-30 CN CN201911043225.4A patent/CN110908280B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN110134011A (en) * | 2019-04-23 | 2019-08-16 | 浙江工业大学 | An adaptive iterative learning inversion control method for inverted pendulum |
Non-Patent Citations (1)
Title |
---|
直线二级倒立摆系统的H_∞鲁棒最优控制;王春平等;《机电工程》;20170520(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110908280A (en) | 2020-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | An ensemble learning velocity prediction-based energy management strategy for a plug-in hybrid electric vehicle considering driving pattern adaptive reference SOC | |
CN108520155B (en) | Vehicle behavior simulation method based on neural network | |
Chiou et al. | A PSO-based adaptive fuzzy PID-controllers | |
CN110119844A (en) | Introduce robot motion's decision-making technique, the system, device of Feeling control mechanism | |
CN104133372B (en) | Room temperature control algolithm based on fuzzy neural network | |
CN113408648A (en) | Unit combination calculation method combined with deep learning | |
CN110286586A (en) | A Hybrid Modeling Method for Magnetorheological Damper | |
Su et al. | Prediction model of hot metal temperature for blast furnace based on improved multi-layer extreme learning machine | |
CN113420508B (en) | Unit combination calculation method based on LSTM | |
Zhai et al. | Deep q-learning with prioritized sampling | |
CN116892866B (en) | Rocket sublevel recovery track planning method, rocket sublevel recovery track planning equipment and storage medium | |
CN113641907A (en) | A method and device for hyperparameter adaptive depth recommendation based on evolutionary algorithm | |
CN116050242A (en) | Transition prediction method, device, equipment and medium | |
CN110908280B (en) | Optimization control method for trolley-two-stage inverted pendulum system | |
Chen et al. | Improve the accuracy of recurrent fuzzy system design using an efficient continuous ant colony optimization | |
Zhou et al. | Auxiliary task-based deep reinforcement learning for quantum control | |
Bisong et al. | Training a Neural Network | |
CN115453880A (en) | A method for training generative models for state prediction based on adversarial neural networks | |
Chan et al. | Evolutionary computation for on-line and off-line parameter tuning of evolving fuzzy neural networks | |
Morales | Deep Reinforcement Learning | |
Yin et al. | Hashing over predicted future frames for informed exploration of deep reinforcement learning | |
CN111445005A (en) | Neural network control method and reinforcement learning system based on reinforcement learning | |
Peng et al. | Effective policy gradient search for reinforcement learning through NEAT based feature extraction | |
Xiong et al. | Robust PETS reinforcement learning algorithm for uncertain environment | |
Chaturvedi | Factors affecting the performance of artificial neural network models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |