CN110908280B - Optimization control method for trolley-two-stage inverted pendulum system - Google Patents
Optimization control method for trolley-two-stage inverted pendulum system Download PDFInfo
- Publication number
- CN110908280B CN110908280B CN201911043225.4A CN201911043225A CN110908280B CN 110908280 B CN110908280 B CN 110908280B CN 201911043225 A CN201911043225 A CN 201911043225A CN 110908280 B CN110908280 B CN 110908280B
- Authority
- CN
- China
- Prior art keywords
- control
- state
- inverted pendulum
- value
- interval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The embodiment of the invention discloses a vehicle-two-stage inverted pendulum system optimization control method, which comprises the following steps: s10, setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and a state variable as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model; s20, determining an optimal control interval and a better discrete control sequence; s30, after a Gaussian process model and an optimal control interval are obtained, a total objective function value and a gradient value are obtained by combining iterative calculation of the Gaussian process model on the formula Gaussian process model and initial condition variation; and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.
Description
Technical Field
The invention relates to the field of control of a trolley-two-stage inverted pendulum system, in particular to an optimal control method of the trolley-two-stage inverted pendulum system.
Background
The vehicle-two-stage inverted pendulum system is a classic rapid, multivariable, nonlinear and unstable system and is a classic control object in the control field. Many control algorithms, including PID, fuzzy PID, robust control, etc., have been implemented in this system. However, the design control is premised on modeling. At present, the control of a trolley-secondary inverted pendulum system is based on a mechanism model, and the control method is a deterministic model deduced through a physical principle. And the model parameters relate to the size of the vehicle, the size of the inverted pendulum, and so on.
With the development of intelligent algorithms, particularly algorithms such as reinforcement learning, control models have slowly shifted from certain mechanistic models to model-less control. The model-free control has certain disadvantages, such as too many learning times, difficult quantitative analysis of control effect, slow design efficiency of the controller and the like.
Disclosure of Invention
In view of the above technical problems, the present invention is directed to providing a vehicle-two-stage inverted pendulum system optimization control method.
In order to solve the technical problems, the invention adopts the following technical scheme:
a vehicle-secondary inverted pendulum system optimization control method is applied to a vehicle-secondary inverted pendulum system comprising a vehicle, a first inverted pendulum and a second inverted pendulum, and comprises the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment, namely a Gaussian process model, through a joint probability distribution function;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i The iteration form of (1) is combined with the iterative calculation of a Gaussian process model to obtain a total objective function value and a gradient value;
and S40, calling an optimization solver based on the gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and carrying out iterative solution to obtain an optimal control force sequence through calculation of the gradient value and the total objective function value.
Preferably, in S10, the 6 states of the vehicle-secondary inverted pendulum system are determined as follows: displacement x of the car 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 And angle x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 Let x = [ x ] 1 x 2 ... x 6 ] T T is matrix transposition, the acting force is set to be u (T), the trolley and the two inverted pendulums are in a static state at the initial time (0 moment), namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) = pi.wholeThe control time scale is T =1.2 seconds, each control action is 0.02 seconds, and the whole control time requires 60 control sequences, namely discrete time i =0,1.
Preferably, the gaussian modeling process is as follows: the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),. Times, x (59), u (59) } to the state change amount Δ x = { Δ x (0),. Times, Δ x (59) }, where the symbol Δ is the change amount of the state, Δ x (i) = x (i + 1) -x (i),
control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),. ·, x (59), u (59) } is as follows:
in the formula (I), the compound is shown in the specification,representing a Gaussian probability distribution, I being the identity matrix, σ w Is a hyper-parameter related to noise, and K, K are Gaussian kernel functions. When the vector is operated with the vector, the operation is marked as K; when the number and the number or the number and the vector are operated, the operation is marked as k; f is the dynamic model of the unknown vehicle-two-stage inverted pendulum system, and the kernel function generally defines the square exponential covariance function
In the formula, y m ,y n May be a fixed number or a matrix; sigma f A hyper-parameter related to variance; w is a weight matrix, only values exist on a diagonal line, the values are all hyper-parameters, and the specific values of the hyper-parameters are obtained through optimization through the input of specific control state pairs and the output of state variable quantities;
f (x (i)) is obtained by combining probability distributions (1), u (u)i) Mean value of)Sum variance
Therefore, according to the distribution of the state control pairs at the ith time, the predicted value of the state distribution at the (i + 1) th time is obtained through accurate calculation:
in the formula (I), the compound is shown in the specification,the covariance between the state and the corresponding state transition can be obtained by a matching method, so as to obtain a Gaussian process model, namely the distribution of the state control pair at the ith moment, and calculate the distribution of the state at the (i + 1) th moment.
Preferably, S20 specifically includes:
in the control process of the trolley-secondary inverted pendulum system, optimally designing control force to ensure that the secondary inverted pendulum system meets the condition that the angle between the first inverted pendulum and the second inverted pendulum is 2 pi at the moment T, and firstly adopting reinforcement learning to approach a globally optimal control interval as much as possible and obtain a better discrete control sequence;
in reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the Markov decision process (discrete process) of M = < X, U, P, R, gamma >. X, U, P, R, γ are each as defined below:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ j ,η j >,j=0,1,…,N, (5)
Where < > denotes a set, j is a time step, N =10,
u is a discrete value (finite number) containing all possible operating action spaces, i.e., applied forces, and a reasonable control discrete value is set as a m Then, then
U={a m },m=1,2,...,M, (6)
Where M is the number of control quantities taken, M =100 each time, P is the state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x'),
r is a reward of the state and the action of N at each time step j =0,1.. The objective of the learning control is to make the two-stage inverted pendulum system satisfy the angle of the first inverted pendulum and the second inverted pendulum at time T2 pi, the reward function can be defined by a control cost function, and the control cost function at time j is defined as:
in the formula (I), the compound is shown in the specification,z is a weight coefficient of the state and the control respectively, and is given in advance according to the actual situation; x is the number of tg (j) Is the target location;
due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Wherein, L represents an objective function, and when the value far exceeds or is far less than 2 pi, a punishment (a negative value of a reward function) should be given to the optimization control system of the vehicle-secondary inverted pendulum system; when approaching 2 pi, a reward should be given to the optimization control system of the vehicle-secondary inverted pendulum system, and the index C approaching 2 pi and far away from 2 pi j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
In the formula, taking lambda as pi/10;
gamma is the discount coefficient, and the corresponding reward is assumed to be discounted over time, the reward for each time step being based on the reward R of the previous step j And a discount coefficient gamma (gamma is more than or equal to 0 and less than or equal to 1), and taking gamma as 0.85; the jackpot is expressed as
Adopting a Q learning algorithm, and observing the current state and taking action on the state of the maximized reward by the optimization control system of the vehicle-secondary inverted pendulum system at each discrete time step; updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ J (x(j+1),a j+1 )-Q J (x(j).a j ))], (11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The value of Q of (A) is,is describedAll combination training is explored under a large number of training by using the learning rate of the possibility in the algorithmIn practice, the Q learning algorithm should generate Q values for all state-action combinations, and in each state, the action with the largest Q value is selected as the best action, and further, a control interval adaptive reinforcement learning method is proposed for continuously narrowing the value range of the control amount in the learning process,
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
wherein, E [ L (x, a) ] is the sum of the cost (total cost) at each moment, IN represents the interval of adaptively optimizing and selecting the control quantity, a control interval IN a wider range is set to start the training process, and according to the result of the cost function, the control interval IN the action space generating the minimum cost is selected as the new control interval of the next learning, IN the process, the control interval is reduced while keeping the discrete number M of the control unchanged, the control is more and more refined, namely the action number of each control interval is constant, each interval is gradually reduced and the interval is gradually reduced, namely the control interval of the first learning and the control interval of the (L + 1) learning meet the requirement that the control interval of the first learning and the control interval of the (L + 1) learning meet the requirement
[min{a m },max{a m }](l) Is equal to IN equation (12) after one training time IN the control interval,
when a group of actions converge to a certain control interval, the optimization control system of the trolley-secondary inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval learned at the nth time and the control interval learned at the (n + 1) th time meet
the process of optimizing the adaptive selection of the control space with the cost function is iterative and once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
Preferably, in order to obtain an optimal control sequence, the optimal control of the Gaussian process model of the trolley-secondary inverted pendulum system is provided, and firstly, the variation of the formula (4) is obtained
Where Δ represents a micro variable, and since the initial condition is a fixed value, the variation of the initial condition is all 0, that is, Δ μ 0 =0,Δη 0 =0, due to Δ u i May be any value, which is set to Δ u for simplicity of form i =1, yield Δ μ i And Δ η i Of (2) iterative form
Expected value of the overall objective function
To the ith stepDesired E [ L (x (i), u) i )]Is subjected to variation to obtain
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Wherein, is i And Δ η i Given by equation (16), μ i And u i Given by equation (4).
The beneficial effects of the invention are:
(1) A data-driven Gaussian process model is provided for the trolley-secondary inverted pendulum system, the method is different from the traditional deterministic models such as mechanism modeling, the running state of the trolley-secondary inverted pendulum system is represented by means of the mean value and the variance, and the method is closer to the actual motion process of the system.
(2) Aiming at a trolley-two-stage inverted pendulum system (an uncertainty system), self-adaptive reinforcement learning and optimization control of a control interval are designed, and a learning and control method is popularized to the field of uncertainty systems.
(3) Considering the problem of Learning efficiency of the conventional reinforcement Learning, such as Q Learning, a control interval adaptive reinforcement Learning is proposed, that is, the range of control decision is continuously narrowed, an optimal control interval is adaptively selected, and a better discrete control sequence is determined.
(4) Aiming at the problem that the optimization problem is easy to fall into local optimum, the optimal control interval is determined through the proposed reinforcement learning, the optimal control curve (value) is determined by adopting optimization control by taking the obtained better discrete control sequence as an initial guess, and the global optimum solution is ensured to be searched to the maximum extent.
(5) Aiming at the problem that the traditional reinforcement learning control quantity is limited, after the reinforcement learning decision is made, an optimal control input is finally obtained by adopting an optimal control algorithm with continuous control intervals.
(6) The experimental effect of the invention is given by combining the experiment, the control interval before Q learning is [ -250,250], and the control interval after Q learning is [ -116,93]. After the optimization control, the optimal objective function value is 9338, and an objective function mean and variance graph and an optimal control curve at each moment under the optimal control are provided.
Drawings
FIG. 1 is a simplified diagram of experimental equipment for a two-stage inverted pendulum system of the present invention.
FIG. 2 is a Gaussian process modeling flow chart of the vehicle-two stage inverted pendulum system of the present invention.
FIG. 3 is a flow chart of the control interval adaptive reinforcement learning of the vehicle-two-stage inverted pendulum system of the present invention.
FIG. 4 is an optimized control flow chart of the vehicle-two-stage inverted pendulum system of the present invention.
Fig. 5 is a flow chart of gaussian process modeling, reinforcement learning of adaptive intervals and optimization control of the trolley-secondary inverted pendulum system of the invention.
FIG. 6 is a graph of the mean and variance of the objective function at various times under initial guessing for the vehicle-two stage inverted pendulum system of the present invention.
FIG. 7 is a graph of the mean and variance of the objective function of the vehicle-two stage inverted pendulum system of the present invention at various times under optimal control.
Fig. 8 is an optimal control curve for the trolley-two stage inverted pendulum system of the present invention.
FIG. 9 is a graph showing the variation of the angle-averaged value of the first inverted pendulum of the vehicle-two-stage inverted pendulum system of the present invention.
FIG. 10 is a graph showing the variation of the angle-averaged value of the second inverted pendulum of the vehicle-two-stage inverted pendulum system of the present invention.
Detailed Description
As shown in fig. 1: the invention discloses a simplified diagram of experimental equipment of a trolley-two-stage inverted pendulum system, which sequentially comprises a trolley 1, a 1 st 2 nd vertical pendulum, a second inverted pendulum 3 and the like. The arrow on the right is the force applied to the cart, i.e., the control input to the system. The curved arrow represents the rotation angle of the inverted pendulum.
The optimization control method of the trolley-secondary inverted pendulum system provided by the embodiment of the invention is applied to the trolley-secondary inverted pendulum system shown in figure 1, and specifically comprises the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment, namely a Gaussian process model, through a joint probability distribution function;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model (4) and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i In an iterative fashion. And (4) combining iterative calculation of a Gaussian process model to obtain a total objective function value (17) and a gradient value (19).
And S40, calling a gradient-based optimization solver, such as SQP in matlab, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and iteratively solving to obtain an optimal control force sequence through calculation of a gradient value (19) and a total objective function value (17).
As shown in FIG. 2, the Gaussian process modeling step of the vehicle-two-stage inverted pendulum system of the invention is as follows: the two-stage inverted pendulum system of the trolley comprises 6 states which are respectively the displacement x of the trolley 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 And angle x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 . Let x = [ x ] 1 x 2 ... x 6 ] · And is a matrix transpose, and the force is set to u (t). The trolley and the two inverted pendulums are in a static state at the initial time (time 0), namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) The overall control timescale was T =1.2 seconds, with each control action being 0.02 seconds. Thus the total control time requires 60 control sequences, i.e. discrete time instants i =0,1, 60, the states and controls corresponding to the time instants are x (i) and u (i).
Process of gaussian process modeling: the dynamic process of the system, which is actually the current time state and current control, is the process of the state/state change at the next time. Therefore, the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),. Times, x (59), u (59) } to the state change amount Δ x = { Δ x (0),. Times, Δ x (59) }, where the symbol Δ is the change amount of the state, and Δ x (i) = x (i + 1) -x (i).
Control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),.., x (59), u (59) } is as follows
In the formula (I), the compound is shown in the specification,representing a Gaussian probability distribution, I being the identity matrix, σ w Is a hyper-parameter related to noise, and K, K are Gaussian kernel functions. When the vector and vector operation is, it is denoted as K. When the number and number or the number and vector are operated on, it is denoted as k. f is the unknown dynamic model of the vehicle-two stage inverted pendulum system. Kernel function generally defines a squared exponential covariance function
In the formula, y m ,y n May be a fixed number or a matrix; sigma f A hyper-parameter related to variance; w is the weight matrix, and there are values only on the diagonal, and these values are all hyperparameters. And optimizing to obtain a specific numerical value of the hyper-parameter through the input of the specific control state pair and the output of the state variable quantity.
Therefore, the predicted value of the state distribution at the (i + 1) th time is obtained by accurate calculation according to the distribution of the state control pairs at the (i) th time
In the formula (I), the compound is shown in the specification,is the covariance between the state and the corresponding state transition, which can be obtained by a matching method. Therefore, a Gaussian process model, namely the distribution of the state control pairs at the ith moment is obtained, and the distribution of the state at the (i + 1) th moment is calculated.
The self-adaptive reinforcement learning step of the control interval of the trolley-two-stage inverted pendulum system comprises the following steps: in the control process of the trolley-secondary inverted pendulum system, the control force is optimally designed, so that the secondary inverted pendulum system meets the condition that the angle of the first inverted pendulum and the second inverted pendulum is 2 pi at the time T, namely the 1 st standing pendulum and the 2 nd standing pendulum are vertical. Since the traditional optimization is easy to fall into local optimization, reinforcement learning is firstly adopted to get as close as possible to the globally optimal control interval and obtain a better discrete control sequence.
In reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the Markov decision process (discrete process) of M = < X, U, P, R, gamma >. X, U, P, R, γ are each as defined below:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ j ,η j >,j=0,1,…,N, (5)
Where < > denotes a set, j is a time step, and considering that setting of the step may seriously affect the learning calculation efficiency, it is different from a time discrete number of 60 parts in the gaussian process, where N =10, i.e., 10 parts in time discrete.
U is a discrete value (finite number) containing all possible operating action spaces, i.e., applied forces, and a reasonable control discrete value is set as a m Then, it is
U={a m },m=1,2,...,M, (6)
In the formula, M is the number of control variables, and M =100 is taken each time. P is a state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x').
R is a reward of the state and action of N at each time step j =0,1. The reward function of this patent may be defined by a control cost function. The control cost function at the jth time is defined as:
in the formula (I), the compound is shown in the specification,z is a weight coefficient of the state and the control respectively, and is given in advance according to the actual situation; x is the number of tg (j) Is the target location.
Due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Wherein L represents the objective function. When the value is far beyond or far below 2 pi, a penalty (negative value of a reward function) should be given to the optimization control system of the vehicle-secondary inverted pendulum system. When the distance is close to 2 pi, a reward is given to the optimization control system of the vehicle-secondary inverted pendulum system. In view of this, the index C is close to 2 π and far from 2 π j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
In the formula, lambda is pi/10.
γ is the discount coefficient. It is assumed that over time the corresponding rewards are discounted. Thus, the reward for each time step is based on the reward R of the previous step j And a discount coefficient gamma (gamma is more than or equal to 0 and less than or equal to 1), wherein gamma is taken as 0.85. The jackpot is expressed as
A Q learning algorithm is employed. At each discrete time step, the vehicle-secondary inverted pendulum system optimization control system observes the current state and takes action on the state that maximizes the reward. Updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ J (x(j+1),a j+1 )-Q J (x(j).a j ))], (11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The Q value of (1).Is describedThe learning rate of the likelihood is utilized in the algorithm. Under a large number of training heuristics for all combinations of training, the Q learning algorithm should generate Q values for all state-action combinations. In each state, the action having the largest Q value is selected as the optimum action.
However, the learning of Q is very slow due to the setting of X, U, P, R, and γ, mainly because the discrete range of the applied force is very wide, and the discrete force intervals are very many, which results in the Q list having very high dimension and easily causing dimension disaster. However, if the discrete points of force are set to be small, the learning effect is poor, and a better control strategy is difficult to obtain. The range (control interval) and interval of the control quantity severely limit the efficiency of Q learning. Therefore, a control interval adaptive reinforcement learning method is provided for continuously reducing the value range of the control quantity in the learning process.
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
Where E [ L (x, a) ] is the sum of the costs at each time (total cost), and IN represents the interval for adaptively optimally selecting the control amount. To approximate the global optimal solution as closely as possible, a larger range of control intervals is set to start the training process. According to the result of the cost function, the control interval in the motion space that yields the minimum cost is selected as a new control interval for the next learning. In this process, the control section is in the process of narrowing down while keeping the discrete number M of controls constant, and the control becomes finer, that is, the number of actions per control section is constant, each section is gradually reduced, and the interval is gradually reduced. That is, the control interval of the l-th learning and the control interval of the l + 1-th learning satisfy
[min{a m },max{a m }](l) It is equal to IN equation (12) after i times of control interval training, which can be seen as a control interval adaptively "shrinking" step.
When a group of actions converge to a certain control interval, the optimization control system of the trolley-secondary inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval of the nth learning and the control interval of the (n + 1) th learning meet
The process of optimizing the adaptive selection of the control space with a cost function is iterative. And once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
The optimization control steps of the trolley-two-stage inverted pendulum system of the invention are as follows: aiming at a Gaussian process model, only a limited control decision set can be obtained by utilizing reinforcement learning of an adaptive control interval, the control decision set is related to the discrete degree of specific control quantity, and an optimal control sequence in a continuous control set cannot be obtained. In order to obtain an optimal control sequence, the optimization control of a Gaussian process model of the trolley-secondary inverted pendulum system is provided. First, the formula (4) is subjected to variation to obtain
Where Δ represents a micro-variable. Since the initial conditions are fixed values, the variation of the initial conditions is all 0, i.e., Δ μ 0 =0,Δη 0 =0. Due to delta u i May be any value, and for simplicity of form, Δ u may be expressed i Is set to delta u i =1, yield Δ μ i And Δ η i Iterative form of (2)
Expected value of the overall objective function
For the expectation of step i [ L (x (i), u) i )]Is subjected to variation to obtain
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Wherein, Δ μ i And Δ η i Given by equation (16), μ i And u i Given by equation (4). The optimization control method of the trolley-two-stage inverted pendulum system is implemented and verified, and the parameter data of the trolley is as follows: the weight of the trolley is 0.5kg, the weight of the 1 st and second inverted pendulums is 0.5kg, the length is 0.1m, and the friction coefficient of the trolley and the ground is 0.1. State weight coefficientIs [ 01 11 15 ] or] · . The weight coefficient Z is controlled to be 0.01. The control interval before Q learning is [ -250,250 [ -250 [ ]]And the control interval after Q learning is [ -116,93 [ -116 [ ]]. After the optimization control, the optimal objective function value is 9338. FIG. 6 is a graph of the mean and variance of the objective function at various times under initial guessing for the Cart-secondary inverted pendulum system of the present invention. FIG. 7 is a graph of the mean and variance of the objective function at various times under optimal control by the system. Fig. 8 is an optimal control curve of the system. Fig. 9 is a graph showing the change of the angle average value of the first inverted pendulum of the system. FIG. 10 is a graph showing the variation of the average angle value of the second inverted pendulum of the system.
Claims (4)
1. A vehicle-secondary inverted pendulum system optimization control method is applied to a vehicle-secondary inverted pendulum system comprising a vehicle, a first inverted pendulum and a second inverted pendulum, and is characterized by comprising the following steps:
s10, determining 6 states of the trolley-secondary inverted pendulum system, and obtaining the change conditions of the 6 states along with time after applying a certain acting force; setting a Gaussian kernel function, and training values of hyper-parameters of the Gaussian kernel function by taking a current state control pair as input and taking state variable quantity as output; obtaining the relation between the distribution of the state control pairs at the ith moment and the state distribution at the (i + 1) th moment by combining probability distribution functions, namely a Gaussian process model;
s20, setting the state, the time step length and the discrete value of a control variable of the trolley-secondary inverted pendulum system; defining an expected form of a cost function, setting a reward punishment mechanism, and giving an updating rule of a Q value of each time step; the value range of the control quantity is continuously reduced in the early stage of the learning process, and the value range of the control quantity is continuously translated in the later stage of the learning process; when the convergence reaches a smaller value of the generated cost function, determining an optimal control interval and a better discrete control sequence;
s30, after obtaining the Gaussian process model and the optimal control interval, carrying out variation on the formula Gaussian process model and the initial condition to obtain delta mu i And Δ η i The iteration form of (1) is combined with the iterative calculation of a Gaussian process model to obtain a total objective function value and a gradient value;
s40, calling an optimization solver based on gradient, taking the better discrete control sequence obtained by learning as an initial guess of optimization control, and obtaining an optimal control force sequence by iterative solution through calculation of gradient values and total objective function values;
in the control process of the trolley-secondary inverted pendulum system, optimally designing control force to ensure that the secondary inverted pendulum system meets the requirement that the angle of a first inverted pendulum and a second inverted pendulum is 2 pi at the moment T, and firstly adopting reinforcement learning to approach a globally optimal control interval as much as possible and obtain a better discrete control sequence;
in reinforcement learning, the optimization control system of the trolley-two-stage inverted pendulum system meets the decision process that M = < X, U, P, R, gamma > Markov, and the definitions of X, U, P, R, gamma are as follows:
x is the 6 state vectors X of the vehicle-two-stage inverted pendulum system, i.e.
X=<x(j)>=<μ j ,η j >,j=0,1,...,N, (5)
Where < > denotes the set, j is the time step, N =10,
u is a discrete value containing all possible operating action spaces, i.e. applied force, and a reasonable control discrete value is set as a m Then, then
U={a m },m=1,2,...,M, (6)
Where M is the number of control variables, M =100, P is the state transition probability function, P [ x (j + 1) = x '| x (j) = x, u (j) = a ] = P (x, a, x'),
r is a reward of the state and the action of N at each time step j =0,1.. The objective of the learning control is to make the two-stage inverted pendulum system satisfy the angle of the first inverted pendulum and the second inverted pendulum at time T2 pi, the reward function can be defined by a control cost function, and the control cost function at time j is defined as:
in the formula (I), the compound is shown in the specification,and R' are weight coefficients of state and control respectively, and are given in advance according to actual conditions; x is the number of tg (j) Is the target location;
due to the state quantity x j The control cost function at the jth moment can be obtained by including mean value and variance and calculating expectation at two sides
Wherein, L represents an objective function, and when the L far exceeds or is far less than 2 pi, a punishment is given to a vehicle-secondary inverted pendulum system optimization control system; when approaching 2 pi, a reward should be given to the optimization control system of the vehicle-secondary inverted pendulum system, and the index C approaching 2 pi and far away from 2 pi j =x(j)-Z j Wherein Z is j Set value for approaching 2 pi trend at each moment
In the formula, taking lambda as pi/10;
gamma is the discount coefficient, assuming that over timeThe corresponding reward is discounted, the reward for each time step being based on the reward R of the previous step j And the discount coefficient gamma is more than or equal to gamma and less than or equal to 1, and the gamma is 0.85; the jackpot is expressed as
Adopting a Q learning algorithm, and observing the current state and taking action on the state of the maximized reward by the optimization control system of the vehicle-secondary inverted pendulum system at each discrete time step; updating the Q value at each time step using rules
Q j+1 (x(j),a j )=Q j (x(j),a j )+∈×[R j+1 +γ(maxQ j (x(j+1),a j+1 )-Q j (x(j).a j ))],
(11)
Wherein Q is j (x(j),a j ) Is the state x (j) and action a at time j j The method comprises the following steps that (1) the Q value belongs to a learning rate for describing the utilization possibility in an E-green algorithm, all combination training is searched under a large amount of training, the E belongs to a learning rate for describing the utilization possibility in the E-green algorithm, all combination training is searched under the large amount of training, the Q learning algorithm generates Q values for all state-action combinations, and the action with the maximum Q value is selected as the best action in each state;
s20 specifically comprises the following steps:
the Q learning algorithm should generate Q values for all the state-action combinations, and in each state, the action with the maximum Q value is selected as the best action, and further provides a control interval self-adaptive reinforcement learning method for continuously reducing the value range of the control quantity in the learning process,
argmin IN E[L(x,a)],
IN=[min{a m },max{a m }], (12)
wherein E [ L (x, a) ] is the sum of the cost at each moment, IN represents the interval for adaptively optimizing and selecting the control quantity, a larger range of control intervals is set to start the training process, and the control interval IN the action space generating the minimum cost is selected as a new control interval for next learning according to the result of the cost function, IN the process, the control interval is reduced while keeping the discrete number M of control unchanged, the control is more and more refined, namely the action number of each control interval is constant, each interval is gradually reduced and gradually reduced, namely the control interval of the first learning and the control interval of the (L + 1) th learning meet the requirement that the control interval of the first learning and the control interval of the (L + 1) th learning meet the requirement of the training process
[min{a m },max{a m }](l) Is equal to IN equation (12) after one training of the control interval,
when a group of actions converge to a certain control interval, the optimization control method of the trolley-two-stage inverted pendulum system starts to continuously test adjacent values through a mobile control interval, namely the control interval learned for the nth time and the control interval learned for the (n + 1) th time meet
the process of optimizing the adaptive selection of the control space with the cost function is iterative and once the control interval converges to a time interval that yields a smaller value of the cost function, the control interval becomes the optimal control interval and a better discrete control sequence is determined.
2. The optimization control method of the vehicle-secondary inverted pendulum system according to claim 1, wherein in S10, it is determined that the 6 states of the vehicle-secondary inverted pendulum system are respectively: displacement x of the car 1 Speed x of the trolley 2 Angular velocity x of first inverted pendulum 3 Angle of harmonyDegree x 5 Angular velocity x of the second inverted pendulum 4 And angle x 6 Let us order Is matrix transposing, and the acting force is set as u (t), the trolley and the two inverted pendulums are in static state at the initial moment, namely x 1 (0)=0,x 2 (0)=0,x 3 (0)=0,x 4 (0)=0,x 5 (0)=π,x 6 (0) And = pi, the whole control time scale is T =1.2 seconds, each control action is 0.02 seconds, the whole control time needs 60 control sequences, namely discrete time i =0,1,. Once.., 60, and the states and controls corresponding to the time are x (i) and u (i).
3. The vehicle-two-stage inverted pendulum system optimization control method of claim 2, wherein the gaussian modeling process is as follows:
the mapping relationship to be learned by the gaussian process is the relationship between the current state control pair xp = { x (0), u (0),.., x (59), u (59) } to the state change amount Δ x = { Δ x (0),.., Δ x (59) }, where the symbol Δ is the change amount of the state, Δ x (i) = x (i + 1) -x (i),
control pairs xp for a given 1 state, according to Gaussian process modeling * ={x * ,u * A relationship between it and 60 state control pairs xp = { x (0), u (0),. ·, x (59), u (59) } is as follows:
in the formula, GP represents a Gaussian probability distribution, I is an identity matrix, sigma w Is a hyper-parameter related to noise, K, K is a Gaussian kernel function, and is marked as K when the vector is operated with the vector; when the number and the number or the number and the vector are operated, the operation is marked as k; f is the unknown dynamic model of the vehicle-two-stage inverted pendulum system, and the kernel function is generally fixedMean square exponential covariance function
In the formula, y m ,y n Is a fixed value or matrix; sigma f A hyper-parameter related to variance; w is a weight matrix, only values exist on a diagonal line, the values are all hyper-parameters, and the specific values of the hyper-parameters are obtained through optimization through the input of specific control state pairs and the output of state variable quantities;
Therefore, according to the distribution of the state control pair at the ith time, a predicted value of the state distribution at the (i + 1) th time is obtained through accurate calculation:
in the formula (I), the compound is shown in the specification,the covariance between the states and the corresponding state transitions is obtained by a matching method, thereby obtaining a gaussian process model, i.e. the distribution of the state control pairs at the ith moment, and calculating the distribution of the state at the (i + 1) th moment.
4. The optimization control method of the vehicle-secondary inverted pendulum system as claimed in claim 3, wherein in order to obtain the optimal control sequence, the optimization control of the Gaussian process model of the vehicle-secondary inverted pendulum system is proposed, and first, the formula (4) is subjected to variation to obtain
Where Δ represents a micro variable, and since the initial condition is a fixed value, the variables of the initial condition are all 0, that is, Δ μ 0 =0,Δη 0 =0, due to Δ u i Is an arbitrary number, which is set to Δ u for simplicity of form i =1, yield Δ μ i And Δ η i Of (2) iterative form
Expected value of the overall objective function
Expectation for step i [ L (x (i), u) i )]Is subjected to variation to obtain
Variation of the overall objective function over the time interval, i.e. the gradient of the objective function becomes
Wherein, Δ μ i And Δ η i Given by equation (16), μ i And η i Given by equation (4).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911043225.4A CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911043225.4A CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110908280A CN110908280A (en) | 2020-03-24 |
CN110908280B true CN110908280B (en) | 2023-01-03 |
Family
ID=69814671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911043225.4A Active CN110908280B (en) | 2019-10-30 | 2019-10-30 | Optimization control method for trolley-two-stage inverted pendulum system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110908280B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111580392B (en) * | 2020-07-14 | 2021-06-15 | 江南大学 | Finite frequency range robust iterative learning control method of series inverted pendulum |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN110134011A (en) * | 2019-04-23 | 2019-08-16 | 浙江工业大学 | A kind of inverted pendulum adaptive iteration study back stepping control method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011068222A (en) * | 2009-09-24 | 2011-04-07 | Honda Motor Co Ltd | Control device of inverted pendulum type vehicle |
-
2019
- 2019-10-30 CN CN201911043225.4A patent/CN110908280B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105549384A (en) * | 2015-09-01 | 2016-05-04 | 中国矿业大学 | Inverted pendulum control method based on neural network and reinforced learning |
CN110134011A (en) * | 2019-04-23 | 2019-08-16 | 浙江工业大学 | A kind of inverted pendulum adaptive iteration study back stepping control method |
Non-Patent Citations (1)
Title |
---|
直线二级倒立摆系统的H_∞鲁棒最优控制;王春平等;《机电工程》;20170520(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110908280A (en) | 2020-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | A policy search method for temporal logic specified reinforcement learning tasks | |
Chiou et al. | A PSO-based adaptive fuzzy PID-controllers | |
Juang | Combination of online clustering and Q-value based GA for reinforcement fuzzy system design | |
CN108520155B (en) | Vehicle behavior simulation method based on neural network | |
CN111241952A (en) | Reinforced learning reward self-learning method in discrete manufacturing scene | |
CN101566829A (en) | Method for computer-aided open loop and/or closed loop control of a technical system | |
CN110442129A (en) | A kind of control method and system that multiple agent is formed into columns | |
CN113419424B (en) | Modeling reinforcement learning robot control method and system for reducing overestimation | |
CN114384931B (en) | Multi-target optimal control method and equipment for unmanned aerial vehicle based on strategy gradient | |
CN110908280B (en) | Optimization control method for trolley-two-stage inverted pendulum system | |
CN116587275A (en) | Mechanical arm intelligent impedance control method and system based on deep reinforcement learning | |
CN115765050A (en) | Power system safety correction control method, system, equipment and storage medium | |
CN111930010A (en) | LSTM network-based general MFA controller design method | |
Pazis et al. | Binary action search for learning continuous-action control policies | |
CN114330119A (en) | Deep learning-based pumped storage unit adjusting system identification method | |
CN115062528A (en) | Prediction method for industrial process time sequence data | |
CN116892866B (en) | Rocket sublevel recovery track planning method, rocket sublevel recovery track planning equipment and storage medium | |
CN117818643A (en) | Speed and acceleration prediction-based man-vehicle collaborative driving method | |
CN117787384A (en) | Reinforced learning model training method for unmanned aerial vehicle air combat decision | |
CN111240201B (en) | Disturbance suppression control method | |
Yang et al. | Continuous control for searching and planning with a learned model | |
CN110450164A (en) | Robot control method, device, robot and storage medium | |
CN113485107B (en) | Reinforced learning robot control method and system based on consistency constraint modeling | |
Li et al. | Morphing Strategy Design for UAV based on Prioritized Sweeping Reinforcement Learning | |
CN115599296A (en) | Automatic node expansion method and system for distributed storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |