CN110531620B - Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model - Google Patents

Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model Download PDF

Info

Publication number
CN110531620B
CN110531620B CN201910823151.XA CN201910823151A CN110531620B CN 110531620 B CN110531620 B CN 110531620B CN 201910823151 A CN201910823151 A CN 201910823151A CN 110531620 B CN110531620 B CN 110531620B
Authority
CN
China
Prior art keywords
state
function
gaussian process
model
trolley
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910823151.XA
Other languages
Chinese (zh)
Other versions
CN110531620A (en
Inventor
钟珊
陈雪梅
应文豪
伏玉琛
龚声蓉
钱振江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changshu Institute of Technology
Original Assignee
Changshu Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changshu Institute of Technology filed Critical Changshu Institute of Technology
Priority to CN201910823151.XA priority Critical patent/CN110531620B/en
Publication of CN110531620A publication Critical patent/CN110531620A/en
Application granted granted Critical
Publication of CN110531620B publication Critical patent/CN110531620B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model, which learns a value function and a strategy through online samples generated by a physical system simulator, and simultaneously learns an environmental dynamic model based on the Gaussian process by utilizing the online samples in the process. When the precision of the environmental dynamic model meets a certain precision, the model based on the Gaussian process can be used for off-line planning, and the convergence of the algorithm is promoted together with on-line learning. The method can acquire the optimal control method of the trolley climbing system more quickly.

Description

Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model
Technical Field
The invention relates to a self-adaptive control method of a physical system, in particular to a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model.
Background
The system for the trolley to ascend the mountain is shown in figure 1, the trolley is positioned at the slope bottom of two mountains, and the target of the trolley is a pentagram position of the right mountain head. However, the car cannot directly accelerate to reach the destination through the accelerator due to insufficient power, and only can reach the left side first, so that the car has enough forward inertia and can reach the destination on the right side through enough acceleration. The self-adaptive control of the system is to control the acceleration of the trolley at any time step to ensure that the trolley reaches the right in the shortest time. This control problem is an optimal control problem for continuous states or continuous motion spaces. The control problem of a physical system can be generally modeled as a markov decision problem, that is, all possible states in the physical system are modeled as a state space, all possible actions are modeled as an action space, the probability distribution of the next state reached after a certain action is applied in the current state is modeled as a transition function, and the environmental feedback obtained after a certain action is applied in the current state is called a reward function.
After the physical system is modeled into the MDP model, the optimal strategy can be solved by adopting a reinforcement learning method, namely the optimal control method of the physical system is obtained. Reinforcement learning methods can be divided into two categories: model independent methods and model based methods. Model-independent methods learn value functions and strategies by learning the interaction of agents with the environment to obtain samples. The method is simple and quick, but only utilizes the sample learning value function and the strategy, and the sample is discarded after being used once, so that the utilization rate of the sample is extremely low; the model-based method can learn the value function and the strategy by planning through the dynamic model without participation of real samples, so the method has higher sample utilization efficiency, and has the defect that the optimal solution of the problem is obtained by continuously iterating the Bellman equation, thereby leading the model-based method to have higher computational complexity.
In most practical physical systems, the model is unknown. If one wants to take advantage of model planning, one must first learn a model and then use that model to do the planning. However, most physical systems are continuous rather than discrete and even if the model is known, they cannot be used directly for iterative solution of bellman equations. Meanwhile, the quality of planning is directly affected when the learned model is not accurate enough.
Disclosure of Invention
The invention aims to provide a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model, which learns a value function and a strategy through online samples generated by a physical system simulator, and simultaneously learns an environmental dynamic model based on the Gaussian process by using the online samples in the process. When the precision of the environmental dynamic model meets a certain precision, the model is used for planning to generate a simulation sample, and the simulation sample and the online sample jointly learn a value function and a strategy, so that the convergence of the algorithm is promoted, and the optimal control method of the system is obtained more quickly.
The technical scheme of the invention is as follows: a self-adaptive control method of a trolley climbing system based on a Gaussian process approximate model comprises the following steps:
step (1) initializing a model, setting a state space X and an action space U of an environment, wherein the state is represented by a two-dimensional vector X ═ (w, v) ∈ X, w is the position of the trolley in the horizontal direction, v is the speed of the trolley in the horizontal direction, and the trolley can executeThe action is acceleration U ∈ U, and the temporary variable in the approximate model of the Gaussian process, namely the state transition function, is a vector
Figure GDA0002554377500000021
Variable d is 0, variable s is 0 and matrix
Figure GDA0002554377500000022
Figure GDA0002554377500000028
The state x is corresponding to a characteristic function, and phi (x, u) is the characteristic function of the state action pair (x, u);
initializing the hyper-parameters, setting the discount rate gamma, the attenuation factor lambda, the maximum plot number E and the exploration variance sigma of the Gaussian function2Matrix DeltaNkRespective element σ on the middle diagonali 2I is more than or equal to 1 and less than or equal to k, the maximum time step T contained in each plot, the learning rate α of the value function and the strategy, the current plot number e is 1, and the parameter vector of the value function
Figure GDA0002554377500000023
Policy parameter vector
Figure GDA0002554377500000024
Gaussian process approximation model parameter vector
Figure GDA0002554377500000025
Planning the maximum times K;
initializing the range of a state space and an action space of a trolley climbing system, initializing conditions of success or failure of control, setting the current time step t as 1, and setting the current state x as x1
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*2) Using the Gaussian equation to generate the action u to be performed currentlyt
Step (5) in the current state xtThen, the step (A) is performed4) Of (1) determined action utAnd obtaining the next state x of the trolley by using the dynamic equation of the systemt+1While obtaining an immediate reward r using a reward functiont+1Forming a sample (x)t,ut,xt+1,rt+1);
Step (6) TD error using sample calculation value functiontt=rt+1+γV(xt+1t)-V(xtt);
Step (7) updating the eligibility trace e of the value functiont+1
Figure GDA0002554377500000026
Step (8) updating the value function parameter vt+1:νt+1←νttet+1
Step (9) updating strategy parameter thetat+1:θt+1←θtt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1
Step (11) updating the state transition function parameters by using the current sample:
Figure GDA0002554377500000027
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether to control the success condition:
if it is
Let E be E +1 and judge whether the current episode E be E ═ E holds:
if it reaches
Turning to step (19);
otherwise
Turning to step (13);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted according to the Gaussian process approximation model:
Figure GDA0002554377500000031
wherein phik=(φ(x′1,u0),φ(x′2,u1),...,φ(x′k,ut))TFor the state feature matrix at time step t, βkBeing a model parameter of the Gaussian process, Δ Nt∈Rt×tA noise matrix in which the position vectors satisfy the Gaussian distribution until t time step;
step (15) updating the qualification trace according to the Gaussian process approximate model:
Figure GDA0002554377500000032
step (16) updating value function parameters according to simulation samples generated by the Gaussian process approximation model: v isk+1←νkkek+1
Step (17) updating strategy parameters according to simulation samples generated by the Gaussian process approximation model: thetak+1←θkkΔuk
Step (18) of judging the current planning times k:
if K is equal to K
Updating the current time step t to t +1, and judging the current time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Then go to step (19);
otherwise
Turning to the step (3);
otherwise
k +1, and go to step (14);
and (19) outputting the optimal strategy. At this point the trolley is moved from its initial state x0Starting from an arbitrary state xtTo adopt an optimal strategy
Figure GDA0002554377500000033
To obtain an arbitrary state xtAnd corresponding optimal action until the target state is reached.
Further, the solution of the optimal action in the step (4)
Figure GDA0002554377500000034
Wherein the content of the first and second substances,
Figure GDA0002554377500000035
is a state xtCorresponding feature, θtAnd representing the strategy parameters corresponding to the time step t.
Further, in the step (5), given the current state x ═ xt=(wt,vt),wtIs a position component, vtFor the velocity component, its next state can be represented as xt+1=(wt+1,vt+1) Wherein the velocity component of the next time step can be represented by vt+1=vt+0.001ut+gcos(3wt) Can be solved and the state component of the next time step can be passed through wt+1=wt+vt+1To solve, where g-0.0025 is the acceleration of gravity, and the reward function is: if the next state is xt+1When in the target state, r t+10, otherwise rt+1=-1;
Further, the expression of the function of the state value in the step (6) is
Figure GDA0002554377500000036
Wherein, vtRepresents a state xtThe parameters of the corresponding value function are,
Figure GDA0002554377500000041
represents a state xt+1The function of the corresponding value is then used,
Figure GDA0002554377500000042
is a state xtCorresponding feature, rt+1Is in a state xtTo execute action utA prize to be awarded.
Further, the qualification trace updating formula in the step (7) is as follows:
Figure GDA0002554377500000043
wherein e istRepresents a state xtCorresponding qualification trace, et+1Represents a state xt+1The corresponding qualification trace.
Further, the value function parameter in the step (8) is updated to vt+1←νttet+1In vtIs a state xtThe corresponding value function parameter vector.
Further, the strategy parameter θ in the step (9)t+1:θt+1←θttΔ u, whereintTD error as a function of the value corresponding to step (7).
Further, p of the intermediate formula of the model in the step (10)t+1、dt+1、st+1And Pt+1The update formulas are respectively:
Figure GDA0002554377500000044
Figure GDA0002554377500000045
wherein u ist+1Indicating that the state x can be obtained according to step (4)t+1An action to be performed, utIndicating that the state x can be obtained according to step (4)tActions performed at time step, σtThe gaussian process approximates the standard deviation of the model at time step t.
Further, the state transition function parameter in the step (11):
Figure GDA0002554377500000046
wherein p ist+1、dt+1And st+1Is the intermediate variable of the model, β, found according to step (10)tAnd (3) approximating a model, namely a parameter vector of the state transition function, for the Gaussian process corresponding to the time step t.
Further, the next state obtained in the step (14) based on the gaussian process approximation model is
Figure GDA0002554377500000047
Wherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TIs a state feature matrix, x 'at time step k'1Is a planned initial state, x'kIs planned to be from x'1State reached after k times of planning start βkBeing a model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix whose position components up to k time steps satisfy a Gaussian distribution, i.e.
Figure GDA0002554377500000048
The technical scheme provided by the invention has the advantages that,
establishing an approximate method suitable for solving the optimal strategy of the mountain climbing system of the trolley, namely, carrying out approximate solution on a continuously controlled value function and a Bellman equation corresponding to the strategy; the learning precision of the model is improved, and when the model meets a certain precision requirement, the model is used for planning to generate a simulation sample, so that the convergence of the optimal control method is promoted.
Drawings
FIG. 1 is a schematic view of a cart mountain climbing system;
FIG. 2 is a schematic flow chart of the method of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention thereto.
Referring to fig. 1 and 2, the present invention relates to a system for climbing a mountain on a trolley, which is initially set up and applied to the problemA dynamic model: the state of the trolley is x when any time step t is sett=(wt,vt) Wherein w istThe position of the trolley on the x axis at the time step t is shown, the limit value of the x axis at the top end on the left side of the hill is-1.2, the limit value of the x axis at the top end on the right side of the hill is 0.5, and therefore the w of the position of the trolley istThe threshold value of is-1.2-wt≤0.5;vtThe speed of the display trolley is-0.07-vt≤0.07;utThe value range of the action exerted on the trolley, namely the acceleration of the trolley corresponding to the time step t is-1 to utThe acceleration is less than or equal to 1, the accelerator is used as a timing accelerator, and the brake is stepped when the acceleration is negative. At an initial time step, i.e. when t is 1, the initial state of the vehicle is x1=(w1,v1) (-0.5, 0). The target position of the trolley is the right five-pointed star position, namely wtAt 0.5.
Suppose that the current state of the known trolley is xt=(wt,vt) Then, at the next time step t +1, the next state x of the vehiclet+1The corresponding x-axis values of the cart are:
Figure GDA0002554377500000051
wherein g-0.0025 is gravity acceleration, utIs the action exerted on the trolley, i.e. the acceleration of the trolley corresponding to time step t.
Solution objective of the problem: from the kinetic model given above, it can be seen that given the current state xt=(wt,vt) In the case of (1), as long as the strategy is known, i.e. the acceleration u at any time steptThe vehicle state for the next time step can be obtained until the goal is achieved.
The measure of the optimality of a policy may be measured by the time required to reach the goal, and thus, the less time required to reach the goal, the better the corresponding policy. To achieve this goal, a reward function is introduced:
Figure GDA0002554377500000052
from this reward function, it can be seen that if the more time steps the vehicle takes to reach the target from the initial position, the less the accumulated reward value; conversely, if less time steps are required to reach the goal, then the more prize values are accumulated.
The optimization goal of the algorithm is to maximize the accumulated reward since the less time steps required to reach the target position, the better the strategy.
The accumulated reward may be approximated by a V-valued function, state x at time step ttThe corresponding V value may be expressed as:
Figure GDA0002554377500000061
wherein, vtFor the V-valued function parameter vector corresponding to time step t,
Figure GDA0002554377500000062
is a state xtIs characterized by
Solving the optimal strategy is to find a strategy capable of maximizing the accumulated reward, namely:
h(xt)*=argmaxhVh(xtt)
wherein, Vh(xtt) Representing the V-value function under the adopted strategy h.
When the optimal strategy is obtained, h (x) can be utilized*And obtaining the optimal acceleration of the trolley corresponding to the arbitrary time step t as follows:
Figure GDA0002554377500000063
referring to fig. 2, the adaptive control method for the mountain climbing system of the trolley based on the gaussian process approximation model in the embodiment includes the following steps:
step (1) initializing a model, and setting a variable in a state space X of an environment to be a 2-dimensional directionQuantity xt=(wt,vt), wt∈[-1.2,+0.5]Is the position of the trolley in the horizontal direction, vt∈[-0.07,+0.07]Indicating the speed of the trolley in the horizontal direction. Temporary variables in Gaussian process approximation model (state transition function) are vectors
Figure GDA0002554377500000064
Variable d is 0, variable s is 0 and matrix
Figure GDA0002554377500000065
Figure GDA0002554377500000066
For state x, the corresponding feature function, and φ (x, u) the feature function of the state action pair (x, u). The motion that the trolley can perform is acceleration ut∈[-1,1]The reward function is set to:
Figure GDA0002554377500000067
step (2) initializing an environment, setting the discount rate gamma to be 0.95, the attenuation factor lambda to be 0.4, the maximum knot number E to be 500, and the exploration variance sigma of the Gaussian function20.9, the matrix Δ N is initializedkRespective element σ on the middle diagonali 2(1 ≦ i ≦ k) is a random number between 0.1 and 0.8, the maximum time step T included in each episode is 3000, the learning rate of the value function and the learning rate of the policy α are 0.6, the current episode number e is 1, the value function parameter vector is a vector of values
Figure GDA0002554377500000068
Policy parameter vector
Figure GDA0002554377500000069
Gaussian process approximation model parameter vector
Figure GDA00025543775000000610
Planning the maximum times K to be 100;
step (3) initializing a physical system of the trolley for going up the hill,the initial state is defined as x1=(w1,v1) Initializing the ranges of the state space and the action space of the trolley ascending system (-0.5,0), controlling the conditions of success or failure, achieving the target state w-0.5 or the current plot number equal to the maximum plot number E-E, the current time step t-1, and the current state x-x1
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*2) Using the Gaussian equation to generate the action u to be performed currentlyt
Step (5) in the current state xtThen, the action u determined in step (4) is executedtGet the next state x of the physical systemtImmediate reward rt+1Forming a sample (x)t,ut,xt+1,rt+1) In calculating the next state xt+1=(wt+1,vt+1) In the process of (1), the next position w is calculated according to the following formulat+1And velocity vt+1
Figure GDA0002554377500000071
Wherein g-0.0025 is gravity acceleration
Generating a reward r after execution of a current action according to a reward functiont+1
Figure GDA0002554377500000072
Step (6) TD error using sample calculation value functiontt=rt+1+γV(xt+1t)-V(xtt);
Step (7) updating the qualification trace of the value function:
Figure GDA0002554377500000073
the initial eligibility trace vector defaults to 0,
Figure GDA0002554377500000074
is calculated as follows:
Figure GDA0002554377500000075
wherein the content of the first and second substances,
Figure GDA0002554377500000076
the center of the gaussian radial basis function is represented as a two-dimensional vector whose dimension is the product of the number of center points in the position direction and the velocity direction. The number of central points selected in the position direction is 11, and the central points are { -1.05, -0.9, -0.75, -0.6, -0.45, -0.3, -0.15, 0, 0.15, 0.3, 0.45 }; the number of central points selected in the speed direction is 10, the speed central points are { -0.058, -0.046, -0.034, -0.022, -0.01, 0.002, 0.014, 0.026, 0.038 and 0.05}, and the variances of the position and the speed are respectively taken
Figure GDA0002554377500000077
And
Figure GDA0002554377500000078
step (8) updating the value function parameter vt+1:νt+1←νttet+1
Step (9) updating strategy parameter thetat+1:θt+1←θtt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1Wherein the state action pairs yt=(xt,ut) Is phi (x)t,ut) Can be expressed as:
Figure GDA0002554377500000079
wherein the content of the first and second substances,
Figure GDA00025543775000000710
the center of the Gaussian radial basis function is represented as a three-dimensional vector, and the dimension of the three-dimensional vector is the product of the number of center points of the position direction, the speed direction and the action direction. The number of central points selected in the position direction is 11, and the central points of the positions are { -1.05, -0.9, -0.75, -0.6, -0.45, -0.3, -0.15, 0, 0.15, 0.3, 0.45 }; the number of central points selected in the speed direction is 10, and the speed central points are { -0.058, -0.046, -0.034, -0.022, -0.01, 0.002, 0.014, 0.026, 0.038 and 0.05 }; the number of the central points selected by the action directions is 5, and the central points of the action directions are as follows: { -1, -0.5,0,0.5, 1}. The variance of position, velocity and motion is taken separately
Figure GDA00025543775000000711
And
Figure GDA00025543775000000712
step (11) updating the state transition function parameter vector by using the current sample:
Figure GDA00025543775000000713
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether or not 0.5 holds:
if yes, let E be E +1, and judge whether current plot E reaches maximum value E:
if it reaches
Then go to step (19);
otherwise
Turning to the step (3);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted from the gaussian process model:
Figure GDA0002554377500000081
wherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TFor the state feature matrix at k time steps, βkBeing a model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix satisfying a gaussian distribution to k time step positions with values:
Figure GDA0002554377500000082
step (15) updating the qualification trace according to the Gaussian process model:
Figure GDA0002554377500000083
step (16) updating value function parameters according to simulation samples generated by the Gaussian process model: v isk+1←νkkek+1
Step (17) updating strategy parameters according to the simulation samples generated by the Gaussian process model: thetak+1←θkkΔuk
Step (18) of judging the current planning times k:
if K is equal to K, updating the current time step t to t +1, and judging the time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Then go to step (19);
otherwise
Turning to the step (3);
otherwise
Continuing to execute the step (14) when k is k + 1;
and (19) outputting the optimal strategy. At this time, the trolley is driven from the initial stateState x1Starting from an arbitrary state xtTo adopt an optimal strategy
Figure GDA0002554377500000091
To obtain an arbitrary state xtAnd corresponding optimal action until the target state is reached.

Claims (8)

1. A self-adaptive control method of a trolley hill climbing system based on a Gaussian process approximate model is characterized by comprising the following steps:
initializing a model, setting a state space X and an action space U of an environment, wherein the state is represented by a two-dimensional vector X (w, v) ∈ X, w is the position of a trolley in the horizontal direction, v is the speed of the trolley in the horizontal direction, the action which can be executed by the trolley is an acceleration U ∈ U, and a Gaussian process approximation model, namely a temporary variable in a state transfer function is a vector
Figure FDA0002588818620000011
Variable d is 0, variable s is 0 and matrix
Figure FDA0002588818620000012
Figure FDA0002588818620000013
The state x is corresponding to a characteristic function, and phi (x, u) is the characteristic function of the state action pair (x, u);
initializing the hyper-parameters, setting the discount rate gamma, the attenuation factor lambda, the maximum plot number E and the exploration variance sigma of the Gaussian function2Matrix DeltaNkRespective element σ on the middle diagonali 2I is more than or equal to 1 and less than or equal to k, the maximum time step T contained in each plot, the learning rate α of the value function and the strategy, the current plot number e is 1, and the parameter vector of the value function
Figure FDA0002588818620000014
Policy parameter vector
Figure FDA0002588818620000015
Gaussian process approximation model parameter vector
Figure FDA0002588818620000016
Planning the maximum times K;
initializing the range of a state space and an action space of a trolley climbing system, initializing conditions of success or failure of control, setting the current time step t as 1, and setting the current state x as x1
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*2) Using the Gaussian equation to generate the action u to be performed currentlyt
Step (5) in the current state xtThen, the action u determined in step (4) is executedtAnd obtaining the next state x of the trolley by using the dynamic equation of the systemt+1While obtaining an immediate reward r using a reward functiont+1Forming a sample (x)t,ut,xt+1,rt+1);
Step (6) TD error using sample calculation value functiontt=rt+1+γV(xt+1t)-V(xtt) Wherein v istRepresents a state xtParameter of the corresponding value function, V (x)t+1t) Represents a state xt+1Corresponding value function, V (x)tt) Represents a state xtA corresponding value function;
step (7) updating the eligibility trace e of the value functiont+1
Figure FDA0002588818620000017
Step (8) update value function parameter vt+1:vt+1←vttet+1
Step (9) updating strategy parameter thetat+1:θt+1←θtt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1
Figure FDA0002588818620000018
Figure FDA0002588818620000019
Wherein u ist+1Indicating that the state x can be obtained according to step (4)t+1An action to be performed, utIndicating that the state x can be obtained according to step (4)tActions performed at time step, σtThe standard deviation of the Gaussian process approximate model at the time step t;
step (11) updating the state transition function parameter vector by using the current sample:
Figure FDA0002588818620000021
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether to control the success condition:
if yes, let E be E +1, and judge whether the current plot E ═ E holds:
if so, turning to step (19);
otherwise, go to step (13);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted according to the Gaussian process approximation model:
Figure FDA0002588818620000022
wherein phik=(φ(x′1,u0),φ(x′2,u1),...,φ(x′k,ut))TTo time tThe state feature matrix at step β is the model parameter of the Gaussian process, Δ Nt∈Rt×tA noise matrix in which the position components satisfy the gaussian distribution until t time step;
step (15) updating the qualification trace according to the Gaussian process approximate model:
Figure FDA0002588818620000023
step (16) updating value function parameters according to simulation samples generated by the Gaussian process approximation model: v. ofk+1←vkkek+1WhereinkTD error as a function of value;
step (17) updating strategy parameters according to simulation samples generated by the Gaussian process approximation model: thetak+1←θkkΔukWherein Δ uk=u*-uk,u*For the current optimal action, ukRepresentation Using the Gaussian equation N (u)*2) A resulting action to be currently performed;
step (18) of judging the current planning times k:
if K is equal to K
Updating the current time step t to t +1 and judging the current time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Turning to step (19);
otherwise
Turning to the step (3);
otherwise
k +1, and go to step (14);
step (19) of outputting an optimal strategy, wherein the trolley is in the initial state x0Starting from an arbitrary state xtTo adopt an optimal strategy
Figure FDA0002588818620000024
To obtain an arbitrary state xtAnd corresponding optimal action until the target state is reached.
2. The adaptive control method for mountain climbing system of trolley based on Gaussian process approximation model as claimed in claim 1, wherein the solution of optimal action in step (4)
Figure FDA0002588818620000031
Wherein the content of the first and second substances,
Figure FDA0002588818620000032
is a state xtCorresponding feature, θtAnd representing the strategy parameters corresponding to the time step t.
3. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein in step (5), given current state x ═ xt=(wt,vt),wtIs a position component, vtFor the velocity component, its next state can be represented as xt+1=(wt+1,vt+1) Wherein the velocity component of the next time step can be represented by vt+1=vt+0.001ut+gcos(3wt) Can be solved and the state component of the next time step can be passed through wt+1=wt+vt+1To solve, where g-0.0025 is the acceleration of gravity, and the reward function is: if the next state is xt+1When in the target state, rt+10, otherwise rt+1=-1。
4. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein the expression of the function of state value in step (6) is
Figure FDA0002588818620000033
Wherein, vtRepresents a state xtThe parameters of the corresponding value function are,
Figure FDA0002588818620000034
represents a state xt+1The function of the corresponding value is then used,
Figure FDA0002588818620000035
is a state xtCorresponding feature, rt+1Is in a state xtTo execute action utA prize to be awarded.
5. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximate model as claimed in claim 1, wherein the qualification track update formula in the step (7) is as follows:
Figure FDA0002588818620000036
wherein e istRepresents a state xtCorresponding qualification trace, et+1Represents a state xt+1The corresponding qualification trace.
6. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein the strategy parameter θ in step (9)t+1:θt+1←θtt(u*-ut) WhereintTD error as a function of the value corresponding to step (7).
7. The adaptive control method for mountain climbing system of trolley based on approximate Gaussian process model as claimed in claim 1, wherein in the step (11), the parameter vector of state transition function:
Figure FDA0002588818620000037
wherein p ist+1、dt+1And st+1Is obtained according to the step (10), βtFor Gauss passing corresponding to time step tThe path approximation model is the parameter vector of the state transition function.
8. The adaptive control method for mountain climbing system of trolley based on Gaussian process approximate model as claimed in claim 1, wherein the next state obtained based on Gaussian process approximate model in step (14) is
Figure FDA0002588818620000038
Wherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TIs a state feature matrix, x 'at time step k'1Is a planned initial state, x'kIs planned to be from x'1The state reached after k times of planning is started β is the model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix whose position components up to k time steps satisfy a Gaussian distribution, i.e.
Figure FDA0002588818620000041
CN201910823151.XA 2019-09-02 2019-09-02 Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model Active CN110531620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910823151.XA CN110531620B (en) 2019-09-02 2019-09-02 Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910823151.XA CN110531620B (en) 2019-09-02 2019-09-02 Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model

Publications (2)

Publication Number Publication Date
CN110531620A CN110531620A (en) 2019-12-03
CN110531620B true CN110531620B (en) 2020-09-18

Family

ID=68666154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910823151.XA Active CN110531620B (en) 2019-09-02 2019-09-02 Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model

Country Status (1)

Country Link
CN (1) CN110531620B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN104932267B (en) * 2015-06-04 2017-10-03 曲阜师范大学 A kind of neural network lea rning control method of use eligibility trace
CN108549232B (en) * 2018-05-08 2019-11-08 常熟理工学院 A kind of room air self-adaptation control method based on approximate model planning

Also Published As

Publication number Publication date
CN110531620A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
WO2021135554A1 (en) Method and device for planning global path of unmanned vehicle
CN110136481B (en) Parking strategy based on deep reinforcement learning
CN110989576B (en) Target following and dynamic obstacle avoidance control method for differential slip steering vehicle
CN112162555B (en) Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
CN108161934B (en) Method for realizing robot multi-axis hole assembly by utilizing deep reinforcement learning
CN106950956B (en) Vehicle track prediction system integrating kinematics model and behavior cognition model
Leottau et al. Decentralized reinforcement learning of robot behaviors
CN110615003B (en) Cruise control system based on strategy gradient online learning algorithm and design method
CN111679660B (en) Unmanned deep reinforcement learning method integrating human-like driving behaviors
US20230367934A1 (en) Method and apparatus for constructing vehicle dynamics model and method and apparatus for predicting vehicle state information
CN112286218B (en) Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN110083167A (en) A kind of path following method and device of mobile robot
Gómez et al. Optimal motion planning by reinforcement learning in autonomous mobile vehicles
CN111783994A (en) Training method and device for reinforcement learning
Yang et al. Longitudinal tracking control of vehicle platooning using DDPG-based PID
CN114153213A (en) Deep reinforcement learning intelligent vehicle behavior decision method based on path planning
CN116679719A (en) Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
CN113741533A (en) Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning
CN115373415A (en) Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning
CN110531620B (en) Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model
CN116127853A (en) Unmanned driving overtaking decision method based on DDPG (distributed data base) with time sequence information fused
CN115857548A (en) Terminal guidance law design method based on deep reinforcement learning
Guo et al. Modeling, learning and prediction of longitudinal behaviors of human-driven vehicles by incorporating internal human DecisionMaking process using inverse model predictive control
CN115743178A (en) Automatic driving method and system based on scene self-adaptive recognition
CN111562740B (en) Automatic control method based on multi-target reinforcement learning algorithm utilizing gradient

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant