CN110531620B - Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model - Google Patents
Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model Download PDFInfo
- Publication number
- CN110531620B CN110531620B CN201910823151.XA CN201910823151A CN110531620B CN 110531620 B CN110531620 B CN 110531620B CN 201910823151 A CN201910823151 A CN 201910823151A CN 110531620 B CN110531620 B CN 110531620B
- Authority
- CN
- China
- Prior art keywords
- state
- function
- gaussian process
- model
- trolley
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model, which learns a value function and a strategy through online samples generated by a physical system simulator, and simultaneously learns an environmental dynamic model based on the Gaussian process by utilizing the online samples in the process. When the precision of the environmental dynamic model meets a certain precision, the model based on the Gaussian process can be used for off-line planning, and the convergence of the algorithm is promoted together with on-line learning. The method can acquire the optimal control method of the trolley climbing system more quickly.
Description
Technical Field
The invention relates to a self-adaptive control method of a physical system, in particular to a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model.
Background
The system for the trolley to ascend the mountain is shown in figure 1, the trolley is positioned at the slope bottom of two mountains, and the target of the trolley is a pentagram position of the right mountain head. However, the car cannot directly accelerate to reach the destination through the accelerator due to insufficient power, and only can reach the left side first, so that the car has enough forward inertia and can reach the destination on the right side through enough acceleration. The self-adaptive control of the system is to control the acceleration of the trolley at any time step to ensure that the trolley reaches the right in the shortest time. This control problem is an optimal control problem for continuous states or continuous motion spaces. The control problem of a physical system can be generally modeled as a markov decision problem, that is, all possible states in the physical system are modeled as a state space, all possible actions are modeled as an action space, the probability distribution of the next state reached after a certain action is applied in the current state is modeled as a transition function, and the environmental feedback obtained after a certain action is applied in the current state is called a reward function.
After the physical system is modeled into the MDP model, the optimal strategy can be solved by adopting a reinforcement learning method, namely the optimal control method of the physical system is obtained. Reinforcement learning methods can be divided into two categories: model independent methods and model based methods. Model-independent methods learn value functions and strategies by learning the interaction of agents with the environment to obtain samples. The method is simple and quick, but only utilizes the sample learning value function and the strategy, and the sample is discarded after being used once, so that the utilization rate of the sample is extremely low; the model-based method can learn the value function and the strategy by planning through the dynamic model without participation of real samples, so the method has higher sample utilization efficiency, and has the defect that the optimal solution of the problem is obtained by continuously iterating the Bellman equation, thereby leading the model-based method to have higher computational complexity.
In most practical physical systems, the model is unknown. If one wants to take advantage of model planning, one must first learn a model and then use that model to do the planning. However, most physical systems are continuous rather than discrete and even if the model is known, they cannot be used directly for iterative solution of bellman equations. Meanwhile, the quality of planning is directly affected when the learned model is not accurate enough.
Disclosure of Invention
The invention aims to provide a self-adaptive control method of a mountain climbing system of a trolley based on a Gaussian process approximate model, which learns a value function and a strategy through online samples generated by a physical system simulator, and simultaneously learns an environmental dynamic model based on the Gaussian process by using the online samples in the process. When the precision of the environmental dynamic model meets a certain precision, the model is used for planning to generate a simulation sample, and the simulation sample and the online sample jointly learn a value function and a strategy, so that the convergence of the algorithm is promoted, and the optimal control method of the system is obtained more quickly.
The technical scheme of the invention is as follows: a self-adaptive control method of a trolley climbing system based on a Gaussian process approximate model comprises the following steps:
step (1) initializing a model, setting a state space X and an action space U of an environment, wherein the state is represented by a two-dimensional vector X ═ (w, v) ∈ X, w is the position of the trolley in the horizontal direction, v is the speed of the trolley in the horizontal direction, and the trolley can executeThe action is acceleration U ∈ U, and the temporary variable in the approximate model of the Gaussian process, namely the state transition function, is a vectorVariable d is 0, variable s is 0 and matrix The state x is corresponding to a characteristic function, and phi (x, u) is the characteristic function of the state action pair (x, u);
initializing the hyper-parameters, setting the discount rate gamma, the attenuation factor lambda, the maximum plot number E and the exploration variance sigma of the Gaussian function2Matrix DeltaNkRespective element σ on the middle diagonali 2I is more than or equal to 1 and less than or equal to k, the maximum time step T contained in each plot, the learning rate α of the value function and the strategy, the current plot number e is 1, and the parameter vector of the value functionPolicy parameter vectorGaussian process approximation model parameter vectorPlanning the maximum times K;
initializing the range of a state space and an action space of a trolley climbing system, initializing conditions of success or failure of control, setting the current time step t as 1, and setting the current state x as x1;
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*,σ2) Using the Gaussian equation to generate the action u to be performed currentlyt;
Step (5) in the current state xtThen, the step (A) is performed4) Of (1) determined action utAnd obtaining the next state x of the trolley by using the dynamic equation of the systemt+1While obtaining an immediate reward r using a reward functiont+1Forming a sample (x)t,ut,xt+1,rt+1);
Step (6) TD error using sample calculation value functiont:t=rt+1+γV(xt+1,νt)-V(xt,νt);
Step (8) updating the value function parameter vt+1:νt+1←νt+αtet+1;
Step (9) updating strategy parameter thetat+1:θt+1←θt+αt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1;
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether to control the success condition:
if it is
Let E be E +1 and judge whether the current episode E be E ═ E holds:
if it reaches
Turning to step (19);
otherwise
Turning to step (13);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1;
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted according to the Gaussian process approximation model:wherein phik=(φ(x′1,u0),φ(x′2,u1),...,φ(x′k,ut))TFor the state feature matrix at time step t, βkBeing a model parameter of the Gaussian process, Δ Nt∈Rt×tA noise matrix in which the position vectors satisfy the Gaussian distribution until t time step;
step (16) updating value function parameters according to simulation samples generated by the Gaussian process approximation model: v isk+1←νk+αkek+1;
Step (17) updating strategy parameters according to simulation samples generated by the Gaussian process approximation model: thetak+1←θk+αkΔuk;
Step (18) of judging the current planning times k:
if K is equal to K
Updating the current time step t to t +1, and judging the current time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Then go to step (19);
otherwise
Turning to the step (3);
otherwise
k +1, and go to step (14);
and (19) outputting the optimal strategy. At this point the trolley is moved from its initial state x0Starting from an arbitrary state xtTo adopt an optimal strategyTo obtain an arbitrary state xtAnd corresponding optimal action until the target state is reached.
Further, the solution of the optimal action in the step (4)Wherein the content of the first and second substances,is a state xtCorresponding feature, θtAnd representing the strategy parameters corresponding to the time step t.
Further, in the step (5), given the current state x ═ xt=(wt,vt),wtIs a position component, vtFor the velocity component, its next state can be represented as xt+1=(wt+1,vt+1) Wherein the velocity component of the next time step can be represented by vt+1=vt+0.001ut+gcos(3wt) Can be solved and the state component of the next time step can be passed through wt+1=wt+vt+1To solve, where g-0.0025 is the acceleration of gravity, and the reward function is: if the next state is xt+1When in the target state, r t+10, otherwise rt+1=-1;
Further, the expression of the function of the state value in the step (6) isWherein, vtRepresents a state xtThe parameters of the corresponding value function are,represents a state xt+1The function of the corresponding value is then used,is a state xtCorresponding feature, rt+1Is in a state xtTo execute action utA prize to be awarded.
Further, the qualification trace updating formula in the step (7) is as follows:wherein e istRepresents a state xtCorresponding qualification trace, et+1Represents a state xt+1The corresponding qualification trace.
Further, the value function parameter in the step (8) is updated to vt+1←νt+αtet+1In vtIs a state xtThe corresponding value function parameter vector.
Further, the strategy parameter θ in the step (9)t+1:θt+1←θt+αtΔ u, whereintTD error as a function of the value corresponding to step (7).
Further, p of the intermediate formula of the model in the step (10)t+1、dt+1、st+1And Pt+1The update formulas are respectively:
wherein u ist+1Indicating that the state x can be obtained according to step (4)t+1An action to be performed, utIndicating that the state x can be obtained according to step (4)tActions performed at time step, σtThe gaussian process approximates the standard deviation of the model at time step t.
Further, the state transition function parameter in the step (11):wherein p ist+1、dt+1And st+1Is the intermediate variable of the model, β, found according to step (10)tAnd (3) approximating a model, namely a parameter vector of the state transition function, for the Gaussian process corresponding to the time step t.
Further, the next state obtained in the step (14) based on the gaussian process approximation model isWherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TIs a state feature matrix, x 'at time step k'1Is a planned initial state, x'kIs planned to be from x'1State reached after k times of planning start βkBeing a model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix whose position components up to k time steps satisfy a Gaussian distribution, i.e.
The technical scheme provided by the invention has the advantages that,
establishing an approximate method suitable for solving the optimal strategy of the mountain climbing system of the trolley, namely, carrying out approximate solution on a continuously controlled value function and a Bellman equation corresponding to the strategy; the learning precision of the model is improved, and when the model meets a certain precision requirement, the model is used for planning to generate a simulation sample, so that the convergence of the optimal control method is promoted.
Drawings
FIG. 1 is a schematic view of a cart mountain climbing system;
FIG. 2 is a schematic flow chart of the method of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are not to be construed as limiting the invention thereto.
Referring to fig. 1 and 2, the present invention relates to a system for climbing a mountain on a trolley, which is initially set up and applied to the problemA dynamic model: the state of the trolley is x when any time step t is sett=(wt,vt) Wherein w istThe position of the trolley on the x axis at the time step t is shown, the limit value of the x axis at the top end on the left side of the hill is-1.2, the limit value of the x axis at the top end on the right side of the hill is 0.5, and therefore the w of the position of the trolley istThe threshold value of is-1.2-wt≤0.5;vtThe speed of the display trolley is-0.07-vt≤0.07;utThe value range of the action exerted on the trolley, namely the acceleration of the trolley corresponding to the time step t is-1 to utThe acceleration is less than or equal to 1, the accelerator is used as a timing accelerator, and the brake is stepped when the acceleration is negative. At an initial time step, i.e. when t is 1, the initial state of the vehicle is x1=(w1,v1) (-0.5, 0). The target position of the trolley is the right five-pointed star position, namely wtAt 0.5.
Suppose that the current state of the known trolley is xt=(wt,vt) Then, at the next time step t +1, the next state x of the vehiclet+1The corresponding x-axis values of the cart are:
wherein g-0.0025 is gravity acceleration, utIs the action exerted on the trolley, i.e. the acceleration of the trolley corresponding to time step t.
Solution objective of the problem: from the kinetic model given above, it can be seen that given the current state xt=(wt,vt) In the case of (1), as long as the strategy is known, i.e. the acceleration u at any time steptThe vehicle state for the next time step can be obtained until the goal is achieved.
The measure of the optimality of a policy may be measured by the time required to reach the goal, and thus, the less time required to reach the goal, the better the corresponding policy. To achieve this goal, a reward function is introduced:
from this reward function, it can be seen that if the more time steps the vehicle takes to reach the target from the initial position, the less the accumulated reward value; conversely, if less time steps are required to reach the goal, then the more prize values are accumulated.
The optimization goal of the algorithm is to maximize the accumulated reward since the less time steps required to reach the target position, the better the strategy.
The accumulated reward may be approximated by a V-valued function, state x at time step ttThe corresponding V value may be expressed as:
wherein, vtFor the V-valued function parameter vector corresponding to time step t,is a state xtIs characterized by
Solving the optimal strategy is to find a strategy capable of maximizing the accumulated reward, namely:
h(xt)*=argmaxhVh(xt,νt)
wherein, Vh(xt,νt) Representing the V-value function under the adopted strategy h.
When the optimal strategy is obtained, h (x) can be utilized*And obtaining the optimal acceleration of the trolley corresponding to the arbitrary time step t as follows:
referring to fig. 2, the adaptive control method for the mountain climbing system of the trolley based on the gaussian process approximation model in the embodiment includes the following steps:
step (1) initializing a model, and setting a variable in a state space X of an environment to be a 2-dimensional directionQuantity xt=(wt,vt), wt∈[-1.2,+0.5]Is the position of the trolley in the horizontal direction, vt∈[-0.07,+0.07]Indicating the speed of the trolley in the horizontal direction. Temporary variables in Gaussian process approximation model (state transition function) are vectorsVariable d is 0, variable s is 0 and matrix For state x, the corresponding feature function, and φ (x, u) the feature function of the state action pair (x, u). The motion that the trolley can perform is acceleration ut∈[-1,1]The reward function is set to:
step (2) initializing an environment, setting the discount rate gamma to be 0.95, the attenuation factor lambda to be 0.4, the maximum knot number E to be 500, and the exploration variance sigma of the Gaussian function20.9, the matrix Δ N is initializedkRespective element σ on the middle diagonali 2(1 ≦ i ≦ k) is a random number between 0.1 and 0.8, the maximum time step T included in each episode is 3000, the learning rate of the value function and the learning rate of the policy α are 0.6, the current episode number e is 1, the value function parameter vector is a vector of valuesPolicy parameter vectorGaussian process approximation model parameter vectorPlanning the maximum times K to be 100;
step (3) initializing a physical system of the trolley for going up the hill,the initial state is defined as x1=(w1,v1) Initializing the ranges of the state space and the action space of the trolley ascending system (-0.5,0), controlling the conditions of success or failure, achieving the target state w-0.5 or the current plot number equal to the maximum plot number E-E, the current time step t-1, and the current state x-x1;
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*,σ2) Using the Gaussian equation to generate the action u to be performed currentlyt;
Step (5) in the current state xtThen, the action u determined in step (4) is executedtGet the next state x of the physical systemtImmediate reward rt+1Forming a sample (x)t,ut,xt+1,rt+1) In calculating the next state xt+1=(wt+1,vt+1) In the process of (1), the next position w is calculated according to the following formulat+1And velocity vt+1:
Wherein g-0.0025 is gravity acceleration
Generating a reward r after execution of a current action according to a reward functiont+1:
Step (6) TD error using sample calculation value functiont:t=rt+1+γV(xt+1,νt)-V(xt,νt);
Step (7) updating the qualification trace of the value function:the initial eligibility trace vector defaults to 0,is calculated as follows:
wherein the content of the first and second substances,the center of the gaussian radial basis function is represented as a two-dimensional vector whose dimension is the product of the number of center points in the position direction and the velocity direction. The number of central points selected in the position direction is 11, and the central points are { -1.05, -0.9, -0.75, -0.6, -0.45, -0.3, -0.15, 0, 0.15, 0.3, 0.45 }; the number of central points selected in the speed direction is 10, the speed central points are { -0.058, -0.046, -0.034, -0.022, -0.01, 0.002, 0.014, 0.026, 0.038 and 0.05}, and the variances of the position and the speed are respectively takenAnd
step (8) updating the value function parameter vt+1:νt+1←νt+αtet+1;
Step (9) updating strategy parameter thetat+1:θt+1←θt+αt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1Wherein the state action pairs yt=(xt,ut) Is phi (x)t,ut) Can be expressed as:
wherein the content of the first and second substances,the center of the Gaussian radial basis function is represented as a three-dimensional vector, and the dimension of the three-dimensional vector is the product of the number of center points of the position direction, the speed direction and the action direction. The number of central points selected in the position direction is 11, and the central points of the positions are { -1.05, -0.9, -0.75, -0.6, -0.45, -0.3, -0.15, 0, 0.15, 0.3, 0.45 }; the number of central points selected in the speed direction is 10, and the speed central points are { -0.058, -0.046, -0.034, -0.022, -0.01, 0.002, 0.014, 0.026, 0.038 and 0.05 }; the number of the central points selected by the action directions is 5, and the central points of the action directions are as follows: { -1, -0.5,0,0.5, 1}. The variance of position, velocity and motion is taken separatelyAnd
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether or not 0.5 holds:
if yes, let E be E +1, and judge whether current plot E reaches maximum value E:
if it reaches
Then go to step (19);
otherwise
Turning to the step (3);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1;
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted from the gaussian process model:wherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TFor the state feature matrix at k time steps, βkBeing a model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix satisfying a gaussian distribution to k time step positions with values:
step (16) updating value function parameters according to simulation samples generated by the Gaussian process model: v isk+1←νk+αkek+1;
Step (17) updating strategy parameters according to the simulation samples generated by the Gaussian process model: thetak+1←θk+αkΔuk;
Step (18) of judging the current planning times k:
if K is equal to K, updating the current time step t to t +1, and judging the time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Then go to step (19);
otherwise
Turning to the step (3);
otherwise
Continuing to execute the step (14) when k is k + 1;
Claims (8)
1. A self-adaptive control method of a trolley hill climbing system based on a Gaussian process approximate model is characterized by comprising the following steps:
initializing a model, setting a state space X and an action space U of an environment, wherein the state is represented by a two-dimensional vector X (w, v) ∈ X, w is the position of a trolley in the horizontal direction, v is the speed of the trolley in the horizontal direction, the action which can be executed by the trolley is an acceleration U ∈ U, and a Gaussian process approximation model, namely a temporary variable in a state transfer function is a vectorVariable d is 0, variable s is 0 and matrix The state x is corresponding to a characteristic function, and phi (x, u) is the characteristic function of the state action pair (x, u);
initializing the hyper-parameters, setting the discount rate gamma, the attenuation factor lambda, the maximum plot number E and the exploration variance sigma of the Gaussian function2Matrix DeltaNkRespective element σ on the middle diagonali 2I is more than or equal to 1 and less than or equal to k, the maximum time step T contained in each plot, the learning rate α of the value function and the strategy, the current plot number e is 1, and the parameter vector of the value functionPolicy parameter vectorGaussian process approximation model parameter vectorPlanning the maximum times K;
initializing the range of a state space and an action space of a trolley climbing system, initializing conditions of success or failure of control, setting the current time step t as 1, and setting the current state x as x1;
Step (4) with the current optimal action u*As the mean of the Gaussian function, with the search variance σ specified in step (2)2Establishing the Gaussian equation N (u) as the variance*,σ2) Using the Gaussian equation to generate the action u to be performed currentlyt;
Step (5) in the current state xtThen, the action u determined in step (4) is executedtAnd obtaining the next state x of the trolley by using the dynamic equation of the systemt+1While obtaining an immediate reward r using a reward functiont+1Forming a sample (x)t,ut,xt+1,rt+1);
Step (6) TD error using sample calculation value functiont:t=rt+1+γV(xt+1,νt)-V(xt,νt) Wherein v istRepresents a state xtParameter of the corresponding value function, V (x)t+1,νt) Represents a state xt+1Corresponding value function, V (x)t,νt) Represents a state xtA corresponding value function;
Step (8) update value function parameter vt+1:vt+1←vt+αtet+1;
Step (9) updating strategy parameter thetat+1:θt+1←θt+αt(u*-ut);
Step (10) uses the sample to update the model intermediate formula pt+1、dt+1、st+1And Pt+1;
Wherein u ist+1Indicating that the state x can be obtained according to step (4)t+1An action to be performed, utIndicating that the state x can be obtained according to step (4)tActions performed at time step, σtThe standard deviation of the Gaussian process approximate model at the time step t;
step (12) updating the current state: x ═ xt+1Judgment of xt+1The state component w int+1Whether to control the success condition:
if yes, let E be E +1, and judge whether the current plot E ═ E holds:
if so, turning to step (19);
otherwise, go to step (13);
step (13) initializes the planning number k to 1, and plans the initial state x 'of the process'k=x1;
Step (14) is x 'in the current state'kThen, according to step (4), the action u to be executed is selectedkThen, the next state is predicted according to the Gaussian process approximation model:wherein phik=(φ(x′1,u0),φ(x′2,u1),...,φ(x′k,ut))TTo time tThe state feature matrix at step β is the model parameter of the Gaussian process, Δ Nt∈Rt×tA noise matrix in which the position components satisfy the gaussian distribution until t time step;
step (16) updating value function parameters according to simulation samples generated by the Gaussian process approximation model: v. ofk+1←vk+αkek+1WhereinkTD error as a function of value;
step (17) updating strategy parameters according to simulation samples generated by the Gaussian process approximation model: thetak+1←θk+αkΔukWherein Δ uk=u*-uk,u*For the current optimal action, ukRepresentation Using the Gaussian equation N (u)*,σ2) A resulting action to be currently performed;
step (18) of judging the current planning times k:
if K is equal to K
Updating the current time step t to t +1 and judging the current time step t;
if the current time step does not reach the maximum time step T
Continuing to operate in the step (4);
otherwise
And updating the current plot e as e +1, and judging the current plot:
if the current plot E is equal to E
Turning to step (19);
otherwise
Turning to the step (3);
otherwise
k +1, and go to step (14);
2. The adaptive control method for mountain climbing system of trolley based on Gaussian process approximation model as claimed in claim 1, wherein the solution of optimal action in step (4)Wherein the content of the first and second substances,is a state xtCorresponding feature, θtAnd representing the strategy parameters corresponding to the time step t.
3. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein in step (5), given current state x ═ xt=(wt,vt),wtIs a position component, vtFor the velocity component, its next state can be represented as xt+1=(wt+1,vt+1) Wherein the velocity component of the next time step can be represented by vt+1=vt+0.001ut+gcos(3wt) Can be solved and the state component of the next time step can be passed through wt+1=wt+vt+1To solve, where g-0.0025 is the acceleration of gravity, and the reward function is: if the next state is xt+1When in the target state, rt+10, otherwise rt+1=-1。
4. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein the expression of the function of state value in step (6) isWherein, vtRepresents a state xtThe parameters of the corresponding value function are,represents a state xt+1The function of the corresponding value is then used,is a state xtCorresponding feature, rt+1Is in a state xtTo execute action utA prize to be awarded.
5. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximate model as claimed in claim 1, wherein the qualification track update formula in the step (7) is as follows:wherein e istRepresents a state xtCorresponding qualification trace, et+1Represents a state xt+1The corresponding qualification trace.
6. The adaptive control method for mountain climbing system on trolley based on Gaussian process approximation model as claimed in claim 1, wherein the strategy parameter θ in step (9)t+1:θt+1←θt+αt(u*-ut) WhereintTD error as a function of the value corresponding to step (7).
7. The adaptive control method for mountain climbing system of trolley based on approximate Gaussian process model as claimed in claim 1, wherein in the step (11), the parameter vector of state transition function:wherein p ist+1、dt+1And st+1Is obtained according to the step (10), βtFor Gauss passing corresponding to time step tThe path approximation model is the parameter vector of the state transition function.
8. The adaptive control method for mountain climbing system of trolley based on Gaussian process approximate model as claimed in claim 1, wherein the next state obtained based on Gaussian process approximate model in step (14) isWherein phik=(φ(x′1,u1),φ(x′2,u2),...,φ(x′k,uk))TIs a state feature matrix, x 'at time step k'1Is a planned initial state, x'kIs planned to be from x'1The state reached after k times of planning is started β is the model parameter of the Gaussian process, Δ Nk∈Rk×kIs a noise matrix whose position components up to k time steps satisfy a Gaussian distribution, i.e.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910823151.XA CN110531620B (en) | 2019-09-02 | 2019-09-02 | Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910823151.XA CN110531620B (en) | 2019-09-02 | 2019-09-02 | Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110531620A CN110531620A (en) | 2019-12-03 |
CN110531620B true CN110531620B (en) | 2020-09-18 |
Family
ID=68666154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910823151.XA Active CN110531620B (en) | 2019-09-02 | 2019-09-02 | Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110531620B (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929281A (en) * | 2012-11-05 | 2013-02-13 | 西南科技大学 | Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment |
CN104932267B (en) * | 2015-06-04 | 2017-10-03 | 曲阜师范大学 | A kind of neural network lea rning control method of use eligibility trace |
CN108549232B (en) * | 2018-05-08 | 2019-11-08 | 常熟理工学院 | A kind of room air self-adaptation control method based on approximate model planning |
-
2019
- 2019-09-02 CN CN201910823151.XA patent/CN110531620B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110531620A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021135554A1 (en) | Method and device for planning global path of unmanned vehicle | |
CN110136481B (en) | Parking strategy based on deep reinforcement learning | |
CN110989576B (en) | Target following and dynamic obstacle avoidance control method for differential slip steering vehicle | |
CN112162555B (en) | Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet | |
CN108161934B (en) | Method for realizing robot multi-axis hole assembly by utilizing deep reinforcement learning | |
CN106950956B (en) | Vehicle track prediction system integrating kinematics model and behavior cognition model | |
Leottau et al. | Decentralized reinforcement learning of robot behaviors | |
CN110615003B (en) | Cruise control system based on strategy gradient online learning algorithm and design method | |
CN111679660B (en) | Unmanned deep reinforcement learning method integrating human-like driving behaviors | |
US20230367934A1 (en) | Method and apparatus for constructing vehicle dynamics model and method and apparatus for predicting vehicle state information | |
CN112286218B (en) | Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient | |
CN110083167A (en) | A kind of path following method and device of mobile robot | |
Gómez et al. | Optimal motion planning by reinforcement learning in autonomous mobile vehicles | |
CN111783994A (en) | Training method and device for reinforcement learning | |
Yang et al. | Longitudinal tracking control of vehicle platooning using DDPG-based PID | |
CN114153213A (en) | Deep reinforcement learning intelligent vehicle behavior decision method based on path planning | |
CN116679719A (en) | Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy | |
CN113741533A (en) | Unmanned aerial vehicle intelligent decision-making system based on simulation learning and reinforcement learning | |
CN115373415A (en) | Unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning | |
CN110531620B (en) | Adaptive control method of mountain climbing system of trolley based on Gaussian process approximate model | |
CN116127853A (en) | Unmanned driving overtaking decision method based on DDPG (distributed data base) with time sequence information fused | |
CN115857548A (en) | Terminal guidance law design method based on deep reinforcement learning | |
Guo et al. | Modeling, learning and prediction of longitudinal behaviors of human-driven vehicles by incorporating internal human DecisionMaking process using inverse model predictive control | |
CN115743178A (en) | Automatic driving method and system based on scene self-adaptive recognition | |
CN111562740B (en) | Automatic control method based on multi-target reinforcement learning algorithm utilizing gradient |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |