CN107885086B - Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study - Google Patents
Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study Download PDFInfo
- Publication number
- CN107885086B CN107885086B CN201711144395.2A CN201711144395A CN107885086B CN 107885086 B CN107885086 B CN 107885086B CN 201711144395 A CN201711144395 A CN 201711144395A CN 107885086 B CN107885086 B CN 107885086B
- Authority
- CN
- China
- Prior art keywords
- moment
- navigation device
- autonomous navigation
- control parameter
- movement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Abstract
The invention discloses a kind of autonomous navigation device control parameter on-line control methods based on MCMC optimization Q study, the following steps are included: the possibility situation of change of aircraft pid control parameter is carried out the set of actions that statistics obtains parameter regulation according to practical situation first, and experience is controlled according to aircraft and initializes pid control parameter;Then randomly choose it is a kind of act on autonomous navigation device, the value function value Q of each movement according to obtained in Q learning algorithm*The movement that subsequent time is taken is obtained with the sampling of MCMC algorithm, and algorithm is adjusted using SPSA step-length to the Studying factors l in Q learning algorithm over time and is adjusted;Finally the optimal control parameter obtained under the present circumstances is adjusted repeatedly by control parameter.The present invention solves hyperharmonic delay problem of autonomous navigation device during navigation, and autonomous navigation device is made to rapidly adapt to the variation of environment and arriving at the destination for quick and stable.
Description
Technical field
The invention belongs to autonomous navigation device control parameter on-line tuning fields, specifically a kind of pair of autonomous navigation device control
The method of parameter regulation processed.
Background technique
Aircraft autonomous navigation refers to that aircraft passes through the destination artificially specified and reached in the water surface, then contexture by self
The path advanced well is arrived at the destination eventually by continuous self-control.Water quality inspection and in terms of have
Important application value.
Currently, traditional autonomous navigation device is using fixed pid parameter method, this method is using fixed aircraft control
Parameter processed, parameter are as acquired in a large amount of aircraft autonomous navigation engineering project experience.When fixed control parameter is not suitable for
The problem of hyperharmonic response delay can be brought when current environment to aircraft autonomous navigation, situation especially changeable in environment
Under, fixed control parameter may have preferable response to individual circumstances state, but not be able to satisfy all ambient conditions,
Artificial change aircraft control parameter is needed to be not easy to the use of aircraft when the environment changes.
There are also some methods for carrying out aircraft control parameter adjusting using fuzzy algorithmic approach, annealing algorithm, these methods
Control parameter self-correcting mechanism is introduced to a certain extent, but since these methods are not intelligent control algorithm,
So the situation changeable to environment still can not solve the problems, such as that autonomous navigation device control parameter is quickly adjusted to optimal value.
Summary of the invention
The present invention is to solve above-mentioned the shortcomings of the prior art, is provided a kind of based on MCMC optimization Q study
Autonomous navigation device control parameter on-line control method, to can solve hyperharmonic time delay of autonomous navigation device during navigation
Problem, so that autonomous navigation device rapidly adapts to arriving at the destination for the variation of environment and quick and stable.
In order to achieve the above object, the technical scheme adopted by the invention is as follows:
The present invention it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method the characteristics of include
Following steps:
Step 1, the control precision σ according to autonomous navigation device respectively obtain autonomous navigation device PID tri- controls using formula (1)
Parameter k processedp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold value model
It encloses;
Step 2 utilizes the adjustment parameter Δ kp、ΔkiWith Δ kdCombination obtains the Parameters variation of the autonomous navigation device
Set of actions is denoted as A={ a1,a2,···,an,···,aN, wherein anIt indicates in the Parameters variation set of actions
N control parameter adjusting movement, and Indicate the corresponding proportion adjustment of n-th of movement
Parameter,Indicate the corresponding integral adjustment parameter of n-th of movement,It is dynamic to indicate that n-th of control parameter is adjusted
Make corresponding differential adjustment parameter, n=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movement a "n t-1Act on the autonomous boat
Row device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0,
1];
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd;
By the value function estimated value Q ' (e at t-1 moment in the Q learning algorithmt-1,Δet-1,a″n t-1) initialized,
Wherein, et-1Indicate error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate the autonomous navigation device at the t-1 moment
Error rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device,
Using formula (2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is dynamic to be transferred to control parameter adjusting
MakeTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function under ambient condition
Value
In formula (3), wj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment weight, j=1,2 ...,
nh;The number of nh expression BP neural network hidden layer;yj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment it is defeated
Out, and have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) indicate BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weight,
xi(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP nerve net
The number of network input layer;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts a " using the sampling of MCMC algorithmn t;
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement a " chosen with the t-1 momentn t-1, the transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N;
Indicate that t moment is acted from n-th of control parameter adjustingIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and obtains MCMC algorithm using formula (7)
In t moment the c+1 times sampling receptance
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value, pc(a′n t) indicate t
The moment the c times obtained movement a ' of samplingn tProbability value;As c=0, enable in the t moment the c times obtained movement of sampling
a′n tProbability distribution pc(a′n t) it is to wait general distribution, i.e.,
Step 5.2.4, it is sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, will be received at random
Rate u and the receptanceIt is compared, ifIt is obtained then to receive the c+1 times sampling
MovementOtherwise the c+1 times obtained movement of sampling is not receivedAnd by a 'n tIt is assigned to
Step 5.2.5, the t moment the c+1 times obtained movement a ' of sampling is updated using formula (8)n tProbability distribution pc+1
(a′n t):
In formula (8),Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) denominator;Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) molecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, it is no
Then, return step 5.2.3 sequence executes;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, t moment autonomous navigation device is obtained
Control parameter adjusting act a "n t, and enable t moment value function estimated value Q ' (et,Δet,a″n t) it is autonomous navigation described in t moment
The control parameter adjusting of device acts a "n tValue function value Q*(et,Δet,a″n t);
Step 6, the control parameter adjusting movement a " that t moment autonomous navigation device is obtained using formula (9)n tBehavior act return
Value r (et,Δet,a″n t):
r(et,Δet,a″n t)=α × (et-et-1)+β×(Δet-Δet-1) (9)
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1,
And alpha+beta=1;
Step 7 updates t-1 moment value Function Estimation value Q ' (e using formula (10)t-1,Δet-1,a″n t-1) be the t-1 moment most
Final value functional value Q (et-1,Δet-1,a″n t-1);
Q(et-1,Δet-1,a″n t-1)=Q ' (et-1,Δet-1,a″n t-1)+ltΔQ(et-1,Δet-1,a″n t-1) (10)
In formula (10), Δ Q (et-1,Δet-1,a″n t-1) indicate end value function difference value, and have:
ΔQ(et-1,Δet-1,a″n t-1)=r (et,Δet,a″n t)+γQ′(et,Δet,a″n t)-Q′(et-1,Δet-1,
a″n t-1) (11)
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxIt indicates
Set maximum number of iterations;Otherwise according to SPSA step-length adjust algorithm with time t variation, using formula (12) to study because
Sub- ltIt is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Step 9, the end value functional value for judging continuous two moment | Q (et,Δet,a″n t)-Q(et-1,Δet-1,a″n t-1)
| whether < ε is true, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise,
Execute step 10;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng
Number adjusting acts a "nt -1Adjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device
Pid control parameter is adjusted;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ
et| > | Δ emin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminTable respectively
Show ambient condition error and error rate minimum value that autonomous navigation device allows;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise step is returned
Rapid 12 execute;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
Compared with the prior art, the invention has the benefit that
1, present invention employs Q learning algorithms carries out on-line control to aircraft autonomous navigation control parameter, and learns in Q
MCMC sampling algorithm is introduced in algorithm and SPSA step-length adjusts algorithm, makes aircraft adaptive ring during autonomous navigation
The variation in border and the voyage conditions for prejudging subsequent time in advance, solve the problems, such as aircraft hyperharmonic time delay, make navigation process
More steady, especially parameter regulation is rapid in the case where Changes in weather, has in aircraft autonomous navigation field wide
Application prospect.
2, present invention introduces Q learning algorithms, and aircraft control effect is associated with ambient condition, pass through environmental feedback
Return value determines the quality of this parameter regulation movement, and the direction that parameter regulation is become better gradually is approached, and solves boat
There is the problem of hyperharmonic response delay during navigation in row device, and control parameter is quickly changed to the variation of environment
Optimal value, so as to rapidly adapt to the variation of environment.
3, the present invention introduces MCMC sampling algorithm for optimizing in traditional Q learning algorithm, will take at current time
Parameter regulation action policy do not use the single movement for taking maximum behavior value function value, but by between behavior act
Transition probability goes the probability Distribution Model of estimation entirety, solves the problems, such as to fall into local optimum when Q learning algorithm selection movement,
So as to obtain the optimal adjustment action policy during the navigation of autonomous navigation device.
4, the present invention the general distribution such as sets for initial samples moment movement probability distribution in MCMC sampling algorithm, so that
MCMC sampling algorithm has the generality of action behavior sampling early period in algorithm operation, and the later period with sampling obtained movement every time
Movement probability distribution is updated, the increase of movement probability distribution ratio corresponding to obtained movement will be sampled every time, thus
Improve the correctness of per moment movement sampling.
5, the present invention uses SPSA step-length to the variation of Studying factors l in traditional Q learning algorithm and adjusts algorithm, passes through
SPSA step-length adjusts the setting of parameters in algorithm, defines the speed degree of Studying factors l variation and the section model of variation
It encloses, so that making the change of Studying factors l during Q learning algorithm has certain regularity, makes autonomous navigation device parameter regulation more
Add accurate.
Detailed description of the invention
Fig. 1 is that the present invention is based on the autonomous navigation device control parameter on-line control Method And Principle block diagrams that MCMC optimizes Q study;
Fig. 2 is MCMC Optimization Steps figure in Q learning algorithm of the present invention;
Fig. 3 is that the present invention is based on the autonomous navigation device control parameter on-line control method flow diagrams that MCMC optimizes Q study;
Fig. 4 is that BP neural network solves action behavior value function schematic diagram;
Fig. 5 is that the method for the present invention disappears with traditional pid parameter method autonomous navigation device navigation process under different experiments of fixing
The Comparison of experiment results figure of time-consuming;
Fig. 6 is that the method for the present invention and the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water are constant
In the case of Real-time Error etComparison of experiment results figure;
Fig. 7 is that the method for the present invention becomes with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water
Real-time Error e in the case where changetComparison of experiment results figure;
Fig. 8 is that the method for the present invention has been sent out with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water
Real-time Error e in the case where after changingtComparison of experiment results figure.
Specific embodiment
In the present embodiment, the principle of the autonomous navigation device control parameter on-line control method based on MCMC optimization Q study is such as
Shown in Fig. 1, the error e of autonomous navigation device real-time reception current environmenttWith error rate Δ et, Q study is optimized by MCMC and is calculated
Method Real-time Decision goes out the parameter regulation movement a of subsequent timen, finally when the end value functional value in Q learning algorithm is not occurring
The control parameter optimal value under current environment is obtained when variation.MCMC Optimization Steps are as shown in Figure 2 in Q learning algorithm.This method
It is to be applied to autonomous navigation device control parameter on-line tuning field, is adapted to by changing the control parameter of autonomous navigation device current
Environment.
As shown in figure 3, autonomous navigation device control parameter on-line control method carries out as follows:
Step 1, pid control parameter include scale parameter kp, integral parameter kiWith differential parameter kd, wherein scale parameter kp
Effect be to speed up the response speed of system, improve the degree of regulation of system, integral parameter kiEffect be the steady of elimination system
State error, differential parameter kdEffect be improvement system dynamic characteristic;
According to the control precision σ of autonomous navigation device, tri- control parameters of autonomous navigation device PID are respectively obtained using formula (1)
kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold value model
It encloses;
Such as α=0.1, Xp∈ [10,20], Xi∈ [1,6], Xp∈ [1,2] obtains adjustment parameter Δ k according to formula (1)p's
Transition activities are positive increase 1, constant and reversed reduction 1;It can similarly obtain Δ kiWith Δ kdTransition activities;
Traditional autonomous navigation device is given certainly using fixed pid parameter method, this method due to the uncertainty of environment
Main aircraft brings the problem of hyperharmonic response delay during navigation, needs artificial modification simultaneously for different environment
Pid parameter adapts to.So being directed to these problems and puzzlement, Q learning algorithm is introduced herein and carrys out real-time online adjustment PID control
Parameter.
Q learning algorithm is a kind of intelligence learning algorithm that ChrisWatkins was proposed in 1989, by TD algorithm and dynamic
Planning combines, and the work of Watkins advances the fast development of intensified learning.Q learning algorithm is a kind of and real system mould
Type is unrelated, the nitrification enhancement of value iterative type, which is the related theoretical of Dynamic Programming and animal learning is psychologic has
Benefit be combined with each other, for solving the problems, such as the used sequence Optimal Decision-making with delay return.
Step 2, due to Q study in need by decision to autonomous navigation device control parameter change, if by PID adjust join
Number is divided into from the point of view of three movements, then will increase the computation complexity in Q learning algorithm, so utilizing the adjustment parameter Δ kp、
ΔkiWith Δ kdCombination obtains the Parameters variation set of actions of the autonomous navigation device, is denoted as A={ a1,a2,···,
an,···,aN, wherein anIndicate that n-th of control parameter adjusting acts in the Parameters variation set of actions, and Indicate the corresponding proportion adjustment parameter of n-th of movement,It indicates to move for described n-th
Make corresponding integral adjustment parameter,Indicate the corresponding differential adjustment parameter of n-th of control parameter adjusting movement,
N=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movement a "n t-1Act on the autonomous boat
Row device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0,
1];
Studying factors l in the Q learning algorithmtChange with the variation of time t, the early period of Q learning algorithm needs
Biggish learning value is obtained from sample data, so initial Studying factors ltFor a biggish positive number, with the increasing of time t
Autonomous navigation device is added not need very big learning value thus by Studying factors ltIt gradually becomes smaller;Discount factor γ is for controlling certainly
The considerations of main aircraft is to short-term and long-term results degree, such as consider two extreme cases, the autonomous navigation device as γ=0
Consider the return value of current time environment, the return value for the moment environment that only looks to the future as γ=1, so according to autonomous navigation
Device actual demand sets discount factor, and γ=0.5 pair current time and future time instance is generally taken to comprehensively consider;
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd;Such as
This experimental system initially sets three control parameters as kp=2.5, ki=0.5, kd=0.2;
By the value function estimated value Q ' (e at t-1 moment in the Q learning algorithmt-1,Δet-1,a″n t-1) initialized,
Wherein, et-1Indicate error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate the autonomous navigation device at the t-1 moment
Error rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
T=1 moment setting value Function Estimation value Q ' (et-1,Δet-1,a″n t-1)=0, error et-1=0, error rate
Δet-1=0;
In step 4, Q learning algorithm, the maximum movement of autonomous navigation device selective value functional value is needed, not only to obtain maximum
Instant return;Autonomous navigation device is also needed to select different movements as far as possible, it is contemplated that obtain the case where everything
Optimal policy.If autonomous navigation device selects always the movement with peak functional value, if can have the following disadvantages: in the early stage
The stage of acquisition experience, autonomous navigation device not yet acquire optimal strategy, then the study stage afterwards would be impossible to obtain again
Optimal strategy.
So introducing the movement that MCMC sampling algorithm is chosen for decision per moment in Q learning algorithm.MCMC sampling
Algorithm meets the sampled value of movement probability distribution by sampling to obtain to movement transfer matrix, the case where for Probability Distributed Unknown
The movement for showing that per moment is chosen can accurately be sampled.
According to the number N that control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device acts, formula is utilized
(2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is transferred to control parameter adjusting
MovementTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1, BP neural network have the ability of Approximation of Arbitrary Nonlinear Function, for solving extensive and continuous
Evolvement problem in state space plays a significant role, and BP neural network solves action behavior value function principle as shown in figure 4, benefit
N-th of control parameter adjusting of t moment movement is calculated with formula (3)Value function value under ambient condition
In formula (3), wj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment weight, j=1,2 ...,
nh;The number of nh expression BP neural network hidden layer;yj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment it is defeated
Out, and have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) indicate BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weight,
xi(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP nerve net
The number of network input layer;
Such as ni=3 indicates 3 input layers of BP neural network, respectively error et-1, error rate Δ et-1With
MovementInput;Nh=5 is indicated containing there are five hidden layer node, the more more then counting accuracies of general hidden layer node number
It is higher, but the complexity calculated is also bigger;The t=1 moment sets the weight w of hidden layerj(t-1)=1, j=1,2 ..., nh,
Input layer weight wij(t-1)=0.8, i=1,2 ..., ni;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts a " using the sampling of MCMC algorithmn t;
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement a " chosen with the t-1 momentn t-1, the transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N;
Indicate that t moment is acted from n-th of control parameter adjustingIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and obtains MCMC algorithm using formula (7)
In t moment the c+1 times sampling receptance
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value, pc(a′n t) indicate t
The moment the c times obtained movement a ' of samplingn tProbability value;As c=0, enable in the t moment the c times obtained movement of sampling
a′n tProbability distribution pc(a′n t) it is to wait general distribution, i.e.,
By formula (7) as can be seen that t moment pc(a′n t)、WithIt is definite value, when t moment
The movement a ' of c+1 samplingn tThe corresponding probability value the big, it is bigger to sample receptance, otherwise sampling receptance is smaller;
Since MCMC sampling algorithm is the transition probability matrix by sampling actionIt goes to obtain and meets movement probability distribution
pc(a′n t) sampled value, so movement probability distribution p (a when MCMC sampling algorithm startsn) can arbitrarily set;Start to adopt
Set action a ' when samplen tProbability distribution pc(a′n t) it is to wait general distribution,There is aircraft to every kind of movement
Identical sampled probability ensure that Q learning algorithm to the correctness of movement sampling of per moment;
Step 5.2.4, it is sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, will be received at random
Rate u and the receptanceIt is compared, ifIt is obtained then to receive the c+1 times sampling
MovementOtherwise the c+1 times obtained movement of sampling is not receivedAnd by a 'n tIt is assigned to
Such as random receptance u=0.5, if the sampling receptance obtained according to formula (7)
Then think this sampling failure, sampling action value a 'n tIt remains unchanged;If the sampling receptance obtained according to formula (7)Then think that this is sampled successfully, sampling action value a 'n tBecome
Step 5.2.5, the t moment the c+1 times obtained movement a ' of sampling is updated using formula (8)n tProbability distribution pc+1
(a′n t):
In formula (8),Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) denominator;Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) molecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, it is no
Then, return step 5.2.3 sequence executes;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, t moment autonomous navigation device is obtained
Control parameter adjusting act a "n t, and enable t moment value function estimated value Q ' (et,Δet,a″n t) it is autonomous navigation described in t moment
The control parameter adjusting of device acts a "n tValue function value Q*(et,Δet,a″n t);
A ' is acted when sampling number c reaches 100 times according to MCMC algorithmn tProbability distribution pc(a′n t) basically reach it is flat
Surely, C=100 is generally set;Sampling number C can be set according to the precision of aircraft systems;
Step 6, the control parameter adjusting movement a " that t moment autonomous navigation device is obtained using formula (9)n tBehavior act return
Value r (et,Δet,a″n t):
r(et,Δet,a″n t)=α × (et-et-1)+β×(Δet-Δet-1) (9)
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1,
And alpha+beta=1;
Behavior act return value r (et,Δet,a″n t) illustrate that t moment parameter regulation acts a "n tAct on autonomous navigation
The operating condition of aircraft after device, if the ambient condition returned is deteriorated, behavior act return value r (e at this timet,Δet,a
″n t) it is a negative, indicate punishment;If the ambient condition returned improves, at this time behavior act return value r (et,Δet,
a″n t) it is a positive number, indicate reward;If the ambient condition returned does not change, at this time behavior act return value r (et,
Δet,a″n t) it is zero, it indicates to keep;The ambient condition of autonomous navigation device includes error etWith Δ et, so according to importance
Difference introduces α and β ambient condition return parameter to determine the influence degree of different conditions, generally setting α=0.8, β=0.2;
Step 7 updates t-1 moment value Function Estimation value Q ' (e using formula (10)t-1,Δet-1,a″n t-1) be the t-1 moment most
Final value functional value Q (et-1,Δet-1,a″n t-1);
Q(et-1,Δet-1,a″n t-1)=Q ' (et-1,Δet-1,a″n t-1)+ltΔQ(et-1,Δet-1,
a″n t-1) (10)
In formula (10), Δ Q (et-1,Δet-1,a″n t-1) indicate end value function difference value, and have:
ΔQ(et-1,Δet-1,a″n t-1)=r (et,Δet,a″n t)+γQ′(et,Δet,a″n t)-Q′(et-1,Δet-1,
a″n t-1) (11)
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxIt indicates
Set maximum number of iterations;Otherwise according to SPSA step-length adjust algorithm with time t variation, using formula (12) to study because
Sub- ltIt is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Be introduced into SPSA step-length adjust algorithm make Q learn in Studying factors ltVariation has certain regularity, and leads to
The setting that SPSA step-length adjusts non-negative parameter μ and λ in algorithm is crossed, defines Studying factors ltThe speed degree of variation and variation
Interval range keeps aircraft parameter regulation more accurate, generally setting tmax=30, μ=0.3, λ=1.2;
Step 9, the end value functional value for judging continuous two moment | Q (et,Δet,a″n t)-Q(et-1,Δet-1,a″n t-1)
| whether < ε is true, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise,
Execute step 10;
ε is that a very small positive number finishes and the control precision of aircraft has for determining whether pid control parameter is adjusted
It closes;When ε is smaller, then the precision of aircraft autonomous navigation will be higher, and obtained aircraft pid control parameter will be closer to most
The figure of merit, generally setting ε=0.2;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng
Number adjusting acts a "n t-1Adjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device
Pid control parameter is adjusted;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ
et| > | Δ emin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminTable respectively
Show ambient condition error and error rate minimum value that autonomous navigation device allows;Such as general setting emin=0.1, Δ emin=
0.05;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise step is returned
Rapid 12 execute;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
Experimental result:
This patent method and the fixed pid parameter method of tradition are respectively used to autonomous navigation device simultaneously, carried out multiple groups pair
Than experiment, guarantee that two groups of autonomous navigation devices reach identical terminal from identical starting point simultaneously in experiment.Fig. 5 is autonomous navigation
Device navigates by water the time consuming comparing result of process;Fig. 6, Fig. 7 and Fig. 8 are that autonomous navigation device navigates by water process Real-time Error etComparison
As a result.
It is time consuming experimentally in comparison, three groups of comparative experimentss are taken, every group of experiment is carried out 50 times and taken to result
Average value.First group is the arrival time that two groups of autonomous navigation devices are compared in the case that current environment is stablized, and second group is environment
The arrival time of two groups of autonomous navigation devices is compared in the case where suddenly change during navigation, third group is after environmental change
In the case where compare arrival times of two groups of autonomous navigation devices;As shown in Figure 5, due to using in the state that initial environment is stablized
The pid control parameter that the autonomous navigation device of fixed pid parameter method uses is close to optimized parameter so and using this patent method
Autonomous navigation device elapsed time it is roughly the same;When environment is in the case where suddenly change during navigation, although two groups autonomous
The time that aircraft reaches is all elongated, but this it appears that the autonomous navigation device elapsed time ratio using this patent method uses
The autonomous navigation device of conventional method is much smaller, occurs mainly in adjusting using the time that the autonomous navigation device of this patent method increases
During control parameter;After environmental change, due to using this patent method autonomous navigation device incited somebody to action
Control parameter under current environment is adjusted to optimal value, so the time of consumption has been returned to and identical water before environmental change
It is flat, and use the autonomous navigation device of conventional method since the control parameter under new environment has not reacceesed optimal control parameter,
So the time continuation of consumption is elongated, when environmental change is violent, will be will appear using the autonomous navigation device of conventional method can not
The case where reaching specified destination.
In comparison Real-time Error etExperimentally, above-mentioned three groups of comparative experimentss are equally taken, every group of experiment carries out 50 times simultaneously
To results are averaged.Fig. 6 is the constant comparing result of initial environment, it is possible to find two groups of autonomous navigation device Real-time Error etChange
It is roughly the same to change situation;Fig. 7 is comparison knot of the environment when autonomous navigation device navigates by water process the 7th second in the case where suddenly change
Fruit, it is possible to find two groups of autonomous navigation device Real-time Error e in environment suddenly changetAll great increase, but use this patent method
Autonomous navigation device through navigational parameter after a period of time adjustment after Real-time Error etIt is rapidly reduced to close to 0, and uses again
The autonomous navigation device Real-time Error e of conventional methodtIt can not be reduced to 0 fluctuation up and down in an error range always;Fig. 8 is ring
Comparing result in the case where after the variation of border, it is possible to find using the Real-time Error e of the autonomous navigation device of this patent methodtVariation
Rule before rule and environmental change is almost the same, and uses the autonomous navigation device Real-time Error e of conventional methodtIt can not be reduced to
0 fluctuation up and down in an error range always.
Two kinds of comparing results discovery under three groups of experiments of comprehensive appeal, the fixed PID control ginseng of this patent method tradition relatively
Counting method has better autonomous navigation effect in the case where environment is changeable, while solving since control parameter is not current
Caused by optimal value under environment the problem of autonomous navigation device hyperharmonic response delay.
Claims (1)
1. it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method, it is characterised in that: including with
Lower step:
Step 1, the control precision σ according to autonomous navigation device respectively obtain tri- controls of autonomous navigation device PID using formula (1) and join
Number kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold range;
Step 2 utilizes the adjustment parameter Δ kp、ΔkiWith Δ kdCombination obtains the Parameters variation movement of the autonomous navigation device
Set, is denoted as A={ a1,a2,…,an,…,aN, wherein anIndicate n-th of control parameter in the Parameters variation set of actions
Adjusting movement, and Indicate the corresponding proportion adjustment parameter of n-th of movement,Table
Show the corresponding integral adjustment parameter of n-th of movement,It indicates corresponding to n-th of control parameter adjusting movement
Differential adjustment parameter, n=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movementAct on the autonomous navigation device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0,1];
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd;
By the value function estimated value at t-1 moment in the Q learning algorithmIt is initialized, wherein et-1Table
Show error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate error rate of the autonomous navigation device at the t-1 moment,
And by et-1With Δ et-1Form the ambient condition at t-1 moment;
Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device, utilize
Transfer matrix of the formula (2) to the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is transferred to control parameter adjusting movementTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function value under ambient condition
In formula (3), wj(t-1) weight of j-th of hidden layer of t-1 moment in BP neural network, j=1,2 ..., nh are indicated;Nh table
Show the number of BP neural network hidden layer;yj(t-1) output of j-th of hidden layer of t-1 moment in BP neural network is indicated, and
Have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) weight of i-th of the input layer of t-1 moment to j-th of hidden layer in BP neural network, x are indicatedi
(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP neural network
The number of input layer;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts using the sampling of MCMC algorithm
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement chosen with the t-1 momentThe transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N;Indicate that t moment is controlled from n-th
Parameter regulation movement processedIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2 ... C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and is obtained in MCMC algorithm using formula (7)
The receptance of t moment the c+1 times sampling
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value,Indicate t moment
The c times obtained movement of samplingProbability value;As c=0, enable in the t moment the c times obtained movement of sampling's
Probability distributionIt is generally distributed to be equal, i.e.,
Step 5.2.4, sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, by random receptance u and
The receptanceIt is compared, ifThen receive the c+1 times obtained movement of samplingOtherwise the c+1 times obtained movement of sampling is not receivedAnd it willIt is assigned to
Step 5.2.5, the t moment the c+1 times obtained movement of sampling is updated using formula (8)Probability distribution
In formula (8),Indicate the t moment the c times obtained movement of samplingProbability distributionDenominator;Indicate t
The moment the c times obtained movement of samplingProbability distributionMolecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, otherwise, is returned
Step 5.2.3 sequence is returned to execute;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, the control of t moment autonomous navigation device is obtained
Parameter regulation movement processedAnd enable t moment value function estimated valueFor the control of autonomous navigation device described in t moment
Parameter regulation movementValue function value
Step 6, the control parameter adjusting movement that t moment autonomous navigation device is obtained using formula (9)Behavior act return value
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, and α+
β=1;
Step 7 updates t-1 moment value Function Estimation value using formula (10)For t-1 moment final value function
Value
In formula (10),It indicates end value function difference value, and has:
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxSet by expression
Determine maximum number of iterations;Otherwise algorithm is adjusted with the variation of time t, using formula (12) to Studying factors l according to SPSA step-lengtht
It is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Step 9, the end value functional value for judging continuous two momentWhether at
It is vertical, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise, step 10 is executed;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control parameter tune
Section movementAdjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device PID control
Parameter regulation;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ et| > |
Δemin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminIt respectively indicates autonomous
The ambient condition error and error rate minimum value that aircraft allows;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise return step 12
It executes;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711144395.2A CN107885086B (en) | 2017-11-17 | 2017-11-17 | Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711144395.2A CN107885086B (en) | 2017-11-17 | 2017-11-17 | Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107885086A CN107885086A (en) | 2018-04-06 |
CN107885086B true CN107885086B (en) | 2019-10-25 |
Family
ID=61777810
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711144395.2A Active CN107885086B (en) | 2017-11-17 | 2017-11-17 | Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107885086B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108710289B (en) * | 2018-05-18 | 2021-11-09 | 厦门理工学院 | Relay base quality optimization method based on improved SPSA |
CN109696830B (en) * | 2019-01-31 | 2021-12-03 | 天津大学 | Reinforced learning self-adaptive control method of small unmanned helicopter |
EP3725471A1 (en) * | 2019-04-16 | 2020-10-21 | Robert Bosch GmbH | Configuring a system which interacts with an environment |
CN114237267B (en) * | 2021-11-02 | 2023-11-24 | 中国人民解放军海军航空大学航空作战勤务学院 | Flight maneuver decision assisting method based on reinforcement learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
CN105700526A (en) * | 2016-01-13 | 2016-06-22 | 华北理工大学 | On-line sequence limit learning machine method possessing autonomous learning capability |
CN106950956A (en) * | 2017-03-22 | 2017-07-14 | 合肥工业大学 | The wheelpath forecasting system of fusional movement model and behavior cognitive model |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2178745B1 (en) * | 2007-08-14 | 2012-02-29 | Propeller Control Aps | Efficiency optimizing propeller speed control for ships |
-
2017
- 2017-11-17 CN CN201711144395.2A patent/CN107885086B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102819264A (en) * | 2012-07-30 | 2012-12-12 | 山东大学 | Path planning Q-learning initial method of mobile robot |
CN105700526A (en) * | 2016-01-13 | 2016-06-22 | 华北理工大学 | On-line sequence limit learning machine method possessing autonomous learning capability |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
CN106950956A (en) * | 2017-03-22 | 2017-07-14 | 合肥工业大学 | The wheelpath forecasting system of fusional movement model and behavior cognitive model |
Non-Patent Citations (1)
Title |
---|
An Introduction to MCMC for Machine Learning;CHRISTOPHE ANDRIEU 等;《Machine Learning》;20031231;第5-37页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107885086A (en) | 2018-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107885086B (en) | Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study | |
CN110427261A (en) | A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree | |
CN112114521B (en) | Intelligent prediction control entry guidance method for spacecraft | |
CN107767022A (en) | A kind of Dynamic Job-shop Scheduling rule intelligent selecting method of creation data driving | |
CN111176807A (en) | Multi-satellite cooperative task planning method | |
CN106056127A (en) | GPR (gaussian process regression) online soft measurement method with model updating | |
CN114169543A (en) | Federal learning algorithm based on model obsolescence and user participation perception | |
Yuan et al. | Actor-critic deep reinforcement learning for energy minimization in UAV-aided networks | |
CN116523079A (en) | Reinforced learning-based federal learning optimization method and system | |
Goldenshluger et al. | A note on performance limitations in bandit problems with side information | |
Bui et al. | Clustered bandits | |
Cassano et al. | Distributed value-function learning with linear convergence rates | |
CN116880191A (en) | Intelligent control method of process industrial production system based on time sequence prediction | |
CN114039366B (en) | Power grid secondary frequency modulation control method and device based on peacock optimization algorithm | |
CN115310775A (en) | Multi-agent reinforcement learning rolling scheduling method, device, equipment and storage medium | |
CN111582567B (en) | Wind power probability prediction method based on hierarchical integration | |
CN111796519B (en) | Automatic control method of multi-input multi-output system based on extreme learning machine | |
CN109657778B (en) | Improved multi-swarm global optimal-based adaptive pigeon swarm optimization method | |
Fagan et al. | Dynamic multi-agent reinforcement learning for control optimization | |
Wongsai et al. | A Reinforcement learning for criminal’s escape path prediction | |
Lei | Optimization of intelligent neural network prediction based on particle swarm | |
Wang et al. | Convergence-Based Exploration Algorithm for Reinforcement Learning | |
CN113270867B (en) | Automatic adjustment method for weak power grid tide without solution | |
CN114637209A (en) | Method for controlling neural network inverse controller based on reinforcement learning | |
Fu et al. | Research on Multi-Agent Reinforcement Learning Traffic Control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |