CN107885086B - Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study - Google Patents

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study Download PDF

Info

Publication number
CN107885086B
CN107885086B CN201711144395.2A CN201711144395A CN107885086B CN 107885086 B CN107885086 B CN 107885086B CN 201711144395 A CN201711144395 A CN 201711144395A CN 107885086 B CN107885086 B CN 107885086B
Authority
CN
China
Prior art keywords
moment
navigation device
autonomous navigation
control parameter
movement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711144395.2A
Other languages
Chinese (zh)
Other versions
CN107885086A (en
Inventor
夏娜
柴煜奇
杜华争
陈斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Polytechnic University
Original Assignee
Hefei Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University filed Critical Hefei Polytechnic University
Priority to CN201711144395.2A priority Critical patent/CN107885086B/en
Publication of CN107885086A publication Critical patent/CN107885086A/en
Application granted granted Critical
Publication of CN107885086B publication Critical patent/CN107885086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The invention discloses a kind of autonomous navigation device control parameter on-line control methods based on MCMC optimization Q study, the following steps are included: the possibility situation of change of aircraft pid control parameter is carried out the set of actions that statistics obtains parameter regulation according to practical situation first, and experience is controlled according to aircraft and initializes pid control parameter;Then randomly choose it is a kind of act on autonomous navigation device, the value function value Q of each movement according to obtained in Q learning algorithm*The movement that subsequent time is taken is obtained with the sampling of MCMC algorithm, and algorithm is adjusted using SPSA step-length to the Studying factors l in Q learning algorithm over time and is adjusted;Finally the optimal control parameter obtained under the present circumstances is adjusted repeatedly by control parameter.The present invention solves hyperharmonic delay problem of autonomous navigation device during navigation, and autonomous navigation device is made to rapidly adapt to the variation of environment and arriving at the destination for quick and stable.

Description

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study
Technical field
The invention belongs to autonomous navigation device control parameter on-line tuning fields, specifically a kind of pair of autonomous navigation device control The method of parameter regulation processed.
Background technique
Aircraft autonomous navigation refers to that aircraft passes through the destination artificially specified and reached in the water surface, then contexture by self The path advanced well is arrived at the destination eventually by continuous self-control.Water quality inspection and in terms of have Important application value.
Currently, traditional autonomous navigation device is using fixed pid parameter method, this method is using fixed aircraft control Parameter processed, parameter are as acquired in a large amount of aircraft autonomous navigation engineering project experience.When fixed control parameter is not suitable for The problem of hyperharmonic response delay can be brought when current environment to aircraft autonomous navigation, situation especially changeable in environment Under, fixed control parameter may have preferable response to individual circumstances state, but not be able to satisfy all ambient conditions, Artificial change aircraft control parameter is needed to be not easy to the use of aircraft when the environment changes.
There are also some methods for carrying out aircraft control parameter adjusting using fuzzy algorithmic approach, annealing algorithm, these methods Control parameter self-correcting mechanism is introduced to a certain extent, but since these methods are not intelligent control algorithm, So the situation changeable to environment still can not solve the problems, such as that autonomous navigation device control parameter is quickly adjusted to optimal value.
Summary of the invention
The present invention is to solve above-mentioned the shortcomings of the prior art, is provided a kind of based on MCMC optimization Q study Autonomous navigation device control parameter on-line control method, to can solve hyperharmonic time delay of autonomous navigation device during navigation Problem, so that autonomous navigation device rapidly adapts to arriving at the destination for the variation of environment and quick and stable.
In order to achieve the above object, the technical scheme adopted by the invention is as follows:
The present invention it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method the characteristics of include Following steps:
Step 1, the control precision σ according to autonomous navigation device respectively obtain autonomous navigation device PID tri- controls using formula (1) Parameter k processedp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold value model It encloses;
Step 2 utilizes the adjustment parameter Δ kp、ΔkiWith Δ kdCombination obtains the Parameters variation of the autonomous navigation device Set of actions is denoted as A={ a1,a2,···,an,···,aN, wherein anIt indicates in the Parameters variation set of actions N control parameter adjusting movement, and Indicate the corresponding proportion adjustment of n-th of movement Parameter,Indicate the corresponding integral adjustment parameter of n-th of movement,It is dynamic to indicate that n-th of control parameter is adjusted Make corresponding differential adjustment parameter, n=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movement a "n t-1Act on the autonomous boat Row device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0, 1];
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd
By the value function estimated value Q ' (e at t-1 moment in the Q learning algorithmt-1,Δet-1,a″n t-1) initialized, Wherein, et-1Indicate error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate the autonomous navigation device at the t-1 moment Error rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device, Using formula (2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is dynamic to be transferred to control parameter adjusting MakeTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function under ambient condition Value
In formula (3), wj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment weight, j=1,2 ..., nh;The number of nh expression BP neural network hidden layer;yj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment it is defeated Out, and have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) indicate BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weight, xi(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP nerve net The number of network input layer;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts a " using the sampling of MCMC algorithmn t
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement a " chosen with the t-1 momentn t-1, the transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N; Indicate that t moment is acted from n-th of control parameter adjustingIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and obtains MCMC algorithm using formula (7) In t moment the c+1 times sampling receptance
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value, pc(a′n t) indicate t The moment the c times obtained movement a ' of samplingn tProbability value;As c=0, enable in the t moment the c times obtained movement of sampling a′n tProbability distribution pc(a′n t) it is to wait general distribution, i.e.,
Step 5.2.4, it is sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, will be received at random Rate u and the receptanceIt is compared, ifIt is obtained then to receive the c+1 times sampling MovementOtherwise the c+1 times obtained movement of sampling is not receivedAnd by a 'n tIt is assigned to
Step 5.2.5, the t moment the c+1 times obtained movement a ' of sampling is updated using formula (8)n tProbability distribution pc+1 (a′n t):
In formula (8),Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) denominator;Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) molecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, it is no Then, return step 5.2.3 sequence executes;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, t moment autonomous navigation device is obtained Control parameter adjusting act a "n t, and enable t moment value function estimated value Q ' (et,Δet,a″n t) it is autonomous navigation described in t moment The control parameter adjusting of device acts a "n tValue function value Q*(et,Δet,a″n t);
Step 6, the control parameter adjusting movement a " that t moment autonomous navigation device is obtained using formula (9)n tBehavior act return Value r (et,Δet,a″n t):
r(et,Δet,a″n t)=α × (et-et-1)+β×(Δet-Δet-1) (9)
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, And alpha+beta=1;
Step 7 updates t-1 moment value Function Estimation value Q ' (e using formula (10)t-1,Δet-1,a″n t-1) be the t-1 moment most Final value functional value Q (et-1,Δet-1,a″n t-1);
Q(et-1,Δet-1,a″n t-1)=Q ' (et-1,Δet-1,a″n t-1)+ltΔQ(et-1,Δet-1,a″n t-1) (10)
In formula (10), Δ Q (et-1,Δet-1,a″n t-1) indicate end value function difference value, and have:
ΔQ(et-1,Δet-1,a″n t-1)=r (et,Δet,a″n t)+γQ′(et,Δet,a″n t)-Q′(et-1,Δet-1, a″n t-1) (11)
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxIt indicates Set maximum number of iterations;Otherwise according to SPSA step-length adjust algorithm with time t variation, using formula (12) to study because Sub- ltIt is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Step 9, the end value functional value for judging continuous two moment | Q (et,Δet,a″n t)-Q(et-1,Δet-1,a″n t-1) | whether < ε is true, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise, Execute step 10;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng Number adjusting acts a "nt -1Adjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device Pid control parameter is adjusted;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ et| > | Δ emin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise step is returned Rapid 12 execute;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
Compared with the prior art, the invention has the benefit that
1, present invention employs Q learning algorithms carries out on-line control to aircraft autonomous navigation control parameter, and learns in Q MCMC sampling algorithm is introduced in algorithm and SPSA step-length adjusts algorithm, makes aircraft adaptive ring during autonomous navigation The variation in border and the voyage conditions for prejudging subsequent time in advance, solve the problems, such as aircraft hyperharmonic time delay, make navigation process More steady, especially parameter regulation is rapid in the case where Changes in weather, has in aircraft autonomous navigation field wide Application prospect.
2, present invention introduces Q learning algorithms, and aircraft control effect is associated with ambient condition, pass through environmental feedback Return value determines the quality of this parameter regulation movement, and the direction that parameter regulation is become better gradually is approached, and solves boat There is the problem of hyperharmonic response delay during navigation in row device, and control parameter is quickly changed to the variation of environment Optimal value, so as to rapidly adapt to the variation of environment.
3, the present invention introduces MCMC sampling algorithm for optimizing in traditional Q learning algorithm, will take at current time Parameter regulation action policy do not use the single movement for taking maximum behavior value function value, but by between behavior act Transition probability goes the probability Distribution Model of estimation entirety, solves the problems, such as to fall into local optimum when Q learning algorithm selection movement, So as to obtain the optimal adjustment action policy during the navigation of autonomous navigation device.
4, the present invention the general distribution such as sets for initial samples moment movement probability distribution in MCMC sampling algorithm, so that MCMC sampling algorithm has the generality of action behavior sampling early period in algorithm operation, and the later period with sampling obtained movement every time Movement probability distribution is updated, the increase of movement probability distribution ratio corresponding to obtained movement will be sampled every time, thus Improve the correctness of per moment movement sampling.
5, the present invention uses SPSA step-length to the variation of Studying factors l in traditional Q learning algorithm and adjusts algorithm, passes through SPSA step-length adjusts the setting of parameters in algorithm, defines the speed degree of Studying factors l variation and the section model of variation It encloses, so that making the change of Studying factors l during Q learning algorithm has certain regularity, makes autonomous navigation device parameter regulation more Add accurate.
Detailed description of the invention
Fig. 1 is that the present invention is based on the autonomous navigation device control parameter on-line control Method And Principle block diagrams that MCMC optimizes Q study;
Fig. 2 is MCMC Optimization Steps figure in Q learning algorithm of the present invention;
Fig. 3 is that the present invention is based on the autonomous navigation device control parameter on-line control method flow diagrams that MCMC optimizes Q study;
Fig. 4 is that BP neural network solves action behavior value function schematic diagram;
Fig. 5 is that the method for the present invention disappears with traditional pid parameter method autonomous navigation device navigation process under different experiments of fixing The Comparison of experiment results figure of time-consuming;
Fig. 6 is that the method for the present invention and the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water are constant In the case of Real-time Error etComparison of experiment results figure;
Fig. 7 is that the method for the present invention becomes with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water Real-time Error e in the case where changetComparison of experiment results figure;
Fig. 8 is that the method for the present invention has been sent out with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water Real-time Error e in the case where after changingtComparison of experiment results figure.
Specific embodiment
In the present embodiment, the principle of the autonomous navigation device control parameter on-line control method based on MCMC optimization Q study is such as Shown in Fig. 1, the error e of autonomous navigation device real-time reception current environmenttWith error rate Δ et, Q study is optimized by MCMC and is calculated Method Real-time Decision goes out the parameter regulation movement a of subsequent timen, finally when the end value functional value in Q learning algorithm is not occurring The control parameter optimal value under current environment is obtained when variation.MCMC Optimization Steps are as shown in Figure 2 in Q learning algorithm.This method It is to be applied to autonomous navigation device control parameter on-line tuning field, is adapted to by changing the control parameter of autonomous navigation device current Environment.
As shown in figure 3, autonomous navigation device control parameter on-line control method carries out as follows:
Step 1, pid control parameter include scale parameter kp, integral parameter kiWith differential parameter kd, wherein scale parameter kp Effect be to speed up the response speed of system, improve the degree of regulation of system, integral parameter kiEffect be the steady of elimination system State error, differential parameter kdEffect be improvement system dynamic characteristic;
According to the control precision σ of autonomous navigation device, tri- control parameters of autonomous navigation device PID are respectively obtained using formula (1) kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold value model It encloses;
Such as α=0.1, Xp∈ [10,20], Xi∈ [1,6], Xp∈ [1,2] obtains adjustment parameter Δ k according to formula (1)p's Transition activities are positive increase 1, constant and reversed reduction 1;It can similarly obtain Δ kiWith Δ kdTransition activities;
Traditional autonomous navigation device is given certainly using fixed pid parameter method, this method due to the uncertainty of environment Main aircraft brings the problem of hyperharmonic response delay during navigation, needs artificial modification simultaneously for different environment Pid parameter adapts to.So being directed to these problems and puzzlement, Q learning algorithm is introduced herein and carrys out real-time online adjustment PID control Parameter.
Q learning algorithm is a kind of intelligence learning algorithm that ChrisWatkins was proposed in 1989, by TD algorithm and dynamic Planning combines, and the work of Watkins advances the fast development of intensified learning.Q learning algorithm is a kind of and real system mould Type is unrelated, the nitrification enhancement of value iterative type, which is the related theoretical of Dynamic Programming and animal learning is psychologic has Benefit be combined with each other, for solving the problems, such as the used sequence Optimal Decision-making with delay return.
Step 2, due to Q study in need by decision to autonomous navigation device control parameter change, if by PID adjust join Number is divided into from the point of view of three movements, then will increase the computation complexity in Q learning algorithm, so utilizing the adjustment parameter Δ kp、 ΔkiWith Δ kdCombination obtains the Parameters variation set of actions of the autonomous navigation device, is denoted as A={ a1,a2,···, an,···,aN, wherein anIndicate that n-th of control parameter adjusting acts in the Parameters variation set of actions, and Indicate the corresponding proportion adjustment parameter of n-th of movement,It indicates to move for described n-th Make corresponding integral adjustment parameter,Indicate the corresponding differential adjustment parameter of n-th of control parameter adjusting movement, N=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movement a "n t-1Act on the autonomous boat Row device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0, 1];
Studying factors l in the Q learning algorithmtChange with the variation of time t, the early period of Q learning algorithm needs Biggish learning value is obtained from sample data, so initial Studying factors ltFor a biggish positive number, with the increasing of time t Autonomous navigation device is added not need very big learning value thus by Studying factors ltIt gradually becomes smaller;Discount factor γ is for controlling certainly The considerations of main aircraft is to short-term and long-term results degree, such as consider two extreme cases, the autonomous navigation device as γ=0 Consider the return value of current time environment, the return value for the moment environment that only looks to the future as γ=1, so according to autonomous navigation Device actual demand sets discount factor, and γ=0.5 pair current time and future time instance is generally taken to comprehensively consider;
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd;Such as This experimental system initially sets three control parameters as kp=2.5, ki=0.5, kd=0.2;
By the value function estimated value Q ' (e at t-1 moment in the Q learning algorithmt-1,Δet-1,a″n t-1) initialized, Wherein, et-1Indicate error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate the autonomous navigation device at the t-1 moment Error rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
T=1 moment setting value Function Estimation value Q ' (et-1,Δet-1,a″n t-1)=0, error et-1=0, error rate Δet-1=0;
In step 4, Q learning algorithm, the maximum movement of autonomous navigation device selective value functional value is needed, not only to obtain maximum Instant return;Autonomous navigation device is also needed to select different movements as far as possible, it is contemplated that obtain the case where everything Optimal policy.If autonomous navigation device selects always the movement with peak functional value, if can have the following disadvantages: in the early stage The stage of acquisition experience, autonomous navigation device not yet acquire optimal strategy, then the study stage afterwards would be impossible to obtain again Optimal strategy.
So introducing the movement that MCMC sampling algorithm is chosen for decision per moment in Q learning algorithm.MCMC sampling Algorithm meets the sampled value of movement probability distribution by sampling to obtain to movement transfer matrix, the case where for Probability Distributed Unknown The movement for showing that per moment is chosen can accurately be sampled.
According to the number N that control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device acts, formula is utilized (2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is transferred to control parameter adjusting MovementTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1, BP neural network have the ability of Approximation of Arbitrary Nonlinear Function, for solving extensive and continuous Evolvement problem in state space plays a significant role, and BP neural network solves action behavior value function principle as shown in figure 4, benefit N-th of control parameter adjusting of t moment movement is calculated with formula (3)Value function value under ambient condition
In formula (3), wj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment weight, j=1,2 ..., nh;The number of nh expression BP neural network hidden layer;yj(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment it is defeated Out, and have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) indicate BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weight, xi(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP nerve net The number of network input layer;
Such as ni=3 indicates 3 input layers of BP neural network, respectively error et-1, error rate Δ et-1With MovementInput;Nh=5 is indicated containing there are five hidden layer node, the more more then counting accuracies of general hidden layer node number It is higher, but the complexity calculated is also bigger;The t=1 moment sets the weight w of hidden layerj(t-1)=1, j=1,2 ..., nh, Input layer weight wij(t-1)=0.8, i=1,2 ..., ni;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts a " using the sampling of MCMC algorithmn t
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement a " chosen with the t-1 momentn t-1, the transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N; Indicate that t moment is acted from n-th of control parameter adjustingIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and obtains MCMC algorithm using formula (7) In t moment the c+1 times sampling receptance
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value, pc(a′n t) indicate t The moment the c times obtained movement a ' of samplingn tProbability value;As c=0, enable in the t moment the c times obtained movement of sampling a′n tProbability distribution pc(a′n t) it is to wait general distribution, i.e.,
By formula (7) as can be seen that t moment pc(a′n t)、WithIt is definite value, when t moment The movement a ' of c+1 samplingn tThe corresponding probability value the big, it is bigger to sample receptance, otherwise sampling receptance is smaller;
Since MCMC sampling algorithm is the transition probability matrix by sampling actionIt goes to obtain and meets movement probability distribution pc(a′n t) sampled value, so movement probability distribution p (a when MCMC sampling algorithm startsn) can arbitrarily set;Start to adopt Set action a ' when samplen tProbability distribution pc(a′n t) it is to wait general distribution,There is aircraft to every kind of movement Identical sampled probability ensure that Q learning algorithm to the correctness of movement sampling of per moment;
Step 5.2.4, it is sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, will be received at random Rate u and the receptanceIt is compared, ifIt is obtained then to receive the c+1 times sampling MovementOtherwise the c+1 times obtained movement of sampling is not receivedAnd by a 'n tIt is assigned to
Such as random receptance u=0.5, if the sampling receptance obtained according to formula (7) Then think this sampling failure, sampling action value a 'n tIt remains unchanged;If the sampling receptance obtained according to formula (7)Then think that this is sampled successfully, sampling action value a 'n tBecome
Step 5.2.5, the t moment the c+1 times obtained movement a ' of sampling is updated using formula (8)n tProbability distribution pc+1 (a′n t):
In formula (8),Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) denominator;Indicate the t moment the c times obtained movement a ' of samplingn tProbability distribution pc(a′n t) molecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, it is no Then, return step 5.2.3 sequence executes;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, t moment autonomous navigation device is obtained Control parameter adjusting act a "n t, and enable t moment value function estimated value Q ' (et,Δet,a″n t) it is autonomous navigation described in t moment The control parameter adjusting of device acts a "n tValue function value Q*(et,Δet,a″n t);
A ' is acted when sampling number c reaches 100 times according to MCMC algorithmn tProbability distribution pc(a′n t) basically reach it is flat Surely, C=100 is generally set;Sampling number C can be set according to the precision of aircraft systems;
Step 6, the control parameter adjusting movement a " that t moment autonomous navigation device is obtained using formula (9)n tBehavior act return Value r (et,Δet,a″n t):
r(et,Δet,a″n t)=α × (et-et-1)+β×(Δet-Δet-1) (9)
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, And alpha+beta=1;
Behavior act return value r (et,Δet,a″n t) illustrate that t moment parameter regulation acts a "n tAct on autonomous navigation The operating condition of aircraft after device, if the ambient condition returned is deteriorated, behavior act return value r (e at this timet,Δet,a ″n t) it is a negative, indicate punishment;If the ambient condition returned improves, at this time behavior act return value r (et,Δet, a″n t) it is a positive number, indicate reward;If the ambient condition returned does not change, at this time behavior act return value r (et, Δet,a″n t) it is zero, it indicates to keep;The ambient condition of autonomous navigation device includes error etWith Δ et, so according to importance Difference introduces α and β ambient condition return parameter to determine the influence degree of different conditions, generally setting α=0.8, β=0.2;
Step 7 updates t-1 moment value Function Estimation value Q ' (e using formula (10)t-1,Δet-1,a″n t-1) be the t-1 moment most Final value functional value Q (et-1,Δet-1,a″n t-1);
Q(et-1,Δet-1,a″n t-1)=Q ' (et-1,Δet-1,a″n t-1)+ltΔQ(et-1,Δet-1, a″n t-1) (10)
In formula (10), Δ Q (et-1,Δet-1,a″n t-1) indicate end value function difference value, and have:
ΔQ(et-1,Δet-1,a″n t-1)=r (et,Δet,a″n t)+γQ′(et,Δet,a″n t)-Q′(et-1,Δet-1, a″n t-1) (11)
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxIt indicates Set maximum number of iterations;Otherwise according to SPSA step-length adjust algorithm with time t variation, using formula (12) to study because Sub- ltIt is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Be introduced into SPSA step-length adjust algorithm make Q learn in Studying factors ltVariation has certain regularity, and leads to The setting that SPSA step-length adjusts non-negative parameter μ and λ in algorithm is crossed, defines Studying factors ltThe speed degree of variation and variation Interval range keeps aircraft parameter regulation more accurate, generally setting tmax=30, μ=0.3, λ=1.2;
Step 9, the end value functional value for judging continuous two moment | Q (et,Δet,a″n t)-Q(et-1,Δet-1,a″n t-1) | whether < ε is true, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise, Execute step 10;
ε is that a very small positive number finishes and the control precision of aircraft has for determining whether pid control parameter is adjusted It closes;When ε is smaller, then the precision of aircraft autonomous navigation will be higher, and obtained aircraft pid control parameter will be closer to most The figure of merit, generally setting ε=0.2;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng Number adjusting acts a "n t-1Adjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device Pid control parameter is adjusted;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ et| > | Δ emin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows;Such as general setting emin=0.1, Δ emin= 0.05;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise step is returned Rapid 12 execute;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
Experimental result:
This patent method and the fixed pid parameter method of tradition are respectively used to autonomous navigation device simultaneously, carried out multiple groups pair Than experiment, guarantee that two groups of autonomous navigation devices reach identical terminal from identical starting point simultaneously in experiment.Fig. 5 is autonomous navigation Device navigates by water the time consuming comparing result of process;Fig. 6, Fig. 7 and Fig. 8 are that autonomous navigation device navigates by water process Real-time Error etComparison As a result.
It is time consuming experimentally in comparison, three groups of comparative experimentss are taken, every group of experiment is carried out 50 times and taken to result Average value.First group is the arrival time that two groups of autonomous navigation devices are compared in the case that current environment is stablized, and second group is environment The arrival time of two groups of autonomous navigation devices is compared in the case where suddenly change during navigation, third group is after environmental change In the case where compare arrival times of two groups of autonomous navigation devices;As shown in Figure 5, due to using in the state that initial environment is stablized The pid control parameter that the autonomous navigation device of fixed pid parameter method uses is close to optimized parameter so and using this patent method Autonomous navigation device elapsed time it is roughly the same;When environment is in the case where suddenly change during navigation, although two groups autonomous The time that aircraft reaches is all elongated, but this it appears that the autonomous navigation device elapsed time ratio using this patent method uses The autonomous navigation device of conventional method is much smaller, occurs mainly in adjusting using the time that the autonomous navigation device of this patent method increases During control parameter;After environmental change, due to using this patent method autonomous navigation device incited somebody to action Control parameter under current environment is adjusted to optimal value, so the time of consumption has been returned to and identical water before environmental change It is flat, and use the autonomous navigation device of conventional method since the control parameter under new environment has not reacceesed optimal control parameter, So the time continuation of consumption is elongated, when environmental change is violent, will be will appear using the autonomous navigation device of conventional method can not The case where reaching specified destination.
In comparison Real-time Error etExperimentally, above-mentioned three groups of comparative experimentss are equally taken, every group of experiment carries out 50 times simultaneously To results are averaged.Fig. 6 is the constant comparing result of initial environment, it is possible to find two groups of autonomous navigation device Real-time Error etChange It is roughly the same to change situation;Fig. 7 is comparison knot of the environment when autonomous navigation device navigates by water process the 7th second in the case where suddenly change Fruit, it is possible to find two groups of autonomous navigation device Real-time Error e in environment suddenly changetAll great increase, but use this patent method Autonomous navigation device through navigational parameter after a period of time adjustment after Real-time Error etIt is rapidly reduced to close to 0, and uses again The autonomous navigation device Real-time Error e of conventional methodtIt can not be reduced to 0 fluctuation up and down in an error range always;Fig. 8 is ring Comparing result in the case where after the variation of border, it is possible to find using the Real-time Error e of the autonomous navigation device of this patent methodtVariation Rule before rule and environmental change is almost the same, and uses the autonomous navigation device Real-time Error e of conventional methodtIt can not be reduced to 0 fluctuation up and down in an error range always.
Two kinds of comparing results discovery under three groups of experiments of comprehensive appeal, the fixed PID control ginseng of this patent method tradition relatively Counting method has better autonomous navigation effect in the case where environment is changeable, while solving since control parameter is not current Caused by optimal value under environment the problem of autonomous navigation device hyperharmonic response delay.

Claims (1)

1. it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method, it is characterised in that: including with Lower step:
Step 1, the control precision σ according to autonomous navigation device respectively obtain tri- controls of autonomous navigation device PID using formula (1) and join Number kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd:
In formula (1), Xp、Xi、XdRespectively indicate three pid control parameter k of the autonomous navigation devicep、kiAnd kdThreshold range;
Step 2 utilizes the adjustment parameter Δ kp、ΔkiWith Δ kdCombination obtains the Parameters variation movement of the autonomous navigation device Set, is denoted as A={ a1,a2,…,an,…,aN, wherein anIndicate n-th of control parameter in the Parameters variation set of actions Adjusting movement, and Indicate the corresponding proportion adjustment parameter of n-th of movement,Table Show the corresponding integral adjustment parameter of n-th of movement,It indicates corresponding to n-th of control parameter adjusting movement Differential adjustment parameter, n=1,2 ..., N;
Step 3, setting time t=1 randomly choose a control parameter adjusting movementAct on the autonomous navigation device;
Initialize the relevant parameter in Q learning algorithm: t moment Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0,1];
Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd
By the value function estimated value at t-1 moment in the Q learning algorithmIt is initialized, wherein et-1Table Show error of the autonomous navigation device at the t-1 moment, Δ et-1Indicate error rate of the autonomous navigation device at the t-1 moment, And by et-1With Δ et-1Form the ambient condition at t-1 moment;
Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device, utilize Transfer matrix of the formula (2) to the decision process in Q learning algorithmIt is initialized:
In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is transferred to control parameter adjusting movementTransition probability, and as t=1,
Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm;
Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function value under ambient condition
In formula (3), wj(t-1) weight of j-th of hidden layer of t-1 moment in BP neural network, j=1,2 ..., nh are indicated;Nh table Show the number of BP neural network hidden layer;yj(t-1) output of j-th of hidden layer of t-1 moment in BP neural network is indicated, and Have:
In formula (4), oj(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:
In formula (5), wij(t-1) weight of i-th of the input layer of t-1 moment to j-th of hidden layer in BP neural network, x are indicatedi (t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP neural network The number of input layer;
Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts using the sampling of MCMC algorithm
Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement chosen with the t-1 momentThe transition probability matrix of decision process is updated using formula (6)
In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N;Indicate that t moment is controlled from n-th Parameter regulation movement processedIt is transferred to m-th of control parameter adjusting movementTransition probability;
Step 5.2.2, sampling number c=0,1,2 ... C is set;
Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and is obtained in MCMC algorithm using formula (7) The receptance of t moment the c+1 times sampling
In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value,Indicate t moment The c times obtained movement of samplingProbability value;As c=0, enable in the t moment the c times obtained movement of sampling's Probability distributionIt is generally distributed to be equal, i.e.,
Step 5.2.4, sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, by random receptance u and The receptanceIt is compared, ifThen receive the c+1 times obtained movement of samplingOtherwise the c+1 times obtained movement of sampling is not receivedAnd it willIt is assigned to
Step 5.2.5, the t moment the c+1 times obtained movement of sampling is updated using formula (8)Probability distribution
In formula (8),Indicate the t moment the c times obtained movement of samplingProbability distributionDenominator;Indicate t The moment the c times obtained movement of samplingProbability distributionMolecule;As c=0, enable
Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, otherwise, is returned Step 5.2.3 sequence is returned to execute;
Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, the control of t moment autonomous navigation device is obtained Parameter regulation movement processedAnd enable t moment value function estimated valueFor the control of autonomous navigation device described in t moment Parameter regulation movementValue function value
Step 6, the control parameter adjusting movement that t moment autonomous navigation device is obtained using formula (9)Behavior act return value
In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, and α+ β=1;
Step 7 updates t-1 moment value Function Estimation value using formula (10)For t-1 moment final value function Value
In formula (10),It indicates end value function difference value, and has:
Step 8 enables t+1 be assigned to t, judges t > tmaxIt is whether true, if so, then follow the steps 9, wherein tmaxSet by expression Determine maximum number of iterations;Otherwise algorithm is adjusted with the variation of time t, using formula (12) to Studying factors l according to SPSA step-lengtht It is adjusted:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm;
Step 9, the end value functional value for judging continuous two momentWhether at It is vertical, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11;Otherwise, step 10 is executed;
Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control parameter tune Section movementAdjust autonomous navigation device pid control parameter;Otherwise, it gos to step and 5 continues autonomous navigation device PID control Parameter regulation;
Step 11 enables t=1;
The ambient condition e of step 12, autonomous navigation device acquisition t momenttWith Δ et, judgement | et| > | emin| or | Δ et| > | Δemin| it is whether true, if so, then follow the steps 13;Otherwise return step 11;Wherein, eminWith Δ eminIt respectively indicates autonomous The ambient condition error and error rate minimum value that aircraft allows;
Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3;Otherwise return step 12 It executes;Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.
CN201711144395.2A 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study Active CN107885086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711144395.2A CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711144395.2A CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Publications (2)

Publication Number Publication Date
CN107885086A CN107885086A (en) 2018-04-06
CN107885086B true CN107885086B (en) 2019-10-25

Family

ID=61777810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711144395.2A Active CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Country Status (1)

Country Link
CN (1) CN107885086B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710289B (en) * 2018-05-18 2021-11-09 厦门理工学院 Relay base quality optimization method based on improved SPSA
CN109696830B (en) * 2019-01-31 2021-12-03 天津大学 Reinforced learning self-adaptive control method of small unmanned helicopter
EP3725471A1 (en) * 2019-04-16 2020-10-21 Robert Bosch GmbH Configuring a system which interacts with an environment
CN114237267B (en) * 2021-11-02 2023-11-24 中国人民解放军海军航空大学航空作战勤务学院 Flight maneuver decision assisting method based on reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
CN105700526A (en) * 2016-01-13 2016-06-22 华北理工大学 On-line sequence limit learning machine method possessing autonomous learning capability
CN106950956A (en) * 2017-03-22 2017-07-14 合肥工业大学 The wheelpath forecasting system of fusional movement model and behavior cognitive model
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2178745B1 (en) * 2007-08-14 2012-02-29 Propeller Control Aps Efficiency optimizing propeller speed control for ships

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
CN105700526A (en) * 2016-01-13 2016-06-22 华北理工大学 On-line sequence limit learning machine method possessing autonomous learning capability
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN106950956A (en) * 2017-03-22 2017-07-14 合肥工业大学 The wheelpath forecasting system of fusional movement model and behavior cognitive model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
An Introduction to MCMC for Machine Learning;CHRISTOPHE ANDRIEU 等;《Machine Learning》;20031231;第5-37页 *

Also Published As

Publication number Publication date
CN107885086A (en) 2018-04-06

Similar Documents

Publication Publication Date Title
CN107885086B (en) Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study
CN110427261A (en) A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN112114521B (en) Intelligent prediction control entry guidance method for spacecraft
CN107767022A (en) A kind of Dynamic Job-shop Scheduling rule intelligent selecting method of creation data driving
CN111176807A (en) Multi-satellite cooperative task planning method
CN106056127A (en) GPR (gaussian process regression) online soft measurement method with model updating
CN114169543A (en) Federal learning algorithm based on model obsolescence and user participation perception
Yuan et al. Actor-critic deep reinforcement learning for energy minimization in UAV-aided networks
CN116523079A (en) Reinforced learning-based federal learning optimization method and system
Goldenshluger et al. A note on performance limitations in bandit problems with side information
Bui et al. Clustered bandits
Cassano et al. Distributed value-function learning with linear convergence rates
CN116880191A (en) Intelligent control method of process industrial production system based on time sequence prediction
CN114039366B (en) Power grid secondary frequency modulation control method and device based on peacock optimization algorithm
CN115310775A (en) Multi-agent reinforcement learning rolling scheduling method, device, equipment and storage medium
CN111582567B (en) Wind power probability prediction method based on hierarchical integration
CN111796519B (en) Automatic control method of multi-input multi-output system based on extreme learning machine
CN109657778B (en) Improved multi-swarm global optimal-based adaptive pigeon swarm optimization method
Fagan et al. Dynamic multi-agent reinforcement learning for control optimization
Wongsai et al. A Reinforcement learning for criminal’s escape path prediction
Lei Optimization of intelligent neural network prediction based on particle swarm
Wang et al. Convergence-Based Exploration Algorithm for Reinforcement Learning
CN113270867B (en) Automatic adjustment method for weak power grid tide without solution
CN114637209A (en) Method for controlling neural network inverse controller based on reinforcement learning
Fu et al. Research on Multi-Agent Reinforcement Learning Traffic Control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant