CN107885086B

CN107885086B - Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Info

Publication number: CN107885086B
Application number: CN201711144395.2A
Authority: CN
Inventors: 夏娜; 柴煜奇; 杜华争; 陈斌
Original assignee: Hefei Polytechnic University
Current assignee: Hefei Polytechnic University
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2019-10-25
Anticipated expiration: 2037-11-17
Also published as: CN107885086A

Abstract

The invention discloses a kind of autonomous navigation device control parameter on-line control methods based on MCMC optimization Q study, the following steps are included: the possibility situation of change of aircraft pid control parameter is carried out the set of actions that statistics obtains parameter regulation according to practical situation first, and experience is controlled according to aircraft and initializes pid control parameter；Then randomly choose it is a kind of act on autonomous navigation device, the value function value Q of each movement according to obtained in Q learning algorithm^*The movement that subsequent time is taken is obtained with the sampling of MCMC algorithm, and algorithm is adjusted using SPSA step-length to the Studying factors l in Q learning algorithm over time and is adjusted；Finally the optimal control parameter obtained under the present circumstances is adjusted repeatedly by control parameter.The present invention solves hyperharmonic delay problem of autonomous navigation device during navigation, and autonomous navigation device is made to rapidly adapt to the variation of environment and arriving at the destination for quick and stable.

Description

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Technical field

The invention belongs to autonomous navigation device control parameter on-line tuning fields, specifically a kind of pair of autonomous navigation device control The method of parameter regulation processed.

Background technique

Aircraft autonomous navigation refers to that aircraft passes through the destination artificially specified and reached in the water surface, then contexture by self The path advanced well is arrived at the destination eventually by continuous self-control.Water quality inspection and in terms of have Important application value.

Currently, traditional autonomous navigation device is using fixed pid parameter method, this method is using fixed aircraft control Parameter processed, parameter are as acquired in a large amount of aircraft autonomous navigation engineering project experience.When fixed control parameter is not suitable for The problem of hyperharmonic response delay can be brought when current environment to aircraft autonomous navigation, situation especially changeable in environment Under, fixed control parameter may have preferable response to individual circumstances state, but not be able to satisfy all ambient conditions, Artificial change aircraft control parameter is needed to be not easy to the use of aircraft when the environment changes.

There are also some methods for carrying out aircraft control parameter adjusting using fuzzy algorithmic approach, annealing algorithm, these methods Control parameter self-correcting mechanism is introduced to a certain extent, but since these methods are not intelligent control algorithm, So the situation changeable to environment still can not solve the problems, such as that autonomous navigation device control parameter is quickly adjusted to optimal value.

Summary of the invention

The present invention is to solve above-mentioned the shortcomings of the prior art, is provided a kind of based on MCMC optimization Q study Autonomous navigation device control parameter on-line control method, to can solve hyperharmonic time delay of autonomous navigation device during navigation Problem, so that autonomous navigation device rapidly adapts to arriving at the destination for the variation of environment and quick and stable.

In order to achieve the above object, the technical scheme adopted by the invention is as follows:

The present invention it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method the characteristics of include Following steps:

Step 1, the control precision σ according to autonomous navigation device respectively obtain autonomous navigation device PID tri- controls using formula (1) Parameter k processed_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d:

In formula (1), X_p、X_i、X_dRespectively indicate three pid control parameter k of the autonomous navigation device_p、k_iAnd k_dThreshold value model It encloses；

Step 2 utilizes the adjustment parameter Δ k_p、Δk_iWith Δ k_dCombination obtains the Parameters variation of the autonomous navigation device Set of actions is denoted as A={ a₁,a₂,···,a_n,···,a_N, wherein a_nIt indicates in the Parameters variation set of actions N control parameter adjusting movement, and Indicate the corresponding proportion adjustment of n-th of movement Parameter,Indicate the corresponding integral adjustment parameter of n-th of movement,It is dynamic to indicate that n-th of control parameter is adjusted Make corresponding differential adjustment parameter, n=1,2 ..., N；

Step 3, setting time t=1 randomly choose a control parameter adjusting movement a "_n ^t-1Act on the autonomous boat Row device；

Initialize the relevant parameter in Q learning algorithm: t moment Studying factors l_tWith discount factor γ, l_t> 0, γ ∈ [0, 1]；

Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation device_p、k_iAnd k_d；

By the value function estimated value Q ' (e at t-1 moment in the Q learning algorithm_t-1,Δe_t-1,a″_n ^t-1) initialized, Wherein, e_t-1Indicate error of the autonomous navigation device at the t-1 moment, Δ e_t-1Indicate the autonomous navigation device at the t-1 moment Error rate, and by e_t-1With Δ e_t-1Form the ambient condition at t-1 moment；

Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device, Using formula (2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:

In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is dynamic to be transferred to control parameter adjusting MakeTransition probability, and as t=1,

Step 5, the decision process that t moment is obtained using MCMC optimization Q learning algorithm；

Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function under ambient condition Value

In formula (3), w_j(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment weight, j=1,2 ..., nh；The number of nh expression BP neural network hidden layer；y_j(t-1) indicate BP neural network in j-th of hidden layer of t-1 moment it is defeated Out, and have:

In formula (4), o_j(t-1) it indicates the input of j-th of hidden layer of t moment in BP neural network, and has:

In formula (5), w_ij(t-1) indicate BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weight, x_i(t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP nerve net The number of network input layer；

Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts a " using the sampling of MCMC algorithm_n ^t；

Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement a " chosen with the t-1 moment_n ^t-1, the transition probability matrix of decision process is updated using formula (6)

In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N； Indicate that t moment is acted from n-th of control parameter adjustingIt is transferred to m-th of control parameter adjusting movementTransition probability；

Step 5.2.2, sampling number c=0,1,2C is set；

Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and obtains MCMC algorithm using formula (7) In t moment the c+1 times sampling receptance

In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value, p_c(a′_n ^t) indicate t The moment the c times obtained movement a ' of sampling_n ^tProbability value；As c=0, enable in the t moment the c times obtained movement of sampling a′_n ^tProbability distribution p_c(a′_n ^t) it is to wait general distribution, i.e.,

Step 5.2.4, it is sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, will be received at random Rate u and the receptanceIt is compared, ifIt is obtained then to receive the c+1 times sampling MovementOtherwise the c+1 times obtained movement of sampling is not receivedAnd by a '_n ^tIt is assigned to

Step 5.2.5, the t moment the c+1 times obtained movement a ' of sampling is updated using formula (8)_n ^tProbability distribution p_c+1 (a′_n ^t):

In formula (8),Indicate the t moment the c times obtained movement a ' of sampling_n ^tProbability distribution p_c(a′_n ^t) denominator；Indicate the t moment the c times obtained movement a ' of sampling_n ^tProbability distribution p_c(a′_n ^t) molecule；As c=0, enable

Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, it is no Then, return step 5.2.3 sequence executes；

Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, t moment autonomous navigation device is obtained Control parameter adjusting act a "_n ^t, and enable t moment value function estimated value Q ' (e_t,Δe_t,a″_n ^t) it is autonomous navigation described in t moment The control parameter adjusting of device acts a "_n ^tValue function value Q^*(e_t,Δe_t,a″_n ^t)；

Step 6, the control parameter adjusting movement a " that t moment autonomous navigation device is obtained using formula (9)_n ^tBehavior act return Value r (e_t,Δe_t,a″_n ^t):

r(e_t,Δe_t,a″_n ^t)=α × (e_t-e_t-1)+β×(Δe_t-Δe_t-1) (9)

In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, And alpha+beta=1；

Step 7 updates t-1 moment value Function Estimation value Q ' (e using formula (10)_t-1,Δe_t-1,a″_n ^t-1) be the t-1 moment most Final value functional value Q (e_t-1,Δe_t-1,a″_n ^t-1)；

Q(e_t-1,Δe_t-1,a″_n ^t-1)=Q ' (e_t-1,Δe_t-1,a″_n ^t-1)+l_tΔQ(e_t-1,Δe_t-1,a″_n ^t-1) (10)

In formula (10), Δ Q (e_t-1,Δe_t-1,a″_n ^t-1) indicate end value function difference value, and have:

ΔQ(e_t-1,Δe_t-1,a″_n ^t-1)=r (e_t,Δe_t,a″_n ^t)+γQ′(e_t,Δe_t,a″_n ^t)-Q′(e_t-1,Δe_t-1, a″_n ^t-1) (11)

Step 8 enables t+1 be assigned to t, judges t > t_maxIt is whether true, if so, then follow the steps 9, wherein t_maxIt indicates Set maximum number of iterations；Otherwise according to SPSA step-length adjust algorithm with time t variation, using formula (12) to study because Sub- l_tIt is adjusted:

In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-length adjusting algorithm；

Step 9, the end value functional value for judging continuous two moment | Q (e_t,Δe_t,a″_n ^t)-Q(e_t-1,Δe_t-1,a″_n ^t-1) | whether < ε is true, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11；Otherwise, Execute step 10；

Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng Number adjusting acts a "_nt ^-1Adjust autonomous navigation device pid control parameter；Otherwise, it gos to step and 5 continues autonomous navigation device Pid control parameter is adjusted；

Step 11 enables t=1；

The ambient condition e of step 12, autonomous navigation device acquisition t moment_tWith Δ e_t, judgement | e_t| > | e_min| or | Δ e_t| > | Δ e_min| it is whether true, if so, then follow the steps 13；Otherwise return step 11；Wherein, e_minWith Δ e_minTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows；

Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3；Otherwise step is returned Rapid 12 execute；Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.

Compared with the prior art, the invention has the benefit that

1, present invention employs Q learning algorithms carries out on-line control to aircraft autonomous navigation control parameter, and learns in Q MCMC sampling algorithm is introduced in algorithm and SPSA step-length adjusts algorithm, makes aircraft adaptive ring during autonomous navigation The variation in border and the voyage conditions for prejudging subsequent time in advance, solve the problems, such as aircraft hyperharmonic time delay, make navigation process More steady, especially parameter regulation is rapid in the case where Changes in weather, has in aircraft autonomous navigation field wide Application prospect.

2, present invention introduces Q learning algorithms, and aircraft control effect is associated with ambient condition, pass through environmental feedback Return value determines the quality of this parameter regulation movement, and the direction that parameter regulation is become better gradually is approached, and solves boat There is the problem of hyperharmonic response delay during navigation in row device, and control parameter is quickly changed to the variation of environment Optimal value, so as to rapidly adapt to the variation of environment.

3, the present invention introduces MCMC sampling algorithm for optimizing in traditional Q learning algorithm, will take at current time Parameter regulation action policy do not use the single movement for taking maximum behavior value function value, but by between behavior act Transition probability goes the probability Distribution Model of estimation entirety, solves the problems, such as to fall into local optimum when Q learning algorithm selection movement, So as to obtain the optimal adjustment action policy during the navigation of autonomous navigation device.

4, the present invention the general distribution such as sets for initial samples moment movement probability distribution in MCMC sampling algorithm, so that MCMC sampling algorithm has the generality of action behavior sampling early period in algorithm operation, and the later period with sampling obtained movement every time Movement probability distribution is updated, the increase of movement probability distribution ratio corresponding to obtained movement will be sampled every time, thus Improve the correctness of per moment movement sampling.

5, the present invention uses SPSA step-length to the variation of Studying factors l in traditional Q learning algorithm and adjusts algorithm, passes through SPSA step-length adjusts the setting of parameters in algorithm, defines the speed degree of Studying factors l variation and the section model of variation It encloses, so that making the change of Studying factors l during Q learning algorithm has certain regularity, makes autonomous navigation device parameter regulation more Add accurate.

Detailed description of the invention

Fig. 1 is that the present invention is based on the autonomous navigation device control parameter on-line control Method And Principle block diagrams that MCMC optimizes Q study；

Fig. 2 is MCMC Optimization Steps figure in Q learning algorithm of the present invention；

Fig. 3 is that the present invention is based on the autonomous navigation device control parameter on-line control method flow diagrams that MCMC optimizes Q study；

Fig. 4 is that BP neural network solves action behavior value function schematic diagram；

Fig. 5 is that the method for the present invention disappears with traditional pid parameter method autonomous navigation device navigation process under different experiments of fixing The Comparison of experiment results figure of time-consuming；

Fig. 6 is that the method for the present invention and the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water are constant In the case of Real-time Error e_tComparison of experiment results figure；

Fig. 7 is that the method for the present invention becomes with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water Real-time Error e in the case where change_tComparison of experiment results figure；

Fig. 8 is that the method for the present invention has been sent out with the fixed pid parameter method of tradition environment during autonomous navigation device navigates by water Real-time Error e in the case where after changing_tComparison of experiment results figure.

Specific embodiment

In the present embodiment, the principle of the autonomous navigation device control parameter on-line control method based on MCMC optimization Q study is such as Shown in Fig. 1, the error e of autonomous navigation device real-time reception current environment_tWith error rate Δ e_t, Q study is optimized by MCMC and is calculated Method Real-time Decision goes out the parameter regulation movement a of subsequent time_n, finally when the end value functional value in Q learning algorithm is not occurring The control parameter optimal value under current environment is obtained when variation.MCMC Optimization Steps are as shown in Figure 2 in Q learning algorithm.This method It is to be applied to autonomous navigation device control parameter on-line tuning field, is adapted to by changing the control parameter of autonomous navigation device current Environment.

As shown in figure 3, autonomous navigation device control parameter on-line control method carries out as follows:

Step 1, pid control parameter include scale parameter k_p, integral parameter k_iWith differential parameter k_d, wherein scale parameter k_p Effect be to speed up the response speed of system, improve the degree of regulation of system, integral parameter k_iEffect be the steady of elimination system State error, differential parameter k_dEffect be improvement system dynamic characteristic；

According to the control precision σ of autonomous navigation device, tri- control parameters of autonomous navigation device PID are respectively obtained using formula (1) k_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d:

Such as α=0.1, X_p∈ [10,20], X_i∈ [1,6], X_p∈ [1,2] obtains adjustment parameter Δ k according to formula (1)_p's Transition activities are positive increase 1, constant and reversed reduction 1；It can similarly obtain Δ k_iWith Δ k_dTransition activities；

Traditional autonomous navigation device is given certainly using fixed pid parameter method, this method due to the uncertainty of environment Main aircraft brings the problem of hyperharmonic response delay during navigation, needs artificial modification simultaneously for different environment Pid parameter adapts to.So being directed to these problems and puzzlement, Q learning algorithm is introduced herein and carrys out real-time online adjustment PID control Parameter.

Q learning algorithm is a kind of intelligence learning algorithm that ChrisWatkins was proposed in 1989, by TD algorithm and dynamic Planning combines, and the work of Watkins advances the fast development of intensified learning.Q learning algorithm is a kind of and real system mould Type is unrelated, the nitrification enhancement of value iterative type, which is the related theoretical of Dynamic Programming and animal learning is psychologic has Benefit be combined with each other, for solving the problems, such as the used sequence Optimal Decision-making with delay return.

Step 2, due to Q study in need by decision to autonomous navigation device control parameter change, if by PID adjust join Number is divided into from the point of view of three movements, then will increase the computation complexity in Q learning algorithm, so utilizing the adjustment parameter Δ k_p、 Δk_iWith Δ k_dCombination obtains the Parameters variation set of actions of the autonomous navigation device, is denoted as A={ a₁,a₂,···, a_n,···,a_N, wherein a_nIndicate that n-th of control parameter adjusting acts in the Parameters variation set of actions, and Indicate the corresponding proportion adjustment parameter of n-th of movement,It indicates to move for described n-th Make corresponding integral adjustment parameter,Indicate the corresponding differential adjustment parameter of n-th of control parameter adjusting movement, N=1,2 ..., N；

Studying factors l in the Q learning algorithm_tChange with the variation of time t, the early period of Q learning algorithm needs Biggish learning value is obtained from sample data, so initial Studying factors l_tFor a biggish positive number, with the increasing of time t Autonomous navigation device is added not need very big learning value thus by Studying factors l_tIt gradually becomes smaller；Discount factor γ is for controlling certainly The considerations of main aircraft is to short-term and long-term results degree, such as consider two extreme cases, the autonomous navigation device as γ=0 Consider the return value of current time environment, the return value for the moment environment that only looks to the future as γ=1, so according to autonomous navigation Device actual demand sets discount factor, and γ=0.5 pair current time and future time instance is generally taken to comprehensively consider；

Tri- control parameter k of the PID are initialized according to the control experience of the autonomous navigation device_p、k_iAnd k_d；Such as This experimental system initially sets three control parameters as k_p=2.5, k_i=0.5, k_d=0.2；

T=1 moment setting value Function Estimation value Q ' (e_t-1,Δe_t-1,a″_n ^t-1)=0, error e_t-1=0, error rate Δe_t-1=0；

In step 4, Q learning algorithm, the maximum movement of autonomous navigation device selective value functional value is needed, not only to obtain maximum Instant return；Autonomous navigation device is also needed to select different movements as far as possible, it is contemplated that obtain the case where everything Optimal policy.If autonomous navigation device selects always the movement with peak functional value, if can have the following disadvantages: in the early stage The stage of acquisition experience, autonomous navigation device not yet acquire optimal strategy, then the study stage afterwards would be impossible to obtain again Optimal strategy.

So introducing the movement that MCMC sampling algorithm is chosen for decision per moment in Q learning algorithm.MCMC sampling Algorithm meets the sampled value of movement probability distribution by sampling to obtain to movement transfer matrix, the case where for Probability Distributed Unknown The movement for showing that per moment is chosen can accurately be sampled.

According to the number N that control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device acts, formula is utilized (2) to the transfer matrix of the decision process in Q learning algorithmIt is initialized:

In formula (2),Indicate that the t-1 moment acts from control parameter adjustingIt is transferred to control parameter adjusting MovementTransition probability, and as t=1,

Step 5.1, BP neural network have the ability of Approximation of Arbitrary Nonlinear Function, for solving extensive and continuous Evolvement problem in state space plays a significant role, and BP neural network solves action behavior value function principle as shown in figure 4, benefit N-th of control parameter adjusting of t moment movement is calculated with formula (3)Value function value under ambient condition

Such as ni=3 indicates 3 input layers of BP neural network, respectively error e_t-1, error rate Δ e_t-1With MovementInput；Nh=5 is indicated containing there are five hidden layer node, the more more then counting accuracies of general hidden layer node number It is higher, but the complexity calculated is also bigger；The t=1 moment sets the weight w of hidden layer_j(t-1)=1, j=1,2 ..., nh, Input layer weight w_ij(t-1)=0.8, i=1,2 ..., ni；

Step 5.2.2, sampling number c=0,1,2C is set；

By formula (7) as can be seen that t moment p_c(a′_n ^t)、WithIt is definite value, when t moment The movement a ' of c+1 sampling_n ^tThe corresponding probability value the big, it is bigger to sample receptance, otherwise sampling receptance is smaller；

Since MCMC sampling algorithm is the transition probability matrix by sampling actionIt goes to obtain and meets movement probability distribution p_c(a′_n ^t) sampled value, so movement probability distribution p (a when MCMC sampling algorithm starts_n) can arbitrarily set；Start to adopt Set action a ' when sample_n ^tProbability distribution p_c(a′_n ^t) it is to wait general distribution,There is aircraft to every kind of movement Identical sampled probability ensure that Q learning algorithm to the correctness of movement sampling of per moment；

Such as random receptance u=0.5, if the sampling receptance obtained according to formula (7) Then think this sampling failure, sampling action value a '_n ^tIt remains unchanged；If the sampling receptance obtained according to formula (7)Then think that this is sampled successfully, sampling action value a '_n ^tBecome

A ' is acted when sampling number c reaches 100 times according to MCMC algorithm_n ^tProbability distribution p_c(a′_n ^t) basically reach it is flat Surely, C=100 is generally set；Sampling number C can be set according to the precision of aircraft systems；

r(e_t,Δe_t,a″_n ^t)=α × (e_t-e_t-1)+β×(Δe_t-Δe_t-1) (9)

Behavior act return value r (e_t,Δe_t,a″_n ^t) illustrate that t moment parameter regulation acts a "_n ^tAct on autonomous navigation The operating condition of aircraft after device, if the ambient condition returned is deteriorated, behavior act return value r (e at this time_t,Δe_t,a ″_n ^t) it is a negative, indicate punishment；If the ambient condition returned improves, at this time behavior act return value r (e_t,Δe_t, a″_n ^t) it is a positive number, indicate reward；If the ambient condition returned does not change, at this time behavior act return value r (e_t, Δe_t,a″_n ^t) it is zero, it indicates to keep；The ambient condition of autonomous navigation device includes error e_tWith Δ e_t, so according to importance Difference introduces α and β ambient condition return parameter to determine the influence degree of different conditions, generally setting α=0.8, β=0.2；

Q(e_t-1,Δe_t-1,a″_n ^t-1)=Q ' (e_t-1,Δe_t-1,a″_n ^t-1)+l_tΔQ(e_t-1,Δe_t-1, a″_n ^t-1) (10)

Be introduced into SPSA step-length adjust algorithm make Q learn in Studying factors l_tVariation has certain regularity, and leads to The setting that SPSA step-length adjusts non-negative parameter μ and λ in algorithm is crossed, defines Studying factors l_tThe speed degree of variation and variation Interval range keeps aircraft parameter regulation more accurate, generally setting t_max=30, μ=0.3, λ=1.2；

ε is that a very small positive number finishes and the control precision of aircraft has for determining whether pid control parameter is adjusted It closes；When ε is smaller, then the precision of aircraft autonomous navigation will be higher, and obtained aircraft pid control parameter will be closer to most The figure of merit, generally setting ε=0.2；

Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control ginseng Number adjusting acts a "_n ^t-1Adjust autonomous navigation device pid control parameter；Otherwise, it gos to step and 5 continues autonomous navigation device Pid control parameter is adjusted；

Step 11 enables t=1；

The ambient condition e of step 12, autonomous navigation device acquisition t moment_tWith Δ e_t, judgement | e_t| > | e_min| or | Δ e_t| > | Δ e_min| it is whether true, if so, then follow the steps 13；Otherwise return step 11；Wherein, e_minWith Δ e_minTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows；Such as general setting e_min=0.1, Δ e_min= 0.05；

Experimental result:

This patent method and the fixed pid parameter method of tradition are respectively used to autonomous navigation device simultaneously, carried out multiple groups pair Than experiment, guarantee that two groups of autonomous navigation devices reach identical terminal from identical starting point simultaneously in experiment.Fig. 5 is autonomous navigation Device navigates by water the time consuming comparing result of process；Fig. 6, Fig. 7 and Fig. 8 are that autonomous navigation device navigates by water process Real-time Error e_tComparison As a result.

It is time consuming experimentally in comparison, three groups of comparative experimentss are taken, every group of experiment is carried out 50 times and taken to result Average value.First group is the arrival time that two groups of autonomous navigation devices are compared in the case that current environment is stablized, and second group is environment The arrival time of two groups of autonomous navigation devices is compared in the case where suddenly change during navigation, third group is after environmental change In the case where compare arrival times of two groups of autonomous navigation devices；As shown in Figure 5, due to using in the state that initial environment is stablized The pid control parameter that the autonomous navigation device of fixed pid parameter method uses is close to optimized parameter so and using this patent method Autonomous navigation device elapsed time it is roughly the same；When environment is in the case where suddenly change during navigation, although two groups autonomous The time that aircraft reaches is all elongated, but this it appears that the autonomous navigation device elapsed time ratio using this patent method uses The autonomous navigation device of conventional method is much smaller, occurs mainly in adjusting using the time that the autonomous navigation device of this patent method increases During control parameter；After environmental change, due to using this patent method autonomous navigation device incited somebody to action Control parameter under current environment is adjusted to optimal value, so the time of consumption has been returned to and identical water before environmental change It is flat, and use the autonomous navigation device of conventional method since the control parameter under new environment has not reacceesed optimal control parameter, So the time continuation of consumption is elongated, when environmental change is violent, will be will appear using the autonomous navigation device of conventional method can not The case where reaching specified destination.

In comparison Real-time Error e_tExperimentally, above-mentioned three groups of comparative experimentss are equally taken, every group of experiment carries out 50 times simultaneously To results are averaged.Fig. 6 is the constant comparing result of initial environment, it is possible to find two groups of autonomous navigation device Real-time Error e_tChange It is roughly the same to change situation；Fig. 7 is comparison knot of the environment when autonomous navigation device navigates by water process the 7th second in the case where suddenly change Fruit, it is possible to find two groups of autonomous navigation device Real-time Error e in environment suddenly change_tAll great increase, but use this patent method Autonomous navigation device through navigational parameter after a period of time adjustment after Real-time Error e_tIt is rapidly reduced to close to 0, and uses again The autonomous navigation device Real-time Error e of conventional method_tIt can not be reduced to 0 fluctuation up and down in an error range always；Fig. 8 is ring Comparing result in the case where after the variation of border, it is possible to find using the Real-time Error e of the autonomous navigation device of this patent method_tVariation Rule before rule and environmental change is almost the same, and uses the autonomous navigation device Real-time Error e of conventional method_tIt can not be reduced to 0 fluctuation up and down in an error range always.

Two kinds of comparing results discovery under three groups of experiments of comprehensive appeal, the fixed PID control ginseng of this patent method tradition relatively Counting method has better autonomous navigation effect in the case where environment is changeable, while solving since control parameter is not current Caused by optimal value under environment the problem of autonomous navigation device hyperharmonic response delay.

Claims

1. it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method, it is characterised in that: including with Lower step:

Step 1, the control precision σ according to autonomous navigation device respectively obtain tri- controls of autonomous navigation device PID using formula (1) and join Number k_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d:

In formula (1), X_p、X_i、X_dRespectively indicate three pid control parameter k of the autonomous navigation device_p、k_iAnd k_dThreshold range；

Step 2 utilizes the adjustment parameter Δ k_p、Δk_iWith Δ k_dCombination obtains the Parameters variation movement of the autonomous navigation device Set, is denoted as A={ a₁,a₂,…,a_n,…,a_N, wherein a_nIndicate n-th of control parameter in the Parameters variation set of actions Adjusting movement, and Indicate the corresponding proportion adjustment parameter of n-th of movement,Table Show the corresponding integral adjustment parameter of n-th of movement,It indicates corresponding to n-th of control parameter adjusting movement Differential adjustment parameter, n=1,2 ..., N；

Step 3, setting time t=1 randomly choose a control parameter adjusting movementAct on the autonomous navigation device；

Initialize the relevant parameter in Q learning algorithm: t moment Studying factors l_tWith discount factor γ, l_t> 0, γ ∈ [0,1]；

By the value function estimated value at t-1 moment in the Q learning algorithmIt is initialized, wherein e_t-1Table Show error of the autonomous navigation device at the t-1 moment, Δ e_t-1Indicate error rate of the autonomous navigation device at the t-1 moment, And by e_t-1With Δ e_t-1Form the ambient condition at t-1 moment；

Step 4, the number N acted according to control parameter adjusting in the Parameters variation set of actions A of the autonomous navigation device, utilize Transfer matrix of the formula (2) to the decision process in Q learning algorithmIt is initialized:

Step 5.1 calculates n-th of control parameter adjusting of t moment movement using formula (3)Value function value under ambient condition

In formula (3), w_j(t-1) weight of j-th of hidden layer of t-1 moment in BP neural network, j=1,2 ..., nh are indicated；Nh table Show the number of BP neural network hidden layer；y_j(t-1) output of j-th of hidden layer of t-1 moment in BP neural network is indicated, and Have:

In formula (5), w_ij(t-1) weight of i-th of the input layer of t-1 moment to j-th of hidden layer in BP neural network, x are indicated_i (t-1) input of i-th of input layer of t-1 moment in BP neural network is indicated, i=1,2 ..., ni, ni indicate BP neural network The number of input layer；

Step 5.2 show that the control parameter adjusting of autonomous navigation device described in t moment acts using the sampling of MCMC algorithm

Step 5.2.1, it is acted according to n-th of control parameter adjusting of t momentValue function value under ambient conditionThe movement chosen with the t-1 momentThe transition probability matrix of decision process is updated using formula (6)

In formula (6),Indicate n-th of control parameter adjusting of t moment movementValue function value, i.e., Indicate the summation of the value function value of t moment everything, n=1,2 ..., N；Indicate that t moment is controlled from n-th Parameter regulation movement processedIt is transferred to m-th of control parameter adjusting movementTransition probability；

Step 5.2.2, sampling number c=0,1,2 ... C is set；

Step 5.2.3, to the transition probability matrix of t momentC sampling is carried out, and is obtained in MCMC algorithm using formula (7) The receptance of t moment the c+1 times sampling

In formula (7),Indicate the t moment the c+1 times obtained movement of samplingProbability value,Indicate t moment The c times obtained movement of samplingProbability value；As c=0, enable in the t moment the c times obtained movement of sampling's Probability distributionIt is generally distributed to be equal, i.e.,

Step 5.2.4, sampled to obtain random receptance u in Uniform [0,1] from being uniformly distributed, by random receptance u and The receptanceIt is compared, ifThen receive the c+1 times obtained movement of samplingOtherwise the c+1 times obtained movement of sampling is not receivedAnd it willIt is assigned to

Step 5.2.5, the t moment the c+1 times obtained movement of sampling is updated using formula (8)Probability distribution

In formula (8),Indicate the t moment the c times obtained movement of samplingProbability distributionDenominator；Indicate t The moment the c times obtained movement of samplingProbability distributionMolecule；As c=0, enable

Step 5.2.6, it enables c+1 be assigned to c, and judges whether c > C is true, if so, 5.2.7 is thened follow the steps, otherwise, is returned Step 5.2.3 sequence is returned to execute；

Step 5.2.7, to the transition probability matrix of t momentThe C+1 times sampling is carried out, the control of t moment autonomous navigation device is obtained Parameter regulation movement processedAnd enable t moment value function estimated valueFor the control of autonomous navigation device described in t moment Parameter regulation movementValue function value

Step 6, the control parameter adjusting movement that t moment autonomous navigation device is obtained using formula (9)Behavior act return value

In formula (9), α and β respectively indicate error return parameter and error rate return parameter, 0 < α <, 1,0 < β < 1, and α+ β=1；

Step 7 updates t-1 moment value Function Estimation value using formula (10)For t-1 moment final value function Value

In formula (10),It indicates end value function difference value, and has:

Step 8 enables t+1 be assigned to t, judges t > t_maxIt is whether true, if so, then follow the steps 9, wherein t_maxSet by expression Determine maximum number of iterations；Otherwise algorithm is adjusted with the variation of time t, using formula (12) to Studying factors l according to SPSA step-length_t It is adjusted:

Step 9, the end value functional value for judging continuous two momentWhether at It is vertical, if so, it then indicates that the adjusting of autonomous navigation device pid control parameter finishes, and gos to step 11；Otherwise, step 10 is executed；

Step 10 judges whether t exceeds the stipulated time, if exceeding, gos to step 3, reselects initial control parameter tune Section movementAdjust autonomous navigation device pid control parameter；Otherwise, it gos to step and 5 continues autonomous navigation device PID control Parameter regulation；

Step 11 enables t=1；

The ambient condition e of step 12, autonomous navigation device acquisition t moment_tWith Δ e_t, judgement | e_t| > | e_min| or | Δ e_t| > | Δe_min| it is whether true, if so, then follow the steps 13；Otherwise return step 11；Wherein, e_minWith Δ e_minIt respectively indicates autonomous The ambient condition error and error rate minimum value that aircraft allows；

Step 13 enables t+1 be assigned to t, and judges whether t > T is true, if so, then follow the steps 3；Otherwise return step 12 It executes；Wherein, T indicates that autonomous navigation device adapts to the time constant of environmental change speed.