CN107885086A

CN107885086A - Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Info

Publication number: CN107885086A
Application number: CN201711144395.2A
Authority: CN
Inventors: 夏娜; 柴煜奇; 杜华争; 陈斌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2018-04-06
Anticipated expiration: 2037-11-17
Also published as: CN107885086B

Abstract

The invention discloses a kind of autonomous navigation device control parameter on-line control method based on MCMC optimization Q study, comprise the following steps：The possibility situation of change of ROV pid control parameter is carried out by statistics according to the situation of reality first and draws the set of actions of parameter regulation, and experience initialization pid control parameter is controlled according to ROV；Then random selection one kind acts on autonomous navigation device, according to the value function value Q of each action obtained in Q learning algorithms^*The action for showing that subsequent time is taken is sampled with MCMC algorithms, and the Studying factors l in Q learning algorithms is adjusted using SPSA step-lengths regulation algorithm over time；Finally the regulation repeatedly by control parameter draws optimal control parameter under the present circumstances.The present invention solves hyperharmonic delay problem of the autonomous navigation device during navigation, autonomous navigation device is rapidly adapted to the change of environment and arriving at for quick and stable.

Description

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Technical field

It is specifically a kind of to autonomous navigation device control the invention belongs to autonomous navigation device control parameter on-line tuning field The method of parameter regulation processed.

Background technology

ROV autonomous navigation refers to the destination that ROV reaches in the water surface by being artificially assigned to, then contexture by self The path advanced well, arrived at eventually through continuous self-control.Have in water quality inspection and beat surface etc. Important application value.

At present, traditional autonomous navigation device is using fixed pid parameter method, and this method is using fixed ROV control Parameter processed, parameter are as acquired in substantial amounts of ROV autonomous navigation engineering project experience.When fixed control parameter is not suitable for The problem of hyperharmonic response delay can be brought during current environment to ROV autonomous navigation, especially in the changeable situation of environment Under, fixed control parameter may have preferable response to individual circumstances state, but can not meet all ambient conditions, Artificial change ROV control parameter is needed to be not easy to the use of ROV when the environment changes.

The method that ROV control parameter regulation is more also carried out using fuzzy algorithmic approach, annealing algorithm, these methods Control parameter self-correcting mechanism is introduced to a certain extent, but because these methods are not intelligent control algorithm in itself, So the situation changeable to environment still can not solve the problems, such as autonomous navigation device control parameter quick regulation to optimal value.

The content of the invention

The present invention is the above-mentioned the shortcomings of the prior art of solution, there is provided one kind is based on MCMC optimization Q study Autonomous navigation device control parameter on-line control method, to solve hyperharmonic time delay of the autonomous navigation device during navigation Problem, so that autonomous navigation device rapidly adapts to arriving at for the change of environment and quick and stable.

In order to achieve the above object, the technical solution adopted in the present invention is：

The present invention it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method the characteristics of include Following steps：

Step 1, the control accuracy α according to autonomous navigation device, tri- controls of autonomous navigation device PID are respectively obtained using formula (1) Parameter k processed_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d：

In formula (1), X_p、X_i、X_dDescribed three pid control parameter k of autonomous navigation device are represented respectively_p、k_iAnd k_dThreshold value model Enclose；

Step 2, utilize the adjustment parameter Δ k_p、Δk_iWith Δ k_dCombination draws the Parameters variation of the autonomous navigation device Set of actions, it is designated as A={ a₁,a₂,···,a_n,···,a_N, wherein, a_nRepresent in the Parameters variation set of actions N control parameter regulation action, andRepresent the corresponding proportion adjustment of n-th of action Parameter,The corresponding integral adjustment parameter of n-th of action is represented,Represent that n-th of control parameter regulation is dynamic Make corresponding differential adjustment parameter, n=1,2 ..., N；

Step 3, setting time t=1, randomly choose a control parameter regulation actionAct on the autonomous navigation Device；

Initialize the relevant parameter in Q learning algorithms：T Studying factors l_tWith discount factor γ, l_t＞ 0, γ ∈ [0, 1]；

Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation device_p、k_iAnd k_d；

By the value function estimate at t-1 moment in the Q learning algorithmsInitialized, wherein, e_t-1Represent error of the autonomous navigation device at the t-1 moment, Δ e_t-1Represent that error of the autonomous navigation device at the t-1 moment becomes Rate, and by e_t-1With Δ e_t-1Form the ambient condition at t-1 moment；

Step 4, the number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, Transfer matrix using formula (2) to the decision process in Q learning algorithmsInitialized：

In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is transferred to control parameter regulation ActionTransition probability, and as t=1,

Step 5, optimize the decision process that Q learning algorithms obtain t using MCMC；

Step 5.1, n-th of control parameter regulation of t action is calculated using formula (3)Value function under ambient condition Value

In formula (3), w_j(t-1) represent BP neural network in j-th of hidden layer of t-1 moment weights, j=1,2 ..., nh；Nh represents the number of BP neural network hidden layer；y_j(t-1) represent BP neural network in j-th of hidden layer of t-1 moment it is defeated Go out, and have：

In formula (4), o_j(t-1) input of j-th of hidden layer of t in BP neural network is represented, and is had：

In formula (5), w_ij(t-1) represent BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weights, x_i(t-1) represent BP neural network in i-th of input layer of t-1 moment input, i=1,2 ..., ni, ni represent BP nerve nets The number of network input layer；

Step 5.2, the control parameter regulation action for drawing autonomous navigation device described in t is sampled using MCMC algorithms

Step 5.2.1, acted according to n-th of control parameter regulation of tValue function value under ambient conditionThe action chosen with the t-1 momentUtilize the transition probability matrix of formula (6) renewal decision process

In formula (6),Represent n-th of control parameter regulation of t actionValue function value, i.e., Represent the summation of the value function value of t everything, n=1,2 ..., N； Represent that t acts from n-th of control parameter regulationIt is transferred to m-th of control parameter regulation actionTransition probability；

Step 5.2.2, sampling number c=0,1,2C is set；

Step 5.2.3, to the transition probability matrix of tC sampling is carried out, and MCMC algorithms are obtained using formula (7) In the sampling of t the c+1 times receptance

In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t Action obtained by the c times sampling of momentProbable value；As c=0, the action obtained by the c times sampling of t is made Probability distributionTo wait general distribution, i.e.,

Step 5.2.4, sampled to obtain random receptance u from being uniformly distributed in Uniform [0,1], will be received at random Rate u and the receptanceIt is compared, ifThen receive the c+1 times dynamic obtained by sampling MakeOtherwise the action obtained by the c+1 times sampling is not receivedAnd willIt is assigned to

Step 5.2.5, the action obtained by the c+1 times sampling of t is updated using formula (8)Probability distribution

In formula (8),Represent the action obtained by the c times sampling of tProbability distributionDenominator； Represent the action obtained by the c times sampling of tProbability distributionMolecule；As c=0, order

Step 5.2.6, make c+1 be assigned to c, and judge whether c ＞ C set up, if so, then execution step 5.2.7, it is no Then, return to step 5.2.3 orders perform；

Step 5.2.7, to the transition probability matrix of tThe C+1 times sampling is carried out, obtains t autonomous navigation device Control parameter regulation actionAnd make t value function estimateFor autonomous navigation device described in t Control parameter regulation actsValue function value

Step 6, the control parameter regulation action of t autonomous navigation device is obtained using formula (9)Behavior act return Value

In formula (9), α and β represent error return parameter and error rate return parameter respectively, the ＜ β ＜ 1 of 0 ＜ α ＜ 1,0, And alpha+beta=1；

Step 7, utilize formula (10) renewal t-1 moment value Function Estimation valuesFor t-1 moment end values Functional value

In formula (10),End value function difference value is represented, and is had：

Step 8, make t+1 be assigned to t, judge t ＞ t_maxWhether set up, if so, then perform step 9；Otherwise according to SPSA Step-length regulation algorithm t over time change, using formula (12) to Studying factors l_tIt is adjusted, wherein, t_maxSet by expression Maximum iteration：

In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-lengths regulation algorithm；

Step 9, the end value functional value for judging continuous two momentIt is No establishment, if so, then represent that the regulation of autonomous navigation device pid control parameter finishes, and jump to step 11；Otherwise, step is performed Rapid 10；

Step 10, judge whether t exceeds the stipulated time, if exceeding, jump to step 3, reselect initial control ginseng Number regulation actionAdjust autonomous navigation device pid control parameter；Otherwise, jump to step 5 and continue autonomous navigation device PID Control parameter is adjusted；

Step 11, make t=1；

The ambient condition e of step 12, autonomous navigation device collection t_tWith Δ e_t, judge | e_t| ＞ | e_min| or | Δ e_t| ＞ | Δ e_min| whether set up, if so, then perform step 13；Otherwise return to step 11；Wherein, e_minWith Δ e_minTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows；

Step 13, make t+1 be assigned to t, and judge whether t ＞ T set up, if so, then perform step 3；Otherwise step is returned Rapid 12 perform；Wherein, T represents that autonomous navigation device adapts to the time constant of environmental change speed.

Compared with the prior art, beneficial effects of the present invention are：

1st, present invention employs Q learning algorithms to carry out on-line control to ROV autonomous navigation control parameter, and learns in Q MCMC sampling algorithms and SPSA step-lengths regulation algorithm are introduced in algorithm, makes ROV adaptive ring during autonomous navigation The change in border and the voyage conditions for prejudging subsequent time in advance, solve the problems, such as ROV hyperharmonic time delay, make navigation process More steady, particularly parameter regulation is rapid in the case of Changes in weather, has in ROV autonomous navigation field wide Application prospect.

2nd, it is present invention introduces Q learning algorithms, ROV control effect is associated with ambient condition, pass through environmental feedback Return value judges the quality of this parameter regulation action, and the direction become better of parameter regulation is gradually approached, and solves boat Is there is the problem of hyperharmonic response delay in row device during navigation, control parameter is quickly changed to the change of environment Optimal value, so as to rapidly adapt to the change of environment.

3rd, the present invention introduces MCMC sampling algorithms in traditional Q learning algorithms and is used to optimize, and will take at current time Parameter regulation action policy do not use the single action for taking maximum behavior value function value, but by between behavior act Transition probability goes the overall probability Distribution Model of estimation, solves the problems, such as to be absorbed in local optimum during Q learning algorithms selection action, So as to draw the optimal adjustment action policy during the navigation of autonomous navigation device.

4th, initial samples moment action probability distribution the general distribution such as is arranged to by the present invention in MCMC sampling algorithms so that MCMC sampling algorithms have the generality of action behavior sampling, the action that the later stage obtains with sampling every time early stage in algorithm operation Action probability distribution is updated, will sample the action probability distribution ratio increase corresponding to obtained action every time, so that Improve the correctness of per moment action sampling.

5th, change of the present invention to traditional Q learning algorithm learning factors l employs SPSA step-lengths regulation algorithm, passes through The setting of parameters, defines the speed degree of Studying factors l changes and the section model of change in SPSA step-lengths regulation algorithm Enclose, so that Studying factors l change has certain regularity during Q learning algorithms, make autonomous navigation device parameter regulation more Add accurate.

Brief description of the drawings

Fig. 1 is the autonomous navigation device control parameter on-line control Method And Principle block diagram that the present invention optimizes Q study based on MCMC；

Fig. 2 is MCMC Optimization Steps figures in Q learning algorithms of the present invention；

Fig. 3 is the autonomous navigation device control parameter on-line control method flow diagram that the present invention optimizes Q study based on MCMC；

Fig. 4 is that BP neural network solves action behavior value function schematic diagram；

Fig. 5 is that the autonomous navigation device navigation process under different experiments disappears the inventive method with the fixed pid parameter method of tradition The Comparison of experiment results figure of time-consuming；

Fig. 6 is that the environment during the navigation of autonomous navigation device is constant with the fixed pid parameter method of tradition for the inventive method In the case of Real-time Error e_tComparison of experiment results figure；

Fig. 7 is that the environment during the navigation of autonomous navigation device becomes the inventive method with the fixed pid parameter method of tradition Real-time Error e in the case of change_tComparison of experiment results figure；

Fig. 8 is that the inventive method with the fixed pid parameter method of tradition sent out by the environment during the navigation of autonomous navigation device Real-time Error e in the case of after changing_tComparison of experiment results figure.

Embodiment

In the present embodiment, the principle of the autonomous navigation device control parameter on-line control method based on MCMC optimization Q study is such as Shown in Fig. 1, the error e of autonomous navigation device real-time reception current environment_tWith error rate Δ e_t, Q study is optimized by MCMC and calculated Method Real-time Decision goes out the parameter regulation action a of subsequent time_n, finally when the end value functional value in Q learning algorithms is not occurring The control parameter optimal value under current environment is drawn during change.MCMC Optimization Steps are as shown in Figure 2 in Q learning algorithms.This method It is to be applied to autonomous navigation device control parameter on-line tuning field, is adapted to by changing the control parameter of autonomous navigation device current Environment.

As shown in figure 3, autonomous navigation device control parameter on-line control method is carried out as follows：

Step 1, pid control parameter include scale parameter k_p, integral parameter k_iWith differential parameter k_d, wherein scale parameter k_p Effect be to speed up the response speed of system, improve the degree of regulation of system, integral parameter k_iEffect be the steady of elimination system State error, differential parameter k_dEffect be improvement system dynamic characteristic；

According to the control accuracy α of autonomous navigation device, tri- control parameters of autonomous navigation device PID are respectively obtained using formula (1) k_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d：

Such as α=0.1, X_p∈ [10,20], X_i∈ [1,6], X_p∈ [1,2], adjustment parameter Δ k is drawn according to formula (1)_p's Transition activities are positive increase 1, constant and reverse reduction 1；Δ k can similarly be drawn_iWith Δ k_dTransition activities；

Traditional autonomous navigation device is given certainly using fixed pid parameter method, this method due to the uncertainty of environment Main ROV is bringing the problem of hyperharmonic response is delayed during navigation, need artificial modification simultaneously for different environment Pid parameter adapts to.So being directed to these problems and puzzlement, Q learning algorithms are introduced herein and carry out real-time online adjustment PID control Parameter.

Q learning algorithms are a kind of intelligence learning algorithms of the Chris Watkins in proposition in 1989, by TD algorithms and dynamic Planning is combined, and Watkins work advances the fast development of intensified learning.Q learning algorithms are a kind of and real system moulds Type is unrelated, the nitrification enhancement of value iterative type, and the algorithm is the relevant theoretical of Dynamic Programming and animal learning is psychologic has Profit be combined with each other, for solving the problems, such as the used sequence Optimal Decision-making with delay return.

Step 2, need to change autonomous navigation device control parameter by decision-making in due to Q learning, join if PID adjusted Number is divided into from the point of view of three actions, then can increase the computation complexity in Q learning algorithms, so utilizing the adjustment parameter Δ k_p、 Δk_iWith Δ k_dCombination draws the Parameters variation set of actions of the autonomous navigation device, is designated as A={ a₁,a₂,···, a_n,···,a_N, wherein, a_nRepresent that n-th of control parameter regulation acts in the Parameters variation set of actions, andThe corresponding proportion adjustment parameter of n-th of action is represented,Represent described n-th The corresponding integral adjustment parameter of action,Represent the corresponding differential regulation ginseng of n-th of control parameter regulation action Number, n=1,2 ..., N；

Studying factors l in the Q learning algorithms_tOver time t change and change, early stages of Q learning algorithms needs Larger learning value is obtained from sample data, so initial Studying factors l_tFor a larger positive number, over time t increasing Add autonomous navigation device not in the very big learning value of needs so as to by Studying factors l_tTaper into；Discount factor γ is used to control certainly Main ROV is to short-term and long-term results consideration degree, such as considers two extreme cases, and as γ=0, autonomous navigation device is only Consider the return value of current time environment, the return value for the moment environment that only looked to the future as γ=1, so according to autonomous navigation Device actual demand is set to discount factor, typically takes γ=0.5 pair current time and future time instance to consider；

Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation device_p、k_iAnd k_d；Such as This experimental system initially sets three control parameters as k_p=2.5, k_i=0.5, k_d=0.2；

T=1 moment setting value Function Estimation valuesError e_t-1=0, error rate Δ e_t-1 =0；

In step 4, Q learning algorithms, the maximum action of autonomous navigation device selective value functional value is not only needed, to obtain maximum Instant return；Also need to autonomous navigation device and select different actions as far as possible, it is contemplated that the situation of everything is so as to obtaining Optimal policy.If autonomous navigation device selects the action with peak functional value always, can have the disadvantage that：If in the early stage The stage of acquisition experience, autonomous navigation device not yet acquire optimal strategy, then the study stage afterwards would be impossible to obtain again Optimal strategy.

So MCMC sampling algorithms are introduced in Q learning algorithms is used for the action that decision-making is chosen per the moment.MCMC is sampled Algorithm obtains the sampled value for meeting action probability distribution by being sampled to action transfer matrix, in the case of Probability Distributed Unknown The action for showing that per moment is chosen can accurately be sampled.

The number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, utilizes formula (2) to the transfer matrix of the decision process in Q learning algorithmsInitialized：

In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is dynamic to be transferred to control parameter regulation MakeTransition probability, and as t=1,

Step 5.1, BP neural network have the ability of Approximation of Arbitrary Nonlinear Function, for solving extensive and continuous Evolvement problem in state space plays an important roll, and BP neural network solves action behavior value function principle as shown in figure 4, profit N-th of control parameter regulation of t is calculated with formula (3) to actValue function value under ambient condition

Such as ni=3 represents 3 input layers of BP neural network, respectively error e_t-1, error rate Δ e_t-1With ActionInput；Nh=5 represents to contain five hidden layer nodes, general hidden layer node number more at most counting accuracy It is higher, but the complexity calculated is also bigger；The t=1 moment sets the weight w of hidden layer_j(t-1)=1, j=1,2 ..., nh, Input layer weight w_ij(t-1)=0.8, i=1,2 ..., ni；

Step 5.2.2, sampling number c=0,1,2C is set；

In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t Action obtained by the c times sampling of momentProbable value；As c=0, the action obtained by the c times sampling of t is madeProbability distributionTo wait general distribution, i.e.,

By formula (7) as can be seen that tWithIt is definite value, when t The action of c+1 samplingThe more big receptance that then samples of corresponding probable value is bigger, otherwise sampling receptance is smaller；

Because MCMC sampling algorithms are the transition probability matrixs by sampling actionGo to obtain and meet action probability distributionSampled value, so action probability distribution p (a when MCMC sampling algorithms start_n) can arbitrarily set；Start to sample When set actionProbability distributionTo wait general distribution,ROV is set to have every kind of action identical Sampled probability, ensure that Q learning algorithms to per moment act sampling correctness；

Such as random receptance u=0.5, if the sampling receptance obtained according to formula (7) Then think this sampling failure, sampling action valueKeep constant；If the sampling receptance obtained according to formula (7)Then think that this is sampled successfully, sampling action valueIt is changed into

Acted according to MCMC algorithms when sampling number c reaches 100 timesProbability distributionBasically reach steadily, General setting C=100；Sampling number C can be set according to the precision of aircraft systems；

Behavior act return valueIllustrate the action of t parameter regulationAfter acting on autonomous navigation device The running situation of ROV, if the ambient condition returned is deteriorated, now behavior act return valueFor one Negative, represent punishment；If the ambient condition returned improves, now behavior act return valueFor one just Number, represent reward；If the ambient condition returned does not change, now behavior act return valueIt is zero, table Show holding；The ambient condition of autonomous navigation device includes error e_tWith Δ e_t, so introducing α and β environment according to the different of importance State reporting parameter determines the influence degree of different conditions, typically sets α=0.8, β=0.2；

Be introduced into SPSA step-lengths regulation algorithm make Q learn in Studying factors l_tChange has certain regularity, and leads to The setting of non-negative parameter μ and λ in SPSA step-lengths regulation algorithm is crossed, defines Studying factors l_tThe speed degree of change and change Interval range makes ROV parameter regulation more accurate, typically sets t_max=30, μ=0.3, λ=1.2；

ε is that a minimum positive number finishes for judging whether pid control parameter is adjusted, and the control accuracy of ROV has Close；When ε is smaller, then the precision of ROV autonomous navigation will be higher, and obtained ROV pid control parameter will be closer to most The figure of merit, typically set ε=0.2；

Step 11, make t=1；

The ambient condition e of step 12, autonomous navigation device collection t_tWith Δ e_t, judge | e_t| ＞ | e_min| or | Δ e_t| ＞ | Δ e_min| whether set up, if so, then perform step 13；Otherwise return to step 11；Wherein, e_minWith Δ e_minTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows；Such as general setting e_min=0.1, Δ e_min= 0.05；

Experimental result：

This patent method and the fixed pid parameter method of tradition are respectively used to autonomous navigation device simultaneously, it is multigroup right to have carried out Than experiment, ensure that two groups of autonomous navigation devices reach identical terminal from identical starting point simultaneously in experiment.Fig. 5 is autonomous navigation Device navigates by water the time consuming comparing result of process；Fig. 6, Fig. 7 and Fig. 8 are that autonomous navigation device navigates by water process Real-time Error e_tContrast As a result.

It is time consuming experimentally in contrast, three groups of contrast experiments are taken, every group of experiment carries out 50 times and result is taken Average value.First group of arrival time for contrasting two groups of autonomous navigation devices in the case of stable for current environment, second group is environment The arrival time of two groups of autonomous navigation devices is contrasted in the case of suddenly change during navigation, the 3rd group is after environmental change In the case of contrast arrival times of two groups of autonomous navigation devices；As shown in Figure 5, due to using in the state of initial environment is stable The pid control parameter that the autonomous navigation device of fixed pid parameter method uses is close to optimized parameter so and using this patent method Autonomous navigation device elapsed time it is roughly the same；When environment is during navigation in the case of suddenly change, although two groups autonomous The time that ROV reaches is all elongated, but this it appears that is used using the autonomous navigation device elapsed time ratio of this patent method The autonomous navigation device of conventional method is much smaller, and the time increased using the autonomous navigation device of this patent method occurs mainly in regulation During control parameter；In the case of after the environmental change, because the autonomous navigation device using this patent method will Optimal value is arrived in control parameter regulation under current environment, so the time of consumption has been returned to and identical water before environmental change It is flat, and the autonomous navigation device of conventional method is used because the control parameter under new environment has not reacceesed optimal control parameter, So the time continuation of consumption is elongated, when environmental change is violent, will be occurred using the autonomous navigation device of conventional method can not Reach the situation for specifying destination.

In contrast Real-time Error e_tExperimentally, above-mentioned three groups of contrast experiments are equally taken, every group of experiment carries out 50 times simultaneously To results averaged.Fig. 6 is the constant comparing result of initial environment, it is possible to find two groups of autonomous navigation device Real-time Error e_tChange Change situation is roughly the same；Fig. 7 is contrast knot of the environment when autonomous navigation device navigates by water process the 7th second in the case of suddenly change Fruit, it is possible to find two groups of autonomous navigation device Real-time Error e in environment suddenly change_tAll great increase, but use this patent method Autonomous navigation device through navigational parameter after a while adjustment after Real-time Error e_tIt is rapidly reduced to close to 0, and uses again The autonomous navigation device Real-time Error e of conventional method_t0 can not be reduced to fluctuate up and down in an error range always；Fig. 8 is ring Comparing result in the case of after the change of border, it is possible to find using the Real-time Error e of the autonomous navigation device of this patent method_tChange Rule before rule and environmental change is basically identical, and uses the autonomous navigation device Real-time Error e of conventional method_tIt can not be reduced to 0 fluctuates up and down in an error range always.

Two kinds of comparing results under three groups of experiments of comprehensive appeal find that the relatively conventional fixed PID control of this patent method is joined Counting method has more preferable autonomous navigation effect in the case where environment is changeable, while solves because control parameter is not current Caused by optimal value under environment the problem of autonomous navigation device hyperharmonic response delay.

Claims

A kind of 1. autonomous navigation device control parameter on-line control method based on MCMC optimization Q study, it is characterised in that：Including with Lower step：

Step 1, the control accuracy α according to autonomous navigation device, tri- control ginsengs of autonomous navigation device PID are respectively obtained using formula (1) Number k_p、k_iAnd k_dAdjustment parameter Δ k_p、Δk_iWith Δ k_d：

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&Delta;k</mi> <mi>p</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&alpha;X</mi> <mi>p</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&alpha;X</mi> <mi>p</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> <mtd> <mrow> <msub> <mi>&Delta;k</mi> <mi>i</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&alpha;X</mi> <mi>i</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&alpha;X</mi> <mi>i</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> <mtd> <mrow> <msub> <mi>&Delta;k</mi> <mi>d</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&alpha;X</mi> <mi>d</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&alpha;X</mi> <mi>d</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

In formula (1), X_p、X_i、X_dDescribed three pid control parameter k of autonomous navigation device are represented respectively_p、k_iAnd k_dThreshold range；

Step 2, utilize the adjustment parameter Δ k_p、Δk_iWith Δ k_dCombination draws the Parameters variation action of the autonomous navigation device Set, is designated as A={ a₁,a₂,···,a_n,···,a_N, wherein, a_nRepresent in the Parameters variation set of actions n-th Control parameter regulation acts, and Represent the corresponding proportion adjustment ginseng of n-th of action Number,The corresponding integral adjustment parameter of n-th of action is represented,Represent n-th of control parameter regulation action Corresponding differential adjustment parameter, n=1,2 ..., N；

Step 3, setting time t=1, randomly choose a control parameter regulation actionAct on the autonomous navigation device；

Initialize the relevant parameter in Q learning algorithms：T Studying factors l_tWith discount factor γ, l_t＞ 0, γ ∈ [0,1]；

Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation device_p、k_iAnd k_d；

By the value function estimate at t-1 moment in the Q learning algorithmsInitialized, wherein, e_t-1Table Show error of the autonomous navigation device at the t-1 moment, Δ e_t-1Error rate of the autonomous navigation device at the t-1 moment is represented, And by e_t-1With Δ e_t-1Form the ambient condition at t-1 moment；

Step 4, the number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, are utilized Transfer matrix of the formula (2) to the decision process in Q learning algorithmsInitialized：

<mrow> <msubsup> <mi>p</mi> <mrow> <mi>n</mi> <mi>m</mi> </mrow> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is transferred to control parameter regulation actionTransition probability, and as t=1,

Step 5, optimize the decision process that Q learning algorithms obtain t using MCMC；

Step 5.1, n-th of control parameter regulation of t action is calculated using formula (3)Value function value under ambient condition

<mrow> <msup> <mi>Q</mi> <mo>*</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>n</mi> <mi>h</mi> </mrow> </munderover> <msub> <mi>w</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <msub> <mi>y</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

In formula (3), w_j(t-1) weights of j-th of hidden layer of t-1 moment in BP neural network, j=1,2 ..., nh are represented；Nh tables Show the number of BP neural network hidden layer；y_j(t-1) output of j-th of hidden layer of t-1 moment in BP neural network is represented, and Have：

<mrow> <msub> <mi>y</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>e</mi> <mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </msup> </mrow> <mrow> <mn>1</mn> <mo>+</mo> <msup> <mi>e</mi> <mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </msup> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

In formula (4), o_j(t-1) input of j-th of hidden layer of t in BP neural network is represented, and is had：

<mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>n</mi> <mi>i</mi> </mrow> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

In formula (5), w_ij(t-1) represent that i-th of input layer of t-1 moment is to the weights of j-th of hidden layer, x in BP neural network_i (t-1) represent BP neural network in i-th of input layer of t-1 moment input, i=1,2 ..., ni, ni represent BP neural network The number of input layer；

Step 5.2, the control parameter regulation action for drawing autonomous navigation device described in t is sampled using MCMC algorithms

Step 5.2.1, acted according to n-th of control parameter regulation of tValue function value under ambient conditionThe action chosen with the t-1 momentUtilize the transition probability matrix of formula (6) renewal decision process

<mrow> <msubsup> <mi>p</mi> <mrow> <mi>n</mi> <mi>m</mi> </mrow> <mi>t</mi> </msubsup> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mn>1</mn> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mn>2</mn> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mi>m</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mi>N</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>....</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>

In formula (6),Represent n-th of control parameter regulation of t actionValue function value, i.e., Represent the summation of the value function value of t everything, n=1,2 ..., N；Represent t from n-th of control Parameter regulation action processedIt is transferred to m-th of control parameter regulation actionTransition probability；

Step 5.2.2, sampling number c=0,1,2C is set；

Step 5.2.3, to the transition probability matrix of tC sampling is carried out, and the t in MCMC algorithms is obtained using formula (7) The receptance of the c+1 times sampling of moment

<mrow> <msub> <mi>&alpha;</mi> <mrow> <mi>c</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>,</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>min</mi> <mo>{</mo> <mfrac> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>&times;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>,</mo> <mn>1</mn> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t the Action obtained by c samplingProbable value；As c=0, the action obtained by the c times sampling of t is madeProbability DistributionTo wait general distribution, i.e.,

Step 5.2.4, sampled to obtain random receptance u from being uniformly distributed in Uniform [0,1], by random receptance u and The receptanceIt is compared, ifThen receive the action obtained by the c+1 times sampling Otherwise the action obtained by the c+1 times sampling is not receivedAnd willIt is assigned to

Step 5.2.5, the action obtained by the c+1 times sampling of t is updated using formula (8)Probability distribution

<mrow> <msub> <mi>p</mi> <mrow> <mi>c</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mfrac> <mrow> <msubsup> <mi>d</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>c</mi> </mrow> <mi>t</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> <mrow> <msubsup> <mi>&sigma;</mi> <mi>c</mi> <mi>t</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> </mtd> <mtd> <mrow> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>=</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> </mrow> </mtd> </mtr> <mtr> <mtd> <mfrac> <msubsup> <mi>d</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>c</mi> </mrow> <mi>t</mi> </msubsup> <mrow> <msubsup> <mi>&sigma;</mi> <mi>c</mi> <mi>i</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> </mtd> <mtd> <mrow> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>&NotEqual;</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow>

In formula (8),Represent the action obtained by the c times sampling of tProbability distributionDenominator；Represent t Action obtained by the c times sampling of momentProbability distributionMolecule；As c=0, ordern =1,2 ..., N；

Step 5.2.6, make c+1 be assigned to c, and judge whether c ＞ C set up, if so, step 5.2.7 is then performed, otherwise, is returned Step 5.2.3 orders are returned to perform；

Step 5.2.7, to the transition probability matrix of tThe C+1 times sampling is carried out, obtains the control of t autonomous navigation device Parameter regulation action processedAnd make t value function estimateFor the control of autonomous navigation device described in t Parameter regulation actsValue function value

Step 6, the control parameter regulation action of t autonomous navigation device is obtained using formula (9)Behavior act return value

<mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>&alpha;</mi> <mo>&times;</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <mo>&times;</mo> <mrow> <mo>(</mo> <msub> <mi>&Delta;e</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow>

In formula (9), α and β represent error return parameter and error rate return parameter respectively, the ＜ β ＜ 1 of 0 ＜ α ＜ 1,0, and α+ β=1；

Step 7, utilize formula (10) renewal t-1 moment value Function Estimation valuesFor t-1 moment final value function Value

<mrow> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mi>Q</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> <mi>&Delta;</mi> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow>

In formula (10),End value function difference value is represented, and is had：

<mrow> <mi>&Delta;</mi> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msup> <mi>&gamma;Q</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msup> <mi>Q</mi> <mo>&prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&prime;</mo> <mo>&prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow>

Step 8, make t+1 be assigned to t, judge t ＞ t_maxWhether set up, if so, then perform step 9；Otherwise according to SPSA step-lengths Algorithm t over time change is adjusted, using formula (12) to Studying factors l_tIt is adjusted, wherein, t_maxIt is maximum set by expression Iterations：

<mrow> <msub> <mi>l</mi> <mi>t</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>+</mo> <mi>&mu;</mi> <mo>)</mo> </mrow> <mi>&lambda;</mi> </msup> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> </mrow>

In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-lengths regulation algorithm；

Step 9, the end value functional value for judging continuous two momentWhether into It is vertical, if so, then represent that the regulation of autonomous navigation device pid control parameter finishes, and jump to step 11；Otherwise, step 10 is performed；

Step 10, judge whether t exceeds the stipulated time, if exceeding, jump to step 3, reselect initial control parameter and adjust Section actsAdjust autonomous navigation device pid control parameter；Otherwise, jump to step 5 and continue autonomous navigation device PID control Parameter regulation；

Step 11, make t=1；

The ambient condition e of step 12, autonomous navigation device collection t_tWith Δ e_t, judge | e_t| ＞ | e_min| or | Δ e_t| ＞ | Δe_min| whether set up, if so, then perform step 13；Otherwise return to step 11；Wherein, e_minWith Δ e_minRepresent respectively autonomous The ambient condition error and error rate minimum value that ROV allows；

Step 13, make t+1 be assigned to t, and judge whether t ＞ T set up, if so, then perform step 3；Otherwise return to step 12 Perform；Wherein, T represents that autonomous navigation device adapts to the time constant of environmental change speed.