CN107885086A - Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study - Google Patents

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study Download PDF

Info

Publication number
CN107885086A
CN107885086A CN201711144395.2A CN201711144395A CN107885086A CN 107885086 A CN107885086 A CN 107885086A CN 201711144395 A CN201711144395 A CN 201711144395A CN 107885086 A CN107885086 A CN 107885086A
Authority
CN
China
Prior art keywords
mrow
msubsup
mtd
msub
mtr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711144395.2A
Other languages
Chinese (zh)
Other versions
CN107885086B (en
Inventor
夏娜
柴煜奇
杜华争
陈斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201711144395.2A priority Critical patent/CN107885086B/en
Publication of CN107885086A publication Critical patent/CN107885086A/en
Application granted granted Critical
Publication of CN107885086B publication Critical patent/CN107885086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a kind of autonomous navigation device control parameter on-line control method based on MCMC optimization Q study, comprise the following steps:The possibility situation of change of ROV pid control parameter is carried out by statistics according to the situation of reality first and draws the set of actions of parameter regulation, and experience initialization pid control parameter is controlled according to ROV;Then random selection one kind acts on autonomous navigation device, according to the value function value Q of each action obtained in Q learning algorithms*The action for showing that subsequent time is taken is sampled with MCMC algorithms, and the Studying factors l in Q learning algorithms is adjusted using SPSA step-lengths regulation algorithm over time;Finally the regulation repeatedly by control parameter draws optimal control parameter under the present circumstances.The present invention solves hyperharmonic delay problem of the autonomous navigation device during navigation, autonomous navigation device is rapidly adapted to the change of environment and arriving at for quick and stable.

Description

Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study
Technical field
It is specifically a kind of to autonomous navigation device control the invention belongs to autonomous navigation device control parameter on-line tuning field The method of parameter regulation processed.
Background technology
ROV autonomous navigation refers to the destination that ROV reaches in the water surface by being artificially assigned to, then contexture by self The path advanced well, arrived at eventually through continuous self-control.Have in water quality inspection and beat surface etc. Important application value.
At present, traditional autonomous navigation device is using fixed pid parameter method, and this method is using fixed ROV control Parameter processed, parameter are as acquired in substantial amounts of ROV autonomous navigation engineering project experience.When fixed control parameter is not suitable for The problem of hyperharmonic response delay can be brought during current environment to ROV autonomous navigation, especially in the changeable situation of environment Under, fixed control parameter may have preferable response to individual circumstances state, but can not meet all ambient conditions, Artificial change ROV control parameter is needed to be not easy to the use of ROV when the environment changes.
The method that ROV control parameter regulation is more also carried out using fuzzy algorithmic approach, annealing algorithm, these methods Control parameter self-correcting mechanism is introduced to a certain extent, but because these methods are not intelligent control algorithm in itself, So the situation changeable to environment still can not solve the problems, such as autonomous navigation device control parameter quick regulation to optimal value.
The content of the invention
The present invention is the above-mentioned the shortcomings of the prior art of solution, there is provided one kind is based on MCMC optimization Q study Autonomous navigation device control parameter on-line control method, to solve hyperharmonic time delay of the autonomous navigation device during navigation Problem, so that autonomous navigation device rapidly adapts to arriving at for the change of environment and quick and stable.
In order to achieve the above object, the technical solution adopted in the present invention is:
The present invention it is a kind of based on MCMC optimization Q study autonomous navigation device control parameter on-line control method the characteristics of include Following steps:
Step 1, the control accuracy α according to autonomous navigation device, tri- controls of autonomous navigation device PID are respectively obtained using formula (1) Parameter k processedp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd
In formula (1), Xp、Xi、XdDescribed three pid control parameter k of autonomous navigation device are represented respectivelyp、kiAnd kdThreshold value model Enclose;
Step 2, utilize the adjustment parameter Δ kp、ΔkiWith Δ kdCombination draws the Parameters variation of the autonomous navigation device Set of actions, it is designated as A={ a1,a2,···,an,···,aN, wherein, anRepresent in the Parameters variation set of actions N control parameter regulation action, andRepresent the corresponding proportion adjustment of n-th of action Parameter,The corresponding integral adjustment parameter of n-th of action is represented,Represent that n-th of control parameter regulation is dynamic Make corresponding differential adjustment parameter, n=1,2 ..., N;
Step 3, setting time t=1, randomly choose a control parameter regulation actionAct on the autonomous navigation Device;
Initialize the relevant parameter in Q learning algorithms:T Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0, 1];
Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd
By the value function estimate at t-1 moment in the Q learning algorithmsInitialized, wherein, et-1Represent error of the autonomous navigation device at the t-1 moment, Δ et-1Represent that error of the autonomous navigation device at the t-1 moment becomes Rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
Step 4, the number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, Transfer matrix using formula (2) to the decision process in Q learning algorithmsInitialized:
In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is transferred to control parameter regulation ActionTransition probability, and as t=1,
Step 5, optimize the decision process that Q learning algorithms obtain t using MCMC;
Step 5.1, n-th of control parameter regulation of t action is calculated using formula (3)Value function under ambient condition Value
In formula (3), wj(t-1) represent BP neural network in j-th of hidden layer of t-1 moment weights, j=1,2 ..., nh;Nh represents the number of BP neural network hidden layer;yj(t-1) represent BP neural network in j-th of hidden layer of t-1 moment it is defeated Go out, and have:
In formula (4), oj(t-1) input of j-th of hidden layer of t in BP neural network is represented, and is had:
In formula (5), wij(t-1) represent BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weights, xi(t-1) represent BP neural network in i-th of input layer of t-1 moment input, i=1,2 ..., ni, ni represent BP nerve nets The number of network input layer;
Step 5.2, the control parameter regulation action for drawing autonomous navigation device described in t is sampled using MCMC algorithms
Step 5.2.1, acted according to n-th of control parameter regulation of tValue function value under ambient conditionThe action chosen with the t-1 momentUtilize the transition probability matrix of formula (6) renewal decision process
In formula (6),Represent n-th of control parameter regulation of t actionValue function value, i.e., Represent the summation of the value function value of t everything, n=1,2 ..., N; Represent that t acts from n-th of control parameter regulationIt is transferred to m-th of control parameter regulation actionTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of tC sampling is carried out, and MCMC algorithms are obtained using formula (7) In the sampling of t the c+1 times receptance
In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t Action obtained by the c times sampling of momentProbable value;As c=0, the action obtained by the c times sampling of t is made Probability distributionTo wait general distribution, i.e.,
Step 5.2.4, sampled to obtain random receptance u from being uniformly distributed in Uniform [0,1], will be received at random Rate u and the receptanceIt is compared, ifThen receive the c+1 times dynamic obtained by sampling MakeOtherwise the action obtained by the c+1 times sampling is not receivedAnd willIt is assigned to
Step 5.2.5, the action obtained by the c+1 times sampling of t is updated using formula (8)Probability distribution
In formula (8),Represent the action obtained by the c times sampling of tProbability distributionDenominator; Represent the action obtained by the c times sampling of tProbability distributionMolecule;As c=0, order
Step 5.2.6, make c+1 be assigned to c, and judge whether c > C set up, if so, then execution step 5.2.7, it is no Then, return to step 5.2.3 orders perform;
Step 5.2.7, to the transition probability matrix of tThe C+1 times sampling is carried out, obtains t autonomous navigation device Control parameter regulation actionAnd make t value function estimateFor autonomous navigation device described in t Control parameter regulation actsValue function value
Step 6, the control parameter regulation action of t autonomous navigation device is obtained using formula (9)Behavior act return Value
In formula (9), α and β represent error return parameter and error rate return parameter respectively, the < β < 1 of 0 < α < 1,0, And alpha+beta=1;
Step 7, utilize formula (10) renewal t-1 moment value Function Estimation valuesFor t-1 moment end values Functional value
In formula (10),End value function difference value is represented, and is had:
Step 8, make t+1 be assigned to t, judge t > tmaxWhether set up, if so, then perform step 9;Otherwise according to SPSA Step-length regulation algorithm t over time change, using formula (12) to Studying factors ltIt is adjusted, wherein, tmaxSet by expression Maximum iteration:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-lengths regulation algorithm;
Step 9, the end value functional value for judging continuous two momentIt is No establishment, if so, then represent that the regulation of autonomous navigation device pid control parameter finishes, and jump to step 11;Otherwise, step is performed Rapid 10;
Step 10, judge whether t exceeds the stipulated time, if exceeding, jump to step 3, reselect initial control ginseng Number regulation actionAdjust autonomous navigation device pid control parameter;Otherwise, jump to step 5 and continue autonomous navigation device PID Control parameter is adjusted;
Step 11, make t=1;
The ambient condition e of step 12, autonomous navigation device collection ttWith Δ et, judge | et| > | emin| or | Δ et| > | Δ emin| whether set up, if so, then perform step 13;Otherwise return to step 11;Wherein, eminWith Δ eminTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows;
Step 13, make t+1 be assigned to t, and judge whether t > T set up, if so, then perform step 3;Otherwise step is returned Rapid 12 perform;Wherein, T represents that autonomous navigation device adapts to the time constant of environmental change speed.
Compared with the prior art, beneficial effects of the present invention are:
1st, present invention employs Q learning algorithms to carry out on-line control to ROV autonomous navigation control parameter, and learns in Q MCMC sampling algorithms and SPSA step-lengths regulation algorithm are introduced in algorithm, makes ROV adaptive ring during autonomous navigation The change in border and the voyage conditions for prejudging subsequent time in advance, solve the problems, such as ROV hyperharmonic time delay, make navigation process More steady, particularly parameter regulation is rapid in the case of Changes in weather, has in ROV autonomous navigation field wide Application prospect.
2nd, it is present invention introduces Q learning algorithms, ROV control effect is associated with ambient condition, pass through environmental feedback Return value judges the quality of this parameter regulation action, and the direction become better of parameter regulation is gradually approached, and solves boat Is there is the problem of hyperharmonic response delay in row device during navigation, control parameter is quickly changed to the change of environment Optimal value, so as to rapidly adapt to the change of environment.
3rd, the present invention introduces MCMC sampling algorithms in traditional Q learning algorithms and is used to optimize, and will take at current time Parameter regulation action policy do not use the single action for taking maximum behavior value function value, but by between behavior act Transition probability goes the overall probability Distribution Model of estimation, solves the problems, such as to be absorbed in local optimum during Q learning algorithms selection action, So as to draw the optimal adjustment action policy during the navigation of autonomous navigation device.
4th, initial samples moment action probability distribution the general distribution such as is arranged to by the present invention in MCMC sampling algorithms so that MCMC sampling algorithms have the generality of action behavior sampling, the action that the later stage obtains with sampling every time early stage in algorithm operation Action probability distribution is updated, will sample the action probability distribution ratio increase corresponding to obtained action every time, so that Improve the correctness of per moment action sampling.
5th, change of the present invention to traditional Q learning algorithm learning factors l employs SPSA step-lengths regulation algorithm, passes through The setting of parameters, defines the speed degree of Studying factors l changes and the section model of change in SPSA step-lengths regulation algorithm Enclose, so that Studying factors l change has certain regularity during Q learning algorithms, make autonomous navigation device parameter regulation more Add accurate.
Brief description of the drawings
Fig. 1 is the autonomous navigation device control parameter on-line control Method And Principle block diagram that the present invention optimizes Q study based on MCMC;
Fig. 2 is MCMC Optimization Steps figures in Q learning algorithms of the present invention;
Fig. 3 is the autonomous navigation device control parameter on-line control method flow diagram that the present invention optimizes Q study based on MCMC;
Fig. 4 is that BP neural network solves action behavior value function schematic diagram;
Fig. 5 is that the autonomous navigation device navigation process under different experiments disappears the inventive method with the fixed pid parameter method of tradition The Comparison of experiment results figure of time-consuming;
Fig. 6 is that the environment during the navigation of autonomous navigation device is constant with the fixed pid parameter method of tradition for the inventive method In the case of Real-time Error etComparison of experiment results figure;
Fig. 7 is that the environment during the navigation of autonomous navigation device becomes the inventive method with the fixed pid parameter method of tradition Real-time Error e in the case of changetComparison of experiment results figure;
Fig. 8 is that the inventive method with the fixed pid parameter method of tradition sent out by the environment during the navigation of autonomous navigation device Real-time Error e in the case of after changingtComparison of experiment results figure.
Embodiment
In the present embodiment, the principle of the autonomous navigation device control parameter on-line control method based on MCMC optimization Q study is such as Shown in Fig. 1, the error e of autonomous navigation device real-time reception current environmenttWith error rate Δ et, Q study is optimized by MCMC and calculated Method Real-time Decision goes out the parameter regulation action a of subsequent timen, finally when the end value functional value in Q learning algorithms is not occurring The control parameter optimal value under current environment is drawn during change.MCMC Optimization Steps are as shown in Figure 2 in Q learning algorithms.This method It is to be applied to autonomous navigation device control parameter on-line tuning field, is adapted to by changing the control parameter of autonomous navigation device current Environment.
As shown in figure 3, autonomous navigation device control parameter on-line control method is carried out as follows:
Step 1, pid control parameter include scale parameter kp, integral parameter kiWith differential parameter kd, wherein scale parameter kp Effect be to speed up the response speed of system, improve the degree of regulation of system, integral parameter kiEffect be the steady of elimination system State error, differential parameter kdEffect be improvement system dynamic characteristic;
According to the control accuracy α of autonomous navigation device, tri- control parameters of autonomous navigation device PID are respectively obtained using formula (1) kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd
In formula (1), Xp、Xi、XdDescribed three pid control parameter k of autonomous navigation device are represented respectivelyp、kiAnd kdThreshold value model Enclose;
Such as α=0.1, Xp∈ [10,20], Xi∈ [1,6], Xp∈ [1,2], adjustment parameter Δ k is drawn according to formula (1)p's Transition activities are positive increase 1, constant and reverse reduction 1;Δ k can similarly be drawniWith Δ kdTransition activities;
Traditional autonomous navigation device is given certainly using fixed pid parameter method, this method due to the uncertainty of environment Main ROV is bringing the problem of hyperharmonic response is delayed during navigation, need artificial modification simultaneously for different environment Pid parameter adapts to.So being directed to these problems and puzzlement, Q learning algorithms are introduced herein and carry out real-time online adjustment PID control Parameter.
Q learning algorithms are a kind of intelligence learning algorithms of the Chris Watkins in proposition in 1989, by TD algorithms and dynamic Planning is combined, and Watkins work advances the fast development of intensified learning.Q learning algorithms are a kind of and real system moulds Type is unrelated, the nitrification enhancement of value iterative type, and the algorithm is the relevant theoretical of Dynamic Programming and animal learning is psychologic has Profit be combined with each other, for solving the problems, such as the used sequence Optimal Decision-making with delay return.
Step 2, need to change autonomous navigation device control parameter by decision-making in due to Q learning, join if PID adjusted Number is divided into from the point of view of three actions, then can increase the computation complexity in Q learning algorithms, so utilizing the adjustment parameter Δ kp、 ΔkiWith Δ kdCombination draws the Parameters variation set of actions of the autonomous navigation device, is designated as A={ a1,a2,···, an,···,aN, wherein, anRepresent that n-th of control parameter regulation acts in the Parameters variation set of actions, andThe corresponding proportion adjustment parameter of n-th of action is represented,Represent described n-th The corresponding integral adjustment parameter of action,Represent the corresponding differential regulation ginseng of n-th of control parameter regulation action Number, n=1,2 ..., N;
Step 3, setting time t=1, randomly choose a control parameter regulation actionAct on the autonomous navigation Device;
Initialize the relevant parameter in Q learning algorithms:T Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0, 1];
Studying factors l in the Q learning algorithmstOver time t change and change, early stages of Q learning algorithms needs Larger learning value is obtained from sample data, so initial Studying factors ltFor a larger positive number, over time t increasing Add autonomous navigation device not in the very big learning value of needs so as to by Studying factors ltTaper into;Discount factor γ is used to control certainly Main ROV is to short-term and long-term results consideration degree, such as considers two extreme cases, and as γ=0, autonomous navigation device is only Consider the return value of current time environment, the return value for the moment environment that only looked to the future as γ=1, so according to autonomous navigation Device actual demand is set to discount factor, typically takes γ=0.5 pair current time and future time instance to consider;
Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd;Such as This experimental system initially sets three control parameters as kp=2.5, ki=0.5, kd=0.2;
By the value function estimate at t-1 moment in the Q learning algorithmsInitialized, wherein, et-1Represent error of the autonomous navigation device at the t-1 moment, Δ et-1Represent that error of the autonomous navigation device at the t-1 moment becomes Rate, and by et-1With Δ et-1Form the ambient condition at t-1 moment;
T=1 moment setting value Function Estimation valuesError et-1=0, error rate Δ et-1 =0;
In step 4, Q learning algorithms, the maximum action of autonomous navigation device selective value functional value is not only needed, to obtain maximum Instant return;Also need to autonomous navigation device and select different actions as far as possible, it is contemplated that the situation of everything is so as to obtaining Optimal policy.If autonomous navigation device selects the action with peak functional value always, can have the disadvantage that:If in the early stage The stage of acquisition experience, autonomous navigation device not yet acquire optimal strategy, then the study stage afterwards would be impossible to obtain again Optimal strategy.
So MCMC sampling algorithms are introduced in Q learning algorithms is used for the action that decision-making is chosen per the moment.MCMC is sampled Algorithm obtains the sampled value for meeting action probability distribution by being sampled to action transfer matrix, in the case of Probability Distributed Unknown The action for showing that per moment is chosen can accurately be sampled.
The number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, utilizes formula (2) to the transfer matrix of the decision process in Q learning algorithmsInitialized:
In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is dynamic to be transferred to control parameter regulation MakeTransition probability, and as t=1,
Step 5, optimize the decision process that Q learning algorithms obtain t using MCMC;
Step 5.1, BP neural network have the ability of Approximation of Arbitrary Nonlinear Function, for solving extensive and continuous Evolvement problem in state space plays an important roll, and BP neural network solves action behavior value function principle as shown in figure 4, profit N-th of control parameter regulation of t is calculated with formula (3) to actValue function value under ambient condition
In formula (3), wj(t-1) represent BP neural network in j-th of hidden layer of t-1 moment weights, j=1,2 ..., nh;Nh represents the number of BP neural network hidden layer;yj(t-1) represent BP neural network in j-th of hidden layer of t-1 moment it is defeated Go out, and have:
In formula (4), oj(t-1) input of j-th of hidden layer of t in BP neural network is represented, and is had:
In formula (5), wij(t-1) represent BP neural network in i-th of input layer of t-1 moment to j-th of hidden layer weights, xi(t-1) represent BP neural network in i-th of input layer of t-1 moment input, i=1,2 ..., ni, ni represent BP nerve nets The number of network input layer;
Such as ni=3 represents 3 input layers of BP neural network, respectively error et-1, error rate Δ et-1With ActionInput;Nh=5 represents to contain five hidden layer nodes, general hidden layer node number more at most counting accuracy It is higher, but the complexity calculated is also bigger;The t=1 moment sets the weight w of hidden layerj(t-1)=1, j=1,2 ..., nh, Input layer weight wij(t-1)=0.8, i=1,2 ..., ni;
Step 5.2, the control parameter regulation action for drawing autonomous navigation device described in t is sampled using MCMC algorithms
Step 5.2.1, acted according to n-th of control parameter regulation of tValue function value under ambient conditionThe action chosen with the t-1 momentUtilize the transition probability matrix of formula (6) renewal decision process
In formula (6),Represent n-th of control parameter regulation of t actionValue function value, i.e., Represent the summation of the value function value of t everything, n=1,2 ..., N; Represent that t acts from n-th of control parameter regulationIt is transferred to m-th of control parameter regulation actionTransition probability;
Step 5.2.2, sampling number c=0,1,2C is set;
Step 5.2.3, to the transition probability matrix of tC sampling is carried out, and MCMC algorithms are obtained using formula (7) In the sampling of t the c+1 times receptance
In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t Action obtained by the c times sampling of momentProbable value;As c=0, the action obtained by the c times sampling of t is madeProbability distributionTo wait general distribution, i.e.,
By formula (7) as can be seen that tWithIt is definite value, when t The action of c+1 samplingThe more big receptance that then samples of corresponding probable value is bigger, otherwise sampling receptance is smaller;
Because MCMC sampling algorithms are the transition probability matrixs by sampling actionGo to obtain and meet action probability distributionSampled value, so action probability distribution p (a when MCMC sampling algorithms startn) can arbitrarily set;Start to sample When set actionProbability distributionTo wait general distribution,ROV is set to have every kind of action identical Sampled probability, ensure that Q learning algorithms to per moment act sampling correctness;
Step 5.2.4, sampled to obtain random receptance u from being uniformly distributed in Uniform [0,1], will be received at random Rate u and the receptanceIt is compared, ifThen receive the c+1 times dynamic obtained by sampling MakeOtherwise the action obtained by the c+1 times sampling is not receivedAnd willIt is assigned to
Such as random receptance u=0.5, if the sampling receptance obtained according to formula (7) Then think this sampling failure, sampling action valueKeep constant;If the sampling receptance obtained according to formula (7)Then think that this is sampled successfully, sampling action valueIt is changed into
Step 5.2.5, the action obtained by the c+1 times sampling of t is updated using formula (8)Probability distribution
In formula (8),Represent the action obtained by the c times sampling of tProbability distributionDenominator; Represent the action obtained by the c times sampling of tProbability distributionMolecule;As c=0, order
Step 5.2.6, make c+1 be assigned to c, and judge whether c > C set up, if so, then execution step 5.2.7, it is no Then, return to step 5.2.3 orders perform;
Step 5.2.7, to the transition probability matrix of tThe C+1 times sampling is carried out, obtains t autonomous navigation device Control parameter regulation actionAnd make t value function estimateFor autonomous navigation device described in t Control parameter regulation actsValue function value
Acted according to MCMC algorithms when sampling number c reaches 100 timesProbability distributionBasically reach steadily, General setting C=100;Sampling number C can be set according to the precision of aircraft systems;
Step 6, the control parameter regulation action of t autonomous navigation device is obtained using formula (9)Behavior act return Value
In formula (9), α and β represent error return parameter and error rate return parameter respectively, the < β < 1 of 0 < α < 1,0, And alpha+beta=1;
Behavior act return valueIllustrate the action of t parameter regulationAfter acting on autonomous navigation device The running situation of ROV, if the ambient condition returned is deteriorated, now behavior act return valueFor one Negative, represent punishment;If the ambient condition returned improves, now behavior act return valueFor one just Number, represent reward;If the ambient condition returned does not change, now behavior act return valueIt is zero, table Show holding;The ambient condition of autonomous navigation device includes error etWith Δ et, so introducing α and β environment according to the different of importance State reporting parameter determines the influence degree of different conditions, typically sets α=0.8, β=0.2;
Step 7, utilize formula (10) renewal t-1 moment value Function Estimation valuesFor t-1 moment end values Functional value
In formula (10),End value function difference value is represented, and is had:
Step 8, make t+1 be assigned to t, judge t > tmaxWhether set up, if so, then perform step 9;Otherwise according to SPSA Step-length regulation algorithm t over time change, using formula (12) to Studying factors ltIt is adjusted, wherein, tmaxSet by expression Maximum iteration:
In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-lengths regulation algorithm;
Be introduced into SPSA step-lengths regulation algorithm make Q learn in Studying factors ltChange has certain regularity, and leads to The setting of non-negative parameter μ and λ in SPSA step-lengths regulation algorithm is crossed, defines Studying factors ltThe speed degree of change and change Interval range makes ROV parameter regulation more accurate, typically sets tmax=30, μ=0.3, λ=1.2;
Step 9, the end value functional value for judging continuous two momentIt is No establishment, if so, then represent that the regulation of autonomous navigation device pid control parameter finishes, and jump to step 11;Otherwise, step is performed Rapid 10;
ε is that a minimum positive number finishes for judging whether pid control parameter is adjusted, and the control accuracy of ROV has Close;When ε is smaller, then the precision of ROV autonomous navigation will be higher, and obtained ROV pid control parameter will be closer to most The figure of merit, typically set ε=0.2;
Step 10, judge whether t exceeds the stipulated time, if exceeding, jump to step 3, reselect initial control ginseng Number regulation actionAdjust autonomous navigation device pid control parameter;Otherwise, jump to step 5 and continue autonomous navigation device PID Control parameter is adjusted;
Step 11, make t=1;
The ambient condition e of step 12, autonomous navigation device collection ttWith Δ et, judge | et| > | emin| or | Δ et| > | Δ emin| whether set up, if so, then perform step 13;Otherwise return to step 11;Wherein, eminWith Δ eminTable respectively Show ambient condition error and error rate minimum value that autonomous navigation device allows;Such as general setting emin=0.1, Δ emin= 0.05;
Step 13, make t+1 be assigned to t, and judge whether t > T set up, if so, then perform step 3;Otherwise step is returned Rapid 12 perform;Wherein, T represents that autonomous navigation device adapts to the time constant of environmental change speed.
Experimental result:
This patent method and the fixed pid parameter method of tradition are respectively used to autonomous navigation device simultaneously, it is multigroup right to have carried out Than experiment, ensure that two groups of autonomous navigation devices reach identical terminal from identical starting point simultaneously in experiment.Fig. 5 is autonomous navigation Device navigates by water the time consuming comparing result of process;Fig. 6, Fig. 7 and Fig. 8 are that autonomous navigation device navigates by water process Real-time Error etContrast As a result.
It is time consuming experimentally in contrast, three groups of contrast experiments are taken, every group of experiment carries out 50 times and result is taken Average value.First group of arrival time for contrasting two groups of autonomous navigation devices in the case of stable for current environment, second group is environment The arrival time of two groups of autonomous navigation devices is contrasted in the case of suddenly change during navigation, the 3rd group is after environmental change In the case of contrast arrival times of two groups of autonomous navigation devices;As shown in Figure 5, due to using in the state of initial environment is stable The pid control parameter that the autonomous navigation device of fixed pid parameter method uses is close to optimized parameter so and using this patent method Autonomous navigation device elapsed time it is roughly the same;When environment is during navigation in the case of suddenly change, although two groups autonomous The time that ROV reaches is all elongated, but this it appears that is used using the autonomous navigation device elapsed time ratio of this patent method The autonomous navigation device of conventional method is much smaller, and the time increased using the autonomous navigation device of this patent method occurs mainly in regulation During control parameter;In the case of after the environmental change, because the autonomous navigation device using this patent method will Optimal value is arrived in control parameter regulation under current environment, so the time of consumption has been returned to and identical water before environmental change It is flat, and the autonomous navigation device of conventional method is used because the control parameter under new environment has not reacceesed optimal control parameter, So the time continuation of consumption is elongated, when environmental change is violent, will be occurred using the autonomous navigation device of conventional method can not Reach the situation for specifying destination.
In contrast Real-time Error etExperimentally, above-mentioned three groups of contrast experiments are equally taken, every group of experiment carries out 50 times simultaneously To results averaged.Fig. 6 is the constant comparing result of initial environment, it is possible to find two groups of autonomous navigation device Real-time Error etChange Change situation is roughly the same;Fig. 7 is contrast knot of the environment when autonomous navigation device navigates by water process the 7th second in the case of suddenly change Fruit, it is possible to find two groups of autonomous navigation device Real-time Error e in environment suddenly changetAll great increase, but use this patent method Autonomous navigation device through navigational parameter after a while adjustment after Real-time Error etIt is rapidly reduced to close to 0, and uses again The autonomous navigation device Real-time Error e of conventional methodt0 can not be reduced to fluctuate up and down in an error range always;Fig. 8 is ring Comparing result in the case of after the change of border, it is possible to find using the Real-time Error e of the autonomous navigation device of this patent methodtChange Rule before rule and environmental change is basically identical, and uses the autonomous navigation device Real-time Error e of conventional methodtIt can not be reduced to 0 fluctuates up and down in an error range always.
Two kinds of comparing results under three groups of experiments of comprehensive appeal find that the relatively conventional fixed PID control of this patent method is joined Counting method has more preferable autonomous navigation effect in the case where environment is changeable, while solves because control parameter is not current Caused by optimal value under environment the problem of autonomous navigation device hyperharmonic response delay.

Claims (1)

  1. A kind of 1. autonomous navigation device control parameter on-line control method based on MCMC optimization Q study, it is characterised in that:Including with Lower step:
    Step 1, the control accuracy α according to autonomous navigation device, tri- control ginsengs of autonomous navigation device PID are respectively obtained using formula (1) Number kp、kiAnd kdAdjustment parameter Δ kp、ΔkiWith Δ kd
    <mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&amp;Delta;k</mi> <mi>p</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&amp;alpha;X</mi> <mi>p</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&amp;alpha;X</mi> <mi>p</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> <mtd> <mrow> <msub> <mi>&amp;Delta;k</mi> <mi>i</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&amp;alpha;X</mi> <mi>i</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&amp;alpha;X</mi> <mi>i</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> <mtd> <mrow> <msub> <mi>&amp;Delta;k</mi> <mi>d</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mi>&amp;alpha;X</mi> <mi>d</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>-</mo> <msub> <mi>&amp;alpha;X</mi> <mi>d</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
    In formula (1), Xp、Xi、XdDescribed three pid control parameter k of autonomous navigation device are represented respectivelyp、kiAnd kdThreshold range;
    Step 2, utilize the adjustment parameter Δ kp、ΔkiWith Δ kdCombination draws the Parameters variation action of the autonomous navigation device Set, is designated as A={ a1,a2,···,an,···,aN, wherein, anRepresent in the Parameters variation set of actions n-th Control parameter regulation acts, and Represent the corresponding proportion adjustment ginseng of n-th of action Number,The corresponding integral adjustment parameter of n-th of action is represented,Represent n-th of control parameter regulation action Corresponding differential adjustment parameter, n=1,2 ..., N;
    Step 3, setting time t=1, randomly choose a control parameter regulation actionAct on the autonomous navigation device;
    Initialize the relevant parameter in Q learning algorithms:T Studying factors ltWith discount factor γ, lt> 0, γ ∈ [0,1];
    Described tri- control parameter k of PID are initialized according to the control experience of the autonomous navigation devicep、kiAnd kd
    By the value function estimate at t-1 moment in the Q learning algorithmsInitialized, wherein, et-1Table Show error of the autonomous navigation device at the t-1 moment, Δ et-1Error rate of the autonomous navigation device at the t-1 moment is represented, And by et-1With Δ et-1Form the ambient condition at t-1 moment;
    Step 4, the number N acted according to control parameter regulation in the Parameters variation set of actions A of the autonomous navigation device, are utilized Transfer matrix of the formula (2) to the decision process in Q learning algorithmsInitialized:
    <mrow> <msubsup> <mi>p</mi> <mrow> <mi>n</mi> <mi>m</mi> </mrow> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
    In formula (2),Represent that the t-1 moment acts from control parameter regulationIt is transferred to control parameter regulation actionTransition probability, and as t=1,
    Step 5, optimize the decision process that Q learning algorithms obtain t using MCMC;
    Step 5.1, n-th of control parameter regulation of t action is calculated using formula (3)Value function value under ambient condition
    <mrow> <msup> <mi>Q</mi> <mo>*</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>n</mi> <mi>h</mi> </mrow> </munderover> <msub> <mi>w</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <msub> <mi>y</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
    In formula (3), wj(t-1) weights of j-th of hidden layer of t-1 moment in BP neural network, j=1,2 ..., nh are represented;Nh tables Show the number of BP neural network hidden layer;yj(t-1) output of j-th of hidden layer of t-1 moment in BP neural network is represented, and Have:
    <mrow> <msub> <mi>y</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mn>1</mn> <mo>-</mo> <msup> <mi>e</mi> <mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </msup> </mrow> <mrow> <mn>1</mn> <mo>+</mo> <msup> <mi>e</mi> <mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </msup> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
    In formula (4), oj(t-1) input of j-th of hidden layer of t in BP neural network is represented, and is had:
    <mrow> <msub> <mi>o</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>n</mi> <mi>i</mi> </mrow> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
    In formula (5), wij(t-1) represent that i-th of input layer of t-1 moment is to the weights of j-th of hidden layer, x in BP neural networki (t-1) represent BP neural network in i-th of input layer of t-1 moment input, i=1,2 ..., ni, ni represent BP neural network The number of input layer;
    Step 5.2, the control parameter regulation action for drawing autonomous navigation device described in t is sampled using MCMC algorithms
    Step 5.2.1, acted according to n-th of control parameter regulation of tValue function value under ambient conditionThe action chosen with the t-1 momentUtilize the transition probability matrix of formula (6) renewal decision process
    <mrow> <msubsup> <mi>p</mi> <mrow> <mi>n</mi> <mi>m</mi> </mrow> <mi>t</mi> </msubsup> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mn>1</mn> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&amp;Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mn>2</mn> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&amp;Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mi>m</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&amp;Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mfrac> <msubsup> <mi>Q</mi> <mrow> <mi>N</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> <mrow> <mo>&amp;Sigma;</mo> <msubsup> <mi>Q</mi> <mrow> <mi>n</mi> <mi>t</mi> </mrow> <mo>*</mo> </msubsup> </mrow> </mfrac> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mn>...</mn> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>0</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mn>1</mn> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>....</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> <mtd> <mn>...</mn> </mtd> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>N</mi> <mi>t</mi> </msubsup> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>
    In formula (6),Represent n-th of control parameter regulation of t actionValue function value, i.e., Represent the summation of the value function value of t everything, n=1,2 ..., N;Represent t from n-th of control Parameter regulation action processedIt is transferred to m-th of control parameter regulation actionTransition probability;
    Step 5.2.2, sampling number c=0,1,2C is set;
    Step 5.2.3, to the transition probability matrix of tC sampling is carried out, and the t in MCMC algorithms is obtained using formula (7) The receptance of the c+1 times sampling of moment
    <mrow> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>c</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>,</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>min</mi> <mo>{</mo> <mfrac> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>p</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>&amp;times;</mo> <mi>p</mi> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>|</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>,</mo> <mn>1</mn> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>
    In formula (7),Represent the action obtained by the c+1 times sampling of tProbable value,Represent t the Action obtained by c samplingProbable value;As c=0, the action obtained by the c times sampling of t is madeProbability DistributionTo wait general distribution, i.e.,
    Step 5.2.4, sampled to obtain random receptance u from being uniformly distributed in Uniform [0,1], by random receptance u and The receptanceIt is compared, ifThen receive the action obtained by the c+1 times sampling Otherwise the action obtained by the c+1 times sampling is not receivedAnd willIt is assigned to
    Step 5.2.5, the action obtained by the c+1 times sampling of t is updated using formula (8)Probability distribution
    <mrow> <msub> <mi>p</mi> <mrow> <mi>c</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mfrac> <mrow> <msubsup> <mi>d</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>c</mi> </mrow> <mi>t</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> <mrow> <msubsup> <mi>&amp;sigma;</mi> <mi>c</mi> <mi>t</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> </mtd> <mtd> <mrow> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>=</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> </mrow> </mtd> </mtr> <mtr> <mtd> <mfrac> <msubsup> <mi>d</mi> <mrow> <mi>n</mi> <mo>,</mo> <mi>c</mi> </mrow> <mi>t</mi> </msubsup> <mrow> <msubsup> <mi>&amp;sigma;</mi> <mi>c</mi> <mi>i</mi> </msubsup> <mo>+</mo> <mn>1</mn> </mrow> </mfrac> </mtd> <mtd> <mrow> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>&amp;NotEqual;</mo> <msubsup> <mi>a</mi> <mi>m</mi> <mrow> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow>
    In formula (8),Represent the action obtained by the c times sampling of tProbability distributionDenominator;Represent t Action obtained by the c times sampling of momentProbability distributionMolecule;As c=0, ordern =1,2 ..., N;
    Step 5.2.6, make c+1 be assigned to c, and judge whether c > C set up, if so, step 5.2.7 is then performed, otherwise, is returned Step 5.2.3 orders are returned to perform;
    Step 5.2.7, to the transition probability matrix of tThe C+1 times sampling is carried out, obtains the control of t autonomous navigation device Parameter regulation action processedAnd make t value function estimateFor the control of autonomous navigation device described in t Parameter regulation actsValue function value
    Step 6, the control parameter regulation action of t autonomous navigation device is obtained using formula (9)Behavior act return value
    <mrow> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>&amp;alpha;</mi> <mo>&amp;times;</mo> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&amp;beta;</mi> <mo>&amp;times;</mo> <mrow> <mo>(</mo> <msub> <mi>&amp;Delta;e</mi> <mi>t</mi> </msub> <mo>-</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow>
    In formula (9), α and β represent error return parameter and error rate return parameter respectively, the < β < 1 of 0 < α < 1,0, and α+ β=1;
    Step 7, utilize formula (10) renewal t-1 moment value Function Estimation valuesFor t-1 moment final value function Value
    <mrow> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <msup> <mi>Q</mi> <mo>&amp;prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> <mi>&amp;Delta;</mi> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow>
    In formula (10),End value function difference value is represented, and is had:
    <mrow> <mi>&amp;Delta;</mi> <mi>Q</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>+</mo> <msup> <mi>&amp;gamma;Q</mi> <mo>&amp;prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mi>t</mi> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msup> <mi>Q</mi> <mo>&amp;prime;</mo> </msup> <mrow> <mo>(</mo> <msub> <mi>e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>&amp;Delta;e</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msubsup> <mi>a</mi> <mi>n</mi> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow>
    Step 8, make t+1 be assigned to t, judge t > tmaxWhether set up, if so, then perform step 9;Otherwise according to SPSA step-lengths Algorithm t over time change is adjusted, using formula (12) to Studying factors ltIt is adjusted, wherein, tmaxIt is maximum set by expression Iterations:
    <mrow> <msub> <mi>l</mi> <mi>t</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>+</mo> <mi>&amp;mu;</mi> <mo>)</mo> </mrow> <mi>&amp;lambda;</mi> </msup> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> </mrow>
    In formula (12), l is the Studying factors value at t=1 moment, and μ and λ are the nonnegative constant in SPSA step-lengths regulation algorithm;
    Step 9, the end value functional value for judging continuous two momentWhether into It is vertical, if so, then represent that the regulation of autonomous navigation device pid control parameter finishes, and jump to step 11;Otherwise, step 10 is performed;
    Step 10, judge whether t exceeds the stipulated time, if exceeding, jump to step 3, reselect initial control parameter and adjust Section actsAdjust autonomous navigation device pid control parameter;Otherwise, jump to step 5 and continue autonomous navigation device PID control Parameter regulation;
    Step 11, make t=1;
    The ambient condition e of step 12, autonomous navigation device collection ttWith Δ et, judge | et| > | emin| or | Δ et| > | Δemin| whether set up, if so, then perform step 13;Otherwise return to step 11;Wherein, eminWith Δ eminRepresent respectively autonomous The ambient condition error and error rate minimum value that ROV allows;
    Step 13, make t+1 be assigned to t, and judge whether t > T set up, if so, then perform step 3;Otherwise return to step 12 Perform;Wherein, T represents that autonomous navigation device adapts to the time constant of environmental change speed.
CN201711144395.2A 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study Active CN107885086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711144395.2A CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711144395.2A CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Publications (2)

Publication Number Publication Date
CN107885086A true CN107885086A (en) 2018-04-06
CN107885086B CN107885086B (en) 2019-10-25

Family

ID=61777810

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711144395.2A Active CN107885086B (en) 2017-11-17 2017-11-17 Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study

Country Status (1)

Country Link
CN (1) CN107885086B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710289A (en) * 2018-05-18 2018-10-26 厦门理工学院 A method of the relay base quality optimization based on modified SPSA
CN109696830A (en) * 2019-01-31 2019-04-30 天津大学 The reinforcement learning adaptive control method of small-sized depopulated helicopter
CN111830822A (en) * 2019-04-16 2020-10-27 罗伯特·博世有限公司 System for configuring interaction with environment
CN114237267A (en) * 2021-11-02 2022-03-25 中国人民解放军海军航空大学航空作战勤务学院 Flight maneuver decision auxiliary method based on reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208377A1 (en) * 2007-08-14 2011-08-25 Propeller Control Aps Efficiency optimizing propeller speed control for ships
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
CN105700526A (en) * 2016-01-13 2016-06-22 华北理工大学 On-line sequence limit learning machine method possessing autonomous learning capability
CN106950956A (en) * 2017-03-22 2017-07-14 合肥工业大学 The wheelpath forecasting system of fusional movement model and behavior cognitive model
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110208377A1 (en) * 2007-08-14 2011-08-25 Propeller Control Aps Efficiency optimizing propeller speed control for ships
CN102819264A (en) * 2012-07-30 2012-12-12 山东大学 Path planning Q-learning initial method of mobile robot
CN105700526A (en) * 2016-01-13 2016-06-22 华北理工大学 On-line sequence limit learning machine method possessing autonomous learning capability
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN106950956A (en) * 2017-03-22 2017-07-14 合肥工业大学 The wheelpath forecasting system of fusional movement model and behavior cognitive model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHRISTOPHE ANDRIEU 等: "An Introduction to MCMC for Machine Learning", 《MACHINE LEARNING》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710289A (en) * 2018-05-18 2018-10-26 厦门理工学院 A method of the relay base quality optimization based on modified SPSA
CN109696830A (en) * 2019-01-31 2019-04-30 天津大学 The reinforcement learning adaptive control method of small-sized depopulated helicopter
CN109696830B (en) * 2019-01-31 2021-12-03 天津大学 Reinforced learning self-adaptive control method of small unmanned helicopter
CN111830822A (en) * 2019-04-16 2020-10-27 罗伯特·博世有限公司 System for configuring interaction with environment
CN114237267A (en) * 2021-11-02 2022-03-25 中国人民解放军海军航空大学航空作战勤务学院 Flight maneuver decision auxiliary method based on reinforcement learning
CN114237267B (en) * 2021-11-02 2023-11-24 中国人民解放军海军航空大学航空作战勤务学院 Flight maneuver decision assisting method based on reinforcement learning

Also Published As

Publication number Publication date
CN107885086B (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN107885086A (en) Autonomous navigation device control parameter on-line control method based on MCMC optimization Q study
CN110427261A (en) A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN109828552B (en) Intermittent process fault monitoring and diagnosing method based on width learning system
CN104616060A (en) Method for predicating contamination severity of insulator based on BP neural network and fuzzy logic
CN103971160B (en) particle swarm optimization method based on complex network
CN106056127A (en) GPR (gaussian process regression) online soft measurement method with model updating
CN109218744B (en) A kind of adaptive UAV Video of bit rate based on DRL spreads transmission method
CN109214579B (en) BP neural network-based saline-alkali soil stability prediction method and system
CN112766603B (en) Traffic flow prediction method, system, computer equipment and storage medium
WO2023035727A1 (en) Industrial process soft-measurement method based on federated incremental stochastic configuration network
CN105843189A (en) Simplified simulation model based high efficient scheduling rule choosing method for use in semiconductor production lines
Hu et al. Adaptive exploration strategy with multi-attribute decision-making for reinforcement learning
CN111582567B (en) Wind power probability prediction method based on hierarchical integration
Mellios et al. A multivariate analysis of the daily water demand of Skiathos Island, Greece, implementing the artificial neuro-fuzzy inference system (ANFIS)
Liu et al. Accelerate mini-batch machine learning training with dynamic batch size fitting
Li et al. Hyper-parameter tuning of federated learning based on particle swarm optimization
Remmerswaal et al. Combined MPC and reinforcement learning for traffic signal control in urban traffic networks
Han et al. Multi-step prediction for the network traffic based on echo state network optimized by quantum-behaved fruit fly optimization algorithm
Li et al. Graph reinforcement learning-based cnn inference offloading in dynamic edge computing
Al-Lawati et al. Anytime minibatch with stale gradients
CN111796519B (en) Automatic control method of multi-input multi-output system based on extreme learning machine
CN109636609A (en) Stock recommended method and system based on two-way length memory models in short-term
Cui On asymptotics of t-type regression estimation in multiple linear model
Zhou et al. Decentralized adaptive optimal control for massive multi-agent systems using mean field game with self-organizing neural networks
Yin et al. FedSCS: Client selection for federated learning under system heterogeneity and client fairness with a Stackelberg game approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant