CN108803321A

CN108803321A - Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study

Info

Publication number: CN108803321A
Application number: CN201810535773.8A
Authority: CN
Inventors: 宋士吉; 石文杰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-13
Anticipated expiration: 2038-05-30
Also published as: CN108803321B

Abstract

The present invention proposes a kind of Autonomous Underwater Vehicle Trajectory Tracking Control method learnt based on deeply, belongs to deeply study and field of intelligent control.AUV Trajectory Tracking Control problems are defined first；Then the markov decision process model of AUV track following problems is established；Then mixed strategy-evaluation network is built, which is made of multiple tactful networks and evaluation network；Finally by the target strategy of the mixed strategy-evaluation Solution To The Network AUV Trajectory Tracking Controls built, for multiple evaluation networks, the graceful absolute error of Bell it is expected by definition to assess the performance of each evaluation network, in a worst evaluation network of each time step more new capability, for multiple tactful networks, a tactful network is randomly choosed in each time step, and is updated using deterministic policy gradient, the final strategy learnt is the mean value of all policies network.The present invention is not easily susceptible to the influence of severe AUV historical traces track, and precision is high.

Description

Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study

Technical field

The invention belongs to deeply study and field of intelligent control, are related to a kind of autonomous water learnt based on deeply Lower aircraft (AUV) Trajectory Tracking Control method.

Background technology

Deep seafloor the reach of science is highly dependent on deep-sea detecting technology and equipment, since abyssal environment is complicated, condition Extremely, it mainly replaces using deep ocean work type Autonomous Underwater Vehicle at present or people is assisted to detect, observe and adopt to deep-sea Sample.And the task scene of execute-in-place can not be reached for mankind such as marine resources exploration, seabed investigation and marine chartings, ensure The independence and controllability of AUV sub-aqua sports are a most basic and important functional requirements, are to realize that every complex job is appointed The premise of business.However, many off-shore applications (such as Trajectory Tracking Control, target following control etc.) of AUV are extremely challenging, this Kind challenge is mainly caused by the characteristic of three aspects below AUV systems.First, AUV as a kind of multi-input multi-output system, Dynamics and kinematics model (hereinafter referred to as model) are complicated, and with nonlinearity, close coupling, there are input or state constraints And the features such as time-varying；Second, there is uncertainty in model parameter or hydrodynamic environment, cause AUV system modellings more difficult；The Three, Most current AUV belong to under-actuated systems, i.e. degree of freedom (respectively independently executes device difference more than the quantity for independently executing device Corresponding one degree of freedom).In general, by method that mathematical physics illation of mechanism, numerical simulation and full-scale investigation are combined come really Determine the model and parameter of AUV, and rationally portrays the uncertain part in model.Complicated model leads to the control problem of AUV It is extremely complex.Moreover, with the continuous extension of AUV application scenarios, people propose more precision, the stability of its motion control How high requirement improves control effects of the AUV under various moving scenes and has become important research direction.

In the past few decades, for different applications such as track following, path point tracking, path planning and formation controls Scene, researchers devise various AUV motion control methods and demonstrate its validity.Wherein representative is The output feedback ontrol method based on model that Refsnes et al. is proposed, the control method use the system mould of two decouplings Type：One for portray ocean current load Three Degree Of Freedom ocean current induction hull model and one for describe system dynamic five from By degree model.In addition, Healey et al. devises a kind of tracking and controlling method based on feedback of status, the control method is using solid Fixed propulsion speed simultaneously carries out linearization process to system model, while the control method uses the moulds of three decouplings Type：Surging model, horizontally-guided model (swaying and yawing) and vertical model (heaving and pitching).However, these methods are all right System model has carried out decoupling or linearization process, therefore is difficult to meet high-precision controls of the AUV under specific application scene to want It asks.

Due to the powerful self-learning capability of the limitation and intensified learning of above-mentioned Classical Motion control method, in recent years, Researchers by the intelligent control method of representative of intensified learning to showing great research interest.And it is various based on reinforcing The intelligent control method of learning art (such as Q study, direct strategy search, strategy-evaluation network and self-adapting strengthened study) And be constantly suggested and be successfully applied in different complex application contexts, such as motion planning and robot control, unmanned plane during flying Control, hypersonic vehicle tracing control and signal light path control etc..The core of control method based on intensified learning Thought is that the performance optimization of control system is realized under the premise of no priori.For AUV systems, many researchers have set Count out the various control methods based on intensified learning and actual verification its feasibility.It is asked for autonomous underwater cable tracing control Topic, EI-Fakdi et al. is using direct strategy search technique come learning state/action mapping relations, but this method is only applicable to State and motion space are all discrete situations；And for continuous motion space, Paula et al. is using radial basis function network come close Like strategic function, however since the approximation to function ability of radial basis function network is weaker, which can not ensure higher tracking Control accuracy.

In recent years, with batch study, the hair of experience replay and batch regularization even depth neural network (DNN) training technique Exhibition, deeply study are multiple in motion planning and robot control, autonomous ground vehicle motion control, quadrotor control and automatic Pilot etc. Excellent properties are shown in miscellaneous task.The depth Q networks (DQN) especially proposed in the recent period are in many extremely challenging tasks In all show the control accuracy of human levels.However DQN cannot handle while have dimensional state space and continuous action empty Between the problem of.On the basis of DQN, depth deterministic policy gradient (DDPG) algorithm is by it is further proposed that and realize continuous control System.However DDPG estimates the desired value of evaluation network using objective appraisal network so that evaluation network cannot be evaluated effectively By tactful e-learning to strategy, and there are larger variances for the action value function learnt, therefore when DDPG is applied to AUV When Trajectory Tracking Control problem, it cannot be satisfied higher tracing control precision and stablize the requirement of study.

Invention content

The purpose of the present invention is to propose to a kind of AUV Trajectory Tracking Control methods based on deeply study, this method is adopted Evaluation net is respectively trained with a kind of mixed strategy-evaluation network structure, and using multiple quasi- Q study and deterministic policy gradient Network and tactful network, overcome in the past that the method control accuracy based on intensified learning is relatively low, cannot achieve continuous control and learnt The problems such as journey is unstable realizes high-precision AUV Trajectory Tracking Controls and stablizes study.

To achieve the goals above, the present invention adopts the following technical scheme that：

A kind of Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study, this method includes following step Suddenly：

1) Autonomous Underwater Vehicle AUV Trajectory Tracking Control problems are defined

It includes four parts to define AUV Trajectory Tracking Control problems：Determine AUV systems input, determine AUV system outputs, It defines Trajectory Tracking Control error and establishes AUV Trajectory Tracking Control targets；It is as follows：

1-1) determine that AUV systems input

It is τ to enable AUV system input vectors_k=[ξ_k,δ_k]^T, wherein ξ_k、δ_kThe respectively airscrew thrust of AUV and rudder angle, under It marks k and indicates k-th of time step；ξ_k、δ_kValue range be respectivelyWithRespectively maximum spiral Paddle thrust and hard over angle；

1-2) determine AUV system outputs

It is η to enable AUV system output vectors_k=[x_k,y_k,ψ_k]^T, wherein x_k、y_kRespectively k-th of time step AUV is in inertia Along X, the coordinate of Y-axis, ψ under coordinate system I-XYZ_kFor the angle of k-th time step AUV directions of advance and X-axis；

1-3) define Trajectory Tracking Control error

Reference locus is chosen according to the driving path of AUVDefine the AUV track followings of k-th of time step Controlling error is：

1-4) establish AUV Trajectory Tracking Control targets

For step 1-3) in reference locus d_k, select the object function of following form：

Wherein, γ is discount factor, and H is weight matrix；

The target for establishing AUV Trajectory Tracking Controls is to find an optimal system list entries τ^*So that the mesh of initial time Scalar functions P₀(τ) is minimum, and calculation formula is as follows：

2) the markov decision process model of AUV track following problems is established

Markov decision process modeling is carried out to the AUV track following problems in step 1), is as follows：

2-1) definition status vector

The velocity vector for defining AUV systems is φ_k=[u_k,v_k,χ_k]^T, wherein u_k、v_kRespectively k-th of edge time step AUV Direction of advance, the linear velocity perpendicular to direction of advance, χ_kIt is k-th of time step AUV around the angular speed of direction of advance；

According to step 1-2) determine AUV system output vectors η_kWith step 1-3) reference locus that defines, it defines k-th The state vector of time step is as follows：

2-2) definition action vector

The action vector for defining k-th of time step is the AUV system input vectors of the time step, i.e. a_k=τ_k；

2-3) define reward function

The reward function of k-th of time step is for portraying in state s_kTake action a_kImplementation effect, according to step 1-3) The Trajectory Tracking Control error e of definition_kWith step 2-2) the action vector a that defines_k, define the AUV reward letters of k-th of time step Number is as follows：

2-4) by step 1-4) establish AUV Trajectory Tracking Controls target τ^*Be converted to the AUV under intensified learning frame Trajectory Tracking Control target

Definition strategy π is that each probability that may be acted is selected under a certain state, then it is as follows to define action value function：

Wherein,It indicates to reward function, state and the desired value of action；K walks for maximum time；

The action value function be used for describe currently and later it is stateful it is lower take tactful π when expectation aggregated rebates Reward, therefore under intensified learning frame, AUV Trajectory Tracking Control targets are to learn one by the interaction with AUV local environments A optimal objective strategy π^*So that the working value of initial time is maximum, and calculation formula is as follows：

Wherein, p (s₀) it is original state s₀Distribution；a₀For initial actuating vector；

By step 1-4) establish AUV Trajectory Tracking Controls target τ^*Solution be converted to π^*Solution；

2-5) simplify the AUV Trajectory Tracking Control targets under intensified learning frame

By following iteration Bellman equation come solution procedure 2-4) in action value function：

If strategy being to determine property of π, i.e., the action vector space from the state vector space of AUV to AUV is to map one by one Relationship, and be denoted as μ, be then reduced to above-mentioned iteration Bellman equation：

For deterministic strategy μ, by step 2-4) in optimal objective strategy π^*It is reduced to certainty optimal objective plan Slightly μ^*：

3) mixed strategy-evaluation network is built

Certainty optimal objective strategy μ is estimated respectively by building mixed strategy-evaluation network^*With it is corresponding optimal dynamic Make value functionIt includes three parts to build mixed strategy-evaluation network：Construction strategy network, structure evaluation network and determining mesh Mark strategy, is as follows：

3-1) construction strategy network

Mixed strategy-evaluation network structure is by building n tactful networkTo estimate certainty optimal objective plan Slightly μ^*；Wherein, θ_pFor the weight parameter of p-th of tactful network, p=1 ..., n；Each strategy network respectively uses a full connection Deep neural network realize that each strategy network includes respectively an input layer, two hidden layers and an output layer；Respectively The input of tactful network is state vector s_k, the output of each strategy network is action vector a_k；

3-2) structure evaluation network

Mixed strategy-evaluation network structure is by building m evaluation networkTo estimate optimal action value functionWherein, w_qFor the weight parameter of q-th of evaluation network, q=1 ..., m；Each evaluation network respectively uses a full connection Deep neural network realize that each to evaluate network respectively include an input layer, two hidden layers and an output layer；Respectively The input for evaluating network is state vector s_kWith action vector a_k, wherein state vector s_kIt is input to each evaluation network from input layer, Act vector a_kIt is input to each evaluation network from first hidden layer, it is each to evaluate network output as in state vector s_kUnder take it is dynamic Make vector a_kWorking value；

3-3) determine target strategy

According to constructed mixed strategy-evaluation network, the AUV Trajectory Tracking Controls that k-th of time step is learnt Target strategy μ_f(s_k) it is defined as the mean value of n tactful network output, calculation formula is as follows：

4) the target strategy μ of AUV Trajectory Tracking Controls is solved_f(s_k), it is as follows：

4-1) parameter setting

Maximum iteration M is respectively set, the training set size that maximum time step K, the experience replay of each iteration extract N, the learning rate α of network is respectively evaluated_ω, each tactful network learning rate α_θ, weight matrix in discount factor γ and reward function H；

4-2) initialize mixed strategy-evaluation network

The tactful network of random initializtion nWith m evaluation networkWeight parameter θ_pAnd w_q； D-th of tactful network is randomly choosed from n tactful network to be denoted asD=1 ..., n；

Structure experience lines up set R, if the maximum capacity that the experience lines up set R is B, and is initialized as sky；

4-3) iteration starts, and is trained to mixed strategy-evaluation network, initialization iterations episode=1；

4-4) setting current time walks k=0, the state variable s of random initializtion AUV₀, the state that current time walks is enabled to become Measure s_k=s₀；And it generates one and explores noise Noise_k；

4-5) according to n current strategies networkWith exploration noise Noise_kDetermine the action vector a of current time step_k For：

4-6) AUV is in current state s_kLower execution acts a_k, according to step 2-3) and receive awards function r_k+1, and observe one A new state s_k+1；Remember e_k=(s_k,a_k,r_k+1,s_k+1) it is an experience sample；If experience lines up the sample size of set R Capacity B is had reached the maximum, then first deletes a sample being added at first, then by experience sample e_kDeposit experience lines up set R In；Otherwise directly by experience sample e_kDeposit experience is lined up in set R；

Line up to choose A experience sample in set R from experience, it is specific as follows：When experience lines up in set R sample size not When more than N, then chooses the experience and line up had experience sample in set R；When experience, which lines up set R, is more than N, then from the warp It tests and lines up to randomly select N number of experience sample (s in set R_l,a_l,r_l+1,s_l+1)；

The graceful absolute error EBAE of expectation Bell of each evaluation network 4-7) is calculated according to A experience sample of selection_q, use In the performance of each evaluation network of characterization, formula is as follows：

The evaluation network for selecting performance worst acquires the serial number of the worst evaluation network of the performance, note by following formula For c：

4-8) by c-th of evaluation networkEach experience sample is obtained by such as next greedy strategy to walk in future time Action vector：

The desired value of c-th of evaluation network 4-9) is calculated by multiple quasi- Q learning methodsFormula is as follows：

4-10) calculate the loss function L (w of c-th of evaluation network_c), formula is as follows：

4-11) pass through loss function L (w_c) to weight parameter w_cDerivative update the weight parameter of c-th of evaluation network, Formula is as follows：

The weight parameter of remaining evaluation network remains unchanged；

A tactful network 4-12) is randomly choosed from n tactful network to reset d-th of tactful network

4-13) according to d-th of tactful network of updated c-th evaluation network calculationsCertainty Policy-GradientAnd d-th of tactful network is updated with thisWeight parameter θ_d, calculation formula is distinguished as follows：

The weight parameter of remaining tactful network remains unchanged；

It 4-14) enables k=k+1 and k is judged：Such as k<K then returns to step 4-5), AUV continues track reference rail Mark；Otherwise, 4-15 is entered step)；

It 4-15) enables episode=episode+1 and episode is judged：Such as episode<M then returns to step Rapid 4-4), AUV carries out next iterative process；Otherwise, 4-16 is entered step)；

4-16) iteration terminates, and terminates the training process of mixed strategy-evaluation network, n policy network when by iteration ends The output valve of network passes through step 3-3) in calculation formula obtain the target strategy μ of final AUV Trajectory Tracking Controls_f(s_k), by this Target strategy realizes the Trajectory Tracking Control to AUV.

The features of the present invention and advantageous effect：

Method proposed by the present invention uses multiple tactful networks and evaluation network.For multiple evaluation networks, by fixed Justice it is expected the graceful absolute error of Bell to assess the performance of each evaluation network, in each time step only worst one of more new capability Network is evaluated, the existing control method based on intensified learning is different from, the present invention proposes multiple quasi- Q learning methods to calculate more Accurately evaluation network objectives value, this method can solve action value function and cross estimation problem, and can be not by target Stablize learning process under the premise of evaluating network.For multiple tactful networks, a policy network is randomly choosed in each time step Network, and be updated using deterministic policy gradient.The final strategy learnt is the mean value of all policies network.

1) AUV Trajectory Tracking Controls method proposed by the present invention passes through AUV adopting in the process of moving independent of model Sample data carry out autonomous learning and send as an envoy to obtain the target strategy that is optimal of control targe, which need not make AUV models any It is assumed that being particularly suitable for the AUV to work under complicated abyssal environment, there is very high actual application value.

2) the method for the present invention obtains more accurately evaluating network objectives value than existing method using multiple quasi- Q study, Both the variance for having reduced the action value function obtained by evaluation network approximation also solves action value function and crosses estimation problem, from And more preferably target strategy is obtained, realize high-precision AUV Trajectory Tracking Controls.

3) which evaluation net of each time step update determined based on the graceful absolute error of desired Bell for the method for the present invention Network, this update rule can weaken the influence of poor evaluation network, to ensure the Fast Convergent of learning process.

4) the method for the present invention is as a result of multiple evaluation networks, learning process be not easily susceptible to severe AUV history with The influence of track track, robustness is good, and learning process is stablized.

5) intensified learning is combined by the method for the present invention with deep neural network, has very strong self-learning capability, can It is realized in uncertain abyssal environment and the high-accuracy self-adaptation of AUV is controlled, in scenes such as AUV track followings, underwater avoidances In have good application prospect.

Description of the drawings

Fig. 1 is the performance comparison figure of proposition method of the present invention and existing DDPG methods；Wherein, figure (a) is learning curve pair Than figure, figure (b) is AUV track following effect contrast figures.

Fig. 2 is the performance comparison figure of proposition method of the present invention and Neural network PID method；Wherein, figure (a) is AUV along X, Y The Grid Track tracking effect comparison diagram in direction, figure (b) are AUV in X, the tracking error comparison diagram of Y-direction.

Specific implementation mode

A kind of Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study proposed by the present invention, below It is further described with reference to the drawings and specific embodiments as follows.

The present invention proposes a kind of Autonomous Underwater Vehicle tracking control algorithm learnt based on deeply, includes mainly Four parts：Markov decision process model, the structure for defining AUV Trajectory Tracking Controls problem, establishing AUV track following problems It builds mixed strategy-evaluation network structure and solves the target strategy of AUV Trajectory Tracking Controls.

1) AUV Trajectory Tracking Control problems are defined

It includes four component parts to define AUV Trajectory Tracking Control problems：It determines the input of AUV systems, determine that AUV systems are defeated Go out, define Trajectory Tracking Control error and establish AUV Trajectory Tracking Control targets；It is as follows：

1-1) determine that AUV systems input

It is τ to enable AUV system input vectors_k=[ξ_k,δ_k]^T, wherein ξ_k、δ_kThe respectively airscrew thrust of AUV and rudder angle, under The value that k indicates k-th of time step, that is, moment kt is marked, wherein t is time step, similarly hereinafter；ξ_k、δ_kValue range be respectivelyWithWhereinRespectively maximum airscrew thrust and hard over angle, according to spiral used by AUV Paddle type determination.

1-2) determine AUV system outputs

It is η to enable AUV system output vectors_k=[x_k,y_k,ψ_k]^T, wherein x_k、y_kRespectively k-th of time step AUV is in inertia Along X, the coordinate of Y-axis, ψ under coordinate system I-XYZ_kFor the angle of k-th time step AUV directions of advance and X-axis.

1-3) define Trajectory Tracking Control error

1-4) establish AUV Trajectory Tracking Control targets

Wherein, γ is discount factor, and H is weight matrix；

Markov decision process (MDP) is the basis of intensified learning theory, it is therefore desirable to the tracks AUV in step 1) Tracking problem carries out MDP modelings.The essential element of intensified learning includes intelligent body, environment, state, action and reward function, intelligence Can the target of body be to be maximized by the interaction with AUV local environments to learn optimal action (or control input) sequence Cumulative award (or minimize and add up tracing control error), and then realize the solution of AUV track following targets.Specific steps are such as Under：

2-1) definition status vector

The velocity vector for defining AUV systems is φ_k=[u_k,v_k,χ_k]^T, wherein u_k、v_kRespectively k-th of edge time step AUV Direction of advance, the linear velocity perpendicular to direction of advance, χ_kIt is k-th of time step AUV around the angular speed of direction of advance.

2-2) definition action vector

The action vector for defining k-th of time step is the AUV system input vectors of the time step, i.e.,：a_k=τ_k。

2-3) define reward function

Wherein,It indicates to reward function, state and the desired value of action (similarly hereinafter)；K walks for maximum time；

The action value function be used for describe currently and later it is stateful it is lower take tactful π when expectation aggregated rebates Reward, therefore, under intensified learning frame, AUV Trajectory Tracking Controls target (i.e. the target of intelligent body) be by with residing for AUV The interaction of environment learns an optimal objective strategy π^*So that the working value of initial time is maximum, i.e.,：

Wherein, p (s₀) it is original state s₀Distribution；a₀For initial actuating vector.

Therefore, step 1-4) establish AUV Trajectory Tracking Controls target τ^*Solution can be exchanged into π^*Solution.

Similar to Dynamic Programming, many intensified learning methods carry out solution procedure 2-4 using following iteration Bellman equation) in Action value function：

It is assumed that strategy being to determine property of π, i.e., the action vector space from the state vector space of AUV to AUV is to reflect one by one The relationship penetrated, and it is denoted as μ, then above-mentioned iteration Bellman equation can be reduced to：

In addition, for deterministic strategy μ, by step 2-4) in optimal objective strategy π^*It is reduced to the optimal mesh of certainty Mark strategy μ^*：

3) mixed strategy-evaluation network is built

By step 2-5) it is found that solving AUV track following the very corns of a subject are how to solve certainty using intensified learning Optimal objective strategy μ^*With corresponding optimal action value functionThe method of the present invention uses a kind of mixed strategy-evaluation network To estimate μ respectively^*WithIt includes three parts to build mixed strategy-evaluation network：Construction strategy network, structure evaluation network and It determines target strategy, is as follows：

3-1) construction strategy network

Mixed strategy-evaluation network structure (is instructed to balance inventive algorithm tracing control precision with network by building n Practice speed, value should not be too large also unsuitable too small) a tactful networkTo estimate certainty optimal objective strategy μ^*。 Wherein, θ_pFor the weight parameter of p-th of tactful network, p=1 ..., n；Each strategy network respectively uses a depth connected entirely Neural network is spent to realize, each strategy network includes respectively an input layer, two hidden layers and an output layer, each plan Slightly the input of network is state vector s_k, each strategy network output is action vector a_k, two hidden layers respectively contain 400 and 300 A unit.

3-2) structure evaluation network

Mixed strategy-evaluation network structure is by building m (basis for selecting of evaluation the number networks and above-mentioned tactful network number The basis for selecting of amount is identical) a evaluation networkTo estimate optimal action value functionWherein,^w _qIt is commented for q-th The weight parameter of valence network, q=1 ..., m；The deep neural network that each evaluation network is respectively connected using one entirely is come real Existing, each to evaluate network respectively comprising an input layer, two hidden layers and an output layer, two hidden layers contain respectively 400 and 300 units；The input of each evaluation network is state vector s_kWith action vector a_k, wherein state vector s_kFrom input layer It is input to each evaluation network, action vector a_kBe input to each evaluation network from first hidden layer, it is each evaluate network output for State vector s_kUnder take action vector a_kWorking value.

3-3) determine target strategy

4-1) parameter setting

Maximum iteration M is respectively set, the training set size that maximum time step K, the experience replay of each iteration extract N, the learning rate α of network is respectively evaluated_ω, each tactful network learning rate α_θ, weight matrix in discount factor γ and reward function H；In the present embodiment, M=1500, K=1000 (each time step t=0.2s), N=64, each α for evaluating network_ω=0.01, The α of each strategy network_θ=0.001, γ=0.99, H=[0.001,0；0,0.001]；

4-2) initialize mixed strategy-evaluation network

The tactful network of random initializtion nWith m evaluation networkWeight parameter θ_pAnd w_q； A tactful networks of d (d=1 ..., n) are randomly choosed from n tactful network to be denoted as

Structure experience lines up set R, if the maximum capacity that the experience lines up set R is B (the present embodiment B=10000), and It is initialized as sky；

4-4) setting current time walks k=0, the state variable s of random initializtion AUV₀, the state that current time walks is enabled to become Measure s_k=^s ₀；And it generates one and explores noise Noise_k(the present embodiment uses Ornstein-Wu Lun Bake (Ornstein- Uhlenbeck noise) is explored)；

Line up to choose A experience sample in set R from experience, A≤N is specific as follows：When experience lines up sample in set R When quantity is no more than N, then chooses the experience and line up had experience sample in set R；When experience, which lines up set R, is more than N, then Line up to randomly select N number of experience sample (s in set R from the experience_l,a_l,r_l+1,s_l+1), l is selected experience sample place Time step；

The weight parameter of remaining evaluation network remains unchanged；

The weight parameter of remaining tactful network remains unchanged.

It 4-14) enables k=k+1 and k is judged：Such as k<K then returns to step 4-5), AUV continues track reference rail Mark；Otherwise, 4-15 is entered step).

It 4-15) enables episode=episode+1 and episode is judged：Such as episode<M then returns to step Rapid 4-4), AUV carries out next iterative process；Otherwise, 4-16 is entered step).

The validation verification of the embodiment of the present invention

AUV Trajectory Tracking Controls method (hereinafter referred to as MPQ-DPG) based on deeply study proposed by the invention Performance evaluation it is as follows, all contrast experiments are all based on widely used REMUS autonomous unmanned navigations device, maximum spiral shell Revolve paddle thrustAnd rudder angleRespectively 86N and 0.24rad；And use following reference locus：

In addition, in embodiments of the present invention, evaluation the number networks m is identical as strategy the number networks n, is uniformly denoted as n hereinafter.

1) MPQ-DPG and existing DDPG methods comparative analysis

Fig. 1 be deeply proposed by the present invention study AUV propose Trajectory Tracking Control method (MPQ-DPG) with it is existing Comparison in the learning curve and track following effect of DDPG methods in the training process.Wherein, the learning curve in figure (a) is It is obtained by five independent experiments, Ref indicates reference locus in figure (b).

Analysis chart 1 can obtain and such as draw a conclusion：

A) relative to DDPG methods, the study stability of MPQ-DPG is more preferable, this is because MPQ-DPG uses multiple evaluations Network and tactful network, can reduce influence of the difference sample to study stability.

B) finally convergent average cumulative reward is apparently higher than DDPG methods to MPQ-DPG methods, this illustrates the side MPQ-DPG The tracing control precision of method will be apparently higher than DDPG methods.

C) it from Fig. 1 (b) it is observed that the pursuit path that MPQ-DPG methods obtain almost is overlapped with reference locus, says High-precision AUV tracing controls may be implemented in bright MPQ-DPG methods.

D) with the increase of tactful network and evaluation the number networks, the tracing control precision of MPQ-DPG methods can be carried gradually Height, but the amplitude improved is in n>It will be no longer apparent after 4.

2) MPQ-DPG methods and existing Neural network PID method comparative analysis

Fig. 2 is that the present invention is the MPQ-DPG methods and Neural network PID that underwater unmanned vehicle Trajectory Tracking Control proposes Comparison of the method on Grid Track aircraft pursuit course and Grid Track tracking error.Ref indicates reference coordinate track in figure, PIDNN indicates Neural network PID algorithm, n=4.

Analysis chart 2 can obtain, and the tracking performance of NN-PID Control Method is significantly worse than MPQ-DPG proposed by the present invention Method；In addition, the tracking error in Fig. 2 (b) shows that MPQ-DPG methods may be implemented error and restrain faster, especially rising Quick, high-precision tracking performance still may be implemented in stage beginning, MPQ-DPG methods, and when the response of Neural network PID method Between to be considerably longer than MPQ-DPG methods, and the convergence of tracking error is poor.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. a kind of Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study, which is characterized in that this method Include the following steps：

It includes four parts to define AUV Trajectory Tracking Control problems：It determines the input of AUV systems, determine AUV system outputs, definition Trajectory Tracking Control error and establish AUV Trajectory Tracking Control targets；It is as follows：

1-1) determine that AUV systems input

It is τ to enable AUV system input vectors_k=[ξ_k,δ_k]^T, wherein ξ_k、δ_kThe respectively airscrew thrust of AUV and rudder angle, subscript k Indicate k-th of time step；ξ_k、δ_kValue range be respectivelyWith Respectively maximum propeller pushes away Power and hard over angle；

1-2) determine AUV system outputs

It is η to enable AUV system output vectors_k=[x_k,y_k,ψ_k]^T, wherein x_k、y_kRespectively k-th of time step AUV is in inertial coodinate system Along X, the coordinate of Y-axis, ψ under I-XYZ_kFor the angle of k-th time step AUV directions of advance and X-axis；

1-3) define Trajectory Tracking Control error

Reference locus is chosen according to the driving path of AUVDefine the AUV Trajectory Tracking Controls of k-th of time step Error is：

1-4) establish AUV Trajectory Tracking Control targets

Wherein, γ is discount factor, and H is weight matrix；

The target for establishing AUV Trajectory Tracking Controls is to find an optimal system list entries τ^*So that the target letter of initial time Number P₀(τ) is minimum, and calculation formula is as follows：

2-1) definition status vector

The velocity vector for defining AUV systems is φ_k=[u_k,v_k,χ_k]^T, wherein u_k、v_kRespectively k-th of time step AUV is along advance Direction, the linear velocity perpendicular to direction of advance, χ_kIt is k-th of time step AUV around the angular speed of direction of advance；

According to step 1-2) determine AUV system output vectors η_kWith step 1-3) reference locus that defines, define k-th of time The state vector of step is as follows：

2-2) definition action vector

2-3) define reward function

The reward function of k-th of time step is for portraying in state s_kTake action a_kImplementation effect, according to step 1-3) definition Trajectory Tracking Control error e_kWith step 2-2) the action vector a that defines_k, define the AUV reward functions of k-th of time step such as Under：

2-4) by step 1-4) establish AUV Trajectory Tracking Controls target τ^*Be converted to tracks AUV under intensified learning frame with Track control targe

The action value function be used to describe it is current and later it is stateful it is lower take tactful π when expectation aggregated rebates encourage It encourages, therefore under intensified learning frame, AUV Trajectory Tracking Control targets are to learn one by the interaction with AUV local environments Optimal objective strategy π^*So that the working value of initial time is maximum, and calculation formula is as follows：

If strategy being to determine property of π, i.e., the action vector space from the state vector space of AUV to AUV is the pass mapped one by one System, and be denoted as μ, then above-mentioned iteration Bellman equation is reduced to：

For deterministic strategy μ, by step 2-4) in optimal objective strategy π^*It is reduced to certainty optimal objective strategy μ^*：

3) mixed strategy-evaluation network is built

Certainty optimal objective strategy μ is estimated respectively by building mixed strategy-evaluation network^*With corresponding optimal working value FunctionIt includes three parts to build mixed strategy-evaluation network：Construction strategy network, structure evaluation network and determining target plan Slightly, it is as follows：

3-1) construction strategy network

Mixed strategy-evaluation network structure is by building n tactful networkTo estimate certainty optimal objective strategy μ^*； Wherein, θ_pFor the weight parameter of p-th of tactful network, p=1 ..., n；Each strategy network respectively uses a depth connected entirely Neural network is spent to realize, each strategy network includes respectively an input layer, two hidden layers and an output layer；Each strategy The input of network is state vector s_k, the output of each strategy network is action vector a_k；

3-2) structure evaluation network

Mixed strategy-evaluation network structure is by building m evaluation networkTo estimate optimal action value function Wherein, w_qFor the weight parameter of q-th of evaluation network, q=1 ..., m；Each evaluation network respectively uses a depth connected entirely For degree neural network to realize, it includes an input layer, two hidden layers and an output layer respectively to evaluate network respectively；Each evaluation The input of network is state vector s_kWith action vector a_k, wherein state vector s_kIt is input to each evaluation network from input layer, is acted Vectorial a_kIt is input to each evaluation network from first hidden layer, it is each to evaluate network output as in state vector s_kUnder take action to Measure a_kWorking value；

3-3) determine target strategy

According to constructed mixed strategy-evaluation network, the target for the AUV Trajectory Tracking Controls that k-th of time step is learnt Tactful μ_f(s_k) it is defined as the mean value of n tactful network output, calculation formula is as follows：

4-1) parameter setting

Maximum iteration M is respectively set, is the training set size N that maximum time step K, the experience replay of each iteration extract, each Evaluate the learning rate α of network_ω, each tactful network learning rate α_θ, weight matrix H in discount factor γ and reward function；

4-2) initialize mixed strategy-evaluation network

The tactful network of random initializtion nWith m evaluation networkWeight parameter θ_pAnd w_q；From n D-th of tactful network is randomly choosed in tactful network to be denoted asD=1 ..., n；

4-4) setting current time walks k=0, the state variable s of random initializtion AUV₀, enable the state variable s of current time step_k =s₀；And it generates one and explores noise Noise_k；

4-5) according to n current strategies networkWith exploration noise Noise_kDetermine the action vector a of current time step_kFor：

4-6) AUV is in current state s_kLower execution acts a_k, according to step 2-3) and receive awards function r_k+1, and observe one newly State s_k+1；Remember e_k=(s_k,a_k,r_k+1,s_k+1) it is an experience sample；If experience has lined up the sample size of set R Reach maximum capacity B, then first delete a sample being added at first, then by experience sample e_kDeposit experience is lined up in set R；It is no Then directly by experience sample e_kDeposit experience is lined up in set R；

Line up to choose A experience sample in set R from experience, it is specific as follows：Line up sample size in set R when experience to be no more than When N, then chooses the experience and line up had experience sample in set R；When experience, which lines up set R, is more than N, then arranged from the experience N number of experience sample (s is randomly selected in team set R_l,a_l,r_l+1,s_l+1)；

The graceful absolute error EBAE of expectation Bell of each evaluation network 4-7) is calculated according to A experience sample of selection_q, it is used for table The performance of each evaluation network of sign, formula are as follows：

The evaluation network for selecting performance worst is acquired the serial number of the worst evaluation network of the performance by following formula, is denoted as c：

4-8) by c-th of evaluation networkEach experience sample is obtained by such as next greedy strategy to move what future time walked Make vector：

4-11) pass through loss function L (w_c) to weight parameter w_cDerivative update the weight parameter of c-th of evaluation network, formula It is as follows：

The weight parameter of remaining evaluation network remains unchanged；

The weight parameter of remaining tactful network remains unchanged；

It 4-14) enables k=k+1 and k is judged：Such as k<K then returns to step 4-5), AUV continues track reference track； Otherwise, 4-15 is entered step)；

It 4-15) enables episode=episode+1 and episode is judged：Such as episode<M then returns to step 4- 4), AUV carries out next iterative process；Otherwise, 4-16 is entered step)；

4-16) iteration terminates, and terminates the training process of mixed strategy-evaluation network, the tactful network of n when by iteration ends Output valve passes through step 3-3) in calculation formula obtain the target strategy μ of final AUV Trajectory Tracking Controls_f(s_k), by the target Strategy realizes the Trajectory Tracking Control to AUV.