CN109726866A

CN109726866A - Unmanned boat paths planning method based on Q learning neural network

Info

Publication number: CN109726866A
Application number: CN201811612058.6A
Authority: CN
Inventors: 冯海林; 吕扬民; 方益明; 周国模
Original assignee: Zhejiang A&F University ZAFU
Current assignee: Zhejiang A&F University ZAFU
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-05-07

Abstract

The invention discloses unmanned boat paths planning method of the kind based on Q learning neural network, comprising the following steps: a), initializes memory block D；B), Q network, state, movement initial value are initialized；C), training objective is set at random；D), random selection acts a_t, obtain currently rewarding r_t, subsequent time state s_t+1, by (s_t,a_t,r_t,s_t+1) be stored in the D of memory block；E), stochastical sampling batch of data is trained from the D of memory block, i.e. a batch (s_t,a_t,r_t,s_t+1), the state when USV reaches target position, or when more than every wheel maximum time is regarded as end-state；If f), s_t+1It is not end-state, then return step d, if s_t+1It is end-state, then updates Q network parameter, and return step d, algorithm terminates after repeating n wheel；G), target is set, carries out path planning with the Q network after training, until USV reaches target position.Decision-making time of the invention is short, route is more optimized, can satisfy the requirement of real-time planned online.

Description

Unmanned boat paths planning method based on Q learning neural network

Technical field

The invention belongs to unmanned ship field of intelligent control, are specifically related to a kind of nothing based on Q learning neural network People's ship paths planning method.

Background technique

Water quality monitoring is the main method of water quality assessment and prevention water pollution.With increasing for industrial wastewater, water pollution The problem of getting worse, the demand of water pollution dynamic monitoring is very urgent.But because traditional water quality monitoring method step It is various, time-consuming, but the data diversity that gets, accuracy are much unsatisfactory for the demand of decision.It is more according to the above problem Kind of water quality monitoring method is suggested, and if Cao Lijie et al. is proposed by establishing sensor network, obtains more that accurately water quality is anti- Drill model.Field et al. proposes to obtain the water quality parameter distribution in monitoring waters to satellite data progress inverting by water quality model Figure.But above method can not neatly replace monitoring waters, and project amount is big, step is various, in contrast water quality monitoring without Not by the influence of topography, energy continuity carries out multinomial water quality parameter monitoring in situ, makes in people's hull product small easy to carry, monitoring field Monitoring result has more diversity and accuracy.

Unmanned ship (Unmanned Surface Vehicle, USV) is that one kind can be under unknown water environment certainly Main navigation, and the Water surface motion platform completed various tasks, because extensive by application field, research contents be related to automatic Pilot, Automatic obstacle avoiding, navigation planning and pattern-recognition etc. are many-sided.It can not only be used to the clearance of military field, scout and anti- Latent operation etc. can be also used for hydrometeorology detection, environmental monitoring and search and rescue waterborne of civil field etc..But by In the mobility of water, Various Complex landform can be flowed through, staff can not detect, such as when water flows through cave；Or again due to Weather it is changeable, if waters is chronically at foggy days, keep staff unsighted, can not accurately to USV real-time operation, The autonomous navigation that can use USV reaches target water level and is detected, and autonomous navigation function passage path planning technology is subject to It realizes.

USV Path Planning Technique refers to USV in operation waters, and according to certain performance indicator, (such as distance is most short, the time It is most short etc.) search obtains a nothing from starting point to target point and touches path, and it is core component in USV airmanship, even more Represent USV intelligent level standard.Currently used planing method mainly have particle swarm algorithm, A* algorithm, Visual Graph method, Artificial Potential Field Method, ant group algorithm etc., but its method is chiefly used under the conditions of known environment.

Preferable solution has been obtained currently for the trajectory planning problem under known environment, but USV is in unknown waters It is unable to get the environmental information that will monitor waters before job execution task, the road based on known environment information can not be passed through Diameter planing method goes planning USV navigation path.Secondly because monitoring water environment is complicated, sensor information is more, the meter of system Operator workload is big, and it is poor to cause USV there are real-times, the disadvantages of oscillation before barrier.Therefore USV path planning is badly in need of research calculation Method is simple, strong real-time, and can control the path planning algorithm of the indeterminacy phenomenon in system, and it is therefore necessary to introduce to have The method of independent learning ability, wherein the path planning based on Q learning algorithm is suitable for the path planning in circumstances not known.It is existing Guo Na et al. is on the basis of traditional Q-learning algorithm in research, carries out movement selection using simulated annealing method, solve to explore with The equilibrium problem utilized.Chen Zili et al.] it proposes that genetic algorithm is used to establish new Q value table to carry out static global path rule It draws.Dong Peifang et al. Artificial Potential Field Method is added in Q learning algorithm, using gravitation potential field as initial environment prior information, then it is right Environment is successively searched for, and Q value iteration is accelerated.

Notification number is that the Chinese patent of CN108106623A discloses a kind of unmanned vehicle paths planning method based on flow field, The following steps are included: establishing flow field calculation model according to the barrier in the starting point of vehicle, terminal and environment；With front wheel angle For input quantity, coordinate and course angle are quantity of state, establish vehicle kinematics model；Using vehicle kinematics model as rolling Equation, the rolling time horizon optimization problem for solving flow field are obtained using flow field velocity vector distribution as the guidance information of path planning To planning path, wherein optimized amount is front wheel angle, optimization aim include vehicle movement with flow field movement reach it is consistent and It does not collide with barrier during vehicle movement, constraint condition is that front wheel angle is no more than steering wheel hard-over.The party Method can find connection source and terminal, smooth and avoidance path in complicated landform.Under the premise of avoidance, path Slickness and completeness obtain preferable effect simultaneously.But this method needs to know the position of environment terrain and barrier, cannot Path planning is carried out for unknown field.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention proposes one kind to be based on BP neural network Q learn intensified learning path planning algorithm, with neural network fitting Q learning method in Q function, enable it to continuous System mode as input, and by experience replay and be arranged target network method significantly improve network in the training process Convergence rate.By experiment simulation, the feasibility of modified two-step method planing method presented here is demonstrated.

Technical solution: to achieve the above object, a kind of unmanned boat path planning based on Q learning neural network of the invention A kind of method, comprising the following steps: unmanned boat paths planning method based on Q learning neural network, which is characterized in that including Following steps:

A), memory block D is initialized；

B), Q network, state, movement initial value are initialized；It include following element: S, A, P in Q network_s,α, R, wherein S table Show the set of system mode locating for USV, A indicates the set for the movement that USV can take, P_s,αIndicate that systematic state transfer is general Rate, R indicate reward function；

C), training objective is set at random；

D), random selection acts a_t, obtain currently rewarding r_t, subsequent time state s_t+1, by (s_t,a_t,r_t,s_t+1) be stored to In storage area D；

E), stochastical sampling batch of data is trained from the D of memory block, i.e. a batch (s_t,a_t,r_t,s_t+1), when USV reaches Target position, or more than it is every wheel maximum time when state be regarded as end-state；

If f), s_t+1It is not end-state, then return step d, if s_t+1It is end-state, then updates Q network parameter, And return step d, algorithm terminates after repeating n wheel；

G), target is set, carries out path planning with the Q network after training, until USV reaches target position.

Preferably, memory block D is experience replay memory block in step a), for storing USV navigation process soldier acquisition Training sample, the presence of experience replay not being continuous in time between multiple samples when each training.

Preferably, the algorithmic rule of Q network are as follows:

Q(s_t,a_t)=Q (s_t,a_t)+αδ'_t

Wherein, function Q (s_t, a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate, δ '_tFor TD (0) deviation, TD (0) 0 in indicate is 1 step, the values seen forward under current state more are as follows:

δ'_t=R (s_t)+γV(s_t+1)-Q(s_t,a_t)

Wherein, γ is discount factor, and R (s) is reward function, and V (s) is value function, value function Alternatively, it is also possible to which TD (0) deviation is defined as

δ_t+1=R (s_t+1)+γV(s_t+2)-V(s_t+1)

Wherein, δ_t+1For TD (0) deviation, R (s) is reward function, and V (s) is value function,

Come to carry out discount to the TD deviation in step in future using another discount factor λ ∈ [0,1],

Wherein, function Q (s_t,a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate,For the deviation of TD (λ), TD (λ) is to see that λ is walked forward under current state more；

The deviation of TD (λ) hereinIt is defined as

Wherein, δ '_tThe deviation for the study of represent over,The deviation of multistep study is carried out, γ is discount factor, λ For discount factor, and λ ∈ [0,1], δ_t+iFor the deviation learnt now.

Preferably, by η^t(s a) is defined as characteristic function: t moment (s, a) occur, then return to 1, otherwise return to 0, To put it more simply, ignore learning efficiency, (s a) defines one with trace e to each_t(s,a)

So it is in moment t online updating

Q (s, a)=Q (s, a)+α [δ '_tη^t(s,a)+

δ_te_t(s,a)]

Wherein, (s is a) the execution movement a in state s to function Q, and α is learning rate, η^t(s a) is characterized function, e_t(s, A) for trace, δ '_tThe deviation for the study of represent over, δ₁For the deviation learnt now, δ₁It is by adding up back Report the deviation δ ' of R (s) and current estimation V (s)_t, and update to obtain multiplied by learning rate with the deviation.

Preferably, the overall income expectation harvested when intensified learning wishes to run system is maximum, i.e. E (R (s₀)+γ R(s₁)+γ²R(s₂)+...) it is maximum, need to find an optimal policy π thus, so that when USV carries out decision and movement according to π When, the total revenue of acquisition is maximum,

The objective function of intensified learning is one of:

V^π(s)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, π)

Q^π(s, a)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, a₀=a, π)

Wherein, V^π(s) it indicates at current original state s, can be obtained expectation according to the decision movement of tactful π and receive Benefit；And Q^π(s a) is indicated to take movement a at current state s, be moved in the state of all later all in accordance with the decision of tactful π It can be obtained expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the overall income phase harvested when system operation It hopes, R (s_t) indicate t moment reward function, γ is discount factor；

Purpose is exactly to find optimal policy π in Q study^*, so that

Preferably, definitionQ^π(s, a) refer in state s execution act a, and it The decision all done according to optimal policy in the state of all afterwards moves the expected revenus size that can be harvested, it is assumed that Q^*(s,a) It is known that so can be easily by Q^*(s a) generates π^*As long as making to each sIt sets up, In this way, the problem of seeking optimal policy, which translates into, seeks Q^*(s, a), due to having:

Q^*(s, a)=R (s₀)+γE(R(s₁)+

γR(s₂)+…|s₁,a₁)

Q^π(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π decision Movement can be obtained expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the overall income harvested when system operation It is expected that R (s_t) indicate t moment reward function, γ is discount factor；

And a₁By π^*It determines, then:

a₁Indicate the movement taken under optimal policy, π^*(s₁) indicate optimal policy, Q^π(s a) is indicated in current state s Under take movement a, it is all later in the state of all in accordance with tactful π decision movement can be obtained expected revenus,

Then according to the graceful equation of Buddhist script written on pattra leaves, Q function is iterated and is found out.

Preferably, the graceful equation of Buddhist script written on pattra leaves is with recursive formal definition Q^*(s, a), so that Q function can be iterated It finds out, the graceful equation of Buddhist script written on pattra leaves are as follows:

Wherein, Q^π(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π's Decision movement can be obtained expected revenus, R (s₀) represent reward function, η¹(s a) is characterized function, e₁(s, a) indicate with Trace, δ '_tThe deviation for the study of represent over, δ₁For the deviation learnt now, δ₁It is by accumulative return R (s) With the deviation δ ' of current estimation V (s)_t, and update to obtain multiplied by learning rate with the deviation.

Preferably, reward function is divided into 3 kinds, the first is rewarded at a distance from target position for USV；Second Kind is that USV arrival target position is rewarded；The third is punished for USV and barrier collision；Specifically:

Preferably, the value range for repeating to take turns number n is 3000-5000 in step f.

The utility model has the advantages that

The present invention compared with the prior art, has the advantage, that

1, the method for intensified learning of the invention solve water quality monitoring unmanned boat when unknown waters carries out water quality monitoring oneself Leading bit path planning problem is fitted Q function by BP neural network, so that trained strategy can be according to current The real time information of barrier makes a policy in environment.

2, method of the invention can be such that water quality monitoring unmanned boat is cooked up in circumstances not known according to different states feasible Path, and the decision-making time is short, route is more optimized, can satisfy the requirement of real-time planned online, to overcome traditional Q The disadvantage that paths planning method is computationally intensive, convergence rate is slow is practised, it can first time monitoring problem waters.

3, the present invention is enabled it to the Q function in neural network fitting Q learning method with continuous system mode work To input, and the convergence rate of network in the training process is significantly improved by experience replay and setting target network method.

4, the present invention improves traditional Q-learning, realizes Q value iteration, the output pair of network using BP neural network Should each movement Q value, the state of the corresponding description environment of the input of network.

5, the present invention returns to different reward values for different situations, so that USV exists by the design of reward function When study and exploration more efficiently.

Detailed description of the invention

Fig. 1 is；Overall flow figure of the invention；

Fig. 2 is；The analogous diagram of complex water areas landform；

Fig. 3 is: complex water areas landform is actually reached a little absolute error figure figure at a distance from target point；

Fig. 4 is: the analogous diagram in simple concentric circles labyrinth；

Fig. 5 is: simple concentric circles labyrinth is actually reached a little absolute error figure figure at a distance from target point；

Fig. 6 is: the analogous diagram of complex maze；

Fig. 7 is: complex maze is actually reached a little absolute error figure figure at a distance from target point；

Fig. 8 is: the simulation result diagram of East Lake background；

Fig. 9 is: East Lake background is actually reached a little absolute error figure figure at a distance from target point；

Figure 10 is: the number of iterations figure of East Lake background.

Specific embodiment

The present invention will be further explained with reference to the accompanying drawings and examples.

Embodiment one:

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment, comprising the following steps: a kind of Unmanned boat paths planning method based on Q learning neural network, which comprises the following steps:

A), memory block D is initialized；

C), training objective is set at random；

Wherein, D is experience replay memory block, for storing USV navigation process and acquiring training sample.Experience replay In the presence of so that not being continuously, to minimize the correlation between sample in time between multiple samples when training every time Enhance the stability and accuracy of sample；

Embodiment two:

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment one, tradition Q learning algorithm specifically:

Q study is to describe problem, Ma Erke based on markov decision process (Markov Decision Process) Husband's decision process includes 4 elements: S, A, P_s,a,R.Wherein S indicates system mode set locating for USV, i.e., USV is current State and current environment state, such as the size and location of barrier；A indicates the set of actions that USV can take, i.e. USV The direction of rotation；P_s,aExpression system model, i.e. systematic state transfer probability, P (s'| s a) is described at current state s, After execution acts a, system reaches the probability of state s；R indicates reward function, has current state and the movement taken to determine It is fixed.Q study is regarded as and finds strategy and makes overall merit maximum increment type planning, the thought of Q study be do not consider environment because Element, but directly one Q function that can be iterated to calculate of optimization, defined function Q (s_t, a_t) it is in state s_tShi Zhihang acts a_t, And accoumulation of discount reinforcement value when hereafter optimal action sequence executes, it may be assumed that

In formula, s_tFor t moment USV state in which, s_t+1For subsequent time USV state in which, a_tFor holding for t moment Action is made, and γ is discount factor, value 0≤γ≤1；R(s_t) it is reward function, value is positive number or negative.In initial rank In section study, Q value may be improperly to reflect strategy defined in them, initial Q₀(s, a) for all state and Movement is it is assumed that provide.Assuming that the possible set of actions A of state set s, USV of given environment is selectively more, data volume Greatly, a large amount of system memory space is needed to go to store, and can not be extensive.In order to overcome drawbacks described above, traditional Q-learning is carried out It improves, realizes that Q value iteration, the input correspondence of the Q value of the corresponding each movement of the output of network, network are retouched using BP neural network State the state of environment.

Improved Q learning path planning algorithm

Q (λ) algorithm is to use for reference TD (λ) algorithm to generate, and allows data constantly to transmit by the thought of backtracking, so that a certain The movement decision of state is influenced by its succeeding state.If the following a certain decision π is the decision of a failure, when Preceding decision will also undertake corresponding punishment, this influence can be appended to current decision；If the following a certain decision π is one A correct decision equally also will affect in current decision then current decision can also be rewarded accordingly.In conjunction with It can be improved convergence speed of the algorithm after improvement, meet the practicability of study.Improved Q (λ) algorithm updates rule

Q(s_t,a_t)=Q (s_t,a_t)+αδ'_t (2)

Wherein, function Q (s_t, a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate, is TD (0) deviation, TD (0) In 0 indicate is 1 step, the values seen forward under current state more are as follows:

δ'_t=R (s_t)+γV(s_t+1)-Q(s_t,a_t) (3)

δ_t+1=R (s_t+1)+γV(s_t+2)-V(s_t+1) (4)

Wherein, δ_t+1For TD (0) deviation, R (s) is reward function, and V (s) is value function, and what 0 in TD (0) indicated is It is forward under current state to see 1 step more.

Wherein also come to carry out discount to the TD deviation in step in future using discount factor λ ∈ [0,1].

Wherein, function Q (s_t,a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate,

Introducing a new parameter lambda herein can be accomplished by this new parameter in the feelings for not increasing computation complexity The prediction that all step numbers are comprehensively considered under condition is for controlling weight as γ parameter before.TD (λ) is current It is forward under state to see that λ is walked more.

The deviation of TD (λ) hereinIt is defined as

Wherein, δ '_tThe deviation that the study of represent over obtains,Carry out multistep study deviation, γ be discount because Son, λ are discount factor, and λ ∈ [0,1], δ_t+iFor the deviation learnt now.

Embodiment three

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment two, as long as The TD deviation in future is unknown, and update above can not just carry out.But them can be gradually calculated by using with trace. Below by η^t(s a) is defined as characteristic function: in t moment, (s a) occurs, then returns to 1, otherwise return to 0.To put it more simply, ignoring Learning efficiency, to it is each (s, a) define one with trace e_t(s,a)

So it is in moment t online updating

Q (s, a)=Q (s, a)+α [δ '_tη^t(s,a)+

δ_te_t(s,a)] (8)

Example IV:

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment three, strengthens The overall income expectation that study harvests when wishing to run system is maximum, needs to find an optimal policy π thus, so that working as When USV carries out decision and movement according to π, the total revenue of acquisition is maximum.In general, the objective function of intensified learning be it is following wherein One of:

V^π(s)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, π)

Q^π(s, a)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, a₀=a, π) (9)

Wherein, V^π(s) it indicates at current original state s, can be obtained expectation according to the decision movement of tactful π and receive Benefit；And Q^π(s a) is indicated to take movement a at current state s, be moved in the state of all later all in accordance with the decision of tactful π It can be obtained expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the overall income phase harvested when system operation It hopes, R (st) indicates the reward function of t moment, and γ is discount factor.

Purpose is exactly to find optimal policy π in Q study^*, so that

Embodiment five:

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on example IV, definitionQ^π(s, a) refer in state s execution act a, and it is all later in the state of all according to most The decision that dominant strategy is done moves the expected revenus size that can be harvested.Assuming that Q^*(s, a) it is known that so can be easily By Q^*(s a) generates π^*As long as making to each sIt sets up.In this way, the problem of seeking optimal policy It translates into and seeks Q^*(s,a).Due to having:

Q^*(s, a)=R (s₀)+γE(R(s₁)+

γR(s₂)+…|s₁,a₁) (10)

Wherein, Q^π(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π's Decision movement can be obtained expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the totality harvested when system operation Profit expectation, R (s_t) indicate t moment reward function, γ is discount factor；

And a₁By π^*It determines, then:

Wherein, a₁Indicate the movement taken under optimal policy, π^*(s₁) indicate optimal policy, Q^π(s a) is indicated current Taken under state s movement a, it is all later in the state of all in accordance with tactful π decision movement can be obtained expected revenus, so Afterwards according to the graceful equation of Buddhist script written on pattra leaves, Q function is iterated and is found out.

Embodiment six

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment five, Buddhist script written on pattra leaves Graceful equation is with recursive formal definition Q^*(s a) is found out so that Q function can be iterated, the graceful equation of Buddhist script written on pattra leaves are as follows:

Q function is to save and update in table form in traditional Q-learning algorithm, but in USV obstacle-avoiding route planning, Since barrier is likely to occur any position in space, Q function is difficult to describe in continuous space in table form The barrier of appearance.Therefore herein on Q learning foundation, depth Q study is fitted Q function with BP neural network, inputs shape State s is continuous variable.In general, being difficult to restrain with learning process when nonlinear function approximation Q function, experience replay is used thus Improve study stability with the method for target network.

Embodiment seven:

A kind of unmanned boat paths planning method based on Q learning neural network of the present embodiment is based on embodiment six, strong During chemistry is practised, the design of reward function directly affects the quality of learning effect.In general, reward function corresponds to people to a certain The description of business is incorporated in study by the priori knowledge that the design of reward function can solve task.In USV path planning, Wish to avoid safely bumping against with barrier during USV is also wanted to while reaching target position as early as possible in navigation.This Reward function is divided into 3 kinds by text, the first is rewarded at a distance from target position for USV；Second is USV arrival target Position is rewarded；The third is punished for USV and barrier collision.Reward function are as follows:

From magnitude, first and second kind of reward value is bigger than first reward value.Because coming for USV avoidance task It says, main target is exactly avoiding obstacles and gets to target position, rather than only shortens USV at a distance from target position, The reason of this is added be, if only target position is reached to USV and USV knocks barrier and rewarded and punished, It is 0 that a large amount of step reward will be so had in motion process, this meeting is so that USV will not be improved in most cases Strategy, learning efficiency are low.This reward is added and is equivalent to the priori knowledge that joined people to this task, so that USV is learning With explore when more efficiently.

Embodiment eight:

In order to examine the path planning algorithm designed herein, emulation experiment is carried out on Matlab2014a software herein. In an experiment, simulated environment is the region of 20*20, and discount factor γ value is 0.9, and memory block D is sized to 40000, circulation Number 1000, neural network first layer have 64 neurons, and the second layer has 32 neurons.In trained each round, whenever When USV bumps against barrier or the arrival target position USV, which is all immediately finished, and returns to a reward.

For the accuracy for verifying context of methods, it will be tested using labyrinth landform, three kinds of different terrains will be designed herein Carry out the comparison of algorithm, respectively complex water areas landform (as shown in Figure 2), simple concentric circles labyrinth landform is (such as Fig. 4 institute Show), complex maze landform (as shown in Figure 6).This paper innovatory algorithm is emulated with traditional Q-learning algorithm in the above landform, The innovatory algorithm route that blue represents it can be seen from path profile compares the route of traditional Q-learning algorithm simulating, path length It is shorter, it is simpler and more direct.By be actually reached a little at a distance from target point it can be seen from absolute error figure innovatory algorithm than traditional Q Habit algorithm shifts to an earlier date one third and carries out convergence stabilization.

Embodiment nine:

Experiment simulation is carried out by taking the actual environment background of Linan East Lake waters as an example, as seen from Figure 8, USV is in simulation process In have no that appearance and barrier bump against and path is simple and fast.Fig. 9 is standard error curve, Figure 10 is learning curve, You Shangtu Find out, when frequency of training reaches 56 times, curve tends to be steady, illustrate substantially to cook up a safe and efficient whole route, USV can avoiding obstacles arrival target position in most cases at this time.It therefore deduces that, based on BP neural network Q learning algorithm is improved than traditional Q-learning algorithm, learns convergence rate faster, path is more optimized.

The above is only a preferred embodiment of the present invention, it should be pointed out that: those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications should also regard For protection scope of the present invention.

Claims

1. a kind of unmanned boat paths planning method based on Q learning neural network, which comprises the following steps:

A), memory block D is initialized；

B), Q network, state, movement initial value are initialized；It include following element: S, A, P in Q network_s,α, R, wherein wherein S table Show the set of system mode locating for USV, A indicates the set for the movement that USV can take, P_s,αIndicate that systematic state transfer is general Rate, R indicate reward function；

C), training objective is set at random；

D), random selection acts a_t, obtain currently rewarding r_t, subsequent time state s_t+1, by (s_t,a_t,r_t,s_t+1) it is stored to memory block In D；

2. the unmanned boat paths planning method according to claim 1 based on Q learning neural network, which is characterized in that step It is rapid a) in, memory block D be experience replay memory block, for store USV navigation process soldier acquire training sample.

3. the unmanned boat paths planning method according to claim 1 based on Q learning neural network, which is characterized in that Q net The algorithmic rule of network are as follows:

Q(s_t,a_t)=Q (s_t,a_t)+αδ′_t

Wherein, function Q (s_t, a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate, δ '_tFor TD (0) deviation, in TD (0) 0 indicate is 1 step, the values seen forward under current state more are as follows:

δ′_t=R (s_t)+γV(s_t+1)-Q(s_t,a_t)

Wherein, γ is discount factor, and R (s) is reward function, and V (s) is value function, value functionIn addition, TD (0) deviation can also be defined as

δ_t+1=R (s_t+1)+γV(s_t+2)-V(s_t+1)

Q(s_t,a_t)=Q (s_t,a_t)+αδ_t ^λ

Wherein, function Q (s_t,a_t) it is in state s_tShi Zhihang acts a_t, α is learning rate, δ_t ^λFor the deviation of TD (λ), TD (λ) It is to see that λ is walked forward under current state more；

The deviation δ of TD (λ) herein_t ^λIt is defined as

Wherein, δ '_tThe deviation that the study of represent over obtains, δ_t ^λThe deviation of multistep study is carried out, γ is discount factor, λ For discount factor, and λ ∈ [0,1], δ_t+iFor the deviation learnt now.

4. the unmanned boat paths planning method according to claim 1 based on Q learning neural network, which is characterized in that by η^t (s a) is defined as characteristic function: t moment (s a) occurs, then returns to 1, otherwise return to 0, to put it more simply, ignore learning efficiency, To it is each (s, a) define one with trace e_t(s,a)

So it is in moment t online updating

Q (s, a)=Q (s, a)+α [δ '_tη^t(s,a)+

δ_te_t(s,a)]

Wherein, (s is a) the execution movement a in state s to function Q, and α is learning rate, η^t(s a) is characterized function, e_t(s a) is With trace, δ '_tThe deviation for the study of represent over, δ₁For the deviation learnt now.

5. the unmanned boat paths planning method according to claim 4 based on Q learning neural network, which is characterized in that strong Chemistry practises the overall income expectation maximum harvested when wishing to run system, needs to find an optimal policy π thus, so that working as When USV carries out decision and movement according to π, the total revenue of acquisition is maximum,

The objective function of intensified learning is one of:

V^π(s)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, π)

Q^π(s, a)=E (R (s₀)+γR(s₁)+

γ²R(s₂)+…|s₀=s, a₀=a, π)

Wherein, V^π(s) it indicates at current original state s, can be obtained expected revenus according to the decision movement of tactful π；And Q^π (s a) indicates to take acting a at current state s, can obtain in the state of all later all in accordance with the decision movement of tactful π Expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the overall income expectation harvested when system operation, R (s_t) indicate The reward function of t moment, γ are discount factor；

Purpose is exactly to find optimal policy π in Q study^*, so that

6. the unmanned boat paths planning method according to claim 5 based on Q learning neural network, which is characterized in that fixed JusticeQ^π(s, a) refer in state s execution act a, and it is all later in the state of all bases The decision that optimal policy is done moves the expected revenus size that can be harvested, it is assumed that Q^*(s, a) it is known that so can be easily By Q^*(s a) generates π^*As long as making to each sIt sets up, in this way, the problem of seeking optimal policy It translates into and seeks Q^*(s, a), due to having:

Q^*(s, a)=R (s₀)+γE(R(s₁)+

γR(s₂)+…|s₁,a₁)

Q^π(s, a) indicate taken at current state s movement a, it is all later in the state of all in accordance with tactful π decision move institute Obtainable expected revenus, E (R (s₀)+γR(s₁)+γ²R(s₂)+...) it is the overall income expectation harvested when system operation, R (s_t) indicate t moment reward function, γ is discount factor；

And a₁By π^*It determines, then:

a₁Indicate the movement taken under optimal policy, π^*(s₁) indicate optimal policy, Q^π(s a) indicates to take at current state s Act a, it is all later in the state of all in accordance with tactful π decision movement can be obtained expected revenus,

7. the unmanned boat paths planning method according to claim 6 based on Q learning neural network, which is characterized in that shellfish The graceful equation of text is with recursive formal definition Q^*(s a) is found out so that Q function can be iterated, the graceful equation of Buddhist script written on pattra leaves are as follows:

Wherein, Q^π(s a) is indicated to take movement a at current state s, be transported in the state of all later all in accordance with the decision of tactful π It is dynamic to can be obtained expected revenus, R (s₀) represent reward function, η¹(s a) is characterized function, e₁(s a) is indicated with trace, δ '_t The deviation that the study of represent over obtains, δ₁For the deviation learnt now, δ₁It is R (s) and current by adding up to return Estimate the deviation δ ' of V (s)_t, and update to obtain multiplied by learning rate with the deviation.

8. the unmanned boat paths planning method according to claim 6 or 7 based on Q learning neural network, which is characterized in that Reward function is divided into 3 kinds, the first is rewarded at a distance from target position for USV；Second is USV arrival target position It is rewarded；The third is punished for USV and barrier collision；Specifically:

9. the unmanned boat paths planning method according to claim 1 based on Q learning neural network, which is characterized in that step In rapid f, the value range for repeating to take turns number n is 3000-5000.