CN109655066A

CN109655066A - One kind being based on the unmanned plane paths planning method of Q (λ) algorithm

Info

Publication number: CN109655066A
Application number: CN201910071929.6A
Authority: CN
Inventors: 张迎周; 竺殊荣; 高扬; 孙仪; 张灿
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2019-04-19
Anticipated expiration: 2039-01-25
Also published as: CN109655066B

Abstract

The present invention provides a kind of unmanned plane mission planning methods for being based on Q (λ) algorithm, including environmental modeling step, markov decision process model initialization step, Q (λ) algorithm iteration calculates step, optimal path step is calculated according to state value function, grid space is initialized according to unmanned plane minimum track segment length first, grid space coordinate is mapped as way point, and round and polygon threatening area is indicated, then Markovian decision model is established, it is indicated including unmanned plane during flying motion space, the design of state transition probability, the construction of reward function, then calculating is iterated on the basis of the model of building using Q (λ) algorithm, and the optimal path that can avoid the unmanned plane of threatening area safely is calculated according to final convergent state value function, the present invention learns traditional Q Algorithm is combined with effectiveness tracking, improves the convergent speed of cost function and precision, and guidance unmanned plane avoids threatening area and carries out autonomous path planning.

Description

One kind being based on the unmanned plane paths planning method of Q (λ) algorithm

Technical field

The present invention relates to a kind of unmanned plane, specifically a kind of unmanned plane paths planning method belongs to heuritic approach Technical field.

Background technique

Unmanned plane path planning is the important component of unmanned plane mission planning, is to realize that unmanned plane independently executes task Important stage.Unmanned plane path planning requires to cook up known to given known, part or in the environment of totally unknown information Target point is reached from starting point, it can be around threatening area and barrier, safe and reliable collisionless and meet various constraint items simultaneously The flight track of part.Path planning is divided into global path planning by the acquisition situation of the battlefield surroundings information according to locating for unmanned plane And local paths planning.

In practical applications, if unmanned function obtains global context knowledge, Dynamic Programming realizing route rule can be used It draws.Complexity and uncertain increase however as battlefield surroundings, the priori knowledge of the few environment of unmanned plane, so in reality Need unmanned plane that there is the stronger ability for adapting to dynamic environment in the application of border.In this case, sensor information is depended on The technology that real-time perception threatening area information carries out local paths planning just shows huge superiority.

There is algorithms easily to fall into local minimum or local oscillation, algorithm time cost for current local paths planning technology Big and computerized information amount of storage is big, rule is difficult to the problems such as determining.And the unmanned plane paths planning method of Behavior-based control is known as The hot spot studied now, essence are exactly that the ambient condition that sensor perceives is mapped to the movement of actuator, Behavior-based control It is often extremely difficult in actual complex environment to the acquisition of the design of state feature vector and the sample for having supervision in method 's.Therefore these problems are urgently to be resolved.

Summary of the invention

The object of the present invention is to provide a kind of unmanned plane mission planning methods for being based on Q (λ) algorithm, learn in conjunction with Q and imitate With tracking (Eligibility Traces), the rewards and punishments signal of quantization is given to the ambient condition of sensor perception, by continuous With the interaction of environment, guides unmanned plane to carry out autonomous path planning and avoided threatening area safely, realize to external environment The quick response of variation has the advantages that quick, real-time, promotion unmanned plane adaptability under unknown or part circumstances not known.

The present invention provides a kind of unmanned plane paths planning method for being based on Q (λ) algorithm, it is characterised in that: including following step It is rapid:

Step 1, environmental modeling: utilizing the collected environmental information of sensor, identifies threatening area, using Grid Method by nothing Man-machine flight environment of vehicle is modeled, and by continuous spatial discretization, generates uniform grid chart according to the space size of setting, will Grid vertex is as the way point after discrete；

Step 2, initialize markov decision process model: initialization is suitable for solving the unmanned plane path planning Markov decision process model, the markov decision process model can use four-tuple<S, A, P, and R>expression, S are nothing Man-machine state in which space, A are the motion space of unmanned plane, and P is state-transition matrix, and R is reward function, and Markov is determined The initialization of plan process model includes the expression to unmanned plane during flying motion space, the design of state transition probability and reward function Construction；

Step 3, it on the model established, is calculated using Q (λ) algorithm iteration: in the model that step 1 and step 2 are established On the basis of, calculating is iterated using Q (λ) algorithm for combining Q-learning algorithm and effectiveness to track；Introduce state action valence Value function Q (s takes the value of movement a a) to characterize unmanned plane in state s, establishes Q table and stores each state action to<s, and a> Value；Introduce effectiveness tracking function E (s, a) indicate final state and state behavior to<s, a>causality；It carries out first Q value and the initialization of E value, then in each learning cycle, the movement taken under s state by Boltzmann policy selection a；After execution movement a is transferred to NextState s', Q (s, value a), and pass through E value more new formula are updated by Q value more new formula The E value for updating all state actions pair, when reaching final state, when secondary learning cycle terminates, until reaching maximum study week After issue, Q (λ) algorithm iteration calculating process terminates；

Step 4, optimal path is calculated according to state value function: obtains convergent state value function after step 3, The movement a* with maximum Q value can be then selected at state s, continue to use deterministic strategy after taking movement a*, until Final state is reached, the node in grid is finally mapped into longitude and latitude and then obtains optimal path.

As the specific steps for further defining that step 1 environmental modeling of the invention are as follows:

Step 1.1 initializes grid space according to unmanned plane minimum track segment length；

Unmanned plane fly between several destinations be along rectilinear flight, and when reaching certain destinations according to track requirement and Change of flight posture, minimum track segment length are the most short distances that limits unmanned plane and must fly nonstop to before starting change of flight posture From with unmanned plane minimum track segment length setting step-length, the available Discrete Grid space for meeting unmanned plane itself constraint；

The latitude and longitude coordinates that unmanned plane start position is arranged are S=(lon_S,lat_S), the latitude and longitude coordinates of target point are T =(lon_T,lat_T), unmanned plane minimum track segment length is d_min, the size of grid space is m*n, by d_minIt is set as grid step It grows, then the calculation formula of m, n are as follows:

Grid space coordinate is mapped as way point by step 1.2；

Using vertex raster as the way point after discrete, the coordinate in grid space uses (x, y) to indicate, setting grid is empty Between the corresponding latitude and longitude coordinates of origin (0,0) be (lon_o,lat_o), then (x, y) corresponding way point latitude and longitude coordinates (lon_xy, lat_xy) calculation formula it is as follows: lon_xy=lon_o+d_min*x,lat_xy=lat_o+d_min*y。

The expression of step 1.3 threatening area information；

Unmanned plane will consider the spatial position in threat source in flight course, be divided into threatening area according to threat source category Node label containing threatening area is 1, is expressed as no-fly zone by border circular areas and polygonal region in grid space Domain, the node label without containing threatening area are 0, are expressed as that region can be flown；For round threatening area, the setting area center of circle is sat It is designated as (lon_c,lat_c), threatening area radius is r (km), for each node (x, y) in grid, according to haversine public affairs Distance d of the corresponding way point of formula calculate node to the threat area center of circle_xyo, haversine equation is calculated according to latitude and longitude coordinates Distance on spherical surface between two points；

If d_xyo(x, y) corresponding node label is then 1, is otherwise labeled as 0 by≤r, for polygon threatening area, With way point (lon_xy,lat_xy) start, horizontal direction to the right (or to the left) makees a ray, calculates the ray and polygon area The intersection point number in domain, if intersection point number is odd number, way point, which is located at polygon, to be threatened in area, is by (x, y) node label 1, if intersection point number is even number, threatened outside area in polygon, is 1 by node label.

As the specific steps for further defining that the step 2 markov decision process model initialization of the invention Are as follows:

Step 2.1 indicates unmanned plane during flying motion space

Using grid vertex as way point in grid space, then a vertex to another vertex shares eight transfer sides To (except boundary point)；Certain limitation is done to shift direction according to the constraint of unmanned plane itself and the threat in space distribution, it will The behavior of unmanned plane is generalized for discrete movement space, by course state with 45 ° for interval carry out discretization, can obtain 8 from Bulk state；According to the discretization course state of setting, 5 unmanned plane during flying movements are set, and flying nonstop to is indicated with number 0, turned right 45 ° are indicated with 1, and turning left 45 ° is indicated with 2, and turning right 90 ° is indicated with 3, turns left 90 ° to indicate that then motion space is expressed as A=with 4 [0,1,2,3,4], each number respectively indicate a movement；

Step 2.2 design point transition probability

After state transition probability refers to that execution acts under a certain air route state when unmanned plane, another air route state is reached Conditional probability is usedIt indicates, represents the probability that unmanned plane execution movement a at state s is transferred to state s'；

Since at study initial stage, unmanned plane is unknown to environment, easily enter threatening area, unmanned plane enters threatening area i.e. Representing a learning cycle terminates, and is confined near original state to the exploration of environment, so setting is moved when what unmanned plane was taken It is will lead into threatening area or when will lead to unmanned plane leave state space, generating state does not shift, i.e. unmanned plane State does not change, and is transferred to the state that movement is directed toward for 100% under the conditions of remaining；The state space of unmanned plane is S, is threatened Regional space is O, thenCalculation formula are as follows:

The construction of step 2.3 reward function

Unmanned plane carries out that instant reward can be obtained when way point is transferred into next state, is based on the study of Q (λ) algorithm Target be exactly maximize accumulation immediately reward, the construction of reward function to consider influence track performance various indexs, including away from Distance, flight safety, threat degree of target point etc.；Indicate that unmanned plane takes movement a to be transferred to s' state at state s The instant reward function obtained, calculation formula is as follows, wherein w₁、w₂、w₃For weighting coefficient, f_d、f_o、f_aFor by normalization The route evaluation factor of reason；

f_dIt indicates visibility, takes state s' away from the inverse of target point distance, the latitude and longitude coordinates of s' are s'=(lon_s', lat_s'), the latitude and longitude coordinates of target point are T=(lon_T,lat_T), f_dCalculation formula are as follows:

f_oIndicate threatening area to the threat degree of state s',Wherein I_oIt indicates to the current shape of unmanned plane There is the threat area set threatened in state transfer,It indicates to threaten area o_iTo the threat degree of s', area o is threatened_iLatitude and longitude coordinates For Calculation formula are as follows:

f_aIndicate that the penalty term acted to unmanned plane during flying, the flare maneuver that unmanned plane is taken are to influence unmanned plane during flying peace Full key factor；According to the unmanned plane during flying motion space that step 2.1 is arranged, by f_aProcessing is discrete function:

As of the invention further defining that, the step 3 on the model established, is calculated using Q (λ) algorithm iteration Specific steps are as follows:

Step 3.1 initializes Q table

To each state action in Q table to Q (s, a) carries out the initialization of Q value, Q (s,~) indicate s state lower it is stateful The initial value of movement pair, s_TIndicate final state, then Q (s, calculation formula a) are as follows:

If s is final state, initial Q value is 0, otherwise sets s and s for Q value_TDistance inverse, s state pair The coordinate answered is (x, y), s_TThe corresponding coordinate of state is (x_T,y_T), dss_TCalculation formula are as follows:

Step 3.2 initializes E value

When each learning cycle starts, by all state actions to<s, a>E value E (s a) is initialized as 0；

Step 3.3 carries out movement selection using Boltzmann Distribution Strategy.

In each learning cycle, first setting original state, then according to Boltzmann Distribution Strategy selection act into The transfer of row state；Probability p (a | s) calculation formula of movement a is taken under s state are as follows:

Wherein T indicates temperature coefficient, for the exploration intensity of control strategy.Biggish temperature can be used at study initial stage Coefficient is gradually reduced temperature coefficient to guarantee stronger tactful exploring ability later.Then it is selected according to p (a | s) using wheel disc method Movement a is selected, and (s, value a) add one by E；

Step 3.4 updates Q value

Unmanned plane takes steps the movement a of 3.2 selections at state s, is transferred to state s', and obtain reward r immediately, then Q (s, more new formula a) are as follows:

Q (s, a)=Q (s, a)+α * (r+ γ * max_aQ(s′,a)-Q(s,a))*E(s,a)

Wherein α is learning rate, and γ is discount factor, and γ indicates the attention degree to future reward, max_a(s' a) is Q Maximum Q value under state s'；

Step 3.5 updates E value

To all state actions to E (s, more new formula a) are as follows: ((s, a), wherein λ is weight ginseng to E by s, a)=λ * E Number, when state s' is final state, then this learning cycle terminates, and into next learning cycle, is otherwise transferred to s' state, And return step 3.2, continue learning process；

Further define that the step 4 according to the specific step of state value function calculating optimal path as of the invention Suddenly are as follows:

Step 4.1 carries out state transfer using deterministic policy

After step 3, state value Q has restrained；Original state s is set first, and selection has maximum under s state The movement a* of Q value, and state transfer is carried out, act the selection formula of a* are as follows: a^*=argmax_a∈A(s a) acts a when taking to Q After being transferred to NextState s', continue that deterministic policy selection is taken to act, until reaching final state；

Mesh space is mapped to way point latitude and longitude coordinates by step 4.2

The optimal path coordinate in grid obtained in step 4.1 is mapped to way point according to the formula in step 1.2 Latitude and longitude coordinates, then obtain unmanned plane optimal path.

The invention adopts the above technical scheme compared with prior art, has following technical effect that

1. using unmanned plane minimum track segment length as discretization step-length, it is contemplated that unmanned plane itself constraint solves The shortcomings that discretization process of environmental modeling lacks foundation, obtains the discrete programming sky that can give full play to unmanned plane during flying ability Between；

2. when state transition probability is arranged, when the movement that unmanned plane is taken, which will lead to it, enters threatening area, nobody Machine not generating state shift, keep the current status unchanged the study for continuing current period, solve study initial stage nobody Machine and environmental interaction are confined to the disadvantage near original state, improve convergence speed of the algorithm；

3.Q learning algorithm does not need global environmental knowledge, but by the method for similar trial and error, constantly handed over environment Mutually, optimal policy is approached by optimization behavior memory function, is suitable for that unmanned plane under dynamic environment is to environment unknown or part Unknown situation, guidance unmanned plane carry out autonomous path planning；

4. traditional Q-learning algorithm is to see a step under current state during algorithm iteration more, by learning to calculate in Q Effectiveness is introduced in method and tracks function, the prediction of all step numbers has been comprehensively considered, so that the calculating for cost function is more acurrate.And And effective online updating can be carried out, needing not wait for a learning cycle terminates just to carry out the update of Q value, before can abandoning Learning data, accelerate the speed of algorithmic statement.

Detailed description of the invention

Unmanned plane discrete movement and its transfer result in Fig. 1 grid space.

Algorithm iteration flow chart in each learning cycle of Fig. 2.

Specific embodiment

Further explanation is done to the present invention with reference to the accompanying drawing.

In order to facilitate narration, the simple primary variables defined in algorithm:

The latitude and longitude coordinates of unmanned plane start position are S=(lon_S,lat_S), the latitude and longitude coordinates of target point are T= (lon_T,lat_T), the size of grid space is m*n, and point coordinate is (x, y) in grid space.Markov model with four-tuple < S, A, P, R > expression, S are unmanned plane status space, and A is unmanned plane motion space, and R is reward function, and P is state transfer Probability matrix.

The present invention proposes a kind of unmanned plane paths planning method for being based on Q (λ) algorithm, including environmental modeling step, Ma Er Section husband decision process model initialization step, Q (λ) algorithm iteration calculate step, calculate optimal path according to state value function Step；

Specific step is as follows:

Step 1) environmental modeling step

The step-length of grid space is set unmanned plane minimum track segment length d by step 1.1)_min；

Step 1.2) is according to formulaComputation grid space size；

Step 1.3) is according to formula lon_xy=lon_o+d_min*x,lat_xy=lat_o+d_min* grid space coordinate is mapped as by y Way point latitude and longitude coordinates, (lon_o,lat_o) it is grid space origin (0,0) corresponding latitude and longitude coordinates；

Node label containing threatening area is 1, indicates no-fly region by step 1.4) in grid space.It will be free of The node label for having threatening area is 0, is expressed as that region can be flown；

Step 2) markov decision process model initialization

Step 2.1) is arranged 5 unmanned plane during flying movements, flies nonstop to and use number according to unmanned plane shift direction as shown in Figure 1 Word 0 indicates, turning right 45 ° is indicated with 1, and turning left 45 ° is indicated with 2, and 90 ° of right-hand rotation is indicated with 3, turning left 90 ° is indicated with 4, by unmanned plane Flare maneuver space representation is A=[0,1,2,3,4], and each number respectively indicates a movement；

Step 2.2) by state transition probability be set as the movement that unmanned plane is taken will lead to its into threatening area or When will lead to unmanned plane leave state space, generating state is not shifted, i.e., drone status does not change, general under the conditions of remaining 100% is transferred to the state that movement is directed toward.State transition probability calculation formula are as follows:

Wherein O is threatening area space；

Step 2.3) unmanned plane takes movement a to be transferred to the instant reward function that s' state obtains at state sIt calculates Formula isWherein w₁、w₂、w₃For weighting coefficient, f_d、f_o、 f_aFor by normalized Route evaluation factor；

Step 2.4) f_dIt indicates visibility, takes state s' away from the inverse of target point distance, the latitude and longitude coordinates of s' are s'= (lon_s',lat_s'), the latitude and longitude coordinates of target point are T=(lon_T,lat_T), f_dCalculation formula are as follows:

Step 2.5) f_oIndicate threatening area to the threat degree of state s',Wherein I_oIt indicates to nobody There is the threat area set threatened in the transfer of machine current state,It indicates to threaten area o_iTo the threat degree of s', area o is threatened_iWarp Latitude coordinate is Calculation formula are as follows:

Step 2.6) f_aIndicate the penalty term acted to unmanned plane during flying, the flare maneuver that unmanned plane is taken is to influence nobody The key factor of machine flight safety.According to the unmanned plane during flying motion space that step 2.1 is arranged, by f_aProcessing is discrete function,

Step 3) is iterated calculating on the model established, using Q (λ) algorithm, and algorithm is in each learning cycle Iterative process it is as shown in Figure 2；

To each state action in Q table, to Q, (s a) carries out the initialization of Q value to step 3.1).Q (s,~) indicate s state Under all state actions pair initial value, s_TIndicate final state, then Q (s, calculation formula a) are as follows:

Step 3.2) is when each learning cycle starts, by all state actions to<s, a>E value E (s a) is initialized It is 0；

Original state is arranged in step 3.3)；

Step 3.4) carries out movement selection according to Boltzmann Distribution Strategy, taken under s state movement a Probability p (a | S) calculation formula are as follows:

Step 3.5) is according to formula:

Q (s, a)=Q (s, a)+α * (r+ γ * max_aQ(s′,a)-Q(s,a))*E(s,a)

To Q, (s a) is updated；

Step 3.6) according to formula E (s, a)=λ * E (s a) is updated E value:

Step 3.7) takes movement a to be transferred to NextState s', if s' is final state, this learning cycle terminates, Return step 3.2) enter next learning cycle, otherwise return step 3.4) continue iteration.

Step 4) calculates optimal path according to state value function:

After step 3), state value Q has restrained step 4.1), first setting original state s, selects under s state The movement a* with maximum Q value is selected, and carries out state transfer, acts the selection formula of a* are as follows: a^*=argmax_a∈AQ(s,a).When After taking movement a to be transferred to NextState s', continue that deterministic policy selection is taken to act, until reaching final state；

Step 4.2) reflects the optimal path coordinate in grid obtained in step 4.1) according to the formula in step 1.3) The latitude and longitude coordinates of way point are penetrated into, then obtain unmanned plane optimal path.

The above, the only specific embodiment in the present invention, but scope of protection of the present invention is not limited thereto, appoints What is familiar with the people of the technology within the technical scope disclosed by the invention, it will be appreciated that expects transforms or replaces, and should all cover Within scope of the invention, therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims

1. being based on the unmanned plane paths planning method of Q (λ) algorithm, it is characterised in that: the following steps are included:

Step 1, environmental modeling: environmental information is acquired using sensor, identifies threatening area, using Grid Method by unmanned plane during flying Environment is modeled, and by continuous spatial discretization, uniform grid chart is generated according to the space size of setting, by grid vertex As the way point after discrete；

Step 2, initialize markov decision process model: initialization is suitable for solving the Ma Er of the unmanned plane path planning Section husband decision process model, the markov decision process model four-tuple<S, A, P, R>expression, S are locating for unmanned plane State space, A be unmanned plane motion space, P is state-transition matrix, and R is reward function, markov decision process mould Type initialization includes the construction of the expression to unmanned plane during flying motion space, the design of state transition probability and reward function；

Step 3, it on the model established, is calculated using Q (λ) algorithm iteration: in the model basis that step 1 and step 2 are established On, calculating is iterated using Q (λ) algorithm for combining Q-learning algorithm and effectiveness to track；It introduces state action and is worth letter Number Q (s takes the value of movement a a) to characterize unmanned plane in state s, establishes Q table and stores each state action to<s, a>valence Value；Introduce effectiveness tracking function E (s, a) indicate final state and state behavior to<s, a>causality；Q value is carried out first It is initialized with E value, then in each learning cycle, the movement a that is taken under s state by Boltzmann policy selection；It holds After action is transferred to NextState s' as a, Q (s, value a), and update by E value more new formula are updated by Q value more new formula The E value of all state actions pair, when reaching final state, when secondary learning cycle terminates, until reaching maximum learning cycle number Afterwards, Q (λ) algorithm iteration calculating process terminates；

Step 4, optimal path is calculated according to state value function: obtains convergent state value function after step 3, then may be used To select the movement a* with maximum Q value at state s, continue after taking movement a* using deterministic strategy, until reaching Node in grid is finally mapped to longitude and latitude and then obtains optimal path by final state.

2. the unmanned plane paths planning method according to claim 1 based on Q (λ) algorithm, it is characterised in that: the step The specific steps of 1 environmental modeling are as follows:

It is along rectilinear flight that unmanned plane flies between several destinations, and while reaching certain destinations changes according to track requirement Flight attitude, minimum track segment length are the shortest distances that limits unmanned plane and must fly nonstop to before starting change of flight posture, with Step-length is arranged in unmanned plane minimum track segment length, can get the Discrete Grid space for meeting unmanned plane itself constraint；

The latitude and longitude coordinates that unmanned plane start position is arranged are S=(lon_S,lat_S), the latitude and longitude coordinates of target point are T= (lon_T,lat_T), unmanned plane minimum track segment length is d_min, the size of grid space is m*n, by d_minIt is set as grid step-length, The then calculation formula of m, n are as follows:

Grid space coordinate is mapped as way point by step 1.2；

Using vertex raster as the way point after discrete, the coordinate in grid space uses (x, y) to indicate, setting grid space is former The corresponding latitude and longitude coordinates of point (0,0) are (lon_o,lat_o), then (x, y) corresponding way point latitude and longitude coordinates (lon_xy,lat_xy) Calculation formula it is as follows: lon_xy=lon_o+d_min*x,lat_xy=lat_o+d_min*y。

The expression of step 1.3 threatening area information；

Unmanned plane will consider the spatial position in threat source in flight course, and threatening area is divided into circle according to threat source category Node label containing threatening area is 1, is expressed as no-fly region, is free of by region and polygonal region in grid space The node label for having threatening area is 0, is expressed as that region can be flown；For round threatening area, setting area central coordinate of circle is (lon_c,lat_c), threatening area radius is r (km), for each node (x, y) in grid, according to haversine formula meter Distance d of the corresponding way point of operator node to the threat area center of circle_xyo, haversine equation is to calculate spherical surface according to latitude and longitude coordinates Distance between upper two points；

If d_xyo(x, y) corresponding node label is then 1, is otherwise labeled as 0, for polygon threatening area, with boat by≤r Waypoint (lon_xy,lat_xy) start, horizontal direction to the right (or to the left) makees a ray, calculates the ray and polygonal region Intersection point number, if intersection point number is odd number, it is 1 by (x, y) node label that way point, which is located at polygon, which to be threatened in area, if Intersection point number is even number, then threatens outside area in polygon, is 1 by node label.

3. the unmanned plane paths planning method according to claim 2 based on Q (λ) algorithm, it is characterised in that: the step The specific steps of 2 markov decision process model initializations are as follows:

Step 2.1 indicates unmanned plane during flying motion space

Using grid vertex as way point in grid space, then a vertex to another vertex shares eight shift directions (except boundary point)；Certain limitation is done to shift direction according to the constraint of unmanned plane itself and the threat in space distribution, by nothing Man-machine behavior is generalized for discrete movement space, by course state with 45 ° for interval carry out discretization, can obtain 8 it is discrete State；According to the discretization course state of setting, 5 unmanned plane during flyings movements are set, fly nonstop to indicated with digital 0, turn right 45 ° with 1 indicate, turn left 45 ° indicated with 2, turn right 90 ° indicated with 3, turning left 90 ° is indicated with 4, then motion space be expressed as A=[0,1,2, 3,4], each number respectively indicates a movement；

Step 2.2 design point transition probability

After state transition probability refers to that execution acts under a certain air route state when unmanned plane, the condition of another air route state is reached Probability is usedIt indicates, represents the probability that unmanned plane execution movement a at state s is transferred to state s'；

Since at study initial stage, unmanned plane is unknown to environment, easily enter threatening area, unmanned plane enters threatening area and represents One learning cycle terminates, and is confined near original state to the exploration of environment, so the movement meeting that setting is taken when unmanned plane When it being caused to enter threatening area or will lead to unmanned plane leave state space, generating state is not shifted, i.e. drone status It does not change, is transferred to the state that movement is directed toward for 100% under the conditions of remaining；The state space of unmanned plane is S, threatening area Space is O, thenCalculation formula are as follows:

The construction of step 2.3 reward function

Unmanned plane carries out that instant reward can be obtained when way point is transferred into next state, is based on the learning objective of Q (λ) algorithm It is exactly to maximize accumulation reward immediately, the construction of reward function will consider to influence the various indexs of track performance, including away from target Distance, flight safety, the threat degree etc. of point；Indicate that unmanned plane takes movement a to be transferred to the acquisition of s' state at state s Instant reward function, calculation formula is as follows, wherein w₁、w₂、w₃For weighting coefficient, f_d、f_o、f_aFor the boat Jing Guo normalized Mark factor of evaluation；

f_dIt indicates visibility, takes state s' away from the inverse of target point distance, the latitude and longitude coordinates of s' are s'=(lon_s',lat_s'), The latitude and longitude coordinates of target point are T=(lon_T,lat_T), f_dCalculation formula are as follows:

f_oIndicate threatening area to the threat degree of state s',Wherein I_oIt indicates to turn unmanned plane current state The threat area set for existing and threatening is moved,It indicates to threaten area o_iTo the threat degree of s', area o is threatened_iLatitude and longitude coordinates be Calculation formula are as follows:

f_aIndicate that the penalty term acted to unmanned plane during flying, the flare maneuver that unmanned plane is taken are to influence unmanned plane during flying safety Key factor；According to the unmanned plane during flying motion space that step 2.1 is arranged, by f_aProcessing is discrete function:

4. the unmanned plane paths planning method according to claim 3 based on Q (λ) algorithm, it is characterised in that: the step 3 on the model established, the specific steps calculated using Q (λ) algorithm iteration are as follows:

Step 3.1 initializes Q table

To each state action in Q table, to Q, (s, a) carries out the initialization of Q value, and Q (s,~) indicates all state actions under s state Pair initial value, s_TIndicate final state, then Q (s, calculation formula a) are as follows:

If s is final state, initial Q value is 0, otherwise sets s and s for Q value_TDistance inverse, s state is corresponding Coordinate is (x, y), s_TThe corresponding coordinate of state is (x_T,y_T), dss_TCalculation formula are as follows:

Step 3.2 initializes E value

Step 3.3 carries out movement selection using Boltzmann Distribution Strategy.

In each learning cycle, original state is set first, is then acted according to Boltzmann Distribution Strategy selection and is carried out shape State transfer；Probability p (a | s) calculation formula of movement a is taken under s state are as follows:

Wherein T indicates temperature coefficient, for the exploration intensity of control strategy.Biggish temperature coefficient can be used at study initial stage To guarantee stronger tactful exploring ability, it is gradually reduced temperature coefficient later.Then it is selected according to p (a | s) using wheel disc method dynamic Make a, and (s, value a) add one by E；

Step 3.4 updates Q value

Unmanned plane takes steps the movement a of 3.2 selections at state s, is transferred to state s', and obtain reward r immediately, then Q (s, A) more new formula are as follows:

Q (s, a)=Q (s, a)+α * (r+ γ * max_aQ(s′,a)-Q(s,a))*E(s,a)

Wherein α is learning rate, and γ is discount factor, and γ indicates the attention degree to future reward, max_a(s' a) is state s' to Q Under maximum Q value；

Step 3.5 updates E value

To all state actions to E (s, more new formula a) are as follows: E (s, a)=λ * E (and s, a), wherein λ is weight parameter, when When state s' is final state, then this learning cycle terminates, and into next learning cycle, is otherwise transferred to s' state, and return Step 3.2 is returned, learning process is continued.

5. the unmanned plane paths planning method according to claim 4 based on Q (λ) algorithm, it is characterised in that: the step 4 calculate the specific steps of optimal path according to state value function are as follows:

Step 4.1 carries out state transfer using deterministic policy

After step 3, state value Q has restrained；Original state s is set first, and selection has maximum Q value under s state Movement a*, and carry out state transfer, act the selection formula of a* are as follows: a^*=argmax_a∈A(s a) acts a transfer when taking to Q To after NextState s', continue that deterministic policy selection is taken to act, until reaching final state；

Optimal path coordinate in grid obtained in step 4.1 is mapped to the warp of way point according to the formula in step 1.2 Latitude coordinate then obtains unmanned plane optimal path.