CN109655066B

CN109655066B - Unmanned aerial vehicle path planning method based on Q (lambda) algorithm

Info

Publication number: CN109655066B
Application number: CN201910071929.6A
Authority: CN
Inventors: 张迎周; 竺殊荣; 高扬; 孙仪; 张灿
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2022-05-17
Anticipated expiration: 2039-01-25
Also published as: CN109655066A

Abstract

The invention provides an unmanned aerial vehicle task planning method based on Q (lambda) algorithm, which comprises an environment modeling step, a Markov decision process model initialization step, a Q (lambda) algorithm iterative computation step and an optimal path computation step according to a state cost function, firstly, a grid space is initialized according to the minimum track length of an unmanned aerial vehicle, grid space coordinates are mapped into waypoints, circular and polygonal threat regions are represented, then, the Markov decision model is established, the Markov decision model comprises unmanned aerial vehicle flight action space representation, state transition probability design and reward function construction, then, the Q (lambda) algorithm is adopted to carry out iterative computation on the basis of the established model, an optimal path of the unmanned aerial vehicle capable of safely avoiding the threat regions is computed according to the finally converged state cost function, the traditional Q learning algorithm is combined with utility tracking, the speed and the precision of the value function convergence are improved, and the unmanned aerial vehicle is guided to avoid a threat area and perform autonomous path planning.

Description

Unmanned aerial vehicle path planning method based on Q (lambda) algorithm

Technical Field

The invention relates to an unmanned aerial vehicle, in particular to a path planning method for the unmanned aerial vehicle, and belongs to the technical field of heuristic algorithms.

Background

Unmanned aerial vehicle path planning is an important component of unmanned aerial vehicle mission planning, and is an important stage for realizing autonomous task execution of an unmanned aerial vehicle. Unmanned aerial vehicle path planning requires planning a flight path from a starting point to a target point in an environment given known, partially known or completely unknown information, so that a threat zone and an obstacle can be bypassed, the flight path is safe and reliable without collision, and various constraint conditions are met. And dividing the path planning into global path planning and local path planning according to the acquisition condition of the battlefield environment information where the unmanned aerial vehicle is located.

In practical applications, if no one can obtain global environmental knowledge, path planning can be achieved using dynamic planning. However, as the complexity and uncertainty of the battlefield environment increases, the unmanned aerial vehicle has little prior knowledge of the environment, so that the unmanned aerial vehicle is required to have strong capability of adapting to the dynamic environment in practical application. In this case, the technology of local path planning by sensing threat area information in real time by relying on sensor information shows great advantages.

The existing local path planning technology has the problems that the algorithm is easy to fall into local minimum or local oscillation, the algorithm time cost is high, the computer information storage capacity is large, the rule is difficult to determine and the like. The behavior-based unmanned aerial vehicle path planning method is called as a hotspot of current research, the essence of the behavior-based unmanned aerial vehicle path planning method is to map an environmental state sensed by a sensor to an action of an actuator, and the design of a state feature vector and the acquisition of a supervised sample in the behavior-based unmanned aerial vehicle path planning method are often very difficult in an actual complex environment. Therefore, these problems need to be solved.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle task planning method based on a Q (lambda) algorithm, which combines Q learning and utility tracking (Eligibility channels), gives quantized reward and punishment signals to the environment state sensed by a sensor, guides an unmanned aerial vehicle to carry out autonomous path planning and safely avoid a threat region through continuous interaction with the environment, realizes quick response to external environment change, has the advantages of quickness and real time, and improves the adaptability of the unmanned aerial vehicle in unknown or partially unknown environments.

The invention provides an unmanned aerial vehicle path planning method based on a Q (lambda) algorithm, which is characterized by comprising the following steps of: the method comprises the following steps:

step 1, environment modeling: identifying a threat area by utilizing environmental information acquired by a sensor, modeling the flight environment of the unmanned aerial vehicle by using a grid method, discretizing continuous space, generating a uniform grid map according to the set space size, and taking the grid vertex as a discrete waypoint;

step 2, initializing a Markov decision process model: initializing a Markov decision process model suitable for solving the unmanned aerial vehicle path plan, wherein the Markov decision process model can be represented by a quadruple < S, A, P and R >, S is a state space where the unmanned aerial vehicle is located, A is an action space of the unmanned aerial vehicle, P is a state transition matrix, R is a reward function, and the Markov decision process model initialization comprises representation of the flight action space of the unmanned aerial vehicle, design of the state transition probability and construction of the reward function;

and 3, performing iterative calculation on the established model by using a Q (lambda) algorithm: performing iterative computation by using a Q-learning algorithm and a Q (lambda) algorithm of utility tracking on the basis of the models established in the steps 1 and 2; introducing a state action value function Q (s, a) to represent the value of taking an action a by the unmanned aerial vehicle in a state s, and establishing a Q table to store the value of each state action pair < s, a >; introducing a utility tracking function E (s, a) representing a causal relationship between the termination state and the state behavior pair < s, a >; initializing a Q value and an E value, and then selecting an action a taken in an s state through a Boltzmann strategy in each learning period; after the action a is transferred to the next state s', updating the value of Q (s, a) through a Q value updating formula, updating the E values of all state action pairs through an E value updating formula, and when the termination state is reached, finishing the secondary learning period until the maximum learning period number is reached, finishing the iterative calculation process of the Q (lambda) algorithm;

and 4, calculating an optimal path according to the state cost function: and 3, obtaining a converged state cost function, selecting an action a with the maximum Q value in the state s, continuously adopting a deterministic strategy after the action a is adopted until the terminal state is reached, and finally mapping the nodes in the grid to the longitude and latitude to obtain the optimal path.

As a further limitation of the invention: the step 1 of environment modeling specifically comprises the following steps:

step 1.1, initializing a grid space according to the minimum track segment length of the unmanned aerial vehicle;

the unmanned aerial vehicle flies along a straight line among a plurality of waypoints, and changes the flying attitude according to the flight path requirement when arriving at certain waypoints, the minimum flight path length is the shortest distance for limiting the unmanned aerial vehicle to fly straight before the flying attitude is changed, the step length is set according to the minimum flight path length of the unmanned aerial vehicle, and the discrete grid space meeting the self-constraint of the unmanned aerial vehicle can be obtained;

setting the longitude and latitude coordinate of the starting position of the unmanned aerial vehicle as S ═ lon_S,lat_S) The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T) The length of the minimum track segment of the unmanned plane is d_minThe size of the grid space is m x n, d_minAnd setting the grid step length as follows, the calculation formula of m and n is as follows:

step 1.2, mapping the grid space coordinate into a waypoint;

the grid vertex is taken as a scattered navigation point, coordinates in the grid space are expressed by (x, y), and longitude and latitude coordinates corresponding to the origin (0,0) of the grid space are set as (lon)_o,lat_o) Then (x, y) the corresponding latitude and longitude coordinates (lon) of the waypoint_xy,lat_xy) The calculation formula of (a) is as follows: lon_xy＝lon_o+d_min*x,lat_xy＝lat_o+d_min*y。

Step 1.3 representation of threat zone information;

the method comprises the steps that the space position of a threat source is considered in the flying process of the unmanned aerial vehicle, a threat area is divided into a circular area and a polygonal area according to the type of the threat source, in a grid space, a node containing the threat area is marked as 1 and is represented as a no-fly area, and a node without the threat area is marked as 0 and is represented as a fly area; for a circular threat zone, the coordinate of the center of the zone is set as (lon)_c,lat_c) The radius of the threat area is r (km), and for each node (x, y) in the grid, the distance d from the waypoint corresponding to the node to the center of the threat area is calculated according to a haversine formula_xyoThe haversine equation is to calculate the distance between two points on the spherical surface according to the longitude and latitude coordinates;

if d is_xyoR, label the corresponding node of (x, y) as 1, otherwise label as 0, for the polygon threat region, use the waypoint (lon)_xy,lat_xy) And starting, making a ray in the right (or left) horizontal direction, calculating the number of intersection points of the ray and the polygonal area, if the number of the intersection points is an odd number, marking the route point in the polygonal threat area as 1, and if the number of the intersection points is an even number, marking the node outside the polygonal threat area as 1.

As a further limitation of the invention: the step 2 of initializing the Markov decision process model specifically comprises the following steps:

step 2.1 shows the flight action space of the unmanned aerial vehicle

Taking the grid vertex as a waypoint in the grid space, and enabling the vertex to have eight transfer directions (except for a boundary point) from one vertex to another vertex; the method comprises the following steps of limiting the transfer direction to a certain extent according to self constraint of the unmanned aerial vehicle and threat distribution of the space, generalizing the behavior of the unmanned aerial vehicle into a discrete action space, discretizing the course state at an interval of 45 degrees, and obtaining 8 discrete states; according to the set discretized course state, 5 flight actions of the unmanned aerial vehicle are set, the straight flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 3, and the left turn is represented by 4, so that the action space is represented by A ═ 0,1,2,3,4, and each numeral represents an action;

step 2.2 design State transition probability

The state transition probability refers to the conditional probability of the unmanned aerial vehicle reaching another route state after the unmanned aerial vehicle executes actions in a certain route state, and is used

Representing the probability that the unmanned aerial vehicle performs action a to transition to state s' in state s;

in the early learning stage, the unmanned aerial vehicle is unknown to the environment and is easy to enter a threat area, the unmanned aerial vehicle enters the threat area, namely, the unmanned aerial vehicle represents that a learning period is ended, and the exploration on the environment is limited to be near an initial state, so that when the action taken by the unmanned aerial vehicle can cause the unmanned aerial vehicle to enter the threat area or cause the unmanned aerial vehicle to leave a state space, the state is not transferred, namely, the state of the unmanned aerial vehicle is not changed, and 100% of unmanned aerial vehicle is transferred to a state in which the action points under the other conditions; the state space of the unmanned aerial vehicle is S, the threat area space is O, then

The calculation formula of (2) is as follows:

step 2.3 construction of reward function

The unmanned aerial vehicle can obtain instant rewards when the unmanned aerial vehicle transfers waypoints to enter the next state, the learning objective based on the Q (lambda) algorithm is to maximally accumulate the instant rewards, and the construction of a reward function needs to consider various indexes influencing track performance, including the distance from a target point, flight safety, threat degree and the like;

the instantaneous reward function obtained when the unmanned plane takes the action a in the state s and transfers to the state s' is represented by the following calculation formula, wherein w₁、w₂、w₃As weighting coefficient, f_d、f_o、f_aThe track evaluation factors are normalized;

f_dexpressing visibility, taking the reciprocal of the distance from the state s 'to the target point, and the longitude and latitude coordinates of s' are s ═ lon_s',lat_s') The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T)，f_dThe calculation formula is as follows:

f_orepresenting the degree of threat of the threat zone to state s',

wherein I_oA set of threat zones representing threats to the current state transition of the drone,

represents a threat zone o_iDegree of threat to s', threat zone o_iHas longitude and latitude coordinates of

The calculation formula is as follows:

f_athe penalty item for the flight action of the unmanned aerial vehicle is represented, and the flight action taken by the unmanned aerial vehicle is a key factor influencing the flight safety of the unmanned aerial vehicle; according to the unmanned aerial vehicle flight action space set in the step 2.1, f is set_aThe processing is a discrete function:

as a further limitation of the invention: in the step 3, on the established model, the specific steps of iterative computation by using a Q (lambda) algorithm are as follows:

step 3.1 initialize Q-table

Q value initialization is carried out on each state action pair Q (s, a) in the Q table, Q (s-) -represents the initial value of all state action pairs in the s state, and s_TRepresenting the termination state, the formula for Q (s, a) is as follows:

if s is the termination state, the initial Q value is 0, otherwise, the Q value is set to s and s_TThe reciprocal of the distance of (c), the coordinates corresponding to the s state are (x, y), s_TThe state corresponds to a coordinate of (x)_T,y_T)，dss_TThe calculation formula of (2) is as follows:

step 3.2 initialize E value

At the beginning of each learning cycle, initializing the E value E (s, a) of all state-action pairs < s, a > to 0;

step 3.3 uses the Boltzmann distribution strategy for action selection.

In each learning period, firstly setting an initial state, and then selecting an action according to a Boltzmann distribution strategy to perform state transition; the probability p (a | s) of taking action a in the s state is calculated as:

where T represents the temperature coefficient used to control the heuristic intensity of the strategy. A larger temperature coefficient may be used during the early stages of learning to ensure greater strategy exploration capabilities, followed by a gradual reduction in the temperature coefficient. Then selecting action a using the wheel method according to p (a | s) and adding one to the value of E (s, a);

step 3.4 update Q value

The unmanned plane takes the action a selected in the step 3.2 in the state s, and shifts to the state s' to obtain the instant reward r, and then the update formula of Q (s, a) is as follows:

Q(s,a)＝Q(s,a)+α*(r+γ*max_aQ(s′,a)-Q(s,a))*E(s,a)

where α is the learning rate, γ is the discount factor, γ represents the degree of importance to future rewards, max_aQ (s ', a) is the maximum Q value in the state s';

step 3.5 update E value

E (s, a) ═ λ × E (s, a), wherein λ is a weight parameter, when the state s 'is a termination state, the learning cycle is ended, the next learning cycle is entered, otherwise, the learning cycle is transferred to the state s', and the step 3.2 is returned to, and the learning process is continued;

as a further limitation of the invention: the step 4 of calculating the optimal path according to the state cost function specifically comprises the following steps:

step 4.1 State transition Using deterministic policy

After step 3, the state value Q has converged; firstly, setting an initial state s, selecting an action a with the maximum Q value in the state s, and carrying out state transition, wherein the selection formula of the action a is that^*＝argmax_a∈AQ (s, a), when taking action a and shifting to next state s', continuing to take deterministic policy selection action until reaching the termination state;

step 4.2 mapping the grid space into the latitude and longitude coordinates of the waypoints

And (3) mapping the optimal path coordinates in the grid obtained in the step (4.1) into longitude and latitude coordinates of the waypoints according to a formula in the step (1.2) to obtain the optimal path of the unmanned aerial vehicle. .

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the minimum track segment length of the unmanned aerial vehicle is used as the discretization step length, the self-restraint of the unmanned aerial vehicle is considered, the defect that the discretization process of environment modeling lacks basis is overcome, and the discretization planning space capable of fully exerting the flight capacity of the unmanned aerial vehicle is obtained;

2. when the state transition probability is set, when the unmanned aerial vehicle enters a threat area due to actions taken by the unmanned aerial vehicle, the unmanned aerial vehicle does not generate state transition, the current state is kept unchanged, the learning of the current period is continued, the defect that the interaction between the unmanned aerial vehicle and the environment is limited to be near the initial state at the initial stage of learning is overcome, and the convergence speed of the algorithm is improved;

the Q learning algorithm does not need to acquire environment knowledge globally, continuously interacts with the environment through a method similar to trial and error, approaches an optimal strategy through an optimized behavior value function, is suitable for the condition that the unmanned aerial vehicle is unknown or partially unknown to the environment under a dynamic environment, and guides the unmanned aerial vehicle to carry out autonomous path planning;

4. in the traditional Q learning algorithm, one more step is added in the current state in the algorithm iteration process, and the prediction of all steps is comprehensively considered by introducing a utility tracking function into the Q learning algorithm, so that the calculation of the value function is more accurate. And the method can effectively perform online updating, does not need to wait for the Q value updating until the end of a learning period, can discard the previous learning data, and accelerates the algorithm convergence speed.

Drawings

Fig. 1 shows the discrete actions of the unmanned aerial vehicle and the transfer results thereof in the grid space.

Fig. 2 is a flow chart of algorithm iteration within each learning cycle.

Detailed Description

The invention is further explained below with reference to the drawings.

For convenience of description, the main variables in the algorithm are simply defined:

the longitude and latitude coordinate of the starting point position of the unmanned aerial vehicle is S ═ lon_S,lat_S) The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T) The size of the grid space is m x n, and the coordinates of the points in the grid space are (x, y). Quadruplet for Markov model<S,A,P,R>And expressing that S is the state space where the unmanned aerial vehicle is located, A is the action space of the unmanned aerial vehicle, R is a reward function, and P is a state transition probability matrix.

The invention provides an unmanned aerial vehicle path planning method based on a Q (lambda) algorithm, which comprises an environment modeling step, a Markov decision process model initialization step, a Q (lambda) algorithm iterative computation step and an optimal path computation step according to a state cost function;

the method comprises the following specific steps:

step 1) environmental modeling step

Step 1.1) setting the step length of the grid space as the length d of the minimum track segment of the unmanned aerial vehicle_min；

Step 1.2) according to the formula

Calculating the size of the grid space;

step 1.3) according to the formula lon_xy＝lon_o+d_min*x,lat_xy＝lat_o+d_minMapping grid space coordinates into route point longitude and latitude coordinates (lon)_o,lat_o) Longitude and latitude coordinates corresponding to the grid space origin (0, 0);

and step 1.4) marking the node containing the threat zone as 1 in a grid space, and representing a no-fly zone. Marking the node without the threat area as 0, and representing the node as a flyable area;

step 2) Markov decision process model initialization

Step 2.1) setting 5 flying actions of the unmanned aerial vehicle according to the unmanned aerial vehicle transfer direction shown in fig. 1, wherein the direct flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 90, and the left turn is represented by 4, the flying action space of the unmanned aerial vehicle is represented by a ═ 0,1,2,3,4, and each numeral represents an action;

step 2.2) the state transition probability is set as that when the action taken by the unmanned aerial vehicle can cause the unmanned aerial vehicle to enter a threat area or can cause the unmanned aerial vehicle to leave a state space, the state transition does not occur, namely, the state of the unmanned aerial vehicle does not change, and 100% of the unmanned aerial vehicle is transferred to the state pointed by the action under the rest conditions. The state transition probability calculation formula is:

wherein O is a threat zone space;

step 2.3) the unmanned aerial vehicle takes the action a in the state s and transfers to the instant reward function obtained in the state s

Is calculated by the formula

Wherein w₁、w₂、w₃As weighting coefficient, f_d、f_o、 f_aThe track evaluation factors are normalized;

step 2.4) f_dExpressing visibility, taking the reciprocal of the distance from the state s 'to the target point, and the longitude and latitude coordinates of s' are s ═ lon_s',lat_s') The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T)，f_dThe calculation formula is as follows:

step 2.5) f_oRepresenting the degree of threat of the threat zone to state s',

The calculation formula is as follows:

step 2.6) f_aThe penalty item for the flight action of the unmanned aerial vehicle is represented, and the flight action taken by the unmanned aerial vehicle is a key factor influencing the flight safety of the unmanned aerial vehicle. According to the unmanned aerial vehicle flight action space set in the step 2.1, f is set_aThe processing is a discrete function of the number of pixels,

step 3) performing iterative computation on the established model by using a Q (lambda) algorithm, wherein the iterative flow of the algorithm in each learning period is shown in FIG. 2;

step 3.1) Q value initialization is performed on each state action pair Q (s, a) in the Q table. Q (s-) -_TRepresenting the termination state, the formula for Q (s, a)The following were used:

step 3.2) at the beginning of each learning cycle, initializing the E value E (s, a) of all state action pairs < s, a > to 0;

step 3.3) setting an initial state;

step 3.4) selecting actions according to Boltzmann distribution strategy, wherein the probability p (a | s) of taking the action a in the s state is calculated according to the formula:

step 3.5) according to the formula:

Q(s,a)＝Q(s,a)+α*(r+γ*max_aQ(s′,a)-Q(s,a))*E(s,a)

updating Q (s, a);

step 3.6) updates the E value according to the formula E (s, a) ═ λ × E (s, a):

and 3.7) taking the action a to transfer to the next state s ', if s' is in a termination state, ending the learning period, returning to the step 3.2) to enter the next learning period, and otherwise, returning to the step 3.4) to continue iteration.

Step 4), calculating an optimal path according to the state cost function:

step 4.1) after step 3), the state value Q has converged, an initial state s is set first, an action a with the maximum Q value is selected in the state s, and state transition is performed, wherein the selection formula of the action a is as follows^*＝argmax_a∈AQ (s, a). When the action a is taken and the next state s' is transferred, the deterministic strategy selection action is continuously taken until the termination state is reached;

and 4.2) mapping the optimal path coordinates in the grid obtained in the step 4.1) into longitude and latitude coordinates of the waypoints according to the formula in the step 1.3), so as to obtain the optimal path of the unmanned aerial vehicle.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An unmanned aerial vehicle path planning method based on a Q (lambda) algorithm is characterized in that: the method comprises the following steps:

step 1, environment modeling: acquiring environment information by using a sensor, identifying a threat area, modeling the flight environment of the unmanned aerial vehicle by using a grid method, discretizing continuous space, generating a uniform grid map according to the set space size, and taking the grid vertex as a discrete waypoint;

step 2, initializing a Markov decision process model: initializing a Markov decision process model suitable for solving the unmanned aerial vehicle path plan, wherein the Markov decision process model is represented by a quadruple < S, A, P and R >, S is a state space where the unmanned aerial vehicle is located, A is an action space of the unmanned aerial vehicle, P is a state transition matrix, and R is a reward function;

2. The unmanned aerial vehicle path planning method based on Q (lambda) algorithm of claim 1, wherein: the step 1 of environment modeling specifically comprises the following steps:

setting the longitude and latitude coordinate of the starting position of the unmanned aerial vehicle as S ═ lon_S,lat_S) The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T) The minimum path length of the unmanned aerial vehicle is d_minThe size of the grid space is m x n, d_minAnd setting the grid step length as follows, the calculation formula of m and n is as follows:

step 1.2, mapping the grid space coordinate into a waypoint;

the grid vertex is taken as a scattered navigation point, coordinates in the grid space are expressed by (x, y), and longitude and latitude coordinates corresponding to the origin (0,0) of the grid space are set as (lon)_o,lat_o) Then (x, y) the corresponding latitude and longitude coordinates (lon) of the waypoint_xy,lat_xy) The calculation formula of (a) is as follows: lon_xy＝lon_o+d_min*x,lat_xy＝lat_o+d_min*y；

Step 1.3 representation of threat zone information;

the method comprises the steps that the space position of a threat source is considered in the flying process of the unmanned aerial vehicle, a threat area is divided into a circular area and a polygonal area according to the type of the threat source, in a grid space, a node containing the threat area is marked as 1 and is represented as a no-fly area, and a node without the threat area is marked as 0 and is represented as a fly area; for a circular threat zone, the coordinate of the center of the zone is set as (lon)_c,lat_c) The radius of the threat area is rkm, and for each node (x, y) in the grid, the distance d from the waypoint corresponding to the node to the center of the threat area is calculated according to the haversine formula_xyoThe haversine equation is to calculate the distance between two points on the spherical surface according to the longitude and latitude coordinates;

if d is_xyoR, label the corresponding node of (x, y) as 1, otherwise label as 0, for the polygon threat region, use the waypoint (lon)_xy,lat_xy) And starting to make a ray in the horizontal direction towards the right or left, calculating the number of intersection points of the ray and the polygonal area, marking the (x, y) node as 1 if the number of the intersection points is odd, and marking the node as 1 outside the polygonal threat area if the number of the intersection points is even.

3. The unmanned aerial vehicle path planning method based on the Q (λ) algorithm of claim 2, wherein: the step 2 of initializing the Markov decision process model specifically comprises the following steps:

step 2.1 shows the flight action space of the unmanned aerial vehicle

Taking the grid vertexes as route points in the grid space, wherein eight transfer directions are shared from one vertex to another vertex except for boundary points; limiting the transfer direction to a certain extent according to the self-restraint of the unmanned aerial vehicle and the threat distribution of the space, generalizing the behavior of the unmanned aerial vehicle into a discrete action space, and discretizing the course state at an interval of 45 degrees to obtain 8 discrete states; according to the set discretized course state, 5 flight actions of the unmanned aerial vehicle are set, the straight flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 3, and the left turn is represented by 4, so that the action space is represented by A ═ 0,1,2,3,4, and each numeral represents an action;

step 2.2 design State transition probability

The calculation formula of (c) is:

step 2.3 construction of reward function

The unmanned aerial vehicle can obtain instant rewards when the unmanned aerial vehicle transfers the waypoints to enter the next state, the learning goal based on the Q (lambda) algorithm is to maximize the accumulated instant rewards, and the construction of a reward function needs to consider the influenceVarious indexes of the track performance, including the distance from a target point, flight safety and threat degree;

f_dexpressing visibility, taking the reciprocal of the distance from the state s 'to a target point, wherein the longitude and latitude coordinate of s' is s ═ lon_s',lat_s') The longitude and latitude coordinate of the target point is T ═ lon_T,lat_T)，f_dThe calculation formula is as follows:

f_orepresenting the degree of threat of the threat zone to state s',

Computing deviceThe formula is as follows:

4. the unmanned aerial vehicle path planning method based on Q (λ) algorithm of claim 3, wherein: in the step 3, on the established model, the specific steps of iterative computation by using a Q (lambda) algorithm are as follows:

step 3.1 initialize Q-table

step 3.2 initialize E value

step 3.3, selecting actions by using a Boltzmann distribution strategy;

wherein T represents a temperature coefficient for controlling the heuristic intensity of the strategy; a larger temperature coefficient is used in the initial learning stage to ensure stronger strategy exploration capacity, and then the temperature coefficient is gradually reduced; then selecting action a using the wheel method according to p (a | s) and adding one to the value of E (s, a);

step 3.4 updating Q value

Q(s，a)＝Q(s，a)+α*(r+γ*max_aQ(s′，a)-Q(s，a))*E(s，a)

where α is the learning rate, γ is the discount factor, γ represents the degree of importance to future rewards, max_aQ (s ', a) is the maximum Q value in state s';

step 3.5 update E value

And E (s, a) ═ λ × (s, a), wherein λ is a weight parameter, when the state s 'is a termination state, the learning cycle is ended, the next learning cycle is started, otherwise, the learning cycle is shifted to the s' state, and the step 3.2 is returned to continue the learning process.

5. The unmanned aerial vehicle path planning method based on Q (λ) algorithm of claim 4, wherein: the step 4 of calculating the optimal path according to the state cost function specifically comprises the following steps:

step 4.1 State transition Using deterministic policy

And (3) mapping the optimal path coordinates in the grid obtained in the step (4.1) into longitude and latitude coordinates of the waypoints according to a formula in the step (1.2) to obtain the optimal path of the unmanned aerial vehicle.