CN109655066B - Unmanned aerial vehicle path planning method based on Q (lambda) algorithm - Google Patents

Unmanned aerial vehicle path planning method based on Q (lambda) algorithm Download PDF

Info

Publication number
CN109655066B
CN109655066B CN201910071929.6A CN201910071929A CN109655066B CN 109655066 B CN109655066 B CN 109655066B CN 201910071929 A CN201910071929 A CN 201910071929A CN 109655066 B CN109655066 B CN 109655066B
Authority
CN
China
Prior art keywords
state
aerial vehicle
unmanned aerial
action
threat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910071929.6A
Other languages
Chinese (zh)
Other versions
CN109655066A (en
Inventor
张迎周
竺殊荣
高扬
孙仪
张灿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910071929.6A priority Critical patent/CN109655066B/en
Publication of CN109655066A publication Critical patent/CN109655066A/en
Application granted granted Critical
Publication of CN109655066B publication Critical patent/CN109655066B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft

Abstract

The invention provides an unmanned aerial vehicle task planning method based on Q (lambda) algorithm, which comprises an environment modeling step, a Markov decision process model initialization step, a Q (lambda) algorithm iterative computation step and an optimal path computation step according to a state cost function, firstly, a grid space is initialized according to the minimum track length of an unmanned aerial vehicle, grid space coordinates are mapped into waypoints, circular and polygonal threat regions are represented, then, the Markov decision model is established, the Markov decision model comprises unmanned aerial vehicle flight action space representation, state transition probability design and reward function construction, then, the Q (lambda) algorithm is adopted to carry out iterative computation on the basis of the established model, an optimal path of the unmanned aerial vehicle capable of safely avoiding the threat regions is computed according to the finally converged state cost function, the traditional Q learning algorithm is combined with utility tracking, the speed and the precision of the value function convergence are improved, and the unmanned aerial vehicle is guided to avoid a threat area and perform autonomous path planning.

Description

Unmanned aerial vehicle path planning method based on Q (lambda) algorithm
Technical Field
The invention relates to an unmanned aerial vehicle, in particular to a path planning method for the unmanned aerial vehicle, and belongs to the technical field of heuristic algorithms.
Background
Unmanned aerial vehicle path planning is an important component of unmanned aerial vehicle mission planning, and is an important stage for realizing autonomous task execution of an unmanned aerial vehicle. Unmanned aerial vehicle path planning requires planning a flight path from a starting point to a target point in an environment given known, partially known or completely unknown information, so that a threat zone and an obstacle can be bypassed, the flight path is safe and reliable without collision, and various constraint conditions are met. And dividing the path planning into global path planning and local path planning according to the acquisition condition of the battlefield environment information where the unmanned aerial vehicle is located.
In practical applications, if no one can obtain global environmental knowledge, path planning can be achieved using dynamic planning. However, as the complexity and uncertainty of the battlefield environment increases, the unmanned aerial vehicle has little prior knowledge of the environment, so that the unmanned aerial vehicle is required to have strong capability of adapting to the dynamic environment in practical application. In this case, the technology of local path planning by sensing threat area information in real time by relying on sensor information shows great advantages.
The existing local path planning technology has the problems that the algorithm is easy to fall into local minimum or local oscillation, the algorithm time cost is high, the computer information storage capacity is large, the rule is difficult to determine and the like. The behavior-based unmanned aerial vehicle path planning method is called as a hotspot of current research, the essence of the behavior-based unmanned aerial vehicle path planning method is to map an environmental state sensed by a sensor to an action of an actuator, and the design of a state feature vector and the acquisition of a supervised sample in the behavior-based unmanned aerial vehicle path planning method are often very difficult in an actual complex environment. Therefore, these problems need to be solved.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle task planning method based on a Q (lambda) algorithm, which combines Q learning and utility tracking (Eligibility channels), gives quantized reward and punishment signals to the environment state sensed by a sensor, guides an unmanned aerial vehicle to carry out autonomous path planning and safely avoid a threat region through continuous interaction with the environment, realizes quick response to external environment change, has the advantages of quickness and real time, and improves the adaptability of the unmanned aerial vehicle in unknown or partially unknown environments.
The invention provides an unmanned aerial vehicle path planning method based on a Q (lambda) algorithm, which is characterized by comprising the following steps of: the method comprises the following steps:
step 1, environment modeling: identifying a threat area by utilizing environmental information acquired by a sensor, modeling the flight environment of the unmanned aerial vehicle by using a grid method, discretizing continuous space, generating a uniform grid map according to the set space size, and taking the grid vertex as a discrete waypoint;
step 2, initializing a Markov decision process model: initializing a Markov decision process model suitable for solving the unmanned aerial vehicle path plan, wherein the Markov decision process model can be represented by a quadruple < S, A, P and R >, S is a state space where the unmanned aerial vehicle is located, A is an action space of the unmanned aerial vehicle, P is a state transition matrix, R is a reward function, and the Markov decision process model initialization comprises representation of the flight action space of the unmanned aerial vehicle, design of the state transition probability and construction of the reward function;
and 3, performing iterative calculation on the established model by using a Q (lambda) algorithm: performing iterative computation by using a Q-learning algorithm and a Q (lambda) algorithm of utility tracking on the basis of the models established in the steps 1 and 2; introducing a state action value function Q (s, a) to represent the value of taking an action a by the unmanned aerial vehicle in a state s, and establishing a Q table to store the value of each state action pair < s, a >; introducing a utility tracking function E (s, a) representing a causal relationship between the termination state and the state behavior pair < s, a >; initializing a Q value and an E value, and then selecting an action a taken in an s state through a Boltzmann strategy in each learning period; after the action a is transferred to the next state s', updating the value of Q (s, a) through a Q value updating formula, updating the E values of all state action pairs through an E value updating formula, and when the termination state is reached, finishing the secondary learning period until the maximum learning period number is reached, finishing the iterative calculation process of the Q (lambda) algorithm;
and 4, calculating an optimal path according to the state cost function: and 3, obtaining a converged state cost function, selecting an action a with the maximum Q value in the state s, continuously adopting a deterministic strategy after the action a is adopted until the terminal state is reached, and finally mapping the nodes in the grid to the longitude and latitude to obtain the optimal path.
As a further limitation of the invention: the step 1 of environment modeling specifically comprises the following steps:
step 1.1, initializing a grid space according to the minimum track segment length of the unmanned aerial vehicle;
the unmanned aerial vehicle flies along a straight line among a plurality of waypoints, and changes the flying attitude according to the flight path requirement when arriving at certain waypoints, the minimum flight path length is the shortest distance for limiting the unmanned aerial vehicle to fly straight before the flying attitude is changed, the step length is set according to the minimum flight path length of the unmanned aerial vehicle, and the discrete grid space meeting the self-constraint of the unmanned aerial vehicle can be obtained;
setting the longitude and latitude coordinate of the starting position of the unmanned aerial vehicle as S ═ lonS,latS) The longitude and latitude coordinate of the target point is T ═ lonT,latT) The length of the minimum track segment of the unmanned plane is dminThe size of the grid space is m x n, dminAnd setting the grid step length as follows, the calculation formula of m and n is as follows:
Figure BDA0001957555760000031
step 1.2, mapping the grid space coordinate into a waypoint;
the grid vertex is taken as a scattered navigation point, coordinates in the grid space are expressed by (x, y), and longitude and latitude coordinates corresponding to the origin (0,0) of the grid space are set as (lon)o,lato) Then (x, y) the corresponding latitude and longitude coordinates (lon) of the waypointxy,latxy) The calculation formula of (a) is as follows: lonxy=lono+dmin*x,latxy=lato+dmin*y。
Step 1.3 representation of threat zone information;
the method comprises the steps that the space position of a threat source is considered in the flying process of the unmanned aerial vehicle, a threat area is divided into a circular area and a polygonal area according to the type of the threat source, in a grid space, a node containing the threat area is marked as 1 and is represented as a no-fly area, and a node without the threat area is marked as 0 and is represented as a fly area; for a circular threat zone, the coordinate of the center of the zone is set as (lon)c,latc) The radius of the threat area is r (km), and for each node (x, y) in the grid, the distance d from the waypoint corresponding to the node to the center of the threat area is calculated according to a haversine formulaxyoThe haversine equation is to calculate the distance between two points on the spherical surface according to the longitude and latitude coordinates;
Figure BDA0001957555760000032
if d isxyoR, label the corresponding node of (x, y) as 1, otherwise label as 0, for the polygon threat region, use the waypoint (lon)xy,latxy) And starting, making a ray in the right (or left) horizontal direction, calculating the number of intersection points of the ray and the polygonal area, if the number of the intersection points is an odd number, marking the route point in the polygonal threat area as 1, and if the number of the intersection points is an even number, marking the node outside the polygonal threat area as 1.
As a further limitation of the invention: the step 2 of initializing the Markov decision process model specifically comprises the following steps:
step 2.1 shows the flight action space of the unmanned aerial vehicle
Taking the grid vertex as a waypoint in the grid space, and enabling the vertex to have eight transfer directions (except for a boundary point) from one vertex to another vertex; the method comprises the following steps of limiting the transfer direction to a certain extent according to self constraint of the unmanned aerial vehicle and threat distribution of the space, generalizing the behavior of the unmanned aerial vehicle into a discrete action space, discretizing the course state at an interval of 45 degrees, and obtaining 8 discrete states; according to the set discretized course state, 5 flight actions of the unmanned aerial vehicle are set, the straight flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 3, and the left turn is represented by 4, so that the action space is represented by A ═ 0,1,2,3,4, and each numeral represents an action;
step 2.2 design State transition probability
The state transition probability refers to the conditional probability of the unmanned aerial vehicle reaching another route state after the unmanned aerial vehicle executes actions in a certain route state, and is used
Figure BDA0001957555760000043
Representing the probability that the unmanned aerial vehicle performs action a to transition to state s' in state s;
in the early learning stage, the unmanned aerial vehicle is unknown to the environment and is easy to enter a threat area, the unmanned aerial vehicle enters the threat area, namely, the unmanned aerial vehicle represents that a learning period is ended, and the exploration on the environment is limited to be near an initial state, so that when the action taken by the unmanned aerial vehicle can cause the unmanned aerial vehicle to enter the threat area or cause the unmanned aerial vehicle to leave a state space, the state is not transferred, namely, the state of the unmanned aerial vehicle is not changed, and 100% of unmanned aerial vehicle is transferred to a state in which the action points under the other conditions; the state space of the unmanned aerial vehicle is S, the threat area space is O, then
Figure BDA0001957555760000041
The calculation formula of (2) is as follows:
Figure BDA0001957555760000042
step 2.3 construction of reward function
The unmanned aerial vehicle can obtain instant rewards when the unmanned aerial vehicle transfers waypoints to enter the next state, the learning objective based on the Q (lambda) algorithm is to maximally accumulate the instant rewards, and the construction of a reward function needs to consider various indexes influencing track performance, including the distance from a target point, flight safety, threat degree and the like;
Figure BDA0001957555760000051
the instantaneous reward function obtained when the unmanned plane takes the action a in the state s and transfers to the state s' is represented by the following calculation formula, wherein w1、w2、w3As weighting coefficient, fd、fo、faThe track evaluation factors are normalized;
Figure BDA0001957555760000059
fdexpressing visibility, taking the reciprocal of the distance from the state s 'to the target point, and the longitude and latitude coordinates of s' are s ═ lons',lats') The longitude and latitude coordinate of the target point is T ═ lonT,latT),fdThe calculation formula is as follows:
Figure BDA0001957555760000052
forepresenting the degree of threat of the threat zone to state s',
Figure BDA0001957555760000053
wherein IoA set of threat zones representing threats to the current state transition of the drone,
Figure BDA0001957555760000054
represents a threat zone oiDegree of threat to s', threat zone oiHas longitude and latitude coordinates of
Figure BDA0001957555760000055
Figure BDA0001957555760000056
The calculation formula is as follows:
Figure BDA0001957555760000057
fathe penalty item for the flight action of the unmanned aerial vehicle is represented, and the flight action taken by the unmanned aerial vehicle is a key factor influencing the flight safety of the unmanned aerial vehicle; according to the unmanned aerial vehicle flight action space set in the step 2.1, f is setaThe processing is a discrete function:
Figure BDA0001957555760000058
as a further limitation of the invention: in the step 3, on the established model, the specific steps of iterative computation by using a Q (lambda) algorithm are as follows:
step 3.1 initialize Q-table
Q value initialization is carried out on each state action pair Q (s, a) in the Q table, Q (s-) -represents the initial value of all state action pairs in the s state, and sTRepresenting the termination state, the formula for Q (s, a) is as follows:
Figure BDA0001957555760000061
if s is the termination state, the initial Q value is 0, otherwise, the Q value is set to s and sTThe reciprocal of the distance of (c), the coordinates corresponding to the s state are (x, y), sTThe state corresponds to a coordinate of (x)T,yT),dssTThe calculation formula of (2) is as follows:
Figure BDA0001957555760000062
step 3.2 initialize E value
At the beginning of each learning cycle, initializing the E value E (s, a) of all state-action pairs < s, a > to 0;
step 3.3 uses the Boltzmann distribution strategy for action selection.
In each learning period, firstly setting an initial state, and then selecting an action according to a Boltzmann distribution strategy to perform state transition; the probability p (a | s) of taking action a in the s state is calculated as:
Figure BDA0001957555760000063
where T represents the temperature coefficient used to control the heuristic intensity of the strategy. A larger temperature coefficient may be used during the early stages of learning to ensure greater strategy exploration capabilities, followed by a gradual reduction in the temperature coefficient. Then selecting action a using the wheel method according to p (a | s) and adding one to the value of E (s, a);
step 3.4 update Q value
The unmanned plane takes the action a selected in the step 3.2 in the state s, and shifts to the state s' to obtain the instant reward r, and then the update formula of Q (s, a) is as follows:
Q(s,a)=Q(s,a)+α*(r+γ*maxaQ(s′,a)-Q(s,a))*E(s,a)
where α is the learning rate, γ is the discount factor, γ represents the degree of importance to future rewards, maxaQ (s ', a) is the maximum Q value in the state s';
step 3.5 update E value
E (s, a) ═ λ × E (s, a), wherein λ is a weight parameter, when the state s 'is a termination state, the learning cycle is ended, the next learning cycle is entered, otherwise, the learning cycle is transferred to the state s', and the step 3.2 is returned to, and the learning process is continued;
as a further limitation of the invention: the step 4 of calculating the optimal path according to the state cost function specifically comprises the following steps:
step 4.1 State transition Using deterministic policy
After step 3, the state value Q has converged; firstly, setting an initial state s, selecting an action a with the maximum Q value in the state s, and carrying out state transition, wherein the selection formula of the action a is that*=argmaxa∈AQ (s, a), when taking action a and shifting to next state s', continuing to take deterministic policy selection action until reaching the termination state;
step 4.2 mapping the grid space into the latitude and longitude coordinates of the waypoints
And (3) mapping the optimal path coordinates in the grid obtained in the step (4.1) into longitude and latitude coordinates of the waypoints according to a formula in the step (1.2) to obtain the optimal path of the unmanned aerial vehicle. .
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. the minimum track segment length of the unmanned aerial vehicle is used as the discretization step length, the self-restraint of the unmanned aerial vehicle is considered, the defect that the discretization process of environment modeling lacks basis is overcome, and the discretization planning space capable of fully exerting the flight capacity of the unmanned aerial vehicle is obtained;
2. when the state transition probability is set, when the unmanned aerial vehicle enters a threat area due to actions taken by the unmanned aerial vehicle, the unmanned aerial vehicle does not generate state transition, the current state is kept unchanged, the learning of the current period is continued, the defect that the interaction between the unmanned aerial vehicle and the environment is limited to be near the initial state at the initial stage of learning is overcome, and the convergence speed of the algorithm is improved;
the Q learning algorithm does not need to acquire environment knowledge globally, continuously interacts with the environment through a method similar to trial and error, approaches an optimal strategy through an optimized behavior value function, is suitable for the condition that the unmanned aerial vehicle is unknown or partially unknown to the environment under a dynamic environment, and guides the unmanned aerial vehicle to carry out autonomous path planning;
4. in the traditional Q learning algorithm, one more step is added in the current state in the algorithm iteration process, and the prediction of all steps is comprehensively considered by introducing a utility tracking function into the Q learning algorithm, so that the calculation of the value function is more accurate. And the method can effectively perform online updating, does not need to wait for the Q value updating until the end of a learning period, can discard the previous learning data, and accelerates the algorithm convergence speed.
Drawings
Fig. 1 shows the discrete actions of the unmanned aerial vehicle and the transfer results thereof in the grid space.
Fig. 2 is a flow chart of algorithm iteration within each learning cycle.
Detailed Description
The invention is further explained below with reference to the drawings.
For convenience of description, the main variables in the algorithm are simply defined:
the longitude and latitude coordinate of the starting point position of the unmanned aerial vehicle is S ═ lonS,latS) The longitude and latitude coordinate of the target point is T ═ lonT,latT) The size of the grid space is m x n, and the coordinates of the points in the grid space are (x, y). Quadruplet for Markov model<S,A,P,R>And expressing that S is the state space where the unmanned aerial vehicle is located, A is the action space of the unmanned aerial vehicle, R is a reward function, and P is a state transition probability matrix.
The invention provides an unmanned aerial vehicle path planning method based on a Q (lambda) algorithm, which comprises an environment modeling step, a Markov decision process model initialization step, a Q (lambda) algorithm iterative computation step and an optimal path computation step according to a state cost function;
the method comprises the following specific steps:
step 1) environmental modeling step
Step 1.1) setting the step length of the grid space as the length d of the minimum track segment of the unmanned aerial vehiclemin
Step 1.2) according to the formula
Figure BDA0001957555760000081
Calculating the size of the grid space;
step 1.3) according to the formula lonxy=lono+dmin*x,latxy=lato+dminMapping grid space coordinates into route point longitude and latitude coordinates (lon)o,lato) Longitude and latitude coordinates corresponding to the grid space origin (0, 0);
and step 1.4) marking the node containing the threat zone as 1 in a grid space, and representing a no-fly zone. Marking the node without the threat area as 0, and representing the node as a flyable area;
step 2) Markov decision process model initialization
Step 2.1) setting 5 flying actions of the unmanned aerial vehicle according to the unmanned aerial vehicle transfer direction shown in fig. 1, wherein the direct flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 90, and the left turn is represented by 4, the flying action space of the unmanned aerial vehicle is represented by a ═ 0,1,2,3,4, and each numeral represents an action;
step 2.2) the state transition probability is set as that when the action taken by the unmanned aerial vehicle can cause the unmanned aerial vehicle to enter a threat area or can cause the unmanned aerial vehicle to leave a state space, the state transition does not occur, namely, the state of the unmanned aerial vehicle does not change, and 100% of the unmanned aerial vehicle is transferred to the state pointed by the action under the rest conditions. The state transition probability calculation formula is:
Figure BDA0001957555760000091
wherein O is a threat zone space;
step 2.3) the unmanned aerial vehicle takes the action a in the state s and transfers to the instant reward function obtained in the state s
Figure BDA0001957555760000092
Is calculated by the formula
Figure BDA0001957555760000093
Wherein w1、w2、w3As weighting coefficient, fd、fo、 faThe track evaluation factors are normalized;
step 2.4) fdExpressing visibility, taking the reciprocal of the distance from the state s 'to the target point, and the longitude and latitude coordinates of s' are s ═ lons',lats') The longitude and latitude coordinate of the target point is T ═ lonT,latT),fdThe calculation formula is as follows:
Figure BDA0001957555760000094
step 2.5) foRepresenting the degree of threat of the threat zone to state s',
Figure BDA0001957555760000095
wherein IoA set of threat zones representing threats to the current state transition of the drone,
Figure BDA0001957555760000096
represents a threat zone oiDegree of threat to s', threat zone oiHas longitude and latitude coordinates of
Figure BDA0001957555760000097
Figure BDA0001957555760000098
The calculation formula is as follows:
Figure BDA0001957555760000099
step 2.6) faThe penalty item for the flight action of the unmanned aerial vehicle is represented, and the flight action taken by the unmanned aerial vehicle is a key factor influencing the flight safety of the unmanned aerial vehicle. According to the unmanned aerial vehicle flight action space set in the step 2.1, f is setaThe processing is a discrete function of the number of pixels,
Figure BDA00019575557600000910
step 3) performing iterative computation on the established model by using a Q (lambda) algorithm, wherein the iterative flow of the algorithm in each learning period is shown in FIG. 2;
step 3.1) Q value initialization is performed on each state action pair Q (s, a) in the Q table. Q (s-) -TRepresenting the termination state, the formula for Q (s, a)The following were used:
Figure BDA0001957555760000101
step 3.2) at the beginning of each learning cycle, initializing the E value E (s, a) of all state action pairs < s, a > to 0;
step 3.3) setting an initial state;
step 3.4) selecting actions according to Boltzmann distribution strategy, wherein the probability p (a | s) of taking the action a in the s state is calculated according to the formula:
Figure BDA0001957555760000102
step 3.5) according to the formula:
Q(s,a)=Q(s,a)+α*(r+γ*maxaQ(s′,a)-Q(s,a))*E(s,a)
updating Q (s, a);
step 3.6) updates the E value according to the formula E (s, a) ═ λ × E (s, a):
and 3.7) taking the action a to transfer to the next state s ', if s' is in a termination state, ending the learning period, returning to the step 3.2) to enter the next learning period, and otherwise, returning to the step 3.4) to continue iteration.
Step 4), calculating an optimal path according to the state cost function:
step 4.1) after step 3), the state value Q has converged, an initial state s is set first, an action a with the maximum Q value is selected in the state s, and state transition is performed, wherein the selection formula of the action a is as follows*=argmaxa∈AQ (s, a). When the action a is taken and the next state s' is transferred, the deterministic strategy selection action is continuously taken until the termination state is reached;
and 4.2) mapping the optimal path coordinates in the grid obtained in the step 4.1) into longitude and latitude coordinates of the waypoints according to the formula in the step 1.3), so as to obtain the optimal path of the unmanned aerial vehicle.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. An unmanned aerial vehicle path planning method based on a Q (lambda) algorithm is characterized in that: the method comprises the following steps:
step 1, environment modeling: acquiring environment information by using a sensor, identifying a threat area, modeling the flight environment of the unmanned aerial vehicle by using a grid method, discretizing continuous space, generating a uniform grid map according to the set space size, and taking the grid vertex as a discrete waypoint;
step 2, initializing a Markov decision process model: initializing a Markov decision process model suitable for solving the unmanned aerial vehicle path plan, wherein the Markov decision process model is represented by a quadruple < S, A, P and R >, S is a state space where the unmanned aerial vehicle is located, A is an action space of the unmanned aerial vehicle, P is a state transition matrix, and R is a reward function;
and 3, performing iterative calculation on the established model by using a Q (lambda) algorithm: performing iterative computation by using a Q-learning algorithm and a Q (lambda) algorithm of utility tracking on the basis of the models established in the steps 1 and 2; introducing a state action value function Q (s, a) to represent the value of taking an action a by the unmanned aerial vehicle in a state s, and establishing a Q table to store the value of each state action pair < s, a >; introducing a utility tracking function E (s, a) representing a causal relationship between the termination state and the state behavior pair < s, a >; initializing a Q value and an E value, and then selecting an action a taken in an s state through a Boltzmann strategy in each learning period; after the action a is transferred to the next state s', updating the value of Q (s, a) through a Q value updating formula, updating the E values of all state action pairs through an E value updating formula, and when the termination state is reached, finishing the secondary learning period until the maximum learning period number is reached, finishing the iterative calculation process of the Q (lambda) algorithm;
and 4, calculating an optimal path according to the state cost function: and 3, obtaining a converged state cost function, selecting an action a with the maximum Q value in the state s, continuously adopting a deterministic strategy after the action a is adopted until the terminal state is reached, and finally mapping the nodes in the grid to the longitude and latitude to obtain the optimal path.
2. The unmanned aerial vehicle path planning method based on Q (lambda) algorithm of claim 1, wherein: the step 1 of environment modeling specifically comprises the following steps:
step 1.1, initializing a grid space according to the minimum track segment length of the unmanned aerial vehicle;
the unmanned aerial vehicle flies along a straight line among a plurality of waypoints, and changes the flying attitude according to the flight path requirement when arriving at certain waypoints, the minimum flight path length is the shortest distance for limiting the unmanned aerial vehicle to fly straight before the flying attitude is changed, the step length is set according to the minimum flight path length of the unmanned aerial vehicle, and the discrete grid space meeting the self-constraint of the unmanned aerial vehicle can be obtained;
setting the longitude and latitude coordinate of the starting position of the unmanned aerial vehicle as S ═ lonS,latS) The longitude and latitude coordinate of the target point is T ═ lonT,latT) The minimum path length of the unmanned aerial vehicle is dminThe size of the grid space is m x n, dminAnd setting the grid step length as follows, the calculation formula of m and n is as follows:
Figure FDA0003553271700000021
step 1.2, mapping the grid space coordinate into a waypoint;
the grid vertex is taken as a scattered navigation point, coordinates in the grid space are expressed by (x, y), and longitude and latitude coordinates corresponding to the origin (0,0) of the grid space are set as (lon)o,lato) Then (x, y) the corresponding latitude and longitude coordinates (lon) of the waypointxy,latxy) The calculation formula of (a) is as follows: lonxy=lono+dmin*x,latxy=lato+dmin*y;
Step 1.3 representation of threat zone information;
the method comprises the steps that the space position of a threat source is considered in the flying process of the unmanned aerial vehicle, a threat area is divided into a circular area and a polygonal area according to the type of the threat source, in a grid space, a node containing the threat area is marked as 1 and is represented as a no-fly area, and a node without the threat area is marked as 0 and is represented as a fly area; for a circular threat zone, the coordinate of the center of the zone is set as (lon)c,latc) The radius of the threat area is rkm, and for each node (x, y) in the grid, the distance d from the waypoint corresponding to the node to the center of the threat area is calculated according to the haversine formulaxyoThe haversine equation is to calculate the distance between two points on the spherical surface according to the longitude and latitude coordinates;
Figure FDA0003553271700000022
if d isxyoR, label the corresponding node of (x, y) as 1, otherwise label as 0, for the polygon threat region, use the waypoint (lon)xy,latxy) And starting to make a ray in the horizontal direction towards the right or left, calculating the number of intersection points of the ray and the polygonal area, marking the (x, y) node as 1 if the number of the intersection points is odd, and marking the node as 1 outside the polygonal threat area if the number of the intersection points is even.
3. The unmanned aerial vehicle path planning method based on the Q (λ) algorithm of claim 2, wherein: the step 2 of initializing the Markov decision process model specifically comprises the following steps:
step 2.1 shows the flight action space of the unmanned aerial vehicle
Taking the grid vertexes as route points in the grid space, wherein eight transfer directions are shared from one vertex to another vertex except for boundary points; limiting the transfer direction to a certain extent according to the self-restraint of the unmanned aerial vehicle and the threat distribution of the space, generalizing the behavior of the unmanned aerial vehicle into a discrete action space, and discretizing the course state at an interval of 45 degrees to obtain 8 discrete states; according to the set discretized course state, 5 flight actions of the unmanned aerial vehicle are set, the straight flight is represented by a numeral 0, the right turn is represented by 1, the left turn is represented by 2, the right turn is represented by 3, and the left turn is represented by 4, so that the action space is represented by A ═ 0,1,2,3,4, and each numeral represents an action;
step 2.2 design State transition probability
The state transition probability refers to the conditional probability of the unmanned aerial vehicle reaching another route state after the unmanned aerial vehicle executes actions in a certain route state, and is used
Figure FDA0003553271700000031
Representing the probability that the unmanned aerial vehicle performs action a to transition to state s' in state s;
in the early learning stage, the unmanned aerial vehicle is unknown to the environment and is easy to enter a threat area, the unmanned aerial vehicle enters the threat area, namely, the unmanned aerial vehicle represents that a learning period is ended, and the exploration on the environment is limited to be near an initial state, so that when the action taken by the unmanned aerial vehicle can cause the unmanned aerial vehicle to enter the threat area or cause the unmanned aerial vehicle to leave a state space, the state is not transferred, namely, the state of the unmanned aerial vehicle is not changed, and 100% of unmanned aerial vehicle is transferred to a state in which the action points under the other conditions; the state space of the unmanned aerial vehicle is S, the threat area space is O, then
Figure FDA0003553271700000032
The calculation formula of (c) is:
Figure FDA0003553271700000033
step 2.3 construction of reward function
The unmanned aerial vehicle can obtain instant rewards when the unmanned aerial vehicle transfers the waypoints to enter the next state, the learning goal based on the Q (lambda) algorithm is to maximize the accumulated instant rewards, and the construction of a reward function needs to consider the influenceVarious indexes of the track performance, including the distance from a target point, flight safety and threat degree;
Figure FDA0003553271700000034
the instantaneous reward function obtained when the unmanned plane takes the action a in the state s and transfers to the state s' is represented by the following calculation formula, wherein w1、w2、w3As weighting coefficient, fd、fo、faThe track evaluation factors are normalized;
Figure FDA0003553271700000035
fdexpressing visibility, taking the reciprocal of the distance from the state s 'to a target point, wherein the longitude and latitude coordinate of s' is s ═ lons',lats') The longitude and latitude coordinate of the target point is T ═ lonT,latT),fdThe calculation formula is as follows:
Figure FDA0003553271700000036
forepresenting the degree of threat of the threat zone to state s',
Figure FDA0003553271700000041
wherein IoA set of threat zones representing threats to the current state transition of the drone,
Figure FDA0003553271700000042
represents a threat zone oiDegree of threat to s', threat zone oiHas longitude and latitude coordinates of
Figure FDA0003553271700000043
Figure FDA0003553271700000044
Computing deviceThe formula is as follows:
Figure FDA0003553271700000045
fathe penalty item for the flight action of the unmanned aerial vehicle is represented, and the flight action taken by the unmanned aerial vehicle is a key factor influencing the flight safety of the unmanned aerial vehicle; according to the unmanned aerial vehicle flight action space set in the step 2.1, f is setaThe processing is a discrete function:
Figure FDA0003553271700000046
4. the unmanned aerial vehicle path planning method based on Q (λ) algorithm of claim 3, wherein: in the step 3, on the established model, the specific steps of iterative computation by using a Q (lambda) algorithm are as follows:
step 3.1 initialize Q-table
Q value initialization is carried out on each state action pair Q (s, a) in the Q table, Q (s-) -represents the initial value of all state action pairs in the s state, and sTRepresenting the termination state, the formula for Q (s, a) is as follows:
Figure FDA0003553271700000047
if s is the termination state, the initial Q value is 0, otherwise, the Q value is set to s and sTThe reciprocal of the distance of (c), the coordinates corresponding to the s state are (x, y), sTThe state corresponds to a coordinate of (x)T,yT),dssTThe calculation formula of (2) is as follows:
Figure FDA0003553271700000048
step 3.2 initialize E value
At the beginning of each learning cycle, initializing the E value E (s, a) of all state-action pairs < s, a > to 0;
step 3.3, selecting actions by using a Boltzmann distribution strategy;
in each learning period, firstly setting an initial state, and then selecting an action according to a Boltzmann distribution strategy to perform state transition; the probability p (a | s) of taking action a in the s state is calculated as:
Figure FDA0003553271700000049
wherein T represents a temperature coefficient for controlling the heuristic intensity of the strategy; a larger temperature coefficient is used in the initial learning stage to ensure stronger strategy exploration capacity, and then the temperature coefficient is gradually reduced; then selecting action a using the wheel method according to p (a | s) and adding one to the value of E (s, a);
step 3.4 updating Q value
The unmanned plane takes the action a selected in the step 3.2 in the state s, and shifts to the state s' to obtain the instant reward r, and then the update formula of Q (s, a) is as follows:
Q(s,a)=Q(s,a)+α*(r+γ*maxaQ(s′,a)-Q(s,a))*E(s,a)
where α is the learning rate, γ is the discount factor, γ represents the degree of importance to future rewards, maxaQ (s ', a) is the maximum Q value in state s';
step 3.5 update E value
And E (s, a) ═ λ × (s, a), wherein λ is a weight parameter, when the state s 'is a termination state, the learning cycle is ended, the next learning cycle is started, otherwise, the learning cycle is shifted to the s' state, and the step 3.2 is returned to continue the learning process.
5. The unmanned aerial vehicle path planning method based on Q (λ) algorithm of claim 4, wherein: the step 4 of calculating the optimal path according to the state cost function specifically comprises the following steps:
step 4.1 State transition Using deterministic policy
After step 3, the state value Q has converged; firstly, setting an initial state s, selecting an action a with the maximum Q value in the state s, and carrying out state transition, wherein the selection formula of the action a is that*=argmaxa∈AQ (s, a), when taking action a and shifting to next state s', continuing to take deterministic policy selection action until reaching the termination state;
step 4.2 mapping the grid space into the latitude and longitude coordinates of the waypoints
And (3) mapping the optimal path coordinates in the grid obtained in the step (4.1) into longitude and latitude coordinates of the waypoints according to a formula in the step (1.2) to obtain the optimal path of the unmanned aerial vehicle.
CN201910071929.6A 2019-01-25 2019-01-25 Unmanned aerial vehicle path planning method based on Q (lambda) algorithm Active CN109655066B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910071929.6A CN109655066B (en) 2019-01-25 2019-01-25 Unmanned aerial vehicle path planning method based on Q (lambda) algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910071929.6A CN109655066B (en) 2019-01-25 2019-01-25 Unmanned aerial vehicle path planning method based on Q (lambda) algorithm

Publications (2)

Publication Number Publication Date
CN109655066A CN109655066A (en) 2019-04-19
CN109655066B true CN109655066B (en) 2022-05-17

Family

ID=66121623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910071929.6A Active CN109655066B (en) 2019-01-25 2019-01-25 Unmanned aerial vehicle path planning method based on Q (lambda) algorithm

Country Status (1)

Country Link
CN (1) CN109655066B (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134140B (en) * 2019-05-23 2022-01-11 南京航空航天大学 Unmanned aerial vehicle path planning method based on potential function reward DQN under continuous state of unknown environmental information
CN110320931A (en) * 2019-06-20 2019-10-11 西安爱生技术集团公司 Unmanned plane avoidance Route planner based on Heading control rule
CN110324805B (en) * 2019-07-03 2022-03-08 东南大学 Unmanned aerial vehicle-assisted wireless sensor network data collection method
CN110428115A (en) * 2019-08-13 2019-11-08 南京理工大学 Maximization system benefit method under dynamic environment based on deeply study
CN111340324B (en) * 2019-09-25 2022-06-07 中国人民解放军国防科技大学 Multilayer multi-granularity cluster task planning method based on sequential distribution
CN110673637B (en) * 2019-10-08 2022-05-13 福建工程学院 Unmanned aerial vehicle pseudo path planning method based on deep reinforcement learning
CN110726416A (en) * 2019-10-23 2020-01-24 西安工程大学 Reinforced learning path planning method based on obstacle area expansion strategy
CN110879610B (en) * 2019-10-24 2021-08-13 北京航空航天大学 Reinforced learning method for autonomous optimizing track planning of solar unmanned aerial vehicle
CN111006693B (en) * 2019-12-12 2021-12-21 中国人民解放军陆军工程大学 Intelligent aircraft track planning system and method thereof
CN111026157B (en) * 2019-12-18 2020-07-28 四川大学 Intelligent aircraft guiding method based on reward remodeling reinforcement learning
CN111123963B (en) * 2019-12-19 2021-06-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111160755B (en) * 2019-12-26 2023-08-18 西北工业大学 Real-time scheduling method for aircraft overhaul workshop based on DQN
CN111328023B (en) * 2020-01-18 2021-02-09 重庆邮电大学 Mobile equipment multitask competition unloading method based on prediction mechanism
CN111399541B (en) * 2020-03-30 2022-07-15 西北工业大学 Unmanned aerial vehicle whole-region reconnaissance path planning method of unsupervised learning type neural network
CN111479216B (en) * 2020-04-10 2021-06-01 北京航空航天大学 Unmanned aerial vehicle cargo conveying method based on UWB positioning
CN111538059B (en) * 2020-05-11 2022-11-11 东华大学 Self-adaptive rapid dynamic positioning system and method based on improved Boltzmann machine
CN111612162B (en) * 2020-06-02 2021-08-27 中国人民解放军军事科学院国防科技创新研究院 Reinforced learning method and device, electronic equipment and storage medium
CN111736461B (en) * 2020-06-30 2021-05-04 西安电子科技大学 Unmanned aerial vehicle task collaborative allocation method based on Q learning
CN111880563B (en) * 2020-07-17 2022-07-15 西北工业大学 Multi-unmanned aerial vehicle task decision method based on MADDPG
CN112130124B (en) * 2020-09-18 2023-11-24 郑州市混沌信息技术有限公司 Quick calibration and error processing method for unmanned aerial vehicle management and control equipment in civil aviation airport
CN112356031B (en) * 2020-11-11 2022-04-01 福州大学 On-line planning method based on Kernel sampling strategy under uncertain environment
CN113033815A (en) * 2021-02-07 2021-06-25 广州杰赛科技股份有限公司 Intelligent valve cooperation control method, device, equipment and storage medium
CN112525213B (en) * 2021-02-10 2021-05-14 腾讯科技(深圳)有限公司 ETA prediction method, model training method, device and storage medium
CN113093803B (en) * 2021-04-03 2022-10-14 西北工业大学 Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm
CN113176786A (en) * 2021-04-23 2021-07-27 成都凯天通导科技有限公司 Q-Learning-based hypersonic aircraft dynamic path planning method
CN114020009B (en) * 2021-10-20 2024-03-29 中国航空工业集团公司洛阳电光设备研究所 Small fixed-wing unmanned aerial vehicle terrain burst prevention planning method
CN114115340A (en) * 2021-11-15 2022-03-01 南京航空航天大学 Airspace cooperative control method based on reinforcement learning
CN114153213A (en) * 2021-12-01 2022-03-08 吉林大学 Deep reinforcement learning intelligent vehicle behavior decision method based on path planning
CN113867369B (en) * 2021-12-03 2022-03-22 中国人民解放军陆军装甲兵学院 Robot path planning method based on alternating current learning seagull algorithm
CN115192452A (en) * 2022-07-27 2022-10-18 苏州泽达兴邦医药科技有限公司 Traditional Chinese medicine production granulation process and process strategy calculation method
CN115562357B (en) * 2022-11-23 2023-03-14 南京邮电大学 Intelligent path planning method for unmanned aerial vehicle cluster

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970648B (en) * 2017-04-19 2019-05-14 北京航空航天大学 Unmanned plane multi-goal path plans combined method for searching under the environment of city low latitude
CN108413959A (en) * 2017-12-13 2018-08-17 南京航空航天大学 Based on the Path Planning for UAV for improving Chaos Ant Colony Optimization
CN108171315B (en) * 2017-12-27 2021-11-19 南京邮电大学 Multi-unmanned aerial vehicle task allocation method based on SMC particle swarm algorithm
CN108170147B (en) * 2017-12-31 2020-10-16 南京邮电大学 Unmanned aerial vehicle task planning method based on self-organizing neural network
CN108319286B (en) * 2018-03-12 2020-09-22 西北工业大学 Unmanned aerial vehicle air combat maneuver decision method based on reinforcement learning

Also Published As

Publication number Publication date
CN109655066A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109655066B (en) Unmanned aerial vehicle path planning method based on Q (lambda) algorithm
Singla et al. Memory-based deep reinforcement learning for obstacle avoidance in UAV with limited environment knowledge
Zhu et al. Chaotic predator–prey biogeography-based optimization approach for UCAV path planning
Liu et al. Adaptive sensitivity decision based path planning algorithm for unmanned aerial vehicle with improved particle swarm optimization
CN107450593B (en) Unmanned aerial vehicle autonomous navigation method and system
Sharma et al. Path planning for multiple targets interception by the swarm of UAVs based on swarm intelligence algorithms: A review
CN112435275A (en) Unmanned aerial vehicle maneuvering target tracking method integrating Kalman filtering and DDQN algorithm
CN109597425A (en) Navigation of Pilotless Aircraft and barrier-avoiding method based on intensified learning
CN110926477A (en) Unmanned aerial vehicle route planning and obstacle avoidance method
US20210325891A1 (en) Graph construction and execution ml techniques
CN113268074B (en) Unmanned aerial vehicle flight path planning method based on joint optimization
Haghighi et al. Multi-objective cooperated path planning of multiple unmanned aerial vehicles based on revisit time
Lu et al. Real-time perception-limited motion planning using sampling-based MPC
Chen et al. Risk-aware trajectory sampling for quadrotor obstacle avoidance in dynamic environments
Xue et al. A uav navigation approach based on deep reinforcement learning in large cluttered 3d environments
Saha et al. Real-time robot path planning around complex obstacle patterns through learning and transferring options
CN116661503B (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
Rottmann et al. Adaptive autonomous control using online value iteration with gaussian processes
Lu et al. Flight with limited field of view: A parallel and gradient-free strategy for micro aerial vehicle
Hao et al. A search and rescue robot search method based on flower pollination algorithm and Q-learning fusion algorithm
Chronis et al. Dynamic Navigation in Unconstrained Environments Using Reinforcement Learning Algorithms
Quinones-Ramirez et al. Robot path planning using deep reinforcement learning
Niu et al. 3D real-time dynamic path planning for UAV based on improved interfered fluid dynamical system and artificial neural network
KR20220090732A (en) Method and system for determining action of device for given state using model trained based on risk measure parameter
Liao Control, Planning, and Learning for Multi-UAV Cooperative Hunting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant