CN112824998A

CN112824998A - Multi-unmanned aerial vehicle collaborative route planning method and device in Markov decision process

Info

Publication number: CN112824998A
Application number: CN201911139552.XA
Authority: CN
Inventors: 刘蓉; 肖颖峰; 张衡; 梁瑾
Original assignee: Nanjing Changkong Technology Co ltd; Nanjing Pukou High-Tech Industrial Development Zone Management Committee; Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing Changkong Technology Co ltd; Nanjing Pukou High-Tech Industrial Development Zone Management Committee; Nanjing University of Aeronautics and Astronautics
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2021-05-21

Abstract

The invention discloses a multi-unmanned aerial vehicle collaborative route planning method and a device in a Markov decision process, wherein the method comprises the following steps: constructing a Markov process model under a multi-unmanned aerial vehicle collaborative route planning task according to a state space, an action space, a state transfer function and a pre-constructed reward function in the Markov decision process; and executing search strategy iteration based on a Markov process model based on a pre-constructed evaluation function, and searching an action sequence which enables the evaluation function value to be maximum, thereby planning the optimal multi-unmanned aerial vehicle collaborative air route. The invention also introduces radar threats into a reward function when the unmanned aerial vehicle runs, and reasonably designs the number of the operational environments and the state spaces of a plurality of unmanned aerial vehicles; reasonable and effective flight paths can be planned for multiple unmanned aerial vehicles rapidly, meanwhile, the radar threat cost of the multiple unmanned aerial vehicles on the air route is greatly reduced, and the safety of the unmanned aerial vehicles in task execution under the complex environment is improved.

Description

Multi-unmanned aerial vehicle collaborative route planning method and device in Markov decision process

Technical Field

The invention relates to the technical field of unmanned aerial vehicle route planning, in particular to a multi-unmanned aerial vehicle collaborative route planning method and device in a Markov decision process.

Background

With the development of aviation technology, the cooperative combat by using multiple unmanned aerial vehicles in a complex and changeable environment is widely applied. The research of the unmanned aerial vehicle route planning method is developed, the burden and inconvenience of manually planning the route are reduced, meanwhile, the information such as the known terrain and threats can be fully utilized to complete the global route meeting the self-restraint and task requirements, and the technical guarantee is provided for realizing the low-altitude penetration and hidden flight of the unmanned aerial vehicle. Therefore, the unmanned aerial vehicle route planning method is a key component of an unmanned aerial vehicle system; the unmanned aerial vehicle is an important premise for realizing the autonomous flight of the unmanned aerial vehicle; the unmanned aerial vehicle is an important basis for ensuring that the unmanned aerial vehicle smoothly completes tasks and accurately hits enemy targets; the unmanned aerial vehicle is a powerful guarantee for realizing automatic control of the unmanned aerial vehicle. The development of the research of the air route planning can also improve the overall level of the current mission planning, and has important practical significance for the further research of the mission planning. The survival probability of the unmanned aerial vehicle is further improved by developing the research of the air route planning and the mission planning, a powerful basis is provided for determining the operational use value of the air route, and the unmanned aerial vehicle has strong engineering application value and practical significance for the development of the unmanned aerial vehicle in China. How to rapidly plan the flight path meeting the constraint condition is also the key to realize the autonomous planning of the unmanned aerial vehicle.

At present, research work for route planning at home and abroad mainly focuses on the aspect of route planning algorithm, and the route planning algorithm plays a decisive role in autonomous flight, accurate tracking or striking of the unmanned aerial vehicle, and is related to the efficiency of route planning and even the survival probability of the unmanned aerial vehicle. In the flight path planning, the unmanned aerial vehicle executes different tasks, and the adopted route planning algorithm is also different. When a simple investigation task is executed, only one global airway needs to be planned according to the obtained information, and the unmanned aerial vehicle only needs to load the global airway before taking off. When attack on enemies is carried out, dynamic threats of the enemies often appear, and at the moment, dynamic adjustment needs to be properly carried out on the basis of a global reference route so as to avoid the dynamic threats.

At present, the common multi-unmanned aerial vehicle collaborative route planning method at home and abroad comprises an ant colony algorithm, a genetic algorithm and A^*Algorithms, and the like. The ant colony algorithm has the advantages of strong robustness, good information feedback capability and the like, but the convergence rate of the algorithm is low and the algorithm is easy to fall into local optimization. The genetic algorithm has strong robustness because the genetic algorithm does not depend on the characteristics of the model, but for a complex battlefield environment, the convergence speed of the algorithm is slow, so that the path searching time is long. A. the^*The algorithm has the advantages of simple algorithm, easiness for engineers to first and the like, but the algorithm has large calculation amount and long planning time.

Disclosure of Invention

The method aims to overcome at least one defect of the multi-unmanned aerial vehicle collaborative route planning method in the prior art.

In one aspect, the invention provides a multi-unmanned aerial vehicle collaborative route planning method based on an improved Markov decision process, which comprises the following steps:

determining starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene according to the collaborative route planning task of the multiple unmanned aerial vehicles;

determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat areas of the multiple unmanned aerial vehicles;

constructing a Markov decision process model under a multi-unmanned-aerial-vehicle collaborative route planning task according to a state space, an action space, a state transfer function and a pre-constructed reward function in the Markov decision process;

and executing search strategy iteration based on a Markov decision process model based on a pre-constructed evaluation function, and searching an action sequence which enables the evaluation function value to be maximum, thereby planning the optimal multi-unmanned aerial vehicle collaborative air route.

Before the unmanned aerial vehicle takes off, the planned optimal multi-unmanned aerial vehicle collaborative air path is loaded to execute the collaborative flight task.

According to the multi-unmanned-aerial-vehicle collaborative route planning method based on the improved Markov decision process, preferably, modeling is carried out on the flight environment of the unmanned aerial vehicle, the task environment is initialized, a grid method is adopted to carry out two-dimensional space modeling on the flight environment of the unmanned aerial vehicle, influences such as terrain obstacles and severe weather are ignored for simplifying the model, only enemy radar threats are considered, and the flight scene comprises starting points of the multi-unmanned aerial vehicle, task target points and radar threat areas.

The method for planning the collaborative route of the multiple unmanned aerial vehicles based on the improved Markov decision process preferably further comprises the steps of calculating a comprehensive navigation cost function value of the multiple unmanned aerial vehicles based on a pre-constructed comprehensive navigation cost function of the multiple unmanned aerial vehicles in the search strategy iteration process, and taking a path with the minimum comprehensive navigation cost function value of the multiple unmanned aerial vehicles as an optimal collaborative route of the multiple unmanned aerial vehicles;

the method comprises the following steps of presetting a multi-unmanned aerial vehicle comprehensive navigation cost function for evaluating performance indexes of a planned route, and specifically:

for a single unmanned aerial vehicle, the route cost mainly comprises fuel oil cost, threat cost and the like. For the multi-unmanned aerial vehicle collaborative route planning, the route cost not only considers the navigation cost of a single machine, but also meets the multi-machine collaborative navigation cost. The following cost equation is adopted to describe a comprehensive navigation cost calculation formula of the multiple unmanned aerial vehicles as follows:

J_i＝W₁J_l，i+W₂J_r，i+W₃J_t (1)

in the formula: w₁、W₂And W₃Weights for fuel cost, threat cost and synergy cost, J_l，iRepresenting the fuel cost when the length of the path segment under the ith path segment is l, and relating to the flight range of the unmanned aerial vehicle; j. the design is a square_r，iRepresenting the threat cost when the radar threat under the ith navigation route section is r; j. the design is a square_tIt varies with the time of flight of the drone, at a synergistic cost. According to the calculation formula (1) of the comprehensive navigation cost function of the multiple unmanned aerial vehicles, the comprehensive navigation cost of the planned route can be calculated, and the comprehensive navigation is selected to be smallerThe air route is taken as the final air route planned by the algorithm, so that the safety of the unmanned aerial vehicle for executing tasks in a complex environment is ensured. In the method for collaborative route planning of multiple drones based on the improved markov decision process, preferably, the markov decision process Model (MDP) is used as the following four-tuple M ═ for the markov decision process model<S，A，P，R>To show that:

s represents a finite set of system states, including finite state points of the drone flight environment. An environment model of the drone is established in a two-dimensional coordinate system according to step S11, where different coordinate points in the environment model represent different states of the drone, and each state corresponds to an element in the set S of state spaces.

A represents the limited set of actions available to the drone. The flight of the unmanned aerial vehicle is in a continuous state in the actual flight process of the unmanned aerial vehicle, but in the route planning of multiple unmanned aerial vehicles, after the starting point and the target point of each unmanned aerial vehicle are set, the unmanned aerial vehicle is regarded as a particle in the route planning process. Since the flight environment state of the drone is established by the grid method, the drone is defined to have 8 executable actions, where a is 1, 2, 3, …, 8. The entire 360 is equally divided by these movements, the angle between two consecutive movements being 45.

P is a state transfer function indicating when the body is in state s_tWhen it is, perform action a_tE.g. A, and transition to state s_t+1The probability of (c). The state transition probabilities may change with the target state, threat conditions, etc. Given the current state of the drone and the execution of the action, the distribution of the state transition probabilities will largely determine the action selection of the drone at the next moment. The state transition probability can be expressed as:

P(s′|s，a)＝P(s_t+1＝s′|s_t＝s，a_t＝a) (2)

∑_s′∈SP(s′|s，a)＝1 (3)

wherein

Representing unmanned aerial vehiclesAn example of a state is that of a state,

representing instances of unmanned aerial vehicle actions, s_tIndicating the state of the drone at time t, a_tRepresenting the action selected by the drone at time t.

The unmanned aerial vehicle takes the safe arrival of the target point as a task target, so that when the unmanned aerial vehicle flies from an initial point to the target point, the movement direction of the unmanned aerial vehicle is guided by the direction of the target point. An included angle between a connecting line of the target point and the unmanned aerial vehicle and the x direction is defined as theta, and the unmanned aerial vehicle can be controlled to continuously adjust the action according to the position of the target point so as to move towards the target point. Dividing the 360-degree space around the target point into 8 position states at intervals of 45 degrees according to theta, and dispersing the 8 position states into a target point position space T_stateThe dispersion rule is as follows:

when the position point of the target point is known, in order to control the unmanned aerial vehicle to move towards the direction of the target point, the executable actions of the unmanned aerial vehicle are limited, meanwhile, the unmanned aerial vehicle is determined to give actions towards the grid where the direction of the target point is located with a high probability, the unmanned aerial vehicle can enter the adjacent grid with a certain probability, and the probability is low. When the drone is within a certain position space of the target point, the drone will have 5 executable actions, and each action probability is different, so that the action output state is 5 × 8 ═ 40 for 8 position spaces. Those skilled in the art can plan the state transition probabilities of the executable actions of the drone according to the actual mission.

R is a reward function representing the immediate reward that can be achieved given the current state and action of the drone. In a markov model system, the reward function is a penalty or reward signal that is fed back by the environment after the UAV makes a motion policy and interacts with the environment. The model represents the quality of actions taken by the unmanned aerial vehicle in a certain state, and is also an important basis for guiding the unmanned aerial vehicle to make flight decisions and safely avoid obstacles. NeedleThe reward function is designed in advance for the problems of the safety and the tendency to a target point of the unmanned aerial vehicle in the route planning process, and a reward function model R with a model-free uniform structure is introduced_movegoalAnd R_{avoidobstacle}。

R_movegoalFor a model of the reward function, R, during normal driving of the unmanned aerial vehicle_{avoidobstacle}Reporting a reward function model for the unmanned aerial vehicle when the unmanned aerial vehicle encounters a threat;

in the state s, the obtained reward function R (s, a) of the action a is selected by the drone as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle} (7)。

in the collaborative air route planning of the multiple unmanned aerial vehicles, the unmanned aerial vehicles are threatened by radar all the time in the flight process, and although effective paths can be planned for the unmanned aerial vehicles based on the basic Markov model algorithm, the effective paths are still probably detected by the radar. Therefore, in order to further reduce the probability that the unmanned aerial vehicle is detected by the radar, a radar threat model R with a non-uniform structure is put forward and introduced in the reward function_threat。

Wherein R is_threatThe model is a radar threat reward function model when the unmanned aerial vehicle runs, and negative rewards are given to radar threats received when the unmanned aerial vehicle flies. L is the length of the navigation section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iAnd k is 1, 2 and 3 is the distance between the k/4 point of the navigation section and the ith radar threat, and the unmanned aerial vehicle selects the acquired reward function R (s,a) is represented as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat (9)。

the invention combines radar threat cost and a Markov model, reasonably designs the Markov decision model, and provides a multi-unmanned-aerial-vehicle air route planning algorithm based on the improved Markov decision model. Under the complex multi-threat environment, the flight path planning is carried out for multiple unmanned aerial vehicles, so that reasonable and effective flight paths can be quickly planned for the multiple unmanned aerial vehicles, meanwhile, the threat cost and the comprehensive cost of the paths of the multiple unmanned aerial vehicles are greatly reduced, and the safety of the unmanned aerial vehicles in executing tasks under the complex environment is improved.

Preferably, the multi-unmanned aerial vehicle collaborative route planning based on the Markov decision model aims to plan the effective route of the unmanned aerial vehicle by interaction between the actions of the unmanned aerial vehicle and the flight environment and final decision generation. The unmanned aerial vehicle main part selects and executes the action a according to the current environment state s, so that the unmanned aerial vehicle state is transferred to s' from s, meanwhile, the reward R is obtained, and the circulation is repeated until the target state is finally reached. Namely, the cooperative route planning of multiple unmanned aerial vehicles is to find the optimal strategy pi^*I.e. according to the current state of the drone, a search strategy is implemented, searching for the desired reward, i.e. the evaluation function V^π(s) maximum sequence of actions.

Using an optimal strategy pi^*Indicates that V exists for all states S ∈ S^*(s)＝max_πV^π(s), optimal strategy π^*The corresponding evaluation function is called the optimal evaluation function V^*(s). The generation process of the optimal strategy is called strategy iteration. Optimum strategy pi^*Dynamic programming can be used to find the maximum reward V^*(s). In the infinite stage discount model, the function V is evaluated^π(s) can be described as:

where γ is a discount factor, γ^tFor the discount factor at time t, 0.9 is taken. R_tThe reward function value corresponding to the time t. s is the state that unmanned aerial vehicle corresponds at the moment when t is 0, and s' is the state that unmanned aerial vehicle is located at next moment. Then the above equation can be rewritten in a recursive fashion as:

V^π(s)＝R(s，π(s))+γ∑_s′∈SP(s′|s，a)V^π(s') (11) the above equation defines a method for calculating an evaluation function corresponding to a policy, and defines a state-action value function Q^π(s, a) as an intermediate variable in solving the merit function. Given the initial state s and the current action a of the drone, the drone will turn to the next state s 'at the next moment with probability P (s' | s, a) and follow this rule in the future, the state-action value function Q^π(s, a) can be expressed as:

Q^π(s，a)＝R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′) (12)

and R (s, a) is the reward obtained when the unmanned aerial vehicle selects the action a in the state s.

At this time, the optimal strategy of MDP is pi^*(s) can be expressed as:

π^*(s)＝arg max_a∈AQ^π(s，a)＝arg max_a∈A{R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′)} (13)

accordingly, the optimum merit function V^*(s) can be expressed as:

V^*(s)＝max_a∈A{R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′)} (14)

on the other hand, the invention provides a multi-unmanned aerial vehicle collaborative route planning device based on an improved Markov decision process, which is characterized by comprising an unmanned aerial vehicle collaborative system modeling module, a Markov process model building module and a multi-unmanned aerial vehicle collaborative route planning module; the unmanned aerial vehicle collaborative system modeling module is used for determining starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene according to the multiple unmanned aerial vehicle collaborative route planning task;

the Markov process model building module is used for determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat regions of the multiple unmanned aerial vehicles; constructing a Markov decision process model under a multi-unmanned-aerial-vehicle collaborative route planning task according to a state space, an action space, a state transfer function and a pre-constructed reward function in the Markov decision process;

the multi-unmanned aerial vehicle collaborative route planning module is used for executing search strategy iteration based on a Markov decision process model based on a pre-constructed evaluation function, and searching an action sequence enabling the evaluation function value to be maximum, so that an optimal multi-unmanned aerial vehicle collaborative route is planned.

Further, the pre-constructed reward function R comprises R when the unmanned aerial vehicle normally runs_movegoalAnd R when the unmanned aerial vehicle meets threat_{avoidobstacle}Expressed as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}。

still further, the reward function also comprises a radar threat reward function R when the unmanned aerial vehicle runs_threatExpressed as follows:

wherein R is_threatA radar threat reward function for unmanned aerial vehicle running; l is the length of the navigation section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iK is 1, 2, 3 is the distance between k/4 point of the navigation section and the ith radar threat;

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat。

the invention achieves the following beneficial technical effects:

the invention provides a multi-unmanned aerial vehicle route planning algorithm based on an improved Markov decision process model, aiming at the problem that a multi-unmanned aerial vehicle is easily affected by environmental threats when executing combat tasks in a complex environment. According to the method, starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene are determined according to the collaborative route planning task of the multiple unmanned aerial vehicles; determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat areas of the multiple unmanned aerial vehicles; constructing a reward function, and constructing a Markov process model under the collaborative route planning task of the multiple unmanned aerial vehicles according to a state space, an action space, a state transfer function and the reward function in the Markov decision process; and constructing an evaluation function, executing search strategy iteration based on Markov decision based on the evaluation function, and searching an action sequence which enables the evaluation function value to be maximum, thereby planning the optimal multi-unmanned aerial vehicle collaborative air route.

The invention also introduces the radar threat into the reward function when the unmanned aerial vehicle runs, and the algorithm reasonably designs the combat environment and the state space number of the multiple unmanned aerial vehicles by using the discretized radar threat information; discretizing the target point position space, and further reasonably distributing state transition probability;

combining radar threat and Markov decision process models, providing and introducing a radar threat model with a non-uniform structure on the basis of a reward function without a model uniform structure, and establishing an improved Markov decision process model;

according to the invention, by setting the multi-unmanned aerial vehicle comprehensive navigation cost function, the multi-unmanned aerial vehicle comprehensive navigation cost function value is calculated after searching the action sequence which enables the evaluation function value to be maximum, and the path with the minimum multi-unmanned aerial vehicle comprehensive navigation cost function value is taken as the optimal unmanned aerial vehicle collaborative airway, so that not only can a reasonable and effective flight path be planned for the multi-unmanned aerial vehicle rapidly, but also the threat cost and the airway comprehensive cost of the multi-unmanned aerial vehicle airway are greatly reduced, and the safety of the unmanned aerial vehicle in executing tasks in a complex environment is improved.

Drawings

FIG. 1 is an algorithmic flow chart of a method of route planning in accordance with an embodiment of the present invention;

FIG. 2 is a diagram of a multi-UAV environment model according to an embodiment of the present invention;

FIG. 3 is a diagram of basic operation of a drone in accordance with an embodiment of the present invention;

FIG. 4 is a schematic illustration of a position state of an embodiment of the present invention;

FIG. 5 is a diagram of simulation results in a simple environment according to an embodiment of the present invention;

FIG. 6 is a diagram of simulation results in a simple environment according to an embodiment of the present invention, wherein (a) is a diagram of single-target simulation results in a complex environment; (b) the method is a multi-target simulation result diagram in a complex environment.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, a method for collaborative multi-drone collaborative route planning based on an improved markov decision process includes the following steps:

step S1, constructing the flight environment of the unmanned aerial vehicle according to the nature of the multi-unmanned aerial vehicle collaborative route planning task, establishing a route evaluation system, and determining the starting points, the task target points and the radar threat areas of the multi-unmanned aerial vehicle in the flight scene; the method comprises the following steps:

modeling the flight environment of the unmanned aerial vehicle, initializing a task environment, and performing two-dimensional space modeling on the flight environment of the unmanned aerial vehicle by adopting a grid method, wherein the size of each grid is 5km, in order to simplify the model, influence of terrain obstacles, severe weather and the like is ignored, only enemy radar threat is considered, a flight scene comprises starting points, task target points and radar threat areas of a plurality of unmanned aerial vehicles, and the multi-unmanned aerial vehicle environment model is shown in FIG. 2;

step S2, determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat areas of the multiple unmanned aerial vehicles;

The method specifically comprises the following steps;

and step S21, combining the radar threat cost with a Markov decision process model, and respectively constructing the model aiming at a state space, an action space, a state transfer function and a reward function in the Markov decision process.

The Markov decision process Model (MDP) is represented by the following quadruple M ═ S, A, P, R >:

A represents the limited set of actions available to the drone. The flight of the unmanned aerial vehicle is in a continuous state in the actual flight process of the unmanned aerial vehicle, but in the route planning of multiple unmanned aerial vehicles, after the starting point and the target point of each unmanned aerial vehicle are set, the unmanned aerial vehicle is regarded as a particle in the route planning process. Since the flight environment state of the drone is established by the grid method, the drone is defined to have 8 executable actions, where a is 1, 2, 3, …, 8. The entire 360 ° is divided equally by these actions, the angle between two adjacent actions is 45 °, and the basic action diagram of the drone is shown in fig. 3.

P(s′|s，a)＝P(s_t+1＝s′|s_t＝s，a_t＝a) (2)

∑_S′∈SP(s′|s，a)＝1 (3)

wherein

An example of the state of the drone is shown,

The unmanned aerial vehicle takes the safe arrival of the target point as a task target, so that when the unmanned aerial vehicle flies from an initial point to the target point, the movement direction of the unmanned aerial vehicle is guided by the direction of the target point. An included angle between a connecting line of the target point and the unmanned aerial vehicle and the x direction is defined as theta, and the unmanned aerial vehicle can be controlled to continuously adjust the action according to the position of the target point so as to move towards the target point. The space of 360 ° around the target point may be divided at intervals of 45 ° according to θ, and dispersed into 8 position states. The state diagram is as shown in FIG. 4Target point position space T_stateThe dispersion rule is as follows:

when the position point of the target point is known, in order to control the unmanned aerial vehicle to move towards the direction of the target point, the executable actions of the unmanned aerial vehicle are limited, meanwhile, the unmanned aerial vehicle is determined to give actions towards the grid where the direction of the target point is located with a high probability, the unmanned aerial vehicle can enter the adjacent grid with a certain probability, and the probability is low. When the drone is in a certain position space of the target point, the drone will have 5 executable actions, and each action probability is different, and the action output state is 5 × 8 ═ 40 for 8 position spaces (the basic action diagram of the drone is shown in fig. 3, and the schematic diagram of the position state is shown in fig. 4). The partial state transition probability design of the executable action of the unmanned aerial vehicle in the embodiment is shown in table 1.

TABLE 1 partial State transition probability design

R is a reward function representing the immediate reward that can be achieved given the current state and action of the drone. In a markov model system, the reward function is a penalty or reward signal that is fed back by the environment after the UAV makes a motion policy and interacts with the environment. The model represents the quality of actions taken by the unmanned aerial vehicle in a certain state, and is also an important basis for guiding the unmanned aerial vehicle to make flight decisions and safely avoid obstacles. Designing a reward function aiming at the problems of safety and tendency to a target point of the unmanned aerial vehicle in the route planning process, and constructing a reward function model R with a model-free uniform structure in advance_movegoalAnd R_{avoidobstacle}。

R_movegoalFor a model of the reward function, R, during normal driving of the unmanned aerial vehicle_{avoidobstacle}And reporting a reward function model when the unmanned aerial vehicle encounters a threat.

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle} (7)。

Wherein R is_threatThe model is a radar threat reward function model when the unmanned aerial vehicle runs, and negative rewards are given to radar threats received when the unmanned aerial vehicle flies. L is the length of the navigation path section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iAnd k is 1, 2 and 3 is the distance between the k/4 point of the navigation road section and the ith radar threat. After the current state s and the execution action a of the unmanned aerial vehicle are given, R can be determined through formulas (5) to (6) according to the distance change conditions between the unmanned aerial vehicle and the target point and between the unmanned aerial vehicle and the obstacle_movegoalAnd R_{avoidobstacle}R can be obtained through a formula (8) according to the distance relation between the unmanned aerial vehicle and the radar_threat. And the unmanned plane selects the obtained reward of the action a as follows in the state s:

R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat (9)

step S22, a search strategy is executed based on a pre-constructed evaluation function for the established markov decision process model, and an action sequence that maximizes the evaluation function is searched for.

The multi-unmanned aerial vehicle route planning based on the Markov decision model aims to plan the effective route of the unmanned aerial vehicle by means of interaction between the actions of the unmanned aerial vehicle and the flight environment and final decision generation. The unmanned aerial vehicle main part selects and executes the action a according to the current environment state s, so that the unmanned aerial vehicle state is transferred to s' from s, meanwhile, the reward R is obtained, and the circulation is repeated until the target state is finally reached. Namely, the cooperative route planning of multiple unmanned aerial vehicles is to find the optimal strategy pi^*I.e. according to the current state of the drone, a search strategy is implemented, searching for the desired reward, i.e. the evaluation function V^π(s) maximum sequence of actions.

Using an optimal strategy pi^*Indicates that V exists for all states S ∈ S^*(s)＝max_πV^π(s), optimal strategy π^*The corresponding evaluation function is called the optimal evaluation function V^*(s)。

The generation process of the optimal strategy is called strategy iteration. Optimum strategy pi^*Dynamic programming can be used to find the maximum reward V^*(s). In the infinite stage discount model, the function V is evaluated^π(s) can be described as:

where γ is a discount factor, γ^tFor time t discount factors, gamma and gamma^t0.9 is taken. R_tIs the corresponding reward value at time t. s is the state that unmanned aerial vehicle corresponds at the moment when t is 0, and s' is the state that unmanned aerial vehicle is located at next moment. Then the above equation can be rewritten in a recursive fashion as:

V^π(s)＝R(s，π(s))+γ∑_s′∈SP(s′|s，a)V^π(s′) (11)

the above formula gives a method for calculating an evaluation function corresponding to a strategy, and defines a state-action value function Q^π(s, a) as an intermediate variable in solving the merit function. Given the initial state s and the current action a of the drone, the drone will turn to the next state s 'at the next moment with probability P (s' | s, a) and follow this rule in the future, the state-action value function Q^π(s, a) can be expressed as:

Q^π(s，a)＝R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′) (12)

and R (s, a) is the reward obtained by the unmanned aerial vehicle selecting the action a under the state s.

At this time, the optimal strategy of MDP is pi^*(s) can be expressed as:

π^*(s)＝arg max_a∈AQ^π(s，a)＝arg max_a∈A{R(s，a)+γ∑_s′∈S P(s′|s，a)V^π(s′)} (13)

accordingly, the optimum merit function V^*(s) can be expressed as:

V^*(s)＝max_a∈A{R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′)} (14)

on the basis of the embodiment, a multi-unmanned aerial vehicle comprehensive navigation cost function is further set for evaluating performance indexes of a planned route, a multi-unmanned aerial vehicle comprehensive navigation cost function value is calculated in a search strategy iteration process, and a path with the minimum multi-unmanned aerial vehicle comprehensive navigation cost function value is used as an optimal unmanned aerial vehicle collaborative route;

J_i＝W₁J_l，i+W₂J_r，i+W₃J_t (1)

in the formula: w₁、W₂And W₃Weights for fuel cost, threat cost and synergy cost, J_r，iRepresenting fuel cost and relating to the flight range of the unmanned aerial vehicle; j. the design is a square_r，iAt a threat cost; j. the design is a square_tIt varies with the time of flight of the drone, at a synergistic cost.

Simulation analysis:

FIG. 5 is a diagram of simulation results in a simple environment according to an embodiment of the present invention; as shown in fig. 5, in a simple environment, in order to obtain a Matlab simulation result diagram of a basic ant colony algorithm and a basic MDP (Markov Decision process) model planning algorithm, the related parameters are initialized: the flying points are set to be (4,40), (60,5), (90,90), the target points are set to be (50,50), the radar threat radiuses are unified to be 2, the unit is kilometer, the radar threat number is 64, and the table 2 is experimental data of a basic ant colony algorithm and a basic MDP model planning algorithm in a simple environment; as can be seen from fig. 5, both algorithms can plan a feasible path, where the solid line represents the feasible path planned by the basic MDP model planning algorithm, and the dotted line represents the feasible path planned by the basic ant colony algorithm; as can be seen from table 2, in the same target environment, the feasible path can be planned for the unmanned aerial vehicle by using the basic MDP model planning algorithm compared with the basic ant colony algorithm, but the path planning time of the basic MDP model planning algorithm is shorter, and the planned route has smaller threat cost and route comprehensive cost.

TABLE 2 basic ant colony Algorithm and basic MDP model planning Algorithm Experimental data

Experimental methods	Planning time/ms	Cost of threat	Composite cost
				Basic ant colony algorithm	352	470	324
Basic MDP model algorithm	223	358	263

FIG. 6 is a diagram of simulation results in a simple environment according to an embodiment of the present invention, wherein (a) is a diagram of single-target simulation results in a complex environment; (b) the method is a multi-target simulation result diagram in a complex environment. As shown in fig. 6, in order to obtain a Matlab simulation result diagram by using a basic MDP model and an improved MDP model algorithm (here, the improved MDP model algorithm is a specific embodiment of a radar threat model with a non-uniform structure proposed and introduced in a reward function in the present invention) for a single-target task and a multi-target task respectively in a complex environment, relevant parameters are initialized: the single-target mission departure points are set to be (4,40), (60,5), (90,90), the target points are set to be (50,50), the multi-target mission departure points are set to be (10,10), (60,5), (85,5), the target points are set to be (20,80), (50,85), (70,85), the radar threat radii of the target points are uncertain, the number of radar threats is 64, and table 3 shows experimental data of a basic MDP model algorithm and an improved MDP model planning algorithm in a complex environment; from table 3, it can be seen that the improved MDP model algorithm plans a reasonable and effective flight path for multiple unmanned aerial vehicles, and simultaneously greatly reduces the route threat cost and the comprehensive cost, and improves the safety of the unmanned aerial vehicle in executing tasks in a complex environment.

TABLE 3 basic MDP model Algorithm and improved MDP model planning Algorithm Experimental data

Example (b): a multi-unmanned aerial vehicle collaborative route planning device based on an improved Markov decision process comprises an unmanned aerial vehicle collaborative system modeling module, a Markov process model building module and a multi-unmanned aerial vehicle collaborative route planning module; the unmanned aerial vehicle collaborative system modeling module is used for determining starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene according to the multiple unmanned aerial vehicle collaborative route planning task;

the Markov process model building module is used for determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat regions of the multiple unmanned aerial vehicles; constructing a reward function, and constructing a Markov process model under the collaborative route planning task of the multiple unmanned aerial vehicles according to a state space, an action space, a state transfer function and the reward function in the Markov decision process;

the multi-unmanned aerial vehicle collaborative route planning module is used for constructing an evaluation function, executing search strategy iteration based on Markov decision based on the evaluation function, and searching an action sequence enabling the evaluation function value to be maximum, so that an optimal multi-unmanned aerial vehicle collaborative route is planned.

The Markov decision process model constructed by the Markov process model construction module is represented by a Markov decision process Model (MDP) with a quadruple M ═ S, A, P, R >:

P(s′|s，a)＝P(s_t+1＝s′|s_t＝s，a_t＝a) (2)

∑_s′∈SP(s′|s，a)＝1 (3)

wherein

An example of the state of the drone is shown,

when the position point of the target point is known, in order to control the unmanned aerial vehicle to move towards the direction of the target point, the executable actions of the unmanned aerial vehicle are limited, meanwhile, the unmanned aerial vehicle is determined to give actions towards the grid where the direction of the target point is located with a high probability, the unmanned aerial vehicle can enter the adjacent grid with a certain probability, and the probability is low. When the drone is within a certain position space of the target point, the drone will have 5 executable actions, and each action probability is different, so that the action output state is 5 × 8 ═ 40 for 8 position spaces. The partial state transition probability design of the executable action of the unmanned aerial vehicle is shown in table 1.

The markov process model building module includes a reward function building module, R being a reward function representing the immediate reward that can be obtained given the current state and action of the drone. In a markov model system, the reward function is a penalty or reward signal that is fed back by the environment after the UAV makes a motion policy and interacts with the environment. The model represents the quality of actions taken by the unmanned aerial vehicle in a certain state, and is also an important basis for guiding the unmanned aerial vehicle to make flight decisions and safely avoid obstacles. The reward function is designed aiming at the problems of safety and tendency to a target point of the unmanned aerial vehicle in the route planning process, and a reward function model R with a model-free uniform structure is introduced into a reward function building module_movegoalAnd R_{avoidobstacle}。

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle} (7)。

in the collaborative air route planning of the multiple unmanned aerial vehicles, the unmanned aerial vehicles are threatened by radar all the time in the flight process, and although effective paths can be planned for the unmanned aerial vehicles based on the basic Markov model algorithm, the effective paths are still probably detected by the radar. Therefore, in order to further reduce the probability that the unmanned aerial vehicle is detected by the radar, the reward function building module provides and introduces the radar threat model R with the non-uniform structure in the reward function_threat。

Wherein R is_threatThe model is a radar threat reward function model when the unmanned aerial vehicle runs, and negative rewards are given to radar threats received when the unmanned aerial vehicle flies. L is the length of the navigation section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iAnd k is 1, 2 and 3 is the distance between the k/4 point of the navigation section and the ith radar threat, and the obtained reward function R (s, a) of the action a is selected by the unmanned aerial vehicle in the state s and is represented as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat (9)。

and the multi-unmanned aerial vehicle collaborative route planning module is used for planning the effective route of the unmanned aerial vehicle by interacting the unmanned aerial vehicle action with the flight environment and finally generating a decision based on the Markov decision model. The unmanned aerial vehicle main part selects and executes the action a according to the current environment state s, so that the unmanned aerial vehicle state is transferred to s' from s, meanwhile, the reward R is obtained, and the circulation is repeated until the target state is finally reached. Namely, the cooperative route planning of multiple unmanned aerial vehicles is to find the optimal strategy pi^*I.e. according to the current state of the drone, a search strategy is implemented, searching for the desired reward, i.e. the evaluation function V^π(s) maximum sequence of actions.

Using an optimal strategy pi^*For all shapesThe state S belongs to S and has V^*(s)＝max_πV^π(s), optimal strategy π^*The corresponding evaluation function is called the optimal evaluation function V^*(s). The generation process of the optimal strategy is called strategy iteration. Optimum strategy pi^*Dynamic programming can be used to find the maximum reward V^*(s). In the infinite stage discount model, the function V is evaluated^π(s) can be described as:

where γ is a discount factor, γ^tFor the discount factor at time t, 0.9 is taken. R_tThe value of the reward function corresponding to the t moment; s is the state that unmanned aerial vehicle corresponds at the moment when t is 0, and s' is the state that unmanned aerial vehicle is located at next moment. Then the above equation can be rewritten in a recursive fashion as:

V^π(s)＝R(s，π(s))+γ∑_s′∈SP(s′|s，a)V^π(s′) (11)

the above formula gives a method for calculating an evaluation function corresponding to a strategy, and defines a state-action value function Q^π(s, a) as an intermediate variable in solving the merit function. Given the initial state s of the drone and the current action a, the drone will turn to the next state s 'at the next instant with probability P (s' | s, a) and follow this rule in the future, then the state-action value function Q pi (s, a) can be expressed as:

Q^π(s，a)＝R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s') (12) wherein R (s, a) is the reward obtained when the drone picks action a in state s.

At this time, the optimal strategy of MDP is pi^*(s) can be expressed as:

accordingly, the optimum merit functionV^*(s) can be expressed as:

V^*(s)＝max_a∈A{R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′)}

(14)

the invention provides a multi-unmanned aerial vehicle route planning algorithm based on an improved Markov decision process model, aiming at the characteristics of the multi-unmanned aerial vehicle collaborative route planning problem and the problem that the multi-unmanned aerial vehicle is easily affected by environmental threats when the multi-unmanned aerial vehicle executes combat missions in a complex environment. According to the method, starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene are determined according to the collaborative route planning task of the multiple unmanned aerial vehicles; determining all state spaces, action spaces and state transfer functions in the motion process of the multiple unmanned aerial vehicles according to the starting points, the task target points and the radar threat areas of the multiple unmanned aerial vehicles; constructing a reward function, and constructing a Markov process model under the collaborative route planning task of the multiple unmanned aerial vehicles according to a state space, an action space, a state transfer function and the reward function in the Markov decision process; and constructing an evaluation function, executing search strategy iteration based on Markov decision based on the evaluation function, and searching an action sequence which enables the evaluation function value to be maximum, thereby planning the optimal multi-unmanned aerial vehicle collaborative air route.

In order to further reduce the probability of the unmanned aerial vehicle being detected by the radar, a radar threat model with a non-uniform structure is provided and introduced in a reward function, the radar threat cost is combined with a Markov model, the Markov decision model is reasonably designed, and a multi-unmanned aerial vehicle route planning algorithm based on the improved Markov decision model is provided. Under the complex multi-threat environment, flight route planning is carried out on multiple unmanned aerial vehicles, and simulation results show that the multi-unmanned aerial vehicle route planning based on the improved Markov decision model can not only quickly plan reasonable and effective flight paths for the multiple unmanned aerial vehicles, but also greatly reduce the radar threat cost and route comprehensive cost of the multiple unmanned aerial vehicles, and improve the safety of the unmanned aerial vehicles in executing tasks under the complex environment.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A multi-unmanned aerial vehicle collaborative route planning method based on an improved Markov decision process is characterized by comprising the following steps:

2. The method for planning the collaborative navigation route of multiple unmanned aerial vehicles based on the improved Markov decision process according to claim 1, wherein the method further comprises the steps of calculating a comprehensive navigation cost function value of the multiple unmanned aerial vehicles based on a pre-constructed comprehensive navigation cost function of the multiple unmanned aerial vehicles in the iterative process of the search strategy, and taking the path with the minimum comprehensive navigation cost function value of the multiple unmanned aerial vehicles as the optimal collaborative navigation route of the multiple unmanned aerial vehicles;

the multi-unmanned aerial vehicle comprehensive navigation cost function calculation formula is as follows:

J_i＝W₁J_l，i+W₂J_r，i+W₃J_t

in the formula: j. the design is a square_iFor the comprehensive navigation cost of multiple unmanned planes under the ith navigation route section, W₁、W₂And W₃Weights for fuel cost, threat cost and synergy cost, J_l，iRepresenting the fuel cost when the length of the path section under the ith path section is l; j. the design is a square_r，iRepresenting the threat cost when the radar threat under the ith navigation route section is r; j. the design is a square_tAt a synergistic cost.

3. The method for collaborative multi-drone roadrouting based on improved markov decision process according to claim 1, wherein the markov decision process model is represented by the following four-tuple M ═ S, a, P, R >:

s represents a finite set of system states, including finite state points of the unmanned aerial vehicle flight environment;

a represents a limited set of actions available to the drone;

p is a state transition probability function, representing when the body is in state s_tWhen it is, perform action a_tE.g. A, and transition to state s_t+1The probability of (d);

r is a reward function representing the immediate reward that can be achieved given the current state and action of the drone.

4. The collaborative multi-drone collaborative routeing method based on the improved markov decision process according to claim 3, wherein the state transition probability function is represented as:

P(s′|s，a)＝P(s_t+1＝s′|s_t＝s，a_t＝a)

wherein

To indicate nobodyAn example of the state of the machine,

5. The collaborative multi-drone collaborative route planning method based on improved markov decision process of claim 1, wherein the reward function R includes reward function R during normal drone driving_movegoalAnd a reward function R when the unmanned aerial vehicle encounters a threat_{avoidobstacle}Expressed as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}。

6. the collaborative multi-drone collaborative route planning method based on improved markov decision process of claim 5, wherein the reward function further comprises a radar threat reward function R while drones are driving_threatExpressed as follows:

wherein L is the length of the navigation section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iK is 1, 2, 3 is the k/4 point and the first point of the navigation sectionThe distance between i radar threats;

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat。

7. the collaborative multi-drone collaborative route planning method based on improved markov decision process according to claim 1, wherein the pre-constructed merit function V^π(s) is expressed as:

V^*(s)＝max_a∈A{R(s，a)+γ∑_s′∈SP(s′|s，a)V^π(s′)}

(11)

wherein V^*(s) represents the optimal strategy π^*The corresponding evaluation function is called as an optimal evaluation function; a represents a limited set of actions available to the drone; r (s, a) is the reward function value obtained by selecting the action a of the unmanned aerial vehicle in the state s; gamma is a discount factor, V^π(s ') is the evaluation function of the state s' under the strategy pi;

an example of the state of the drone is shown,

representing the drone action instance, P (s' | s, a) is the state transfer function.

8. A multi-unmanned aerial vehicle collaborative route planning device based on an improved Markov decision process is characterized by comprising an unmanned aerial vehicle collaborative system modeling module, a Markov process model building module and a multi-unmanned aerial vehicle collaborative route planning module; the unmanned aerial vehicle collaborative system modeling module is used for determining starting points, task target points and radar threat areas of the multiple unmanned aerial vehicles in a flight scene according to the multiple unmanned aerial vehicle collaborative route planning task;

the multi-unmanned aerial vehicle collaborative route planning module is used for executing search strategy iteration of a Markov decision process model based on a pre-constructed evaluation function, searching an action sequence enabling the evaluation function value to be maximum, and planning an optimal multi-unmanned aerial vehicle collaborative route.

9. The collaborative multi-drone collaborative route planning device according to claim 8, wherein the pre-constructed reward function R includes reward function R for normal drone driving_movegoalAnd a reward function R when the unmanned aerial vehicle encounters a threat_{avoidobstacle}Expressed as follows:

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}。

10. the collaborative multi-drone collaborative route planning device according to claim 9, wherein the reward function further includes a radar threat reward function R while drones are traveling_threatExpressed as follows:

wherein L is the length of the navigation section after the unmanned aerial vehicle makes action decision, N is the number of radar threats, d_k/4，iK is 1, 2 and 3 is the distance between k/4 points of the navigation road section and the ith radar threat;

R(s，a)＝R(s，a)＝R_movegoal+R_{avoidobstacle}+R_threat。