CN115457809A

CN115457809A - Multi-agent reinforcement learning-based flight path planning method under opposite support scene

Info

Publication number: CN115457809A
Application number: CN202210955706.8A
Authority: CN
Inventors: 瞿崇晓; 靳慧泉; 焦文明; 朱燎原; 夏少杰; 范长军
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-12-09

Abstract

The invention discloses a multi-agent reinforcement learning-based track planning method in a face support scene, which is applied to real-time simulation deduction and comprises the following steps: initializing an opposite support simulation engine and instantiating an entity; acquiring current situation information of all entities and environments; carrying out situation conversion on the obtained situation information; based on the situation information converted into the relative coordinate system, deciding the maneuvering direction of each airplane in the own party by a pre-trained reinforcement learning model; according to the maneuvering direction of each airplane on the own side, performing action conversion by using a maneuvering model to generate position information of each airplane on the own side at the next moment; obtaining situation information of all entities and the next moment of the environment according to the position information of each airplane at the next moment; and repeating the execution until the opposite support task is completed, and outputting a complete route containing the position information of each airplane at each moment. The invention improves the efficiency and the environmental adaptability of the flight path planning under the opposite support scene.

Description

Multi-agent reinforcement learning-based flight path planning method under opposite support scene

Technical Field

The invention belongs to the technical field of path planning, and particularly relates to a multi-agent reinforcement learning-based flight path planning method in an opposite support scene.

Background

The opposite support is a combat action of attacking enemy targets close to the army distance by a fixed wing or rotary wing aircraft in air, and the combat process needs close maneuvering and fire coordination of ground troops and support aircraft. The flight path planning in the opposite support scene refers to finding a flight path of a moving body from an initial point to a target point under a specific constraint condition, wherein the flight path meets the set performance index and is optimal. In the modern opposite-surface support scene, the real-time track planning not only needs to realize terrain following, terrain avoiding and threat avoiding flight, but also needs to effectively perform relatively flexible maneuvering and behavior decision in real time according to the situation of instantaneous and huge changes in a battlefield. In order to better solve the above problems, a real-time track planning method is needed to cope with complex and variable battlefield environments.

The reinforcement learning takes the environmental feedback as the input so as to achieve the purpose of adapting to the environment. The method has the main idea that decision optimization is realized by utilizing the interaction and trial-and-error process of an intelligent agent and the environment and through an evaluative feedback signal. The intelligent agent can adapt to an unknown environment and has self-learning capacity through reinforcement learning, and the intelligent agent is enabled to have certain intelligence through the training of the reinforcement learning so as to enhance the capacity of solving the flight path planning problem under the opposite support scene. Compared with a single intelligent system, the multi-intelligent system can utilize the individual perception capability to the maximum extent, improve the working efficiency through cooperation, avoid repetition and conflict and complete the tasks which cannot be completed by the single intelligent system.

The existing technical scheme comprises an unmanned aerial vehicle task allocation and flight path planning cooperative control method (Author Sungxian, ziming, dong Cheng ran in China), a technical research on flight path planning in low altitude penetration of an airplane (Author leaf and text, jiang Wen Zhi, ma Deng Wu, yu Fu Shen), a multi-unmanned aerial vehicle cooperative patrol flight path planning in a complex environment (Author Liyou Song, qiaofu), a UCAV low altitude penetration prevention flight path planning in a complex battlefield environment (Author Zhouyi, yaoto Kai, wuqi Ke), and a multi-unmanned aerial vehicle cooperative reconnaissance flight path planning based on an improved genetic algorithm (Youjiang Wai, youjiang river, etc.).

The main disadvantages of the existing real-time track planning method are as follows: firstly, the real-time performance is not high, the disclosed flight path planning method is generally planned in advance, a complete flight path is output, and once the battlefield environment is changed, the flight path planning can not be efficiently and quickly carried out in an actual scene; secondly, in the prior art, the flight path planning needs to be carried out under the condition of global information, the scene scale and the search space are larger, the decision step length is longer, the hysteresis is stronger, and the adaptability is relatively poorer.

Disclosure of Invention

The invention aims to provide a multi-agent reinforcement learning-based flight path planning method in an opposite support scene, and improve the efficiency and the environmental adaptability of the flight path planning in the opposite support scene.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a multi-agent reinforcement learning-based flight path planning method in an opposite support scene is used for planning flight paths of airplanes, and when the multi-agent reinforcement learning-based flight path planning method in the opposite support scene is applied to real-time simulation deduction, the method comprises the following steps:

step 1, initializing an opposite support simulation engine, and instantiating an entity corresponding to an opposite support scene;

step 2, acquiring current situation information of all entities and the environment;

and 3, carrying out situation conversion on the obtained situation information, wherein the situation conversion comprises the following steps: taking a plane of own square as a coordinate origin, establishing a relative coordinate system with coordinate axes parallel to those of the earth coordinate system, and converting the situation information under the earth coordinate system into situation information under the relative coordinate system;

step 4, deciding the maneuvering direction of each airplane in the own party by a pre-trained reinforcement learning model based on the situation information converted into the relative coordinate system;

step 5, according to the maneuvering direction of each airplane in the own party, performing action conversion through a maneuvering model constructed by using the flight dynamics to generate position information of each airplane in the own party at the next moment;

step 6, updating the rest entities and the environment in the opposite support scene according to the position information of each airplane at the next moment of the own party to serve as the situation information of all the entities and the environment at the next moment;

and 7, repeating the steps 2-6 until the opposite support task is completed, and outputting a complete route containing position information at each moment of each airplane of the own party.

Several alternatives are provided below, but not as an additional limitation to the above general solution, but merely as a further addition or preference, each alternative being combinable individually for the above general solution or among several alternatives without technical or logical contradictions.

Preferably, before the situation conversion, situation extraction is carried out on the situation information, and the situation information obtained based on the situation extraction is used as the input of a pre-trained reinforcement learning model;

the situation extraction includes extracting situation information of the aircraft and entities related to the ground facility and extracting relative relationships among the entities.

Preferably, the method for taking the own airplane as the origin of coordinates includes: the first airplane in the own plane is taken as the origin of coordinates.

Preferably, the multi-agent reinforcement learning-based flight path planning method in the opposite support scenario is applied to model training, and includes:

step 3, deciding the maneuvering direction of each airplane in the own party by the reinforcement learning model based on the current situation information converted into the relative coordinate system, and generating the position information of each airplane in the own party at the next moment through the maneuvering model;

step 4, updating the observation of the whole environment, and calculating a reward and punishment value according to a reward function of the flight path planning;

and 5, updating parameters of the reinforcement learning model, returning to the step 2 and continuing to execute until the training is finished.

Preferably, the specific training process of the reinforcement learning model is as follows:

1) Initializing neural network parameters of Actor/Critic;

2) Assigning the parameters of the neural network to corresponding target network parameters;

3) Initializing a playback buffer R, and storing data obtained by training;

4) Initializing an opposite support simulation engine and instantiating an entity corresponding to an opposite support scene;

5) Initializing an aircraft track planning process N to be used as an exploration space to obtain an initial state S;

6) Acquiring current situation information of all entities and environments;

7) Selecting an entity control action according to the current strategy u and the exploration noise;

8) After the action is performed, a reward is fed back, and a discount reward is given for the decision at the future time t +1, and a new state S is obtained _t+1 (ii) a The reward is the damage of a single entity, and a negative value is awarded; when entering the detection area of the enemy, awarding a negative value; when the entity is close to the target position of the enemy, the reward is added with a positive value;

9) Policy network state transition procedure (S) _t ,a _t ,r _t ，S _t+1 ) Storing the data into a playback buffer R as a data set of the training network; wherein S _t Is the situation information of the agent at the current time t, a _t Is the amount of decision performed by the agent at the current time t, r _t Awards obtained after the intelligent agent executes decision actions at the current time t;

10 Randomly sampling M data after state conversion from the playback cache R to serve as mini-batch training data, and updating network parameters through a loss function loss of a minimum value network;

11 Update policy gradients for the Actor network, and update Critic network parameters;

12 Returning to the step 6 and continuing to execute until the training is finished, and storing the generated model into the file with the h5 format suffix after the training is finished.

Preferably, the specific implementation process of steps 7) to 9) is as follows:

real-time track planning can be described as a sequence

Wherein X is the situation information of the environment, U _i The selectable behavior space of the ith agent is 1-n, and the combination of the selectable behavior spaces of all agents is U = U ₁ ×U ₂ ×…×U _n ，f＝U×X×U,f∈[0,1]In order to be a state-transition equation,

is the reward of the ith agent at the current time t;

in a multi-agent system, the state transition is dependent on the combination of the behaviors of all agents U _k ＝U _1k ,U _2k ,…,U _nk ，U _k ∈U，U _ik The method is characterized in that the method is a behavior combination of the ith intelligent agent, a strategy h is formed by combining strategies of each airplane, the reward of a single airplane depends on the behavior combinations of all the airplanes, the real-time flight path planning of the airplane belongs to a full-cooperation random game, and all reward functions are the same in the full-cooperation random game.

The multi-agent reinforcement learning-based flight path planning method under the opposite support scene improves the generalization of the flight path planning through input situation optimization, and enables the training model to have better effect under different environments. Specific reward design is carried out aiming at the opposite support scene, and in addition to victory or defeat reward, damage reward and process reward such as relative distance with a target are added, so that the reward guides model training, and the model training effect is improved.

Drawings

FIG. 1 is a flowchart of a multi-agent reinforcement learning-based track planning method applied to model training in an opposite support scenario;

FIG. 2 is a flowchart of the multi-agent reinforcement learning-based track planning method applied to simulation deduction in the opposite support scenario.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The embodiment provides a multi-agent reinforcement learning-based flight path planning method in an opposite support scene, which is used for solving the problems of information sharing and strategy cooperation among multiple airplanes in opposite support scene simulation countermeasure in the opposite support scene with complex terrain, variable situation and incomplete information, can perform real-time flight path planning without global information, and improves the efficiency and environmental adaptability of flight path planning in the opposite support scene.

The multi-agent reinforcement learning-based flight path planning method in the opposite support scene of the embodiment mainly comprises two steps, namely model training of real-time flight path planning and loading of a trained algorithm model into a complex and variable real-time battlefield for real-time simulation deduction.

The real-time flight path planning problem adopts a distributed reinforcement learning algorithm, each airplane independently makes a decision, a single airplane is an intelligent agent, the cooperation among the airplanes is completed through information sharing, and each airplane utilizes the acquired incomplete situation information to interact with the environment, so that the own target is optimized, and the local or global optimum is finally achieved. All airplanes have the same target, cooperate to complete the mission and avoid collision, maximizing the joint profit.

For convenience of understanding, in the embodiment, taking the real-time battle simulation aiming at the enemy target which cannot be attacked by the army of our party in the opposite support scene as an example, the model training part for performing the track planning is as follows:

step 1, initializing a simulation environment of an opposite support scene, and instantiating an agent.

And 2, processing situation information according to the observation information and the target state of the environment.

And 3, deciding the maneuvering direction of each airplane in the own party by the reinforcement learning model based on the current situation information converted into the relative coordinate system, and generating the position information of each airplane in the own party at the next moment through the maneuvering model.

And 4, updating the observation of the whole environment, and calculating a reward and punishment value according to the reward function of the flight path planning.

And 5, repeating the steps 2, 3, 4 and 5, storing certain training sample data for algorithm training, and updating the parameters of the model until the training is finished.

In the problem of planning the real-time flight path of the airplane, the airplane can be regarded as an intelligent agent, the environment of airplane navigation is the environment interacting with the intelligent agent, and indexes such as path length, fighting loss and energy consumption for evaluating results in the traditional method can be regarded as rewards returned to the intelligent agent by the environment. The embodiment improves the generalization of the flight path planning by inputting situation optimization, and the specific situation information is processed as follows:

and after initialization, obtaining the current situation information of a plurality of airplanes and the environment, and processing the situation information. The use of absolute coordinates as input in the situational information may induce the neural network to "remember" the features of each location in the map, which, while improving the performance of the strategy on a particular map, may also severely compromise its generalization capability. Therefore, the present embodiment uses the relative coordinates, and uses the first airplane as the origin of coordinates, and recalculates the coordinates of each airplane and other objects in the environment, so as to make the strategy learn more general knowledge from the perspective of the intelligent agent, and reduce overfitting to a specific map as much as possible. Binary coding is used for different airplane types and missile types, and compared with one-hot coding, the method can reduce the input dimension of training and improve the training efficiency.

In the model training facing the global flight path planning, the state variables in the environment are mainly static terrain data, and are added with initial and terminal states and real-time state information of the airplane, and the reward is mainly measured by the time of reaching a target, the damage degree of the airplane and the length of a path.

As shown in fig. 1, the overall process of model training is as follows:

1) Initializing neural network parameters of Actor/Critic: theta ^Q And theta ^u 。

2) Copying parameters of the neural network to corresponding target network parameters: theta.theta. ^Q ′←θ ^Q ，θ ^u ′←θ ^u 。

3) And initializing a playback buffer R and storing the data obtained by training.

4) The real-time opposite-face support simulation engine is initialized, and the model instantiates corresponding entities according to the scenario information of the initialization function, wherein the entities comprise airplane models, terrain, obstacles and the like.

5) Initializing a flight path planning process N of an airplane to be used as an exploration space, wherein the exploration space comprises acceleration, sideslip angular velocity and pitch angular velocity; an initial state S is obtained.

6) And processing the situation according to the observation information and the target state of the environment.

7) Selecting an aircraft control action according to the current strategy u and the exploration noise: a is _t 。

8) After performing the action, feeding back a reward r _t ：r _t ＝R _t +γR _t+1 γ is the reward factor for the discount reward, R _t For rewards for current actions, R _t+1 For awards for actions at a future time t +1, the current time t award includes the survival of each rackThe original reward of each airplane and the team cooperation reward of each airplane, and gives a discount reward to the decision of the future t +1 moment to obtain a new state S _t+1 。

9) The policy network converts this state into a process (S) _t ,a _t ,r _t ，S _t+1 ) And storing the data into a playback buffer R as a data set of the training network.

10 Randomly sampling M state-converted data from R as a mini-batch training data y _t -Q(s _t ,a _t |θ ^Q ) Updating network parameters by minimizing loss function loss of the value network, defining loss:

n is the number of samples, where Q(s) _t ,a _t |θ ^Q ) Is that the cost function is used to give the action a _t Score of performance, y _t Can be considered as a label: y is _t ＝r _t +γQ′(s _t+1 ,u′(s _t+1 |θ ^u′ )|θ ^Q′ )，Q′(s _t+1 ,u′(s _t+1 |θ ^u′ )|θ ^Q′ ) Is a cost function for the given action a _t+1 And (4) scoring.

11 Update policy gradient of Actor network:

updating the Critic network parameters:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^u′ ←τθ ^u +(1-τ)θ ^u′

12 Step 6) -11) are repeatedly executed until the training is finished, and the generated model is stored in a file with a suffix of h5 format, so that the model can be called later.

Wherein, state s _t The real-time environment situation observed by the intelligent agent in the current state comprises all the airplane state information and barriersObstacle location, etc.; a is _t The decision variables executed by the intelligent agent under the current environment situation comprise acceleration, sideslip angular velocity and pitch angular velocity; r is a radical of hydrogen _t After the decision-making action is executed for the intelligent agent, the obtained instant rewards comprise individual profits and team profits of the single plane damage; s is _t+1 And entering a real-time environment situation returned by the next state environment after the agent executes the decision-making action. The agent interacts with the environment through a series of s, a and r to maximize the jackpot.

In the embodiment, when the initial training is performed, steps 1) -5) of the overall process of model training are executed, and step 1) of the corresponding model training part is executed; in each battlefield bureau, the steps 6) -7) of the overall process of executing the model training are used for carrying out maneuvering direction decision, and the steps 2 and 3 of the model training part are corresponded; secondly, in the battlefield bureau, executing step 8) and step 9) of the whole process of model training to carry out reward and data acquisition, and corresponding to step 4 of the model training part; and finally, executing steps 10) -12) of the overall process of model training to perform algorithm training and model updating, and corresponding to step 5 of the model training part until the model training is finished.

In the model training of the embodiment, the real-time flight path planning of the airplane can be described as a sequence

X is an environmental state, including an airplane model, terrain, obstacles, and the like; u shape _i The optional behavior space of the ith intelligent agent is defined, i is more than or equal to 1 and less than or equal to n, the optional behavior space comprises acceleration, sideslip angular velocity and pitch angular velocity, and the combination of the optional behavior spaces of all the intelligent agents is U = U ₁ ×U ₂ ×…×U _n ，f＝U×X×U,f∈[0,1]Is a state transition equation.

The reward of the ith intelligent agent at the current time t is realized, and the specific reward comprises individual income and team income of the damage of a single airplane.

RewardSpecifically, a single airplane is damaged, and a negative value is awarded; when entering the detection area of the enemy, awarding a negative value; when the plane approaches the enemy target position, the reward is increased by a positive value.

In a multi-agent system, the state transition is dependent on the combination of the behaviors of all agents U _k ＝U _1k ,U _2k ,…,U _nk ，U _k E.g. U. The strategy h is formed by combining strategies of each airplane, the reward of a single airplane depends on the behavior combination of all the airplanes, and the reward expectation under the current strategy is as follows:

wherein,

for the reward expectation of the ith intelligent agent under the current strategy h, gamma is the reward factor of the discount reward, namely multiplying the future reward by a discount factor with the value between (0, 1) reflecting the importance degree of the future reward, and gamma ^t Representing the reward factor at the current time t.

The Q function for each aircraft depends on the strategy and behavior of all aircraft, and is of the form:

the aircraft real-time flight path planning belongs to a full cooperation random game, and in the full cooperation random game, all reward functions are the same. All airplanes have the same target, a relatively large reward can be obtained when the target is successfully hit, and the damage degree of the airplane, the length of the path and the maneuvering time are all factors of reward setting. The behavior space is the combination of all the airplane behavior spaces, and the goal of maximizing the profit is achieved by optimizing the common behavior. There is a collaborative problem with independently decision-making multi-agent systems, where a single agent maximizes its own revenue by optimizing the same objectives.

In the practical application of the multi-agent reinforcement learning algorithm, the method respectively carries out global flight path planning, real-time obstacle avoidance and emergency treatment, and trains a set of model for real-time flight path planning of airplanes in different stages.

As shown in fig. 2, taking the real-time opposite-face support simulation of the enemy target as an example, the part of the invention that loads the trained algorithm model into the real-time opposite-face support scene for real-time simulation deduction is as follows:

step 1, initializing an opposite support simulation engine, and instantiating an entity corresponding to an opposite support scene.

And 2, acquiring current situation information of all entities and environments.

And 3, carrying out situation conversion on the obtained situation information, wherein the situation conversion comprises the following steps: and taking a plane of own side as a coordinate origin, establishing a relative coordinate system with a coordinate axis parallel to the coordinate axis of the terrestrial coordinate system, and converting the situation information in the terrestrial coordinate system into the situation information in the relative coordinate system.

And 4, deciding the maneuvering direction of each airplane in the own party by a pre-trained reinforcement learning model based on the situation information converted into the relative coordinate system.

And 5, performing action conversion through a maneuvering model constructed by using flight dynamics according to the maneuvering direction of each airplane of the own party to generate position information of each airplane of the own party at the next moment.

And 6, updating the rest entities and the environment in the opposite support scene according to the position information of each airplane at the next moment of the own party, and taking the updated rest entities and environment as the situation information of all the entities and the environment at the next moment, namely the situation information of the next decision.

In order to improve the decision efficiency of the model, in another embodiment, situation extraction is carried out on situation information before situation conversion, and the situation information obtained based on the situation extraction is used as the input of a pre-trained reinforcement learning model; the situation extraction includes extracting situation information of the aircraft, entities related to the ground facility and extracting relative relationships between the entities.

The invention provides a flight path planning method based on multi-agent reinforcement learning in an opposite support scene, which is used in the opposite support scene with complex terrain, variable situation and incomplete information, solves the problems of information sharing and strategy cooperation among multiple airplanes in the opposite support scene simulation countermeasure, can carry out real-time flight path planning without global information, and improves the flight path planning efficiency and the environmental adaptability in the opposite support scene.

Compared with the existing flight path planning scheme, the real-time flight path planning method based on multi-agent reinforcement learning can dynamically interact with the environment in real time, consider the change of situation environment and the cooperation among the multi-agents, avoid task conflict and improve the solving efficiency and the reliability; path planning can be carried out under the condition that global information is not needed, and the method has stronger adaptability; the problem of hysteresis of a traditional track planning algorithm for dealing with an environment with large scene scale and large search space can be solved.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims

1. A multi-agent reinforcement learning-based flight path planning method in an opposite support scene is used for planning flight paths of airplanes, and is characterized in that the multi-agent reinforcement learning-based flight path planning method in the opposite support scene is applied to real-time simulation deduction and comprises the following steps:

step 2, acquiring situation information of all entities and environments;

and 3, carrying out situation conversion on the obtained situation information, wherein the situation conversion comprises the following steps: taking a self airplane as a coordinate origin, establishing a relative coordinate system with a coordinate axis parallel to that of the terrestrial coordinate system, and converting the situation information under the terrestrial coordinate system into situation information under the relative coordinate system;

step 5, according to the maneuvering direction of each airplane in the own party, performing action conversion through a maneuvering model constructed by using flight dynamics to generate position information of each airplane in the own party at the next moment;

and 7, repeating the steps 2-6 until the opposite support task is completed, and outputting a complete air route of each airplane at the own side, wherein each airplane comprises position information at each moment.

2. The multi-agent reinforcement learning-based track planning method in the opposite support scenario as claimed in claim 1, wherein the situation information is subjected to situation extraction before the situation conversion, and the situation information obtained based on the situation extraction is used as an input of a pre-trained reinforcement learning model;

3. The multi-agent reinforcement learning-based track planning method for opposing support scenario as claimed in claim 1, wherein the method of taking an airplane in own right as the origin of coordinates comprises: the first self-plane is taken as the origin of coordinates.

4. The multi-agent reinforcement learning-based track planning method for the opposite support scenario of claim 1, wherein the multi-agent reinforcement learning-based track planning method for the opposite support scenario is applied to model training and comprises:

and 5, updating parameters of the reinforcement learning model, and returning to the step 2 to continue executing until the training is finished.

5. The multi-agent reinforcement learning-based track planning method in the opposite support scenario as claimed in claim 4, wherein the reinforcement learning model is trained by the following steps:

1) Initializing neural network parameters of Actor/Critic;

3) Initializing a playback buffer R, and storing data obtained by training;

6) Acquiring current situation information of all entities and environments;

9) Policy network state transition procedure (S) _t ,a _t ,r _t ，S _t+1 ) Storing the data into a playback buffer R as a data set of the training network; wherein S _t Is the situation information of the agent at the current time t, a _t Is the amount of decision made by the agent at the current time t, r _t Awards obtained after the intelligent agent executes decision actions at the current time t;

6. The multi-agent reinforcement learning-based track planning method for the opposite support scenario as claimed in claim 5, wherein the specific implementation processes of steps 7) -9) are as follows:

real-time track planning can be described as a sequence

is the reward of the ith agent at the current time t;

in a multi-agent system, the transition of state depends on the combination of the behaviors of all agents U _k ＝U _1k ,U _2k ,…,U _nk ，U _k ∈U，U _ik The strategy h is a behavior combination of the ith intelligent agent and is formed by combining the strategies of each airplane, the reward of a single airplane depends on the behavior combinations of all the airplanes, the real-time flight path planning of the airplanes belongs to a full-cooperation random game, and all reward functions are the same in the full-cooperation random game.