CN113791634A

CN113791634A - Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Info

Publication number: CN113791634A
Application number: CN202110964271.9A
Authority: CN
Inventors: 刘小雄; 尹逸; 苏玉展; 秦斌; 韦大正
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-08-22
Filing date: 2021-08-22
Publication date: 2021-12-14
Anticipated expiration: 2041-08-22
Also published as: CN113791634B

Abstract

The invention discloses a multi-aircraft air combat decision method based on multi-agent reinforcement learning, which comprises the steps of firstly establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, situation judgment and target distribution model of an unmanned aerial vehicle; then, adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment; and finally, combining the constructed unmanned aerial vehicle model with a multi-agent reinforcement learning algorithm to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning. The method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.

Description

Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-aircraft air combat decision method.

Background

The decision-making of the unmanned aerial vehicle is to make the unmanned aerial vehicle take advantages or get disadvantages into advantages in the battle, and the key of the research is to design an efficient autonomous decision-making mechanism. The autonomous decision-making of the unmanned aerial fighter is a mechanism for making a tactical plan or selecting flight actions in real time according to actual combat environment in air combat, and the degree of goodness of the decision-making mechanism reflects the intelligent level of the unmanned aerial fighter in modern air combat. The input of the autonomous decision making mechanism is various parameters related to air combat, such as flight parameters of an aircraft, weapon parameters, three-dimensional space scene parameters and the relative relationship between enemy and my parties, the decision making process is an information processing and calculation decision making process in the system, and the output is a tactical plan made by decision making or certain specific flight actions.

At present, the decision-making method for researching the air combat tactics can be basically divided into two types, the first type is that the traditional rule-based non-learning strategy mainly comprises a differential countermeasure method, an expert system, an influence graph method, a matrix game algorithm and the like, the decision-making strategies of the traditional rule-based non-learning strategy are generally fixed and cannot completely cover the problem of complex and instantaneously-changed multi-machine air combat, the second type is that the self-learning strategy based on an intelligent algorithm mainly comprises an artificial immune system, a genetic algorithm, transfer learning, an approximate dynamic programming algorithm, reinforcement learning and the like, and the structure and parameters of a self-decision model are optimized through self experience. The self-learning strategy has strong adaptability and can deal with the air battle field environment with complex and changeable situation.

With the development of air combat technology, the air combat of the modern unmanned aerial vehicle is not limited to the operation environment of one aircraft to one aircraft in the prior art, formation cooperation means many-to-many unmanned aerial vehicle attack mode, mutual shielding among the unmanned aerial vehicles, and cooperative attack also becomes an important component of multi-aircraft air combat decision.

The difficulty of multi-agent multi-machine tactical decision is mainly reflected in (1) cooperation of multi-heterogeneous agents. (2) Real-time confrontation and action persistence. (3) Incomplete information play and strong uncertainty. (4) Huge search space and multiple complex tasks. With the breakthrough and development of artificial intelligence technology taking deep reinforcement learning as a core, a new technical approach is developed for the intellectualization of a command information system, and a new solution is brought for complex multi-agent multi-air decision.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-aircraft air combat decision method based on multi-agent reinforcement learning, which comprises the steps of firstly establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of an unmanned aerial vehicle; then, adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment; and finally, combining the constructed unmanned aerial vehicle model with a multi-agent reinforcement learning algorithm to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning. The method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: the unmanned aerial vehicles of the two parties of the battle are assumed to be the unmanned aerial vehicle of the same party and the unmanned aerial vehicle of the opposite party, the unmanned aerial vehicle of the same party is the red machine, and the unmanned aerial vehicle of the opposite party is the blue machine; establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle;

step 2: adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment;

and step 3: and (3) combining the unmanned aerial vehicle model constructed in the step (1) with the multi-agent reinforcement learning algorithm in the step (2) to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning.

Further, in the step 1, an airplane model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle are established, and the specific steps are as follows:

step 1-1: establishing an airplane model of the unmanned aerial vehicle;

step 1-1-1: input state S of unmanned aerial vehicle_r＝[V_r，γ_r，φ_r，x_r，y_r，h_r]Speed V of the unmanned aerial vehicle_rAngle of pitch γ_rAngle of roll phi_rThree axis position (x)_r，y_r，h_r)；

Step 1-1-2: constructing a six-degree-of-freedom model and seven actions of the unmanned aerial vehicle; the actions are coded by selecting the tangential overload, normal overload and roll angle of the unmanned aerial vehicle, namely in the formula (1)

The actions taken at each moment in the simulation are represented, and the actions comprise seven actions of constant level flight, acceleration, deceleration, left turning, right turning, upward pulling and downward diving after being coded;

where v represents the speed of the drone, N_xRepresenting tangential overload of the drone, theta representing pitch angle of the drone, psi representing yaw angle of the drone, N_zIndicating a normal overload of the drone,

the roll angle of the unmanned aerial vehicle is represented, t represents the updating time of the state of the unmanned aerial vehicle, and g represents the gravity acceleration;

step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;

step 1-1-4: resolving the state of the airplane after the airplane executes the action through the Longge Kutta;

step 1-1-5: updating the state of the airplane;

step 1-2: constructing a missile model;

step 1-2-1: determining the parameter of the conductance elastic energy as the maximum off-axis emission angle

Maximum minimum attack distance D_MmaxAnd D_MminMaximum and minimum non-escapeable distances D_MkmaxAnd D_MkminAnd a cone angle

The missile attack area is assumed to be static, and only the maximum attack distance, the maximum non-escape distance and the cone angle are concerned; the attack Area is marked as Area_ackAnd satisfies the following conditions:

wherein d is_tIndicating the distance from red to blue machine, q_tRepresenting the line of sight angle of the red machine to the blue machine; pos (target) denotes the position of the blue machine;

the non-escape Area is marked as Area_deadAnd satisfies the following conditions:

when the blue machine enters an attack area of the red machine, the blue machine is destroyed with a certain probability;

step 1-2-2: dividing an attack area;

when in use

And D_{Mk min}＜d＜D_{Mk max}In time, the blue machine is in the fifth area of the attack area;

when in use

And D_{M min}＜d＜D_{Mk min}When the blue machine is in the first area of the attack area;

when in use

And D_{Mk max}＜d＜D_{M max}Meanwhile, the blue machine is in the area IV of the attack area;

when in use

And D_{M min}＜d＜D_{M max}The blue machine is positioned in the second area or the third area of the attack area; utensil for cleaning buttockThe body is judged in the second area or the third area according to the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):

wherein, Deltax, Deltay, Deltai represent the distance difference of red machine and blue machine in the direction of X-axis, direction of Y-axis and direction of Z-axis respectively, and x_b、y_b、z_bRespectively representing the position of the blue machine in the x-axis direction, the y-axis direction and the z-axis direction, x_r、y_r、z_rRespectively showing the positions of the red machine in the x-axis direction, the y-axis direction and the z-axis direction;

if it is not

The blue machine is located to the right with respect to the red machine, i.e. zone c of the attack zone, if

The blue camera is positioned on the left side relative to the red camera, namely the region II of the attack region;

in summary, the attack area is specifically divided as follows:

step 1-2-3: when the blue machine is in the region, the blue machine is in the non-escape area of the red machine, and the missile hit probability is maximum; when the blue-ray machine is in other areas, the hit probability of the missile is a function from 0 to 1, and the size of the hit probability is related to the distance, the departure angle, the deviation angle and the flight direction; when the missile hit probability is less than 0.3, the missile is considered to be not hit, and the missile cannot be launched at the moment; the specific destruction probability is as follows:

wherein the content of the first and second substances,p_arepresenting the probability of crash, p, associated with a blue plane maneuver_dThe damage probability associated with the distance is shown, and the position (airfft _ aim) shows the area of the attack area of the local area where the blue machine is located;

step 1-2-4: the specific steps for launching the missile are as follows:

step 1-2-4-1: inputting the distance d, the departure angle AA, the deviation angle ATA, the position and the speed of the red machine and the blue machine;

step 1-2-4-2: constructing a missile model, and setting the number of missiles;

step 1-2-4-3: judging whether the blue machine is in an attack area of the red machine or not according to the distance d and the departure angle ATA;

step 1-2-4-4: when the blue machine is in the attack area of the red machine, judging which part of the attack area the blue machine is in;

step 1-2-4-5: judging the speed direction of the blue machine relative to the red machine;

step 1-2-4-6: calculating the hit rate of the missile at the moment;

step 1-2-4-7: judging whether the missile is hit;

step 1-3: a neural network normalization model;

step 1-3-1: inputting state variables of the unmanned aerial vehicle;

step 1-3-2: normalized velocity

Step 1-3-3: normalized angle

Step 1-3-4: normalized position

Step 1-3-5: making a difference on the positions of the normalized red machine and the normalized blue machine;

step 1-3-6: outputting the data;

step 1-4: constructing a battlefield environment model;

step 1-5: situation judgment and target distribution model;

step 1-5-1: inputting the states of the red machine and the blue machine, including the speed, the pitch angle, the yaw angle and the triaxial position;

step 1-5-2: calculating respective angle advantages according to the pitch angle and the yaw angle

φ_tIs the target entry angle, phi_fIs the target azimuth;

step 1-5-3: calculating respective distance advantage according to three-axis position

1-5-4: calculating respective energy advantages from the velocity and the height in the three-axis position

1-5-5: calculating the comprehensive advantage S ═ C by combining the advantages of angle, speed and energy₁S_a+C₂S_r+C₃E_g，C₁、C₂And C₃Are all weighting coefficients;

1-5-6: sequencing the targets according to the comprehensive advantages to generate a target distribution matrix;

1-5-7: and outputting the allocation of the targets according to the target allocation matrix.

Further, in the step 2, an MAPPO algorithm is adopted as a multi-agent reinforcement learning algorithm, a centralized training and distributed execution framework is combined with a PPO algorithm to form the MAPPO algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:

the return function consists of four sub-return functions, namely a height return function, a speed return function, an angle return function and a distance return function; the method comprises the following specific steps:

step 2-1: input unmanned aerial vehicle state S_r＝[V_r，γ_r，φ_r，x_r，y_r，h_r]；

Step 2-2: calculating the height difference Δ h ═ h_r-h_bAnd calculates a height difference reward r _ h, h_r、h_bRespectively, the height of the red machine and the height of the blue machine, wherein the height unit is meter:

step 2-3: high security rewards for computer red:

step 2-4: calculating a total altitude reward R_h＝r_h+r_h_self；

Step 2-5: calculating the speed difference Δ h ═ v_r-v_bAnd calculating the velocity difference reward, v_r、v_bThe speed of the red machine and the speed of the blue machine are respectively expressed, and the speed is expressed in the unit of meter/second:

step 2-6: speed safety reward of computer:

step 2-7: calculating a total speed reward R_v＝r_v+r_v_self；

Step 2-8: calculating the deviation angle AA and the deviation angle ATA of the red computer and the blue computer;

step 2-9: calculating to obtain angle reward

Step 2-10: calculating the distance between red and blue machines as the departure angle AWhen TA is less than 60 degree, the distance reward is obtained

Step 2-11: setting different weights to sum the rewards to obtain continuous reward R_c＝a₁·R_a+a₂·R_h+a₃·R_v+a₄·R_d，a₁、a₂、a₃And a₄Respectively, representing different weights.

Further, in step 3, the unmanned aerial vehicle model constructed in step 1 and the multi-agent reinforcement learning algorithm in step 2 are combined to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning, which is specifically as follows:

step 3-1: the multi-agent reinforcement learning algorithm consists of a strategy network and a value network, wherein the value network is responsible for evaluating the action selected by the strategy network so as to guide the updating of the strategy network; the input of the value network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction, the height and the selected action of the unmanned aerial vehicle, the friend aircraft and the enemy aircraft at the last moment; the input of the strategy network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction and the height of the unmanned aerial vehicle, and the output of the strategy network is selected action;

step 3-2: firstly, selecting initial actions according to initial parameters of a policy network of the red machine and the blue machine, executing the actions in a battlefield environment model to obtain a new state, then calculating rewards, and then packaging and storing the states, the rewards and the actions of the red machine and the blue machine in an experience playback library of a multi-agent reinforcement learning algorithm in a normalized mode; after enough set data are stored, the value network of the red machine and the blue machine samples the data of the experience playback library, the states of the red machine and the blue machine are combined, the strategy network updates the strategy, then the unmanned aerial vehicle takes the state of the unmanned aerial vehicle as the input of the strategy network, the strategy network selects the action of the unmanned aerial vehicle according to the state of the unmanned aerial vehicle, the unmanned aerial vehicle executes the action to obtain new data, and the circulation is carried out repeatedly.

The invention has the following beneficial effects:

(1) the method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.

(2) The multi-machine cooperative air combat decision algorithm based on multi-agent reinforcement learning formed by the method effectively solves the problems of multi-heterogeneous agent cooperation, real-time confrontation and action continuity, huge search space, multi-complex tasks and the like in multi-agent decision.

(3) The multi-machine cooperative air combat decision algorithm based on multi-agent reinforcement learning comprises a battlefield environment construction module, a normalization module, a reinforcement learning module, an airplane module, a missile module, a reward module and a target distribution module, and a decision model can be established according to battlefield environment and situation information.

(4) The invention can realize multi-machine air combat decision output, the reinforcement learning algorithm can be trained independently according to different scenes, and the decision algorithm has the characteristics of good input/output interface and modular rapid transplantation.

Drawings

Fig. 1 is a schematic cross-sectional view of an attack area of an unmanned aerial vehicle according to the present invention.

Fig. 2 is a flowchart of a battlefield environment module of the present invention.

FIG. 3 is a multi-agent multi-aircraft air combat decision algorithm design framework according to the present invention.

FIG. 4 is a diagram showing the relationship between modules in the method of the present invention.

Fig. 5 is an initial occupation map of V2 air war in accordance with embodiment of the present invention.

FIG. 6 is a diagram of the velocity change of both air fighters according to the embodiment of the invention.

FIG. 7 is a diagram of the height change of both air fighters in accordance with the embodiment of the present invention.

FIG. 8 is a diagram illustrating situation changes of both air fighters according to an embodiment of the present invention.

FIG. 9 is a track diagram of both air war parties in accordance with an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

A multi-airplane air combat decision method based on multi-agent reinforcement learning comprises the following steps:

step 1-1: establishing an airplane model of the unmanned aerial vehicle;

firstly, constructing a six-degree-of-freedom model of the unmanned aerial vehicle according to a three-dimensional space kinematic equation under a ground inertial coordinate system, then constructing seven actions of the airplane according to the tangential overload, normal phase overload and roll angle of the unmanned aerial vehicle, and updating the state after the actions are finished through a Longge Kutta when the airplane selects to execute any one of the actions;

Step 1-1-2: constructing a six-degree-of-freedom model and seven actions of the unmanned aerial vehicle;

step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;

step 1-1-5: updating the state of the airplane;

step 1-2: constructing a missile model;

In order to simplify the problem, the missile attack area is assumed to be static, and only the maximum attack distance, the maximum non-escape distance and the cone angle are concerned; the attack Area is marked as Area_ackAnd satisfies the following conditions:

wherein q is_tRepresenting the line of sight angle of the red machine to the blue machine; pos (target) denotes the position of the blue machine;

to better determine this probability, the attack zone is further analyzed as shown in FIG. 1.

Step 1-2-2: dividing an attack area;

when in use

when in use

And D_Mmin＜d＜D_{Mk min}When the blue machine is in the first area of the attack area;

when in use

when in use

And D_{M min}＜d＜D_{M max}The blue machine is positioned in the second area or the third area of the attack area; specifically, in the second zone or the third zone, the judgment is carried out through the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):

if it is not

in summary, the attack area is specifically divided as follows:

step 1-2-4: the specific steps for launching the missile are as follows:

step 1-2-4-2: constructing a missile model, and setting the number of missiles;

step 1-2-4-6: calculating the hit rate of the missile at the moment;

step 1-2-4-7: judging whether the missile is hit;

step 1-3: a neural network normalization model;

normalization can ensure that when the input of each layer of the neural network keeps the same distribution gradient and is reduced, the model is converged to a correct place, and the gradient updating direction is deviated under different dimensions. And normalization to a reasonable range favors model generalization.

Step 1-3-1: inputting state variables of the unmanned aerial vehicle;

step 1-3-2: normalized velocity

Step 1-3-3: normalized angle

Step 1-3-4: normalized position

step 1-3-6: outputting the data;

step 1-4: constructing a battlefield environment model;

step 1-5: situation judgment and target distribution model;

and the situation judgment and target distribution model constructs a comprehensive advantage function by analyzing distance threat, angle advantage and energy advantage so as to construct an air war threat degree model. And then, calculating a target distribution matrix according to a target distribution matrix criterion after data fusion according to all information obtained by the long machine. And then selecting a tactical caution degree or risk degree coefficient according to the target distribution matrix, and representing the balance of the pilot on attacking and avoiding the danger problem.

φ_tIs the target entry angle, phi_fIs the target azimuth;

1-5-5: calculating the comprehensive advantages by combining the advantages of angle, speed and energy;

Further, in the step 2, a MAPPO algorithm is adopted as a multi-agent reinforcement learning algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:

MAPPO algorithm:

because the state, the action space of multimachine air battle scene are huge, and the space that single unmanned aerial vehicle can explore is limited, and the sample availability factor is not high. In addition, as a typical multi-machine system, in the problem of multi-machine cooperative air combat, the strategy of a single unmanned aerial vehicle is not only dependent on the feedback of the strategy and environment of the single unmanned aerial vehicle, but also influenced by the actions of other unmanned aerial vehicles and the cooperative relationship with the unmanned aerial vehicles, so that an experience sharing mechanism is designed, and the experience sharing mechanism comprises two aspects of sharing a sample experience base and sharing network parameters. The shared sample experience library is obtained by storing global environment situation information, action decision information of the unmanned aerial vehicle, environment situation information after the unmanned aerial vehicle executes a new action and an award value fed back by the environment aiming at the action into an experience playback library according to a quadruple form, and information of each unmanned aerial vehicle is stored into the same experience playback library according to the form. When network parameters are updated, samples are extracted from the experience playback library, loss values of the samples generated by different unmanned aerial vehicles under the Actor network and the Critic network are calculated respectively, then updating gradients of the two neural networks are obtained, gradient values calculated by the samples of the different unmanned aerial vehicles are weighted, and a global gradient formula can be obtained. As shown in fig. 3, the whole framework of the multi-machine collaborative air combat decision framework based on deep reinforcement learning includes seven modules, which are a battlefield environment construction module, a normalization module, a reinforcement learning module, an airplane module, a missile module, a reward module and a target distribution module. The input quantity of the framework is real-time battlefield situation information, and the output quantity is an action decision scheme of the controlled entity. After the original battlefield situation information is input into the framework, the original battlefield situation information is firstly processed by the situation processing module, and after data is cleaned, screened, extracted, packaged, normalized and represented in a format, the data is transmitted to the deep reinforcement learning module; the deep reinforcement learning module receives situation information data and outputs action decisions; the strategy network receives the action decision output of the deep reinforcement learning module, decodes and packages the action decision output into an operation instruction acceptable for the platform environment, and controls the corresponding unit; meanwhile, the new environment situation and the reward value obtained by executing the new action are packaged and stored in the experience storage module together with the environment situation information and the action decision scheme of the decision-making in the step, and when the network is to be trained, the sample data are extracted from the experience base and are transmitted to the neural network training module for training.

The return function consists of four sub-return functions, namely a height return function, a speed return function, an angle return function and a distance return function; the four return functions reflect the distribution of the energy advantage, the kinetic energy advantage and the hit probability of the attack area when the aircraft fights in the air, and the whole air combat environment is summarized. The reward function can reflect the occupation of the opponent airplane relative to the opponent airplane at the current moment, and can guide the airplane to fly to a high reward value, namely a place with a better situation. The method comprises the following specific steps:

Step 2-2: calculating the height difference Δ h ═ h_r-h_bAnd calculates a height difference reward r _ h:

step 2-3: high security rewards for computer red:

step 2-4: calculating a total altitude reward R_h＝r_h+r_h_self；

Step 2-5: calculating the speed difference Δ h ═ v_r-v_bAnd calculates a speed difference reward:

step 2-6: speed safety reward of computer:

step 2-7: calculating a total speed reward R_v＝r_v+r_v_self；

step 2-9: calculating to obtain angle reward

Step 2-10: calculating the distance between the red computer and the blue computer, and obtaining distance reward when the departure angle ATA is less than 60 degrees

step 3-1: the relation between the model constructed in the step 1 and the MAPPO algorithm and the designed reporting function in the step 2 is shown in the attached figure 4, the multi-agent reinforcement learning algorithm is composed of a strategy network and a value network, and the value network is responsible for evaluating the action selected by the strategy network so as to guide the updating of the strategy network; the input of the value network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction, the height and the selected action of the unmanned aerial vehicle, the friend aircraft and the enemy aircraft at the last moment; the input of the strategy network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction and the height of the unmanned aerial vehicle, and the output of the strategy network is selected action;

The specific embodiment is as follows:

the situation of double-aircraft in wartime is shown in fig. 5, four airplanes are on the same plane, a red aircraft 1 and a red aircraft 2 are respectively positioned right in front of a blue aircraft 1 and a blue aircraft 2, the blue aircraft 1 and the blue aircraft 2 have a tendency to be close to a combined attack area of the red aircraft 1 and the red aircraft 2, and the red aircraft 1 and the red aircraft 2 also have a tendency to be close to a combined attack area of the blue aircraft 1 and the blue aircraft 2. So that the red machine 1 and the red machine 2 are in the same potential as the blue machine 1 and the blue machine 2.

After the training was completed, the number of wins in the red and blue after 1000 trials is shown in table 1. It can be found that the winning rate of the red square is 51.8 percent and the winning rate of the blue square is 48.2 percent.

TABLE 1 number of wins in Red and blue

Situation(s)	Number of times
		Red machine 1 hits basket machine 1	226
Red machine 1 hits basket machine 2	129
		Red machine 2 hits basket machine 1	0
Red machine 2 hits basket machine 2	163
		Blue machine 1 hitting red machine 1	330
Blue machine 1 hitting red machine 2	0
		Blue machine 2 hitting red machine 1	152
Blue machine 2 hits red machine 2	0

The analysis was performed by using a red machine 1 and a middle blue machine 1 as an example.

The action selected by red machine 1 is [ right, right, right, right, acc, acc, acc, acc, acc, acc).

The action selected by Red machine 2 is [ right, right, acc, right, acc, acc, acc, acc, acc, acc ].

The action selected by the blue machine 1 is [ right, right, right, right, acc, acc, acc, acc, acc, acc ].

The action selected by blue machine 2 is [ right, right, right, right, acc, acc, acc, acc, acc, acc ] is.

The simulation result graphs are shown in fig. 6-8, wherein the solid line represents red machine 1, the dotted line represents red machine 2, the dotted line represents blue machine 1, and the dotted curve represents blue machine 2. As shown in fig. 6, the speed of blue machine 2 is highest with the greatest speed advantage, and the speed of red machine 1 and red machine 2 is far less than that of blue machine 1 and blue machine 2. As can be seen from fig. 7, the blue aircraft 1 and the blue aircraft 2 are not as superior in height to the red aircraft 1 and the red aircraft 2, and as can be seen from fig. 8, the red aircraft 1, the red aircraft 2, the blue aircraft 1 and the blue aircraft 2 are flying safely, so that the initial situations thereof are all positive, as the air war carries out the pinching of the red aircraft 1 and the blue aircraft 2 by the blue aircraft 1 and the blue aircraft 2, the situations of the blue aircraft 1 and the blue aircraft 2 gradually rise, the situation of the red aircraft gradually worsens, then the two red aircraft also start the pinching of the blue aircraft 2, the situation of the blue aircraft falls, the situation of the red aircraft rises, finally, the blue aircraft finishes the pinching of the red aircraft 1 first, and the blue aircraft 2 launches a missile, successfully hits the situations of the red aircraft 1 and the blue aircraft 2, and masters the battlefield initiative.

Fig. 9 is a trajectory diagram of four drones.

The effectiveness of the multi-computer cooperative air combat decision algorithm based on multi-agent reinforcement learning designed by the invention is proved by integrating all simulation results, the problems that the traditional multi-agent cooperative air combat is large in calculated amount and difficult to deal with the battlefield situation requiring real-time settlement of the transient change are effectively solved, meanwhile, the problems of cooperation, real-time confrontation and action continuity of multi-heterogeneous agents, huge search space, multi-complex tasks and the like in multi-agent decision are effectively solved, and a decision model can be established according to battlefield environment and situation information; the multi-machine air combat decision output can be realized, the reinforcement learning algorithm can be trained independently according to different scenes, and the decision algorithm has the characteristics of good input/output interfaces and modular rapid transplantation.

Claims

1. A multi-airplane air combat decision method based on multi-agent reinforcement learning is characterized by comprising the following steps:

2. The multi-aircraft air combat decision method based on multi-agent reinforcement learning as claimed in claim 1, wherein in step 1, an unmanned aerial vehicle model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment and target distribution model are established, and the specific steps are as follows:

step 1-1: establishing an airplane model of the unmanned aerial vehicle;

step 1-1-1: input state S of unmanned aerial vehicle_r＝[V_r,γ_r,φ_r,x_r,y_r,h_r]Speed V of the unmanned aerial vehicle_rAngle of pitch γ_rAngle of roll phi_rThree axis position (x)_r,y_r,h_r)；

Coming watchDisplaying actions taken at each moment in the simulation, wherein the actions comprise seven actions of constant level flight, acceleration, deceleration, left turning, right turning, upward pulling and downward diving after being coded;

step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;

step 1-1-5: updating the state of the airplane;

step 1-2: constructing a missile model;

step 1-2-2: dividing an attack area;

when in use

And D_Mkmin<d<D_MkmaxIn time, the blue machine is in the fifth area of the attack area;

when in use

And D_Mmin<d<D_MkminWhen the blue machine is in the first area of the attack area;

when in use

And D_Mkmax<d<D_MmaxMeanwhile, the blue machine is in the area IV of the attack area;

when in use

And D_Mmin<d<D_MmaxThe blue machine is positioned in the second area or the third area of the attack area; specifically, in the second zone or the third zone, the judgment is carried out through the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):

wherein, Deltax, Deltay, Deltaz represent the distance difference of red machine and blue machine in the direction of X-axis, direction of Y-axis and direction of Z-axis respectively, and x_b、y_b、z_bRespectively representing the position of the blue machine in the x-axis direction, the y-axis direction and the z-axis direction, x_r、y_r、z_rRespectively showing the positions of the red machine in the x-axis direction, the y-axis direction and the z-axis direction;

if it is not

in summary, the attack area is specifically divided as follows:

wherein p is_aRepresenting the probability of crash, p, associated with a blue plane maneuver_dThe damage probability associated with the distance is shown, and the position (airfft _ aim) shows the area of the attack area of the local area where the blue machine is located;

step 1-2-4: the specific steps for launching the missile are as follows:

step 1-2-4-2: constructing a missile model, and setting the number of missiles;

step 1-2-4-6: calculating the hit rate of the missile at the moment;

step 1-2-4-7: judging whether the missile is hit;

step 1-3: a neural network normalization model;

step 1-3-1: inputting state variables of the unmanned aerial vehicle;

step 1-3-2: normalized velocity

Step 1-3-3: normalized angle

Step 1-3-4: normalized position

step 1-3-6: outputting the data;

step 1-4: constructing a battlefield environment model;

step 1-5: situation judgment and target distribution model;

φ_tIs the target entry angle, phi_fIs the target azimuth;

3. The multi-aircraft air combat decision method based on multi-agent reinforcement learning as claimed in claim 2, wherein in the step 2, a MAPPO algorithm is adopted as the multi-agent reinforcement learning algorithm, a centralized training and distributed execution framework is combined with a PPO algorithm to form the MAPPO algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:

step 2-1: input unmanned aerial vehicle state S_r＝[V_r,γ_r,φ_r,x_r,y_r,h_r]；

Step 2-2: calculating the height difference Δ h ═ h_r-n_bAnd calculates a height difference reward r _ h, h_r、h_bRespectively, the height of the red machine and the height of the blue machine, wherein the height unit is meter:

step 2-3: high security rewards for computer red:

step 2-4: calculating a total altitude reward R_h＝r_h+r_h_self；

step 2-6: speed safety reward of computer:

step 2-7: calculating a total speed reward R_v＝r_v+r_v_self；

step 2-9: calculating to obtain angle reward

Steps 2 to 10: calculating the distance between the red computer and the blue computer, and obtaining distance reward when the departure angle ATA is less than 60 degrees

4. The multi-machine air combat decision method based on multi-agent reinforcement learning as claimed in claim 3, wherein in the step 3, the unmanned aerial vehicle model constructed in the step 1 and the multi-agent reinforcement learning algorithm in the step 2 are combined to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning, which is specifically as follows: