CN113791634A - Multi-aircraft air combat decision method based on multi-agent reinforcement learning - Google Patents

Multi-aircraft air combat decision method based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN113791634A
CN113791634A CN202110964271.9A CN202110964271A CN113791634A CN 113791634 A CN113791634 A CN 113791634A CN 202110964271 A CN202110964271 A CN 202110964271A CN 113791634 A CN113791634 A CN 113791634A
Authority
CN
China
Prior art keywords
machine
blue
unmanned aerial
aerial vehicle
red
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110964271.9A
Other languages
Chinese (zh)
Other versions
CN113791634B (en
Inventor
刘小雄
尹逸
苏玉展
秦斌
韦大正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202110964271.9A priority Critical patent/CN113791634B/en
Publication of CN113791634A publication Critical patent/CN113791634A/en
Application granted granted Critical
Publication of CN113791634B publication Critical patent/CN113791634B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a multi-aircraft air combat decision method based on multi-agent reinforcement learning, which comprises the steps of firstly establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, situation judgment and target distribution model of an unmanned aerial vehicle; then, adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment; and finally, combining the constructed unmanned aerial vehicle model with a multi-agent reinforcement learning algorithm to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning. The method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.

Description

Multi-aircraft air combat decision method based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-aircraft air combat decision method.
Background
The decision-making of the unmanned aerial vehicle is to make the unmanned aerial vehicle take advantages or get disadvantages into advantages in the battle, and the key of the research is to design an efficient autonomous decision-making mechanism. The autonomous decision-making of the unmanned aerial fighter is a mechanism for making a tactical plan or selecting flight actions in real time according to actual combat environment in air combat, and the degree of goodness of the decision-making mechanism reflects the intelligent level of the unmanned aerial fighter in modern air combat. The input of the autonomous decision making mechanism is various parameters related to air combat, such as flight parameters of an aircraft, weapon parameters, three-dimensional space scene parameters and the relative relationship between enemy and my parties, the decision making process is an information processing and calculation decision making process in the system, and the output is a tactical plan made by decision making or certain specific flight actions.
At present, the decision-making method for researching the air combat tactics can be basically divided into two types, the first type is that the traditional rule-based non-learning strategy mainly comprises a differential countermeasure method, an expert system, an influence graph method, a matrix game algorithm and the like, the decision-making strategies of the traditional rule-based non-learning strategy are generally fixed and cannot completely cover the problem of complex and instantaneously-changed multi-machine air combat, the second type is that the self-learning strategy based on an intelligent algorithm mainly comprises an artificial immune system, a genetic algorithm, transfer learning, an approximate dynamic programming algorithm, reinforcement learning and the like, and the structure and parameters of a self-decision model are optimized through self experience. The self-learning strategy has strong adaptability and can deal with the air battle field environment with complex and changeable situation.
With the development of air combat technology, the air combat of the modern unmanned aerial vehicle is not limited to the operation environment of one aircraft to one aircraft in the prior art, formation cooperation means many-to-many unmanned aerial vehicle attack mode, mutual shielding among the unmanned aerial vehicles, and cooperative attack also becomes an important component of multi-aircraft air combat decision.
The difficulty of multi-agent multi-machine tactical decision is mainly reflected in (1) cooperation of multi-heterogeneous agents. (2) Real-time confrontation and action persistence. (3) Incomplete information play and strong uncertainty. (4) Huge search space and multiple complex tasks. With the breakthrough and development of artificial intelligence technology taking deep reinforcement learning as a core, a new technical approach is developed for the intellectualization of a command information system, and a new solution is brought for complex multi-agent multi-air decision.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-aircraft air combat decision method based on multi-agent reinforcement learning, which comprises the steps of firstly establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of an unmanned aerial vehicle; then, adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment; and finally, combining the constructed unmanned aerial vehicle model with a multi-agent reinforcement learning algorithm to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning. The method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: the unmanned aerial vehicles of the two parties of the battle are assumed to be the unmanned aerial vehicle of the same party and the unmanned aerial vehicle of the opposite party, the unmanned aerial vehicle of the same party is the red machine, and the unmanned aerial vehicle of the opposite party is the blue machine; establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle;
step 2: adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment;
and step 3: and (3) combining the unmanned aerial vehicle model constructed in the step (1) with the multi-agent reinforcement learning algorithm in the step (2) to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning.
Further, in the step 1, an airplane model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle are established, and the specific steps are as follows:
step 1-1: establishing an airplane model of the unmanned aerial vehicle;
step 1-1-1: input state S of unmanned aerial vehicler=[Vr,γr,φr,xr,yr,hr]Speed V of the unmanned aerial vehiclerAngle of pitch γrAngle of roll phirThree axis position (x)r,yr,hr);
Step 1-1-2: constructing a six-degree-of-freedom model and seven actions of the unmanned aerial vehicle; the actions are coded by selecting the tangential overload, normal overload and roll angle of the unmanned aerial vehicle, namely in the formula (1)
Figure BDA0003223429930000021
The actions taken at each moment in the simulation are represented, and the actions comprise seven actions of constant level flight, acceleration, deceleration, left turning, right turning, upward pulling and downward diving after being coded;
Figure BDA0003223429930000022
where v represents the speed of the drone, NxRepresenting tangential overload of the drone, theta representing pitch angle of the drone, psi representing yaw angle of the drone, NzIndicating a normal overload of the drone,
Figure BDA0003223429930000031
the roll angle of the unmanned aerial vehicle is represented, t represents the updating time of the state of the unmanned aerial vehicle, and g represents the gravity acceleration;
step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;
step 1-1-4: resolving the state of the airplane after the airplane executes the action through the Longge Kutta;
step 1-1-5: updating the state of the airplane;
step 1-2: constructing a missile model;
step 1-2-1: determining the parameter of the conductance elastic energy as the maximum off-axis emission angle
Figure BDA0003223429930000032
Maximum minimum attack distance DMmaxAnd DMminMaximum and minimum non-escapeable distances DMkmaxAnd DMkminAnd a cone angle
Figure BDA0003223429930000033
The missile attack area is assumed to be static, and only the maximum attack distance, the maximum non-escape distance and the cone angle are concerned; the attack Area is marked as AreaackAnd satisfies the following conditions:
Figure BDA0003223429930000034
wherein d istIndicating the distance from red to blue machine, qtRepresenting the line of sight angle of the red machine to the blue machine; pos (target) denotes the position of the blue machine;
the non-escape Area is marked as AreadeadAnd satisfies the following conditions:
Figure BDA0003223429930000035
when the blue machine enters an attack area of the red machine, the blue machine is destroyed with a certain probability;
step 1-2-2: dividing an attack area;
when in use
Figure BDA0003223429930000036
And DMk min<d<DMk maxIn time, the blue machine is in the fifth area of the attack area;
when in use
Figure BDA0003223429930000037
And DM min<d<DMk minWhen the blue machine is in the first area of the attack area;
when in use
Figure BDA0003223429930000038
And DMk max<d<DM maxMeanwhile, the blue machine is in the area IV of the attack area;
when in use
Figure BDA0003223429930000039
And DM min<d<DM maxThe blue machine is positioned in the second area or the third area of the attack area; utensil for cleaning buttockThe body is judged in the second area or the third area according to the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):
Figure BDA00032234299300000310
wherein, Deltax, Deltay, Deltai represent the distance difference of red machine and blue machine in the direction of X-axis, direction of Y-axis and direction of Z-axis respectively, and xb、yb、zbRespectively representing the position of the blue machine in the x-axis direction, the y-axis direction and the z-axis direction, xr、yr、zrRespectively showing the positions of the red machine in the x-axis direction, the y-axis direction and the z-axis direction;
if it is not
Figure BDA0003223429930000041
The blue machine is located to the right with respect to the red machine, i.e. zone c of the attack zone, if
Figure BDA0003223429930000042
The blue camera is positioned on the left side relative to the red camera, namely the region II of the attack region;
in summary, the attack area is specifically divided as follows:
Figure BDA0003223429930000043
step 1-2-3: when the blue machine is in the region, the blue machine is in the non-escape area of the red machine, and the missile hit probability is maximum; when the blue-ray machine is in other areas, the hit probability of the missile is a function from 0 to 1, and the size of the hit probability is related to the distance, the departure angle, the deviation angle and the flight direction; when the missile hit probability is less than 0.3, the missile is considered to be not hit, and the missile cannot be launched at the moment; the specific destruction probability is as follows:
Figure BDA0003223429930000044
wherein the content of the first and second substances,parepresenting the probability of crash, p, associated with a blue plane maneuverdThe damage probability associated with the distance is shown, and the position (airfft _ aim) shows the area of the attack area of the local area where the blue machine is located;
step 1-2-4: the specific steps for launching the missile are as follows:
step 1-2-4-1: inputting the distance d, the departure angle AA, the deviation angle ATA, the position and the speed of the red machine and the blue machine;
step 1-2-4-2: constructing a missile model, and setting the number of missiles;
step 1-2-4-3: judging whether the blue machine is in an attack area of the red machine or not according to the distance d and the departure angle ATA;
step 1-2-4-4: when the blue machine is in the attack area of the red machine, judging which part of the attack area the blue machine is in;
step 1-2-4-5: judging the speed direction of the blue machine relative to the red machine;
step 1-2-4-6: calculating the hit rate of the missile at the moment;
step 1-2-4-7: judging whether the missile is hit;
step 1-3: a neural network normalization model;
step 1-3-1: inputting state variables of the unmanned aerial vehicle;
step 1-3-2: normalized velocity
Figure BDA0003223429930000051
Step 1-3-3: normalized angle
Figure BDA0003223429930000052
Step 1-3-4: normalized position
Figure BDA0003223429930000053
Step 1-3-5: making a difference on the positions of the normalized red machine and the normalized blue machine;
step 1-3-6: outputting the data;
step 1-4: constructing a battlefield environment model;
step 1-5: situation judgment and target distribution model;
step 1-5-1: inputting the states of the red machine and the blue machine, including the speed, the pitch angle, the yaw angle and the triaxial position;
step 1-5-2: calculating respective angle advantages according to the pitch angle and the yaw angle
Figure BDA0003223429930000054
φtIs the target entry angle, phifIs the target azimuth;
step 1-5-3: calculating respective distance advantage according to three-axis position
Figure BDA0003223429930000055
1-5-4: calculating respective energy advantages from the velocity and the height in the three-axis position
Figure BDA0003223429930000056
1-5-5: calculating the comprehensive advantage S ═ C by combining the advantages of angle, speed and energy1Sa+C2Sr+C3Eg,C1、C2And C3Are all weighting coefficients;
1-5-6: sequencing the targets according to the comprehensive advantages to generate a target distribution matrix;
1-5-7: and outputting the allocation of the targets according to the target allocation matrix.
Further, in the step 2, an MAPPO algorithm is adopted as a multi-agent reinforcement learning algorithm, a centralized training and distributed execution framework is combined with a PPO algorithm to form the MAPPO algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:
the return function consists of four sub-return functions, namely a height return function, a speed return function, an angle return function and a distance return function; the method comprises the following specific steps:
step 2-1: input unmanned aerial vehicle state Sr=[Vr,γr,φr,xr,yr,hr];
Step 2-2: calculating the height difference Δ h ═ hr-hbAnd calculates a height difference reward r _ h, hr、hbRespectively, the height of the red machine and the height of the blue machine, wherein the height unit is meter:
Figure BDA0003223429930000061
step 2-3: high security rewards for computer red:
Figure BDA0003223429930000062
step 2-4: calculating a total altitude reward Rh=r_h+r_h_self;
Step 2-5: calculating the speed difference Δ h ═ vr-vbAnd calculating the velocity difference reward, vr、vbThe speed of the red machine and the speed of the blue machine are respectively expressed, and the speed is expressed in the unit of meter/second:
Figure BDA0003223429930000063
step 2-6: speed safety reward of computer:
Figure BDA0003223429930000064
step 2-7: calculating a total speed reward Rv=r_v+r_v_self;
Step 2-8: calculating the deviation angle AA and the deviation angle ATA of the red computer and the blue computer;
step 2-9: calculating to obtain angle reward
Figure BDA0003223429930000065
Step 2-10: calculating the distance between red and blue machines as the departure angle AWhen TA is less than 60 degree, the distance reward is obtained
Figure BDA0003223429930000066
Step 2-11: setting different weights to sum the rewards to obtain continuous reward Rc=a1·Ra+a2·Rh+a3·Rv+a4·Rd,a1、a2、a3And a4Respectively, representing different weights.
Further, in step 3, the unmanned aerial vehicle model constructed in step 1 and the multi-agent reinforcement learning algorithm in step 2 are combined to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning, which is specifically as follows:
step 3-1: the multi-agent reinforcement learning algorithm consists of a strategy network and a value network, wherein the value network is responsible for evaluating the action selected by the strategy network so as to guide the updating of the strategy network; the input of the value network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction, the height and the selected action of the unmanned aerial vehicle, the friend aircraft and the enemy aircraft at the last moment; the input of the strategy network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction and the height of the unmanned aerial vehicle, and the output of the strategy network is selected action;
step 3-2: firstly, selecting initial actions according to initial parameters of a policy network of the red machine and the blue machine, executing the actions in a battlefield environment model to obtain a new state, then calculating rewards, and then packaging and storing the states, the rewards and the actions of the red machine and the blue machine in an experience playback library of a multi-agent reinforcement learning algorithm in a normalized mode; after enough set data are stored, the value network of the red machine and the blue machine samples the data of the experience playback library, the states of the red machine and the blue machine are combined, the strategy network updates the strategy, then the unmanned aerial vehicle takes the state of the unmanned aerial vehicle as the input of the strategy network, the strategy network selects the action of the unmanned aerial vehicle according to the state of the unmanned aerial vehicle, the unmanned aerial vehicle executes the action to obtain new data, and the circulation is carried out repeatedly.
The invention has the following beneficial effects:
(1) the method effectively solves the problems that the traditional multi-agent collaborative air combat is large in calculation amount and difficult to deal with the situation of a battlefield which needs to settle out the transient changes in real time.
(2) The multi-machine cooperative air combat decision algorithm based on multi-agent reinforcement learning formed by the method effectively solves the problems of multi-heterogeneous agent cooperation, real-time confrontation and action continuity, huge search space, multi-complex tasks and the like in multi-agent decision.
(3) The multi-machine cooperative air combat decision algorithm based on multi-agent reinforcement learning comprises a battlefield environment construction module, a normalization module, a reinforcement learning module, an airplane module, a missile module, a reward module and a target distribution module, and a decision model can be established according to battlefield environment and situation information.
(4) The invention can realize multi-machine air combat decision output, the reinforcement learning algorithm can be trained independently according to different scenes, and the decision algorithm has the characteristics of good input/output interface and modular rapid transplantation.
Drawings
Fig. 1 is a schematic cross-sectional view of an attack area of an unmanned aerial vehicle according to the present invention.
Fig. 2 is a flowchart of a battlefield environment module of the present invention.
FIG. 3 is a multi-agent multi-aircraft air combat decision algorithm design framework according to the present invention.
FIG. 4 is a diagram showing the relationship between modules in the method of the present invention.
Fig. 5 is an initial occupation map of V2 air war in accordance with embodiment of the present invention.
FIG. 6 is a diagram of the velocity change of both air fighters according to the embodiment of the invention.
FIG. 7 is a diagram of the height change of both air fighters in accordance with the embodiment of the present invention.
FIG. 8 is a diagram illustrating situation changes of both air fighters according to an embodiment of the present invention.
FIG. 9 is a track diagram of both air war parties in accordance with an embodiment of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
A multi-airplane air combat decision method based on multi-agent reinforcement learning comprises the following steps:
step 1: the unmanned aerial vehicles of the two parties of the battle are assumed to be the unmanned aerial vehicle of the same party and the unmanned aerial vehicle of the opposite party, the unmanned aerial vehicle of the same party is the red machine, and the unmanned aerial vehicle of the opposite party is the blue machine; establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle;
step 2: adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment;
and step 3: and (3) combining the unmanned aerial vehicle model constructed in the step (1) with the multi-agent reinforcement learning algorithm in the step (2) to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning.
Further, in the step 1, an airplane model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle are established, and the specific steps are as follows:
step 1-1: establishing an airplane model of the unmanned aerial vehicle;
firstly, constructing a six-degree-of-freedom model of the unmanned aerial vehicle according to a three-dimensional space kinematic equation under a ground inertial coordinate system, then constructing seven actions of the airplane according to the tangential overload, normal phase overload and roll angle of the unmanned aerial vehicle, and updating the state after the actions are finished through a Longge Kutta when the airplane selects to execute any one of the actions;
step 1-1-1: input state S of unmanned aerial vehicler=[Vr,γr,φr,xr,yr,hr]Speed V of the unmanned aerial vehiclerAngle of pitch γrAngle of roll phirThree axis position (x)r,yr,hr);
Step 1-1-2: constructing a six-degree-of-freedom model and seven actions of the unmanned aerial vehicle;
Figure BDA0003223429930000091
step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;
step 1-1-4: resolving the state of the airplane after the airplane executes the action through the Longge Kutta;
step 1-1-5: updating the state of the airplane;
step 1-2: constructing a missile model;
step 1-2-1: determining the parameter of the conductance elastic energy as the maximum off-axis emission angle
Figure BDA0003223429930000092
Maximum minimum attack distance DMmaxAnd DMminMaximum and minimum non-escapeable distances DMkmaxAnd DMkminAnd a cone angle
Figure BDA0003223429930000093
In order to simplify the problem, the missile attack area is assumed to be static, and only the maximum attack distance, the maximum non-escape distance and the cone angle are concerned; the attack Area is marked as AreaackAnd satisfies the following conditions:
Figure BDA0003223429930000094
wherein q istRepresenting the line of sight angle of the red machine to the blue machine; pos (target) denotes the position of the blue machine;
the non-escape Area is marked as AreadeadAnd satisfies the following conditions:
Figure BDA0003223429930000095
when the blue machine enters an attack area of the red machine, the blue machine is destroyed with a certain probability;
to better determine this probability, the attack zone is further analyzed as shown in FIG. 1.
Step 1-2-2: dividing an attack area;
when in use
Figure BDA0003223429930000096
And DMk min<d<DMk maxIn time, the blue machine is in the fifth area of the attack area;
when in use
Figure BDA0003223429930000097
And DMmin<d<DMk minWhen the blue machine is in the first area of the attack area;
when in use
Figure BDA0003223429930000098
And DMk max<d<DM maxMeanwhile, the blue machine is in the area IV of the attack area;
when in use
Figure BDA0003223429930000099
And DM min<d<DM maxThe blue machine is positioned in the second area or the third area of the attack area; specifically, in the second zone or the third zone, the judgment is carried out through the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):
Figure BDA0003223429930000101
if it is not
Figure BDA0003223429930000102
The blue machine is located to the right with respect to the red machine, i.e. zone c of the attack zone, if
Figure BDA0003223429930000103
The blue camera is positioned on the left side relative to the red camera, namely the region II of the attack region;
in summary, the attack area is specifically divided as follows:
Figure BDA0003223429930000104
step 1-2-3: when the blue machine is in the region, the blue machine is in the non-escape area of the red machine, and the missile hit probability is maximum; when the blue-ray machine is in other areas, the hit probability of the missile is a function from 0 to 1, and the size of the hit probability is related to the distance, the departure angle, the deviation angle and the flight direction; when the missile hit probability is less than 0.3, the missile is considered to be not hit, and the missile cannot be launched at the moment; the specific destruction probability is as follows:
Figure BDA0003223429930000105
step 1-2-4: the specific steps for launching the missile are as follows:
step 1-2-4-1: inputting the distance d, the departure angle AA, the deviation angle ATA, the position and the speed of the red machine and the blue machine;
step 1-2-4-2: constructing a missile model, and setting the number of missiles;
step 1-2-4-3: judging whether the blue machine is in an attack area of the red machine or not according to the distance d and the departure angle ATA;
step 1-2-4-4: when the blue machine is in the attack area of the red machine, judging which part of the attack area the blue machine is in;
step 1-2-4-5: judging the speed direction of the blue machine relative to the red machine;
step 1-2-4-6: calculating the hit rate of the missile at the moment;
step 1-2-4-7: judging whether the missile is hit;
step 1-3: a neural network normalization model;
normalization can ensure that when the input of each layer of the neural network keeps the same distribution gradient and is reduced, the model is converged to a correct place, and the gradient updating direction is deviated under different dimensions. And normalization to a reasonable range favors model generalization.
Step 1-3-1: inputting state variables of the unmanned aerial vehicle;
step 1-3-2: normalized velocity
Figure BDA0003223429930000111
Step 1-3-3: normalized angle
Figure BDA0003223429930000112
Step 1-3-4: normalized position
Figure BDA0003223429930000113
Step 1-3-5: making a difference on the positions of the normalized red machine and the normalized blue machine;
step 1-3-6: outputting the data;
step 1-4: constructing a battlefield environment model;
step 1-5: situation judgment and target distribution model;
and the situation judgment and target distribution model constructs a comprehensive advantage function by analyzing distance threat, angle advantage and energy advantage so as to construct an air war threat degree model. And then, calculating a target distribution matrix according to a target distribution matrix criterion after data fusion according to all information obtained by the long machine. And then selecting a tactical caution degree or risk degree coefficient according to the target distribution matrix, and representing the balance of the pilot on attacking and avoiding the danger problem.
Step 1-5-1: inputting the states of the red machine and the blue machine, including the speed, the pitch angle, the yaw angle and the triaxial position;
step 1-5-2: calculating respective angle advantages according to the pitch angle and the yaw angle
Figure BDA0003223429930000114
φtIs the target entry angle, phifIs the target azimuth;
step 1-5-3: calculating respective distance advantage according to three-axis position
Figure BDA0003223429930000115
1-5-4: calculating respective energy advantages from the velocity and the height in the three-axis position
Figure BDA0003223429930000116
1-5-5: calculating the comprehensive advantages by combining the advantages of angle, speed and energy;
1-5-6: sequencing the targets according to the comprehensive advantages to generate a target distribution matrix;
1-5-7: and outputting the allocation of the targets according to the target allocation matrix.
Further, in the step 2, a MAPPO algorithm is adopted as a multi-agent reinforcement learning algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:
MAPPO algorithm:
because the state, the action space of multimachine air battle scene are huge, and the space that single unmanned aerial vehicle can explore is limited, and the sample availability factor is not high. In addition, as a typical multi-machine system, in the problem of multi-machine cooperative air combat, the strategy of a single unmanned aerial vehicle is not only dependent on the feedback of the strategy and environment of the single unmanned aerial vehicle, but also influenced by the actions of other unmanned aerial vehicles and the cooperative relationship with the unmanned aerial vehicles, so that an experience sharing mechanism is designed, and the experience sharing mechanism comprises two aspects of sharing a sample experience base and sharing network parameters. The shared sample experience library is obtained by storing global environment situation information, action decision information of the unmanned aerial vehicle, environment situation information after the unmanned aerial vehicle executes a new action and an award value fed back by the environment aiming at the action into an experience playback library according to a quadruple form, and information of each unmanned aerial vehicle is stored into the same experience playback library according to the form. When network parameters are updated, samples are extracted from the experience playback library, loss values of the samples generated by different unmanned aerial vehicles under the Actor network and the Critic network are calculated respectively, then updating gradients of the two neural networks are obtained, gradient values calculated by the samples of the different unmanned aerial vehicles are weighted, and a global gradient formula can be obtained. As shown in fig. 3, the whole framework of the multi-machine collaborative air combat decision framework based on deep reinforcement learning includes seven modules, which are a battlefield environment construction module, a normalization module, a reinforcement learning module, an airplane module, a missile module, a reward module and a target distribution module. The input quantity of the framework is real-time battlefield situation information, and the output quantity is an action decision scheme of the controlled entity. After the original battlefield situation information is input into the framework, the original battlefield situation information is firstly processed by the situation processing module, and after data is cleaned, screened, extracted, packaged, normalized and represented in a format, the data is transmitted to the deep reinforcement learning module; the deep reinforcement learning module receives situation information data and outputs action decisions; the strategy network receives the action decision output of the deep reinforcement learning module, decodes and packages the action decision output into an operation instruction acceptable for the platform environment, and controls the corresponding unit; meanwhile, the new environment situation and the reward value obtained by executing the new action are packaged and stored in the experience storage module together with the environment situation information and the action decision scheme of the decision-making in the step, and when the network is to be trained, the sample data are extracted from the experience base and are transmitted to the neural network training module for training.
The return function consists of four sub-return functions, namely a height return function, a speed return function, an angle return function and a distance return function; the four return functions reflect the distribution of the energy advantage, the kinetic energy advantage and the hit probability of the attack area when the aircraft fights in the air, and the whole air combat environment is summarized. The reward function can reflect the occupation of the opponent airplane relative to the opponent airplane at the current moment, and can guide the airplane to fly to a high reward value, namely a place with a better situation. The method comprises the following specific steps:
step 2-1: input unmanned aerial vehicle state Sr=[Vr,γr,φr,xr,yr,hr];
Step 2-2: calculating the height difference Δ h ═ hr-hbAnd calculates a height difference reward r _ h:
Figure BDA0003223429930000131
step 2-3: high security rewards for computer red:
Figure BDA0003223429930000132
step 2-4: calculating a total altitude reward Rh=r_h+r_h_self;
Step 2-5: calculating the speed difference Δ h ═ vr-vbAnd calculates a speed difference reward:
Figure BDA0003223429930000133
step 2-6: speed safety reward of computer:
Figure BDA0003223429930000134
step 2-7: calculating a total speed reward Rv=r_v+r_v_self;
Step 2-8: calculating the deviation angle AA and the deviation angle ATA of the red computer and the blue computer;
step 2-9: calculating to obtain angle reward
Figure BDA0003223429930000135
Step 2-10: calculating the distance between the red computer and the blue computer, and obtaining distance reward when the departure angle ATA is less than 60 degrees
Figure BDA0003223429930000136
Step 2-11: setting different weights to sum the rewards to obtain continuous reward Rc=a1·Ra+a2·Rh+a3·Rv+a4·Rd,a1、a2、a3And a4Respectively, representing different weights.
Further, in step 3, the unmanned aerial vehicle model constructed in step 1 and the multi-agent reinforcement learning algorithm in step 2 are combined to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning, which is specifically as follows:
step 3-1: the relation between the model constructed in the step 1 and the MAPPO algorithm and the designed reporting function in the step 2 is shown in the attached figure 4, the multi-agent reinforcement learning algorithm is composed of a strategy network and a value network, and the value network is responsible for evaluating the action selected by the strategy network so as to guide the updating of the strategy network; the input of the value network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction, the height and the selected action of the unmanned aerial vehicle, the friend aircraft and the enemy aircraft at the last moment; the input of the strategy network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction and the height of the unmanned aerial vehicle, and the output of the strategy network is selected action;
step 3-2: firstly, selecting initial actions according to initial parameters of a policy network of the red machine and the blue machine, executing the actions in a battlefield environment model to obtain a new state, then calculating rewards, and then packaging and storing the states, the rewards and the actions of the red machine and the blue machine in an experience playback library of a multi-agent reinforcement learning algorithm in a normalized mode; after enough set data are stored, the value network of the red machine and the blue machine samples the data of the experience playback library, the states of the red machine and the blue machine are combined, the strategy network updates the strategy, then the unmanned aerial vehicle takes the state of the unmanned aerial vehicle as the input of the strategy network, the strategy network selects the action of the unmanned aerial vehicle according to the state of the unmanned aerial vehicle, the unmanned aerial vehicle executes the action to obtain new data, and the circulation is carried out repeatedly.
The specific embodiment is as follows:
the situation of double-aircraft in wartime is shown in fig. 5, four airplanes are on the same plane, a red aircraft 1 and a red aircraft 2 are respectively positioned right in front of a blue aircraft 1 and a blue aircraft 2, the blue aircraft 1 and the blue aircraft 2 have a tendency to be close to a combined attack area of the red aircraft 1 and the red aircraft 2, and the red aircraft 1 and the red aircraft 2 also have a tendency to be close to a combined attack area of the blue aircraft 1 and the blue aircraft 2. So that the red machine 1 and the red machine 2 are in the same potential as the blue machine 1 and the blue machine 2.
After the training was completed, the number of wins in the red and blue after 1000 trials is shown in table 1. It can be found that the winning rate of the red square is 51.8 percent and the winning rate of the blue square is 48.2 percent.
TABLE 1 number of wins in Red and blue
Situation(s) Number of times
Red machine 1 hits basket machine 1 226
Red machine 1 hits basket machine 2 129
Red machine 2 hits basket machine 1 0
Red machine 2 hits basket machine 2 163
Blue machine 1 hitting red machine 1 330
Blue machine 1 hitting red machine 2 0
Blue machine 2 hitting red machine 1 152
Blue machine 2 hits red machine 2 0
The analysis was performed by using a red machine 1 and a middle blue machine 1 as an example.
The action selected by red machine 1 is [ right, right, right, right, acc, acc, acc, acc, acc, acc).
The action selected by Red machine 2 is [ right, right, acc, right, acc, acc, acc, acc, acc, acc ].
The action selected by the blue machine 1 is [ right, right, right, right, acc, acc, acc, acc, acc, acc ].
The action selected by blue machine 2 is [ right, right, right, right, acc, acc, acc, acc, acc, acc ] is.
The simulation result graphs are shown in fig. 6-8, wherein the solid line represents red machine 1, the dotted line represents red machine 2, the dotted line represents blue machine 1, and the dotted curve represents blue machine 2. As shown in fig. 6, the speed of blue machine 2 is highest with the greatest speed advantage, and the speed of red machine 1 and red machine 2 is far less than that of blue machine 1 and blue machine 2. As can be seen from fig. 7, the blue aircraft 1 and the blue aircraft 2 are not as superior in height to the red aircraft 1 and the red aircraft 2, and as can be seen from fig. 8, the red aircraft 1, the red aircraft 2, the blue aircraft 1 and the blue aircraft 2 are flying safely, so that the initial situations thereof are all positive, as the air war carries out the pinching of the red aircraft 1 and the blue aircraft 2 by the blue aircraft 1 and the blue aircraft 2, the situations of the blue aircraft 1 and the blue aircraft 2 gradually rise, the situation of the red aircraft gradually worsens, then the two red aircraft also start the pinching of the blue aircraft 2, the situation of the blue aircraft falls, the situation of the red aircraft rises, finally, the blue aircraft finishes the pinching of the red aircraft 1 first, and the blue aircraft 2 launches a missile, successfully hits the situations of the red aircraft 1 and the blue aircraft 2, and masters the battlefield initiative.
Fig. 9 is a trajectory diagram of four drones.
The effectiveness of the multi-computer cooperative air combat decision algorithm based on multi-agent reinforcement learning designed by the invention is proved by integrating all simulation results, the problems that the traditional multi-agent cooperative air combat is large in calculated amount and difficult to deal with the battlefield situation requiring real-time settlement of the transient change are effectively solved, meanwhile, the problems of cooperation, real-time confrontation and action continuity of multi-heterogeneous agents, huge search space, multi-complex tasks and the like in multi-agent decision are effectively solved, and a decision model can be established according to battlefield environment and situation information; the multi-machine air combat decision output can be realized, the reinforcement learning algorithm can be trained independently according to different scenes, and the decision algorithm has the characteristics of good input/output interfaces and modular rapid transplantation.

Claims (4)

1. A multi-airplane air combat decision method based on multi-agent reinforcement learning is characterized by comprising the following steps:
step 1: the unmanned aerial vehicles of the two parties of the battle are assumed to be the unmanned aerial vehicle of the same party and the unmanned aerial vehicle of the opposite party, the unmanned aerial vehicle of the same party is the red machine, and the unmanned aerial vehicle of the opposite party is the blue machine; establishing a six-degree-of-freedom model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment model and a target distribution model of the unmanned aerial vehicle;
step 2: adopting an MAPPO algorithm as a multi-agent reinforcement learning algorithm, and designing a corresponding return function on the basis of a specific air combat environment;
and step 3: and (3) combining the unmanned aerial vehicle model constructed in the step (1) with the multi-agent reinforcement learning algorithm in the step (2) to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning.
2. The multi-aircraft air combat decision method based on multi-agent reinforcement learning as claimed in claim 1, wherein in step 1, an unmanned aerial vehicle model, a missile model, a neural network normalization model, a battlefield environment model, a situation judgment and target distribution model are established, and the specific steps are as follows:
step 1-1: establishing an airplane model of the unmanned aerial vehicle;
step 1-1-1: input state S of unmanned aerial vehicler=[Vrrr,xr,yr,hr]Speed V of the unmanned aerial vehiclerAngle of pitch γrAngle of roll phirThree axis position (x)r,yr,hr);
Step 1-1-2: constructing a six-degree-of-freedom model and seven actions of the unmanned aerial vehicle; the actions are coded by selecting the tangential overload, normal overload and roll angle of the unmanned aerial vehicle, namely in the formula (1)
Figure FDA0003223429920000013
Coming watchDisplaying actions taken at each moment in the simulation, wherein the actions comprise seven actions of constant level flight, acceleration, deceleration, left turning, right turning, upward pulling and downward diving after being coded;
Figure FDA0003223429920000011
where v represents the speed of the drone, NxRepresenting tangential overload of the drone, theta representing pitch angle of the drone, psi representing yaw angle of the drone, NzIndicating a normal overload of the drone,
Figure FDA0003223429920000012
the roll angle of the unmanned aerial vehicle is represented, t represents the updating time of the state of the unmanned aerial vehicle, and g represents the gravity acceleration;
step 1-1-3: inputting the action to be executed by the unmanned aerial vehicle;
step 1-1-4: resolving the state of the airplane after the airplane executes the action through the Longge Kutta;
step 1-1-5: updating the state of the airplane;
step 1-2: constructing a missile model;
step 1-2-1: determining the parameter of the conductance elastic energy as the maximum off-axis emission angle
Figure FDA0003223429920000021
Maximum minimum attack distance DMmaxAnd DMminMaximum and minimum non-escapeable distances DMkmaxAnd DMKminAnd a cone angle
Figure FDA0003223429920000022
The missile attack area is assumed to be static, and only the maximum attack distance, the maximum non-escape distance and the cone angle are concerned; the attack Area is marked as AreaackAnd satisfies the following conditions:
Figure FDA0003223429920000023
wherein d istIndicating the distance from red to blue machine, qtRepresenting the line of sight angle of the red machine to the blue machine; pos (target) denotes the position of the blue machine;
the non-escape Area is marked as AreadeadAnd satisfies the following conditions:
Figure FDA0003223429920000024
when the blue machine enters an attack area of the red machine, the blue machine is destroyed with a certain probability;
step 1-2-2: dividing an attack area;
when in use
Figure FDA0003223429920000025
And DMkmin<d<DMkmaxIn time, the blue machine is in the fifth area of the attack area;
when in use
Figure FDA0003223429920000026
And DMmin<d<DMkminWhen the blue machine is in the first area of the attack area;
when in use
Figure FDA0003223429920000027
And DMkmax<d<DMmaxMeanwhile, the blue machine is in the area IV of the attack area;
when in use
Figure FDA0003223429920000028
And DMmin<d<DMmaxThe blue machine is positioned in the second area or the third area of the attack area; specifically, in the second zone or the third zone, the judgment is carried out through the relative position of the red machine and the blue machine, and the relative position of the red machine and the blue machine is as shown in the formula (4):
Figure FDA0003223429920000029
wherein, Deltax, Deltay, Deltaz represent the distance difference of red machine and blue machine in the direction of X-axis, direction of Y-axis and direction of Z-axis respectively, and xb、yb、zbRespectively representing the position of the blue machine in the x-axis direction, the y-axis direction and the z-axis direction, xr、yr、zrRespectively showing the positions of the red machine in the x-axis direction, the y-axis direction and the z-axis direction;
if it is not
Figure FDA00032234299200000210
The blue machine is located to the right with respect to the red machine, i.e. zone c of the attack zone, if
Figure FDA00032234299200000211
The blue camera is positioned on the left side relative to the red camera, namely the region II of the attack region;
in summary, the attack area is specifically divided as follows:
Figure FDA0003223429920000031
step 1-2-3: when the blue machine is in the region, the blue machine is in the non-escape area of the red machine, and the missile hit probability is maximum; when the blue-ray machine is in other areas, the hit probability of the missile is a function from 0 to 1, and the size of the hit probability is related to the distance, the departure angle, the deviation angle and the flight direction; when the missile hit probability is less than 0.3, the missile is considered to be not hit, and the missile cannot be launched at the moment; the specific destruction probability is as follows:
Figure FDA0003223429920000032
wherein p isaRepresenting the probability of crash, p, associated with a blue plane maneuverdThe damage probability associated with the distance is shown, and the position (airfft _ aim) shows the area of the attack area of the local area where the blue machine is located;
step 1-2-4: the specific steps for launching the missile are as follows:
step 1-2-4-1: inputting the distance d, the departure angle AA, the deviation angle ATA, the position and the speed of the red machine and the blue machine;
step 1-2-4-2: constructing a missile model, and setting the number of missiles;
step 1-2-4-3: judging whether the blue machine is in an attack area of the red machine or not according to the distance d and the departure angle ATA;
step 1-2-4-4: when the blue machine is in the attack area of the red machine, judging which part of the attack area the blue machine is in;
step 1-2-4-5: judging the speed direction of the blue machine relative to the red machine;
step 1-2-4-6: calculating the hit rate of the missile at the moment;
step 1-2-4-7: judging whether the missile is hit;
step 1-3: a neural network normalization model;
step 1-3-1: inputting state variables of the unmanned aerial vehicle;
step 1-3-2: normalized velocity
Figure FDA0003223429920000033
Step 1-3-3: normalized angle
Figure FDA0003223429920000041
Step 1-3-4: normalized position
Figure FDA0003223429920000042
Step 1-3-5: making a difference on the positions of the normalized red machine and the normalized blue machine;
step 1-3-6: outputting the data;
step 1-4: constructing a battlefield environment model;
step 1-5: situation judgment and target distribution model;
step 1-5-1: inputting the states of the red machine and the blue machine, including the speed, the pitch angle, the yaw angle and the triaxial position;
step 1-5-2: calculating respective angle advantages according to the pitch angle and the yaw angle
Figure FDA0003223429920000043
φtIs the target entry angle, phifIs the target azimuth;
step 1-5-3: calculating respective distance advantage according to three-axis position
Figure FDA0003223429920000044
1-5-4: calculating respective energy advantages from the velocity and the height in the three-axis position
Figure FDA0003223429920000045
1-5-5: calculating the comprehensive advantage S ═ C by combining the advantages of angle, speed and energy1Sa+C2Sr+C3Eg,C1、C2And C3Are all weighting coefficients;
1-5-6: sequencing the targets according to the comprehensive advantages to generate a target distribution matrix;
1-5-7: and outputting the allocation of the targets according to the target allocation matrix.
3. The multi-aircraft air combat decision method based on multi-agent reinforcement learning as claimed in claim 2, wherein in the step 2, a MAPPO algorithm is adopted as the multi-agent reinforcement learning algorithm, a centralized training and distributed execution framework is combined with a PPO algorithm to form the MAPPO algorithm, and a corresponding reward function is designed on the basis of a specific air combat environment, and the specific steps are as follows:
the return function consists of four sub-return functions, namely a height return function, a speed return function, an angle return function and a distance return function; the method comprises the following specific steps:
step 2-1: input unmanned aerial vehicle state Sr=[Vrrr,xr,yr,hr];
Step 2-2: calculating the height difference Δ h ═ hr-nbAnd calculates a height difference reward r _ h, hr、hbRespectively, the height of the red machine and the height of the blue machine, wherein the height unit is meter:
Figure FDA0003223429920000051
step 2-3: high security rewards for computer red:
Figure FDA0003223429920000052
step 2-4: calculating a total altitude reward Rh=r_h+r_h_self;
Step 2-5: calculating the speed difference Δ h ═ vr-vbAnd calculating the velocity difference reward, vr、vbThe speed of the red machine and the speed of the blue machine are respectively expressed, and the speed is expressed in the unit of meter/second:
Figure FDA0003223429920000053
step 2-6: speed safety reward of computer:
Figure FDA0003223429920000054
step 2-7: calculating a total speed reward Rv=r_v+r_v_self;
Step 2-8: calculating the deviation angle AA and the deviation angle ATA of the red computer and the blue computer;
step 2-9: calculating to obtain angle reward
Figure FDA0003223429920000055
Steps 2 to 10: calculating the distance between the red computer and the blue computer, and obtaining distance reward when the departure angle ATA is less than 60 degrees
Figure FDA0003223429920000056
Step 2-11: setting different weights to sum the rewards to obtain continuous reward Rc=a1·Ra+a2·Rh+a3·Rv+a4·Rd,a1、a2、a3And a4Respectively, representing different weights.
4. The multi-machine air combat decision method based on multi-agent reinforcement learning as claimed in claim 3, wherein in the step 3, the unmanned aerial vehicle model constructed in the step 1 and the multi-agent reinforcement learning algorithm in the step 2 are combined to generate a final multi-machine cooperative air combat decision method based on multi-agent reinforcement learning, which is specifically as follows:
step 3-1: the multi-agent reinforcement learning algorithm consists of a strategy network and a value network, wherein the value network is responsible for evaluating the action selected by the strategy network so as to guide the updating of the strategy network; the input of the value network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction, the height and the selected action of the unmanned aerial vehicle, the friend aircraft and the enemy aircraft at the last moment; the input of the strategy network is the speed, the pitch angle, the yaw angle, the position in the x direction, the position in the y direction and the height of the unmanned aerial vehicle, and the output of the strategy network is selected action;
step 3-2: firstly, selecting initial actions according to initial parameters of a policy network of the red machine and the blue machine, executing the actions in a battlefield environment model to obtain a new state, then calculating rewards, and then packaging and storing the states, the rewards and the actions of the red machine and the blue machine in an experience playback library of a multi-agent reinforcement learning algorithm in a normalized mode; after enough set data are stored, the value network of the red machine and the blue machine samples the data of the experience playback library, the states of the red machine and the blue machine are combined, the strategy network updates the strategy, then the unmanned aerial vehicle takes the state of the unmanned aerial vehicle as the input of the strategy network, the strategy network selects the action of the unmanned aerial vehicle according to the state of the unmanned aerial vehicle, the unmanned aerial vehicle executes the action to obtain new data, and the circulation is carried out repeatedly.
CN202110964271.9A 2021-08-22 2021-08-22 Multi-agent reinforcement learning-based multi-machine air combat decision method Active CN113791634B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110964271.9A CN113791634B (en) 2021-08-22 2021-08-22 Multi-agent reinforcement learning-based multi-machine air combat decision method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110964271.9A CN113791634B (en) 2021-08-22 2021-08-22 Multi-agent reinforcement learning-based multi-machine air combat decision method

Publications (2)

Publication Number Publication Date
CN113791634A true CN113791634A (en) 2021-12-14
CN113791634B CN113791634B (en) 2024-02-02

Family

ID=78876259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110964271.9A Active CN113791634B (en) 2021-08-22 2021-08-22 Multi-agent reinforcement learning-based multi-machine air combat decision method

Country Status (1)

Country Link
CN (1) CN113791634B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114371729A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
CN114492059A (en) * 2022-02-07 2022-05-13 清华大学 Multi-agent confrontation scene situation assessment method and device based on field energy
CN114492058A (en) * 2022-02-07 2022-05-13 清华大学 Multi-agent confrontation scene oriented defense situation assessment method and device
CN114578838A (en) * 2022-03-01 2022-06-03 哈尔滨逐宇航天科技有限责任公司 Reinforced learning active disturbance rejection attitude control method suitable for aircrafts of various configurations
CN115047907A (en) * 2022-06-10 2022-09-13 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115113642A (en) * 2022-06-02 2022-09-27 中国航空工业集团公司沈阳飞机设计研究所 Multi-unmanned aerial vehicle space-time key feature self-learning cooperative confrontation decision-making method
CN115484205A (en) * 2022-07-12 2022-12-16 北京邮电大学 Deterministic network routing and queue scheduling method and device
CN116187787A (en) * 2023-04-25 2023-05-30 中国人民解放军96901部队 Intelligent planning method for cross-domain allocation problem of combat resources
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116880186A (en) * 2023-07-13 2023-10-13 四川大学 Data-driven self-adaptive dynamic programming air combat decision method
CN116909155A (en) * 2023-09-14 2023-10-20 中国人民解放军国防科技大学 Unmanned aerial vehicle autonomous maneuver decision-making method and device based on continuous reinforcement learning
CN117313561A (en) * 2023-11-30 2023-12-29 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
WO2020024097A1 (en) * 2018-07-30 2020-02-06 东莞理工学院 Deep reinforcement learning-based adaptive game algorithm
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020000399A1 (en) * 2018-06-29 2020-01-02 东莞理工学院 Multi-agent deep reinforcement learning proxy method based on intelligent grid
WO2020024097A1 (en) * 2018-07-30 2020-02-06 东莞理工学院 Deep reinforcement learning-based adaptive game algorithm
CN110404264A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) It is a kind of based on the virtually non-perfect information game strategy method for solving of more people, device, system and the storage medium self played a game
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔文华;李东;唐宇波;柳少军;: "基于深度强化学习的兵棋推演决策方法框架", 国防科技, no. 02 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114371729A (en) * 2021-12-22 2022-04-19 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle air combat maneuver decision method based on distance-first experience playback
CN114492058B (en) * 2022-02-07 2023-02-03 清华大学 Multi-agent confrontation scene oriented defense situation assessment method and device
CN114492059A (en) * 2022-02-07 2022-05-13 清华大学 Multi-agent confrontation scene situation assessment method and device based on field energy
CN114492058A (en) * 2022-02-07 2022-05-13 清华大学 Multi-agent confrontation scene oriented defense situation assessment method and device
CN114492059B (en) * 2022-02-07 2023-02-28 清华大学 Multi-agent confrontation scene situation assessment method and device based on field energy
CN114578838A (en) * 2022-03-01 2022-06-03 哈尔滨逐宇航天科技有限责任公司 Reinforced learning active disturbance rejection attitude control method suitable for aircrafts of various configurations
CN114578838B (en) * 2022-03-01 2022-09-16 哈尔滨逐宇航天科技有限责任公司 Reinforced learning active disturbance rejection attitude control method suitable for aircrafts of various configurations
CN115113642A (en) * 2022-06-02 2022-09-27 中国航空工业集团公司沈阳飞机设计研究所 Multi-unmanned aerial vehicle space-time key feature self-learning cooperative confrontation decision-making method
CN115047907A (en) * 2022-06-10 2022-09-13 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115047907B (en) * 2022-06-10 2024-05-07 中国电子科技集团公司第二十八研究所 Air isomorphic formation command method based on multi-agent PPO algorithm
CN115484205A (en) * 2022-07-12 2022-12-16 北京邮电大学 Deterministic network routing and queue scheduling method and device
CN115484205B (en) * 2022-07-12 2023-12-01 北京邮电大学 Deterministic network routing and queue scheduling method and device
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116679742B (en) * 2023-04-11 2024-04-02 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116187787A (en) * 2023-04-25 2023-05-30 中国人民解放军96901部队 Intelligent planning method for cross-domain allocation problem of combat resources
CN116187787B (en) * 2023-04-25 2023-09-12 中国人民解放军96901部队 Intelligent planning method for cross-domain allocation problem of combat resources
CN116880186A (en) * 2023-07-13 2023-10-13 四川大学 Data-driven self-adaptive dynamic programming air combat decision method
CN116880186B (en) * 2023-07-13 2024-04-16 四川大学 Data-driven self-adaptive dynamic programming air combat decision method
CN116909155B (en) * 2023-09-14 2023-11-24 中国人民解放军国防科技大学 Unmanned aerial vehicle autonomous maneuver decision-making method and device based on continuous reinforcement learning
CN116909155A (en) * 2023-09-14 2023-10-20 中国人民解放军国防科技大学 Unmanned aerial vehicle autonomous maneuver decision-making method and device based on continuous reinforcement learning
CN117313561A (en) * 2023-11-30 2023-12-29 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
CN117313561B (en) * 2023-11-30 2024-02-13 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method

Also Published As

Publication number Publication date
CN113791634B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN113791634A (en) Multi-aircraft air combat decision method based on multi-agent reinforcement learning
Yang et al. Maneuver decision of UAV in short-range air combat based on deep reinforcement learning
CN111880563B (en) Multi-unmanned aerial vehicle task decision method based on MADDPG
CN111240353B (en) Unmanned aerial vehicle collaborative air combat decision method based on genetic fuzzy tree
CN111859541B (en) PMADDPG multi-unmanned aerial vehicle task decision method based on transfer learning improvement
CN112198892B (en) Multi-unmanned aerial vehicle intelligent cooperative penetration countermeasure method
CN110928329A (en) Multi-aircraft track planning method based on deep Q learning algorithm
CN115291625A (en) Multi-unmanned aerial vehicle air combat decision method based on multi-agent layered reinforcement learning
CN105678030B (en) Divide the air-combat tactics team emulation mode of shape based on expert system and tactics tactics
CN113893539B (en) Cooperative fighting method and device for intelligent agent
Zhang et al. Maneuver decision-making of deep learning for UCAV thorough azimuth angles
CN114063644B (en) Unmanned fighter plane air combat autonomous decision-making method based on pigeon flock reverse countermeasure learning
CN115951709A (en) Multi-unmanned aerial vehicle air combat strategy generation method based on TD3
Bae et al. Deep reinforcement learning-based air-to-air combat maneuver generation in a realistic environment
CN114638339A (en) Intelligent agent task allocation method based on deep reinforcement learning
CN113435598A (en) Knowledge-driven intelligent strategy deduction decision method
Wu et al. Visual range maneuver decision of unmanned combat aerial vehicle based on fuzzy reasoning
CN117313561B (en) Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
CN113741186B (en) Double-aircraft air combat decision-making method based on near-end strategy optimization
Wang et al. Deep reinforcement learning-based air combat maneuver decision-making: literature review, implementation tutorial and future direction
Duan et al. Autonomous maneuver decision for unmanned aerial vehicle via improved pigeon-inspired optimization
Zhu et al. Mastering air combat game with deep reinforcement learning
Kong et al. Multi-ucav air combat in short-range maneuver strategy generation using reinforcement learning and curriculum learning
Wang et al. Research on naval air defense intelligent operations on deep reinforcement learning
CN114330093A (en) Multi-platform collaborative intelligent confrontation decision-making method for aviation soldiers based on DQN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant