CN112947581A - Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning - Google Patents

Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN112947581A
CN112947581A CN202110318644.5A CN202110318644A CN112947581A CN 112947581 A CN112947581 A CN 112947581A CN 202110318644 A CN202110318644 A CN 202110318644A CN 112947581 A CN112947581 A CN 112947581A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
target
uav
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110318644.5A
Other languages
Chinese (zh)
Other versions
CN112947581B (en
Inventor
杨啟明
张建东
史国庆
吴勇
朱岩
张耀中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202110318644.5A priority Critical patent/CN112947581B/en
Publication of CN112947581A publication Critical patent/CN112947581A/en
Application granted granted Critical
Publication of CN112947581B publication Critical patent/CN112947581B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning, which solves the autonomous decision problem of maneuver action in multi-unmanned aerial vehicle collaborative air combat in simulation multi-to-multi air combat. The method comprises the following steps: creating a motion model of the unmanned aerial vehicle platform; evaluating the situation of the multi-aircraft air combat based on attack areas, distances and angle factors, and analyzing the state space, action space and reward value of the maneuvering decision of the multi-aircraft air combat; a target distribution method and a strategy coordination mechanism in the cooperative air combat are designed, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safe collision avoidance is defined through distribution of reward values, and strategy cooperation is achieved after training. The invention can effectively improve the capability of multiple unmanned aerial vehicles for carrying out collaborative air combat maneuver autonomous decision, has stronger cooperativity and autonomous optimization, and continuously improves the decision level of unmanned aerial vehicle formation in continuous simulation and learning.

Description

Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-unmanned aerial vehicle collaborative air combat maneuver decision method.
Background
At present, unmanned aerial vehicles can complete tasks such as reconnaissance, monitoring and ground attack, and play an increasingly difficult role in modern war. But because of the restriction of intelligent level, unmanned aerial vehicle can not carry out autonomic air battle maneuver decision yet at present, especially the autonomic collaborative air battle of many unmanned aerial vehicles. Therefore, promote unmanned aerial vehicle's intelligent level, let unmanned aerial vehicle can be according to the situation environment and the maneuver in the automatic control command completion air battle is current main research direction.
The unmanned aerial vehicle can complete the maneuver autonomous decision of the air combat, and the essence of the maneuver autonomous decision is to complete the mapping from the air combat situation to the maneuver and execute the corresponding maneuver under different situations. Because the situation of air battle is more complex than other tasks, the situation space of the air battle task is difficult to be completely covered by a manual pre-programming method, and the optimal maneuver decision is more difficult to calculate and generate.
At present, the air combat maneuver decision research of unmanned aerial vehicles is developed under the single-machine confrontation scene of 1v1, and in the actual air combat, a plurality of unmanned aerial vehicles are basically formed into a formation cooperative combat. The multi-machine cooperative air combat relates to three aspects of air combat situation assessment, multi-target distribution and maneuver decision, the cooperative air combat is a closely-connected coupling process of the three parts, and compared with the maneuver decision of single-machine confrontation, the multi-machine cooperative air combat needs to consider tactical cooperation besides the enlargement of the force quantity scale, so that the problem is more complex.
The research on the multi-machine collaborative air combat decision can be divided into centralized type and distributed type, the centralized type is that a center calculates the actions of all unmanned aerial vehicles in a formation, and the models are complex and have the problems of high calculation difficulty and insufficient real-time performance. The idea of the distributed method is that each unmanned aerial vehicle in the formation calculates respective maneuvering action by itself on the basis of target allocation, so that the complexity of the model is reduced, and the cooperation of the formation task is realized through the target allocation. The existing distributed cooperative air combat decision-making method mostly adopts the conditions that target distribution is firstly carried out, and then many-to-many air combat is converted into one-to-one according to the target distribution result, so that the method cannot well play the tactical cooperation of multi-target attack capability and formation combat, and cannot achieve the effect of 1+1> 2.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning, and solves the autonomous decision problem of maneuver action in multi-unmanned aerial vehicle collaborative air combat in simulation multi-to-multi air combat. The method comprises the following steps: creating a motion model of the unmanned aerial vehicle platform; evaluating the situation of the multi-aircraft air combat based on attack areas, distances and angle factors, and analyzing the state space, action space and reward value of the maneuvering decision of the multi-aircraft air combat; a target distribution method and a strategy coordination mechanism in the cooperative air combat are designed, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safe collision avoidance is defined through distribution of reward values, and strategy cooperation is achieved after training. The invention can effectively improve the capability of multiple unmanned aerial vehicles for carrying out collaborative air combat maneuver autonomous decision, has stronger cooperativity and autonomous optimization, and continuously improves the decision level of unmanned aerial vehicle formation in continuous simulation and learning.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: establishing a multi-machine air combat environment model, and defining a state space, an action space and a reward value for each unmanned aerial vehicle to make a maneuver decision in the multi-machine cooperative air combat process;
step 1-1: in a ground coordinate system, an ox axis is taken as the true east, an oy axis is taken as the true north, and an oz axis is taken as the vertical direction; the motion model of the unmanned aerial vehicle in the ground coordinate system is shown as the formula (1):
Figure BDA0002991868610000021
in the ground coordinate system, the dynamic model of the unmanned aerial vehicle is shown as formula (2):
Figure BDA0002991868610000022
wherein (x, y, z) represents the position of the drone in the ground coordinate system, v represents the drone velocity,
Figure BDA0002991868610000023
and
Figure BDA0002991868610000024
respectively representing the values of the speed v of the unmanned aerial vehicle on three coordinate axes of xyz; the flight path angle gamma represents an included angle between the speed v of the unmanned aerial vehicle and the horizontal plane o-x-y; the heading angle psi represents an included angle between a projection v' of the unmanned aerial vehicle speed v on an o-x-y plane and an oy axis, and g represents gravity acceleration; [ n ] ofx,nz,μ]Is a control variable, n, for controlling the unmanned aerial vehicle to maneuverxThe overload in the speed direction of the unmanned aerial vehicle represents the thrust and deceleration action of the unmanned aerial vehicle; n iszIndicating the overload of the pitching direction of the unmanned aerial vehicle, namely normal overload; μ is the roll angle around the drone velocity vector; through nxControlling the speed of the unmanned plane by nzAnd mu, controlling the direction of the speed vector of the unmanned aerial vehicle, and further controlling the unmanned aerial vehicle to perform maneuvering action;
step 1-2: setting the missile to have the tail attack capability only; in the interception area of the missile by vUAnd vTRepresenting the speed of the drone and the target, respectively; d is a distance vector and represents the position relation between the unmanned aerial vehicle and the target; alpha is alphaUAnd alphaTRespectively representing an included angle between the speed vector of the unmanned aerial vehicle and the distance vector D and an included angle between the target speed vector and the distance vector D;
setting the maximum interception distance of the missile as DmAngle of view of
Figure BDA0002991868610000031
The interception area of the missile is a conical area omega; the maneuvering target of the unmanned aerial vehicle in the air war is to make the target enter the interception area omega of the unmanned aerial vehicleUSimultaneously, the unmanned aerial vehicle is prevented from entering a target interception area omegaT
According to the definition of the missile interception area, if the target is in the interception area of the missile of the own partyAnd the area shows that the own party can launch weapons to attack the target, the own party is in advantage, and the advantage value eta of the unmanned aerial vehicle when the unmanned aerial vehicle intercepts the target is definedUComprises the following steps:
Figure BDA0002991868610000032
wherein (x)T,yT,zT) Representing position coordinates of the target; re is a positive number;
defining a dominance η obtained by a target intercepting unmanned aerial vehicleTComprises the following steps:
Figure BDA0002991868610000033
wherein (x)U,yU,zU) Representing position coordinates of the drone;
in the air battle, the advantage value eta obtained by the unmanned aerial vehicle based on the interception opportunityAIs defined as:
ηA=ηUT (4)
defining an advantage value eta obtained based on angle parameters and distance parameters of two partiesBComprises the following steps:
Figure BDA0002991868610000034
the above formula shows that when the unmanned aerial vehicle tailors the target, the dominance value is eta B1 is ═ 1; when the unmanned aerial vehicle is tailed by the target, the advantage value is etaB-1; when the distance between the unmanned aerial vehicle and the target is larger than the maximum interception distance of the missile, the advantage value is attenuated according to an exponential function;
by integrating formulas (4) and (5), the situation assessment function eta of the air war in which the unmanned aerial vehicle is located is obtained as follows:
η=ηAB (6)
step 1-3: the geometric relationship of the air combat situation at any moment is completely determined by information contained in an unmanned aerial vehicle position vector, an unmanned aerial vehicle speed vector, a target position vector and a target speed vector in the same coordinate system, so that the description of the air combat situation is composed of the following 5 aspects:
1) speed information of unmanned aerial vehicle, including speed magnitude vUTrack angle gammaUAnd heading angle psiU
2) Speed information of the object, including the magnitude v of the speedTTrack angle gammaTAnd heading angle psiT
3) The relative position relation between the unmanned aerial vehicle and the target is represented by a distance vector D; distance vector modulo D | | | D | |, γDRepresenting the angle of the distance vector D with the horizontal plane o-x-y,. phiDThe included angle between the projection vector of the distance vector D on the horizontal plane o-x-y and the oy axis is shown, and the relative position relation between the unmanned aerial vehicle and the target is D and gammaDAnd psiDRepresents;
4) the relative motion relation between the unmanned aerial vehicle and the target comprises an included angle alpha between a speed vector of the unmanned aerial vehicle and a distance vector DUAnd the angle alpha between the target velocity vector and the distance vector DT
5) Height information z of unmanned aerial vehicleUAnd height information z of the targetT
Based on the variables 1) to 5) above, the 1v1 air battle situation at any time can be completely characterized, so the state space of the 1v1 maneuver decision model is a 13-dimensional vector space s:
s=[vUUU,vTTT,D,γDDUT,zU,zT] (7)
adopting a situation evaluation function eta as an air war maneuver decision reward value R, and reflecting the action of an action value on the air war situation through the situation evaluation function, wherein the R is eta;
step 1-4: in the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be n and respectively recorded as UAVsi(i is 1,2, …, n), the number of targets is m, and each is denoted as Targetj(j ═ 1,2, …, m), the number of set targets is not greater than the number of drones, i.e. m ≦ n;
remember any two UAVsiAnd TargetjRelative state therebetween is
Figure BDA0002991868610000041
UAViUAV for interacting with any one of the other aircraftkThe relative state between them is recorded as
Figure BDA0002991868610000042
Any UAV in multi-aircraft air combatiThe observation state of (a) is:
Si=[∪sij|j=1,2...,m,∪sik|k=1,2,...,n(k≠i)] (8)
in the process of multi-machine air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)x、nzAnd μ three variables control flight, therefore UAViHas a motion space of Ai=[nxi,nzii];
In the multi-machine collaborative air battle, the situation assessment value eta between each unmanned aerial vehicle and each target is respectively calculated according to the formula (4) and the formula (5)AAnd ηBRecording UAViAnd TargetjHas a situation evaluation value of
Figure BDA0002991868610000051
And
Figure BDA0002991868610000052
in addition to that, consider a UAViFriend machine UAVkThe relative state of (A) on the self-situation, thus defining the UAViFriend machine UAVkThe situation assessment function of (1) is:
Figure BDA0002991868610000053
wherein DikUAV for unmanned aerial vehicleiFriend machine UAVkA distance between, DsafeIs the minimum safe distance between two unmanned planes, and P is a positive number.
Step 2: establishing a multi-machine cooperative target distribution method, and determining a target distribution rule during reinforcement learning training;
step 2-1: in the air battle, n unmanned aerial vehicles are arranged to fight m targets, and n is more than or equal to m; according to equation (6), UAVi(i-1, 2, …, n) vs. TargetjThe situation evaluation value of (j ═ 1,2, …, m) is
Figure BDA0002991868610000054
Let the target allocation matrix be X ═ Xij],xij1 denotes TargetjTo UAVi,xij0 denotes TargetjNot allocated to UAVi(ii) a Let each drone be able to launch missiles at most simultaneously on L targets located in its attack zone, i.e.
Figure BDA0002991868610000055
Meanwhile, during the battle, the target is prevented from being omitted and the attack is abandoned, namely, each target is at least allocated with one unmanned aerial vehicle to attack, so that the unmanned aerial vehicle can be used for preventing the attack
Figure BDA0002991868610000056
All unmanned aerial vehicles are required to be put into combat, so that
Figure BDA0002991868610000057
With the situation advantage maximization of the unmanned aerial vehicle to the target as a target, establishing a target distribution model as follows:
Figure BDA0002991868610000061
step 2-2: in the target allocation process, targets in an attack area are allocated firstly, and then targets outside the attack area are allocated, so that the target allocation method is divided into the following two parts:
step 2-2-1: preferentially distributing targets located in the attack area;
to be provided with
Figure BDA0002991868610000062
And
Figure BDA0002991868610000063
constructing two n x m dimensional matrices H for elementsAAnd HB
Figure BDA0002991868610000064
Figure BDA0002991868610000065
From equation (3) if TargetjIn the UAViIn the attack area of (1), then
Figure BDA0002991868610000066
Otherwise
Figure BDA0002991868610000067
Thus, let
Figure BDA0002991868610000068
Order to
Figure BDA0002991868610000069
X of corresponding positions of all zero elements ij1 is ═ 1; during the distribution process, if at unmanned aerial vehicle UAViThe target number x in the attack area exceeds the maximum attack target number of the unmanned aerial vehicle, namely x>L, then use UAViAt HBSorting corresponding element values in the matrix, selecting L targets with the maximum element values to be allocated to the UAVi
Step 2-2-2: allocating targets located outside the attack area;
for UAViIf a target within its attack zone has already been allocated, it can no longer be allocated a target outside the attack zone; and for a plurality of targets outside the attack area, the unmanned aerial vehicle cannot makeManeuvering to enable a plurality of targets to be in the attack area, and therefore when the targets are outside the attack area, only one target can be allocated to the unmanned aerial vehicle; therefore, after the target allocation in the attack area is completed, the remaining target allocation work is changed into a process of allocating 1 target to the unallocated unmanned aerial vehicle, and the allocation is realized by adopting the hungarian algorithm, which specifically comprises the following steps:
first, a matrix X is allocated according to the current target [ X ]ij]n×mIs prepared from HBAll of x inijDeleting the ith row and the jth column where the 1 is positioned to obtain a matrix
Figure BDA00029918686100000610
Based on
Figure BDA00029918686100000611
The allocation result is calculated by adopting the Hungarian algorithm, because n is more than or equal to m, and L>0, adopting a margin complementing method to complete the Hungarian algorithm, realizing target distribution, and ordering corresponding xij=1;
After the above two steps are completed, the allocation of all the targets is completed, and a target allocation matrix X ═ X is obtainedij]n×m
And step 3: designing a multi-machine cooperative maneuver strategy learning algorithm and determining a reinforcement learning training logic;
the multi-machine cooperative maneuver strategy learning algorithm comprises a strategy coordination mechanism and a strategy learning mechanism:
step 3-1: designing a strategy coordination mechanism;
the air combat confrontation is regarded as a competitive game between n unmanned aerial vehicles and m targets, a model is established based on a framework of a random game, and one random game can use one tuple
Figure BDA0002991868610000071
To represent; s represents the state space of the current game, and all agents can be shared; UAViIs defined as Ai,TargetiIs defined as Bi;T:S×An×Bm→ S denotes the deterministic transfer function of the environment,
Figure BDA0002991868610000072
Figure BDA0002991868610000073
representing a UAViA reward value function of; the action spaces of drones in respective convoy in the collaborative air battle are the same, i.e. for UAViAnd TargetjRespectively have AiA and Bi=B;
Defining the global reward value of the formation of the unmanned aerial vehicles as the average value of the reward values of the unmanned aerial vehicles, namely:
Figure BDA0002991868610000074
wherein r (s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicles form to take action a belongs to AnThe target formation takes action B ∈ BmUnder the condition of (1), the unmanned aerial vehicles form the acquired reward value;
the goal of drone formation is to learn a strategy to make the expectation of a discount accumulation of reward values
Figure BDA0002991868610000075
Maximum, wherein 0<λ ≦ 1 is a discount factor; the random game is transformed into a markov decision problem:
Figure BDA0002991868610000076
wherein Q*(. cndot.) represents a state-action value function for executing action a in state s, r (s, a) represents a reward value for executing action a in state s, θ represents a network parameter of the policy function, s' represents a state at a next time, aθRepresenting a parameterized policy function;
the reward value function for each drone is defined as:
Figure BDA0002991868610000077
wherein r isi(s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicle formation takes action as a belongs to AnThe target formation takes action B ∈ BmIn case of (A), UAViThe value of the prize earned, wherein
Figure BDA0002991868610000081
Characterizing UAVsiRelative to the situational dominance value of the target assigned to it,
Figure BDA0002991868610000082
is a penalty term to constrain the UAViDistance from friend machine;
based on the formula (13), for n unmanned aerial vehicle individuals, there are n bellman equations shown as the formula (14), wherein the policy function aθWith the same parameters θ:
Figure BDA0002991868610000083
wherein the content of the first and second substances,
Figure BDA0002991868610000084
UAV representing an unmanned aerial vehicleiState-action value function for performing action a in state s, ri(s, a) denotes unmanned aerial vehicle UAViA reward value obtained by performing action a in state s;
step 3-2: designing a strategy learning mechanism;
establishing a multi-unmanned aerial vehicle maneuvering decision model by adopting a bidirectional circulation neural network BRNN;
the multi-unmanned aerial vehicle air combat maneuver decision model consists of an Actor network and a Critic network, wherein the Actor network is formed by connecting Actor networks of all unmanned aerial vehicle individuals through BRNN, and the Critic network is formed by connecting Critic networks of all unmanned aerial vehicle individuals through BRNN; setting hidden layers in strategy networks Actor and Q networks Critic in a single unmanned aerial vehicle decision model into a BRNN circulating unit in a multi-unmanned aerial vehicle air combat maneuver decision model, and then expanding the BRNN according to the number of unmanned aerial vehicles; the input of the multi-unmanned aerial vehicle air combat maneuver decision model is the current air combat situation, and the action value of each unmanned aerial vehicle is output;
defining UAVsiHas an objective function of
Figure BDA0002991868610000085
Representing individual prize values riThe expectation of the accumulation of (a) is,
Figure BDA0002991868610000086
indicating the adoption of an action policy a under a state transition function TθThe obtained state distribution is stable in the traversal markov decision process, so that the target functions of n unmanned planes are recorded as J (theta):
Figure BDA0002991868610000087
according to the multi-agent deterministic policy gradient theorem, for the target function J (theta) of the n drones described in equation (15), the gradient of the policy network parameter theta is
Figure BDA0002991868610000088
Using parameterized Critic function Qξ(s, a) to estimate the state-action function in equation (16)
Figure BDA0002991868610000089
When the Critic is trained, a square sum loss function is adopted to calculate a parameterized Critic function QξThe gradient of (s, a) is shown as equation (17), where ξ is a parameter of the Q network:
Figure BDA0002991868610000091
optimizing the Actor and Critic networks by adopting a random gradient descent method based on the formulas (16) and (17); in the interactive learning process, the learning optimization of the collaborative air combat strategy is completed through the data updating parameters obtained by trial and error;
step 3-3: according to the strategy coordination mechanism and the strategy learning mechanism, the reinforcement learning training process for determining the multi-unmanned aerial vehicle collaborative air combat maneuver decision model is as follows:
step 3-3-1: firstly, initialization is carried out: determining the forces and situations of both air combat parties, and arranging n unmanned aerial vehicles and m targets for air combat confrontation, wherein n is more than or equal to m; randomly initializing an online network parameter theta of the Actor and a parameter xi of the online network of Critic, and then respectively assigning the parameters of the Actor and the Critic online networks to the parameters of corresponding target networks, namely theta '← theta, xi' ← xi, theta 'and xi' are the parameters of the Actor and the Critic target networks respectively; initializing an experience pool R1The system is used for storing experience data obtained by probe interaction; initializing a random process epsilon for realizing the exploration of action values;
step 3-3-2: determining an initial state of training, namely determining the relative situation of two parties at the beginning of an air battle; setting initial position information and speed information of each unmanned aerial vehicle in the unmanned aerial vehicle formation and target formation, namely determining (x, y, z, v, gamma, psi) information of each unmanned aerial vehicle, and calculating to obtain an air war initial state s according to the definition of a state space1(ii) a Let t equal 1;
step 3-3-3: and (3) repeatedly carrying out multi-screen training according to the initial state, and executing the following operations in each single-screen air combat simulation:
firstly according to the current air war state stCalculating a target distribution matrix X based on the target distribution methodt(ii) a Then each UAViAccording to state stAnd a random process ∈ generating action values
Figure BDA0002991868610000092
And executing, at the same time, each Target in the Target formationiPerforming an action
Figure BDA0002991868610000093
State transition to s after executiont+1Calculating the value of the prize award according to equation (13)
Figure BDA0002991868610000094
Transfer a process variable
Figure BDA0002991868610000095
Stored as a piece of experience data in an experience pool R1Performing the following steps; during learning, from experience pool R1Randomly sampling a batch of M pieces of empirical data
Figure BDA0002991868610000096
Calculating the target Q value of each unmanned aerial vehicle, namely for each piece of M data, the following steps are carried out:
Figure BDA0002991868610000097
the gradient estimate for Critic was calculated according to equation (17) as follows:
Figure BDA0002991868610000101
the gradient estimation value of Actor is calculated according to equation (16) as follows:
Figure BDA0002991868610000102
updating online network parameters of Actor and Critic by adopting an optimizer according to the obtained gradient estimation values delta xi and delta theta; after the online network optimization is completed, the target network parameters are updated in a soft update mode, namely
Figure BDA0002991868610000103
Wherein κ ∈ (0, 1);
step 3-3-4: and after the single-screen simulation is finished, if the simulation reaches the set maximum screen number, stopping the reinforcement learning training, otherwise adding 1 to t, and repeatedly executing the step 3-3-3.
The invention has the following beneficial effects:
the invention is based on a multi-agent reinforcement learning method, establishes a method for generating a multi-unmanned aerial vehicle cooperative air combat maneuver decision strategy, adopts a bidirectional cyclic neural network to establish a communication network, connects discrete unmanned aerial vehicles into a formation cooperative decision network, establishes a multi-unmanned aerial vehicle cooperative air combat maneuver decision model under an Actor-critic architecture, and realizes the unification of the learning of individual behaviors of the unmanned aerial vehicles and the overall combat target of the formation. Different from the mode that a multi-airplane air battle is decomposed into a plurality of 1v1 air battles, the multi-unmanned-airplane collaborative air battle maneuver decision model established by the invention can obtain collaborative air battle maneuver strategies through autonomous learning, and tactical coordination is realized in the air battle process, so that the situation advantage of the whole formation operation is achieved and opponents are defeated.
Drawings
FIG. 1 is a three-degree-of-freedom particle motion model of the unmanned aerial vehicle.
FIG. 2 is a one-to-one close-up air combat situation diagram of the present invention.
FIG. 3 is a diagram showing the result of the maneuver decision of the UAV under the condition of uniform velocity and linear flight.
FIG. 4 is a model structure of the multi-unmanned aerial vehicle collaborative air combat maneuver decision based on the bidirectional cyclic neural network.
FIG. 5 is a schematic diagram of an air combat simulated maneuver trajectory based on learned strategies after training is completed.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The invention aims to provide a method for generating a multi-unmanned aerial vehicle collaborative air combat autonomous maneuver decision based on multi-agent reinforcement learning.
The invention realizes the consistency of the state understanding of the unmanned aerial vehicles through the communication network. According to the characteristics of multi-target attack, the reinforcement learning reward value of each unmanned aerial vehicle is calculated by combining target distribution and the air combat situation evaluation value, and the individual reinforcement learning process is guided through the reward of each unmanned aerial vehicle, so that the tactical targets of the formation are closely combined with the learning target of a single unmanned aerial vehicle, and a collaborative tactical maneuver strategy is generated. Tactical coordination is realized in the air combat process, the situation advantage of the whole formation combat is achieved, and the opponents are combed.
A multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning comprises the following steps:
step 1: establishing a multi-machine air combat environment model, and defining a state space, an action space and a reward value for each unmanned aerial vehicle to make a maneuver decision in the multi-machine cooperative air combat process;
step 1-1: in a ground coordinate system, an ox axis is taken as the true east, an oy axis is taken as the true north, and an oz axis is taken as the vertical direction; the motion model of the unmanned aerial vehicle in the ground coordinate system is shown as the formula (1):
Figure BDA0002991868610000111
in the ground coordinate system, the dynamic model of the unmanned aerial vehicle is shown as formula (2):
Figure BDA0002991868610000112
wherein (x, y, z) represents the position of the drone in the ground coordinate system, v represents the drone velocity,
Figure BDA0002991868610000113
and
Figure BDA0002991868610000114
respectively representing the values of the speed v of the unmanned aerial vehicle on three coordinate axes of xyz; the flight path angle gamma represents an included angle between the speed v of the unmanned aerial vehicle and the horizontal plane o-x-y; the heading angle psi represents an included angle between a projection v' of the unmanned aerial vehicle speed v on an o-x-y plane and an oy axis, and g represents gravity acceleration; [ n ] ofx,nz,μ]Is a control variable, n, for controlling the unmanned aerial vehicle to maneuverxThe overload in the speed direction of the unmanned aerial vehicle represents the thrust and deceleration action of the unmanned aerial vehicle; n iszIndicating the overload of the pitching direction of the unmanned aerial vehicle, namely normal overload; μ is the roll angle around the drone velocity vector; through nxControlling the speed of the unmanned plane by nzAnd mu, controlling the direction of the speed vector of the unmanned aerial vehicle, and further controlling the unmanned aerial vehicle to perform maneuvering action; as shown in fig. 1;
step 1-2: setting the missile to have the tail attack capability only; in the interception area of the missile by vUAnd vTRepresenting the speed of the drone and the target, respectively; d is a distance vector and represents the position relation between the unmanned aerial vehicle and the target; alpha is alphaUAnd alphaTRespectively representing an included angle between the speed vector of the unmanned aerial vehicle and the distance vector D and an included angle between the target speed vector and the distance vector D;
setting the maximum interception distance of the missile as DmAngle of view of
Figure BDA0002991868610000121
The interception area of the missile is a conical area omega; the maneuvering target of the unmanned aerial vehicle in the air war is to make the target enter the interception area omega of the unmanned aerial vehicleUSimultaneously, the unmanned aerial vehicle is prevented from entering a target interception area omegaT
According to the definition of the missile interception area, if the target is in the interception area of the missile of the own party, the fact that the own party can launch a weapon to attack the target and the own party is in advantage is shown, and the advantage value eta when the unmanned aerial vehicle intercepts the target is definedUComprises the following steps:
Figure BDA0002991868610000122
wherein (x)T,yT,zT) Representing position coordinates of the target; re represents a large positive number, and can be manually adjusted according to the training effect to guide the training effect of the model;
defining dominance values obtained by target intercepting unmanned aerial vehicleηTComprises the following steps:
Figure BDA0002991868610000123
wherein (x)U,yU,zU) Representing position coordinates of the drone;
in the air battle, the advantage value eta obtained by the unmanned aerial vehicle based on the interception opportunityAIs defined as:
ηA=ηUT (4)
in addition, in the air battle, because the field angle of an aerogun and some missiles is small, the launching condition can be formed only under the condition of tailgating, the requirement on the angle relationship is severe, and the dominant value eta obtained based on the angle parameter and the distance parameter of the aerogun and the missiles is definedBComprises the following steps:
Figure BDA0002991868610000124
the above formula shows that when the unmanned aerial vehicle tailors the target, the dominance value is eta B1 is ═ 1; when the unmanned aerial vehicle is tailed by the target, the advantage value is etaB-1; when the distance between the unmanned aerial vehicle and the target is larger than the maximum interception distance of the missile, the advantage value is attenuated according to an exponential function;
by integrating formulas (4) and (5), the situation assessment function eta of the air war in which the unmanned aerial vehicle is located is obtained as follows:
η=ηAB (6)
step 1-3: the state of the air combat maneuver decision model is composed of a set of variables capable of completely describing the air combat situation, as shown in fig. 2, the geometric relationship of the air combat situation at any moment is completely determined by the information contained in the unmanned aerial vehicle position vector, the unmanned aerial vehicle speed vector, the target position vector and the target speed vector in the same coordinate system, so that the description of the air combat situation is composed of the following 5 aspects:
1) speed information of unmanned aerial vehicle, including speed magnitude vUTrack angle gammaUAnd heading angle psiU
2) Speed information of the object, including the magnitude v of the speedTTrack angle gammaTAnd heading angle psiT
3) The relative position relation between the unmanned aerial vehicle and the target is represented by a distance vector D; distance vector modulo D | | | D | |, γDRepresenting the angle of the distance vector D with the horizontal plane o-x-y,. phiDThe included angle between the projection vector of the distance vector D on the horizontal plane o-x-y and the oy axis is shown, and the relative position relation between the unmanned aerial vehicle and the target is D and gammaDAnd psiDRepresents;
4) the relative motion relation between the unmanned aerial vehicle and the target comprises an included angle alpha between a speed vector of the unmanned aerial vehicle and a distance vector DUAnd the angle alpha between the target velocity vector and the distance vector DT
5) Height information z of unmanned aerial vehicleUAnd height information z of the targetT
Based on the variables 1) to 5) above, the 1v1 air battle situation at any time can be completely characterized, so the state space of the 1v1 maneuver decision model is a 13-dimensional vector space s:
s=[vUUU,vTTT,D,γDDUT,zU,zT] (7)
adopting a situation evaluation function eta as an air war maneuver decision reward value R, and reflecting the action of an action value on the air war situation through the situation evaluation function, wherein the R is eta;
step 1-4: in the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be n and respectively recorded as UAVsi(i is 1,2, …, n), the number of targets is m, and each is denoted as Targetj(j ═ 1,2, …, m), the number of set targets is not greater than the number of drones, i.e. m ≦ n;
as shown in FIG. 3, as the number of drones and targets increases in a multi-aircraft air battle, each drone needs to take maneuver decisions into consideration with other stationsThere is the relative state of the drone (target versus friend). The relative situation of a drone and another drone in an air battle can be fully described by the 13 variables described in equation (7). Remember any two UAVsiAnd TargetjRelative state therebetween is
Figure BDA0002991868610000141
UAViUAV for interacting with any one of the other aircraftkThe relative state between them is recorded as
Figure BDA0002991868610000142
Any UAV in multi-aircraft air combatiThe observation state of (a) is:
Si=[∪sij|j=1,2...,m,∪sik|k=1,2,...,n(k≠i)] (8)
in the process of multi-machine air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)x、nzAnd μ three variables control flight, therefore UAViHas a motion space of Ai=[nxi,nzii];
In the multi-machine collaborative air battle, the situation assessment value eta between each unmanned aerial vehicle and each target is respectively calculated according to the formula (4) and the formula (5)AAnd ηBRecording UAViAnd TargetjHas a situation evaluation value of
Figure BDA0002991868610000143
And
Figure BDA0002991868610000144
in addition to that, consider a UAViFriend machine UAVkIf the distance from the friend aircraft is too close, the risk of collision is increased, so that the UAV is definediFriend machine UAVkThe situation assessment function of (1) is:
Figure BDA0002991868610000145
wherein DikUAV for unmanned aerial vehicleiFriend machine UAVkA distance between, DsafeFor the minimum safe distance between two drones, P is a large positive number.
Step 2: establishing a multi-machine cooperative target distribution method, and determining a target distribution rule during reinforcement learning training;
in the multi-machine cooperative air combat, from the overall perspective of the air combat, the maximum advantage of the formation of the unmanned aerial vehicles in the air combat means that each enemy plane can be attacked by the weapon of the unmanned aerial vehicle, but each unmanned aerial vehicle can only maneuver against one target at the same time, so that the multi-machine cooperative air combat also needs to carry out target distribution at the same time when maneuvering decision is carried out, and cooperation of tactical strategies is realized.
Step 2-1: in the air battle, n unmanned aerial vehicles are arranged to fight m targets, and n is more than or equal to m; according to equation (6), UAVi(i-1, 2, …, n) vs. TargetjThe situation evaluation value of (j ═ 1,2, …, m) is
Figure BDA0002991868610000146
Let the target allocation matrix be X ═ Xij],x ij1 denotes TargetjTo UAVi,xij0 denotes TargetjNot allocated to UAVi(ii) a In the process of multi-aircraft air combat, the situation that a plurality of targets are simultaneously in the attack area of one unmanned aerial vehicle exists, so that the multi-target attack capability of the unmanned aerial vehicle needs to be considered in target distribution, and each unmanned aerial vehicle is designed to be capable of launching missiles to L targets in the attack area at most, namely
Figure BDA0002991868610000151
Meanwhile, during the battle, the target is prevented from being omitted and the attack is abandoned, namely, each target is at least allocated with one unmanned aerial vehicle to attack, so that the unmanned aerial vehicle can be used for preventing the attack
Figure BDA0002991868610000152
All unmanned aerial vehicles are required to be put into combat, so that
Figure BDA0002991868610000153
With the situation advantage maximization of the unmanned aerial vehicle to the target as a target, establishing a target distribution model as follows:
Figure BDA0002991868610000154
step 2-2: the unmanned aerial vehicle performs a series of maneuvers in the air war to enable a target to enter an attack area and launch weapons to the target, the target in the attack area is firstly distributed in the target distribution process, and then the target outside the attack area is distributed, so that the target distribution method is divided into the following two parts:
step 2-2-1: preferentially distributing targets located in the attack area;
to be provided with
Figure BDA0002991868610000155
And
Figure BDA0002991868610000156
constructing two n x m dimensional matrices H for elementsAAnd HB
Figure BDA0002991868610000157
Figure BDA0002991868610000158
From equation (3) if TargetjIn the UAViIn the attack area of (1), then
Figure BDA0002991868610000159
Otherwise
Figure BDA00029918686100001510
Thus, let
Figure BDA00029918686100001511
Order to
Figure BDA00029918686100001512
X of corresponding positions of all zero elements ij1 is ═ 1; during the distribution process, if at unmanned aerial vehicle UAViThe target number x in the attack area exceeds the maximum attack target number of the unmanned aerial vehicle, namely x>L, then use UAViAt HBSorting corresponding element values in the matrix, selecting L targets with the maximum element values to be allocated to the UAVi
Step 2-2-2: allocating targets located outside the attack area;
for UAViIf a target within its attack zone has already been allocated, it can no longer be allocated a target outside the attack zone; for a plurality of targets outside the attack area, the unmanned aerial vehicle cannot maneuver so that the targets are in the attack area, and therefore when the targets are outside the attack area, only one target can be allocated to the unmanned aerial vehicle; therefore, after the target allocation in the attack area is completed, the remaining target allocation work is changed into a process of allocating 1 target to the unallocated unmanned aerial vehicle, and the allocation is realized by adopting the hungarian algorithm, which specifically comprises the following steps:
first, a matrix X is allocated according to the current target [ X ]ij]n×mIs prepared from HBAll of x inijDeleting the ith row and the jth column where the 1 is positioned to obtain a matrix
Figure BDA0002991868610000161
Based on
Figure BDA0002991868610000162
The allocation result is calculated by adopting the Hungarian algorithm, because n is more than or equal to m, and L>0, adopting a margin complementing method to complete the Hungarian algorithm, realizing target distribution, and ordering corresponding xij=1;
After the above two steps are completed, the allocation of all the targets is completed, and a target allocation matrix X ═ X is obtainedij]n×m
And step 3: designing a multi-machine cooperative maneuver strategy learning algorithm and determining a reinforcement learning training logic;
the multi-machine cooperative maneuver strategy learning algorithm comprises a strategy coordination mechanism and a strategy learning mechanism:
step 3-1: designing a strategy coordination mechanism;
the air combat confrontation is regarded as a competitive game between n unmanned aerial vehicles and m targets, a model is established based on a framework of a random game, and one random game can use one tuple
Figure BDA0002991868610000163
To represent; s represents the state space of the current game, and all agents can be shared; UAViIs defined as Ai,TargetiIs defined as Bi;T:S×An×Bm→ S denotes the deterministic transfer function of the environment,
Figure BDA0002991868610000164
Figure BDA0002991868610000165
representing a UAViA reward value function of; the action spaces of drones in respective convoy in the collaborative air battle are the same, i.e. for UAViAnd TargetjRespectively have AiA and Bi=B;
Whether the unmanned aerial vehicle is superior in confrontation in the collaborative air battle is evaluated according to the situation of all the unmanned aerial vehicles. Defining the global reward value of the formation of the unmanned aerial vehicles as the average value of the reward values of the unmanned aerial vehicles, namely:
Figure BDA0002991868610000166
wherein r (s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicles form to take action a belongs to AnThe target formation takes action B ∈ BmUnder the condition of (1), the unmanned aerial vehicles form the acquired reward value;
the goal of drone formation is to learn a strategy to make the expectation of a discount accumulation of reward values
Figure BDA0002991868610000167
Maximum, wherein 0<λ ≦ 1 is a discount factor; the random game is transformed into a markov decision problem:
Figure BDA0002991868610000171
wherein Q*(. cndot.) represents a state-action value function for executing action a in state s, r (s, a) represents a reward value for executing action a in state s, θ represents a network parameter of the policy function, s' represents a state at a next time, aθRepresenting a parameterized policy function;
the global reward value defined by the formula (11) can reflect the situation of the whole formation of the unmanned aerial vehicles, but the global reward value cannot reflect the action of the unmanned aerial vehicle individuals in the formation cooperation. In fact, global coordination is driven by the goals of each individual, and therefore, the reward value function for each drone is defined as:
Figure BDA0002991868610000172
wherein r isi(s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicle formation takes action as a belongs to AnThe target formation takes action B ∈ BmIn case of (A), UAViThe value of the prize earned, wherein
Figure BDA0002991868610000173
Characterizing UAVsiRelative to the situational dominance value of the target assigned to it,
Figure BDA0002991868610000174
is a penalty term to constrain the UAViDistance from friend machine;
based on the formula (13), for n unmanned aerial vehicle individuals, there are n bellman equations shown as the formula (14), wherein the policy function aθWith the same parameters θ:
Figure BDA0002991868610000175
wherein the content of the first and second substances,
Figure BDA0002991868610000176
UAV representing an unmanned aerial vehicleiState-action value function for performing action a in state s, ri(s, a) denotes unmanned aerial vehicle UAViA reward value obtained by performing action a in state s;
in the learning and training process, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safety collision avoidance is defined through distribution of reward values, strategy cooperation is achieved after training, behavior of each unmanned aerial vehicle can be acquiescent with behaviors of other friends, and centralized target distribution is not needed.
Step 3-2: designing a strategy learning mechanism;
the premise of realizing collective cooperation based on multi-agent reinforcement learning is that information interaction among individuals, so that a bidirectional cyclic neural network BRNN is adopted to establish a multi-unmanned aerial vehicle maneuvering decision model, the information interaction among unmanned aerial vehicles is ensured, and the coordination of a formation maneuvering strategy is realized;
the model is established as shown in fig. 4, the multi-unmanned aerial vehicle air combat maneuver decision model is composed of an Actor network and a criticic network, wherein the Actor network is formed by connecting Actor networks of all unmanned aerial vehicle individuals through BRNN, and the criticic network is formed by connecting criticic networks of all unmanned aerial vehicle individuals through BRNN; setting hidden layers in strategy networks Actor and Q networks Critic in a single unmanned aerial vehicle decision model into a BRNN circulating unit in a multi-unmanned aerial vehicle air combat maneuver decision model, and then expanding the BRNN according to the number of unmanned aerial vehicles; the input of the air combat maneuver decision model of the multiple unmanned aerial vehicles is the current air combat situation, and action values of all the unmanned aerial vehicles are output;
since the model is built based on BRNN, it is learned for network parametersThe idea is to expand the network into n (number of drones) sub-networks to calculate the inverse gradient and then update the network parameters using a time-based back propagation algorithm. Gradient at Q of each individual droneiThe functions and the strategy functions are propagated, and when the model is learned, the individual reward value of each unmanned aerial vehicle influences the action of each unmanned aerial vehicle, so that the generated gradient information is reversely propagated, and the model parameters are updated.
Defining UAVsiHas an objective function of
Figure BDA0002991868610000181
Representing individual prize values riThe expectation of the accumulation of (a) is,
Figure BDA0002991868610000182
indicating the adoption of an action policy a under a state transition function TθThe obtained state distribution is generally stable in the traversal markov decision process, so the target functions of n unmanned planes are recorded as J (theta):
Figure BDA0002991868610000183
according to the multi-agent deterministic policy gradient theorem, for the target function J (theta) of the n drones described in equation (15), the gradient of the policy network parameter theta is
Figure BDA0002991868610000184
Using parameterized Critic function Qξ(s, a) to estimate the state-action function in equation (16)
Figure BDA0002991868610000185
When the Critic is trained, a square sum loss function is adopted to calculate a parameterized Critic function QξThe gradient of (s, a) is shown as equation (17), where ξ is a parameter of the Q network:
Figure BDA0002991868610000186
optimizing the Actor and Critic networks by adopting a random gradient descent method based on the formulas (16) and (17); in the interactive learning process, the learning optimization of the collaborative air combat strategy is completed through the data updating parameters obtained by trial and error;
step 3-3: according to the strategy coordination mechanism and the strategy learning mechanism, the reinforcement learning training process for determining the multi-unmanned aerial vehicle collaborative air combat maneuver decision model is as follows:
step 3-3-1: firstly, initialization is carried out: determining the forces and situations of both air combat parties, and arranging n unmanned aerial vehicles and m targets for air combat confrontation, wherein n is more than or equal to m; randomly initializing an online network parameter theta of the Actor and a parameter xi of the online network of Critic, and then respectively assigning the parameters of the Actor and the Critic online networks to the parameters of corresponding target networks, namely theta '← theta, xi' ← xi, theta 'and xi' are the parameters of the Actor and the Critic target networks respectively; initializing an experience pool R1The system is used for storing experience data obtained by probe interaction; initializing a random process epsilon for realizing the exploration of action values;
step 3-3-2: determining an initial state of training, namely determining the relative situation of two parties at the beginning of an air battle; setting initial position information and speed information of each unmanned aerial vehicle in the unmanned aerial vehicle formation and target formation, namely determining (x, y, z, v, gamma, psi) information of each unmanned aerial vehicle, and calculating to obtain an air war initial state s according to the definition of a state space1(ii) a Let t equal 1;
step 3-3-3: and (3) repeatedly carrying out multi-screen training according to the initial state, and executing the following operations in each single-screen air combat simulation:
firstly according to the current air war state stCalculating a target distribution matrix X based on the target distribution methodt(ii) a Then each UAViAccording to state stAnd a random process ∈ generating action values
Figure BDA0002991868610000191
And executing, at the same time, each Target in the Target formationiPerforming an action
Figure BDA0002991868610000192
State transition to s after executiont+1Calculating the value of the prize award according to equation (13)
Figure BDA0002991868610000193
Transfer a process variable
Figure BDA0002991868610000194
Stored as a piece of experience data in an experience pool R1Performing the following steps; during learning, from experience pool R1Randomly sampling a batch of M pieces of empirical data
Figure BDA0002991868610000195
Calculating target Q values for individual drones, i.e. for each of the M pieces of data, there is
Figure BDA0002991868610000196
The gradient estimate for Critic was calculated according to equation (17) as follows:
Figure BDA0002991868610000197
the gradient estimation value of Actor is calculated according to equation (16) as follows:
Figure BDA0002991868610000198
updating online network parameters of Actor and Critic by adopting an optimizer according to the obtained gradient estimation values delta xi and delta theta; after the online network optimization is completed, the target network parameters are updated in a soft update mode, namely
Figure BDA0002991868610000201
Wherein κ ∈ (0, 1);
step 3-3-4: and after the single-screen simulation is finished, if the simulation reaches the set maximum screen number, stopping the reinforcement learning training, otherwise adding 1 to t, and repeatedly executing the step 3-3-3.
The specific embodiment is as follows:
the method is used for unmanned aerial vehicle dual-machine formation, and specifically comprises the following steps:
1. and designing a multi-machine air combat environment model.
In the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be 2 and respectively recorded as UAVsi(i is 1,2) and the number of targets is 2, each of which is denoted as Targetj(j=1,2)。
Calculating according to the step 1 to obtain any UAViIs observed state Si
In the process of multi-airplane air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)x,nzAnd μ three variables control flight, therefore UAViHas a motion space of Ai=[nxi,nzii]。
In the multi-machine collaborative air battle, the situation assessment value eta between each unmanned aerial vehicle and each target is respectively calculated according to the formula (4) and the formula (5)AAnd ηBRecording UAViAnd TargetjHas a situation evaluation value of
Figure BDA0002991868610000202
And
Figure BDA0002991868610000203
in addition to this, UAVs should also be considerediFriend machine UAVkIf the distance from the friend aircraft is too close, the risk of collision is increased, so that the UAV is definediFriend machine UAVkThe evaluation function of (2) is shown in equation (9).
2. Designing a multi-machine cooperative target distribution method.
Two unmanned aerial vehicles fight 2 targets. UAV according to formula (6)i(i ═ 1,2) relative TargetjThe situation evaluation value of (j ═ 1,2) is
Figure BDA0002991868610000204
Obtaining the target distribution matrix X ═ X from the step 2ij]n×m
3. And designing a multi-machine cooperative maneuver strategy learning algorithm.
The unmanned aerial vehicle is subjected to reinforcement learning training in an air battle scene that the unmanned aerial vehicle and the target aircraft fly in opposite directions and the target flies in uniform-speed linear motion.
The air war background of the multi-unmanned aerial vehicle collaborative air war is set to be a short distance air war, and the parameters of the air war environment model are set as follows. Maximum interception distance D of missilemax3km, field of view angle
Figure BDA0002991868610000211
Minimum safe distance D between two unmanned aerial vehiclessafe200m, the dominance value Re of the target acquisition is 5, the penalty value P is 10, and the maximum speed v is set in the motion model of the unmanned aerial vehiclemax400m/s, minimum velocity vminControl parameter n of 90m/sx∈[-1,2],nz∈[0,8],μ∈[-π,π]。
The Actor network of the maneuvering decision model is divided into an input layer, a hidden layer and an output layer, wherein the input layer inputs an air combat state, the hidden layer is divided into 2 layers, the 1 st layer is composed of 400 LSTM neurons in the forward direction and the reverse direction respectively, the layer is expanded according to the number of unmanned aerial vehicles and a bidirectional circulation neural network structure to form a communication layer, the 2 nd layer is composed of 100 neurons, a tanh activation function is adopted, and parameters are uniformly distributed [ -3 × 10 [-4,3×10-4]Random initialization, outputting 3 control quantities by output layer, adopting tanh activating function, uniformly distributing parameter [ -2 × 10 [)-5,2×10-5]Random initialization, output range [0,1 ] of tanh by linear adjustment]Are respectively adjusted to [1,2]、[0,8]And [ - π, π]。
Critic network of maneuver decision modelsThe unmanned aerial vehicle is also divided into an input layer, a hidden layer and an output layer, wherein the input layer inputs an air combat state and 3 action values of the unmanned aerial vehicle, the hidden layer is divided into 2 layers, the 1 st layer consists of 500 LSTM neurons in forward and reverse directions, the layer is expanded according to the number of the unmanned aerial vehicles and a bidirectional circulation neural network structure to form a communication layer, the 2 nd layer consists of 150 neurons, a tanh activation function is adopted, and parameters are uniformly distributed [ -3 × 10 [ -3 [-4,3×10-4]Random initialization, output layer outputs 1Q value, tan h activation function is adopted, and parameters are uniformly distributed [ -2 × 10 [)-4,2×10-4]And (4) random initialization. The Actor and ciptic models both adopt Adam optimizers, the learning rate of the Actor network is set to be 0.001, and the learning rate of the criticc network is set to be 0.0001. The discount factor λ is 0.95 and the soft update factor k of the target network is 0.005. The random process epsilon of action value exploration employs the OU process. The size of the empirical playback space R is set to 106The size of batch is set to 512.
FIG. 5 is an air combat simulated maneuver trajectory based on learned strategies after training is complete. And simulating a maneuver trajectory based on the air war of the learned strategy. As can be seen in the figure, at an initial moment, UAVs 1 and 2 fly in opposite directions facing targets 1 and 2 respectively, according to the target assignment algorithm, UAV1 and UAV2 select target 1 and target 2, respectively, as targets of attack for maneuvering engagement, in the process of approaching to respective targets, the course and the height are adjusted to avoid possible collision in the intersection, before and after meeting with the target, the UAV1 turns to the right side, and the UAV2 turns to the left side, so that the cross shield is realized, after the two unmanned aerial vehicles turn towards the opposite direction, respective attack targets are exchanged instead of continuously turning to chase the respective initially distributed targets, tactical coordination is embodied, and it is proved that through reinforcement learning training, the unmanned aerial vehicles can learn to obtain an air combat maneuvering strategy by two-machine formation, tactical coordination between two machines is realized, advantages are obtained in air combat, and multi-machine air combat is not decomposed into a plurality of 1v1 countermeasures.

Claims (1)

1. A multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning is characterized by comprising the following steps:
step 1: establishing a multi-machine air combat environment model, and defining a state space, an action space and a reward value for each unmanned aerial vehicle to make a maneuver decision in the multi-machine cooperative air combat process;
step 1-1: in a ground coordinate system, an ox axis is taken as the true east, an oy axis is taken as the true north, and an oz axis is taken as the vertical direction; the motion model of the unmanned aerial vehicle in the ground coordinate system is shown as the formula (1):
Figure FDA0002991868600000011
in the ground coordinate system, the dynamic model of the unmanned aerial vehicle is shown as formula (2):
Figure FDA0002991868600000012
wherein (x, y, z) represents the position of the drone in the ground coordinate system, v represents the drone velocity,
Figure FDA0002991868600000013
and
Figure FDA0002991868600000014
respectively representing the values of the speed v of the unmanned aerial vehicle on three coordinate axes of xyz; the flight path angle gamma represents an included angle between the speed v of the unmanned aerial vehicle and the horizontal plane o-x-y; the heading angle psi represents an included angle between a projection v' of the unmanned aerial vehicle speed v on an o-x-y plane and an oy axis, and g represents gravity acceleration; [ n ] ofx,nz,μ]Is a control variable, n, for controlling the unmanned aerial vehicle to maneuverxThe overload in the speed direction of the unmanned aerial vehicle represents the thrust and deceleration action of the unmanned aerial vehicle; n iszIndicating the overload of the pitching direction of the unmanned aerial vehicle, namely normal overload; μ is the roll angle around the drone velocity vector; through nxControlling the speed of the unmanned plane by nzAnd mu, controlling the direction of the speed vector of the unmanned aerial vehicle, and further controlling the unmanned aerial vehicle to perform maneuvering action;
step 1-2: setting missilesOnly have a tail attack capability; in the interception area of the missile by vUAnd vTRepresenting the speed of the drone and the target, respectively; d is a distance vector and represents the position relation between the unmanned aerial vehicle and the target; alpha is alphaUAnd alphaTRespectively representing an included angle between the speed vector of the unmanned aerial vehicle and the distance vector D and an included angle between the target speed vector and the distance vector D;
setting the maximum interception distance of the missile as DmAngle of view of
Figure FDA0002991868600000015
The interception area of the missile is a conical area omega; the maneuvering target of the unmanned aerial vehicle in the air war is to make the target enter the interception area omega of the unmanned aerial vehicleUSimultaneously, the unmanned aerial vehicle is prevented from entering a target interception area omegaT
According to the definition of the missile interception area, if the target is in the interception area of the missile of the own party, the fact that the own party can launch a weapon to attack the target and the own party is in advantage is shown, and the advantage value eta when the unmanned aerial vehicle intercepts the target is definedUComprises the following steps:
Figure FDA0002991868600000021
wherein (x)T,yT,zT) Representing position coordinates of the target; re is a positive number;
defining a dominance η obtained by a target intercepting unmanned aerial vehicleTComprises the following steps:
Figure FDA0002991868600000022
wherein (x)U,yU,zU) Representing position coordinates of the drone;
in the air battle, the advantage value eta obtained by the unmanned aerial vehicle based on the interception opportunityAIs defined as:
ηA=ηUT (4)
defining an advantage value eta obtained based on angle parameters and distance parameters of two partiesBComprises the following steps:
Figure FDA0002991868600000023
the above formula shows that when the unmanned aerial vehicle tailors the target, the dominance value is etaB1 is ═ 1; when the unmanned aerial vehicle is tailed by the target, the advantage value is etaB-1; when the distance between the unmanned aerial vehicle and the target is larger than the maximum interception distance of the missile, the advantage value is attenuated according to an exponential function;
by integrating formulas (4) and (5), the situation assessment function eta of the air war in which the unmanned aerial vehicle is located is obtained as follows:
η=ηAB (6)
step 1-3: the geometric relationship of the air combat situation at any moment is completely determined by information contained in an unmanned aerial vehicle position vector, an unmanned aerial vehicle speed vector, a target position vector and a target speed vector in the same coordinate system, so that the description of the air combat situation is composed of the following 5 aspects:
1) speed information of unmanned aerial vehicle, including speed magnitude vUTrack angle gammaUAnd heading angle psiU
2) Speed information of the object, including the magnitude v of the speedTTrack angle gammaTAnd heading angle psiT
3) The relative position relation between the unmanned aerial vehicle and the target is represented by a distance vector D; distance vector modulo D | | | D | |, γDRepresenting the angle of the distance vector D with the horizontal plane o-x-y,. phiDThe included angle between the projection vector of the distance vector D on the horizontal plane o-x-y and the oy axis is shown, and the relative position relation between the unmanned aerial vehicle and the target is D and gammaDAnd psiDRepresents;
4) the relative motion relation between the unmanned aerial vehicle and the target comprises an included angle alpha between a speed vector of the unmanned aerial vehicle and a distance vector DUAnd the angle alpha between the target velocity vector and the distance vector DT
5) Height information z of unmanned aerial vehicleUAnd height information z of the targetT
Based on the variables 1) to 5) above, the 1v1 air battle situation at any time can be completely characterized, so the state space of the 1v1 maneuver decision model is a 13-dimensional vector space s:
s=[vUUU,vTTT,D,γDDUT,zU,zT] (7)
adopting a situation evaluation function eta as an air war maneuver decision reward value R, and reflecting the action of an action value on the air war situation through the situation evaluation function, wherein the R is eta;
step 1-4: in the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be n and respectively recorded as UAVsi(i is 1,2, …, n), the number of targets is m, and each is denoted as Targetj(j ═ 1,2, …, m), the number of set targets is not greater than the number of drones, i.e. m ≦ n;
remember any two UAVsiAnd TargetjRelative state therebetween is
Figure FDA0002991868600000031
UAViUAV for interacting with any one of the other aircraftkThe relative state between them is recorded as
Figure FDA0002991868600000032
Any UAV in multi-aircraft air combatiThe observation state of (a) is:
Si=[∪sij|j=1,2...,m,∪sik|k=1,2,...,n(k≠i)] (8)
in the process of multi-machine air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)x、nzAnd μ three variables control flight, therefore UAViHas a motion space of Ai=[nxi,nzii];
In-multiple-machine cooperative air battleIn the method, the situation assessment value eta between each unmanned aerial vehicle and each target is calculated according to the formula (4) and the formula (5)AAnd ηBRecording UAViAnd TargetjHas a situation evaluation value of
Figure FDA0002991868600000033
And
Figure FDA0002991868600000034
in addition to that, consider a UAViFriend machine UAVkThe relative state of (A) on the self-situation, thus defining the UAViFriend machine UAVkThe situation assessment function of (1) is:
Figure FDA0002991868600000041
wherein DikUAV for unmanned aerial vehicleiFriend machine UAVkA distance between, DsafeIs the minimum safe distance between two unmanned planes, and P is a positive number.
Step 2: establishing a multi-machine cooperative target distribution method, and determining a target distribution rule during reinforcement learning training;
step 2-1: in the air battle, n unmanned aerial vehicles are arranged to fight m targets, and n is more than or equal to m; according to equation (6), UAVi(i-1, 2, …, n) vs. TargetjThe situation evaluation value of (j ═ 1,2, …, m) is
Figure FDA0002991868600000042
Let the target allocation matrix be X ═ Xij],xij1 denotes TargetjTo UAVi,xij0 denotes TargetjNot allocated to UAVi(ii) a Let each drone be able to launch missiles at most simultaneously on L targets located in its attack zone, i.e.
Figure FDA0002991868600000043
At the same time, the war time is to be avoidedTargets are omitted and the attack is abandoned, i.e. each target should be assigned at least one drone to attack, so
Figure FDA0002991868600000044
All unmanned aerial vehicles are required to be put into combat, so that
Figure FDA0002991868600000045
With the situation advantage maximization of the unmanned aerial vehicle to the target as a target, establishing a target distribution model as follows:
Figure FDA0002991868600000046
step 2-2: in the target allocation process, targets in an attack area are allocated firstly, and then targets outside the attack area are allocated, so that the target allocation method is divided into the following two parts:
step 2-2-1: preferentially distributing targets located in the attack area;
to be provided with
Figure FDA0002991868600000047
And
Figure FDA0002991868600000048
constructing two n x m dimensional matrices H for elementsAAnd HB
Figure FDA0002991868600000049
Figure FDA0002991868600000051
From equation (3) if TargetjIn the UAViIn the attack area of (1), then
Figure FDA0002991868600000052
Otherwise
Figure FDA0002991868600000053
Thus, let
Figure FDA0002991868600000054
Order to
Figure FDA0002991868600000055
X of corresponding positions of all zero elementsij1 is ═ 1; during the distribution process, if at unmanned aerial vehicle UAViThe target number x in the attack area exceeds the maximum attack target number of the unmanned aerial vehicle, namely x>L, then use UAViAt HBSorting corresponding element values in the matrix, selecting L targets with the maximum element values to be allocated to the UAVi
Step 2-2-2: allocating targets located outside the attack area;
for UAViIf a target within its attack zone has already been allocated, it can no longer be allocated a target outside the attack zone; for a plurality of targets outside the attack area, the unmanned aerial vehicle cannot maneuver so that the targets are in the attack area, and therefore when the targets are outside the attack area, only one target can be allocated to the unmanned aerial vehicle; therefore, after the target allocation in the attack area is completed, the remaining target allocation work is changed into a process of allocating 1 target to the unallocated unmanned aerial vehicle, and the allocation is realized by adopting the hungarian algorithm, which specifically comprises the following steps:
first, a matrix X is allocated according to the current target [ X ]ij]n×mIs prepared from HBAll of x inijDeleting the ith row and the jth column where the 1 is positioned to obtain a matrix
Figure FDA0002991868600000056
Based on
Figure FDA0002991868600000057
The allocation result is calculated by adopting the Hungarian algorithm, because n is more than or equal to m, and L>0, adopting a margin complementing method to complete the Hungarian algorithm, realizing target distribution, and ordering corresponding xij=1;
Complete the above two stepsAfter the step (b), the distribution of all targets is completed, and a target distribution matrix X ═ X is obtainedij]n×m
And step 3: designing a multi-machine cooperative maneuver strategy learning algorithm and determining a reinforcement learning training logic;
the multi-machine cooperative maneuver strategy learning algorithm comprises a strategy coordination mechanism and a strategy learning mechanism:
step 3-1: designing a strategy coordination mechanism;
the air combat confrontation is regarded as a competitive game between n unmanned aerial vehicles and m targets, a model is established based on a framework of a random game, and one random game can use one tuple
Figure FDA0002991868600000058
To represent; s represents the state space of the current game, and all agents can be shared; UAViIs defined as Ai,TargetiIs defined as Bi;T:S×An×Bm→ S denotes the deterministic transfer function of the environment,
Figure FDA0002991868600000059
Figure FDA00029918686000000510
representing a UAViA reward value function of; the action spaces of drones in respective convoy in the collaborative air battle are the same, i.e. for UAViAnd TargetjRespectively have AiA and Bi=B;
Defining the global reward value of the formation of the unmanned aerial vehicles as the average value of the reward values of the unmanned aerial vehicles, namely:
Figure FDA0002991868600000061
wherein r (s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicles form to take action a belongs to AnThe target formation takes action B ∈ BmIn case of (2), unmanned plane knittingThe value of the reward earned by the team;
the goal of drone formation is to learn a strategy to make the expectation of a discount accumulation of reward values
Figure FDA0002991868600000062
Maximum, wherein 0<λ ≦ 1 is a discount factor; the random game is transformed into a markov decision problem:
Figure FDA0002991868600000063
wherein Q*(. cndot.) represents a state-action value function for executing action a in state s, r (s, a) represents a reward value for executing action a in state s, θ represents a network parameter of the policy function, s' represents a state at a next time, aθRepresenting a parameterized policy function;
the reward value function for each drone is defined as:
Figure FDA0002991868600000064
wherein r isi(s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicle formation takes action as a belongs to AnThe target formation takes action B ∈ BmIn case of (A), UAViThe value of the prize earned, wherein
Figure FDA0002991868600000065
Characterizing UAVsiRelative to the situational dominance value of the target assigned to it,
Figure FDA0002991868600000066
is a penalty term to constrain the UAViDistance from friend machine;
based on the formula (13), for n unmanned aerial vehicle individuals, there are n bellman equations shown as the formula (14), wherein the policy function aθWith the same parameters θ:
Figure FDA0002991868600000067
wherein the content of the first and second substances,
Figure FDA0002991868600000068
UAV representing an unmanned aerial vehicleiState-action value function for performing action a in state s, ri(s, a) denotes unmanned aerial vehicle UAViA reward value obtained by performing action a in state s;
step 3-2: designing a strategy learning mechanism;
establishing a multi-unmanned aerial vehicle maneuvering decision model by adopting a bidirectional circulation neural network BRNN;
the multi-unmanned aerial vehicle air combat maneuver decision model consists of an Actor network and a Critic network, wherein the Actor network is formed by connecting Actor networks of all unmanned aerial vehicle individuals through BRNN, and the Critic network is formed by connecting Critic networks of all unmanned aerial vehicle individuals through BRNN; setting hidden layers in strategy networks Actor and Q networks Critic in a single unmanned aerial vehicle decision model into a BRNN circulating unit in a multi-unmanned aerial vehicle air combat maneuver decision model, and then expanding the BRNN according to the number of unmanned aerial vehicles; the input of the multi-unmanned aerial vehicle air combat maneuver decision model is the current air combat situation, and the action value of each unmanned aerial vehicle is output;
defining UAVsiHas an objective function of
Figure FDA0002991868600000071
Representing individual prize values riThe expectation of the accumulation of (a) is,
Figure FDA0002991868600000072
indicating the adoption of an action policy a under a state transition function TθThe obtained state distribution is stable in the traversal markov decision process, so that the target functions of n unmanned planes are recorded as J (theta):
Figure FDA0002991868600000073
according to the multi-agent deterministic policy gradient theorem, for the target function J (theta) of the n drones described in equation (15), the gradient of the policy network parameter theta is
Figure FDA0002991868600000074
Using parameterized Critic function Qξ(s, a) to estimate the state-action function in equation (16)
Figure FDA0002991868600000075
When the Critic is trained, a square sum loss function is adopted to calculate a parameterized Critic function QξThe gradient of (s, a) is shown as equation (17), where ξ is a parameter of the Q network:
Figure FDA0002991868600000076
optimizing the Actor and Critic networks by adopting a random gradient descent method based on the formulas (16) and (17); in the interactive learning process, the learning optimization of the collaborative air combat strategy is completed through the data updating parameters obtained by trial and error;
step 3-3: according to the strategy coordination mechanism and the strategy learning mechanism, the reinforcement learning training process for determining the multi-unmanned aerial vehicle collaborative air combat maneuver decision model is as follows:
step 3-3-1: firstly, initialization is carried out: determining the forces and situations of both air combat parties, and arranging n unmanned aerial vehicles and m targets for air combat confrontation, wherein n is more than or equal to m; randomly initializing an online network parameter theta of the Actor and a parameter xi of the online network of Critic, and then respectively assigning the parameters of the Actor and the Critic online networks to the parameters of corresponding target networks, namely theta '← theta, xi' ← xi, theta 'and xi' are the parameters of the Actor and the Critic target networks respectively; initializing an experience pool R1The system is used for storing experience data obtained by probe interaction; initialize a randomThe machine process epsilon is used for realizing the exploration of action values;
step 3-3-2: determining an initial state of training, namely determining the relative situation of two parties at the beginning of an air battle; setting initial position information and speed information of each unmanned aerial vehicle in the unmanned aerial vehicle formation and target formation, namely determining (x, y, z, v, gamma, psi) information of each unmanned aerial vehicle, and calculating to obtain an air war initial state s according to the definition of a state space1(ii) a Let t equal 1;
step 3-3-3: and (3) repeatedly carrying out multi-screen training according to the initial state, and executing the following operations in each single-screen air combat simulation:
firstly according to the current air war state stCalculating a target distribution matrix X based on the target distribution methodt(ii) a Then each UAViAccording to state stAnd a random process ∈ generating action values
Figure FDA0002991868600000081
And executing, at the same time, each Target in the Target formationiPerforming an action
Figure FDA0002991868600000082
State transition to s after executiont+1Calculating the value of the prize award according to equation (13)
Figure FDA0002991868600000083
Transfer a process variable
Figure FDA0002991868600000084
Stored as a piece of experience data in an experience pool R1Performing the following steps; during learning, from experience pool R1Randomly sampling a batch of M pieces of empirical data
Figure FDA0002991868600000085
Calculating the target Q value of each unmanned aerial vehicle, namely for each piece of M data, the following steps are carried out:
Figure FDA0002991868600000086
the gradient estimate for Critic was calculated according to equation (17) as follows:
Figure FDA0002991868600000087
the gradient estimation value of Actor is calculated according to equation (16) as follows:
Figure FDA0002991868600000088
updating online network parameters of Actor and Critic by adopting an optimizer according to the obtained gradient estimation values delta xi and delta theta; after the online network optimization is completed, the target network parameters are updated in a soft update mode, namely
Figure FDA0002991868600000089
Wherein κ ∈ (0, 1);
step 3-3-4: and after the single-screen simulation is finished, if the simulation reaches the set maximum screen number, stopping the reinforcement learning training, otherwise adding 1 to t, and repeatedly executing the step 3-3-3.
CN202110318644.5A 2021-03-25 2021-03-25 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning Active CN112947581B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110318644.5A CN112947581B (en) 2021-03-25 2021-03-25 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110318644.5A CN112947581B (en) 2021-03-25 2021-03-25 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Publications (2)

Publication Number Publication Date
CN112947581A true CN112947581A (en) 2021-06-11
CN112947581B CN112947581B (en) 2022-07-05

Family

ID=76226772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110318644.5A Active CN112947581B (en) 2021-03-25 2021-03-25 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN112947581B (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255234A (en) * 2021-06-28 2021-08-13 北京航空航天大学 Method for carrying out online target distribution on missile groups
CN113566831A (en) * 2021-09-26 2021-10-29 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster navigation method, device and equipment based on human-computer interaction
CN113625739A (en) * 2021-08-25 2021-11-09 中国航空工业集团公司沈阳飞机设计研究所 Expert system optimization method based on heuristic maneuver selection algorithm
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning
CN113805569A (en) * 2021-09-23 2021-12-17 北京理工大学 Multi-agent technology-based countermeasure system, method, terminal and storage medium
CN113867178A (en) * 2021-10-26 2021-12-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN113893539A (en) * 2021-12-09 2022-01-07 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN113962012A (en) * 2021-07-23 2022-01-21 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device
CN114167899A (en) * 2021-12-27 2022-03-11 北京联合大学 Unmanned aerial vehicle swarm cooperative countermeasure decision-making method and system
CN114167756A (en) * 2021-12-08 2022-03-11 北京航空航天大学 Autonomous learning and semi-physical simulation verification method for cooperative air combat decision of multiple unmanned aerial vehicles
CN114239392A (en) * 2021-12-09 2022-03-25 南通大学 Unmanned aerial vehicle decision model training method, using method, equipment and medium
CN114326826A (en) * 2022-01-11 2022-04-12 北方工业大学 Multi-unmanned aerial vehicle formation transformation method and system
CN114330115A (en) * 2021-10-27 2022-04-12 中国空气动力研究与发展中心计算空气动力研究所 Neural network air combat maneuver decision method based on particle swarm search
CN114727407A (en) * 2022-05-12 2022-07-08 中国科学院自动化研究所 Resource allocation method, device and equipment
CN114815882A (en) * 2022-04-08 2022-07-29 北京航空航天大学 Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning
CN115097864A (en) * 2022-06-27 2022-09-23 中国人民解放军海军航空大学 Multi-machine formation task allocation method
CN115113642A (en) * 2022-06-02 2022-09-27 中国航空工业集团公司沈阳飞机设计研究所 Multi-unmanned aerial vehicle space-time key feature self-learning cooperative confrontation decision-making method
CN115238832A (en) * 2022-09-22 2022-10-25 中国人民解放军空军预警学院 CNN-LSTM-based air formation target intention identification method and system
CN115268481A (en) * 2022-07-06 2022-11-01 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle countermeasure strategy decision method and system
CN115470894A (en) * 2022-10-31 2022-12-13 中国人民解放军国防科技大学 Unmanned aerial vehicle knowledge model time-sharing calling method and device based on reinforcement learning
CN115755956A (en) * 2022-11-03 2023-03-07 南京航空航天大学 Unmanned aerial vehicle maneuver decision method and system driven by knowledge and data in cooperation
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN116047984A (en) * 2023-03-07 2023-05-02 北京全路通信信号研究设计院集团有限公司 Consistency tracking control method, device, equipment and medium of multi-agent system
CN116149348A (en) * 2023-04-17 2023-05-23 四川汉科计算机信息技术有限公司 Air combat maneuver system, control method and defense system control method
CN116227361A (en) * 2023-03-06 2023-06-06 中国人民解放军32370部队 Intelligent body decision method and device
CN116489193A (en) * 2023-05-04 2023-07-25 中国人民解放军陆军工程大学 Combat network self-adaptive combination method, device, equipment and medium
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116736883A (en) * 2023-05-23 2023-09-12 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN116893690A (en) * 2023-07-25 2023-10-17 西安爱生技术集团有限公司 Unmanned aerial vehicle evasion attack input data calculation method based on reinforcement learning
CN116974297A (en) * 2023-06-27 2023-10-31 北京五木恒润科技有限公司 Conflict resolution method and device based on multi-objective optimization, medium and electronic equipment
CN117111640A (en) * 2023-10-24 2023-11-24 中国人民解放军国防科技大学 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment
CN117168468A (en) * 2023-11-03 2023-12-05 安徽大学 Multi-unmanned-ship deep reinforcement learning collaborative navigation method based on near-end strategy optimization
CN117162102A (en) * 2023-10-30 2023-12-05 南京邮电大学 Independent near-end strategy optimization training acceleration method for robot joint action
CN117313561A (en) * 2023-11-30 2023-12-29 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
CN113962012B (en) * 2021-07-23 2024-05-24 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080584A2 (en) * 2006-01-11 2007-07-19 Carmel-Haifa University Economic Corp. Ltd. Uav decision and control system
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN111260031A (en) * 2020-01-14 2020-06-09 西北工业大学 Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning
CN111523177A (en) * 2020-04-17 2020-08-11 西安科为实业发展有限责任公司 Air combat countermeasure autonomous decision method and system based on intelligent learning
CN111880563A (en) * 2020-07-17 2020-11-03 西北工业大学 Multi-unmanned aerial vehicle task decision method based on MADDPG
CN111880565A (en) * 2020-07-22 2020-11-03 电子科技大学 Q-Learning-based cluster cooperative countermeasure method
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN112051863A (en) * 2020-09-25 2020-12-08 南京大学 Unmanned aerial vehicle autonomous anti-reconnaissance and enemy attack avoidance method
CN112180967A (en) * 2020-04-26 2021-01-05 北京理工大学 Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN112182977A (en) * 2020-10-12 2021-01-05 中国人民解放军国防科技大学 Control method and system for cooperative game confrontation of unmanned cluster

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007080584A2 (en) * 2006-01-11 2007-07-19 Carmel-Haifa University Economic Corp. Ltd. Uav decision and control system
CN108319286A (en) * 2018-03-12 2018-07-24 西北工业大学 A kind of unmanned plane Air Combat Maneuvering Decision Method based on intensified learning
CN111260031A (en) * 2020-01-14 2020-06-09 西北工业大学 Unmanned aerial vehicle cluster target defense method based on deep reinforcement learning
CN111523177A (en) * 2020-04-17 2020-08-11 西安科为实业发展有限责任公司 Air combat countermeasure autonomous decision method and system based on intelligent learning
CN112180967A (en) * 2020-04-26 2021-01-05 北京理工大学 Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN111880563A (en) * 2020-07-17 2020-11-03 西北工业大学 Multi-unmanned aerial vehicle task decision method based on MADDPG
CN111880565A (en) * 2020-07-22 2020-11-03 电子科技大学 Q-Learning-based cluster cooperative countermeasure method
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents
CN112051863A (en) * 2020-09-25 2020-12-08 南京大学 Unmanned aerial vehicle autonomous anti-reconnaissance and enemy attack avoidance method
CN112182977A (en) * 2020-10-12 2021-01-05 中国人民解放军国防科技大学 Control method and system for cooperative game confrontation of unmanned cluster

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
WEIREN KONG,等: "Maneuver Strategy Generation of UCAV for within Visual Range Air Combat Based on Multi-Agent Reinforcement Learning and Target Position Prediction", 《MDPI》 *
丁林静,等: "基于强化学习的无人机空战机动决策", 《航空电子技术》 *
刘强,等: "基于深度强化学习的群体对抗策略研究", 《智能计算机与应用》 *
谢建峰,等: "基于强化遗传算法的无人机空战机动决策研究", 《西北工业大学学报》 *

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255234A (en) * 2021-06-28 2021-08-13 北京航空航天大学 Method for carrying out online target distribution on missile groups
CN113962012B (en) * 2021-07-23 2024-05-24 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device
CN113962012A (en) * 2021-07-23 2022-01-21 中国科学院自动化研究所 Unmanned aerial vehicle countermeasure strategy optimization method and device
CN113791634B (en) * 2021-08-22 2024-02-02 西北工业大学 Multi-agent reinforcement learning-based multi-machine air combat decision method
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning
CN113625739A (en) * 2021-08-25 2021-11-09 中国航空工业集团公司沈阳飞机设计研究所 Expert system optimization method based on heuristic maneuver selection algorithm
CN113805569A (en) * 2021-09-23 2021-12-17 北京理工大学 Multi-agent technology-based countermeasure system, method, terminal and storage medium
CN113805569B (en) * 2021-09-23 2024-03-26 北京理工大学 Countermeasure system, method, terminal and storage medium based on multi-agent technology
CN113566831A (en) * 2021-09-26 2021-10-29 中国人民解放军国防科技大学 Unmanned aerial vehicle cluster navigation method, device and equipment based on human-computer interaction
CN113867178A (en) * 2021-10-26 2021-12-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN113867178B (en) * 2021-10-26 2022-05-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN114330115A (en) * 2021-10-27 2022-04-12 中国空气动力研究与发展中心计算空气动力研究所 Neural network air combat maneuver decision method based on particle swarm search
CN114167756A (en) * 2021-12-08 2022-03-11 北京航空航天大学 Autonomous learning and semi-physical simulation verification method for cooperative air combat decision of multiple unmanned aerial vehicles
CN114167756B (en) * 2021-12-08 2023-06-02 北京航空航天大学 Multi-unmanned aerial vehicle collaborative air combat decision autonomous learning and semi-physical simulation verification method
CN114239392A (en) * 2021-12-09 2022-03-25 南通大学 Unmanned aerial vehicle decision model training method, using method, equipment and medium
CN113893539B (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN113893539A (en) * 2021-12-09 2022-01-07 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN114167899A (en) * 2021-12-27 2022-03-11 北京联合大学 Unmanned aerial vehicle swarm cooperative countermeasure decision-making method and system
CN114167899B (en) * 2021-12-27 2023-05-26 北京联合大学 Unmanned plane bee colony collaborative countermeasure decision-making method and system
CN114326826A (en) * 2022-01-11 2022-04-12 北方工业大学 Multi-unmanned aerial vehicle formation transformation method and system
CN114815882A (en) * 2022-04-08 2022-07-29 北京航空航天大学 Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning
CN114727407B (en) * 2022-05-12 2022-08-26 中国科学院自动化研究所 Resource allocation method, device and equipment
CN114727407A (en) * 2022-05-12 2022-07-08 中国科学院自动化研究所 Resource allocation method, device and equipment
CN115113642A (en) * 2022-06-02 2022-09-27 中国航空工业集团公司沈阳飞机设计研究所 Multi-unmanned aerial vehicle space-time key feature self-learning cooperative confrontation decision-making method
CN115097864A (en) * 2022-06-27 2022-09-23 中国人民解放军海军航空大学 Multi-machine formation task allocation method
CN115268481A (en) * 2022-07-06 2022-11-01 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle countermeasure strategy decision method and system
CN115238832B (en) * 2022-09-22 2022-12-02 中国人民解放军空军预警学院 CNN-LSTM-based air formation target intention identification method and system
CN115238832A (en) * 2022-09-22 2022-10-25 中国人民解放军空军预警学院 CNN-LSTM-based air formation target intention identification method and system
CN115470894A (en) * 2022-10-31 2022-12-13 中国人民解放军国防科技大学 Unmanned aerial vehicle knowledge model time-sharing calling method and device based on reinforcement learning
CN115755956A (en) * 2022-11-03 2023-03-07 南京航空航天大学 Unmanned aerial vehicle maneuver decision method and system driven by knowledge and data in cooperation
CN115755956B (en) * 2022-11-03 2023-12-15 南京航空航天大学 Knowledge and data collaborative driving unmanned aerial vehicle maneuvering decision method and system
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN116227361A (en) * 2023-03-06 2023-06-06 中国人民解放军32370部队 Intelligent body decision method and device
CN116227361B (en) * 2023-03-06 2023-08-15 中国人民解放军32370部队 Intelligent body decision method and device
CN116047984A (en) * 2023-03-07 2023-05-02 北京全路通信信号研究设计院集团有限公司 Consistency tracking control method, device, equipment and medium of multi-agent system
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116679742B (en) * 2023-04-11 2024-04-02 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116149348B (en) * 2023-04-17 2023-06-23 四川汉科计算机信息技术有限公司 Air combat maneuver system, control method and defense system control method
CN116149348A (en) * 2023-04-17 2023-05-23 四川汉科计算机信息技术有限公司 Air combat maneuver system, control method and defense system control method
CN116489193B (en) * 2023-05-04 2024-01-23 中国人民解放军陆军工程大学 Combat network self-adaptive combination method, device, equipment and medium
CN116489193A (en) * 2023-05-04 2023-07-25 中国人民解放军陆军工程大学 Combat network self-adaptive combination method, device, equipment and medium
CN116736883A (en) * 2023-05-23 2023-09-12 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN116736883B (en) * 2023-05-23 2024-03-08 天津大学 Unmanned aerial vehicle cluster intelligent cooperative motion planning method
CN116974297B (en) * 2023-06-27 2024-01-26 北京五木恒润科技有限公司 Conflict resolution method and device based on multi-objective optimization, medium and electronic equipment
CN116974297A (en) * 2023-06-27 2023-10-31 北京五木恒润科技有限公司 Conflict resolution method and device based on multi-objective optimization, medium and electronic equipment
CN116893690A (en) * 2023-07-25 2023-10-17 西安爱生技术集团有限公司 Unmanned aerial vehicle evasion attack input data calculation method based on reinforcement learning
CN117111640A (en) * 2023-10-24 2023-11-24 中国人民解放军国防科技大学 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment
CN117111640B (en) * 2023-10-24 2024-01-16 中国人民解放军国防科技大学 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment
CN117162102A (en) * 2023-10-30 2023-12-05 南京邮电大学 Independent near-end strategy optimization training acceleration method for robot joint action
CN117168468A (en) * 2023-11-03 2023-12-05 安徽大学 Multi-unmanned-ship deep reinforcement learning collaborative navigation method based on near-end strategy optimization
CN117168468B (en) * 2023-11-03 2024-02-06 安徽大学 Multi-unmanned-ship deep reinforcement learning collaborative navigation method based on near-end strategy optimization
CN117313561B (en) * 2023-11-30 2024-02-13 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
CN117313561A (en) * 2023-11-30 2023-12-29 中国科学院自动化研究所 Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method

Also Published As

Publication number Publication date
CN112947581B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN112947581B (en) Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN111880563B (en) Multi-unmanned aerial vehicle task decision method based on MADDPG
Yang et al. Maneuver decision of UAV in short-range air combat based on deep reinforcement learning
WO2021174765A1 (en) Control system based on multi-unmanned-aerial-vehicle collaborative game confrontation
CN108319286B (en) Unmanned aerial vehicle air combat maneuver decision method based on reinforcement learning
Jiandong et al. UAV cooperative air combat maneuver decision based on multi-agent reinforcement learning
CN112180967B (en) Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN112902767B (en) Multi-missile time collaborative missile guidance method and system
CN113791634A (en) Multi-aircraft air combat decision method based on multi-agent reinforcement learning
CN112198892B (en) Multi-unmanned aerial vehicle intelligent cooperative penetration countermeasure method
CN113095481A (en) Air combat maneuver method based on parallel self-game
CN114489144B (en) Unmanned aerial vehicle autonomous maneuver decision method and device and unmanned aerial vehicle
CN112906233B (en) Distributed near-end strategy optimization method based on cognitive behavior knowledge and application thereof
CN111859541B (en) PMADDPG multi-unmanned aerial vehicle task decision method based on transfer learning improvement
CN114063644B (en) Unmanned fighter plane air combat autonomous decision-making method based on pigeon flock reverse countermeasure learning
CN114460959A (en) Unmanned aerial vehicle group cooperative autonomous decision-making method and device based on multi-body game
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
CN114167756B (en) Multi-unmanned aerial vehicle collaborative air combat decision autonomous learning and semi-physical simulation verification method
CN113741186B (en) Double-aircraft air combat decision-making method based on near-end strategy optimization
CN116700079A (en) Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP
Wu et al. Visual range maneuver decision of unmanned combat aerial vehicle based on fuzzy reasoning
Duan et al. Autonomous maneuver decision for unmanned aerial vehicle via improved pigeon-inspired optimization
CN116796843A (en) Unmanned aerial vehicle many-to-many chase game method based on PSO-M3DDPG
CN116225065A (en) Unmanned plane collaborative pursuit method of multi-degree-of-freedom model for multi-agent reinforcement learning
Guo et al. Maneuver decision of UAV in air combat based on deterministic policy gradient

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant