CN112947581A

CN112947581A - Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Info

Publication number: CN112947581A
Application number: CN202110318644.5A
Authority: CN
Inventors: 杨啟明; 张建东; 史国庆; 吴勇; 朱岩; 张耀中
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-11
Anticipated expiration: 2041-03-25
Also published as: CN112947581B

Abstract

The invention discloses a multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning, which solves the autonomous decision problem of maneuver action in multi-unmanned aerial vehicle collaborative air combat in simulation multi-to-multi air combat. The method comprises the following steps: creating a motion model of the unmanned aerial vehicle platform; evaluating the situation of the multi-aircraft air combat based on attack areas, distances and angle factors, and analyzing the state space, action space and reward value of the maneuvering decision of the multi-aircraft air combat; a target distribution method and a strategy coordination mechanism in the cooperative air combat are designed, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safe collision avoidance is defined through distribution of reward values, and strategy cooperation is achieved after training. The invention can effectively improve the capability of multiple unmanned aerial vehicles for carrying out collaborative air combat maneuver autonomous decision, has stronger cooperativity and autonomous optimization, and continuously improves the decision level of unmanned aerial vehicle formation in continuous simulation and learning.

Description

Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a multi-unmanned aerial vehicle collaborative air combat maneuver decision method.

Background

At present, unmanned aerial vehicles can complete tasks such as reconnaissance, monitoring and ground attack, and play an increasingly difficult role in modern war. But because of the restriction of intelligent level, unmanned aerial vehicle can not carry out autonomic air battle maneuver decision yet at present, especially the autonomic collaborative air battle of many unmanned aerial vehicles. Therefore, promote unmanned aerial vehicle's intelligent level, let unmanned aerial vehicle can be according to the situation environment and the maneuver in the automatic control command completion air battle is current main research direction.

The unmanned aerial vehicle can complete the maneuver autonomous decision of the air combat, and the essence of the maneuver autonomous decision is to complete the mapping from the air combat situation to the maneuver and execute the corresponding maneuver under different situations. Because the situation of air battle is more complex than other tasks, the situation space of the air battle task is difficult to be completely covered by a manual pre-programming method, and the optimal maneuver decision is more difficult to calculate and generate.

At present, the air combat maneuver decision research of unmanned aerial vehicles is developed under the single-machine confrontation scene of 1v1, and in the actual air combat, a plurality of unmanned aerial vehicles are basically formed into a formation cooperative combat. The multi-machine cooperative air combat relates to three aspects of air combat situation assessment, multi-target distribution and maneuver decision, the cooperative air combat is a closely-connected coupling process of the three parts, and compared with the maneuver decision of single-machine confrontation, the multi-machine cooperative air combat needs to consider tactical cooperation besides the enlargement of the force quantity scale, so that the problem is more complex.

The research on the multi-machine collaborative air combat decision can be divided into centralized type and distributed type, the centralized type is that a center calculates the actions of all unmanned aerial vehicles in a formation, and the models are complex and have the problems of high calculation difficulty and insufficient real-time performance. The idea of the distributed method is that each unmanned aerial vehicle in the formation calculates respective maneuvering action by itself on the basis of target allocation, so that the complexity of the model is reduced, and the cooperation of the formation task is realized through the target allocation. The existing distributed cooperative air combat decision-making method mostly adopts the conditions that target distribution is firstly carried out, and then many-to-many air combat is converted into one-to-one according to the target distribution result, so that the method cannot well play the tactical cooperation of multi-target attack capability and formation combat, and cannot achieve the effect of 1+1> 2.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning, and solves the autonomous decision problem of maneuver action in multi-unmanned aerial vehicle collaborative air combat in simulation multi-to-multi air combat. The method comprises the following steps: creating a motion model of the unmanned aerial vehicle platform; evaluating the situation of the multi-aircraft air combat based on attack areas, distances and angle factors, and analyzing the state space, action space and reward value of the maneuvering decision of the multi-aircraft air combat; a target distribution method and a strategy coordination mechanism in the cooperative air combat are designed, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safe collision avoidance is defined through distribution of reward values, and strategy cooperation is achieved after training. The invention can effectively improve the capability of multiple unmanned aerial vehicles for carrying out collaborative air combat maneuver autonomous decision, has stronger cooperativity and autonomous optimization, and continuously improves the decision level of unmanned aerial vehicle formation in continuous simulation and learning.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: establishing a multi-machine air combat environment model, and defining a state space, an action space and a reward value for each unmanned aerial vehicle to make a maneuver decision in the multi-machine cooperative air combat process;

step 1-1: in a ground coordinate system, an ox axis is taken as the true east, an oy axis is taken as the true north, and an oz axis is taken as the vertical direction; the motion model of the unmanned aerial vehicle in the ground coordinate system is shown as the formula (1):

in the ground coordinate system, the dynamic model of the unmanned aerial vehicle is shown as formula (2):

wherein (x, y, z) represents the position of the drone in the ground coordinate system, v represents the drone velocity,

and

respectively representing the values of the speed v of the unmanned aerial vehicle on three coordinate axes of xyz; the flight path angle gamma represents an included angle between the speed v of the unmanned aerial vehicle and the horizontal plane o-x-y; the heading angle psi represents an included angle between a projection v' of the unmanned aerial vehicle speed v on an o-x-y plane and an oy axis, and g represents gravity acceleration; [ n ] of_x,n_z,μ]Is a control variable, n, for controlling the unmanned aerial vehicle to maneuver_xThe overload in the speed direction of the unmanned aerial vehicle represents the thrust and deceleration action of the unmanned aerial vehicle; n is_zIndicating the overload of the pitching direction of the unmanned aerial vehicle, namely normal overload; μ is the roll angle around the drone velocity vector; through n_xControlling the speed of the unmanned plane by n_zAnd mu, controlling the direction of the speed vector of the unmanned aerial vehicle, and further controlling the unmanned aerial vehicle to perform maneuvering action;

step 1-2: setting the missile to have the tail attack capability only; in the interception area of the missile by v_UAnd v_TRepresenting the speed of the drone and the target, respectively; d is a distance vector and represents the position relation between the unmanned aerial vehicle and the target; alpha is alpha_UAnd alpha_TRespectively representing an included angle between the speed vector of the unmanned aerial vehicle and the distance vector D and an included angle between the target speed vector and the distance vector D;

setting the maximum interception distance of the missile as D_mAngle of view of

The interception area of the missile is a conical area omega; the maneuvering target of the unmanned aerial vehicle in the air war is to make the target enter the interception area omega of the unmanned aerial vehicle_USimultaneously, the unmanned aerial vehicle is prevented from entering a target interception area omega_T；

According to the definition of the missile interception area, if the target is in the interception area of the missile of the own partyAnd the area shows that the own party can launch weapons to attack the target, the own party is in advantage, and the advantage value eta of the unmanned aerial vehicle when the unmanned aerial vehicle intercepts the target is defined_UComprises the following steps:

wherein (x)_T，y_T，z_T) Representing position coordinates of the target; re is a positive number;

defining a dominance η obtained by a target intercepting unmanned aerial vehicle_TComprises the following steps:

wherein (x)_U，y_U，z_U) Representing position coordinates of the drone;

in the air battle, the advantage value eta obtained by the unmanned aerial vehicle based on the interception opportunity_AIs defined as:

η_A＝η_U-η_T (4)

defining an advantage value eta obtained based on angle parameters and distance parameters of two parties_BComprises the following steps:

the above formula shows that when the unmanned aerial vehicle tailors the target, the dominance value is eta _B1 is ═ 1; when the unmanned aerial vehicle is tailed by the target, the advantage value is eta_B-1; when the distance between the unmanned aerial vehicle and the target is larger than the maximum interception distance of the missile, the advantage value is attenuated according to an exponential function;

by integrating formulas (4) and (5), the situation assessment function eta of the air war in which the unmanned aerial vehicle is located is obtained as follows:

η＝η_A+η_B (6)

step 1-3: the geometric relationship of the air combat situation at any moment is completely determined by information contained in an unmanned aerial vehicle position vector, an unmanned aerial vehicle speed vector, a target position vector and a target speed vector in the same coordinate system, so that the description of the air combat situation is composed of the following 5 aspects:

1) speed information of unmanned aerial vehicle, including speed magnitude v_UTrack angle gamma_UAnd heading angle psi_U；

2) Speed information of the object, including the magnitude v of the speed_TTrack angle gamma_TAnd heading angle psi_T；

3) The relative position relation between the unmanned aerial vehicle and the target is represented by a distance vector D; distance vector modulo D | | | D | |, γ_DRepresenting the angle of the distance vector D with the horizontal plane o-x-y,. phi_DThe included angle between the projection vector of the distance vector D on the horizontal plane o-x-y and the oy axis is shown, and the relative position relation between the unmanned aerial vehicle and the target is D and gamma_DAnd psi_DRepresents;

4) the relative motion relation between the unmanned aerial vehicle and the target comprises an included angle alpha between a speed vector of the unmanned aerial vehicle and a distance vector D_UAnd the angle alpha between the target velocity vector and the distance vector D_T；

5) Height information z of unmanned aerial vehicle_UAnd height information z of the target_T；

Based on the variables 1) to 5) above, the 1v1 air battle situation at any time can be completely characterized, so the state space of the 1v1 maneuver decision model is a 13-dimensional vector space s:

s＝[v_U,γ_U,ψ_U,v_T,γ_T,ψ_T,D,γ_D,ψ_D,α_U,α_T,z_U,z_T] (7)

adopting a situation evaluation function eta as an air war maneuver decision reward value R, and reflecting the action of an action value on the air war situation through the situation evaluation function, wherein the R is eta;

step 1-4: in the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be n and respectively recorded as UAVs_i(i is 1,2, …, n), the number of targets is m, and each is denoted as Target_j(j ═ 1,2, …, m), the number of set targets is not greater than the number of drones, i.e. m ≦ n;

remember any two UAVs_iAnd Target_jRelative state therebetween is

UAV_iUAV for interacting with any one of the other aircraft_kThe relative state between them is recorded as

Any UAV in multi-aircraft air combat_iThe observation state of (a) is:

S_i＝[∪s_ij|_j＝1,2...,m,∪s_ik|_{k＝1,2,...,n(k≠i)}] (8)

in the process of multi-machine air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)_x、n_zAnd μ three variables control flight, therefore UAV_iHas a motion space of A_i＝[n_xi,n_zi,μ_i]；

In the multi-machine collaborative air battle, the situation assessment value eta between each unmanned aerial vehicle and each target is respectively calculated according to the formula (4) and the formula (5)_AAnd η_BRecording UAV_iAnd Target_jHas a situation evaluation value of

And

in addition to that, consider a UAV_iFriend machine UAV_kThe relative state of (A) on the self-situation, thus defining the UAV_iFriend machine UAV_kThe situation assessment function of (1) is:

wherein D_ikUAV for unmanned aerial vehicle_iFriend machine UAV_kA distance between, D_safeIs the minimum safe distance between two unmanned planes, and P is a positive number.

Step 2: establishing a multi-machine cooperative target distribution method, and determining a target distribution rule during reinforcement learning training;

step 2-1: in the air battle, n unmanned aerial vehicles are arranged to fight m targets, and n is more than or equal to m; according to equation (6), UAV_i(i-1, 2, …, n) vs. Target_jThe situation evaluation value of (j ═ 1,2, …, m) is

Let the target allocation matrix be X ═ X_ij]，x_ij1 denotes Target_jTo UAV_i，x_ij0 denotes Target_jNot allocated to UAV_i(ii) a Let each drone be able to launch missiles at most simultaneously on L targets located in its attack zone, i.e.

Meanwhile, during the battle, the target is prevented from being omitted and the attack is abandoned, namely, each target is at least allocated with one unmanned aerial vehicle to attack, so that the unmanned aerial vehicle can be used for preventing the attack

All unmanned aerial vehicles are required to be put into combat, so that

With the situation advantage maximization of the unmanned aerial vehicle to the target as a target, establishing a target distribution model as follows:

step 2-2: in the target allocation process, targets in an attack area are allocated firstly, and then targets outside the attack area are allocated, so that the target allocation method is divided into the following two parts:

step 2-2-1: preferentially distributing targets located in the attack area;

to be provided with

And

constructing two n x m dimensional matrices H for elements_AAnd H_B，

From equation (3) if Target_jIn the UAV_iIn the attack area of (1), then

Otherwise

Thus, let

Order to

X of corresponding positions of all zero elements _ij1 is ═ 1; during the distribution process, if at unmanned aerial vehicle UAV_iThe target number x in the attack area exceeds the maximum attack target number of the unmanned aerial vehicle, namely x>L, then use UAV_iAt H_BSorting corresponding element values in the matrix, selecting L targets with the maximum element values to be allocated to the UAV_i；

Step 2-2-2: allocating targets located outside the attack area;

for UAV_iIf a target within its attack zone has already been allocated, it can no longer be allocated a target outside the attack zone; and for a plurality of targets outside the attack area, the unmanned aerial vehicle cannot makeManeuvering to enable a plurality of targets to be in the attack area, and therefore when the targets are outside the attack area, only one target can be allocated to the unmanned aerial vehicle; therefore, after the target allocation in the attack area is completed, the remaining target allocation work is changed into a process of allocating 1 target to the unallocated unmanned aerial vehicle, and the allocation is realized by adopting the hungarian algorithm, which specifically comprises the following steps:

first, a matrix X is allocated according to the current target [ X ]_ij]_n×mIs prepared from H_BAll of x in_ijDeleting the ith row and the jth column where the 1 is positioned to obtain a matrix

Based on

The allocation result is calculated by adopting the Hungarian algorithm, because n is more than or equal to m, and L>0, adopting a margin complementing method to complete the Hungarian algorithm, realizing target distribution, and ordering corresponding x_ij＝1；

After the above two steps are completed, the allocation of all the targets is completed, and a target allocation matrix X ═ X is obtained_ij]_n×m；

And step 3: designing a multi-machine cooperative maneuver strategy learning algorithm and determining a reinforcement learning training logic;

the multi-machine cooperative maneuver strategy learning algorithm comprises a strategy coordination mechanism and a strategy learning mechanism:

step 3-1: designing a strategy coordination mechanism;

the air combat confrontation is regarded as a competitive game between n unmanned aerial vehicles and m targets, a model is established based on a framework of a random game, and one random game can use one tuple

To represent; s represents the state space of the current game, and all agents can be shared; UAV_iIs defined as A_i，Target_iIs defined as B_i；T:S×Aⁿ×B^m→ S denotes the deterministic transfer function of the environment,

representing a UAV_iA reward value function of; the action spaces of drones in respective convoy in the collaborative air battle are the same, i.e. for UAV_iAnd Target_jRespectively have A_iA and B_i＝B；

Defining the global reward value of the formation of the unmanned aerial vehicles as the average value of the reward values of the unmanned aerial vehicles, namely:

wherein r (s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicles form to take action a belongs to AⁿThe target formation takes action B ∈ B^mUnder the condition of (1), the unmanned aerial vehicles form the acquired reward value;

the goal of drone formation is to learn a strategy to make the expectation of a discount accumulation of reward values

Maximum, wherein 0<λ ≦ 1 is a discount factor; the random game is transformed into a markov decision problem:

wherein Q^*(. cndot.) represents a state-action value function for executing action a in state s, r (s, a) represents a reward value for executing action a in state s, θ represents a network parameter of the policy function, s' represents a state at a next time, a_θRepresenting a parameterized policy function;

the reward value function for each drone is defined as:

wherein r is_i(s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicle formation takes action as a belongs to AⁿThe target formation takes action B ∈ B^mIn case of (A), UAV_iThe value of the prize earned, wherein

Characterizing UAVs_iRelative to the situational dominance value of the target assigned to it,

is a penalty term to constrain the UAV_iDistance from friend machine;

based on the formula (13), for n unmanned aerial vehicle individuals, there are n bellman equations shown as the formula (14), wherein the policy function a_θWith the same parameters θ:

wherein the content of the first and second substances,

UAV representing an unmanned aerial vehicle_iState-action value function for performing action a in state s, r_i(s, a) denotes unmanned aerial vehicle UAV_iA reward value obtained by performing action a in state s;

step 3-2: designing a strategy learning mechanism;

establishing a multi-unmanned aerial vehicle maneuvering decision model by adopting a bidirectional circulation neural network BRNN;

the multi-unmanned aerial vehicle air combat maneuver decision model consists of an Actor network and a Critic network, wherein the Actor network is formed by connecting Actor networks of all unmanned aerial vehicle individuals through BRNN, and the Critic network is formed by connecting Critic networks of all unmanned aerial vehicle individuals through BRNN; setting hidden layers in strategy networks Actor and Q networks Critic in a single unmanned aerial vehicle decision model into a BRNN circulating unit in a multi-unmanned aerial vehicle air combat maneuver decision model, and then expanding the BRNN according to the number of unmanned aerial vehicles; the input of the multi-unmanned aerial vehicle air combat maneuver decision model is the current air combat situation, and the action value of each unmanned aerial vehicle is output;

defining UAVs_iHas an objective function of

Representing individual prize values r_iThe expectation of the accumulation of (a) is,

indicating the adoption of an action policy a under a state transition function T_θThe obtained state distribution is stable in the traversal markov decision process, so that the target functions of n unmanned planes are recorded as J (theta):

according to the multi-agent deterministic policy gradient theorem, for the target function J (theta) of the n drones described in equation (15), the gradient of the policy network parameter theta is

Using parameterized Critic function Q^ξ(s, a) to estimate the state-action function in equation (16)

When the Critic is trained, a square sum loss function is adopted to calculate a parameterized Critic function Q^ξThe gradient of (s, a) is shown as equation (17), where ξ is a parameter of the Q network:

optimizing the Actor and Critic networks by adopting a random gradient descent method based on the formulas (16) and (17); in the interactive learning process, the learning optimization of the collaborative air combat strategy is completed through the data updating parameters obtained by trial and error;

step 3-3: according to the strategy coordination mechanism and the strategy learning mechanism, the reinforcement learning training process for determining the multi-unmanned aerial vehicle collaborative air combat maneuver decision model is as follows:

step 3-3-1: firstly, initialization is carried out: determining the forces and situations of both air combat parties, and arranging n unmanned aerial vehicles and m targets for air combat confrontation, wherein n is more than or equal to m; randomly initializing an online network parameter theta of the Actor and a parameter xi of the online network of Critic, and then respectively assigning the parameters of the Actor and the Critic online networks to the parameters of corresponding target networks, namely theta '← theta, xi' ← xi, theta 'and xi' are the parameters of the Actor and the Critic target networks respectively; initializing an experience pool R₁The system is used for storing experience data obtained by probe interaction; initializing a random process epsilon for realizing the exploration of action values;

step 3-3-2: determining an initial state of training, namely determining the relative situation of two parties at the beginning of an air battle; setting initial position information and speed information of each unmanned aerial vehicle in the unmanned aerial vehicle formation and target formation, namely determining (x, y, z, v, gamma, psi) information of each unmanned aerial vehicle, and calculating to obtain an air war initial state s according to the definition of a state space¹(ii) a Let t equal 1;

step 3-3-3: and (3) repeatedly carrying out multi-screen training according to the initial state, and executing the following operations in each single-screen air combat simulation:

firstly according to the current air war state s^tCalculating a target distribution matrix X based on the target distribution method^t(ii) a Then each UAV_iAccording to state s^tAnd a random process ∈ generating action values

And executing, at the same time, each Target in the Target formation_iPerforming an action

State transition to s after execution^t+1Calculating the value of the prize award according to equation (13)

Transfer a process variable

Stored as a piece of experience data in an experience pool R₁Performing the following steps; during learning, from experience pool R₁Randomly sampling a batch of M pieces of empirical data

Calculating the target Q value of each unmanned aerial vehicle, namely for each piece of M data, the following steps are carried out:

the gradient estimate for Critic was calculated according to equation (17) as follows:

the gradient estimation value of Actor is calculated according to equation (16) as follows:

updating online network parameters of Actor and Critic by adopting an optimizer according to the obtained gradient estimation values delta xi and delta theta; after the online network optimization is completed, the target network parameters are updated in a soft update mode, namely

Wherein κ ∈ (0, 1);

step 3-3-4: and after the single-screen simulation is finished, if the simulation reaches the set maximum screen number, stopping the reinforcement learning training, otherwise adding 1 to t, and repeatedly executing the step 3-3-3.

The invention has the following beneficial effects:

the invention is based on a multi-agent reinforcement learning method, establishes a method for generating a multi-unmanned aerial vehicle cooperative air combat maneuver decision strategy, adopts a bidirectional cyclic neural network to establish a communication network, connects discrete unmanned aerial vehicles into a formation cooperative decision network, establishes a multi-unmanned aerial vehicle cooperative air combat maneuver decision model under an Actor-critic architecture, and realizes the unification of the learning of individual behaviors of the unmanned aerial vehicles and the overall combat target of the formation. Different from the mode that a multi-airplane air battle is decomposed into a plurality of 1v1 air battles, the multi-unmanned-airplane collaborative air battle maneuver decision model established by the invention can obtain collaborative air battle maneuver strategies through autonomous learning, and tactical coordination is realized in the air battle process, so that the situation advantage of the whole formation operation is achieved and opponents are defeated.

Drawings

FIG. 1 is a three-degree-of-freedom particle motion model of the unmanned aerial vehicle.

FIG. 2 is a one-to-one close-up air combat situation diagram of the present invention.

FIG. 3 is a diagram showing the result of the maneuver decision of the UAV under the condition of uniform velocity and linear flight.

FIG. 4 is a model structure of the multi-unmanned aerial vehicle collaborative air combat maneuver decision based on the bidirectional cyclic neural network.

FIG. 5 is a schematic diagram of an air combat simulated maneuver trajectory based on learned strategies after training is completed.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The invention aims to provide a method for generating a multi-unmanned aerial vehicle collaborative air combat autonomous maneuver decision based on multi-agent reinforcement learning.

The invention realizes the consistency of the state understanding of the unmanned aerial vehicles through the communication network. According to the characteristics of multi-target attack, the reinforcement learning reward value of each unmanned aerial vehicle is calculated by combining target distribution and the air combat situation evaluation value, and the individual reinforcement learning process is guided through the reward of each unmanned aerial vehicle, so that the tactical targets of the formation are closely combined with the learning target of a single unmanned aerial vehicle, and a collaborative tactical maneuver strategy is generated. Tactical coordination is realized in the air combat process, the situation advantage of the whole formation combat is achieved, and the opponents are combed.

A multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning comprises the following steps:

and

respectively representing the values of the speed v of the unmanned aerial vehicle on three coordinate axes of xyz; the flight path angle gamma represents an included angle between the speed v of the unmanned aerial vehicle and the horizontal plane o-x-y; the heading angle psi represents an included angle between a projection v' of the unmanned aerial vehicle speed v on an o-x-y plane and an oy axis, and g represents gravity acceleration; [ n ] of_x,n_z,μ]Is a control variable, n, for controlling the unmanned aerial vehicle to maneuver_xThe overload in the speed direction of the unmanned aerial vehicle represents the thrust and deceleration action of the unmanned aerial vehicle; n is_zIndicating the overload of the pitching direction of the unmanned aerial vehicle, namely normal overload; μ is the roll angle around the drone velocity vector; through n_xControlling the speed of the unmanned plane by n_zAnd mu, controlling the direction of the speed vector of the unmanned aerial vehicle, and further controlling the unmanned aerial vehicle to perform maneuvering action; as shown in fig. 1;

setting the maximum interception distance of the missile as D_mAngle of view of

According to the definition of the missile interception area, if the target is in the interception area of the missile of the own party, the fact that the own party can launch a weapon to attack the target and the own party is in advantage is shown, and the advantage value eta when the unmanned aerial vehicle intercepts the target is defined_UComprises the following steps:

wherein (x)_T,y_T,z_T) Representing position coordinates of the target; re represents a large positive number, and can be manually adjusted according to the training effect to guide the training effect of the model;

defining dominance values obtained by target intercepting unmanned aerial vehicleη_TComprises the following steps:

wherein (x)_U，y_U，z_U) Representing position coordinates of the drone;

η_A＝η_U-η_T (4)

in addition, in the air battle, because the field angle of an aerogun and some missiles is small, the launching condition can be formed only under the condition of tailgating, the requirement on the angle relationship is severe, and the dominant value eta obtained based on the angle parameter and the distance parameter of the aerogun and the missiles is defined_BComprises the following steps:

η＝η_A+η_B (6)

step 1-3: the state of the air combat maneuver decision model is composed of a set of variables capable of completely describing the air combat situation, as shown in fig. 2, the geometric relationship of the air combat situation at any moment is completely determined by the information contained in the unmanned aerial vehicle position vector, the unmanned aerial vehicle speed vector, the target position vector and the target speed vector in the same coordinate system, so that the description of the air combat situation is composed of the following 5 aspects:

s＝[v_U,γ_U,ψ_U,v_T,γ_T,ψ_T,D,γ_D,ψ_D,α_U,α_T,z_U,z_T] (7)

as shown in FIG. 3, as the number of drones and targets increases in a multi-aircraft air battle, each drone needs to take maneuver decisions into consideration with other stationsThere is the relative state of the drone (target versus friend). The relative situation of a drone and another drone in an air battle can be fully described by the 13 variables described in equation (7). Remember any two UAVs_iAnd Target_jRelative state therebetween is

Any UAV in multi-aircraft air combat_iThe observation state of (a) is:

S_i＝[∪s_ij|_j＝1,2...,m,∪s_ik|_{k＝1,2,...,n(k≠i)}] (8)

And

in addition to that, consider a UAV_iFriend machine UAV_kIf the distance from the friend aircraft is too close, the risk of collision is increased, so that the UAV is defined_iFriend machine UAV_kThe situation assessment function of (1) is:

wherein D_ikUAV for unmanned aerial vehicle_iFriend machine UAV_kA distance between, D_safeFor the minimum safe distance between two drones, P is a large positive number.

in the multi-machine cooperative air combat, from the overall perspective of the air combat, the maximum advantage of the formation of the unmanned aerial vehicles in the air combat means that each enemy plane can be attacked by the weapon of the unmanned aerial vehicle, but each unmanned aerial vehicle can only maneuver against one target at the same time, so that the multi-machine cooperative air combat also needs to carry out target distribution at the same time when maneuvering decision is carried out, and cooperation of tactical strategies is realized.

Let the target allocation matrix be X ═ X_ij]，x _ij1 denotes Target_jTo UAV_i，x_ij0 denotes Target_jNot allocated to UAV_i(ii) a In the process of multi-aircraft air combat, the situation that a plurality of targets are simultaneously in the attack area of one unmanned aerial vehicle exists, so that the multi-target attack capability of the unmanned aerial vehicle needs to be considered in target distribution, and each unmanned aerial vehicle is designed to be capable of launching missiles to L targets in the attack area at most, namely

All unmanned aerial vehicles are required to be put into combat, so that

step 2-2: the unmanned aerial vehicle performs a series of maneuvers in the air war to enable a target to enter an attack area and launch weapons to the target, the target in the attack area is firstly distributed in the target distribution process, and then the target outside the attack area is distributed, so that the target distribution method is divided into the following two parts:

step 2-2-1: preferentially distributing targets located in the attack area;

to be provided with

And

constructing two n x m dimensional matrices H for elements_AAnd H_B，

From equation (3) if Target_jIn the UAV_iIn the attack area of (1), then

Otherwise

Thus, let

Order to

Step 2-2-2: allocating targets located outside the attack area;

for UAV_iIf a target within its attack zone has already been allocated, it can no longer be allocated a target outside the attack zone; for a plurality of targets outside the attack area, the unmanned aerial vehicle cannot maneuver so that the targets are in the attack area, and therefore when the targets are outside the attack area, only one target can be allocated to the unmanned aerial vehicle; therefore, after the target allocation in the attack area is completed, the remaining target allocation work is changed into a process of allocating 1 target to the unallocated unmanned aerial vehicle, and the allocation is realized by adopting the hungarian algorithm, which specifically comprises the following steps:

Based on

step 3-1: designing a strategy coordination mechanism;

Whether the unmanned aerial vehicle is superior in confrontation in the collaborative air battle is evaluated according to the situation of all the unmanned aerial vehicles. Defining the global reward value of the formation of the unmanned aerial vehicles as the average value of the reward values of the unmanned aerial vehicles, namely:

the global reward value defined by the formula (11) can reflect the situation of the whole formation of the unmanned aerial vehicles, but the global reward value cannot reflect the action of the unmanned aerial vehicle individuals in the formation cooperation. In fact, global coordination is driven by the goals of each individual, and therefore, the reward value function for each drone is defined as:

is a penalty term to constrain the UAV_iDistance from friend machine;

wherein the content of the first and second substances,

in the learning and training process, behavior feedback of each unmanned aerial vehicle in target distribution, situation advantages and safety collision avoidance is defined through distribution of reward values, strategy cooperation is achieved after training, behavior of each unmanned aerial vehicle can be acquiescent with behaviors of other friends, and centralized target distribution is not needed.

Step 3-2: designing a strategy learning mechanism;

the premise of realizing collective cooperation based on multi-agent reinforcement learning is that information interaction among individuals, so that a bidirectional cyclic neural network BRNN is adopted to establish a multi-unmanned aerial vehicle maneuvering decision model, the information interaction among unmanned aerial vehicles is ensured, and the coordination of a formation maneuvering strategy is realized;

the model is established as shown in fig. 4, the multi-unmanned aerial vehicle air combat maneuver decision model is composed of an Actor network and a criticic network, wherein the Actor network is formed by connecting Actor networks of all unmanned aerial vehicle individuals through BRNN, and the criticic network is formed by connecting criticic networks of all unmanned aerial vehicle individuals through BRNN; setting hidden layers in strategy networks Actor and Q networks Critic in a single unmanned aerial vehicle decision model into a BRNN circulating unit in a multi-unmanned aerial vehicle air combat maneuver decision model, and then expanding the BRNN according to the number of unmanned aerial vehicles; the input of the air combat maneuver decision model of the multiple unmanned aerial vehicles is the current air combat situation, and action values of all the unmanned aerial vehicles are output;

since the model is built based on BRNN, it is learned for network parametersThe idea is to expand the network into n (number of drones) sub-networks to calculate the inverse gradient and then update the network parameters using a time-based back propagation algorithm. Gradient at Q of each individual drone_iThe functions and the strategy functions are propagated, and when the model is learned, the individual reward value of each unmanned aerial vehicle influences the action of each unmanned aerial vehicle, so that the generated gradient information is reversely propagated, and the model parameters are updated.

Defining UAVs_iHas an objective function of

indicating the adoption of an action policy a under a state transition function T_θThe obtained state distribution is generally stable in the traversal markov decision process, so the target functions of n unmanned planes are recorded as J (theta):

Transfer a process variable

Calculating target Q values for individual drones, i.e. for each of the M pieces of data, there is

Wherein κ ∈ (0, 1);

The specific embodiment is as follows:

the method is used for unmanned aerial vehicle dual-machine formation, and specifically comprises the following steps:

1. and designing a multi-machine air combat environment model.

In the multi-airplane air battle, the number of the unmanned aerial vehicles is set to be 2 and respectively recorded as UAVs_i(i is 1,2) and the number of targets is 2, each of which is denoted as Target_j(j＝1,2)。

Calculating according to the step 1 to obtain any UAV_iIs observed state S_i；

In the process of multi-airplane air combat, each unmanned aerial vehicle makes own maneuvering decision according to the situation of the unmanned aerial vehicle in the air combat environment, and the unmanned aerial vehicle passes through n according to the unmanned aerial vehicle dynamics model shown in the formula (2)_x，n_zAnd μ three variables control flight, therefore UAV_iHas a motion space of A_i＝[n_xi,n_zi,μ_i]。

And

in addition to this, UAVs should also be considered_iFriend machine UAV_kIf the distance from the friend aircraft is too close, the risk of collision is increased, so that the UAV is defined_iFriend machine UAV_kThe evaluation function of (2) is shown in equation (9).

2. Designing a multi-machine cooperative target distribution method.

Two unmanned aerial vehicles fight 2 targets. UAV according to formula (6)_i(i ═ 1,2) relative Target_jThe situation evaluation value of (j ═ 1,2) is

Obtaining the target distribution matrix X ═ X from the step 2_ij]_n×m。

3. And designing a multi-machine cooperative maneuver strategy learning algorithm.

The unmanned aerial vehicle is subjected to reinforcement learning training in an air battle scene that the unmanned aerial vehicle and the target aircraft fly in opposite directions and the target flies in uniform-speed linear motion.

The air war background of the multi-unmanned aerial vehicle collaborative air war is set to be a short distance air war, and the parameters of the air war environment model are set as follows. Maximum interception distance D of missile_max3km, field of view angle

Minimum safe distance D between two unmanned aerial vehicles_safe200m, the dominance value Re of the target acquisition is 5, the penalty value P is 10, and the maximum speed v is set in the motion model of the unmanned aerial vehicle_max400m/s, minimum velocity v_minControl parameter n of 90m/s_x∈[-1,2]，n_z∈[0,8]，μ∈[-π,π]。

The Actor network of the maneuvering decision model is divided into an input layer, a hidden layer and an output layer, wherein the input layer inputs an air combat state, the hidden layer is divided into 2 layers, the 1 st layer is composed of 400 LSTM neurons in the forward direction and the reverse direction respectively, the layer is expanded according to the number of unmanned aerial vehicles and a bidirectional circulation neural network structure to form a communication layer, the 2 nd layer is composed of 100 neurons, a tanh activation function is adopted, and parameters are uniformly distributed [ -3 × 10 [^-4,3×10^-4]Random initialization, outputting 3 control quantities by output layer, adopting tanh activating function, uniformly distributing parameter [ -2 × 10 [)^-5,2×10^-5]Random initialization, output range [0,1 ] of tanh by linear adjustment]Are respectively adjusted to [1,2]、[0,8]And [ - π, π]。

Critic network of maneuver decision modelsThe unmanned aerial vehicle is also divided into an input layer, a hidden layer and an output layer, wherein the input layer inputs an air combat state and 3 action values of the unmanned aerial vehicle, the hidden layer is divided into 2 layers, the 1 st layer consists of 500 LSTM neurons in forward and reverse directions, the layer is expanded according to the number of the unmanned aerial vehicles and a bidirectional circulation neural network structure to form a communication layer, the 2 nd layer consists of 150 neurons, a tanh activation function is adopted, and parameters are uniformly distributed [ -3 × 10 [ -3 [^-4,3×10^-4]Random initialization, output layer outputs 1Q value, tan h activation function is adopted, and parameters are uniformly distributed [ -2 × 10 [)^-4,2×10^-4]And (4) random initialization. The Actor and ciptic models both adopt Adam optimizers, the learning rate of the Actor network is set to be 0.001, and the learning rate of the criticc network is set to be 0.0001. The discount factor λ is 0.95 and the soft update factor k of the target network is 0.005. The random process epsilon of action value exploration employs the OU process. The size of the empirical playback space R is set to 10⁶The size of batch is set to 512.

FIG. 5 is an air combat simulated maneuver trajectory based on learned strategies after training is complete. And simulating a maneuver trajectory based on the air war of the learned strategy. As can be seen in the figure, at an initial moment, UAVs 1 and 2 fly in opposite directions facing targets 1 and 2 respectively, according to the target assignment algorithm, UAV1 and UAV2 select target 1 and target 2, respectively, as targets of attack for maneuvering engagement, in the process of approaching to respective targets, the course and the height are adjusted to avoid possible collision in the intersection, before and after meeting with the target, the UAV1 turns to the right side, and the UAV2 turns to the left side, so that the cross shield is realized, after the two unmanned aerial vehicles turn towards the opposite direction, respective attack targets are exchanged instead of continuously turning to chase the respective initially distributed targets, tactical coordination is embodied, and it is proved that through reinforcement learning training, the unmanned aerial vehicles can learn to obtain an air combat maneuvering strategy by two-machine formation, tactical coordination between two machines is realized, advantages are obtained in air combat, and multi-machine air combat is not decomposed into a plurality of 1v1 countermeasures.

Claims

1. A multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning is characterized by comprising the following steps:

and

step 1-2: setting missilesOnly have a tail attack capability; in the interception area of the missile by v_UAnd v_TRepresenting the speed of the drone and the target, respectively; d is a distance vector and represents the position relation between the unmanned aerial vehicle and the target; alpha is alpha_UAnd alpha_TRespectively representing an included angle between the speed vector of the unmanned aerial vehicle and the distance vector D and an included angle between the target speed vector and the distance vector D;

setting the maximum interception distance of the missile as D_mAngle of view of

wherein (x)_U，y_U，z_U) Representing position coordinates of the drone;

η_A＝η_U-η_T (4)

the above formula shows that when the unmanned aerial vehicle tailors the target, the dominance value is eta_B1 is ═ 1; when the unmanned aerial vehicle is tailed by the target, the advantage value is eta_B-1; when the distance between the unmanned aerial vehicle and the target is larger than the maximum interception distance of the missile, the advantage value is attenuated according to an exponential function;

η＝η_A+η_B (6)

s＝[v_U,γ_U,ψ_U,v_T,γ_T,ψ_T,D,γ_D,ψ_D,α_U,α_T,z_U,z_T] (7)

remember any two UAVs_iAnd Target_jRelative state therebetween is

Any UAV in multi-aircraft air combat_iThe observation state of (a) is:

S_i＝[∪s_ij|_j＝1,2...,m,∪s_ik|_{k＝1,2,...,n(k≠i)}] (8)

In-multiple-machine cooperative air battleIn the method, the situation assessment value eta between each unmanned aerial vehicle and each target is calculated according to the formula (4) and the formula (5)_AAnd η_BRecording UAV_iAnd Target_jHas a situation evaluation value of

And

At the same time, the war time is to be avoidedTargets are omitted and the attack is abandoned, i.e. each target should be assigned at least one drone to attack, so

All unmanned aerial vehicles are required to be put into combat, so that

step 2-2-1: preferentially distributing targets located in the attack area;

to be provided with

And

constructing two n x m dimensional matrices H for elements_AAnd H_B，

From equation (3) if Target_jIn the UAV_iIn the attack area of (1), then

Otherwise

Thus, let

Order to

X of corresponding positions of all zero elements_ij1 is ═ 1; during the distribution process, if at unmanned aerial vehicle UAV_iThe target number x in the attack area exceeds the maximum attack target number of the unmanned aerial vehicle, namely x>L, then use UAV_iAt H_BSorting corresponding element values in the matrix, selecting L targets with the maximum element values to be allocated to the UAV_i；

Step 2-2-2: allocating targets located outside the attack area;

Based on

Complete the above two stepsAfter the step (b), the distribution of all targets is completed, and a target distribution matrix X ═ X is obtained_ij]_n×m；

step 3-1: designing a strategy coordination mechanism;

wherein r (s, a, b) represents that at the time t, the environment state is s, and the unmanned aerial vehicles form to take action a belongs to AⁿThe target formation takes action B ∈ B^mIn case of (2), unmanned plane knittingThe value of the reward earned by the team;

the reward value function for each drone is defined as:

is a penalty term to constrain the UAV_iDistance from friend machine;

wherein the content of the first and second substances,

step 3-2: designing a strategy learning mechanism;

defining UAVs_iHas an objective function of

step 3-3-1: firstly, initialization is carried out: determining the forces and situations of both air combat parties, and arranging n unmanned aerial vehicles and m targets for air combat confrontation, wherein n is more than or equal to m; randomly initializing an online network parameter theta of the Actor and a parameter xi of the online network of Critic, and then respectively assigning the parameters of the Actor and the Critic online networks to the parameters of corresponding target networks, namely theta '← theta, xi' ← xi, theta 'and xi' are the parameters of the Actor and the Critic target networks respectively; initializing an experience pool R₁The system is used for storing experience data obtained by probe interaction; initialize a randomThe machine process epsilon is used for realizing the exploration of action values;

Transfer a process variable

Wherein κ ∈ (0, 1);