CN117171984A

CN117171984A - Air combat maneuver decision method based on deep reinforcement learning

Info

Publication number: CN117171984A
Application number: CN202311071553.1A
Authority: CN
Inventors: 陈宇哲; 李秋妮; 宋祺; 焦城阳
Original assignee: Air Force Engineering University of PLA
Current assignee: Air Force Engineering University of PLA
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-12-05

Abstract

The invention discloses an air combat maneuver decision method based on deep reinforcement learning, which comprises the following steps: 1. constructing an air combat one-to-one first-second countermeasure design; 2. constructing three-dimensional relative situation, rewarding function and action space of the first and second aircrafts; 3. establishing an air combat maneuver decision model based on deep reinforcement learning and training; 4. maneuver decisions based on an air combat maneuver decision model. The invention improves the perception capability of the first aircraft to the maneuver sequence characteristics by adding the long and short time memory neural network layer based on the PPO algorithm, randomly shifts the motion output by the first aircraft in the air combat according to the sampling frequency parameter, improves the capability of the first aircraft to find the optimal maneuver decision, and effectively solves the problem of insufficient perception capability of the traditional reinforcement learning algorithm to the maneuver sequence characteristics in the air combat.

Description

Air combat maneuver decision method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field, and particularly relates to an air combat maneuver decision method based on deep reinforcement learning.

Background

The air force is taken as important military force in modern air combat, and no matter the air force is a fighter plane or an unmanned plane, the air combat maneuver decision needs to be carried out, the air combat maneuver decision is carried out by taking a fighter plane pilot as a main decision, and the air force is mainly carried out by depending on the loop control of people and the intelligent algorithm of the unmanned plane. The key way of realizing future intelligent air combat is the intellectualization of the air combat process, and the whole process of the air combat links of observation, judgment, decision and action is penetrated. The intelligent decision of the air combat greatly changes the mode and the form of future warfare, and has a subverted influence on the development of the warfare. The intelligent decision of the air combat simulates the decision made by the control fighter machine under various air combat conditions, and is the core and intelligent module of the intelligent fighter aircraft. Since the reaction rate of such aircraft outperforms any human pilot and does not take into account pilot physiological limits, it is advantageous in predicting combat victory and implementing active attacks. However, the implementation of air combat intelligent decisions is very complex, involving dynamic, real-time factors and larger scale solution space, which also presents significant challenges to the implementation of air combat intelligent decisions.

Air combat can be classified into a short distance air combat, a medium distance air combat, and a long distance air combat. Although significant progress has been made in air weapon technology, the air combat battlefield has been extended from close range to medium-long range, but close range air combat has not been neglected, and related technology is rapidly evolving. At present, the traditional air combat maneuver decision-making method mainly comprises an expert system, an influence graph method, a matrix game, differential countermeasures and the like, and the common characteristics of the methods are complex calculation, low instantaneity and serious dependence on human experience knowledge.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an air combat maneuver decision method based on deep reinforcement learning, which is based on a PPO algorithm, improves the perception capability of a first aircraft to maneuver motion time sequence characteristics by adding a long and short time memory neural network layer, randomly shifts the motion output by the air combat first aircraft according to sampling frequency parameters so as to improve the capability of the first aircraft to find the optimal maneuver decision, effectively solves the problem of insufficient perception capability of the traditional reinforcement learning algorithm to the air combat maneuver motion time sequence characteristics, and is convenient to popularize and use.

In order to solve the technical problems, the invention adopts the following technical scheme: the air combat maneuver decision-making method based on the deep reinforcement learning is characterized by comprising the following steps of:

step one, constructing an air combat one-to-one first-second countermeasure design;

step two, constructing three-dimensional relative situation, rewarding function and action space of the first and second aircrafts;

step three, establishing an air combat maneuver decision model based on deep reinforcement learning and training;

and step four, maneuver decision based on the air combat maneuver decision model.

The air combat maneuver decision method based on the deep reinforcement learning is characterized by comprising the following steps of: in the first step, an air combat one-to-one first-second countermeasure design is constructed by using an air combat simulation environment platform, the first-second fight thinking middle first-side aircraft is an air combat maneuver to-be-decided aircraft, and the first-second fight thinking middle second-side aircraft is controlled by an air combat simulation environment platform.

The air combat maneuver decision method based on the deep reinforcement learning is characterized by comprising the following steps of: in the second step, three-dimensional relative situation of the two-party aircraft A and B is constructedWherein z is _r Altitude, z of the A-square aircraft _b Is the altitude of the second aircraft, delta h is the relative altitude of the first and second aircraft, V _r Is the velocity vector of the A-square aircraft, V _b For the speed of the B-planeVector, deltav is the absolute value difference of the speeds of the first and second aircrafts, d is the distance vector of the first and second aircrafts, and AA and ATA are the two-aircraft disengaging angle and the disengaging angle observed from the first angle respectively;

according to the formulaConstructing a reward function R, wherein ∈>For the angle bonus function R _a Normalized results,/-> Awarding a function R for speed _v Normalized results,/-> For a height bonus function R _h Normalized results,/-> For distance rewarding function R _d Normalized results,/->D is the range of the missile on the A-square aircraft, D _min Is the minimum value of the range of the missile on the first aircraft, d _max For the maximum range of missiles on a square aircraft, < +.>To win or lose the bonus function R _end Normalized results,/->k ₁ 、k ₂ 、k ₃ And k ₄ Respectively-> And->Weight coefficient, k of (2) ₁ 、k ₂ 、k ₃ And k ₄ Are all nonnegative numbers and k ₁ +k ₂ +k ₃ +k ₄ ＝1；

And constructing an action space [ delta psil, delta V and delta H ] of the first aircraft relative to the last step, wherein delta psil is an orientation angle change value of the first aircraft relative to the last step and delta psil is dispersed to (-20 degrees, +20 degrees), delta V is a speed change value of the first aircraft relative to the last step and delta V is dispersed to (-100 km/H,100 km/H), and delta H is a height change value of the first aircraft relative to the last step and delta H is dispersed to (-500 m,500 m).

The air combat maneuver decision method based on the deep reinforcement learning is characterized by comprising the following steps of: in the third step, an air combat maneuver decision model based on deep reinforcement learning is established and trained, and the process is as follows:

step 301, a single-layer LSTM network layer is established, the LSTM layer is used as an initial layer of a PPO algorithm network, and an air combat maneuver decision model LSTM-PPO based on deep reinforcement learning is established;

step 302, taking the three-dimensional relative situation of the first and second aircrafts as an input parameter, and taking the action space result of the first aircrafts relative to the last step as an execution action output;

wherein, according to formula a _t ＝π(S _t ；θ _π )+ε(S _t ；θ _ε ) Acquiring the result of the action space of the first aircraft relative to the last step, namely a new execution action a _t ，π(S _t ；θ _π ) Three-dimensional relative situation S of first aircraft on second aircraft _t And policy super-parameter theta of air combat maneuver decision model LSTM-PPO _π Action results, ε (S) _t ；θ _ε ) Is a three-dimensional relative situation S of a first aircraft and a second aircraft _t And parameter theta _ε Noise function value, parameter θ _ε ～N(0,σ ² ) Is obtained from Gaussian distribution N (0, sigma) at the beginning of the round ² ) Sampling to obtain;

step 303, calculating a reward function value;

step 304, setting a sampling frequency, and executing step 305 when the set time step number is not reached in any round; when the set number of time steps is reached, step 306 is performed;

step 305, continuing to take the three-dimensional relative situation of the first and second aircrafts as an input parameter, taking the action space result of the first aircrafts relative to the last step as an execution action output training, and calculating a reward function value in an air combat maneuver decision model LSTM-PPO based on deep reinforcement learning;

step 306, resampling noise and updating strategy super-parameter theta of air combat maneuver decision model LSTM-PPO based on deep reinforcement learning _π ；

Step 307, the steps 302 to 306 are circulated until the rewarding function value is not smaller than the set threshold value, and training of the air combat maneuver decision model based on deep reinforcement learning is performed.

The air combat maneuver decision method based on the deep reinforcement learning is characterized by comprising the following steps of: and step four, performing maneuver decision test on the first aircraft by using the air combat maneuver decision model based on the deep reinforcement learning, which is trained in step 307.

Compared with the prior art, the invention has the following advantages:

(1) The method effectively solves the problem that the traditional reinforcement learning algorithm has insufficient perceptibility of the air combat time sequence maneuver characteristic.

(2) The air combat maneuver decision method based on the deep reinforcement learning formed by the method can show good countermeasure performance in 1v1 near air combat simulation countermeasure.

(3) The method can realize the maneuver decision output of the single air combat, and training is carried out according to different scenes or different model planes by using a deep reinforcement learning algorithm.

(4) The method has good compatibility and can be used for rapidly transplanting different simulation environments and algorithms.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Fig. 2 is a schematic diagram of three-dimensional relative situation of a two-sided aircraft.

FIG. 3 is a schematic diagram of the exploration strategy based on generalized state dependence according to the present invention.

Fig. 4 shows the operational area range set for the air combat in example 1v1 of the present invention.

FIG. 5 is a comparison curve of the round step size of the method of the present invention compared to the conventional PPO algorithm.

FIG. 6 is a plot of the prize versus the conventional PPO algorithm for the method of the present invention.

Fig. 7 is a graph showing the change of the winning rate of the air combat aircraft.

Fig. 8 is a diagram of a maneuver decision trajectory of an air combat aircraft.

Fig. 9 is a second diagram of a maneuver decision trajectory of an air combat aircraft.

Detailed Description

As shown in fig. 1 to 9, the air combat maneuver decision method based on deep reinforcement learning of the present invention comprises the following steps:

In the embodiment, in the first step, an air combat one-to-one first-second countermeasure design is constructed by utilizing an air combat simulation environment platform, the first-second fight thinking middle first-side aircraft is an air combat maneuver to-be-decided aircraft, and the first-second fight thinking middle second-side aircraft is controlled by an air combat simulation environment platform.

It should be noted that, using Mo Zibing chess deduction system as the air combat simulation environment platform, the ink subsystem can perform tactical and battle level simulation deduction, and provide AI development kit based on python, supporting development of military first-aid aircraft. In an air combat simulation environment, a first deduction party and a second deduction party are added, the relationship between the first deduction party and the second deduction party is set to be hostile, and the cognitive level and training of the first deduction party and the second deduction party are common. Adding a first air force base, a second air force base, adding fighters for the two bases respectively, and mounting 4 near-distance air-to-air missiles; and (3) moving the fighter aircraft from the first side and the second side to ensure that the distance between the first side and the second side is 98 km, the initial altitude is 10973 meters, and the fighter aircraft is stored as expected.

In the second step, a three-dimensional relative situation of the two-sided aircraft is constructedWherein z is _r Altitude, z of the A-square aircraft _b Is the altitude of the second aircraft, delta h is the relative altitude of the first and second aircraft, V _r Is the velocity vector of the A-square aircraft, V _b The velocity vector of the second aircraft is Deltav, the absolute value difference of the velocity of the first and second aircraft is Deltav, d is the distance vector of the first and second aircraft, and AA and ATA are respectively the two aircraft disengaging angles and disengaging angles observed from the first angle;

according to the formulaConstructing a reward function R, wherein ∈>For the angle bonus function R _a Normalized results,/-> Awarding a function R for speed _v Normalized results,/-> For a height bonus function R _h The result of the normalization process is that, for distance rewarding function R _d The result of the normalization process is that,d is the range of the missile on the A-square aircraft, D _min Is the minimum value of the range of the missile on the first aircraft, d _max For the maximum range of missiles on a square aircraft, < +.>To win or lose the bonus function R _end The result of the normalization process is that,k ₁ 、k ₂ 、k ₃ and k ₄ Respectively-> And->Weight coefficient, k of (2) ₁ 、k ₂ 、k ₃ And k ₄ Are all nonnegative numbers and k ₁ +k ₂ +k ₃ +k ₄ ＝1；

k ₁ 、k ₂ 、k ₃ And k ₄ 0.5, 0.2 and 0.1 were taken respectively.

In the third embodiment, in the step, an air combat maneuver decision model based on deep reinforcement learning is built and trained, and the process is as follows:

step 303, calculating a reward function value;

step 306,Resampling noise and updating strategy super-parameter theta of air combat maneuver decision model LSTM-PPO based on deep reinforcement learning _π ；

It should be noted that, a single-layer LSTM network layer is established, and an air combat maneuver decision model LSTM-PPO based on deep reinforcement learning comprises a value network and a strategy network, wherein the input latitude of the network is 5, the output latitude of the value network is 1, and the output latitude of the strategy network is 3; the method comprises the steps of setting the node number and the activation function of a hidden layer in a model, selecting a ReLU function by the activation function, adding a gSDE strategy exploration method, and randomly shifting actions output by the air combat aircraft according to sampling frequency parameters so as to improve the capability of searching optimal maneuver decision of the aircraft. Setting the sampling frequency to be 2 steps, namely resampling noise and updating strategy super-parameters of an air combat maneuver decision model LSTM-PPO based on deep reinforcement learning every time the first aircraft executes 2 steps in any round.

Setting decision interval time as 5 seconds, setting round maximum step length as 100 steps, and setting the termination and win-lose judgment conditions of each round as follows:

(1) Killing party B, survival of party A, judging party A wins, and ending the round.

(2) Party A is killed, party B survives, party B wins is judged, and the round is terminated.

(3) Both the first and second sides are knocked down at the same time, the tie is determined, and the round is terminated.

(4) Party a is more than 200km from the base to which it belongs, and is considered that the aircraft cannot return, and the round is terminated.

(5) As in fig. 4, two bases are diagonal vertices of a rectangle (dashed box) outside of which a 5km box is defined as the combat zone, and an aircraft beyond the solid box of the first color is considered to be beyond the combat zone, the round will be terminated.

Setting the training steps as 50000 steps and the test rounds as 100 rounds, and setting the super parameters of the LSTM-PPO algorithm for training, wherein the super parameters are as follows:

LSTM-PPO algorithm superparameter:

in the fourth embodiment, in step 307, the maneuver decision test is performed on the first aircraft by using the air combat maneuver decision model based on the deep reinforcement learning.

After 50000 steps of training, the training result is analyzed, as shown in fig. 5, which is a training round step curve, the round average step gradually rises from about 25 initial steps, and reaches stability at 35 steps. The step length is increased to indicate that the first aircraft performs more actions in each round on average in the later period of training, and to indicate that the strategy of the first aircraft is more complex and more likely to generate a better maneuvering method; FIG. 6 is a graph of the rewards of the air combat aircraft A, wherein the rewards acquired by the aircraft A gradually rise from an initial negative value and remain stable within a certain range finally, which shows that the deep reinforcement learning algorithm completes convergence and learns a better strategy; fig. 7 is a graph of the change in the winning rate of an air combat aircraft, showing a gradual increase in the winning rate from 0 to about 70% and good combat performance.

In a certain round, as shown in fig. 8 and 9, the warplane tracks of both the first and second sides can be seen, the first side corresponds to the oncoming enemy plane, the roundabout maneuver or lateral cutting mode is sampled, the attack situation is rapidly formed behind the enemy plane side, the dominant position of the air combat countermeasure is occupied, the missile launching condition is preferentially formed, and the air combat situation which is more beneficial to the first side is obtained.

The invention can realize the maneuver decision output of the single air combat, and training is carried out according to different scenes or different model aircrafts by using a deep reinforcement learning algorithm; the air combat maneuver decision method based on deep reinforcement learning can lead the air combat first-party aircraft to actively strike the dominant position, and shows good countermeasure performance; the LSTM-PPO deep reinforcement learning algorithm effectively solves the problem that the traditional reinforcement learning algorithm has insufficient perceptibility of the air combat timing maneuver characteristic in the air combat one-to-one maneuver decision; the method has good compatibility, and can rapidly transplant simulation environments and different algorithms.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and any simple modification, variation and equivalent structural changes made to the above embodiment according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. The air combat maneuver decision-making method based on the deep reinforcement learning is characterized by comprising the following steps of:

2. An air combat maneuver decision method based on deep reinforcement learning as defined in claim 1 wherein: in the first step, an air combat one-to-one first-second countermeasure design is constructed by using an air combat simulation environment platform, the first-second fight thinking middle first-side aircraft is an air combat maneuver to-be-decided aircraft, and the first-second fight thinking middle second-side aircraft is controlled by an air combat simulation environment platform.

3. An air combat maneuver decision method based on deep reinforcement learning as defined in claim 2 wherein: in the second step, three-dimensional relative situation of the two-party aircraft A and B is constructedWherein z is _r Altitude, z of the A-square aircraft _b Is the altitude of the second aircraft, delta h is the relative altitude of the first and second aircraft, V _r Is the velocity vector of the A-square aircraft, V _b For the second aircraftThe speed vector, deltav is the absolute value difference of the speeds of the first and second aircrafts, d is the distance vector of the first and second aircrafts, and AA and ATA are the two aircraft disengaging angles and the disengaging angle observed from the first angle respectively;

according to the formulaConstructing a reward function R, wherein ∈>For the angle bonus function R _a Normalized results,/-> Awarding a function R for speed _v The result of the normalization process is that, for a height bonus function R _h Normalized results,/-> For distance rewarding function R _d Normalized results,/->D is the range of the missile on the A-square aircraft, D _min Is the minimum value of the range of the missile on the first aircraft, d _max For the maximum range of missiles on a square aircraft, < +.>To win or lose the bonus function R _end Normalized junctionFruit of (Bu)>k ₁ 、k ₂ 、k ₃ And k ₄ Respectively-> And->Weight coefficient, k of (2) ₁ 、k ₂ 、k ₃ And k ₄ Are all nonnegative numbers and k ₁ +k ₂ +k ₃ +k ₄ ＝1；

4. A deep reinforcement learning-based air combat maneuver decision method as defined in claim 3 wherein: in the third step, an air combat maneuver decision model based on deep reinforcement learning is established and trained, and the process is as follows:

step 303, calculating a reward function value;

5. The air combat maneuver decision method based on deep reinforcement learning according to claim 4 wherein: and step four, performing maneuver decision test on the first aircraft by using the air combat maneuver decision model based on the deep reinforcement learning, which is trained in step 307.