CN115933717A

CN115933717A - Unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning

Info

Publication number: CN115933717A
Application number: CN202211189814.5A
Authority: CN
Inventors: 段海滨; 郑志强; 霍梦真; 魏晨; 邓亦敏
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-04-07

Abstract

The invention discloses an unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning, wherein the method comprises the following implementation steps: the method comprises the following steps: designing an air combat environment; step two: combining a reinforcement learning algorithm with the environment; step three: air combat training; step four: displaying a training result; step five: the training results are migrated. The invention can realize the switching of the reinforcement learning algorithm and the self-definition of the air combat simulation environment support part, adopts a layered hierarchical training scheme, improves the operability and the success rate of the training and simplifies the training verification process of the set calculation method.

Description

Unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning

Technical Field

The invention relates to an unmanned aerial vehicle air combat maneuver decision algorithm training system, in particular to an unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning, and belongs to the technical field of unmanned aerial vehicles.

Background

Unmanned aerial vehicles have been widely used in military fields, such as reconnaissance, search, confrontation, and the like. The unmanned aerial vehicle air combat is one of confrontations, and is a big problem due to flexibility, changeability and high-dimensional complexity. The unmanned aerial vehicle air combat maneuver decision aims to enable a red-side (or called my-side) unmanned aerial vehicle to continuously select a proper maneuver according to the situations of the red-side and the blue-side (or called enemy) in the air combat process, by using methods such as mathematical optimization, artificial intelligence and the like, so that the unmanned aerial vehicle occupies a favorable situation or twists back to a battle office under an unfavorable condition, attacks the blue-side unmanned aerial vehicle until the unmanned aerial vehicle is hit down, and wins are obtained. Because the unmanned aerial vehicle air combat maneuver decision-making is faced with a highly dynamic and strongly associated environment, the current commonly used air combat decision-making method is converted into a method for researching how to utilize various intelligent algorithms, such as an intelligent optimization algorithm, a neural network and the like, through a traditional expert system method, an image graph method, a fuzzy logic method, a differential countermeasure method and the like so as to solve the problem. The traditional method has the disadvantages of large workload, more fixed decision result, complex and complicated implementation details and incapability of better competence for complex air combat tasks. Due to the complexity of an air combat model and the large calculation amount of solution, the intelligent optimization algorithm is difficult to realize the real-time selection of the appropriate maneuver from the selectable maneuver library. In conclusion, the unmanned aerial vehicle air combat maneuver decision is still a difficult problem needing deep research.

The artificial intelligence technology represented by reinforcement learning, particularly deep reinforcement learning, becomes a research hotspot in recent years, is successfully applied in the field of games, has a huge application prospect in the fields of intelligent decision making and the like, and provides a new idea for solving the problem of unmanned aerial vehicle air combat maneuver decision making. But the current relevant research is comparatively loose, does not have a comparatively unified unmanned aerial vehicle air battle environment as reinforcement learning training platform. The invention provides an unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning, and aims to provide a flow unmanned aerial vehicle air combat maneuver decision training method, a reward design scheme and a training flow scheme, and to achieve standardized and efficient air combat decision training based on a reinforcement learning algorithm.

Disclosure of Invention

The invention provides an unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning, aiming at solving the problems of complex air combat maneuver decision, difficult training of reinforcement learning algorithm and the like of an unmanned aerial vehicle, and designing the unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on the deep reinforcement learning, which have the advantages of convenient reinforcement learning algorithm switching, partial self-definition of air combat simulation environment support and adoption of a layered hierarchical training scheme, improves the operability and success rate of training and simplifies the training verification process of the set calculation method.

The unmanned aerial vehicle intelligent air combat maneuver decision training system based on deep reinforcement learning is composed of five parts, namely a reinforcement learning algorithm layer, an unmanned aerial vehicle air combat environment layer, an air combat maneuver decision algorithm training layer, a log recording and effect displaying layer and a migration and expansion layer, and is shown in figure 1. Each of which is described in detail below.

1) And (5) a reinforcement learning algorithm layer. The reinforcement learning algorithm has more types, and the invention focuses on the depth reinforcement learning algorithm under the Actor-Critic (AC) framework. The reinforcement learning algorithms under the AC framework are more in variety, but interact with the environment through an action network (Actor), and acquire Observation (observer) vectors and rewards (rewarded) from the environment. In the reinforcement learning algorithm layer, a deep reinforcement learning algorithm under a self-defined AC frame is supported, the expansibility is high, but the algorithm designed by a user is required to meet the environment interaction requirement. In particular, the deep reinforcement learning algorithm used by the user in the invention has the following requirements: the action selection function is used for receiving the observation vector given by the air combat environment and outputting the selected action; and calling an interactive function provided by the air battle environment, wherein the input is an action, and the output is a new observation vector, a reward, an ending marker bit and other information. In particular, the present invention uses a near-end policy optimization (PPO) algorithm by default, from which a user can perform custom algorithm migration. For ease of migration, the present invention provides five submodules: the system comprises a neural network module for designing a neural network, an experience processing module for storing and taking experiences in the training process, an algorithm design module for defining a deep reinforcement learning algorithm, a hyper-parameter module for designing and assigning hyper-parameters of the algorithm, and a self-defined algorithm module for a user to define the algorithm. The algorithm design module is a default PPO algorithm (for multi-agent problems, the default PPO algorithm is a multi-agent PPO algorithm), and a user can select some implementation details in the algorithm according to requirements; the neural network module and the experience processing module are used as auxiliary design tools, the neural network module comprises neural networks in various forms, so that a user can conveniently select the neural networks as required, the experience processing module is used for storing and processing experience data in algorithm training, and the experience processing module is independently provided for the user to modify as required; the user can use the two auxiliary design modules to help design, and can also independently realize the required complete algorithm; the hyper-parameter module is a module which integrates a plurality of adjustable and parameter-setting modules of the deep reinforcement learning algorithm, such as hyper-parameters of the algorithm, size parameters of a neural network hidden layer and the like, and is convenient for users to adjust parameters and select different algorithm structures. Particularly, when a user designs an algorithm of the user in a self-defined algorithm module, the design requirement of the module needs to be met, namely the operation requirements of a hyper-parameter module and other modules such as an environment definition module, a training log data recording module and a network parameter migration module are met. Meanwhile, the adjustable parameters are put into the hyper-parameter module as much as possible, so that the adjustment and the selection are convenient. The specific implementation can refer to the implementation of the PPO algorithm given in the algorithm design module. When the deep reinforcement learning algorithm is operated, the system firstly loads the hyper-parameter module and reads the parameter setting in the hyper-parameter module, so that the default PPO algorithm in the algorithm design module or the algorithm in the self-defined algorithm module is selected to be called; the parameters in the current hyper-parameter module are stored in a text file mode and a neural network structure is adopted, so that the training result can be reproduced later. Particularly, in order to improve the universality of the system, the invention does not carry out detailed regulation on a reinforcement learning algorithm layer, but carries out detailed division on the function, provides a feasible algorithm and matches with two auxiliary design tools, and provides a self-defined algorithm module on the basis, thereby providing more choices for the user and facilitating the design work of the user.

2) Unmanned aerial vehicle air battle environment layer. The layer is an unmanned aerial vehicle air combat environment and is used for responding the action given by an action network, and the unmanned aerial vehicle air combat action system mainly comprises an environment definition module, an environment parameter setting module, a bluesquare maneuvering strategy module and an unmanned aerial vehicle movement module. The environment definition module defines an air battle rule, a state space, an action space, a reward function and the like, and is mainly used for receiving the maneuvering action of the red unmanned aerial vehicle given by the deep reinforcement learning algorithm and the maneuvering action of the blue unmanned aerial vehicle given by the blue maneuvering strategy module, calling an unmanned aerial vehicle movement model to obtain the state information (including position and speed) of the unmanned aerial vehicle at the next moment, further obtaining the state at the next moment through the designed state space, and bringing the state information of the unmanned aerial vehicle into the designed air battle rule to determine the environment information, if the two sides have ground collision or exceed the range or can carry out attack once and judge the attack result, then obtaining the reward which can be obtained by the red unmanned aerial vehicle at the state according to the designed reward function, and finally returning the reward and changing to the state at the next moment. The action space needs to be customized on the premise of matching with an unmanned aerial vehicle motion model and a designed deep reinforcement learning algorithm, and a discrete action space formed by 15 discrete actions is adopted by default. The environment parameter setting module is an independent setting file and comprises environment information, reward function parameters, the number of the two unmanned aerial vehicles, performance parameters of the two unmanned aerial vehicles, the initial positions and the speeds of the two unmanned aerial vehicles, a maneuvering strategy and the like. In the invention, unmanned aerial vehicle models of both sides in air battle are defined in an unmanned aerial vehicle motion module, and a maneuvering decision algorithm of a blue side is written into a blue side maneuvering strategy module. The unmanned aerial vehicle motion model and the maneuvering strategy module of the blue square support the user to define to a certain degree. For the model shown by the default adoption formula of the unmanned aerial vehicle motion module, a user can perform custom modification, but the model needs to be matched with the motion space. For the blue-side mobile strategy module, the invention provides a MinMax air combat decision algorithm based on situation evaluation as a default blue-side mobile decision algorithm, and a plurality of setting parameters for the mobile decision algorithm are given in an environment parameter setting module, so that a user can select a fixed action, a random action, a MinMax action or a mixed action of the fixed action, the random action and the MinMax action (the selection probability of the three is given at the moment); meanwhile, when the default motion space design is selected by the user, the user can select whether the selection of the blue square maneuvering motion is limited to the motion on the horizontal plane or the motion on the full motion space. When the system operates and calls an unmanned aerial vehicle air combat environment layer, firstly, an environment parameter setting module is called to read set parameters, an environment definition module is called to generate a confrontation environment, then, a bluesquare maneuvering strategy module and an unmanned aerial vehicle movement module are called when a redward action given by a deep reinforcement learning algorithm is received, then, the treatment is carried out according to the designed environment, and finally, updated state, reward and environment information are returned.

3) And an air combat maneuver decision algorithm training layer. The layer is used for training a deep reinforcement learning algorithm and mainly comprises a training parameter setting module and a training flow module. The training parameter setting module comprises the setting of training, such as training bureau number, training frequency, data storage frequency and the like. The training flow module adopts a layered and hierarchical training scheme to train the air combat maneuver decision algorithm aiming at the problems of difficult training and slow convergence of the air combat maneuver decision algorithm based on deep reinforcement learning, and the training scheme usually comprises a plurality of training schemes (a training scheme (1), a training scheme (2) and the like) to complete the training of a complex scene, and at the moment, the training schemes are sequentially executed according to the scheme flow sequence. When the system is operated to call the training layer of the air combat maneuver decision algorithm, the parameter setting in the training setting module is firstly read, then the training flow module is called to obtain the requirement of current training, and therefore the parameter setting of the hyper-parameter module and the environment parameter setting module is influenced.

4) Log recording and effect display layer. The layer is used for log data recording and air combat training result display and mainly comprises a log data recording module, a scene display module and a simulation data storage module. The log data recording module is used for recording data in the training process, is used for explaining the quality of a training result, and is controlled by the training parameter setting module of the air combat algorithm training layer. The scene display module is mainly used for loading and training to obtain a reinforcement learning algorithm network model and displaying the red and blue air battle tracks under corresponding setting conditions. The simulation data storage module is used for recording data information such as position, attitude, speed and the like of the air combat process displayed in the scene. When the system is operated to call the log record and the effect display layer, firstly, a log data recording module is called to complete related setting and initialization and prepare for recording data; and after the training is finished, calling the scene display module or the simulation data storage module as required to finish the training.

5) A migration development layer. The layer is mainly used for migrating the trained neural network parameters and mainly comprises a network parameter migration module. The module supports displaying the structure of the trained neural network, stores network parameters of a part specified by a user into a file, and can be called to read the new specified position of the neural network when the user needs the module, so that migration and expansion of the trained result are realized. When the system is operated to call the migration expansion layer, a migration scheme is given to help realize the connection between two adjacent training schemes when the training scheme has a plurality of training schemes, namely, the neural network parameters obtained by the last training are migrated to initialize the network parameters in the next training.

An unmanned aerial vehicle intelligent air war maneuver decision training method based on deep reinforcement learning is shown in fig. 5. Overall, when the system runs, the air combat maneuver decision algorithm training layer is sequentially run to determine the current training setting and the training scheme, the reinforcement learning algorithm layer is run to initialize the algorithm to be trained, the unmanned aerial vehicle air combat environment layer is run to initialize the confrontation environment, the log record and the effect display layer are run to initialize the data record related settings, namely (1) designed by the air combat process in the following three steps; then carrying out formal training on the algorithm to be trained, the confrontation environment and the training scheme obtained after initialization according to the design of the air combat process in the following three steps; and finally, after the training is finished, determining whether to call a migration expansion layer to determine a migration scheme according to the training scheme, and further carrying out subsequent further training.

The main components of the unmanned aerial vehicle intelligent air combat maneuver decision training system based on deep reinforcement learning are shown in fig. 1, and the specific implementation steps are as follows:

the method comprises the following steps: unmanned aerial vehicle intelligent air combat environment design

The air combat environment design is the first step of unmanned aerial vehicle air combat maneuver decision training, and mainly comprises the following steps of designing an unmanned aerial vehicle model for red and blue parties, designing a blue air combat strategy and designing an air combat rule.

(1) Unmanned aerial vehicle model design

In order to accelerate the training process and pay more attention to the maneuver strategy, the invention defaults to use a simplified unmanned aerial vehicle model, as shown in formula (1), corresponding to the unmanned aerial vehicle motion module. The control command of the model is overloaded by n _f Normal overload n _z And roll angle phi, i.e., u = [ n = [ ] _f ,n _z ,φ] ^T The state quantity is the three-axis position [ x, y, z ] of the unmanned plane]Velocity magnitude V, pitch angle θ and yaw angle ψ.

Wherein g is the acceleration of gravity. Meanwhile, during initialization, x belongs to [0,Length']、y∈[0,Width]、z∈[0,Height]、V∈[V _min ,V _max ]、θ∈[-π/2,π/2]ψ ∈ [0,2 π ]). During motion, these range constraints must still be met, except for x and y allowed ranges.

Particularly, in order to improve the success rate of training and properly simplify the problem, the invention adopts a discrete action space A, namely, the control command u is taken as certain values, the values form an action space, and the unmanned aerial vehicle selects a command to be executed from the action space at the decision moment. The details will be described in step two.

(2) Lanfang strategy design

The invention uses the thought based on the minimum maximum (MinMax) decision algorithm by default and utilizes the situation evaluation function to solve the individual profit, thereby carrying out the blue-square motion decision and carrying out the blue-square motion decisionAnd a blue party maneuvering strategy module. In particular, the situation assessment function is a component of the blue-party maneuver decision module, and a user can customize the situation assessment function according to needs, but the situation assessment function needs to be designed on the basis of the provided state information of each unmanned aerial vehicle, namely the input of the situation assessment function is the state x of the current existing red-party individual i _i And state x of the blue-party existing individual k _k (x _α ＝[x _α ,y _α ,z _α ,v _xα ,v _yα ,v _zα ] ^T Where α = i, k), the output is a scalar, called revenue. Importantly, the invention provides a maneuvering decision strategy of the BlueTou unmanned aerial vehicle based on a MinMax decision algorithm, and in order to embody equality, a control command of the BlueTou is selected in the same discrete action space as that of the Red Square.

The basic idea of the MinMax decision algorithm is as follows: and under the condition that the information of both sides of the game is completely known, all the optional actions of the other side are traversed in sequence, and all the optional actions of the opposite side are traversed and deduced under each action, so that the smallest corresponding action of the other side is selected from the benefits as the current decision result according to the maximum value of the corresponding benefit of the opposite side under each optional action of the other side.

In particular, for many-to-many air battle scenes, the default k of the invention is the nearest blue-side unmanned aerial vehicle to the distance i.

(3) Air combat rule design

The air combat rule is a basis that two parties in red and blue have to obey and make a judgment in the combat process and corresponds to one part of the environment definition module. The rules in the invention are mainly divided into two categories, namely physical rules and judgment rules. The physical rules are rules close to the real world physical principle, including the continuity and the constraint of the motion of the unmanned aerial vehicle, the collision of the two aircrafts when the distance between the two aircrafts is less than a limit distance, the collision of the unmanned aerial vehicle when the height of the unmanned aerial vehicle from the ground is less than zero, the unmanned aerial vehicle can crash after being attacked to a certain degree, the attack is limited by weapons, and the like, which are also basic rules, and the basic rules need to be considered and set in the training; the decision rule is some necessary preconditions for determining the result, such as attack situation condition, defeated condition, reward obtaining condition, etc., and these settings are various.

In particular, for the attack situation condition rule, the attack condition of the unmanned aerial vehicle i to k is defined as

Wherein D is _ik Is the distance between them, D _att,min And D _att,max In order to be a range of attack distances,

is the target azimuth angle, q _ik The target entry angle is shown in fig. 2. When the attack conditions are met, the unmanned aerial vehicle i is considered to have the attack success rate p _att And hitting the unmanned plane k, and only carrying out attack once in a decision period, and when the hit times exceed a certain number, the unmanned plane k can be destroyed. Meanwhile, for the rule of the defeated condition, the Blood attribute is introduced into the unmanned plane, after being hit each time, the Blood attribute is divided into injuries of various degrees according to the probability, the corresponding Blood value is correspondingly deducted, and when the Blood value is reduced to be less than 0, the Blood attribute is considered to be defeated. And when the vehicle collides or hits the ground, the Blood value is directly set to-1. The conditions for obtaining the bonus are explained in step two in relation to the design of the bonus function.

Step two: reinforcement learning algorithm and environment combination

The reinforcement learning algorithm to be used is combined with the unmanned aerial vehicle intelligent air combat environment designed by the invention. Wherein the reinforcement learning algorithm corresponds to a layer of reinforcement learning algorithms. This layer provides a number of secondary design modules, and the PPO algorithm may be used by default. The user can customize the reinforcement learning algorithm according to the requirement, and the related hyper-parameters are placed in the hyper-parameter module. Particularly, the invention provides a scheme for combining a reinforcement learning algorithm with an unmanned aerial vehicle air combat environment, namely, action space design is carried out in an environment definition module to be matched with the reinforcement learning algorithm, and then state space design and reward function design are carried out to be used as a main part of an interaction function. Thus, the action selection function of the reinforcement learning algorithm selects actions according to the observation (corresponding to the state space) given by the environment; and after receiving the action, the interaction function of the environment updates the state, further obtains new observation and reward, and returns the new observation and reward to the reinforcement learning algorithm, so that the reinforcement learning algorithm and the environment are connected. The second step specifically includes a state space design, an action space design and a reward function design, and corresponds to the environment definition module, which is specifically described as follows.

Preconditions are given first. The set of the unmanned aerial vehicles in red is set as R = {1,2, \8230, N _r The set of blue-side unmanned planes is B = {1,2, \8230;, N _b And the coordinate system is a northeast coordinate system, and the maximum values of all axes of the expected space range of the air battle are Length, width and Height in sequence.

(1) State space design

The unmanned aerial vehicle state space comprises three components: self state, relative state with the blue side unmanned aerial vehicle and with other unmanned aerial vehicles of red side relative state. Self state

(i ∈ R) has a dimension of 3, has ∈ ->

Three-axis position [ x ] of the current unmanned aerial vehicle in sequence _i ,y _i ,z _i ](ii) a Relative state with bluetooth unmanned aerial vehicle->

(i ∈ R and k ∈ B) dimension of 8, having

Sequentially is a three-dimensional vector D pointing to the k position of the unmanned aerial vehicle from the i position of the red unmanned aerial vehicle _ik Length D of _ik Angle D with the horizontal plane _θ,ik Included angle D between projection on horizontal plane and x-axis _ψ,ik Three-axis speed difference [ delta v ] of unmanned aerial vehicles i and k _x,ik ,Δv _y,ik ,Δv _z,ik ]Based on the target azimuth angle>

And target entry angle q _ik (ii) a Relative with other unmanned aerial vehicles of the red sideStatus->

(i, j ∈ R and i ≠ j) has a dimension of 6, has ≠>

Sequentially a three-dimensional vector D pointing to an unmanned plane j position from a red unmanned plane i position _ij Length D of _ij Angle D with the horizontal plane _θ,ij Included angle D between projection on horizontal plane and x-axis _ψ,ij Three-axis velocity difference [ delta v ] of unmanned aerial vehicles i and j _x,ij ,Δv _y,ij ,Δv _z,ij ]. So unmanned aerial vehicle i's status space->

In particular, the number of unmanned aerial vehicles N on the same side _r Is 1, is combined>

Further, in order to accelerate the training, the present invention uses the following formula to normalize the state space.

Wherein,

is the t-th element in the state space, a and b are parameters, eta _t Is the corresponding threshold value>

Are the corresponding normalized elements. When/is>

Is x _i ,y _i ,D _ij ,D _ik One of η _t = Length; when/is>

Is z _i Time, eta _t ＝Height；/>

Is D _θ,ij ,D _θ,ik One of η _t ＝π/2；/>

Is D _ψ,ij ,D _ψ,ik One of η _t ＝2π；/>

Is->

q _ik One of η _t ＝π；/>

Is Δ v _x,ij ,Δv _y,ij ,Δv _z,ij ,Δv _x,ik ,Δv _y,ik ,Δv _z,ik One of η _t ＝V _max -V _min In which V is _max ,V _min The maximum value and the minimum value of the speed of the unmanned aerial vehicle.

(2) Design of motion space

The action space design of the invention can be customized by a user on the premise of meeting the requirements of the unmanned aerial vehicle motion model and the designed deep reinforcement learning algorithm, and meanwhile, the following default action space design is provided. The invention defaults to adopting discrete action space to improve the feasibility of training. The complex maneuver in the unmanned aerial vehicle air battle can be combined out through some simple maneuvers, and corresponding control instructions can be obtained by encoding some simple maneuvers, so that an action space is established, the complexity of the problem is simplified, and the feasibility of establishing a discrete action space is also explained. In particular, the commonly used maneuvers include five basic maneuvers of level flight, left turn, right turn, upward pull and downward dive, and the invention develops each basic maneuver into a uniform maneuver to improve flexibilityThree sub-actions of speed, acceleration and deceleration are carried out, and thus 15 discrete action spaces of the mechanical actions are obtained. Further, three parameters, m, are used ₁ 、m ₂ And n to uniformly code the 15 actions, where m ₁ And m ₂ Two expected values for tangential overload and m ₂ ＜0＜m ₁ And n > 0 is a normal overload expectation value, which is shown in the following table.

Importantly, when designing an air battle scene, a user can determine the parameter m according to the performance of the selected unmanned aerial vehicle ₁ 、m ₂ And n, thereby establishing an action space.

(3) Reward function design

The design of the reward function is important to speed up the training process. The invention designs three rewards, namely a process reward r ₁ Event award r ₂ Ending the prize r ₃ . In particular, to design the reward function, the present invention introduces a region of advantage, as shown in FIG. 3. For a red drone, the dominant zone is located in a cone zone behind the tail of a blue drone, i.e. drone i is considered to be in the dominant zone for drone k when the relationship between drone i and drone k satisfies the following condition.

Wherein D is _adv0 And D _adv Distance range, α, for the advantageous region _adv Is the angular threshold of the dominant region. In particular, the conditions of the advantageous area are more relaxed than the attack conditions, and the function of exciting the unmanned aerial vehicle is achieved.

The design of the reward function of the present invention is described in detail below.

(1) The process rewards. The reward is mainly used for guiding and is generated at each decision moment, and the reward is more favorable for the developing direction of the red party, so that the reward is playedTo accelerate the training. Particularly, for many-to-many air battle scenes, the method selects the nearest blue-party unmanned aerial vehicle as a target when calculating the reward in the process. In the present invention, the course award is given by the angle award r _a Distance reward r _d Height award r _h The method comprises the following three parts:

r _d ＝r _d1 +r _d2 (6)

wherein r is decreased when the distance is relatively increased _d1 =0.15, otherwise r _d1 ＝0；r _d2 Then it is calculated by:

while r is rewarded for the height _h Is provided with

Wherein Δ h = z _i -z _k The height difference between the unmanned aerial vehicle i and the unmanned aerial vehicle k is obtained; h _att Is the optimal attack height.

So the total process reward is

r ₁ ＝ε _a ·r _a +ε _d ·r _d +ε _h ·r _h (9)

Wherein epsilon ₁ 、ε ₂ And ε ₃ Are weighting coefficients.

(2) Event reward

Event reward r ₂ Triggered by the occurrence of a particular situation, including a preferential area reward r _adv Attack award r _att And a destruction reward r _dam . In the invention, r is _att And r _dam Is set to be constant, andr _adv as shown in the following formula.

For the red unmanned aerial vehicle i and the blue unmanned aerial vehicle k, when the condition of the formula (4) is met, r is awarded to the unmanned aerial vehicle i _adv (ii) a When the formula (2) is satisfied and the attack success rate p is satisfied _att Then give r to the unmanned plane i _att (ii) a When the Blood value of the unmanned plane k is smaller than 0 after the attack, r is awarded to the unmanned plane i _dam 。

(3) Ending reward

Ending reward r ₃ Given at the end of the current air war. The air war end condition is that the Blood values of one unmanned aerial vehicle are all less than 0 or the maximum simulation time t is reached _max . And the winner condition of the red is that the Blood values of the blue party are all less than 0 in the maximum simulation time, and the Blood of the unmanned aerial vehicle existing in the red party is more than 0. The ending reward is

Wherein r is _win1 、r _win2 、r _win3 And r _loss Is a set normal number; blood ₀ Initializing a Blood value for the unmanned aerial vehicle; t is t _end And (5) ending the air war. In particular, for many-to-many air battle scenarios, the ending reward will be based on giving all surviving drones to the red party.

Finally, the reward obtained is

rwd＝k _rwd ·(r ₁ +r ₂ )+r ₃ (12)

Wherein k is _rwd The weighting factors are used for balancing the process reward, the time reward and the ending reward, and highlighting the final reward.

Step three: air combat training

After the air combat environment is designed and combined with the reinforcement learning algorithm, air combat training can be carried out. The air combat training comprises an air combat process and a training scheme, and corresponds to an air combat maneuver decision algorithm training layer, which is specifically described as follows.

(1) Air combat process design

Aiming at the requirement of unmanned air combat maneuver decision training, the invention designs a general air combat process of adopting an AC framework deep reinforcement learning algorithm by the current red party, as shown in FIG. 4, and explained as follows.

(1) And (4) initializing an algorithm. Setting air combat environment parameters such as area size, unmanned aerial vehicle number, reward function parameters, state space form and the like according to requirements; and initializing a reinforcement learning algorithm, such as hyper-parameters and the like.

(2) And (4) initializing the station. Initializing the state of the current office unmanned aerial vehicle and resetting related variables; according to the designed state space, the observation vector of each red unmanned aerial vehicle is calculated and transmitted to a reinforcement learning algorithm,

(3) and updating the state of the unmanned aerial vehicle. Since different strategies can be adopted by both sides of red and blue, the two strategies need to be carried out separately. For the unmanned aerial vehicle of the red side, respective corresponding Actor networks are called, actions, namely control quantity of an unmanned aerial vehicle motion model, are obtained through observation vectors, meanwhile, relevant data and the like are collected through a reinforcement learning algorithm (relevant to the adopted reinforcement learning algorithm), and then the unmanned aerial vehicle executes the actions and substitutes the actions into the motion model to obtain new states (position, speed and posture). For the blue-side unmanned aerial vehicle, state information of the blue-side unmanned aerial vehicle and state information of other unmanned aerial vehicles are obtained and transmitted to a used maneuvering decision strategy (which information is used specifically and is different according to different given strategies), so that an action is obtained, and then the unmanned aerial vehicle executes the action and updates the state of the blue-side unmanned aerial vehicle. Particularly, after the state is updated each time, whether two rules of collision and ground collision are met needs to be judged, and the judgment that the two rules are met is carried out and the crash is stopped. For some air war with decision cycle, the action is executed in one cycle, and the state is updated for many times.

(4) And (5) judging the situation. And judging whether an attack condition is met, whether the attack succeeds and causes damage, whether the attack is in a dominant region, whether an end condition is met and the like according to the new states of the red part and the blue part.

(5) Reward and new observation vector calculations. Calculating rewards acquired by all unmanned aerial vehicles of the red party according to the situation judgment result, the states of all unmanned aerial vehicles and the designed reward function; and calculating an observation vector of each red unmanned aerial vehicle at the moment according to the designed state space.

(6) And updating the reinforcement learning algorithm. The environment transmits information such as rewards, new observation vectors, whether end conditions are met and the like to the reinforcement learning algorithm, the algorithm sorts the data, stores experience, updates algorithm parameters and the like according to algorithm setting, and the like are determined according to the selected algorithm.

(7) And (8) if the current bureau meets the end condition, otherwise, turning to the step (3).

(8) If the training end condition is met, storing the training result, and displaying or reproducing the training result according to the requirement; otherwise, turning to step (2), and starting a new round of training.

(2) Training protocol design

Aiming at the problems of difficult training, slow convergence and the like of an air combat algorithm based on reinforcement learning, the invention adopts a hierarchical training scheme and provides a default training scheme for a user to perform reference training. If the training result is better, the user can skip some training processes, otherwise, the user can add new processes by himself. The user-designed air combat algorithm is trained by default using the following training scheme flow.

The training scheme (1) that the red side is initially in a dominant position (namely in the tail area of the blue side) and flies in a blue side straight line;

the training scheme (2) that the red party is initially in a dominant position, and the blue party adopts a hybrid maneuver strategy (random maneuver or linear flight is adopted according to probability) to maneuver;

the training scheme (3) that the red party is initially in a dominant position and the blue party adopts a specified maneuver decision algorithm;

in the training scheme (4), the two blue and red parties are initially in equal status with each other face to face, and the blue party adopts a specified maneuver decision algorithm;

the training scheme (5) (optional) blue is initially in the dominant position (i.e., in the red tail region).

In particular, the present invention also provides other training protocols for the user to choose from. Aiming at unmanned aerial vehicle initialization, the invention provides four initialization schemes of random initialization in a designated initialization range, discretization of the designated initialization range into cells of 100x100x100 and position initialization on grid points, designation of an initialization position point set, random selection from each initialization and designation of an initialization position for each unmanned aerial vehicle, and initialization users of both red and blue can respectively select from the cells without mutual influence. The dominant position relation during initialization is determined according to the initialization ranges of the red and the blue, and is not related to the selection of a specific initialization method. Aiming at the strategy of the blue-side unmanned aerial vehicle, the invention provides 6 maneuvering decision schemes from simple to responsible, such as executing fixed maneuvering actions (such as straight flight, straight left turn and the like), random full-action space actions (namely randomly selecting from 15 actions), random horizontal plane actions (namely randomly selecting from the first 9 actions in the action space), a hybrid maneuvering strategy, a horizontal plane action maneuvering decision strategy based on a MinMax decision algorithm (namely selecting actions according to the MinMax decision algorithm in the first 9 actions in the action space), a full-action space maneuvering decision strategy based on the MinMax decision algorithm (namely selecting actions according to the MinMax decision algorithm in the 15 actions in the action space), and the like, and a user can select according to training requirements. It should be noted that, after the user has prepared his own training process, each scheme may be performed many times, and the parameters of the relevant algorithm and reward function are modified during the training process, so as to ensure that the subsequent training process is continued after the training schedule is reached.

In addition, in the training process, a log data recording module is used for recording data representing the training quality such as reward obtaining change and neural network error change in the training process in real time so as to evaluate the effectiveness of the current training process in the later period.

Step four: demonstration of training results

After the training is finished, the training result can be displayed, and the corresponding scene display module and the simulation data storage module are used. The display is divided into three types, and the user selects according to the requirement.

Firstly, the data of the training process is displayed, and a TensorBoard tool is mainly used for displaying, wherein reward change, error change of a neural network and the like in the training process are included.

And secondly, the reproduction display of the training result mainly calls the stored designated network parameters, uses the same environment setting and blues strategy during storage, thereby completing the whole air combat process from the initialization of the unmanned aerial vehicle to the step countermeasure to the final success or failure, finally uses a three-dimensional curve to represent the track of the unmanned aerial vehicle, and displays the air combat process. Meanwhile, the speed and posture change in the air combat process can be displayed by drawing according to the user requirements through data in the training process.

And thirdly, finely displaying the training result, namely displaying the red and blue parties by using a three-dimensional model of the unmanned aerial vehicle in a dynamic window, and dynamically displaying the whole unmanned aerial vehicle air combat process.

Step five: migration of training results

Training parameter migration between different scenes is mainly carried out so as to accelerate the training progress. The migration scheme of neural network parameter migration adopted by the invention can provide parameters of a network designated layer for migration according to user requirements, and corresponds to a migration expansion layer. The part of the invention is optional for the user who only has one scheme in the training scheme flow selected in the training scheme design in the step three, and is used for the user to expand according to the requirement. However, for the training scheme of hierarchical level in step three, the training flow is divided into a plurality of pieces, and then the migration of the training result is necessary. The invention adopts a layered and hierarchical training scheme to carry out training step by step from simple to complex, so that after a certain training scheme is finished, if the training of the next scheme is continued, the inheritance and utilization of a training result are required, namely all network parameters of an Actor network obtained by the finished training scheme are migrated to the next scheme for training, the neural network parameters to be trained are initialized into the migration network parameters, and then a new training is carried out on the basis. For the situation that one-to-one air combat is expanded to many-to-many air combat, if migration is needed, the Actor network which needs to meet the design has enough similarity, and therefore parameter migration of the specified layer is conducted. It should be noted that migration does not necessarily lead to good results, but if the similarity between the two environments before and after migration is strictly controlled, a certain acceleration effect can be usually achieved. According to the training scheme, when a certain training is completed and the next stage of training is required, the parameters of the last final training result can be loaded, so that the result obtained by training is transferred to the next stage of training.

The unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning can realize reinforcement learning algorithm switching, partial self-definition of air combat simulation environment support and adoption of a layered and hierarchical training scheme, improve the operability and success rate of training and simplify the training verification process of the set calculation method.

Drawings

FIG. 1 is a structural diagram of an unmanned aerial vehicle intelligent air combat maneuver decision training system based on deep reinforcement learning

FIG. 2 is a schematic diagram of the relative relationship between the unmanned aerial vehicle and the air combat aircraft

FIG. 3 is a schematic diagram of the distribution of attack regions and dominant regions

FIG. 4 schematic diagram of unmanned aerial vehicle air battle process of red and blue parties

FIG. 5 is a flowchart of an example training scheme

FIG. 6 training result reward convergence curve

FIG. 7 three-dimensional trajectory plot of training results

FIG. 8 is a graph of the variation of yaw angle reproduced from the training results

The reference numbers and symbols in the figures are as follows:

x, y, z-unmanned aerial vehicle position under ground coordinate system

t is time

yaw-yaw angle

Detailed Description

The effectiveness of the unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning provided by the invention is verified through a specific example, wherein the flow of the example is shown in fig. 5 and specifically described as follows.

The method comprises the following steps: air combat environment design

(1) Unmanned aerial vehicle model design

The following unmanned aerial vehicle motion model is adopted:

for constraints, there are Length = Width =5000m, height =3000m, v _min ＝50m/s，V _max ＝200m/s。

(2) Basket strategy design

The default BlueTooth strategy, namely the BlueTooth unmanned aerial vehicle maneuvering decision strategy based on the MinMax decision algorithm, is adopted.

(3) Air battle rule design

Using the default attack judgment condition of the present invention, wherein D _att,min ＝45m，D _att,max =850m; initial Blood value, blood is taken ₀ =300. For attack injury, the process is divided into three sections as shown in the following formula.

Wherein r is a random number between 0 and 1, and satisfies the average distribution.

Step two: reinforcement learning algorithm and environment combination

The default PPO algorithm of the invention is used for one-to-one air combat training, namely N _r ＝N _b ＝1。

(1) State space design

Using the default state space design of the invention, i.e.

Wherein->

Is the three-axis position of the unmanned plane in the red>

Is the relative of red and blueThe position, the speed and the angle relation are sequentially three-dimensional vectors D pointing to the unmanned aerial vehicle position from the position of the unmanned aerial vehicle in the red ₁₁ Length D of ₁₁ Angle D with the horizontal plane _θ,11 Included angle D between projection on horizontal plane and x-axis _ψ,11 Three-axis velocity difference [ delta v ] between unmanned aerial vehicles _x,11 ,Δv _y,11 ,Δv _z,11 ]Based on the target azimuth angle>

And target entry angle q ₁₁ . The dimension of the designed state space is 11. Meanwhile, the state space normalization designed by the invention is adopted.

Wherein,

Are the corresponding normalized elements. Corresponding threshold η _t Length, width, height, length, pi/2, 2 pi, V _max -V _min ,V _max -V _min ,V _max -V _min And pi, 11 in total. Further, a =8,b =4.

(2) Design of motion space

The 15-dimensional discrete motion space designed by the invention is adopted, and the value of the motion space in design is set as m in the example ₁ ＝2.4,m ₂ ＝-1.2,n＝2.8。

(3) Reward function design

The reward function design designed by the invention is adopted, and the related parameters are set as follows. D _adv0 =45m and D _adv Distance range of =1400m as dominant region, α _adv And the angle threshold value of the dominant area is = pi/3. For course awards, H _att =100m as optimum attack height, the weighting factor being taken as ∈ ₁ ＝0.18、ε ₂ =0.72 and ε ₃ =0.1. For event rewards, attack rewards r _att =0.08 and destruction prize r _dam =0. For the end reward, get r _win1 ＝5、r _win2 ＝4、r _win3 =1 and r _loss =10. And, for the equalizing weight k _rwd ＝0.01。

Step three: air combat training

(1) Air combat process

And carrying out air combat training according to the designed air combat process.

(2) Training scheme

In this example, the following training scheme procedure is employed:

in the training scheme (2), a red party is initially in a dominant position, and a blue party adopts a specified maneuvering decision algorithm;

and (3) the blue and red parties are in mutual equal status initially face to face, and the blue party adopts a specified maneuver decision algorithm.

Step four: training result display

After the training is completed, result display is carried out. The example performs two kinds of demonstration, namely reward result demonstration in the training process, recurrence demonstration of three-dimensional air combat track, and yaw angle change demonstration in the air combat process, as shown in fig. 6, 7 and 8.

Step five: migration of training results

As the example adopts a staged training scheme, after the training scheme (1) in the step three is finished, the training result is transferred to the training scheme (2) to continue the training of the scheme (2); and after the training scheme (2) is finished, transferring the training result to the training scheme (3) to continue the training of the scheme (3).

Claims

1. An unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning are characterized in that: the method comprises the following steps:

Designing an unmanned aerial vehicle model, a blue air combat strategy and an air combat rule for red and blue parties;

step two: reinforcement learning algorithm and environment combination

Combining the reinforcement learning algorithm to be used with the unmanned aerial vehicle intelligent air combat environment in the step one, namely, performing action space design in an environment definition module to enable the action space design to be matched with the reinforcement learning algorithm, and then performing state space design and reward function design to serve as a main part of an interaction function; step two, specifically, the state space design, the action space design and the reward function design are included;

step three: air combat training

The air combat training comprises an air combat process and a training scheme; in the training process, a log data recording module is used for recording the data representing the training quality and the training quality of the acquired reward change and the neural network error change in the training process in real time so as to evaluate the effectiveness of the current training process in the later period;

step four: training result display

The presentation is divided into three types: firstly, data display in the training process is carried out by using a Tensorboard tool, wherein the data display comprises reward change and error change of a neural network in the training process;

second, the recurrence of the training result is shown, it is to call the designated network parameter preserved, use the same environmental setting and blue policy while preserving, thus finish a whole air battle process from unmanned aerial vehicle initialization to fight against step by step until finally deciding the victory or defeat, represent the orbit of the unmanned aerial vehicle with the three-dimensional curve finally, show the air battle process; meanwhile, drawing display is carried out according to the speed and the posture change in the air combat process through data in the training process according to the user requirements;

thirdly, fine display of training results, namely displaying the red and blue parties by using a three-dimensional model of the unmanned aerial vehicle in a dynamic window, and dynamically displaying the whole unmanned aerial vehicle air combat process;

step five: migration of training results

And the adopted migration scheme of the neural network parameter migration provides parameters of a network designated layer for migration according to user requirements.

2. The unmanned aerial vehicle intelligent air combat maneuver decision training method based on deep reinforcement learning of claim 1, wherein: the state space design:

the unmanned aerial vehicle state space includes three components: the state of the unmanned aerial vehicle is the self state, the state relative to the blue unmanned aerial vehicle and the state relative to the other red unmanned aerial vehicles; self state

(i ∈ R) has a dimension of 3, has ∈ ->

In turn, the current three-axis position [ x ] of the drone _i ,y _i ,z _i ](ii) a Relative status & ltSUB & gt & lt/SUB & gt with the blue-square unmanned aerial vehicle>

(i ∈ R and k ∈ B) dimension of 8, having

Sequentially is a three-dimensional vector D pointing to the k position of the unmanned aerial vehicle from the i position of the red unmanned aerial vehicle _ik Length D of _ik Angle D with the horizontal plane _θ,ik Included angle D between projection on horizontal plane and x-axis _ψ,ik Three-axis speed difference [ delta v ] of unmanned aerial vehicles i and k _x,ik ,Δv _y,ik ,Δv _z,ik ]Target azimuth angle

And target entry angle q _ik (ii) a Relative state of unmanned aerial vehicle with other red sides->

(i, j ∈ R and i ≠ j) has a dimension of 6, and>

sequentially a three-dimensional vector D pointing to an unmanned plane j position from a red unmanned plane i position _ij Length D of _ij And an angle D with the horizontal plane _θ,ij Included angle D between projection on horizontal plane and x-axis _ψ,ij Three-axis velocity difference [ delta v ] of unmanned aerial vehicles i and j _x,ij ,Δv _y,ij ,Δv _z,ij ](ii) a So unmanned aerial vehicle i's state space

Number of unmanned aerial vehicles in red _r Is 1, is selected>

In order to accelerate the training, the state space is normalized by adopting the following formula;

wherein,

Is the corresponding normalized element; when +>

Is x _i ,y _i ,D _ij ,D _ik One of η _t = Length; when/is>

Is z _i Time, eta _t ＝Height；/>

Is D _θ,ij ,D _θ,ik One of η _t ＝π/2；/>

Is D _ψ,ij ,D _ψ,ik One of η _t ＝2π；/>

Is->

q _ik One of η _t ＝π；/>

3. The unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning of claim 1, wherein: the reward function is designed as follows:

three rewards, i.e. course rewards r, are designed ₁ Event award r ₂ Ending the prize r ₃ (ii) a Introducing a dominant region for designing a reward function; for a red unmanned aerial vehicle, the dominant region is located in a conical region behind the tail of a blue unmanned aerial vehicle, namely when the relationship between an unmanned aerial vehicle i and an unmanned aerial vehicle k meets the following conditions, the unmanned aerial vehicle i is considered to be in the dominant region for the unmanned aerial vehicle k;

wherein D is _adv0 And D _adv Distance range, alpha, for the dominant region _adv An angle threshold for the dominant region;

(1) process rewards; for a many-to-many air battle scene, selecting a nearest blue-party unmanned aerial vehicle as a target when rewarding in the calculation process; course award angle award r _a Distance reward r _d Height award r _h The method comprises the following three parts:

r _d ＝r _d1 +r _d2 (4)

while for the height reward r _h Is provided with

Wherein Δ h = z _i -z _k The height difference between the unmanned aerial vehicle i and the unmanned aerial vehicle k is obtained; h _att Is the optimal attack height;

the total process reward is therefore:

r ₁ ＝ε _a ·r _a +ε _d ·r _d +ε _h ·r _h (7)

wherein epsilon ₁ 、ε ₂ And ε ₃ Is a weighting coefficient;

(2) event reward

Event reward r ₂ Triggered by the occurrence of a particular situation, including a preferential area reward r _adv Attack award r _att And a destruction reward r _dam (ii) a Will r is _att And r _dam Is set to be constant, and r _adv As shown in the following formula;

for the red unmanned aerial vehicle i and the blue unmanned aerial vehicle k, when the condition of the formula (2) is met, r is awarded to the unmanned aerial vehicle i _adv (ii) a When the attack condition is reached, the success rate p of the attack is satisfied _att Then give the reward r to the unmanned plane i _att (ii) a When the Blood value of the unmanned plane k is smaller than 0 after the attack, r is awarded to the unmanned plane i _dam ；

(3) Ending reward

Ending reward r ₃ Giving the result when the current air war is finished; the air war ending condition is that the Blood values of the unmanned aerial vehicles on one side are all smaller than 0 or the maximum simulation time t is reached _max (ii) a The method comprises the following steps that the red party winning condition is that the Blood values of the blue party are all smaller than 0 in the maximum simulation time, and the Blood of the unmanned aerial vehicle existing in the red party is larger than 0; the ending reward is

Wherein r is _win1 、r _win2 、r _win3 And r _loss Is a set normal number; blood ₀ Initializing a Blood value for the unmanned aerial vehicle; t is t _end Air war ending time; for many-to-many air battle scenarios, ending the reward will be based on giving all surviving drones to the red party;

finally, the reward obtained is

rwd＝k _rwd ·(r ₁ +r ₂ )+r ₃ (10)

Wherein k is _rwd The weighting coefficients are used to balance the process reward, the time reward and the end reward, and to highlight the final reward.

4. The unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning of claim 1, wherein: the training scheme is designed as follows:

adopting a layered and hierarchical training scheme and providing a default training scheme for a user to perform reference training; if the training result is better, the user skips some training processes, otherwise, the user adds new processes; training an air combat algorithm designed by a user by default by using the following training scheme flow;

the first training scheme is that the red side is initially in a dominant position, namely, in a blue side tail area, and flies in a blue side straight line;

in the second training scheme, the red party is initially in a dominant position, and the blue party adopts a hybrid maneuver strategy to maneuver; the hybrid maneuver strategy adopts random maneuver or linear flight according to probability;

the third training scheme is that the red party is initially in a dominant position, and the blue party adopts a specified maneuvering decision algorithm;

the fourth training scheme is that the blue and red parties are in equal status with each other initially face to face, and the blue party adopts a specified maneuver decision algorithm;

the blue side of the fifth training scheme is initially in a dominant position, namely in the tail area of the red side; the fifth training scheme is an alternative scheme;

it should be noted that after the user prepares his own training process, each scheme can be performed for many times, and the parameters of the relevant algorithm and the reward function are modified during the training process, so as to ensure that the subsequent training process is continued after the training schedule is reached.

5. The unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning according to claim 1, characterized in that: other training protocols are also provided for the user to choose:

aiming at unmanned aerial vehicle initialization, four initialization schemes of random initialization in an appointed initialization range, discretization of the appointed initialization range into cells of 100x100x100, position initialization on grid points, appointed initialization position point set, random selection from each initialization and appointed initialization position for each unmanned aerial vehicle are provided, and initialization users of both red and blue sides respectively select the initialization positions without mutual influence;

aiming at the strategy of the blue-side unmanned aerial vehicle, 6 simple to complex maneuvering decision schemes, namely a fixed maneuvering action, a random full-maneuvering space action, a random horizontal plane action, a hybrid maneuvering strategy, a horizontal plane action maneuvering decision strategy based on a MinMax decision algorithm and a full-maneuvering space maneuvering decision strategy based on a MinMax decision algorithm are provided, and a user selects the strategies according to training requirements.

6. The utility model provides an unmanned aerial vehicle intelligence air battle maneuver decision training system based on deep reinforcement study which characterized in that: the system consists of five parts, namely a reinforcement learning algorithm layer, an unmanned aerial vehicle air combat environment layer, an air combat maneuver decision algorithm training layer, a log recording and effect display layer and a migration expansion layer;

1) The reinforcement learning algorithm layer adopts a depth reinforcement learning algorithm supporting the addition of a self-defined AC frame, and the requirements of the depth reinforcement learning algorithm used by a user are as follows: the action selection function is used for receiving the observation vector given by the air combat environment and outputting the selected action; calling an interactive function provided by an air combat environment, inputting the interactive function into an action, and outputting the interactive function into a new observation vector, a reward, an ending marker bit and other information; a near-end policy optimization (PPO) algorithm is used by default, and a user performs self-defined algorithm transplantation according to the PPO algorithm;

2) The unmanned aerial vehicle air combat environment layer is an unmanned aerial vehicle air combat environment and is used for responding to actions given by an action network, and mainly comprises four sub-modules, namely an environment definition module, an environment parameter setting module, a bluesquare maneuvering strategy module and an unmanned aerial vehicle motion module; the environment definition module defines an air combat rule, a state space, an action space and a reward function, and is mainly used for receiving the maneuvering action of the red-party unmanned aerial vehicle given by a deep reinforcement learning algorithm and the maneuvering action of the blue-party unmanned aerial vehicle given by a blue-party maneuvering strategy module, calling an unmanned aerial vehicle movement model to obtain the state information of the unmanned aerial vehicle at the next moment, further obtaining the state at the next moment through the designed state space, and bringing the state information of the unmanned aerial vehicle into the designed air combat rule to determine the environment information;

3) The air combat maneuver decision algorithm training layer is used for training a deep reinforcement learning algorithm and mainly comprises a training parameter setting module and a training flow module;

the training flow module trains an air combat maneuver decision algorithm by adopting a layered and hierarchical training scheme, wherein the training scheme comprises a plurality of training schemes, and the training schemes are sequentially executed according to a scheme flow sequence; when the system is operated and the air combat maneuver decision algorithm training layer is called, the parameter setting in the training setting module is firstly read, then the training flow module is called, and the requirement of current training is obtained, so that the parameter settings of the hyper-parameter module and the environment parameter setting module are influenced;

4) The system comprises a log recording and effect displaying layer, a scene displaying module and a simulation data storing module, wherein the log recording and effect displaying layer is used for log data recording and air combat training result displaying and mainly comprises a log data recording module, a scene displaying module and a simulation data storing module;

5) The system comprises a migration expansion layer and a network parameter migration module, wherein the migration expansion layer is mainly used for migrating trained neural network parameters, a migration scheme is given when the system runs and calls the migration expansion layer, and the migration scheme helps to realize the linkage between two adjacent training schemes when the training schemes have multiple training schemes, namely, the neural network parameters obtained by the last training are migrated to initialize the network parameters in the next training.

7. The unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning of claim 6, wherein: the method comprises the following steps that an action space needs to be customized on the premise of being matched with an unmanned aerial vehicle motion model and a designed deep reinforcement learning algorithm, and a discrete action space consisting of 15 discrete actions is adopted by default;

the environment parameter setting module is an independent setting file and comprises environment information, reward function parameters, the number of the two unmanned aerial vehicles, performance parameters of the two unmanned aerial vehicles, the initial positions and the speeds of the two unmanned aerial vehicles and a maneuvering strategy adopted.