CN114035602A

CN114035602A - Airplane maneuvering control method based on layered reinforcement learning

Info

Publication number: CN114035602A
Application number: CN202110904677.8A
Authority: CN
Inventors: 杨晟琦; 朴海音; 孙智孝; 彭宣淇; 韩玥; 樊松源; 孙阳; 于津; 田明俊; 金琳乘
Original assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Current assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Priority date: 2021-08-07
Filing date: 2021-08-07
Publication date: 2022-02-11

Abstract

The application relates to the technical field of flight control, in particular to an airplane maneuvering control method based on layered reinforcement learning. The method comprises the following steps: step S1, acquiring action embedding vectors of the intelligent agent after calculation through the neural network; step S2, respectively outputting a probability list of horizontal angles, a probability list of vertical angles and a shooting probability list according to the motion embedding vectors; and step S3, sampling according to the probability lists, determining the horizontal control mode and the vertical control mode of the intelligent agent and whether to shoot, and controlling the intelligent agent. The application can generate a large number of maneuvering patterns with more diversity and flexibility by combining the horizontal tactical maneuvering intention and the vertical three-dimensional maneuvering intention.

Description

Airplane maneuvering control method based on layered reinforcement learning

Technical Field

The application relates to the technical field of flight control, in particular to an airplane maneuvering control method based on layered reinforcement learning.

Background

The beyond visual range air combat is a main form of modern air combat, and the beyond visual range air combat mainly adopts a deductive form of selecting targets for detection and tracking by a fighter, launching guided missiles for guidance and avoiding the missiles attacked by enemies. The process involves a large number of maneuver decision processes, how to carry out maneuver occupation to occupy good attack situation and maneuver evasion to protect itself from being hit by enemy missiles to the maximum extent, which are all key problems to be considered in over-the-horizon air combat. In recent years, the problem of how to make an unmanned intelligent body to make tactical decision behaviors which are equivalent to those of human pilots has become a hotspot of unmanned autonomous air combat research. The existing AI air combat method mainly comprises a rule-based expert system method, a probability model/fuzzy logic and computational intelligent mixing method and a machine learning and deep strong learning method. These methods all achieved exceptional performance in different respects. The rule-based expert system method completely depends on an air combat rule database defined by human pilots in advance, all the methods need to be designed in advance and have no self-evolution characteristic, and the behavior of the intelligent agent has an obvious upper limit. The probability model/fuzzy logic and computational intelligence hybrid method requires experts to construct a probabilistic reasoning network or design a heuristic objective function, cannot cover all air combat states, and is very complex and difficult to design. Machine learning methods rely heavily on large amounts of real air combat data, which is often rare or even unavailable, and tend to limit the performance of agents to the range of capabilities that the data can provide. The deep reinforcement learning method automatically generates the tactical strategy of the air battle through self-game reinforcement learning training without human knowledge supervision, but the maneuvering style is fixed, and the method greatly lacks diversity and flexibility compared with the human.

Disclosure of Invention

In order to solve the problems, the application provides a conceptual modeling method based on object process language, the maneuvering action of an airplane is decomposed into a horizontal dimension maneuvering angle and a vertical dimension maneuvering angle, and a large number of maneuvering patterns with more diversity and flexibility can be generated by mutually combining a horizontal tactical maneuvering intention and a vertical three-dimensional maneuvering intention.

The application relates to an airplane maneuvering control method based on layered reinforcement learning, which is used for maneuvering control of two groups of airplane intelligent bodies in a game process, and comprises the following steps:

step S1, acquiring action embedding vectors of the intelligent agent after calculation through the neural network;

step S2, respectively outputting a probability list of a horizontal angle, a probability list of a vertical angle and a shooting probability list according to the action embedding vector, wherein the probability list of the horizontal angle comprises a plurality of probability values respectively corresponding to a plurality of preset horizontal control instructions, the probability list of the vertical angle comprises a plurality of probability values respectively corresponding to a plurality of preset vertical control instructions, and the shooting probability list comprises two probability values respectively corresponding to whether to shoot or not;

and step S3, sampling according to the probability lists, determining the horizontal control mode and the vertical control mode of the intelligent agent and whether to shoot, and controlling the intelligent agent.

Preferably, the step S1 further includes:

s11, acquiring the overall air combat state, and dividing the overall air combat state into the current intelligent agent absolute state quantity representing the self attribute of the intelligent agent, the relative state quantity of the current intelligent agent and other intelligent agents, the missile state quantity of the intelligent agent of the team and the missile state quantity of the intelligent agent of the opponent;

step S12, determining a global embedding vector based on the overall air combat state, determining a relative observation embedding vector based on the relative state quantity, determining a friend missile embedding vector based on the missile state quantity of the team intelligent body, and determining an enemy missile embedding vector based on the missile state quantity of the opponent intelligent body;

and step S13, the action embedding vector is formed by jointly splicing the global embedding vector, the relative observation embedding vector, the friend missile embedding vector and the enemy missile embedding vector.

Preferably, in step S11, the current absolute state quantities of the agent include a vacuum speed, a current altitude, a climbing rate, a three-axis attitude angle, a normal overload, a fire radar lock signal, an electronic alarm device alarm state, and a number of remaining empty missiles; the relative state quantity comprises a relative distance, an approach rate, a relative altitude difference, a target entrance angle, a local beam angle and attack area information, and the missile state quantity comprises a missile speed, a current height, a missile-target distance, a missile-target approach rate, residual hit time, an entrance angle and a beam angle between a missile and a target; the missile state quantities of the opponent agent comprise missile state quantities which threaten the current agent.

Preferably, in step S2, the preset level control commands include six, which are a hold command to hold the current heading, an attack command to the target, an attack command biased by ± 30 °, an attack command biased by ± 50 °, a defense command biased by ± 90 °, and a defense command biased by ± 180 °.

Preferably, in step S2, the preset vertical control commands include six, i.e., a hold command for holding the current heading, an attack command for pointing to the target, +30 ° climb attack command, +60 ° climb attack command, -30 ° dive defense command, and-60 ° dive defense command.

Preferably, step S3 is followed by further comprising:

and step S4, determining an instruction speed probability list and an instruction overload probability list according to sampling results corresponding to a horizontal control mode and a vertical control mode and the action embedding vector, and sampling according to the instruction speed probability list and the instruction overload probability list respectively to obtain the instruction speed and the instruction overload of the intelligent agent.

Preferably, the instruction speed probability list includes a plurality of probabilities corresponding to the speed values, the instruction overload probability list includes a plurality of probabilities corresponding to the overload values, and the plurality of speed values and the plurality of overload values are obtained by discretizing the instruction speed and the discretizing the instruction overload.

The application can generate a large number of maneuvering patterns with more diversity and flexibility by combining the horizontal tactical maneuvering intention and the vertical three-dimensional maneuvering intention.

Drawings

FIG. 1 is a flow chart of an airplane maneuver control method based on hierarchical reinforcement learning according to the present application.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all embodiments of the present application. The embodiments described below with reference to the drawings are illustrative and intended to be used for explaining the present application and should not be construed as limiting the present application. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the protection scope of the present application. Embodiments of the present application will be described in detail below with reference to the drawings.

The application provides an airplane maneuvering control method based on layered reinforcement learning, and the characteristic that the whole macro tactical intention of an airplane is influenced by combining the microscopic maneuvering actions under a plurality of dimensional visual angles is called maneuvering semantics. The strategy network of the intelligent body is improved by utilizing the maneuvering semantic information, so that the intelligent body can show more flexible tactical behaviors. The method comprises the following procedures:

a) extracting air combat characteristics as neural network input;

b) designing a semantic mechanical controller as decision output of a neural network;

c) and constructing a hierarchical network to forward the neural network.

The specific example steps are as follows:

1) and extracting air combat characteristics.

Divide the air combat state S into the absolute state quantity of the current intelligent agent

Relative state quantities of a current agent and other agents

Missile state quantity of intelligent agent of team

Missile status quantity of opponent agent

Current agent absolute state quantity

The material is composed of the following elements: vacuum speed tas, current height h, climbing rate

Three-axis attitude angle psi, theta, phi, normal overload n_nFire control radar locking signal lo, electronic warning device warning state wan, number m of remaining empty missiles_left. The absolute state characterizes the attributes of the agent itself. Relative state quantity

The material is composed of the following elements: distance r, rate of approach

Relative altitude Δ h, target entry angle AA (angle between target velocity vector and line of sight vector), local beam angle BA (angle between local handpiece orientation and line of sight vector)And attack area information DLZ. The relative state characterizes the situational information of the current agent and target. The absolute state and the relative state are integrated to provide information of the whole battlefield for the intelligent agent, and provide characteristic information for the attack decision and team friend cooperation decision of the intelligent agent. The missile state quantity is composed of the following state quantities: velocity v of missile_mCurrent height h_mDistance r between missile and target_mRate of missile approach to target

Remaining hit time T_goAngle of entry AA between missile and target_mSum beam angle BA_m。

The system is composed of missile state quantities of all intelligent agents of the team, and provides characteristic information for cooperative guidance decision of the intelligent agents.

The difference of the method is that only missile state quantities which threaten the current intelligent agent are contained, and characteristic information is provided for defense decision of the intelligent agent.

For a certain agent, the overall air war status is defined as follows:

(2) designing a semantic mechanical controller.

The application decomposes the maneuver of a fighter plane into several microscopic maneuvers in both horizontal and vertical dimensions, see table 1. The horizontal dimension maneuver takes the sight line as a reference, the included angle between the speed direction of the airplane and the sight line is used as a control instruction, the airplane flies towards a target at 0 degree, and the positive and negative of other angles are determined by the fastest turning direction. When the angle is 0, + -30, + -50, it means that the agent is currently using an attack strategy, which usually occurs when proximity to a target is required to launch a missile or a bias maneuver is used after launching a missile to guide the missile, including the semantics of an attack. The conditions of +/-90 and +/-180 correspond to the compression of defense maneuver against the Doppler effect of the hand fire control radar and the escape maneuver at the tail end of a turn, and the semantics of the defense are included. The maneuver of the vertical dimension takes a horizontal line as a reference, an included angle between the speed direction of the airplane and the horizontal line is used as a control instruction, a positive angle is climbing, and a negative angle is diving. The purpose of the climbing maneuver is to enter a region with thin air to weaken the energy attenuation of the launched missile, including the semantics of attack. The diving maneuver makes the self enter an air dense area, increases the energy consumption of the incoming missile, improves the survival probability and contains the defense semantics. See table 1 for details.

TABLE 1 semantic maneuver List

(3) And constructing a hierarchical network.

From the foregoing, the present application has characterized air combat as four types of observed states: absolute observation, relative observation, friend missile observation and enemy missile observation. The four types of observation states jointly form a global observation state of the air battle, and almost all information of a battlefield is contained, so that the value estimation is more accurate by estimating the state value of each time step through the global observation state. Inputting global observations into the network f^O2EGenerating a corresponding global embedding vector e_gThen through the network f^E2VThe state value of the current time step is output as described on the left side of fig. 1.

e_g＝f^O2E(S)

V(S)＝f^E2V(e_g)。

On the other hand, as described above, the relative observation, the friend missile observation and the enemy missile observation have respective state semantics, which are closely related to the tactical decision of the air battle, so that the application makes full use of the information to extract the strategy. The three observed quantities are respectively passed through respective f^O2EThe network generates the corresponding embedded vector: relative observation embedding vector e_rFriend-party missile embedded vector e_amAnd enemy missile embedding vector e_omAs shown at the bottom of the right side of fig. 1. They are embedded with a global embedding vector e_gSpliced together to form a comprehensive action-embedding vector, i.e. hidden state vector e shown in FIG. 1_tot. The hidden state vector comprehensively extracts the characteristic information of the global observation and each semantic observation, and is beneficial to the generation of strategies. The present application divides the strategy into a maneuver strategy and a fire strategy, and therefore will output three decision actions: horizontal mechanical mode a_horVertical maneuvering mode a_verAnd firing instructions a_shootWherein the selection ranges of the horizontal and vertical maneuvers are shown in Table 1, firing order a_shootE {0,1}, 0 denotes no shot and 1 denotes a shot. The hidden state vectors pass through the embedded layer network f of each action^actionGenerating respective motion-embedded vectors e_aThe selection probability of each action is proportional to the action embedding vector: pi (a | s) octo exp (f)^action(e_tot) Calculate the probability of each action using the Softmax activation function. The following formula describes the forward propagation process of a policy network:

e_tot＝C^cat(e_g,e_r,e_am,e_om)

e_a＝f^action(e_tot)

in addition, in order to enable the maneuvering action to be suitable for richer situations, the control of the command speed and the command overload is added in the maneuvering mode. In order to reduce the learning complexity, a few common values are selected, and the instruction speed and the instruction overload are discretized, namely v epsilon [ v ∈ [ v [ ]₁,…,v_n]，nn∈[nn₁,…,nn_m]. Thus, even under the same upper-layer mechanical strategy, the intelligent agent can select different instruction speeds and instruction overload to flexibly deal with different air combat situations. Since the selection of the two lower-layer instructions is closely related to the maneuver strategy of the upper layer, the application therefore adopts a layered idea to handle the selection of the two instruction actions. Converting the currently selected horizontal maneuver number and the currently selected vertical maneuver number into a one-hot vector form, and e_totPassing in together instruction generation network f^steerAnd then outputting the instruction speed and the instruction overload through the softmax activating function. The forward propagation process of the underlying policy network is defined as follows:

e_a＝f^steer(e_tot,T^one-hot(a_hor),T^one-hot(a_ver))

the following is a specific example.

(1) Inputting the whole air combat state S into a neural network to calculate e_g＝f^O2E(S), assuming the output vector is e_g＝[10,22,21](ii) a E is to be_g＝[10,22,2]1 carrying out V (S) ═ f^E2V(e_g) Calculating and outputting the value of the current state, wherein V (S) is 34; relative state quantity

To carry out

Calculation, assuming output vector e_r＝[9,2,11]The missile state quantity of the intelligent agent of the team

To carry out

Computing, assuming the output vector is e_am＝[11,23,67]The missile state quantity of the opponent agent

To carry out

Calculation, assuming output vector as e_om＝[54,3,7]. E is to be_g，e_r，e_amAnd e_omPerforming a stitching operation to form the motion-embedded vector of step S1:

e_tot＝C^cat(e_g,e_r,e_am,e_om)；

output 12-dimensional vector e_tot＝[10,22,21,9,2,11,11,23,67,54,3,7]。

(2) In step S2, referring to fig. 1, a vector e is first embedded in the generated motion_totCarry out e_hor＝f^Hor(e_tot) Calculation, assuming output e_hor＝[34,21,1]Then proceed to

The probability of outputting horizontal angle is calculated, and is assumed to be [0.1,0.1,0.2,0.4,0.1]In step S3, a sample is taken based on the probability,the 4 th horizontal angle was selected and the horizontal control was determined to be + -50 deg. offset according to table 1. In fig. 1, it is also necessary to convert the horizontal motion number 4 into a one-hot vector T^one-hot(a_hor)＝[0,0,0,1,0,0]To control the speed and overload of the agent;

similarly, with continued reference to FIG. 1, in step S2, a vector e is embedded for the generated motion_totCarry out e_ver＝f^Ver(e_tot) Calculation, assuming output e_ver＝[4,2,12]Then proceed to

The probability of outputting the vertical angle is calculated, and is assumed to be [0.3,0,0.2,0.2,0.2, 0.1%]In step S3, sampling is performed according to the probability, the 1 st vertical angle is selected, and the vertical control mode is determined according to table 1 to maintain the current heading, and in fig. 1, the vertical motion 1 needs to be converted into a one-hot vector T^one-hot(a_ver)＝[1,0,0,0,0,0]To control the speed and overload of the agent;

similarly, with continued reference to FIG. 1, in step S2, a vector e is embedded for the generated motion_totCarry out e_shoot＝f^Shoot(e_tot) Calculation, assuming output e_shoot＝[2,52,12]Then proceed to

Calculate the probability of outputting the shooting action, assume to be [0.3,0.7 ]]In step S3, sampling is performed based on the probability, the 2 nd shooting action is selected, and a shooting command is formed as a shooting.

In step S4, e generated in (1) is added_totT produced in (2)^one-hot(a_hor)，T^one-hot(a_ver) Calculating the command speed and the command overload, wherein the formula is as follows:

e_v,e_nn＝f^steer(e_tot,T^one-hot(a_hor),T^one-hot(a_ver))；

suppose output e_vIs [12,2,42 and e ]_nn＝[14,23,4]Then proceed again

And

calculating and determining a command speed probability list and a command overload probability list, and assuming that the probability of outputting the command speed is [0.6,0.4 ]]The probability of instruction overload is [0.2,0.8 ]]Assume that the command speed selection value is [200,300 ]]Two options, instruction override select value [5, 6%]Two options, sampling according to the probability, and selecting the instruction 1 speed 200m/s and the instruction 2 overload 6 g.

The airplane maneuvering control method (HRLMC) based on hierarchical reinforcement learning decomposes the learning of maneuvering strategies into two aspects of tactical maneuvering intention learning and three-dimensional maneuvering intention learning, and through the deep reinforcement learning method of self-gaming, an intelligent agent can learn flexible and diverse maneuvering strategies to deal with different air combat situations, so that the robustness of the algorithm is enhanced. The whole learning process does not involve any human handwriting rules. The application can generate a large number of maneuvering patterns with more diversity and flexibility by combining the horizontal tactical maneuvering intention and the vertical three-dimensional maneuvering intention.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An airplane maneuver control method based on layered reinforcement learning is used for maneuvering control of two teams of airplane intelligent bodies in a game process, and is characterized by comprising the following steps:

2. The method for controlling maneuvering of an aircraft based on hierarchical reinforcement learning as set forth in claim 1, wherein step S1 further includes:

3. The aircraft maneuver control method based on the layered reinforcement learning as claimed in claim 2, wherein in step S11, the current intelligent agent absolute state quantities include vacuum speed, current altitude, climbing rate, three-axis attitude angle, normal overload, fire control radar locking signal, electronic alarm device alarm state, and number of remaining empty missiles; the relative state quantity comprises a relative distance, an approach rate, a relative altitude difference, a target entrance angle, a local beam angle and attack area information, and the missile state quantity comprises a missile speed, a current height, a missile-target distance, a missile-target approach rate, residual hit time, an entrance angle and a beam angle between a missile and a target; the missile state quantities of the opponent agent comprise missile state quantities which threaten the current agent.

4. The method for controlling maneuvering of an aircraft based on stratified reinforcement learning as recited in claim 1, wherein the preset level control commands in step S2 include six of a hold command to hold the current heading, an attack command to the target, an attack command biased by ± 30 °, an attack command biased by ± 50 °, a defense command biased by ± 90 °, and a defense command biased by ± 180 °.

5. The method for controlling maneuvering of an aircraft based on hierarchical reinforcement learning as set forth in claim 1, characterized in that the preset vertical control commands in step S2 include six, respectively a hold command to hold the current heading, an attack command to aim at, an attack command to climb at +30 °, an attack command to climb at +60 °, a defense command to dive at-30 ° and a defense command to dive at-60 °.

6. The method for controlling maneuvering of an aircraft based on hierarchical reinforcement learning of claim 1, further comprising, after step S3:

7. The method as claimed in claim 1, wherein the command speed probability list includes a plurality of probabilities corresponding to speed values, the command overload probability list includes a plurality of probabilities corresponding to overload values, and the speed values and the overload values are obtained by discretizing the command speed and the discretizing the command overload.