CN113221444A

CN113221444A - Behavior simulation training method for air intelligent game

Info

Publication number: CN113221444A
Application number: CN202110425153.0A
Authority: CN
Inventors: 包骐豪; 朱燎原; 夏少杰; 瞿崇晓
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-06
Anticipated expiration: 2041-04-20
Also published as: CN113221444B

Abstract

The invention discloses a behavior simulation training method for an air intelligent game, which comprises the following steps: s1, constructing an intelligent agent game decision model; s2, determining the environment state and the action space, and modeling a continuous non-sparse reward function of each action; s3, playing an air game in the model, and executing the following steps: s31, generating the next environment state according to the executed action and obtaining the reward, and sequentially and circularly iterating to realize the maximum accumulated reward; s32, realizing reverse reinforcement learning based on expert behaviors and obtaining a target reward function; s33, calculating the similarity of each agent behavior and the expert behavior; s34, obtaining a comprehensive reward; and S4, training an intelligent agent game decision model. The method improves the traditional inefficient reward function design process and the random model training exploration process, so that the reward function has interpretability and human intervention capability, the intelligent agent decision level and convergence speed are improved, and the problem of model training cold start is solved.

Description

Behavior simulation training method for air intelligent game

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a behavior simulation training method for an air intelligent game.

Background

In the future, the air game needs to acquire the most accurate information from various detection systems at any time and any place to realize the information advantage, and more importantly, the air game realizes the decision advantage by utilizing the technologies such as machine learning, artificial intelligence, cloud computing and the like. In order to better mine information to realize decision advantages, exert game efficiency and ensure aerial advantages, an aerial auxiliary decision support system matched with the pilot is needed besides excellent aerial skills of the pilot and good command ability of the commander. The aerial assistant decision support system is used as an artificial intelligence assistant system, can provide decision scheme reference in a high-dynamic complex confrontation environment, reduces the decision burden of a pilot, better excavates information to realize decision advantages, exerts game efficiency and ensures aerial advantages.

However, the existing air-aided decision support system is relatively backward, the number of sensing parameters or target objects which can be simultaneously controlled is limited, the robustness, timeliness and accuracy of decision support are poor, moreover, in game decision, the training model is difficult to converge due to too high decision dimensionality, and the time consumption of the practical intelligent body for training is long, even the practical intelligent body cannot be trained at all to obtain the effective decision intelligent body. And the decision level of the intelligent agent is low and the time consumption is long due to sparse awards and complex and low-efficiency awards design links in the air intelligent self-game confrontation, and meanwhile, the awards are designed to be manually customized according to scenes, so that the labor cost is high, the reusability is poor, and the algorithm training has the cold start problem.

Disclosure of Invention

The invention aims to provide a behavior simulation training method facing to the aerial intelligent game, which improves the traditional inefficient reward function design process and the random exploration process during model training, so that the reward function is clearly and definitely designed and has interpretability, meanwhile, the behavior control of the intelligent agent can be manually intervened, the decision level and the convergence speed of the intelligent agent can be rapidly improved, the intelligent agent has the capability of simulating the complex behaviors of experts, the labor cost is reduced, and the problem of model training cold start is solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention provides a behavior simulation training method facing to an air intelligent game, which comprises the following steps:

s1, constructing an air confrontation simulation environment and red and blue intelligent agents, constructing an intelligent agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each intelligent agent;

s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a_tShaping each action a_tIs continuous non-sparse reward function R_t；

S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state S_tT0, 1, 2.. T, performing the following steps:

s31, determining action a required to be executed_tAn action performed a_tGenerating next environment state S after acting on air confrontation simulation environment_t+1And obtain the corresponding reward function R_tIn turn, the iteration is circulated to realize the maximum accumulated reward R_b；

S32, controlling each agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R^*(s)＝w^*Phi(s), wherein w^*Is a reward function weight vector, h in total,

||w^*and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1]^h；

S33, calculating similarity R of each agent behavior and expert behavior_ε；

S34, obtaining the comprehensive reward r_tComprehensive reward r_tThe formula is as follows:

r_t＝w_bR_b+w_iR^*(s)+w_εR_ε

wherein, w_bFor maximising the jackpot R_bWeight coefficient of (d), w_iFor a target reward function R^*Weight coefficient of(s), w_εIs degree of similarity R_εThe weight coefficient of (a);

s4, training intelligent game decision model and judging comprehensive reward r_tIf the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r_tGreater than the reward threshold.

Preferably, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

Preferably, the action space comprises at least one action a of object pursuit, object avoidance, tangential free enemy locking, cross-attack and snake maneuver_t。

Preferably, each act a is sculpted_tIs continuous non-sparse reward function R_tThe following were used:

1) reward function R for target pursuit_t1The following conditions are satisfied:

2) reward function R for target avoidance_t2The following conditions are satisfied:

R_t2＝Δθ_f+a′+A+B

3) reward function R for tangentially escaping enemy locking_t3The following conditions are satisfied:

R_t3＝-Δθ_t+a′+A+B

4) cross-attacking reward function R_t4The following conditions are satisfied:

5) snake-shaped motorized reward function R_t5The following conditions are satisfied:

wherein, Delta theta_c＝|θ-δ|-ε，Δθ_f＝|θ-δ|，Δθ_t＝|θ-δ|-90°，Δθ_j＝|θ-δ′|，Δθ_sθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θ_eIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x_aThe distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is the connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is the deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent_aThe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]，T₁For the change period of the maneuvering direction in a snake-shaped machine, T is more than 0₁≤T。

Preferably, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing a target reward function R^*(s)＝w^*·φ(s)；

S322, randomly acquiring an initial action strategy pi⁰Calculating an initial action policy π⁰Corresponding characteristic expectation mu (pi)⁰) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;

s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm

And obtain the current reward function weight vector

Updating the reward function weight vector w^*Is (w)ⁱ)^TWherein, mu^ETo be specially designedCharacteristic expectations of home behavior policies; mu.s^jT is transposition for the characteristic expectation of the current action strategy;

s324, judging the characteristic error eⁱWhether the error is less than the error threshold value or not, if so, outputting the target reward function R^*(s)＝(wⁱ)^TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmⁱAnd the update characteristic is expected to be mu (pi)ⁱ) If i is set to i +1, the process returns to step S323 to iterate.

Preferably, the characteristic is expected to be μ (π)ⁱ) The following formula is satisfied:

where γ is the discount factor, φ(s)_t)|πⁱAction policy pi for time tⁱCorresponding battlefield situation.

Preferably, each agent is controlled to play the air game based on expert behaviors to realize reverse reinforcement learning, and the method further comprises the following steps:

generating battlefield situation sets according to action strategies corresponding to each bureau

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein,

the environmental status at the moment t of the ith exchange,

as a battlefield situation pair at the moment t of the ith roundThe corresponding situation information vector.

Preferably, in step S33, the similarity R between the behavior of each agent and the behavior of the expert_εThe calculation process of (2) is as follows:

s331, searching noise N according to current strategy mu_tDetermining an action a_t＝μ(S_t|θ^μ)+N_tWherein, μ (S)_t|θ^μ) Representing the strategy mu in the neural network parameter theta^μEnvironmental state S_tThe next resulting action;

s332, according to the current environment state S_tGenerating an action as a mock object a using expert behavior strategy_e；

S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference method_εThe calculation formula is as follows:

for discrete action a_t：

Wherein, a_t(x) Representing the probability of an agent performing an x-action, a_e(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;

for continuous action a_t：

Wherein, a_t(x) Probability density, a, representing x behavior of agent_e(x) Representing the probability density of an expert performing X actions, X being the set of actions.

Preferably, in step S4, the training of the intelligent agent gaming decision model includes the following steps:

s41, initializing online neural network parameters theta of operator network in intelligent agent game decision model^QAnd the online neural network parameter theta of the criticc network^u；

S42, connect the actor networkOnline neural network parameter θ^QAnd the online neural network parameter theta of the criticc network^uRespectively corresponding to the target network parameters copied to the actor network

And target network parameters of critic network

S43, initializing a playback buffer R;

s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:

s441, initializing random process noise N to obtain an initial environment state S;

s442, for a time T ═ 1, 2.., T, the following steps are performed:

a) each agent executes corresponding action, and performs environment state conversion to obtain next environment state S_t+1And a composite award r_t；

b) In the process of converting the environmental state_t，a_t，r_t，S_t+1) The data is stored as an array in a playback buffer R to be used as a training data set;

c) randomly sampling U arrays in a playback buffer R as mini-batch training data (y)_L-Q(s_L，a_L|θ^Q) Label (C)

Wherein,

representing policy μ at parameter

Environmental state s_L+1The action to be generated next is that of,

representing parameters

Environmental state s_L+1And performing the action

The value of the lower Q function;

d) minimizing the loss function of the critic network, the loss function being

e) Updating policy gradient of actor network:

wherein Q (s, a | θ)^Q) Expressed at a parameter theta^QEnvironment state s, the value of the Q function under the execution action a,

represents Q (s, a | θ)^Q) Partial derivatives of a; mu (s | theta)^μ) Representing the strategy mu at the parameter theta^μAn action generated in the environment state s,

represents μ (s | θ)^μ) To theta^μPartial derivatives of (a);

f) updating target network parameters of actor network

And target network parameters of critic network

Wherein tau is an adjustable coefficient and belongs to [0,1 ];

g) judging the comprehensive reward r_tIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward r_tGreater than the reward threshold.

Preferably, target network parameters of the actor network

And target network parameters of critic network

And updating by adopting a gradient descent method.

Compared with the prior art, the invention has the beneficial effects that:

1) key elements in an air scene of an intelligent agent game decision model are integrated to serve as reward factors, a continuous non-sparse reward function is shaped, so that the reward function is clearly and definitely designed and has interpretability, and meanwhile, the behavior control of an intelligent agent can be manually intervened;

2) the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved;

3) the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced;

4) meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved;

5) the method solves the problem of cold start of training by a behavior simulation method, improves the traditional inefficient reward function design process and the random exploration process during initial training, optimizes decision-making capability, and can quickly obtain a decision-making intelligent agent with high intelligence level and practical value.

Drawings

FIG. 1 is a schematic diagram of the cyclical interaction of an agent of the present invention with an airborne countermeasure simulation environment;

FIG. 2 is a block diagram of a method for reverse reinforcement learning of expert behavior according to the present invention;

FIG. 3 is a flow chart of the expert intelligence based behavioral decision simulation method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1-3, a behavior simulation training method for air intelligent gaming comprises the following steps:

s1, constructing an air confrontation simulation environment and an agent of the red and blue, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent.

In one embodiment, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

The intelligent game decision model is constructed by adopting a DDPG (distributed data group PG) deep reinforcement learning algorithm. The intelligent agent game decision model countermeasure training links the constructed air countermeasure simulation environment with the intelligent agent, and the intelligent agent is a red-blue airplane, so that the intelligent agent and the air countermeasure simulation environment are interactively influenced, as shown in fig. 1. It should be noted that the intelligent agent can also be changed according to different application scenarios, for example, the intelligent agent can also be a missile, and the intelligent agent game decision model can also be constructed based on a DQN algorithm or other deep reinforcement learning algorithms in the prior art.

S2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a_tShaping each action a_tIs continuous non-sparse reward function R_t。

In one embodiment, the action space includes at least one action a of object pursuit, object avoidance, tangential escape from enemy locking, cross-attack, and snake maneuver_t. Optional actions a of the action space_tThe number and types of the components can be adjusted according to actual needs, and other actions in the prior art can also be adopted.

In one embodiment, each act a is sculpted_tIs continuous non-sparse reward function R_tThe following were used:

R_t2＝Δθ_f+a′+A+B

R_t3＝-Δθ_t+a′+A+B

4) cross-attacking reward function R_t4The following conditions are satisfied:

wherein, Delta theta_c＝|θ-δ|-ε，Δθ_f＝|θ-δ|，Δθ_t＝|θ-δ|-90°，Δθ_j＝|θ-δ′|，Δθ_sθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θ_eIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x_aThe distance between a nearest friend machine of a red intelligent agent and a nearest blue intelligent agent, delta is a connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is a connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is a deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent, and_athe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]And T is the maneuvering direction changing period in the snake-shaped maneuvering.

Specifically, a northeast coordinate system is adopted, theta is an included angle between the direction of the nose of the red-square intelligent aircraft and the true north direction, and theta is_eThe included angle between the machine head direction of the blue intelligent body closest to the red intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest friend machine and the due north direction is delta', and the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the machine head direction of the red intelligent body is epsilon. The space motion of the air combat agent has six degrees of freedom, and the application considers the intelligence when the reward function is shapedYaw angle and acceleration of the energy body. The reward function modeling link is oriented to various actions, namely target pursuit, target avoidance, tangential escape from enemy plane locking, cross attack and snake maneuver, and comprises at least one of a red intelligent body as a local plane and a blue intelligent body as an enemy plane (enemy plane for short), wherein:

target pursuit: the local machine takes the nearest enemy plane as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest enemy plane is considered_cOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta theta_cThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta_eIn the same direction, i.e. | theta-theta_eWhen | < 180 °, a' is positive reward, | theta-theta_eIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target pursuit reward function R according to the factors_t1The following were used:

target avoidance: the local machine takes the nearest enemy attacking the local machine as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the nearest enemy connecting line angle delta between the local machine and the local machine_fOne of the reward factors is | θ - δ |. When avoiding the target, in order to compress the envelope of the missile of the enemy aircraft, the delta theta within the effective range of the enemy aircraft is avoided_fThe smaller the prize. The local acceleration a' is one of the reward factors because of the need to quickly zoom up and enemy while evadingDistance, factor a', is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target evasion reward function R according to the factors_t2The following were used:

R_t2＝Δθ_f+a′+A+B

tangential escape from enemy locking: the machine takes the enemy which has recently locked itself as relative reference, and the difference value delta theta between the absolute orientation angle theta of the machine and the connection angle delta between the machine and the enemy which has recently locked itself_tOne of the reward factors is | θ - δ | -90 °. Since the tangential escape enemy locking action is based on Doppler effect escape enemy radar locking, the optimum angle is 90 °, so the closer the difference angle | θ - δ | to 90 °, the better, Δ θ |_tThe closer to 0 the prize is. The local acceleration a' is one of the reward factors, and is always a positive reward because of the fast zooming-up and enemy distance required to escape. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a tangential escape enemy locking reward function R according to the factors_t3The following were used:

R_t3＝-Δθ_t+a′+A+B

cross attack: the machine gives different rewards according to the position of the nearest friend machine (namely the nearest red party intelligent agent to the machine) by taking the nearest enemy machine as a reference.

If x_aX is more than or equal to 0, the machine is positioned in the front of the nearest friend machine to execute attack tactics, x is the distance between the red square intelligent agent and the nearest blue square intelligent agent, namely the distance between the machine and the nearest enemy plane, x_aNearest friend machine and nearest blue square for red square intelligent agentThe distance between the intelligent bodies, namely the distance between the nearest friend plane and the nearest friend plane, takes the deviation angle epsilon of the local machine relative to the nearest friend plane into consideration, and the difference delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest friend plane is calculated_cOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta theta_cThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta_eIn the same direction, i.e. | theta-theta_eWhen | < 180 °, a' is positive reward, | theta-theta_eIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward.

If x_aAnd x is less than 0, the machine is positioned behind the nearest friend machine to execute the tactics of following the previous machine and opening the interference. The difference delta theta between the absolute orientation angle theta of the machine and the connection angle delta' between the machine and the nearest friend machine_jOne of the reward factors is | θ - δ' |. Because when following the nearest friend machine, the nearest friend machine is considered to be kept in the radar locking range of the machine, so that the distance between the nearest friend machine and the team member is kept to play a role of concealing and interfering the enemy, delta theta_jThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta_eIn the same direction, i.e. | theta-theta_eWhen | < 180 °, a' is positive reward, | theta-theta_eI > 180 deg., a ' is a negative reward, while a ' is proportional to the speed difference with the nearest friend, i.e. a ' ═ a (v-v)_a) Where v is the local velocity, v_aα is a weight coefficient for the nearest friend speed. The relative hit number A and the local win-loss condition B are one of the considered factors, the relative hit number A is the difference between the lost number of the blue intelligent agent and the lost number of the red intelligent agent, when the relative hit number A is more than 0, the reward is positive, A is 0, the reward is not present, A is less than 0, the reward is negative,meanwhile, the winning of the victory or defeat situation B of the machine is positive reward, the tie of the war is no reward, and the failure of the war is negative reward. Synthesizing a cross attack reward function R according to the factors_t4The following were used:

s-shaped maneuvering: the machine takes the nearest enemy plane as relative reference, every T₁The maneuvering direction is changed in a time period of 0 < T₁≤T，2kT₁≤t≤(2k+1)T1，

The difference delta theta between the absolute orientation angle theta of the plane itself and the line angle delta between the plane itself and the nearest enemy plane_sθ - δ - (90 ° - σ) is one of the reward factors. Sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG ]]The radial direction is indicated by an arrow with the local machine as a starting point and the nearest enemy as an end point, and the radial direction is a velocity component in the radial direction. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a snake-shaped maneuvering reward function R according to the factors_t5The following were used:

because the actions of the agent are continuous, such as the turning angle and acceleration of the airplane, the reward function is used as the evaluation of the behavior, and a continuous non-sparse reward function can be adopted to be in one-to-one correspondence with the behavior so as to provide feedback to the agent at each time point in the real-time confrontation process. The modeling of the continuous non-sparse reward function is clear and definite, and the continuous non-sparse reward function has interpretability, and meanwhile, the behavior control of the intelligent agent can be adjusted by human intervention.

And controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and automatically resolving a target reward function. If a series of air decision characteristic elements exist, phi: s → [0,1]^hFor example, the characteristic elements include basic space coordinates, motion indexes (speed and rotation angle) and the like of the intelligent agent, the fire control system state, the radar state, the interference pod state and the like of the intelligent agent. The characteristic indexes are related to the reward, and in order to facilitate quick optimization, the correlation is expressed as a linear combination relation to obtain a target reward function R^*(s)＝w^*Phi(s), wherein w^*Is a reward function weight vector, h in total,

||w^*and | | is less than or equal to 1, and phi(s) is a situation information vector, so that the importance of each characteristic quantity is represented in a normalized space. When the expert behaviors are adopted to control the intelligent agent to play games in the air, the intelligent agent can generate a series of battlefield situations along withThe time is passed until the war-game is over, the set of the corresponding battlefield situations of the ith game is

Assuming that there are p red-square airplanes and q blue-square airplanes, the collected situation information vector phi(s) may selectively include the following elements:

a)

relative distance between each aircraft;

b)

relative angles between the individual airplanes;

c) p + q aircraft absolute heading angles;

d) p + q aircraft longitude and latitude coordinates;

e) pq missile threat points represent the threat magnitude of the missile in flight to the airplane (proportional to the flight time and inversely proportional to the distance from the target airplane);

f) p + q aircraft speeds per hour;

g) p + q aircraft relative positions;

h) whether the local machine is a current attack machine or not;

i) the missile states of q enemy planes at a moment;

j) the missile state at the moment of time on the p airplanes of our party.

In one embodiment, the method for controlling each agent to conduct air gaming based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing a target reward function R^*(s)＝w^*·φ(s)；

in one embodiment, the characteristic is expected to be μ (π)ⁱ) The following formula is satisfied:

In one embodiment, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein,

the environmental status at the moment t of the ith exchange,

and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.

And obtain the current reward function weight vector

Updating the reward function weight vector w^*Is (w)ⁱ)^TWherein, mu^ECharacteristic expectation of expert behavior strategy; mu.s^jFor current action policyThe characteristic is expected, T is transposition;

s324, judging the characteristic error eⁱWhether the error is less than the error threshold value or not, if so, outputting the target reward function R^*(s)＝(wⁱ)^TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmⁱAnd the update characteristic is expected to be mu (pi)ⁱ) If i is set to i +1, the process returns to step S323 to iterate. The error e is set in the present embodiment for the sake of accuracy and time consumption of calculationⁱIs 10^-5If several iterations are performed, eⁱ＜10^-5Then the error e is determinedⁱLess than the error threshold.

The intelligent agent battle effect is close to the expert battle effect through the reverse reinforcement learning algorithm based on the expert behaviors, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved.

S33, calculating similarity R of each agent behavior and expert behavior_ε。

In one embodiment, in step S33, the similarity R between the behavior of each agent and the behavior of the expert_εThe calculation process of (2) is as follows:

s331, searching noise N according to current strategy mu_tDetermining an action space a_t＝μ(S_t|θ^μ)+N_tWherein, μ (S)_t|θ^μ) Representing the strategy mu in the neural network parameter theta^μEnvironmental state S_tThe next resulting action;

for discrete action a_t：

Wherein, a_t(x) Representing the probability of an agent performing an x-action, a_e(x) The probability of an expert performing an X behavior is represented, the X behavior is a behavior set, and if the X is { flying to the north, flying to the south, flying to the east, flying to the west }, the flying to the north, and the like, is a behavior;

for continuous action a_t：

Wherein, a_t(x) Probability density, a, representing x behavior of agent_e(x) And the probability density of X behaviors is represented by the expert, wherein X is the behavior and X is the behavior set. For example, X ═ X ° flight at an angle X degrees to the horizontal, X ∈ [0 °, 180 ° ]]}

The DDPG algorithm outputs specific actions according to the current environment state and the action strategy and executes the specific actions, and action instruction output is continuously provided according to the real-time situation until the end of the war office. And when the DDPG algorithm outputs behaviors each time, making a decision in the same situation by adopting the expert behaviors, but not updating the decision to the environment state, namely the expert behaviors do not take effect and only serve as feedback of the DDPG algorithm action decision, and calculating the similarity of the expert behaviors and the action of the intelligent agent. If the similarity is high, the simulation ability of the current intelligent agent game decision model to the expert behaviors is strong.

r_t＝w_bR_b+w_iR^*(s)+w_εR_ε

wherein, w_bFor maximising the jackpot R_bWeight coefficient of (d), w_iFor a target reward function R^*Weight coefficient of(s), w_εIs degree of similarity R_εThe weight coefficient of (2).

S4, training intelligent game decision model and judging comprehensive reward r_tIf the number of the intelligent agent game decision models is larger than the reward threshold value, the training is stopped to obtain the final intelligent agent game decision model, and if not, the circulation iteration training is carried outTraining until maximum training is given to the number of rounds n or the total reward r_tGreater than the reward threshold.

The method mainly adopts a DDPG algorithm to carry out expert agent decision behavior simulation training so as to obtain an agent game decision model capable of making expert behaviors. The behavior simulation training process is a process that the intelligent agent generates actions according to situation to obtain income, experiences are accumulated continuously, the generated actions change to the direction of obtaining high income until the actions reach stable high income, and convergence is carried out, so that a final intelligent agent game decision model with high intelligent level is obtained and is used for air games.

In one embodiment, in step S4, the training of the intelligent agent gambling decision model includes the following steps:

s41, initializing online neural network parameters theta of operator network in intelligent agent game decision model^QAnd the online neural network parameter theta of the criticc network^u(ii) a The actor network is used for outputting behaviors according to the battlefield situation, and the critic network is used for outputting scores according to the battlefield situation and the behaviors.

S42, converting online neural network parameter theta of actor network^QAnd the online neural network parameter theta of the criticc network^uRespectively corresponding to the target network parameters copied to the actor network

And target network parameters of critic network

S43, initializing a playback buffer R;

s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:

s442, for a time T ═ 1, 2.., T, the following steps are performed:

Wherein,

representing policy μ at parameter

Environmental state s_L+1The action to be generated next is that of,

representing parameters

Environmental state s_L+1And performing the action

The value of the lower Q function;

d) minimizing the loss function of the critic network, the loss function being

e) Updating policy gradient of actor network:

represents μ (s | θ)^μ) To theta^μPartial derivatives of (a);

f) updating target network parameters of actor network

And target network parameters of critic network

Wherein tau is an adjustable coefficient and belongs to [0,1 ];

g) judging the comprehensive reward r_tIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward r_tGreater than the reward threshold. If the upper limit of the preset reward threshold value is 100, the user sets the reward threshold value to 80 according to the actual requirement, and then when the comprehensive reward r is_tIf the number of the rounds is more than 80, the training is terminated, otherwise, the step S44 is returned to carry out the loop iteration training until the maximum training pair round number n or the comprehensive reward r_tGreater than 80.

The method comprises the steps of carrying out decision training on intelligent agent behaviors by utilizing a deep reinforcement learning algorithm, and continuously updating a deep network by taking similarity calculation of expert behaviors and intelligent agent behaviors as feedback. By training the intelligent agent game decision model, decision experience can be obtained from expert prior knowledge, and with the improvement of similarity of intelligent agent behaviors and expert behaviors in the training process, the behavior imitation level is gradually improved, so that the final intelligent agent game decision model with high intelligence level is obtained and used for air game.

In one embodiment, target network parameters for an actor network

And target network parameters of critic network

And updating by adopting a gradient descent method. The revenue of the generated action changes once per update, and the overall trend is the revenue rising trend.

In conclusion, the method synthesizes key elements in the air scene of the intelligent agent game decision model as reward factors, and the reward functions are clearly and definitely designed and have interpretability by forming continuous non-sparse reward functions, and meanwhile, the behavior control of the intelligent agent can be manually intervened; the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved; the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced; meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved; the problem of cold start of training is solved by a behavior simulation method, the traditional inefficient reward function design process and the random exploration process during initial training are improved, the decision-making capability is optimized, and a decision-making intelligent agent with high intelligence level and practical value can be obtained quickly.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not should be understood as the limitation of the invention claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A behavior simulation training method facing to air intelligent game is characterized in that: the behavior simulation training method for the air intelligent game comprises the following steps:

s1, constructing an air confrontation simulation environment and an agent of the red and blue parties, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent;

s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a_tShaping each of said actions a_tIs continuous non-sparse reward function R_t；

S3, performing air game matching in the intelligent agent game decision model, wherein the matching time is T, and each intelligent agent is in accordance with the current environment state S_tT is 0,1,2, …, T, the following steps are performed:

s31, determining action a required to be executed_tThe action performed a_tGenerating a next environment state S after acting on the air countermeasure simulation environment_t+1And obtaining the corresponding reward function R_tIn turn, the iteration is circulated to realize the maximum accumulated reward R_b；

S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R^*(s)＝w^*Phi(s), wherein w^*Is a reward function weight vector, h in total,

‖w^*| < 1, phi(s) is the situation information vector phi: → [0,1 →]^h；

S33, calculating similarity R of each intelligent agent behavior and expert behavior_ε；

S34, obtaining the comprehensive reward r_tThe composite award r_tThe formula is as follows:

r_t＝w_bR_b+w_iR^*(s)+w_εR_ε

wherein, w_bFor said maximising the jackpot R_bWeight coefficient of (d), w_iFor the target reward function R^*Weight coefficient of(s), w_εIs the similarity R_εThe weight coefficient of (a);

s4, training the intelligent agent game decision model and judging the comprehensive reward r_tIf the number of the training pairs is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r_tGreater than the reward threshold.

2. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

3. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the action space comprises at least one action a of target pursuit, target avoidance, tangential escape enemy locking, cross attack and snake maneuver_t。

4. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 3, wherein: said shaping each of said actions a_tIs continuous non-sparse reward function R_tThe following were used:

1) reward function R of said target pursuit_t1The following conditions are satisfied:

2) reward function R for said target avoidance_t2The following conditions are satisfied:

R_t2＝Δθ_f+a′+A+B

3) the reward function R for tangentially getting rid of enemy locking_t3The following conditions are satisfied:

R_t3＝-Δθ_t+a′+A+B

4) reward function R of the cross attack_t4The following conditions are satisfied:

5) said serpentine motorized reward function R_t5The following conditions are satisfied:

wherein, Delta theta_c＝|θ-δ|-ε，Δθ_f＝|θ-δ|，Δθ_t＝|θ-δ|-90°，Δθ_j＝|θ-δ′|，Δθ_sθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θ_eIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x_aA nearest friend machine for the red-square agent and the nearest blue-square agentThe distance between the energy bodies, delta is the connecting line angle of the red intelligent body and the nearest blue intelligent body, delta 'is the connecting line angle of the red intelligent body and the nearest friend machine, epsilon is the deviation angle of the red intelligent body and the nearest blue intelligent body, a' is the acceleration of the red intelligent body, A is the difference between the loss quantity of the blue intelligent body and the loss quantity of the red intelligent body, B is the winning bureau number of the red intelligent body, v is the speed of the red intelligent body_aIs the nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]，T₁For maneuvering direction change cycles in said snake-like machine, 0<T₁≤T。

5. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing the target reward function R^*(s)＝w^*·φ(s)；

S322, randomly acquiring an initial action strategy pi⁰Calculating the initial action strategy pi⁰Corresponding characteristic expectation mu (pi)⁰) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;

And obtain the current reward function weight vector

Updating the reward function weight vector w^*Is (w)ⁱ)^TWherein, mu^ECharacteristic expectations for the expert behavior strategy; mu.s^jT is the transposition for the characteristic expectation of the current action strategy;

s324, judging the characteristic error eⁱWhether or not less than the error thresholdIf yes, outputting the target reward function R^*(s)＝(wⁱ)^TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmⁱAnd the update characteristic is expected to be mu (pi)ⁱ) If i is set to i +1, the process returns to step S323 to iterate.

6. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 5, wherein: the characteristic is expected to be mu (pi)ⁱ) The following formula is satisfied:

where γ is the discount factor, φ(s)_t)|πⁱThe action strategy pi for the moment tⁱCorresponding battlefield situation.

7. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 6, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein,

for loops at time t of ith roundThe state of the environment is as follows,

8. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S33, the similarity R between the behavior of each agent and the behavior of the expert_εThe calculation process of (2) is as follows:

s331, searching noise N according to current strategy mu_tDetermining the action a_t＝μ(S_t|θ^μ)+N_tWherein, μ (S)_t|θ^μ) Representing the strategy mu in the neural network parameter theta^μEnvironmental state S_tThe next resulting action;

S333, calculating similarity R of the intelligent agent behavior and the expert behavior by adopting a Kullback-Leibler difference method_εThe calculation formula is as follows:

for discrete action a_t：

Wherein, a_t(x) Representing the probability of the agent performing an x-action, a_e(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;

for continuous action a_t：

Wherein, a_t(x) Probability density, a, representing the agent performing x-actions_e(x) Probability density representing x behaviors of expertDegree, X is the set of behaviors.

9. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S4, the training of the intelligent agent game decision model includes the following steps:

s41, initializing online neural network parameters theta of operator network in the intelligent agent game decision model^QAnd the online neural network parameter theta of the criticc network^u；

S42, connecting the online neural network parameter theta of the actor network^QAnd the online neural network parameter theta of the criticc network^uRespectively corresponding to the target network parameters copied to the actor network

And target network parameters of critic network

S43, initializing a playback buffer R;

s44, executing the following steps for the lth (L ═ 1,2, …, n) pair:

s442, for time T equal to 1,2, …, T, the following steps are performed:

a) each intelligent agent executes corresponding action, and performs environmental state conversion to obtain the next environmental state S_t+1And a composite award r_t；

b) In the process of converting the environmental state_t,a_t,r_t,S_t+1) Storing the data as an array in the playback buffer R as a training data set;

c) randomly sampling U of the arrays in the playback buffer R as onemini-batch training data (y)_L-Q(s_L,a_L|θ^Q) Label (C)

Wherein,

representing policy μ at parameter

Environmental state s_L+1The action to be generated next is that of,

representing parameters

Environmental state s_L+1And performing the action

The value of the lower Q function;

d) minimizing a loss function of the critic network, said loss function being

e) Updating policy gradient of actor network:

represents μ (s | θ)^μ) To theta^μPartial derivatives of (a);

f) updating target network parameters of the actor network

And target network parameters of critic network

Wherein tau is an adjustable coefficient,

10. The method for behavioral simulation training in air-oriented intelligent gaming according to claim 9, wherein: target network parameters of the actor network

And target network parameters of critic network

And updating by adopting a gradient descent method.