CN113221444B

CN113221444B - Behavior simulation training method for air intelligent game

Info

Publication number: CN113221444B
Application number: CN202110425153.0A
Authority: CN
Inventors: 包骐豪; 朱燎原; 夏少杰; 瞿崇晓
Original assignee: CETC 52 Research Institute
Current assignee: CETC 52 Research Institute
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2023-01-03
Anticipated expiration: 2041-04-20
Also published as: CN113221444A

Abstract

The invention discloses a behavior simulation training method for an air intelligent game, which comprises the following steps: s1, constructing an intelligent game decision model; s2, determining an environment state and an action space, and modeling a continuous non-sparse reward function of each action; s3, playing an air game in the model, and executing the following steps: s31, generating a next environment state according to the executed action and obtaining the reward, and sequentially performing loop iteration to realize the maximum accumulated reward; s32, reverse reinforcement learning is realized based on expert behaviors, and a target reward function is obtained; s33, calculating the similarity between the behaviors of the agents and the behaviors of the experts; s34, obtaining a comprehensive reward; and S4, training an intelligent game decision model. The method improves the traditional inefficient reward function design process and the random model training exploration process, so that the reward function has interpretability and human intervention capability, the intelligent agent decision level and convergence speed are improved, and the problem of model training cold start is solved.

Description

Behavior simulation training method for air intelligent game

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a behavior simulation training method for an air intelligent game.

Background

In the future, the air game needs to acquire the most accurate information from various detection systems at any time and any place to realize the information advantage, and more importantly, the air game realizes the decision advantage by utilizing the technologies such as machine learning, artificial intelligence, cloud computing and the like. In order to better mine information to realize decision advantages, exert game efficiency and ensure aerial advantages, an aerial auxiliary decision support system matched with the pilot is needed besides excellent aerial skills of the pilot and good command ability of the commander. The aerial assistant decision support system is used as an artificial intelligence assistant system, can provide decision scheme reference in a high-dynamic complex confrontation environment, reduces the decision burden of a pilot, better excavates information to realize decision advantages, exerts game efficiency and ensures aerial advantages.

However, the existing air-aided decision support system is relatively backward, the number of sensing parameters or target objects which can be simultaneously controlled is limited, the robustness, timeliness and accuracy of decision support are poor, moreover, in game decision, the training model is difficult to converge due to too high decision dimensionality, and the time consumption of the practical intelligent body for training is long, even the practical intelligent body cannot be trained at all to obtain the effective decision intelligent body. And the decision level of the intelligent agent is low and the time consumption is long due to sparse awards and complex and low-efficiency awards design links in the air intelligent self-game confrontation, and meanwhile, the awards are designed to be manually customized according to scenes, so that the labor cost is high, the reusability is poor, and the algorithm training has the cold start problem.

Disclosure of Invention

The invention aims to provide a behavior simulation training method facing to the aerial intelligent game, which improves the traditional inefficient reward function design process and the random exploration process during model training, so that the reward function is clearly and definitely designed and has interpretability, meanwhile, the behavior control of the intelligent agent can be manually intervened, the decision level and the convergence speed of the intelligent agent can be rapidly improved, the intelligent agent has the capability of simulating the complex behaviors of experts, the labor cost is reduced, and the problem of model training cold start is solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention provides a behavior simulation training method facing to an air intelligent game, which comprises the following steps:

s1, constructing an air confrontation simulation environment and red and blue intelligent agents, constructing an intelligent agent game decision model based on a deep reinforcement learning algorithm, and realizing circular interaction between the air confrontation simulation environment and each intelligent agent;

s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a _t Shaping each action a _t Is continuously non-sparse reward function R _t ；

S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state S _t T =0,1, 2.. T, the following steps are performed:

s31, determining action a required to be executed _t An action performed a _t Generating next environment state S after acting on air confrontation simulation environment _t+1 And obtain the corresponding reward function R _t Reward of, loop iteration sequentiallyTo achieve a maximum jackpot R _b ；

S32, controlling all the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning and obtain a target reward function R ^* (s)＝w ^* Phi(s), wherein w ^* Is a reward function weight vector, h in total,

||w ^* and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1] ^h ；

S33, calculating similarity R of behaviors of all agents and expert behaviors _ε ；

S34, obtaining the comprehensive reward r _t Comprehensive reward r _t The formula is as follows:

r _t ＝w _b R _b +w _i R ^* (s)+w _ε R _ε

wherein, w _b To maximise the jackpot R _b Weight coefficient of (d), w _i For a target reward function R ^* Weight coefficient of(s), w _ε Is degree of similarity R _ε The weight coefficient of (a);

s4, training an intelligent game decision model and judging comprehensive rewards r _t If the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r _t Greater than the reward threshold.

Preferably, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

Preferably, the action space comprises at least one action a of object pursuit, object avoidance, tangential evasion enemy locking, cross-attack and snake maneuvering _t 。

Preferably, each act a is sculpted _t Is continuous non-sparse reward function R _t The following:

1) Reward function R for target pursuit _t1 The following conditions are satisfied:

2) Reward function R for target avoidance _t2 The following conditions are satisfied:

R _t2 ＝Δθ _f +a′+A+B

3) Reward function R for tangentially breaking free of enemy locking _t3 The following conditions are satisfied:

R _t3 ＝-Δθ _t +a′+A+B

4) Cross-attacking reward function R _t4 The following conditions are satisfied:

5) Snake-shaped motorized reward function R _t5 The following conditions are satisfied:

wherein, delta theta _c ＝|θ-δ|-ε，Δθ _f ＝|θ-δ|，Δθ _t ＝|θ-δ|-90°，Δθ _j ＝|θ-δ′|，Δθ _s θ - δ - (90 ° - σ), θ is the absolute orientation angle of the red agent, θ _e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x _a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is the connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is the deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent _a The nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]，T ₁ For the change period of the maneuvering direction in a snake-shaped machine, T is more than 0 ₁ ≤T。

Preferably, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing a target reward function R ^* (s)＝w ^* ·φ(s)；

S322, randomly acquiring an initial action strategy pi ⁰ Computing an initial action policy π ⁰ Corresponding characteristic expectation mu (pi) ⁰ ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;

s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm

And obtain the current reward function weight vector

Updating the reward function weight vector w ^* Is (w) ⁱ ) ^T Wherein, mu ^E Characteristic expectation of expert behavior strategy; mu.s ^j T is transposition for the characteristic expectation of the current action strategy;

s324, judging the characteristic error e ⁱ Whether the error is less than the error threshold value or not, if so, outputting the target reward function R ^* (s)＝(w ⁱ ) ^T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm ⁱ And the update characteristic is expected to be mu (pi) ⁱ ) When i = i +1, the process returns to step S323 to iterate.

Preferably, the characteristic is expected to be μ (π) ⁱ ) The following formula is satisfied:

where γ is the discount factor, φ(s) _t )|π ⁱ Action policy pi for time t ⁱ Corresponding battlefield situation.

Preferably, the method for controlling the intelligent agents to conduct air gaming based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:

generating battlefield situation sets according to action strategies corresponding to each bureau

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein the content of the first and second substances,

the environmental status at the moment t of the ith exchange,

and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.

Preferably, in step S33, the similarity R between the behavior of each agent and the behavior of the expert _ε The calculation process of (c) is as follows:

s331, searching noise N according to current strategy mu _t Determining an action a _t ＝μ(S _t |θ ^μ )+N _t Wherein, mu (S) _t |θ ^μ ) Representing the strategy mu in the neural network parameter theta ^μ Environmental state S _t The next resulting action;

s332, according to the current environment state S _t Generating an action as a mock-object a using expert behavior policy _e ；

S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference method _ε The calculation formula is as follows:

for discrete actions a _t ：

Wherein, a _t (x) Representing the probability of an agent performing an x-action, a _e (x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;

for continuous action a _t ：

Wherein, a _t (x) Probability density, a, representing x behavior of agent _e (x) Representing the probability density of the expert performing X actions, X being the set of actions.

Preferably, in step S4, the training of the intelligent agent game decision model includes the following steps:

s41, initializing online neural network parameters theta of actor network in intelligent agent game decision model ^Q And the online neural network parameter theta of the critic network ^u ；

S42, converting the online neural network parameter theta of the operator network ^Q And the online neural network parameter theta of the criticc network ^u Respectively and correspondingly copying target network parameters to the operator network

And target network parameters of critic network

S43, initializing a playback buffer R;

s44, performing the following steps for the lth (L =1, 2.., n) pair:

s441, initializing random process noise N to obtain an initial environment state S;

s442, for time T =1, 2.

a) Each agent executes corresponding action, and performs environment state conversion to obtain next environment state S _t+1 And a composite award r _t ；

b) In the process of converting the environmental state _t ，a _t ，r _t ，S _t+1 ) The data is stored as an array in a playback buffer R to be used as a training data set;

c) Randomly sampling U arrays in a playback buffer R as mini-batch training data (y) _L -Q(s _L ，a _L |θ ^Q ) Label (C)

Wherein the content of the first and second substances,

representing policy μ at parameter

Environmental state s _L+1 The action to be generated next is that of,

representing parameters

Environmental state s _L+1 And performing the action

The value of the lower Q function;

d) Minimizing the loss function of the critic network, the loss function being

e) Updating policy gradient of actor network:

wherein Q (s, a | θ) ^Q ) Is shown inParameter theta ^Q Environment state s, the value of the Q function under the execution action a,

denotes Q (s, a | θ) ^Q ) Partial derivatives of a; μ (s | θ) ^μ ) Representing the strategy μ at the parameter θ ^μ An action generated in the environment state s,

represents μ (s | θ) ^μ ) To theta ^μ Partial derivatives of (a);

f) Updating target network parameters of an operator network

And target network parameters of critic network

Wherein tau is an adjustable coefficient and belongs to [0,1];

g) Judging the comprehensive reward r _t If the number of the rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward r _t Greater than the reward threshold.

Preferably, target network parameters of the actor network

And target network parameters of critic network

By usingThe gradient descent method is updated.

Compared with the prior art, the invention has the beneficial effects that:

1) Key elements in an air scene of an intelligent agent game decision model are integrated to serve as reward factors, a continuous non-sparse reward function is shaped, so that the reward function is clearly and definitely designed and has interpretability, and meanwhile, the behavior control of an intelligent agent can be manually intervened;

2) The intelligent agent battle effect is close to the expert battle effect based on the reverse reinforcement learning algorithm of the expert behavior, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved;

3) The intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced;

4) Meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved;

5) The method solves the problem of training cold start through a behavior simulation method, improves the traditional inefficient reward function design process and the random exploration process during initial training, optimizes the decision-making capability, and can quickly obtain a decision-making agent with high intelligence level and practical value.

Drawings

FIG. 1 is a schematic diagram of the cyclical interaction of an agent of the present invention with an airborne countermeasure simulation environment;

FIG. 2 is a block diagram of a method for reverse reinforcement learning of expert behavior according to the present invention;

FIG. 3 is a flow chart of the expert intelligence based behavioral decision simulation method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1-3, a behavior simulation training method for air intelligent gaming comprises the following steps:

s1, intelligent agents of an air confrontation simulation environment and red and blue are constructed, an intelligent agent game decision model is constructed based on a deep reinforcement learning algorithm, and circulation interaction between the air confrontation simulation environment and each intelligent agent is achieved.

In one embodiment, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

The intelligent game decision model is constructed by adopting a DDPG (distributed data group PG) deep reinforcement learning algorithm. The intelligent agent game decision model countermeasure training links the constructed air countermeasure simulation environment with the intelligent agent, and the intelligent agent is a red-blue airplane, so that the intelligent agent and the air countermeasure simulation environment are interactively influenced, as shown in fig. 1. It should be noted that the intelligent agent can also be changed according to different application scenarios, for example, the intelligent agent can also be a missile, and the intelligent agent game decision model can also be constructed based on a DQN algorithm or other deep reinforcement learning algorithms in the prior art.

S2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a _t Shaping each action a _t Is continuously non-sparse reward function R _t 。

In an embodiment, the action space comprises at least one action a of object pursuit, object avoidance, tangential evasion enemy locking, cross-attack and snake maneuver _t . Optional actions a of the action space _t The number and types of the components can be adjusted according to actual needs, and other actions in the prior art can also be adopted.

In one embodiment, each act a is sculpted _t Is continuously non-sparse reward function R _t The following were used:

R _t2 ＝Δθ _f +a′+A+B

3) Reward function R for tangentially escaping enemy locking _t3 The following conditions are satisfied:

R _t3 ＝-Δθ _t +a′+A+B

5) Reward function R for snake maneuvering _t5 The following conditions are satisfied:

wherein, delta theta _c ＝|θ-δ|-ε，Δθ _f ＝|θ-δ|，Δθ _t ＝|θ-δ|-90°，Δθ _j ＝|θ-δ′|，Δθ _s (= theta-delta- (90 ° -sigma)), theta is the absolute orientation angle of the red agent, theta _e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x _a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle of the red intelligent agent and the nearest blue intelligent agent, and delta' is the red intelligent agentThe connecting line angle with the nearest friend computer, epsilon is the deviation angle between the red-square agent and the nearest blue-square agent, a' is the acceleration of the red-square agent, A is the difference between the loss quantity of the blue-square agent and the loss quantity of the red-square agent, B is the number of winning stations of the red-square agent, v is the speed of the red-square agent _a The nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]And T is the maneuvering direction changing period in the snake-shaped maneuvering.

Specifically, a northeast coordinate system is adopted, theta is an included angle between the heading of the nose of the Hongfang intelligent machine and the due north direction, and theta _e The included angle between the machine head direction of the blue intelligent body closest to the red intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest friend machine and the due north direction is delta', and the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the machine head direction of the red intelligent body is epsilon. The space motion of the intelligent agent of air battle has six degrees of freedom, and the application considers the yaw angle and the acceleration of the intelligent agent when the reward function is shaped. The reward function modeling link is oriented to various actions, namely target pursuit, target avoidance, tangential escape from enemy plane locking, cross attack and snake maneuver, and comprises at least one of a red intelligent body as a local plane and a blue intelligent body as an enemy plane (enemy plane for short), wherein:

target pursuit: the local machine takes the nearest enemy plane as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest enemy plane is considered _c = theta-delta-epsilon is one of the reward factors. When the enemy missile is hit by the target, the enemy missile is kept in the radar locking range of the enemy missile, so that the enemy missile is convenient to put the tail and avoid quickly, and meanwhile, the envelope of the enemy missile is compressed by delta theta _c The smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta _e In the same direction, i.e. | theta-theta _e When | < 180 °, a' is positive reward, | theta-theta _e If > 180 deg., a' is a negative reward. Relative hit number A and local win-lose conditionB is one of the consideration factors, the relative hit quantity A is the difference between the lost quantity of the blue party intelligent agent and the lost quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the game winning in the local winning and negating condition B is the positive reward, the game tie is the no reward, and the game losing is the negative reward. Synthesizing a target pursuit reward function R according to the factors _t1 The following were used:

target avoidance: the local machine takes the nearest enemy attacking the local machine as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the nearest enemy connecting line angle delta between the local machine and the local machine _f And (= theta-delta) is one of the reward factors. When avoiding the target, in order to compress the envelope of the missile of the enemy aircraft, the delta theta within the effective range of the enemy aircraft is avoided _f The smaller the prize. The local acceleration a 'is one of the reward factors because of the need to quickly zoom in and out of range of the enemy while evading, and the factor a' is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward. Synthesizing a target evasion reward function R according to the factors _t2 The following:

R _t2 ＝Δθ _f +a′+A+B

tangential escape from enemy locking: the machine takes the enemy which has recently locked itself as relative reference, and the difference value delta theta between the absolute orientation angle theta of the machine and the connection angle delta between the machine and the enemy which has recently locked itself _t And = | theta-delta | 90 ° is one of the reward factors. Since the tangential escape enemy locking action is based on Doppler effect escape enemy radar locking, the optimum angle is 90 °, so the closer the difference angle | θ - δ | to 90 °, the better, Δ θ | _t The closer to 0 the prize is. The local acceleration a' is one of the reward factorsSince the distance between the player and the enemy needs to be rapidly increased during escape, a' is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue party intelligent agent and the loss quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the local win-lose condition B winning in the game is the positive reward, the game is no reward in the game tie, and the game failure is the negative reward. Synthesizing a tangential escape enemy locking reward function R according to the factors _t3 The following were used:

R _t3 ＝-Δθ _t +a′+A+B

cross attack: the machine gives different rewards according to the position of the nearest friend machine (namely the nearest red party intelligent agent to the machine) by taking the nearest enemy machine as a reference.

If x _a X is more than or equal to 0, the machine is positioned in the front of the nearest friend machine to execute attack tactics, x is the distance between the red square intelligent agent and the nearest blue square intelligent agent, namely the distance between the machine and the nearest enemy plane, x _a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, namely the distance between the nearest friend machine and the nearest enemy machine is considered, and the difference value delta theta between the absolute orientation angle theta of the local machine and the connection angle delta between the local machine and the nearest enemy machine is considered when the local machine has an offset angle epsilon relative to the nearest enemy machine _c And (= theta-delta-epsilon) is one of the reward factors. When the enemy missile is hit by the target, the enemy missile is kept in the radar locking range of the enemy missile, so that the enemy missile is convenient to put the tail and avoid quickly, and meanwhile, the envelope of the enemy missile is compressed by delta theta _c The smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta _e In the same direction, i.e. | theta-theta _e When | < 180 °, a' is positive reward, | theta-theta _e If > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward.

If x _a And x is less than 0, the machine is positioned behind the nearest friend machine to execute the tactics of following the previous machine and opening the interference. The difference delta theta between the absolute orientation angle theta of the machine and the connection angle delta' between the machine and the nearest friend machine _j = theta-delta' | is one of the reward factors. When following the nearest friend machine, the distance between the nearest friend machine and the team friend is kept to play a role in hiding and disturbing the enemy machine by considering that the nearest friend machine is kept in the self radar locking range of the machine _j The smaller the prize. The local acceleration a' is one of the reward factors, at the local absolute heading angle theta and the nearest enemy heading angle theta _e In the same direction, i.e. | theta-theta _e When | < 180 °, a' is positive reward, | theta-theta _e If > 180 deg., a ' is a negative reward, while a ' is proportional to the speed difference with the nearest friend, i.e. a ' = α (v-v) _a ) Where v is the local velocity, v _a α is a weight coefficient for the nearest friend speed. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue party intelligent agent and the loss quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the local win-lose condition B winning in the game is the positive reward, the game is no reward in the game tie, and the game failure is the negative reward. Synthesizing a cross attack reward function R according to the factors _t4 The following were used:

the snake-shaped motor is driven: the machine takes the nearest enemy machine as relative reference, every T ₁ The maneuvering direction is changed in a time period of 0 < T ₁ ≤T，2kT ₁ ≤t≤(2k+1)T1，

The difference delta theta between the absolute orientation angle theta of the plane itself and the line angle delta between the plane itself and the nearest enemy plane _s (= θ - δ - (90 ° - σ)) is one of the reward factors. Sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG ]]For controlling the speed of the machine close to the nearest enemy, the radial speed of the machine is larger when the sigma is larger, and an arrow taking the machine as a starting point and the nearest enemy as an end pointThe direction is indicated as radial, and the velocity component in the radial direction is the radial velocity. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward. Synthesizing a snake-shaped maneuvering reward function R according to the factors _t5 The following:

because the actions of the agent are continuous, such as the turning angle and acceleration of the airplane, the reward function is used as the evaluation of the behavior, and a continuous non-sparse reward function can be adopted to be in one-to-one correspondence with the behavior so as to provide feedback to the agent at each time point in the real-time confrontation process. The model of the continuous non-sparse reward function is clear and definite, and the continuous non-sparse reward function has interpretability, and meanwhile, the behavior control of the intelligent agent can be adjusted by human intervention.

S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state S _t T =0,1, 2., T, the following steps are performed:

s31, determining action a required to be executed _t An action performed a _t Generating next environment state S after acting on air confrontation simulation environment _t+1 And obtain the corresponding reward function R _t In turn, the iteration is circulated to realize the maximum accumulated reward R _b ；

S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning and obtain a target reward function R ^* (s)＝w ^* Phi(s), wherein w ^* Is a reward function weight vector, h in total,

||w ^* the | | < 1, phi(s) is situation informationVector, φ: s → [0,1] ^h ；

And controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and automatically resolving a target reward function. If there is a series of air decision feature elements, φ: s → [0,1] ^h For example, the characteristic elements comprise basic space coordinates, motion indexes (speed and rotation angle) and the like of the intelligent agent, the fire control system state, the radar state, the interference pod state and the like of the intelligent agent. The characteristic indexes are correlated with the reward, and for the convenience of quick optimization, the correlation is expressed as a linear combination relation to obtain a target reward function R ^* (s)＝w ^* Phi(s), wherein w ^* The weight vectors of the reward functions are h in number,

||w ^* and | | is less than or equal to 1, and phi(s) is a situation information vector, so that the importance of each characteristic quantity is represented in a normalized space. When the intelligent agent is controlled by the expert behaviors to play the air game, the intelligent agent can generate a series of battlefield situations, and the set of the battlefield situations corresponding to the ith game is

Assuming that there are p red-side aircraft and q blue-side aircraft, the collected situation information vector φ(s) may optionally include the following elements:

a)

relative distance between each aircraft;

b)

relative angles between the individual airplanes;

c) p + q aircraft absolute heading angles;

d) p + q aircraft longitude and latitude coordinates;

e) pq missile threat points represent the threat size of the missile in flight to the airplane (the threat size is in direct proportion to the flight time and in inverse proportion to the distance from the target airplane);

f) p + q aircraft speeds per hour;

g) p + q aircraft relative positions;

h) Whether the local machine is a current attack machine or not;

i) The missile states of q enemy planes at a moment;

j) And the missile states of p airplanes of our party at the last moment.

In one embodiment, the method for controlling each agent to conduct air gaming based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing a target reward function R ^* (s)＝w ^* ·φ(s)；

S322, randomly acquiring an initial action strategy pi ⁰ Calculating an initial action policy π ⁰ Corresponding characteristic expectation mu (pi) ⁰ ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;

in one embodiment, the characteristic is expected to be μ (π) ⁱ ) The following formula is satisfied:

In one embodiment, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein the content of the first and second substances,

the environmental status at the moment t of the ith exchange,

And obtain the current reward function weight vector

s324, judging the characteristic error e ⁱ Whether the error is less than the error threshold value or not, if so, outputting the target reward function R ^* (s)＝(w ⁱ ) ^T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm ⁱ And the update characteristic is expected to be mu (pi) ⁱ ) If i = i +1, the process returns to step S323 to iterate. The error e is set in the present embodiment for the sake of accuracy and time consumption of calculation ⁱ Is 10 ^-5 If several iterations are performed, e ⁱ ＜10 ^-5 Then the error e is determined ⁱ Less than the error threshold.

The intelligent agent battle effect is close to the expert battle effect through the reverse reinforcement learning algorithm based on the expert behaviors, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved.

S33, calculating similarity R of behaviors of all agents and expert behaviors _ε 。

In one embodiment, step S33Similarity R of individual agent behavior and expert behavior _ε The calculation process of (c) is as follows:

s331, searching noise N according to current strategy mu _t Determining an action space a _t ＝μ(S _t |θ ^μ )+N _t Wherein, mu (S) _t |θ ^μ ) Representing the strategy mu in the neural network parameter theta ^μ Environmental state S _t A next generated action;

for discrete actions a _t ：

Wherein, a _t (x) Representing the probability of the agent performing x-behavior, a _e (x) Representing the probability of an expert performing an X behavior, wherein X is the behavior, and X is a behavior set, if X = { flight north, flight south, flight east, flight west }, then flight north, etc. is a behavior;

for continuous action a _t ：

Wherein, a _t (x) Probability density, a, representing the agent's performance of x _e (x) And the probability density of X behaviors is represented by the expert, wherein X is the behavior and X is the behavior set. For example, X = { flight X degrees from horizontal, X ∈ [0 °,180 ° ]]}

The DDPG algorithm outputs specific actions according to the current environment state and the action strategy and executes the specific actions, and action instruction output is continuously provided according to the real-time situation until the end of the war office. And when the DDPG algorithm outputs behaviors each time, making a decision in the same situation by adopting the expert behaviors, but not updating the decision to the environment state, namely the expert behaviors do not take effect and only serve as feedback of the DDPG algorithm action decision, and calculating the similarity of the expert behaviors and the action of the intelligent agent. If the similarity is high, the simulation ability of the current intelligent agent game decision model to the expert behaviors is strong.

r _t ＝w _b R _b +w _i R ^* (s)+w _ε R _ε

wherein w _b For maximising the jackpot R _b Weight coefficient of (d), w _i For a target reward function R ^* Weight coefficient of(s), w _ε Is degree of similarity R _ε The weight coefficient of (c).

The method mainly adopts a DDPG algorithm to carry out expert agent decision behavior simulation training so as to obtain an agent game decision model capable of making expert behaviors. The behavior simulation training process is a process that the intelligent agent generates actions according to situation to obtain income, experiences are accumulated continuously, the generated actions change to the direction of obtaining high income until the actions reach stable high income, and convergence is carried out, so that a final intelligent agent game decision model with high intelligent level is obtained and is used for air games.

In an embodiment, in step S4, the training of the intelligent agent game decision model includes the following steps:

s41, initializing online neural network parameter theta of operator network in intelligent agent game decision model ^Q And the online neural network parameter theta of the critic network ^u (ii) a The actor network is used for outputting behaviors according to the battlefield situation, and the critic network is used for outputting behaviors according to the battlefield situation and the behaviorsAnd (6) scoring.

S42, connecting the online neural network parameter theta of the actor network ^Q And the online neural network parameter theta of the critic network ^u Respectively and correspondingly copying target network parameters to the operator network

And target network parameters of critic network

S43, initializing a playback cache R;

s44, performing the following steps for the lth (L =1, 2.., n) pair:

s442, for time T =1, 2.

a) Each agent executes corresponding action, and performs environmental state conversion to obtain next environmental state S _t+1 And a composite award r _t ；

Wherein the content of the first and second substances,

representing the policy μ at parameter

Environmental state s _L+1 The action to be generated next is that of,

representing parameters

Environmental state s _L+1 And performing the action

The value of the lower Q function;

d) Minimizing a loss function of the critical network, the loss function being

e) Updating policy gradient of actor network:

wherein Q (s, a | θ) ^Q ) Expressed at a parameter theta ^Q Environment state s, the value of the Q function under the execution action a,

represents Q (s, a | θ) ^Q ) Partial derivatives of a; mu (s | theta) ^μ ) Representing the strategy mu at the parameter theta ^μ An action generated in the environment state s,

represents μ (s | θ) ^μ ) To theta ^μ Partial derivatives of (a);

f) Updating target network parameters of actor network

And target network parameters of critic network

Wherein tau is an adjustable coefficient and belongs to [0,1];

g) Judging comprehensive reward r _t If the number of the game pieces is larger than the reward threshold value, the training is stopped, the final intelligent agent game decision model is obtained, if not, the step S44 is returned to carry out the circular iterative training until the number n of the game pieces or the comprehensive reward r of the maximum training is reached _t Greater than the reward threshold. If the upper limit of the preset reward threshold value is 100, the user sets the reward threshold value to 80 according to the actual requirement, and then when the comprehensive reward r is _t If the training is more than 80, terminating the training, otherwise, returning to the step S44 to carry out the loop iteration training until the maximum training pair number n or the comprehensive reward r _t Greater than 80.

The method comprises the steps of carrying out decision training on behaviors of the intelligent agent by utilizing a deep reinforcement learning algorithm, taking similarity calculation of expert behaviors and the behaviors of the intelligent agent as feedback, and continuously updating a deep network. By training the intelligent agent game decision model, decision experience can be obtained from expert prior knowledge, and with the improvement of similarity of intelligent agent behaviors and expert behaviors in the training process, the behavior imitation level is gradually improved, so that the final intelligent agent game decision model with high intelligence level is obtained and used for air game.

In one embodiment, target network parameters for an actor network

And target network parameters of the critical network

And updating by adopting a gradient descent method. The revenue of the generated action changes once per update, and the overall trend is the revenue rising trend.

In conclusion, the method synthesizes key elements in the air scene of the intelligent agent game decision model as reward factors, and the reward functions are clearly and definitely designed and have interpretability by forming continuous non-sparse reward functions, and meanwhile, the behavior control of the intelligent agent can be manually intervened; the intelligent agent battle effect is close to the expert battle effect based on the reverse reinforcement learning algorithm of the expert behavior, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved; the intelligent decision imitation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow even can not be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of imitating the complex behaviors of experts, so that the labor cost is reduced; meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved; the problem of cold start of training is solved by a behavior simulation method, the traditional inefficient reward function design process and the random exploration process during initial training are improved, the decision-making capability is optimized, and a decision-making intelligent agent with high intelligence level and practical value can be obtained quickly.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not should be understood as the limitation of the invention claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims

1. A behavior simulation training method facing to air intelligent game is characterized in that: the behavior simulation training method for the air intelligent game comprises the following steps:

s1, constructing an air countermeasure simulation environment and an agent of both red and blue, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing circular interaction between the air countermeasure simulation environment and each agent;

s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a _t Shaping each of said actions a _t Is continuous non-sparse reward function R _t ；

S3, performing air game match in the intelligent agent game decision model, wherein the game match time is T, and each intelligent agent performs game match according to the current environment state S _t T =0,1,2, \8230, T, performing the following steps:

s31, determining action a required to be executed _t The action performed a _t Generating a next environment state S after acting on the air countermeasure simulation environment _t+1 And obtaining the corresponding reward function R _t In turn, the iteration is circulated to realize the maximum accumulated reward R _b ；

S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R ^* (s)＝w ^* Phi(s), wherein w ^* Is a reward function weight vector, h in total,

‖w ^* | < 1, phi(s) is a situation information vector phi: → [0,1] ^h ；

S33, calculating similarity R of behaviors of the agents and the expert behaviors _ε ；

S34, obtaining the comprehensive reward r _t The composite award r _t The formula is as follows:

r _t ＝w _b R _b +w _i R ^* (s)+w _ε R _ε

wherein w _b For said maximising the jackpot R _b Weight coefficient of (d), w _i Awarding the targetFunction R ^* Weight coefficient of(s), w _ε Is the similarity R _ε The weight coefficient of (a);

s4, training the intelligent game decision model and judging the comprehensive reward r _t If the number of the training pairs is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r _t Greater than the reward threshold.

2. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.

3. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the action space comprises at least one action a of target pursuit, target avoidance, tangential escape from enemy locking, cross attack and snake maneuver _t 。

4. The method for behavioral imitation training in air intelligent gaming according to claim 3, wherein: said shaping each of said actions a _t Is continuously non-sparse reward function R _t The following:

1) Reward function R of the target pursuit _t1 The following conditions are satisfied:

2) Reward function R for said target avoidance _t2 The following conditions are satisfied:

R _t2 ＝Δθ _f +a′+A+B

3) The tangential escape from the reward function R of enemy machine locking _t3 The following conditions are satisfied:

R _t3 ＝-Δθ _t +a′+A+B

4) Reward function R of the cross attack _t4 The following conditions are satisfied:

5) Said serpentine motorized reward function R _t5 The following conditions are satisfied:

wherein, delta theta _c ＝|θ-δ|-ε，Δθ _f ＝|θ-δ|，Δθ _t ＝|θ-δ|-90°，Δθ _j ＝|θ-δ′|，Δθ _s θ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θ _e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x _a Is the distance between the nearest friend computer of the red agent and the nearest blue agent, delta is the angle between the red agent and the nearest blue agent, delta 'is the angle between the red agent and the nearest blue agent, epsilon is the angle between the red agent and the nearest blue agent, a' is the acceleration of the red agent, A is the difference between the loss amount of the blue agent and the loss amount of the red agent, B is the winning number of the red agent, v is the speed of the blue agent _a Is the nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]，T ₁ For the change of direction of movement in said snake movement, 0<T ₁ ≤T。

5. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:

s321, constructing the target rewardFunction R ^* (s)＝w ^* ·φ(s)；

S322, randomly acquiring an initial action strategy pi ⁰ Calculating the initial action strategy pi ⁰ Corresponding characteristic expectation mu (pi) ⁰ ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle plays;

And obtain the current reward function weight vector

Updating the reward function weight vector w ^* Is (w) ⁱ ) ^T Wherein, mu ^E Characteristic expectations for the expert behavior strategy; mu.s ^j T is the transposition for the characteristic expectation of the current action strategy;

s324, judging the characteristic error e ⁱ Whether the target reward function is smaller than the error threshold value or not, if so, outputting the target reward function R ^* (s)＝(w ⁱ ) ^T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm ⁱ And the update characteristic is expected to be mu (pi) ⁱ ) If i = i +1, the process returns to step S323 to iterate.

6. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 5, wherein: the characteristic is expected to be mu (pi) ⁱ ) The following formula is satisfied:

where γ is the discount factor, φ(s) _t )|π ⁱ The action strategy pi for the moment t ⁱ Corresponding battlefield situation.

7. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 6, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:

generating a battlefield situation set according to action strategies corresponding to each bureau

Calculating global feature expectations for m-game air games

The formula is as follows:

wherein the content of the first and second substances,

the environmental status at the moment t of the ith exchange,

8. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S33, the similarity R between the behaviors of the agents and the behaviors of the experts _ε The calculation process of (2) is as follows:

s331, searching noise N according to current strategy mu _t Determining the action a _t ＝μ(S _t |θ ^μ )+N _t Wherein, μ (S) _t |θ ^μ ) Representing the strategy mu in the neural network parameter theta ^μ Environmental state S _t A next generated action;

s332, according to the current environment state S _t Generating an action as a mock object a using expert behavior strategy _e ；

S333, calculating the similarity R of the intelligent agent behaviors and the expert behaviors by adopting a Kullback-Leibler difference method _ε The calculation formula is as follows:

for discrete action a _t ：

Wherein, a _t (x) Representing the probability of the agent performing an x-action, a _e (x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;

for continuous action a _t ：

Wherein, a _t (x) Probability density representing x behavior of said agent, a _e (x) Representing the probability density of the expert performing X actions, X being the set of actions.

9. The method for behavioral imitation training in air intelligent gaming according to claim 1, wherein: in step S4, the training of the intelligent agent game decision model includes the following steps:

s41, initializing online neural network parameters theta of the operator network in the intelligent agent game decision model ^Q And the online neural network parameter theta of the critic network ^u ；

S42, connecting the online neural network parameter theta of the actor network ^Q And the online neural network parameter theta of the criticc network ^u Respectively and correspondingly copying target network parameters to the operator network

And target network parameters of critic network

S43, initializing a playback cache R;

s44, for the L (L =1,2, \8230;, n) pair, the following steps are performed:

s442, for time T =1,2, \8230, T executes the following steps:

a) Each intelligent agent executes corresponding action, and performs environmental state conversion to obtain the next environmental state S _t+1 And a composite award r _t ；

b) In the process of converting the environmental state _t ,a _t ,r _t ,S _t+1 ) The data is stored as an array in the playback cache R to be used as a training data set;

c) Randomly sampling U arrays in the playback buffer R as mini-batch training data (y) _L -Q(s _L ,a _L |θ ^Q ) Label (C)

Wherein the content of the first and second substances,

representing policy μ at parameter

Environmental state s _L+1 The action to be generated next is that of,

representing parameters

Environmental state s _L+1 And performing the action

The value of the lower Q function;

d) Minimizing a loss function of the critic network, said loss function being

e) Updating policy gradient of actor network:

represents μ (s | θ) ^μ ) To theta ^μ The partial derivatives of (1);

f) Updating target network parameters of the operator network

And target network parameters of the critical network

Wherein tau is an adjustable coefficient,

10. The method for behavioral imitation training in air intelligent gaming according to claim 9, wherein: target network parameters of the actor network

And target network parameters of critic network

And updating by adopting a gradient descent method.