CN113221444A - Behavior simulation training method for air intelligent game - Google Patents

Behavior simulation training method for air intelligent game Download PDF

Info

Publication number
CN113221444A
CN113221444A CN202110425153.0A CN202110425153A CN113221444A CN 113221444 A CN113221444 A CN 113221444A CN 202110425153 A CN202110425153 A CN 202110425153A CN 113221444 A CN113221444 A CN 113221444A
Authority
CN
China
Prior art keywords
reward
intelligent
action
agent
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110425153.0A
Other languages
Chinese (zh)
Other versions
CN113221444B (en
Inventor
包骐豪
朱燎原
夏少杰
瞿崇晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 52 Research Institute
Original Assignee
CETC 52 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 52 Research Institute filed Critical CETC 52 Research Institute
Priority to CN202110425153.0A priority Critical patent/CN113221444B/en
Publication of CN113221444A publication Critical patent/CN113221444A/en
Application granted granted Critical
Publication of CN113221444B publication Critical patent/CN113221444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a behavior simulation training method for an air intelligent game, which comprises the following steps: s1, constructing an intelligent agent game decision model; s2, determining the environment state and the action space, and modeling a continuous non-sparse reward function of each action; s3, playing an air game in the model, and executing the following steps: s31, generating the next environment state according to the executed action and obtaining the reward, and sequentially and circularly iterating to realize the maximum accumulated reward; s32, realizing reverse reinforcement learning based on expert behaviors and obtaining a target reward function; s33, calculating the similarity of each agent behavior and the expert behavior; s34, obtaining a comprehensive reward; and S4, training an intelligent agent game decision model. The method improves the traditional inefficient reward function design process and the random model training exploration process, so that the reward function has interpretability and human intervention capability, the intelligent agent decision level and convergence speed are improved, and the problem of model training cold start is solved.

Description

Behavior simulation training method for air intelligent game
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a behavior simulation training method for an air intelligent game.
Background
In the future, the air game needs to acquire the most accurate information from various detection systems at any time and any place to realize the information advantage, and more importantly, the air game realizes the decision advantage by utilizing the technologies such as machine learning, artificial intelligence, cloud computing and the like. In order to better mine information to realize decision advantages, exert game efficiency and ensure aerial advantages, an aerial auxiliary decision support system matched with the pilot is needed besides excellent aerial skills of the pilot and good command ability of the commander. The aerial assistant decision support system is used as an artificial intelligence assistant system, can provide decision scheme reference in a high-dynamic complex confrontation environment, reduces the decision burden of a pilot, better excavates information to realize decision advantages, exerts game efficiency and ensures aerial advantages.
However, the existing air-aided decision support system is relatively backward, the number of sensing parameters or target objects which can be simultaneously controlled is limited, the robustness, timeliness and accuracy of decision support are poor, moreover, in game decision, the training model is difficult to converge due to too high decision dimensionality, and the time consumption of the practical intelligent body for training is long, even the practical intelligent body cannot be trained at all to obtain the effective decision intelligent body. And the decision level of the intelligent agent is low and the time consumption is long due to sparse awards and complex and low-efficiency awards design links in the air intelligent self-game confrontation, and meanwhile, the awards are designed to be manually customized according to scenes, so that the labor cost is high, the reusability is poor, and the algorithm training has the cold start problem.
Disclosure of Invention
The invention aims to provide a behavior simulation training method facing to the aerial intelligent game, which improves the traditional inefficient reward function design process and the random exploration process during model training, so that the reward function is clearly and definitely designed and has interpretability, meanwhile, the behavior control of the intelligent agent can be manually intervened, the decision level and the convergence speed of the intelligent agent can be rapidly improved, the intelligent agent has the capability of simulating the complex behaviors of experts, the labor cost is reduced, and the problem of model training cold start is solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention provides a behavior simulation training method facing to an air intelligent game, which comprises the following steps:
s1, constructing an air confrontation simulation environment and red and blue intelligent agents, constructing an intelligent agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each intelligent agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each action atIs continuous non-sparse reward function Rt
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state StT0, 1, 2.. T, performing the following steps:
s31, determining action a required to be executedtAn action performed atGenerating next environment state S after acting on air confrontation simulation environmentt+1And obtain the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb
S32, controlling each agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,
Figure BDA0003029048020000021
||w*and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1]h
S33, calculating similarity R of each agent behavior and expert behaviorε
S34, obtaining the comprehensive reward rtComprehensive reward rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor maximising the jackpot RbWeight coefficient of (d), wiFor a target reward function R*Weight coefficient of(s), wεIs degree of similarity RεThe weight coefficient of (a);
s4, training intelligent game decision model and judging comprehensive reward rtIf the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
Preferably, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
Preferably, the action space comprises at least one action a of object pursuit, object avoidance, tangential free enemy locking, cross-attack and snake maneuvert
Preferably, each act a is sculptedtIs continuous non-sparse reward function RtThe following were used:
1) reward function R for target pursuitt1The following conditions are satisfied:
Figure BDA0003029048020000031
2) reward function R for target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) reward function R for tangentially escaping enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) cross-attacking reward function Rt4The following conditions are satisfied:
Figure BDA0003029048020000032
5) snake-shaped motorized reward function Rt5The following conditions are satisfied:
Figure BDA0003029048020000033
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaThe distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is the connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is the deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agentaThe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T1For the change period of the maneuvering direction in a snake-shaped machine, T is more than 01≤T。
Preferably, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating an initial action policy π0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure BDA0003029048020000041
And obtain the current reward function weight vector
Figure BDA0003029048020000042
Updating the reward function weight vector w*Is (w)i)TWherein, muETo be specially designedCharacteristic expectations of home behavior policies; mu.sjT is transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error eiWhether the error is less than the error threshold value or not, if so, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate.
Preferably, the characteristic is expected to be μ (π)i) The following formula is satisfied:
Figure BDA0003029048020000043
where γ is the discount factor, φ(s)t)|πiAction policy pi for time tiCorresponding battlefield situation.
Preferably, each agent is controlled to play the air game based on expert behaviors to realize reverse reinforcement learning, and the method further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureau
Figure BDA0003029048020000044
Calculating global feature expectations for m-game air games
Figure BDA0003029048020000045
The formula is as follows:
Figure BDA0003029048020000046
wherein,
Figure BDA0003029048020000047
the environmental status at the moment t of the ith exchange,
Figure BDA0003029048020000048
as a battlefield situation pair at the moment t of the ith roundThe corresponding situation information vector.
Preferably, in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining an action at=μ(Stμ)+NtWherein, μ (S)tμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at
Figure BDA0003029048020000051
Wherein, at(x) Representing the probability of an agent performing an x-action, ae(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action at
Figure BDA0003029048020000052
Wherein, at(x) Probability density, a, representing x behavior of agente(x) Representing the probability density of an expert performing X actions, X being the set of actions.
Preferably, in step S4, the training of the intelligent agent gaming decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku
S42, connect the actor networkOnline neural network parameter θQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor network
Figure BDA0003029048020000053
And target network parameters of critic network
Figure BDA0003029048020000054
Figure BDA0003029048020000055
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for a time T ═ 1, 2.., T, the following steps are performed:
a) each agent executes corresponding action, and performs environment state conversion to obtain next environment state St+1And a composite award rt
b) In the process of converting the environmental statet,at,rt,St+1) The data is stored as an array in a playback buffer R to be used as a training data set;
c) randomly sampling U arrays in a playback buffer R as mini-batch training data (y)L-Q(sL,aLQ) Label (C)
Figure BDA0003029048020000061
Wherein,
Figure BDA0003029048020000062
representing policy μ at parameter
Figure BDA0003029048020000063
Environmental state sL+1The action to be generated next is that of,
Figure BDA0003029048020000064
representing parameters
Figure BDA00030290480200000616
Environmental state sL+1And performing the action
Figure BDA0003029048020000065
The value of the lower Q function;
d) minimizing the loss function of the critic network, the loss function being
Figure BDA0003029048020000066
e) Updating policy gradient of actor network:
Figure BDA0003029048020000067
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,
Figure BDA0003029048020000068
represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,
Figure BDA0003029048020000069
represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of actor network
Figure BDA00030290480200000610
And target network parameters of critic network
Figure BDA00030290480200000611
Figure BDA00030290480200000612
Figure BDA00030290480200000613
Wherein tau is an adjustable coefficient and belongs to [0,1 ];
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
Preferably, target network parameters of the actor network
Figure BDA00030290480200000614
And target network parameters of critic network
Figure BDA00030290480200000615
And updating by adopting a gradient descent method.
Compared with the prior art, the invention has the beneficial effects that:
1) key elements in an air scene of an intelligent agent game decision model are integrated to serve as reward factors, a continuous non-sparse reward function is shaped, so that the reward function is clearly and definitely designed and has interpretability, and meanwhile, the behavior control of an intelligent agent can be manually intervened;
2) the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved;
3) the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced;
4) meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved;
5) the method solves the problem of cold start of training by a behavior simulation method, improves the traditional inefficient reward function design process and the random exploration process during initial training, optimizes decision-making capability, and can quickly obtain a decision-making intelligent agent with high intelligence level and practical value.
Drawings
FIG. 1 is a schematic diagram of the cyclical interaction of an agent of the present invention with an airborne countermeasure simulation environment;
FIG. 2 is a block diagram of a method for reverse reinforcement learning of expert behavior according to the present invention;
FIG. 3 is a flow chart of the expert intelligence based behavioral decision simulation method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1-3, a behavior simulation training method for air intelligent gaming comprises the following steps:
s1, constructing an air confrontation simulation environment and an agent of the red and blue, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent.
In one embodiment, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
The intelligent game decision model is constructed by adopting a DDPG (distributed data group PG) deep reinforcement learning algorithm. The intelligent agent game decision model countermeasure training links the constructed air countermeasure simulation environment with the intelligent agent, and the intelligent agent is a red-blue airplane, so that the intelligent agent and the air countermeasure simulation environment are interactively influenced, as shown in fig. 1. It should be noted that the intelligent agent can also be changed according to different application scenarios, for example, the intelligent agent can also be a missile, and the intelligent agent game decision model can also be constructed based on a DQN algorithm or other deep reinforcement learning algorithms in the prior art.
S2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each action atIs continuous non-sparse reward function Rt
In one embodiment, the action space includes at least one action a of object pursuit, object avoidance, tangential escape from enemy locking, cross-attack, and snake maneuvert. Optional actions a of the action spacetThe number and types of the components can be adjusted according to actual needs, and other actions in the prior art can also be adopted.
In one embodiment, each act a is sculptedtIs continuous non-sparse reward function RtThe following were used:
1) reward function R for target pursuitt1The following conditions are satisfied:
Figure BDA0003029048020000081
2) reward function R for target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) reward function R for tangentially escaping enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) cross-attacking reward function Rt4The following conditions are satisfied:
Figure BDA0003029048020000082
5) snake-shaped motorized reward function Rt5The following conditions are satisfied:
Figure BDA0003029048020000083
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaThe distance between a nearest friend machine of a red intelligent agent and a nearest blue intelligent agent, delta is a connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is a connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is a deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent, andathe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]And T is the maneuvering direction changing period in the snake-shaped maneuvering.
Specifically, a northeast coordinate system is adopted, theta is an included angle between the direction of the nose of the red-square intelligent aircraft and the true north direction, and theta iseThe included angle between the machine head direction of the blue intelligent body closest to the red intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest friend machine and the due north direction is delta', and the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the machine head direction of the red intelligent body is epsilon. The space motion of the air combat agent has six degrees of freedom, and the application considers the intelligence when the reward function is shapedYaw angle and acceleration of the energy body. The reward function modeling link is oriented to various actions, namely target pursuit, target avoidance, tangential escape from enemy plane locking, cross attack and snake maneuver, and comprises at least one of a red intelligent body as a local plane and a blue intelligent body as an enemy plane (enemy plane for short), wherein:
target pursuit: the local machine takes the nearest enemy plane as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest enemy plane is consideredcOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta thetacThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target pursuit reward function R according to the factorst1The following were used:
Figure BDA0003029048020000091
target avoidance: the local machine takes the nearest enemy attacking the local machine as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the nearest enemy connecting line angle delta between the local machine and the local machinefOne of the reward factors is | θ - δ |. When avoiding the target, in order to compress the envelope of the missile of the enemy aircraft, the delta theta within the effective range of the enemy aircraft is avoidedfThe smaller the prize. The local acceleration a' is one of the reward factors because of the need to quickly zoom up and enemy while evadingDistance, factor a', is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target evasion reward function R according to the factorst2The following were used:
Rt2=Δθf+a′+A+B
tangential escape from enemy locking: the machine takes the enemy which has recently locked itself as relative reference, and the difference value delta theta between the absolute orientation angle theta of the machine and the connection angle delta between the machine and the enemy which has recently locked itselftOne of the reward factors is | θ - δ | -90 °. Since the tangential escape enemy locking action is based on Doppler effect escape enemy radar locking, the optimum angle is 90 °, so the closer the difference angle | θ - δ | to 90 °, the better, Δ θ |tThe closer to 0 the prize is. The local acceleration a' is one of the reward factors, and is always a positive reward because of the fast zooming-up and enemy distance required to escape. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a tangential escape enemy locking reward function R according to the factorst3The following were used:
Rt3=-Δθt+a′+A+B
cross attack: the machine gives different rewards according to the position of the nearest friend machine (namely the nearest red party intelligent agent to the machine) by taking the nearest enemy machine as a reference.
If xaX is more than or equal to 0, the machine is positioned in the front of the nearest friend machine to execute attack tactics, x is the distance between the red square intelligent agent and the nearest blue square intelligent agent, namely the distance between the machine and the nearest enemy plane, xaNearest friend machine and nearest blue square for red square intelligent agentThe distance between the intelligent bodies, namely the distance between the nearest friend plane and the nearest friend plane, takes the deviation angle epsilon of the local machine relative to the nearest friend plane into consideration, and the difference delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest friend plane is calculatedcOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta thetacThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward.
If xaAnd x is less than 0, the machine is positioned behind the nearest friend machine to execute the tactics of following the previous machine and opening the interference. The difference delta theta between the absolute orientation angle theta of the machine and the connection angle delta' between the machine and the nearest friend machinejOne of the reward factors is | θ - δ' |. Because when following the nearest friend machine, the nearest friend machine is considered to be kept in the radar locking range of the machine, so that the distance between the nearest friend machine and the team member is kept to play a role of concealing and interfering the enemy, delta thetajThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeI > 180 deg., a ' is a negative reward, while a ' is proportional to the speed difference with the nearest friend, i.e. a ' ═ a (v-v)a) Where v is the local velocity, vaα is a weight coefficient for the nearest friend speed. The relative hit number A and the local win-loss condition B are one of the considered factors, the relative hit number A is the difference between the lost number of the blue intelligent agent and the lost number of the red intelligent agent, when the relative hit number A is more than 0, the reward is positive, A is 0, the reward is not present, A is less than 0, the reward is negative,meanwhile, the winning of the victory or defeat situation B of the machine is positive reward, the tie of the war is no reward, and the failure of the war is negative reward. Synthesizing a cross attack reward function R according to the factorst4The following were used:
Figure BDA0003029048020000111
s-shaped maneuvering: the machine takes the nearest enemy plane as relative reference, every T1The maneuvering direction is changed in a time period of 0 < T1≤T,2kT1≤t≤(2k+1)T1,
Figure BDA0003029048020000112
The difference delta theta between the absolute orientation angle theta of the plane itself and the line angle delta between the plane itself and the nearest enemy planesθ - δ - (90 ° - σ) is one of the reward factors. Sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG ]]The radial direction is indicated by an arrow with the local machine as a starting point and the nearest enemy as an end point, and the radial direction is a velocity component in the radial direction. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a snake-shaped maneuvering reward function R according to the factorst5The following were used:
Figure BDA0003029048020000121
because the actions of the agent are continuous, such as the turning angle and acceleration of the airplane, the reward function is used as the evaluation of the behavior, and a continuous non-sparse reward function can be adopted to be in one-to-one correspondence with the behavior so as to provide feedback to the agent at each time point in the real-time confrontation process. The modeling of the continuous non-sparse reward function is clear and definite, and the continuous non-sparse reward function has interpretability, and meanwhile, the behavior control of the intelligent agent can be adjusted by human intervention.
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state StT0, 1, 2.. T, performing the following steps:
s31, determining action a required to be executedtAn action performed atGenerating next environment state S after acting on air confrontation simulation environmentt+1And obtain the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb
S32, controlling each agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,
Figure BDA0003029048020000122
||w*and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1]h
And controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and automatically resolving a target reward function. If a series of air decision characteristic elements exist, phi: s → [0,1]hFor example, the characteristic elements include basic space coordinates, motion indexes (speed and rotation angle) and the like of the intelligent agent, the fire control system state, the radar state, the interference pod state and the like of the intelligent agent. The characteristic indexes are related to the reward, and in order to facilitate quick optimization, the correlation is expressed as a linear combination relation to obtain a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,
Figure BDA0003029048020000123
||w*and | | is less than or equal to 1, and phi(s) is a situation information vector, so that the importance of each characteristic quantity is represented in a normalized space. When the expert behaviors are adopted to control the intelligent agent to play games in the air, the intelligent agent can generate a series of battlefield situations along withThe time is passed until the war-game is over, the set of the corresponding battlefield situations of the ith game is
Figure BDA0003029048020000124
Assuming that there are p red-square airplanes and q blue-square airplanes, the collected situation information vector phi(s) may selectively include the following elements:
a)
Figure BDA0003029048020000131
relative distance between each aircraft;
b)
Figure BDA0003029048020000132
relative angles between the individual airplanes;
c) p + q aircraft absolute heading angles;
d) p + q aircraft longitude and latitude coordinates;
e) pq missile threat points represent the threat magnitude of the missile in flight to the airplane (proportional to the flight time and inversely proportional to the distance from the target airplane);
f) p + q aircraft speeds per hour;
g) p + q aircraft relative positions;
h) whether the local machine is a current attack machine or not;
i) the missile states of q enemy planes at a moment;
j) the missile state at the moment of time on the p airplanes of our party.
In one embodiment, the method for controlling each agent to conduct air gaming based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating an initial action policy π0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
in one embodiment, the characteristic is expected to be μ (π)i) The following formula is satisfied:
Figure BDA0003029048020000133
where γ is the discount factor, φ(s)t)|πiAction policy pi for time tiCorresponding battlefield situation.
In one embodiment, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureau
Figure BDA0003029048020000134
Calculating global feature expectations for m-game air games
Figure BDA0003029048020000135
The formula is as follows:
Figure BDA0003029048020000136
wherein,
Figure BDA0003029048020000141
the environmental status at the moment t of the ith exchange,
Figure BDA0003029048020000142
and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
S323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure BDA0003029048020000143
And obtain the current reward function weight vector
Figure BDA0003029048020000144
Updating the reward function weight vector w*Is (w)i)TWherein, muECharacteristic expectation of expert behavior strategy; mu.sjFor current action policyThe characteristic is expected, T is transposition;
s324, judging the characteristic error eiWhether the error is less than the error threshold value or not, if so, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate. The error e is set in the present embodiment for the sake of accuracy and time consumption of calculationiIs 10-5If several iterations are performed, ei<10-5Then the error e is determinediLess than the error threshold.
The intelligent agent battle effect is close to the expert battle effect through the reverse reinforcement learning algorithm based on the expert behaviors, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved.
S33, calculating similarity R of each agent behavior and expert behaviorε
In one embodiment, in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining an action space at=μ(Stμ)+NtWherein, μ (S)tμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at
Figure BDA0003029048020000145
Wherein, at(x) Representing the probability of an agent performing an x-action, ae(x) The probability of an expert performing an X behavior is represented, the X behavior is a behavior set, and if the X is { flying to the north, flying to the south, flying to the east, flying to the west }, the flying to the north, and the like, is a behavior;
for continuous action at
Figure BDA0003029048020000151
Wherein, at(x) Probability density, a, representing x behavior of agente(x) And the probability density of X behaviors is represented by the expert, wherein X is the behavior and X is the behavior set. For example, X ═ X ° flight at an angle X degrees to the horizontal, X ∈ [0 °, 180 ° ]]}
The DDPG algorithm outputs specific actions according to the current environment state and the action strategy and executes the specific actions, and action instruction output is continuously provided according to the real-time situation until the end of the war office. And when the DDPG algorithm outputs behaviors each time, making a decision in the same situation by adopting the expert behaviors, but not updating the decision to the environment state, namely the expert behaviors do not take effect and only serve as feedback of the DDPG algorithm action decision, and calculating the similarity of the expert behaviors and the action of the intelligent agent. If the similarity is high, the simulation ability of the current intelligent agent game decision model to the expert behaviors is strong.
S34, obtaining the comprehensive reward rtComprehensive reward rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor maximising the jackpot RbWeight coefficient of (d), wiFor a target reward function R*Weight coefficient of(s), wεIs degree of similarity RεThe weight coefficient of (2).
S4, training intelligent game decision model and judging comprehensive reward rtIf the number of the intelligent agent game decision models is larger than the reward threshold value, the training is stopped to obtain the final intelligent agent game decision model, and if not, the circulation iteration training is carried outTraining until maximum training is given to the number of rounds n or the total reward rtGreater than the reward threshold.
The method mainly adopts a DDPG algorithm to carry out expert agent decision behavior simulation training so as to obtain an agent game decision model capable of making expert behaviors. The behavior simulation training process is a process that the intelligent agent generates actions according to situation to obtain income, experiences are accumulated continuously, the generated actions change to the direction of obtaining high income until the actions reach stable high income, and convergence is carried out, so that a final intelligent agent game decision model with high intelligent level is obtained and is used for air games.
In one embodiment, in step S4, the training of the intelligent agent gambling decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku(ii) a The actor network is used for outputting behaviors according to the battlefield situation, and the critic network is used for outputting scores according to the battlefield situation and the behaviors.
S42, converting online neural network parameter theta of actor networkQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor network
Figure BDA0003029048020000161
And target network parameters of critic network
Figure BDA0003029048020000162
Figure BDA0003029048020000163
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for a time T ═ 1, 2.., T, the following steps are performed:
a) each agent executes corresponding action, and performs environment state conversion to obtain next environment state St+1And a composite award rt
b) In the process of converting the environmental statet,at,rt,St+1) The data is stored as an array in a playback buffer R to be used as a training data set;
c) randomly sampling U arrays in a playback buffer R as mini-batch training data (y)L-Q(sL,aLQ) Label (C)
Figure BDA0003029048020000164
Wherein,
Figure BDA0003029048020000165
representing policy μ at parameter
Figure BDA0003029048020000166
Environmental state sL+1The action to be generated next is that of,
Figure BDA0003029048020000167
representing parameters
Figure BDA0003029048020000168
Environmental state sL+1And performing the action
Figure BDA0003029048020000169
The value of the lower Q function;
d) minimizing the loss function of the critic network, the loss function being
Figure BDA00030290480200001610
e) Updating policy gradient of actor network:
Figure BDA00030290480200001611
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,
Figure BDA00030290480200001612
represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,
Figure BDA00030290480200001613
represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of actor network
Figure BDA00030290480200001614
And target network parameters of critic network
Figure BDA00030290480200001615
Figure BDA00030290480200001616
Figure BDA00030290480200001617
Wherein tau is an adjustable coefficient and belongs to [0,1 ];
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold. If the upper limit of the preset reward threshold value is 100, the user sets the reward threshold value to 80 according to the actual requirement, and then when the comprehensive reward r istIf the number of the rounds is more than 80, the training is terminated, otherwise, the step S44 is returned to carry out the loop iteration training until the maximum training pair round number n or the comprehensive reward rtGreater than 80.
The method comprises the steps of carrying out decision training on intelligent agent behaviors by utilizing a deep reinforcement learning algorithm, and continuously updating a deep network by taking similarity calculation of expert behaviors and intelligent agent behaviors as feedback. By training the intelligent agent game decision model, decision experience can be obtained from expert prior knowledge, and with the improvement of similarity of intelligent agent behaviors and expert behaviors in the training process, the behavior imitation level is gradually improved, so that the final intelligent agent game decision model with high intelligence level is obtained and used for air game.
In one embodiment, target network parameters for an actor network
Figure BDA0003029048020000171
And target network parameters of critic network
Figure BDA0003029048020000172
And updating by adopting a gradient descent method. The revenue of the generated action changes once per update, and the overall trend is the revenue rising trend.
In conclusion, the method synthesizes key elements in the air scene of the intelligent agent game decision model as reward factors, and the reward functions are clearly and definitely designed and have interpretability by forming continuous non-sparse reward functions, and meanwhile, the behavior control of the intelligent agent can be manually intervened; the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved; the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced; meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved; the problem of cold start of training is solved by a behavior simulation method, the traditional inefficient reward function design process and the random exploration process during initial training are improved, the decision-making capability is optimized, and a decision-making intelligent agent with high intelligence level and practical value can be obtained quickly.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not should be understood as the limitation of the invention claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A behavior simulation training method facing to air intelligent game is characterized in that: the behavior simulation training method for the air intelligent game comprises the following steps:
s1, constructing an air confrontation simulation environment and an agent of the red and blue parties, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each of said actions atIs continuous non-sparse reward function Rt
S3, performing air game matching in the intelligent agent game decision model, wherein the matching time is T, and each intelligent agent is in accordance with the current environment state StT is 0,1,2, …, T, the following steps are performed:
s31, determining action a required to be executedtThe action performed atGenerating a next environment state S after acting on the air countermeasure simulation environmentt+1And obtaining the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb
S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,
Figure FDA0003029048010000011
‖w*| < 1, phi(s) is the situation information vector phi: → [0,1 →]h
S33, calculating similarity R of each intelligent agent behavior and expert behaviorε
S34, obtaining the comprehensive reward rtThe composite award rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor said maximising the jackpot RbWeight coefficient of (d), wiFor the target reward function R*Weight coefficient of(s), wεIs the similarity RεThe weight coefficient of (a);
s4, training the intelligent agent game decision model and judging the comprehensive reward rtIf the number of the training pairs is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
2. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
3. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the action space comprises at least one action a of target pursuit, target avoidance, tangential escape enemy locking, cross attack and snake maneuvert
4. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 3, wherein: said shaping each of said actions atIs continuous non-sparse reward function RtThe following were used:
1) reward function R of said target pursuitt1The following conditions are satisfied:
Figure FDA0003029048010000021
2) reward function R for said target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) the reward function R for tangentially getting rid of enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) reward function R of the cross attackt4The following conditions are satisfied:
Figure FDA0003029048010000022
5) said serpentine motorized reward function Rt5The following conditions are satisfied:
Figure FDA0003029048010000023
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaA nearest friend machine for the red-square agent and the nearest blue-square agentThe distance between the energy bodies, delta is the connecting line angle of the red intelligent body and the nearest blue intelligent body, delta 'is the connecting line angle of the red intelligent body and the nearest friend machine, epsilon is the deviation angle of the red intelligent body and the nearest blue intelligent body, a' is the acceleration of the red intelligent body, A is the difference between the loss quantity of the blue intelligent body and the loss quantity of the red intelligent body, B is the winning bureau number of the red intelligent body, v is the speed of the red intelligent bodyaIs the nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T1For maneuvering direction change cycles in said snake-like machine, 0<T1≤T。
5. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing the target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating the initial action strategy pi0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure FDA0003029048010000031
And obtain the current reward function weight vector
Figure FDA0003029048010000032
Updating the reward function weight vector w*Is (w)i)TWherein, muECharacteristic expectations for the expert behavior strategy; mu.sjT is the transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error eiWhether or not less than the error thresholdIf yes, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate.
6. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 5, wherein: the characteristic is expected to be mu (pi)i) The following formula is satisfied:
Figure FDA0003029048010000033
where γ is the discount factor, φ(s)t)|πiThe action strategy pi for the moment tiCorresponding battlefield situation.
7. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 6, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureau
Figure FDA0003029048010000034
Calculating global feature expectations for m-game air games
Figure FDA0003029048010000035
The formula is as follows:
Figure FDA0003029048010000036
wherein,
Figure FDA0003029048010000037
for loops at time t of ith roundThe state of the environment is as follows,
Figure FDA0003029048010000038
and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
8. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining the action at=μ(Stμ)+NtWherein, μ (S)tμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye
S333, calculating similarity R of the intelligent agent behavior and the expert behavior by adopting a Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at
Figure FDA0003029048010000041
Wherein, at(x) Representing the probability of the agent performing an x-action, ae(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action at
Figure FDA0003029048010000042
Wherein, at(x) Probability density, a, representing the agent performing x-actionse(x) Probability density representing x behaviors of expertDegree, X is the set of behaviors.
9. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S4, the training of the intelligent agent game decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in the intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku
S42, connecting the online neural network parameter theta of the actor networkQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor network
Figure FDA0003029048010000043
And target network parameters of critic network
Figure FDA0003029048010000044
Figure FDA0003029048010000045
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1,2, …, n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for time T equal to 1,2, …, T, the following steps are performed:
a) each intelligent agent executes corresponding action, and performs environmental state conversion to obtain the next environmental state St+1And a composite award rt
b) In the process of converting the environmental statet,at,rt,St+1) Storing the data as an array in the playback buffer R as a training data set;
c) randomly sampling U of the arrays in the playback buffer R as onemini-batch training data (y)L-Q(sL,aLQ) Label (C)
Figure FDA0003029048010000051
Wherein,
Figure FDA0003029048010000052
representing policy μ at parameter
Figure FDA0003029048010000053
Environmental state sL+1The action to be generated next is that of,
Figure FDA0003029048010000054
representing parameters
Figure FDA0003029048010000055
Environmental state sL+1And performing the action
Figure FDA0003029048010000056
The value of the lower Q function;
d) minimizing a loss function of the critic network, said loss function being
Figure FDA0003029048010000057
Figure FDA0003029048010000058
e) Updating policy gradient of actor network:
Figure FDA0003029048010000059
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,
Figure FDA00030290480100000510
represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,
Figure FDA00030290480100000511
represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of the actor network
Figure FDA00030290480100000512
And target network parameters of critic network
Figure FDA00030290480100000513
Figure FDA00030290480100000514
Figure FDA00030290480100000515
Wherein tau is an adjustable coefficient,
Figure FDA00030290480100000516
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
10. The method for behavioral simulation training in air-oriented intelligent gaming according to claim 9, wherein: target network parameters of the actor network
Figure FDA0003029048010000061
And target network parameters of critic network
Figure FDA0003029048010000062
And updating by adopting a gradient descent method.
CN202110425153.0A 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game Active CN113221444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110425153.0A CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110425153.0A CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Publications (2)

Publication Number Publication Date
CN113221444A true CN113221444A (en) 2021-08-06
CN113221444B CN113221444B (en) 2023-01-03

Family

ID=77088029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110425153.0A Active CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Country Status (1)

Country Link
CN (1) CN113221444B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867178A (en) * 2021-10-26 2021-12-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN113893539A (en) * 2021-12-09 2022-01-07 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN114021737A (en) * 2021-11-04 2022-02-08 中国电子科技集团公司信息科学研究院 Game-based reinforcement learning method, system, terminal and storage medium
CN114423046A (en) * 2021-12-03 2022-04-29 中国人民解放军空军工程大学 Cooperative communication interference decision method
CN115470710A (en) * 2022-09-26 2022-12-13 北京鼎成智造科技有限公司 Air game simulation method and device
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN112052511A (en) * 2020-06-15 2020-12-08 成都蓉奥科技有限公司 Air combat maneuver strategy generation technology based on deep random game
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291890A (en) * 2020-05-13 2020-06-16 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN112052511A (en) * 2020-06-15 2020-12-08 成都蓉奥科技有限公司 Air combat maneuver strategy generation technology based on deep random game
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867178A (en) * 2021-10-26 2021-12-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN113867178B (en) * 2021-10-26 2022-05-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN114021737A (en) * 2021-11-04 2022-02-08 中国电子科技集团公司信息科学研究院 Game-based reinforcement learning method, system, terminal and storage medium
CN114021737B (en) * 2021-11-04 2023-08-22 中国电子科技集团公司信息科学研究院 Reinforced learning method, system, terminal and storage medium based on game
CN114423046A (en) * 2021-12-03 2022-04-29 中国人民解放军空军工程大学 Cooperative communication interference decision method
CN113893539A (en) * 2021-12-09 2022-01-07 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN113893539B (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN115470710A (en) * 2022-09-26 2022-12-13 北京鼎成智造科技有限公司 Air game simulation method and device
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model
CN115470710B (en) * 2022-09-26 2023-06-06 北京鼎成智造科技有限公司 Air game simulation method and device

Also Published As

Publication number Publication date
CN113221444B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN113221444B (en) Behavior simulation training method for air intelligent game
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN113095481B (en) Air combat maneuver method based on parallel self-game
CN113791634B (en) Multi-agent reinforcement learning-based multi-machine air combat decision method
CN113050686A (en) Combat strategy optimization method and system based on deep reinforcement learning
Wang et al. Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm
CN113893539B (en) Cooperative fighting method and device for intelligent agent
CN114460959A (en) Unmanned aerial vehicle group cooperative autonomous decision-making method and device based on multi-body game
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
CN116185059A (en) Unmanned aerial vehicle air combat autonomous evasion maneuver decision-making method based on deep reinforcement learning
CN113159266B (en) Air combat maneuver decision method based on sparrow searching neural network
CN113962012A (en) Unmanned aerial vehicle countermeasure strategy optimization method and device
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
CN111773722B (en) Method for generating maneuver strategy set for avoiding fighter plane in simulation environment
CN114721424A (en) Multi-unmanned aerial vehicle cooperative countermeasure method, system and storage medium
CN116700079A (en) Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP
CN113282100A (en) Unmanned aerial vehicle confrontation game training control method based on reinforcement learning
CN115933717A (en) Unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning
CN114371634B (en) Unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback
Xianyong et al. Research on maneuvering decision algorithm based on improved deep deterministic policy gradient
CN117291254A (en) Agent task allocation training method based on imitation learning and safety reinforcement learning
CN116225065A (en) Unmanned plane collaborative pursuit method of multi-degree-of-freedom model for multi-agent reinforcement learning
Kong et al. Multi-ucav air combat in short-range maneuver strategy generation using reinforcement learning and curriculum learning
Wang et al. Research on autonomous decision-making of UCAV based on deep reinforcement learning
Lu et al. Strategy Generation Based on DDPG with Prioritized Experience Replay for UCAV

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant