CN113221444B - Behavior simulation training method for air intelligent game - Google Patents

Behavior simulation training method for air intelligent game Download PDF

Info

Publication number
CN113221444B
CN113221444B CN202110425153.0A CN202110425153A CN113221444B CN 113221444 B CN113221444 B CN 113221444B CN 202110425153 A CN202110425153 A CN 202110425153A CN 113221444 B CN113221444 B CN 113221444B
Authority
CN
China
Prior art keywords
agent
reward
action
intelligent
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110425153.0A
Other languages
Chinese (zh)
Other versions
CN113221444A (en
Inventor
包骐豪
朱燎原
夏少杰
瞿崇晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 52 Research Institute
Original Assignee
CETC 52 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 52 Research Institute filed Critical CETC 52 Research Institute
Priority to CN202110425153.0A priority Critical patent/CN113221444B/en
Publication of CN113221444A publication Critical patent/CN113221444A/en
Application granted granted Critical
Publication of CN113221444B publication Critical patent/CN113221444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a behavior simulation training method for an air intelligent game, which comprises the following steps: s1, constructing an intelligent game decision model; s2, determining an environment state and an action space, and modeling a continuous non-sparse reward function of each action; s3, playing an air game in the model, and executing the following steps: s31, generating a next environment state according to the executed action and obtaining the reward, and sequentially performing loop iteration to realize the maximum accumulated reward; s32, reverse reinforcement learning is realized based on expert behaviors, and a target reward function is obtained; s33, calculating the similarity between the behaviors of the agents and the behaviors of the experts; s34, obtaining a comprehensive reward; and S4, training an intelligent game decision model. The method improves the traditional inefficient reward function design process and the random model training exploration process, so that the reward function has interpretability and human intervention capability, the intelligent agent decision level and convergence speed are improved, and the problem of model training cold start is solved.

Description

Behavior simulation training method for air intelligent game
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a behavior simulation training method for an air intelligent game.
Background
In the future, the air game needs to acquire the most accurate information from various detection systems at any time and any place to realize the information advantage, and more importantly, the air game realizes the decision advantage by utilizing the technologies such as machine learning, artificial intelligence, cloud computing and the like. In order to better mine information to realize decision advantages, exert game efficiency and ensure aerial advantages, an aerial auxiliary decision support system matched with the pilot is needed besides excellent aerial skills of the pilot and good command ability of the commander. The aerial assistant decision support system is used as an artificial intelligence assistant system, can provide decision scheme reference in a high-dynamic complex confrontation environment, reduces the decision burden of a pilot, better excavates information to realize decision advantages, exerts game efficiency and ensures aerial advantages.
However, the existing air-aided decision support system is relatively backward, the number of sensing parameters or target objects which can be simultaneously controlled is limited, the robustness, timeliness and accuracy of decision support are poor, moreover, in game decision, the training model is difficult to converge due to too high decision dimensionality, and the time consumption of the practical intelligent body for training is long, even the practical intelligent body cannot be trained at all to obtain the effective decision intelligent body. And the decision level of the intelligent agent is low and the time consumption is long due to sparse awards and complex and low-efficiency awards design links in the air intelligent self-game confrontation, and meanwhile, the awards are designed to be manually customized according to scenes, so that the labor cost is high, the reusability is poor, and the algorithm training has the cold start problem.
Disclosure of Invention
The invention aims to provide a behavior simulation training method facing to the aerial intelligent game, which improves the traditional inefficient reward function design process and the random exploration process during model training, so that the reward function is clearly and definitely designed and has interpretability, meanwhile, the behavior control of the intelligent agent can be manually intervened, the decision level and the convergence speed of the intelligent agent can be rapidly improved, the intelligent agent has the capability of simulating the complex behaviors of experts, the labor cost is reduced, and the problem of model training cold start is solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention provides a behavior simulation training method facing to an air intelligent game, which comprises the following steps:
s1, constructing an air confrontation simulation environment and red and blue intelligent agents, constructing an intelligent agent game decision model based on a deep reinforcement learning algorithm, and realizing circular interaction between the air confrontation simulation environment and each intelligent agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a t Shaping each action a t Is continuously non-sparse reward function R t
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state S t T =0,1, 2.. T, the following steps are performed:
s31, determining action a required to be executed t An action performed a t Generating next environment state S after acting on air confrontation simulation environment t+1 And obtain the corresponding reward function R t Reward of, loop iteration sequentiallyTo achieve a maximum jackpot R b
S32, controlling all the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning and obtain a target reward function R * (s)=w * Phi(s), wherein w * Is a reward function weight vector, h in total,
Figure BDA0003029048020000021
||w * and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1] h
S33, calculating similarity R of behaviors of all agents and expert behaviors ε
S34, obtaining the comprehensive reward r t Comprehensive reward r t The formula is as follows:
r t =w b R b +w i R * (s)+w ε R ε
wherein, w b To maximise the jackpot R b Weight coefficient of (d), w i For a target reward function R * Weight coefficient of(s), w ε Is degree of similarity R ε The weight coefficient of (a);
s4, training an intelligent game decision model and judging comprehensive rewards r t If the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r t Greater than the reward threshold.
Preferably, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
Preferably, the action space comprises at least one action a of object pursuit, object avoidance, tangential evasion enemy locking, cross-attack and snake maneuvering t
Preferably, each act a is sculpted t Is continuous non-sparse reward function R t The following:
1) Reward function R for target pursuit t1 The following conditions are satisfied:
Figure BDA0003029048020000031
2) Reward function R for target avoidance t2 The following conditions are satisfied:
R t2 =Δθ f +a′+A+B
3) Reward function R for tangentially breaking free of enemy locking t3 The following conditions are satisfied:
R t3 =-Δθ t +a′+A+B
4) Cross-attacking reward function R t4 The following conditions are satisfied:
Figure BDA0003029048020000032
5) Snake-shaped motorized reward function R t5 The following conditions are satisfied:
Figure BDA0003029048020000033
wherein, delta theta c =|θ-δ|-ε,Δθ f =|θ-δ|,Δθ t =|θ-δ|-90°,Δθ j =|θ-δ′|,Δθ s θ - δ - (90 ° - σ), θ is the absolute orientation angle of the red agent, θ e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is the connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is the deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent a The nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T 1 For the change period of the maneuvering direction in a snake-shaped machine, T is more than 0 1 ≤T。
Preferably, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R * (s)=w * ·φ(s);
S322, randomly acquiring an initial action strategy pi 0 Computing an initial action policy π 0 Corresponding characteristic expectation mu (pi) 0 ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure BDA0003029048020000041
And obtain the current reward function weight vector
Figure BDA0003029048020000042
Updating the reward function weight vector w * Is (w) i ) T Wherein, mu E Characteristic expectation of expert behavior strategy; mu.s j T is transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error e i Whether the error is less than the error threshold value or not, if so, outputting the target reward function R * (s)=(w i ) T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm i And the update characteristic is expected to be mu (pi) i ) When i = i +1, the process returns to step S323 to iterate.
Preferably, the characteristic is expected to be μ (π) i ) The following formula is satisfied:
Figure BDA0003029048020000043
where γ is the discount factor, φ(s) t )|π i Action policy pi for time t i Corresponding battlefield situation.
Preferably, the method for controlling the intelligent agents to conduct air gaming based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureau
Figure BDA0003029048020000044
Calculating global feature expectations for m-game air games
Figure BDA0003029048020000045
The formula is as follows:
Figure BDA0003029048020000046
wherein the content of the first and second substances,
Figure BDA0003029048020000047
the environmental status at the moment t of the ith exchange,
Figure BDA0003029048020000048
and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
Preferably, in step S33, the similarity R between the behavior of each agent and the behavior of the expert ε The calculation process of (c) is as follows:
s331, searching noise N according to current strategy mu t Determining an action a t =μ(S tμ )+N t Wherein, mu (S) tμ ) Representing the strategy mu in the neural network parameter theta μ Environmental state S t The next resulting action;
s332, according to the current environment state S t Generating an action as a mock-object a using expert behavior policy e
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference method ε The calculation formula is as follows:
for discrete actions a t
Figure BDA0003029048020000051
Wherein, a t (x) Representing the probability of an agent performing an x-action, a e (x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action a t
Figure BDA0003029048020000052
Wherein, a t (x) Probability density, a, representing x behavior of agent e (x) Representing the probability density of the expert performing X actions, X being the set of actions.
Preferably, in step S4, the training of the intelligent agent game decision model includes the following steps:
s41, initializing online neural network parameters theta of actor network in intelligent agent game decision model Q And the online neural network parameter theta of the critic network u
S42, converting the online neural network parameter theta of the operator network Q And the online neural network parameter theta of the criticc network u Respectively and correspondingly copying target network parameters to the operator network
Figure BDA0003029048020000053
And target network parameters of critic network
Figure BDA0003029048020000054
Figure BDA0003029048020000055
S43, initializing a playback buffer R;
s44, performing the following steps for the lth (L =1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for time T =1, 2.
a) Each agent executes corresponding action, and performs environment state conversion to obtain next environment state S t+1 And a composite award r t
b) In the process of converting the environmental state t ,a t ,r t ,S t+1 ) The data is stored as an array in a playback buffer R to be used as a training data set;
c) Randomly sampling U arrays in a playback buffer R as mini-batch training data (y) L -Q(s L ,a LQ ) Label (C)
Figure BDA0003029048020000061
Wherein the content of the first and second substances,
Figure BDA0003029048020000062
representing policy μ at parameter
Figure BDA0003029048020000063
Environmental state s L+1 The action to be generated next is that of,
Figure BDA0003029048020000064
representing parameters
Figure BDA00030290480200000616
Environmental state s L+1 And performing the action
Figure BDA0003029048020000065
The value of the lower Q function;
d) Minimizing the loss function of the critic network, the loss function being
Figure BDA0003029048020000066
e) Updating policy gradient of actor network:
Figure BDA0003029048020000067
wherein Q (s, a | θ) Q ) Is shown inParameter theta Q Environment state s, the value of the Q function under the execution action a,
Figure BDA0003029048020000068
denotes Q (s, a | θ) Q ) Partial derivatives of a; μ (s | θ) μ ) Representing the strategy μ at the parameter θ μ An action generated in the environment state s,
Figure BDA0003029048020000069
represents μ (s | θ) μ ) To theta μ Partial derivatives of (a);
f) Updating target network parameters of an operator network
Figure BDA00030290480200000610
And target network parameters of critic network
Figure BDA00030290480200000611
Figure BDA00030290480200000612
Figure BDA00030290480200000613
Wherein tau is an adjustable coefficient and belongs to [0,1];
g) Judging the comprehensive reward r t If the number of the rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward r t Greater than the reward threshold.
Preferably, target network parameters of the actor network
Figure BDA00030290480200000614
And target network parameters of critic network
Figure BDA00030290480200000615
By usingThe gradient descent method is updated.
Compared with the prior art, the invention has the beneficial effects that:
1) Key elements in an air scene of an intelligent agent game decision model are integrated to serve as reward factors, a continuous non-sparse reward function is shaped, so that the reward function is clearly and definitely designed and has interpretability, and meanwhile, the behavior control of an intelligent agent can be manually intervened;
2) The intelligent agent battle effect is close to the expert battle effect based on the reverse reinforcement learning algorithm of the expert behavior, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved;
3) The intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced;
4) Meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved;
5) The method solves the problem of training cold start through a behavior simulation method, improves the traditional inefficient reward function design process and the random exploration process during initial training, optimizes the decision-making capability, and can quickly obtain a decision-making agent with high intelligence level and practical value.
Drawings
FIG. 1 is a schematic diagram of the cyclical interaction of an agent of the present invention with an airborne countermeasure simulation environment;
FIG. 2 is a block diagram of a method for reverse reinforcement learning of expert behavior according to the present invention;
FIG. 3 is a flow chart of the expert intelligence based behavioral decision simulation method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1-3, a behavior simulation training method for air intelligent gaming comprises the following steps:
s1, intelligent agents of an air confrontation simulation environment and red and blue are constructed, an intelligent agent game decision model is constructed based on a deep reinforcement learning algorithm, and circulation interaction between the air confrontation simulation environment and each intelligent agent is achieved.
In one embodiment, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
The intelligent game decision model is constructed by adopting a DDPG (distributed data group PG) deep reinforcement learning algorithm. The intelligent agent game decision model countermeasure training links the constructed air countermeasure simulation environment with the intelligent agent, and the intelligent agent is a red-blue airplane, so that the intelligent agent and the air countermeasure simulation environment are interactively influenced, as shown in fig. 1. It should be noted that the intelligent agent can also be changed according to different application scenarios, for example, the intelligent agent can also be a missile, and the intelligent agent game decision model can also be constructed based on a DQN algorithm or other deep reinforcement learning algorithms in the prior art.
S2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a t Shaping each action a t Is continuously non-sparse reward function R t
In an embodiment, the action space comprises at least one action a of object pursuit, object avoidance, tangential evasion enemy locking, cross-attack and snake maneuver t . Optional actions a of the action space t The number and types of the components can be adjusted according to actual needs, and other actions in the prior art can also be adopted.
In one embodiment, each act a is sculpted t Is continuously non-sparse reward function R t The following were used:
1) Reward function R for target pursuit t1 The following conditions are satisfied:
Figure BDA0003029048020000081
2) Reward function R for target avoidance t2 The following conditions are satisfied:
R t2 =Δθ f +a′+A+B
3) Reward function R for tangentially escaping enemy locking t3 The following conditions are satisfied:
R t3 =-Δθ t +a′+A+B
4) Cross-attacking reward function R t4 The following conditions are satisfied:
Figure BDA0003029048020000082
5) Reward function R for snake maneuvering t5 The following conditions are satisfied:
Figure BDA0003029048020000083
wherein, delta theta c =|θ-δ|-ε,Δθ f =|θ-δ|,Δθ t =|θ-δ|-90°,Δθ j =|θ-δ′|,Δθ s (= theta-delta- (90 ° -sigma)), theta is the absolute orientation angle of the red agent, theta e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle of the red intelligent agent and the nearest blue intelligent agent, and delta' is the red intelligent agentThe connecting line angle with the nearest friend computer, epsilon is the deviation angle between the red-square agent and the nearest blue-square agent, a' is the acceleration of the red-square agent, A is the difference between the loss quantity of the blue-square agent and the loss quantity of the red-square agent, B is the number of winning stations of the red-square agent, v is the speed of the red-square agent a The nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]And T is the maneuvering direction changing period in the snake-shaped maneuvering.
Specifically, a northeast coordinate system is adopted, theta is an included angle between the heading of the nose of the Hongfang intelligent machine and the due north direction, and theta e The included angle between the machine head direction of the blue intelligent body closest to the red intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest friend machine and the due north direction is delta', and the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the machine head direction of the red intelligent body is epsilon. The space motion of the intelligent agent of air battle has six degrees of freedom, and the application considers the yaw angle and the acceleration of the intelligent agent when the reward function is shaped. The reward function modeling link is oriented to various actions, namely target pursuit, target avoidance, tangential escape from enemy plane locking, cross attack and snake maneuver, and comprises at least one of a red intelligent body as a local plane and a blue intelligent body as an enemy plane (enemy plane for short), wherein:
target pursuit: the local machine takes the nearest enemy plane as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest enemy plane is considered c = theta-delta-epsilon is one of the reward factors. When the enemy missile is hit by the target, the enemy missile is kept in the radar locking range of the enemy missile, so that the enemy missile is convenient to put the tail and avoid quickly, and meanwhile, the envelope of the enemy missile is compressed by delta theta c The smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta e In the same direction, i.e. | theta-theta e When | < 180 °, a' is positive reward, | theta-theta e If > 180 deg., a' is a negative reward. Relative hit number A and local win-lose conditionB is one of the consideration factors, the relative hit quantity A is the difference between the lost quantity of the blue party intelligent agent and the lost quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the game winning in the local winning and negating condition B is the positive reward, the game tie is the no reward, and the game losing is the negative reward. Synthesizing a target pursuit reward function R according to the factors t1 The following were used:
Figure BDA0003029048020000091
target avoidance: the local machine takes the nearest enemy attacking the local machine as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the nearest enemy connecting line angle delta between the local machine and the local machine f And (= theta-delta) is one of the reward factors. When avoiding the target, in order to compress the envelope of the missile of the enemy aircraft, the delta theta within the effective range of the enemy aircraft is avoided f The smaller the prize. The local acceleration a 'is one of the reward factors because of the need to quickly zoom in and out of range of the enemy while evading, and the factor a' is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward. Synthesizing a target evasion reward function R according to the factors t2 The following:
R t2 =Δθ f +a′+A+B
tangential escape from enemy locking: the machine takes the enemy which has recently locked itself as relative reference, and the difference value delta theta between the absolute orientation angle theta of the machine and the connection angle delta between the machine and the enemy which has recently locked itself t And = | theta-delta | 90 ° is one of the reward factors. Since the tangential escape enemy locking action is based on Doppler effect escape enemy radar locking, the optimum angle is 90 °, so the closer the difference angle | θ - δ | to 90 °, the better, Δ θ | t The closer to 0 the prize is. The local acceleration a' is one of the reward factorsSince the distance between the player and the enemy needs to be rapidly increased during escape, a' is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue party intelligent agent and the loss quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the local win-lose condition B winning in the game is the positive reward, the game is no reward in the game tie, and the game failure is the negative reward. Synthesizing a tangential escape enemy locking reward function R according to the factors t3 The following were used:
R t3 =-Δθ t +a′+A+B
cross attack: the machine gives different rewards according to the position of the nearest friend machine (namely the nearest red party intelligent agent to the machine) by taking the nearest enemy machine as a reference.
If x a X is more than or equal to 0, the machine is positioned in the front of the nearest friend machine to execute attack tactics, x is the distance between the red square intelligent agent and the nearest blue square intelligent agent, namely the distance between the machine and the nearest enemy plane, x a The distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, namely the distance between the nearest friend machine and the nearest enemy machine is considered, and the difference value delta theta between the absolute orientation angle theta of the local machine and the connection angle delta between the local machine and the nearest enemy machine is considered when the local machine has an offset angle epsilon relative to the nearest enemy machine c And (= theta-delta-epsilon) is one of the reward factors. When the enemy missile is hit by the target, the enemy missile is kept in the radar locking range of the enemy missile, so that the enemy missile is convenient to put the tail and avoid quickly, and meanwhile, the envelope of the enemy missile is compressed by delta theta c The smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle theta e In the same direction, i.e. | theta-theta e When | < 180 °, a' is positive reward, | theta-theta e If > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward.
If x a And x is less than 0, the machine is positioned behind the nearest friend machine to execute the tactics of following the previous machine and opening the interference. The difference delta theta between the absolute orientation angle theta of the machine and the connection angle delta' between the machine and the nearest friend machine j = theta-delta' | is one of the reward factors. When following the nearest friend machine, the distance between the nearest friend machine and the team friend is kept to play a role in hiding and disturbing the enemy machine by considering that the nearest friend machine is kept in the self radar locking range of the machine j The smaller the prize. The local acceleration a' is one of the reward factors, at the local absolute heading angle theta and the nearest enemy heading angle theta e In the same direction, i.e. | theta-theta e When | < 180 °, a' is positive reward, | theta-theta e If > 180 deg., a ' is a negative reward, while a ' is proportional to the speed difference with the nearest friend, i.e. a ' = α (v-v) a ) Where v is the local velocity, v a α is a weight coefficient for the nearest friend speed. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue party intelligent agent and the loss quantity of the red party intelligent agent, when the relative hit quantity A is larger than 0, the positive reward is given, A =0, the no reward is given, A is smaller than 0, the negative reward is given, meanwhile, the local win-lose condition B winning in the game is the positive reward, the game is no reward in the game tie, and the game failure is the negative reward. Synthesizing a cross attack reward function R according to the factors t4 The following were used:
Figure BDA0003029048020000111
the snake-shaped motor is driven: the machine takes the nearest enemy machine as relative reference, every T 1 The maneuvering direction is changed in a time period of 0 < T 1 ≤T,2kT 1 ≤t≤(2k+1)T1,
Figure BDA0003029048020000112
The difference delta theta between the absolute orientation angle theta of the plane itself and the line angle delta between the plane itself and the nearest enemy plane s (= θ - δ - (90 ° - σ)) is one of the reward factors. Sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG ]]For controlling the speed of the machine close to the nearest enemy, the radial speed of the machine is larger when the sigma is larger, and an arrow taking the machine as a starting point and the nearest enemy as an end pointThe direction is indicated as radial, and the velocity component in the radial direction is the radial velocity. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A =0, the reward is not available, A is smaller than 0, the negative reward is available, meanwhile, the local win-lose condition B win is positive reward, the tactic tie is no reward, and the tactic failure is negative reward. Synthesizing a snake-shaped maneuvering reward function R according to the factors t5 The following:
Figure BDA0003029048020000121
because the actions of the agent are continuous, such as the turning angle and acceleration of the airplane, the reward function is used as the evaluation of the behavior, and a continuous non-sparse reward function can be adopted to be in one-to-one correspondence with the behavior so as to provide feedback to the agent at each time point in the real-time confrontation process. The model of the continuous non-sparse reward function is clear and definite, and the continuous non-sparse reward function has interpretability, and meanwhile, the behavior control of the intelligent agent can be adjusted by human intervention.
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state S t T =0,1, 2., T, the following steps are performed:
s31, determining action a required to be executed t An action performed a t Generating next environment state S after acting on air confrontation simulation environment t+1 And obtain the corresponding reward function R t In turn, the iteration is circulated to realize the maximum accumulated reward R b
S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning and obtain a target reward function R * (s)=w * Phi(s), wherein w * Is a reward function weight vector, h in total,
Figure BDA0003029048020000122
||w * the | | < 1, phi(s) is situation informationVector, φ: s → [0,1] h
And controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and automatically resolving a target reward function. If there is a series of air decision feature elements, φ: s → [0,1] h For example, the characteristic elements comprise basic space coordinates, motion indexes (speed and rotation angle) and the like of the intelligent agent, the fire control system state, the radar state, the interference pod state and the like of the intelligent agent. The characteristic indexes are correlated with the reward, and for the convenience of quick optimization, the correlation is expressed as a linear combination relation to obtain a target reward function R * (s)=w * Phi(s), wherein w * The weight vectors of the reward functions are h in number,
Figure BDA0003029048020000123
||w * and | | is less than or equal to 1, and phi(s) is a situation information vector, so that the importance of each characteristic quantity is represented in a normalized space. When the intelligent agent is controlled by the expert behaviors to play the air game, the intelligent agent can generate a series of battlefield situations, and the set of the battlefield situations corresponding to the ith game is
Figure BDA0003029048020000124
Assuming that there are p red-side aircraft and q blue-side aircraft, the collected situation information vector φ(s) may optionally include the following elements:
a)
Figure BDA0003029048020000131
relative distance between each aircraft;
b)
Figure BDA0003029048020000132
relative angles between the individual airplanes;
c) p + q aircraft absolute heading angles;
d) p + q aircraft longitude and latitude coordinates;
e) pq missile threat points represent the threat size of the missile in flight to the airplane (the threat size is in direct proportion to the flight time and in inverse proportion to the distance from the target airplane);
f) p + q aircraft speeds per hour;
g) p + q aircraft relative positions;
h) Whether the local machine is a current attack machine or not;
i) The missile states of q enemy planes at a moment;
j) And the missile states of p airplanes of our party at the last moment.
In one embodiment, the method for controlling each agent to conduct air gaming based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R * (s)=w * ·φ(s);
S322, randomly acquiring an initial action strategy pi 0 Calculating an initial action policy π 0 Corresponding characteristic expectation mu (pi) 0 ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
in one embodiment, the characteristic is expected to be μ (π) i ) The following formula is satisfied:
Figure BDA0003029048020000133
where γ is the discount factor, φ(s) t )|π i Action policy pi for time t i Corresponding battlefield situation.
In one embodiment, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureau
Figure BDA0003029048020000134
Calculating global feature expectations for m-game air games
Figure BDA0003029048020000135
The formula is as follows:
Figure BDA0003029048020000136
wherein the content of the first and second substances,
Figure BDA0003029048020000141
the environmental status at the moment t of the ith exchange,
Figure BDA0003029048020000142
and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
S323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure BDA0003029048020000143
And obtain the current reward function weight vector
Figure BDA0003029048020000144
Updating the reward function weight vector w * Is (w) i ) T Wherein, mu E Characteristic expectation of expert behavior strategy; mu.s j T is transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error e i Whether the error is less than the error threshold value or not, if so, outputting the target reward function R * (s)=(w i ) T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm i And the update characteristic is expected to be mu (pi) i ) If i = i +1, the process returns to step S323 to iterate. The error e is set in the present embodiment for the sake of accuracy and time consumption of calculation i Is 10 -5 If several iterations are performed, e i <10 -5 Then the error e is determined i Less than the error threshold.
The intelligent agent battle effect is close to the expert battle effect through the reverse reinforcement learning algorithm based on the expert behaviors, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved.
S33, calculating similarity R of behaviors of all agents and expert behaviors ε
In one embodiment, step S33Similarity R of individual agent behavior and expert behavior ε The calculation process of (c) is as follows:
s331, searching noise N according to current strategy mu t Determining an action space a t =μ(S tμ )+N t Wherein, mu (S) tμ ) Representing the strategy mu in the neural network parameter theta μ Environmental state S t A next generated action;
s332, according to the current environment state S t Generating an action as a mock-object a using expert behavior policy e
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference method ε The calculation formula is as follows:
for discrete actions a t
Figure BDA0003029048020000145
Wherein, a t (x) Representing the probability of the agent performing x-behavior, a e (x) Representing the probability of an expert performing an X behavior, wherein X is the behavior, and X is a behavior set, if X = { flight north, flight south, flight east, flight west }, then flight north, etc. is a behavior;
for continuous action a t
Figure BDA0003029048020000151
Wherein, a t (x) Probability density, a, representing the agent's performance of x e (x) And the probability density of X behaviors is represented by the expert, wherein X is the behavior and X is the behavior set. For example, X = { flight X degrees from horizontal, X ∈ [0 °,180 ° ]]}
The DDPG algorithm outputs specific actions according to the current environment state and the action strategy and executes the specific actions, and action instruction output is continuously provided according to the real-time situation until the end of the war office. And when the DDPG algorithm outputs behaviors each time, making a decision in the same situation by adopting the expert behaviors, but not updating the decision to the environment state, namely the expert behaviors do not take effect and only serve as feedback of the DDPG algorithm action decision, and calculating the similarity of the expert behaviors and the action of the intelligent agent. If the similarity is high, the simulation ability of the current intelligent agent game decision model to the expert behaviors is strong.
S34, obtaining the comprehensive reward r t Comprehensive reward r t The formula is as follows:
r t =w b R b +w i R * (s)+w ε R ε
wherein w b For maximising the jackpot R b Weight coefficient of (d), w i For a target reward function R * Weight coefficient of(s), w ε Is degree of similarity R ε The weight coefficient of (c).
S4, training an intelligent game decision model and judging comprehensive rewards r t If the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r t Greater than the reward threshold.
The method mainly adopts a DDPG algorithm to carry out expert agent decision behavior simulation training so as to obtain an agent game decision model capable of making expert behaviors. The behavior simulation training process is a process that the intelligent agent generates actions according to situation to obtain income, experiences are accumulated continuously, the generated actions change to the direction of obtaining high income until the actions reach stable high income, and convergence is carried out, so that a final intelligent agent game decision model with high intelligent level is obtained and is used for air games.
In an embodiment, in step S4, the training of the intelligent agent game decision model includes the following steps:
s41, initializing online neural network parameter theta of operator network in intelligent agent game decision model Q And the online neural network parameter theta of the critic network u (ii) a The actor network is used for outputting behaviors according to the battlefield situation, and the critic network is used for outputting behaviors according to the battlefield situation and the behaviorsAnd (6) scoring.
S42, connecting the online neural network parameter theta of the actor network Q And the online neural network parameter theta of the critic network u Respectively and correspondingly copying target network parameters to the operator network
Figure BDA0003029048020000161
And target network parameters of critic network
Figure BDA0003029048020000162
Figure BDA0003029048020000163
S43, initializing a playback cache R;
s44, performing the following steps for the lth (L =1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for time T =1, 2.
a) Each agent executes corresponding action, and performs environmental state conversion to obtain next environmental state S t+1 And a composite award r t
b) In the process of converting the environmental state t ,a t ,r t ,S t+1 ) The data is stored as an array in a playback buffer R to be used as a training data set;
c) Randomly sampling U arrays in a playback buffer R as mini-batch training data (y) L -Q(s L ,a LQ ) Label (C)
Figure BDA0003029048020000164
Wherein the content of the first and second substances,
Figure BDA0003029048020000165
representing the policy μ at parameter
Figure BDA0003029048020000166
Environmental state s L+1 The action to be generated next is that of,
Figure BDA0003029048020000167
representing parameters
Figure BDA0003029048020000168
Environmental state s L+1 And performing the action
Figure BDA0003029048020000169
The value of the lower Q function;
d) Minimizing a loss function of the critical network, the loss function being
Figure BDA00030290480200001610
e) Updating policy gradient of actor network:
Figure BDA00030290480200001611
wherein Q (s, a | θ) Q ) Expressed at a parameter theta Q Environment state s, the value of the Q function under the execution action a,
Figure BDA00030290480200001612
represents Q (s, a | θ) Q ) Partial derivatives of a; mu (s | theta) μ ) Representing the strategy mu at the parameter theta μ An action generated in the environment state s,
Figure BDA00030290480200001613
represents μ (s | θ) μ ) To theta μ Partial derivatives of (a);
f) Updating target network parameters of actor network
Figure BDA00030290480200001614
And target network parameters of critic network
Figure BDA00030290480200001615
Figure BDA00030290480200001616
Figure BDA00030290480200001617
Wherein tau is an adjustable coefficient and belongs to [0,1];
g) Judging comprehensive reward r t If the number of the game pieces is larger than the reward threshold value, the training is stopped, the final intelligent agent game decision model is obtained, if not, the step S44 is returned to carry out the circular iterative training until the number n of the game pieces or the comprehensive reward r of the maximum training is reached t Greater than the reward threshold. If the upper limit of the preset reward threshold value is 100, the user sets the reward threshold value to 80 according to the actual requirement, and then when the comprehensive reward r is t If the training is more than 80, terminating the training, otherwise, returning to the step S44 to carry out the loop iteration training until the maximum training pair number n or the comprehensive reward r t Greater than 80.
The method comprises the steps of carrying out decision training on behaviors of the intelligent agent by utilizing a deep reinforcement learning algorithm, taking similarity calculation of expert behaviors and the behaviors of the intelligent agent as feedback, and continuously updating a deep network. By training the intelligent agent game decision model, decision experience can be obtained from expert prior knowledge, and with the improvement of similarity of intelligent agent behaviors and expert behaviors in the training process, the behavior imitation level is gradually improved, so that the final intelligent agent game decision model with high intelligence level is obtained and used for air game.
In one embodiment, target network parameters for an actor network
Figure BDA0003029048020000171
And target network parameters of the critical network
Figure BDA0003029048020000172
And updating by adopting a gradient descent method. The revenue of the generated action changes once per update, and the overall trend is the revenue rising trend.
In conclusion, the method synthesizes key elements in the air scene of the intelligent agent game decision model as reward factors, and the reward functions are clearly and definitely designed and have interpretability by forming continuous non-sparse reward functions, and meanwhile, the behavior control of the intelligent agent can be manually intervened; the intelligent agent battle effect is close to the expert battle effect based on the reverse reinforcement learning algorithm of the expert behavior, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved; the intelligent decision imitation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow even can not be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of imitating the complex behaviors of experts, so that the labor cost is reduced; meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved; the problem of cold start of training is solved by a behavior simulation method, the traditional inefficient reward function design process and the random exploration process during initial training are improved, the decision-making capability is optimized, and a decision-making intelligent agent with high intelligence level and practical value can be obtained quickly.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not should be understood as the limitation of the invention claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (10)

1. A behavior simulation training method facing to air intelligent game is characterized in that: the behavior simulation training method for the air intelligent game comprises the following steps:
s1, constructing an air countermeasure simulation environment and an agent of both red and blue, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing circular interaction between the air countermeasure simulation environment and each agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action a t Shaping each of said actions a t Is continuous non-sparse reward function R t
S3, performing air game match in the intelligent agent game decision model, wherein the game match time is T, and each intelligent agent performs game match according to the current environment state S t T =0,1,2, \8230, T, performing the following steps:
s31, determining action a required to be executed t The action performed a t Generating a next environment state S after acting on the air countermeasure simulation environment t+1 And obtaining the corresponding reward function R t In turn, the iteration is circulated to realize the maximum accumulated reward R b
S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R * (s)=w * Phi(s), wherein w * Is a reward function weight vector, h in total,
Figure FDA0003029048010000011
‖w * | < 1, phi(s) is a situation information vector phi: → [0,1] h
S33, calculating similarity R of behaviors of the agents and the expert behaviors ε
S34, obtaining the comprehensive reward r t The composite award r t The formula is as follows:
r t =w b R b +w i R * (s)+w ε R ε
wherein w b For said maximising the jackpot R b Weight coefficient of (d), w i Awarding the targetFunction R * Weight coefficient of(s), w ε Is the similarity R ε The weight coefficient of (a);
s4, training the intelligent game decision model and judging the comprehensive reward r t If the number of the training pairs is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward r t Greater than the reward threshold.
2. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
3. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the action space comprises at least one action a of target pursuit, target avoidance, tangential escape from enemy locking, cross attack and snake maneuver t
4. The method for behavioral imitation training in air intelligent gaming according to claim 3, wherein: said shaping each of said actions a t Is continuously non-sparse reward function R t The following:
1) Reward function R of the target pursuit t1 The following conditions are satisfied:
Figure FDA0003029048010000021
2) Reward function R for said target avoidance t2 The following conditions are satisfied:
R t2 =Δθ f +a′+A+B
3) The tangential escape from the reward function R of enemy machine locking t3 The following conditions are satisfied:
R t3 =-Δθ t +a′+A+B
4) Reward function R of the cross attack t4 The following conditions are satisfied:
Figure FDA0003029048010000022
5) Said serpentine motorized reward function R t5 The following conditions are satisfied:
Figure FDA0003029048010000023
wherein, delta theta c =|θ-δ|-ε,Δθ f =|θ-δ|,Δθ t =|θ-δ|-90°,Δθ j =|θ-δ′|,Δθ s θ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θ e Is the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, x a Is the distance between the nearest friend computer of the red agent and the nearest blue agent, delta is the angle between the red agent and the nearest blue agent, delta 'is the angle between the red agent and the nearest blue agent, epsilon is the angle between the red agent and the nearest blue agent, a' is the acceleration of the red agent, A is the difference between the loss amount of the blue agent and the loss amount of the red agent, B is the winning number of the red agent, v is the speed of the blue agent a Is the nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T 1 For the change of direction of movement in said snake movement, 0<T 1 ≤T。
5. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing the target rewardFunction R * (s)=w * ·φ(s);
S322, randomly acquiring an initial action strategy pi 0 Calculating the initial action strategy pi 0 Corresponding characteristic expectation mu (pi) 0 ) I =0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle plays;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithm
Figure FDA0003029048010000031
And obtain the current reward function weight vector
Figure FDA0003029048010000032
Updating the reward function weight vector w * Is (w) i ) T Wherein, mu E Characteristic expectations for the expert behavior strategy; mu.s j T is the transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error e i Whether the target reward function is smaller than the error threshold value or not, if so, outputting the target reward function R * (s)=(w i ) T Phi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithm i And the update characteristic is expected to be mu (pi) i ) If i = i +1, the process returns to step S323 to iterate.
6. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 5, wherein: the characteristic is expected to be mu (pi) i ) The following formula is satisfied:
Figure FDA0003029048010000033
where γ is the discount factor, φ(s) t )|π i The action strategy pi for the moment t i Corresponding battlefield situation.
7. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 6, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating a battlefield situation set according to action strategies corresponding to each bureau
Figure FDA0003029048010000034
Calculating global feature expectations for m-game air games
Figure FDA0003029048010000035
The formula is as follows:
Figure FDA0003029048010000036
wherein the content of the first and second substances,
Figure FDA0003029048010000037
the environmental status at the moment t of the ith exchange,
Figure FDA0003029048010000038
and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
8. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S33, the similarity R between the behaviors of the agents and the behaviors of the experts ε The calculation process of (2) is as follows:
s331, searching noise N according to current strategy mu t Determining the action a t =μ(S tμ )+N t Wherein, μ (S) tμ ) Representing the strategy mu in the neural network parameter theta μ Environmental state S t A next generated action;
s332, according to the current environment state S t Generating an action as a mock object a using expert behavior strategy e
S333, calculating the similarity R of the intelligent agent behaviors and the expert behaviors by adopting a Kullback-Leibler difference method ε The calculation formula is as follows:
for discrete action a t
Figure FDA0003029048010000041
Wherein, a t (x) Representing the probability of the agent performing an x-action, a e (x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action a t
Figure FDA0003029048010000042
Wherein, a t (x) Probability density representing x behavior of said agent, a e (x) Representing the probability density of the expert performing X actions, X being the set of actions.
9. The method for behavioral imitation training in air intelligent gaming according to claim 1, wherein: in step S4, the training of the intelligent agent game decision model includes the following steps:
s41, initializing online neural network parameters theta of the operator network in the intelligent agent game decision model Q And the online neural network parameter theta of the critic network u
S42, connecting the online neural network parameter theta of the actor network Q And the online neural network parameter theta of the criticc network u Respectively and correspondingly copying target network parameters to the operator network
Figure FDA0003029048010000043
And target network parameters of critic network
Figure FDA0003029048010000044
Figure FDA0003029048010000045
S43, initializing a playback cache R;
s44, for the L (L =1,2, \8230;, n) pair, the following steps are performed:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for time T =1,2, \8230, T executes the following steps:
a) Each intelligent agent executes corresponding action, and performs environmental state conversion to obtain the next environmental state S t+1 And a composite award r t
b) In the process of converting the environmental state t ,a t ,r t ,S t+1 ) The data is stored as an array in the playback cache R to be used as a training data set;
c) Randomly sampling U arrays in the playback buffer R as mini-batch training data (y) L -Q(s L ,a LQ ) Label (C)
Figure FDA0003029048010000051
Wherein the content of the first and second substances,
Figure FDA0003029048010000052
representing policy μ at parameter
Figure FDA0003029048010000053
Environmental state s L+1 The action to be generated next is that of,
Figure FDA0003029048010000054
representing parameters
Figure FDA0003029048010000055
Environmental state s L+1 And performing the action
Figure FDA0003029048010000056
The value of the lower Q function;
d) Minimizing a loss function of the critic network, said loss function being
Figure FDA0003029048010000057
Figure FDA0003029048010000058
e) Updating policy gradient of actor network:
Figure FDA0003029048010000059
wherein Q (s, a | θ) Q ) Expressed at a parameter theta Q Environment state s, the value of the Q function under the execution action a,
Figure FDA00030290480100000510
represents Q (s, a | θ) Q ) Partial derivatives of a; mu (s | theta) μ ) Representing the strategy mu at the parameter theta μ An action generated in the environment state s,
Figure FDA00030290480100000511
represents μ (s | θ) μ ) To theta μ The partial derivatives of (1);
f) Updating target network parameters of the operator network
Figure FDA00030290480100000512
And target network parameters of the critical network
Figure FDA00030290480100000513
Figure FDA00030290480100000514
Figure FDA00030290480100000515
Wherein tau is an adjustable coefficient,
Figure FDA00030290480100000516
g) Judging the comprehensive reward r t If the number of the rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward r t Greater than the reward threshold.
10. The method for behavioral imitation training in air intelligent gaming according to claim 9, wherein: target network parameters of the actor network
Figure FDA0003029048010000061
And target network parameters of critic network
Figure FDA0003029048010000062
And updating by adopting a gradient descent method.
CN202110425153.0A 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game Active CN113221444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110425153.0A CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110425153.0A CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Publications (2)

Publication Number Publication Date
CN113221444A CN113221444A (en) 2021-08-06
CN113221444B true CN113221444B (en) 2023-01-03

Family

ID=77088029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110425153.0A Active CN113221444B (en) 2021-04-20 2021-04-20 Behavior simulation training method for air intelligent game

Country Status (1)

Country Link
CN (1) CN113221444B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867178B (en) * 2021-10-26 2022-05-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN114021737B (en) * 2021-11-04 2023-08-22 中国电子科技集团公司信息科学研究院 Reinforced learning method, system, terminal and storage medium based on game
CN113893539B (en) * 2021-12-09 2022-03-25 中国电子科技集团公司第十五研究所 Cooperative fighting method and device for intelligent agent
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model
CN115470710B (en) * 2022-09-26 2023-06-06 北京鼎成智造科技有限公司 Air game simulation method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291890B (en) * 2020-05-13 2021-01-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN112052511A (en) * 2020-06-15 2020-12-08 成都蓉奥科技有限公司 Air combat maneuver strategy generation technology based on deep random game
CN112052456A (en) * 2020-08-31 2020-12-08 浙江工业大学 Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents

Also Published As

Publication number Publication date
CN113221444A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN113221444B (en) Behavior simulation training method for air intelligent game
CN113791634B (en) Multi-agent reinforcement learning-based multi-machine air combat decision method
CN113050686B (en) Combat strategy optimization method and system based on deep reinforcement learning
CN113093802B (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN113396428B (en) Learning system, computer program product and method for multi-agent application
CN113095481A (en) Air combat maneuver method based on parallel self-game
CN113282061A (en) Unmanned aerial vehicle air game countermeasure solving method based on course learning
CN113893539B (en) Cooperative fighting method and device for intelligent agent
CN115755956B (en) Knowledge and data collaborative driving unmanned aerial vehicle maneuvering decision method and system
CN116185059A (en) Unmanned aerial vehicle air combat autonomous evasion maneuver decision-making method based on deep reinforcement learning
CN115933717A (en) Unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
CN114167756B (en) Multi-unmanned aerial vehicle collaborative air combat decision autonomous learning and semi-physical simulation verification method
CN113741500A (en) Unmanned aerial vehicle air combat maneuver decision method for imitating Harris eagle intelligent predation optimization
Bae et al. Deep reinforcement learning-based air-to-air combat maneuver generation in a realistic environment
CN116700079A (en) Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP
CN111830848A (en) Unmanned aerial vehicle super-maneuvering flight performance simulation training system and method
Xianyong et al. Research on maneuvering decision algorithm based on improved deep deterministic policy gradient
CN113741186A (en) Double-machine air combat decision method based on near-end strategy optimization
CN116225065A (en) Unmanned plane collaborative pursuit method of multi-degree-of-freedom model for multi-agent reinforcement learning
CN115457809A (en) Multi-agent reinforcement learning-based flight path planning method under opposite support scene
Wang et al. Research on autonomous decision-making of UCAV based on deep reinforcement learning
CN113093803B (en) Unmanned aerial vehicle air combat motion control method based on E-SAC algorithm
Lu et al. Strategy Generation Based on DDPG with Prioritized Experience Replay for UCAV
Wang et al. Research on naval air defense intelligent operations on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant