CN113221444A - Behavior simulation training method for air intelligent game - Google Patents
Behavior simulation training method for air intelligent game Download PDFInfo
- Publication number
- CN113221444A CN113221444A CN202110425153.0A CN202110425153A CN113221444A CN 113221444 A CN113221444 A CN 113221444A CN 202110425153 A CN202110425153 A CN 202110425153A CN 113221444 A CN113221444 A CN 113221444A
- Authority
- CN
- China
- Prior art keywords
- reward
- intelligent
- action
- agent
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000004088 simulation Methods 0.000 title claims abstract description 40
- 230000006399 behavior Effects 0.000 claims abstract description 93
- 230000006870 function Effects 0.000 claims abstract description 88
- 230000009471 action Effects 0.000 claims abstract description 86
- 230000002787 reinforcement Effects 0.000 claims abstract description 29
- 230000008569 process Effects 0.000 claims abstract description 23
- 230000002441 reversible effect Effects 0.000 claims abstract description 17
- 239000003795 chemical substances by application Substances 0.000 claims description 166
- 238000004422 calculation algorithm Methods 0.000 claims description 30
- 230000000875 corresponding effect Effects 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 230000001133 acceleration Effects 0.000 claims description 10
- 230000003542 behavioural effect Effects 0.000 claims description 10
- 230000001276 controlling effect Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000007613 environmental effect Effects 0.000 claims description 7
- 230000000977 initiatory effect Effects 0.000 claims description 6
- 241000270295 Serpentes Species 0.000 claims description 4
- 239000002131 composite material Substances 0.000 claims description 4
- 230000003993 interaction Effects 0.000 claims description 4
- 238000007493 shaping process Methods 0.000 claims description 4
- 238000003491 array Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000036961 partial effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- WYTGDNHDOZPMIW-RCBQFDQVSA-N alstonine Natural products C1=CC2=C3C=CC=CC3=NC2=C2N1C[C@H]1[C@H](C)OC=C(C(=O)OC)[C@H]1C2 WYTGDNHDOZPMIW-RCBQFDQVSA-N 0.000 claims 1
- 238000012938 design process Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 6
- 230000002194 synthesizing effect Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 208000001613 Gambling Diseases 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a behavior simulation training method for an air intelligent game, which comprises the following steps: s1, constructing an intelligent agent game decision model; s2, determining the environment state and the action space, and modeling a continuous non-sparse reward function of each action; s3, playing an air game in the model, and executing the following steps: s31, generating the next environment state according to the executed action and obtaining the reward, and sequentially and circularly iterating to realize the maximum accumulated reward; s32, realizing reverse reinforcement learning based on expert behaviors and obtaining a target reward function; s33, calculating the similarity of each agent behavior and the expert behavior; s34, obtaining a comprehensive reward; and S4, training an intelligent agent game decision model. The method improves the traditional inefficient reward function design process and the random model training exploration process, so that the reward function has interpretability and human intervention capability, the intelligent agent decision level and convergence speed are improved, and the problem of model training cold start is solved.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a behavior simulation training method for an air intelligent game.
Background
In the future, the air game needs to acquire the most accurate information from various detection systems at any time and any place to realize the information advantage, and more importantly, the air game realizes the decision advantage by utilizing the technologies such as machine learning, artificial intelligence, cloud computing and the like. In order to better mine information to realize decision advantages, exert game efficiency and ensure aerial advantages, an aerial auxiliary decision support system matched with the pilot is needed besides excellent aerial skills of the pilot and good command ability of the commander. The aerial assistant decision support system is used as an artificial intelligence assistant system, can provide decision scheme reference in a high-dynamic complex confrontation environment, reduces the decision burden of a pilot, better excavates information to realize decision advantages, exerts game efficiency and ensures aerial advantages.
However, the existing air-aided decision support system is relatively backward, the number of sensing parameters or target objects which can be simultaneously controlled is limited, the robustness, timeliness and accuracy of decision support are poor, moreover, in game decision, the training model is difficult to converge due to too high decision dimensionality, and the time consumption of the practical intelligent body for training is long, even the practical intelligent body cannot be trained at all to obtain the effective decision intelligent body. And the decision level of the intelligent agent is low and the time consumption is long due to sparse awards and complex and low-efficiency awards design links in the air intelligent self-game confrontation, and meanwhile, the awards are designed to be manually customized according to scenes, so that the labor cost is high, the reusability is poor, and the algorithm training has the cold start problem.
Disclosure of Invention
The invention aims to provide a behavior simulation training method facing to the aerial intelligent game, which improves the traditional inefficient reward function design process and the random exploration process during model training, so that the reward function is clearly and definitely designed and has interpretability, meanwhile, the behavior control of the intelligent agent can be manually intervened, the decision level and the convergence speed of the intelligent agent can be rapidly improved, the intelligent agent has the capability of simulating the complex behaviors of experts, the labor cost is reduced, and the problem of model training cold start is solved.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention provides a behavior simulation training method facing to an air intelligent game, which comprises the following steps:
s1, constructing an air confrontation simulation environment and red and blue intelligent agents, constructing an intelligent agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each intelligent agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each action atIs continuous non-sparse reward function Rt;
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state StT0, 1, 2.. T, performing the following steps:
s31, determining action a required to be executedtAn action performed atGenerating next environment state S after acting on air confrontation simulation environmentt+1And obtain the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb;
S32, controlling each agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,||w*and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1]h;
S33, calculating similarity R of each agent behavior and expert behaviorε;
S34, obtaining the comprehensive reward rtComprehensive reward rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor maximising the jackpot RbWeight coefficient of (d), wiFor a target reward function R*Weight coefficient of(s), wεIs degree of similarity RεThe weight coefficient of (a);
s4, training intelligent game decision model and judging comprehensive reward rtIf the number of the game pieces is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
Preferably, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
Preferably, the action space comprises at least one action a of object pursuit, object avoidance, tangential free enemy locking, cross-attack and snake maneuvert。
Preferably, each act a is sculptedtIs continuous non-sparse reward function RtThe following were used:
1) reward function R for target pursuitt1The following conditions are satisfied:
2) reward function R for target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) reward function R for tangentially escaping enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) cross-attacking reward function Rt4The following conditions are satisfied:
5) snake-shaped motorized reward function Rt5The following conditions are satisfied:
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaThe distance between the nearest friend machine of the red intelligent agent and the nearest blue intelligent agent, delta is the connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is the connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is the deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agentaThe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T1For the change period of the maneuvering direction in a snake-shaped machine, T is more than 01≤T。
Preferably, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating an initial action policy π0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithmAnd obtain the current reward function weight vectorUpdating the reward function weight vector w*Is (w)i)TWherein, muETo be specially designedCharacteristic expectations of home behavior policies; mu.sjT is transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error eiWhether the error is less than the error threshold value or not, if so, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate.
Preferably, the characteristic is expected to be μ (π)i) The following formula is satisfied:
where γ is the discount factor, φ(s)t)|πiAction policy pi for time tiCorresponding battlefield situation.
Preferably, each agent is controlled to play the air game based on expert behaviors to realize reverse reinforcement learning, and the method further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureauCalculating global feature expectations for m-game air gamesThe formula is as follows:
wherein,the environmental status at the moment t of the ith exchange,as a battlefield situation pair at the moment t of the ith roundThe corresponding situation information vector.
Preferably, in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining an action at=μ(St|θμ)+NtWherein, μ (S)t|θμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye;
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at:
Wherein, at(x) Representing the probability of an agent performing an x-action, ae(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action at:
Wherein, at(x) Probability density, a, representing x behavior of agente(x) Representing the probability density of an expert performing X actions, X being the set of actions.
Preferably, in step S4, the training of the intelligent agent gaming decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku;
S42, connect the actor networkOnline neural network parameter θQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor networkAnd target network parameters of critic network
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for a time T ═ 1, 2.., T, the following steps are performed:
a) each agent executes corresponding action, and performs environment state conversion to obtain next environment state St+1And a composite award rt;
b) In the process of converting the environmental statet,at,rt,St+1) The data is stored as an array in a playback buffer R to be used as a training data set;
c) randomly sampling U arrays in a playback buffer R as mini-batch training data (y)L-Q(sL,aL|θQ) Label (C)Wherein,representing policy μ at parameterEnvironmental state sL+1The action to be generated next is that of,representing parametersEnvironmental state sL+1And performing the actionThe value of the lower Q function;
e) Updating policy gradient of actor network:
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of actor networkAnd target network parameters of critic network
Wherein tau is an adjustable coefficient and belongs to [0,1 ];
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
Preferably, target network parameters of the actor networkAnd target network parameters of critic networkAnd updating by adopting a gradient descent method.
Compared with the prior art, the invention has the beneficial effects that:
1) key elements in an air scene of an intelligent agent game decision model are integrated to serve as reward factors, a continuous non-sparse reward function is shaped, so that the reward function is clearly and definitely designed and has interpretability, and meanwhile, the behavior control of an intelligent agent can be manually intervened;
2) the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved;
3) the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced;
4) meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved;
5) the method solves the problem of cold start of training by a behavior simulation method, improves the traditional inefficient reward function design process and the random exploration process during initial training, optimizes decision-making capability, and can quickly obtain a decision-making intelligent agent with high intelligence level and practical value.
Drawings
FIG. 1 is a schematic diagram of the cyclical interaction of an agent of the present invention with an airborne countermeasure simulation environment;
FIG. 2 is a block diagram of a method for reverse reinforcement learning of expert behavior according to the present invention;
FIG. 3 is a flow chart of the expert intelligence based behavioral decision simulation method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1-3, a behavior simulation training method for air intelligent gaming comprises the following steps:
s1, constructing an air confrontation simulation environment and an agent of the red and blue, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent.
In one embodiment, the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
The intelligent game decision model is constructed by adopting a DDPG (distributed data group PG) deep reinforcement learning algorithm. The intelligent agent game decision model countermeasure training links the constructed air countermeasure simulation environment with the intelligent agent, and the intelligent agent is a red-blue airplane, so that the intelligent agent and the air countermeasure simulation environment are interactively influenced, as shown in fig. 1. It should be noted that the intelligent agent can also be changed according to different application scenarios, for example, the intelligent agent can also be a missile, and the intelligent agent game decision model can also be constructed based on a DQN algorithm or other deep reinforcement learning algorithms in the prior art.
S2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each action atIs continuous non-sparse reward function Rt。
In one embodiment, the action space includes at least one action a of object pursuit, object avoidance, tangential escape from enemy locking, cross-attack, and snake maneuvert. Optional actions a of the action spacetThe number and types of the components can be adjusted according to actual needs, and other actions in the prior art can also be adopted.
In one embodiment, each act a is sculptedtIs continuous non-sparse reward function RtThe following were used:
1) reward function R for target pursuitt1The following conditions are satisfied:
2) reward function R for target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) reward function R for tangentially escaping enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) cross-attacking reward function Rt4The following conditions are satisfied:
5) snake-shaped motorized reward function Rt5The following conditions are satisfied:
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaThe distance between a nearest friend machine of a red intelligent agent and a nearest blue intelligent agent, delta is a connecting line angle between the red intelligent agent and the nearest blue intelligent agent, delta 'is a connecting line angle between the red intelligent agent and the nearest friend machine, epsilon is a deviation angle between the red intelligent agent and the nearest blue intelligent agent, a' is the acceleration of the red intelligent agent, A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, B is the number of winning places of the red intelligent agent, v is the speed of the red intelligent agent, v is the number of winning places of the red intelligent agent, andathe nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °]And T is the maneuvering direction changing period in the snake-shaped maneuvering.
Specifically, a northeast coordinate system is adopted, theta is an included angle between the direction of the nose of the red-square intelligent aircraft and the true north direction, and theta iseThe included angle between the machine head direction of the blue intelligent body closest to the red intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the due north direction is delta, the included angle between the connecting line of the red intelligent body and the nearest friend machine and the due north direction is delta', and the included angle between the connecting line of the red intelligent body and the nearest blue intelligent body and the machine head direction of the red intelligent body is epsilon. The space motion of the air combat agent has six degrees of freedom, and the application considers the intelligence when the reward function is shapedYaw angle and acceleration of the energy body. The reward function modeling link is oriented to various actions, namely target pursuit, target avoidance, tangential escape from enemy plane locking, cross attack and snake maneuver, and comprises at least one of a red intelligent body as a local plane and a blue intelligent body as an enemy plane (enemy plane for short), wherein:
target pursuit: the local machine takes the nearest enemy plane as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest enemy plane is consideredcOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta thetacThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target pursuit reward function R according to the factorst1The following were used:
target avoidance: the local machine takes the nearest enemy attacking the local machine as relative reference, and the difference value delta theta between the absolute orientation angle theta of the local machine and the nearest enemy connecting line angle delta between the local machine and the local machinefOne of the reward factors is | θ - δ |. When avoiding the target, in order to compress the envelope of the missile of the enemy aircraft, the delta theta within the effective range of the enemy aircraft is avoidedfThe smaller the prize. The local acceleration a' is one of the reward factors because of the need to quickly zoom up and enemy while evadingDistance, factor a', is always a positive reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a target evasion reward function R according to the factorst2The following were used:
Rt2=Δθf+a′+A+B
tangential escape from enemy locking: the machine takes the enemy which has recently locked itself as relative reference, and the difference value delta theta between the absolute orientation angle theta of the machine and the connection angle delta between the machine and the enemy which has recently locked itselftOne of the reward factors is | θ - δ | -90 °. Since the tangential escape enemy locking action is based on Doppler effect escape enemy radar locking, the optimum angle is 90 °, so the closer the difference angle | θ - δ | to 90 °, the better, Δ θ |tThe closer to 0 the prize is. The local acceleration a' is one of the reward factors, and is always a positive reward because of the fast zooming-up and enemy distance required to escape. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a tangential escape enemy locking reward function R according to the factorst3The following were used:
Rt3=-Δθt+a′+A+B
cross attack: the machine gives different rewards according to the position of the nearest friend machine (namely the nearest red party intelligent agent to the machine) by taking the nearest enemy machine as a reference.
If xaX is more than or equal to 0, the machine is positioned in the front of the nearest friend machine to execute attack tactics, x is the distance between the red square intelligent agent and the nearest blue square intelligent agent, namely the distance between the machine and the nearest enemy plane, xaNearest friend machine and nearest blue square for red square intelligent agentThe distance between the intelligent bodies, namely the distance between the nearest friend plane and the nearest friend plane, takes the deviation angle epsilon of the local machine relative to the nearest friend plane into consideration, and the difference delta theta between the absolute orientation angle theta of the local machine and the line angle delta between the local machine and the nearest friend plane is calculatedcOne of the reward factors is | θ - δ | - ε. When the enemy missile is hit by a target, the enemy missile is kept in the self radar locking range of the self-body, so that the enemy missile is beneficial to being quickly arranged at the tail and kept away, and meanwhile, the envelope of the enemy missile is compressed to delta thetacThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeIf > 180 deg., a' is a negative reward. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward.
If xaAnd x is less than 0, the machine is positioned behind the nearest friend machine to execute the tactics of following the previous machine and opening the interference. The difference delta theta between the absolute orientation angle theta of the machine and the connection angle delta' between the machine and the nearest friend machinejOne of the reward factors is | θ - δ' |. Because when following the nearest friend machine, the nearest friend machine is considered to be kept in the radar locking range of the machine, so that the distance between the nearest friend machine and the team member is kept to play a role of concealing and interfering the enemy, delta thetajThe smaller the prize. The local acceleration a' is one of reward factors, at the local absolute orientation angle theta and the nearest enemy orientation angle thetaeIn the same direction, i.e. | theta-thetaeWhen | < 180 °, a' is positive reward, | theta-thetaeI > 180 deg., a ' is a negative reward, while a ' is proportional to the speed difference with the nearest friend, i.e. a ' ═ a (v-v)a) Where v is the local velocity, vaα is a weight coefficient for the nearest friend speed. The relative hit number A and the local win-loss condition B are one of the considered factors, the relative hit number A is the difference between the lost number of the blue intelligent agent and the lost number of the red intelligent agent, when the relative hit number A is more than 0, the reward is positive, A is 0, the reward is not present, A is less than 0, the reward is negative,meanwhile, the winning of the victory or defeat situation B of the machine is positive reward, the tie of the war is no reward, and the failure of the war is negative reward. Synthesizing a cross attack reward function R according to the factorst4The following were used:
s-shaped maneuvering: the machine takes the nearest enemy plane as relative reference, every T1The maneuvering direction is changed in a time period of 0 < T1≤T,2kT1≤t≤(2k+1)T1,The difference delta theta between the absolute orientation angle theta of the plane itself and the line angle delta between the plane itself and the nearest enemy planesθ - δ - (90 ° - σ) is one of the reward factors. Sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG ]]The radial direction is indicated by an arrow with the local machine as a starting point and the nearest enemy as an end point, and the radial direction is a velocity component in the radial direction. The relative hit quantity A and the local win-lose condition B are one of the considered factors, the relative hit quantity A is the difference between the loss quantity of the blue intelligent agent and the loss quantity of the red intelligent agent, when the relative hit quantity A is larger than 0, the reward is positive, A is 0, the reward is not provided, A is smaller than 0, the negative reward is provided, meanwhile, the local win-lose condition B wins as the positive reward, the tactic tie is not provided, and the tactic failure is the negative reward. Synthesizing a snake-shaped maneuvering reward function R according to the factorst5The following were used:
because the actions of the agent are continuous, such as the turning angle and acceleration of the airplane, the reward function is used as the evaluation of the behavior, and a continuous non-sparse reward function can be adopted to be in one-to-one correspondence with the behavior so as to provide feedback to the agent at each time point in the real-time confrontation process. The modeling of the continuous non-sparse reward function is clear and definite, and the continuous non-sparse reward function has interpretability, and meanwhile, the behavior control of the intelligent agent can be adjusted by human intervention.
S3, carrying out air game match in the intelligent agent game decision model, wherein the match time is T, and each intelligent agent carries out game match according to the current environment state StT0, 1, 2.. T, performing the following steps:
s31, determining action a required to be executedtAn action performed atGenerating next environment state S after acting on air confrontation simulation environmentt+1And obtain the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb;
S32, controlling each agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,||w*and | | is less than or equal to 1, phi(s) is a situation information vector, phi: s → [0,1]h;
And controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and automatically resolving a target reward function. If a series of air decision characteristic elements exist, phi: s → [0,1]hFor example, the characteristic elements include basic space coordinates, motion indexes (speed and rotation angle) and the like of the intelligent agent, the fire control system state, the radar state, the interference pod state and the like of the intelligent agent. The characteristic indexes are related to the reward, and in order to facilitate quick optimization, the correlation is expressed as a linear combination relation to obtain a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,||w*and | | is less than or equal to 1, and phi(s) is a situation information vector, so that the importance of each characteristic quantity is represented in a normalized space. When the expert behaviors are adopted to control the intelligent agent to play games in the air, the intelligent agent can generate a series of battlefield situations along withThe time is passed until the war-game is over, the set of the corresponding battlefield situations of the ith game isAssuming that there are p red-square airplanes and q blue-square airplanes, the collected situation information vector phi(s) may selectively include the following elements:
c) p + q aircraft absolute heading angles;
d) p + q aircraft longitude and latitude coordinates;
e) pq missile threat points represent the threat magnitude of the missile in flight to the airplane (proportional to the flight time and inversely proportional to the distance from the target airplane);
f) p + q aircraft speeds per hour;
g) p + q aircraft relative positions;
h) whether the local machine is a current attack machine or not;
i) the missile states of q enemy planes at a moment;
j) the missile state at the moment of time on the p airplanes of our party.
In one embodiment, the method for controlling each agent to conduct air gaming based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing a target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating an initial action policy π0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
in one embodiment, the characteristic is expected to be μ (π)i) The following formula is satisfied:
where γ is the discount factor, φ(s)t)|πiAction policy pi for time tiCorresponding battlefield situation.
In one embodiment, the method for controlling each agent to play the air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureauCalculating global feature expectations for m-game air gamesThe formula is as follows:
wherein,the environmental status at the moment t of the ith exchange,and the situation information vector is corresponding to the battlefield situation at the moment t of the ith bureau.
S323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithmAnd obtain the current reward function weight vectorUpdating the reward function weight vector w*Is (w)i)TWherein, muECharacteristic expectation of expert behavior strategy; mu.sjFor current action policyThe characteristic is expected, T is transposition;
s324, judging the characteristic error eiWhether the error is less than the error threshold value or not, if so, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate. The error e is set in the present embodiment for the sake of accuracy and time consumption of calculationiIs 10-5If several iterations are performed, ei<10-5Then the error e is determinediLess than the error threshold.
The intelligent agent battle effect is close to the expert battle effect through the reverse reinforcement learning algorithm based on the expert behaviors, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved.
S33, calculating similarity R of each agent behavior and expert behaviorε。
In one embodiment, in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining an action space at=μ(St|θμ)+NtWherein, μ (S)t|θμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye;
S333, calculating similarity R of intelligent agent behaviors and expert behaviors by adopting Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at:
Wherein, at(x) Representing the probability of an agent performing an x-action, ae(x) The probability of an expert performing an X behavior is represented, the X behavior is a behavior set, and if the X is { flying to the north, flying to the south, flying to the east, flying to the west }, the flying to the north, and the like, is a behavior;
for continuous action at:
Wherein, at(x) Probability density, a, representing x behavior of agente(x) And the probability density of X behaviors is represented by the expert, wherein X is the behavior and X is the behavior set. For example, X ═ X ° flight at an angle X degrees to the horizontal, X ∈ [0 °, 180 ° ]]}
The DDPG algorithm outputs specific actions according to the current environment state and the action strategy and executes the specific actions, and action instruction output is continuously provided according to the real-time situation until the end of the war office. And when the DDPG algorithm outputs behaviors each time, making a decision in the same situation by adopting the expert behaviors, but not updating the decision to the environment state, namely the expert behaviors do not take effect and only serve as feedback of the DDPG algorithm action decision, and calculating the similarity of the expert behaviors and the action of the intelligent agent. If the similarity is high, the simulation ability of the current intelligent agent game decision model to the expert behaviors is strong.
S34, obtaining the comprehensive reward rtComprehensive reward rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor maximising the jackpot RbWeight coefficient of (d), wiFor a target reward function R*Weight coefficient of(s), wεIs degree of similarity RεThe weight coefficient of (2).
S4, training intelligent game decision model and judging comprehensive reward rtIf the number of the intelligent agent game decision models is larger than the reward threshold value, the training is stopped to obtain the final intelligent agent game decision model, and if not, the circulation iteration training is carried outTraining until maximum training is given to the number of rounds n or the total reward rtGreater than the reward threshold.
The method mainly adopts a DDPG algorithm to carry out expert agent decision behavior simulation training so as to obtain an agent game decision model capable of making expert behaviors. The behavior simulation training process is a process that the intelligent agent generates actions according to situation to obtain income, experiences are accumulated continuously, the generated actions change to the direction of obtaining high income until the actions reach stable high income, and convergence is carried out, so that a final intelligent agent game decision model with high intelligent level is obtained and is used for air games.
In one embodiment, in step S4, the training of the intelligent agent gambling decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku(ii) a The actor network is used for outputting behaviors according to the battlefield situation, and the critic network is used for outputting scores according to the battlefield situation and the behaviors.
S42, converting online neural network parameter theta of actor networkQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor networkAnd target network parameters of critic network
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1, 2.., n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for a time T ═ 1, 2.., T, the following steps are performed:
a) each agent executes corresponding action, and performs environment state conversion to obtain next environment state St+1And a composite award rt;
b) In the process of converting the environmental statet,at,rt,St+1) The data is stored as an array in a playback buffer R to be used as a training data set;
c) randomly sampling U arrays in a playback buffer R as mini-batch training data (y)L-Q(sL,aL|θQ) Label (C)Wherein,representing policy μ at parameterEnvironmental state sL+1The action to be generated next is that of,representing parametersEnvironmental state sL+1And performing the actionThe value of the lower Q function;
e) Updating policy gradient of actor network:
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of actor networkAnd target network parameters of critic network
Wherein tau is an adjustable coefficient and belongs to [0,1 ];
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold. If the upper limit of the preset reward threshold value is 100, the user sets the reward threshold value to 80 according to the actual requirement, and then when the comprehensive reward r istIf the number of the rounds is more than 80, the training is terminated, otherwise, the step S44 is returned to carry out the loop iteration training until the maximum training pair round number n or the comprehensive reward rtGreater than 80.
The method comprises the steps of carrying out decision training on intelligent agent behaviors by utilizing a deep reinforcement learning algorithm, and continuously updating a deep network by taking similarity calculation of expert behaviors and intelligent agent behaviors as feedback. By training the intelligent agent game decision model, decision experience can be obtained from expert prior knowledge, and with the improvement of similarity of intelligent agent behaviors and expert behaviors in the training process, the behavior imitation level is gradually improved, so that the final intelligent agent game decision model with high intelligence level is obtained and used for air game.
In one embodiment, target network parameters for an actor networkAnd target network parameters of critic networkAnd updating by adopting a gradient descent method. The revenue of the generated action changes once per update, and the overall trend is the revenue rising trend.
In conclusion, the method synthesizes key elements in the air scene of the intelligent agent game decision model as reward factors, and the reward functions are clearly and definitely designed and have interpretability by forming continuous non-sparse reward functions, and meanwhile, the behavior control of the intelligent agent can be manually intervened; the reverse reinforcement learning algorithm based on the expert behaviors enables the fighting effect of the intelligent agent to be close to the fighting effect of the expert fighting, the resolving process of the target reward function can be automated, and meanwhile, the interpretability is achieved; the intelligent decision simulation method based on expert behaviors can quickly improve the decision level of an intelligent agent, solve the problem that convergence is difficult and slow and even cannot be achieved, and fit linear and nonlinear reward functions to enable the intelligent agent to have the capability of simulating the complex behaviors of the expert, so that the labor cost is reduced; meanwhile, continuous non-sparse modeling reward, reverse reinforcement learning reward and behavior simulation similarity reward are considered, the weight can be adjusted according to the user requirement, and the intelligent decision level of the intelligent agent is improved; the problem of cold start of training is solved by a behavior simulation method, the traditional inefficient reward function design process and the random exploration process during initial training are improved, the decision-making capability is optimized, and a decision-making intelligent agent with high intelligence level and practical value can be obtained quickly.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not should be understood as the limitation of the invention claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. A behavior simulation training method facing to air intelligent game is characterized in that: the behavior simulation training method for the air intelligent game comprises the following steps:
s1, constructing an air confrontation simulation environment and an agent of the red and blue parties, constructing an agent game decision model based on a deep reinforcement learning algorithm, and realizing the circular interaction between the air confrontation simulation environment and each agent;
s2, determining the environment state and the action space of each agent, wherein the action space comprises at least one action atShaping each of said actions atIs continuous non-sparse reward function Rt;
S3, performing air game matching in the intelligent agent game decision model, wherein the matching time is T, and each intelligent agent is in accordance with the current environment state StT is 0,1,2, …, T, the following steps are performed:
s31, determining action a required to be executedtThe action performed atGenerating a next environment state S after acting on the air countermeasure simulation environmentt+1And obtaining the corresponding reward function RtIn turn, the iteration is circulated to realize the maximum accumulated reward Rb;
S32, controlling each intelligent agent to carry out air game based on expert behaviors to realize reverse reinforcement learning, and obtaining a target reward function R*(s)=w*Phi(s), wherein w*Is a reward function weight vector, h in total,‖w*| < 1, phi(s) is the situation information vector phi: → [0,1 →]h;
S33, calculating similarity R of each intelligent agent behavior and expert behaviorε;
S34, obtaining the comprehensive reward rtThe composite award rtThe formula is as follows:
rt=wbRb+wiR*(s)+wεRε
wherein, wbFor said maximising the jackpot RbWeight coefficient of (d), wiFor the target reward function R*Weight coefficient of(s), wεIs the similarity RεThe weight coefficient of (a);
s4, training the intelligent agent game decision model and judging the comprehensive reward rtIf the number of the training pairs is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, performing the circulating iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
2. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the deep reinforcement learning algorithm is a DQN algorithm or a DDPG algorithm.
3. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the action space comprises at least one action a of target pursuit, target avoidance, tangential escape enemy locking, cross attack and snake maneuvert。
4. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 3, wherein: said shaping each of said actions atIs continuous non-sparse reward function RtThe following were used:
1) reward function R of said target pursuitt1The following conditions are satisfied:
2) reward function R for said target avoidancet2The following conditions are satisfied:
Rt2=Δθf+a′+A+B
3) the reward function R for tangentially getting rid of enemy lockingt3The following conditions are satisfied:
Rt3=-Δθt+a′+A+B
4) reward function R of the cross attackt4The following conditions are satisfied:
5) said serpentine motorized reward function Rt5The following conditions are satisfied:
wherein, Delta thetac=|θ-δ|-ε,Δθf=|θ-δ|,Δθt=|θ-δ|-90°,Δθj=|θ-δ′|,Δθsθ - δ - (90 ° - σ), θ being the absolute orientation angle of the red agent, θeIs the absolute orientation angle of the nearest blue agent to the red agent, x is the distance between the red agent and the nearest blue agent, xaA nearest friend machine for the red-square agent and the nearest blue-square agentThe distance between the energy bodies, delta is the connecting line angle of the red intelligent body and the nearest blue intelligent body, delta 'is the connecting line angle of the red intelligent body and the nearest friend machine, epsilon is the deviation angle of the red intelligent body and the nearest blue intelligent body, a' is the acceleration of the red intelligent body, A is the difference between the loss quantity of the blue intelligent body and the loss quantity of the red intelligent body, B is the winning bureau number of the red intelligent body, v is the speed of the red intelligent bodyaIs the nearest friend machine speed of the intelligent agent of the red side, alpha is a weight coefficient, sigma is an adjustable angle parameter, and sigma belongs to [0 DEG, 90 DEG °],T1For maneuvering direction change cycles in said snake-like machine, 0<T1≤T。
5. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning comprises the following steps:
s321, constructing the target reward function R*(s)=w*·φ(s);
S322, randomly acquiring an initial action strategy pi0Calculating the initial action strategy pi0Corresponding characteristic expectation mu (pi)0) I is equal to 0, i is more than or equal to 0 and less than or equal to m, and m is the number of battle bureaus;
s323, calculating characteristic errors of the current action strategy and the expert action strategy by adopting a minimum maximum algorithmAnd obtain the current reward function weight vectorUpdating the reward function weight vector w*Is (w)i)TWherein, muECharacteristic expectations for the expert behavior strategy; mu.sjT is the transposition for the characteristic expectation of the current action strategy;
s324, judging the characteristic error eiWhether or not less than the error thresholdIf yes, outputting the target reward function R*(s)=(wi)TPhi(s), otherwise, calculating the current optimal action strategy pi based on the deep reinforcement learning algorithmiAnd the update characteristic is expected to be mu (pi)i) If i is set to i +1, the process returns to step S323 to iterate.
6. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 5, wherein: the characteristic is expected to be mu (pi)i) The following formula is satisfied:
where γ is the discount factor, φ(s)t)|πiThe action strategy pi for the moment tiCorresponding battlefield situation.
7. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 6, wherein: the method for controlling the intelligent agents to carry out air game based on expert behaviors to realize reverse reinforcement learning further comprises the following steps:
generating battlefield situation sets according to action strategies corresponding to each bureauCalculating global feature expectations for m-game air gamesThe formula is as follows:
8. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S33, the similarity R between the behavior of each agent and the behavior of the expertεThe calculation process of (2) is as follows:
s331, searching noise N according to current strategy mutDetermining the action at=μ(St|θμ)+NtWherein, μ (S)t|θμ) Representing the strategy mu in the neural network parameter thetaμEnvironmental state StThe next resulting action;
s332, according to the current environment state StGenerating an action as a mock object a using expert behavior strategye;
S333, calculating similarity R of the intelligent agent behavior and the expert behavior by adopting a Kullback-Leibler difference methodεThe calculation formula is as follows:
for discrete action at:
Wherein, at(x) Representing the probability of the agent performing an x-action, ae(x) Representing the probability of the expert performing X behaviors, wherein X is a behavior set;
for continuous action at:
Wherein, at(x) Probability density, a, representing the agent performing x-actionse(x) Probability density representing x behaviors of expertDegree, X is the set of behaviors.
9. The method for behavioral simulation training for intelligent gaming over the air as recited in claim 1, wherein: in step S4, the training of the intelligent agent game decision model includes the following steps:
s41, initializing online neural network parameters theta of operator network in the intelligent agent game decision modelQAnd the online neural network parameter theta of the criticc networku;
S42, connecting the online neural network parameter theta of the actor networkQAnd the online neural network parameter theta of the criticc networkuRespectively corresponding to the target network parameters copied to the actor networkAnd target network parameters of critic network
S43, initializing a playback buffer R;
s44, executing the following steps for the lth (L ═ 1,2, …, n) pair:
s441, initializing random process noise N to obtain an initial environment state S;
s442, for time T equal to 1,2, …, T, the following steps are performed:
a) each intelligent agent executes corresponding action, and performs environmental state conversion to obtain the next environmental state St+1And a composite award rt;
b) In the process of converting the environmental statet,at,rt,St+1) Storing the data as an array in the playback buffer R as a training data set;
c) randomly sampling U of the arrays in the playback buffer R as onemini-batch training data (y)L-Q(sL,aL|θQ) Label (C)Wherein,representing policy μ at parameterEnvironmental state sL+1The action to be generated next is that of,representing parametersEnvironmental state sL+1And performing the actionThe value of the lower Q function;
e) Updating policy gradient of actor network:
wherein Q (s, a | θ)Q) Expressed at a parameter thetaQEnvironment state s, the value of the Q function under the execution action a,represents Q (s, a | θ)Q) Partial derivatives of a; mu (s | theta)μ) Representing the strategy mu at the parameter thetaμAn action generated in the environment state s,represents μ (s | θ)μ) To thetaμPartial derivatives of (a);
f) updating target network parameters of the actor networkAnd target network parameters of critic network
g) judging the comprehensive reward rtIf the number of rounds n is larger than the reward threshold value, terminating the training to obtain a final intelligent agent game decision model, otherwise, returning to the step S44 to carry out the circular iterative training until the maximum training pair number n or the comprehensive reward rtGreater than the reward threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110425153.0A CN113221444B (en) | 2021-04-20 | 2021-04-20 | Behavior simulation training method for air intelligent game |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110425153.0A CN113221444B (en) | 2021-04-20 | 2021-04-20 | Behavior simulation training method for air intelligent game |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113221444A true CN113221444A (en) | 2021-08-06 |
CN113221444B CN113221444B (en) | 2023-01-03 |
Family
ID=77088029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110425153.0A Active CN113221444B (en) | 2021-04-20 | 2021-04-20 | Behavior simulation training method for air intelligent game |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113221444B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113867178A (en) * | 2021-10-26 | 2021-12-31 | 哈尔滨工业大学 | Virtual and real migration training system for multi-robot confrontation |
CN113893539A (en) * | 2021-12-09 | 2022-01-07 | 中国电子科技集团公司第十五研究所 | Cooperative fighting method and device for intelligent agent |
CN114021737A (en) * | 2021-11-04 | 2022-02-08 | 中国电子科技集团公司信息科学研究院 | Game-based reinforcement learning method, system, terminal and storage medium |
CN114423046A (en) * | 2021-12-03 | 2022-04-29 | 中国人民解放军空军工程大学 | Cooperative communication interference decision method |
CN115470710A (en) * | 2022-09-26 | 2022-12-13 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
CN115648204A (en) * | 2022-09-26 | 2023-01-31 | 吉林大学 | Training method, device, equipment and storage medium of intelligent decision model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291890A (en) * | 2020-05-13 | 2020-06-16 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Game strategy optimization method, system and storage medium |
CN112052511A (en) * | 2020-06-15 | 2020-12-08 | 成都蓉奥科技有限公司 | Air combat maneuver strategy generation technology based on deep random game |
CN112052456A (en) * | 2020-08-31 | 2020-12-08 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
-
2021
- 2021-04-20 CN CN202110425153.0A patent/CN113221444B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291890A (en) * | 2020-05-13 | 2020-06-16 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Game strategy optimization method, system and storage medium |
CN112052511A (en) * | 2020-06-15 | 2020-12-08 | 成都蓉奥科技有限公司 | Air combat maneuver strategy generation technology based on deep random game |
CN112052456A (en) * | 2020-08-31 | 2020-12-08 | 浙江工业大学 | Deep reinforcement learning strategy optimization defense method based on multiple intelligent agents |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113867178A (en) * | 2021-10-26 | 2021-12-31 | 哈尔滨工业大学 | Virtual and real migration training system for multi-robot confrontation |
CN113867178B (en) * | 2021-10-26 | 2022-05-31 | 哈尔滨工业大学 | Virtual and real migration training system for multi-robot confrontation |
CN114021737A (en) * | 2021-11-04 | 2022-02-08 | 中国电子科技集团公司信息科学研究院 | Game-based reinforcement learning method, system, terminal and storage medium |
CN114021737B (en) * | 2021-11-04 | 2023-08-22 | 中国电子科技集团公司信息科学研究院 | Reinforced learning method, system, terminal and storage medium based on game |
CN114423046A (en) * | 2021-12-03 | 2022-04-29 | 中国人民解放军空军工程大学 | Cooperative communication interference decision method |
CN113893539A (en) * | 2021-12-09 | 2022-01-07 | 中国电子科技集团公司第十五研究所 | Cooperative fighting method and device for intelligent agent |
CN113893539B (en) * | 2021-12-09 | 2022-03-25 | 中国电子科技集团公司第十五研究所 | Cooperative fighting method and device for intelligent agent |
CN115470710A (en) * | 2022-09-26 | 2022-12-13 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
CN115648204A (en) * | 2022-09-26 | 2023-01-31 | 吉林大学 | Training method, device, equipment and storage medium of intelligent decision model |
CN115470710B (en) * | 2022-09-26 | 2023-06-06 | 北京鼎成智造科技有限公司 | Air game simulation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113221444B (en) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113221444B (en) | Behavior simulation training method for air intelligent game | |
CN113093802B (en) | Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning | |
CN113095481B (en) | Air combat maneuver method based on parallel self-game | |
CN113791634B (en) | Multi-agent reinforcement learning-based multi-machine air combat decision method | |
CN113050686A (en) | Combat strategy optimization method and system based on deep reinforcement learning | |
Wang et al. | Improving maneuver strategy in air combat by alternate freeze games with a deep reinforcement learning algorithm | |
CN113893539B (en) | Cooperative fighting method and device for intelligent agent | |
CN114460959A (en) | Unmanned aerial vehicle group cooperative autonomous decision-making method and device based on multi-body game | |
CN113282061A (en) | Unmanned aerial vehicle air game countermeasure solving method based on course learning | |
CN116185059A (en) | Unmanned aerial vehicle air combat autonomous evasion maneuver decision-making method based on deep reinforcement learning | |
CN113159266B (en) | Air combat maneuver decision method based on sparrow searching neural network | |
CN113962012A (en) | Unmanned aerial vehicle countermeasure strategy optimization method and device | |
Kong et al. | Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat | |
CN111773722B (en) | Method for generating maneuver strategy set for avoiding fighter plane in simulation environment | |
CN114721424A (en) | Multi-unmanned aerial vehicle cooperative countermeasure method, system and storage medium | |
CN116700079A (en) | Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP | |
CN113282100A (en) | Unmanned aerial vehicle confrontation game training control method based on reinforcement learning | |
CN115933717A (en) | Unmanned aerial vehicle intelligent air combat maneuver decision training system and method based on deep reinforcement learning | |
CN114371634B (en) | Unmanned aerial vehicle combat analog simulation method based on multi-stage after-the-fact experience playback | |
Xianyong et al. | Research on maneuvering decision algorithm based on improved deep deterministic policy gradient | |
CN117291254A (en) | Agent task allocation training method based on imitation learning and safety reinforcement learning | |
CN116225065A (en) | Unmanned plane collaborative pursuit method of multi-degree-of-freedom model for multi-agent reinforcement learning | |
Kong et al. | Multi-ucav air combat in short-range maneuver strategy generation using reinforcement learning and curriculum learning | |
Wang et al. | Research on autonomous decision-making of UCAV based on deep reinforcement learning | |
Lu et al. | Strategy Generation Based on DDPG with Prioritized Experience Replay for UCAV |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |