CN114358141A - Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision - Google Patents

Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision Download PDF

Info

Publication number
CN114358141A
CN114358141A CN202111530475.8A CN202111530475A CN114358141A CN 114358141 A CN114358141 A CN 114358141A CN 202111530475 A CN202111530475 A CN 202111530475A CN 114358141 A CN114358141 A CN 114358141A
Authority
CN
China
Prior art keywords
combat
reward
reinforcement learning
unit
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111530475.8A
Other languages
Chinese (zh)
Inventor
李博遥
郑本昌
路鹰
黄虎
惠俊鹏
陈海鹏
王振亚
李君�
阎岩
范佳宣
李丝然
何昳頔
张佳
任金磊
吴志壕
刘峰
范中行
张旭辉
赵大海
韩特
肖肖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Launch Vehicle Technology CALT
Original Assignee
China Academy of Launch Vehicle Technology CALT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Academy of Launch Vehicle Technology CALT filed Critical China Academy of Launch Vehicle Technology CALT
Priority to CN202111530475.8A priority Critical patent/CN114358141A/en
Publication of CN114358141A publication Critical patent/CN114358141A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision comprises the following steps: aiming at the game confrontation scene of the red and blue sides, a multi-agent reinforcement learning model is established, and intelligent cooperative decision modeling facing multiple combat units is realized; the effective training sample number is increased by adopting a post-event target conversion method, and the optimization convergence of the multi-agent reinforcement learning model is realized; constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific actions of each fighting unit as feedback information; and generating various opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model by using a reward function through massive simulation game confrontation. The invention solves the problems of low decision cooperativity of the multi-combat unit of the game fighting of the Reddish and the Lankan, difficult acquisition of valuable training samples and the like in the prior art.

Description

Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision
Technical Field
The invention belongs to the field of artificial intelligence technology game confrontation, and relates to a multi-agent reinforcement learning method.
Background
The multi-agent deep reinforcement learning is a new research hotspot and application direction in the field of machine learning by combining the cooperative ability of the multi-agents with the decision-making ability of reinforcement learning to solve the cooperative decision-making problem of clustering multiple units, covers numerous algorithms, rules and frames, is widely applied to the actual fields of automatic driving, energy distribution, formation control, track planning, route planning, social problems and the like, and has extremely high research value and significance. Some preliminary basic technical researches on multi-agent deep reinforcement learning by foreign related research institutions have been carried out, and the research on the technology, especially the application related research work in the field of military command, in China is just beginning at present.
Most of current intelligent decision algorithms adopt methods based on optimization and priori knowledge, and the problems of low decision cooperativity, difficulty in obtaining valuable training samples and the like exist for the problem of dynamic optimization of multiple combat units in a game countermeasure scene of a Reddish-Lanzi game.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method overcomes the defects of the prior art, provides a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision, and solves the problems that the red-blue-party game fighting multi-combat-unit decision cooperative is low, valuable training samples are difficult to obtain and the like in the prior art.
The technical solution of the invention is as follows: a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision comprises the following steps:
the method comprises the steps that firstly, a multi-agent reinforcement learning model is established for a game confrontation scene of a red party and a blue party, and intelligent cooperative decision modeling facing multiple combat units is achieved;
the multi-agent reinforcement learning model is constructed by the following steps:
building a game confrontation scene of a red party and a blue party;
analyzing task characteristics and decision points in a game confrontation scene of the red and blue parties to determine a state space of a cooperative task decision point;
and aiming at the cooperative task decision point, establishing a multi-agent reinforcement learning model.
The method for determining the state space of the collaborative task decision point comprises the following steps:
and (3) taking the whole situation information of the game countermeasure scene and the local observation information of the combat unit as state input, performing default verification through fixed part state input values, eliminating useless or counteractive states, and determining a key state space of a task decision point.
Step two, increasing the number of effective training samples by adopting a post-event target conversion method, and realizing the optimization convergence of the multi-agent reinforcement learning model;
the specific method for enhancing the number of effective training samples by adopting the post-event target conversion method comprises the following steps:
in each round of iterative training, sample data is selected from the experience pool according to the sampling probability value, the original task target which cannot be realized by the intelligent agent in the sample is changed into a state which can be reached at a certain moment, and an effective positive sample is constructed for model training.
The sampling probability value is calculated as follows:
Figure BDA0003410478690000031
wherein p isi=|δiI + ε represents the priority of the ith sample, δiRepresenting the time sequence difference error of the ith sample, wherein epsilon represents random noise and prevents the sampling probability from being 0; α is used to adjust the degree of priority, and p (i) is the sampling probability of the ith sample data.
Thirdly, constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific action of each fighting unit as feedback information;
the method for constructing the reward function comprises the following steps:
calculating global task reward R according to the situation information of the termination time of the task decision sequencetask
Calculating a prize R for each of the plurality of warfare units based on the sequence of actions performed by each of the plurality of warfare unitsi(ii) a i denotes the number of the unit, i is 1,2,3, … …
Awarding R according to global missiontaskAnd an action award R for each of the combat unitsiCalculating cooperative task decision feedback information of each combat unit in the game countermeasure scene of the Reddish and the Landan
Figure BDA0003410478690000032
Global mission rewards RtaskTwo categories are included, respectively:
the task completion reward means that the red party completes the fighting task target at the termination moment;
damage reward means that the number of fighting units for red hitting and destroying blue is more than the number of self damage;
the task completion reward and the damage reward are double values, and the numerical distribution areas are different.
Action reward R for each combat unitiThe method comprises three types, namely:
the casualty reward means that the red party combat unit is destroyed by the blue party and is a negative reward;
the ammunition consumption reward refers to the number of ammunitions consumed by the unit of the red party battle, and is a negative reward;
the visual field reward means that the red party combat unit can detect the blue party situation information and is positive reward;
the bonus of the red party casualty, the bonus of the ammunition consumption and the bonus of the visual field are double values, and the numerical distribution areas are different.
Cooperative task decision feedback information of each combat unit in game countermeasure scene of red and blue parties
Figure BDA0003410478690000041
The calculation formula of (2) is as follows:
Figure BDA0003410478690000042
wherein, η represents the importance degree of the team global mission reward, η ═ 0 represents that each combat unit only considers the income brought by self action, η ═ 1 represents that only considers the team overall income.
And step four, generating a plurality of opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model through massive simulation game confrontation by utilizing a reward function.
The specific method for training the multi-agent reinforcement learning model by using the reward function through massive simulation game confrontation comprises the following steps:
the method comprises the steps of constructing a blues strategy library according to different combat schemes, expanding the blues strategy library by using a reds online decision model at intervals of a set training period, and completing evolution training of a reds multi-agent enhanced learning model by using a reward function through massive simulation game countermeasures.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, the samples acquired on line in the game confrontation scene of the red and blue sides are optimized and selected by utilizing the after-the-fact target conversion, and valuable training samples are generated, so that the number of positive samples in a war game deduction scene with a larger search space can be effectively increased, and the rapid optimization convergence of an enhanced learning intelligent model is realized;
2. according to the method, the feedback of the reinforcement learning model is calculated in real time by combining the overall task reward and the specific action rewards of each combat unit, so that the method is more suitable for the game confrontation scene of the red and blue sides of the multi-combat unit, and the synergistic effect of the reinforcement learning model is improved;
3. the invention constructs a blue strategy library based on different combat schemes, adopts mass self-game deduction to complete the evolution training of the red-party multi-agent reinforcement learning model, and can effectively improve the combat decision-making capability of the reinforcement learning model by increasing the diversity and the combat difficulty of the opponent strategies.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a model architecture diagram of the present invention;
FIG. 3 is a schematic diagram of a method for converting a post-event object according to the present invention.
Detailed Description
The invention provides a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision, as shown in figure 1, the method comprises the following steps:
the method comprises the steps of firstly, establishing a multi-agent reinforcement learning model aiming at a game confrontation scene of a red party and a blue party, and realizing intelligent cooperative decision modeling facing multiple combat units.
The construction process of the multi-agent reinforcement learning model comprises the following steps:
(1.1) building a game confrontation scene of a red and blue party;
(1.2) analyzing task characteristics and decision points in a game countermeasure scene of the Reddish and the blue sides to determine a state space of the task decision points;
the specific method for designing the state space comprises the following steps:
before modeling of a decision model, the whole situation information of a game countermeasure scene and the local observation information of a combat unit are used as state input, default verification is carried out through fixed part state input values, useless or counteractive states are eliminated, and a key state space of a task decision point is determined.
And (1.3) aiming at the cooperative task decision point in the step (1.2), establishing a multi-agent reinforcement learning model, and realizing cooperative task decision modeling facing a game countermeasure scene of a red-blue party multi-combat unit.
And secondly, improving the number of effective training samples by adopting a post-event target conversion method, and realizing the rapid optimization convergence of the intelligent model.
The detailed method for generating the effective training sample by adopting the posterior target conversion method comprises the following steps:
in each round of iterative training, firstly, sample data is selected from an experience pool according to the following requirements, and the network model is optimally trained in the reinforcement learning process, so that the network model has higher training efficiency:
Figure BDA0003410478690000061
let p bei=|δiI + ε represents the priority of the ith sample, δiRepresents the timing difference error (td-error) of the ith sample, and epsilon represents random noise, preventing samplingThe probability is 0, α is used to adjust the degree of priority (α ═ 0 indicates uniform sampling), and p (i) is the sampling probability of the ith sample data.
When network training is carried out, training samples are selected from the experience pool according to the sampling probability value, the original task target which cannot be realized by the intelligent agent in the samples is changed into the state which can be reached at a certain moment, and effective positive samples are constructed for model training, so that the number of valuable training samples in a game confrontation scene of a red and blue side with a larger search space is increased, and the rapid optimization convergence of the intelligent model is realized.
And thirdly, constructing a reward function by taking the reward of the global team task as a reference and combining the reward of the specific actions of each fighting unit as feedback information.
The calculation steps of the cooperative task decision reward function in the game countermeasure scene of the Reddish and the blue are as follows:
(3.1) calculating the global task reward R according to the situation information of the termination time of the task decision sequencetask
(3.2) calculating an action award R for each of the fighting units based on the sequence of actions performed by each of the fighting unitsi
(3.3) combining the award values in the steps (3.1) and (3.2) to calculate cooperative task decision feedback information of each combat unit in the game countermeasure scene of the Reddish and Landan
Figure BDA0003410478690000062
The global mission rewards include two categories, respectively:
(3.1.1) completing reward of the task, namely completing a fighting task target by the red party at the termination moment;
(3.1.2) damage reward, namely the number of fighting units for destroying blue by beating of the red party is more than the number of damage per se;
the task completion reward and the damage reward are double values, but the numerical distribution areas are different.
For each combat unit in the game confrontation scene of the red and blue sides, the action rewards comprise three types, namely:
(3.2.1) casual rewards, namely the red party combat units are destroyed by the blue party and are negative rewards;
(3.2.2) the ammunition consumption reward is negative reward, namely the number of ammunitions consumed by the red party combat unit;
(3.2.3) field reward, namely the red party combat unit can detect blue party situation information and is positive reward;
the bonus of the red casualty, the bonus of the ammunition consumption and the bonus of the visual field are double values, but the numerical distribution areas are different.
Collaborative task decision feedback information for each operational unit
Figure BDA0003410478690000071
The specific calculation method comprises the following steps:
Figure BDA0003410478690000072
let η denote the importance of the team global mission rewards, η ═ 0 denotes that each combat unit considers only the gains from its own actions, and η ═ 1 denotes that only the team global gains are considered.
And fourthly, generating a plurality of opponent strategies based on different combat schemes, and training the multi-agent reinforcement learning model by simulating game confrontation by utilizing a reward function.
The method comprises the steps of constructing a bluesquare strategy library based on different combat schemes, expanding the bluesquare strategy library by using a redside online decision model at regular intervals of a training period, increasing diversity and combat difficulty of the strategy library, and completing evolution training of a redside multi-agent reinforcement learning model through massive simulation game confrontation.
The scene of the invention is a cooperative sequence decision problem with delayed return, and in order to solve task cooperativeness, the invention uses multi-agent reinforcement learning to perform reinforcement learning modeling on each combat unit and then perform centralized training.
The algorithmic framework of the multi-agent reinforcement learning model is shown in FIG. 2. The model is constructed on the basis of a DDPG algorithm model, the DDPG algorithm adopts an Actor-Critic framework to carry out single-step updating, the gradient is faster than that of the traditional strategy, and the problem of the gradient on a continuous action space can be solvedA sequence decision problem. The model comprises an Actor strategy network and a criticic evaluation network, wherein the Actor network fits the action strategy of the agent, and the output of the Actor strategy is not a single action, but the probability distribution pi (a | s) of a selection action and represents the probability of selecting the action a in the state s at the current moment; critic network fitting agent action value function Qπ(s, a) indicating the value of taking action a in current time state s.
Meanwhile, the invention improves the number of effective training samples by adopting post-event target conversion, and constructs a feedback value by combining the reward of the global task of the team and the reward of the specific action of each fighting unit, thereby improving the training efficiency and the synergistic effect of the multi-agent reinforcement learning model.
The strategy network input of the multi-agent reinforcement learning model is the real-time state of a battle scene, namely the position and the survival state of each battle unit of the red party, the observed position of each battle unit of the blue party, topographic information and simulation time, the output of the network is the position of a moving target of the red party and the selection of an attack target, in the scene, the mapping relation from the state to the action is expected to be established through the neural network training on the premise of limited time, and a moving and fire power distribution scheme is rapidly generated on line by utilizing a reinforcement learning method. The input of the evaluation network is the real-time state of the scene and the action information of each agent, and the output is the joint Q value.
According to the method, the reward of the global mission of the team and the reward of the specific action of each fighting unit are calculated, and the feedback value of the enhanced learning training is obtained.
Taking a military chess game confrontation scene of land battles as an example, the task completion reward is taken as a seizing control point, and if the quantity of the red party destroying blue party fighting units is more than the self damage quantity at the termination moment, the damage reward is obtained; and calculating individual action rewards of each fighting unit by counting the survival state of each fighting unit at the termination time, the ammunition consumption number and the number of the blue square units which can be detected, so as to obtain the feedback value of each intelligent agent training of the multi-intelligent agent reinforcement learning model.
The method comprises the steps that a 'centralized training-distributed execution' principle is adopted, observation information and execution actions of all agents are known in a model training process and are used for training evaluation networks of all agents, and each agent strategy network generates decision actions only according to local observation information of the agent strategy network when the model is executed.
The specific steps of the multi-agent reinforcement learning model training algorithm are as follows:
1) initializing policy master networks
Figure BDA0003410478690000081
And evaluating the host network
Figure BDA0003410478690000082
Policy target network
Figure BDA0003410478690000091
And evaluating the target network
Figure BDA0003410478690000092
And experience pools, the target network being a duplicate of the main network, oiLocal observation information of each combat unit is shown, s is joint situation information, a is joint action,
Figure BDA0003410478690000093
and
Figure BDA0003410478690000094
for the primary network weight parameter to be used,
Figure BDA0003410478690000095
and
Figure BDA0003410478690000096
a weight parameter for the target network; 1,2,3, … …
2) Selecting the action of the current state of each intelligent agent:
Figure BDA0003410478690000097
Ntrandom noise which is subject to normal distribution is used for increasing the exploration capacity of the intelligent agent;
3) executing the action to obtain the reward value corresponding to each agent, and converting the state-motion transformation data (O)t,At,Rt,Ot+1) Storing the data into an experience pool;
Ot={o1,t,o2,t,…,oi,tdenotes the joint state at time t, At={a1,t,a2,t,…,ai,tDenotes the joint action of agents at time t, Rt={r1,t,r2,t,…,ri,tDenotes the feedback of each agent at time t, Ot+1={o1,t+1,o2,t+1,…,oi,t+1Denotes the joint status at time t + 1.
4) And when the sample size of the experience pool reaches a certain amount, processing the sample data (selecting samples based on the sampling probability of td-error) by adopting post-event target conversion, and constructing an effective positive sample for model training. The loss function L of each agent evaluation network is calculated according to the following formula:
Figure BDA0003410478690000098
E[]representing a desired function; r isi,tRepresents the reward value of the ith agent at time t; gamma represents an attenuation factor, and gamma is more than or equal to 0 and less than or equal to 1.
In the sampling process of the model training samples, the method carries out priority sequencing on the sample data stored in the experience pool so as to increase the probability of the valuable samples being sampled and improve the training efficiency. And using td-error as a measurement index of the sample importance, wherein the higher the value of the td-error is, the larger the difference between the action value estimation value of the evaluation network and the action value target value is, and the more valuable the training sample is.
The scenario of the invention is not a strict sequence decision problem, or the scenario is a cooperative sequence decision problem integrating the problems of sparse return, delayed return and the like. We can consider the problem as a function optimization problem with the goal of function maximization and the objective function of cooperative mission reward maximization. Because the search space of the game confrontation scene of the Reddish and Landan sides is larger, the reinforcement learning method adopting the random exploration mode is difficult to even can not obtain a successful sample, according to the method of the invention, the original task target which can not be realized by the intelligent agent in the sample data is changed into the state which can be reached by the intelligent agent at a certain moment by adopting the after-event target conversion, and an effective positive sample is constructed for model training.
A schematic diagram of the post-event target conversion is shown in fig. 3, although the intelligent agent cannot reach the target position expected to move, sample data is re-marked by a post-event target conversion method, a current failed decision sequence is converted into a successful decision track, a movement skill experience is accumulated for the intelligent agent, and finally, the cooperative movement strategy learning for the target position is realized.
The modeling and training of the intelligent strategy model in the Hongsanfang game confrontation scene need data driving, and the method quickly obtains training samples to improve the learning efficiency of the strategy model by simulating the attack and defense confrontation game process of the battle scene, and completes the decision capability evolution of the Hongsanfang multi-agent reinforcement learning model. The game confrontation training method comprises the following specific steps:
1) before the model training starts, a plurality of opponent strategies are generated off line by using priori knowledge based on different combat schemes, and a blueness strategy library is constructed;
2) randomly extracting a strategy model from a strategy library, and generating training data for model iterative training through the countervailing deduction of red and blue in a simulation platform;
3) performing online expansion on the bluesquare strategy library by using the redsquare strategy model at regular training intervals;
4) and (4) repeating the steps 2) to 4) circularly to realize intelligent model evolution training in a game countermeasure scene.
In the game fighting deduction simulation platform of the red and blue parties, the method is verified on the basis of the cooperative mobile striking decision capability of the capturing point occupied by the red party, and the test flow is as follows:
1) setting a proper game fighting scene;
2) training of a multi-agent reinforcement learning model is achieved through simulation confrontation, the adaptability of the Hongfang cooperative mobile striking decision model to a typical scene is verified, if the model training is not converged, parameters are adjusted and retraining is carried out until the model is converged, and the next step is carried out;
3) carrying out verification test on the method under the condition of random scene;
4) under the same typical combat scene as the step 3), adopting a single intelligent agent reinforcement learning model for each combat unit of the red side, and carrying out verification test on the model after the model training is converged;
5) under the same typical combat scene as the step 3), carrying out multi-agent reinforcement learning model training and verification tests for canceling the after-event target conversion;
6) the statistical comparison analysis is carried out on the test results of the step 3), the step 4) and the step 5), and the invention is found to be capable of well solving the problems that the traditional multi-combat unit game countermeasure decision-making cooperativity is low and valuable training samples are difficult to obtain.
Aiming at the planning requirement of army tactical tasks, the multi-agent reinforcement learning model is utilized to make a decision on a cooperative mobile striking action sequence of a red party in a game confrontation scene of the red and blue parties; by utilizing a sample generation method of after-the-fact target conversion, the learning efficiency and the exploration capacity are enhanced, and the rapid convergence of an intelligent model is realized; the evaluation parameters considering the reward of the global team task and the reward of the specific actions of each fighting unit are constructed, and the parameters are used as feedback to effectively improve the cooperative decision effect of the intelligent model; a mass game countermeasure training method is utilized to generate a plurality of blue party combat strategies in an off-line and on-line mode, and a training sample is rapidly generated through mass countermeasure deduction, so that the operational capacity of the intelligent model is evolved and upgraded; in the game fighting deduction simulation platform of the red and blue parties, the effectiveness of the game fighting deduction simulation platform is verified on the basis of the cooperative mobile striking decision capability of the capturing point occupied by the red party. The invention solves the problems of low decision cooperativity of the multi-combat unit of the game fighting of the Reddish and the Lankan, difficult acquisition of valuable training samples and the like in the prior art.
Those skilled in the art will appreciate that the details of the invention not described in detail in this specification are well within the skill of those in the art.

Claims (10)

1. A multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision is characterized by comprising the following steps:
aiming at the game confrontation scene of the red and blue sides, a multi-agent reinforcement learning model is established, and intelligent cooperative decision modeling facing multiple combat units is realized;
the effective training sample number is increased by adopting a post-event target conversion method, and the optimization convergence of the multi-agent reinforcement learning model is realized;
constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific actions of each fighting unit as feedback information;
and generating a plurality of opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model by simulating game confrontation by using a reward function.
2. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the multi-agent reinforcement learning model is constructed by the following steps:
building a game confrontation scene of a red party and a blue party;
analyzing task characteristics and decision points in a game confrontation scene of the red and blue parties to determine a state space of a cooperative task decision point;
and aiming at the cooperative task decision point, establishing a multi-agent reinforcement learning model.
3. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 2, wherein: the method for determining the state space of the collaborative task decision point comprises the following steps:
and (3) taking the whole situation information of the game countermeasure scene and the local observation information of the combat unit as state input, performing default verification through fixed part state input values, eliminating useless or counteractive states, and determining a key state space of a task decision point.
4. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the specific method for enhancing the number of effective training samples by adopting the post-event target conversion method comprises the following steps:
in each round of iterative training, sample data is selected from the experience pool according to the sampling probability value, the original task target which cannot be realized by the intelligent agent in the sample is changed into a state which can be reached at a certain moment, and an effective positive sample is constructed for model training.
5. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 4, wherein: the sampling probability value is calculated as follows:
Figure FDA0003410478680000021
wherein p isi=|δiI + ε represents the priority of the ith sample, δiRepresenting the time sequence difference error of the ith sample, wherein epsilon represents random noise and prevents the sampling probability from being 0; α is used to adjust the degree of priority, and p (i) is the sampling probability of the ith sample data.
6. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the method for constructing the reward function comprises the following steps:
calculating global task reward R according to the situation information of the termination time of the task decision sequencetask
Calculating a prize R for each of the plurality of warfare units based on the sequence of actions performed by each of the plurality of warfare unitsi(ii) a i denotes the number of the unit, i is 1,2,3, … …
Awarding R according to global missiontaskAnd an action award R for each of the combat unitsiCalculating cooperative task decision feedback information of each combat unit in the game countermeasure scene of the Reddish and the Landan
Figure FDA0003410478680000022
7. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: global mission rewards RtaskTwo categories are included, respectively:
the task completion reward means that the red party completes the fighting task target at the termination moment;
damage reward means that the number of fighting units for red hitting and destroying blue is more than the number of self damage;
the task completion reward and the damage reward are double values, and the numerical distribution areas are different.
8. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: action reward R for each combat unitiThe method comprises three types, namely:
the casualty reward means that the red party combat unit is destroyed by the blue party and is a negative reward;
the ammunition consumption reward refers to the number of ammunitions consumed by the unit of the red party battle, and is a negative reward;
the visual field reward means that the red party combat unit can detect the blue party situation information and is positive reward;
the bonus of the red party casualty, the bonus of the ammunition consumption and the bonus of the visual field are double values, and the numerical distribution areas are different.
9. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: cooperative task decision feedback information of each combat unit in game countermeasure scene of red and blue parties
Figure FDA0003410478680000031
The calculation formula of (2) is as follows:
Figure FDA0003410478680000032
wherein, η represents the importance degree of the team global mission reward, η ═ 0 represents that each combat unit only considers the income brought by self action, η ═ 1 represents that only considers the team overall income.
10. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the specific method for training the multi-agent reinforcement learning model by simulating game confrontation by utilizing the reward function comprises the following steps:
the method comprises the steps of constructing a blues strategy library according to different combat schemes, expanding the blues strategy library by using a reds online decision model at intervals of a set training period, and completing evolution training of a reds multi-agent reinforcement learning model by simulating game countermeasures by using a reward function.
CN202111530475.8A 2021-12-14 2021-12-14 Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision Pending CN114358141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111530475.8A CN114358141A (en) 2021-12-14 2021-12-14 Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111530475.8A CN114358141A (en) 2021-12-14 2021-12-14 Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision

Publications (1)

Publication Number Publication Date
CN114358141A true CN114358141A (en) 2022-04-15

Family

ID=81099216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111530475.8A Pending CN114358141A (en) 2021-12-14 2021-12-14 Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision

Country Status (1)

Country Link
CN (1) CN114358141A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114626836A (en) * 2022-05-17 2022-06-14 浙江大学 Multi-agent reinforcement learning-based emergency delivery decision-making system and method
CN114880955A (en) * 2022-07-05 2022-08-09 中国人民解放军国防科技大学 War and chess multi-entity asynchronous collaborative decision-making method and device based on reinforcement learning
CN114897267A (en) * 2022-06-14 2022-08-12 哈尔滨工业大学(深圳) Fire power distribution method and system for many-to-many intelligent agent cooperative battlefield scene
CN114925601A (en) * 2022-05-06 2022-08-19 南京航空航天大学 Combat simulation deduction method based on deep reinforcement learning and image vision
CN115544898A (en) * 2022-11-09 2022-12-30 哈尔滨工业大学 Multi-agent attack and defense decision method based on deep reinforcement learning
CN116128013A (en) * 2023-04-07 2023-05-16 中国人民解放军国防科技大学 Temporary collaboration method and device based on diversity population training and computer equipment
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116821693A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925601A (en) * 2022-05-06 2022-08-19 南京航空航天大学 Combat simulation deduction method based on deep reinforcement learning and image vision
CN114626836B (en) * 2022-05-17 2022-08-05 浙江大学 Multi-agent reinforcement learning-based emergency post-delivery decision-making system and method
CN114626836A (en) * 2022-05-17 2022-06-14 浙江大学 Multi-agent reinforcement learning-based emergency delivery decision-making system and method
CN114897267B (en) * 2022-06-14 2024-02-27 哈尔滨工业大学(深圳) Fire distribution method and system for multi-to-multi-agent cooperative combat scene
CN114897267A (en) * 2022-06-14 2022-08-12 哈尔滨工业大学(深圳) Fire power distribution method and system for many-to-many intelligent agent cooperative battlefield scene
CN114880955A (en) * 2022-07-05 2022-08-09 中国人民解放军国防科技大学 War and chess multi-entity asynchronous collaborative decision-making method and device based on reinforcement learning
CN114880955B (en) * 2022-07-05 2022-09-20 中国人民解放军国防科技大学 War and chess multi-entity asynchronous collaborative decision-making method and device based on reinforcement learning
CN115544898A (en) * 2022-11-09 2022-12-30 哈尔滨工业大学 Multi-agent attack and defense decision method based on deep reinforcement learning
CN115544898B (en) * 2022-11-09 2023-08-29 哈尔滨工业大学 Multi-agent attack and defense decision-making method based on deep reinforcement learning
CN116128013B (en) * 2023-04-07 2023-07-04 中国人民解放军国防科技大学 Temporary collaboration method and device based on diversity population training and computer equipment
CN116128013A (en) * 2023-04-07 2023-05-16 中国人民解放军国防科技大学 Temporary collaboration method and device based on diversity population training and computer equipment
CN116679742A (en) * 2023-04-11 2023-09-01 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116679742B (en) * 2023-04-11 2024-04-02 中国人民解放军海军航空大学 Multi-six-degree-of-freedom aircraft collaborative combat decision-making method
CN116821693A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium
CN116821693B (en) * 2023-08-29 2023-11-03 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114358141A (en) Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision
CN112861442B (en) Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN111275174B (en) Game-oriented radar countermeasure generating method
CN114757351B (en) Defense method for resisting attack by deep reinforcement learning model
CN113627596A (en) Multi-agent confrontation method and system based on dynamic graph neural network
CN114880955B (en) War and chess multi-entity asynchronous collaborative decision-making method and device based on reinforcement learning
Uriarte et al. Automatic learning of combat models for RTS games
Toghiani-Rizi et al. Evaluating deep reinforcement learning for computer generated forces in ground combat simulation
CN109299491B (en) Meta-model modeling method based on dynamic influence graph strategy and using method
CN115293022A (en) Aviation soldier intelligent agent confrontation behavior modeling method based on OptiGAN and spatiotemporal attention
CN116360503B (en) Unmanned plane game countermeasure strategy generation method and system and electronic equipment
Hou et al. Advances in memetic automaton: Toward human-like autonomous agents in complex multi-agent learning problems
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
CN114662655A (en) Attention mechanism-based weapon and chess deduction AI hierarchical decision method and device
CN116596343A (en) Intelligent soldier chess deduction decision method based on deep reinforcement learning
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN114757092A (en) System and method for training multi-agent cooperative communication strategy based on teammate perception
CN115185294A (en) QMIX-based aviation soldier multi-formation collaborative autonomous behavior decision-making modeling method
CN113807230A (en) Equipment target identification method based on active reinforcement learning and man-machine intelligent body
Liu A novel approach based on evolutionary game theoretic model for multi-player pursuit evasion
Bian et al. Cooperative strike target assignment algorithm based on deep reinforcement learning
CN110930054A (en) Data-driven battle system key parameter rapid optimization method
Li et al. Dynamic weapon target assignment based on deep q network
Qin et al. Confrontation Simulation Platform Based on Spatio-temporal Agents
Mo et al. Research on virtual human swarm football collaboration technology based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination