CN114358141A

CN114358141A - Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision

Info

Publication number: CN114358141A
Application number: CN202111530475.8A
Authority: CN
Inventors: 李博遥; 郑本昌; 路鹰; 黄虎; 惠俊鹏; 陈海鹏; 王振亚; 李君�; 阎岩; 范佳宣; 李丝然; 何昳頔; 张佳; 任金磊; 吴志壕; 刘峰; 范中行; 张旭辉; 赵大海; 韩特
Original assignee: China Academy of Launch Vehicle Technology CALT
Current assignee: China Academy of Launch Vehicle Technology CALT
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-15

Abstract

A multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision comprises the following steps: aiming at the game confrontation scene of the red and blue sides, a multi-agent reinforcement learning model is established, and intelligent cooperative decision modeling facing multiple combat units is realized; the effective training sample number is increased by adopting a post-event target conversion method, and the optimization convergence of the multi-agent reinforcement learning model is realized; constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific actions of each fighting unit as feedback information; and generating various opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model by using a reward function through massive simulation game confrontation. The invention solves the problems of low decision cooperativity of the multi-combat unit of the game fighting of the Reddish and the Lankan, difficult acquisition of valuable training samples and the like in the prior art.

Description

Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision

Technical Field

The invention belongs to the field of artificial intelligence technology game confrontation, and relates to a multi-agent reinforcement learning method.

Background

The multi-agent deep reinforcement learning is a new research hotspot and application direction in the field of machine learning by combining the cooperative ability of the multi-agents with the decision-making ability of reinforcement learning to solve the cooperative decision-making problem of clustering multiple units, covers numerous algorithms, rules and frames, is widely applied to the actual fields of automatic driving, energy distribution, formation control, track planning, route planning, social problems and the like, and has extremely high research value and significance. Some preliminary basic technical researches on multi-agent deep reinforcement learning by foreign related research institutions have been carried out, and the research on the technology, especially the application related research work in the field of military command, in China is just beginning at present.

Most of current intelligent decision algorithms adopt methods based on optimization and priori knowledge, and the problems of low decision cooperativity, difficulty in obtaining valuable training samples and the like exist for the problem of dynamic optimization of multiple combat units in a game countermeasure scene of a Reddish-Lanzi game.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method overcomes the defects of the prior art, provides a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision, and solves the problems that the red-blue-party game fighting multi-combat-unit decision cooperative is low, valuable training samples are difficult to obtain and the like in the prior art.

The technical solution of the invention is as follows: a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision comprises the following steps:

the method comprises the steps that firstly, a multi-agent reinforcement learning model is established for a game confrontation scene of a red party and a blue party, and intelligent cooperative decision modeling facing multiple combat units is achieved;

the multi-agent reinforcement learning model is constructed by the following steps:

building a game confrontation scene of a red party and a blue party;

analyzing task characteristics and decision points in a game confrontation scene of the red and blue parties to determine a state space of a cooperative task decision point;

and aiming at the cooperative task decision point, establishing a multi-agent reinforcement learning model.

The method for determining the state space of the collaborative task decision point comprises the following steps:

and (3) taking the whole situation information of the game countermeasure scene and the local observation information of the combat unit as state input, performing default verification through fixed part state input values, eliminating useless or counteractive states, and determining a key state space of a task decision point.

Step two, increasing the number of effective training samples by adopting a post-event target conversion method, and realizing the optimization convergence of the multi-agent reinforcement learning model;

the specific method for enhancing the number of effective training samples by adopting the post-event target conversion method comprises the following steps:

in each round of iterative training, sample data is selected from the experience pool according to the sampling probability value, the original task target which cannot be realized by the intelligent agent in the sample is changed into a state which can be reached at a certain moment, and an effective positive sample is constructed for model training.

The sampling probability value is calculated as follows:

wherein p is_i＝|δ_iI + ε represents the priority of the ith sample, δ_iRepresenting the time sequence difference error of the ith sample, wherein epsilon represents random noise and prevents the sampling probability from being 0; α is used to adjust the degree of priority, and p (i) is the sampling probability of the ith sample data.

Thirdly, constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific action of each fighting unit as feedback information;

the method for constructing the reward function comprises the following steps:

calculating global task reward R according to the situation information of the termination time of the task decision sequence_task；

Calculating a prize R for each of the plurality of warfare units based on the sequence of actions performed by each of the plurality of warfare units_i(ii) a i denotes the number of the unit, i is 1,2,3, … …

Awarding R according to global mission_taskAnd an action award R for each of the combat units_iCalculating cooperative task decision feedback information of each combat unit in the game countermeasure scene of the Reddish and the Landan

Global mission rewards R_taskTwo categories are included, respectively:

the task completion reward means that the red party completes the fighting task target at the termination moment;

damage reward means that the number of fighting units for red hitting and destroying blue is more than the number of self damage;

the task completion reward and the damage reward are double values, and the numerical distribution areas are different.

Action reward R for each combat unit_iThe method comprises three types, namely:

the casualty reward means that the red party combat unit is destroyed by the blue party and is a negative reward;

the ammunition consumption reward refers to the number of ammunitions consumed by the unit of the red party battle, and is a negative reward;

the visual field reward means that the red party combat unit can detect the blue party situation information and is positive reward;

the bonus of the red party casualty, the bonus of the ammunition consumption and the bonus of the visual field are double values, and the numerical distribution areas are different.

Cooperative task decision feedback information of each combat unit in game countermeasure scene of red and blue parties

The calculation formula of (2) is as follows:

wherein, η represents the importance degree of the team global mission reward, η ═ 0 represents that each combat unit only considers the income brought by self action, η ═ 1 represents that only considers the team overall income.

And step four, generating a plurality of opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model through massive simulation game confrontation by utilizing a reward function.

The specific method for training the multi-agent reinforcement learning model by using the reward function through massive simulation game confrontation comprises the following steps:

the method comprises the steps of constructing a blues strategy library according to different combat schemes, expanding the blues strategy library by using a reds online decision model at intervals of a set training period, and completing evolution training of a reds multi-agent enhanced learning model by using a reward function through massive simulation game countermeasures.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, the samples acquired on line in the game confrontation scene of the red and blue sides are optimized and selected by utilizing the after-the-fact target conversion, and valuable training samples are generated, so that the number of positive samples in a war game deduction scene with a larger search space can be effectively increased, and the rapid optimization convergence of an enhanced learning intelligent model is realized;

2. according to the method, the feedback of the reinforcement learning model is calculated in real time by combining the overall task reward and the specific action rewards of each combat unit, so that the method is more suitable for the game confrontation scene of the red and blue sides of the multi-combat unit, and the synergistic effect of the reinforcement learning model is improved;

3. the invention constructs a blue strategy library based on different combat schemes, adopts mass self-game deduction to complete the evolution training of the red-party multi-agent reinforcement learning model, and can effectively improve the combat decision-making capability of the reinforcement learning model by increasing the diversity and the combat difficulty of the opponent strategies.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a model architecture diagram of the present invention;

FIG. 3 is a schematic diagram of a method for converting a post-event object according to the present invention.

Detailed Description

The invention provides a multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision, as shown in figure 1, the method comprises the following steps:

the method comprises the steps of firstly, establishing a multi-agent reinforcement learning model aiming at a game confrontation scene of a red party and a blue party, and realizing intelligent cooperative decision modeling facing multiple combat units.

The construction process of the multi-agent reinforcement learning model comprises the following steps:

(1.1) building a game confrontation scene of a red and blue party;

(1.2) analyzing task characteristics and decision points in a game countermeasure scene of the Reddish and the blue sides to determine a state space of the task decision points;

the specific method for designing the state space comprises the following steps:

before modeling of a decision model, the whole situation information of a game countermeasure scene and the local observation information of a combat unit are used as state input, default verification is carried out through fixed part state input values, useless or counteractive states are eliminated, and a key state space of a task decision point is determined.

And (1.3) aiming at the cooperative task decision point in the step (1.2), establishing a multi-agent reinforcement learning model, and realizing cooperative task decision modeling facing a game countermeasure scene of a red-blue party multi-combat unit.

And secondly, improving the number of effective training samples by adopting a post-event target conversion method, and realizing the rapid optimization convergence of the intelligent model.

The detailed method for generating the effective training sample by adopting the posterior target conversion method comprises the following steps:

in each round of iterative training, firstly, sample data is selected from an experience pool according to the following requirements, and the network model is optimally trained in the reinforcement learning process, so that the network model has higher training efficiency:

let p be_i＝|δ_iI + ε represents the priority of the ith sample, δ_iRepresents the timing difference error (td-error) of the ith sample, and epsilon represents random noise, preventing samplingThe probability is 0, α is used to adjust the degree of priority (α ═ 0 indicates uniform sampling), and p (i) is the sampling probability of the ith sample data.

When network training is carried out, training samples are selected from the experience pool according to the sampling probability value, the original task target which cannot be realized by the intelligent agent in the samples is changed into the state which can be reached at a certain moment, and effective positive samples are constructed for model training, so that the number of valuable training samples in a game confrontation scene of a red and blue side with a larger search space is increased, and the rapid optimization convergence of the intelligent model is realized.

And thirdly, constructing a reward function by taking the reward of the global team task as a reference and combining the reward of the specific actions of each fighting unit as feedback information.

The calculation steps of the cooperative task decision reward function in the game countermeasure scene of the Reddish and the blue are as follows:

(3.1) calculating the global task reward R according to the situation information of the termination time of the task decision sequence_task；

(3.2) calculating an action award R for each of the fighting units based on the sequence of actions performed by each of the fighting units_i；

(3.3) combining the award values in the steps (3.1) and (3.2) to calculate cooperative task decision feedback information of each combat unit in the game countermeasure scene of the Reddish and Landan

The global mission rewards include two categories, respectively:

(3.1.1) completing reward of the task, namely completing a fighting task target by the red party at the termination moment;

(3.1.2) damage reward, namely the number of fighting units for destroying blue by beating of the red party is more than the number of damage per se;

the task completion reward and the damage reward are double values, but the numerical distribution areas are different.

For each combat unit in the game confrontation scene of the red and blue sides, the action rewards comprise three types, namely:

(3.2.1) casual rewards, namely the red party combat units are destroyed by the blue party and are negative rewards;

(3.2.2) the ammunition consumption reward is negative reward, namely the number of ammunitions consumed by the red party combat unit;

(3.2.3) field reward, namely the red party combat unit can detect blue party situation information and is positive reward;

the bonus of the red casualty, the bonus of the ammunition consumption and the bonus of the visual field are double values, but the numerical distribution areas are different.

Collaborative task decision feedback information for each operational unit

The specific calculation method comprises the following steps:

let η denote the importance of the team global mission rewards, η ═ 0 denotes that each combat unit considers only the gains from its own actions, and η ═ 1 denotes that only the team global gains are considered.

And fourthly, generating a plurality of opponent strategies based on different combat schemes, and training the multi-agent reinforcement learning model by simulating game confrontation by utilizing a reward function.

The method comprises the steps of constructing a bluesquare strategy library based on different combat schemes, expanding the bluesquare strategy library by using a redside online decision model at regular intervals of a training period, increasing diversity and combat difficulty of the strategy library, and completing evolution training of a redside multi-agent reinforcement learning model through massive simulation game confrontation.

The scene of the invention is a cooperative sequence decision problem with delayed return, and in order to solve task cooperativeness, the invention uses multi-agent reinforcement learning to perform reinforcement learning modeling on each combat unit and then perform centralized training.

The algorithmic framework of the multi-agent reinforcement learning model is shown in FIG. 2. The model is constructed on the basis of a DDPG algorithm model, the DDPG algorithm adopts an Actor-Critic framework to carry out single-step updating, the gradient is faster than that of the traditional strategy, and the problem of the gradient on a continuous action space can be solvedA sequence decision problem. The model comprises an Actor strategy network and a criticic evaluation network, wherein the Actor network fits the action strategy of the agent, and the output of the Actor strategy is not a single action, but the probability distribution pi (a | s) of a selection action and represents the probability of selecting the action a in the state s at the current moment; critic network fitting agent action value function Q_π(s, a) indicating the value of taking action a in current time state s.

Meanwhile, the invention improves the number of effective training samples by adopting post-event target conversion, and constructs a feedback value by combining the reward of the global task of the team and the reward of the specific action of each fighting unit, thereby improving the training efficiency and the synergistic effect of the multi-agent reinforcement learning model.

The strategy network input of the multi-agent reinforcement learning model is the real-time state of a battle scene, namely the position and the survival state of each battle unit of the red party, the observed position of each battle unit of the blue party, topographic information and simulation time, the output of the network is the position of a moving target of the red party and the selection of an attack target, in the scene, the mapping relation from the state to the action is expected to be established through the neural network training on the premise of limited time, and a moving and fire power distribution scheme is rapidly generated on line by utilizing a reinforcement learning method. The input of the evaluation network is the real-time state of the scene and the action information of each agent, and the output is the joint Q value.

According to the method, the reward of the global mission of the team and the reward of the specific action of each fighting unit are calculated, and the feedback value of the enhanced learning training is obtained.

Taking a military chess game confrontation scene of land battles as an example, the task completion reward is taken as a seizing control point, and if the quantity of the red party destroying blue party fighting units is more than the self damage quantity at the termination moment, the damage reward is obtained; and calculating individual action rewards of each fighting unit by counting the survival state of each fighting unit at the termination time, the ammunition consumption number and the number of the blue square units which can be detected, so as to obtain the feedback value of each intelligent agent training of the multi-intelligent agent reinforcement learning model.

The method comprises the steps that a 'centralized training-distributed execution' principle is adopted, observation information and execution actions of all agents are known in a model training process and are used for training evaluation networks of all agents, and each agent strategy network generates decision actions only according to local observation information of the agent strategy network when the model is executed.

The specific steps of the multi-agent reinforcement learning model training algorithm are as follows:

1) initializing policy master networks

And evaluating the host network

Policy target network

And evaluating the target network

And experience pools, the target network being a duplicate of the main network, o_iLocal observation information of each combat unit is shown, s is joint situation information, a is joint action,

and

for the primary network weight parameter to be used,

and

a weight parameter for the target network; 1,2,3, … …

2) Selecting the action of the current state of each intelligent agent:

N_trandom noise which is subject to normal distribution is used for increasing the exploration capacity of the intelligent agent;

3) executing the action to obtain the reward value corresponding to each agent, and converting the state-motion transformation data (O)_t,A_t,R_t,O_t+1) Storing the data into an experience pool;

O_t＝{o_1,t,o_2,t,…,o_i,tdenotes the joint state at time t, A_t＝{a_1,t,a_2,t,…,a_i,tDenotes the joint action of agents at time t, R_t＝{r_1,t,r_2,t,…,r_i,tDenotes the feedback of each agent at time t, O_t+1＝{o_1,t+1,o_2,t+1,…,o_i,t+1Denotes the joint status at time t + 1.

4) And when the sample size of the experience pool reaches a certain amount, processing the sample data (selecting samples based on the sampling probability of td-error) by adopting post-event target conversion, and constructing an effective positive sample for model training. The loss function L of each agent evaluation network is calculated according to the following formula:

E[]representing a desired function; r is_i,tRepresents the reward value of the ith agent at time t; gamma represents an attenuation factor, and gamma is more than or equal to 0 and less than or equal to 1.

In the sampling process of the model training samples, the method carries out priority sequencing on the sample data stored in the experience pool so as to increase the probability of the valuable samples being sampled and improve the training efficiency. And using td-error as a measurement index of the sample importance, wherein the higher the value of the td-error is, the larger the difference between the action value estimation value of the evaluation network and the action value target value is, and the more valuable the training sample is.

The scenario of the invention is not a strict sequence decision problem, or the scenario is a cooperative sequence decision problem integrating the problems of sparse return, delayed return and the like. We can consider the problem as a function optimization problem with the goal of function maximization and the objective function of cooperative mission reward maximization. Because the search space of the game confrontation scene of the Reddish and Landan sides is larger, the reinforcement learning method adopting the random exploration mode is difficult to even can not obtain a successful sample, according to the method of the invention, the original task target which can not be realized by the intelligent agent in the sample data is changed into the state which can be reached by the intelligent agent at a certain moment by adopting the after-event target conversion, and an effective positive sample is constructed for model training.

A schematic diagram of the post-event target conversion is shown in fig. 3, although the intelligent agent cannot reach the target position expected to move, sample data is re-marked by a post-event target conversion method, a current failed decision sequence is converted into a successful decision track, a movement skill experience is accumulated for the intelligent agent, and finally, the cooperative movement strategy learning for the target position is realized.

The modeling and training of the intelligent strategy model in the Hongsanfang game confrontation scene need data driving, and the method quickly obtains training samples to improve the learning efficiency of the strategy model by simulating the attack and defense confrontation game process of the battle scene, and completes the decision capability evolution of the Hongsanfang multi-agent reinforcement learning model. The game confrontation training method comprises the following specific steps:

1) before the model training starts, a plurality of opponent strategies are generated off line by using priori knowledge based on different combat schemes, and a blueness strategy library is constructed;

2) randomly extracting a strategy model from a strategy library, and generating training data for model iterative training through the countervailing deduction of red and blue in a simulation platform;

3) performing online expansion on the bluesquare strategy library by using the redsquare strategy model at regular training intervals;

4) and (4) repeating the steps 2) to 4) circularly to realize intelligent model evolution training in a game countermeasure scene.

In the game fighting deduction simulation platform of the red and blue parties, the method is verified on the basis of the cooperative mobile striking decision capability of the capturing point occupied by the red party, and the test flow is as follows:

1) setting a proper game fighting scene;

2) training of a multi-agent reinforcement learning model is achieved through simulation confrontation, the adaptability of the Hongfang cooperative mobile striking decision model to a typical scene is verified, if the model training is not converged, parameters are adjusted and retraining is carried out until the model is converged, and the next step is carried out;

3) carrying out verification test on the method under the condition of random scene;

4) under the same typical combat scene as the step 3), adopting a single intelligent agent reinforcement learning model for each combat unit of the red side, and carrying out verification test on the model after the model training is converged;

5) under the same typical combat scene as the step 3), carrying out multi-agent reinforcement learning model training and verification tests for canceling the after-event target conversion;

6) the statistical comparison analysis is carried out on the test results of the step 3), the step 4) and the step 5), and the invention is found to be capable of well solving the problems that the traditional multi-combat unit game countermeasure decision-making cooperativity is low and valuable training samples are difficult to obtain.

Aiming at the planning requirement of army tactical tasks, the multi-agent reinforcement learning model is utilized to make a decision on a cooperative mobile striking action sequence of a red party in a game confrontation scene of the red and blue parties; by utilizing a sample generation method of after-the-fact target conversion, the learning efficiency and the exploration capacity are enhanced, and the rapid convergence of an intelligent model is realized; the evaluation parameters considering the reward of the global team task and the reward of the specific actions of each fighting unit are constructed, and the parameters are used as feedback to effectively improve the cooperative decision effect of the intelligent model; a mass game countermeasure training method is utilized to generate a plurality of blue party combat strategies in an off-line and on-line mode, and a training sample is rapidly generated through mass countermeasure deduction, so that the operational capacity of the intelligent model is evolved and upgraded; in the game fighting deduction simulation platform of the red and blue parties, the effectiveness of the game fighting deduction simulation platform is verified on the basis of the cooperative mobile striking decision capability of the capturing point occupied by the red party. The invention solves the problems of low decision cooperativity of the multi-combat unit of the game fighting of the Reddish and the Lankan, difficult acquisition of valuable training samples and the like in the prior art.

Those skilled in the art will appreciate that the details of the invention not described in detail in this specification are well within the skill of those in the art.

Claims

1. A multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision is characterized by comprising the following steps:

aiming at the game confrontation scene of the red and blue sides, a multi-agent reinforcement learning model is established, and intelligent cooperative decision modeling facing multiple combat units is realized;

the effective training sample number is increased by adopting a post-event target conversion method, and the optimization convergence of the multi-agent reinforcement learning model is realized;

constructing a reward function by taking the reward of the global team task as a reference and taking the reward of the specific actions of each fighting unit as feedback information;

and generating a plurality of opponent strategies according to different combat schemes, and training the multi-agent reinforcement learning model by simulating game confrontation by using a reward function.

2. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the multi-agent reinforcement learning model is constructed by the following steps:

building a game confrontation scene of a red party and a blue party;

3. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 2, wherein: the method for determining the state space of the collaborative task decision point comprises the following steps:

4. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the specific method for enhancing the number of effective training samples by adopting the post-event target conversion method comprises the following steps:

5. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 4, wherein: the sampling probability value is calculated as follows:

6. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the method for constructing the reward function comprises the following steps:

7. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: global mission rewards R_taskTwo categories are included, respectively:

8. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: action reward R for each combat unit_iThe method comprises three types, namely:

9. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 6, wherein: cooperative task decision feedback information of each combat unit in game countermeasure scene of red and blue parties

The calculation formula of (2) is as follows:

10. The multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision making according to claim 1, wherein: the specific method for training the multi-agent reinforcement learning model by simulating game confrontation by utilizing the reward function comprises the following steps:

the method comprises the steps of constructing a blues strategy library according to different combat schemes, expanding the blues strategy library by using a reds online decision model at intervals of a set training period, and completing evolution training of a reds multi-agent reinforcement learning model by simulating game countermeasures by using a reward function.