CN115047907A

CN115047907A - Air isomorphic formation command method based on multi-agent PPO algorithm

Info

Publication number: CN115047907A
Application number: CN202210656190.7A
Authority: CN
Inventors: 汪亚斌; 李友江; 崔鹏; 郭成昊; 丁峰; 易侃
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-09-13
Anticipated expiration: 2042-06-10
Also published as: CN115047907B

Abstract

The invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps: constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input; initializing local environment states, global environment states and data caches required by other training; interacting according to the local environment state and the environment through the action network; calculating an advantage function according to the global environment state; calculating the loss of the action network according to the advantage function, calculating and evaluating the loss of the network according to the evaluation network, and performing backward propagation to update the network according to the two loss values; using the updated network and environment interactions. The method improves the action command capability of the formation combining the macro and micro, introduces the multi-agent PPO algorithm for commanding the formation of the agents for the first time, improves the stability of the training of the agents for the formation command, and realizes the improvement of the training effect.

Description

Air isomorphic formation command method based on multi-agent PPO algorithm

Technical Field

The invention relates to an air isomorphic formation command method, in particular to an air isomorphic formation command method based on a multi-agent PPO algorithm.

Background

At present, reinforcement learning is increasingly widely applied to simulated formation action training, in order to achieve the training purpose, a multi-agent-oriented neural network is generally required to be constructed and used as a learning network for deep reinforcement learning, an important link of the reinforcement learning is the input structure of the action network and an evaluation network, the input is the basis of the neural network learning, and the input suitable for learning can ensure that the neural network can learn quickly and efficiently.

The fundamental difference between many multi-agent algorithms is the input to the network, the representation of the input values is an important aspect of the overall algorithm, and the action network and evaluation network of some multi-agent algorithms use global inputs, which has the disadvantage that the algorithm ignores some important local information.

Another part of the agent algorithm uses local information for both networks, and the disadvantage of training using local information is that it cannot be considered globally.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of providing an air isomorphic formation command method based on a multi-agent PPO algorithm aiming at the defects of the prior art.

In order to solve the technical problem, the invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps:

step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;

step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate shadow layer information of an action network and an evaluation network;

step 3, collecting environmental state data from the simulation confrontation environment, wherein the environmental state data consists of local environmental state data and global environmental state data; inputting the environment state data into the action network, outputting the action by the action network and issuing the action to the simulation confrontation environment; the simulation confrontation environment changes the environment state data after receiving the action and returns the changed environment state data to the action network; the action network outputs a formation control instruction, continuously interacts with a simulation confrontation environment, and samples to obtain a sampling data set for training;

step 4, calculating an advantage function facing the intelligent agent of the formation command aircraft according to the sampling data in the sampling data set in the step 3;

step 5, calculating the loss of the mobile network according to the sampling data in the sampling data set obtained after the sampling in the step 3 _actor And evaluating network loss _value And according to two loss values loss, backward propagation is carried out to update the action network and the evaluation network, and the derivative of the loss value loss is calculated

Realizing backward propagation to update parameters of the action network and the evaluation network;

step 6, outputting the action and the simulation confrontation environment interaction by using the updated action network, and continuing to perform the sampling in the step 3;

and repeating the steps 3 to 6 until the action output by the mobile network meets the set requirement.

Step 1, inputting a local environment state around each airplane intelligent agent in an air isomorphic formation to the mobile network; and inputting the global environment state into the evaluation network, integrating the global environment states of all the airplane intelligent bodies, and evaluating the influence of the action of each airplane intelligent body on the whole formation overall target, wherein the overall target comprises the step of eliminating all forces of enemies or ensuring the minimum overall damage of the formation.

In step 1, a local environment state is input to the mobile network, and the method comprises the following steps:

inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into an n 128 matrix, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:

a position feature matrix: abstracting a battlefield into a 128-by-128 space, wherein each point is 1 if an enemy airplane intelligent agent exists, 2 if the enemy airplane intelligent agent exists and 0 if no airplane intelligent agent exists;

a damaged state matrix: in the 128-by-128 matrix, each point, if there is an airplane agent, is the heading of the airplane agent, and the heading is 360 degrees.

In step 1, the mobile network is constructed as follows:

the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and my state information; leveling the output of the convolution layer and inputting the output of the convolution layer into a full connection layer; then, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the step, the hidden layer output is input to an air command target to distribute an action network, the direct output of the action network is changed into 128 x 128 dimensional probability distribution through category discretization, each dimension represents the probability that the airplane intelligent body flies to the corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent body, the enemy airplane intelligent body flies to the target coordinate to attack the enemy airplane intelligent body, and if the airplane intelligent body does not exist, the airplane intelligent body flies to the target coordinate; calculating the probability distribution through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether there is an enemy plane agent at the corresponding coordinate.

In step 1, the evaluation network is used for evaluating the battlefield environment, the input of the evaluation network is the same as the input of the action network, and the output is a 1-dimensional vector.

In step 2, initializing data for the action network and evaluation network training, constructing a playback buffer, and initializing, including:

evaluating global environmental status information s of network inputs _share S represents state information, and subscript share represents that the state information is global environment information; local environment state information s input by mobile network _o S represents status information, and subscript o indicates that the status information is local environment information; mobile network shadow layer information hs _act Hs represents the network hidden layer information and the subscript act representsRefers to the network as a mobile network; evaluating network shadow layer information hs _critic The subscript critic represents that the network refers to the evaluation network; the log value logP is taken after the action information a output by the intelligent airplane body and the a probability of the action output by the intelligent airplane body _a ，p _a Probability representing output action a of airplane intelligent body and evaluation network output V(s) _share ) V denotes the evaluation network, V(s) _share ) Inputting global environmental status information s for evaluating a network _share The latter output value.

In step 2, the data includes:

global environment state information s _share : the input used by the evaluation network during training is the global battlefield environment, and the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _s ]Wherein length _episode The length represents the time step, and the epicode represents the corresponding fighting round; num _thread The number of simulation environments running in parallel is num, and thread represents the number of threads running corresponding simulation environments; num _agents Agents refers to airplane agents for the number of airplane agents of one party; dim _s For the dimension of battlefield environment data in each time slice, s is state information;

local environmental status information s _o : battlefield environment input of each individual aircraft intelligent agent in the air isomorphic formation, s represents state information, subscript o indicates that the state information is local environment information, and battlefield environment data and global environment state information s of the local environment information _share The same;

hs _act : the intermediate output of the mobile network cyclic neural hidden layer is realized, and the dimensionality of data is [ length ] _episode ,num _thread ,num _agents ,dim _hsact ]Wherein dim _hsact The dimension is the output dimension of the shadow hiding layer, dim is the dimension value, and hsac refers to the mobile network hiding layer;

hs _critic : evaluating the intermediate output of the network recurrent neural hidden layer, wherein the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _hscritic ]Which isMedium dim _hscritic For the output dimension of the occlusion layer, hscritic refers to evaluating the network occlusion layer.

The data set obtained by sampling in step 3 comprises s _share ，s _o ，hs _act ，hs _critic ，a，logp _ac ，V(s _share ) R and log π θ; wherein r is action execution feedback obtained from the environment (for example, the number of destroyed enemy units, etc.), log pi θ is a log value obtained after the action network directly outputs, pi represents the action network, and subscript θ represents parameters of the action network.

The method for calculating the merit function described in step 4 includes:

wherein the content of the first and second substances,

representing the estimate of the merit function at time t,

for the global context state information at time t,

inputting the global environment state information at the time t into an output value after the evaluation network; gamma is the accumulated discount value, l represents the number of action steps after time t, r _t+l A reported value r representing the environmental feedback after the step t + l; v represents the evaluation network and is,

and inputting the global environment state information at the time t for the evaluation network to obtain an output value.

The method for calculating the mobile network loss in the step 5 comprises the following steps:

where t is time t, clip (r) _t (theta), 1-epsilon, 1+ epsilon) is a truncation operation if r _t The value of (theta) is truncated if it exceeds the range (1-epsilon, 1+ epsilon), i.e. if r is greater than the range _t (θ) if greater than 1+ ε, then let it be 1+ ε, if less than 1- ε, then let it be 1- ε, if it is between (1- ε,1+ ε), then keep its value; epsilon is a set value; action a representing current aircraft agent selection _t Advantages over selecting other actions;

representing local environment state information of the airplane intelligent body at the moment t;

representative is to get

And

smaller value of comparison;

representing the average value calculated by taking multiple rounds;

action a of selecting a network of actions representing the use of the latest parameter θ by the aircraft agent at time t _t The probability of (d);

representing the last iteration parameter theta used by the airplane intelligent agent at the time t _old Mobile network selection action a _t The probability of (d); r is _t (theta) selection of a for calculating the current mobile network _t Probability and last one ofIterative mobile network selection a _t The ratio of the probabilities of (a);

step 5 said calculating the derivative of the loss value loss

The method for realizing backward propagation updating of the parameters of the action network and the evaluation network comprises the following steps:

wherein the content of the first and second substances,

is the derivative of the loss value of the mobile network,

representing a derivative operation, s _o As local environmental status information, a _t In order to be the action information, the user can select the action information,

represents

Taking the log value of the obtained data to obtain the log value,

representing the derivative after taking the log value,

the representation takes the average of multiple rounds of calculation for estimating the value at time t.

Has the beneficial effects that:

aiming at an application scene that the training time of the formation command simulation training intelligent agent is long and the stability is not strong, the invention constructs the formation command intelligent agent from the macroscopic command angle, evaluates the network input global environment state data by adopting a macroscopic and microscopic combination mode, inputs the local environment data around a single airplane intelligent agent by an action network, improves the control capability of the formation macroscopic and microscopic, introduces a multi-intelligent agent PPO (proximal end strategy optimization) algorithm for commanding the formation intelligent agent to construct for the first time, and improves the training stability of the formation command intelligent agent.

The invention constructs an air isomorphic formation command method based on a multi-agent PPO algorithm, an evaluation network uses global information, so that the algorithm has the capability of evaluating the global information, the input of an action network is local information, an agent can focus on local information learning measures, the network is evaluated to input the global information, the global environment state is evaluated, and the agent is guided to select actions favorable for the global environment state.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic view of the overall process of the present invention.

FIG. 2 is a mobile network architecture diagram.

Detailed Description

The following description will take an example of a simulation scenario in which 6 fighters of both red and blue are fighting each other. The simulated battlefield space is discretized into a 128 x 128 grid space where the fighters of both red and blue maneuver and attack each other. An air isomorphic formation command method based on a multi-agent PPO (proximal end strategy optimization) algorithm controls a 6-shelf fighter in the red party and a blue party to fight against. An air isomorphic formation command method based on a multi-agent PPO algorithm decides which lattice each fighter flies to and hits which Bluetooth aircraft agent.

As shown in fig. 1, an air isomorphic formation commanding method based on multi-agent PPO algorithm includes: step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;

and (6) repeating the steps (3) to (6) until the action output by the mobile network meets the set requirement.

In step 1, the concrete method for constructing the mobile network and evaluating the network is as follows:

the information of the red and the blue in the battlefield environment is input by the mobile network, the information is organized into a 3 x 128 matrix, each dimension in the 3-dimensional matrix represents a class of battlefield environment characteristic information, and the characteristics comprise the following:

a position feature matrix: the battlefield is abstracted into a 128 x 128 space, with 1 for each grid if there is a blue airplane, 2 for a red airplane agent, and 0 for no airplane agent.

Course matrix: in the 128 × 128 lattice space, if there is an airplane agent in one lattice, the corresponding matrix element value is the heading of the airplane agent, and the heading is divided into 360 degrees.

A damaged state matrix: in the 128 × 128 lattice space, if there is an airplane agent in one lattice, the corresponding matrix element value is the damage status of the airplane agent, and the status is classified as good (indicated by 1), damaged (indicated by 0.5), and destroyed (indicated by 0.1).

Based on the above network, as shown in fig. 2, the mobile network architecture is constructed as follows:

the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and my state information; carrying out flattening (flattening) operation on the output of the convolution layer and then inputting the output of the convolution layer into a full connection layer; and then the data is transmitted through a circulating neural (rnn) network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the step, the hidden layer output and the input are distributed to an air command target action (actor) network, the direct output of the actor network is changed into 128-dimensional probability distribution through category discretization (probability), each dimension represents the probability that the airplane intelligent body will fly to the corresponding coordinate for attack, if the target coordinate has the enemy airplane intelligent body, the enemy airplane intelligent body will fly to the target coordinate for attack, and if the airplane intelligent body does not exist, the enemy airplane intelligent body will fly to the target coordinate. The probability distribution is passed through a mask calculation (mask) to form the final action output. The mask is a 128-by-128 vector, each element represents whether there is enemy plane agent in the corresponding coordinate, and the mask is output as final output.

The constructed evaluation network is used for evaluating a battlefield environment, the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and self state information; carrying out flattening (flattening) operation on the output of the convolution layer and then inputting the output of the convolution layer into a full connection layer; and then, the data is transmitted through a cyclic neural (rnn) network, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to a core evaluation network, and the evaluation network output is a one-dimensional vector.

In step 2, a playback cache is constructed and initialized, and the content in the cache comprises s _share ，s _o ，hs _act 、hs _critic ，a，logp _ac ，V(s _share )

s _share The input used by the critic network during training is the global battlefield environment, and the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _s ]Wherein length _episode Setting the time step length of one round of battle as 1024 steps; num _thread The number of simulation environments for parallel operation is set to be 6; num _agents The number of the intelligent bodies of the airplane at one place is set to be 5; dim _s The dimensions for the battlefield environment data within each time slice are set to 128 x 3.

s _o : battlefield environment input, battlefield environment data and s for each individual aircraft agent in the formation _share Are identical

hs _act Intermediate output of operator network cyclic neural hidden layer with data dimension of [ length ] _episode ,num _thread ,num _agents ,dim _hsact ]。dim _hsact The output dimension for the occlusion layer is set to 256.

hs _critic : the critic network circulates the intermediate output of the neural hidden layer, and the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _hsact ]。dim _hscritic The output dimension for the occlusion layer is set to 256.

In the step 3: acquiring a sample data set for training continuously interacting with the environment, the data set comprising s _share ，s _o ，hs _act 、hs _critic ，a，logp _ac ，V(s _share ),r,logπθ。

In the step 4: the advantage function is calculated as a function of the position,

the dimensionality of the transformed sampled data is [ the number of parallel environments, the number of agents, the number of steps of one command, the dimensionality observed in each step ]

In the step 5: calculating operator loss (loss) _actor )，

Wherein, clip (r) _t (theta), 1-epsilon, 1+ epsilon) is a truncation operation if r _t (theta) is truncated if it exceeds the range (1-epsilon, 1+ epsilon), i.e. if r is _t If the value of (theta) is greater than 1+ epsilon, it is made to be 1+ epsilon, if it is less than 1-epsilon, it is made to be 1-epsilon, and if it is between (1-epsilon, 1+ epsilon), it is kept. The determination of the value of epsilon requires the use of empirical settings.

As an advantage function, the action a selected on behalf of the current aircraft agent _t Advantage over selecting other actions.

Representing local environmental status information of the aircraft agent at time t.

Representative is to get

And

to a smaller value.

Representing taking the average of multiple rounds of calculations.

Action a of selecting a network of actions representing the use of the latest parameter θ by the aircraft agent at time t _t The probability of (c).

Representing the last iteration parameter theta used by the airplane intelligent agent at the time t _old The mobile network selection action a _t The probability of (c). r is _t (theta) calculating the current mobile network selection a _t Probability of (a) and mobile network selection of the previous iteration (a) _t The ratio of the probabilities of (a).

Step 5 said calculating the derivative of the loss value loss

The method for realizing the backward propagation updating of the parameters of the action network and the evaluation network comprises the following steps:

: wherein s is _o As local environmental status information, a _t In order to be the action information, the user can select the action information,

represents

Taking the log value of the obtained data to obtain the log value,

representing the derivative after taking the log value.

In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the air isomorphic formation commanding method based on the multi-agent PPO algorithm provided by the present invention and some or all of the steps in each embodiment. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, which may be stored in a storage medium and includes instructions for causing a device (which may be a personal computer, a server, a single-chip computer, MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.

The invention provides a thought and a method of an air isomorphic formation command method based on a multi-agent PPO algorithm, and a method and a way for realizing the technical scheme are many. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. An air isomorphic formation command method based on a multi-agent PPO algorithm is characterized by comprising the following steps:

step 5, calculating the loss of the mobile network aiming at the sampling data in the sampling data set obtained after sampling in the step 3 _actor And evaluating network loss _value And according to two loss values loss, backward propagation is carried out to update the action network and the evaluation network, and the derivative of the loss value loss is calculated

2. The method for commanding an air isomorphic formation based on a multi-agent PPO algorithm as claimed in claim 1, wherein step 1 comprises: inputting to the mobile network a local environmental state around each aircraft agent in an airborne homogeneous formation; and inputting the global environment state into the evaluation network, integrating the global environment states of all the airplane intelligent bodies, and evaluating the influence of the action of each airplane intelligent body on the whole formation overall target, wherein the overall target comprises the step of eliminating all forces of enemies or ensuring the minimum overall damage of the formation.

3. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 2, wherein in step 1, the local environment status is input into the action network, and the method comprises:

4. The method for commanding an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 3, wherein in step 1, the mobile network is constructed as follows:

5. The multi-agent PPO algorithm-based air isomorphic formation command method as claimed in claim 4, wherein in step 1, the evaluation network is used to evaluate battlefield environment, the input of the evaluation network is identical to the input of the action network, and the output is 1-dimensional vector.

6. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 5, wherein in step 2, initializing data for the action network and evaluation network training, constructing a playback buffer and initializing comprise:

evaluating global environmental status information s of network inputs _share S represents state information, and subscript share represents that the state information is global environment information; local environmental status information s input by mobile network _o S represents status information, and subscript o indicates that the status information is local environment information; mobile network shadow layer information hs _act Hs represents the hidden layer information of the network, and the subscript act represents that the network is a mobile network; evaluating network shadow layer information hs _critic The subscript critic represents that the network refers to the evaluation network; the log value logP is taken after the action information a output by the intelligent airplane body and the a probability of the action output by the intelligent airplane body _a ，p _a Probability representing output action a of airplane intelligent body and evaluation network output V(s) _share ) V denotes the evaluation network, V(s) _share ) Inputting global environmental status information s for evaluating a network _share The latter output value.

7. The method for commanding an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 6, wherein in step 2, the data comprises:

global environment state information s _share : the input used by the evaluation network during training is the global battlefield environment, and the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _s ]Wherein length _episode For the time step of one round of battle, length represents the time step, and epicode represents the time stepCorresponding fighting rounds; num _thread The number of simulation environments running in parallel is num, and thread represents the number of threads running corresponding simulation environments; num _agents Agents refers to airplane agents for the number of airplane agents of our party; dim _s For the dimension of battlefield environment data in each time slice, s is state information;

hs _critic : evaluating the intermediate output of the network cyclic neural hidden layer, wherein the dimensionality of the data is [ length ] _episode ,num _thread ,num _agents ,dim _hscritic ]Wherein dim _hscritic For the output dimension of the occlusion layer, hscritic refers to evaluating the network occlusion layer.

8. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 7, wherein the data set obtained by sampling in step 3 comprises s _share ，s _o ，hs _act ，hs _critic ，a，logp _ac ，V(s _share ) R and log π _θ (ii) a Where r is the action execution feedback obtained from the environment, log π _θ And taking a log value after the direct output of the mobile network, wherein pi represents the mobile network, and a subscript theta represents a parameter of the mobile network.

9. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 8, wherein the step 4 of calculating the dominance function comprises:

wherein the content of the first and second substances,

representing the estimate of the merit function at time t,

for the global context state information at time t,

inputting the global environment state information at the time t into an output value after the evaluation network; gamma is the cumulative discount value, l represents the number of action steps after time t, r _t+l A reported value r representing the environmental feedback after the step t + l; v represents the evaluation network and is,

and inputting the global environment state information at the time t into the evaluation network to obtain an output value.

10. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 9, wherein the method for calculating the loss of the action network in step 5 comprises:

representing and getting

And

smaller value of comparison;

representing the average value calculated by taking multiple rounds;

representing the last iteration parameter theta used by the airplane intelligent agent at the time t _old Mobile network selection action a _t The probability of (d); r is _t (theta) selection of a for calculating the current mobile network _t Probability of (a) and mobile network selection of the previous iteration (a) _t The ratio of the probabilities of (a);

step 5 said calculating the derivative of the loss value loss

wherein the content of the first and second substances,

is the derivative of the loss value of the mobile network,

represents

Taking the log value of the obtained data to obtain the log value,

representing the derivation after taking the log value,