CN115047907B

CN115047907B - Air isomorphic formation command method based on multi-agent PPO algorithm

Info

Publication number: CN115047907B
Application number: CN202210656190.7A
Authority: CN
Inventors: 汪亚斌; 李友江; 崔鹏; 郭成昊; 丁峰; 易侃
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-05-07
Anticipated expiration: 2042-06-10
Also published as: CN115047907A

Abstract

The invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps: constructing an action network for local environment state input and an evaluation network for global environment state input; initializing local environment states, global environment states and data caches required by other training; interaction is carried out according to the local environment state and the environment through the action network; calculating a dominance function according to the global environment state; calculating the loss of the action network according to the dominance function, calculating the loss of the evaluation network according to the evaluation network, and carrying out backward propagation to update the network according to the two loss values; the updated network and environment interactions are used. The method improves the macroscopic and microscopic combined action command capability of formation, introduces a multi-agent PPO algorithm for the first time for the construction of the formation command agents, improves the training stability of the formation command agents and improves the training effect.

Description

Air isomorphic formation command method based on multi-agent PPO algorithm

Technical Field

The invention relates to an air isomorphic formation command method, in particular to an air isomorphic formation command method based on a multi-agent PPO algorithm.

Background

At present, reinforcement learning is increasingly widely applied to simulation formation action training, a neural network oriented to multiple intelligent agents is generally required to be constructed to achieve the training purpose and is used as a learning network for deep reinforcement learning, one important link is input structures of an action network and an evaluation network, input is the basis of the neural network learning, and the neural network can be fast and efficiently learnt due to the input suitable for learning.

The fundamental difference between many multi-agent algorithms is that the input to the network, the representation of the input values is an important aspect of the overall algorithm, and both the action network and the evaluation network of a portion of the multi-agent algorithm use global inputs, which has the disadvantage that the algorithm ignores a portion of the important local information.

Another part of agent algorithms uses local information for both networks, and the disadvantage of using local information training is that it cannot be considered globally.

Disclosure of Invention

The invention aims to: the invention aims to solve the technical problem of providing an air isomorphic formation command method based on a multi-agent PPO algorithm aiming at the defects of the prior art.

In order to solve the technical problems, the invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps:

step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;

step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;

Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;

step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;

Step 5, calculating action network loss _actor and evaluation network loss _value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;

Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;

repeating the steps 3 to 6 until the action output by the action network meets the set requirement.

In the step 1, inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.

In step1, a local environment state is input to the mobile network, and the method comprises the following steps:

Inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into a matrix of n×128×128, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:

Position feature matrix: abstracting the battlefield into a 128 x 128 space, wherein each point is 1 if an enemy plane agent exists, 2 if an enemy plane agent exists, and 0 if no plane agent exists;

a marred state matrix: in the 128 x 128 matrix, if an aircraft agent exists at each point, the heading of the aircraft agent is divided into 360 degrees.

In step 1, the action network is constructed as follows:

The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; flattening the output of the convolution layer and inputting the flattened output into the full-connection layer; then, through a circulating neural network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target to be distributed with an action network, the direct output of the action network is changed into 128 x 128-dimensional probability distribution through class discretization, each dimension represents the probability that an airplane intelligent object flies to a corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent object, the enemy airplane intelligent object is attacked by the target coordinate, and if the airplane intelligent object does not exist, the enemy airplane intelligent object flies to the target coordinate; the probability distribution is calculated through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent.

In step 1, the evaluation network is used for evaluating the battlefield environment, the input of the evaluation network is identical to the input of the moving network, and the output is a 1-dimensional vector.

In step 2, initializing data for training of the action network and the evaluation network, constructing a playback buffer and initializing, including:

The global environment state information s _share input by the evaluation network, s represents state information, and the subscript share refers to the state information as global environment information; the local environment state information s _o input by the action network, s represents state information, and subscript o refers to the state information as local environment information; the mobile network shadow layer information hs _act, hs represents the network hidden layer information, and the subscript act represents the network as the mobile network; evaluation network reservoir information hs _critic, subscript critic represents that the network refers to an evaluation network; the action information a output by the aircraft intelligent agent, the log value logp _a,p_a is taken after the a probability of the action output by the aircraft intelligent agent, the probability of the action a output by the aircraft intelligent agent and the evaluation network output V (s _share), V represents the evaluation network, and V (s _share) is the output value after the global environment state information s _share is input by the evaluation network.

In step 2, the data includes:

Global environmental state information s _share: the input used by the evaluation network in training is a global battlefield environment, the dimension of data is [ length _episode,num_thread,num_agents,dim_s ], wherein length _episode is the time step of one-round combat, length represents the time step, and episode represents the corresponding combat round; num _thread is the number of simulation environments running in parallel, num represents the number, and thread represents the thread running the corresponding simulation environment; num _agents is the number of the aircraft agents on my side, and agents refer to the aircraft agents; dim _s is the dimension of the battlefield environmental data in each time slice, and s is the state information;

Local environment state information s _o: the battlefield environmental input of each individual aircraft agent in the air isomorphic formation, s represents state information, subscript o refers to the state information as local environment information, and the battlefield environmental data and global environment state information s _share are the same;

hs _act: the middle output of the action network cyclic nerve hiding layer, the dimension of the data is [ length _episode,num_thread,num_agents,dim_hsact ], wherein dim _hsact is the output dimension of the shadow hiding layer, dim is a dimension value, and hsac refers to the action network hiding layer;

hs _critic: the intermediate output of the network cyclic nerve hiding layer is evaluated, the dimension of the data is [ length _episode,num_thread,num_agents,dim_hscritic ], wherein dim _hscritic is the output dimension of the shadow storage layer, and hscritic refers to evaluating the network shadow storage layer.

The dataset obtained by the sampling in step 3, including s _share,s_o,hs_act,hs_critic,a,logp_ac,V(s_share), r and log pi θ; where r is action execution feedback (e.g. number of hostile units destroyed) obtained from the environment, log pi theta is log value obtained after the direct output of the action network, pi represents the action network, and subscript theta represents the parameters of the action network.

The method for calculating the dominance function in step 4 includes:

Wherein, Representing the estimated value of the moment t for the dominance function,/>Is global environmental state information at time t,Inputting an output value after evaluating the network for global environment state information at the time t; gamma is the accumulated discount value, l represents the number of action steps after t time, and r _t+l represents the report value r of environmental feedback after t+l steps; v represents an evaluation network,/>And the output value after the global environment state information at the time t is input for evaluating the network.

The method for calculating the mobile network loss in the step 5 comprises the following steps:

Wherein t is time t, clip (r _t (θ), 1- ε,1+ε) is a truncation operation, and if the value of r _t (θ) exceeds the range of (1- ε,1+ε), the value of r _t (θ) is made to be 1+ε if the value of r _t (θ) is greater than 1+ε, the value of r is made to be 1- ε if the value of r _t (θ) is less than 1- ε, and the value of r is kept if the value of r is between (1- ε,1+ε); epsilon is a set value; an action a _t representing the current aircraft agent selection relative to selecting other actions; representing local environment state information of the aircraft intelligent body at the moment t;

Representative get/> AndA smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a _t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a _t is selected by the action network representing the last round of iteration parameters theta _old used by the aircraft agent at the moment t; r _t (θ) is the ratio of the probability of computing the current action network selection a _t to the probability of the action network selection a _t of the previous iteration;

by calculating the derivative of loss value loss as described in step 5 The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:

Wherein, As a derivative of the loss value of the mobile network,/>Representing derivative operations, s _o is local environmental state information, a _t is motion information,/>Representative/>Take log value,/>Representing the derivative after taking the log value,/>Representing the value used to estimate the time t by taking the average of multiple rounds of calculation.

The beneficial effects are that:

Aiming at the application scene that the formation command simulation training agent has long training time and weak stability, the invention constructs the formation command agent from the macroscopic command angle, adopts the macroscopic and microscopic combined mode, the evaluation network inputs global environment state data, the action network inputs local environment data around a single aircraft agent, improves the macroscopic and microscopic control capability of formation, and introduces a multi-agent PPO (near-end strategy optimization) algorithm for the construction of the command agent for the first time, thereby improving the training stability of the formation command agent.

The invention constructs an air isomorphic formation command method based on a multi-agent PPO algorithm, and the evaluation network uses global information, so that the algorithm has the capability of evaluating the global information, the input of the action network is local information, the intelligent agent can focus on local information learning countermeasures, the evaluation network inputs the global information, the global environment state is evaluated, and the intelligent agent is guided to select actions favorable for the global environment state.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

FIG. 1 is a schematic diagram of the overall process of the present invention.

Fig. 2 is a diagram of a mobile network configuration.

Detailed Description

The following describes an embodiment of a simulation scene of a fighter of 6 fighters each in red and blue. The simulated battlefield space is discretized into 128 x 128 grid space in which the fighter of both red and blue are maneuvered and attack each other. An air isomorphic formation command method based on a multi-agent PPO (near-end strategy optimization) algorithm controls 6 fighters on the red side and the blue side to fight against each other. The method for directing the air isomorphism formation based on the multi-agent PPO algorithm decides which battle aircraft flies to which grid and strikes which blue aircraft agent.

As shown in fig. 1, an air isomorphic formation command method based on a multi-agent PPO algorithm includes: step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;

In step 1, the specific method for constructing the action network and the evaluation network is as follows:

The mobile network inputs information of both red and blue in the battlefield environment, the information is organized into a matrix of 3 x 128, each dimension in the 3-dimensional matrix represents one type of battlefield environment characteristic information, and the characteristic comprises the following steps:

position feature matrix: the battlefield is abstracted as a 128 x 128 space, with 1 for blue aircraft, 2 for red aircraft agents, and 0 for no aircraft agents on each grid.

Heading matrix: in the grid space of 128 x 128, if an aircraft agent exists in one grid, the corresponding matrix element value is the heading of the aircraft agent, and the heading is divided into 360 degrees.

A marred state matrix: in the cell space of 128 x 128, if there is an aircraft agent in one cell, the corresponding matrix element value is the damage state of the aircraft agent, and the states are classified into good (denoted by 1), damaged (denoted by 0.5) and knocked down (denoted by 0.1).

Based on the above network, as shown in fig. 2, the constructed action network architecture is as follows:

The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; then, through a cyclic neural (rnn) network, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target distribution action (actor) network, the direct output of the actor network is changed into 128 x 128-dimensional probability distribution through class discretization (categorical), each dimension represents the probability that an aircraft agent will fly to corresponding coordinates for attack, if the target coordinates have the enemy aircraft agent, the enemy aircraft agent can be attacked by the target coordinates, and if the aircraft agent does not exist, the aircraft agent can fly to the target coordinates. The probability distribution is passed through a mask calculation (mask) to form the final action output. mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent, and mask x categorical output becomes the final output.

The constructed evaluation network is used for evaluating the battlefield environment, the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the battlefield state information; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; and then, through a circulating nerve (rnn) network, the input of the circulating nerve network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to a core evaluation network, and the evaluation network output is a one-dimensional vector.

In step 2, a playback buffer is constructed and initialized, and the content in the buffer contains s _share,s_o,hs_act、hs_critic,a,logp_ac,V(s_share

S _share, inputting critic network used in training, namely, a global battlefield environment, wherein the dimension of data is [ length _episode,num_thread,num_agents,dim_s ], and length _episode is a time step of one-round combat and is set as 1024 steps; num _thread is the number of simulation environments running in parallel and is set to be 6; num _agents is the number of the intelligent agents of the aircraft on the my side and is set to be 5; dim _s is the dimension of the battlefield environmental data in each time slice, and is set to 128×128×3.

S _o: battlefield environmental input for each individual aircraft agent within the formation, the battlefield environmental data being the same as s _share

The hs _act: actor network circulates the intermediate output of the nerve hiding layer, the dimension of data is [ length _episode,num_thread,num_agents,dim_hsact].dim_hsact ] which is the output dimension of the shadow hiding layer and is set to be 256.

Hs _critic: the critic network circulates the intermediate output of the nerve hiding layer, the dimension of the data is [ length _episode,num_thread,num_agents,dim_hsact].dim_hscritic ] which is the output dimension of the shadow hiding layer and is set to 256.

In step 3: the method comprises the steps of continuously interacting with the environment to obtain a sampling data set for training, wherein the data set comprises s _share,s_o,hs_act、hs_critic,a,logp_ac,V(s_share), r and log pi theta.

In step 4: the advantage function is calculated and the function of the advantage,

The dimension after the sample data transformation is [ the number of parallel environments, the number of agents, the number of steps of one round of command, the dimension observed in each step ]

In step 5: calculation actor loss (loss _actor),

Wherein clip (r _t (θ), 1- ε,1+ε) is a truncation operation, and if r _t (θ) exceeds the (1- ε,1+ε) range, the value of r _t (θ) is 1+ε if it is greater than 1+ε, the value of r is 1- ε if it is less than 1- ε, and the value of r is 1- ε if it is between (1- ε,1+ε) is retained. The determination of epsilon values requires the use of empirical settings.Action a _t, which is the dominance function, represents the dominance of the current aircraft agent selection over the selection of other actions. /(I)Representing the local environmental status information of the aircraft agent at time t. /(I)Representative get/>And/>A smaller value for the comparison. /(I)Representing taking the average of multiple rounds of calculation. /(I)Representing the probability of the aircraft agent selecting action a _t using the latest parameter θ at time t. /(I)The probability of action a _t being selected by the aircraft agent at time t using the action network of the previous iteration parameter θ _old. r _t (θ) calculates the ratio of the probability of the current action network selection a _t to the probability of the action network selection a _t of the previous iteration.

By calculating the derivative of loss value loss as described in step 5The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:

: where s _o is local environment state information, a _t is action information, Representative/>Take log value,/>Representing the derivative after taking the log.

In a specific implementation, the application provides a computer storage medium and a corresponding data processing unit, wherein the computer storage medium can store a computer program, and the computer program can run the application content of the air isomorphic formation command method based on the multi-agent PPO algorithm and part or all of the steps in each embodiment when being executed by the data processing unit. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer MUU, or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.

The invention provides an idea and a method for an air isomorphic formation command method based on a multi-agent PPO algorithm, and particularly the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made by those skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. An air isomorphic formation command method based on a multi-agent PPO algorithm is characterized by comprising the following steps:

2. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 1, wherein step 1 comprises: inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.

3. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 2, wherein in step 1, a local environment state is input to the mobile network, the method comprises:

4. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 3, wherein in step 1, the action network is constructed as follows:

5. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 4, wherein in step 1, the evaluation network is used for evaluating battlefield environment, and the input of the evaluation network and the input of the moving network are 1-dimensional vectors.

6. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 5, wherein in step 2, initializing data for training of the action network and evaluation network, constructing playback buffer and initializing, comprises:

7. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 6, wherein in step 2, the data comprises:

8. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 7, wherein the data set obtained by the sampling in step 3 includes s _share,s_o,hs_act,hs_critic,a,logp_ac,V(s_share), r and log pi _θ; where r is action execution feedback obtained from the environment, log pi _θ is the log value obtained after the action network directly outputs, pi represents the action network, and subscript θ represents the parameter of the action network.

9. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 8, wherein the calculating the dominance function in step 4 comprises:

10. An air isomorphic formation command method based on multi-agent PPO algorithm as claimed in claim 9, wherein the method for calculating the mobile network loss in step 5 comprises:

Representative get/> And/>A smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a _t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a _t is selected by the action network representing the last round of iteration parameters theta _old used by the aircraft agent at the moment t; r _t (θ) is the ratio of the probability of computing the current action network selection a _t to the probability of the action network selection a _t of the previous iteration;