CN115047907B - Air isomorphic formation command method based on multi-agent PPO algorithm - Google Patents
Air isomorphic formation command method based on multi-agent PPO algorithm Download PDFInfo
- Publication number
- CN115047907B CN115047907B CN202210656190.7A CN202210656190A CN115047907B CN 115047907 B CN115047907 B CN 115047907B CN 202210656190 A CN202210656190 A CN 202210656190A CN 115047907 B CN115047907 B CN 115047907B
- Authority
- CN
- China
- Prior art keywords
- network
- action
- agent
- value
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 30
- 230000009471 action Effects 0.000 claims abstract description 122
- 238000011156 evaluation Methods 0.000 claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 24
- 230000003993 interaction Effects 0.000 claims abstract description 5
- 239000003795 chemical substances by application Substances 0.000 claims description 105
- 230000007613 environmental effect Effects 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 17
- 238000004088 simulation Methods 0.000 claims description 17
- 238000005070 sampling Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 210000005036 nerve Anatomy 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000008901 benefit Effects 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 230000008030 elimination Effects 0.000 claims description 2
- 238000003379 elimination reaction Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 abstract description 2
- 238000004590 computer program Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/104—Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
- Separation Of Gases By Adsorption (AREA)
- Flow Control (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps: constructing an action network for local environment state input and an evaluation network for global environment state input; initializing local environment states, global environment states and data caches required by other training; interaction is carried out according to the local environment state and the environment through the action network; calculating a dominance function according to the global environment state; calculating the loss of the action network according to the dominance function, calculating the loss of the evaluation network according to the evaluation network, and carrying out backward propagation to update the network according to the two loss values; the updated network and environment interactions are used. The method improves the macroscopic and microscopic combined action command capability of formation, introduces a multi-agent PPO algorithm for the first time for the construction of the formation command agents, improves the training stability of the formation command agents and improves the training effect.
Description
Technical Field
The invention relates to an air isomorphic formation command method, in particular to an air isomorphic formation command method based on a multi-agent PPO algorithm.
Background
At present, reinforcement learning is increasingly widely applied to simulation formation action training, a neural network oriented to multiple intelligent agents is generally required to be constructed to achieve the training purpose and is used as a learning network for deep reinforcement learning, one important link is input structures of an action network and an evaluation network, input is the basis of the neural network learning, and the neural network can be fast and efficiently learnt due to the input suitable for learning.
The fundamental difference between many multi-agent algorithms is that the input to the network, the representation of the input values is an important aspect of the overall algorithm, and both the action network and the evaluation network of a portion of the multi-agent algorithm use global inputs, which has the disadvantage that the algorithm ignores a portion of the important local information.
Another part of agent algorithms uses local information for both networks, and the disadvantage of using local information training is that it cannot be considered globally.
Disclosure of Invention
The invention aims to: the invention aims to solve the technical problem of providing an air isomorphic formation command method based on a multi-agent PPO algorithm aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
In the step 1, inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.
In step1, a local environment state is input to the mobile network, and the method comprises the following steps:
Inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into a matrix of n×128×128, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
Position feature matrix: abstracting the battlefield into a 128 x 128 space, wherein each point is 1 if an enemy plane agent exists, 2 if an enemy plane agent exists, and 0 if no plane agent exists;
a marred state matrix: in the 128 x 128 matrix, if an aircraft agent exists at each point, the heading of the aircraft agent is divided into 360 degrees.
In step 1, the action network is constructed as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; flattening the output of the convolution layer and inputting the flattened output into the full-connection layer; then, through a circulating neural network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target to be distributed with an action network, the direct output of the action network is changed into 128 x 128-dimensional probability distribution through class discretization, each dimension represents the probability that an airplane intelligent object flies to a corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent object, the enemy airplane intelligent object is attacked by the target coordinate, and if the airplane intelligent object does not exist, the enemy airplane intelligent object flies to the target coordinate; the probability distribution is calculated through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent.
In step 1, the evaluation network is used for evaluating the battlefield environment, the input of the evaluation network is identical to the input of the moving network, and the output is a 1-dimensional vector.
In step 2, initializing data for training of the action network and the evaluation network, constructing a playback buffer and initializing, including:
The global environment state information s share input by the evaluation network, s represents state information, and the subscript share refers to the state information as global environment information; the local environment state information s o input by the action network, s represents state information, and subscript o refers to the state information as local environment information; the mobile network shadow layer information hs act, hs represents the network hidden layer information, and the subscript act represents the network as the mobile network; evaluation network reservoir information hs critic, subscript critic represents that the network refers to an evaluation network; the action information a output by the aircraft intelligent agent, the log value logp a,pa is taken after the a probability of the action output by the aircraft intelligent agent, the probability of the action a output by the aircraft intelligent agent and the evaluation network output V (s share), V represents the evaluation network, and V (s share) is the output value after the global environment state information s share is input by the evaluation network.
In step 2, the data includes:
Global environmental state information s share: the input used by the evaluation network in training is a global battlefield environment, the dimension of data is [ length episode,numthread,numagents,dims ], wherein length episode is the time step of one-round combat, length represents the time step, and episode represents the corresponding combat round; num thread is the number of simulation environments running in parallel, num represents the number, and thread represents the thread running the corresponding simulation environment; num agents is the number of the aircraft agents on my side, and agents refer to the aircraft agents; dim s is the dimension of the battlefield environmental data in each time slice, and s is the state information;
Local environment state information s o: the battlefield environmental input of each individual aircraft agent in the air isomorphic formation, s represents state information, subscript o refers to the state information as local environment information, and the battlefield environmental data and global environment state information s share are the same;
hs act: the middle output of the action network cyclic nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact ], wherein dim hsact is the output dimension of the shadow hiding layer, dim is a dimension value, and hsac refers to the action network hiding layer;
hs critic: the intermediate output of the network cyclic nerve hiding layer is evaluated, the dimension of the data is [ length episode,numthread,numagents,dimhscritic ], wherein dim hscritic is the output dimension of the shadow storage layer, and hscritic refers to evaluating the network shadow storage layer.
The dataset obtained by the sampling in step 3, including s share,so,hsact,hscritic,a,logpac,V(sshare), r and log pi θ; where r is action execution feedback (e.g. number of hostile units destroyed) obtained from the environment, log pi theta is log value obtained after the direct output of the action network, pi represents the action network, and subscript theta represents the parameters of the action network.
The method for calculating the dominance function in step 4 includes:
Wherein, Representing the estimated value of the moment t for the dominance function,/>Is global environmental state information at time t,Inputting an output value after evaluating the network for global environment state information at the time t; gamma is the accumulated discount value, l represents the number of action steps after t time, and r t+l represents the report value r of environmental feedback after t+l steps; v represents an evaluation network,/>And the output value after the global environment state information at the time t is input for evaluating the network.
The method for calculating the mobile network loss in the step 5 comprises the following steps:
Wherein t is time t, clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if the value of r t (θ) exceeds the range of (1- ε,1+ε), the value of r t (θ) is made to be 1+ε if the value of r t (θ) is greater than 1+ε, the value of r is made to be 1- ε if the value of r t (θ) is less than 1- ε, and the value of r is kept if the value of r is between (1- ε,1+ε); epsilon is a set value; an action a t representing the current aircraft agent selection relative to selecting other actions; representing local environment state information of the aircraft intelligent body at the moment t;
Representative get/> AndA smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a t is selected by the action network representing the last round of iteration parameters theta old used by the aircraft agent at the moment t; r t (θ) is the ratio of the probability of computing the current action network selection a t to the probability of the action network selection a t of the previous iteration;
by calculating the derivative of loss value loss as described in step 5 The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
Wherein, As a derivative of the loss value of the mobile network,/>Representing derivative operations, s o is local environmental state information, a t is motion information,/>Representative/>Take log value,/>Representing the derivative after taking the log value,/>Representing the value used to estimate the time t by taking the average of multiple rounds of calculation.
The beneficial effects are that:
Aiming at the application scene that the formation command simulation training agent has long training time and weak stability, the invention constructs the formation command agent from the macroscopic command angle, adopts the macroscopic and microscopic combined mode, the evaluation network inputs global environment state data, the action network inputs local environment data around a single aircraft agent, improves the macroscopic and microscopic control capability of formation, and introduces a multi-agent PPO (near-end strategy optimization) algorithm for the construction of the command agent for the first time, thereby improving the training stability of the formation command agent.
The invention constructs an air isomorphic formation command method based on a multi-agent PPO algorithm, and the evaluation network uses global information, so that the algorithm has the capability of evaluating the global information, the input of the action network is local information, the intelligent agent can focus on local information learning countermeasures, the evaluation network inputs the global information, the global environment state is evaluated, and the intelligent agent is guided to select actions favorable for the global environment state.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a schematic diagram of the overall process of the present invention.
Fig. 2 is a diagram of a mobile network configuration.
Detailed Description
The following describes an embodiment of a simulation scene of a fighter of 6 fighters each in red and blue. The simulated battlefield space is discretized into 128 x 128 grid space in which the fighter of both red and blue are maneuvered and attack each other. An air isomorphic formation command method based on a multi-agent PPO (near-end strategy optimization) algorithm controls 6 fighters on the red side and the blue side to fight against each other. The method for directing the air isomorphism formation based on the multi-agent PPO algorithm decides which battle aircraft flies to which grid and strikes which blue aircraft agent.
As shown in fig. 1, an air isomorphic formation command method based on a multi-agent PPO algorithm includes: step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
In step 1, the specific method for constructing the action network and the evaluation network is as follows:
The mobile network inputs information of both red and blue in the battlefield environment, the information is organized into a matrix of 3 x 128, each dimension in the 3-dimensional matrix represents one type of battlefield environment characteristic information, and the characteristic comprises the following steps:
position feature matrix: the battlefield is abstracted as a 128 x 128 space, with 1 for blue aircraft, 2 for red aircraft agents, and 0 for no aircraft agents on each grid.
Heading matrix: in the grid space of 128 x 128, if an aircraft agent exists in one grid, the corresponding matrix element value is the heading of the aircraft agent, and the heading is divided into 360 degrees.
A marred state matrix: in the cell space of 128 x 128, if there is an aircraft agent in one cell, the corresponding matrix element value is the damage state of the aircraft agent, and the states are classified into good (denoted by 1), damaged (denoted by 0.5) and knocked down (denoted by 0.1).
Based on the above network, as shown in fig. 2, the constructed action network architecture is as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; then, through a cyclic neural (rnn) network, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target distribution action (actor) network, the direct output of the actor network is changed into 128 x 128-dimensional probability distribution through class discretization (categorical), each dimension represents the probability that an aircraft agent will fly to corresponding coordinates for attack, if the target coordinates have the enemy aircraft agent, the enemy aircraft agent can be attacked by the target coordinates, and if the aircraft agent does not exist, the aircraft agent can fly to the target coordinates. The probability distribution is passed through a mask calculation (mask) to form the final action output. mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent, and mask x categorical output becomes the final output.
The constructed evaluation network is used for evaluating the battlefield environment, the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the battlefield state information; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; and then, through a circulating nerve (rnn) network, the input of the circulating nerve network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to a core evaluation network, and the evaluation network output is a one-dimensional vector.
In step 2, a playback buffer is constructed and initialized, and the content in the buffer contains s share,so,hsact、hscritic,a,logpac,V(sshare
S share, inputting critic network used in training, namely, a global battlefield environment, wherein the dimension of data is [ length episode,numthread,numagents,dims ], and length episode is a time step of one-round combat and is set as 1024 steps; num thread is the number of simulation environments running in parallel and is set to be 6; num agents is the number of the intelligent agents of the aircraft on the my side and is set to be 5; dim s is the dimension of the battlefield environmental data in each time slice, and is set to 128×128×3.
S o: battlefield environmental input for each individual aircraft agent within the formation, the battlefield environmental data being the same as s share
The hs act: actor network circulates the intermediate output of the nerve hiding layer, the dimension of data is [ length episode,numthread,numagents,dimhsact].dimhsact ] which is the output dimension of the shadow hiding layer and is set to be 256.
Hs critic: the critic network circulates the intermediate output of the nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact].dimhscritic ] which is the output dimension of the shadow hiding layer and is set to 256.
In step 3: the method comprises the steps of continuously interacting with the environment to obtain a sampling data set for training, wherein the data set comprises s share,so,hsact、hscritic,a,logpac,V(sshare), r and log pi theta.
In step 4: the advantage function is calculated and the function of the advantage,
The dimension after the sample data transformation is [ the number of parallel environments, the number of agents, the number of steps of one round of command, the dimension observed in each step ]
In step 5: calculation actor loss (loss actor),
Wherein clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if r t (θ) exceeds the (1- ε,1+ε) range, the value of r t (θ) is 1+ε if it is greater than 1+ε, the value of r is 1- ε if it is less than 1- ε, and the value of r is 1- ε if it is between (1- ε,1+ε) is retained. The determination of epsilon values requires the use of empirical settings.Action a t, which is the dominance function, represents the dominance of the current aircraft agent selection over the selection of other actions. /(I)Representing the local environmental status information of the aircraft agent at time t. /(I)Representative get/>And/>A smaller value for the comparison. /(I)Representing taking the average of multiple rounds of calculation. /(I)Representing the probability of the aircraft agent selecting action a t using the latest parameter θ at time t. /(I)The probability of action a t being selected by the aircraft agent at time t using the action network of the previous iteration parameter θ old. r t (θ) calculates the ratio of the probability of the current action network selection a t to the probability of the action network selection a t of the previous iteration.
By calculating the derivative of loss value loss as described in step 5The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
: where s o is local environment state information, a t is action information, Representative/>Take log value,/>Representing the derivative after taking the log.
In a specific implementation, the application provides a computer storage medium and a corresponding data processing unit, wherein the computer storage medium can store a computer program, and the computer program can run the application content of the air isomorphic formation command method based on the multi-agent PPO algorithm and part or all of the steps in each embodiment when being executed by the data processing unit. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer MUU, or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.
The invention provides an idea and a method for an air isomorphic formation command method based on a multi-agent PPO algorithm, and particularly the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made by those skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.
Claims (10)
1. An air isomorphic formation command method based on a multi-agent PPO algorithm is characterized by comprising the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
2. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 1, wherein step 1 comprises: inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.
3. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 2, wherein in step 1, a local environment state is input to the mobile network, the method comprises:
Inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into a matrix of n×128×128, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
Position feature matrix: abstracting the battlefield into a 128 x 128 space, wherein each point is 1 if an enemy plane agent exists, 2 if an enemy plane agent exists, and 0 if no plane agent exists;
a marred state matrix: in the 128 x 128 matrix, if an aircraft agent exists at each point, the heading of the aircraft agent is divided into 360 degrees.
4. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 3, wherein in step 1, the action network is constructed as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; flattening the output of the convolution layer and inputting the flattened output into the full-connection layer; then, through a circulating neural network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target to be distributed with an action network, the direct output of the action network is changed into 128 x 128-dimensional probability distribution through class discretization, each dimension represents the probability that an airplane intelligent object flies to a corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent object, the enemy airplane intelligent object is attacked by the target coordinate, and if the airplane intelligent object does not exist, the enemy airplane intelligent object flies to the target coordinate; the probability distribution is calculated through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent.
5. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 4, wherein in step 1, the evaluation network is used for evaluating battlefield environment, and the input of the evaluation network and the input of the moving network are 1-dimensional vectors.
6. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 5, wherein in step 2, initializing data for training of the action network and evaluation network, constructing playback buffer and initializing, comprises:
The global environment state information s share input by the evaluation network, s represents state information, and the subscript share refers to the state information as global environment information; the local environment state information s o input by the action network, s represents state information, and subscript o refers to the state information as local environment information; the mobile network shadow layer information hs act, hs represents the network hidden layer information, and the subscript act represents the network as the mobile network; evaluation network reservoir information hs critic, subscript critic represents that the network refers to an evaluation network; the action information a output by the aircraft intelligent agent, the log value logp a,pa is taken after the a probability of the action output by the aircraft intelligent agent, the probability of the action a output by the aircraft intelligent agent and the evaluation network output V (s share), V represents the evaluation network, and V (s share) is the output value after the global environment state information s share is input by the evaluation network.
7. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 6, wherein in step 2, the data comprises:
Global environmental state information s share: the input used by the evaluation network in training is a global battlefield environment, the dimension of data is [ length episode,numthread,numagents,dims ], wherein length episode is the time step of one-round combat, length represents the time step, and episode represents the corresponding combat round; num thread is the number of simulation environments running in parallel, num represents the number, and thread represents the thread running the corresponding simulation environment; num agents is the number of the aircraft agents on my side, and agents refer to the aircraft agents; dim s is the dimension of the battlefield environmental data in each time slice, and s is the state information;
Local environment state information s o: the battlefield environmental input of each individual aircraft agent in the air isomorphic formation, s represents state information, subscript o refers to the state information as local environment information, and the battlefield environmental data and global environment state information s share are the same;
hs act: the middle output of the action network cyclic nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact ], wherein dim hsact is the output dimension of the shadow hiding layer, dim is a dimension value, and hsac refers to the action network hiding layer;
hs critic: the intermediate output of the network cyclic nerve hiding layer is evaluated, the dimension of the data is [ length episode,numthread,numagents,dimhscritic ], wherein dim hscritic is the output dimension of the shadow storage layer, and hscritic refers to evaluating the network shadow storage layer.
8. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 7, wherein the data set obtained by the sampling in step 3 includes s share,so,hsact,hscritic,a,logpac,V(sshare), r and log pi θ; where r is action execution feedback obtained from the environment, log pi θ is the log value obtained after the action network directly outputs, pi represents the action network, and subscript θ represents the parameter of the action network.
9. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 8, wherein the calculating the dominance function in step 4 comprises:
Wherein, Representing the estimated value of the moment t for the dominance function,/>Is global environmental state information at time t,Inputting an output value after evaluating the network for global environment state information at the time t; gamma is the accumulated discount value, l represents the number of action steps after t time, and r t+l represents the report value r of environmental feedback after t+l steps; v represents an evaluation network,/>And the output value after the global environment state information at the time t is input for evaluating the network.
10. An air isomorphic formation command method based on multi-agent PPO algorithm as claimed in claim 9, wherein the method for calculating the mobile network loss in step 5 comprises:
Wherein t is time t, clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if the value of r t (θ) exceeds the range of (1- ε,1+ε), the value of r t (θ) is made to be 1+ε if the value of r t (θ) is greater than 1+ε, the value of r is made to be 1- ε if the value of r t (θ) is less than 1- ε, and the value of r is kept if the value of r is between (1- ε,1+ε); epsilon is a set value; an action a t representing the current aircraft agent selection relative to selecting other actions; representing local environment state information of the aircraft intelligent body at the moment t;
Representative get/> And/>A smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a t is selected by the action network representing the last round of iteration parameters theta old used by the aircraft agent at the moment t; r t (θ) is the ratio of the probability of computing the current action network selection a t to the probability of the action network selection a t of the previous iteration;
by calculating the derivative of loss value loss as described in step 5 The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
Wherein, As a derivative of the loss value of the mobile network,/>Representing derivative operations, s o is local environmental state information, a t is motion information,/>Representative/>Take log value,/>Representing the derivative after taking the log value,/>Representing the value used to estimate the time t by taking the average of multiple rounds of calculation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210656190.7A CN115047907B (en) | 2022-06-10 | 2022-06-10 | Air isomorphic formation command method based on multi-agent PPO algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210656190.7A CN115047907B (en) | 2022-06-10 | 2022-06-10 | Air isomorphic formation command method based on multi-agent PPO algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115047907A CN115047907A (en) | 2022-09-13 |
CN115047907B true CN115047907B (en) | 2024-05-07 |
Family
ID=83161154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210656190.7A Active CN115047907B (en) | 2022-06-10 | 2022-06-10 | Air isomorphic formation command method based on multi-agent PPO algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115047907B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115826627A (en) * | 2023-02-21 | 2023-03-21 | 白杨时代(北京)科技有限公司 | Method, system, equipment and storage medium for determining formation instruction |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113313267A (en) * | 2021-06-28 | 2021-08-27 | 浙江大学 | Multi-agent reinforcement learning method based on value decomposition and attention mechanism |
CN113625757A (en) * | 2021-08-12 | 2021-11-09 | 中国电子科技集团公司第二十八研究所 | Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism |
CN113791634A (en) * | 2021-08-22 | 2021-12-14 | 西北工业大学 | Multi-aircraft air combat decision method based on multi-agent reinforcement learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635917B (en) * | 2018-10-17 | 2020-08-25 | 北京大学 | Multi-agent cooperation decision and training method |
-
2022
- 2022-06-10 CN CN202210656190.7A patent/CN115047907B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113313267A (en) * | 2021-06-28 | 2021-08-27 | 浙江大学 | Multi-agent reinforcement learning method based on value decomposition and attention mechanism |
CN113625757A (en) * | 2021-08-12 | 2021-11-09 | 中国电子科技集团公司第二十八研究所 | Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism |
CN113791634A (en) * | 2021-08-22 | 2021-12-14 | 西北工业大学 | Multi-aircraft air combat decision method based on multi-agent reinforcement learning |
Non-Patent Citations (2)
Title |
---|
基于多智能体强化学习的无人机集群攻防对抗策略研究;轩书哲,柯良军;无线电工程;20210505;第51卷(第05期);全文 * |
基于多智能体的编队协同试飞仿真与效能评估;夏庆军;张安;张耀中;;火力与指挥控制;20110515(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115047907A (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Xin et al. | Efficient decision makings for dynamic weapon-target assignment by virtual permutation and tabu search heuristics | |
Schultz et al. | Improving tactical plans with genetic algorithms | |
CN110083971B (en) | Self-explosion unmanned aerial vehicle cluster combat force distribution method based on combat deduction | |
CN108549402A (en) | Unmanned aerial vehicle group method for allocating tasks based on quantum crow group hunting mechanism | |
CN109190978A (en) | A kind of unmanned plane resource allocation methods based on quantum flock of birds mechanism of Evolution | |
CN107330560A (en) | A kind of multitask coordinated distribution method of isomery aircraft for considering temporal constraint | |
Ming et al. | Improved discrete mapping differential evolution for multi-unmanned aerial vehicles cooperative multi-targets assignment under unified model | |
CN112600795B (en) | Method and system for collapsing combat network under incomplete information | |
CN113625569B (en) | Small unmanned aerial vehicle prevention and control decision method and system based on hybrid decision model | |
Wang et al. | UAV swarm confrontation using hierarchical multiagent reinforcement learning | |
CN115047907B (en) | Air isomorphic formation command method based on multi-agent PPO algorithm | |
Lee et al. | Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning | |
CN113741186B (en) | Double-aircraft air combat decision-making method based on near-end strategy optimization | |
Kong et al. | Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat | |
Xianyong et al. | Research on maneuvering decision algorithm based on improved deep deterministic policy gradient | |
CN111797966B (en) | Multi-machine collaborative global target distribution method based on improved flock algorithm | |
CN113625767A (en) | Fixed-wing unmanned aerial vehicle cluster collaborative path planning method based on preferred pheromone gray wolf algorithm | |
CN114911269B (en) | Networking radar interference strategy generation method based on unmanned aerial vehicle group | |
Wang et al. | Cooperatively pursuing a target unmanned aerial vehicle by multiple unmanned aerial vehicles based on multiagent reinforcement learning | |
CN113324545A (en) | Multi-unmanned aerial vehicle collaborative task planning method based on hybrid enhanced intelligence | |
Zhao et al. | Deep Reinforcement Learning‐Based Air Defense Decision‐Making Using Potential Games | |
Hao et al. | Flight Trajectory Prediction Using an Enhanced CNN-LSTM Network | |
CN117590757B (en) | Multi-unmanned aerial vehicle cooperative task allocation method based on Gaussian distribution sea-gull optimization algorithm | |
CN115695209B (en) | Graph model-based anti-control unmanned aerial vehicle bee colony assessment method | |
Li et al. | Research on stealthy UAV path planning based on improved genetic algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |