CN115047907B - Air isomorphic formation command method based on multi-agent PPO algorithm - Google Patents

Air isomorphic formation command method based on multi-agent PPO algorithm Download PDF

Info

Publication number
CN115047907B
CN115047907B CN202210656190.7A CN202210656190A CN115047907B CN 115047907 B CN115047907 B CN 115047907B CN 202210656190 A CN202210656190 A CN 202210656190A CN 115047907 B CN115047907 B CN 115047907B
Authority
CN
China
Prior art keywords
network
action
agent
value
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210656190.7A
Other languages
Chinese (zh)
Other versions
CN115047907A (en
Inventor
汪亚斌
李友江
崔鹏
郭成昊
丁峰
易侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202210656190.7A priority Critical patent/CN115047907B/en
Publication of CN115047907A publication Critical patent/CN115047907A/en
Application granted granted Critical
Publication of CN115047907B publication Critical patent/CN115047907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Separation Of Gases By Adsorption (AREA)
  • Flow Control (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps: constructing an action network for local environment state input and an evaluation network for global environment state input; initializing local environment states, global environment states and data caches required by other training; interaction is carried out according to the local environment state and the environment through the action network; calculating a dominance function according to the global environment state; calculating the loss of the action network according to the dominance function, calculating the loss of the evaluation network according to the evaluation network, and carrying out backward propagation to update the network according to the two loss values; the updated network and environment interactions are used. The method improves the macroscopic and microscopic combined action command capability of formation, introduces a multi-agent PPO algorithm for the first time for the construction of the formation command agents, improves the training stability of the formation command agents and improves the training effect.

Description

Air isomorphic formation command method based on multi-agent PPO algorithm
Technical Field
The invention relates to an air isomorphic formation command method, in particular to an air isomorphic formation command method based on a multi-agent PPO algorithm.
Background
At present, reinforcement learning is increasingly widely applied to simulation formation action training, a neural network oriented to multiple intelligent agents is generally required to be constructed to achieve the training purpose and is used as a learning network for deep reinforcement learning, one important link is input structures of an action network and an evaluation network, input is the basis of the neural network learning, and the neural network can be fast and efficiently learnt due to the input suitable for learning.
The fundamental difference between many multi-agent algorithms is that the input to the network, the representation of the input values is an important aspect of the overall algorithm, and both the action network and the evaluation network of a portion of the multi-agent algorithm use global inputs, which has the disadvantage that the algorithm ignores a portion of the important local information.
Another part of agent algorithms uses local information for both networks, and the disadvantage of using local information training is that it cannot be considered globally.
Disclosure of Invention
The invention aims to: the invention aims to solve the technical problem of providing an air isomorphic formation command method based on a multi-agent PPO algorithm aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
In the step 1, inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.
In step1, a local environment state is input to the mobile network, and the method comprises the following steps:
Inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into a matrix of n×128×128, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
Position feature matrix: abstracting the battlefield into a 128 x 128 space, wherein each point is 1 if an enemy plane agent exists, 2 if an enemy plane agent exists, and 0 if no plane agent exists;
a marred state matrix: in the 128 x 128 matrix, if an aircraft agent exists at each point, the heading of the aircraft agent is divided into 360 degrees.
In step 1, the action network is constructed as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; flattening the output of the convolution layer and inputting the flattened output into the full-connection layer; then, through a circulating neural network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target to be distributed with an action network, the direct output of the action network is changed into 128 x 128-dimensional probability distribution through class discretization, each dimension represents the probability that an airplane intelligent object flies to a corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent object, the enemy airplane intelligent object is attacked by the target coordinate, and if the airplane intelligent object does not exist, the enemy airplane intelligent object flies to the target coordinate; the probability distribution is calculated through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent.
In step 1, the evaluation network is used for evaluating the battlefield environment, the input of the evaluation network is identical to the input of the moving network, and the output is a 1-dimensional vector.
In step 2, initializing data for training of the action network and the evaluation network, constructing a playback buffer and initializing, including:
The global environment state information s share input by the evaluation network, s represents state information, and the subscript share refers to the state information as global environment information; the local environment state information s o input by the action network, s represents state information, and subscript o refers to the state information as local environment information; the mobile network shadow layer information hs act, hs represents the network hidden layer information, and the subscript act represents the network as the mobile network; evaluation network reservoir information hs critic, subscript critic represents that the network refers to an evaluation network; the action information a output by the aircraft intelligent agent, the log value logp a,pa is taken after the a probability of the action output by the aircraft intelligent agent, the probability of the action a output by the aircraft intelligent agent and the evaluation network output V (s share), V represents the evaluation network, and V (s share) is the output value after the global environment state information s share is input by the evaluation network.
In step 2, the data includes:
Global environmental state information s share: the input used by the evaluation network in training is a global battlefield environment, the dimension of data is [ length episode,numthread,numagents,dims ], wherein length episode is the time step of one-round combat, length represents the time step, and episode represents the corresponding combat round; num thread is the number of simulation environments running in parallel, num represents the number, and thread represents the thread running the corresponding simulation environment; num agents is the number of the aircraft agents on my side, and agents refer to the aircraft agents; dim s is the dimension of the battlefield environmental data in each time slice, and s is the state information;
Local environment state information s o: the battlefield environmental input of each individual aircraft agent in the air isomorphic formation, s represents state information, subscript o refers to the state information as local environment information, and the battlefield environmental data and global environment state information s share are the same;
hs act: the middle output of the action network cyclic nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact ], wherein dim hsact is the output dimension of the shadow hiding layer, dim is a dimension value, and hsac refers to the action network hiding layer;
hs critic: the intermediate output of the network cyclic nerve hiding layer is evaluated, the dimension of the data is [ length episode,numthread,numagents,dimhscritic ], wherein dim hscritic is the output dimension of the shadow storage layer, and hscritic refers to evaluating the network shadow storage layer.
The dataset obtained by the sampling in step 3, including s share,so,hsact,hscritic,a,logpac,V(sshare), r and log pi θ; where r is action execution feedback (e.g. number of hostile units destroyed) obtained from the environment, log pi theta is log value obtained after the direct output of the action network, pi represents the action network, and subscript theta represents the parameters of the action network.
The method for calculating the dominance function in step 4 includes:
Wherein, Representing the estimated value of the moment t for the dominance function,/>Is global environmental state information at time t,Inputting an output value after evaluating the network for global environment state information at the time t; gamma is the accumulated discount value, l represents the number of action steps after t time, and r t+l represents the report value r of environmental feedback after t+l steps; v represents an evaluation network,/>And the output value after the global environment state information at the time t is input for evaluating the network.
The method for calculating the mobile network loss in the step 5 comprises the following steps:
Wherein t is time t, clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if the value of r t (θ) exceeds the range of (1- ε,1+ε), the value of r t (θ) is made to be 1+ε if the value of r t (θ) is greater than 1+ε, the value of r is made to be 1- ε if the value of r t (θ) is less than 1- ε, and the value of r is kept if the value of r is between (1- ε,1+ε); epsilon is a set value; an action a t representing the current aircraft agent selection relative to selecting other actions; representing local environment state information of the aircraft intelligent body at the moment t;
Representative get/> AndA smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a t is selected by the action network representing the last round of iteration parameters theta old used by the aircraft agent at the moment t; r t (θ) is the ratio of the probability of computing the current action network selection a t to the probability of the action network selection a t of the previous iteration;
by calculating the derivative of loss value loss as described in step 5 The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
Wherein, As a derivative of the loss value of the mobile network,/>Representing derivative operations, s o is local environmental state information, a t is motion information,/>Representative/>Take log value,/>Representing the derivative after taking the log value,/>Representing the value used to estimate the time t by taking the average of multiple rounds of calculation.
The beneficial effects are that:
Aiming at the application scene that the formation command simulation training agent has long training time and weak stability, the invention constructs the formation command agent from the macroscopic command angle, adopts the macroscopic and microscopic combined mode, the evaluation network inputs global environment state data, the action network inputs local environment data around a single aircraft agent, improves the macroscopic and microscopic control capability of formation, and introduces a multi-agent PPO (near-end strategy optimization) algorithm for the construction of the command agent for the first time, thereby improving the training stability of the formation command agent.
The invention constructs an air isomorphic formation command method based on a multi-agent PPO algorithm, and the evaluation network uses global information, so that the algorithm has the capability of evaluating the global information, the input of the action network is local information, the intelligent agent can focus on local information learning countermeasures, the evaluation network inputs the global information, the global environment state is evaluated, and the intelligent agent is guided to select actions favorable for the global environment state.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a schematic diagram of the overall process of the present invention.
Fig. 2 is a diagram of a mobile network configuration.
Detailed Description
The following describes an embodiment of a simulation scene of a fighter of 6 fighters each in red and blue. The simulated battlefield space is discretized into 128 x 128 grid space in which the fighter of both red and blue are maneuvered and attack each other. An air isomorphic formation command method based on a multi-agent PPO (near-end strategy optimization) algorithm controls 6 fighters on the red side and the blue side to fight against each other. The method for directing the air isomorphism formation based on the multi-agent PPO algorithm decides which battle aircraft flies to which grid and strikes which blue aircraft agent.
As shown in fig. 1, an air isomorphic formation command method based on a multi-agent PPO algorithm includes: step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
In step 1, the specific method for constructing the action network and the evaluation network is as follows:
The mobile network inputs information of both red and blue in the battlefield environment, the information is organized into a matrix of 3 x 128, each dimension in the 3-dimensional matrix represents one type of battlefield environment characteristic information, and the characteristic comprises the following steps:
position feature matrix: the battlefield is abstracted as a 128 x 128 space, with 1 for blue aircraft, 2 for red aircraft agents, and 0 for no aircraft agents on each grid.
Heading matrix: in the grid space of 128 x 128, if an aircraft agent exists in one grid, the corresponding matrix element value is the heading of the aircraft agent, and the heading is divided into 360 degrees.
A marred state matrix: in the cell space of 128 x 128, if there is an aircraft agent in one cell, the corresponding matrix element value is the damage state of the aircraft agent, and the states are classified into good (denoted by 1), damaged (denoted by 0.5) and knocked down (denoted by 0.1).
Based on the above network, as shown in fig. 2, the constructed action network architecture is as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; then, through a cyclic neural (rnn) network, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target distribution action (actor) network, the direct output of the actor network is changed into 128 x 128-dimensional probability distribution through class discretization (categorical), each dimension represents the probability that an aircraft agent will fly to corresponding coordinates for attack, if the target coordinates have the enemy aircraft agent, the enemy aircraft agent can be attacked by the target coordinates, and if the aircraft agent does not exist, the aircraft agent can fly to the target coordinates. The probability distribution is passed through a mask calculation (mask) to form the final action output. mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent, and mask x categorical output becomes the final output.
The constructed evaluation network is used for evaluating the battlefield environment, the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the battlefield state information; the output of the convolution layer is input into the full connection layer after flattening (flat) operation; and then, through a circulating nerve (rnn) network, the input of the circulating nerve network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to a core evaluation network, and the evaluation network output is a one-dimensional vector.
In step 2, a playback buffer is constructed and initialized, and the content in the buffer contains s share,so,hsact、hscritic,a,logpac,V(sshare
S share, inputting critic network used in training, namely, a global battlefield environment, wherein the dimension of data is [ length episode,numthread,numagents,dims ], and length episode is a time step of one-round combat and is set as 1024 steps; num thread is the number of simulation environments running in parallel and is set to be 6; num agents is the number of the intelligent agents of the aircraft on the my side and is set to be 5; dim s is the dimension of the battlefield environmental data in each time slice, and is set to 128×128×3.
S o: battlefield environmental input for each individual aircraft agent within the formation, the battlefield environmental data being the same as s share
The hs act: actor network circulates the intermediate output of the nerve hiding layer, the dimension of data is [ length episode,numthread,numagents,dimhsact].dimhsact ] which is the output dimension of the shadow hiding layer and is set to be 256.
Hs critic: the critic network circulates the intermediate output of the nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact].dimhscritic ] which is the output dimension of the shadow hiding layer and is set to 256.
In step 3: the method comprises the steps of continuously interacting with the environment to obtain a sampling data set for training, wherein the data set comprises s share,so,hsact、hscritic,a,logpac,V(sshare), r and log pi theta.
In step 4: the advantage function is calculated and the function of the advantage,
The dimension after the sample data transformation is [ the number of parallel environments, the number of agents, the number of steps of one round of command, the dimension observed in each step ]
In step 5: calculation actor loss (loss actor),
Wherein clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if r t (θ) exceeds the (1- ε,1+ε) range, the value of r t (θ) is 1+ε if it is greater than 1+ε, the value of r is 1- ε if it is less than 1- ε, and the value of r is 1- ε if it is between (1- ε,1+ε) is retained. The determination of epsilon values requires the use of empirical settings.Action a t, which is the dominance function, represents the dominance of the current aircraft agent selection over the selection of other actions. /(I)Representing the local environmental status information of the aircraft agent at time t. /(I)Representative get/>And/>A smaller value for the comparison. /(I)Representing taking the average of multiple rounds of calculation. /(I)Representing the probability of the aircraft agent selecting action a t using the latest parameter θ at time t. /(I)The probability of action a t being selected by the aircraft agent at time t using the action network of the previous iteration parameter θ old. r t (θ) calculates the ratio of the probability of the current action network selection a t to the probability of the action network selection a t of the previous iteration.
By calculating the derivative of loss value loss as described in step 5The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
: where s o is local environment state information, a t is action information, Representative/>Take log value,/>Representing the derivative after taking the log.
In a specific implementation, the application provides a computer storage medium and a corresponding data processing unit, wherein the computer storage medium can store a computer program, and the computer program can run the application content of the air isomorphic formation command method based on the multi-agent PPO algorithm and part or all of the steps in each embodiment when being executed by the data processing unit. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer MUU, or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.
The invention provides an idea and a method for an air isomorphic formation command method based on a multi-agent PPO algorithm, and particularly the method and the way for realizing the technical scheme are numerous, the above is only a preferred embodiment of the invention, and it should be pointed out that a plurality of improvements and modifications can be made by those skilled in the art without departing from the principle of the invention, and the improvements and the modifications are also regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (10)

1. An air isomorphic formation command method based on a multi-agent PPO algorithm is characterized by comprising the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate image layer information of an action network and an evaluation network;
Step 3, collecting environment state data from the simulation countermeasure environment, wherein the environment state data consists of local environment state data and global environment state data; inputting the environmental state data into an action network, outputting actions by the action network and issuing the actions to the simulated countermeasure environment; the simulation countermeasure environment changes the environment state data after receiving the action, and returns the changed environment state data to the action network; the action network outputs formation control instructions, constantly interacts with the simulation countermeasure environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function oriented to the formation command aircraft agent according to the sampling data in the sampling data set in the step 3;
Step 5, calculating action network loss actor and evaluation network loss value according to the sampled data in the sampled data set obtained in step 3, and updating the action network and the evaluation network by backward propagation according to the two loss values loss, and calculating the derivative of the loss values loss The parameters of the action network and the evaluation network are updated by backward propagation;
Step 6, outputting actions and simulating countermeasure environment interaction by using the updated action network, and continuing to sample in the step 3;
repeating the steps 3 to 6 until the action output by the action network meets the set requirement.
2. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 1, wherein step 1 comprises: inputting local environment states around each aircraft agent in an air isomorphic formation to the action network; and inputting global environment states into the evaluation network, and integrating the global environment states of all the aircraft agents to evaluate the influence of the actions of each aircraft agent on the whole formation overall target, wherein the overall target comprises the elimination of all enemy forces or the minimum overall damage of the formation.
3. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 2, wherein in step 1, a local environment state is input to the mobile network, the method comprises:
Inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into a matrix of n×128×128, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
Position feature matrix: abstracting the battlefield into a 128 x 128 space, wherein each point is 1 if an enemy plane agent exists, 2 if an enemy plane agent exists, and 0 if no plane agent exists;
a marred state matrix: in the 128 x 128 matrix, if an aircraft agent exists at each point, the heading of the aircraft agent is divided into 360 degrees.
4. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 3, wherein in step 1, the action network is constructed as follows:
The first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on the state information of the battlefield; flattening the output of the convolution layer and inputting the flattened output into the full-connection layer; then, through a circulating neural network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to an aerial command target to be distributed with an action network, the direct output of the action network is changed into 128 x 128-dimensional probability distribution through class discretization, each dimension represents the probability that an airplane intelligent object flies to a corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent object, the enemy airplane intelligent object is attacked by the target coordinate, and if the airplane intelligent object does not exist, the enemy airplane intelligent object flies to the target coordinate; the probability distribution is calculated through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether the corresponding coordinate has enemy aircraft agent.
5. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 4, wherein in step 1, the evaluation network is used for evaluating battlefield environment, and the input of the evaluation network and the input of the moving network are 1-dimensional vectors.
6. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 5, wherein in step 2, initializing data for training of the action network and evaluation network, constructing playback buffer and initializing, comprises:
The global environment state information s share input by the evaluation network, s represents state information, and the subscript share refers to the state information as global environment information; the local environment state information s o input by the action network, s represents state information, and subscript o refers to the state information as local environment information; the mobile network shadow layer information hs act, hs represents the network hidden layer information, and the subscript act represents the network as the mobile network; evaluation network reservoir information hs critic, subscript critic represents that the network refers to an evaluation network; the action information a output by the aircraft intelligent agent, the log value logp a,pa is taken after the a probability of the action output by the aircraft intelligent agent, the probability of the action a output by the aircraft intelligent agent and the evaluation network output V (s share), V represents the evaluation network, and V (s share) is the output value after the global environment state information s share is input by the evaluation network.
7. An air isomorphic formation command method based on a multi-agent PPO algorithm as claimed in claim 6, wherein in step 2, the data comprises:
Global environmental state information s share: the input used by the evaluation network in training is a global battlefield environment, the dimension of data is [ length episode,numthread,numagents,dims ], wherein length episode is the time step of one-round combat, length represents the time step, and episode represents the corresponding combat round; num thread is the number of simulation environments running in parallel, num represents the number, and thread represents the thread running the corresponding simulation environment; num agents is the number of the aircraft agents on my side, and agents refer to the aircraft agents; dim s is the dimension of the battlefield environmental data in each time slice, and s is the state information;
Local environment state information s o: the battlefield environmental input of each individual aircraft agent in the air isomorphic formation, s represents state information, subscript o refers to the state information as local environment information, and the battlefield environmental data and global environment state information s share are the same;
hs act: the middle output of the action network cyclic nerve hiding layer, the dimension of the data is [ length episode,numthread,numagents,dimhsact ], wherein dim hsact is the output dimension of the shadow hiding layer, dim is a dimension value, and hsac refers to the action network hiding layer;
hs critic: the intermediate output of the network cyclic nerve hiding layer is evaluated, the dimension of the data is [ length episode,numthread,numagents,dimhscritic ], wherein dim hscritic is the output dimension of the shadow storage layer, and hscritic refers to evaluating the network shadow storage layer.
8. An air isomorphic formation command method based on multi-agent PPO algorithm according to claim 7, wherein the data set obtained by the sampling in step 3 includes s share,so,hsact,hscritic,a,logpac,V(sshare), r and log pi θ; where r is action execution feedback obtained from the environment, log pi θ is the log value obtained after the action network directly outputs, pi represents the action network, and subscript θ represents the parameter of the action network.
9. The method for air isomorphic formation command based on multi-agent PPO algorithm according to claim 8, wherein the calculating the dominance function in step 4 comprises:
Wherein, Representing the estimated value of the moment t for the dominance function,/>Is global environmental state information at time t,Inputting an output value after evaluating the network for global environment state information at the time t; gamma is the accumulated discount value, l represents the number of action steps after t time, and r t+l represents the report value r of environmental feedback after t+l steps; v represents an evaluation network,/>And the output value after the global environment state information at the time t is input for evaluating the network.
10. An air isomorphic formation command method based on multi-agent PPO algorithm as claimed in claim 9, wherein the method for calculating the mobile network loss in step 5 comprises:
Wherein t is time t, clip (r t (θ), 1- ε,1+ε) is a truncation operation, and if the value of r t (θ) exceeds the range of (1- ε,1+ε), the value of r t (θ) is made to be 1+ε if the value of r t (θ) is greater than 1+ε, the value of r is made to be 1- ε if the value of r t (θ) is less than 1- ε, and the value of r is kept if the value of r is between (1- ε,1+ε); epsilon is a set value; an action a t representing the current aircraft agent selection relative to selecting other actions; representing local environment state information of the aircraft intelligent body at the moment t;
Representative get/> And/>A smaller value of the comparison; /(I)Representing taking the average of multiple rounds of calculation; /(I)A probability representing an action network selection action a t of the aircraft agent using the latest parameter θ at time t; /(I)The probability of action a t is selected by the action network representing the last round of iteration parameters theta old used by the aircraft agent at the moment t; r t (θ) is the ratio of the probability of computing the current action network selection a t to the probability of the action network selection a t of the previous iteration;
by calculating the derivative of loss value loss as described in step 5 The method for realizing backward propagation updating of parameters of the mobile network and the evaluation network is as follows:
Wherein, As a derivative of the loss value of the mobile network,/>Representing derivative operations, s o is local environmental state information, a t is motion information,/>Representative/>Take log value,/>Representing the derivative after taking the log value,/>Representing the value used to estimate the time t by taking the average of multiple rounds of calculation.
CN202210656190.7A 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm Active CN115047907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210656190.7A CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210656190.7A CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Publications (2)

Publication Number Publication Date
CN115047907A CN115047907A (en) 2022-09-13
CN115047907B true CN115047907B (en) 2024-05-07

Family

ID=83161154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210656190.7A Active CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Country Status (1)

Country Link
CN (1) CN115047907B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113313267A (en) * 2021-06-28 2021-08-27 浙江大学 Multi-agent reinforcement learning method based on value decomposition and attention mechanism
CN113625757A (en) * 2021-08-12 2021-11-09 中国电子科技集团公司第二十八研究所 Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635917B (en) * 2018-10-17 2020-08-25 北京大学 Multi-agent cooperation decision and training method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113313267A (en) * 2021-06-28 2021-08-27 浙江大学 Multi-agent reinforcement learning method based on value decomposition and attention mechanism
CN113625757A (en) * 2021-08-12 2021-11-09 中国电子科技集团公司第二十八研究所 Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多智能体强化学习的无人机集群攻防对抗策略研究;轩书哲,柯良军;无线电工程;20210505;第51卷(第05期);全文 *
基于多智能体的编队协同试飞仿真与效能评估;夏庆军;张安;张耀中;;火力与指挥控制;20110515(第05期);全文 *

Also Published As

Publication number Publication date
CN115047907A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
Xin et al. Efficient decision makings for dynamic weapon-target assignment by virtual permutation and tabu search heuristics
Schultz et al. Improving tactical plans with genetic algorithms
CN110083971B (en) Self-explosion unmanned aerial vehicle cluster combat force distribution method based on combat deduction
CN108549402A (en) Unmanned aerial vehicle group method for allocating tasks based on quantum crow group hunting mechanism
CN109190978A (en) A kind of unmanned plane resource allocation methods based on quantum flock of birds mechanism of Evolution
CN107330560A (en) A kind of multitask coordinated distribution method of isomery aircraft for considering temporal constraint
Ming et al. Improved discrete mapping differential evolution for multi-unmanned aerial vehicles cooperative multi-targets assignment under unified model
CN112600795B (en) Method and system for collapsing combat network under incomplete information
CN113625569B (en) Small unmanned aerial vehicle prevention and control decision method and system based on hybrid decision model
Wang et al. UAV swarm confrontation using hierarchical multiagent reinforcement learning
CN115047907B (en) Air isomorphic formation command method based on multi-agent PPO algorithm
Lee et al. Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning
CN113741186B (en) Double-aircraft air combat decision-making method based on near-end strategy optimization
Kong et al. Hierarchical multi‐agent reinforcement learning for multi‐aircraft close‐range air combat
Xianyong et al. Research on maneuvering decision algorithm based on improved deep deterministic policy gradient
CN111797966B (en) Multi-machine collaborative global target distribution method based on improved flock algorithm
CN113625767A (en) Fixed-wing unmanned aerial vehicle cluster collaborative path planning method based on preferred pheromone gray wolf algorithm
CN114911269B (en) Networking radar interference strategy generation method based on unmanned aerial vehicle group
Wang et al. Cooperatively pursuing a target unmanned aerial vehicle by multiple unmanned aerial vehicles based on multiagent reinforcement learning
CN113324545A (en) Multi-unmanned aerial vehicle collaborative task planning method based on hybrid enhanced intelligence
Zhao et al. Deep Reinforcement Learning‐Based Air Defense Decision‐Making Using Potential Games
Hao et al. Flight Trajectory Prediction Using an Enhanced CNN-LSTM Network
CN117590757B (en) Multi-unmanned aerial vehicle cooperative task allocation method based on Gaussian distribution sea-gull optimization algorithm
CN115695209B (en) Graph model-based anti-control unmanned aerial vehicle bee colony assessment method
Li et al. Research on stealthy UAV path planning based on improved genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant