CN115047907A - Air isomorphic formation command method based on multi-agent PPO algorithm - Google Patents

Air isomorphic formation command method based on multi-agent PPO algorithm Download PDF

Info

Publication number
CN115047907A
CN115047907A CN202210656190.7A CN202210656190A CN115047907A CN 115047907 A CN115047907 A CN 115047907A CN 202210656190 A CN202210656190 A CN 202210656190A CN 115047907 A CN115047907 A CN 115047907A
Authority
CN
China
Prior art keywords
network
action
environment
information
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210656190.7A
Other languages
Chinese (zh)
Other versions
CN115047907B (en
Inventor
汪亚斌
李友江
崔鹏
郭成昊
丁峰
易侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202210656190.7A priority Critical patent/CN115047907B/en
Publication of CN115047907A publication Critical patent/CN115047907A/en
Application granted granted Critical
Publication of CN115047907B publication Critical patent/CN115047907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Separation Of Gases By Adsorption (AREA)
  • Flow Control (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps: constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input; initializing local environment states, global environment states and data caches required by other training; interacting according to the local environment state and the environment through the action network; calculating an advantage function according to the global environment state; calculating the loss of the action network according to the advantage function, calculating and evaluating the loss of the network according to the evaluation network, and performing backward propagation to update the network according to the two loss values; using the updated network and environment interactions. The method improves the action command capability of the formation combining the macro and micro, introduces the multi-agent PPO algorithm for commanding the formation of the agents for the first time, improves the stability of the training of the agents for the formation command, and realizes the improvement of the training effect.

Description

Air isomorphic formation command method based on multi-agent PPO algorithm
Technical Field
The invention relates to an air isomorphic formation command method, in particular to an air isomorphic formation command method based on a multi-agent PPO algorithm.
Background
At present, reinforcement learning is increasingly widely applied to simulated formation action training, in order to achieve the training purpose, a multi-agent-oriented neural network is generally required to be constructed and used as a learning network for deep reinforcement learning, an important link of the reinforcement learning is the input structure of the action network and an evaluation network, the input is the basis of the neural network learning, and the input suitable for learning can ensure that the neural network can learn quickly and efficiently.
The fundamental difference between many multi-agent algorithms is the input to the network, the representation of the input values is an important aspect of the overall algorithm, and the action network and evaluation network of some multi-agent algorithms use global inputs, which has the disadvantage that the algorithm ignores some important local information.
Another part of the agent algorithm uses local information for both networks, and the disadvantage of training using local information is that it cannot be considered globally.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem of providing an air isomorphic formation command method based on a multi-agent PPO algorithm aiming at the defects of the prior art.
In order to solve the technical problem, the invention discloses an air isomorphic formation command method based on a multi-agent PPO algorithm, which comprises the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate shadow layer information of an action network and an evaluation network;
step 3, collecting environmental state data from the simulation confrontation environment, wherein the environmental state data consists of local environmental state data and global environmental state data; inputting the environment state data into the action network, outputting the action by the action network and issuing the action to the simulation confrontation environment; the simulation confrontation environment changes the environment state data after receiving the action and returns the changed environment state data to the action network; the action network outputs a formation control instruction, continuously interacts with a simulation confrontation environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function facing the intelligent agent of the formation command aircraft according to the sampling data in the sampling data set in the step 3;
step 5, calculating the loss of the mobile network according to the sampling data in the sampling data set obtained after the sampling in the step 3 actor And evaluating network loss value And according to two loss values loss, backward propagation is carried out to update the action network and the evaluation network, and the derivative of the loss value loss is calculated
Figure BDA0003687908860000021
Realizing backward propagation to update parameters of the action network and the evaluation network;
step 6, outputting the action and the simulation confrontation environment interaction by using the updated action network, and continuing to perform the sampling in the step 3;
and repeating the steps 3 to 6 until the action output by the mobile network meets the set requirement.
Step 1, inputting a local environment state around each airplane intelligent agent in an air isomorphic formation to the mobile network; and inputting the global environment state into the evaluation network, integrating the global environment states of all the airplane intelligent bodies, and evaluating the influence of the action of each airplane intelligent body on the whole formation overall target, wherein the overall target comprises the step of eliminating all forces of enemies or ensuring the minimum overall damage of the formation.
In step 1, a local environment state is input to the mobile network, and the method comprises the following steps:
inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into an n 128 matrix, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
a position feature matrix: abstracting a battlefield into a 128-by-128 space, wherein each point is 1 if an enemy airplane intelligent agent exists, 2 if the enemy airplane intelligent agent exists and 0 if no airplane intelligent agent exists;
a damaged state matrix: in the 128-by-128 matrix, each point, if there is an airplane agent, is the heading of the airplane agent, and the heading is 360 degrees.
In step 1, the mobile network is constructed as follows:
the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and my state information; leveling the output of the convolution layer and inputting the output of the convolution layer into a full connection layer; then, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the step, the hidden layer output is input to an air command target to distribute an action network, the direct output of the action network is changed into 128 x 128 dimensional probability distribution through category discretization, each dimension represents the probability that the airplane intelligent body flies to the corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent body, the enemy airplane intelligent body flies to the target coordinate to attack the enemy airplane intelligent body, and if the airplane intelligent body does not exist, the airplane intelligent body flies to the target coordinate; calculating the probability distribution through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether there is an enemy plane agent at the corresponding coordinate.
In step 1, the evaluation network is used for evaluating the battlefield environment, the input of the evaluation network is the same as the input of the action network, and the output is a 1-dimensional vector.
In step 2, initializing data for the action network and evaluation network training, constructing a playback buffer, and initializing, including:
evaluating global environmental status information s of network inputs share S represents state information, and subscript share represents that the state information is global environment information; local environment state information s input by mobile network o S represents status information, and subscript o indicates that the status information is local environment information; mobile network shadow layer information hs act Hs represents the network hidden layer information and the subscript act representsRefers to the network as a mobile network; evaluating network shadow layer information hs critic The subscript critic represents that the network refers to the evaluation network; the log value logP is taken after the action information a output by the intelligent airplane body and the a probability of the action output by the intelligent airplane body a ,p a Probability representing output action a of airplane intelligent body and evaluation network output V(s) share ) V denotes the evaluation network, V(s) share ) Inputting global environmental status information s for evaluating a network share The latter output value.
In step 2, the data includes:
global environment state information s share : the input used by the evaluation network during training is the global battlefield environment, and the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim s ]Wherein length episode The length represents the time step, and the epicode represents the corresponding fighting round; num thread The number of simulation environments running in parallel is num, and thread represents the number of threads running corresponding simulation environments; num agents Agents refers to airplane agents for the number of airplane agents of one party; dim s For the dimension of battlefield environment data in each time slice, s is state information;
local environmental status information s o : battlefield environment input of each individual aircraft intelligent agent in the air isomorphic formation, s represents state information, subscript o indicates that the state information is local environment information, and battlefield environment data and global environment state information s of the local environment information share The same;
hs act : the intermediate output of the mobile network cyclic neural hidden layer is realized, and the dimensionality of data is [ length ] episode ,num thread ,num agents ,dim hsact ]Wherein dim hsact The dimension is the output dimension of the shadow hiding layer, dim is the dimension value, and hsac refers to the mobile network hiding layer;
hs critic : evaluating the intermediate output of the network recurrent neural hidden layer, wherein the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim hscritic ]Which isMedium dim hscritic For the output dimension of the occlusion layer, hscritic refers to evaluating the network occlusion layer.
The data set obtained by sampling in step 3 comprises s share ,s o ,hs act ,hs critic ,a,logp ac ,V(s share ) R and log π θ; wherein r is action execution feedback obtained from the environment (for example, the number of destroyed enemy units, etc.), log pi θ is a log value obtained after the action network directly outputs, pi represents the action network, and subscript θ represents parameters of the action network.
The method for calculating the merit function described in step 4 includes:
Figure BDA0003687908860000041
wherein the content of the first and second substances,
Figure BDA0003687908860000042
representing the estimate of the merit function at time t,
Figure BDA0003687908860000043
for the global context state information at time t,
Figure BDA0003687908860000044
inputting the global environment state information at the time t into an output value after the evaluation network; gamma is the accumulated discount value, l represents the number of action steps after time t, r t+l A reported value r representing the environmental feedback after the step t + l; v represents the evaluation network and is,
Figure BDA0003687908860000045
and inputting the global environment state information at the time t for the evaluation network to obtain an output value.
The method for calculating the mobile network loss in the step 5 comprises the following steps:
Figure BDA0003687908860000046
Figure BDA0003687908860000047
where t is time t, clip (r) t (theta), 1-epsilon, 1+ epsilon) is a truncation operation if r t The value of (theta) is truncated if it exceeds the range (1-epsilon, 1+ epsilon), i.e. if r is greater than the range t (θ) if greater than 1+ ε, then let it be 1+ ε, if less than 1- ε, then let it be 1- ε, if it is between (1- ε,1+ ε), then keep its value; epsilon is a set value; action a representing current aircraft agent selection t Advantages over selecting other actions;
Figure BDA0003687908860000048
representing local environment state information of the airplane intelligent body at the moment t;
Figure BDA0003687908860000049
representative is to get
Figure BDA00036879088600000410
And
Figure BDA00036879088600000411
smaller value of comparison;
Figure BDA00036879088600000412
representing the average value calculated by taking multiple rounds;
Figure BDA00036879088600000413
action a of selecting a network of actions representing the use of the latest parameter θ by the aircraft agent at time t t The probability of (d);
Figure BDA00036879088600000414
representing the last iteration parameter theta used by the airplane intelligent agent at the time t old Mobile network selection action a t The probability of (d); r is t (theta) selection of a for calculating the current mobile network t Probability and last one ofIterative mobile network selection a t The ratio of the probabilities of (a);
step 5 said calculating the derivative of the loss value loss
Figure BDA00036879088600000415
The method for realizing backward propagation updating of the parameters of the action network and the evaluation network comprises the following steps:
Figure BDA00036879088600000416
wherein the content of the first and second substances,
Figure BDA0003687908860000051
is the derivative of the loss value of the mobile network,
Figure BDA0003687908860000052
representing a derivative operation, s o As local environmental status information, a t In order to be the action information, the user can select the action information,
Figure BDA0003687908860000053
represents
Figure BDA0003687908860000054
Taking the log value of the obtained data to obtain the log value,
Figure BDA0003687908860000055
representing the derivative after taking the log value,
Figure BDA0003687908860000056
the representation takes the average of multiple rounds of calculation for estimating the value at time t.
Has the beneficial effects that:
aiming at an application scene that the training time of the formation command simulation training intelligent agent is long and the stability is not strong, the invention constructs the formation command intelligent agent from the macroscopic command angle, evaluates the network input global environment state data by adopting a macroscopic and microscopic combination mode, inputs the local environment data around a single airplane intelligent agent by an action network, improves the control capability of the formation macroscopic and microscopic, introduces a multi-intelligent agent PPO (proximal end strategy optimization) algorithm for commanding the formation intelligent agent to construct for the first time, and improves the training stability of the formation command intelligent agent.
The invention constructs an air isomorphic formation command method based on a multi-agent PPO algorithm, an evaluation network uses global information, so that the algorithm has the capability of evaluating the global information, the input of an action network is local information, an agent can focus on local information learning measures, the network is evaluated to input the global information, the global environment state is evaluated, and the agent is guided to select actions favorable for the global environment state.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a schematic view of the overall process of the present invention.
FIG. 2 is a mobile network architecture diagram.
Detailed Description
The following description will take an example of a simulation scenario in which 6 fighters of both red and blue are fighting each other. The simulated battlefield space is discretized into a 128 x 128 grid space where the fighters of both red and blue maneuver and attack each other. An air isomorphic formation command method based on a multi-agent PPO (proximal end strategy optimization) algorithm controls a 6-shelf fighter in the red party and a blue party to fight against. An air isomorphic formation command method based on a multi-agent PPO algorithm decides which lattice each fighter flies to and hits which Bluetooth aircraft agent.
As shown in fig. 1, an air isomorphic formation commanding method based on multi-agent PPO algorithm includes: step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate shadow layer information of an action network and an evaluation network;
step 3, collecting environmental state data from the simulation confrontation environment, wherein the environmental state data consists of local environmental state data and global environmental state data; inputting the environment state data into the action network, outputting the action by the action network and issuing the action to the simulation confrontation environment; the simulation confrontation environment changes the environment state data after receiving the action and returns the changed environment state data to the action network; the action network outputs a formation control instruction, continuously interacts with a simulation confrontation environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function facing the intelligent agent of the formation command aircraft according to the sampling data in the sampling data set in the step 3;
step 5, calculating the loss of the mobile network according to the sampling data in the sampling data set obtained after the sampling in the step 3 actor And evaluating network loss value And according to two loss values loss, backward propagation is carried out to update the action network and the evaluation network, and the derivative of the loss value loss is calculated
Figure BDA0003687908860000061
Realizing backward propagation to update parameters of the action network and the evaluation network;
step 6, outputting the action and the simulation confrontation environment interaction by using the updated action network, and continuing to perform the sampling in the step 3;
and (6) repeating the steps (3) to (6) until the action output by the mobile network meets the set requirement.
In step 1, the concrete method for constructing the mobile network and evaluating the network is as follows:
the information of the red and the blue in the battlefield environment is input by the mobile network, the information is organized into a 3 x 128 matrix, each dimension in the 3-dimensional matrix represents a class of battlefield environment characteristic information, and the characteristics comprise the following:
a position feature matrix: the battlefield is abstracted into a 128 x 128 space, with 1 for each grid if there is a blue airplane, 2 for a red airplane agent, and 0 for no airplane agent.
Course matrix: in the 128 × 128 lattice space, if there is an airplane agent in one lattice, the corresponding matrix element value is the heading of the airplane agent, and the heading is divided into 360 degrees.
A damaged state matrix: in the 128 × 128 lattice space, if there is an airplane agent in one lattice, the corresponding matrix element value is the damage status of the airplane agent, and the status is classified as good (indicated by 1), damaged (indicated by 0.5), and destroyed (indicated by 0.1).
Based on the above network, as shown in fig. 2, the mobile network architecture is constructed as follows:
the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and my state information; carrying out flattening (flattening) operation on the output of the convolution layer and then inputting the output of the convolution layer into a full connection layer; and then the data is transmitted through a circulating neural (rnn) network, the input of the circulating neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the step, the hidden layer output and the input are distributed to an air command target action (actor) network, the direct output of the actor network is changed into 128-dimensional probability distribution through category discretization (probability), each dimension represents the probability that the airplane intelligent body will fly to the corresponding coordinate for attack, if the target coordinate has the enemy airplane intelligent body, the enemy airplane intelligent body will fly to the target coordinate for attack, and if the airplane intelligent body does not exist, the enemy airplane intelligent body will fly to the target coordinate. The probability distribution is passed through a mask calculation (mask) to form the final action output. The mask is a 128-by-128 vector, each element represents whether there is enemy plane agent in the corresponding coordinate, and the mask is output as final output.
The constructed evaluation network is used for evaluating a battlefield environment, the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and self state information; carrying out flattening (flattening) operation on the output of the convolution layer and then inputting the output of the convolution layer into a full connection layer; and then, the data is transmitted through a cyclic neural (rnn) network, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the previous step, the hidden layer output is input to a core evaluation network, and the evaluation network output is a one-dimensional vector.
In step 2, a playback cache is constructed and initialized, and the content in the cache comprises s share ,s o ,hs act 、hs critic ,a,logp ac ,V(s share )
s share The input used by the critic network during training is the global battlefield environment, and the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim s ]Wherein length episode Setting the time step length of one round of battle as 1024 steps; num thread The number of simulation environments for parallel operation is set to be 6; num agents The number of the intelligent bodies of the airplane at one place is set to be 5; dim s The dimensions for the battlefield environment data within each time slice are set to 128 x 3.
s o : battlefield environment input, battlefield environment data and s for each individual aircraft agent in the formation share Are identical
hs act Intermediate output of operator network cyclic neural hidden layer with data dimension of [ length ] episode ,num thread ,num agents ,dim hsact ]。dim hsact The output dimension for the occlusion layer is set to 256.
hs critic : the critic network circulates the intermediate output of the neural hidden layer, and the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim hsact ]。dim hscritic The output dimension for the occlusion layer is set to 256.
In the step 3: acquiring a sample data set for training continuously interacting with the environment, the data set comprising s share ,s o ,hs act 、hs critic ,a,logp ac ,V(s share ),r,logπθ。
In the step 4: the advantage function is calculated as a function of the position,
Figure BDA0003687908860000071
the dimensionality of the transformed sampled data is [ the number of parallel environments, the number of agents, the number of steps of one command, the dimensionality observed in each step ]
In the step 5: calculating operator loss (loss) actor ),
Figure BDA0003687908860000081
Figure BDA0003687908860000082
Wherein, clip (r) t (theta), 1-epsilon, 1+ epsilon) is a truncation operation if r t (theta) is truncated if it exceeds the range (1-epsilon, 1+ epsilon), i.e. if r is t If the value of (theta) is greater than 1+ epsilon, it is made to be 1+ epsilon, if it is less than 1-epsilon, it is made to be 1-epsilon, and if it is between (1-epsilon, 1+ epsilon), it is kept. The determination of the value of epsilon requires the use of empirical settings.
Figure BDA0003687908860000083
As an advantage function, the action a selected on behalf of the current aircraft agent t Advantage over selecting other actions.
Figure BDA0003687908860000084
Representing local environmental status information of the aircraft agent at time t.
Figure BDA0003687908860000085
Representative is to get
Figure BDA0003687908860000086
And
Figure BDA0003687908860000087
to a smaller value.
Figure BDA0003687908860000088
Representing taking the average of multiple rounds of calculations.
Figure BDA0003687908860000089
Action a of selecting a network of actions representing the use of the latest parameter θ by the aircraft agent at time t t The probability of (c).
Figure BDA00036879088600000810
Representing the last iteration parameter theta used by the airplane intelligent agent at the time t old The mobile network selection action a t The probability of (c). r is t (theta) calculating the current mobile network selection a t Probability of (a) and mobile network selection of the previous iteration (a) t The ratio of the probabilities of (a).
Step 5 said calculating the derivative of the loss value loss
Figure BDA00036879088600000815
The method for realizing the backward propagation updating of the parameters of the action network and the evaluation network comprises the following steps:
Figure BDA00036879088600000811
: wherein s is o As local environmental status information, a t In order to be the action information, the user can select the action information,
Figure BDA00036879088600000812
represents
Figure BDA00036879088600000813
Taking the log value of the obtained data to obtain the log value,
Figure BDA00036879088600000814
representing the derivative after taking the log value.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program, when executed by the data processing unit, may execute the inventive content of the air isomorphic formation commanding method based on the multi-agent PPO algorithm provided by the present invention and some or all of the steps in each embodiment. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is clear to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, which may be stored in a storage medium and includes instructions for causing a device (which may be a personal computer, a server, a single-chip computer, MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.
The invention provides a thought and a method of an air isomorphic formation command method based on a multi-agent PPO algorithm, and a method and a way for realizing the technical scheme are many. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. An air isomorphic formation command method based on a multi-agent PPO algorithm is characterized by comprising the following steps:
step 1, constructing an action network aiming at local environment state input and an evaluation network aiming at global environment state input;
step 2, initializing a local environment state, a global environment state and other data for training, wherein the other data for training comprises intermediate shadow layer information of an action network and an evaluation network;
step 3, collecting environmental state data from the simulation confrontation environment, wherein the environmental state data consists of local environmental state data and global environmental state data; inputting the environment state data into the action network, outputting the action by the action network and issuing the action to the simulation confrontation environment; the simulation confrontation environment changes the environment state data after receiving the action and returns the changed environment state data to the action network; the action network outputs a formation control instruction, continuously interacts with a simulation confrontation environment, and samples to obtain a sampling data set for training;
step 4, calculating an advantage function facing the intelligent agent of the formation command aircraft according to the sampling data in the sampling data set in the step 3;
step 5, calculating the loss of the mobile network aiming at the sampling data in the sampling data set obtained after sampling in the step 3 actor And evaluating network loss value And according to two loss values loss, backward propagation is carried out to update the action network and the evaluation network, and the derivative of the loss value loss is calculated
Figure FDA0003687908850000011
Realizing backward propagation to update parameters of the action network and the evaluation network;
step 6, outputting the action and the simulation confrontation environment interaction by using the updated action network, and continuing to perform the sampling in the step 3;
and (6) repeating the steps (3) to (6) until the action output by the mobile network meets the set requirement.
2. The method for commanding an air isomorphic formation based on a multi-agent PPO algorithm as claimed in claim 1, wherein step 1 comprises: inputting to the mobile network a local environmental state around each aircraft agent in an airborne homogeneous formation; and inputting the global environment state into the evaluation network, integrating the global environment states of all the airplane intelligent bodies, and evaluating the influence of the action of each airplane intelligent body on the whole formation overall target, wherein the overall target comprises the step of eliminating all forces of enemies or ensuring the minimum overall damage of the formation.
3. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 2, wherein in step 1, the local environment status is input into the action network, and the method comprises:
inputting command friend and foe information of air isomorphic formation into a mobile network, organizing the command friend and foe information into an n 128 matrix, wherein each dimension in the n-dimensional matrix represents characteristic information of a command state, and the characteristic information comprises:
a position feature matrix: abstracting a battlefield into a 128-by-128 space, wherein each point is 1 if an enemy airplane intelligent agent exists, 2 if the enemy airplane intelligent agent exists and 0 if no airplane intelligent agent exists;
a damaged state matrix: in the 128-by-128 matrix, each point, if there is an airplane agent, is the heading of the airplane agent, and the heading is 360 degrees.
4. The method for commanding an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 3, wherein in step 1, the mobile network is constructed as follows:
the first layer is a convolution network, the number of input channels of the 2-dimensional convolution layer is 3, the number of output channels is 8, and the convolution layer performs characteristic learning representation on battlefield enemy and my state information; leveling the output of the convolution layer and inputting the output of the convolution layer into a full connection layer; then, the input of the cyclic neural network is the current battlefield environment and the hidden layer output of the previous step, the output is the hidden layer output of the step, the hidden layer output is input to an air command target to distribute an action network, the direct output of the action network is changed into 128 x 128 dimensional probability distribution through category discretization, each dimension represents the probability that the airplane intelligent body flies to the corresponding coordinate to attack, if the target coordinate has an enemy airplane intelligent body, the enemy airplane intelligent body flies to the target coordinate to attack the enemy airplane intelligent body, and if the airplane intelligent body does not exist, the airplane intelligent body flies to the target coordinate; calculating the probability distribution through a mask to form final action output; mask is a 128 x 128 dimensional vector, each element representing whether there is an enemy plane agent at the corresponding coordinate.
5. The multi-agent PPO algorithm-based air isomorphic formation command method as claimed in claim 4, wherein in step 1, the evaluation network is used to evaluate battlefield environment, the input of the evaluation network is identical to the input of the action network, and the output is 1-dimensional vector.
6. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 5, wherein in step 2, initializing data for the action network and evaluation network training, constructing a playback buffer and initializing comprise:
evaluating global environmental status information s of network inputs share S represents state information, and subscript share represents that the state information is global environment information; local environmental status information s input by mobile network o S represents status information, and subscript o indicates that the status information is local environment information; mobile network shadow layer information hs act Hs represents the hidden layer information of the network, and the subscript act represents that the network is a mobile network; evaluating network shadow layer information hs critic The subscript critic represents that the network refers to the evaluation network; the log value logP is taken after the action information a output by the intelligent airplane body and the a probability of the action output by the intelligent airplane body a ,p a Probability representing output action a of airplane intelligent body and evaluation network output V(s) share ) V denotes the evaluation network, V(s) share ) Inputting global environmental status information s for evaluating a network share The latter output value.
7. The method for commanding an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 6, wherein in step 2, the data comprises:
global environment state information s share : the input used by the evaluation network during training is the global battlefield environment, and the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim s ]Wherein length episode For the time step of one round of battle, length represents the time step, and epicode represents the time stepCorresponding fighting rounds; num thread The number of simulation environments running in parallel is num, and thread represents the number of threads running corresponding simulation environments; num agents Agents refers to airplane agents for the number of airplane agents of our party; dim s For the dimension of battlefield environment data in each time slice, s is state information;
local environmental status information s o : battlefield environment input of each individual aircraft intelligent agent in the air isomorphic formation, s represents state information, subscript o indicates that the state information is local environment information, and battlefield environment data and global environment state information s of the local environment information share The same;
hs act : the intermediate output of the mobile network cyclic neural hidden layer is realized, and the dimensionality of data is [ length ] episode ,num thread ,num agents ,dim hsact ]Wherein dim hsact The dimension is the output dimension of the shadow hiding layer, dim is the dimension value, and hsac refers to the mobile network hiding layer;
hs critic : evaluating the intermediate output of the network cyclic neural hidden layer, wherein the dimensionality of the data is [ length ] episode ,num thread ,num agents ,dim hscritic ]Wherein dim hscritic For the output dimension of the occlusion layer, hscritic refers to evaluating the network occlusion layer.
8. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 7, wherein the data set obtained by sampling in step 3 comprises s share ,s o ,hs act ,hs critic ,a,logp ac ,V(s share ) R and log π θ (ii) a Where r is the action execution feedback obtained from the environment, log π θ And taking a log value after the direct output of the mobile network, wherein pi represents the mobile network, and a subscript theta represents a parameter of the mobile network.
9. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 8, wherein the step 4 of calculating the dominance function comprises:
Figure FDA0003687908850000031
wherein the content of the first and second substances,
Figure FDA0003687908850000032
representing the estimate of the merit function at time t,
Figure FDA0003687908850000033
for the global context state information at time t,
Figure FDA0003687908850000034
inputting the global environment state information at the time t into an output value after the evaluation network; gamma is the cumulative discount value, l represents the number of action steps after time t, r t+l A reported value r representing the environmental feedback after the step t + l; v represents the evaluation network and is,
Figure FDA0003687908850000035
and inputting the global environment state information at the time t into the evaluation network to obtain an output value.
10. The method for conducting an air homogeneous formation based on a multi-agent PPO algorithm as claimed in claim 9, wherein the method for calculating the loss of the action network in step 5 comprises:
Figure FDA0003687908850000041
Figure FDA0003687908850000042
where t is time t, clip (r) t (theta), 1-epsilon, 1+ epsilon) is a truncation operation if r t The value of (theta) is truncated if it exceeds the range (1-epsilon, 1+ epsilon), i.e. if r is greater than the range t (θ) if greater than 1+ ε, then let it be 1+ ε, if less than 1- ε, then let it be 1- ε, if it is between (1- ε,1+ ε), then keep its value; epsilon is a set value; action a representing current aircraft agent selection t Advantages over selecting other actions;
Figure FDA0003687908850000043
representing local environment state information of the airplane intelligent body at the moment t;
Figure FDA0003687908850000044
representing and getting
Figure FDA0003687908850000045
And
Figure FDA0003687908850000046
smaller value of comparison;
Figure FDA0003687908850000047
representing the average value calculated by taking multiple rounds;
Figure FDA0003687908850000048
action a of selecting a network of actions representing the use of the latest parameter θ by the aircraft agent at time t t The probability of (d);
Figure FDA0003687908850000049
representing the last iteration parameter theta used by the airplane intelligent agent at the time t old Mobile network selection action a t The probability of (d); r is t (theta) selection of a for calculating the current mobile network t Probability of (a) and mobile network selection of the previous iteration (a) t The ratio of the probabilities of (a);
step 5 said calculating the derivative of the loss value loss
Figure FDA00036879088500000410
The method for realizing backward propagation updating of the parameters of the action network and the evaluation network comprises the following steps:
Figure FDA00036879088500000411
wherein the content of the first and second substances,
Figure FDA00036879088500000412
is the derivative of the loss value of the mobile network,
Figure FDA00036879088500000413
representing a derivative operation, s o As local environmental status information, a t In order to be the action information, the user can select the action information,
Figure FDA00036879088500000414
represents
Figure FDA00036879088500000415
Taking the log value of the obtained data to obtain the log value,
Figure FDA00036879088500000416
representing the derivation after taking the log value,
Figure FDA00036879088500000417
the representation takes the average of multiple rounds of calculation for estimating the value at time t.
CN202210656190.7A 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm Active CN115047907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210656190.7A CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210656190.7A CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Publications (2)

Publication Number Publication Date
CN115047907A true CN115047907A (en) 2022-09-13
CN115047907B CN115047907B (en) 2024-05-07

Family

ID=83161154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210656190.7A Active CN115047907B (en) 2022-06-10 2022-06-10 Air isomorphic formation command method based on multi-agent PPO algorithm

Country Status (1)

Country Link
CN (1) CN115047907B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN117151224A (en) * 2023-07-27 2023-12-01 中国科学院自动化研究所 Strategy evolution training method, device, equipment and medium for strong random game of soldiers

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113313267A (en) * 2021-06-28 2021-08-27 浙江大学 Multi-agent reinforcement learning method based on value decomposition and attention mechanism
CN113625757A (en) * 2021-08-12 2021-11-09 中国电子科技集团公司第二十八研究所 Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200125957A1 (en) * 2018-10-17 2020-04-23 Peking University Multi-agent cooperation decision-making and training method
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113313267A (en) * 2021-06-28 2021-08-27 浙江大学 Multi-agent reinforcement learning method based on value decomposition and attention mechanism
CN113625757A (en) * 2021-08-12 2021-11-09 中国电子科技集团公司第二十八研究所 Unmanned aerial vehicle cluster scheduling method based on reinforcement learning and attention mechanism
CN113791634A (en) * 2021-08-22 2021-12-14 西北工业大学 Multi-aircraft air combat decision method based on multi-agent reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
夏庆军;张安;张耀中;: "基于多智能体的编队协同试飞仿真与效能评估", 火力与指挥控制, no. 05, 15 May 2011 (2011-05-15) *
轩书哲,柯良军: "基于多智能体强化学习的无人机集群攻防对抗策略研究", 无线电工程, vol. 51, no. 05, 5 May 2021 (2021-05-05) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826627A (en) * 2023-02-21 2023-03-21 白杨时代(北京)科技有限公司 Method, system, equipment and storage medium for determining formation instruction
CN117151224A (en) * 2023-07-27 2023-12-01 中国科学院自动化研究所 Strategy evolution training method, device, equipment and medium for strong random game of soldiers

Also Published As

Publication number Publication date
CN115047907B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
CN115047907A (en) Air isomorphic formation command method based on multi-agent PPO algorithm
US11779837B2 (en) Method, apparatus, and device for scheduling virtual objects in virtual environment
CN110147883B (en) Training method, device, equipment and storage medium for model for combat simulation
Schultz et al. Improving tactical plans with genetic algorithms
Hu et al. A dynamic adjusting reward function method for deep reinforcement learning with adjustable parameters
Wang et al. UAV swarm confrontation using hierarchical multiagent reinforcement learning
Lv et al. Sagci-system: Towards sample-efficient, generalizable, compositional, and incremental robot learning
CN116841317A (en) Unmanned aerial vehicle cluster collaborative countermeasure method based on graph attention reinforcement learning
Ghouri et al. Attitude control of quad-copter using deterministic policy gradient algorithms (DPGA)
CN113902087A (en) Multi-Agent deep reinforcement learning algorithm
Zha et al. Evaluate, explain, and explore the state more exactly: an improved Actor-Critic algorithm for complex environment
CN113741186A (en) Double-machine air combat decision method based on near-end strategy optimization
CN113554680A (en) Target tracking method and device, unmanned aerial vehicle and storage medium
Liang et al. Qauxi: Cooperative multi-agent reinforcement learning with knowledge transferred from auxiliary task
CN114840024A (en) Unmanned aerial vehicle control decision method based on context memory
CN114053712B (en) Action generation method, device and equipment of virtual object
KR20220166716A (en) Demonstration-conditioned reinforcement learning for few-shot imitation
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Chen et al. Modified PPO-RND method for solving sparse reward problem in ViZDoom
Ponsen et al. Hierarchical reinforcement learning with deictic representation in a computer game
Cao et al. PooL: Pheromone-inspired Communication Framework forLarge Scale Multi-Agent Reinforcement Learning
Mendi et al. Applications of Reinforcement Learning and its Extension to Tactical Simulation Technologies
Schwab et al. Tensor action spaces for multi-agent robot transfer learning
Karkus et al. Factored contextual policy search with Bayesian optimization
Zhao et al. Deep Reinforcement Learning‐Based Air Defense Decision‐Making Using Potential Games

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant