CN116776929A

CN116776929A - Multi-agent task decision method based on PF-MADDPG

Info

Publication number: CN116776929A
Application number: CN202310445392.1A
Authority: CN
Inventors: 张绍杰; 赵卯卯
Original assignee: Qinhuai Innovation Research Institute Of Nanjing University Of Aeronautics And Astronautics; Nanjing University of Aeronautics and Astronautics
Current assignee: Qinhuai Innovation Research Institute Of Nanjing University Of Aeronautics And Astronautics; Nanjing University of Aeronautics and Astronautics
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-09-19

Abstract

The application discloses a multi-agent task decision method based on PF-MADDPG, which introduces an improved MADDPG algorithm into multi-agent task planning, firstly establishes a two-dimensional environment model required by deep reinforcement learning according to multi-agent attack and defense countermeasure environments, secondly designs a state space, an action space and a reward function based on a potential function, and finally trains learning to carry out multi-agent task decision. Aiming at the problems of low training convergence speed, poor training effect and the like of an isomorphic multi-agent reinforcement learning algorithm in multi-agent countermeasure, the application designs the MADDPG algorithm based on potential function rewards to solve the problem of multi-agent attack and defense, can better realize game countermeasure training, and the potential function rewards can accelerate the convergence speed of the game countermeasure training.

Description

Multi-agent task decision method based on PF-MADDPG

Technical Field

The application relates to the field of task planning decision-making, in particular to a multi-agent task decision-making method based on PF-MADDPG.

Background

As many of the sequence decision problems involve multiple agents, multiple intelligent systems have been widely used with the increase in the degree of intelligence and the rapid development of artificial intelligence technology. Compared with a single intelligent agent system, the multi-intelligent agent system has the advantages of high efficiency, low cost, flexibility, reliability and the like, and can successfully treat complex problems which cannot be solved by a single intelligent agent. The multi-agent system is composed of a plurality of isomorphic or heterogeneous agents, and the agents form competition or cooperation relationship in the system, so that higher-order complex intelligence is realized.

Multi-agent learning mainly researches policy learning problems between multi-agents. Its main learning methods are reinforcement learning, robust learning, learning dynamics and strategy learning. Reinforcement learning is a learning method that uses experience to try and miss constantly. With the tremendous increase in computing power and storage power, deep learning is widely used. Deep reinforcement learning is a combination of deep learning and reinforcement learning, and has recently made significant progress in sequential decision problems, including mission planning, intelligent air combat, and strategy games, among others. Multi-agent deep reinforcement learning is an effective method of developing a multi-agent system, which applies the concept and method of deep reinforcement learning to learning and control of the multi-agent system. However, there are many difficulties in applying deep reinforcement learning to multi-intelligent systems, such as non-unique learning objectives, non-static environments, partial observability, algorithm stability and convergence, etc.

The application of deep reinforcement learning in multi-agent systems has made some progress. The deep distributed recursive Q network algorithm (deep distributed recursive Q algorithm, DDRQN) proposed by j.n. foerster et al effectively solves the multi-agent partial observability problem and the partial markov decision process (Partially observable Markov decision process, POMDP) problem. The disadvantage of this approach is that it cannot solve the heterogeneous multi-agent optimal control problem given that all agents will choose the same course of action. The team then proposes a countermulti-agent strategy gradient algorithm (Counterfactual multi-agent policy gradients, COMA) that improves the ability of agents to collaborate by using a global rewards function to evaluate all current actions and states, but is limited to discrete movement space. In addition, the multi-agent cooperative control reinforcement learning algorithm mainly includes a loose deep Q-network (LDQN) algorithm, a Q-learning hybrid network (QMIX) algorithm, a multi-agent deep deterministic policy gradient (multi-agent deep deterministic policy gradient, madddpg) algorithm, and the like.

Compared with other algorithms, the MADDPG algorithm adopts a structure of centralized training and distributed execution, and is a multi-agent reinforcement learning algorithm structure which is widely applied at present. The madppg algorithm can handle a variety of task scenarios, including competition and collaboration between homogeneous or heterogeneous multi-agents. In the whole training process, the observation data of other intelligent agents can be used for centralized training, and the algorithm efficiency is improved.

The application with the publication number of CN113589842A provides an unmanned cluster task cooperation method based on multi-agent reinforcement learning, and a reinforcement learning simulation environment facing to multi-unmanned system task planning is built based on Unity; building the acquired information of the simulation environment into a reinforcement learning environment conforming to the specification by using Gym; modeling an unmanned aerial vehicle cluster countermeasure environment; using a Tensorflow deep learning library to build a multi-agent reinforcement learning environment; solving a multi-agent reinforcement learning problem by using a cooperative depth deterministic strategy gradient method; and outputting the unmanned cluster task planning result. The application improves the prior art greatly, and can obtain a better multi-unmanned system collaborative task planning result. The application updates the reinforcement learning rewarding rule, the traditional method directly uses the external rewards obtained by the environment as self rewards, so that the strategy of cooperation is difficult to learn. However, the algorithm cannot be applied to heterogeneous multi-agent scenarios because of the introduction of local state averages of other agents and averages of other agent actions into the algorithm.

Although potential functions have been applied in the past, for example, an unmanned aerial vehicle path planning method based on potential function rewards DQN in a continuous state where environmental information is unknown is proposed in the application of publication No. CN110134140B, wherein potential function rewards are adopted, however, the application only considers the task scenario of obstacle avoidance of a single unmanned aerial vehicle, and does not apply potential function rewards to multi-agent anti-game, and whether the method can solve the path planning problem in the case of multi-agent has yet to be studied.

Disclosure of Invention

Aiming at the defects of the prior art, the application discloses a multi-agent task decision method based on PF-MADDPG, which comprises the steps of firstly establishing a two-dimensional environment model required by a deep reinforcement learning algorithm according to a multi-agent attack and defense countermeasure environment; secondly, designing continuous state space, action space and rewarding function based on potential function of the intelligent agent, and describing the learning process as Markov decision process; finally, learning and training are carried out, and a multi-agent task decision network is obtained. The method uses the reinforcement learning algorithm based on PF-MADDPG, has the characteristics of centralized learning and distributed execution, allows the intelligent agent to use global information during learning, but only uses local information during application of decision making, so that the multi-intelligent agent can perform efficient task decision making on the premise of unknown environment, and meanwhile, the method also adopts potential function to design a reward function, so that the convergence speed of the whole network is improved, and the aim of attack and defense against autonomous decision making of the multi-intelligent agent can be realized.

In order to achieve the technical purpose, the application adopts the following technical scheme:

a multi-agent task decision method based on PF-madddpg, the multi-agent task decision method comprising the steps of:

step 1: setting a plurality of attackers, a plurality of defenders and a plurality of static target areas in the attack and defense countermeasure environment, randomly generating positions of each agent and the target areas in the environment, and constructing a multi-agent attack and defense countermeasure environment;

step 2: establishing a state space of each intelligent body, wherein the state space of each intelligent body i consists of states of the intelligent bodies at various moments, and the states of each intelligent body comprise the speed and the position of each intelligent body, the speeds and the relative positions of other intelligent bodies and the relative positions of the other intelligent bodies and a target area;

step 3: establishing an action space of each intelligent agent i, wherein the action space of each intelligent agent i is formed by actions taken by the intelligent agent at various moments;

step 4: establishing a reward function of the intelligent agent; the method specifically comprises the following substeps:

define attack agent collision reward function r _col，att And a defending agent collision rewarding function r _col，def ：

For an attack agent and a target, constructing a corresponding distance rewarding function based on the potential function as follows:

in the method, in the process of the application,respectively representing the distances between the attack agent and the target at the time t and the time t+1, wherein lambda is the moving step length of the agent;

for the attack agent and the defense agent, the distance reward function corresponding to the attack agent and the defense agent is constructed based on the potential function as follows:

in the method, in the process of the application,respectively representing the distances between the attack agent and the defense agent at the time t and the time t+1;

the reward function of the attack agent is calculated as:

r _att ＝r _col，att +r _AT ；

the reward function of the defending agent is calculated as follows:

r _def ＝r _col，def +r _AD

step 5: establishing and training a multi-agent task decision network model based on a PF-MADDPG algorithm, wherein the process of acquiring an experience pool required for training the multi-agent task decision network model comprises the following steps:

calculating rewards obtained by the agents after the state change according to the rewarding functions of the attack agents and the defending agents in the step 4, and obtaining a state transition sequence of each agent<s _t ，a _t ，r _t ，s _t+1 >，s _t For the state of the agent at time t, s _t+1 A is the state of the intelligent agent at the time t+1 _t For state s of the agent at time t _t Action taken downwards, r _t For state s of the agent at time t _t Take action a down _t The obtained reward value; defining a state transition sequence storage space of each intelligent agent as an experience pool, and storing the state transition sequence obtained at each moment into the experience pool;

step 6: and realizing task decision of the multiple agents by using the trained multi-agent task decision network model.

Further, in step 2, the state s of the agent i at time t _i，t The definition is as follows:

s _i，t ＝{vs _elf，x ，v _self，y ，x _self ，y _self ，v _other，x ，v _other，y ，x _other ，y _other }；

wherein v is _self，x And v _self，y Respectively the velocity components of the x and y axes of the intelligent agent i at the time t, x _self And y _self The position coordinates of the x axis and the y axis of the intelligent agent i at the time t are respectively; v _other，x And v _other，y The velocity components of other intelligent agents at the moment t on the x and y axes, x _other And y _other The relative position coordinates of the intelligent agent i and other intelligent agents at the time t are obtained;

let the training time be t ₀ To t _T The state space of agent i is defined by agent s _i The states at each instant constitute:

further, in step 2, in step 3, the process of establishing the action space of the agent includes the following sub-steps:

state s of agent i _i As an input to the policy network, an agent i action policy a is output _i ：

a _i ＝μ _i (s _i )+N _noise

Wherein N is _noise Mu, environmental noise _i Strategy for agent i;

obtaining two-dimensional acceleration vector by adopting strategy integration method

Wherein the method comprises the steps ofAnd->Acceleration components of the intelligent agent in the left, right, upper and lower directions are respectively shown; η is a sensitivity coefficient related to acceleration for limiting the range of acceleration;

calculating to obtain the value of the instantaneous speed vector:

after the Δt time, the agent moves to the next position, and the update of the position is obtained by the following formula:

the action space of agent i is made up of actions taken by the agent at various times.

Further, in step 2, in step 5, the process of building and training the multi-agent task decision network model based on the PF-madddpg algorithm includes the following sub-steps:

step 5-1: initializing the multi-agent attack and defense countermeasure environment established in the step 1;

step 5-2: each intelligent agent randomly selects actions, so that the state of each intelligent agent changes, and the countermeasure environment changes;

step 5-3: calculating rewards obtained by the agents after the state change according to the agent rewarding function in the step 4, and obtaining a state transition sequence of each agent<s _t ，a _t ，r _t ，s _t+1 >，s _t For the state of the agent at time t, s _t+1 A is the state of the intelligent agent at the time t+1 _t For state s of the agent at time t _t Action taken downwards, r _t For state s of the agent at time t _t Take action a down _t The obtained reward value;

step 5-4: the state transition sequence storage space of each intelligent agent is defined as an experience pool, and the state transition sequence obtained in each moment in the step 5-3 is stored in the experience pool;

step 5-5: the control network of each intelligent agent adopts an Actor-Critic structure, and the Actor network and the Critic network also adopt a double-network structure, and each intelligent agent comprises a Target network and an Eval network;

step 5-6: randomly taking out a batch of experiences at different moments from an experience pool to form an experience package < S, A, R and S '>, wherein S is a state set of an agent at the current moment, S' is a state set of the agent at the next moment, A is an action set taken by the agent at the state set S at the current moment, and R is a reward value set obtained after the agent takes actions in the action set A at the state set S at the current moment;

step 5-7: inputting a state set S 'of the next moment into an Actor network of the intelligent agent, outputting an intelligent agent action set A' of the next moment, taking the S 'and the A' as inputs of the Critic network, and calculating a target Q value estimated by each intelligent agent to the next moment;

step 5-8: define the loss function of Critic network as:

wherein N is the number of all the agents, M is the experience number extracted during training;is a method for taking actions of all agents and some state information x ^j Q value, x as input ^j From observations of all agents (o ₁ ，o ₂ ，...o _N ) Composition, other status information may also be included; />The j-th experience target Q value of the intelligent agent i is represented by the following calculation formula:

wherein the method comprises the steps ofThe j-th empirically derived prize value representing agent i, gamma being the discount factor, ++>Is to take all the agents into the next action and take some state information x 'at the next moment' ^j Q as an input;

step 5-9: updating the Eval network in the Critic network by minimizing the Loss function Loss in steps 5-8;

step 5-10: defining a strategy gradient calculation formula:

step 5-11: updating an Eval network in the Actor network through a strategy gradient calculation formula;

step 5-12: updating Target networks in the Actor network and the Critic network by a soft updating method at fixed intervals;

step 5-13: repeating the steps 5-2 to 5-12, and stopping repeating when the maximum training frequency set value is reached.

Compared with the prior art, the application has the following beneficial effects:

firstly, the multi-agent task decision method based on PF-MADDPG combines the rewarding function based on the potential function with the MADDPG algorithm, has higher learning efficiency and faster convergence speed compared with the traditional deep reinforcement learning algorithm, and can realize the purpose of multi-agent attack and defense countermeasure autonomous decision;

secondly, unlike the control method requiring accurate model information, the PF-MADDPG algorithm provided by the application does not need accurate modeling, and can be better applied to a multi-agent system by learning and training through continuous trial and error of experience;

thirdly, the multi-agent task decision method based on PF-MADDPG can efficiently make decisions in an unknown environment, and overcomes the defect that task decisions can only be made in a known or static environment in the prior art.

Drawings

FIG. 1 is a flowchart of the implementation steps of a multi-agent task decision model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an Actor network and a Critic network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of test results of a multi-agent task decision method according to an embodiment of the present application;

FIG. 4 is a training learning average reward graph of two attack agents trained using PF-MADDPG and two defense agents trained using MADDPG against a two-party challenge process according to an embodiment of the present application;

fig. 5 is a training learning average reward graph of two attack agents trained using madppg and two defense agents trained using PF-madppg against a two-party challenge process according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the embodiment of the application discloses a multi-agent task decision method based on PF-madddpg, comprising:

step 1: setting a plurality of aggressors, a plurality of defenders and a plurality of static target areas in the attack and defense countermeasure environment, randomly generating positions of each agent and the target areas in the environment, and constructing a multi-agent attack and defense countermeasure environment. The multi-agent attack and defense environment adopts a two-dimensional plane environment, the moving track of the agents is continuous, the attack agents need to avoid the attack agents to hit the static target area, and the target of the attack agents is interception attack agents.

Step 2: establishing a state space of the intelligent body;

the status of each agent includes its own speed and position, the speed and relative position of other agents, and the relative position to the target area. State s of agent i at time t _i，t The definition is as follows:

s _i，t ＝{v _self，x ，v _self，y ，x _self ，y _self ，v _other，x ，v _other，y ，x _other ，y _other } (1)

wherein v is _self，x ，v _self，y Respectively the velocity components of the x and y axes of the intelligent agent i at the time t, x _self ，y _self The position coordinate of the intelligent agent i at the time t is obtained; v _other，x ，v _other，y The velocity components of other intelligent agents at the moment t on the x and y axes, x _other ，y _other The relative position coordinates of the intelligent agent i and other intelligent agents at the time t are obtained.

Let the training time be t ₀ To t _T The state space of the agent i is constituted by the states of the agents at the respective moments.

Step 3: establishing an action space of the intelligent body;

state s of agent i _i As an input to the policy network, an agent i action policy a is output _i

a _i ＝μ _i (s _i )+N _noise (3)

Wherein N is _noise Mu, environmental noise _i Is a policy of agent i.

To simulate the motion behavior of an actual agent, the agent model outputs a two-dimensional acceleration vectorThe value of the instantaneous velocity vector can be obtained by:

the acceleration and the instantaneous speed are limited in a specified range, and a strategy integration method is adopted to obtain a two-dimensional acceleration vector for directly generating continuous motion

Wherein the method comprises the steps ofAnd->Acceleration components of the intelligent agent in the left, right, up and down directions are respectively, and eta is a sensitivity coefficient related to the acceleration and is used for limiting the range of the acceleration.

After Δt time, the agent moves to the next position, and the update of the position can be obtained by the following formula:

Step 4: establishing a reward function of the intelligent agent;

For an attacking agent and a target, a distance reward function is designed based on a potential function as follows

Wherein the method comprises the steps ofAnd respectively representing the distances between the attack agent and the target at the time t and the time t+1, wherein lambda is the moving step length of the agent.

For the offensive and defensive agents, the distance reward function is designed based on the potential function as follows

Wherein the method comprises the steps ofThe distances between the attack agent and the defense agent at time t and time t+1 are indicated, respectively.

In summary, the reward function of attacking the agent is

r _att ＝r _col，att +r _AT (11)

The rewarding function of the defending agent is

r _def ＝r _col，def +r _AD (12)

Step 5: establishing and training a multi-agent task decision network model based on a PF-MADDPG algorithm;

step 5-3: calculating rewards obtained by the agents after the state change according to the agent rewarding functions (11), (12) of the step 4, thereby obtaining a state transition sequence of each agent<s _t ，a _t ，r _t ，s _t+1 >，s _t For the state of the agent at time t, s _t+1 A is the state of the intelligent agent at the time t+1 _t For state s of the agent at time t _t Action taken downwards, r _t For state s of the agent at time t _t Take action a down _t The obtained reward value;

step 5-8: define the loss function of Critic network as:

wherein N is the number of all the agents, M is the experience number extracted during training;is a method for taking actions of all agents and some state information x ^j Q value, x as input ^j Can be observed by all agents (o ₁ ，o ₂ ，...o _N ) (here, equivalent to a state) composition, other state information may be contained; />The j-th empirical target Q value of the agent i is represented, and the calculation formula is defined as (14):

wherein the method comprises the steps ofThe j-th empirically derived prize value representing agent i, gamma being the discount factor, ++>Is to take all the agents into the next action and take some state information x 'at the next moment' ^j The Q value as input.

Step 5-9: updating the Eval network in the Critic network by minimizing the loss function (13) in steps 5-8;

step 5-10: defining a strategy gradient calculation formula:

step 5-11: updating an Eval network in the Actor network through (15);

step 5-12: updating Target networks in the Actor network and the Critic network at fixed intervals by a soft updating method:

wherein τ represents a soft update coefficient, θ ^Q And theta ^μ The weights of Eval networks in the Critic network and the Actor network are respectively.

Step 5-13: repeating the steps 5-2 to 5-12, and stopping repeating when the maximum training frequency set value is reached;

step 6: and realizing task decision of multiple agents by using a trained multi-agent task decision network model based on the PF-MADDPG algorithm.

Examples:

the final network structure in this example is designed as: the Actor network is [64;64;5] a fully connected neural network, the Critic network being [64;64;1] and the hidden layers of both neural networks use the Relu function as an activation function, as shown in fig. 2. During training, the maximum training round number is 160000, the maximum interaction step length per round is 40, the experience number extracted from the experience pool per time is 1024, the discount factor gamma is 0.95, and the auxiliary network updating rate tau is 0.01. The learning rate of the Actor network and the Critic network is set to 0.01, and both networks are optimized by adopting an Adam Optimizer.

In the embodiment, 2 attack agent positions, 2 defending agent positions and a position of a static target area are initialized in a random setting area meeting a certain constraint condition in a continuous environment model of a two-dimensional space. The x and y axes of the environment are generated in a finite coordinate system having values of [ -1,1 ]. To reasonably simulate the challenge process, static target regions are randomly generated within x-axis and y-axis coordinate ranges [0.6,0.8], [ -0.8, -0.4] and [ -0.9, -0.8], respectively. The unit of time in the environment is 0.1.2 attack agents start from the initial position, avoid defending agents and attack the static target area. 2 defending agents intercept the attacking agent. A green circle of 0.05 size with a maximum speed of 1.5 represents the attack agent, a red circle of 0.07 radius with a maximum speed of 0.5 represents the defending agent, a stationary target area is represented by a black circle of 0.1 radius, two examples of the attack agent trained using PF-madppg versus two defending agents trained using madppg are shown in fig. 3, and it can be seen that at least one of the attack agents successfully hits the target area. As shown in fig. 4 and 5, the average rewards curves of two attack agents trained using PF-madddpg, two defending agent challenge processes trained using madddpg, and two attack agents trained using madddpg, and two defending agent challenge processes trained using madddpg are compared, and the result shows that the task decision network constructed using PF-madddpg can quickly converge and meet the task decision requirement.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The multi-agent task decision method based on PF-MADDPG is characterized by comprising the following steps of:

the reward function of the attack agent is calculated as:

r _att ＝r _col，att +r _AT ；

the reward function of the defending agent is calculated as follows:

r _def ＝r _col，def +r _AD

calculating rewards obtained by the agents after the state change according to the rewarding functions of the attack agents and the defending agents in the step 4Obtaining a state transition sequence of each agent<s _t ，a _t ，r _t ，s _t+1 >，s _t For the state of the agent at time t, s _t+1 A is the state of the intelligent agent at the time t+1 _t For state s of the agent at time t _t Action taken downwards, r _t For state s of the agent at time t _t Take action a down _t The obtained reward value; defining a state transition sequence storage space of each intelligent agent as an experience pool, and storing the state transition sequence obtained at each moment into the experience pool;

2. The PF-madddpg based multi-agent task decision method according to claim 1, wherein in step 2, the state s of agent i at time t _i，t The definition is as follows:

s _i，t ＝{v _self，x ，v _self，y ，x _self ，y _self ，v _o，her，x ，v _other，y ，x _other ，y _other }；

wherein v is _self，x And v _self，y Respectively the velocity components of the x and y axes of the intelligent agent i at the time t, x _self And y _seff The position coordinates of the x axis and the y axis of the intelligent agent i at the time t are respectively; v _other，x And v _other，y The velocity components of other intelligent agents at the moment t on the x and y axes, x _other And y _other The relative position coordinates of the intelligent agent i and other intelligent agents at the time t are obtained;

3. the PF-madddpg based multi-agent task decision method of claim 2, wherein in step 2, the process of creating an agent's action space in step 3 comprises the sub-steps of:

a _i ＝μ _i (s _i )+N _noise

Wherein N is _noise Mu, environmental noise _i Strategy for agent i;

calculating to obtain the value of the instantaneous speed vector:

4. The PF-madddpg based multi-agent task decision method of claim 1, wherein in step 2, the process of building and training the PF-madddpg algorithm based multi-agent task decision network model in step 5 comprises the following sub-steps:

step 5-8: define the loss function of Critic network as:

step 5-10: defining a strategy gradient calculation formula: