CN108600379A

CN108600379A - A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient

Info

Publication number: CN108600379A
Application number: CN201810397866.9A
Authority: CN
Inventors: 李瑞英; 王瑞; 胡晓惠; 张慧
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2018-09-28

Abstract

The present invention relates to a kind of isomery multiple agent Collaborative Decision Making Methods based on depth deterministic policy gradient, belong to the Coordination Decision field of isomery intelligent Unmanned Systems, include the following steps：First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the motion space of intelligent body, and structure multiple agent carries out the movement environment of Coordination Decision；Then, it is based on the deterministic Policy-Gradient algorithm of depth, establish the actor modules for carrying out decision action and judge the critic modules of feedback, and the parameter of training learning model；Using trained model, the status switch of intelligent body is obtained；According to the rewards and punishments rule being arranged in environment, the assessment of situation is carried out to the motion state sequence of intelligent body.The present invention can build rational movement environment according to actual demand and achieve the purpose that Intellisense, policy optimization by the collaboration each other between multiple agent in system, have the function of to the development in China unmanned systems field positive.

Description

A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient

Technical field

The invention belongs to the Coordination Decision fields of isomery intelligent Unmanned Systems, and in particular to one kind being based on depth certainty plan The slightly isomery multiple agent Collaborative Decision Making Method of gradient.

Background technology

In recent years, the rapid development of information technology and intelligent perception technology, for the perception of complex environment, accurately intelligence The advanced intelligent behavior such as the collaboration of decision and multimachine task has established important basis.The research of intelligent Unmanned Systems, nowadays Through becoming the marked achievement of Artificial Intelligence Development, the complexity of task and the uncertainty of dynamic environment determine system Must have very strong adaptive ability and capacity of will.

Traditional intelligent ant colony (Swarm Intelligenc)^[1]Originate in nineteen fifty-nine, French biologist PierrePaul Grasse researchs are found：There are the tissues of highly structural between insect, can complete far beyond a physical efficiency The operating mode of the complex task of power, ant colony is exactly that the classical of this Intelligent cluster represents, they pass through simple between monomer It communicates with each other coordination, shows the intelligent behavior of large-scale cluster.By the exploration of the Intelligent cluster behavior between insect, emerge Many Intelligent cluster algorithms, such as ant group algorithm (Ant Colony System, ACS)^[2]And particle swarm optimization algorithm (Particle Swarm Optimization, PSO) etc..Traditional Intelligent unattended group system is namely based on biological cluster row To be transmitted by perception interactive to each other and information, to cooperate at low cost under dangerous environment, being completed various The complex task of property.The distribution of unmanned cluster task at this stage is usually according to the maximum benefit damage of guarantee than (distribution Income Maximum damages Consumption is minimum) and task balance principle progress, embody the cooperation advantage of cluster, however these swarm algorithms are not ten It is divided into ripe, is not suitable for the contexture by self of large-scale complex task.

Situation Awareness learning method based on deeply learning art, can enable intelligent Unmanned Systems to have self study Power improves the adaptability to environment complicated and changeable.The historical origin of intensified learning for a long time, the intensified learning and Ma Erke of early stage Husband's decision process (MDP) model has prodigious relationship, can be reduced to a four-tuple, i.e. state s (state), action a (action), reward r (reward) and transition probability P (probability), the target of study are to find a strategy：At certain When one state, different actions is taken to have different probability, while different return can be obtained.Its advantage is that ability to express compared with By force, there is good decision-making capability, the disadvantage is that action and state are all discrete.2006, Hinton et al. propose using by Boltzmann machine RBM (Restricted Boltzmann Machine) is limited to encode deep-neural-network^[3], by neural network Again everybody sight has been retracted；2012, depth convolutional network^[4]In ImageNet contests^[5]Real outburst, welcome depth Degree study flourishes；2016, the decision-making capability of the sensing capability of deep learning and intensified learning is combined and is derived The deeply learning algorithm come brings AlphaGo^[6]Immense success, established new mileage for the development of artificial intelligence Upright stone tablet carries out the intelligent control of robot using deeply learning art^[7-9]Become a new research direction.

It is the bibliography below：

[1]Guy Theraulaz,Eric Bonabeau:A Brief History of Stimergy.Artificial Life 5(2):97-116(1999)

[2]Marco Dorigo,Vittorio Maniezzo,Alberto Colorni:Ant system: optimization by a colony of cooperating agents.IEEE Transactions on Systems, Man,and Cybernetics,Part B 26(1):29-41(1996)

[3]Geoffrey E.Hinton:Boltzmann machine.Scholarpedia 2(5):1668(2007)

[4]Alex Krizhevsky,Ilya Sutskever,Geoffrey E.Hinton:ImageNet Classification with Deep Convolutional Neural Networks.NIPS 2012:1106-1114

[5]Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Fei-Fei Li: ImageNet:A large-scale hierarchical image database.CVPR:248-255(2009)

[6]David Silver,Aja Huang,Chris J.Maddison,Arthur Guez,Laurent Sifre, George van den Driessche,Julian Schrittwieser,Ioannis Antonoglou,Vedavyas Panneershelvam,Marc Lanctot,Sander Dieleman,Dominik Grewe,John Nham,Nal Kalchbrenner,Ilya Sutskever,Timothy P.Lillicrap,Madeleine Leach,Koray Kavukcuoglu,Thore Graepel,Demis Hassabis:Mastering the game of Go with deep neural networks and tree search.Nature 529(7587):484-489(2016)

[7]Fangyi Zhang,Jürgen Leitner,Michael Milford,Ben Upcroft,Peter I.Corke:Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control.CoRR abs/1511.03791(2015)

[8]Sergey Levine,Peter Pastor,Alex Krizhevsky,Deirdre Quillen: Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection.CoRR abs/1603.02199(2016)

[9]Chelsea Finn,Sergey Levine,Pieter Abbeel:Guided Cost Learning:Deep Inverse Optimal Control via Policy Optimization.CoRR abs/1603.00448(2016)

Invention content

The technology of the present invention solves the problems, such as：According to existing algorithm and technology, it is proposed that one kind being based on depth deterministic policy The isomery multiple agent Collaborative Decision Making Method of gradient, this method build the sports ring that isomery multiple agent carries out Coordination Decision first Border；It is then based on depth deterministic policy gradient algorithm, establish the actor modules for carrying out decision action and judges feedback Critic modules, and the parameter of training learning model；It is final to realize isomery multiple agent Coordination Decision.

Technical solution of the invention：A kind of isomery multiple agent Coordination Decision based on depth deterministic policy gradient Method includes the following steps：

Step 1：The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out cooperative motion Movement environment；

Step 2：Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge anti- The critic modules of feedback, random initializtion parameter；

Step 3：Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built：Each intelligent body According to current state s, action a is obtained by actor modules, and reach NextState s'；It is calculated simultaneously according to rewards and punishments rule Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under current state s, by each step<Current state R is returned in s, current action a, next step state s', rewards and punishments>It is stored into experience pond；

Step 4：According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r> It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision；

Step 5：Using trained model the current of intelligent body is obtained in the case of known smart body current state s A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body Status switch；Meanwhile according to the rewards and punishments rule being arranged in environment, completing the Situation Assessment to intelligent body motion state sequence. For unmanned systems, optimal decision behavior can not be obtained intuitively, according to the rewards and punishments of environment set rule, be fought to the finish The behavior quality of plan is analyzed and determined.

Preferably, the specific implementation of step 1 includes following sub-step：

Step 1.1：According to the characteristic attribute of isomery intelligent body, a motion segment each intelligent body being abstracted as in environment Point；

Step 1.2：Set the action of intelligent body:[direction of motion of next step]；Set the state of intelligent body:[itself Position coordinates x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location]；

Step 1.3：Rewards and punishments rule in environment is set；

Step 1.4：The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment It punishes the contents such as rule and constructs the movement environment that an isomery multiple agent carries out Coordination Decision jointly.

Preferably, the specific implementation of step 2 includes following sub-step：

Step 2.1：The update of the parameter of actor modules and critic modules needs to establish on the basis of empirical learning, if Stand state-action pair that an individual experience pond stores each movement node<Current state s, current action a, next step state S' returns r>；

Step 2.2：Actor modules are established, using the state s of each intelligent body as the input of network, by several middle layers Obtain the next step output action a of each intelligent body；Meanwhile during each round iteration, the parameter of network is all that dynamic becomes Change, in order to make the parameter learning of network structure more stablize, retains an actor network structure copy, the actor network knots Structure copy only just carries out the update of parameter in regular hour step-length；

Step 2.3：Critic modules are established, using the state s of intelligent body and action a as the input of network, process is several Middle layer output is action-value Q；Meanwhile in order to keep the study of parameter more stable, retaining a critic network structure pair This, which equally just carries out the update of parameter in regular hour step-length.

Preferably, the specific implementation of step 4 includes following sub-step：

Step 4.1：Critic modules contain two network moulds that structure is identical, parameter renewal time is inconsistent The network model Q of immediate updating parameter is referred to as online critic by type, and parameter is expressed as θ^Q；It will postpone newer network Model Q' is referred to as target critic, and parameter is expressed as θ^Q'；

For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, Action a is taken under current state s, reaches NextState s', and obtain returning r immediately；It is obtained using target actor is network-evaluated The next action a' taken when NextState s' calculates target action-cost function and is represented by Q'(s', a'| θ^Q'), then by Q' It can obtain the estimation expected returns y that action a is taken at current state s：

Y=r+ γ Q'(s', a'| θ^Q')

Wherein, γ (γ ∈ [0,1]) indicates a decay factor；

For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, I.e. online expected returns Q (s, a | θ^Q)；

Estimation expected returns y and online expected returns Q (s, a | θ^Q) mean square error calculation formula be：

It can complete to update the parameter of online critic networks using error L；

Target critic is the delay update of online critic, and the parameter more new formula of target critic is：

θ^Q'=τ θ^Q+(1-τ)θ^Q'

Wherein, τ is a balance factor；

Step 4.2：Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, and When undated parameter network model μ be online actor, parameter is expressed as θ^μ；The network model μ ' for postponing undated parameter is mesh Actor is marked, parameter is expressed as θ^μ'；

For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In Next action a' of s', i.e. μ ' (s'| θ is calculated in NextState s'^μ'), target action-valence for calculating target critic Value function Q'(s', a'| θ^Q')；

For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θ^μ)；It is logical Cross current state s actual act μ (s | θ^μ) and online critic output Q (s, a | θ^Q) the online actor networks of associated update Parameter, gradient decline formula and are：

Target actor is the delay update of online actor, and the parameter more new formula of target actor is：

θ^μ'=τ θ^μ+(1-τ)θ^μ'

Wherein, τ is a balance factor；

Step 4.3：The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>It is right It is stored before in replacement experience pond<s,a,s',r>It is right；Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision is whole Only condition or reach greatest iteration step number.

The present invention compared with prior art the advantages of and good effect it is as follows：

(1) a kind of construction method of feasible isomery multiple agent cooperative surroundings is proposed, by the category for defining intelligent body The information such as property, state and action, sports rule and prize payouts, construct the isomery multiple agent ring towards particular task Border；

(2) original depth deterministic policy gradient algorithm is improved, to each other by shared each intelligent body State and action update the parameter of actor modules and critic modules, are acted according to the output of actor modules, complete The calculating of critic modules；Meanwhile critic modules instruct the parameter of actor modules to update in turn again, reinforce high repayment Action, reduces the action of low return, reaches the Coordination Decision between each intelligent body；

(3) for unmanned systems, the optimizing decision of system can not be obtained intuitively.The present invention propose according to from A series of actions behavior that original state terminates to iteration is advised using the rewards and punishments that the isomery multiple agent environment built provides Then, final analytical judgment is carried out to the behavior quality of decision, to complete the evaluation work of model.

Description of the drawings

Fig. 1 is the implementation flow chart of the present invention；

Fig. 2 is the structure schematic diagram of isomery multiple agent cooperative surroundings；

Fig. 3 is the network structure of actor modules；

Fig. 4 is the network structure of critic modules；

Fig. 5 is the data flow diagram of actor modules and critic modules.

Specific implementation mode

With reference to embodiment and Figure of description, specific embodiments of the present invention are described in detail.This place The embodiment of description is merely to illustrate and explain the present invention, but is not used in the restriction present invention.

Isomery multiple agent Collaborative Decision Making Method proposed by the present invention based on depth deterministic policy gradient includes mainly Following steps：First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Space builds the movement environment that more intelligence carry out Coordination Decision；Then, using based on the deterministic Policy-Gradient algorithm of depth, Definition carries out the actor modules of decision action and judge the critic modules of feedback, and the parameter of training learning model, root Position according to intelligent body local environment and status information automatically carry out the decision action of next step；Finally, due to which system is most Excellent decision subjective cannot provide, and can only be judged according to certain standards of grading, with the rewards and punishments rule defined in environment For foundation, the result quality of decision action is examined.The present invention can build rational movement environment according to actual demand, pass through and be Collaboration each other in system between multiple agent, achievees the purpose that Intellisense, policy optimization, the development to China unmanned systems field Have the function of positive.

The following detailed description of.

Step 1：The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out Coordination Decision Movement environment；

When it is implemented, the sports rule of intelligent body should be formulated according to specific motion model, the fortune of intelligent body is specified Dynamic space and motion space, and the rewards and punishments mechanism of reasonable design.

Step 2：Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and commented The critic modules of valence feedback, and the parameter of actor modules and critic modules is initialized；

Depth deterministic policy gradient algorithm is a kind of intensified learning method based on actor-critic frames, main to wrap Containing two modules：Actor modules and critic modules.Actor modules are responsible for be adopted in next step according to current state computation The action taken, critic modules are then responsible for according to estimation expected returns caused by current state and the action taken to actor The parameter of module carries out feedback modifiers.The trained starting stage needs to initialize the parameter of the two modules respectively.

Step 3：In the movement environment that step 1 is built, multiple agent relies on the specific algorithm of step 2, certain Primary iteration number independently randomly carries out movement exploring.Each intelligent body is moved according to current state s by actor modules Make a, and reaches NextState s'；Calculated simultaneously according to rewards and punishments rule takes action a to reach NextState s' at current state s When environment rewards and punishments to be administered return r, by each step<R is returned in current state s, current action a, next step state s', rewards and punishments >It is stored into experience pond, to carry out the study of module parameter；

Step 4：According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r> It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision.

In order to steadily carry out the renewal learning of parameter, critic modules and actor modules all include one online real-time The network structure of the network structure of undated parameter and a delay certain time step-length undated parameter.The parameter of critic modules is more The action a that the new actor modules that need to rely on are calculated；And the parameter of actor modules updates the critic that then needs to rely on The action that module is calculated-value gradient, the two are fed back mutually, achieve the purpose that multiple agent cooperative motion.

Step 5：Using trained model the current of intelligent body is obtained in the case of known smart body current state s A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body Status switch；Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.For For unmanned systems, optimal decision behavior can not subjective judgement, can only be analyzed according to certain objective standard, root According to the rewards and punishments rule of environment set, the behavior quality of decision is analyzed and determined.

The specific implementation process of above-mentioned steps 1-5 is described in detail below by following components.

1. structure carries out the movement environment of how intelligent coordinated decision

Realize schematic diagram as shown in Fig. 2, being divided into following 4 sub-steps：

Step 1.1：According to the characteristic attribute of isomery intelligent body, such as maximum movement speed, moving region range etc. will be each Intelligent body is abstracted as a movement node in environment；

Step 1.2 sets motion space and the state space of intelligent body, and the action of intelligent body is set in the present invention to be [next The direction of motion of step]；The state of intelligent body is set as [position coordinates x, the y of itself, position coordinates x, y of target, itself and mesh Target azimuth angle theta]；

Step 1.3：Rewards and punishments mechanism in environment is set, the environment prize to be administered when reaching certain state between intelligent body Punish return.The present invention mainly sets 3 kinds of rewards and punishments rules：Certain initial distance is kept between each intelligent body, defines each intelligence It cannot be at a distance of too close between body；There are the limitations of farthest communication distance between each intelligent body, are then punished beyond defined maximum distance Penalize exacerbation；Whether target can be monitored according to intelligent body, this is the final purpose of Coordination Decision, gives corresponding reward.

Step 1.4：The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment Punish the cooperative surroundings that the contents such as rule construct isomery multiple agent jointly：For each intelligent body, according to current observation, Action and the rewards and punishments information of next step are obtained, to instruct continuing to optimize for decision.

2. establishing the network structure of actor modules and critic modules, initialization network parameter

Actor modules apply to decision action, and critic module applications are fed back in evaluation, are divided into following 2 steps：

1) actor module networks structural schematic diagram used in the present invention is as shown in figure 3, with the state s of each movement node Amendment is wherein used after the full articulamentum of the first two by three full articulamentums (Inner product layer) as input Linear unit (Rectified Linear Units, ReLU) is used as activation primitive, and a hyperbolic is passed through in the output of third layer Tangent function tanh (), tanh () function are a kind of variants of sigmoid () function, its value range is [- 1,1], and It is not sigmoid functions [0,1], output result is the radian value of each node next step direction of motion.Actor in embodiment Module realizes on TensorFlow (referred to as tf) deep learning of increasing income frame, network weight (Weights) and bias (Bias) the tf.contrib.layers.xavier_initializer functions being all made of in TensorFlow are initialized, The function returns to an initialization program Xavier for initializing weight, this initializer can ensure each layer of ladder It is all almost identical to spend size.In the iterative process of each round, since the parameter of network is all dynamic change, in order to make ginseng Several study is more stablized, and the copy of an actor network structure is retained, which is only just joined in regular hour step-length Several updates；

2) critic module networks structural schematic diagram used in the present invention is as shown in figure 4, with the state s of each movement node For input, by a full articulamentum and linear activation primitive is corrected；Then it will export with action a as second full articulamentum Input, output result be corrected linear unit activation after, input a shot and long term memory network LSTM (Long Short- Term Memory), output result is state s action-value Q corresponding with action a.Equally, the actor modules in embodiment It increases income in TensorFlow and realizes on deep learning frame, network weight (Weights) and bias (Bias) use Tf.contrib.layers.xavier_initializer functions in TensorFlow are initialized, while remaining one A copy that the newer critic network structures of parameter are carried out in certain time step-length.

3. training and optimizing based on the deterministic Policy-Gradient algorithm of depth

The parameter of critic modules updates the action a that the actor modules that need to rely on are calculated；And actor modules Parameter updates action-value gradient that the critic modules that then need to rely on are calculated, and the two is fed back mutually, reaches mostly intelligent The Coordination Decision of body, as shown in Figure 5.It is divided into following 3 steps：

Y=r+ γ Q'(s', a'| θ^Q')

Wherein, γ (γ ∈ [0,1]) indicates a decay factor；

Estimation expected returns y and online expected returns Q (s, a | θ^Q) error calculation formula be：

θ^Q'=τ θ^Q+(1-τ)θ^Q'

Wherein, τ is a balance factor；

θ^μ'=τ θ^μ+(1-τ)θ^μ'

Wherein, τ is a balance factor；

Non-elaborated part of the present invention belongs to techniques well known.

The above, part specific implementation mode only of the present invention, but scope of protection of the present invention is not limited thereto, appoints What those skilled in the art is in the technical scope disclosed by the present invention, it will be appreciated that the change or replacement expected should all be covered Within protection scope of the present invention.Therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims

1. a kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient, which is characterized in that including with Lower step：

Step 1：The characteristic attribute and rewards and punishments rule of isomery multiple agent are defined, state space and the action for specifying intelligent body are empty Between, each intelligent body is abstracted as a movement node in environment, structure isomery multiple agent carries out the fortune of Coordination Decision Rotating ring border；

Step 2：Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge feedback Critic modules, random initializtion parameter；

Step 3：Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built：Each intelligent body according to Current state s obtains action a by actor modules, and reaches NextState s'；Meanwhile it being calculated current according to rewards and punishments rule Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under state s, by each step<Current state s, when R is returned in preceding action a, next step state s', rewards and punishments>It is stored into experience pond；

Step 4：According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the parameter of critic modules and actor modules into Row training and study, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r>It is right, weight Multiple step 4, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision；

Step 5：Using trained model the current action of intelligent body is obtained in the case of known smart body current state s A, and reach NextState s' repeats step 5, until completion task or reaches the end condition of environment, obtains the shape of intelligent body State sequence；Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.

2. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the specific implementation sub-step of the step 1 includes：

Step 1.1：According to the characteristic attribute of isomery intelligent body, a movement node each intelligent body being abstracted as in environment；

Step 1.2：Set the action of intelligent body:[direction of motion of next step]；Set the state of intelligent body:[the position of itself Coordinate x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location]；

Step 1.3：Rewards and punishments rule in environment is set；

Step 1.4：The abstract movement node of multiple agent, the motion space of intelligent body and state space, the rewards and punishments rule in environment The movement environment that an isomery multiple agent carries out Coordination Decision is then constructed jointly.

3. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the specific implementation sub-step of the step 2 is as follows：

Step 2.1：Set up state-action pair that an individual experience pond stores each intelligent body<Current state s, current action A, next step state s' return r>；

Step 2.2：Actor modules are established, using the state s of each intelligent body as the input of network, are obtained by several middle layers The next step output action a of each intelligent body；Meanwhile retaining an actor network structure copy, the actor network structure copies The update of parameter is only just carried out in regular hour step-length；

Step 2.3：Critic modules are established, using the state s of intelligent body and action a as the input of network, by several centres Layer output is action-value Q；Meanwhile retaining a critic network structure copy, which equally exists Regular hour step-length just carries out the update of parameter.

4. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the step 4 specific implementation sub-step is as follows：

Step 4.1：Critic modules contain two network models that structure is identical, parameter renewal time is inconsistent, will The network model Q of immediate updating parameter is referred to as online critic, and parameter is expressed as θ^Q；It will postpone newer network model Q' Referred to as target critic, parameter are expressed as θ^Q'；

For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, current Action a is taken under state s, reaches NextState s', and obtain returning r immediately；Using target actor it is network-evaluated obtain it is next The next action a' taken when state s' calculates target action-cost function and is represented by Q'(s', a'| θ^Q'), then it can be with by Q' Obtain the estimation expected returns y that action a is taken at current state s：

Y=r+ γ Q'(s', a'| θ^Q')

Wherein, γ (γ ∈ [0,1]) indicates a decay factor；

For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, that is, exist Line expected returns Q (s, a | θ^Q)；

θ^Q'=τ θ^Q+(1-τ)θ^Q'

Wherein, τ is a balance factor；

Step 4.2：Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, in time more The network model μ of new parameter is online actor, and parameter is expressed as θ^μ；The network model μ ' for postponing undated parameter is target Actor, parameter are expressed as θ^μ'；

For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In it is next Next action a' of s', i.e. μ ' (s'| θ is calculated in state s'^μ'), target action-value letter for calculating target critic Number Q'(s', a'| θ^Q')；

For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θ^μ)；By working as Preceding state s actual act μ (s | θ^μ) and online critic output Q (s, a | θ^Q) the online actor networks of associated update ginseng Number, gradient decline formula and are：

θ^μ'=τ θ^μ+(1-τ)θ^μ'

Wherein, τ is a balance factor；

Step 4.3：The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>To replacing It is stored before in experience pond<s,a,s',r>It is right；Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision terminates item Part reaches greatest iteration step number.