CN108600379A - A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient - Google Patents

A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient Download PDF

Info

Publication number
CN108600379A
CN108600379A CN201810397866.9A CN201810397866A CN108600379A CN 108600379 A CN108600379 A CN 108600379A CN 201810397866 A CN201810397866 A CN 201810397866A CN 108600379 A CN108600379 A CN 108600379A
Authority
CN
China
Prior art keywords
action
parameter
actor
state
critic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810397866.9A
Other languages
Chinese (zh)
Inventor
李瑞英
王瑞
胡晓惠
张慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN201810397866.9A priority Critical patent/CN108600379A/en
Publication of CN108600379A publication Critical patent/CN108600379A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of isomery multiple agent Collaborative Decision Making Methods based on depth deterministic policy gradient, belong to the Coordination Decision field of isomery intelligent Unmanned Systems, include the following steps:First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the motion space of intelligent body, and structure multiple agent carries out the movement environment of Coordination Decision;Then, it is based on the deterministic Policy-Gradient algorithm of depth, establish the actor modules for carrying out decision action and judge the critic modules of feedback, and the parameter of training learning model;Using trained model, the status switch of intelligent body is obtained;According to the rewards and punishments rule being arranged in environment, the assessment of situation is carried out to the motion state sequence of intelligent body.The present invention can build rational movement environment according to actual demand and achieve the purpose that Intellisense, policy optimization by the collaboration each other between multiple agent in system, have the function of to the development in China unmanned systems field positive.

Description

A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
Technical field
The invention belongs to the Coordination Decision fields of isomery intelligent Unmanned Systems, and in particular to one kind being based on depth certainty plan The slightly isomery multiple agent Collaborative Decision Making Method of gradient.
Background technology
In recent years, the rapid development of information technology and intelligent perception technology, for the perception of complex environment, accurately intelligence The advanced intelligent behavior such as the collaboration of decision and multimachine task has established important basis.The research of intelligent Unmanned Systems, nowadays Through becoming the marked achievement of Artificial Intelligence Development, the complexity of task and the uncertainty of dynamic environment determine system Must have very strong adaptive ability and capacity of will.
Traditional intelligent ant colony (Swarm Intelligenc)[1]Originate in nineteen fifty-nine, French biologist PierrePaul Grasse researchs are found:There are the tissues of highly structural between insect, can complete far beyond a physical efficiency The operating mode of the complex task of power, ant colony is exactly that the classical of this Intelligent cluster represents, they pass through simple between monomer It communicates with each other coordination, shows the intelligent behavior of large-scale cluster.By the exploration of the Intelligent cluster behavior between insect, emerge Many Intelligent cluster algorithms, such as ant group algorithm (Ant Colony System, ACS)[2]And particle swarm optimization algorithm (Particle Swarm Optimization, PSO) etc..Traditional Intelligent unattended group system is namely based on biological cluster row To be transmitted by perception interactive to each other and information, to cooperate at low cost under dangerous environment, being completed various The complex task of property.The distribution of unmanned cluster task at this stage is usually according to the maximum benefit damage of guarantee than (distribution Income Maximum damages Consumption is minimum) and task balance principle progress, embody the cooperation advantage of cluster, however these swarm algorithms are not ten It is divided into ripe, is not suitable for the contexture by self of large-scale complex task.
Situation Awareness learning method based on deeply learning art, can enable intelligent Unmanned Systems to have self study Power improves the adaptability to environment complicated and changeable.The historical origin of intensified learning for a long time, the intensified learning and Ma Erke of early stage Husband's decision process (MDP) model has prodigious relationship, can be reduced to a four-tuple, i.e. state s (state), action a (action), reward r (reward) and transition probability P (probability), the target of study are to find a strategy:At certain When one state, different actions is taken to have different probability, while different return can be obtained.Its advantage is that ability to express compared with By force, there is good decision-making capability, the disadvantage is that action and state are all discrete.2006, Hinton et al. propose using by Boltzmann machine RBM (Restricted Boltzmann Machine) is limited to encode deep-neural-network[3], by neural network Again everybody sight has been retracted;2012, depth convolutional network[4]In ImageNet contests[5]Real outburst, welcome depth Degree study flourishes;2016, the decision-making capability of the sensing capability of deep learning and intensified learning is combined and is derived The deeply learning algorithm come brings AlphaGo[6]Immense success, established new mileage for the development of artificial intelligence Upright stone tablet carries out the intelligent control of robot using deeply learning art[7-9]Become a new research direction.
It is the bibliography below:
[1]Guy Theraulaz,Eric Bonabeau:A Brief History of Stimergy.Artificial Life 5(2):97-116(1999)
[2]Marco Dorigo,Vittorio Maniezzo,Alberto Colorni:Ant system: optimization by a colony of cooperating agents.IEEE Transactions on Systems, Man,and Cybernetics,Part B 26(1):29-41(1996)
[3]Geoffrey E.Hinton:Boltzmann machine.Scholarpedia 2(5):1668(2007)
[4]Alex Krizhevsky,Ilya Sutskever,Geoffrey E.Hinton:ImageNet Classification with Deep Convolutional Neural Networks.NIPS 2012:1106-1114
[5]Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Fei-Fei Li: ImageNet:A large-scale hierarchical image database.CVPR:248-255(2009)
[6]David Silver,Aja Huang,Chris J.Maddison,Arthur Guez,Laurent Sifre, George van den Driessche,Julian Schrittwieser,Ioannis Antonoglou,Vedavyas Panneershelvam,Marc Lanctot,Sander Dieleman,Dominik Grewe,John Nham,Nal Kalchbrenner,Ilya Sutskever,Timothy P.Lillicrap,Madeleine Leach,Koray Kavukcuoglu,Thore Graepel,Demis Hassabis:Mastering the game of Go with deep neural networks and tree search.Nature 529(7587):484-489(2016)
[7]Fangyi Zhang,Jürgen Leitner,Michael Milford,Ben Upcroft,Peter I.Corke:Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control.CoRR abs/1511.03791(2015)
[8]Sergey Levine,Peter Pastor,Alex Krizhevsky,Deirdre Quillen: Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection.CoRR abs/1603.02199(2016)
[9]Chelsea Finn,Sergey Levine,Pieter Abbeel:Guided Cost Learning:Deep Inverse Optimal Control via Policy Optimization.CoRR abs/1603.00448(2016)
Invention content
The technology of the present invention solves the problems, such as:According to existing algorithm and technology, it is proposed that one kind being based on depth deterministic policy The isomery multiple agent Collaborative Decision Making Method of gradient, this method build the sports ring that isomery multiple agent carries out Coordination Decision first Border;It is then based on depth deterministic policy gradient algorithm, establish the actor modules for carrying out decision action and judges feedback Critic modules, and the parameter of training learning model;It is final to realize isomery multiple agent Coordination Decision.
Technical solution of the invention:A kind of isomery multiple agent Coordination Decision based on depth deterministic policy gradient Method includes the following steps:
Step 1:The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out cooperative motion Movement environment;
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge anti- The critic modules of feedback, random initializtion parameter;
Step 3:Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built:Each intelligent body According to current state s, action a is obtained by actor modules, and reach NextState s';It is calculated simultaneously according to rewards and punishments rule Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under current state s, by each step<Current state R is returned in s, current action a, next step state s', rewards and punishments>It is stored into experience pond;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r> It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision;
Step 5:Using trained model the current of intelligent body is obtained in the case of known smart body current state s A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body Status switch;Meanwhile according to the rewards and punishments rule being arranged in environment, completing the Situation Assessment to intelligent body motion state sequence. For unmanned systems, optimal decision behavior can not be obtained intuitively, according to the rewards and punishments of environment set rule, be fought to the finish The behavior quality of plan is analyzed and determined.
Preferably, the specific implementation of step 1 includes following sub-step:
Step 1.1:According to the characteristic attribute of isomery intelligent body, a motion segment each intelligent body being abstracted as in environment Point;
Step 1.2:Set the action of intelligent body:[direction of motion of next step];Set the state of intelligent body:[itself Position coordinates x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location];
Step 1.3:Rewards and punishments rule in environment is set;
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment It punishes the contents such as rule and constructs the movement environment that an isomery multiple agent carries out Coordination Decision jointly.
Preferably, the specific implementation of step 2 includes following sub-step:
Step 2.1:The update of the parameter of actor modules and critic modules needs to establish on the basis of empirical learning, if Stand state-action pair that an individual experience pond stores each movement node<Current state s, current action a, next step state S' returns r>;
Step 2.2:Actor modules are established, using the state s of each intelligent body as the input of network, by several middle layers Obtain the next step output action a of each intelligent body;Meanwhile during each round iteration, the parameter of network is all that dynamic becomes Change, in order to make the parameter learning of network structure more stablize, retains an actor network structure copy, the actor network knots Structure copy only just carries out the update of parameter in regular hour step-length;
Step 2.3:Critic modules are established, using the state s of intelligent body and action a as the input of network, process is several Middle layer output is action-value Q;Meanwhile in order to keep the study of parameter more stable, retaining a critic network structure pair This, which equally just carries out the update of parameter in regular hour step-length.
Preferably, the specific implementation of step 4 includes following sub-step:
Step 4.1:Critic modules contain two network moulds that structure is identical, parameter renewal time is inconsistent The network model Q of immediate updating parameter is referred to as online critic by type, and parameter is expressed as θQ;It will postpone newer network Model Q' is referred to as target critic, and parameter is expressed as θQ'
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, Action a is taken under current state s, reaches NextState s', and obtain returning r immediately;It is obtained using target actor is network-evaluated The next action a' taken when NextState s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then by Q' It can obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, I.e. online expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) mean square error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, and When undated parameter network model μ be online actor, parameter is expressed as θμ;The network model μ ' for postponing undated parameter is mesh Actor is marked, parameter is expressed as θμ'
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In Next action a' of s', i.e. μ ' (s'| θ is calculated in NextState s'μ'), target action-valence for calculating target critic Value function Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);It is logical Cross current state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update Parameter, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>It is right It is stored before in replacement experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision is whole Only condition or reach greatest iteration step number.
The present invention compared with prior art the advantages of and good effect it is as follows:
(1) a kind of construction method of feasible isomery multiple agent cooperative surroundings is proposed, by the category for defining intelligent body The information such as property, state and action, sports rule and prize payouts, construct the isomery multiple agent ring towards particular task Border;
(2) original depth deterministic policy gradient algorithm is improved, to each other by shared each intelligent body State and action update the parameter of actor modules and critic modules, are acted according to the output of actor modules, complete The calculating of critic modules;Meanwhile critic modules instruct the parameter of actor modules to update in turn again, reinforce high repayment Action, reduces the action of low return, reaches the Coordination Decision between each intelligent body;
(3) for unmanned systems, the optimizing decision of system can not be obtained intuitively.The present invention propose according to from A series of actions behavior that original state terminates to iteration is advised using the rewards and punishments that the isomery multiple agent environment built provides Then, final analytical judgment is carried out to the behavior quality of decision, to complete the evaluation work of model.
Description of the drawings
Fig. 1 is the implementation flow chart of the present invention;
Fig. 2 is the structure schematic diagram of isomery multiple agent cooperative surroundings;
Fig. 3 is the network structure of actor modules;
Fig. 4 is the network structure of critic modules;
Fig. 5 is the data flow diagram of actor modules and critic modules.
Specific implementation mode
With reference to embodiment and Figure of description, specific embodiments of the present invention are described in detail.This place The embodiment of description is merely to illustrate and explain the present invention, but is not used in the restriction present invention.
Isomery multiple agent Collaborative Decision Making Method proposed by the present invention based on depth deterministic policy gradient includes mainly Following steps:First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Space builds the movement environment that more intelligence carry out Coordination Decision;Then, using based on the deterministic Policy-Gradient algorithm of depth, Definition carries out the actor modules of decision action and judge the critic modules of feedback, and the parameter of training learning model, root Position according to intelligent body local environment and status information automatically carry out the decision action of next step;Finally, due to which system is most Excellent decision subjective cannot provide, and can only be judged according to certain standards of grading, with the rewards and punishments rule defined in environment For foundation, the result quality of decision action is examined.The present invention can build rational movement environment according to actual demand, pass through and be Collaboration each other in system between multiple agent, achievees the purpose that Intellisense, policy optimization, the development to China unmanned systems field Have the function of positive.
The following detailed description of.
Step 1:The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out Coordination Decision Movement environment;
When it is implemented, the sports rule of intelligent body should be formulated according to specific motion model, the fortune of intelligent body is specified Dynamic space and motion space, and the rewards and punishments mechanism of reasonable design.
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and commented The critic modules of valence feedback, and the parameter of actor modules and critic modules is initialized;
Depth deterministic policy gradient algorithm is a kind of intensified learning method based on actor-critic frames, main to wrap Containing two modules:Actor modules and critic modules.Actor modules are responsible for be adopted in next step according to current state computation The action taken, critic modules are then responsible for according to estimation expected returns caused by current state and the action taken to actor The parameter of module carries out feedback modifiers.The trained starting stage needs to initialize the parameter of the two modules respectively.
Step 3:In the movement environment that step 1 is built, multiple agent relies on the specific algorithm of step 2, certain Primary iteration number independently randomly carries out movement exploring.Each intelligent body is moved according to current state s by actor modules Make a, and reaches NextState s';Calculated simultaneously according to rewards and punishments rule takes action a to reach NextState s' at current state s When environment rewards and punishments to be administered return r, by each step<R is returned in current state s, current action a, next step state s', rewards and punishments >It is stored into experience pond, to carry out the study of module parameter;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r> It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision.
In order to steadily carry out the renewal learning of parameter, critic modules and actor modules all include one online real-time The network structure of the network structure of undated parameter and a delay certain time step-length undated parameter.The parameter of critic modules is more The action a that the new actor modules that need to rely on are calculated;And the parameter of actor modules updates the critic that then needs to rely on The action that module is calculated-value gradient, the two are fed back mutually, achieve the purpose that multiple agent cooperative motion.
Step 5:Using trained model the current of intelligent body is obtained in the case of known smart body current state s A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body Status switch;Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.For For unmanned systems, optimal decision behavior can not subjective judgement, can only be analyzed according to certain objective standard, root According to the rewards and punishments rule of environment set, the behavior quality of decision is analyzed and determined.
The specific implementation process of above-mentioned steps 1-5 is described in detail below by following components.
1. structure carries out the movement environment of how intelligent coordinated decision
Realize schematic diagram as shown in Fig. 2, being divided into following 4 sub-steps:
Step 1.1:According to the characteristic attribute of isomery intelligent body, such as maximum movement speed, moving region range etc. will be each Intelligent body is abstracted as a movement node in environment;
Step 1.2 sets motion space and the state space of intelligent body, and the action of intelligent body is set in the present invention to be [next The direction of motion of step];The state of intelligent body is set as [position coordinates x, the y of itself, position coordinates x, y of target, itself and mesh Target azimuth angle theta];
Step 1.3:Rewards and punishments mechanism in environment is set, the environment prize to be administered when reaching certain state between intelligent body Punish return.The present invention mainly sets 3 kinds of rewards and punishments rules:Certain initial distance is kept between each intelligent body, defines each intelligence It cannot be at a distance of too close between body;There are the limitations of farthest communication distance between each intelligent body, are then punished beyond defined maximum distance Penalize exacerbation;Whether target can be monitored according to intelligent body, this is the final purpose of Coordination Decision, gives corresponding reward.
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment Punish the cooperative surroundings that the contents such as rule construct isomery multiple agent jointly:For each intelligent body, according to current observation, Action and the rewards and punishments information of next step are obtained, to instruct continuing to optimize for decision.
2. establishing the network structure of actor modules and critic modules, initialization network parameter
Actor modules apply to decision action, and critic module applications are fed back in evaluation, are divided into following 2 steps:
1) actor module networks structural schematic diagram used in the present invention is as shown in figure 3, with the state s of each movement node Amendment is wherein used after the full articulamentum of the first two by three full articulamentums (Inner product layer) as input Linear unit (Rectified Linear Units, ReLU) is used as activation primitive, and a hyperbolic is passed through in the output of third layer Tangent function tanh (), tanh () function are a kind of variants of sigmoid () function, its value range is [- 1,1], and It is not sigmoid functions [0,1], output result is the radian value of each node next step direction of motion.Actor in embodiment Module realizes on TensorFlow (referred to as tf) deep learning of increasing income frame, network weight (Weights) and bias (Bias) the tf.contrib.layers.xavier_initializer functions being all made of in TensorFlow are initialized, The function returns to an initialization program Xavier for initializing weight, this initializer can ensure each layer of ladder It is all almost identical to spend size.In the iterative process of each round, since the parameter of network is all dynamic change, in order to make ginseng Several study is more stablized, and the copy of an actor network structure is retained, which is only just joined in regular hour step-length Several updates;
2) critic module networks structural schematic diagram used in the present invention is as shown in figure 4, with the state s of each movement node For input, by a full articulamentum and linear activation primitive is corrected;Then it will export with action a as second full articulamentum Input, output result be corrected linear unit activation after, input a shot and long term memory network LSTM (Long Short- Term Memory), output result is state s action-value Q corresponding with action a.Equally, the actor modules in embodiment It increases income in TensorFlow and realizes on deep learning frame, network weight (Weights) and bias (Bias) use Tf.contrib.layers.xavier_initializer functions in TensorFlow are initialized, while remaining one A copy that the newer critic network structures of parameter are carried out in certain time step-length.
3. training and optimizing based on the deterministic Policy-Gradient algorithm of depth
The parameter of critic modules updates the action a that the actor modules that need to rely on are calculated;And actor modules Parameter updates action-value gradient that the critic modules that then need to rely on are calculated, and the two is fed back mutually, reaches mostly intelligent The Coordination Decision of body, as shown in Figure 5.It is divided into following 3 steps:
Step 4.1:Critic modules contain two network moulds that structure is identical, parameter renewal time is inconsistent The network model Q of immediate updating parameter is referred to as online critic by type, and parameter is expressed as θQ;It will postpone newer network Model Q' is referred to as target critic, and parameter is expressed as θQ'
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, Action a is taken under current state s, reaches NextState s', and obtain returning r immediately;It is obtained using target actor is network-evaluated The next action a' taken when NextState s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then by Q' It can obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, I.e. online expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, and When undated parameter network model μ be online actor, parameter is expressed as θμ;The network model μ ' for postponing undated parameter is mesh Actor is marked, parameter is expressed as θμ'
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In Next action a' of s', i.e. μ ' (s'| θ is calculated in NextState s'μ'), target action-valence for calculating target critic Value function Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);It is logical Cross current state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update Parameter, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>It is right It is stored before in replacement experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision is whole Only condition or reach greatest iteration step number.
Non-elaborated part of the present invention belongs to techniques well known.
The above, part specific implementation mode only of the present invention, but scope of protection of the present invention is not limited thereto, appoints What those skilled in the art is in the technical scope disclosed by the present invention, it will be appreciated that the change or replacement expected should all be covered Within protection scope of the present invention.Therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims (4)

1. a kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient, which is characterized in that including with Lower step:
Step 1:The characteristic attribute and rewards and punishments rule of isomery multiple agent are defined, state space and the action for specifying intelligent body are empty Between, each intelligent body is abstracted as a movement node in environment, structure isomery multiple agent carries out the fortune of Coordination Decision Rotating ring border;
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge feedback Critic modules, random initializtion parameter;
Step 3:Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built:Each intelligent body according to Current state s obtains action a by actor modules, and reaches NextState s';Meanwhile it being calculated current according to rewards and punishments rule Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under state s, by each step<Current state s, when R is returned in preceding action a, next step state s', rewards and punishments>It is stored into experience pond;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the parameter of critic modules and actor modules into Row training and study, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r>It is right, weight Multiple step 4, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision;
Step 5:Using trained model the current action of intelligent body is obtained in the case of known smart body current state s A, and reach NextState s' repeats step 5, until completion task or reaches the end condition of environment, obtains the shape of intelligent body State sequence;Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.
2. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the specific implementation sub-step of the step 1 includes:
Step 1.1:According to the characteristic attribute of isomery intelligent body, a movement node each intelligent body being abstracted as in environment;
Step 1.2:Set the action of intelligent body:[direction of motion of next step];Set the state of intelligent body:[the position of itself Coordinate x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location];
Step 1.3:Rewards and punishments rule in environment is set;
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the rewards and punishments rule in environment The movement environment that an isomery multiple agent carries out Coordination Decision is then constructed jointly.
3. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the specific implementation sub-step of the step 2 is as follows:
Step 2.1:Set up state-action pair that an individual experience pond stores each intelligent body<Current state s, current action A, next step state s' return r>;
Step 2.2:Actor modules are established, using the state s of each intelligent body as the input of network, are obtained by several middle layers The next step output action a of each intelligent body;Meanwhile retaining an actor network structure copy, the actor network structure copies The update of parameter is only just carried out in regular hour step-length;
Step 2.3:Critic modules are established, using the state s of intelligent body and action a as the input of network, by several centres Layer output is action-value Q;Meanwhile retaining a critic network structure copy, which equally exists Regular hour step-length just carries out the update of parameter.
4. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient, It is characterized in that, the step 4 specific implementation sub-step is as follows:
Step 4.1:Critic modules contain two network models that structure is identical, parameter renewal time is inconsistent, will The network model Q of immediate updating parameter is referred to as online critic, and parameter is expressed as θQ;It will postpone newer network model Q' Referred to as target critic, parameter are expressed as θQ'
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, current Action a is taken under state s, reaches NextState s', and obtain returning r immediately;Using target actor it is network-evaluated obtain it is next The next action a' taken when state s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then it can be with by Q' Obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, that is, exist Line expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) mean square error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, in time more The network model μ of new parameter is online actor, and parameter is expressed as θμ;The network model μ ' for postponing undated parameter is target Actor, parameter are expressed as θμ'
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In it is next Next action a' of s', i.e. μ ' (s'| θ is calculated in state s'μ'), target action-value letter for calculating target critic Number Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);By working as Preceding state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update ginseng Number, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>To replacing It is stored before in experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision terminates item Part reaches greatest iteration step number.
CN201810397866.9A 2018-04-28 2018-04-28 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient Pending CN108600379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810397866.9A CN108600379A (en) 2018-04-28 2018-04-28 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810397866.9A CN108600379A (en) 2018-04-28 2018-04-28 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient

Publications (1)

Publication Number Publication Date
CN108600379A true CN108600379A (en) 2018-09-28

Family

ID=63611007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810397866.9A Pending CN108600379A (en) 2018-04-28 2018-04-28 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient

Country Status (1)

Country Link
CN (1) CN108600379A (en)

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408157A (en) * 2018-11-01 2019-03-01 西北工业大学 A kind of determination method and device of multirobot cotasking
CN109407644A (en) * 2019-01-07 2019-03-01 齐鲁工业大学 One kind being used for manufacturing enterprise's Multi-Agent model control method and system
CN109657802A (en) * 2019-01-28 2019-04-19 清华大学深圳研究生院 A kind of Mixture of expert intensified learning method and system
CN109670270A (en) * 2019-01-11 2019-04-23 山东师范大学 Crowd evacuation emulation method and system based on the study of multiple agent deeply
CN109719721A (en) * 2018-12-26 2019-05-07 北京化工大学 A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait
CN109828460A (en) * 2019-01-21 2019-05-31 南京理工大学 A kind of consistent control method of output for two-way heterogeneous multi-agent system
CN109919319A (en) * 2018-12-31 2019-06-21 中国科学院软件研究所 Deeply learning method and equipment based on multiple history best Q networks
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN110045614A (en) * 2019-05-16 2019-07-23 河海大学常州校区 A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN110442129A (en) * 2019-07-26 2019-11-12 中南大学 A kind of control method and system that multiple agent is formed into columns
CN110515298A (en) * 2019-06-14 2019-11-29 南京信息工程大学 Based on the adaptive marine isomery multiple agent speed cooperative control method of optimization
CN110659796A (en) * 2019-08-08 2020-01-07 北京理工大学 Data acquisition method in rechargeable group vehicle intelligence
CN110839031A (en) * 2019-11-15 2020-02-25 中国人民解放军陆军工程大学 Malicious user behavior intelligent detection method based on reinforcement learning
CN110991972A (en) * 2019-12-14 2020-04-10 中国科学院深圳先进技术研究院 Cargo transportation system based on multi-agent reinforcement learning
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN111050330A (en) * 2018-10-12 2020-04-21 中兴通讯股份有限公司 Mobile network self-optimization method, system, terminal and computer readable storage medium
CN111309880A (en) * 2020-01-21 2020-06-19 清华大学 Multi-agent action strategy learning method, device, medium and computing equipment
CN111416771A (en) * 2020-03-20 2020-07-14 深圳市大数据研究院 Method for controlling routing action based on multi-agent reinforcement learning routing strategy
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN111645076A (en) * 2020-06-17 2020-09-11 郑州大学 Robot control method and equipment
CN111687840A (en) * 2020-06-11 2020-09-22 清华大学 Method, device and storage medium for capturing space target
CN111814915A (en) * 2020-08-26 2020-10-23 中国科学院自动化研究所 Multi-agent space-time feature extraction method and system and behavior decision method and system
CN111914069A (en) * 2019-05-10 2020-11-10 京东方科技集团股份有限公司 Training method and device, dialogue processing method and system and medium
CN112015174A (en) * 2020-07-10 2020-12-01 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN112180724A (en) * 2020-09-25 2021-01-05 中国人民解放军军事科学院国防科技创新研究院 Training method and system for multi-agent cooperative cooperation under interference condition
CN112260733A (en) * 2020-11-10 2021-01-22 东南大学 Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method
CN112270451A (en) * 2020-11-04 2021-01-26 中国科学院重庆绿色智能技术研究院 Monitoring and early warning method and system based on reinforcement learning
CN112597693A (en) * 2020-11-19 2021-04-02 沈阳航盛科技有限责任公司 Self-adaptive control method based on depth deterministic strategy gradient
CN112668721A (en) * 2021-03-17 2021-04-16 中国科学院自动化研究所 Decision-making method for decentralized multi-intelligent system in general non-stationary environment
CN112853560A (en) * 2020-12-31 2021-05-28 盐城师范学院 Global process sharing control system and method based on ring spinning yarn quality
CN112926729A (en) * 2021-05-06 2021-06-08 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN112966641A (en) * 2021-03-23 2021-06-15 中国电子科技集团公司电子科学研究院 Intelligent decision-making method for multiple sensors and multiple targets and storage medium
CN112987713A (en) * 2019-12-17 2021-06-18 杭州海康威视数字技术股份有限公司 Control method and device for automatic driving equipment and storage medium
CN113189983A (en) * 2021-04-13 2021-07-30 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113218400A (en) * 2021-05-17 2021-08-06 太原科技大学 Multi-agent navigation algorithm based on deep reinforcement learning
CN113269329A (en) * 2021-04-30 2021-08-17 北京控制工程研究所 Multi-agent distributed reinforcement learning method
CN113392798A (en) * 2021-06-29 2021-09-14 中国科学技术大学 Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation
CN113408796A (en) * 2021-06-04 2021-09-17 北京理工大学 Deep space probe soft landing path planning method for multitask deep reinforcement learning
CN113433953A (en) * 2021-08-25 2021-09-24 北京航空航天大学 Multi-robot cooperative obstacle avoidance method and device and intelligent robot
CN113467508A (en) * 2021-06-30 2021-10-01 天津大学 Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task
CN113485119A (en) * 2021-07-29 2021-10-08 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN113490578A (en) * 2019-03-08 2021-10-08 罗伯特·博世有限公司 Method for operating a robot in a multi-agent system, robot and multi-agent system
CN113534784A (en) * 2020-04-17 2021-10-22 华为技术有限公司 Decision method of intelligent body action and related equipment
CN113554300A (en) * 2021-07-19 2021-10-26 河海大学 Shared parking space real-time allocation method based on deep reinforcement learning
CN113589842A (en) * 2021-07-26 2021-11-02 中国电子科技集团公司第五十四研究所 Unmanned clustering task cooperation method based on multi-agent reinforcement learning
CN113792846A (en) * 2021-09-06 2021-12-14 中国科学院自动化研究所 State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment
CN113837654A (en) * 2021-10-14 2021-12-24 北京邮电大学 Multi-target-oriented intelligent power grid layered scheduling method
WO2022052406A1 (en) * 2020-09-08 2022-03-17 苏州浪潮智能科技有限公司 Automatic driving training method, apparatus and device, and medium
CN114548497A (en) * 2022-01-13 2022-05-27 山东师范大学 Crowd movement path planning method and system for realizing scene self-adaption
CN114638163A (en) * 2022-03-21 2022-06-17 重庆高新区飞马创新研究院 Self-learning algorithm-based intelligent group cooperative combat method generation method
CN114996856A (en) * 2022-06-27 2022-09-02 北京鼎成智造科技有限公司 Data processing method and device for airplane intelligent agent maneuver decision
CN115086374A (en) * 2022-06-14 2022-09-20 河南职业技术学院 Scene complexity self-adaptive multi-agent layered cooperation method
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN118071119A (en) * 2024-04-18 2024-05-24 中国电子科技集团公司第十研究所 Heterogeneous sensor mixed cooperative scheduling decision method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096729A (en) * 2016-06-06 2016-11-09 天津科技大学 A kind of towards the depth-size strategy learning method of complex task in extensive environment
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096729A (en) * 2016-06-06 2016-11-09 天津科技大学 A kind of towards the depth-size strategy learning method of complex task in extensive environment
CN106970615A (en) * 2017-03-21 2017-07-21 西北工业大学 A kind of real-time online paths planning method of deeply study
CN107065881A (en) * 2017-05-17 2017-08-18 清华大学 A kind of robot global path planning method learnt based on deeply

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JELLE MUNK,JENS KOBER,ROBERT BABUSKA.: "《Learning State Representation for Deep Actor-Critic Control》", 《2016 IEEE 55TH CONFERENCE ON DECISION AND CONTROL(CDC)》 *
S PHANITEJA ; PARIJAT DEWANGAN,POOJA GUHAN.: "《A deep reinforcement learning approach for dynamically stable inverse kinematics of humanoid robots》", 《 2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO)》 *
WANRONG HUANG,YANZHEN WANG,XIAODONG YI.: "《A Deep Reinforcement Learning Approach to Preserve Connectivity for Multi-robot Systems》", 《2017 10TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI)》 *

Cited By (88)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111050330A (en) * 2018-10-12 2020-04-21 中兴通讯股份有限公司 Mobile network self-optimization method, system, terminal and computer readable storage medium
CN109408157A (en) * 2018-11-01 2019-03-01 西北工业大学 A kind of determination method and device of multirobot cotasking
CN109408157B (en) * 2018-11-01 2022-03-04 西北工业大学 Method and device for determining multi-robot cooperative task
CN109719721A (en) * 2018-12-26 2019-05-07 北京化工大学 A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait
CN109719721B (en) * 2018-12-26 2020-07-24 北京化工大学 Adaptive gait autonomous emerging method of snake-like search and rescue robot
CN109919319A (en) * 2018-12-31 2019-06-21 中国科学院软件研究所 Deeply learning method and equipment based on multiple history best Q networks
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN109407644A (en) * 2019-01-07 2019-03-01 齐鲁工业大学 One kind being used for manufacturing enterprise's Multi-Agent model control method and system
CN109670270A (en) * 2019-01-11 2019-04-23 山东师范大学 Crowd evacuation emulation method and system based on the study of multiple agent deeply
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN109828460A (en) * 2019-01-21 2019-05-31 南京理工大学 A kind of consistent control method of output for two-way heterogeneous multi-agent system
CN109828460B (en) * 2019-01-21 2021-11-12 南京理工大学 Output consistency control method for bidirectional heterogeneous multi-agent system
CN109657802B (en) * 2019-01-28 2020-12-29 清华大学深圳研究生院 Hybrid expert reinforcement learning method and system
CN109657802A (en) * 2019-01-28 2019-04-19 清华大学深圳研究生院 A kind of Mixture of expert intensified learning method and system
CN113490578A (en) * 2019-03-08 2021-10-08 罗伯特·博世有限公司 Method for operating a robot in a multi-agent system, robot and multi-agent system
CN110084375B (en) * 2019-04-26 2021-09-17 东南大学 Multi-agent collaboration framework based on deep reinforcement learning
CN110084375A (en) * 2019-04-26 2019-08-02 东南大学 A kind of hierarchy division frame based on deeply study
CN111914069A (en) * 2019-05-10 2020-11-10 京东方科技集团股份有限公司 Training method and device, dialogue processing method and system and medium
WO2020228636A1 (en) * 2019-05-10 2020-11-19 京东方科技集团股份有限公司 Training method and apparatus, dialogue processing method and system, and medium
CN110045614A (en) * 2019-05-16 2019-07-23 河海大学常州校区 A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning
CN110515298A (en) * 2019-06-14 2019-11-29 南京信息工程大学 Based on the adaptive marine isomery multiple agent speed cooperative control method of optimization
CN110515298B (en) * 2019-06-14 2022-09-23 南京信息工程大学 Offshore heterogeneous multi-agent speed cooperative control method based on optimized self-adaption
CN110442129A (en) * 2019-07-26 2019-11-12 中南大学 A kind of control method and system that multiple agent is formed into columns
CN110659796A (en) * 2019-08-08 2020-01-07 北京理工大学 Data acquisition method in rechargeable group vehicle intelligence
CN110659796B (en) * 2019-08-08 2022-07-08 北京理工大学 Data acquisition method in rechargeable group vehicle intelligence
CN110839031A (en) * 2019-11-15 2020-02-25 中国人民解放军陆军工程大学 Malicious user behavior intelligent detection method based on reinforcement learning
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN111026272B (en) * 2019-12-09 2023-10-31 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN110991972A (en) * 2019-12-14 2020-04-10 中国科学院深圳先进技术研究院 Cargo transportation system based on multi-agent reinforcement learning
CN112987713B (en) * 2019-12-17 2024-08-13 杭州海康威视数字技术股份有限公司 Control method and device for automatic driving equipment and storage medium
CN112987713A (en) * 2019-12-17 2021-06-18 杭州海康威视数字技术股份有限公司 Control method and device for automatic driving equipment and storage medium
CN111309880A (en) * 2020-01-21 2020-06-19 清华大学 Multi-agent action strategy learning method, device, medium and computing equipment
CN111309880B (en) * 2020-01-21 2023-11-10 清华大学 Multi-agent action strategy learning method, device, medium and computing equipment
CN111416771A (en) * 2020-03-20 2020-07-14 深圳市大数据研究院 Method for controlling routing action based on multi-agent reinforcement learning routing strategy
CN111582441B (en) * 2020-04-16 2021-07-30 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
CN113534784B (en) * 2020-04-17 2024-03-05 华为技术有限公司 Decision method of intelligent body action and related equipment
CN113534784A (en) * 2020-04-17 2021-10-22 华为技术有限公司 Decision method of intelligent body action and related equipment
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method
CN111687840B (en) * 2020-06-11 2021-10-29 清华大学 Method, device and storage medium for capturing space target
CN111687840A (en) * 2020-06-11 2020-09-22 清华大学 Method, device and storage medium for capturing space target
CN111645076A (en) * 2020-06-17 2020-09-11 郑州大学 Robot control method and equipment
US12045061B2 (en) 2020-07-10 2024-07-23 Goertek Inc. Multi-AGV motion planning method, device and system
CN112015174B (en) * 2020-07-10 2022-06-28 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN112015174A (en) * 2020-07-10 2020-12-01 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN111814915B (en) * 2020-08-26 2020-12-25 中国科学院自动化研究所 Multi-agent space-time feature extraction method and system and behavior decision method and system
CN111814915A (en) * 2020-08-26 2020-10-23 中国科学院自动化研究所 Multi-agent space-time feature extraction method and system and behavior decision method and system
WO2022052406A1 (en) * 2020-09-08 2022-03-17 苏州浪潮智能科技有限公司 Automatic driving training method, apparatus and device, and medium
CN112180724B (en) * 2020-09-25 2022-06-03 中国人民解放军军事科学院国防科技创新研究院 Training method and system for multi-agent cooperative cooperation under interference condition
CN112180724A (en) * 2020-09-25 2021-01-05 中国人民解放军军事科学院国防科技创新研究院 Training method and system for multi-agent cooperative cooperation under interference condition
CN112270451A (en) * 2020-11-04 2021-01-26 中国科学院重庆绿色智能技术研究院 Monitoring and early warning method and system based on reinforcement learning
CN112270451B (en) * 2020-11-04 2022-05-24 中国科学院重庆绿色智能技术研究院 Monitoring and early warning method and system based on reinforcement learning
CN112260733A (en) * 2020-11-10 2021-01-22 东南大学 Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method
CN112260733B (en) * 2020-11-10 2022-02-01 东南大学 Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method
CN112597693A (en) * 2020-11-19 2021-04-02 沈阳航盛科技有限责任公司 Self-adaptive control method based on depth deterministic strategy gradient
CN112853560A (en) * 2020-12-31 2021-05-28 盐城师范学院 Global process sharing control system and method based on ring spinning yarn quality
CN112668721A (en) * 2021-03-17 2021-04-16 中国科学院自动化研究所 Decision-making method for decentralized multi-intelligent system in general non-stationary environment
CN112966641A (en) * 2021-03-23 2021-06-15 中国电子科技集团公司电子科学研究院 Intelligent decision-making method for multiple sensors and multiple targets and storage medium
CN113189983B (en) * 2021-04-13 2022-05-31 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113189983A (en) * 2021-04-13 2021-07-30 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113269329B (en) * 2021-04-30 2024-03-19 北京控制工程研究所 Multi-agent distributed reinforcement learning method
CN113269329A (en) * 2021-04-30 2021-08-17 北京控制工程研究所 Multi-agent distributed reinforcement learning method
CN112926729A (en) * 2021-05-06 2021-06-08 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN112926729B (en) * 2021-05-06 2021-08-03 中国科学院自动化研究所 Man-machine confrontation intelligent agent strategy making method
CN113218400B (en) * 2021-05-17 2022-04-19 太原科技大学 Multi-agent navigation algorithm based on deep reinforcement learning
CN113218400A (en) * 2021-05-17 2021-08-06 太原科技大学 Multi-agent navigation algorithm based on deep reinforcement learning
CN113408796A (en) * 2021-06-04 2021-09-17 北京理工大学 Deep space probe soft landing path planning method for multitask deep reinforcement learning
CN113408796B (en) * 2021-06-04 2022-11-04 北京理工大学 Deep space probe soft landing path planning method for multitask deep reinforcement learning
CN113392798A (en) * 2021-06-29 2021-09-14 中国科学技术大学 Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation
CN113467508A (en) * 2021-06-30 2021-10-01 天津大学 Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task
CN113467508B (en) * 2021-06-30 2022-06-28 天津大学 Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task
CN113554300A (en) * 2021-07-19 2021-10-26 河海大学 Shared parking space real-time allocation method based on deep reinforcement learning
CN113589842B (en) * 2021-07-26 2024-04-19 中国电子科技集团公司第五十四研究所 Unmanned cluster task cooperation method based on multi-agent reinforcement learning
CN113589842A (en) * 2021-07-26 2021-11-02 中国电子科技集团公司第五十四研究所 Unmanned clustering task cooperation method based on multi-agent reinforcement learning
CN113485119A (en) * 2021-07-29 2021-10-08 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN113485119B (en) * 2021-07-29 2022-05-10 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN113433953A (en) * 2021-08-25 2021-09-24 北京航空航天大学 Multi-robot cooperative obstacle avoidance method and device and intelligent robot
CN113792846A (en) * 2021-09-06 2021-12-14 中国科学院自动化研究所 State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment
CN113837654B (en) * 2021-10-14 2024-04-12 北京邮电大学 Multi-objective-oriented smart grid hierarchical scheduling method
CN113837654A (en) * 2021-10-14 2021-12-24 北京邮电大学 Multi-target-oriented intelligent power grid layered scheduling method
CN114548497A (en) * 2022-01-13 2022-05-27 山东师范大学 Crowd movement path planning method and system for realizing scene self-adaption
CN114638163A (en) * 2022-03-21 2022-06-17 重庆高新区飞马创新研究院 Self-learning algorithm-based intelligent group cooperative combat method generation method
CN114638163B (en) * 2022-03-21 2024-09-06 重庆高新区飞马创新研究院 Intelligent group collaborative tactics generation method based on self-learning algorithm
CN115086374A (en) * 2022-06-14 2022-09-20 河南职业技术学院 Scene complexity self-adaptive multi-agent layered cooperation method
CN114996856A (en) * 2022-06-27 2022-09-02 北京鼎成智造科技有限公司 Data processing method and device for airplane intelligent agent maneuver decision
CN115366099B (en) * 2022-08-18 2024-05-28 江苏科技大学 Mechanical arm depth deterministic strategy gradient training method based on forward kinematics
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN118071119A (en) * 2024-04-18 2024-05-24 中国电子科技集团公司第十研究所 Heterogeneous sensor mixed cooperative scheduling decision method

Similar Documents

Publication Publication Date Title
CN108600379A (en) A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN106970615B (en) A kind of real-time online paths planning method of deeply study
WO2022012265A1 (en) Robot learning from demonstration via meta-imitation learning
CN107179077B (en) Self-adaptive visual navigation method based on ELM-LRF
CN109559277A (en) Multi-unmanned aerial vehicle cooperative map construction method oriented to data sharing
CN110135341A (en) Weed identification method, apparatus and terminal device
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN109960880A (en) A kind of industrial robot obstacle-avoiding route planning method based on machine learning
CN114942633A (en) Multi-agent cooperative anti-collision picking method based on digital twins and reinforcement learning
CN110866588B (en) Training learning method and system for realizing individuation of learning ability model of intelligent virtual digital animal
Papadopoulos et al. Towards open and expandable cognitive AI architectures for large-scale multi-agent human-robot collaborative learning
CN109529338A (en) Object control method, apparatus, Electronic Design and computer-readable medium
CN110181508A (en) Underwater robot three-dimensional Route planner and system
CN112348285B (en) Crowd evacuation simulation method in dynamic environment based on deep reinforcement learning
CN110109653A (en) Land battle chess intelligent engine and operation method thereof
CN116663416A (en) CGF decision behavior simulation method based on behavior tree
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Liu et al. Learning communication for cooperation in dynamic agent-number environment
Zuo et al. SOAR improved artificial neural network for multistep decision-making tasks
Ruifeng et al. Research progress and application of behavior tree technology
Hu et al. Super eagle optimization algorithm based three-dimensional ball security corridor planning method for fixed-wing UAVs
Tian et al. Fruit Picking Robot Arm Training Solution Based on Reinforcement Learning in Digital Twin
Wang et al. Towards optimization of path planning: An RRT*-ACO algorithm
CN117518907A (en) Control method, device, equipment and storage medium of intelligent agent
Qin et al. A path planning algorithm based on deep reinforcement learning for mobile robots in unknown environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180928