CN108600379A - A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient - Google Patents
A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient Download PDFInfo
- Publication number
- CN108600379A CN108600379A CN201810397866.9A CN201810397866A CN108600379A CN 108600379 A CN108600379 A CN 108600379A CN 201810397866 A CN201810397866 A CN 201810397866A CN 108600379 A CN108600379 A CN 108600379A
- Authority
- CN
- China
- Prior art keywords
- action
- parameter
- actor
- state
- critic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Algebra (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Pure & Applied Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of isomery multiple agent Collaborative Decision Making Methods based on depth deterministic policy gradient, belong to the Coordination Decision field of isomery intelligent Unmanned Systems, include the following steps:First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the motion space of intelligent body, and structure multiple agent carries out the movement environment of Coordination Decision;Then, it is based on the deterministic Policy-Gradient algorithm of depth, establish the actor modules for carrying out decision action and judge the critic modules of feedback, and the parameter of training learning model;Using trained model, the status switch of intelligent body is obtained;According to the rewards and punishments rule being arranged in environment, the assessment of situation is carried out to the motion state sequence of intelligent body.The present invention can build rational movement environment according to actual demand and achieve the purpose that Intellisense, policy optimization by the collaboration each other between multiple agent in system, have the function of to the development in China unmanned systems field positive.
Description
Technical field
The invention belongs to the Coordination Decision fields of isomery intelligent Unmanned Systems, and in particular to one kind being based on depth certainty plan
The slightly isomery multiple agent Collaborative Decision Making Method of gradient.
Background technology
In recent years, the rapid development of information technology and intelligent perception technology, for the perception of complex environment, accurately intelligence
The advanced intelligent behavior such as the collaboration of decision and multimachine task has established important basis.The research of intelligent Unmanned Systems, nowadays
Through becoming the marked achievement of Artificial Intelligence Development, the complexity of task and the uncertainty of dynamic environment determine system
Must have very strong adaptive ability and capacity of will.
Traditional intelligent ant colony (Swarm Intelligenc)[1]Originate in nineteen fifty-nine, French biologist
PierrePaul Grasse researchs are found:There are the tissues of highly structural between insect, can complete far beyond a physical efficiency
The operating mode of the complex task of power, ant colony is exactly that the classical of this Intelligent cluster represents, they pass through simple between monomer
It communicates with each other coordination, shows the intelligent behavior of large-scale cluster.By the exploration of the Intelligent cluster behavior between insect, emerge
Many Intelligent cluster algorithms, such as ant group algorithm (Ant Colony System, ACS)[2]And particle swarm optimization algorithm
(Particle Swarm Optimization, PSO) etc..Traditional Intelligent unattended group system is namely based on biological cluster row
To be transmitted by perception interactive to each other and information, to cooperate at low cost under dangerous environment, being completed various
The complex task of property.The distribution of unmanned cluster task at this stage is usually according to the maximum benefit damage of guarantee than (distribution Income Maximum damages
Consumption is minimum) and task balance principle progress, embody the cooperation advantage of cluster, however these swarm algorithms are not ten
It is divided into ripe, is not suitable for the contexture by self of large-scale complex task.
Situation Awareness learning method based on deeply learning art, can enable intelligent Unmanned Systems to have self study
Power improves the adaptability to environment complicated and changeable.The historical origin of intensified learning for a long time, the intensified learning and Ma Erke of early stage
Husband's decision process (MDP) model has prodigious relationship, can be reduced to a four-tuple, i.e. state s (state), action a
(action), reward r (reward) and transition probability P (probability), the target of study are to find a strategy:At certain
When one state, different actions is taken to have different probability, while different return can be obtained.Its advantage is that ability to express compared with
By force, there is good decision-making capability, the disadvantage is that action and state are all discrete.2006, Hinton et al. propose using by
Boltzmann machine RBM (Restricted Boltzmann Machine) is limited to encode deep-neural-network[3], by neural network
Again everybody sight has been retracted;2012, depth convolutional network[4]In ImageNet contests[5]Real outburst, welcome depth
Degree study flourishes;2016, the decision-making capability of the sensing capability of deep learning and intensified learning is combined and is derived
The deeply learning algorithm come brings AlphaGo[6]Immense success, established new mileage for the development of artificial intelligence
Upright stone tablet carries out the intelligent control of robot using deeply learning art[7-9]Become a new research direction.
It is the bibliography below:
[1]Guy Theraulaz,Eric Bonabeau:A Brief History of Stimergy.Artificial
Life 5(2):97-116(1999)
[2]Marco Dorigo,Vittorio Maniezzo,Alberto Colorni:Ant system:
optimization by a colony of cooperating agents.IEEE Transactions on Systems,
Man,and Cybernetics,Part B 26(1):29-41(1996)
[3]Geoffrey E.Hinton:Boltzmann machine.Scholarpedia 2(5):1668(2007)
[4]Alex Krizhevsky,Ilya Sutskever,Geoffrey E.Hinton:ImageNet
Classification with Deep Convolutional Neural Networks.NIPS 2012:1106-1114
[5]Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Fei-Fei Li:
ImageNet:A large-scale hierarchical image database.CVPR:248-255(2009)
[6]David Silver,Aja Huang,Chris J.Maddison,Arthur Guez,Laurent Sifre,
George van den Driessche,Julian Schrittwieser,Ioannis Antonoglou,Vedavyas
Panneershelvam,Marc Lanctot,Sander Dieleman,Dominik Grewe,John Nham,Nal
Kalchbrenner,Ilya Sutskever,Timothy P.Lillicrap,Madeleine Leach,Koray
Kavukcuoglu,Thore Graepel,Demis Hassabis:Mastering the game of Go with deep
neural networks and tree search.Nature 529(7587):484-489(2016)
[7]Fangyi Zhang,Jürgen Leitner,Michael Milford,Ben Upcroft,Peter
I.Corke:Towards Vision-Based Deep Reinforcement Learning for Robotic Motion
Control.CoRR abs/1511.03791(2015)
[8]Sergey Levine,Peter Pastor,Alex Krizhevsky,Deirdre Quillen:
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and
Large-Scale Data Collection.CoRR abs/1603.02199(2016)
[9]Chelsea Finn,Sergey Levine,Pieter Abbeel:Guided Cost Learning:Deep
Inverse Optimal Control via Policy Optimization.CoRR abs/1603.00448(2016)
Invention content
The technology of the present invention solves the problems, such as:According to existing algorithm and technology, it is proposed that one kind being based on depth deterministic policy
The isomery multiple agent Collaborative Decision Making Method of gradient, this method build the sports ring that isomery multiple agent carries out Coordination Decision first
Border;It is then based on depth deterministic policy gradient algorithm, establish the actor modules for carrying out decision action and judges feedback
Critic modules, and the parameter of training learning model;It is final to realize isomery multiple agent Coordination Decision.
Technical solution of the invention:A kind of isomery multiple agent Coordination Decision based on depth deterministic policy gradient
Method includes the following steps:
Step 1:The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body
Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out cooperative motion
Movement environment;
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge anti-
The critic modules of feedback, random initializtion parameter;
Step 3:Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built:Each intelligent body
According to current state s, action a is obtained by actor modules, and reach NextState s';It is calculated simultaneously according to rewards and punishments rule
Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under current state s, by each step<Current state
R is returned in s, current action a, next step state s', rewards and punishments>It is stored into experience pond;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules
Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r>
It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision;
Step 5:Using trained model the current of intelligent body is obtained in the case of known smart body current state s
A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body
Status switch;Meanwhile according to the rewards and punishments rule being arranged in environment, completing the Situation Assessment to intelligent body motion state sequence.
For unmanned systems, optimal decision behavior can not be obtained intuitively, according to the rewards and punishments of environment set rule, be fought to the finish
The behavior quality of plan is analyzed and determined.
Preferably, the specific implementation of step 1 includes following sub-step:
Step 1.1:According to the characteristic attribute of isomery intelligent body, a motion segment each intelligent body being abstracted as in environment
Point;
Step 1.2:Set the action of intelligent body:[direction of motion of next step];Set the state of intelligent body:[itself
Position coordinates x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location];
Step 1.3:Rewards and punishments rule in environment is set;
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment
It punishes the contents such as rule and constructs the movement environment that an isomery multiple agent carries out Coordination Decision jointly.
Preferably, the specific implementation of step 2 includes following sub-step:
Step 2.1:The update of the parameter of actor modules and critic modules needs to establish on the basis of empirical learning, if
Stand state-action pair that an individual experience pond stores each movement node<Current state s, current action a, next step state
S' returns r>;
Step 2.2:Actor modules are established, using the state s of each intelligent body as the input of network, by several middle layers
Obtain the next step output action a of each intelligent body;Meanwhile during each round iteration, the parameter of network is all that dynamic becomes
Change, in order to make the parameter learning of network structure more stablize, retains an actor network structure copy, the actor network knots
Structure copy only just carries out the update of parameter in regular hour step-length;
Step 2.3:Critic modules are established, using the state s of intelligent body and action a as the input of network, process is several
Middle layer output is action-value Q;Meanwhile in order to keep the study of parameter more stable, retaining a critic network structure pair
This, which equally just carries out the update of parameter in regular hour step-length.
Preferably, the specific implementation of step 4 includes following sub-step:
Step 4.1:Critic modules contain two network moulds that structure is identical, parameter renewal time is inconsistent
The network model Q of immediate updating parameter is referred to as online critic by type, and parameter is expressed as θQ;It will postpone newer network
Model Q' is referred to as target critic, and parameter is expressed as θQ';
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>,
Action a is taken under current state s, reaches NextState s', and obtain returning r immediately;It is obtained using target actor is network-evaluated
The next action a' taken when NextState s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then by Q'
It can obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q,
I.e. online expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) mean square error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, and
When undated parameter network model μ be online actor, parameter is expressed as θμ;The network model μ ' for postponing undated parameter is mesh
Actor is marked, parameter is expressed as θμ';
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In
Next action a' of s', i.e. μ ' (s'| θ is calculated in NextState s'μ'), target action-valence for calculating target critic
Value function Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);It is logical
Cross current state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update
Parameter, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>It is right
It is stored before in replacement experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision is whole
Only condition or reach greatest iteration step number.
The present invention compared with prior art the advantages of and good effect it is as follows:
(1) a kind of construction method of feasible isomery multiple agent cooperative surroundings is proposed, by the category for defining intelligent body
The information such as property, state and action, sports rule and prize payouts, construct the isomery multiple agent ring towards particular task
Border;
(2) original depth deterministic policy gradient algorithm is improved, to each other by shared each intelligent body
State and action update the parameter of actor modules and critic modules, are acted according to the output of actor modules, complete
The calculating of critic modules;Meanwhile critic modules instruct the parameter of actor modules to update in turn again, reinforce high repayment
Action, reduces the action of low return, reaches the Coordination Decision between each intelligent body;
(3) for unmanned systems, the optimizing decision of system can not be obtained intuitively.The present invention propose according to from
A series of actions behavior that original state terminates to iteration is advised using the rewards and punishments that the isomery multiple agent environment built provides
Then, final analytical judgment is carried out to the behavior quality of decision, to complete the evaluation work of model.
Description of the drawings
Fig. 1 is the implementation flow chart of the present invention;
Fig. 2 is the structure schematic diagram of isomery multiple agent cooperative surroundings;
Fig. 3 is the network structure of actor modules;
Fig. 4 is the network structure of critic modules;
Fig. 5 is the data flow diagram of actor modules and critic modules.
Specific implementation mode
With reference to embodiment and Figure of description, specific embodiments of the present invention are described in detail.This place
The embodiment of description is merely to illustrate and explain the present invention, but is not used in the restriction present invention.
Isomery multiple agent Collaborative Decision Making Method proposed by the present invention based on depth deterministic policy gradient includes mainly
Following steps:First, the characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body
Space builds the movement environment that more intelligence carry out Coordination Decision;Then, using based on the deterministic Policy-Gradient algorithm of depth,
Definition carries out the actor modules of decision action and judge the critic modules of feedback, and the parameter of training learning model, root
Position according to intelligent body local environment and status information automatically carry out the decision action of next step;Finally, due to which system is most
Excellent decision subjective cannot provide, and can only be judged according to certain standards of grading, with the rewards and punishments rule defined in environment
For foundation, the result quality of decision action is examined.The present invention can build rational movement environment according to actual demand, pass through and be
Collaboration each other in system between multiple agent, achievees the purpose that Intellisense, policy optimization, the development to China unmanned systems field
Have the function of positive.
The following detailed description of.
Step 1:The characteristic attribute and rewards and punishments rule for defining isomery multiple agent, specify state space and the action of intelligent body
Each intelligent body is abstracted as a movement node in environment by space, and structure isomery multiple agent carries out Coordination Decision
Movement environment;
When it is implemented, the sports rule of intelligent body should be formulated according to specific motion model, the fortune of intelligent body is specified
Dynamic space and motion space, and the rewards and punishments mechanism of reasonable design.
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and commented
The critic modules of valence feedback, and the parameter of actor modules and critic modules is initialized;
Depth deterministic policy gradient algorithm is a kind of intensified learning method based on actor-critic frames, main to wrap
Containing two modules:Actor modules and critic modules.Actor modules are responsible for be adopted in next step according to current state computation
The action taken, critic modules are then responsible for according to estimation expected returns caused by current state and the action taken to actor
The parameter of module carries out feedback modifiers.The trained starting stage needs to initialize the parameter of the two modules respectively.
Step 3:In the movement environment that step 1 is built, multiple agent relies on the specific algorithm of step 2, certain
Primary iteration number independently randomly carries out movement exploring.Each intelligent body is moved according to current state s by actor modules
Make a, and reaches NextState s';Calculated simultaneously according to rewards and punishments rule takes action a to reach NextState s' at current state s
When environment rewards and punishments to be administered return r, by each step<R is returned in current state s, current action a, next step state s', rewards and punishments
>It is stored into experience pond, to carry out the study of module parameter;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the ginseng of critic modules and actor modules
Number is trained and learns, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r>
It is right, step 4 is repeated, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision.
In order to steadily carry out the renewal learning of parameter, critic modules and actor modules all include one online real-time
The network structure of the network structure of undated parameter and a delay certain time step-length undated parameter.The parameter of critic modules is more
The action a that the new actor modules that need to rely on are calculated;And the parameter of actor modules updates the critic that then needs to rely on
The action that module is calculated-value gradient, the two are fed back mutually, achieve the purpose that multiple agent cooperative motion.
Step 5:Using trained model the current of intelligent body is obtained in the case of known smart body current state s
A is acted, and reaches NextState s', repeats step 5, until completion task or reaches the end condition of environment, obtains intelligent body
Status switch;Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.For
For unmanned systems, optimal decision behavior can not subjective judgement, can only be analyzed according to certain objective standard, root
According to the rewards and punishments rule of environment set, the behavior quality of decision is analyzed and determined.
The specific implementation process of above-mentioned steps 1-5 is described in detail below by following components.
1. structure carries out the movement environment of how intelligent coordinated decision
Realize schematic diagram as shown in Fig. 2, being divided into following 4 sub-steps:
Step 1.1:According to the characteristic attribute of isomery intelligent body, such as maximum movement speed, moving region range etc. will be each
Intelligent body is abstracted as a movement node in environment;
Step 1.2 sets motion space and the state space of intelligent body, and the action of intelligent body is set in the present invention to be [next
The direction of motion of step];The state of intelligent body is set as [position coordinates x, the y of itself, position coordinates x, y of target, itself and mesh
Target azimuth angle theta];
Step 1.3:Rewards and punishments mechanism in environment is set, the environment prize to be administered when reaching certain state between intelligent body
Punish return.The present invention mainly sets 3 kinds of rewards and punishments rules:Certain initial distance is kept between each intelligent body, defines each intelligence
It cannot be at a distance of too close between body;There are the limitations of farthest communication distance between each intelligent body, are then punished beyond defined maximum distance
Penalize exacerbation;Whether target can be monitored according to intelligent body, this is the final purpose of Coordination Decision, gives corresponding reward.
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the prize in environment
Punish the cooperative surroundings that the contents such as rule construct isomery multiple agent jointly:For each intelligent body, according to current observation,
Action and the rewards and punishments information of next step are obtained, to instruct continuing to optimize for decision.
2. establishing the network structure of actor modules and critic modules, initialization network parameter
Actor modules apply to decision action, and critic module applications are fed back in evaluation, are divided into following 2 steps:
1) actor module networks structural schematic diagram used in the present invention is as shown in figure 3, with the state s of each movement node
Amendment is wherein used after the full articulamentum of the first two by three full articulamentums (Inner product layer) as input
Linear unit (Rectified Linear Units, ReLU) is used as activation primitive, and a hyperbolic is passed through in the output of third layer
Tangent function tanh (), tanh () function are a kind of variants of sigmoid () function, its value range is [- 1,1], and
It is not sigmoid functions [0,1], output result is the radian value of each node next step direction of motion.Actor in embodiment
Module realizes on TensorFlow (referred to as tf) deep learning of increasing income frame, network weight (Weights) and bias
(Bias) the tf.contrib.layers.xavier_initializer functions being all made of in TensorFlow are initialized,
The function returns to an initialization program Xavier for initializing weight, this initializer can ensure each layer of ladder
It is all almost identical to spend size.In the iterative process of each round, since the parameter of network is all dynamic change, in order to make ginseng
Several study is more stablized, and the copy of an actor network structure is retained, which is only just joined in regular hour step-length
Several updates;
2) critic module networks structural schematic diagram used in the present invention is as shown in figure 4, with the state s of each movement node
For input, by a full articulamentum and linear activation primitive is corrected;Then it will export with action a as second full articulamentum
Input, output result be corrected linear unit activation after, input a shot and long term memory network LSTM (Long Short-
Term Memory), output result is state s action-value Q corresponding with action a.Equally, the actor modules in embodiment
It increases income in TensorFlow and realizes on deep learning frame, network weight (Weights) and bias (Bias) use
Tf.contrib.layers.xavier_initializer functions in TensorFlow are initialized, while remaining one
A copy that the newer critic network structures of parameter are carried out in certain time step-length.
3. training and optimizing based on the deterministic Policy-Gradient algorithm of depth
The parameter of critic modules updates the action a that the actor modules that need to rely on are calculated;And actor modules
Parameter updates action-value gradient that the critic modules that then need to rely on are calculated, and the two is fed back mutually, reaches mostly intelligent
The Coordination Decision of body, as shown in Figure 5.It is divided into following 3 steps:
Step 4.1:Critic modules contain two network moulds that structure is identical, parameter renewal time is inconsistent
The network model Q of immediate updating parameter is referred to as online critic by type, and parameter is expressed as θQ;It will postpone newer network
Model Q' is referred to as target critic, and parameter is expressed as θQ';
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>,
Action a is taken under current state s, reaches NextState s', and obtain returning r immediately;It is obtained using target actor is network-evaluated
The next action a' taken when NextState s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then by Q'
It can obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q,
I.e. online expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, and
When undated parameter network model μ be online actor, parameter is expressed as θμ;The network model μ ' for postponing undated parameter is mesh
Actor is marked, parameter is expressed as θμ';
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In
Next action a' of s', i.e. μ ' (s'| θ is calculated in NextState s'μ'), target action-valence for calculating target critic
Value function Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);It is logical
Cross current state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update
Parameter, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>It is right
It is stored before in replacement experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision is whole
Only condition or reach greatest iteration step number.
Non-elaborated part of the present invention belongs to techniques well known.
The above, part specific implementation mode only of the present invention, but scope of protection of the present invention is not limited thereto, appoints
What those skilled in the art is in the technical scope disclosed by the present invention, it will be appreciated that the change or replacement expected should all be covered
Within protection scope of the present invention.Therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.
Claims (4)
1. a kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient, which is characterized in that including with
Lower step:
Step 1:The characteristic attribute and rewards and punishments rule of isomery multiple agent are defined, state space and the action for specifying intelligent body are empty
Between, each intelligent body is abstracted as a movement node in environment, structure isomery multiple agent carries out the fortune of Coordination Decision
Rotating ring border;
Step 2:Based on depth deterministic policy gradient algorithm, establishes the actor modules for carrying out decision action and judge feedback
Critic modules, random initializtion parameter;
Step 3:Multiple agent independently randomly carries out movement exploring in the movement environment that step 1 is built:Each intelligent body according to
Current state s obtains action a by actor modules, and reaches NextState s';Meanwhile it being calculated current according to rewards and punishments rule
Environment rewards and punishments return r to be administered when action a being taken to reach NextState s' under state s, by each step<Current state s, when
R is returned in preceding action a, next step state s', rewards and punishments>It is stored into experience pond;
Step 4:According to what is stored in step 3 experience pond<s,a,s',r>It is right, to the parameter of critic modules and actor modules into
Row training and study, while with newly generated<s,a,s',r>To what is stored before replacing in experience pond<s,a,s',r>It is right, weight
Multiple step 4, until meeting the optimization end condition or greatest iteration step number of multiple agent Coordination Decision;
Step 5:Using trained model the current action of intelligent body is obtained in the case of known smart body current state s
A, and reach NextState s' repeats step 5, until completion task or reaches the end condition of environment, obtains the shape of intelligent body
State sequence;Meanwhile according to the rewards and punishments rule of environment setting, completing the Situation Assessment of intelligent body motion state sequence.
2. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient,
It is characterized in that, the specific implementation sub-step of the step 1 includes:
Step 1.1:According to the characteristic attribute of isomery intelligent body, a movement node each intelligent body being abstracted as in environment;
Step 1.2:Set the action of intelligent body:[direction of motion of next step];Set the state of intelligent body:[the position of itself
Coordinate x, y, position coordinates x, y of target, the azimuth angle theta of self-position and target location];
Step 1.3:Rewards and punishments rule in environment is set;
Step 1.4:The abstract movement node of multiple agent, the motion space of intelligent body and state space, the rewards and punishments rule in environment
The movement environment that an isomery multiple agent carries out Coordination Decision is then constructed jointly.
3. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient,
It is characterized in that, the specific implementation sub-step of the step 2 is as follows:
Step 2.1:Set up state-action pair that an individual experience pond stores each intelligent body<Current state s, current action
A, next step state s' return r>;
Step 2.2:Actor modules are established, using the state s of each intelligent body as the input of network, are obtained by several middle layers
The next step output action a of each intelligent body;Meanwhile retaining an actor network structure copy, the actor network structure copies
The update of parameter is only just carried out in regular hour step-length;
Step 2.3:Critic modules are established, using the state s of intelligent body and action a as the input of network, by several centres
Layer output is action-value Q;Meanwhile retaining a critic network structure copy, which equally exists
Regular hour step-length just carries out the update of parameter.
4. the isomery multiple agent Collaborative Decision Making Method according to claim 1 based on depth deterministic policy gradient,
It is characterized in that, the step 4 specific implementation sub-step is as follows:
Step 4.1:Critic modules contain two network models that structure is identical, parameter renewal time is inconsistent, will
The network model Q of immediate updating parameter is referred to as online critic, and parameter is expressed as θQ;It will postpone newer network model Q'
Referred to as target critic, parameter are expressed as θQ';
For target critic, rule of thumb pond<Current state s, current action a, next step state s' return r>, current
Action a is taken under state s, reaches NextState s', and obtain returning r immediately;Using target actor it is network-evaluated obtain it is next
The next action a' taken when state s' calculates target action-cost function and is represented by Q'(s', a'| θQ'), then it can be with by Q'
Obtain the estimation expected returns y that action a is taken at current state s:
Y=r+ γ Q'(s', a'| θQ')
Wherein, γ (γ ∈ [0,1]) indicates a decay factor;
For online critic, rule of thumb the current state s and current action a in pond, are calculated action-value Q, that is, exist
Line expected returns Q (s, a | θQ);
Estimation expected returns y and online expected returns Q (s, a | θQ) mean square error calculation formula be:
It can complete to update the parameter of online critic networks using error L;
Target critic is the delay update of online critic, and the parameter more new formula of target critic is:
θQ'=τ θQ+(1-τ)θQ'
Wherein, τ is a balance factor;
Step 4.2:Actor modules include two network models that structure is identical, parameter renewal time is inconsistent, in time more
The network model μ of new parameter is online actor, and parameter is expressed as θμ;The network model μ ' for postponing undated parameter is target
Actor, parameter are expressed as θμ';
For target actor, rule of thumb pond<Current state s, current action a, next step state s' return r>In it is next
Next action a' of s', i.e. μ ' (s'| θ is calculated in state s'μ'), target action-value letter for calculating target critic
Number Q'(s', a'| θQ');
For online actor, the rule of thumb current state s in pond calculates actual current action, i.e. and μ (s | θμ);By working as
Preceding state s actual act μ (s | θμ) and online critic output Q (s, a | θQ) the online actor networks of associated update ginseng
Number, gradient decline formula and are:
Target actor is the delay update of online actor, and the parameter more new formula of target actor is:
θμ'=τ θμ+(1-τ)θμ'
Wherein, τ is a balance factor;
Step 4.3:The model parameter of training critic networks and actor networks, is used in combination newly generated<s,a,s',r>To replacing
It is stored before in experience pond<s,a,s',r>It is right;Step 4 is repeated, until the optimization for meeting multiple agent Coordination Decision terminates item
Part reaches greatest iteration step number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810397866.9A CN108600379A (en) | 2018-04-28 | 2018-04-28 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810397866.9A CN108600379A (en) | 2018-04-28 | 2018-04-28 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108600379A true CN108600379A (en) | 2018-09-28 |
Family
ID=63611007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810397866.9A Pending CN108600379A (en) | 2018-04-28 | 2018-04-28 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108600379A (en) |
Cited By (57)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408157A (en) * | 2018-11-01 | 2019-03-01 | 西北工业大学 | A kind of determination method and device of multirobot cotasking |
CN109407644A (en) * | 2019-01-07 | 2019-03-01 | 齐鲁工业大学 | One kind being used for manufacturing enterprise's Multi-Agent model control method and system |
CN109657802A (en) * | 2019-01-28 | 2019-04-19 | 清华大学深圳研究生院 | A kind of Mixture of expert intensified learning method and system |
CN109670270A (en) * | 2019-01-11 | 2019-04-23 | 山东师范大学 | Crowd evacuation emulation method and system based on the study of multiple agent deeply |
CN109719721A (en) * | 2018-12-26 | 2019-05-07 | 北京化工大学 | A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait |
CN109828460A (en) * | 2019-01-21 | 2019-05-31 | 南京理工大学 | A kind of consistent control method of output for two-way heterogeneous multi-agent system |
CN109919319A (en) * | 2018-12-31 | 2019-06-21 | 中国科学院软件研究所 | Deeply learning method and equipment based on multiple history best Q networks |
CN109934332A (en) * | 2018-12-31 | 2019-06-25 | 中国科学院软件研究所 | The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN110045614A (en) * | 2019-05-16 | 2019-07-23 | 河海大学常州校区 | A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning |
CN110084375A (en) * | 2019-04-26 | 2019-08-02 | 东南大学 | A kind of hierarchy division frame based on deeply study |
CN110442129A (en) * | 2019-07-26 | 2019-11-12 | 中南大学 | A kind of control method and system that multiple agent is formed into columns |
CN110515298A (en) * | 2019-06-14 | 2019-11-29 | 南京信息工程大学 | Based on the adaptive marine isomery multiple agent speed cooperative control method of optimization |
CN110659796A (en) * | 2019-08-08 | 2020-01-07 | 北京理工大学 | Data acquisition method in rechargeable group vehicle intelligence |
CN110839031A (en) * | 2019-11-15 | 2020-02-25 | 中国人民解放军陆军工程大学 | Malicious user behavior intelligent detection method based on reinforcement learning |
CN110991972A (en) * | 2019-12-14 | 2020-04-10 | 中国科学院深圳先进技术研究院 | Cargo transportation system based on multi-agent reinforcement learning |
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
CN111050330A (en) * | 2018-10-12 | 2020-04-21 | 中兴通讯股份有限公司 | Mobile network self-optimization method, system, terminal and computer readable storage medium |
CN111309880A (en) * | 2020-01-21 | 2020-06-19 | 清华大学 | Multi-agent action strategy learning method, device, medium and computing equipment |
CN111416771A (en) * | 2020-03-20 | 2020-07-14 | 深圳市大数据研究院 | Method for controlling routing action based on multi-agent reinforcement learning routing strategy |
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
CN111582441A (en) * | 2020-04-16 | 2020-08-25 | 清华大学 | High-efficiency value function iteration reinforcement learning method of shared cyclic neural network |
CN111645076A (en) * | 2020-06-17 | 2020-09-11 | 郑州大学 | Robot control method and equipment |
CN111687840A (en) * | 2020-06-11 | 2020-09-22 | 清华大学 | Method, device and storage medium for capturing space target |
CN111814915A (en) * | 2020-08-26 | 2020-10-23 | 中国科学院自动化研究所 | Multi-agent space-time feature extraction method and system and behavior decision method and system |
CN111914069A (en) * | 2019-05-10 | 2020-11-10 | 京东方科技集团股份有限公司 | Training method and device, dialogue processing method and system and medium |
CN112015174A (en) * | 2020-07-10 | 2020-12-01 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN112180724A (en) * | 2020-09-25 | 2021-01-05 | 中国人民解放军军事科学院国防科技创新研究院 | Training method and system for multi-agent cooperative cooperation under interference condition |
CN112260733A (en) * | 2020-11-10 | 2021-01-22 | 东南大学 | Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method |
CN112270451A (en) * | 2020-11-04 | 2021-01-26 | 中国科学院重庆绿色智能技术研究院 | Monitoring and early warning method and system based on reinforcement learning |
CN112597693A (en) * | 2020-11-19 | 2021-04-02 | 沈阳航盛科技有限责任公司 | Self-adaptive control method based on depth deterministic strategy gradient |
CN112668721A (en) * | 2021-03-17 | 2021-04-16 | 中国科学院自动化研究所 | Decision-making method for decentralized multi-intelligent system in general non-stationary environment |
CN112853560A (en) * | 2020-12-31 | 2021-05-28 | 盐城师范学院 | Global process sharing control system and method based on ring spinning yarn quality |
CN112926729A (en) * | 2021-05-06 | 2021-06-08 | 中国科学院自动化研究所 | Man-machine confrontation intelligent agent strategy making method |
CN112966641A (en) * | 2021-03-23 | 2021-06-15 | 中国电子科技集团公司电子科学研究院 | Intelligent decision-making method for multiple sensors and multiple targets and storage medium |
CN112987713A (en) * | 2019-12-17 | 2021-06-18 | 杭州海康威视数字技术股份有限公司 | Control method and device for automatic driving equipment and storage medium |
CN113189983A (en) * | 2021-04-13 | 2021-07-30 | 中国人民解放军国防科技大学 | Open scene-oriented multi-robot cooperative multi-target sampling method |
CN113218400A (en) * | 2021-05-17 | 2021-08-06 | 太原科技大学 | Multi-agent navigation algorithm based on deep reinforcement learning |
CN113269329A (en) * | 2021-04-30 | 2021-08-17 | 北京控制工程研究所 | Multi-agent distributed reinforcement learning method |
CN113392798A (en) * | 2021-06-29 | 2021-09-14 | 中国科学技术大学 | Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation |
CN113408796A (en) * | 2021-06-04 | 2021-09-17 | 北京理工大学 | Deep space probe soft landing path planning method for multitask deep reinforcement learning |
CN113433953A (en) * | 2021-08-25 | 2021-09-24 | 北京航空航天大学 | Multi-robot cooperative obstacle avoidance method and device and intelligent robot |
CN113467508A (en) * | 2021-06-30 | 2021-10-01 | 天津大学 | Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task |
CN113485119A (en) * | 2021-07-29 | 2021-10-08 | 中国人民解放军国防科技大学 | Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability |
CN113490578A (en) * | 2019-03-08 | 2021-10-08 | 罗伯特·博世有限公司 | Method for operating a robot in a multi-agent system, robot and multi-agent system |
CN113534784A (en) * | 2020-04-17 | 2021-10-22 | 华为技术有限公司 | Decision method of intelligent body action and related equipment |
CN113554300A (en) * | 2021-07-19 | 2021-10-26 | 河海大学 | Shared parking space real-time allocation method based on deep reinforcement learning |
CN113589842A (en) * | 2021-07-26 | 2021-11-02 | 中国电子科技集团公司第五十四研究所 | Unmanned clustering task cooperation method based on multi-agent reinforcement learning |
CN113792846A (en) * | 2021-09-06 | 2021-12-14 | 中国科学院自动化研究所 | State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment |
CN113837654A (en) * | 2021-10-14 | 2021-12-24 | 北京邮电大学 | Multi-target-oriented intelligent power grid layered scheduling method |
WO2022052406A1 (en) * | 2020-09-08 | 2022-03-17 | 苏州浪潮智能科技有限公司 | Automatic driving training method, apparatus and device, and medium |
CN114548497A (en) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | Crowd movement path planning method and system for realizing scene self-adaption |
CN114638163A (en) * | 2022-03-21 | 2022-06-17 | 重庆高新区飞马创新研究院 | Self-learning algorithm-based intelligent group cooperative combat method generation method |
CN114996856A (en) * | 2022-06-27 | 2022-09-02 | 北京鼎成智造科技有限公司 | Data processing method and device for airplane intelligent agent maneuver decision |
CN115086374A (en) * | 2022-06-14 | 2022-09-20 | 河南职业技术学院 | Scene complexity self-adaptive multi-agent layered cooperation method |
CN115366099A (en) * | 2022-08-18 | 2022-11-22 | 江苏科技大学 | Mechanical arm depth certainty strategy gradient training method based on forward kinematics |
CN118071119A (en) * | 2024-04-18 | 2024-05-24 | 中国电子科技集团公司第十研究所 | Heterogeneous sensor mixed cooperative scheduling decision method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096729A (en) * | 2016-06-06 | 2016-11-09 | 天津科技大学 | A kind of towards the depth-size strategy learning method of complex task in extensive environment |
CN106970615A (en) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | A kind of real-time online paths planning method of deeply study |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
-
2018
- 2018-04-28 CN CN201810397866.9A patent/CN108600379A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096729A (en) * | 2016-06-06 | 2016-11-09 | 天津科技大学 | A kind of towards the depth-size strategy learning method of complex task in extensive environment |
CN106970615A (en) * | 2017-03-21 | 2017-07-21 | 西北工业大学 | A kind of real-time online paths planning method of deeply study |
CN107065881A (en) * | 2017-05-17 | 2017-08-18 | 清华大学 | A kind of robot global path planning method learnt based on deeply |
Non-Patent Citations (3)
Title |
---|
JELLE MUNK,JENS KOBER,ROBERT BABUSKA.: "《Learning State Representation for Deep Actor-Critic Control》", 《2016 IEEE 55TH CONFERENCE ON DECISION AND CONTROL(CDC)》 * |
S PHANITEJA ; PARIJAT DEWANGAN,POOJA GUHAN.: "《A deep reinforcement learning approach for dynamically stable inverse kinematics of humanoid robots》", 《 2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO)》 * |
WANRONG HUANG,YANZHEN WANG,XIAODONG YI.: "《A Deep Reinforcement Learning Approach to Preserve Connectivity for Multi-robot Systems》", 《2017 10TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, BIOMEDICAL ENGINEERING AND INFORMATICS (CISP-BMEI)》 * |
Cited By (88)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111050330A (en) * | 2018-10-12 | 2020-04-21 | 中兴通讯股份有限公司 | Mobile network self-optimization method, system, terminal and computer readable storage medium |
CN109408157A (en) * | 2018-11-01 | 2019-03-01 | 西北工业大学 | A kind of determination method and device of multirobot cotasking |
CN109408157B (en) * | 2018-11-01 | 2022-03-04 | 西北工业大学 | Method and device for determining multi-robot cooperative task |
CN109719721A (en) * | 2018-12-26 | 2019-05-07 | 北京化工大学 | A kind of autonomous emergence of imitative snake search and rescue robot adaptability gait |
CN109719721B (en) * | 2018-12-26 | 2020-07-24 | 北京化工大学 | Adaptive gait autonomous emerging method of snake-like search and rescue robot |
CN109919319A (en) * | 2018-12-31 | 2019-06-21 | 中国科学院软件研究所 | Deeply learning method and equipment based on multiple history best Q networks |
CN109934332A (en) * | 2018-12-31 | 2019-06-25 | 中国科学院软件研究所 | The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends |
CN109407644A (en) * | 2019-01-07 | 2019-03-01 | 齐鲁工业大学 | One kind being used for manufacturing enterprise's Multi-Agent model control method and system |
CN109670270A (en) * | 2019-01-11 | 2019-04-23 | 山东师范大学 | Crowd evacuation emulation method and system based on the study of multiple agent deeply |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN109828460A (en) * | 2019-01-21 | 2019-05-31 | 南京理工大学 | A kind of consistent control method of output for two-way heterogeneous multi-agent system |
CN109828460B (en) * | 2019-01-21 | 2021-11-12 | 南京理工大学 | Output consistency control method for bidirectional heterogeneous multi-agent system |
CN109657802B (en) * | 2019-01-28 | 2020-12-29 | 清华大学深圳研究生院 | Hybrid expert reinforcement learning method and system |
CN109657802A (en) * | 2019-01-28 | 2019-04-19 | 清华大学深圳研究生院 | A kind of Mixture of expert intensified learning method and system |
CN113490578A (en) * | 2019-03-08 | 2021-10-08 | 罗伯特·博世有限公司 | Method for operating a robot in a multi-agent system, robot and multi-agent system |
CN110084375B (en) * | 2019-04-26 | 2021-09-17 | 东南大学 | Multi-agent collaboration framework based on deep reinforcement learning |
CN110084375A (en) * | 2019-04-26 | 2019-08-02 | 东南大学 | A kind of hierarchy division frame based on deeply study |
CN111914069A (en) * | 2019-05-10 | 2020-11-10 | 京东方科技集团股份有限公司 | Training method and device, dialogue processing method and system and medium |
WO2020228636A1 (en) * | 2019-05-10 | 2020-11-19 | 京东方科技集团股份有限公司 | Training method and apparatus, dialogue processing method and system, and medium |
CN110045614A (en) * | 2019-05-16 | 2019-07-23 | 河海大学常州校区 | A kind of traversing process automatic learning control system of strand suction ship and method based on deep learning |
CN110515298A (en) * | 2019-06-14 | 2019-11-29 | 南京信息工程大学 | Based on the adaptive marine isomery multiple agent speed cooperative control method of optimization |
CN110515298B (en) * | 2019-06-14 | 2022-09-23 | 南京信息工程大学 | Offshore heterogeneous multi-agent speed cooperative control method based on optimized self-adaption |
CN110442129A (en) * | 2019-07-26 | 2019-11-12 | 中南大学 | A kind of control method and system that multiple agent is formed into columns |
CN110659796A (en) * | 2019-08-08 | 2020-01-07 | 北京理工大学 | Data acquisition method in rechargeable group vehicle intelligence |
CN110659796B (en) * | 2019-08-08 | 2022-07-08 | 北京理工大学 | Data acquisition method in rechargeable group vehicle intelligence |
CN110839031A (en) * | 2019-11-15 | 2020-02-25 | 中国人民解放军陆军工程大学 | Malicious user behavior intelligent detection method based on reinforcement learning |
CN111026272A (en) * | 2019-12-09 | 2020-04-17 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
CN111026272B (en) * | 2019-12-09 | 2023-10-31 | 网易(杭州)网络有限公司 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
CN110991972A (en) * | 2019-12-14 | 2020-04-10 | 中国科学院深圳先进技术研究院 | Cargo transportation system based on multi-agent reinforcement learning |
CN112987713B (en) * | 2019-12-17 | 2024-08-13 | 杭州海康威视数字技术股份有限公司 | Control method and device for automatic driving equipment and storage medium |
CN112987713A (en) * | 2019-12-17 | 2021-06-18 | 杭州海康威视数字技术股份有限公司 | Control method and device for automatic driving equipment and storage medium |
CN111309880A (en) * | 2020-01-21 | 2020-06-19 | 清华大学 | Multi-agent action strategy learning method, device, medium and computing equipment |
CN111309880B (en) * | 2020-01-21 | 2023-11-10 | 清华大学 | Multi-agent action strategy learning method, device, medium and computing equipment |
CN111416771A (en) * | 2020-03-20 | 2020-07-14 | 深圳市大数据研究院 | Method for controlling routing action based on multi-agent reinforcement learning routing strategy |
CN111582441B (en) * | 2020-04-16 | 2021-07-30 | 清华大学 | High-efficiency value function iteration reinforcement learning method of shared cyclic neural network |
CN111582441A (en) * | 2020-04-16 | 2020-08-25 | 清华大学 | High-efficiency value function iteration reinforcement learning method of shared cyclic neural network |
CN113534784B (en) * | 2020-04-17 | 2024-03-05 | 华为技术有限公司 | Decision method of intelligent body action and related equipment |
CN113534784A (en) * | 2020-04-17 | 2021-10-22 | 华为技术有限公司 | Decision method of intelligent body action and related equipment |
CN111563188A (en) * | 2020-04-30 | 2020-08-21 | 南京邮电大学 | Mobile multi-agent cooperative target searching method |
CN111687840B (en) * | 2020-06-11 | 2021-10-29 | 清华大学 | Method, device and storage medium for capturing space target |
CN111687840A (en) * | 2020-06-11 | 2020-09-22 | 清华大学 | Method, device and storage medium for capturing space target |
CN111645076A (en) * | 2020-06-17 | 2020-09-11 | 郑州大学 | Robot control method and equipment |
US12045061B2 (en) | 2020-07-10 | 2024-07-23 | Goertek Inc. | Multi-AGV motion planning method, device and system |
CN112015174B (en) * | 2020-07-10 | 2022-06-28 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN112015174A (en) * | 2020-07-10 | 2020-12-01 | 歌尔股份有限公司 | Multi-AGV motion planning method, device and system |
CN111814915B (en) * | 2020-08-26 | 2020-12-25 | 中国科学院自动化研究所 | Multi-agent space-time feature extraction method and system and behavior decision method and system |
CN111814915A (en) * | 2020-08-26 | 2020-10-23 | 中国科学院自动化研究所 | Multi-agent space-time feature extraction method and system and behavior decision method and system |
WO2022052406A1 (en) * | 2020-09-08 | 2022-03-17 | 苏州浪潮智能科技有限公司 | Automatic driving training method, apparatus and device, and medium |
CN112180724B (en) * | 2020-09-25 | 2022-06-03 | 中国人民解放军军事科学院国防科技创新研究院 | Training method and system for multi-agent cooperative cooperation under interference condition |
CN112180724A (en) * | 2020-09-25 | 2021-01-05 | 中国人民解放军军事科学院国防科技创新研究院 | Training method and system for multi-agent cooperative cooperation under interference condition |
CN112270451A (en) * | 2020-11-04 | 2021-01-26 | 中国科学院重庆绿色智能技术研究院 | Monitoring and early warning method and system based on reinforcement learning |
CN112270451B (en) * | 2020-11-04 | 2022-05-24 | 中国科学院重庆绿色智能技术研究院 | Monitoring and early warning method and system based on reinforcement learning |
CN112260733A (en) * | 2020-11-10 | 2021-01-22 | 东南大学 | Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method |
CN112260733B (en) * | 2020-11-10 | 2022-02-01 | 东南大学 | Multi-agent deep reinforcement learning-based MU-MISO hybrid precoding design method |
CN112597693A (en) * | 2020-11-19 | 2021-04-02 | 沈阳航盛科技有限责任公司 | Self-adaptive control method based on depth deterministic strategy gradient |
CN112853560A (en) * | 2020-12-31 | 2021-05-28 | 盐城师范学院 | Global process sharing control system and method based on ring spinning yarn quality |
CN112668721A (en) * | 2021-03-17 | 2021-04-16 | 中国科学院自动化研究所 | Decision-making method for decentralized multi-intelligent system in general non-stationary environment |
CN112966641A (en) * | 2021-03-23 | 2021-06-15 | 中国电子科技集团公司电子科学研究院 | Intelligent decision-making method for multiple sensors and multiple targets and storage medium |
CN113189983B (en) * | 2021-04-13 | 2022-05-31 | 中国人民解放军国防科技大学 | Open scene-oriented multi-robot cooperative multi-target sampling method |
CN113189983A (en) * | 2021-04-13 | 2021-07-30 | 中国人民解放军国防科技大学 | Open scene-oriented multi-robot cooperative multi-target sampling method |
CN113269329B (en) * | 2021-04-30 | 2024-03-19 | 北京控制工程研究所 | Multi-agent distributed reinforcement learning method |
CN113269329A (en) * | 2021-04-30 | 2021-08-17 | 北京控制工程研究所 | Multi-agent distributed reinforcement learning method |
CN112926729A (en) * | 2021-05-06 | 2021-06-08 | 中国科学院自动化研究所 | Man-machine confrontation intelligent agent strategy making method |
CN112926729B (en) * | 2021-05-06 | 2021-08-03 | 中国科学院自动化研究所 | Man-machine confrontation intelligent agent strategy making method |
CN113218400B (en) * | 2021-05-17 | 2022-04-19 | 太原科技大学 | Multi-agent navigation algorithm based on deep reinforcement learning |
CN113218400A (en) * | 2021-05-17 | 2021-08-06 | 太原科技大学 | Multi-agent navigation algorithm based on deep reinforcement learning |
CN113408796A (en) * | 2021-06-04 | 2021-09-17 | 北京理工大学 | Deep space probe soft landing path planning method for multitask deep reinforcement learning |
CN113408796B (en) * | 2021-06-04 | 2022-11-04 | 北京理工大学 | Deep space probe soft landing path planning method for multitask deep reinforcement learning |
CN113392798A (en) * | 2021-06-29 | 2021-09-14 | 中国科学技术大学 | Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation |
CN113467508A (en) * | 2021-06-30 | 2021-10-01 | 天津大学 | Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task |
CN113467508B (en) * | 2021-06-30 | 2022-06-28 | 天津大学 | Multi-unmanned aerial vehicle intelligent cooperative decision-making method for trapping task |
CN113554300A (en) * | 2021-07-19 | 2021-10-26 | 河海大学 | Shared parking space real-time allocation method based on deep reinforcement learning |
CN113589842B (en) * | 2021-07-26 | 2024-04-19 | 中国电子科技集团公司第五十四研究所 | Unmanned cluster task cooperation method based on multi-agent reinforcement learning |
CN113589842A (en) * | 2021-07-26 | 2021-11-02 | 中国电子科技集团公司第五十四研究所 | Unmanned clustering task cooperation method based on multi-agent reinforcement learning |
CN113485119A (en) * | 2021-07-29 | 2021-10-08 | 中国人民解放军国防科技大学 | Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability |
CN113485119B (en) * | 2021-07-29 | 2022-05-10 | 中国人民解放军国防科技大学 | Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability |
CN113433953A (en) * | 2021-08-25 | 2021-09-24 | 北京航空航天大学 | Multi-robot cooperative obstacle avoidance method and device and intelligent robot |
CN113792846A (en) * | 2021-09-06 | 2021-12-14 | 中国科学院自动化研究所 | State space processing method and system under ultrahigh-precision exploration environment in reinforcement learning and electronic equipment |
CN113837654B (en) * | 2021-10-14 | 2024-04-12 | 北京邮电大学 | Multi-objective-oriented smart grid hierarchical scheduling method |
CN113837654A (en) * | 2021-10-14 | 2021-12-24 | 北京邮电大学 | Multi-target-oriented intelligent power grid layered scheduling method |
CN114548497A (en) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | Crowd movement path planning method and system for realizing scene self-adaption |
CN114638163A (en) * | 2022-03-21 | 2022-06-17 | 重庆高新区飞马创新研究院 | Self-learning algorithm-based intelligent group cooperative combat method generation method |
CN114638163B (en) * | 2022-03-21 | 2024-09-06 | 重庆高新区飞马创新研究院 | Intelligent group collaborative tactics generation method based on self-learning algorithm |
CN115086374A (en) * | 2022-06-14 | 2022-09-20 | 河南职业技术学院 | Scene complexity self-adaptive multi-agent layered cooperation method |
CN114996856A (en) * | 2022-06-27 | 2022-09-02 | 北京鼎成智造科技有限公司 | Data processing method and device for airplane intelligent agent maneuver decision |
CN115366099B (en) * | 2022-08-18 | 2024-05-28 | 江苏科技大学 | Mechanical arm depth deterministic strategy gradient training method based on forward kinematics |
CN115366099A (en) * | 2022-08-18 | 2022-11-22 | 江苏科技大学 | Mechanical arm depth certainty strategy gradient training method based on forward kinematics |
CN118071119A (en) * | 2024-04-18 | 2024-05-24 | 中国电子科技集团公司第十研究所 | Heterogeneous sensor mixed cooperative scheduling decision method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108600379A (en) | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient | |
CN106970615B (en) | A kind of real-time online paths planning method of deeply study | |
WO2022012265A1 (en) | Robot learning from demonstration via meta-imitation learning | |
CN107179077B (en) | Self-adaptive visual navigation method based on ELM-LRF | |
CN109559277A (en) | Multi-unmanned aerial vehicle cooperative map construction method oriented to data sharing | |
CN110135341A (en) | Weed identification method, apparatus and terminal device | |
CN114741886B (en) | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation | |
CN109960880A (en) | A kind of industrial robot obstacle-avoiding route planning method based on machine learning | |
CN114942633A (en) | Multi-agent cooperative anti-collision picking method based on digital twins and reinforcement learning | |
CN110866588B (en) | Training learning method and system for realizing individuation of learning ability model of intelligent virtual digital animal | |
Papadopoulos et al. | Towards open and expandable cognitive AI architectures for large-scale multi-agent human-robot collaborative learning | |
CN109529338A (en) | Object control method, apparatus, Electronic Design and computer-readable medium | |
CN110181508A (en) | Underwater robot three-dimensional Route planner and system | |
CN112348285B (en) | Crowd evacuation simulation method in dynamic environment based on deep reinforcement learning | |
CN110109653A (en) | Land battle chess intelligent engine and operation method thereof | |
CN116663416A (en) | CGF decision behavior simulation method based on behavior tree | |
CN115265547A (en) | Robot active navigation method based on reinforcement learning in unknown environment | |
Liu et al. | Learning communication for cooperation in dynamic agent-number environment | |
Zuo et al. | SOAR improved artificial neural network for multistep decision-making tasks | |
Ruifeng et al. | Research progress and application of behavior tree technology | |
Hu et al. | Super eagle optimization algorithm based three-dimensional ball security corridor planning method for fixed-wing UAVs | |
Tian et al. | Fruit Picking Robot Arm Training Solution Based on Reinforcement Learning in Digital Twin | |
Wang et al. | Towards optimization of path planning: An RRT*-ACO algorithm | |
CN117518907A (en) | Control method, device, equipment and storage medium of intelligent agent | |
Qin et al. | A path planning algorithm based on deep reinforcement learning for mobile robots in unknown environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180928 |