CN113449458A - Multi-agent depth certainty strategy gradient method based on course learning - Google Patents

Multi-agent depth certainty strategy gradient method based on course learning Download PDF

Info

Publication number
CN113449458A
CN113449458A CN202110798780.9A CN202110798780A CN113449458A CN 113449458 A CN113449458 A CN 113449458A CN 202110798780 A CN202110798780 A CN 202110798780A CN 113449458 A CN113449458 A CN 113449458A
Authority
CN
China
Prior art keywords
agent
strategy
network
learning
gradient method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110798780.9A
Other languages
Chinese (zh)
Inventor
黄梦醒
冯子凯
吴迪
毋媛媛
冯思玲
张宏瑞
帅文轩
施之羿
于睿华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan University
Original Assignee
Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan University filed Critical Hainan University
Priority to CN202110798780.9A priority Critical patent/CN113449458A/en
Publication of CN113449458A publication Critical patent/CN113449458A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/25Design optimisation, verification or simulation using particle-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/04Constraint-based CAD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which combines curriculum learning with reinforcement learning, samples data from an experience playback pool according to priority weight according to curriculum standard complexity when sampling through the experience playback pool, then trains each intelligent agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updates a strategy network, a strategy target network, an evaluation network and an evaluation target network, updates the curriculum standard when the next state of the multi-agent acting in the environment is not a termination state, repeatedly carries out iterative calculation according to more complex curriculum, a priority standard function contained in the curriculum standard reflects the sampling priority weight of a sample, repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and redundant information punishment can reduce the information redundancy of interaction between the intelligent agents, compared with other algorithms, the method improves the convergence efficiency and final reward of the algorithm.

Description

Multi-agent depth certainty strategy gradient method based on course learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-agent depth certainty strategy gradient method based on course learning.
Background
Reinforcement learning is used to describe and solve the problem that an agent learns a strategy to achieve maximum reward or achieve a specific goal in the interaction process with the environment, and in recent years, reinforcement learning is also gradually applied to solve many challenging problems, such as a single agent in a game, a robot, etc., and reinforcement learning has been successfully applied to the field of single agent, however, the actual scenes are mostly multi-agent scenes, for example: in unmanned ship path planning, when an unmanned ship autonomously navigates on the water surface, in order to avoid static obstacles and other moving ships, multi-agent reinforcement learning can be used for recommending an optimal path for the unmanned ship to ensure smooth traffic; in taxi dispatching, the multi-agent deep reinforcement learning can be used for analyzing the geographic distribution of urban population, taxis, pedestrian volume and the like, and setting targets and paths for taxis in different geographic positions, so that traffic resources are utilized to the maximum extent; in the cooperative formation of multi-unmanned ships, the multi-agent reinforcement learning algorithm can adaptively and cooperatively cope with emergency and interference situations in various driving environments, and for some multi-agent fields, because the exponential level of environment information and multi-agent state information is increased, the traditional reinforcement learning algorithm has the problems of instability, difficult convergence and the like, so the improvement of the reinforcement learning algorithm of the multi-agent is needed.
Course learning is one of machine learning, generally, learning is started from a simple course, then a more complex course is learned, the simple course lays a foundation for learning of a future complex course, and finally the final asymptotic performance of a target task is improved or the calculation time is reduced, so that the effect of transfer learning is improved.
The invention patent with publication number CN110852448A discloses a cooperative agent learning method based on multi-agent reinforcement learning, which only discloses how multi-agents use global feature information through cooperative relationship in the same environment to realize that different agents share model parameters and simplify model complexity, but does not disclose that course learning is used to solve the problem of high convergence difficulty; in a master thesis research and application of reinforcement learning in multi-agent cooperation of electronics science and technology university, a multi-agent reinforcement learning method based on attention suitable for global observability and a multi-agent reinforcement learning method based on a graph network suitable for a partially observable environment are provided, and the method is correspondingly extended to curriculum learning.
Disclosure of Invention
Therefore, the invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which is used for solving the problems of poor stability and high convergence difficulty of a reinforcement learning algorithm in the field of multi-agents.
The technical scheme of the invention is realized as follows:
a multi-agent depth certainty strategy gradient method based on curriculum learning comprises the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
step S2, initializing each parameter and setting iteration times;
step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and the information generated by the multi-agent acts is stored in the experience playback pool, the information generated by the multi-agent acts includes the next state information;
step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the courses, and sampling data from the experience playback pool according to the priority weight by the intelligent agent according to the standard complexity of the courses;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
Preferably, the constraint conditions in step S1 include:
the observations of a single agent are independent of the observations of other agents;
the environment is unknown, and the agent cannot predict the reward and the post-action state;
the communication method between the agents is ignored.
Preferably, the initializing parameters in step S2 includes: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.
Preferably, the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.
Preferably, the course criterion in step S4 includes a priority criterion function, a repetition penalty function, and a redundant information penalty function, and the calculation formula of the complexity of the course criterion is as follows:
Figure BDA0003163840970000041
wherein CI (x)i) For course criterion complexity, SP (delta)iλ) is a priority criterion function, RP (cn)i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,
Figure BDA0003163840970000042
Is a weighting factor.
Preferably, the expression of the priority criterion function is:
Figure BDA0003163840970000043
where δ is the TD error, λ is the course factor, SP (δ)iλ)→[0,1]。
Preferably, the expression of the repetition penalty function is:
Figure BDA0003163840970000044
where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.
Preferably, the expression of the redundant information penalty function is:
Figure BDA0003163840970000045
wherein N is the number of agents.
Preferably, the step S5 of training each agent by using a depth deterministic strategy gradient method based on Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network, and the evaluation target network includes the specific steps of:
step S51, calculating the weight of the loss function, and updating the evaluation network by minimizing the loss function through an Adam optimizer;
step S52, updating the strategy network by using the strategy gradient;
and step S53, updating the evaluation target network and the strategy target network parameters.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a multi-agent depth certainty strategy gradient method based on course learning, which comprises the steps of constructing a multi-agent particle environment, enabling a plurality of agents to act according to a preset strategy network in the multi-agent particle environment, storing generated information into an experience playback pool, constructing course standards according to the information in the experience playback pool, calculating the complexity of the course standards, enabling the agents to sample data from the experience playback pool according to the complexity of the course standards and according to priority weight, training each agent by adopting a depth certainty strategy gradient method based on an Adoptimizer to update a strategy network, a strategy target network, an evaluation network and an evaluation target network, stopping iteration when next state information generated by the action of the agents in the multi-agent particle environment is in a termination state, combining the course learning into the reinforcement learning, convergence and stability may be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a multi-agent depth certainty strategy gradient method based on curriculum learning of the present invention;
FIG. 2 is a comparison of four algorithms for different numbers of agents in a simple multi-agent environment;
FIG. 3 is a graph of the average reward iteration for four algorithms with different numbers of agents in a simple multi-agent environment;
FIG. 4 is a comparison of four algorithms in a Tag multi-agent environment;
FIG. 5 is a graph of the average reward iteration curves for the four algorithms in a Tag multi-agent environment.
Detailed Description
For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.
Referring to fig. 1, the multi-agent depth certainty strategy gradient method based on curriculum learning provided by the invention comprises the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
the invention adopts Openai to construct a multi-agent particle environment, in the environment, the actions of the agents are continuous and discrete, the task of each agent is to find and go to the coordinate of the agent in a specified step number, and obtain corresponding reward according to the distance between the last step of the agent and the coordinate of the agent, and the total reward of the system is the sum of the rewards of all agents.
The behavior policy taken by the agent is defined as the set pi ═ pi { pi }12…πNEach represented by a neural network, and likewise, defining a ═ a1,a2…aNIs a set of agent behavior actions, S ═ S1,S2…SNIs the set of states the agent is in,the parameter set for all agents is defined as θ ═ θ12…θN}. Assuming that the deterministic strategy adopted by the agent each time is μ, the action of each step can be represented by the formula: a ist=μ(St) Obtaining the reward obtained after a certain strategy is executed, wherein the size of the reward value is determined by a Q function, and the algorithm operates under the following constraint conditions:
the learned strategy can only be executed using local information, i.e. the observations of a single agent are independent of the observations of other agents;
the differentiable dynamic model of the environment is not required to be known, the environment is unknown, the intelligent agent cannot predict the states after the reward and the action, and the behavior of the intelligent agent only depends on the strategy;
ignoring the communication method between the agents, not assuming the distinguishable communication channels between the agents, satisfying the above conditions, the algorithm versatility will be greatly improved, will be suitable for the competition, cooperation game under the communication mode containing the determination.
Step S2, initialize each agent policy network and evaluate the parameters of the network, sample batch size, experience replay pool size, learning rate, maximum number of steps of the agent in a cycle, set iteration number, etc., as shown in table 1.
Table 1 initialization of the parameters:
parameter(s) Numerical value
Learning rate 0.01
Tau update coefficient 0.01
Gamma attenuation factor 0.95
Size of experience playback pool 25600
Sample batch size 1024
Maximum number of steps of agent in a cycle 20
Number of iterations 10000
Step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and stores the information generated by the multi-agent action into the experience playback pool, the information generated by the multi-agent action comprises state information, action information, reward information and next state information;
in a two-dimensional simulation environment, a plurality of intelligent agents randomly start the action of a first step, each intelligent agent interacts with the environment, rewards are obtained according to a set rule, then the environment is transferred to the next state, after a plurality of steps are sequentially executed, a cycle is completed, the process conforms to a Markov decision model, a certain number of cycles are executed, and state information, action information, reward information and next state information obtained in the cycles are stored in an experience playback pool for subsequent sampling.
Step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the course, sampling data from the experience playback pool according to the standard complexity of the course according to the priority weight by the intelligent agent, wherein the course standards comprise a priority standard function, a repeated penalty function and a redundant information penalty function, and the calculation formula of the standard complexity of the course is as follows:
Figure BDA0003163840970000081
wherein CI (x)i) For course criterion complexity, SP (delta)iλ) is a priority criterion function, RP (cn)i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,
Figure BDA0003163840970000082
Is a weighting factor.
In the present invention, two key issues are how to evaluate the complexity of each sample and how to design a well-defined standard, the environment in which the agent is located in reinforcement learning is unknown, no prior knowledge is available for evaluating the samples, furthermore, the complexity of the artificially created standard sample may not be suitable for learning by the agent that is changing, the difficult concept at the beginning of learning can be easily understood by the later experienced agent, the agent should therefore select samples from an empirical replay, known as "autonomous learning", and in order to reduce the risk of selecting unreliable data, the agent should focus on making appropriate samples in each value iteration, samples that are too simple may not help to improve current learning capabilities, while others may be too numerous agents to understand, the present invention defines a self-paced priority criteria function SP (δ).iλ), selecting suitable samples in the experience playback pool, and dynamically reflecting gradually obtained information in the training process, wherein the expression of the priority standard function is as follows:
Figure BDA0003163840970000083
wherein, delta is TD error and is course factor representing model age, and lambda is course factor, and 0.6, SP (delta) is taken in the inventioniλ)→[0,1]If λ is<|δ|>2λ,SP(δiλ) is a monotonic decrease of | δ |Decreasing function if | δ>=λ,SP(δiλ) is a monotonically increasing function, otherwise SP (δ)iLambda) takes the global maximum.
In empirical replay, unacceptable over-training situations may occur when some unnecessary samples are reused too many times, and it is obvious that the limited replay memory makes the sample training worse, and the sample training reveals a traditional problem in reinforcement learning, namely exploration-utilization trade-off, full mining of state-motion space is very important to prevent the algorithm from falling into local minima, and the current strategy is used to help the algorithm converge as soon as possible, so if the agent is limited to some specific samples, it will not get the optimal strategy, and in order to solve this problem, we add a repetitive vector { cn ] in empirical replay1,cn2,...,cnNRecording the sampling times of each sample, in the iteration, the agent learns through sampling to update the model parameters, if a sample is sampled once, the corresponding repeated sampling penalty function is updated according to the updating, the more times the sample is sampled, the smaller the probability of being selected in the next iteration is, therefore, a repeated penalty function RP (cn) is defined to complete the process, and the expression of the repeated penalty function is:
Figure BDA0003163840970000091
where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.
When the number of the agents is large, the number of the information interaction among the agents increases exponentially, the state information and the action information of other agents, which need to be extracted by each agent, become redundant, and the training of the agents becomes difficult, so that the invention introduces a redundant information penalty function, wherein the expression of the redundant information penalty function is as follows:
Figure BDA0003163840970000092
wherein N is the number of agents.
The training of a single agent does not use the information of all other agents, but randomly extracts a part of information, the quantity of the extracted information is constrained by the punishment of redundant information, firstly, a quantity of information is randomly selected in the range of [1, N ], and at the moment, a new comment function is as follows:
Figure BDA0003163840970000101
wherein
Figure BDA0003163840970000102
After the priority standard function, the repeated penalty function and the redundant information penalty function are obtained, the course standard complexity can be calculated, samples are sampled according to the probability proportional to the course standard complexity to form small-batch data, then the course standard complexity is updated, and finally, a course factor is increased by lambda plus mu so as to sample more difficult samples in the next value iteration.
Step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network and the evaluation target network comprises the following specific steps:
step S51, calculating weight omega of loss functionj=(N*P(j))/maxii) Calculating the current target Q value
Figure BDA0003163840970000103
Updating evaluation nets by Adam optimizer minimization of loss functionThe combination of the ingredients of the Chinese medicinal preparation,
Figure BDA0003163840970000104
wherein beta is a sampling weight coefficient, P is a non-uniform probability, r is a current instantaneous reward, gamma is an attenuation factor, Q is a reward of a next state, x is an observed value of the state, a is an action value of the agent, and K is the number of samples of batch gradient descent.
Step S52, updating the policy network using the policy gradient,
Figure BDA0003163840970000111
wherein
Figure BDA0003163840970000112
For the gradient, θ is a parameter of the evaluation network and μ is the strategy.
Step S53, updating the evaluation target network w 'and the policy target network θ' parameters:
w′←τw+(1-τ)w′;
θ′←τθ+(1-τ)θ′;
τ is an update coefficient, and generally has a small value, for example, 0.1 or 0.01, and in the present invention, the value is 0.01.
Step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
According to the invention, course learning and reinforcement learning are combined together, the sampling process is limited by the course standard, the priority standard function in the course standard reflects the sampling priority weight of the sample, the repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and the use of redundant information punishment can properly reduce the interactive information redundancy between intelligent bodies, so that the convergence efficiency and the final reward of the algorithm can be improved, and the effect of the invention is discussed through a simulation experiment.
The multi-agent depth certainty strategy gradient method based on course learning is abbreviated as CL-MADDPG in English, and is compared with other three algorithms in simulation experiments, namely MADDPG, PER-MADDPG and Greedy-MADDPG respectively, the method is characterized in that firstly, a simple multi-agent environment is adopted for carrying out experiments, the number of agents is set to be 2, 3, 4, 5 and 10 respectively, under the simple multi-agent environment, the average reward of the CL-MADDPG and the other three algorithms is shown in figure 2, and as can be known from figure 2, the average reward of the CL-MADDPG is superior to that of the other three algorithms in five test environments.
Fig. 3 shows the average reward iteration curve of 4 algorithms under the environment of five multi-agent numbers, so that it can be seen that, in 10000 generations of training, CL-madpg always converges at the fastest speed and obtains the greatest return, i.e. the method of the present invention has the highest convergence efficiency and the best comprehensive performance, and finally, the four algorithms are trained in the multi-agent environment of Tag (including good agent and opponent agent), and the training results are shown in fig. 4 and 5, which further verify the superior performance of the CL-madpg algorithm of the present invention.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A multi-agent depth certainty strategy gradient method based on course learning is characterized by comprising the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
step S2, initializing each parameter and setting iteration times;
step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and the information generated by the multi-agent acts is stored in the experience playback pool, the information generated by the multi-agent acts includes the next state information;
step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the courses, and sampling data from the experience playback pool according to the priority weight by the intelligent agent according to the standard complexity of the courses;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
2. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the constraints of step S1 include:
the observations of a single agent are independent of the observations of other agents;
the environment is unknown, and the agent cannot predict the reward and the post-action state;
the communication method between the agents is ignored.
3. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S2 of initializing parameters comprises: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.
4. The lesson learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.
5. The multi-agent depth certainty strategy gradient method based on course learning as claimed in claim 1, wherein the course criterion in step S4 includes a priority criterion function, a repetition penalty function and a redundant information penalty function, and the complexity of the course criterion is calculated by the following formula:
Figure FDA0003163840960000021
wherein CI (x)i) For course criterion complexity, SP (delta)iλ) is a priority criterion function, RP (cn)i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,
Figure FDA0003163840960000022
Is a weighting factor.
6. The curriculum learning-based multi-agent depth certainty strategy gradient method of claim 5, wherein the expression of the priority criterion function is:
Figure FDA0003163840960000023
where δ is the TD error, λ is the course factor, SP (δ)iλ)→[0,1]。
7. The curriculum learning-based multi-agent depth-deterministic strategy gradient method of claim 5, wherein the repeated penalty function is expressed by:
Figure FDA0003163840960000024
where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.
8. The curriculum learning-based multi-agent depth-deterministic strategy gradient method of claim 5, wherein the redundant information penalty function is expressed as:
Figure FDA0003163840960000031
wherein N is the number of agents.
9. The course learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S5 is implemented by training each agent by using a depth certainty strategy gradient method based on Adam optimizer, and the specific steps of updating strategy network, strategy target network, evaluation network and evaluation target network are as follows:
step S51, calculating the weight of the loss function, and updating the evaluation network by minimizing the loss function through an Adam optimizer;
step S52, updating the strategy network by using the strategy gradient;
and step S53, updating the evaluation target network and the strategy target network parameters.
CN202110798780.9A 2021-07-15 2021-07-15 Multi-agent depth certainty strategy gradient method based on course learning Pending CN113449458A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110798780.9A CN113449458A (en) 2021-07-15 2021-07-15 Multi-agent depth certainty strategy gradient method based on course learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110798780.9A CN113449458A (en) 2021-07-15 2021-07-15 Multi-agent depth certainty strategy gradient method based on course learning

Publications (1)

Publication Number Publication Date
CN113449458A true CN113449458A (en) 2021-09-28

Family

ID=77816337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110798780.9A Pending CN113449458A (en) 2021-07-15 2021-07-15 Multi-agent depth certainty strategy gradient method based on course learning

Country Status (1)

Country Link
CN (1) CN113449458A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114285751A (en) * 2021-12-07 2022-04-05 中国科学院计算技术研究所 Traffic engineering method and system
CN114598667A (en) * 2022-03-04 2022-06-07 重庆邮电大学 Efficient equipment selection and resource allocation method based on federal learning
CN115542915A (en) * 2022-10-08 2022-12-30 中国矿业大学 Automatic driving reinforcement learning method based on approximate safety action
CN116151635A (en) * 2023-04-19 2023-05-23 深圳市迪博企业风险管理技术有限公司 Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph
CN116610037A (en) * 2023-07-17 2023-08-18 中国海洋大学 Comprehensive optimization control method for air quantity of ocean platform ventilation system
CN116680201A (en) * 2023-07-31 2023-09-01 南京争锋信息科技有限公司 System pressure testing method based on machine learning
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN116739077A (en) * 2023-08-16 2023-09-12 西南交通大学 Multi-agent deep reinforcement learning method and device based on course learning
CN117826867A (en) * 2024-03-04 2024-04-05 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114285751A (en) * 2021-12-07 2022-04-05 中国科学院计算技术研究所 Traffic engineering method and system
CN114598667A (en) * 2022-03-04 2022-06-07 重庆邮电大学 Efficient equipment selection and resource allocation method based on federal learning
CN115542915A (en) * 2022-10-08 2022-12-30 中国矿业大学 Automatic driving reinforcement learning method based on approximate safety action
CN115542915B (en) * 2022-10-08 2023-10-31 中国矿业大学 Automatic driving reinforcement learning method based on approximate safety action
CN116151635A (en) * 2023-04-19 2023-05-23 深圳市迪博企业风险管理技术有限公司 Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph
CN116151635B (en) * 2023-04-19 2024-03-08 深圳市迪博企业风险管理技术有限公司 Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph
CN116610037B (en) * 2023-07-17 2023-09-29 中国海洋大学 Comprehensive optimization control method for air quantity of ocean platform ventilation system
CN116610037A (en) * 2023-07-17 2023-08-18 中国海洋大学 Comprehensive optimization control method for air quantity of ocean platform ventilation system
CN116680201B (en) * 2023-07-31 2023-10-17 南京争锋信息科技有限公司 System pressure testing method based on machine learning
CN116680201A (en) * 2023-07-31 2023-09-01 南京争锋信息科技有限公司 System pressure testing method based on machine learning
CN116739077A (en) * 2023-08-16 2023-09-12 西南交通大学 Multi-agent deep reinforcement learning method and device based on course learning
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN116739077B (en) * 2023-08-16 2023-10-31 西南交通大学 Multi-agent deep reinforcement learning method and device based on course learning
CN116739323B (en) * 2023-08-16 2023-11-10 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN117826867A (en) * 2024-03-04 2024-04-05 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium
CN117826867B (en) * 2024-03-04 2024-06-11 之江实验室 Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium

Similar Documents

Publication Publication Date Title
CN113449458A (en) Multi-agent depth certainty strategy gradient method based on course learning
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN111488988B (en) Control strategy simulation learning method and device based on counterstudy
CN112325897B (en) Path planning method based on heuristic deep reinforcement learning
CN112685165B (en) Multi-target cloud workflow scheduling method based on joint reinforcement learning strategy
CN111582469A (en) Multi-agent cooperation information processing method and system, storage medium and intelligent terminal
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
CN114625151A (en) Underwater robot obstacle avoidance path planning method based on reinforcement learning
CN112269382B (en) Robot multi-target path planning method
CN110442129A (en) A kind of control method and system that multiple agent is formed into columns
CN113341972A (en) Robot path optimization planning method based on deep reinforcement learning
CN109925718A (en) A kind of system and method for distributing the micro- end map of game
US20230311003A1 (en) Decision model training method and apparatus, device, storage medium, and program product
CN118365099B (en) Multi-AGV scheduling method, device, equipment and storage medium
CN112613608A (en) Reinforced learning method and related device
CN117289691A (en) Training method for path planning agent for reinforcement learning in navigation scene
CN117474077B (en) Auxiliary decision making method and device based on OAR model and reinforcement learning
CN113110101B (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN116307331B (en) Aircraft trajectory planning method
CN116227622A (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
Peng et al. Hybrid learning for multi-agent cooperation with sub-optimal demonstrations
CN115933712A (en) Bionic fish leader-follower formation control method based on deep reinforcement learning
CN115936058A (en) Multi-agent migration reinforcement learning method based on graph attention network
Li et al. A self-learning Monte Carlo tree search algorithm for robot path planning
CN114911157A (en) Robot navigation control method and system based on partial observable reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210928

RJ01 Rejection of invention patent application after publication