CN113449458A - Multi-agent depth certainty strategy gradient method based on course learning - Google Patents
Multi-agent depth certainty strategy gradient method based on course learning Download PDFInfo
- Publication number
- CN113449458A CN113449458A CN202110798780.9A CN202110798780A CN113449458A CN 113449458 A CN113449458 A CN 113449458A CN 202110798780 A CN202110798780 A CN 202110798780A CN 113449458 A CN113449458 A CN 113449458A
- Authority
- CN
- China
- Prior art keywords
- agent
- strategy
- network
- learning
- gradient method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000006870 function Effects 0.000 claims abstract description 44
- 238000011156 evaluation Methods 0.000 claims abstract description 31
- 238000005070 sampling Methods 0.000 claims abstract description 21
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims abstract description 10
- 239000003795 chemical substances by application Substances 0.000 claims description 144
- 230000009471 action Effects 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 18
- 239000002245 particle Substances 0.000 claims description 10
- 238000004891 communication Methods 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 abstract description 21
- 230000002787 reinforcement Effects 0.000 abstract description 18
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000003993 interaction Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 5
- 230000006399 behavior Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000029305 taxis Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/25—Design optimisation, verification or simulation using particle-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/04—Constraint-based CAD
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Geometry (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which combines curriculum learning with reinforcement learning, samples data from an experience playback pool according to priority weight according to curriculum standard complexity when sampling through the experience playback pool, then trains each intelligent agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updates a strategy network, a strategy target network, an evaluation network and an evaluation target network, updates the curriculum standard when the next state of the multi-agent acting in the environment is not a termination state, repeatedly carries out iterative calculation according to more complex curriculum, a priority standard function contained in the curriculum standard reflects the sampling priority weight of a sample, repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and redundant information punishment can reduce the information redundancy of interaction between the intelligent agents, compared with other algorithms, the method improves the convergence efficiency and final reward of the algorithm.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-agent depth certainty strategy gradient method based on course learning.
Background
Reinforcement learning is used to describe and solve the problem that an agent learns a strategy to achieve maximum reward or achieve a specific goal in the interaction process with the environment, and in recent years, reinforcement learning is also gradually applied to solve many challenging problems, such as a single agent in a game, a robot, etc., and reinforcement learning has been successfully applied to the field of single agent, however, the actual scenes are mostly multi-agent scenes, for example: in unmanned ship path planning, when an unmanned ship autonomously navigates on the water surface, in order to avoid static obstacles and other moving ships, multi-agent reinforcement learning can be used for recommending an optimal path for the unmanned ship to ensure smooth traffic; in taxi dispatching, the multi-agent deep reinforcement learning can be used for analyzing the geographic distribution of urban population, taxis, pedestrian volume and the like, and setting targets and paths for taxis in different geographic positions, so that traffic resources are utilized to the maximum extent; in the cooperative formation of multi-unmanned ships, the multi-agent reinforcement learning algorithm can adaptively and cooperatively cope with emergency and interference situations in various driving environments, and for some multi-agent fields, because the exponential level of environment information and multi-agent state information is increased, the traditional reinforcement learning algorithm has the problems of instability, difficult convergence and the like, so the improvement of the reinforcement learning algorithm of the multi-agent is needed.
Course learning is one of machine learning, generally, learning is started from a simple course, then a more complex course is learned, the simple course lays a foundation for learning of a future complex course, and finally the final asymptotic performance of a target task is improved or the calculation time is reduced, so that the effect of transfer learning is improved.
The invention patent with publication number CN110852448A discloses a cooperative agent learning method based on multi-agent reinforcement learning, which only discloses how multi-agents use global feature information through cooperative relationship in the same environment to realize that different agents share model parameters and simplify model complexity, but does not disclose that course learning is used to solve the problem of high convergence difficulty; in a master thesis research and application of reinforcement learning in multi-agent cooperation of electronics science and technology university, a multi-agent reinforcement learning method based on attention suitable for global observability and a multi-agent reinforcement learning method based on a graph network suitable for a partially observable environment are provided, and the method is correspondingly extended to curriculum learning.
Disclosure of Invention
Therefore, the invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which is used for solving the problems of poor stability and high convergence difficulty of a reinforcement learning algorithm in the field of multi-agents.
The technical scheme of the invention is realized as follows:
a multi-agent depth certainty strategy gradient method based on curriculum learning comprises the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
step S2, initializing each parameter and setting iteration times;
step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and the information generated by the multi-agent acts is stored in the experience playback pool, the information generated by the multi-agent acts includes the next state information;
step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the courses, and sampling data from the experience playback pool according to the priority weight by the intelligent agent according to the standard complexity of the courses;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
Preferably, the constraint conditions in step S1 include:
the observations of a single agent are independent of the observations of other agents;
the environment is unknown, and the agent cannot predict the reward and the post-action state;
the communication method between the agents is ignored.
Preferably, the initializing parameters in step S2 includes: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.
Preferably, the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.
Preferably, the course criterion in step S4 includes a priority criterion function, a repetition penalty function, and a redundant information penalty function, and the calculation formula of the complexity of the course criterion is as follows:
wherein CI (x)i) For course criterion complexity, SP (delta)iλ) is a priority criterion function, RP (cn)i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,Is a weighting factor.
Preferably, the expression of the priority criterion function is:
where δ is the TD error, λ is the course factor, SP (δ)iλ)→[0,1]。
Preferably, the expression of the repetition penalty function is:
where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.
Preferably, the expression of the redundant information penalty function is:
wherein N is the number of agents.
Preferably, the step S5 of training each agent by using a depth deterministic strategy gradient method based on Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network, and the evaluation target network includes the specific steps of:
step S51, calculating the weight of the loss function, and updating the evaluation network by minimizing the loss function through an Adam optimizer;
step S52, updating the strategy network by using the strategy gradient;
and step S53, updating the evaluation target network and the strategy target network parameters.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a multi-agent depth certainty strategy gradient method based on course learning, which comprises the steps of constructing a multi-agent particle environment, enabling a plurality of agents to act according to a preset strategy network in the multi-agent particle environment, storing generated information into an experience playback pool, constructing course standards according to the information in the experience playback pool, calculating the complexity of the course standards, enabling the agents to sample data from the experience playback pool according to the complexity of the course standards and according to priority weight, training each agent by adopting a depth certainty strategy gradient method based on an Adoptimizer to update a strategy network, a strategy target network, an evaluation network and an evaluation target network, stopping iteration when next state information generated by the action of the agents in the multi-agent particle environment is in a termination state, combining the course learning into the reinforcement learning, convergence and stability may be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a multi-agent depth certainty strategy gradient method based on curriculum learning of the present invention;
FIG. 2 is a comparison of four algorithms for different numbers of agents in a simple multi-agent environment;
FIG. 3 is a graph of the average reward iteration for four algorithms with different numbers of agents in a simple multi-agent environment;
FIG. 4 is a comparison of four algorithms in a Tag multi-agent environment;
FIG. 5 is a graph of the average reward iteration curves for the four algorithms in a Tag multi-agent environment.
Detailed Description
For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.
Referring to fig. 1, the multi-agent depth certainty strategy gradient method based on curriculum learning provided by the invention comprises the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
the invention adopts Openai to construct a multi-agent particle environment, in the environment, the actions of the agents are continuous and discrete, the task of each agent is to find and go to the coordinate of the agent in a specified step number, and obtain corresponding reward according to the distance between the last step of the agent and the coordinate of the agent, and the total reward of the system is the sum of the rewards of all agents.
The behavior policy taken by the agent is defined as the set pi ═ pi { pi }1,π2…πNEach represented by a neural network, and likewise, defining a ═ a1,a2…aNIs a set of agent behavior actions, S ═ S1,S2…SNIs the set of states the agent is in,the parameter set for all agents is defined as θ ═ θ1,θ2…θN}. Assuming that the deterministic strategy adopted by the agent each time is μ, the action of each step can be represented by the formula: a ist=μ(St) Obtaining the reward obtained after a certain strategy is executed, wherein the size of the reward value is determined by a Q function, and the algorithm operates under the following constraint conditions:
the learned strategy can only be executed using local information, i.e. the observations of a single agent are independent of the observations of other agents;
the differentiable dynamic model of the environment is not required to be known, the environment is unknown, the intelligent agent cannot predict the states after the reward and the action, and the behavior of the intelligent agent only depends on the strategy;
ignoring the communication method between the agents, not assuming the distinguishable communication channels between the agents, satisfying the above conditions, the algorithm versatility will be greatly improved, will be suitable for the competition, cooperation game under the communication mode containing the determination.
Step S2, initialize each agent policy network and evaluate the parameters of the network, sample batch size, experience replay pool size, learning rate, maximum number of steps of the agent in a cycle, set iteration number, etc., as shown in table 1.
Table 1 initialization of the parameters:
parameter(s) | Numerical value |
Learning rate | 0.01 |
Tau update coefficient | 0.01 |
Gamma attenuation factor | 0.95 |
Size of experience playback pool | 25600 |
Sample batch size | 1024 |
Maximum number of steps of agent in a |
20 |
Number of |
10000 |
Step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and stores the information generated by the multi-agent action into the experience playback pool, the information generated by the multi-agent action comprises state information, action information, reward information and next state information;
in a two-dimensional simulation environment, a plurality of intelligent agents randomly start the action of a first step, each intelligent agent interacts with the environment, rewards are obtained according to a set rule, then the environment is transferred to the next state, after a plurality of steps are sequentially executed, a cycle is completed, the process conforms to a Markov decision model, a certain number of cycles are executed, and state information, action information, reward information and next state information obtained in the cycles are stored in an experience playback pool for subsequent sampling.
Step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the course, sampling data from the experience playback pool according to the standard complexity of the course according to the priority weight by the intelligent agent, wherein the course standards comprise a priority standard function, a repeated penalty function and a redundant information penalty function, and the calculation formula of the standard complexity of the course is as follows:
wherein CI (x)i) For course criterion complexity, SP (delta)iλ) is a priority criterion function, RP (cn)i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,Is a weighting factor.
In the present invention, two key issues are how to evaluate the complexity of each sample and how to design a well-defined standard, the environment in which the agent is located in reinforcement learning is unknown, no prior knowledge is available for evaluating the samples, furthermore, the complexity of the artificially created standard sample may not be suitable for learning by the agent that is changing, the difficult concept at the beginning of learning can be easily understood by the later experienced agent, the agent should therefore select samples from an empirical replay, known as "autonomous learning", and in order to reduce the risk of selecting unreliable data, the agent should focus on making appropriate samples in each value iteration, samples that are too simple may not help to improve current learning capabilities, while others may be too numerous agents to understand, the present invention defines a self-paced priority criteria function SP (δ).iλ), selecting suitable samples in the experience playback pool, and dynamically reflecting gradually obtained information in the training process, wherein the expression of the priority standard function is as follows:
wherein, delta is TD error and is course factor representing model age, and lambda is course factor, and 0.6, SP (delta) is taken in the inventioniλ)→[0,1]If λ is<|δ|>2λ,SP(δiλ) is a monotonic decrease of | δ |Decreasing function if | δ>=λ,SP(δiλ) is a monotonically increasing function, otherwise SP (δ)iLambda) takes the global maximum.
In empirical replay, unacceptable over-training situations may occur when some unnecessary samples are reused too many times, and it is obvious that the limited replay memory makes the sample training worse, and the sample training reveals a traditional problem in reinforcement learning, namely exploration-utilization trade-off, full mining of state-motion space is very important to prevent the algorithm from falling into local minima, and the current strategy is used to help the algorithm converge as soon as possible, so if the agent is limited to some specific samples, it will not get the optimal strategy, and in order to solve this problem, we add a repetitive vector { cn ] in empirical replay1,cn2,...,cnNRecording the sampling times of each sample, in the iteration, the agent learns through sampling to update the model parameters, if a sample is sampled once, the corresponding repeated sampling penalty function is updated according to the updating, the more times the sample is sampled, the smaller the probability of being selected in the next iteration is, therefore, a repeated penalty function RP (cn) is defined to complete the process, and the expression of the repeated penalty function is:
where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.
When the number of the agents is large, the number of the information interaction among the agents increases exponentially, the state information and the action information of other agents, which need to be extracted by each agent, become redundant, and the training of the agents becomes difficult, so that the invention introduces a redundant information penalty function, wherein the expression of the redundant information penalty function is as follows:
wherein N is the number of agents.
The training of a single agent does not use the information of all other agents, but randomly extracts a part of information, the quantity of the extracted information is constrained by the punishment of redundant information, firstly, a quantity of information is randomly selected in the range of [1, N ], and at the moment, a new comment function is as follows:
After the priority standard function, the repeated penalty function and the redundant information penalty function are obtained, the course standard complexity can be calculated, samples are sampled according to the probability proportional to the course standard complexity to form small-batch data, then the course standard complexity is updated, and finally, a course factor is increased by lambda plus mu so as to sample more difficult samples in the next value iteration.
Step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network and the evaluation target network comprises the following specific steps:
step S51, calculating weight omega of loss functionj=(N*P(j))-β/maxi(ωi) Calculating the current target Q valueUpdating evaluation nets by Adam optimizer minimization of loss functionThe combination of the ingredients of the Chinese medicinal preparation,wherein beta is a sampling weight coefficient, P is a non-uniform probability, r is a current instantaneous reward, gamma is an attenuation factor, Q is a reward of a next state, x is an observed value of the state, a is an action value of the agent, and K is the number of samples of batch gradient descent.
Step S52, updating the policy network using the policy gradient,whereinFor the gradient, θ is a parameter of the evaluation network and μ is the strategy.
Step S53, updating the evaluation target network w 'and the policy target network θ' parameters:
w′←τw+(1-τ)w′;
θ′←τθ+(1-τ)θ′;
τ is an update coefficient, and generally has a small value, for example, 0.1 or 0.01, and in the present invention, the value is 0.01.
Step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
According to the invention, course learning and reinforcement learning are combined together, the sampling process is limited by the course standard, the priority standard function in the course standard reflects the sampling priority weight of the sample, the repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and the use of redundant information punishment can properly reduce the interactive information redundancy between intelligent bodies, so that the convergence efficiency and the final reward of the algorithm can be improved, and the effect of the invention is discussed through a simulation experiment.
The multi-agent depth certainty strategy gradient method based on course learning is abbreviated as CL-MADDPG in English, and is compared with other three algorithms in simulation experiments, namely MADDPG, PER-MADDPG and Greedy-MADDPG respectively, the method is characterized in that firstly, a simple multi-agent environment is adopted for carrying out experiments, the number of agents is set to be 2, 3, 4, 5 and 10 respectively, under the simple multi-agent environment, the average reward of the CL-MADDPG and the other three algorithms is shown in figure 2, and as can be known from figure 2, the average reward of the CL-MADDPG is superior to that of the other three algorithms in five test environments.
Fig. 3 shows the average reward iteration curve of 4 algorithms under the environment of five multi-agent numbers, so that it can be seen that, in 10000 generations of training, CL-madpg always converges at the fastest speed and obtains the greatest return, i.e. the method of the present invention has the highest convergence efficiency and the best comprehensive performance, and finally, the four algorithms are trained in the multi-agent environment of Tag (including good agent and opponent agent), and the training results are shown in fig. 4 and 5, which further verify the superior performance of the CL-madpg algorithm of the present invention.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (9)
1. A multi-agent depth certainty strategy gradient method based on course learning is characterized by comprising the following steps:
s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;
step S2, initializing each parameter and setting iteration times;
step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and the information generated by the multi-agent acts is stored in the experience playback pool, the information generated by the multi-agent acts includes the next state information;
step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the courses, and sampling data from the experience playback pool according to the priority weight by the intelligent agent according to the standard complexity of the courses;
step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;
step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.
2. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the constraints of step S1 include:
the observations of a single agent are independent of the observations of other agents;
the environment is unknown, and the agent cannot predict the reward and the post-action state;
the communication method between the agents is ignored.
3. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S2 of initializing parameters comprises: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.
4. The lesson learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.
5. The multi-agent depth certainty strategy gradient method based on course learning as claimed in claim 1, wherein the course criterion in step S4 includes a priority criterion function, a repetition penalty function and a redundant information penalty function, and the complexity of the course criterion is calculated by the following formula:
9. The course learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S5 is implemented by training each agent by using a depth certainty strategy gradient method based on Adam optimizer, and the specific steps of updating strategy network, strategy target network, evaluation network and evaluation target network are as follows:
step S51, calculating the weight of the loss function, and updating the evaluation network by minimizing the loss function through an Adam optimizer;
step S52, updating the strategy network by using the strategy gradient;
and step S53, updating the evaluation target network and the strategy target network parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110798780.9A CN113449458A (en) | 2021-07-15 | 2021-07-15 | Multi-agent depth certainty strategy gradient method based on course learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110798780.9A CN113449458A (en) | 2021-07-15 | 2021-07-15 | Multi-agent depth certainty strategy gradient method based on course learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113449458A true CN113449458A (en) | 2021-09-28 |
Family
ID=77816337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110798780.9A Pending CN113449458A (en) | 2021-07-15 | 2021-07-15 | Multi-agent depth certainty strategy gradient method based on course learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113449458A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114285751A (en) * | 2021-12-07 | 2022-04-05 | 中国科学院计算技术研究所 | Traffic engineering method and system |
CN114598667A (en) * | 2022-03-04 | 2022-06-07 | 重庆邮电大学 | Efficient equipment selection and resource allocation method based on federal learning |
CN115542915A (en) * | 2022-10-08 | 2022-12-30 | 中国矿业大学 | Automatic driving reinforcement learning method based on approximate safety action |
CN116151635A (en) * | 2023-04-19 | 2023-05-23 | 深圳市迪博企业风险管理技术有限公司 | Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph |
CN116610037A (en) * | 2023-07-17 | 2023-08-18 | 中国海洋大学 | Comprehensive optimization control method for air quantity of ocean platform ventilation system |
CN116680201A (en) * | 2023-07-31 | 2023-09-01 | 南京争锋信息科技有限公司 | System pressure testing method based on machine learning |
CN116739323A (en) * | 2023-08-16 | 2023-09-12 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN116739077A (en) * | 2023-08-16 | 2023-09-12 | 西南交通大学 | Multi-agent deep reinforcement learning method and device based on course learning |
CN117826867A (en) * | 2024-03-04 | 2024-04-05 | 之江实验室 | Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium |
-
2021
- 2021-07-15 CN CN202110798780.9A patent/CN113449458A/en active Pending
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114285751A (en) * | 2021-12-07 | 2022-04-05 | 中国科学院计算技术研究所 | Traffic engineering method and system |
CN114598667A (en) * | 2022-03-04 | 2022-06-07 | 重庆邮电大学 | Efficient equipment selection and resource allocation method based on federal learning |
CN115542915A (en) * | 2022-10-08 | 2022-12-30 | 中国矿业大学 | Automatic driving reinforcement learning method based on approximate safety action |
CN115542915B (en) * | 2022-10-08 | 2023-10-31 | 中国矿业大学 | Automatic driving reinforcement learning method based on approximate safety action |
CN116151635A (en) * | 2023-04-19 | 2023-05-23 | 深圳市迪博企业风险管理技术有限公司 | Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph |
CN116151635B (en) * | 2023-04-19 | 2024-03-08 | 深圳市迪博企业风险管理技术有限公司 | Optimization method and device for decision-making of anti-risk enterprises based on multidimensional relation graph |
CN116610037B (en) * | 2023-07-17 | 2023-09-29 | 中国海洋大学 | Comprehensive optimization control method for air quantity of ocean platform ventilation system |
CN116610037A (en) * | 2023-07-17 | 2023-08-18 | 中国海洋大学 | Comprehensive optimization control method for air quantity of ocean platform ventilation system |
CN116680201B (en) * | 2023-07-31 | 2023-10-17 | 南京争锋信息科技有限公司 | System pressure testing method based on machine learning |
CN116680201A (en) * | 2023-07-31 | 2023-09-01 | 南京争锋信息科技有限公司 | System pressure testing method based on machine learning |
CN116739077A (en) * | 2023-08-16 | 2023-09-12 | 西南交通大学 | Multi-agent deep reinforcement learning method and device based on course learning |
CN116739323A (en) * | 2023-08-16 | 2023-09-12 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN116739077B (en) * | 2023-08-16 | 2023-10-31 | 西南交通大学 | Multi-agent deep reinforcement learning method and device based on course learning |
CN116739323B (en) * | 2023-08-16 | 2023-11-10 | 北京航天晨信科技有限责任公司 | Intelligent evaluation method and system for emergency resource scheduling |
CN117826867A (en) * | 2024-03-04 | 2024-04-05 | 之江实验室 | Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium |
CN117826867B (en) * | 2024-03-04 | 2024-06-11 | 之江实验室 | Unmanned aerial vehicle cluster path planning method, unmanned aerial vehicle cluster path planning device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113449458A (en) | Multi-agent depth certainty strategy gradient method based on course learning | |
CN113110592B (en) | Unmanned aerial vehicle obstacle avoidance and path planning method | |
CN111488988B (en) | Control strategy simulation learning method and device based on counterstudy | |
CN112325897B (en) | Path planning method based on heuristic deep reinforcement learning | |
CN112685165B (en) | Multi-target cloud workflow scheduling method based on joint reinforcement learning strategy | |
CN111582469A (en) | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal | |
CN113688977B (en) | Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium | |
CN114625151A (en) | Underwater robot obstacle avoidance path planning method based on reinforcement learning | |
CN112269382B (en) | Robot multi-target path planning method | |
CN110442129A (en) | A kind of control method and system that multiple agent is formed into columns | |
CN113341972A (en) | Robot path optimization planning method based on deep reinforcement learning | |
CN109925718A (en) | A kind of system and method for distributing the micro- end map of game | |
US20230311003A1 (en) | Decision model training method and apparatus, device, storage medium, and program product | |
CN118365099B (en) | Multi-AGV scheduling method, device, equipment and storage medium | |
CN112613608A (en) | Reinforced learning method and related device | |
CN117289691A (en) | Training method for path planning agent for reinforcement learning in navigation scene | |
CN117474077B (en) | Auxiliary decision making method and device based on OAR model and reinforcement learning | |
CN113110101B (en) | Production line mobile robot gathering type recovery and warehousing simulation method and system | |
CN116307331B (en) | Aircraft trajectory planning method | |
CN116227622A (en) | Multi-agent landmark coverage method and system based on deep reinforcement learning | |
Peng et al. | Hybrid learning for multi-agent cooperation with sub-optimal demonstrations | |
CN115933712A (en) | Bionic fish leader-follower formation control method based on deep reinforcement learning | |
CN115936058A (en) | Multi-agent migration reinforcement learning method based on graph attention network | |
Li et al. | A self-learning Monte Carlo tree search algorithm for robot path planning | |
CN114911157A (en) | Robot navigation control method and system based on partial observable reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210928 |
|
RJ01 | Rejection of invention patent application after publication |