CN113449458A

CN113449458A - Multi-agent depth certainty strategy gradient method based on course learning

Info

Publication number: CN113449458A
Application number: CN202110798780.9A
Authority: CN
Inventors: 黄梦醒; 冯子凯; 吴迪; 毋媛媛; 冯思玲; 张宏瑞; 帅文轩; 施之羿; 于睿华
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-09-28

Abstract

The invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which combines curriculum learning with reinforcement learning, samples data from an experience playback pool according to priority weight according to curriculum standard complexity when sampling through the experience playback pool, then trains each intelligent agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updates a strategy network, a strategy target network, an evaluation network and an evaluation target network, updates the curriculum standard when the next state of the multi-agent acting in the environment is not a termination state, repeatedly carries out iterative calculation according to more complex curriculum, a priority standard function contained in the curriculum standard reflects the sampling priority weight of a sample, repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and redundant information punishment can reduce the information redundancy of interaction between the intelligent agents, compared with other algorithms, the method improves the convergence efficiency and final reward of the algorithm.

Description

Multi-agent depth certainty strategy gradient method based on course learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a multi-agent depth certainty strategy gradient method based on course learning.

Background

Reinforcement learning is used to describe and solve the problem that an agent learns a strategy to achieve maximum reward or achieve a specific goal in the interaction process with the environment, and in recent years, reinforcement learning is also gradually applied to solve many challenging problems, such as a single agent in a game, a robot, etc., and reinforcement learning has been successfully applied to the field of single agent, however, the actual scenes are mostly multi-agent scenes, for example: in unmanned ship path planning, when an unmanned ship autonomously navigates on the water surface, in order to avoid static obstacles and other moving ships, multi-agent reinforcement learning can be used for recommending an optimal path for the unmanned ship to ensure smooth traffic; in taxi dispatching, the multi-agent deep reinforcement learning can be used for analyzing the geographic distribution of urban population, taxis, pedestrian volume and the like, and setting targets and paths for taxis in different geographic positions, so that traffic resources are utilized to the maximum extent; in the cooperative formation of multi-unmanned ships, the multi-agent reinforcement learning algorithm can adaptively and cooperatively cope with emergency and interference situations in various driving environments, and for some multi-agent fields, because the exponential level of environment information and multi-agent state information is increased, the traditional reinforcement learning algorithm has the problems of instability, difficult convergence and the like, so the improvement of the reinforcement learning algorithm of the multi-agent is needed.

Course learning is one of machine learning, generally, learning is started from a simple course, then a more complex course is learned, the simple course lays a foundation for learning of a future complex course, and finally the final asymptotic performance of a target task is improved or the calculation time is reduced, so that the effect of transfer learning is improved.

The invention patent with publication number CN110852448A discloses a cooperative agent learning method based on multi-agent reinforcement learning, which only discloses how multi-agents use global feature information through cooperative relationship in the same environment to realize that different agents share model parameters and simplify model complexity, but does not disclose that course learning is used to solve the problem of high convergence difficulty; in a master thesis research and application of reinforcement learning in multi-agent cooperation of electronics science and technology university, a multi-agent reinforcement learning method based on attention suitable for global observability and a multi-agent reinforcement learning method based on a graph network suitable for a partially observable environment are provided, and the method is correspondingly extended to curriculum learning.

Disclosure of Invention

Therefore, the invention provides a multi-agent depth certainty strategy gradient method based on curriculum learning, which is used for solving the problems of poor stability and high convergence difficulty of a reinforcement learning algorithm in the field of multi-agents.

The technical scheme of the invention is realized as follows:

a multi-agent depth certainty strategy gradient method based on curriculum learning comprises the following steps:

s1, building a multi-agent particle environment, and defining the constraint condition of the multi-agent action, wherein the multi-agent comprises a strategy network, a strategy target network, an evaluation network and an evaluation target network;

step S2, initializing each parameter and setting iteration times;

step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and the information generated by the multi-agent acts is stored in the experience playback pool, the information generated by the multi-agent acts includes the next state information;

step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the courses, and sampling data from the experience playback pool according to the priority weight by the intelligent agent according to the standard complexity of the courses;

step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, updating a strategy network, a strategy target network, an evaluation network and an evaluation target network, and storing errors and repeated sampling times into an experience playback pool for updating after the training of the agents is finished;

step S6, determining whether the state corresponding to the next state information in step S3 is the end state, if so, stopping the iteration, otherwise, executing steps S3-S5 in a loop.

Preferably, the constraint conditions in step S1 include:

the observations of a single agent are independent of the observations of other agents;

the environment is unknown, and the agent cannot predict the reward and the post-action state;

the communication method between the agents is ignored.

Preferably, the initializing parameters in step S2 includes: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.

Preferably, the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.

Preferably, the course criterion in step S4 includes a priority criterion function, a repetition penalty function, and a redundant information penalty function, and the calculation formula of the complexity of the course criterion is as follows:

wherein CI (x)_i) For course criterion complexity, SP (delta)_iλ) is a priority criterion function, RP (cn)_i) For the duplicate penalty function, RIP (N) is the redundancy information penalty function, η,

Is a weighting factor.

Preferably, the expression of the priority criterion function is:

where δ is the TD error, λ is the course factor, SP (δ)_iλ)→[0,1]。

Preferably, the expression of the repetition penalty function is:

where cn is an incremental repetition vector in the empirical playback pool used to record the number of samples per sample.

Preferably, the expression of the redundant information penalty function is:

wherein N is the number of agents.

Preferably, the step S5 of training each agent by using a depth deterministic strategy gradient method based on Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network, and the evaluation target network includes the specific steps of:

step S51, calculating the weight of the loss function, and updating the evaluation network by minimizing the loss function through an Adam optimizer;

step S52, updating the strategy network by using the strategy gradient;

and step S53, updating the evaluation target network and the strategy target network parameters.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a multi-agent depth certainty strategy gradient method based on course learning, which comprises the steps of constructing a multi-agent particle environment, enabling a plurality of agents to act according to a preset strategy network in the multi-agent particle environment, storing generated information into an experience playback pool, constructing course standards according to the information in the experience playback pool, calculating the complexity of the course standards, enabling the agents to sample data from the experience playback pool according to the complexity of the course standards and according to priority weight, training each agent by adopting a depth certainty strategy gradient method based on an Adoptimizer to update a strategy network, a strategy target network, an evaluation network and an evaluation target network, stopping iteration when next state information generated by the action of the agents in the multi-agent particle environment is in a termination state, combining the course learning into the reinforcement learning, convergence and stability may be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

FIG. 1 is a flow chart of a multi-agent depth certainty strategy gradient method based on curriculum learning of the present invention;

FIG. 2 is a comparison of four algorithms for different numbers of agents in a simple multi-agent environment;

FIG. 3 is a graph of the average reward iteration for four algorithms with different numbers of agents in a simple multi-agent environment;

FIG. 4 is a comparison of four algorithms in a Tag multi-agent environment;

FIG. 5 is a graph of the average reward iteration curves for the four algorithms in a Tag multi-agent environment.

Detailed Description

For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.

Referring to fig. 1, the multi-agent depth certainty strategy gradient method based on curriculum learning provided by the invention comprises the following steps:

the invention adopts Openai to construct a multi-agent particle environment, in the environment, the actions of the agents are continuous and discrete, the task of each agent is to find and go to the coordinate of the agent in a specified step number, and obtain corresponding reward according to the distance between the last step of the agent and the coordinate of the agent, and the total reward of the system is the sum of the rewards of all agents.

The behavior policy taken by the agent is defined as the set pi ═ pi { pi }₁,π₂…π_NEach represented by a neural network, and likewise, defining a ═ a₁,a₂…a_NIs a set of agent behavior actions, S ═ S₁,S₂…S_NIs the set of states the agent is in,the parameter set for all agents is defined as θ ═ θ₁,θ₂…θ_N}. Assuming that the deterministic strategy adopted by the agent each time is μ, the action of each step can be represented by the formula: a is_t＝μ(S_t) Obtaining the reward obtained after a certain strategy is executed, wherein the size of the reward value is determined by a Q function, and the algorithm operates under the following constraint conditions:

the learned strategy can only be executed using local information, i.e. the observations of a single agent are independent of the observations of other agents;

the differentiable dynamic model of the environment is not required to be known, the environment is unknown, the intelligent agent cannot predict the states after the reward and the action, and the behavior of the intelligent agent only depends on the strategy;

ignoring the communication method between the agents, not assuming the distinguishable communication channels between the agents, satisfying the above conditions, the algorithm versatility will be greatly improved, will be suitable for the competition, cooperation game under the communication mode containing the determination.

Step S2, initialize each agent policy network and evaluate the parameters of the network, sample batch size, experience replay pool size, learning rate, maximum number of steps of the agent in a cycle, set iteration number, etc., as shown in table 1.

Table 1 initialization of the parameters:

parameter(s)	Numerical value
		Learning rate	0.01
Tau update coefficient	0.01
		Gamma attenuation factor	0.95
Size of experience playback pool	25600
		Sample batch size	1024
Maximum number of steps of agent in a cycle	20
		Number of iterations	10000

Step S3, the multi-agent acts according to the strategy network in the multi-agent particle environment, and stores the information generated by the multi-agent action into the experience playback pool, the information generated by the multi-agent action comprises state information, action information, reward information and next state information;

in a two-dimensional simulation environment, a plurality of intelligent agents randomly start the action of a first step, each intelligent agent interacts with the environment, rewards are obtained according to a set rule, then the environment is transferred to the next state, after a plurality of steps are sequentially executed, a cycle is completed, the process conforms to a Markov decision model, a certain number of cycles are executed, and state information, action information, reward information and next state information obtained in the cycles are stored in an experience playback pool for subsequent sampling.

Step S4, constructing course standards according to the information in the experience playback pool, calculating the standard complexity of the course, sampling data from the experience playback pool according to the standard complexity of the course according to the priority weight by the intelligent agent, wherein the course standards comprise a priority standard function, a repeated penalty function and a redundant information penalty function, and the calculation formula of the standard complexity of the course is as follows:

Is a weighting factor.

In the present invention, two key issues are how to evaluate the complexity of each sample and how to design a well-defined standard, the environment in which the agent is located in reinforcement learning is unknown, no prior knowledge is available for evaluating the samples, furthermore, the complexity of the artificially created standard sample may not be suitable for learning by the agent that is changing, the difficult concept at the beginning of learning can be easily understood by the later experienced agent, the agent should therefore select samples from an empirical replay, known as "autonomous learning", and in order to reduce the risk of selecting unreliable data, the agent should focus on making appropriate samples in each value iteration, samples that are too simple may not help to improve current learning capabilities, while others may be too numerous agents to understand, the present invention defines a self-paced priority criteria function SP (δ)._iλ), selecting suitable samples in the experience playback pool, and dynamically reflecting gradually obtained information in the training process, wherein the expression of the priority standard function is as follows:

wherein, delta is TD error and is course factor representing model age, and lambda is course factor, and 0.6, SP (delta) is taken in the invention_iλ)→[0,1]If λ is<|δ|>2λ，SP(δ_iλ) is a monotonic decrease of | δ |Decreasing function if | δ>＝λ，SP(δ_iλ) is a monotonically increasing function, otherwise SP (δ)_iLambda) takes the global maximum.

In empirical replay, unacceptable over-training situations may occur when some unnecessary samples are reused too many times, and it is obvious that the limited replay memory makes the sample training worse, and the sample training reveals a traditional problem in reinforcement learning, namely exploration-utilization trade-off, full mining of state-motion space is very important to prevent the algorithm from falling into local minima, and the current strategy is used to help the algorithm converge as soon as possible, so if the agent is limited to some specific samples, it will not get the optimal strategy, and in order to solve this problem, we add a repetitive vector { cn ] in empirical replay₁，cn₂，...，cn_NRecording the sampling times of each sample, in the iteration, the agent learns through sampling to update the model parameters, if a sample is sampled once, the corresponding repeated sampling penalty function is updated according to the updating, the more times the sample is sampled, the smaller the probability of being selected in the next iteration is, therefore, a repeated penalty function RP (cn) is defined to complete the process, and the expression of the repeated penalty function is:

When the number of the agents is large, the number of the information interaction among the agents increases exponentially, the state information and the action information of other agents, which need to be extracted by each agent, become redundant, and the training of the agents becomes difficult, so that the invention introduces a redundant information penalty function, wherein the expression of the redundant information penalty function is as follows:

wherein N is the number of agents.

The training of a single agent does not use the information of all other agents, but randomly extracts a part of information, the quantity of the extracted information is constrained by the punishment of redundant information, firstly, a quantity of information is randomly selected in the range of [1, N ], and at the moment, a new comment function is as follows:

wherein

After the priority standard function, the repeated penalty function and the redundant information penalty function are obtained, the course standard complexity can be calculated, samples are sampled according to the probability proportional to the course standard complexity to form small-batch data, then the course standard complexity is updated, and finally, a course factor is increased by lambda plus mu so as to sample more difficult samples in the next value iteration.

step S5, training each agent by adopting a depth certainty strategy gradient method based on an Adam optimizer, and updating the strategy network, the strategy target network, the evaluation network and the evaluation target network comprises the following specific steps:

step S51, calculating weight omega of loss function_j＝(N*P(j))^-β/max_i(ω_i) Calculating the current target Q value

Updating evaluation nets by Adam optimizer minimization of loss functionThe combination of the ingredients of the Chinese medicinal preparation,

wherein beta is a sampling weight coefficient, P is a non-uniform probability, r is a current instantaneous reward, gamma is an attenuation factor, Q is a reward of a next state, x is an observed value of the state, a is an action value of the agent, and K is the number of samples of batch gradient descent.

Step S52, updating the policy network using the policy gradient,

wherein

For the gradient, θ is a parameter of the evaluation network and μ is the strategy.

Step S53, updating the evaluation target network w 'and the policy target network θ' parameters:

w′←τw+(1-τ)w′；

θ′←τθ+(1-τ)θ′；

τ is an update coefficient, and generally has a small value, for example, 0.1 or 0.01, and in the present invention, the value is 0.01.

According to the invention, course learning and reinforcement learning are combined together, the sampling process is limited by the course standard, the priority standard function in the course standard reflects the sampling priority weight of the sample, the repeated sampling punishment considers the influence of repeated sampling on the diversity of the sample, and the use of redundant information punishment can properly reduce the interactive information redundancy between intelligent bodies, so that the convergence efficiency and the final reward of the algorithm can be improved, and the effect of the invention is discussed through a simulation experiment.

The multi-agent depth certainty strategy gradient method based on course learning is abbreviated as CL-MADDPG in English, and is compared with other three algorithms in simulation experiments, namely MADDPG, PER-MADDPG and Greedy-MADDPG respectively, the method is characterized in that firstly, a simple multi-agent environment is adopted for carrying out experiments, the number of agents is set to be 2, 3, 4, 5 and 10 respectively, under the simple multi-agent environment, the average reward of the CL-MADDPG and the other three algorithms is shown in figure 2, and as can be known from figure 2, the average reward of the CL-MADDPG is superior to that of the other three algorithms in five test environments.

Fig. 3 shows the average reward iteration curve of 4 algorithms under the environment of five multi-agent numbers, so that it can be seen that, in 10000 generations of training, CL-madpg always converges at the fastest speed and obtains the greatest return, i.e. the method of the present invention has the highest convergence efficiency and the best comprehensive performance, and finally, the four algorithms are trained in the multi-agent environment of Tag (including good agent and opponent agent), and the training results are shown in fig. 4 and 5, which further verify the superior performance of the CL-madpg algorithm of the present invention.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-agent depth certainty strategy gradient method based on course learning is characterized by comprising the following steps:

step S2, initializing each parameter and setting iteration times;

2. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the constraints of step S1 include:

the communication method between the agents is ignored.

3. The curriculum learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S2 of initializing parameters comprises: and initializing parameters of each agent strategy network and evaluation network, sample sampling batch size, experience playback pool size, learning rate and maximum number of steps of the agent in a cycle.

4. The lesson learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the information generated by the multi-agent action in step S3 includes: status information, action information, reward information, and next status information.

5. The multi-agent depth certainty strategy gradient method based on course learning as claimed in claim 1, wherein the course criterion in step S4 includes a priority criterion function, a repetition penalty function and a redundant information penalty function, and the complexity of the course criterion is calculated by the following formula:

Is a weighting factor.

6. The curriculum learning-based multi-agent depth certainty strategy gradient method of claim 5, wherein the expression of the priority criterion function is:

where δ is the TD error, λ is the course factor, SP (δ)_iλ)→[0,1]。

7. The curriculum learning-based multi-agent depth-deterministic strategy gradient method of claim 5, wherein the repeated penalty function is expressed by:

8. The curriculum learning-based multi-agent depth-deterministic strategy gradient method of claim 5, wherein the redundant information penalty function is expressed as:

wherein N is the number of agents.

9. The course learning-based multi-agent depth certainty strategy gradient method as claimed in claim 1, wherein the step S5 is implemented by training each agent by using a depth certainty strategy gradient method based on Adam optimizer, and the specific steps of updating strategy network, strategy target network, evaluation network and evaluation target network are as follows:

step S52, updating the strategy network by using the strategy gradient;