CN116596059A

CN116596059A - Multi-agent reinforcement learning method based on priority experience sharing

Info

Publication number: CN116596059A
Application number: CN202310456116.5A
Authority: CN
Inventors: 郭鹏骏; 赵岭忠
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-08-15

Abstract

The invention discloses a multi-agent reinforcement learning method based on priority experience sharing, which comprises the following steps: step 1, experience collection; step 2, calculating priority; and step 3, strategy updating. According to the method, the priority mechanism is adopted to evaluate the experience data of different agents, the experience data sampling probability with large contribution to the policy learning of the agents is improved, and the problem of poor learning effect caused by low quality of training samples of the agents is solved.

Description

Multi-agent reinforcement learning method based on priority experience sharing

Technical Field

The invention relates to the technical field of multi-agent reinforcement learning, in particular to a multi-agent reinforcement learning method based on priority experience sharing.

Background

Human society has evolved to share experience. In modern society, people can acquire knowledge through various ways, such as reading books, watching videos, participating in training, and the like. However, experience sharing remains a central component of learning, as it not only allows us to learn knowledge more deeply, but also helps us apply knowledge better. Experience sharing is typically achieved by communicating knowledge with the peer and teacher. In this process, we can learn the experiences and ideas of other people and draw experience and intelligence from them. This communication not only speeds up the learning process, but also helps us understand knowledge in depth, thus better applying it.

Experience sharing is used as an information concentration mode, and learning efficiency and performance of multi-agent reinforcement learning can be effectively improved. Classical reinforcement learning methods such as MADDPG, COMA and other algorithms, the strategy optimization of the agents in the training process is independent, namely each agent has a respective experience pool, and the overall learning efficiency is limited by the agents with the lowest learning efficiency. The learning speed may be different between agents, resulting in a limitation of learning efficiency of the entire system. Because each agent performs policy update according to own experience data, the experience data of different agents may have differences, which results in slower learning speed of some agents, thereby affecting the learning efficiency of the whole system. The experience sharing method such as SEAC algorithm can update own strategy and cost function by each agent by using experience data of other agents, thereby realizing knowledge sharing and communication. This approach can help the agents better explore the environment and optimize decisions, thereby improving the revenue and performance of the overall system. However, there are also some adverse effects in the experience sharing process. First, sample data distribution is critical to experience sharing, and a sparse rewarding environment refers to a situation in which an agent can obtain corresponding feedback rewards when few agents are in the environment, and most of the time, no rewarding information exists. Experience sharing may result in reduced learning efficiency for the overall system if there is a large amount of low rewards information or redundant information in the sample data. Because low rewards or redundant information may interfere with the learning process of the agent, the agent may not obtain effective feedback and rewards signals and thus may not update the strategy and cost function properly.

Disclosure of Invention

The invention aims to solve the problem that the learning efficiency of an intelligent agent system is affected due to low probability of acquiring useful information when a multi-intelligent agent system in a sparse rewarding environment shares experience strategy update after experience storage in the prior art, and provides a multi-intelligent agent reinforcement learning method based on priority experience sharing. According to the method, the priority mechanism is adopted to evaluate the experience data of different agents, the experience data sampling probability with large contribution to the policy learning of the agents is improved, and the problem of poor learning effect caused by low quality of training samples of the agents is solved.

The technical scheme for realizing the aim of the invention is as follows:

a multi-agent reinforcement learning method based on priority experience sharing comprises the following steps:

step 1, experience collection: each intelligent agent observes the environment state in the environment, selects corresponding actions according to the self strategy network, outputs the actions to the environment to transfer to the next state, and collects experience according to the state, the actions and the rewarding information;

step 2, priority calculation: calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agent, transmitting the priority to a centralized experience pool module, and ensuring that more effective sample experiences can be selected during strategy updating by giving larger weight to the experience samples with large contribution;

step 3, strategy updating: the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority which is updated continuously, and the learning speed of an intelligent body is accelerated and the training effect is improved by utilizing the information of the experience sample.

The specific process of experience collection in the step 1 is as follows:

firstly, each intelligent agent observes the environmental state in the environment to make action for exploration; secondly, carrying out priority calculation on experience data of each intelligent agent, judging the experience value of each intelligent agent and storing the experience value into a shared experience pool; finally, experience sampling is carried out in a shared experience pool according to probability sampling, and parameters of the deep neural network are updated for actual decision making in the next step; each agent selects actions according to the strategy network, outputs the actions to the environment for execution, iterates until the preset training step number or learning convergence is reached, and then collects experience according to the state, the actions and the rewarding information.

The specific process of the priority calculation in the step 2 is as follows:

calculating the priority of each experience sample data according to the relative importance and the rewarding size of the intelligent agents, firstly, each intelligent agent calculates the priority of each experience sample data in the own experience pool, and then the priority is transmitted to a centralized experience pool module; when the strategy is updated, the centralized experience sharing module samples the experience sample data of all the agents so as to ensure that the important experience sample data of each agent has enough opportunity to be trained;

priority experience sharing ensures that more effective sample experiences can be selected during policy updating by giving large weight to experience samples with large contribution; the contribution of the experience sample is considered in two aspects, firstly, the rewarding size of the experience sample can be evaluated from the rewarding size of the experience sample, and high-efficiency samples with high value can be screened; second, from the importance of the empirical sample to the agent, if the empirical sample itself is highly rewarded but the agent is already familiar with the empirical sample, the effect of the empirical sample on its strategic network correction is still small; therefore, considering the difference between the Q values corresponding to the experience sample rewards and the action cost functions as a measurement index, and improving the utilization of sample data with larger variability by the intelligent agent; in the multi-agent priority experience sharing method, a priority calculation formula is expressed as follows:

p _i ＝|r _i |+|r _i +γmax _a′ Q′(s _i+1, a′)-Q(s _i ,a _i )|+∈ (1)

wherein p is _i Indicating the priority of the ith experience sample, r _i Representing an instant prize of the sample, s _i And a _i Respectively represent the corresponding state and action of the sample, s _i+1, Representing the next state of the sample, a' representing all actions of the next state, γ being the discount coefficient measuring the importance of future rewards, Q(s) _i ,a _i ) And Q'(s) _i+1, a') respectively represents Q values of the selection actions of the current policy network and the target policy network, E is a positive number smaller than 0.01, and the experience sample priority is ensured to be not 0;

the structure of the priority experience sharing algorithm is as follows: firstly, an agent observes an environment state and outputs corresponding actions to the environment by using a strategy network, the environment is transferred to the next state according to the current state and the actions of the agent and gives feedback rewards, and then the agent calculates experience sample data priority and stores the experience sample data priority in a shared experience pool; in the training process, probability sampling is carried out from a shared experience pool at intervals, each intelligent agent calculates TD errors of a value network and carries out corresponding updating, then a corresponding Q value is selected to guide strategy network updating, and finally the intelligent agent recalculates the priority of experience sample data according to the corresponding network errors and updates the priority weight to the corresponding experience sample in the experience pool.

The specific process of policy updating in the step 3 is as follows:

the following defects exist if greedy strategy is adopted to select the experience sample with the greatest contribution in the strategy updating process: the priorities of all experience samples are not updated after each intelligent agent updates Critic, but only the priorities of the currently selected experience samples are updated; thus, the transferred weights correspond to the previously calculated agent, not the current agent; the fact that a certain experience sample has low contribution can only indicate that the contribution of the experience sample to a previously calculated intelligent agent is not large, but cannot indicate that the experience sample does not work on a current intelligent agent, so that sample data with large contribution to the current intelligent agent can be missed based on a priority greedy strategy, in addition, the experience sample data with the largest priority can be selected always by a greedy strategy-based method to cause overfitting, random probability sampling is adopted, the sampling probability of the experience sample data with high contribution can be improved, certain randomness is guaranteed, and a sampling probability formula is defined as follows:

wherein P (i) is the probability that the empirical sample i is sampled, P _i For the priority of the experience sample i, α is used to control the priority, and α=0 is a uniform sample;

in priority experience playback, the priority of the experience sample is related to the sampling probability, so that the experience sample with high priority is selected more frequently for training, and meanwhile, the sampling probability of the experience sample is changed along with the priority of the experience sample due to the fact that the priority is continuously updated, and in this way, the information of the experience sample can be effectively utilized, the learning speed of an intelligent body is increased, and the training effect is improved; the policy update algorithm is as follows:

compared with the prior art, the technical scheme has the following characteristics:

1. the priority calculating method provided by the technical scheme considers two aspects, namely the rewarding size of the experience sample and the importance of the experience sample to the intelligent agent.

2. According to the technical scheme, through random probability empirical sampling, the empirical sample data sampling probability with high contribution can be improved, and certain randomness is ensured to avoid over fitting.

3. The priority is continuously updated during policy learning, the priority is changed in the shared experience library, and the accuracy of the experience sample data weight is improved.

Drawings

FIG. 1 is a diagram of a multi-agent priority experience sharing framework;

FIG. 2 is a flow chart of a multi-agent priority experience sharing task;

fig. 3 is a diagram of the internal network architecture of the algorithm.

Detailed Description

The invention will now be described in further detail with reference to the drawings and specific examples, which are not intended to limit the invention thereto.

Examples:

The specific process of experience collection in the step 1 is as follows:

in the multi-agent reinforcement learning training process, each agent performs corresponding actions according to own strategies and generates corresponding track data, then the agents perform strategy updating according to the data generated by the agents, however, in the multi-agent system, each agent performs exploration training learning according to own strategies, and the learning efficiency of different agents is possibly different, so that the overall learning efficiency in the multi-agent system is limited by the agent with slowest learning, experience sharing can improve the overall learning efficiency of the multi-agent system, the agents can realize common targets or personal targets by cooperation, and in the process, each agent can benefit from the experience of other agents to accelerate learning and improve the efficiency;

the basic idea of experience sharing is to store experiences collected by different agents in the environment in a unified experience pool, as shown in fig. 1, and then randomly extract a batch of samples from the experience pool for training a neural network, so that the diversity of the samples can be improved to avoid negative influence of continuous and highly-relevant state sequences on the training of the network;

the specific flow of experience collection is shown in figure 2: firstly, each intelligent agent observes the environmental state in the environment to make action for exploration; secondly, carrying out priority calculation on experience data of each intelligent agent, judging the experience value of each intelligent agent and storing the experience value into a shared experience pool; finally, experience sampling is carried out in a shared experience pool according to probability sampling, and parameters of the deep neural network are updated for actual decision making in the next step; each agent selects actions according to the strategy network, outputs the actions to the environment for execution, iterates until the preset training step number or learning convergence is reached, and then collects experience according to the state, the actions and the rewarding information. The specific process of the priority calculation in the step 2 is as follows:

p _i ＝|r _i |+|r _i +γmax _a′ Q′(s _i+1, a′)-Q(s _i ,a _i )|+∈ (1)

the structure of the priority experience sharing algorithm is shown in fig. 3, firstly, an agent observes the environment state and outputs corresponding actions to the environment by using a strategy network, the environment is transferred to the next state according to the current state and the actions of the agent and gives feedback rewards, and then the agent calculates the experience sample data priority and stores the experience sample data priority in a shared experience pool; in the training process, probability sampling is carried out from a shared experience pool at intervals, each intelligent agent calculates TD errors of a value network and carries out corresponding updating, then a corresponding Q value is selected to guide strategy network updating, and finally the intelligent agent recalculates the priority of experience sample data according to the corresponding network errors and updates the priority weight to the corresponding experience sample in the experience pool.

The specific process of policy updating in the step 3 is as follows:

Claims

1. the multi-agent reinforcement learning method based on priority experience sharing is characterized by comprising the following steps of:

2. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of experience collection in step 1 is as follows:

3. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of priority calculation in step 2 is as follows:

p _i ＝|r _i |+|r _i +γmax _a′ Q′(s _i+1 ，a′)-Q(s _i ，a _i )|+∈ (1)

wherein p is _i Indicating the priority of the ith experience sample, r _i Representing an instant prize of the sample, s _i And a _i Respectively represent the corresponding state and action of the sample, s _i+1 Representing the next state of the sample, a' representing all actions of the next state, γ being the discount coefficient measuring the importance of future rewards, Q(s) _i ，a _i ) And Q'(s) _i+1 A') respectively represents Q values of the selection actions of the current policy network and the target policy network, E is a positive number smaller than 0.01, and the experience sample priority is ensured to be not 0;

4. The multi-agent reinforcement learning method based on priority experience sharing according to claim 1, wherein the specific process of policy updating in the step 3 is as follows: