CN116205273A

CN116205273A - Multi-agent reinforcement learning method for optimizing experience storage and experience reuse

Info

Publication number: CN116205273A
Application number: CN202111440668.4A
Authority: CN
Inventors: 吴益飞; 赵鹏; 陈庆伟; 郭健; 李胜; 樊卫华; 成爱萍; 郑瑞琳; 梁皓
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2023-06-02

Abstract

The invention discloses a multi-agent reinforcement learning method for optimizing experience storage and experience reuse, which comprises the following steps: an experience buffer pool for experience storage based on an LRU (Least recently used ) mechanism is constructed; designing a multi-agent experience collection method based on network weight sharing; a mixed sampling method based on experience priority is adopted to provide higher retrieval rate for higher priority data, and B is adopted ⁺ The tree data structure stores the experience with marked priorities. Compared with the traditional method, the method provided by the invention further enriches sample types on the basis of sufficiently reducing the relevance of sample data, has high-efficiency experience sample retrieval rate and good biological interpretability, has stronger robustness to sample noise, and has better performance when an intelligent body faces complex environments and tasks.

Description

Multi-agent reinforcement learning method for optimizing experience storage and experience reuse

Technical Field

The invention belongs to the technical field of deep reinforcement learning, and particularly relates to a multi-agent experience playback method combining a least recently used mechanism and a priority mixed sampling mechanism.

Background

In recent years, an artificial neural network inspired by a biological neural network structure and an information transmission mechanism becomes one of hot research directions in the field of machine learning. The deep learning method has shown excellent performances in various fields such as computer vision, intelligent recommendation and automatic driving. Meanwhile, reinforcement learning is widely used in the sequence decision problem as another important branch of the machine learning field. The deep reinforcement learning method combines deep learning with reinforcement learning and is mainly characterized in that the perception and abstract capability of the deep learning on high-dimensional state information is utilized to control the interaction between an agent and the environment, so that the deep learning and the learning are continuously tried.

Deep reinforcement learning is used as a weak supervision learning method, and lacks effective manual intervention and regulation, so that an intelligent agent needs a large amount of environment interaction to obtain a satisfactory control strategy. In an actual use environment, the deep reinforcement learning has the problems that the action output of an agent is lack of rationality, the environmental reward signal is sparse, the return function is difficult to design and the like. Therefore, a large amount of environment interaction is not feasible in a real environment, and low sample utilization rate becomes a main bottleneck for widely applying the deep reinforcement learning algorithm in a practical scene at present.

The experience playback method uses an experience buffer pool to uniformly store samples generated in each time step, each experience sample comprises a current state, actions taken in the current state, rewards given by the environment and the next state transferred to after an agent executes the actions, and the stored samples are randomly selected from the experience buffer pool during network training. The experience playback mechanism realizes the reutilization of experience sample data in the training process of the neural network in an experience pooling mode, and simultaneously solves the problems of correlation (corrected data) and non-smooth distribution (non-stationary distribution) among experience data in a random experience selection mode.

However, the contributions of different experience samples in the experience pool to the updating of the network weight are different, the conventional storage mode based on the first-in first-out mechanism and the sampling mode based on uniform sampling disregard the importance degree among different samples, all experience samples have the same sampling probability, and the uniform sampling strategy can cause the problems of difficult network convergence, easy local minima and the like. Meanwhile, if the outdated experience sample is frequently used to update the decision model of the Agent, the maximum expected return obtained by the Agent in the subsequent training process can be influenced, so that the learning difficulty of the Agent is increased, and the learning time is prolonged.

In the deep reinforcement learning, the importance degree of experience can be measured through the absolute value of the time difference error, when the absolute value of the time difference error is larger, the difference between the estimated true values which are made by the value network representing the current state is larger, and the weight contribution to the back propagation of the neural network is also larger; while the smaller the time difference error, the less the representative sample has a computational impact on the inverse gradient. The prior priority experience playback method based on greedy strategy has the following disadvantages: (1) The time difference error of a certain sample is not high, which only indicates that the feedback information amount is low during training, but does not represent that the time difference error does not contribute to the updating of the weights, and some important information contained in the sample may be missed by using greedy algorithm. (2) The greedy algorithm is used so that the strategy can only select samples with large time difference errors to train repeatedly, and therefore the result is over-fitted. (3) The magnitude of the experience pool capacity in the deep reinforcement learning algorithm is generally set larger, the linear storage mode is adopted to search in a piece-by-piece comparison mode when experience is extracted, and the time complexity of the search algorithm is higher.

Therefore, an improvement on the experience storage mode and the sampling method in the deep reinforcement learning algorithm is necessary, and the training efficiency and the universality of the deep reinforcement learning algorithm can be further improved.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provide a deep reinforcement learning method for optimizing an experience playback storage mode and a sampling strategy, which can remarkably improve the convergence speed and stability of a deep reinforcement learning algorithm.

The technical solution for realizing the purpose of the invention is as follows: a multi-agent reinforcement learning method for optimizing experience storage and experience reuse, the method comprising the steps of:

step 1, initializing all parameters omega of a current Q network, all parameters omega' =omega of a target Q network and capacity N of an experience pool, and setting the number m of samples of batch gradient descent and a parameter updating time step T of the target Q network;

step 2, multi-agent experience collection based on multithreading is carried out: through network weight sharing, a plurality of Agent agents are deployed by using a multithreading technology to interact with the environment at the same time, and on each time point, the Agent agents in different threads adopt the non-identical exploration strategy according to probability to acquire experience samples from the environment;

step 3, calculating an absolute value of a time difference error TD error of the experience sample, inserting the experience into a global shared experience pool taking the TD error as an index, and updating time attribute values of all experiences in the global shared experience pool based on a least recently used mechanism;

step 4, sampling m experience samples from a global shared experience pool by adopting a mixed sampling method when updating the current training strategy, and updating the time attribute of experience in the experience pool according to a least recently used mechanism; wherein m is a sample value for setting batch gradient descent;

and 5, training the current Q network by using m empirical samples obtained by sampling, after the training is finished, recalculating TD error of all empirical samples in the empirical pool, updating the priority of experiences in the global shared empirical pool, judging whether the number of training steps reaches a preset maximum value, if not, returning to the step 3, otherwise, finishing the flow.

A multi-agent reinforcement learning system that optimizes experience storage and experience reuse, the system comprising, in order:

the initialization module is used for initializing all parameters omega of the current Q network, all parameters omega' =omega of the target Q network and the capacity N of the experience pool, and setting the number m of samples of batch gradient descent and the parameter updating time step length T of the target Q network;

the multi-agent experience collection module is used for carrying out multi-agent experience collection based on multithreading: through network weight sharing, a plurality of Agent agents are deployed by using a multithreading technology to interact with the environment at the same time, and on each time point, the Agent agents in different threads adopt the non-identical exploration strategy according to probability to acquire experience samples from the environment;

the experience pool construction module is used for calculating the absolute value of the time difference error TD error of the experience sample, inserting the experience into a global shared experience pool taking the TD error as an index, and updating the time attribute values of all experiences in the global shared experience pool based on a least recently used mechanism;

the updating module is used for sampling m experience samples from the global shared experience pool by adopting a mixed sampling method when updating the current training strategy, and updating the time attribute of the experience in the experience pool according to a least recently used mechanism; wherein m is a sample value for setting batch gradient descent;

the training module is used for training the current Q network by using m experience samples obtained by sampling, after the training is finished, calculating TD error of all experience samples in the experience pool again, updating the priority of experiences in the global shared experience pool, judging whether the number of training steps reaches a preset maximum value, and if not, returning to the experience pool construction module.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

Compared with the prior art, the invention has the remarkable advantages that: 1) Compared with the first-in first-out mechanism, the least recently used mechanism can improve the retention time of important experience samples in an experience pool, thereby accelerating network training; 2) The priority mixed sampling mode can avoid the defect of low learning efficiency in the uniform sampling mode and can also avoid the problem of overfitting caused by the priority sampling mode based on a greedy strategy; 3) The multi-agent experience collection mode updates parameters in a parallelization calculation mode, so that the convergence speed of the neural network is accelerated while the diversity of experience samples is improved; 4) By organizing the storage of the experience samples using the data structure of the b+ tree, the linear time complexity of searching the experience samples can be reduced to logarithmic time complexity, and the retrieval speed of a specific experience sample in a large-capacity experience pool is sufficiently increased.

The invention is described in further detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic diagram of a reinforcement learning method for optimizing an empirical playback sampling strategy in one embodiment.

FIG. 2 is a schematic diagram of a multi-agent experience collection method based on multi-threading in one embodiment.

FIG. 3 is a schematic diagram of a hybrid sampling method based on empirical sample priority in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

In one embodiment, the invention provides a multi-agent experience playback method combining a least recently used mechanism and a priority mixed sampling mechanism, and experience playback is used as an important optimization strategy in a deep reinforcement learning algorithm, and the optimization of the method can improve the convergence speed and stability of the reinforcement learning algorithm, so that the overall performance of reinforcement learning tasks is improved. As shown in fig. 1, the method of the present invention comprises the steps of:

step 1, initializing all parameters omega of a current Q network, initializing all parameters omega' =omega of a target Q network, initializing the capacity N of an experience pool, setting the number m of samples with batch gradient descent and the parameter updating time step length T of the target network;

step 2, deploying a plurality of agents by using a multithreading technology through network weight sharing, and interacting with the environment at the same time, wherein the agents in different threads adopt different exploration strategies according to probabilities at each time point to acquire experience samples from the environment;

step 3, calculating an absolute value of an experience sample TD error (time difference error), and inserting the experience into a global shared experience pool taking the TD error as an index, and updating time attribute values of all experiences in the experience pool based on a least recently used mechanism;

step 4, sampling m experience samples from a globally shared experience playback pool by adopting a mixed sampling method when updating the current training strategy, and updating the time attribute of experience in the experience pool according to a least recently used mechanism; wherein m is a sample value for setting batch gradient descent;

and 5, training the current Q network by using m experience samples obtained by sampling, after the training is finished, recalculating TD error of all experience samples in the experience pool, updating the priority of experiences in the global shared experience pool, judging whether the number of training steps reaches a preset maximum value, and if not, returning to the step 3.

In the embodiment of the invention, the experience obtained by the interaction of the Agent with the environment at the time point t is five-tuple { s } _t ,a _t ,r _t ,s _t+1 Is_end }, where s _t Is the state of the environment where the Agent is located at the moment t and is also the input of the current Q network, a _t Representing action executed by Agent at time t, the environment transitions to state s at the next time according to the state of Agent and action _t+1 And gives the environmental rewards r obtained by the action _t s, is_end indicates whether the current state is already a termination state.

Further, in one embodiment, in conjunction with fig. 2, the multi-agent experience collection method based on the multi-thread described in step 2 specifically includes:

firstly, a public neural network model is deployed, n threads are opened up based on the public model, each thread has a network structure identical to that of the public neural network, each thread can interact with the environment independently to obtain experience data, and the threads are not interfered with each other and operate independently.

Each individual sampling Agent interacts with the own environment to generate an experience sample, the experience samples generated by all the sampling agents are uniformly stored in a shared experience buffer pool, each Agent adopts epsilon-greedy strategy to select actions, the action with the maximum state action value function is selected with the probability of 1-epsilon under each state, random actions are selected with the probability of epsilon, and the actions selected by each Agent under the same state have uncertainty, so that different experience samples are obtained.

Further, in one embodiment, the calculating the absolute value of the experience sample TD error (time difference error) in step 3 inserts the experience into the global shared experience pool with TD error as the index, and updates the time attribute of all experiences in the experience pool based on the least recently used mechanism, which specifically includes:

step 3-1, calculating the time difference error delta of the current time t experience sample _t ：

δ _t ＝Q ^* (s _t ,a _t )-y _t

Wherein s is _t A is the state of the environment where the Agent is at the moment t _t Representing the action performed by Agent at time t, y _t TD target, Q representing time t ^* (s _t ,a _t ) To be in state s _t Action a is taken at that time _t The expected approximation of the maximum cumulative future prize for the current Q network calculated using the monte carlo approximation method is as follows:

wherein r is _t The real-time rewards obtained after the interaction between the Agent and the environment at the moment t are that gamma is the discount rate of return and s _t+1 A new state which is transferred to after the action is executed for the Agent;

step 3-2, taking the absolute value |delta of the calculated time difference error _t The I is used as the priority weight of the experience sample, and if the experience quintuple cannot calculate the time difference error currently, the priority is set asA preset maximum value, and then the priority of the experience sample is used as an index to be inserted into the index B ⁺ In a global shared experience pool with a tree as a storage structure, non-leaf nodes of the tree only store the priorities of experience samples, and leaf nodes store the priorities of experiences and experience quintuples at the same time;

the time attribute of each experience in the experience pool is given by the least recently used rule, the time attribute of the experience currently inserted in the experience pool is set to 0, the time attribute value of all other experiences in the experience pool is added with 1, and when the capacity of the experience pool reaches the upper limit, the experience with the largest time attribute value in the experience pool is discarded.

Further, in one embodiment, in conjunction with the empirical mixed sampling method described in step 4 of fig. 3, higher sample rates may be obtained for higher priority sample data, which is specifically described as follows:

step 4-1, given the capacity of the global shared experience pool as N, a segment of length 1 is divided into N segments, as shown by I in FIG. 3 ₁ ,I ₂ ,I ₃ ...I _N As shown, each segment corresponds to one empirical record in the empirical pool, the length of the line segment corresponding to each empirical record is different, the line segment corresponding to the empirical record with high priority is long, the line segment corresponding to the empirical record with low priority is short, and the length of the line segment of each empirical record i is determined by the following formula:

in delta _i The time difference error of the empirical record i is represented, epsilon is a positive number which is infinitely close to 0, so that the condition that the length of a line segment is 0 is avoided, and the index value takes 3/4 as an empirical value.

Step 4-2, dividing the length 1 line segment into M equal parts, such as S in FIG. 3, before sampling ₁ ,S ₂ ,S ₃ ...S _M-1 S _M As shown, here M>>N, can guarantee in this way that every experience record correspondent line segment will divide into the correspondent small block. While each of the M parts fallsOn a segment corresponding to a certain empirical record, a longer segment is divided into more small blocks. When sampling, M positions are taken out by using uniform sampling from the M positions, and the requirement of higher retrieval rate of the experience test with higher priority can be met.

And 4-3, finding out corresponding experience records from the global shared experience pool by taking the extracted priority of experience as an index value, updating the time attribute value of the experience records to 0, and adding 1 to the time attribute values of the rest experience samples.

Further, in one embodiment, in conjunction with fig. 1, the training of the current Q network by using m experience samples in step 5 adopts a variable learning rate random gradient descent method, which specifically includes the following steps:

step 5-1, sequentially taking experience samples from m experience samples, and respectively calculating gradient values g corresponding to the ith experience sample by using a target Q network _i ：

Where ω is a parameter of the current Q network, s _i ,a _i The action of the state of the Agent in the ith experience sample and the action of the Agent selected in the current state are respectively;

step 5-2, gradient value g is used _i And updating parameters of the current Q network, wherein the updating rule is as follows:

wherein, alpha' is learning rate, is a variable parameter, defined as:

α′＝α·(n·p _i ) ^-β

wherein alpha is the initial setting value of learning rate, is a super parameter and is required to be manually given before network training, and p _i For the sampling probability of the ith empirical sample, β ε (0, 1), when training starts, β is a positive real number close to 0, with training iterationsProceeding, beta gradually approaches to 1, after each training is finished, the time difference error delta of the Q network after parameter updating to the experience sample is used _i Updating and simultaneously updating the storage structure of the global shared experience pool;

and 5-3, updating parameters of the target Q network at intervals of time step T according to the set time step T, wherein the updating rule is as follows:

ω _target ←ω(when t＝nT)。

Specific limitations regarding the multi-agent reinforcement learning system for optimizing experience storage and experience reuse may be found in the above description of the multi-agent reinforcement learning method for optimizing experience storage and experience reuse, and are not repeated herein. The modules in the multi-agent reinforcement learning method for optimizing experience storage and experience recycling can be fully or partially realized by software, and can also be realized by means of a combination of software and hardware. Based on such understanding, the above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules. The invention is not a matter of the known technology.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

For specific limitations on each step, reference may be made to the limitations of the multi-agent reinforcement learning method for optimizing experience storage and experience reuse hereinabove, and no further description is given here.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

Compared with the traditional method, the multi-agent experience playback method combining the least recently used mechanism and the priority mixed sampling mechanism provided by the invention further enriches sample types on the basis of sufficiently reducing the correlation degree among sample data, has high biological interpretability, and simultaneously has better performance when facing complex environments and tasks.

The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A multi-agent reinforcement learning method for optimizing experience storage and experience reuse, the method comprising the steps of:

2. The multi-Agent reinforcement learning method for optimizing experience storage and experience reuse according to claim 1, wherein the experience obtained by the Agent interacting with the environment at the time point t in the step 2 is five-tuple { s } _t ,a _t ,r _t ,s _t+1 Is_end }, where s _t Is the state of the environment where the Agent is located at the moment t and is also the input of the current Q network, a _t Representing action executed by Agent at time t, the environment transitions to state s at the next time according to the state of Agent and action _t+1 And gives the environmental rewards r obtained by the action _t Is_end indicates whether the current state is already a termination state.

3. The multi-agent reinforcement learning method for optimizing experience storage and experience reuse according to claim 1 or 2, wherein the multi-agent experience collection based on the multi-thread in step 2 comprises the following specific procedures:

a public neural network model is deployed, n threads are opened up based on the model, the network structure and parameters which are the same as those of the public neural network model are arranged in each thread, each thread can interact with the environment independently to obtain experience data, the threads are not interfered with each other, and the threads operate independently;

each individual sampling Agent interacts with the own environment to generate an experience sample, the experience samples generated by all the sampling agents are uniformly stored in a shared experience buffer pool, and each Agent adopts epsilon-greedy strategy to select actions, in particular: the action with the largest state action value function is selected with a probability of 1-epsilon for each state, and the random action is selected with a probability of epsilon.

4. The multi-agent reinforcement learning method for optimizing experience storage and experience reuse according to claim 3, wherein the calculating of the absolute value of the experience sample time difference error TD error in step 3 and inserting the experience into the global shared experience pool with TD error as index, and updating the time attribute values of all experiences in the global shared experience pool based on the least recently used mechanism comprises:

δ _t ＝Q ^* (s _t ,a _t )-y _t

step 3-2, taking the absolute value |delta of the calculated time difference error _t As the priority weight of the experience sample, if the experience five-tuple cannot calculate the time difference error currently, the priority is set to the preset maximum value, and then the priority of the experience sample is used as the index to be inserted into B ⁺ In a global shared experience pool with a tree as a storage structure, non-leaf nodes of the tree only store the priorities of experience samples, and leaf nodes store the priorities of experiences and experience quintuples at the same time;

the time attribute of each experience in the global shared experience pool is derived from the least recently used principle: the time attribute of the experience currently inserted into the global shared experience pool is set to 0, and the time attribute value of all other experiences in the global shared experience pool is added with 1, and when the capacity of the experience pool reaches the upper limit, the experience with the largest time attribute value is discarded.

5. The multi-agent reinforcement learning method for optimizing experience storage and experience reuse according to claim 4, wherein the mixed sampling method in step 4 is specifically that higher-priority sample data can obtain higher sampling rate, the mixed sampling method is adopted to sample m experience samples, and the time attribute of experience in the experience pool is updated according to a least recently used mechanism, and the specific implementation process is as follows:

step 4-1, given that the capacity of the global shared experience pool is N, dividing a segment with a length of 1 into N segments, wherein each segment corresponds to one experience sample in the experience pool, the length of the segment corresponding to each experience sample is different, the length of the segment corresponding to the experience sample with high priority is longer, the segment corresponding to the experience sample with low priority is shorter, and the length of the segment of each experience sample i is determined by the following formula:

in delta _i Representing the time difference error of the experience sample, wherein k is the number of the experience samples, epsilon is a positive number which is infinitely close to 0 so as to avoid the condition that the length of a line segment is 0, and the index value takes 3/4 as an experience value;

step 4-2, dividing the line segment with the length of 1 into M equal parts on average before sampling, wherein M > N; each of the M sets of the segments falls on a segment corresponding to a certain experience sample, and the segment with higher priority is divided into more small blocks; when sampling, uniformly sampling and taking out M positions from the M positions;

and 4-3, taking the extracted priority of experience as an index value to find a corresponding experience sample from the global shared experience pool, updating the time attribute value of the experience sample to 0, and adding 1 to the time attribute values of the rest experience samples.

6. The multi-agent reinforcement learning method for optimizing experience storage and experience reuse according to claim 5, wherein in step 5, the current Q network is trained by using m empirical samples obtained by sampling, specifically, a batch gradient descent method based on a variable learning rate is adopted, and the steps are as follows:

wherein, alpha' is learning rate, is a variable parameter, defined as:

α′＝α·(n·p _i ) ^-β

wherein alpha is the initial setting value of learning rate, is a super parameter and is required to be manually given before network training, and p _i For the sampling probability of the ith experience sample, beta epsilon (0, 1), when training starts, beta is a positive real number close to 0, beta gradually approaches to 1 along with the progress of training iteration, and after each training is finished, the time difference error delta of the experience sample is obtained by using a Q network after parameter updating _i Updating and simultaneously updating the storage structure of the global shared experience pool;

ω _target ←ω(when t＝nT)。

7. a multi-agent reinforcement learning system that optimizes experience storage and experience reuse, the system comprising, in order:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when the computer program is executed by the processor.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.