CN108108822B

CN108108822B - Different strategy deep reinforcement learning method for parallel training

Info

Publication number: CN108108822B
Application number: CN201810040895.XA
Authority: CN
Inventors: 陈志波; 张直政; 陈嘉乐; 石隽
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2020-06-26
Anticipated expiration: 2038-01-16
Also published as: CN108108822A

Abstract

The invention discloses a parallel training different strategy deep reinforcement learning method, which comprises the following steps: creating an environment process pool, and selecting more than two environment processes to simultaneously start interaction by taking an intelligent agent needing deep reinforcement learning as a main process; in the interaction process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and each time when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of the intelligent main body; and when one round is finished, the environment process screens the experience information generated by the interaction of the round according to an experience sample screening and storing algorithm and selects and stores the experience information in a corresponding experience storage unit or directly deletes the experience information. The method can improve the sample efficiency of the reinforcement learning algorithm, shorten the training time and further improve the learning efficiency and stability of the reinforcement learning algorithm.

Description

Different strategy deep reinforcement learning method for parallel training

Technical Field

The invention relates to the technical field of artificial intelligence and machine learning, in particular to a parallel training different strategy deep reinforcement learning method.

Background

Reinforcement learning is an important machine learning method and has a great number of applications in the fields of intelligent control robots, man-machine gaming, clinical medicine, analysis and prediction and the like. The reinforcement learning is independent of the supervised learning and the unsupervised learning in the traditional machine learning, and experiences are obtained from the interaction between the intelligent agent and the environment, so that the strategy learning of the intelligent agent from the environment to the behavior mapping is completed. In reinforcement learning, the intelligent main body receives state information from an environment and generates an action to act on the environment based on a learning strategy, the state of the environment changes after receiving the action, a return value (reward or punishment) is generated at the same time, the changed current state and the reward and punishment signal are sent to the intelligent main body, and the intelligent main body updates the strategy according to the received information and selects the next decision result (namely the action) according to the strategy. The learning goal of the reinforcement learning system is to dynamically adjust the parameters of the intelligent agent in the interaction process with the environment so as to update the strategy to be learned, so that the positive signal fed back by the environment is the maximum.

The deep reinforcement learning is a novel reinforcement learning mode for constructing an intelligent main body in the reinforcement learning by using a deep learning model on the basis of the reinforcement learning. In deep reinforcement learning, the intelligent agent needs more training samples, and thus requires a large amount of interaction between the intelligent agent and the environment to generate training data. However, the design of the environment and the feedback signal (the return value) in the actual scientific research and the industrial production is very complicated, the calculation of the state value and the return value after the environment changes according to the action signal sent by the intelligent agent has large calculation amount and long time consumption, and extremely high requirements are provided for the training mode and the sample efficiency of the intelligent agent learning.

Therefore, it is necessary to make intensive research to improve the sample efficiency of the intelligent agent in the deep reinforcement learning and shorten the time for training the intelligent agent, thereby improving the learning ability of the intelligent agent in the reinforcement learning and making the intelligent agent to exert the practical value in the application scene more quickly and better.

Disclosure of Invention

The invention aims to provide a parallel-training different-strategy deep reinforcement learning method, which can improve the sample efficiency of a reinforcement learning algorithm, shorten the training time and further improve the learning efficiency and stability.

The purpose of the invention is realized by the following technical scheme:

a parallel-training different-strategy deep reinforcement learning method comprises the following steps:

creating an environment process pool, wherein the environment process pool comprises a plurality of environment processes of the same type;

an intelligent agent needing deep reinforcement learning serves as a main process to send an interaction request to an environment process pool, and more than two environment processes are selected to simultaneously start interaction according to information returned by the environment processes;

in the interaction process of the main process and the current environment process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of an intelligent main body;

and when one round in the interaction between the main process and the current environment process is finished, the environment process screens the experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information.

According to the technical scheme provided by the invention, the experience sample screening and storing algorithm and the experience sample sampling algorithm are optimization algorithms which are adopted on the basis of an asynchronous parallel training frame, and can be selected and adjusted according to specific application requirements and application scenes, so that the sample efficiency of the different strategy reinforcement learning algorithm is improved, the training time is shortened, and the learning efficiency and stability are further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a conventional inter-policy deep reinforcement learning method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a parallel-trained inter-strategy deep reinforcement learning method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a neural network according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a different-strategy deep reinforcement learning method for parallel training, which aims at a different-strategy deep reinforcement learning algorithm, realizes simultaneous interaction of an intelligent main body and a plurality of environments by utilizing multiple processes, and asynchronously trains the intelligent main body according to experiences generated by interaction with the plurality of environments, and mainly comprises the following steps:

1. and creating an environment process pool, wherein the environment process pool comprises a plurality of environment processes of the same type.

In the embodiment of the invention, the same or different parameter settings can be adopted among the environment processes.

2. And the intelligent agent needing deep reinforcement learning is used as a main process to send an interaction request to the environment process pool, and more than two environment processes are selected to simultaneously start interaction according to information returned by each environment process.

In the step, the intelligent agent serves as a main process to send an interaction request to an environment process pool, if an environment process which is not in an interaction state exists in the process pool, an identifier of the corresponding environment process is returned to the main process, and the main process sets the environment identifier of the corresponding environment process from an idle state to an occupied state and interacts with the corresponding environment process; if the process pool does not have the environment process which is not in the interactive state, the environment identifier which is in the occupied state is returned to the main process as a waiting signal, and the main process selects to wait or continuously sends the interactive request to the environment process pool according to the obtained waiting signal.

3. In the interaction process of the main process and the current environment process, the environment process stores experience information generated in each step of interaction in an experience information cache unit corresponding to the current environment process, and when n pieces of single-step interaction information are added to the experience information cache unit in the environment process, the main process samples from the experience storage unit based on an experience sample sampling algorithm and updates corresponding parameters of the intelligent main body.

In the embodiment of the present invention, the experience information mainly includes: status values, action values, reward values, interaction termination indicia, and cumulative reward values (optional).

4. And when one round in the interaction between the main process and the current environment process is finished, the environment process screens the experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information.

In the embodiment of the invention, when one round in the interaction between the main process and the current environment process is finished, the main process releases the current environment process and resets the relevant environment process to an idle state; the reset environment process resets the environment and waits for the initiation of the next interaction.

In the embodiment of the invention, the experience storage unit mainly comprises a common experience information storage unit and a high-return experience information storage unit. The common experience information storage unit and the high-return experience information storage unit are both fixed in length, and store experience information in a FIFO (first-in first-out) storage mode; the length refers to the number of empirical information corresponding to the maximum storable single step interaction. The length of the common experience information storage unit is recorded as L_OThe number of experience information corresponding to the single-step interaction stored in the common experience information storage unit is recorded as N_O(ii) a The length of the high-return experience information storage unit is recorded as L_HThe number of experience information corresponding to the single-step interaction stored in the high-reward experience information storage unit is recorded as N_H。

In the embodiment of the present invention, the experience sample screening and storing algorithm is to perform screening and storing operations on the experience information in the experience information cache unit in an environment process after a complete round (epicode) of interaction between a main process corresponding to an intelligent agent and the environment process is completed, and the main process is as follows:

storing experience information generated by the interaction of the current round of the cache in the experience information cache unit into a common experience information storage unit, and updating N_O(ii) a Will be updated N_OAnd a first threshold value N_limitA comparison is made.

In the embodiment of the invention, the interaction is carried out in a step unit, and one round (episode) comprises the multi-step interaction. The experience information caching unit and the experience information storage unit have different functions, when a round is not completed, the experience information corresponding to the currently completed step is stored in the experience information caching unit, and after the round is completed, the experience information of all the steps in the round is stored in the experience information storage unit.

If N is present_OIs less than a first threshold value N_limitIf so, the storage operation is ended;

if N is present_OGreater than a first threshold value N_limitThen respectively calculate the latest stored N_newP-th of round accumulated return value corresponding to experience information of each round₁Value R of percentile_highAnd p is₂Value R of percentile_lowWherein p is₁＞p₂(ii) a Recording the round accumulated return value of the experience information currently stored in the common experience information storage unit as R, wherein R is more than R_highCopying the experience information corresponding to the round to a high-return experience information storage unit for storage; when R is_low≤R≤R_highThen, the experience information corresponding to the turn is changed into (R-R) with probability p_high)/(R_high-R_low) Copying the experience information to a high-return experience information storage unit for storage; when R is less than or equal to R_lowWhen the operation is finished, the storage operation is finished.

In the embodiment of the invention, each step of interaction (step) in a turn (episode) generates a return value, wherein the turn accumulated return value refers to the accumulation of the return values of all steps (step) in a certain turn, namely the accumulation is operated by taking the step (step) as a unit.

And the empirical sample sampling algorithm is used for sampling from the storage unit in the main process corresponding to the intelligent agent and updating the parameters of the intelligent agent. The main process is as follows:

when the main process prepares to update the parameters of the intelligent agent, the number N of the experience information already stored in the common experience information storage unit is detected_OAnd compares it with a second threshold value N_l(typically set to an integer multiple of Batch Size): if N is present_O＜N_lIf yes, the sampling and parameter updating are abandoned; if N is present_O≥N_lDetecting the number of experience information already stored in the high-return experience information storage unit, and if the number of experience samples already stored is lower than a third threshold value N_l1Sampling from a common experience sample storage unit, and if the sampling is not lower than a fourth threshold value N_l2Then, one of the following two ways is selected to complete the sampling:

sampling from the high-return experience information storage unit with the probability P and performing parameter updating by using the obtained samples, and sampling from the common experience information storage unit with the probability (1-P) and performing parameter updating by using the obtained samples;

and sampling P samples in the sample set sampled every time from the high-return experience information storage unit, and sampling (1-P) samples from the common experience information storage unit.

For convenience of explanation, the following description will be given with reference to specific examples.

The traditional different-strategy deep reinforcement learning algorithm shown in fig. 1 adopts a training mode of serial synchronization of a single intelligent agent and a single environment, and is different from the training mode, the scheme provided by the embodiment of the invention is shown in fig. 2, and the training mode is used for performing parallel interaction on the single intelligent agent and a plurality of environments of the same type at the same time and correspondingly optimizing the storage and sampling modes of experience information; mainly as follows:

the intelligent agent in the embodiment of the invention is based on a classic strategy gradient algorithm Deep Deterministic policy gradient, and mainly comprises a strategy network and a value network shown in figure 3. Setting general experience informationLength L of memory cell_O＝10⁶(ii) a Length L of high return experience information storage unit_H＝10⁴(ii) a The Baych Size adopted when updating the neural network in the intelligent agent is 128; threshold value N of number of stored experience information required for sampling from common experience information storage unit_limit64 × 128; threshold N of the number of stored experience information required for sampling from the high-reward experience information storage unit_l＝32*128。

First, a process pool containing a plurality of environment processes of the same type is created according to the method in the foregoing scheme, wherein environment processes in different processes may adopt the same or different parameter settings (for example, different difficulty levels may be set in a game environment). And then, the main process corresponding to the intelligent agent interacts with different environment processes at the same time, and the intelligent agent detects whether the number of the experience information stored in each storage unit meets the requirement or not according to the experience sample sampling algorithm described in the scheme and determines the updating mode of the neural network in the intelligent agent every time one-step interaction is completed. More specifically, in each back propagation of the neural network, sampling is performed from the high-reward experience information storage unit with a probability of 0.1, and sampling is performed from the ordinary experience information storage unit with a probability of 0.9. Meanwhile, the experience information obtained by the single-step interaction of the environment process is stored in the experience information cache unit corresponding to the process. After a complete round of interaction is completed, the environmental process stores the experience information generated by the round of interaction according to the experience information screening and storing algorithm in the technical scheme. Wherein 100 rounds of experience information (i.e. N) most recently stored in the common experience information storage unit are taken_new100) round accumulated reward value of 90 th (i.e., p)₁90) percentile as R_high10 th (i.e. p)₂10) percentile as R_low(ii) a And comparing the accumulated return value R with the round accumulated return value R of the experience information currently stored in the common experience information storage unit, thereby selecting a corresponding storage strategy.

It should be noted that the specific values of the parameters mentioned in the above formula examples are only examples and are not limiting; in practical application, a user can set specific values of the parameters according to actual needs or experience.

The invention has certain detectability, and the specific detection scheme is as follows:

firstly, detecting the process number of the related programs. If a main process and a plurality of environment processes interacting with the main process exist in the related program, the technical scheme related to the invention patent is probably used.

And secondly, detecting the read-write condition of the storage unit and the process of the related program. If a plurality of storage units with higher reading and writing frequency and two storage units with reading and writing operation frequency obviously higher than writing operation frequency exist in the related program, the technical scheme related to the invention patent is probably used.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A parallel-training different-strategy deep reinforcement learning method is characterized by comprising the following steps:

when one round in the interaction between the main process and the current environment process is finished, the environment process screens experience information generated by the interaction of the round based on an experience sample screening and storing algorithm and stores the experience information in a corresponding experience storage unit or directly deletes the experience information;

the method for filtering and storing the experience information generated by the interaction of the round based on the experience sample filtering and storing algorithm by the environment process to the corresponding experience storage unit comprises the following steps:

the experience storage unit comprises a common experience information storage unit and a high return experience information storage unit; the length of the common experience information storage unit is recorded as L_OThe number of experience information corresponding to the single-step interaction stored in the common experience information storage unit is recorded as N_O(ii) a The length of the high-return experience information storage unit is recorded as L_HThe number of experience information corresponding to the single-step interaction stored in the high-reward experience information storage unit is recorded as N_H；

Storing the information in the current experience information cache unit into a common experience information storage unit, and updating N_O(ii) a Will be updated N_OAnd a first threshold value N_limitComparing;

if N is present_OGreater than a first threshold value N_limitThen are respectivelyCalculating the latest stored N_newP-th of round accumulated return value corresponding to experience information of each round₁Value R of percentile_highAnd p is₂Value R of percentile_lowWherein p is₁＞p₂(ii) a Recording the round accumulated return value of the experience information currently stored in the common experience information storage unit as R, when R is more than R_highThen, copying the experience information corresponding to the round to a high-return experience information storage unit for storage; when R is_low≤R≤R_highThen, the experience information corresponding to the turn is changed into (R-R) with probability p_high)/(R_high-R_low) Copying the experience information to a high-return experience information storage unit for storage; when R is less than or equal to R_lowWhen the operation is finished, the storage operation is finished.

2. The method for parallel training of the heterogeneous strategy deep reinforcement learning according to claim 1, wherein an intelligent agent is used as a main process to send an interaction request to an environment process pool, if an environment process which is not in an interaction state exists in the environment process pool, an identifier of the corresponding environment process is returned to the main process, and the main process sets the environment identifier of the corresponding environment process from an idle state to an occupied state and interacts with the corresponding environment process;

and if the process pool does not have the environment process which is not in the interactive state, returning the environment identifier in the occupied state to the main process as a waiting signal, and selecting to wait or continuously send an interactive request to the environment process pool by the main process according to the obtained waiting signal.

3. The method according to claim 1, wherein the empirical information comprises: state values, action values, reward values, interaction termination indicia, and cumulative reward values.

4. The parallel-training different-strategy deep reinforcement learning method according to claim 1, wherein the main process performs sampling from an experience storage unit based on an experience sample sampling algorithm and performs corresponding parameter updating on the intelligent agent, and comprises:

when the main process prepares to update the parameters of the intelligent agent, the number N of the experience information already stored in the common experience information storage unit is detected_OAnd compares it with a second threshold value N_lAnd (3) comparison: if N is present_O＜N_lIf yes, the sampling and parameter updating are abandoned; if N is present_O≥N_lDetecting the number of experience information already stored in the high-return experience information storage unit, and if the number of experience samples already stored in the high-return experience information storage unit is lower than a third threshold value N_l1Sampling from a common experience sample storage unit, and if the sampling is not lower than a fourth threshold value N_l2Then, one of the following two ways is selected to complete the sampling:

5. The parallel-training different-strategy deep reinforcement learning method according to claim 1, characterized in that the common experience information storage unit and the high-return experience information storage unit are both of fixed length, and store experience information in a FIFO storage mode; the length refers to the number of empirical information corresponding to the maximum storable single step interaction.

6. The method of claim 1, wherein when one round of interaction between the host process and the current environment process is over, the host process releases the current environment process and resets the relevant environment process to an idle state; the reset environment process resets the environment and waits for the initiation of the next interaction.