CN112734014A

CN112734014A - Experience playback sampling reinforcement learning method and system based on confidence upper bound thought

Info

Publication number: CN112734014A
Application number: CN202110038613.4A
Authority: CN
Inventors: 刘帅; 韩思源; 王小文
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-04-30

Abstract

The disclosure provides an experience playback sampling reinforcement learning method-level system based on a confidence upper bound thought, which includes: acquiring experience obtained by interaction of an intelligent agent and the environment, and storing the experience data into an experience playback pool; when the current training strategy is updated, randomly selecting experience from the experience playback pool according to the priority probability to generate a candidate training sample set; selecting a training sample set according to the confidence upper bound value of each candidate training sample; and updating parameters of the neural network for function approximation according to the training sample data. The technical scheme disclosed by the invention can be combined with any offline RL algorithm, so that the problems of insufficient utilization of samples and low learning efficiency of the updating algorithm in the related technology are solved to a certain extent, the sampling efficiency is effectively improved, and the generalization capability of algorithm updating is further improved.

Description

Experience playback sampling reinforcement learning method and system based on confidence upper bound thought

Technical Field

The disclosure belongs to the technical field of reinforcement learning, and particularly relates to an experience playback sampling reinforcement learning method and system based on a confidence upper bound thought.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Deep reinforcement learning is an important research direction in the field of artificial intelligence, and agents autonomously learn the optimal strategy for action execution to maximize their cumulative rewards through the process of continuous interaction with the environment. Deep reinforcement learning methods have enjoyed tremendous success in a number of areas and tasks, including electronic games, go gaming, and robotic control. Since the great potential of deep reinforcement learning has not been fully exploited, several efforts have been devoted in recent years to study its feasibility and generalization in different application environments. However, existing deep reinforcement learning algorithms still lack data efficiency, and even a simple learning task requires a large amount of environmental interaction. The high cost and low fault tolerance of real environments make it difficult for an agent to interact with the environment in large quantities, greatly limiting the exploration and application of algorithms in real scenes. Similarly, it is also important to improve learning efficiency in complex simulation environments. Thus, one of the biggest challenges in deep reinforcement learning is to let the agent learn efficiently in the application, while consuming little time and resources. Despite this, much current research on deep reinforcement learning focuses on improving performance within the computational budget of a machine, and the problem of how to best utilize more resources has not been adequately addressed.

Empirical playback methods alleviate this problem to some extent. During the learning process, the agent stores the interaction information with the environment, namely experience, in a playback buffer pool, and then uniformly randomly selects part of the experience for playback to update the control strategy. Unlike online reinforcement learning agents that discard incoming data immediately after an update, the empirical replay method allows the agent to learn from data generated by previous versions of the strategy, allowing one experience to be used for more than one update, thereby breaking the time-dependent constraint, which is particularly useful when training neural network function approximations using stochastic gradient descent algorithms based on independent co-distribution assumptions. Specifically, such as the deep Q learning algorithm, a large sliding window replay memory is used from which samples are randomly sampled evenly and each experience is revisited 18 times on average. The experience playback method stabilizes the training of the cost function represented by the deep neural network, uses more relatively cheap calculated amount and memory to replace expensive environment interaction resources required by learning a large amount of experience, and effectively improves the data efficiency.

According to the learning process of human beings, different experiences have different importance for the learning of the strategy. However, the original empirical playback method samples uniformly with equal probability from the playback pool, and does not consider the importance of different samples to the strategy optimization, so a plurality of efforts are directed to improving the original empirical playback method. Because the importance of the sample does not have a definite quantitative measure, the existing method usually designs an importance index based on qualitative analysis.

The classical prior horizontal scanning concept is extended by a preferred empirical Playback (PER) method, and the importance degree of a sample is measured by using a biased index of a time sequence difference error (TD-error). The key idea is that the agent can learn more efficiently from samples with higher uncertainty. The larger the index value corresponding to the sample is, the higher the learning progress expectation is represented, and the higher the sampling priority is also represented. However, it is obvious that the importance of the sample is not only determined by the timing difference error, but also may be related to factors such as the reward signal and the sampled frequency, and the original prior experience playback method still has a large promotion space.

The Q-Prop is based on a depth deterministic strategy gradient method (DDPG), and Taylor expansion of an off-line critic is used as a control variable and is combined with Monte Carlo gradient estimation of a strategy, so that the variance of sample efficiency is reduced, and the stability and the sampling efficiency of the DDPG are improved. However, although Q-Prop provides a solution to high sample complexity, the learning curve of its policy update still suffers from large oscillations because it inherits the high variance inherent in the policy gradient method. Furthermore, recent studies by Tucker et al indicate that the performance improvement of the experiment may be due to design subtle implementation details rather than better baseline functionality.

The post experience playback (HER) technique is an efficient sample learning method proposed for sparse rewards. The method uses a new target (such as the state reached) to replace the original target in each round to obtain new experience, and stores the new experience into an experience playback pool, so that the sample size is increased by several times of the new target, and the sample efficiency of the multi-target task is effectively improved. While it avoids complex rewards engineering, allowing the agent to perform efficient sample learning from sparse and binary rewards, it requires up to 50 steps per round, otherwise a losing reward is obtained. Furthermore, it has limited performance improvement in single-target tasks.

The memory and forgetting Experience playback technology (Ref-ER) uses the difference between the policy at the time of sample collection and the current policy as the importance weight of a sample. Only the sample close to the current strategy is used for updating the strategy gradient, the KL divergence of the old strategy corresponding to the new strategy and the sample is limited not to be overlarge, the quality of the sample is improved on the basis of keeping the capacity of the experience playback pool, and the algorithm effectively utilizes more good data. The method needs to calculate the closeness of all samples to the current strategy every time when updating, and the same samples have different importance due to different strategies, which is contradictory to the actual learning experience of human beings.

From this, the existing sampling technology has many problems in design principle and generalization capability, thereby limiting the application range thereof. The method based on the empirical playback technology still has a great promotion space, and therefore, it is necessary to improve the empirical playback sampling method aiming at partial problems so as to promote the sampling efficiency and the application potential of the deep reinforcement learning algorithm.

Disclosure of Invention

In order to overcome the defects of the prior art, the method for the empirical playback sampling reinforcement learning based on the confidence upper bound thought is provided, so that the sampling efficiency and the application potential of the deep reinforcement learning algorithm are improved.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

in a first aspect, an empirical playback sampling reinforcement learning method based on a confidence upper bound thought is disclosed, which comprises the following steps:

acquiring experience obtained by interaction of an intelligent agent and the environment, and storing the experience data into an experience playback pool;

when the current training strategy is updated, randomly selecting experience from the experience playback pool according to the priority probability to generate a candidate training sample set;

selecting a training sample set according to the confidence upper bound value of each candidate training sample;

and updating parameters of the neural network for function approximation according to the training sample data.

According to the further technical scheme, before the experience obtained by interaction of the intelligent agent and the environment is collected, the network parameters of the deep reinforcement learning algorithm, the current maximum time sequence difference error value and the initial observed value of the intelligent agent are initialized.

According to the further technical scheme, after initialization, at each time step, an intelligent agent and the environment interact to obtain experience, the priority value of each experience is set to be the current maximum priority value, and the experience is stored in an experience playback pool, wherein the method specifically comprises the following steps:

the intelligent agent obtains an observed value of the current moment from the environment;

the intelligent agent calculates the action selected at the current moment according to the current strategy and the observed value at the current moment;

the intelligent agent and the environment interactively execute actions, and the environment is transferred to the next state according to the actions of the intelligent agent and returns a reward signal, an observed value at the next moment and an index for judging whether the turn is terminated or not;

calculating the current maximum priority value by using the current maximum time sequence difference error, and setting the priority value corresponding to the time step experience as the current maximum priority value;

and adding data generated in the interaction process into the experience playback pool.

According to the further technical scheme, when the candidate training sample set is generated:

acquiring the sum of the priority values of the experience in the current experience playback pool, and averagely dividing the sum of the priority values into lambda.K parts;

one experience from each is taken and added to the set of candidate training samples according to the precedence probability.

According to the further technical scheme, a training sample set is selected according to the confidence upper bound value of each candidate training sample, and the method specifically comprises the following steps:

calculating a confidence upper bound value of each candidate training sample;

sorting confidence upper bound values from small to large, and selecting the first K experiences to be added into a training sample set;

updating network parameters according to the training sample set data;

calculating the time sequence difference error of each training sample, and storing the maximum value of the time sequence difference errors in all data;

calculating a loss function according to the time sequence difference error obtained by forward propagation, and performing gradient backward propagation;

and updating parameters of the neural network according to the gradient and the learning rate.

According to the further technical scheme, the time sequence difference error of each training sample is calculated, and when the maximum value of the time sequence difference errors in all data is stored, the training sample data is input into a neural network for forward propagation to obtain the time sequence difference error of each training sample;

and comparing the maximum time sequence difference error stored before training with the maximum time sequence difference error corresponding to the current training sample, and storing the maximum value between the maximum time sequence difference error and the maximum time sequence difference error as the maximum value of the experienced time sequence difference errors in the current experience playback pool.

In a second aspect, an empirical playback sampling reinforcement learning system based on confidence upper bound thought is disclosed, comprising:

the system comprises an acquisition module, a playback module and a processing module, wherein the acquisition module is used for collecting experience data generated by interaction between an agent and the environment and adding the experience data into an experience playback pool;

the sampling module is used for randomly selecting a plurality of experiences from the experience playback pool according to the priority probability to generate a candidate training sample set;

the ordering module is used for ordering the experience in the candidate training sample set according to the confidence upper bound value to generate a training sample set;

and the updating module is used for updating the parameters of the neural network according to the training sample set.

Preferably, the acquisition module comprises:

the first calculation unit is used for calculating the action selected at the current moment according to the observation value of the intelligent agent at the current moment and the current strategy;

the observation unit is used for interactively executing actions with the environment through the agent and observing the empirical data corresponding to the current time step, and comprises: the environment is transferred to the next state according to the action of the intelligent agent and returns to the intelligent agent to award signals, the observed value of the next moment and an index for judging whether the turn is terminated;

the second calculation unit is used for calculating the current maximum priority value according to the current maximum time sequence difference error and setting the priority value corresponding to the time step experience as the current maximum priority value;

and the first adding unit is used for adding the experience data generated by the current time step into the experience playback pool.

Preferably, the sampling module includes:

the segmentation unit is used for acquiring the number of experiences in the current experience playback pool when the current training strategy is updated, and averagely dividing all the current experiences into lambda.K segments;

and the second adding unit is used for taking out one experience from each segment according to the priority probability of each experience and adding the experience to the candidate training sample set.

Preferably, the sorting module includes:

the third calculating unit is used for calculating a confidence upper bound value of each candidate training sample;

the sorting unit is used for sorting the experience in the candidate training set from small to large according to the confidence upper bound value;

a third adding unit for selecting the first K ordered experiences to be added into the training sample set

Preferably, the update module includes:

the first updating unit is used for carrying out forward propagation of the neural network according to the observation value at the current moment, the action selected at the current moment and the observation value at the next moment so as to obtain the time sequence difference error of each training sample;

the comparison unit is used for comparing the maximum time sequence difference error corresponding to the previous moment with the time sequence difference error corresponding to each training sample at the current moment, wherein the larger time sequence difference error is set as the experienced maximum time sequence difference error in the current experience playback pool;

the fourth calculation unit is used for calculating a loss function according to the time sequence difference error obtained by forward propagation and carrying out gradient backward propagation;

the second updating unit is used for updating the parameters of the neural network according to the gradient and the learning rate;

the judging unit is used for judging whether the training result meets the termination requirement, and if so, terminating the training; otherwise, returning to the sampling module.

The above one or more technical solutions have the following beneficial effects:

the technical scheme disclosed by the invention can be combined with any offline RL algorithm, so that the problems of insufficient utilization of samples and low learning efficiency of the updating algorithm in the related technology are solved to a certain extent, the sampling efficiency is effectively improved, and the generalization capability of algorithm updating is further improved.

According to the technical scheme, the confidence upper bound thought is introduced into the prior experience playback sampling technology, under the condition that extra calculation complexity and storage capacity are not increased, historical information is fully utilized, and sampling efficiency and sample utilization rate are improved, so that the exploration capacity of the algorithm is improved, and the training speed and generalization capacity of the reinforcement learning algorithm are improved. The core idea of the present invention can be simply described as increasing the sampling probability of an experience that is sampled a small number of times while considering the degree of progress of experience-expected learning. The improved experience replay sampling strategy can be used in the implementation process of any off-line reinforcement learning algorithm, so that the improved experience replay sampling strategy can be applied to multiple fields and tasks, the learning efficiency of training can be obviously improved, and the generalization capability of the algorithm can be favorably improved. For more complex neural networks or reinforcement learning training tasks, the increase in neural network learning efficiency may be particularly significant.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a schematic diagram of a reinforcement learning system in which an agent interacts with an environment;

FIG. 2 is a schematic diagram of an empirical replay sampling reinforcement learning strategy based on confidence upper bound concepts;

FIG. 3 is a flow diagram of an empirical replay sampling reinforcement learning strategy based on confidence upper bound concepts;

FIG. 4 is a flow diagram of an example process for sampling empirical data based on precedence probabilities and confidence upper bound values;

FIG. 5 is a block diagram of an empirical playback sampling reinforcement learning updating apparatus based on confidence upper bound thought;

FIG. 6 is a graph of the average prize for the invention in an Atalantic Table tennis (Pong-v0) experiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The empirical playback sampling technology is an important part influencing algorithm training efficiency and learning effect in reinforcement learning, particularly in deep reinforcement learning. The existing sampling technology has many problems in design principle and generalization capability, thereby limiting the application range of the algorithm. Therefore, the invention provides an improved experience playback sampling strategy aiming at the problem that the influence of the historical information of the samples on the training importance degree is not considered in the prior art, so that the samples with less training times are preferentially sampled for training on the premise of considering the expected learning progress, and the sampling efficiency and the algorithm stability are further improved. The strategy provided by the invention can be combined with all off-line reinforcement learning algorithms and executed on different reinforcement learning task environments. Reinforcement learning task environments that may be applied include simulated environments and real environments. Wherein, the agents in the simulation environment can be realized by one or more computer program simulation, and the related tasks include but are not limited to: the method enables the simulated players to win in the video game, navigate unmanned planes and unmanned planes, reach Nash equilibrium in the game scene and the like. Agents in a real environment rely on mechanical structures that can interact with real environment information, related tasks including but not limited to: the robot arm is used for completing operations such as pushing, pulling, placing and the like, and autonomous or semi-autonomous vehicles such as unmanned aerial vehicles, unmanned vehicles and the like navigate and unmanned aerial vehicle group fight drilling and the like.

In the embodiment of the present section, a general deep reinforcement learning system is taken as an example, and as shown in fig. 1, the system receives observation information of an agent on an environment and selects an action from an executable action set corresponding to a current state to complete interaction between the agent and the environment. The invention provides an empirical playback sampling reinforcement learning strategy based on a confidence upper bound thought, which is shown in figure 2.

Example one

The embodiment discloses an empirical playback sampling reinforcement learning method based on a confidence upper bound thought, which comprises the following steps of: acquiring experience obtained by interaction of an intelligent agent and the environment, and storing the experience data into an experience playback pool; when the current training strategy is updated, randomly selecting lambda.K experiences from the experience playback pool according to the priority probability to generate a candidate training sample set; selecting a training sample set according to the confidence upper bound value of each candidate training sample; and updating parameters of the neural network for function approximation according to the training sample data.

In a specific implementation example, the purpose of the invention is realized by the following technical scheme:

as shown in fig. 2, an empirical replay sampling reinforcement learning strategy based on a confidence upper bound thought includes:

step 1: network parameters of randomly initializing deep reinforcement learning algorithm and initial observed value o of intelligent agent₀And empirically setting the current maximum timing difference error value error_max。

The time sequence difference error of each empirical data is an index which is generated by training through a neural network and is used for describing the difference between the current strategy and the target strategy. It should be noted that when the current time step is the last time step of the corresponding round, the value of the time step corresponding to the objective policy is the reward of the time step, otherwise, the value of the time step corresponding to the objective policy is composed of the reward of the time step and the output of the neural network.

Step 2: and at each time step, the intelligent agent and the environment interact to obtain experience, the priority value of each experience is set as the current maximum priority value, and the experience is stored in an experience playback pool.

Wherein the priority value may be obtained by adding a predetermined constant to the absolute value of the timing difference error or by using the inverse of the rank ordered according to the timing difference error.

Step 201: and the intelligent agent acquires the observed value of the current moment from the environment.

Step 202: and the intelligent agent calculates the action selected at the current moment according to the current strategy and the observed value at the current moment.

Step 203: and the intelligent agent and the environment perform actions in an interactive way, and the environment is transferred to the next state according to the actions of the intelligent agent and returns a reward signal, an observed value at the next moment and an index for judging whether the turn is terminated or not.

Step 204: and calculating the current maximum priority value by using the current maximum time sequence difference error, and setting the priority value corresponding to the time step experience as the current maximum priority value.

Step 205: and adding data generated in the interaction process into the experience playback pool.

In a specific embodiment, the step 2 includes: at each time step t, the agent receives an observation o of the environmental state_tInputting the observed value into the neural network corresponding to the current strategy to obtain the action a selected at the current moment_t. Agent performing action a_tThe interaction with the environment is completed, and the environment is transferred to the next state according to the action of the intelligent agent and returns to the current reward signal r of the intelligent agent_tAnd an index done for judging whether the turn is over_t. The agent receives again the observation o of the environmental state_t+1. Then, the current maximum timing difference error is used_maxCalculating the current maximum priority value p_maxSetting the priority value corresponding to the time step data as p_max. When the intelligent agent completes the interaction with the environment once, the data e generated in the interaction process can be transmitted_t＝(o_t,a_t,r_t,o_t+1,done_t) Added to the experience playback pool. In general, e_tAlso known as experience. Because the experience of the offline reinforcement learning algorithm does not depend on the strategy at the current moment, namely the experience of each time step obeys the independent same distribution assumption, the experience in the experience playback pool does not need to be stored in a strategy division manner, and only needs to be stored according to the interaction sequence, and the experience playback pool can use efficient data structures such as arrays and matrixes to perform access and other operations.

It should be noted that the capacity of the experience playback pool is a preset positive integer N, and when the data storage reaches the storage capacity of the experience playback pool, the original data needs to be deleted to store the experience data generated in the next time step. Generally, when to delete data from the experience playback pool can be determined according to the time of data storage, namely, the data with the longest storage time is preferentially deleted, so as to ensure the addition of new data; it is also possible to determine when to delete which data from the empirical playback pool based on the magnitude of the timing difference error, i.e., to preferentially delete the data with the smallest timing difference error, i.e., the data with the smallest expected learning potential, to ensure the addition of new data.

And 3, when the current training strategy is updated, taking lambda.K experiences from the experience playback pool according to the priority probability, wherein lambda is more than or equal to 1.

Step 301: and acquiring the sum of the priority values of the experience in the current experience playback pool, and averagely dividing the sum of the priority values into lambda · K parts.

Step 302: one experience from each is taken and added to the set of candidate training samples according to the precedence probability.

In a specific embodiment, the step 3 includes: when the data in the empirical playback pool reaches a certain amount (e.g., 10000), a training process is started, and a maximum training time step number is defined as T, T is 1, …, T, and N > T. And summing the prior values of the experience in the current experience playback pool, and averagely dividing the sum of the prior values into lambda K parts. Where K is a predetermined positive integer, λ is an index for determining an influence of the confidence upper bound value on the sampling, and a larger λ indicates a larger influence of the confidence upper bound value on the sampling, i.e., preferentially selects an experience with a smaller number of times of sampling, and when λ is 1, indicates that the confidence upper bound value is not used to influence the sampling result. The initial value is set to be lambda more than or equal to 1, and lambda can be gradually annealed to 1 with the increase of training times by adopting a linear or exponential annealing method.

One experience from each is taken and added to the set of candidate training samples according to the precedence probability. Wherein each experience has a certain sampling precedence probability

Where α is a predetermined constant, k ranges including empirical data in the entire empirical playback pool, p_iIs a priority value obtained by time series difference error calculation to describe the learning potential expectation of the corresponding experience. α is an index for determining the priority of sampling, and when α is 0, the priority is not used to influence the sampling result, i.e., equal probability uniform sampling, and when α > 0, a larger α indicates a larger influence of the priority on the sampling. The higher expected learning potential means the higher expected learning progress, i.e. higher training value corresponding to experience, and therefore, will beA higher priority probability is obtained.

And 4, step 4: and selecting a training sample set according to the confidence upper limit value of each candidate training sample.

Step 401: a confidence upper bound value is calculated for each candidate training sample.

Step 402: and sorting confidence upper bound values from small to large, and selecting the first K experiences to be added into the training sample set.

In a specific embodiment, the step 4 includes: and calculating a confidence upper bound value of each candidate training sample, sorting the confidence upper bound values from small to large, and selecting the first K experiences to be added into the training sample set (as shown in FIG. 4). Wherein the confidence upper bound value is

Experience i the number of times trained from the start of training to time t. The part is inspired by the idea that the confidence interval of the reward distribution of the intelligent agent is gradually reduced along with the increase of the sampling times in a confidence upper bound method (UCB), namely the standard deviation of the reward distribution is smaller and smaller, and the times that each experience is used for training are considered when a training set is selected. On the basis of considering the time sequence difference error, the importance degree of the experience with few utilized times is improved to a certain degree, so that more experiences can be fully learned. The method fully utilizes the historical information, enhances the exploration capability of sample selection, and obviously improves the sampling efficiency and the learning capability of the reinforcement learning algorithm.

And 5: and updating the network parameters according to the training sample set data.

Step 501: and calculating the time sequence difference error of each training sample, and storing the maximum value of the time sequence difference errors in all the data.

Step 5001: inputting training sample data into a neural network, and carrying out forward propagation to obtain a time sequence difference error of each training sample.

Step 5002: and comparing the maximum time sequence difference error stored before training with the maximum time sequence difference error corresponding to the current training sample, and storing the maximum value between the maximum time sequence difference error and the maximum time sequence difference error as the maximum value of the experienced time sequence difference errors in the current experience playback pool.

Step 502: and calculating a loss function according to the time sequence difference error obtained by forward propagation, and performing gradient backward propagation.

Step 503: and updating parameters of the neural network according to the gradient and the learning rate.

In a specific embodiment, the step 5 includes: inputting the training sample data into a neural network of an algorithm, and carrying out forward propagation to obtain a time sequence difference error of each training sample. And the time sequence difference error of each empirical data represents an index generated by training by using a neural network and used for describing the difference between the current strategy and the target strategy. Typically expressed as the difference between the value of the time step for the target policy and the value of the time step for the current policy. And the value of the time step corresponding to the target strategy is the output obtained by inputting the observed value corresponding to the next time step and the action selected at the next time step based on the current strategy into the neural network and adding the currently obtained reward. The value of the time step corresponding to the current strategy is the current observed value and the current output obtained by inputting the currently selected action into the neural network. It should be noted that when the current time step is the last time step of the corresponding round, the index done for determining whether the round is terminated or not is determined_tWhen the time step value corresponding to the target strategy is 1, the reward of the time step is the value of the time step, otherwise, the value of the time step corresponding to the target strategy is composed of the reward of the time step and the output of the neural network.

Comparing the maximum time sequence difference error stored before training with the maximum time sequence difference error corresponding to the current K training samples, and storing the maximum value between the maximum time sequence difference error and the maximum value as the maximum value of the experienced time sequence difference error in the current experience playback pool as error_max. Then, a loss function is calculated according to the time sequence difference error obtained by forward propagation, and gradient backward propagation is carried out. In general, the loss function may be in the form of a mean square error function. Then, according to the gradient and predefinitionUpdates the parameters of the algorithmic neural network.

Step 6: and judging whether the training termination condition is reached, if not, returning to the step 3. The training termination condition may be a judgment of whether the maximum training time step is reached, or a judgment index set by the user according to the actual environment and the task requirement.

According to the technical scheme provided by the invention, the method introduces the confidence upper bound thought into the sampling method by improving the experience playback sampling technology, so that the sampling probability of the experience with more sampling times is reduced, meanwhile, the sampling probability of the experience with less sampling times is improved, and the sampling efficiency and the sample utilization rate are improved under the condition of not increasing extra calculation complexity and storage capacity, so that the exploration capability of the algorithm is improved, and the training capability of the reinforcement learning algorithm is improved. The improved experience playback sampling strategy can be combined with all off-line reinforcement learning algorithms, so that the improved experience playback sampling strategy can be applied to multiple fields and tasks, such as intelligent router path optimization decision, mechanical arm fetching control, intelligent power grid energy management and economic dispatching decision, robot human walking and jumping control in a Mujoco simulation environment and the like. The learning efficiency of training can be obviously improved, and the generalization capability of the algorithm can be improved.

For the energy management decision task of the intelligent power grid, the intelligent agent is a micro-power grid, and the experience obtained by interaction of the intelligent agent and the environment is in a quintuple form e_t＝(o_t,a_t,r_t,o_t+1,done_t) Wherein o is_tIs an observed value of an agent, including (g)_t,l_t,s_t,p_t) Wherein g is_tIs the local power generation amount, l, of the current time period_tIs the local load amount, s_tFor local battery state, p_tElectricity prices for retail markets; a is_tActions for agents, including (λ)_t，u_t) Wherein λ is_tPricing the auction selected for the current time, u_tThe electric energy trading volume; r is_tThe reward signal obtained by the agent for performing the action is generally composed of the electric energy buying and selling income and some default punishment items；o_t+1Observing for the next moment after the action is executed; done_tTo determine an index of whether the round is over, for the market before the time trading, the round is over when each time step is 1 hour and the experience number reaches 24 hours. The priority value corresponding to each experience is described by the absolute value of the target strategy value and the current strategy value, a batch of experience samples are sampled from the experience playback pool according to the priority probability, a training sample set is further obtained according to the confidence upper bound value corresponding to each sample, the strategy network and value network parameters are updated by using the training sample set, and finally the optimal strategy for maximizing the benefits of the microgrid is obtained on the basis of meeting the constraints of supply and demand balance and the like.

To verify the effectiveness of the proposed method, a specific experiment was performed using the ping-Pong game of athaya (Pong-v0) to obtain an average reward curve as shown in fig. 6. Wherein, the experiment uses a deep Q learning method, the PER curve is the result obtained by using the original prior experience playback sampling technology, and the UCB curve is the result provided by the invention. Obviously, the method of the present invention can make the algorithm converge more quickly, and thus it can improve the sampling efficiency and increase the stability of the algorithm.

Example II

In order to achieve the above object, the present invention provides an empirical playback sampling updating apparatus based on the above-mentioned confidence concept, as shown in fig. 5, including: the system comprises an acquisition module, a playback module and a processing module, wherein the acquisition module is used for collecting experience data generated by interaction between an agent and the environment and adding the experience data into an experience playback pool; the sampling module is used for randomly selecting a plurality of experiences from the experience playback pool according to the priority probability to generate a candidate training sample set; the ordering module is used for ordering the experience in the candidate training sample set according to the confidence upper bound value to generate a training sample set; and the updating module is used for updating the parameters of the neural network according to the training sample set.

Wherein, the collection module includes:

Wherein, the sampling module includes:

Wherein, the sequencing module comprises:

Wherein, the update module includes:

the second updating unit is used for updating the parameters of the neural network according to the gradient and the learning rate; the judging unit is used for judging whether the training result meets the termination requirement, and if so, terminating the training; otherwise, returning to the sampling module.

According to the embodiment of the disclosure, by improving the prior experience playback sampling technology, the confidence upper bound thought is introduced into the sampling technology, the sampling probability of experience with few sampling times is improved while the experience expectation learning progress degree is considered, the historical information generated in the training process is fully utilized, and the sampling efficiency and the sample utilization rate are improved under the condition of not increasing extra calculation complexity and storage capacity, so that the exploration capability of the algorithm is improved, and the training speed and the generalization capability of the reinforcement learning algorithm are further improved. The improved experience replay sampling strategy can be used in the implementation process of any off-line reinforcement learning algorithm, so that the improved experience replay sampling strategy can be applied to multiple fields and tasks, the learning efficiency of training can be obviously improved, and the generalization capability of the algorithm can be favorably improved. For more complex neural networks or reinforcement learning training tasks, the increase in neural network learning efficiency may be particularly significant.

EXAMPLE III

The present embodiment is directed to a computing device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the specific steps of the method according to the above embodiment.

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the method of the above-described embodiment example.

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.

Those skilled in the art will appreciate that the modules or steps of the present disclosure described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code executable by computing means, whereby the modules or steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof may be fabricated into a single integrated circuit module. The present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. An experience playback sampling reinforcement learning method based on a confidence upper bound thought is characterized by comprising the following steps:

2. The experience playback sampling reinforcement learning method based on the confidence upper bound thought as claimed in claim 1, characterized in that before the experience obtained by interaction between the agent and the environment is collected, the network parameters of the deep reinforcement learning algorithm, the current maximum time sequence difference error value and the initial observed value of the agent are initialized.

3. The experience replay sampling reinforcement learning method based on the confidence upper bound thought as claimed in claim 2, wherein after initialization, at each time step, the agent and the environment interact to obtain experiences, the priority value of each experience is set to be the current maximum priority value, and the experiences are stored in an experience replay pool, specifically:

4. The method for reinforcement learning based on empirical replay sampling of confidence upper bound thought of claim 1, wherein when generating the candidate training sample set:

5. The method for reinforcement learning based on experience playback sampling of confidence upper bound thought as claimed in claim 1, wherein the training sample set is selected according to the confidence upper bound value of each candidate training sample, specifically:

calculating a confidence upper bound value of each candidate training sample;

updating network parameters according to the training sample set data;

Preferably, when the timing difference error of each training sample is calculated and the maximum value of the timing difference errors in all data is stored, the training sample data is input into the neural network for forward propagation to obtain the timing difference error of each training sample;

6. An experience playback sampling reinforcement learning system based on a confidence upper bound thought is characterized by comprising the following components:

7. The system of claim 6, wherein the acquisition module comprises:

Preferably, the sampling module includes:

Preferably, the sorting module includes:

and the third adding unit is used for selecting the first K sequenced experiences to be added into the training sample set.

8. The system of claim 6, wherein the update module comprises:

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any of claims 1-5 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of the claims 1-5.