US20170032245A1 - Systems and Methods for Providing Reinforcement Learning in a Deep Learning System - Google Patents

Systems and Methods for Providing Reinforcement Learning in a Deep Learning System Download PDF

Info

Publication number
US20170032245A1
US20170032245A1 US15/212,042 US201615212042A US2017032245A1 US 20170032245 A1 US20170032245 A1 US 20170032245A1 US 201615212042 A US201615212042 A US 201615212042A US 2017032245 A1 US2017032245 A1 US 2017032245A1
Authority
US
United States
Prior art keywords
data
state
action
network
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/212,042
Inventor
Ian David Moffat Osband
Benjamin Van Roy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Leland Stanford Junior University
Original Assignee
Leland Stanford Junior University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Leland Stanford Junior University filed Critical Leland Stanford Junior University
Priority to US15/212,042 priority Critical patent/US20170032245A1/en
Publication of US20170032245A1 publication Critical patent/US20170032245A1/en
Priority to US16/576,697 priority patent/US20200065672A1/en
Assigned to THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY reassignment THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Osband, Ian David Moffat, VAN ROY, BENJAMIN
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N99/005

Definitions

  • This invention relates to deep learning networks including, but not limited to, artificial neural networks. More particularly, this invention relates to systems and methods for training deep learning networks from a set of training data using reinforcement learning.
  • Deep learning networks including, but not limited to, artificial neural networks are machine learning systems that receive data, extract statistics and classify results. These systems use a training set of data to generate a model in order to make data driven decisions to provide a desired output.
  • Deep learning networks are often used to solve problems in an unknown environment where the training dataset is used to ascertain the extent of the environment.
  • One manner of training a deep learning network is referred to as reinforcement learning in which the system takes a sequence of actions in order to maximize cumulative rewards.
  • reinforcement learning the system begins with an imperfect knowledge of the environment and learns through experience. As such, there is a fundamental trade-off between exploration and exploitation in that the system may improve its future rewards by exploring poorly understood states and actions and sacrificing immediate rewards.
  • dithering strategies Some other common exploration approaches are dithering strategies.
  • An example of a dithering strategy is a ⁇ -greedy.
  • the approximated value of an action is a single value and the system picks the action with the highest value.
  • the system may also choose some actions at random.
  • Another common exploration strategy is inspired by Thompson sampling. In a Thompson sampling strategy there is some notion of uncertainty. However, a distribution of the maintained over the possible values from the dataset and the system is explored by randomly selecting a policy according to the probability that the selected policy is the optimal policy.
  • Reinforcement learning with deep exploration is provided in the following manner in accordance with some embodiments of the invention.
  • a deep neural network is maintained.
  • a reinforcement learning process is applied to the deep neural network.
  • the reinforcement learning process is performed in the following manner in accordance with some embodiments.
  • a set of observed data and a set artificial data is received.
  • the process samples a set of data that is a union of the set of observed data and the set of artificial data to generate set of training data.
  • a state-action value function is then determined for the set of training data using a bootstrap process and an approximator.
  • the approximator estimates a state-action function for a dataset.
  • the process determines a state of the system for a current time step from the set of training data.
  • An action based on the determined state of the system and a policy mapping actions to the state of the system is selected by the process and results for the action including a reward and a transition state that result from the selected action are determined.
  • Result data from the current time step that includes the state, the action, the transition state are stored.
  • the set of the observed data is then updated with the result data from at least one time step of an episode at the conclusion of an episode.
  • the reinforcement learning process generates the set of artificial data from the set of observed data.
  • the artificial data is generated by sampling the set of observed data with replacement to generate the set of artificial data.
  • the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assigning each the sampled state-action pairs stochastically optimistic rewards and random state transitions.
  • the reinforcement learning process maintains a training mask that indicates the result data from each of the time period in each episode to be used in training and updates the set of observed data by adding the result data from each time period of an episode indicated in the training mask.
  • the approximator received as an input.
  • the approximator is read from memory.
  • the approximator is a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • one or more reinforcement learning processes are applied to the deep neural network.
  • each of the reinforcement learning processes independently maintains a set of observed data.
  • the reinforcement learning processes cooperatively maintain the set of observed data.
  • a bootstrap mask that indicates each element in the set of observed data that is available to each of the reinforcement learning processes is maintained.
  • FIG. 1 illustrates various devices in a network that perform processes that systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • FIG. 2 illustrates a processing system in a device that performs processes that provide systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • FIG. 3 illustrates a deep neural network that uses processes providing reinforcement learning in deep learning networks in accordance with some embodiments of the invention.
  • FIG. 4 illustrates a state diagram of a deterministic chain representing an environment.
  • FIG. 5 illustrates planning and look ahead trees for exploring the deterministic chain shown in FIG. 4 in accordance with various approaches.
  • FIG. 6 illustrates a process for providing reinforcement learning in a deep learning network in accordance with an embodiment of the invention.
  • FIG. 7 illustrates an incremental process for providing reinforcement learning in a deep learning network in accordance with an embodiment of the invention.
  • FIG. 8 illustrates a deterministic chain of states in an environment of a problem.
  • FIG. 9 illustrates the results of application of various reinforcement learning approaches.
  • FIG. 10 illustrates results of a deep learning network using processes that provide reinforcement learning in accordance with an embodiment of the invention and results from a DQN network.
  • FIG. 11 illustrates a graph showing results of a deep learning network using processes that provide reinforcement learning in accordance with an embodiment of the invention learning various Atari games compared to a human player playing the games.
  • FIG. 12 illustrates graphs showing improvements to policies and rewards of various Atari games by deep learning network using systems and methods for providing reinforcement learning in accordance with an embodiment of the invention.
  • FIG. 13 illustrates a table of results for various deep learning networks including a deep learning network that uses process providing reinforcement learning processes in accordance with an embodiment of the invention.
  • deep learning networks are machine learning systems that use a dataset of observed data to learn how to solve a problem in a system where all of the states of the system, actions based upon states, and/or the resulting transitions are not fully known.
  • Examples of deep learning networks include, but are not limited to, deep neural networks.
  • actions taken by a system may impose delayed consequences.
  • the design of exploration strategies is more difficult than systems that are action-response systems where there are no delayed consequences such as multi-armed bandit problems because the system must establish a context.
  • the system observes a state of the environment, s lt , and selects an action, a lt , according to a policy ⁇ which maps the states to actions.
  • a reward, r lt , and a state transition to state, s lt+1 are realized in response to the action.
  • the goal of the system during exploration is to maximize the long-term sum of the expected rewards even though the system is unsure of the dynamics of the environment and the reward structure.
  • Deep exploration means exploration that is directed over multiple time steps. Deep exploration can also be called “planning to learn” or “far-sighted” exploration. Unlike exploration of multi-arm bandit problems, deep exploration require planning over several time steps instead of the balancing of actions which are immediately rewarding or immediately informative in a directed exploration. For deep learning exploitation, an efficient agent should consider the future rewards over several time steps and not simply the myopic rewards. In exactly the same way, efficient exploration may require taking actions which are neither immediately rewarding, nor immediately informative.
  • FIG. 5 Planning and look-ahead trees for several algorithmic exploration approaches to the MDP of this deterministic chain are shown in FIG. 5 .
  • tree 501 represents the possible decisions of a bandit algorithm
  • tree 502 represents the possible decisions of a dithering algorithm
  • tree 503 represents the possible decisions of a shallow exploration algorithm
  • tree 504 represents the possible decisions of a deep exploration algorithm.
  • Actions, including action “left”, and action “right” are solid lines; rewarding states are at the left and right most bottom nodes; and dashed lines indicate that the agent can plan ahead for either rewards or information.
  • an reinforcement learning agent can plan to exploit future rewards. The strategies that use direct exploration cannot plan ahead.
  • Reinforcement learning is a deep learning approach that differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
  • a common approach to reinforcement learning involves learning a state-action value function, Q, which for time, t, state, s, and action, a, provides an estimate, Q t (s, a), of expected rewards over the remainder of the episode: r l, t +r lt+1 + . . . +r l ⁇ .
  • a bootstrap principle is commonly used to approximate a population distribution be a sample distribution.
  • a common bootstrap takes as input a data set D and an estimator, ⁇ .
  • the bootstrap generates a sample data set from the bootstrapped distribution that has a cardinality equal to D and is sampled uniformly with replacement from data set D.
  • the bootstrap sample estimate is then taken to by ⁇ (D).
  • a network that is an efficient and scalable system for generating bootstrap samples from a large and deep neural network includes a shared architecture with K bootstrapped head (or exploration processes) branching off independently. Each head is trained only on a separate unique sub-sample of data that represents a single bootstrap sample ⁇ (D′).
  • the shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads.
  • This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • a parameterized estimate of the Q-value function Q(s, a; ⁇ ) is used rather than a tabular encoding.
  • a separate neural network is used as an approximator function to estimate the parameterized value.
  • is the scalar learning rate and y t Q is the target value r t + ⁇ max a Q(s t+1 , a; ⁇ ⁇ ).
  • the system learns from sampled transitions from an experience buffer, rather than learning fully online.
  • the system uses a target network with parameters ⁇ ⁇ that are copied from the learning network ⁇ ⁇ ⁇ t only every ⁇ time steps and then kept fixed in between updates.
  • a Double DQN system modifies the target y t Q and may help further as shown in Equation (2):
  • A modifies the learning process to approximate a distribution over Q-values via the bootstrap.
  • a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention samples a single Q-value function from an approximate posterior maintained by the system.
  • An exploration process then follows the policy which is optimal for that sample for the duration of the episode. This is a natural adaptation of the Thompson sampling heuristic to reinforcement learning that allows for temporally extended (or deep) exploration.
  • An exploration process for a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention may efficiently implemented by building up K ⁇ bootstrapped estimates of the Q-value function in parallel.
  • Each one of these value function heads Q k (s, a; ⁇ ) is trained against a separate target network Q k (s, a; ⁇ ⁇ ) such that each Q 1 , . . . , Q K provides a temporally extended (and consistent) estimate of the value uncertainty via Thompson sampling distribution estimates.
  • flags ⁇ 1 , . . . , ⁇ K ⁇ 0,1 ⁇ that indicate which heads are privy to which data may be used are maintained.
  • a bootstrap sample is made by selecting k ⁇ 1, . . . , K ⁇ uniformly at random and following Q k for the duration of that episode.
  • Thompson sampling is often referred to a bandit algorithm and takes a single sample from the posterior at every time step and chooses the action which is optimal for that time step.
  • a system samples a value function from its posterior. Na ⁇ ve applications of Thompson sampling to reinforcement learning resample every time step can be extremely inefficient.
  • Such a system agent would have to commit to a sample for several time steps in order to achieve deep exploration.
  • PSRL does commit to a sample for several steps and provides state of the art guarantees.
  • PSRL still requires solving a single known MDP, which will usually be intractable for large systems.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention approximates commits to a sample for several steps exploration via randomized value functions sampled from an approximate posterior.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention recovers state of the art guarantees in the setting with tabular basis functions, but the performance of these systems is crucially dependent upon a suitable linear representation of the value function.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention extends these ideas to produce a system that can simultaneously perform generalization and exploration with a flexible nonlinear value function representation. Our method is simple, general and compatible with almost all advances in deep exploration via reinforcement learning at low computational cost and with few tuning parameters.
  • a reinforcement learning system in accordance with embodiments of this invention overcomes these problems by providing an exploration strategy that combines efficient generalization and exploration via leveraging a bootstrap process and artificial data.
  • the system receives a set of training data that includes observed data and artificial data.
  • the artificial data is generated by sample state-action pairs from a diffusely mixed generative model and assign each state-action pair stochastically optimistic rewards and random state transitions.
  • the artificial data is generated by sampling a set of observed data with replacement to obtain a set of data having a number of elements that is approximately equal to or greater than the number of elements as the set of observed data.
  • the observed and artificial data are sampled to obtain a training sample set of data.
  • the training sample dataset includes M samples of data.
  • M is equal to or greater than a number of episodes to observe during an exploration process.
  • the sampling is performed in accordance with a known and/or a provided distribution.
  • a bootstrap process is applied to the union of the observed data and the artificial data to obtain a new distribution for the sample of data.
  • An approximator function is applied to the new distribution to generate a randomized state-value function.
  • the process For each episode, the process observes a state of the system, s lt , for a particular time period from the training sample dataset and selects an action to perform based upon policy ⁇ .
  • the results including a reward, r lt realized and a resulting transition state, s lt+1 resulting from the action are observed.
  • the state, action, reward and transition state are stored as result data for the episode. This is repeated for each time step in the episode.
  • the observed dataset is updated with the result data stored during the episode.
  • a training mask may be maintained that includes flags indicating the result data from particular time steps is to be used for training. The training mask is then read to determine the result data to add to the observed data.
  • multiple exploration processes may be performed concurrently.
  • the result data observed from each exploration process is shared with other exploration processes.
  • the observed dataset for each process is updated independently.
  • a boot strap mask is maintained that indicates which elements of the observed dataset are available to each process.
  • a replay buffer may be maintained and playback to update parameters of the value function network Q.
  • Network 100 includes a communications network 160 .
  • the communications network 160 is a network such as the Internet that allows devices connected to the network 160 to communicate with other connected devices.
  • Server systems 110 , 140 , and 170 are connected to the network 160 .
  • Each of the server system 110 , 140 , and 170 is a group of one or more servers communicatively connected to one another via internal networks that execute processes that provide cloud services to users over the network 160 .
  • cloud services are one or more applications that are executed by one or more server systems to provide data and/or executable applications to devices over a network.
  • the server systems 110 , 140 , and 170 are shown each having three servers in the internal network. However, the server systems 110 , 140 and 170 may include any number of servers and any additional number of server systems may be connected to the network 160 to provide cloud services.
  • a deep learning network that uses systems and methods that provide reinforcement learning in accordance with an embodiment of the invention may be provided by process being executed on a single server system and/or a group of server systems communicating over network 160 .
  • the personal devices 180 are shown as desktop computers that are connected via a conventional “wired” connection to the network 160 .
  • the personal device 180 may be a desktop computer, a laptop computer, a smart television, an entertainment gaming console, or any other device that connects to the network 160 via a “wired” connection.
  • the mobile device 120 connects to network 160 using a wireless connection.
  • a wireless connection is a connection that uses Radio Frequency (RF) signals, Infrared signals, or any other form of wireless signaling to connect to the network 160 .
  • RF Radio Frequency
  • the mobile device 120 is a mobile telephone.
  • mobile device 120 may be a mobile phone, Personal Digital Assistant (PDA), a tablet, a smartphone, or any other type of device that connects to network 160 via wireless connection without departing from this invention.
  • FIG. 2 An example of a processing system in a device that executes instructions to perform processes that provide interact with other devices connected to the network as shown in FIG. 1 to provide a deep learning network that uses systems and methods that provide reinforcement learning in accordance with various embodiments of the invention in accordance with various embodiments of this invention is shown in FIG. 2 .
  • the processing device 200 includes a processor 205 , a non-volatile memory 210 , and a volatile memory 215 .
  • the processor 205 is a processor, microprocessor, controller, or a combination of processors, microprocessor, and/or controllers that performs instructions stored in the volatile 215 or the non-volatile memory 210 to manipulate data stored in the memory.
  • the non-volatile memory 210 can store the processor instructions utilized to configure the processing system 200 to perform processes including processes in accordance with embodiments of the invention and/or data for the processes being utilized.
  • the processing system software and/or firmware can be stored in any of a variety of non-transient computer readable media appropriate to a specific application.
  • a network interface is a device that allows processing system 200 to transmit and receive data over a network based upon the instructions performed by processor 205 .
  • FIG. 2 any of a variety of processing system in the various devices can configured to provide the methods and systems in accordance with embodiments of the invention can be utilized.
  • a deep learning network may currently run multiple exploration processes to achieve greater exploration of an environment.
  • a conceptual diagram of a deep learning network that has multiple concurrently running exploration processes in accordance with an embodiment of this invention is shown in FIG. 3 .
  • Deep learning network 300 operates on a frame 305 .
  • K number of heads or exploration processes interact with network 300 to explore the environment.
  • the bootstrap principle is used to approximate a population distribution be a sample distribution.
  • a common bootstrap takes as input a data set D and an estimator, ⁇ .
  • the bootstrap generates a sample data set from the bootstrapped distribution that has a cardinality equal to D and is sampled uniformly with replacement from data set D.
  • the bootstrap sample estimate is then taken to by ⁇ (D).
  • System 300 is an efficient and scalable system for generating bootstrap samples from a large and deep neural network.
  • the network 300 includes a shared architecture with K bootstrapped heads (or exploration processes) branching off independently. Each head is trained only on a separate unique sub-sample of data that represents a single bootstrap sample ⁇ (D′).
  • the shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads.
  • This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • This expectation indicates that the initial state is s, the action is a, and thereafter actions are selected by the policy ⁇ .
  • a parameterized estimate of the Q-value function Q(s, a; ⁇ ) is used rather than a tabular encoding.
  • a neural network is used to estimate the parameterized value.
  • is the scalar learning rate and y t Q is the target value r t + ⁇ max a Q(s t+1 , a; ⁇ ⁇ ).
  • a deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention modifies DQN to approximate a distribution over Q-values via the bootstrap.
  • a deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention samples a single Q-value function from its approximate posterior. The system follows the policy which is optimal for that sample for the duration of the episode.
  • systems and methods provide reinforcement learning by providing a deep exploration process.
  • the deep exploration process fits a state-action value function to a sample of data from set of data that includes artificial data and observed data.
  • the system receives a set of training data that includes observed data and artificial data.
  • the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assign each state-action pair stochastically optimistic rewards and random state transitions.
  • the artificial set of data is generated by sampling the observed set of data with replacement.
  • the artificial data includes M elements where M is approximately greater than or equal to the number of elements in the observed dataset.
  • M is approximately greater than or equal to the number of elements in the observed dataset.
  • the use of the combination of observed and historical data provides randomness in the samples to induce deep learning.
  • An exploration process for providing reinforcement learning to a deep learning network in accordance with an embodiment of this invention is shown in FIG. 6 .
  • Process 600 performs exploration in M distinct episodes.
  • the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network.
  • a set of data including historical and artificial data is obtained ( 605 ).
  • the observed data is read from a memory.
  • the observed data is received from another system.
  • the artificial data is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input ( 610 ).
  • the approximator function may be set for process 600 and stored in memory for use.
  • the approximator estimates a state-action value function for a data set.
  • the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • the observed and artificial data are sampled to obtain a training set of data ( 615 ).
  • the training data includes M samples of data.
  • M is equal to or greater than a number of episodes to observe.
  • the sampling is performed in accordance with a known and/or a provided distribution.
  • a bootstrap process is applied to the union of the observed data and the training data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function ( 620 ).
  • a state, s is observed based on the training data and an action, a, is selected based on the state of the system from the sample of data and the policy ⁇ ( 630 ).
  • the reward, r lt realized and resulting transition state, s lt+1 are observed ( 635 ).
  • the state, s, the action, a, and the resulting transition state, s lt+1 are stored as resulting data in memory.
  • the selecting ( 630 ) and observing of the results are repeated until the time period ends ( 640 ).
  • the observed set of data is then updated with the results ( 650 ).
  • a training mask is maintained that indicates the result data from particular time steps of each episode to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes ( 645 ) and process 600 ends.
  • process 600 for providing reinforcement learning for a deep learning network in accordance with an embodiment of the invention is described with respect to FIG. 6 .
  • Other methods that add, removed, and/or combine steps in process 600 may be performed without departing from the various embodiments of this invention.
  • Fitting a model like a deep neural network is a computationally expensive task. As such, it is desirable to use incremental methods to incorporate new data sample into the fitting process as the data is generated. To do so, parallel computing may be used.
  • a process that performs multiple concurrent explorations in accordance with an embodiment of the invention is shown in FIG. 7 .
  • Process 700 performs exploration in M distinct episodes for K separate exploration process.
  • the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network.
  • a set of data including historical and artificial data is obtained ( 705 ).
  • the observed data is read from a memory.
  • the observed data is received from another system.
  • the artificial data for one or more exploration processes is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data for one or more exploration processes is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data for one or more of the exploration processes is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input ( 705 ).
  • the approximator function may be set for process 700 and stored in memory for use.
  • the approximator estimates a state-action value function for a data set.
  • the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • 1 to K number of approximators may be used where each of the 1 to K approximators is applied to the training set data of one or more of the K exploration processes.
  • the observed and artificial data are sampled to obtain a training set of data for each of the K independent processes ( 715 ).
  • the training data for each exploration process includes M samples of data. In accordance with a number of these embodiments, M is equal to or greater than a number of episodes to observe.
  • the sampling for one or exploration processes is performed in accordance with a known and/or a provided distribution. In accordance with some embodiments, one or more of the exploration process may have the same set of artificial data.
  • a bootstrap process is applied to the union of the observed data and the artificial data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function ( 720 ).
  • exploration is performed in the following manner. For each time step, a state, s, is observed and an action, a, is selected based on the state of the system from the sample of data and the policy ⁇ ( 730 ). The reward, r lt realized and resulting transition state, s lt+1 are observed ( 735 ). The state, s, the action, a, and the resulting transition state, s lt+1 are stored as resulting data in memory.
  • the selecting ( 730 ) and observing of the results are repeated until the time period for the episode ends ( 740 ).
  • the observed set of data is individually updated for each exploration process with the results ( 750 ).
  • a bootstrap mask may be maintained to indicate the observed data that is available to each exploration process.
  • the observed data is updated with the data from all of the different K exploration processes.
  • a training mask is maintained that indicates the result data from particular time steps of each episode for each exploration process to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes ( 745 ) and process 700 ends.
  • Bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) explores in a manner similar to the provably-efficient algorithm PSRL but bootstrap DQN uses a bootstrapped neural network to approximate a posterior sample for the value. Unlike PSRL, bootstrapped DQN directly samples a value function and does not require further planning steps.
  • the bootstrap DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) is similar to RLSVI, which is also provably-efficient, but with a neural network instead of linear value function and bootstrap instead of Gaussian sampling.
  • the analysis for the linear setting suggests that this nonlinear approach will work well as long as the distribution ⁇ Q 1 , . . . , Q K ⁇ remains stochastically optimistic, or at least as spread out as the “correct” posterior.
  • Bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) relies upon random initialization of the network weights as a prior to induce diversity.
  • the initial diversity is enough to maintain diverse generalization to new and unseen states for large and deep neural networks.
  • the initial diversity is effective for this experimental setting, but will not work in all situations.
  • a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention may be necessary to maintain some more rigorous notion of “prior”, potentially through the use of artificial prior data to maintain diversity.
  • One potential explanation for the efficacy of simple random initialization is that unlike supervised learning or bandits, where all networks fit the same data, each of Q K heads has a unique target network. This, together with stochastic minibatch and flexible nonlinear representations, means that even small differences at initialization may become bigger as the heads refit to unique TD errors.
  • a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention was evaluated across 49 Atari games on the Arcade Learning Environment.
  • the domains of these games are not specifically designed to showcase the tested deep learning network.
  • many Atari games are structured so that small rewards always indicate part of an optimal policy that may be crucial for the strong performance observed by dithering strategies.
  • the evaluations show that exploration via bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) produces significant gains versus ⁇ -greedy in this setting.
  • Bootstrapped DQN reaches peak performance roughly similar to DQN.
  • the improved exploration of the bootstrapped DQN reaches human performance on average 30% faster across all games. This translates to significantly improved cumulative rewards through learning.
  • the bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) evaluated had a network structure is identical to the convolutional structure of a DQN except the bootstrapped DQN split 10 separate bootstrap heads after the convolutional layer.
  • the convolutional part of the network (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) used is identical to the one used in other systems.
  • the input to the network is 4 ⁇ 84 ⁇ 84 tensor with a rescaled, grayscale version of the last four observations.
  • the first convolutional (cony) layer has 32 filters of size 8 with a stride of 4.
  • the second cony layer has 64 filters of size 4 with stride 2.
  • the las cony layer has 64 filters of size 3.
  • Each of the networks tested were trained with RMSProp with a momentum of 0.95 and a learning rate of 0.00025.
  • the processes were trained for a total of 50 m steps per game, which corresponds to 200 m frames.
  • the processes were stopped every 1 m frames for evaluation.
  • the bootstrapped DQN used an ensemble voting policy.
  • the experience replay contains the 1 m most recent transitions.
  • the network was updated every 4 steps by randomly sampling a minibatch of 32 transitions from the replay buffer to use the exact same minibatch schedule as DQN.
  • a ⁇ -greedy policy with ⁇ being annealed linearly from 1 to 0.01 over the first 1 m timesteps.
  • bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) makes significant improvements to the cumulative rewards of DQN on Atari, while the peak performance is much more. Furthermore, using bootstrapped DQN without gradient normalization on each head typically learned even faster than our implementation with rescaling 1 /K, but the network was somewhat prone to premature and suboptimal convergence.
  • DQN is an implementation of DQN with the hyperparameters specified above, using the double Q-Learning update.
  • the peak final performance of DQN is similar under bootstrapped DQN to previous benchmarks.
  • the shared network architecture allows training of this combined network via backpropagation. Feeding K network heads to the shared convolutional network effectively increases the learning rate for this portion of the network. In some games, the increased learning rate leads to premature and sub-optimal convergence. The best final scores were achieved by normalizing the gradients by 1/K, but the normalizing of the gradients also leads to early learning.
  • bootstrapped DQN drives efficient exploration in several Atari games.
  • bootstrapped DQN generally outperforms DQN with ⁇ -greedy exploration.
  • FIG. 10 demonstrates this effect for a diverse section of games.
  • bootstrapped DQN typically performs better. Bootstrapped DQN does not reach human performance on Amidar (DQN does) but does on Beam Rider and Battle Zone (DQN does not). To summarize this improvement in learning time, the number of frames required to reach human performance is considered. If bootstrapped DQN reaches human performance in 1/x frames of DQN, bootstrapped DQN has improved by x. FIG. 11 shows that Bootstrapped DQN typically reaches human performance significantly faster.
  • Bootstrapped DQN is able to learn much faster than DQN.
  • Graph 1201 of FIG. 12 shows that bootstrapped DQN also improves upon the final score across most games.
  • the real benefits to efficient exploration mean that bootstrapped DQN outperforms DQN by orders of magnitude in terms of the cumulative rewards through learning (shown in graph 1202 of FIG. 12 .
  • performance is normalized relative to a fully random policy.
  • the most similar work to bootstrapped DQN presents several other approaches to improved exploration in Atari. For example, AUC-20 is optimized for a normalized version of the cumulative returns after 20 m frames.
  • bootstrapped DQN may be beneficial as a purely exploitative policy.
  • all of the heads are combined into a single ensemble policy, for example by choosing the action with the most votes across heads. This approach might have several benefits.
  • the ensemble policy can often outperform any individual policy.
  • the distribution of votes across heads to give a measure of the uncertainty in the optimal policy.
  • bootstrapped DQN can know what it doesn't know. In an application where executing a poorly-understood action is dangerous this could be crucial, the uncertainty in this policy is surprisingly interpretable: all heads agree at clearly crucial decision points, but remain diverse at other less important steps.

Abstract

Systems and methods for providing reinforcement learning for a deep learning network are disclosed. A reinforcement learning process that provides deep exploration is provided by a bootstrap that applied to a sample of observed and artificial data to facilitate deep exploration via a Thompson sampling approach.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • The current application is a Continuation-In-Part Application of U.S. patent application Ser. No. 15/201,284 filed Jul. 1, 2016 that in turn claims priority to U.S. Provisional Application No. 62/187,681, filed Jul. 1, 2015, the disclosures of which are incorporated herein by reference as if set forth herewith.
  • FIELD OF THE INVENTION
  • This invention relates to deep learning networks including, but not limited to, artificial neural networks. More particularly, this invention relates to systems and methods for training deep learning networks from a set of training data using reinforcement learning.
  • BACKGROUND OF THE INVENTION
  • Deep learning networks including, but not limited to, artificial neural networks are machine learning systems that receive data, extract statistics and classify results. These systems use a training set of data to generate a model in order to make data driven decisions to provide a desired output.
  • Deep learning networks are often used to solve problems in an unknown environment where the training dataset is used to ascertain the extent of the environment. One manner of training a deep learning network is referred to as reinforcement learning in which the system takes a sequence of actions in order to maximize cumulative rewards. In reinforcement learning, the system begins with an imperfect knowledge of the environment and learns through experience. As such, there is a fundamental trade-off between exploration and exploitation in that the system may improve its future rewards by exploring poorly understood states and actions and sacrificing immediate rewards.
  • Many approaches to reinforcement learning have been put forth. Most of the proposed approaches are designed based upon Markov Decision Processes (MDPs) with small finite state spaces. Some other approaches require solving computationally intractable planning tasks. These approaches are not practical in complex environments that require the system to generalize in order to operate properly. Thus, these reinforcement learning approaches in large-scale application have relied upon either statistically inefficient exploration strategies or include no exploration at all.
  • Some other common exploration approaches are dithering strategies. An example of a dithering strategy is a ε-greedy. In common dithering strategies, the approximated value of an action is a single value and the system picks the action with the highest value. In some strategies, the system may also choose some actions at random. Another common exploration strategy is inspired by Thompson sampling. In a Thompson sampling strategy there is some notion of uncertainty. However, a distribution of the maintained over the possible values from the dataset and the system is explored by randomly selecting a policy according to the probability that the selected policy is the optimal policy.
  • SUMMARY
  • The above and other problems are solved and an advance in the art is made by systems and methods for providing reinforcement learning in deep learning networks in accordance with some embodiments of the invention. Reinforcement learning with deep exploration is provided in the following manner in accordance with some embodiments of the invention. A deep neural network is maintained. A reinforcement learning process is applied to the deep neural network.
  • The reinforcement learning process is performed in the following manner in accordance with some embodiments. A set of observed data and a set artificial data is received. For each of a number of episodes, the process samples a set of data that is a union of the set of observed data and the set of artificial data to generate set of training data. A state-action value function is then determined for the set of training data using a bootstrap process and an approximator. The approximator estimates a state-action function for a dataset. For each time step in each of the one or more episodes, the process determines a state of the system for a current time step from the set of training data. An action based on the determined state of the system and a policy mapping actions to the state of the system is selected by the process and results for the action including a reward and a transition state that result from the selected action are determined. Result data from the current time step that includes the state, the action, the transition state are stored. The set of the observed data is then updated with the result data from at least one time step of an episode at the conclusion of an episode.
  • In accordance with some embodiments, the reinforcement learning process generates the set of artificial data from the set of observed data. In accordance with many of these embodiments the artificial data is generated by sampling the set of observed data with replacement to generate the set of artificial data. In accordance with a number of other embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assigning each the sampled state-action pairs stochastically optimistic rewards and random state transitions.
  • In accordance with some embodiments, the reinforcement learning process maintains a training mask that indicates the result data from each of the time period in each episode to be used in training and updates the set of observed data by adding the result data from each time period of an episode indicated in the training mask.
  • In accordance with some embodiments, the approximator received as an input. In accordance with many embodiments, the approximator is read from memory. In accordance with a number of embodiments, the approximator is a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • In accordance with some embodiments, one or more reinforcement learning processes are applied to the deep neural network. In accordance with many of these embodiments, each of the reinforcement learning processes independently maintains a set of observed data. In accordance with some other embodiments, the reinforcement learning processes cooperatively maintain the set of observed data. In accordance with some of these embodiments, a bootstrap mask that indicates each element in the set of observed data that is available to each of the reinforcement learning processes is maintained.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 illustrates various devices in a network that perform processes that systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • FIG. 2 illustrates a processing system in a device that performs processes that provide systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • FIG. 3 illustrates a deep neural network that uses processes providing reinforcement learning in deep learning networks in accordance with some embodiments of the invention.
  • FIG. 4 illustrates a state diagram of a deterministic chain representing an environment.
  • FIG. 5 illustrates planning and look ahead trees for exploring the deterministic chain shown in FIG. 4 in accordance with various approaches.
  • FIG. 6 illustrates a process for providing reinforcement learning in a deep learning network in accordance with an embodiment of the invention.
  • FIG. 7 illustrates an incremental process for providing reinforcement learning in a deep learning network in accordance with an embodiment of the invention.
  • FIG. 8 illustrates a deterministic chain of states in an environment of a problem.
  • FIG. 9 illustrates the results of application of various reinforcement learning approaches.
  • FIG. 10 illustrates results of a deep learning network using processes that provide reinforcement learning in accordance with an embodiment of the invention and results from a DQN network.
  • FIG. 11 illustrates a graph showing results of a deep learning network using processes that provide reinforcement learning in accordance with an embodiment of the invention learning various Atari games compared to a human player playing the games.
  • FIG. 12 illustrates graphs showing improvements to policies and rewards of various Atari games by deep learning network using systems and methods for providing reinforcement learning in accordance with an embodiment of the invention.
  • FIG. 13 illustrates a table of results for various deep learning networks including a deep learning network that uses process providing reinforcement learning processes in accordance with an embodiment of the invention.
  • DETAILED DISCUSSION
  • Turning now to the drawings, systems and methods for providing reinforcement learning to a deep learning network in accordance with various embodiment of the invention are disclosed. For purposes of this discussion, deep learning networks are machine learning systems that use a dataset of observed data to learn how to solve a problem in a system where all of the states of the system, actions based upon states, and/or the resulting transitions are not fully known. Examples of deep learning networks include, but are not limited to, deep neural networks.
  • System and methods in accordance with some embodiments of this invention that provide reinforcement learning do so by providing an exploration process for a deep learning network to solve a problem in an environment. In reinforcement learning, actions taken by a system may impose delayed consequences. Thus, the design of exploration strategies is more difficult than systems that are action-response systems where there are no delayed consequences such as multi-armed bandit problems because the system must establish a context. An example of a system that has delayed consequences is a system that interacts with an environment over repeated episodes, l, of length τ. In each time step, t=1, . . . , τ, of an episode, the system observes a state of the environment, slt, and selects an action, alt, according to a policy π which maps the states to actions. A reward, rlt, and a state transition to state, slt+1, are realized in response to the action. The goal of the system during exploration is to maximize the long-term sum of the expected rewards even though the system is unsure of the dynamics of the environment and the reward structure.
  • Deep Learning
  • To understand a system that may have delayed consequences, a deep learning network needs to explore as many states of the system to understand the state-action policy and the rewards associated with actions. Uncertainty estimates allow a system to direct an exploration process at potentially informative states and actions. In multi-arm bandit problems, directed exploration of the system rather than a dithering exploration generally categorizes efficient algorithms. However, directed exploration is not enough to guarantee efficiency in more complex systems with delayed consequences. Instead, the exploration must also be deep. Deep exploration means exploration that is directed over multiple time steps. Deep exploration can also be called “planning to learn” or “far-sighted” exploration. Unlike exploration of multi-arm bandit problems, deep exploration require planning over several time steps instead of the balancing of actions which are immediately rewarding or immediately informative in a directed exploration. For deep learning exploitation, an efficient agent should consider the future rewards over several time steps and not simply the myopic rewards. In exactly the same way, efficient exploration may require taking actions which are neither immediately rewarding, nor immediately informative.
  • To illustrate this distinction, consider a simple deterministic chain {s−3, . . . , s+3} with three step horizon starting from state s0 is shown in FIG. 4. The Markov Decision Processes (MDP) of the chain is known to a system a priori, with deterministic actions “left” and “right”. All states have zero reward, except for the leftmost state s−3 which has known reward of ε>0 and the rightmost state s3 which is unknown. In order to reach either a rewarding state or an informative state within three steps from s0 a system needs to plan a consistent strategy over several time steps.
  • Planning and look-ahead trees for several algorithmic exploration approaches to the MDP of this deterministic chain are shown in FIG. 5. In FIG. 5, tree 501 represents the possible decisions of a bandit algorithm, tree 502 represents the possible decisions of a dithering algorithm, tree 503 represents the possible decisions of a shallow exploration algorithm, and tree 504 represents the possible decisions of a deep exploration algorithm. In all of the trees 501-504, Actions, including action “left”, and action “right” are solid lines; rewarding states are at the left and right most bottom nodes; and dashed lines indicate that the agent can plan ahead for either rewards or information. As can be seen from trees 501-504 only a system that employs a deep exploration strategy such as reinforcement learning Unlike bandit algorithms, an reinforcement learning agent can plan to exploit future rewards. The strategies that use direct exploration cannot plan ahead.
  • Reinforcement Learning
  • Reinforcement learning is a deep learning approach that differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). A common approach to reinforcement learning involves learning a state-action value function, Q, which for time, t, state, s, and action, a, provides an estimate, Qt(s, a), of expected rewards over the remainder of the episode: rl, t+rlt+1+ . . . +r. Given a state-action value function, Q, the system selects an action that maximizes Qt(s, a) when at state s and time t. Most reinforcement learning systems provide an exploration strategy that balances exploration with exploitation. However, the vast majority of these processes operate in a “tabula rasa” setting which does not allow for generalization between state-action pairs which is needed in systems having a large number of states and actions.
  • A bootstrap principle is commonly used to approximate a population distribution be a sample distribution. A common bootstrap takes as input a data set D and an estimator, γ. The bootstrap generates a sample data set from the bootstrapped distribution that has a cardinality equal to D and is sampled uniformly with replacement from data set D. The bootstrap sample estimate is then taken to by γ (D). A network that is an efficient and scalable system for generating bootstrap samples from a large and deep neural network includes a shared architecture with K bootstrapped head (or exploration processes) branching off independently. Each head is trained only on a separate unique sub-sample of data that represents a single bootstrap sample γ (D′). The shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads. This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • For a policy π, the value of an action a in state s may be expressed as Qπ(s, a):=
    Figure US20170032245A1-20170202-P00001
    s,a,πt=1 γtrt], where γε(0,1) is a discount factor that balances immediate versus future rewards rt. This expectation indicates that the initial state is s, the action is a, and thereafter actions are selected by the policy π. The optimal value is Q*(s, a):=maxπQπ(s, a). To scale to large problems, a parameterized estimate of the Q-value function Q(s, a; θ) is used rather than a tabular encoding. To estimate the parameterized value, a separate neural network is used as an approximator function to estimate the parameterized value.
  • In a deep Q learning network, a Q-learning update from state st, action at, reward rt and new state st+1 is given by

  • θt+1←θt+α(y t Q −Q(s t ,a tt))∇θ Q(s t ,a tt)  (1)
  • Where α is the scalar learning rate and yt Q is the target value rt+γ maxaQ(st+1, a; θ). θ are target network parameters fixed θt.
  • Several important modifications to the updating process in Q-learning improve stability for a deep learning network using reinforcement learning provided in accordance with some embodiments of the invention. First, the system learns from sampled transitions from an experience buffer, rather than learning fully online. Second, the system uses a target network with parameters θ that are copied from the learning network θ←θt only every τ time steps and then kept fixed in between updates. A Double DQN system modifies the target yt Q and may help further as shown in Equation (2):

  • y t Q ←r t+γ maxa Q(s t+1, arg maxa Q(s t+1 ,a;θ t);θ).  (2)
  • A modifies the learning process to approximate a distribution over Q-values via the bootstrap. At the start of each episode, a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention samples a single Q-value function from an approximate posterior maintained by the system. An exploration process then follows the policy which is optimal for that sample for the duration of the episode. This is a natural adaptation of the Thompson sampling heuristic to reinforcement learning that allows for temporally extended (or deep) exploration.
  • An exploration process for a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention may efficiently implemented by building up Kε
    Figure US20170032245A1-20170202-P00002
    bootstrapped estimates of the Q-value function in parallel. Each one of these value function heads Qk(s, a; θ) is trained against a separate target network Qk(s, a; θ) such that each Q1, . . . , QK provides a temporally extended (and consistent) estimate of the value uncertainty via Thompson sampling distribution estimates. In order to keep track of which data belongs to which bootstrap head, flags ω1, . . . , ωKε{0,1} that indicate which heads are privy to which data may be used are maintained. A bootstrap sample is made by selecting kε{1, . . . , K} uniformly at random and following Qk for the duration of that episode.
  • The observation that temporally extended exploration is necessary for efficient reinforcement learning is not new. For any prior distribution of MDPs, the optimal exploration strategy is available through dynamic programming in the Bayesian belief state space. However, the exact solution is intractable even for very simple systems. Many successful reinforcement learning applications focus on generalization and planning but address exploration only via an inefficient exploration strategy or not at all. However, such exploration strategies can be highly inefficient.
  • Many exploration strategies are guided by the principle of “optimism in the face of uncertainty” (OFU). These algorithms add an exploration bonus to values of state-action pairs that may lead to useful learning and select actions to maximize these adjusted values. This approach was first proposed for finite-armed bandits, but the principle has been extended successfully across various multi-armed bandits with generalization and/or tabular reinforcement learning. Except for particular deterministic contexts, OFU methods that lead to efficient reinforcement learning in complex domains have been computationally intractable. A particular OFU system aims to add an effective bonus through a variation of a Deep Q-learning Network (DQN). The resulting system relies on a large number of hand-tuned parameters and is only suitable for application to deterministic problems.
  • Perhaps the oldest heuristic for balancing exploration with exploitation is given by Thompson sampling. Thompson sampling is often referred to a bandit algorithm and takes a single sample from the posterior at every time step and chooses the action which is optimal for that time step. To apply the Thompson sampling principle to reinforcement learning, a system samples a value function from its posterior. Naïve applications of Thompson sampling to reinforcement learning resample every time step can be extremely inefficient. Such a system agent would have to commit to a sample for several time steps in order to achieve deep exploration. One proposed system, PSRL does commit to a sample for several steps and provides state of the art guarantees. However, PSRL still requires solving a single known MDP, which will usually be intractable for large systems.
  • A deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention approximates commits to a sample for several steps exploration via randomized value functions sampled from an approximate posterior. A deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention recovers state of the art guarantees in the setting with tabular basis functions, but the performance of these systems is crucially dependent upon a suitable linear representation of the value function. A deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention extends these ideas to produce a system that can simultaneously perform generalization and exploration with a flexible nonlinear value function representation. Our method is simple, general and compatible with almost all advances in deep exploration via reinforcement learning at low computational cost and with few tuning parameters.
  • A reinforcement learning system in accordance with embodiments of this invention overcomes these problems by providing an exploration strategy that combines efficient generalization and exploration via leveraging a bootstrap process and artificial data. In accordance with some embodiments of the invention, the system receives a set of training data that includes observed data and artificial data. In accordance with some embodiments, the artificial data is generated by sample state-action pairs from a diffusely mixed generative model and assign each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some other embodiments, the artificial data is generated by sampling a set of observed data with replacement to obtain a set of data having a number of elements that is approximately equal to or greater than the number of elements as the set of observed data.
  • The observed and artificial data are sampled to obtain a training sample set of data. In accordance with some of these embodiments, the training sample dataset includes M samples of data. In accordance with a number of these embodiments, M is equal to or greater than a number of episodes to observe during an exploration process. In accordance with some embodiments, the sampling is performed in accordance with a known and/or a provided distribution. A bootstrap process is applied to the union of the observed data and the artificial data to obtain a new distribution for the sample of data. An approximator function is applied to the new distribution to generate a randomized state-value function.
  • For each episode, the process observes a state of the system, slt, for a particular time period from the training sample dataset and selects an action to perform based upon policy π. The results including a reward, rlt realized and a resulting transition state, slt+1 resulting from the action are observed. The state, action, reward and transition state are stored as result data for the episode. This is repeated for each time step in the episode. After the episode is completed, the observed dataset is updated with the result data stored during the episode. In accordance with some embodiments, a training mask may be maintained that includes flags indicating the result data from particular time steps is to be used for training. The training mask is then read to determine the result data to add to the observed data.
  • In accordance with many embodiments, multiple exploration processes may be performed concurrently. In accordance with a number of these embodiments, the result data observed from each exploration process is shared with other exploration processes. In accordance with a number of embodiments, the observed dataset for each process is updated independently. In accordance with some of these embodiments, a boot strap mask is maintained that indicates which elements of the observed dataset are available to each process.
  • In accordance with some embodiments, a replay buffer may be maintained and playback to update parameters of the value function network Q.
  • Systems and method for providing reinforcement learning in a deep learning network in accordance with some embodiments of the invention are set forth below with reference to the Figures.
  • Systems that Provide Deep Learning Networks
  • A system that provides a deep learning system that uses systems and methods that provide reinforcement learning in accordance with some embodiments of the invention is shown in FIG. 1. Network 100 includes a communications network 160. The communications network 160 is a network such as the Internet that allows devices connected to the network 160 to communicate with other connected devices. Server systems 110, 140, and 170 are connected to the network 160. Each of the server system 110, 140, and 170 is a group of one or more servers communicatively connected to one another via internal networks that execute processes that provide cloud services to users over the network 160. For purposes of this discussion, cloud services are one or more applications that are executed by one or more server systems to provide data and/or executable applications to devices over a network. The server systems 110, 140, and 170 are shown each having three servers in the internal network. However, the server systems 110, 140 and 170 may include any number of servers and any additional number of server systems may be connected to the network 160 to provide cloud services. In accordance with various embodiments of this invention, a deep learning network that uses systems and methods that provide reinforcement learning in accordance with an embodiment of the invention may be provided by process being executed on a single server system and/or a group of server systems communicating over network 160.
  • Users may use personal devices 180 and 120 that connect to the network 160 to perform processes for providing and/or interaction with a deep learning network that uses systems and methods that provide reinforcement learning in accordance with various embodiments of the invention. In the shown embodiment, the personal devices 180 are shown as desktop computers that are connected via a conventional “wired” connection to the network 160. However, the personal device 180 may be a desktop computer, a laptop computer, a smart television, an entertainment gaming console, or any other device that connects to the network 160 via a “wired” connection. The mobile device 120 connects to network 160 using a wireless connection. A wireless connection is a connection that uses Radio Frequency (RF) signals, Infrared signals, or any other form of wireless signaling to connect to the network 160. In FIG. 1, the mobile device 120 is a mobile telephone. However, mobile device 120 may be a mobile phone, Personal Digital Assistant (PDA), a tablet, a smartphone, or any other type of device that connects to network 160 via wireless connection without departing from this invention.
  • Example of a Processing System
  • An example of a processing system in a device that executes instructions to perform processes that provide interact with other devices connected to the network as shown in FIG. 1 to provide a deep learning network that uses systems and methods that provide reinforcement learning in accordance with various embodiments of the invention in accordance with various embodiments of this invention is shown in FIG. 2. One skilled in the art will recognize that a particular processing system may include other components that are omitted for brevity without departing from this invention. The processing device 200 includes a processor 205, a non-volatile memory 210, and a volatile memory 215. The processor 205 is a processor, microprocessor, controller, or a combination of processors, microprocessor, and/or controllers that performs instructions stored in the volatile 215 or the non-volatile memory 210 to manipulate data stored in the memory. The non-volatile memory 210 can store the processor instructions utilized to configure the processing system 200 to perform processes including processes in accordance with embodiments of the invention and/or data for the processes being utilized. In other embodiments, the processing system software and/or firmware can be stored in any of a variety of non-transient computer readable media appropriate to a specific application. A network interface is a device that allows processing system 200 to transmit and receive data over a network based upon the instructions performed by processor 205. Although a processing system 200 is illustrated in FIG. 2, any of a variety of processing system in the various devices can configured to provide the methods and systems in accordance with embodiments of the invention can be utilized.
  • System that Provides Training by Multiple Concurrently Running Exploration Processes
  • In accordance with some embodiments of the invention, a deep learning network may currently run multiple exploration processes to achieve greater exploration of an environment. A conceptual diagram of a deep learning network that has multiple concurrently running exploration processes in accordance with an embodiment of this invention is shown in FIG. 3. Deep learning network 300 operates on a frame 305. K number of heads or exploration processes interact with network 300 to explore the environment.
  • The bootstrap principle is used to approximate a population distribution be a sample distribution. A common bootstrap takes as input a data set D and an estimator, γ. The bootstrap generates a sample data set from the bootstrapped distribution that has a cardinality equal to D and is sampled uniformly with replacement from data set D. The bootstrap sample estimate is then taken to by γ(D). System 300 is an efficient and scalable system for generating bootstrap samples from a large and deep neural network. The network 300 includes a shared architecture with K bootstrapped heads (or exploration processes) branching off independently. Each head is trained only on a separate unique sub-sample of data that represents a single bootstrap sample γ (D′). The shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads. This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • For a policy π, the value of an action a in state s may be expressed as Qπ(s, a):=
    Figure US20170032245A1-20170202-P00001
    s,a,πt=1 γtrt], where γε(0,1) is a discount factor that balances immediate versus future rewards rt. This expectation indicates that the initial state is s, the action is a, and thereafter actions are selected by the policy π. The optimal value is Q*(s, a):=maxπQπ(s, a). To scale to large problems, a parameterized estimate of the Q-value function Q(s, a; θ) is used rather than a tabular encoding. In accordance with some embodiments, a neural network is used to estimate the parameterized value.
  • The Q-learning update from state st, action at, reward rt and new state st+1 is given by

  • θt+1←θt+α(y t Q −Q(s t ,a tt))∇θ Q(s t ,a tt)  (1)
  • Where α is the scalar learning rate and yt Q is the target value rt+γ maxaQ(st+1, a; θ). θ are target network parameters fixed θt.
  • Several important modifications to the updating process in Q-learning improve stability for a deep learning network using reinforcement learning provided in accordance with some embodiments of the invention. First the algorithm learns from sampled transitions from an experience buffer, rather than learning fully online, Second the algorithm uses a target network with parameters θthat are copied from the learning network θ←θt only every τ time steps and then kept fixed in between updates. Double DQN modifies the target yt Q and helps further:

  • y t Q ←r t+γ maxa Q(s t+1, arg maxa Q(s t+1 ,a;θ t);θ).  (2)
  • A deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention modifies DQN to approximate a distribution over Q-values via the bootstrap. At the start of each episode, A deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention samples a single Q-value function from its approximate posterior. The system follows the policy which is optimal for that sample for the duration of the episode.
  • An exploration process is efficiently implemented by building up Kε
    Figure US20170032245A1-20170202-P00002
    bootstrapped estimates of the Q-value function in parallel as in FIG. 3. Importantly, each one of these value function heads Qk (s, a; θ) is trained against its own target network Qk(s, a; θ). This means that each Q1, . . . , QK provide a temporally extended (and consistent) estimate of the value uncertainty via TD Estimates. In order to keep track of which data belongs to which bootstrap head we store flags ω1, . . . , ωKε{0,1} indicating which heads are privy to which data. We approximate a bootstrap sample by selecting kε{1, . . . , K} uniformly at random and following Qk for the duration of that episode.
  • Reinforcement Learning Exploration Process
  • In accordance with some embodiments of the invention, systems and methods provide reinforcement learning by providing a deep exploration process. The deep exploration process fits a state-action value function to a sample of data from set of data that includes artificial data and observed data. In accordance with some embodiments of the invention, the system receives a set of training data that includes observed data and artificial data. In accordance with some these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assign each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some other embodiments, the artificial set of data is generated by sampling the observed set of data with replacement. In accordance with a number of embodiments, the artificial data includes M elements where M is approximately greater than or equal to the number of elements in the observed dataset. The use of the combination of observed and historical data provides randomness in the samples to induce deep learning. An exploration process for providing reinforcement learning to a deep learning network in accordance with an embodiment of this invention is shown in FIG. 6.
  • Process 600 performs exploration in M distinct episodes. In accordance with some embodiments, the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network. A set of data including historical and artificial data is obtained (605). In accordance with some embodiments, the observed data is read from a memory. In accordance with some other embodiments, the observed data is received from another system.
  • In accordance with some embodiments, the artificial data is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input (610). In accordance with some embodiments, the approximator function may be set for process 600 and stored in memory for use. The approximator estimates a state-action value function for a data set. In accordance with some embodiments, the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • The observed and artificial data are sampled to obtain a training set of data (615). In accordance with some of these embodiments, the training data includes M samples of data. In accordance with a number of these embodiments, M is equal to or greater than a number of episodes to observe. In accordance with some embodiments, the sampling is performed in accordance with a known and/or a provided distribution. A bootstrap process is applied to the union of the observed data and the training data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function (620). For each time step, a state, s, is observed based on the training data and an action, a, is selected based on the state of the system from the sample of data and the policy π (630). The reward, rlt realized and resulting transition state, slt+1 are observed (635). The state, s, the action, a, and the resulting transition state, slt+1 are stored as resulting data in memory. The selecting (630) and observing of the results are repeated until the time period ends (640). The observed set of data is then updated with the results (650). In accordance with some embodiments, a training mask is maintained that indicates the result data from particular time steps of each episode to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes (645) and process 600 ends.
  • Although one process for providing reinforcement learning for a deep learning network in accordance with an embodiment of the invention is described with respect to FIG. 6. Other methods that add, removed, and/or combine steps in process 600 may be performed without departing from the various embodiments of this invention.
  • Fitting a model like a deep neural network is a computationally expensive task. As such, it is desirable to use incremental methods to incorporate new data sample into the fitting process as the data is generated. To do so, parallel computing may be used. A process that performs multiple concurrent explorations in accordance with an embodiment of the invention is shown in FIG. 7.
  • Process 700 performs exploration in M distinct episodes for K separate exploration process. In accordance with some embodiments, the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network. A set of data including historical and artificial data is obtained (705). In accordance with some embodiments, the observed data is read from a memory. In accordance with some other embodiments, the observed data is received from another system.
  • In accordance with some embodiments, the artificial data for one or more exploration processes is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data for one or more exploration processes is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data for one or more of the exploration processes is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input (705). In accordance with some embodiments, the approximator function may be set for process 700 and stored in memory for use. The approximator estimates a state-action value function for a data set. In accordance with some embodiments, the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration. In accordance with some embodiments, 1 to K number of approximators may be used where each of the 1 to K approximators is applied to the training set data of one or more of the K exploration processes.
  • The observed and artificial data are sampled to obtain a training set of data for each of the K independent processes (715). In accordance with some of these embodiments, the training data for each exploration process includes M samples of data. In accordance with a number of these embodiments, M is equal to or greater than a number of episodes to observe. In accordance with some embodiments, the sampling for one or exploration processes is performed in accordance with a known and/or a provided distribution. In accordance with some embodiments, one or more of the exploration process may have the same set of artificial data.
  • For each of the K exploration processes, a bootstrap process is applied to the union of the observed data and the artificial data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function (720). For each exploration process, exploration is performed in the following manner. For each time step, a state, s, is observed and an action, a, is selected based on the state of the system from the sample of data and the policy π (730). The reward, rlt realized and resulting transition state, slt+1 are observed (735). The state, s, the action, a, and the resulting transition state, slt+1 are stored as resulting data in memory. The selecting (730) and observing of the results are repeated until the time period for the episode ends (740). The observed set of data is individually updated for each exploration process with the results (750). To do so, a bootstrap mask may be maintained to indicate the observed data that is available to each exploration process. In accordance with some other embodiments, the observed data is updated with the data from all of the different K exploration processes. In accordance with some embodiments, a training mask is maintained that indicates the result data from particular time steps of each episode for each exploration process to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes (745) and process 700 ends.
  • Although one process for providing reinforcement learning for a deep learning network using multiple exploration processes in accordance with an embodiment of the invention is described with respect to FIG. 7. Other methods that add, removed, and/or combine steps in process 700 may be performed without departing from the various embodiments of this invention.
  • Testing for Deep Exploration
  • The following is an explanation of a series of didactic computational experiments designed to highlight the need for deep exploration. These environments can be described by chains of length N>3 as shown in FIG. 8. Each episode of interaction lasts N+9 steps after which point an exploration process resets to the initial state, s2. These are toy problems intended to be expository rather than entirely realistic. Balancing a well known and mildly successful strategy versus an unknown, but potentially more rewarding, approach can emerge in many practical applications.
  • The environments in these problems may be described by a finite tabular MDP. However, processes tested only interact with the MDP through raw pixel features. The two feature mappings of the tests are φ1hot(st):=(1{x=st}) and φtherm(st):=(1{x≦st}) in {0, 1}N. The results for φtherm were better for all Deep Q-learning Network (DQN) variants due to better generalization, but the difference was relatively small. A Thompson Sampling DQN is the same as a bootstrapped DQN (a deep learning network that uses reinforcement learning provided in accordance with an embodiment of the invention), but resamples every time step. Ensemble DQN uses the same architecture as bootstrapped DQN, but with an ensemble policy.
  • For purposes of this discussion, a process has successfully learned the optimal policy when the process has successfully completed one hundred episodes with optimal reward of 10. For each chain length, each learning system was executed for 2000 episodes across three seeds. The median time to learn for each system is shown in FIG. 9, together with a conservative lower bound of 99+2N-11 on the expected time to learn for any shallow exploration strategy. As seen in graphs 901-904, only bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) demonstrates a graceful scaling to long chains which require deep exploration.
  • Bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) explores in a manner similar to the provably-efficient algorithm PSRL but bootstrap DQN uses a bootstrapped neural network to approximate a posterior sample for the value. Unlike PSRL, bootstrapped DQN directly samples a value function and does not require further planning steps. The bootstrap DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) is similar to RLSVI, which is also provably-efficient, but with a neural network instead of linear value function and bootstrap instead of Gaussian sampling. The analysis for the linear setting suggests that this nonlinear approach will work well as long as the distribution {Q1, . . . , QK} remains stochastically optimistic, or at least as spread out as the “correct” posterior.
  • Bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) relies upon random initialization of the network weights as a prior to induce diversity. The initial diversity is enough to maintain diverse generalization to new and unseen states for large and deep neural networks. The initial diversity is effective for this experimental setting, but will not work in all situations. In general, a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention may be necessary to maintain some more rigorous notion of “prior”, potentially through the use of artificial prior data to maintain diversity. One potential explanation for the efficacy of simple random initialization is that unlike supervised learning or bandits, where all networks fit the same data, each of QK heads has a unique target network. This, together with stochastic minibatch and flexible nonlinear representations, means that even small differences at initialization may become bigger as the heads refit to unique TD errors.
  • Atari Evaluation
  • A deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention was evaluated across 49 Atari games on the Arcade Learning Environment. The domains of these games are not specifically designed to showcase the tested deep learning network. In fact, many Atari games are structured so that small rewards always indicate part of an optimal policy that may be crucial for the strong performance observed by dithering strategies. The evaluations show that exploration via bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) produces significant gains versus ε-greedy in this setting. Bootstrapped DQN reaches peak performance roughly similar to DQN. However, the improved exploration of the bootstrapped DQN reaches human performance on average 30% faster across all games. This translates to significantly improved cumulative rewards through learning.
  • Deep Learning Network Set-Up for Atari Evaluation
  • The bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) evaluated had a network structure is identical to the convolutional structure of a DQN except the bootstrapped DQN split 10 separate bootstrap heads after the convolutional layer.
  • 49 Atari games were used as for our experiments. Each step of the process corresponds to four steps of the emulator, where the same action is repeated, the reward values of the process are clipped between −1 and 1 for stability. The processes are evaluated and reported performance based upon the raw scores.
  • The convolutional part of the network (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) used is identical to the one used in other systems. The input to the network is 4×84×84 tensor with a rescaled, grayscale version of the last four observations. The first convolutional (cony) layer has 32 filters of size 8 with a stride of 4. The second cony layer has 64 filters of size 4 with stride 2. The las cony layer has 64 filters of size 3. We split the network beyond the final layer into K=10 distinct heads, each one is fully connected and identical to the single head of a DQN that includes a fully connected layer to 512 units followed by another fully connected layer to the Q-Values for each action. The fully connected layers all use Rectified Linear Units (ReLU) as a non-linearity. Gradients 1/K that flow from each head are normalized.
  • Each of the networks tested were trained with RMSProp with a momentum of 0.95 and a learning rate of 0.00025. The discount was set to γ=0.99, the number of steps between target updates was set to τ=10000 steps. The processes were trained for a total of 50 m steps per game, which corresponds to 200 m frames. The processes were stopped every 1 m frames for evaluation. Furthermore, the bootstrapped DQN used an ensemble voting policy. The experience replay contains the 1 m most recent transitions. The network was updated every 4 steps by randomly sampling a minibatch of 32 transitions from the replay buffer to use the exact same minibatch schedule as DQN. For training, a ε-greedy policy with ε being annealed linearly from 1 to 0.01 over the first 1 m timesteps.
  • Gradient Normalization in Bootstrap Heads
  • Most literature in deep reinforcement learning for Atari focuses on learning the best single evaluation policy, with particular attention to whether this above or below human performance. This is unusual for the reinforcement learning literature, which typically focuses upon cumulative or final performance.
  • Based on the results, bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) makes significant improvements to the cumulative rewards of DQN on Atari, while the peak performance is much more. Furthermore, using bootstrapped DQN without gradient normalization on each head typically learned even faster than our implementation with rescaling 1/K, but the network was somewhat prone to premature and suboptimal convergence.
  • In order to better the benchmark “best” policies reported by DQN, bootstrapped DQN should use the gradient normalization. However, it is not entirely clear whether gradient normalization represents an improvement for all settings.
  • Where a reinforcement learning system is deployed to learn with real interactions, cumulative rewards present a better measure for performance. In these settings, the benefits of gradient normalization are less clear. However, even with normalization 1/K bootstrapped DQN significantly outperforms DQN in terms of cumulative rewards.
  • Sharing Data in Bootstrap Heads
  • In the Atari tests, all network heads (exploration processes) of the bootstrapped DQN share all the data, so the bootstrapped are not actually a traditional bootstrap at all. This is different from the regression task, where bootstrapped data was essential to obtain meaningful uncertainty estimates. There are several theories for why the networks maintain significant diversity even without data bootstrapping in this setting.
  • First, the network heads all train on different target networks. As such, when facing the same (s, a, r, ś) datapoint, the various heads can reach drastically different Q-value updates. Second, Atari is a deterministic environment and any transition observation is the unique correct datapoint for this type of setting. Third, the networks are deep and the heads are initialized from different random values so the heads will likely find quite diverse generalization even when the heads agree on given data. Finally, since all variants of DQN take many frames to update policy, it is likely that even using ρ=0.5 the heads would still populate their replay memory with identical datapoints. Thus, using ρ=1 to save on minibatch passes seems like a reasonable compromise and the use of ρ=1 doesn't seem to negatively affect performance too much in this setting. More research is needed to examine exactly where/when this data sharing is important.
  • Results Tables
  • In Table 1, shown in FIG. 13, the average score achieved by the various systems are shown during the most successful evaluation period and compared to human performance and a uniformly random policy. DQN is an implementation of DQN with the hyperparameters specified above, using the double Q-Learning update. The peak final performance of DQN is similar under bootstrapped DQN to previous benchmarks.
  • To compare the benefits of exploration via bootstrapped DQN, the results of bootstrapped DQN are benchmarked our performance against the most similar prior work on incentivizing exploration in Atari. To do so, the bootstrapped DQN is compared to AUC-100. Based on the results, bootstrapped DQN out performs this prior work significantly.
  • Implementing Bootstrapped DQN at Scale
  • In the evaluations, the number of heads needed to generate online bootstrap samples for DQN in computationally efficient manner was evaluated. The following three key questions need to be answered to determine the optimal number of heads: how many heads needed, how should gradients be passed to the shared network and how should data be bootstrapped online? To do so, significant compromises were made in order to maintain computational cost comparable to DQN.
  • More heads leads to faster learning, but even a small number of heads captures most of the benefits of bootstrapped DQN. For the evaluations, K=10 was used.
  • The shared network architecture allows training of this combined network via backpropagation. Feeding K network heads to the shared convolutional network effectively increases the learning rate for this portion of the network. In some games, the increased learning rate leads to premature and sub-optimal convergence. The best final scores were achieved by normalizing the gradients by 1/K, but the normalizing of the gradients also leads to early learning.
  • To implement an online bootstrap, an independent Bernoulli mask w1, . . . , wK˜Ber(p) was used for each head in each episode. These flags are stored in the memory replay buffer and identify which heads are trained on which data. However, when trained using a shared minibatch the network will also require an effective 1/p more iterations; this is undesirable computationally. Surprisingly, the bootstrapped DQN performed similarly irrespective of p and all outperformed DQN. In light of empirical observation for Atari, p=1 to is used save on minibatch passes. As a result, bootstrapped DQN runs at a similar computational speed to vanilla DQN on identical hardware.
  • Efficient Exploration in Atari
  • In the evaluations, bootstrapped DQN drives efficient exploration in several Atari games. For the same amount of game experience, bootstrapped DQN generally outperforms DQN with ε-greedy exploration. FIG. 10 demonstrates this effect for a diverse section of games.
  • On games where DQN performs well, bootstrapped DQN typically performs better. Bootstrapped DQN does not reach human performance on Amidar (DQN does) but does on Beam Rider and Battle Zone (DQN does not). To summarize this improvement in learning time, the number of frames required to reach human performance is considered. If bootstrapped DQN reaches human performance in 1/x frames of DQN, bootstrapped DQN has improved by x. FIG. 11 shows that Bootstrapped DQN typically reaches human performance significantly faster.
  • On most games where DQN does not reach human performance, bootstrapped DQN does not solve problem by itself. On some challenging Atari games where deep exploration is conjectured to be important, the results for bootstrapped DQN are not entirely successful, but sill promising. In Frostbite, bootstrapped DQN reaches the second level much faster than DQN but network instabilities cause the performance to crash. In Montezuma's Revenge, bootstrapped DQN reaches the first key after 20 m frames (DQN never observes a reward even after 200 m frames) but does not properly learn from this experience. Our results suggest that improved exploration may help to solve these remaining games, but also highlight the importance of other problems like network instability, reward clipping and temporally extended rewards.
  • Overall Performance
  • Bootstrapped DQN is able to learn much faster than DQN. Graph 1201 of FIG. 12 shows that bootstrapped DQN also improves upon the final score across most games. However, the real benefits to efficient exploration mean that bootstrapped DQN outperforms DQN by orders of magnitude in terms of the cumulative rewards through learning (shown in graph 1202 of FIG. 12. In both graphs, performance is normalized relative to a fully random policy. The most similar work to bootstrapped DQN presents several other approaches to improved exploration in Atari. For example, AUC-20 is optimized for a normalized version of the cumulative returns after 20 m frames. According to this metric, averaged across the 14 games considered, bootstrapped DQN improve upon both base DQN (0.29) and the AUC-20 best method (0.37) to obtain 0.62. These results together with results tables across all 49 games are provided in the table shown in FIG. 13.
  • Visualizing Bootstrapped DQN
  • The following is some more insight to how bootstrapped DQN drives deep exploration in Atari. In each game, although each head Q1, . . . , Q10 learns a high scoring policy, the policies found by each head are quite distinct. Although each head performs well, each follows a unique policy. By contrast, ε-greedy strategies are almost indistinguishable for small values of ε and totally ineffectual for larger values. Deep exploration is key to improved learning, since diverse experiences allow for better generalization.
  • Disregarding exploration, bootstrapped DQN may be beneficial as a purely exploitative policy. In the evaluations, all of the heads are combined into a single ensemble policy, for example by choosing the action with the most votes across heads. This approach might have several benefits. First, the ensemble policy can often outperform any individual policy. Second, the distribution of votes across heads to give a measure of the uncertainty in the optimal policy. Unlike a conventional DQN, bootstrapped DQN can know what it doesn't know. In an application where executing a poorly-understood action is dangerous this could be crucial, the uncertainty in this policy is surprisingly interpretable: all heads agree at clearly crucial decision points, but remain diverse at other less important steps.
  • Although the present invention has been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. It is therefore to be understood that the present invention can be practiced otherwise than specifically described without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

Claims (24)

What is claimed is:
1. A deep learning system comprising:
at least one processor;
memory accessible by each at least one processor;
instructions that when read by the at least one processor direct the at least one processor to:
maintain a deep neural network; and
apply a reinforcement learning process to the deep neural network where the reinforcement learning process includes:
receive a set of observed data and a set artificial data,
for each of one or more episodes:
sample from a set of data that is a union of the set of observed data and the set of artificial data to generate set of training data;
determine a state-action value function for the set of training data using a bootstrap process and an approximator where the approximator that estimates a state-action function for a dataset;
for each time step in each one or more episode:
determine a state of the system for a current time step from the set of training data;
select an action based on the determined state of the system and a policy mapping actions to the state of the system;
determine results for the action including a reward and a transition state that result from the selected action; and
store result data for the current time step that includes the state, the action, the transition state, and
update the set of the observed data with the result data from at least one time step for each of the one or more the episodes.
2. The deep learning system of claim 1 wherein the instructions further direct the at least one processor to generate the set of artificial data from the set of observed data.
3. The deep learning system of claim 2 wherein the instructions to generate the artificial data include instruction that direct the at least one processor:
sample the set of observed data with replacement to generate the set of artificial data.
4. The deep learning system of claim 2 wherein the instructions to generate the artificial data include instructions that direct the at least one processor to:
sample a plurality state-action pairs from a diffusely mixed generative model; and
assign each of the plurality of sampled state-action pairs stochastically optimistic rewards and random state transitions.
5. The deep learning system of claim 1 wherein the instructions further direct the at least one processor to:
maintain a training mask that indicates the result data from each of the time period in each episode to be used in training; and
wherein the updating of the set of observed data includes adding the result data from each time period of an episode indicated in the training mask.
6. The deep learning network of claim 1 where the instructions further direct the processor to:
receive the approximator as an input.
7. The deep learning network of claim 1 wherein the instructions further direct the processor to:
read the approximator from memory.
8. The deep learning network of claim 1 wherein the approximator is a neural network trained to fit a state-action value function to the data set via a least squared iteration.
9. The deep learning network of claim 1 wherein a plurality of reinforcement learning processes are applied to the deep neural network.
10. The deep learning network of claim 9 wherein each of the plurality of reinforcement learning processes independently maintain the set of observed data.
11. The deep learning network of claim 9 wherein the plurality of reinforcement learning processes cooperatively maintain the set of observed data.
12. The deep learning process of claim 9 wherein the instruction further direct the processor to:
maintain a bootstrap mask that indicates each element in the set of observed data that is available to each of the plurality of reinforcement learning process.
13. A method performed by at least one processor executing instructions stored in memory to perform the method to provide reinforcement learning in a deep learning network, the method comprising:
receiving a set of observed data and a set artificial data;
for each of one or more episodes:
sampling from a set of data that is a union of the set of observed data and the set of artificial data to generate set of training data,
determining a state-action value function for the set of training data using a bootstrap process and an approximator where the approximator that estimates a state-action function for a dataset,
for each time step in each one or more episode:
determining a state of the system for a current time step from the set of training data;
selecting an action based on the determined state of the system and a policy mapping actions to the state of the system;
determining results for the action including a reward and a transition state that result from the selected action; and
storing result data for the current time step that includes the state, the action, the transition state, and
updating the set of the observed data with the result data from at least one time step of each of the one or more episodes.
14. The method of claim 13 further comprising generating the set of artificial data from the set of observed data.
15. The method of claim 14 further comprising:
sampling the set of observed data with replacement to generate the set of artificial data.
16. The method of claim 14 further comprising:
sampling a plurality state-action pairs from a diffusely mixed generative model; and
assigning each of the plurality of sampled state-action pairs stochastically optimistic rewards and random state transitions.
17. The method of claim 13 further comprising:
maintaining a training mask that indicates the result data from each of the time period in each episode to be used in training; and
wherein the updating of the set of observed data includes adding the result data from each time period of an episode indicated in the training mask.
18. The method of claim 13 further comprising:
receiving the approximator as an input.
19. The method of claim 13 further comprising:
read the approximator from memory.
20. The method of claim 13 wherein the approximator is a neural network trained to fit a state-action value function to the data set via a least squared iteration.
21. The method of claim 13 wherein a plurality of reinforcement learning methods are applied to the deep neural network.
22. The method of claim 21 wherein each of the plurality of reinforcement learning methods independently maintain the set of observed data.
23. The method of claim 21 wherein the plurality of reinforcement learning methods cooperatively maintain the set of observed data.
24. The method of claim 21 further comprising:
maintaining a bootstrap mask that indicates each element in the set of observed data that is available to each of the plurality of reinforcement learning process.
US15/212,042 2015-07-01 2016-07-15 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System Abandoned US20170032245A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US15/212,042 US20170032245A1 (en) 2015-07-01 2016-07-15 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System
US16/576,697 US20200065672A1 (en) 2015-07-01 2019-09-19 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201562187681P 2015-07-01 2015-07-01
US201615201284A 2016-07-01 2016-07-01
US15/212,042 US20170032245A1 (en) 2015-07-01 2016-07-15 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US201615201284A Continuation-In-Part 2015-07-01 2016-07-01

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/576,697 Continuation US20200065672A1 (en) 2015-07-01 2019-09-19 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System

Publications (1)

Publication Number Publication Date
US20170032245A1 true US20170032245A1 (en) 2017-02-02

Family

ID=57882627

Family Applications (2)

Application Number Title Priority Date Filing Date
US15/212,042 Abandoned US20170032245A1 (en) 2015-07-01 2016-07-15 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System
US16/576,697 Pending US20200065672A1 (en) 2015-07-01 2019-09-19 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System

Family Applications After (1)

Application Number Title Priority Date Filing Date
US16/576,697 Pending US20200065672A1 (en) 2015-07-01 2019-09-19 Systems and Methods for Providing Reinforcement Learning in a Deep Learning System

Country Status (1)

Country Link
US (2) US20170032245A1 (en)

Cited By (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460405A (en) * 2018-02-02 2018-08-28 上海大学 A kind of image latent writing analysis Ensemble classifier optimization method based on deeply study
JP2018142060A (en) * 2017-02-27 2018-09-13 株式会社東芝 Isolation management system and isolation management method
WO2018205778A1 (en) * 2017-05-11 2018-11-15 苏州大学张家港工业技术研究院 Large-range monitoring method based on deep weighted double-q learning and monitoring robot
WO2018210430A1 (en) * 2017-05-19 2018-11-22 Telefonaktiebolaget Lm Ericsson (Publ) Training a software agent to control an environment
CN109344877A (en) * 2018-08-31 2019-02-15 深圳先进技术研究院 A kind of sample data processing method, sample data processing unit and electronic equipment
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
US10210860B1 (en) 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
CN109597876A (en) * 2018-11-07 2019-04-09 中山大学 A kind of more wheels dialogue answer preference pattern and its method based on intensified learning
CN109711239A (en) * 2018-09-11 2019-05-03 重庆邮电大学 Based on the visual attention detection method for improving mixing increment dynamic bayesian network
CN109976909A (en) * 2019-03-18 2019-07-05 中南大学 Low delay method for scheduling task in edge calculations network based on study
US20190385091A1 (en) * 2018-06-15 2019-12-19 International Business Machines Corporation Reinforcement learning exploration by exploiting past experiences for critical events
CN110598906A (en) * 2019-08-15 2019-12-20 珠海米枣智能科技有限公司 Method and system for controlling energy consumption of superstores in real time based on deep reinforcement learning
CN110753936A (en) * 2017-08-25 2020-02-04 谷歌有限责任公司 Batch reinforcement learning
CN111160755A (en) * 2019-12-26 2020-05-15 西北工业大学 DQN-based real-time scheduling method for aircraft overhaul workshop
CN111226235A (en) * 2018-01-17 2020-06-02 华为技术有限公司 Method for generating training data for training neural network, method for training neural network, and method for autonomous operation using neural network
CN111258909A (en) * 2020-02-07 2020-06-09 中国信息安全测评中心 Test sample generation method and device
US10701439B2 (en) 2018-01-04 2020-06-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method of thereof
US20200249675A1 (en) * 2019-01-31 2020-08-06 StradVision, Inc. Method and device for providing personalized and calibrated adaptive deep learning model for the user of an autonomous vehicle
US10789511B2 (en) * 2018-10-12 2020-09-29 Deepmind Technologies Limited Controlling agents over long time scales using temporal value transport
US10818019B2 (en) 2017-08-14 2020-10-27 Siemens Healthcare Gmbh Dilated fully convolutional network for multi-agent 2D/3D medical image registration
CN111950873A (en) * 2020-07-30 2020-11-17 上海卫星工程研究所 Satellite real-time guiding task planning method and system based on deep reinforcement learning
CN112084680A (en) * 2020-09-02 2020-12-15 沈阳工程学院 Energy Internet optimization strategy method based on DQN algorithm
US10872294B2 (en) * 2018-09-27 2020-12-22 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
CN112183762A (en) * 2020-09-15 2021-01-05 上海交通大学 Reinforced learning method based on mixed behavior space
US10984507B2 (en) 2019-07-17 2021-04-20 Harris Geospatial Solutions, Inc. Image processing system including training model based upon iterative blurring of geospatial images and related methods
CN112734014A (en) * 2021-01-12 2021-04-30 山东大学 Experience playback sampling reinforcement learning method and system based on confidence upper bound thought
CN112836974A (en) * 2021-02-05 2021-05-25 上海海事大学 DQN and MCTS based box-to-box inter-zone multi-field bridge dynamic scheduling method
US20210200743A1 (en) * 2019-12-30 2021-07-01 Ensemble Rcm, Llc Validation of data in a database record using a reinforcement learning algorithm
US11068748B2 (en) 2019-07-17 2021-07-20 Harris Geospatial Solutions, Inc. Image processing system including training model based upon iteratively biased loss function and related methods
US20210224685A1 (en) * 2020-01-21 2021-07-22 Walmart Apollo, Llc Robust reinforcement learning in personalized content prediction
CN113162850A (en) * 2021-01-13 2021-07-23 中国科学院计算技术研究所 Artificial intelligence-based heterogeneous network multi-path scheduling method and system
CN113261016A (en) * 2018-11-05 2021-08-13 诺基亚通信公司 Single-shot multi-user multiple-input multiple-output (MU-MIMO) resource pairing using Deep Q Network (DQN) based reinforcement learning
CN113268933A (en) * 2021-06-18 2021-08-17 大连理工大学 Rapid structural parameter design method of S-shaped emergency robot based on reinforcement learning
CN113544703A (en) * 2019-03-05 2021-10-22 易享信息技术有限公司 Efficient off-policy credit allocation
US11164077B2 (en) * 2017-11-02 2021-11-02 Siemens Aktiengesellschaft Randomized reinforcement learning for control of complex systems
US11182676B2 (en) 2017-08-04 2021-11-23 International Business Machines Corporation Cooperative neural network deep reinforcement learning with partial input assistance
US11188797B2 (en) * 2018-10-30 2021-11-30 International Business Machines Corporation Implementing artificial intelligence agents to perform machine learning tasks using predictive analytics to leverage ensemble policies for maximizing long-term returns
US11204761B2 (en) 2018-12-03 2021-12-21 International Business Machines Corporation Data center including cognitive agents and related methods
CN113923308A (en) * 2021-10-15 2022-01-11 浙江工业大学 Prediction type outbound task allocation method based on deep reinforcement learning and outbound system
CN114138416A (en) * 2021-12-03 2022-03-04 福州大学 DQN cloud software resource self-adaptive distribution method facing load-time window
CN114161419A (en) * 2021-12-13 2022-03-11 大连理工大学 Robot operation skill efficient learning method guided by scene memory
US20220183748A1 (en) * 2020-12-16 2022-06-16 Biosense Webster (Israel) Ltd. Accurate tissue proximity
US20220215269A1 (en) * 2018-02-06 2022-07-07 Cognizant Technology Solutions U.S. Corporation Enhancing Evolutionary Optimization in Uncertain Environments By Allocating Evaluations Via Multi-Armed Bandit Algorithms
US11417087B2 (en) 2019-07-17 2022-08-16 Harris Geospatial Solutions, Inc. Image processing system including iteratively biased training model probability distribution function and related methods
US11449763B2 (en) * 2018-03-07 2022-09-20 Adobe Inc. Making resource-constrained sequential recommendations
US11461703B2 (en) 2019-01-23 2022-10-04 International Business Machines Corporation Determinantal reinforced learning in artificial intelligence
US11557036B2 (en) 2016-05-18 2023-01-17 Siemens Healthcare Gmbh Method and system for image registration using an intelligent artificial agent
US11568236B2 (en) 2018-01-25 2023-01-31 The Research Foundation For The State University Of New York Framework and methods of diverse exploration for fast and safe policy improvement
US20230072777A1 (en) * 2021-07-16 2023-03-09 Tata Consultancy Services Limited Budget constrained deep q-network for dynamic campaign allocation in computational advertising
US11656775B2 (en) 2018-08-07 2023-05-23 Marvell Asia Pte, Ltd. Virtualizing isolation areas of solid-state storage media
WO2023111700A1 (en) * 2021-12-15 2023-06-22 International Business Machines Corporation Reinforcement learning under constraints
US11693601B2 (en) 2018-08-07 2023-07-04 Marvell Asia Pte, Ltd. Enabling virtual functions on storage media
US11790399B2 (en) 2020-01-21 2023-10-17 Walmart Apollo, Llc Dynamic evaluation and use of global and contextual personas
US11823039B2 (en) 2018-08-24 2023-11-21 International Business Machines Corporation Safe and fast exploration for reinforcement learning using constrained action manifolds
US11853901B2 (en) 2019-07-26 2023-12-26 Samsung Electronics Co., Ltd. Learning method of AI model and electronic apparatus

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10396919B1 (en) * 2017-05-12 2019-08-27 Virginia Tech Intellectual Properties, Inc. Processing of communications signals using machine learning
US11640516B2 (en) * 2020-06-03 2023-05-02 International Business Machines Corporation Deep evolved strategies with reinforcement

Cited By (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11741605B2 (en) 2016-05-18 2023-08-29 Siemens Healthcare Gmbh Method and system for image registration using an intelligent artificial agent
US11557036B2 (en) 2016-05-18 2023-01-17 Siemens Healthcare Gmbh Method and system for image registration using an intelligent artificial agent
JP2018142060A (en) * 2017-02-27 2018-09-13 株式会社東芝 Isolation management system and isolation management method
US11224970B2 (en) * 2017-05-11 2022-01-18 Soochow University Large area surveillance method and surveillance robot based on weighted double deep Q-learning
WO2018205778A1 (en) * 2017-05-11 2018-11-15 苏州大学张家港工业技术研究院 Large-range monitoring method based on deep weighted double-q learning and monitoring robot
WO2018210430A1 (en) * 2017-05-19 2018-11-22 Telefonaktiebolaget Lm Ericsson (Publ) Training a software agent to control an environment
US11182676B2 (en) 2017-08-04 2021-11-23 International Business Machines Corporation Cooperative neural network deep reinforcement learning with partial input assistance
US11354813B2 (en) 2017-08-14 2022-06-07 Siemens Healthcare Gmbh Dilated fully convolutional network for 2D/3D medical image registration
US10818019B2 (en) 2017-08-14 2020-10-27 Siemens Healthcare Gmbh Dilated fully convolutional network for multi-agent 2D/3D medical image registration
CN110753936A (en) * 2017-08-25 2020-02-04 谷歌有限责任公司 Batch reinforcement learning
US11164077B2 (en) * 2017-11-02 2021-11-02 Siemens Aktiengesellschaft Randomized reinforcement learning for control of complex systems
US10701439B2 (en) 2018-01-04 2020-06-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method of thereof
CN111226235A (en) * 2018-01-17 2020-06-02 华为技术有限公司 Method for generating training data for training neural network, method for training neural network, and method for autonomous operation using neural network
US11568236B2 (en) 2018-01-25 2023-01-31 The Research Foundation For The State University Of New York Framework and methods of diverse exploration for fast and safe policy improvement
CN108460405A (en) * 2018-02-02 2018-08-28 上海大学 A kind of image latent writing analysis Ensemble classifier optimization method based on deeply study
US20220215269A1 (en) * 2018-02-06 2022-07-07 Cognizant Technology Solutions U.S. Corporation Enhancing Evolutionary Optimization in Uncertain Environments By Allocating Evaluations Via Multi-Armed Bandit Algorithms
US11449763B2 (en) * 2018-03-07 2022-09-20 Adobe Inc. Making resource-constrained sequential recommendations
US20190385091A1 (en) * 2018-06-15 2019-12-19 International Business Machines Corporation Reinforcement learning exploration by exploiting past experiences for critical events
US11367433B2 (en) 2018-07-27 2022-06-21 Deepgram, Inc. End-to-end neural networks for speech recognition and classification
US20210035565A1 (en) * 2018-07-27 2021-02-04 Deepgram, Inc. Deep learning internal state index-based search and classification
US10720151B2 (en) 2018-07-27 2020-07-21 Deepgram, Inc. End-to-end neural networks for speech recognition and classification
US10540959B1 (en) 2018-07-27 2020-01-21 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
US10380997B1 (en) * 2018-07-27 2019-08-13 Deepgram, Inc. Deep learning internal state index-based search and classification
US20200035224A1 (en) * 2018-07-27 2020-01-30 Deepgram, Inc. Deep learning internal state index-based search and classification
US10847138B2 (en) * 2018-07-27 2020-11-24 Deepgram, Inc. Deep learning internal state index-based search and classification
US11676579B2 (en) * 2018-07-27 2023-06-13 Deepgram, Inc. Deep learning internal state index-based search and classification
US10210860B1 (en) 2018-07-27 2019-02-19 Deepgram, Inc. Augmented generalized deep learning with special vocabulary
US11693601B2 (en) 2018-08-07 2023-07-04 Marvell Asia Pte, Ltd. Enabling virtual functions on storage media
US11656775B2 (en) 2018-08-07 2023-05-23 Marvell Asia Pte, Ltd. Virtualizing isolation areas of solid-state storage media
US11823039B2 (en) 2018-08-24 2023-11-21 International Business Machines Corporation Safe and fast exploration for reinforcement learning using constrained action manifolds
CN109344877A (en) * 2018-08-31 2019-02-15 深圳先进技术研究院 A kind of sample data processing method, sample data processing unit and electronic equipment
CN109711239A (en) * 2018-09-11 2019-05-03 重庆邮电大学 Based on the visual attention detection method for improving mixing increment dynamic bayesian network
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
US10872294B2 (en) * 2018-09-27 2020-12-22 Deepmind Technologies Limited Imitation learning using a generative predecessor neural network
JP7139524B2 (en) 2018-10-12 2022-09-20 ディープマインド テクノロジーズ リミテッド Control agents over long timescales using time value transfer
CN112840359A (en) * 2018-10-12 2021-05-25 渊慧科技有限公司 Controlling agents on a long time scale by using time value delivery
US11769049B2 (en) 2018-10-12 2023-09-26 Deepmind Technologies Limited Controlling agents over long time scales using temporal value transport
JP2022504739A (en) * 2018-10-12 2022-01-13 ディープマインド テクノロジーズ リミテッド Controlling agents over long timescales using time value transfer
US10789511B2 (en) * 2018-10-12 2020-09-29 Deepmind Technologies Limited Controlling agents over long time scales using temporal value transport
US11188797B2 (en) * 2018-10-30 2021-11-30 International Business Machines Corporation Implementing artificial intelligence agents to perform machine learning tasks using predictive analytics to leverage ensemble policies for maximizing long-term returns
CN113261016A (en) * 2018-11-05 2021-08-13 诺基亚通信公司 Single-shot multi-user multiple-input multiple-output (MU-MIMO) resource pairing using Deep Q Network (DQN) based reinforcement learning
CN109597876A (en) * 2018-11-07 2019-04-09 中山大学 A kind of more wheels dialogue answer preference pattern and its method based on intensified learning
US11204761B2 (en) 2018-12-03 2021-12-21 International Business Machines Corporation Data center including cognitive agents and related methods
US11461703B2 (en) 2019-01-23 2022-10-04 International Business Machines Corporation Determinantal reinforced learning in artificial intelligence
US10824151B2 (en) * 2019-01-31 2020-11-03 StradVision, Inc. Method and device for providing personalized and calibrated adaptive deep learning model for the user of an autonomous vehicle
US20200249675A1 (en) * 2019-01-31 2020-08-06 StradVision, Inc. Method and device for providing personalized and calibrated adaptive deep learning model for the user of an autonomous vehicle
CN113544703A (en) * 2019-03-05 2021-10-22 易享信息技术有限公司 Efficient off-policy credit allocation
CN109976909A (en) * 2019-03-18 2019-07-05 中南大学 Low delay method for scheduling task in edge calculations network based on study
US11068748B2 (en) 2019-07-17 2021-07-20 Harris Geospatial Solutions, Inc. Image processing system including training model based upon iteratively biased loss function and related methods
US10984507B2 (en) 2019-07-17 2021-04-20 Harris Geospatial Solutions, Inc. Image processing system including training model based upon iterative blurring of geospatial images and related methods
US11417087B2 (en) 2019-07-17 2022-08-16 Harris Geospatial Solutions, Inc. Image processing system including iteratively biased training model probability distribution function and related methods
US11853901B2 (en) 2019-07-26 2023-12-26 Samsung Electronics Co., Ltd. Learning method of AI model and electronic apparatus
CN110598906A (en) * 2019-08-15 2019-12-20 珠海米枣智能科技有限公司 Method and system for controlling energy consumption of superstores in real time based on deep reinforcement learning
CN111160755A (en) * 2019-12-26 2020-05-15 西北工业大学 DQN-based real-time scheduling method for aircraft overhaul workshop
US20210200743A1 (en) * 2019-12-30 2021-07-01 Ensemble Rcm, Llc Validation of data in a database record using a reinforcement learning algorithm
US20210224685A1 (en) * 2020-01-21 2021-07-22 Walmart Apollo, Llc Robust reinforcement learning in personalized content prediction
US11790399B2 (en) 2020-01-21 2023-10-17 Walmart Apollo, Llc Dynamic evaluation and use of global and contextual personas
US11645580B2 (en) * 2020-01-21 2023-05-09 Walmart Apollo, Llc Robust reinforcement learning in personalized content prediction
CN111258909A (en) * 2020-02-07 2020-06-09 中国信息安全测评中心 Test sample generation method and device
CN111950873A (en) * 2020-07-30 2020-11-17 上海卫星工程研究所 Satellite real-time guiding task planning method and system based on deep reinforcement learning
CN112084680A (en) * 2020-09-02 2020-12-15 沈阳工程学院 Energy Internet optimization strategy method based on DQN algorithm
CN112183762A (en) * 2020-09-15 2021-01-05 上海交通大学 Reinforced learning method based on mixed behavior space
US20220183748A1 (en) * 2020-12-16 2022-06-16 Biosense Webster (Israel) Ltd. Accurate tissue proximity
CN112734014A (en) * 2021-01-12 2021-04-30 山东大学 Experience playback sampling reinforcement learning method and system based on confidence upper bound thought
CN113162850A (en) * 2021-01-13 2021-07-23 中国科学院计算技术研究所 Artificial intelligence-based heterogeneous network multi-path scheduling method and system
CN112836974A (en) * 2021-02-05 2021-05-25 上海海事大学 DQN and MCTS based box-to-box inter-zone multi-field bridge dynamic scheduling method
CN113268933A (en) * 2021-06-18 2021-08-17 大连理工大学 Rapid structural parameter design method of S-shaped emergency robot based on reinforcement learning
US20230072777A1 (en) * 2021-07-16 2023-03-09 Tata Consultancy Services Limited Budget constrained deep q-network for dynamic campaign allocation in computational advertising
US11915262B2 (en) * 2021-07-16 2024-02-27 Tata Consultancy Services Limited Budget constrained deep Q-network for dynamic campaign allocation in computational advertising
CN113923308A (en) * 2021-10-15 2022-01-11 浙江工业大学 Prediction type outbound task allocation method based on deep reinforcement learning and outbound system
CN114138416A (en) * 2021-12-03 2022-03-04 福州大学 DQN cloud software resource self-adaptive distribution method facing load-time window
CN114161419A (en) * 2021-12-13 2022-03-11 大连理工大学 Robot operation skill efficient learning method guided by scene memory
WO2023111700A1 (en) * 2021-12-15 2023-06-22 International Business Machines Corporation Reinforcement learning under constraints
TWI822291B (en) * 2021-12-15 2023-11-11 美商萬國商業機器公司 Computer-implemented methods, computer program products, and computer processing systems for offline reinforcement learning with a dataset

Also Published As

Publication number Publication date
US20200065672A1 (en) 2020-02-27

Similar Documents

Publication Publication Date Title
US20170032245A1 (en) Systems and Methods for Providing Reinforcement Learning in a Deep Learning System
WO2017004626A1 (en) Systems and methods for providing reinforcement learning in a deep learning system
Osband et al. Deep exploration via bootstrapped DQN
Gronauer et al. Multi-agent deep reinforcement learning: a survey
CN109902706B (en) Recommendation method and device
Shao et al. A survey of deep reinforcement learning in video games
US11291917B2 (en) Artificial intelligence (AI) model training using cloud gaming network
CN110326004B (en) Training a strategic neural network using path consistency learning
AU2016354558B2 (en) Asynchronous deep reinforcement learning
US11157316B1 (en) Determining action selection policies of an execution device
Dowe et al. Bayes not Bust! Why Simplicity is no Problem for Bayesians1
US11204803B2 (en) Determining action selection policies of an execution device
CN113487039A (en) Intelligent body self-adaptive decision generation method and system based on deep reinforcement learning
Wang et al. Achieving cooperation through deep multiagent reinforcement learning in sequential prisoner's dilemmas
Catteeuw et al. The limits and robustness of reinforcement learning in Lewis signalling games
CN112470123A (en) Determining action selection guidelines for an execution device
Gao et al. Adversarial policy gradient for alternating markov games
US20220036179A1 (en) Online task inference for compositional tasks with context adaptation
US9981190B2 (en) Telemetry based interactive content generation
US20210138350A1 (en) Sensor statistics for ranking users in matchmaking systems
Olesen et al. Evolutionary planning in latent space
Amhraoui et al. Expected Lenient Q-learning: a fast variant of the Lenient Q-learning algorithm for cooperative stochastic Markov games
Pavlovic A semantical approach to equilibria and rationality
Hooper POMDP learning in partially observable 3d virtual worlds
Seth et al. A scalable species-based genetic algorithm for reinforcement learning problems

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: THE BOARD OF TRUSTEES OF THE LELAND STANFORD JUNIOR UNIVERSITY, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:OSBAND, IAN DAVID MOFFAT;VAN ROY, BENJAMIN;REEL/FRAME:052368/0934

Effective date: 20191217