WO2017004626A1 - Systèmes et procédés permettant de fournir un apprentissage par renforcement dans un système d'apprentissage en profondeur - Google Patents

Systèmes et procédés permettant de fournir un apprentissage par renforcement dans un système d'apprentissage en profondeur Download PDF

Info

Publication number
WO2017004626A1
WO2017004626A1 PCT/US2016/042631 US2016042631W WO2017004626A1 WO 2017004626 A1 WO2017004626 A1 WO 2017004626A1 US 2016042631 W US2016042631 W US 2016042631W WO 2017004626 A1 WO2017004626 A1 WO 2017004626A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
state
action
network
reinforcement learning
Prior art date
Application number
PCT/US2016/042631
Other languages
English (en)
Inventor
Ian David Moffat OSBAND
Benjamin Van Roy
Original Assignee
The Board Of Trustees Of The Leland Stanford Junior University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Board Of Trustees Of The Leland Stanford Junior University filed Critical The Board Of Trustees Of The Leland Stanford Junior University
Publication of WO2017004626A1 publication Critical patent/WO2017004626A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • Deep learning networks including, but not limited to, artificial neural networks are machine learning systems that receive data, extract statistics and classify results. These systems use a training set of data to generate a model in order to make data driven decisions to provide a desired output.
  • the reinforcement learning process is performed in the following manner in accordance with some embodiments.
  • a set of observed data and a set artificial data is received.
  • the process samples a set of data that is a union of the set of observed data and the set of artificial data to generate set of training data.
  • a state-action value function is then determined for the set of training data using a bootstrap process and an approximator.
  • the approximator estimates a state-action function for a dataset.
  • the process determines a state of the system for a current time step from the set of training data.
  • an action based on the determined state of the system and a policy mapping actions to the state of the system is selected by the process and results for the action including a reward and a transition state that result from the selected action are determined. Result data from the current time step that includes the state, the action, the transition state are stored. The set of the observed data is then updated with the result data from at least one time step of an episode at the conclusion of an episode.
  • the reinforcement learning process generates the set of artificial data from the set of observed data.
  • the artificial data is generated by sampling the set of observed data with replacement to generate the set of artificial data.
  • the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assigning each the sampled state- action pairs stochastically optimistic rewards and random state transitions.
  • the approximator received as an input.
  • the approximator is read from memory.
  • the approximator is a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • Figure 1 illustrates various devices in a network that perform processes that systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • Figure 2 illustrates a processing system in a device that performs processes that provide systems and methods for providing reinforcement learning in a deep learning network in accordance with various embodiments of the invention.
  • Figure 3 illustrates a deep neural network that uses processes providing reinforcement learning in deep learning networks in accordance with some embodiments of the invention.
  • Figure 6 illustrates a process for providing reinforcement learning in a deep learning network in accordance with an embodiment of the invention.
  • Figure 8 illustrates a deterministic chain of states in an environment of a problem.
  • Figure 9 illustrates the results of application of various reinforcement learning approaches.
  • Figure 10 illustrates results of a deep learning network using processes that provide reinforcement learning in accordance with an embodiment of the invention and results from a DQN network.
  • Figure 12 illustrates graphs showing improvements to policies and rewards of various Atari games by deep learning network using systems and methods for providing reinforcement learning in accordance with an embodiment of the invention.
  • Figure 13 illustrates a table of results for various deep learning networks including a deep learning network that uses process providing reinforcement learning processes in accordance with an embodiment of the invention.
  • deep learning networks are machine learning systems that use a dataset of observed data to learn how to solve a problem in a system where all of the states of the system, actions based upon states, and/or the resulting transitions are not fully known.
  • Examples of deep learning networks include, but are not limited to, deep neural networks.
  • System and methods in accordance with some embodiments of this invention that provide reinforcement learning do so by providing an exploration process for a deep learning network to solve a problem in an environment.
  • actions taken by a system may impose delayed consequences.
  • the design of exploration strategies is more difficult than systems that are action-response systems where there are no delayed consequences such as multi-armed bandit problems because the system must establish a context.
  • Deep exploration means exploration that is directed over multiple time steps. Deep exploration can also be called “planning to learn" or "far-sighted” exploration.
  • FIG. 5 Planning and look-ahead trees for several algorithmic exploration approaches to the MDP of this deterministic chain are shown in FIG. 5.
  • tree 501 represents the possible decisions of a bandit algorithm
  • tree 502 represents the possible decisions of a dithering algorithm
  • tree 503 represents the possible decisions of a shallow exploration algorithm
  • tree 504 represents the possible decisions of a deep exploration algorithm.
  • Actions, including action "left", and action "right” are solid lines; rewarding states are at the left and right most bottom nodes; and dashed lines indicate that the agent can plan ahead for either rewards or information.
  • an reinforcement learning agent can plan to exploit future rewards. The strategies that use direct exploration cannot plan ahead. Reinforcement Learning
  • Reinforcement learning is a deep learning approach that differs from standard supervised learning in that correct input/output pairs are never presented, nor sub- optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
  • a common approach to reinforcement learning involves learning a state-action value function, Q, which for time, t, state, s, and action, a, provides an estimate, Q t (s, a), of expected rewards over the remainder of the episode: ⁇ : f +r /f+i +...+ ⁇ ⁇ .
  • a bootstrap principle is commonly used to approximate a population distribution be a sample distribution.
  • a common bootstrap takes as input a data set D and an estimator, ⁇ .
  • the bootstrap generates a sample data set from the bootstrapped distribution that has a cardinality equal to D and is sampled uniformly with replacement from data set D.
  • the bootstrap sample estimate is then taken to by ⁇ (D).
  • a network that is an efficient and scalable system for generating bootstrap samples from a large and deep neural network includes a shared architecture with K bootstrapped head (or exploration processes) branching off independently. Each head is trained only on a separate unique sub-sample of data that represents a single bootstrap sample ⁇ (D').
  • the shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads.
  • This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • Q n (s, a) E Sjaj7r [ ⁇ £Li y f r t ], where ⁇ ⁇ (0,1) is a discount factor that balances immediate versus future rewards r t .
  • This expectation indicates that the initial state is s, the action is a, and thereafter actions are selected by the policy ⁇ .
  • A modifies the learning process to approximate a distribution over Q-values via the bootstrap.
  • a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention samples a single Q-value function from an approximate posterior maintained by the system.
  • An exploration process then follows the policy which is optimal for that sample for the duration of the episode. This is a natural adaptation of the Thompson sampling heuristic to reinforcement learning that allows for temporally extended (or deep) exploration.
  • An exploration process for a deep learning network that uses reinforcement learning as provided in accordance with some embodiments of the invention may efficiently implemented by building up K ⁇ N bootstrapped estimates of the Q-value function in parallel.
  • Each one of these value function heads Q k (s, a; ⁇ ) is trained against a separate target networkQ fc (s, a; ⁇ ⁇ ) such that each ⁇ , ... , Q K provides a temporally extended (and consistent) estimate of the value uncertainty via Thompson sampling distribution estimates.
  • flags ⁇ 1 , ... , ⁇ ⁇ ⁇ ⁇ 0,1 ⁇ that indicate which heads are privy to which data may be used are maintained.
  • a bootstrap sample is made by selecting k ⁇ ⁇ 1, ... , K ⁇ uniformly at random and following Q k for the duration of that episode.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention approximates commits to a sample for several steps exploration via randomized value functions sampled from an approximate posterior.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention recovers state of the art guarantees in the setting with tabular basis functions, but the performance of these systems is crucially dependent upon a suitable linear representation of the value function.
  • a deep learning network that use reinforcement learning provided in accordance with some embodiments of the invention extends these ideas to produce a system that can simultaneously perform generalization and exploration with a flexible nonlinear value function representation. Our method is simple, general and compatible with almost all advances in deep exploration via reinforcement learning at low computational cost and with few tuning parameters.
  • the process observes a state of the system, S/ f , for a particular time period from the training sample dataset and selects an action to perform based upon policy ⁇ .
  • the results including a reward, n t realized and a resulting transition state, s /f+i resulting from the action are observed.
  • the state, action, reward and transition state are stored as result data for the episode. This is repeated for each time step in the episode.
  • the observed dataset is updated with the result data stored during the episode.
  • a training mask may be maintained that includes flags indicating the result data from particular time steps is to be used for training. The training mask is then read to determine the result data to add to the observed data.
  • Network 100 includes a communications network 160.
  • the communications network 160 is a network such as the Internet that allows devices connected to the network 160 to communicate with other connected devices.
  • Server systems 1 10, 140, and 170 are connected to the network 160.
  • Each of the server system 1 10, 140, and 170 is a group of one or more servers communicatively connected to one another via internal networks that execute processes that provide cloud services to users over the network 160.
  • cloud services are one or more applications that are executed by one or more server systems to provide data and/or executable applications to devices over a network.
  • the server systems 1 10, 140, and 170 are shown each having three servers in the internal network. However, the server systems 1 10, 140 and 170 may include any number of servers and any additional number of server systems may be connected to the network 160 to provide cloud services.
  • a deep learning network that uses systems and methods that provide reinforcement learning in accordance with an embodiment of the invention may be provided by process being executed on a single server system and/or a group of server systems communicating over network 160.
  • Users may use personal devices 180 and 120 that connect to the network 160 to perform processes for providing and/or interaction with a deep learning network that uses systems and methods that provide reinforcement learning in accordance with various embodiments of the invention.
  • the personal devices 180 are shown as desktop computers that are connected via a conventional "wired" connection to the network 160.
  • the personal device 180 may be a desktop computer, a laptop computer, a smart television, an entertainment gaming console, or any other device that connects to the network 160 via a "wired" connection.
  • the mobile device 120 connects to network 160 using a wireless connection.
  • a wireless connection is a connection that uses Radio Frequency (RF) signals, Infrared signals, or any other form of wireless signaling to connect to the network 160.
  • RF Radio Frequency
  • the mobile device 120 is a mobile telephone.
  • mobile device 120 may be a mobile phone, Personal Digital Assistant (PDA), a tablet, a smartphone, or any other type of device that connects to network 160 via wireless connection without departing from this invention.
  • PDA Personal Digital Assistant
  • the processor 205 is a processor, microprocessor, controller, or a combination of processors, microprocessor, and/or controllers that performs instructions stored in the volatile 215 or the non-volatile memory 210 to manipulate data stored in the memory.
  • the non-volatile memory 210 can store the processor instructions utilized to configure the processing system 200 to perform processes including processes in accordance with embodiments of the invention and/or data for the processes being utilized.
  • the processing system software and/or firmware can be stored in any of a variety of non-transient computer readable media appropriate to a specific application.
  • a network interface is a device that allows processing system 200 to transmit and receive data over a network based upon the instructions performed by processor 205. Although a processing system 200 is illustrated in FIG. 2, any of a variety of processing system in the various devices can configured to provide the methods and systems in accordance with embodiments of the invention can be utilized.
  • the shared network learns via a joint feature representation across all of the data, which can provide significant computational advantages at the cost of lower diversity between heads.
  • This type of bootstrap can be trained efficiently in a single forward/backward pass and can be thought of as data-dependent dropout, where the dropout mask for each head is fixed for each data point.
  • Q n (s, a) E Sjaj7r [ ⁇ £Li y f r t ], where ⁇ ⁇ (0,1) is a discount factor that balances immediate versus future rewards r t .
  • This expectation indicates that the initial state is s, the action is a, and thereafter actions are selected by the policy ⁇ .
  • a deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention modifies DQN to approximate a distribution over Q-values via the bootstrap.
  • a deep learning network that use reinforcement learning as provided in accordance with embodiments of this invention samples a single Q-value function from its approximate posterior. The system follows the policy which is optimal for that sample for the duration of the episode.
  • An exploration process is efficiently implemented by building up K ⁇ N bootstrapped estimates of the Q-value function in parallel as in Figure 3. Importantly, each one of these value function heads Q k (s, a; 6) is trained against its own target networkQ fc (s, a; ⁇ ⁇ ) . This means that each ⁇ , ...
  • Q K provide a temporally extended (and consistent) estimate of the value uncertainty via TD Estimates.
  • ⁇ ⁇ , ... , ⁇ ⁇ ⁇ ⁇ 0,1 ⁇ indicating which heads are privy to which data.
  • We approximate a bootstrap sample by selecting k ⁇ ⁇ 1, ... , K ⁇ uniformly at random and following Q k for the duration of that episode.
  • systems and methods provide reinforcement learning by providing a deep exploration process.
  • the deep exploration process fits a state-action value function to a sample of data from set of data that includes artificial data and observed data.
  • the system receives a set of training data that includes observed data and artificial data.
  • the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model and assign each state-action pair stochastically optimistic rewards and random state transitions.
  • the artificial set of data is generated by sampling the observed set of data with replacement.
  • the artificial data includes M elements where M is approximately greater than or equal to the number of elements in the observed dataset.
  • M is approximately greater than or equal to the number of elements in the observed dataset.
  • the use of the combination of observed and historical data provides randomness in the samples to induce deep learning.
  • An exploration process for providing reinforcement learning to a deep learning network in accordance with an embodiment of this invention is shown in FIG. 6.
  • Process 600 performs exploration in M distinct episodes.
  • the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network.
  • a set of data including historical and artificial data is obtained (605).
  • the observed data is read from a memory.
  • the observed data is received from another system.
  • the artificial data is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input (610).
  • the approximator function may be set for process 600 and stored in memory for use.
  • the approximator estimates a state-action value function for a data set.
  • the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • the observed and artificial data are sampled to obtain a training set of data (615).
  • the training data includes M samples of data.
  • M is equal to or greater than a number of episodes to observe.
  • the sampling is performed in accordance with a known and/or a provided distribution.
  • a bootstrap process is applied to the union of the observed data and the training data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function (620).
  • a state, s is observed based on the training data and an action, a, is selected based on the state of the system from the sample of data and the policy ⁇ (630).
  • the reward, n t realized and resulting transition state, s /f+i are observed (635).
  • the state, s, the action, a, and the resulting transition state, SIM are stored as resulting data in memory.
  • the selecting (630) and observing of the results are repeated until the time period ends (640).
  • the observed set of data is then updated with the results (650).
  • a training mask is maintained that indicates the result data from particular time steps of each episode to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes (645) and process 600 ends.
  • Fitting a model like a deep neural network is a computationally expensive task. As such, it is desirable to use incremental methods to incorporate new data sample into the fitting process as the data is generated. To do so, parallel computing may be used. A process that performs multiple concurrent explorations in accordance with an embodiment of the invention is shown in FIG. 7.
  • Process 700 performs exploration in M distinct episodes for K separate exploration process.
  • the number of episodes is received as an input and in accordance with some other embodiments the number of episodes may set or selected by the process based on the size of the deep learning network.
  • a set of data including historical and artificial data is obtained (705).
  • the observed data is read from a memory.
  • the observed data is received from another system.
  • the artificial data for one or more exploration processes is generated by sampling the observed data with replacement. In accordance with some other embodiments, the artificial data for one or more exploration processes is generated from the observed data based on a known distribution of the original data. In accordance with some other embodiments, the artificial data for one or more of the exploration processes is generated independent of the observed data. In accordance with some of these embodiments, the artificial data is generated by sampling state-action pairs from a diffusely mixed generative model; and assigning each state-action pair stochastically optimistic rewards and random state transitions. In accordance with some embodiments, the artificial dataset includes M elements of data where M is approximately equal to or greater than the number of elements in the observed dataset.
  • An approximator function is also received as input (705).
  • the approximator function may be set for process 700 and stored in memory for use.
  • the approximator estimates a state-action value function for a data set.
  • the approximator function may be a neural network trained to fit a state-action value function to the data set via a least squared iteration.
  • 1 to K number of approximators may be used where each of the 1 to K approximators is applied to the training set data of one or more of the K exploration processes.
  • the observed and artificial data are sampled to obtain a training set of data for each of the K independent processes (715).
  • the training data for each exploration process includes M samples of data.
  • M is equal to or greater than a number of episodes to observe.
  • the sampling for one or exploration processes is performed in accordance with a known and/or a provided distribution.
  • one or more of the exploration process may have the same set of artificial data.
  • a bootstrap process is applied to the union of the observed data and the artificial data to obtain a new distribution and the approximator function is applied to the distribution to generate a randomized state-value function (720).
  • exploration is performed in the following manner. For each time step, a state, s, is observed and an action, a, is selected based on the state of the system from the sample of data and the policy ⁇ (730). The reward, n t realized and resulting transition state, SIM are observed (735). The state, s, the action, a, and the resulting transition state, SIM are stored as resulting data in memory. The selecting (730) and observing of the results are repeated until the time period for the episode ends (740).
  • the observed set of data is individually updated for each exploration process with the results (750). To do so, a bootstrap mask may be maintained to indicate the observed data that is available to each exploration process. In accordance with some other embodiments, the observed data is updated with the data from all of the different K exploration processes. In accordance with some embodiments, a training mask is maintained that indicates the result data from particular time steps of each episode for each exploration process to add to the observed set of data and the mask is read to determine which elements of the resulting data to add to the observed data. This is then repeated for each of the M episodes (745) and process 700 ends.
  • Ensemble DQN uses the same architecture as bootstrapped DQN, but with an ensemble policy.
  • a process has successfully learned the optimal policy when the process has successfully completed one hundred episodes with optimal reward of 10.
  • each learning system was executed for 2000 episodes across three seeds.
  • the median time to learn for each system is shown in Figure 9, together with a conservative lower bound of 99 + 2 W_11 on the expected time to learn for any shallow exploration strategy.
  • bootstrapped DQN a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention
  • Bootstrapped DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) explores in a manner similar to the provably-efficient algorithm PSRL but bootstrap DQN uses a bootstrapped neural network to approximate a posterior sample for the value. Unlike PSRL, bootstrapped DQN directly samples a value function and does not require further planning steps.
  • the bootstrap DQN (a deep learning network that uses reinforcement learning as provided in accordance with an embodiment of the invention) is similar to RLSVI, which is also provably-efficient, but with a neural network instead of linear value function and bootstrap instead of Gaussian sampling.
  • the analysis for the linear setting suggests that this nonlinear approach will work well as long as the distribution ⁇ Q 1 , ... , Q K ⁇ remains stochastically optimistic, or at least as spread out as the "correct" posterior.
  • bootstrapped DQN drives efficient exploration in several Atari games.
  • bootstrapped DQN generally outperforms DQN with ⁇ -greedy exploration.
  • Figure 10 demonstrates this effect for a diverse section of games.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne des systèmes et des procédés permettant de fournir un apprentissage par renforcement pour un réseau d'apprentissage en profondeur. Un processus d'apprentissage par renforcement, qui permet une exploration profonde, est fourni par une amorce appliquée à un échantillon de données observées et artificielles afin de faciliter une exploration profonde au moyen d'une technique d'échantillonnage de Thompson.
PCT/US2016/042631 2015-07-01 2016-07-15 Systèmes et procédés permettant de fournir un apprentissage par renforcement dans un système d'apprentissage en profondeur WO2017004626A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201562187681P 2015-07-01 2015-07-01
US62/187,681 2015-07-01
US201615201284A 2016-07-01 2016-07-01
US15/201,284 2016-07-01

Publications (1)

Publication Number Publication Date
WO2017004626A1 true WO2017004626A1 (fr) 2017-01-05

Family

ID=57609621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2016/042631 WO2017004626A1 (fr) 2015-07-01 2016-07-15 Systèmes et procédés permettant de fournir un apprentissage par renforcement dans un système d'apprentissage en profondeur

Country Status (1)

Country Link
WO (1) WO2017004626A1 (fr)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970615A (zh) * 2017-03-21 2017-07-21 西北工业大学 一种深度强化学习的实时在线路径规划方法
CN106991509A (zh) * 2017-05-27 2017-07-28 重庆科技学院 基于径向基函数神经网络模型的测井曲线预测方法
CN107292392A (zh) * 2017-05-11 2017-10-24 苏州大学 基于深度带权双q学习的大范围监控方法及监控机器人
CN107607942A (zh) * 2017-08-31 2018-01-19 北京大学 基于深度学习模型的大尺度电磁散射与逆散射的预测方法
CN108051999A (zh) * 2017-10-31 2018-05-18 中国科学技术大学 基于深度强化学习的加速器束流轨道控制方法及系统
CN108319286A (zh) * 2018-03-12 2018-07-24 西北工业大学 一种基于强化学习的无人机空战机动决策方法
CN109472984A (zh) * 2018-12-27 2019-03-15 苏州科技大学 基于深度强化学习的信号灯控制方法、系统和存储介质
CN110011876A (zh) * 2019-04-19 2019-07-12 福州大学 一种基于强化学习的Sketch的网络测量方法
CN110190918A (zh) * 2019-04-25 2019-08-30 广西大学 基于深度q学习的认知无线传感器网络频谱接入方法
CN110326004A (zh) * 2017-02-24 2019-10-11 谷歌有限责任公司 使用路径一致性学习训练策略神经网络
CN110503661A (zh) * 2018-05-16 2019-11-26 武汉智云星达信息技术有限公司 一种基于深度强化学习和时空上下文的目标图像追踪方法
CN110635973A (zh) * 2019-11-08 2019-12-31 西北工业大学青岛研究院 一种基于强化学习的骨干网络流量确定方法及系统
US10701439B2 (en) 2018-01-04 2020-06-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method of thereof
CN111462230A (zh) * 2020-02-18 2020-07-28 天津大学 一种基于深度强化学习的台风中心定位方法
KR102179090B1 (ko) * 2020-01-30 2020-11-16 주식회사 투비코 신경망을 이용한 의료 진단 방법
CN112204580A (zh) * 2018-03-27 2021-01-08 诺基亚通信公司 使用深度q网络促进资源配对的方法和装置
CN113077853A (zh) * 2021-04-06 2021-07-06 西安交通大学 双loss价值网络深度强化学习KVFD模型力学参数全局优化方法及系统
CN113452026A (zh) * 2021-06-29 2021-09-28 华中科技大学 一种电力系统薄弱评估智能体训练方法、评估方法和系统
CN113905606A (zh) * 2021-09-13 2022-01-07 中国地质大学(武汉) 基于深度强化学习的贴片机贴装调度模型训练方法
CN114683287A (zh) * 2022-04-25 2022-07-01 浙江工业大学 一种基于元动作分层泛化的机械臂模仿学习方法
US11645580B2 (en) * 2020-01-21 2023-05-09 Walmart Apollo, Llc Robust reinforcement learning in personalized content prediction
US11853901B2 (en) 2019-07-26 2023-12-26 Samsung Electronics Co., Ltd. Learning method of AI model and electronic apparatus
US11886988B2 (en) 2017-11-22 2024-01-30 International Business Machines Corporation Method for adaptive exploration to accelerate deep reinforcement learning
CN117634320A (zh) * 2024-01-24 2024-03-01 合肥工业大学 基于深度强化学习的三相高频变压器多目标优化设计方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011119A1 (en) * 2005-05-07 2007-01-11 Thaler Stephen L Device for the autonomous bootstrapping of useful information
US20100094786A1 (en) * 2008-10-14 2010-04-15 Honda Motor Co., Ltd. Smoothed Sarsa: Reinforcement Learning for Robot Delivery Tasks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070011119A1 (en) * 2005-05-07 2007-01-11 Thaler Stephen L Device for the autonomous bootstrapping of useful information
US20100094786A1 (en) * 2008-10-14 2010-04-15 Honda Motor Co., Ltd. Smoothed Sarsa: Reinforcement Learning for Robot Delivery Tasks

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110326004A (zh) * 2017-02-24 2019-10-11 谷歌有限责任公司 使用路径一致性学习训练策略神经网络
CN106970615A (zh) * 2017-03-21 2017-07-21 西北工业大学 一种深度强化学习的实时在线路径规划方法
CN107292392A (zh) * 2017-05-11 2017-10-24 苏州大学 基于深度带权双q学习的大范围监控方法及监控机器人
CN107292392B (zh) * 2017-05-11 2019-11-22 苏州大学 基于深度带权双q学习的大范围监控方法及监控机器人
CN106991509A (zh) * 2017-05-27 2017-07-28 重庆科技学院 基于径向基函数神经网络模型的测井曲线预测方法
CN107607942B (zh) * 2017-08-31 2019-09-13 北京大学 基于深度学习模型的大尺度电磁散射与逆散射的预测方法
CN107607942A (zh) * 2017-08-31 2018-01-19 北京大学 基于深度学习模型的大尺度电磁散射与逆散射的预测方法
CN108051999A (zh) * 2017-10-31 2018-05-18 中国科学技术大学 基于深度强化学习的加速器束流轨道控制方法及系统
CN108051999B (zh) * 2017-10-31 2020-08-25 中国科学技术大学 基于深度强化学习的加速器束流轨道控制方法及系统
US11886988B2 (en) 2017-11-22 2024-01-30 International Business Machines Corporation Method for adaptive exploration to accelerate deep reinforcement learning
US10701439B2 (en) 2018-01-04 2020-06-30 Samsung Electronics Co., Ltd. Electronic apparatus and controlling method of thereof
CN108319286A (zh) * 2018-03-12 2018-07-24 西北工业大学 一种基于强化学习的无人机空战机动决策方法
CN108319286B (zh) * 2018-03-12 2020-09-22 西北工业大学 一种基于强化学习的无人机空战机动决策方法
CN112204580A (zh) * 2018-03-27 2021-01-08 诺基亚通信公司 使用深度q网络促进资源配对的方法和装置
CN112204580B (zh) * 2018-03-27 2024-04-12 诺基亚通信公司 使用深度q网络促进资源配对的方法和装置
CN110503661A (zh) * 2018-05-16 2019-11-26 武汉智云星达信息技术有限公司 一种基于深度强化学习和时空上下文的目标图像追踪方法
CN109472984A (zh) * 2018-12-27 2019-03-15 苏州科技大学 基于深度强化学习的信号灯控制方法、系统和存储介质
CN110011876A (zh) * 2019-04-19 2019-07-12 福州大学 一种基于强化学习的Sketch的网络测量方法
CN110011876B (zh) * 2019-04-19 2022-05-03 福州大学 一种基于强化学习的Sketch的网络测量方法
CN110190918A (zh) * 2019-04-25 2019-08-30 广西大学 基于深度q学习的认知无线传感器网络频谱接入方法
CN110190918B (zh) * 2019-04-25 2021-04-30 广西大学 基于深度q学习的认知无线传感器网络频谱接入方法
US11853901B2 (en) 2019-07-26 2023-12-26 Samsung Electronics Co., Ltd. Learning method of AI model and electronic apparatus
CN110635973A (zh) * 2019-11-08 2019-12-31 西北工业大学青岛研究院 一种基于强化学习的骨干网络流量确定方法及系统
CN110635973B (zh) * 2019-11-08 2022-07-12 西北工业大学青岛研究院 一种基于强化学习的骨干网络流量确定方法及系统
US11645580B2 (en) * 2020-01-21 2023-05-09 Walmart Apollo, Llc Robust reinforcement learning in personalized content prediction
KR102179090B1 (ko) * 2020-01-30 2020-11-16 주식회사 투비코 신경망을 이용한 의료 진단 방법
CN111462230B (zh) * 2020-02-18 2023-08-15 天津大学 一种基于深度强化学习的台风中心定位方法
CN111462230A (zh) * 2020-02-18 2020-07-28 天津大学 一种基于深度强化学习的台风中心定位方法
CN113077853B (zh) * 2021-04-06 2023-08-18 西安交通大学 双loss价值网络深度强化学习KVFD模型力学参数全局优化方法及系统
CN113077853A (zh) * 2021-04-06 2021-07-06 西安交通大学 双loss价值网络深度强化学习KVFD模型力学参数全局优化方法及系统
CN113452026A (zh) * 2021-06-29 2021-09-28 华中科技大学 一种电力系统薄弱评估智能体训练方法、评估方法和系统
CN113905606B (zh) * 2021-09-13 2022-09-30 中国地质大学(武汉) 基于深度强化学习的贴片机贴装调度模型训练方法
CN113905606A (zh) * 2021-09-13 2022-01-07 中国地质大学(武汉) 基于深度强化学习的贴片机贴装调度模型训练方法
CN114683287A (zh) * 2022-04-25 2022-07-01 浙江工业大学 一种基于元动作分层泛化的机械臂模仿学习方法
CN114683287B (zh) * 2022-04-25 2023-10-20 浙江工业大学 一种基于元动作分层泛化的机械臂模仿学习方法
CN117634320A (zh) * 2024-01-24 2024-03-01 合肥工业大学 基于深度强化学习的三相高频变压器多目标优化设计方法
CN117634320B (zh) * 2024-01-24 2024-04-09 合肥工业大学 基于深度强化学习的三相高频变压器多目标优化设计方法

Similar Documents

Publication Publication Date Title
US20170032245A1 (en) Systems and Methods for Providing Reinforcement Learning in a Deep Learning System
WO2017004626A1 (fr) Systèmes et procédés permettant de fournir un apprentissage par renforcement dans un système d'apprentissage en profondeur
Osband et al. Deep exploration via bootstrapped DQN
Lanctot et al. A unified game-theoretic approach to multiagent reinforcement learning
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Shao et al. A survey of deep reinforcement learning in video games
US11291917B2 (en) Artificial intelligence (AI) model training using cloud gaming network
Lazaridis et al. Deep reinforcement learning: A state-of-the-art walkthrough
US11886988B2 (en) Method for adaptive exploration to accelerate deep reinforcement learning
US20190272465A1 (en) Reward estimation via state prediction using expert demonstrations
US11157316B1 (en) Determining action selection policies of an execution device
WO2019133848A1 (fr) Avatar commandé par persona et artificiellement intelligent
Wu et al. Deep ensemble reinforcement learning with multiple deep deterministic policy gradient algorithm
US11204803B2 (en) Determining action selection policies of an execution device
CN113487039A (zh) 基于深度强化学习的智能体自适应决策生成方法及系统
Schweighofer et al. Understanding the effects of dataset characteristics on offline reinforcement learning
Xin et al. Exploration entropy for reinforcement learning
Zhang et al. Efficient experience replay architecture for offline reinforcement learning
Tang et al. A framework for constrained optimization problems based on a modified particle swarm optimization
Catteeuw et al. The limits and robustness of reinforcement learning in Lewis signalling games
Gao et al. Adversarial policy gradient for alternating markov games
US9981190B2 (en) Telemetry based interactive content generation
US20210138350A1 (en) Sensor statistics for ranking users in matchmaking systems
Seify Single-agent optimization with monte-carlo tree search and deep reinforcement learning
Argall et al. Automatic weight learning for multiple data sources when learning from demonstration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16818963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16818963

Country of ref document: EP

Kind code of ref document: A1