WO2020259504A1 - 一种强化学习的高效探索方法 - Google Patents

一种强化学习的高效探索方法 Download PDF

Info

Publication number
WO2020259504A1
WO2020259504A1 PCT/CN2020/097757 CN2020097757W WO2020259504A1 WO 2020259504 A1 WO2020259504 A1 WO 2020259504A1 CN 2020097757 W CN2020097757 W CN 2020097757W WO 2020259504 A1 WO2020259504 A1 WO 2020259504A1
Authority
WO
WIPO (PCT)
Prior art keywords
reinforcement learning
exploration
state
strategy
reward
Prior art date
Application number
PCT/CN2020/097757
Other languages
English (en)
French (fr)
Inventor
张寅�
胡滨
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2020259504A1 publication Critical patent/WO2020259504A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the invention relates to an efficient exploration method of deep reinforcement learning, in particular to a counting-based exploration strategy and its application under continuous spatial tasks.
  • Reinforcement learning belongs to the field of machine learning and is an important method for solving sequential decision problems. Reinforcement learning models the sequential decision-making problem as an external environment, and regards the decision-making algorithm as an agent. The agent improves the decision-making strategy through trial-and-error learning, so that the strategy can obtain the largest cumulative benefit in the sequential decision-making process.
  • reinforcement learning algorithms have achieved significant results and have been widely used in games, robot control, natural language processing, computer vision and other fields.
  • the DeepMind team combined deep learning, reinforcement learning, and Monte Carlo search tree to realize the intelligent Go system Alpha Go and Alpha Zero, respectively, defeated Korean player Lee Sedol and Go world champion Ke Jie, which marked the success of machine learning algorithms. In board games, it surpasses humans in an all-round way, showing the strong decision-making ability and development potential of reinforcement learning algorithms.
  • the existing exploration strategies of reinforcement learning mainly use uniform sampling methods or Gaussian noise methods, that is, random exploration with a certain probability or random exploration in the neighborhood of the optimal action. These methods actually add a random undirected noise on the basis of the currently learned strategy, which is called a jitter strategy. Because the jitter strategy does not consider the value of each exploration behavior, there are disadvantages such as low data utilization and infinite time for full exploration.
  • the purpose of the present invention is to solve the problems existing in the prior art and provide an efficient exploration method of reinforcement learning.
  • the count estimation function is expressed by the neural network c ⁇ (s, err), and the triple set M obtained in 4) is used as the data set to train the neural network c ⁇ , from the triple set M during each round of training Sample a batch of data in:
  • N is the number of the current batch of data
  • S i is the i-th data record of the state s
  • err i is the i th data record error reconstruction error
  • cnt i count the number of i-th data record
  • the neural network c ⁇ loss function is:
  • is the reward magnitude factor, ⁇ >0;
  • each step can be implemented in the following specific manner.
  • the termination condition is that the number of interactions between the reinforcement learning algorithm and the environment reaches the set upper limit T.
  • the reward R is preferably calculated using formula (2).
  • the present invention is mainly aimed at the problem of balance between exploration and utilization in reinforcement learning.
  • the pre-training count estimation function is used to estimate the number of occurrences of the state encountered by the agent, the number of occurrences of the state is used to calculate the reward, and the intelligence is guided by the reward. It explores the less encountered states in order to achieve efficient exploration. By using an independent exploration strategy to process the reward signal, the influence of the reward signal on the action strategy of the agent is avoided, and the exploration process is more stable.
  • Figure 1 is a flow chart of pre-training the count estimation function.
  • Figure 2 is a model framework of reinforcement learning algorithm based on strategy separation.
  • Figure 3 is a schematic diagram of an exploration strategy algorithm based on strategy separation.
  • Figure 4 is the test result under the HalfCheetah task in the embodiment.
  • Figure 5 is the test result under the Swimmer task in the embodiment.
  • Fig. 6 is the test result under the Ant task in the embodiment.
  • Figure 7 is the test result under the Reacher task in the embodiment.
  • the present invention provides an efficient exploration method for reinforcement learning, and the steps are as follows:
  • the count estimation function is expressed by the neural network c ⁇ (s, err), and the triple set M obtained in 4) is used as the data set to train the neural network c ⁇ , from the triple set M during each round of training Sample a batch of data in:
  • N is the number of the current batch of data
  • S i is the i-th data record of the state s
  • err i is the i th data record error reconstruction error
  • cnt i count the number of i-th data record
  • the neural network c ⁇ loss function is:
  • the above-mentioned counting estimation function can estimate the number of occurrences of the state according to the VAE reconstruction error of the state.
  • the VAE used in the present invention can be replaced with other structures that can reconstruct the input and obtain the corresponding reconstruction error.
  • the proposed efficient exploration strategy can be combined with existing reinforcement learning algorithms such as Deterministic Strategy Gradient Algorithm (DDPG), see Figure 2 and Figure 3, where the subscript t represents the t-th iteration. The implementation process is described in detail below.
  • DDPG Deterministic Strategy Gradient Algorithm
  • is the reward magnitude factor, ⁇ >0;
  • the termination condition is set as the number of interactions between the reinforcement learning algorithm and the environment reaches the set upper limit T.
  • Mujoco is used as the algorithm test environment.
  • Mujoco is a physics simulator that can simulate complex dynamic systems quickly and accurately, and has a wide range of applications in robotics, biomechanics, graphic animation, machine learning and other fields. In the field of reinforcement learning, Mujoco is often used as a benchmark for continuous space problems. Mujoco contains a series of simulation environments.
  • Gym is a platform for researching reinforcement learning announced by OpenAI. It provides a series of reinforcement learning tasks, including classic control tasks, Atari games, robot control tasks, etc., and provides an interface to interact with these environments. Gym also integrates Mujoco's simulation environment as a Mujoco type of reinforcement learning task. The current Mujoco task has been upgraded to v2 version, and the test used is Mujoco's v2 version.
  • test parameters of the two methods are configured as follows:
  • All the vector dimensions of the middle layer of DDPG's action network and evaluation network are set to 64.
  • the dimensions of the coding layer and the decoding layer are 64, and the dimensions of the mean vector, standard deviation vector, and implicit vector are 8.
  • the evaluation network of action strategy and exploration strategy and the structure of the action network are the same as DDPG.
  • Other DDPG related parameters are the same as those used in the above DDPG algorithm.
  • the number of each batch of data sampled is 64. All optimizers choose Adam optimization algorithm, the learning rate of all action networks is 10 -4 , and the learning rate of all other networks is 10 -3 . All activation functions used are ReLU.
  • test results are shown in Figures 4-7, where BRL-S is the result of the efficient exploration method proposed by the present invention.
  • the test results show that under the four test tasks, the proposed efficient exploration method can achieve better results than DDPG.
  • the score obtained by the efficient exploration method is about 15% higher than that of DDPG; in the Swimmer environment, it is about 67%; in the Ant environment, it is about 160%; in the Reacher environment, the score is increased from -12 points to -8 points.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种强化学习的高效探索方法,该方法的步骤如下:1)预训练计数估计函数;2)利用预训练的计数估计函数进行强化学习的高效探索。此方法主要针对强化学习中探索与利用的平衡问题,在连续空间任务中,通过预训练计数估计函数估计智能体所遇到的状态的出现次数,利用状态的出现次数计算奖赏,通过奖赏引导智能体探索那些较少遇到的状态从而实现高效探索。通过使用独立的探索策略处理奖赏信号,避免了奖赏信号对智能体行动策略的影响,使得探索过程更稳定。

Description

一种强化学习的高效探索方法 技术领域
本发明涉及深度强化学习的高效探索方法,尤其涉及基于计数的探索策略及其在连续空间任务下的应用。
背景技术
强化学习属于机器学习领域,是一种解决序列决策问题的重要方法。强化学习将序列决策问题建模为外部环境,将决策算法视为智能体,智能体通过试错学习改进决策策略,使得该策略在序列决策过程中能获得最大的累积收益。近年来,随着强化学习与深度学习的结合,强化学习算法取得了显著的效果,在游戏、机器人控制、自然语言处理、计算机视觉等领域取得了广泛的应用。特别的,DeepMind团队结合深度学习、强化学习、蒙特卡洛搜索树实现的智能围棋系统Alpha Go和Alpha Zero先后分别击败了韩国棋手李世石和围棋世界冠军柯洁,这标志着机器学习算法在棋类游戏中全面超过人类,展现出了强化学习算法强大的决策能力和发展潜力。
但是强化学习算法仍然面临探索与利用之间的权衡问题。探索与利用是强化学习的主要矛盾,智能体一方面要探索新的状态和动作以获得潜在的最优策略;另一方面需要利用历史信息以实现最优策略。这两者是一组天然的矛盾,执行探索势必影响策略的最优性,执行最优策略势必影响探索能力,平衡两者之间的矛盾是强化学习的一个重要问题。
现有的强化学习的探索策略主要使用均匀采样方法或高斯噪声方法,即以一定概率进行随机探索或者在最优动作的邻域内随机探索。这些方法实际上都是在当前学习到的策略的基础上加一个随机无向的噪声,称之为抖动策略。抖动策略因为没有考虑每次探索行为的价值,存在数据利用率低、充分探索需要无限长的时间等不足。
发明内容
本发明的目的在于解决现有技术中存在的问题,并提供一种强化学习的高效探索方法。
本发明具体采用的技术方案如下:
一种强化学习的高效探索方法,其步骤如下:
1)预训练计数估计函数,预训练过程为:
1.1)在环境的状态空间中进行采样;
1.2)对每个采样到的状态,分别用独立的VAE对其进行重构;
1.3)对每个采样到的状态s,用不同数量的状态s训练VAE,并计算对应的重构误差;记数量为count时的重构误差为error,得到三元组为<s,error,count>;
1.4)将不同状态和不同数量下得到的三元组<s,error,count>构成三元组集合M;
1.5)将计数估计函数用神经网络c θ(s,err)表示,用4)中所得到的三元组集合M作为数据集训练神经网络c θ,每一轮训练时从三元组集合M中采样一个批次的数据:
Figure PCTCN2020097757-appb-000001
式中:N为当前批次数据的个数;s i为第i个数据记录的状态s,err i为第i个数据记录的重构误差error,cnt i为第i个数据记录的数量count;
神经网络c θ损失函数为:
Figure PCTCN2020097757-appb-000002
2)利用预训练的计数估计函数进行强化学习的高效探索,探索过程为:
2.1)将强化学习算法中的行动策略μ(s)拷贝一份作为探索策略μ E(s);
2.2)初始化一个VAE,记为V *
2.3)对于强化学习过程中遇到的<状态s,动作a,奖励r>序列,对于每一个<状态s,动作a,奖励r>对,用状态s训练V *并计算其重构误差err *=‖V *(s)-s‖ 2,V *(s)表示状态s输入V *所得到的结果;
2.4)用状态s的实时重构误差err *和预训练的计数估计函数c θ(s,err)估计状态s出现次数cnt(s)=c θ(s,err *);
2.5)按照公式(1)或(2),用估计的状态出现次数cnt(s)计算相应的奖赏:
Figure PCTCN2020097757-appb-000003
Figure PCTCN2020097757-appb-000004
式中:β为奖赏量级因子,β>0;
2.6)用环境给出的奖励信号r训练行动策略μ(s),用奖赏信号R训练探索策略μ E(s);在训练探索策略时,同时引入与行动策略的距离约束α‖μ(s)-μ E(s)‖ 2,α为约束项因子;
2.7)以一定的概率p选择探索策略产生的动作与环境进行交互,以剩下的概率1-p选择行动策略产生的动作与环境进行交互;
2.8)不断重复2.3)~2.7),直到满足终止条件,终止循环,完成强化学习的高效探索。
基于上述技术方案,各步骤可采用如下具体方式实现。
优选的,所述的终止条件为强化学习算法与环境交互次数达到设定上限T。
优选的,所述的奖赏R优选采用公式(2)计算。
优选的,所述的2.6)中,在训练过程中,计算两个策略的距离dist=‖μ(s)-μ E(s)‖ 2,α根据距离dist动态调整,当两个策略的距离超过设定的上限时,增大α;当两个策略的距离低于设定的下限时,减小α。
优选的,所述的奖赏量级因子β=1,概率p=0.1。
本发明主要针对强化学习中探索与利用的平衡问题,在连续空间任务中,通过预训练计数估计函数估计智能体所遇到的状态的出现次数,利用状态的出现次数计算奖赏,通过奖赏引导智能体探索那些较少遇到的状态从而实现高效探索。通过使用独立的探索策略处理奖赏信号,避免了奖赏信号对智能体行动策略的影响,使得探索过程更稳定。
附图说明
图1是计数估计函数预训练流程图。
图2是基于策略分离的强化学习算法模型框架。
图3是基于策略分离的探索策略算法示意图。
图4是实施例中HalfCheetah任务下的测试结果。
图5是实施例中Swimmer任务下的测试结果。
图6是实施例中Ant任务下的测试结果。
图7是实施例中Reacher任务下的测试结果。
具体实施方式
下面结合附图和具体实施例对本发明做进一步阐述和说明。
如图1~3所示,本发明提供了一种强化学习的高效探索方法,其步骤如下:
1)预训练计数估计函数,预训练过程为:
1.1)在环境的状态空间中进行采样;
1.2)对每个采样到的状态,分别用独立的VAE对其进行重构;
1.3)对每个采样到的状态s,用不同数量的状态s训练VAE,并计算对应的重构误差;记数量为count时的重构误差为error,得到三元组为<s,error,count>;
1.4)将不同状态和不同数量下得到的三元组<s,error,count>构成三元组集合M;
1.5)将计数估计函数用神经网络c θ(s,err)表示,用4)中所得到的三元组集合M作为数据集训练神经网络c θ,每一轮训练时从三元组集合M中采样一个批次的数据:
Figure PCTCN2020097757-appb-000005
式中:N为当前批次数据的个数;s i为第i个数据记录的状态s,err i为第i个数据记录的重构误差error,cnt i为第i个数据记录的数量count;
神经网络c θ损失函数为:
Figure PCTCN2020097757-appb-000006
对于连续状态空间下的状态,上述计数估计函数可以根据状态的VAE重构误差估计其出现的次数。本发明中所用到的VAE可替换为其他能够重构输入并得到相应重构误差的结构。所提出的高效探索策略可以与现有的强化学习算法如确定性策略梯度算法(DDPG)相结合,参见图2、图3,图中下标t表示第t轮迭代。下面具体描述其实现过程。
2)利用预训练的计数估计函数进行强化学习的高效探索,探索过程为:
2.1)将强化学习算法中的行动策略μ(s)拷贝一份作为探索策略μ E(s);
2.2)初始化一个VAE,记为V *
2.3)对于强化学习过程中遇到的<状态s,动作a,奖励r>序列,对于每一个<状态s,动作a,奖励r>对,用状态s训练V *并计算其重构误差err *=‖V *(s)-s‖ 2,V *(s)表示状态s输入V *所得到的结果;
2.4)用状态s的实时重构误差err *和预训练的计数估计函数c θ(s,err)估计状态s出现次数cnt(s)=c θ(s,err *);
2.5)按照公式(1)或(2),用估计的状态出现次数cnt(s)计算相应的奖赏:
Figure PCTCN2020097757-appb-000007
Figure PCTCN2020097757-appb-000008
式中:β为奖赏量级因子,β>0;
上述两个奖赏的计算公式可以根据需要进行选择,但在本发明中优选采用公式(2)。
2.6)用环境给出的奖励信号r训练行动策略μ(s),用奖赏信号R训练探索策略μ E(s);在训练探索策略时,同时引入与行动策略的距离约束α‖μ(s)-μ E(s)‖ 2,α为约束项因子。在训练过程中,计算两个策略的距离dist=‖μ(s)-μ E(s)‖ 2,α根据距离dist动态调整,当两个策略的距离超过设定的上限时,增大α;当两个策略的距离低于设定的下限时,减小α。
2.7)以一定的概率p选择探索策略产生的动作与环境进行交互,以剩下的概率1-p选择行动策略产生的动作与环境进行交互;
2.8)不断重复2.3)~2.7),直到满足终止条件,终止循环,完成强化学习的高效探索。终止条件设置为强化学习算法与环境交互次数达到设定上限T。
下面将上述方法应用至具体实施例中,具体的实施步骤如前所述,实施例中主要展示其效果。
实施例:
为了测试高效探索方法的实际效果,使用Mujoco作为算法的测试环境。
Mujoco是一个物理模拟器,可以快速准确地模拟复杂动态系统,在机器人、生物力学、图形动画、机器学习等领域有广泛的应用。在强化学习领域,Mujoco常作为连续空间问题的基准测试。Mujoco包含一系列的模拟环境。
Gym是OpenAI公布的用于研究强化学习的平台,它提供了一系列的强化学习任务,包括经典控制任务、Atari游戏、机器人控制任务等,同时提供了与这些环境交互的接口。Gym同样整合了Mujoco的模拟环境,作为Mujoco类型的 强化学习任务。当前Mujoco任务已经升级到v2版本,测试使用的为Mujoco的v2版本。
选择Mujoco的四个任务HalfCheetah、Swimmer、Ant和Reacher作测试。
使用强化学习算法:确定性策略梯度方法DDPG作为对比,两种方法的测试参数配置如下:
1)DDPG算法
DDPG的动作网络和评价网络的所有中间层向量维度均设为64。
延迟更新网络的更新参数τ=0.01,折扣因子γ=0.99。
2)本发明的高效探索方法(具体方法如前步骤1)~2)所述,不再赘述)
所有使用到的VAE均使用相同结构,选择编码层和解码层维度为64,均值向量、标准差向量、隐向量维度为8。
行动策略和探索策略的评价网络和动作网络的结构与DDPG相同。其他DDPG相关参数与上述DDPG算法使用参数相同。
探索策略损失函数中的距离约束项系数α=1,距离dist的上下界定分别为d +=0.3,d -=0.1。α动态调整系数λ=1.01,当距离超过d +时,α=α×λ,当距离低于d -时,α=α÷λ。奖赏量级因子β=1,奖赏的计算公式采用公式(2)。
动作选择时选择探索策略的概率为p=0.1。
上述两种算法共有的参数如下:
采样的每个批次数据的数量为64。所有优化器选择Adam优化算法,所有动作网络的学习率learning rate为10 -4,其余所有网络的learning rate为10 -3。所有使用的激活函数均为ReLU。
测试结果如图4~7所示,其中BRL-S是本发明所提出的高效探索方法的结果。测试结果表明,在四个测试任务下,所提出的高效探索方法能够取得比DDPG更好的结果。具体的,在HalfCheetah任务下,高效探索方法取得的分数比DDPG高约15%;Swimmer环境下,高约67%;Ant环境下,高约160%;Reacher环境下将分数从-12分提高到-8分。
以上所述的实施例只是本发明的一种较佳的方案,然其并非用以限制本发明。有关技术领域的普通技术人员,在不脱离本发明的精神和范围的情况下,还可以做出各种变化和变型。因此凡采取等同替换或等效变换的方式所获得的技术方案, 均落在本发明的保护范围内。

Claims (5)

  1. 一种强化学习的高效探索方法,其特征在于,步骤如下:
    1)预训练计数估计函数,预训练过程为:
    1.1)在环境的状态空间中进行采样;
    1.2)对每个采样到的状态,分别用独立的VAE对其进行重构;
    1.3)对每个采样到的状态s,用不同数量的状态s训练VAE,并计算对应的重构误差;记数量为count时的重构误差为error,得到三元组为<s,error,count>;
    1.4)将不同状态和不同数量下得到的三元组<s,error,count>构成三元组集合M;
    1.5)将计数估计函数用神经网络c θ(s,err)表示,用1.4)中所得到的三元组集合M作为数据集训练神经网络c θ,每一轮训练时从三元组集合M中采样一个批次的数据:
    Figure PCTCN2020097757-appb-100001
    式中:N为当前批次数据的个数;s i为第i个数据记录的状态s,err i为第i个数据记录的重构误差error,cnt i为第i个数据记录的数量count;
    神经网络c θ损失函数为:
    Figure PCTCN2020097757-appb-100002
    2)利用预训练的计数估计函数进行强化学习的高效探索,探索过程为:
    2.1)将强化学习算法中的行动策略μ(s)拷贝一份作为探索策略μ E(s);
    2.2)初始化一个VAE,记为V *
    2.3)对于强化学习过程中遇到的<状态s,动作a,奖励r>序列,对于每一个<状态s,动作a,奖励r>对,用状态s训练V *并计算其重构误差err *=‖V *(s)-s‖ 2,V *(s)表示状态s输入V *所得到的结果;
    2.4)用状态s的实时重构误差err *和预训练的计数估计函数c θ(s,err)估计状态s出现次数cnt(s)=c θ(s,err *);
    2.5)按照公式(1)或(2),用估计的状态出现次数cnt(s)计算相应的奖赏:
    Figure PCTCN2020097757-appb-100003
    Figure PCTCN2020097757-appb-100004
    式中:β为奖赏量级因子,β>0;
    2.6)用环境给出的奖励信号r训练行动策略μ(s),用奖赏信号R训练探索策略μ E(s);在训练探索策略时,同时引入与行动策略的距离约束α‖μ(s)-μ E(s)‖ 2,α为约束项因子;
    2.7)以一定的概率p选择探索策略产生的动作与环境进行交互,以剩下的概率1-p选择行动策略产生的动作与环境进行交互;
    2.8)不断重复2.3)~2.7),直到满足终止条件,终止循环,完成强化学习的高效探索。
  2. 如权利要求1所述的强化学习的高效探索方法,其特征在于,所述的终止条件为强化学习算法与环境交互次数达到设定上限T。
  3. 如权利要求1所述的强化学习的高效探索方法,其特征在于,所述的奖赏R优选采用公式(2)计算。
  4. 如权利要求1所述的强化学习的高效探索方法,其特征在于,所述的2.6)中,在训练过程中,计算两个策略的距离dist=‖μ(s)-μ E(s)‖ 2,α根据距离dist动态调整,当两个策略的距离超过设定的上限时,增大α;当两个策略的距离低于设定的下限时,减小α。
  5. 如权利要求1所述的强化学习的高效探索方法,其特征在于,所述的奖赏量级因子β=1,概率p=0.1。
PCT/CN2020/097757 2019-06-24 2020-06-23 一种强化学习的高效探索方法 WO2020259504A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910549586.XA CN110390399A (zh) 2019-06-24 2019-06-24 一种强化学习的高效探索方法
CN201910549586.X 2019-06-24

Publications (1)

Publication Number Publication Date
WO2020259504A1 true WO2020259504A1 (zh) 2020-12-30

Family

ID=68285838

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097757 WO2020259504A1 (zh) 2019-06-24 2020-06-23 一种强化学习的高效探索方法

Country Status (2)

Country Link
CN (1) CN110390399A (zh)
WO (1) WO2020259504A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390399A (zh) * 2019-06-24 2019-10-29 浙江大学 一种强化学习的高效探索方法
CN111861159B (zh) * 2020-07-03 2024-02-02 武汉实为信息技术股份有限公司 一种基于强化学习的任务分配方法
CN112462613B (zh) * 2020-12-08 2022-09-23 周世海 一种基于贝叶斯概率的强化学习智能体控制优化方法
CN113239629B (zh) * 2021-06-03 2023-06-16 上海交通大学 一种轨迹空间行列式点过程的强化学习探索和利用的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729953A (zh) * 2017-09-18 2018-02-23 清华大学 基于连续状态行为域强化学习的机器人羽状流追踪方法
US20180101784A1 (en) * 2016-10-05 2018-04-12 D-Wave Systems Inc. Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
US20190179938A1 (en) * 2017-12-13 2019-06-13 Google Llc Reinforcement learning techniques to improve searching and/or to conserve computational and network resources
CN110390399A (zh) * 2019-06-24 2019-10-29 浙江大学 一种强化学习的高效探索方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180101784A1 (en) * 2016-10-05 2018-04-12 D-Wave Systems Inc. Discrete variational auto-encoder systems and methods for machine learning using adiabatic quantum computers
CN107729953A (zh) * 2017-09-18 2018-02-23 清华大学 基于连续状态行为域强化学习的机器人羽状流追踪方法
US20190179938A1 (en) * 2017-12-13 2019-06-13 Google Llc Reinforcement learning techniques to improve searching and/or to conserve computational and network resources
CN110390399A (zh) * 2019-06-24 2019-10-29 浙江大学 一种强化学习的高效探索方法

Also Published As

Publication number Publication date
CN110390399A (zh) 2019-10-29

Similar Documents

Publication Publication Date Title
WO2020259504A1 (zh) 一种强化学习的高效探索方法
Zhang et al. A competitive mechanism based multi-objective particle swarm optimizer with fast convergence
Blondé et al. Sample-efficient imitation learning via generative adversarial nets
CN109690576A (zh) 在多个机器学习任务上训练机器学习模型
Chuang et al. The annealing robust backpropagation (ARBP) learning algorithm
US20220176248A1 (en) Information processing method and apparatus, computer readable storage medium, and electronic device
WO2020190415A1 (en) Reinforcement learning to train a character using disparate target animation data
CN113408743A (zh) 联邦模型的生成方法、装置、电子设备和存储介质
Tan et al. Deep reinforcement learning: from Q-learning to deep Q-learning
CN109858630A (zh) 用于强化学习的方法和设备
Zhong et al. Density-based evolutionary framework for crowd model calibration
WO2007050622A2 (en) Weighted pattern learning for neural networks
CN110488611A (zh) 一种仿生机器鱼运动控制方法、控制器及仿生机器鱼
CN108683614A (zh) 基于门限残差网络的虚拟现实设备集群带宽分配装置
CN116834037B (zh) 基于动态多目标优化的采摘机械臂轨迹规划方法及装置
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN113379027A (zh) 一种生成对抗交互模仿学习方法、系统、存储介质及应用
CN113843802B (zh) 一种基于深度强化学习td3算法的机械臂运动控制方法
CN114911157A (zh) 基于部分可观测强化学习的机器人导航控制方法及系统
Chen et al. Attention Loss Adjusted Prioritized Experience Replay
CN114511092A (zh) 一种基于量子线路的图注意力机制实现方法
Reid et al. Mutual reinforcement learning with heterogenous agents
CN109816530A (zh) 一种基于深度强化学习a3c算法的金融交易方法
CN116227571B (zh) 模型的训练、动作确定方法、装置、电子设备及存储介质
CN115796244B (zh) 一种超非线性输入输出系统基于cff的参数辨识方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20832243

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20832243

Country of ref document: EP

Kind code of ref document: A1