CN116757272A - A continuous action control reinforcement learning framework and learning method - Google Patents

A continuous action control reinforcement learning framework and learning method Download PDF

Info

Publication number
CN116757272A
CN116757272A CN202310805443.7A CN202310805443A CN116757272A CN 116757272 A CN116757272 A CN 116757272A CN 202310805443 A CN202310805443 A CN 202310805443A CN 116757272 A CN116757272 A CN 116757272A
Authority
CN
China
Prior art keywords
action control
continuous action
expectation
reinforcement learning
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310805443.7A
Other languages
Chinese (zh)
Inventor
黄天意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Westlake University
Original Assignee
Westlake University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Westlake University filed Critical Westlake University
Priority to CN202310805443.7A priority Critical patent/CN116757272A/en
Publication of CN116757272A publication Critical patent/CN116757272A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

The application discloses a continuous action control reinforcement learning framework and a learning method, and relates to the technical field of artificial intelligence. The learning framework includes: the multi-step state transfer learning module is used for learning multi-step state transfer by adopting a convolutional neural network and updating a strategy; the expected estimation module is used for estimating the expected of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm; and the sample clustering module is used for clustering different types of state transition samples so that each sample is uniformly sampled. The application combines convolutional neural network, multi-step time sequence differential estimation and state transfer clustering, effectively improves learning efficiency and accuracy, and makes the sample more fully utilized.

Description

一种连续动作控制强化学习框架及学习方法A continuous action control reinforcement learning framework and learning method

技术领域Technical field

本发明涉及人工智能技术领域,具体而言,涉及一种连续动作控制强化学习框架及学习方法。The present invention relates to the field of artificial intelligence technology, and specifically to a continuous action control reinforcement learning framework and a learning method.

背景技术Background technique

目前,一些有效的深度强化学习算法被提出用于优化连续控制。最具代表性的是DDPG,它是基于演员评论家方法工作的。定义ρt是t时刻的状态,αt是t时刻的动作,同时定义确定性策略如下:Currently, some effective deep reinforcement learning algorithms have been proposed for optimizing continuous control. The most representative one is DDPG, which works based on the actor critic method. Define ρ t as the state at time t, α t as the action at time t, and define the deterministic strategy as follows:

αt=πθt)α tθt )

现有的演员-评论家框架通过循环更新累计回报的估计函数和最大化这个函数的策略来训练智能体。对累计回报的估计可通过最小化如下目标函数得到。Existing actor-critic frameworks train agents by cyclically updating an estimated function of cumulative rewards and a policy that maximizes this function. An estimate of the cumulative return can be obtained by minimizing the following objective function.

其中B是采样到的状态转移,回报以及动作的集合。Where B is the set of sampled state transitions, returns and actions.

在更新策略的时候需要最大化的目标函数如下:The objective function that needs to be maximized when updating the strategy is as follows:

基于演员-评论家框架,DDPG主要是通过全连接的神经网络来学习单步的状态转移然后通过单步的累计回报来估计累计回报函数的期望。TD3和SAC是两个基于DDPG的改进算法,TD3通过双评论家网络,时序差分估计和高斯噪声改进了DDPG中的过估计,策略更新和探索。SAC主要通过改进目标函数提升了DDPG中的探索,它也使用了双评论家网络和时序差分估计。Based on the actor-critic framework, DDPG mainly learns the single-step state transition through a fully connected neural network and then estimates the expectation of the cumulative reward function through the cumulative reward of the single step. TD3 and SAC are two improved algorithms based on DDPG. TD3 improves overestimation, policy updating and exploration in DDPG through dual critic network, temporal difference estimation and Gaussian noise. SAC improves exploration in DDPG mainly by improving the objective function. It also uses dual critic network and temporal difference estimation.

但是现有技术存在如下缺点:However, the existing technology has the following shortcomings:

1.只考虑单步的状态转移会导致学习效率不够高。1. Only considering single-step state transfer will lead to insufficient learning efficiency.

2.只考虑单步的回报来估计累计回报的期望会导致估计不够准确。2. Estimating the expectation of cumulative returns by considering only single-step returns will lead to inaccurate estimates.

3.使用随机采样的状态转移更新神经网络,容易使得样本利用不够充分。3. Using randomly sampled state transitions to update the neural network can easily lead to insufficient sample utilization.

发明内容Contents of the invention

为了克服上述问题或者至少部分地解决上述问题,本发明提供一种连续动作控制强化学习框架及学习方法,结合了卷积神经网络、多步时序差分估计和状态转移聚类,有效提高了学习效率以及准确度,并使样本利用更充分。In order to overcome the above problems or at least partially solve the above problems, the present invention provides a continuous action control reinforcement learning framework and learning method, which combines a convolutional neural network, multi-step temporal difference estimation and state transfer clustering, effectively improving learning efficiency. and accuracy, and make full use of samples.

为解决上述技术问题,本发明采用的技术方案为:In order to solve the above technical problems, the technical solutions adopted by the present invention are:

第一方面,本发明提供一种连续动作控制强化学习框架,包括多步状态转移学习模块、期望估计模块以及样本聚类模块,其中:In the first aspect, the present invention provides a continuous action control reinforcement learning framework, including a multi-step state transfer learning module, an expectation estimation module and a sample clustering module, wherein:

多步状态转移学习模块,用于采用卷积神经网络学习多步状态转移,更新策略;The multi-step state transfer learning module is used to learn multi-step state transfer and update strategies using convolutional neural networks;

期望估计模块,用于采用多步时序差分算法估计多步累计回报的期望;The expectation estimation module is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm;

样本聚类模块,用于对不同类型的状态转移样本进行聚类,使每种样本被均匀采样。The sample clustering module is used to cluster different types of state transition samples so that each sample is evenly sampled.

本框架首次结合了卷积神经网络、多步时序差分估计和状态转移聚类,它具有如下特点:使用卷积神经网络考虑多步状态转移来更新策略;使用多步时序差分算法来估计多步累计回报的期望;通过聚类现有状态转移样本,使得每种样本都被充分采样。本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移,提高了学习效率;在上一步的基础上通过多步回报来估计累计回报的期望,让估计更准确;本发明还通过聚类使得不同类型的状态转移样本被均匀的采样,从而使得样本利用更充分。This framework combines convolutional neural networks, multi-step temporal difference estimation and state transition clustering for the first time. It has the following characteristics: it uses a convolutional neural network to consider multi-step state transitions to update the policy; it uses a multi-step temporal difference algorithm to estimate multi-step Expectation of cumulative returns; by clustering existing state transition samples, each sample is fully sampled. This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

基于第一方面,进一步地,上述策略为其中,α是动作、ρ指的状态、π是策略,θc是卷积神经网络的参数、t为当前时刻以及np为状态转移步数。Based on the first aspect, further, the above strategy is in, α is the action, ρ refers to the state, π is the strategy, θ c is the parameter of the convolutional neural network, t is the current moment and n p is the number of state transition steps.

基于第一方面,进一步地,对以下目标函数进行最小化,以得到估计多步累计回报的期望,Based on the first aspect, further, the following objective function is minimized to obtain the expectation of estimating the multi-step cumulative return,

目标函数为:The objective function is:

中,np为状态转移步数,nq为回报步数,Bn为采为样到的多步状态转移、多步回报以及动作的集合,E是期望,Q是估计累计回报期望的函数,是估计Q的参数以及Among them, n p is the number of state transition steps, n q is the number of reward steps, B n is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and

基于第一方面,进一步地,采用函数更新策略。Based on the first aspect, further, using the function Update strategy.

基于第一方面,进一步地,进行聚类时,将训练的总步数平均分配到不同时间段内,对每个时间段内的样本进行聚类;上述聚类的方法采用k-means算法。Based on the first aspect, further, when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are clustered; the above clustering method uses the k-means algorithm.

基于第一方面,进一步地,采样状态转移更新函数时,对每个聚类中的样本均匀采样。Based on the first aspect, further, when sampling the state transition update function, samples in each cluster are uniformly sampled.

本发明至少具有如下优点或有益效果:The present invention has at least the following advantages or beneficial effects:

本发明提供一种连续动作控制强化学习框架及学习方法,结合了卷积神经网络、多步时序差分估计和状态转移聚类,在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移,提高了学习效率;在上一步的基础上通过多步回报来估计累计回报的期望,让估计更准确;本发明还通过聚类使得不同类型的状态转移样本被均匀的采样,从而使得样本利用更充分。The present invention provides a continuous action control reinforcement learning framework and learning method, which combines a convolutional neural network, multi-step temporal difference estimation and state transfer clustering, and learns multi-steps through the convolutional neural network in reinforcement learning for continuous control. State transfer improves learning efficiency; on the basis of the previous step, the expectation of cumulative return is estimated through multi-step returns, making the estimate more accurate; the present invention also uses clustering to evenly sample different types of state transition samples, so that Samples are more fully utilized.

附图说明Description of the drawings

为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings required to be used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and therefore do not It should be regarded as a limitation of the scope. For those of ordinary skill in the art, other relevant drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明实施例一种连续动作控制强化学习框架的示意图;Figure 1 is a schematic diagram of a continuous action control reinforcement learning framework according to an embodiment of the present invention;

图2为本发明实施例中智能体在不同虚拟环境中实验训练的示意图;Figure 2 is a schematic diagram of experimental training of an agent in different virtual environments in an embodiment of the present invention;

图3为本发明实施例中进行样本聚类后得到的采样池示意图;Figure 3 is a schematic diagram of the sampling pool obtained after sample clustering in the embodiment of the present invention;

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments. The components of the embodiments of the invention generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations.

因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。Therefore, the following detailed description of the embodiments of the invention provided in the appended drawings is not intended to limit the scope of the claimed invention, but rather to represent selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters represent similar items in the following figures, therefore, once an item is defined in one figure, it does not need further definition and explanation in subsequent figures.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. Furthermore, the term "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus including a list of elements includes not only those elements but also other elements not expressly listed, Or it also includes elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.

在本发明实施例的描述中,“多个”代表至少2个。In the description of the embodiments of the present invention, “plurality” represents at least two.

实施例:Example:

如图1所示,第一方面,本发明实施例提供一种连续动作控制强化学习框架,包括多步状态转移学习模块100、期望估计模块200以及样本聚类模块300,其中:As shown in Figure 1, in the first aspect, embodiments of the present invention provide a continuous action control reinforcement learning framework, including a multi-step state transfer learning module 100, an expectation estimation module 200 and a sample clustering module 300, wherein:

多步状态转移学习模块100,用于采用卷积神经网络学习多步状态转移,更新策略;The multi-step state transfer learning module 100 is used to learn multi-step state transfer and update strategies using convolutional neural networks;

期望估计模块200,用于采用多步时序差分算法估计多步累计回报的期望;The expectation estimation module 200 is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm;

样本聚类模块300,用于对不同类型的状态转移样本进行聚类,使每种样本被均匀采样。The sample clustering module 300 is used to cluster different types of state transition samples so that each sample is uniformly sampled.

本框架通过多步状态转移学习模块100、期望估计模块200以及样本聚类模块300的配合,结合了卷积神经网络、多步时序差分估计和状态转移聚类,它具有如下特点:使用卷积神经网络考虑多步状态转移来更新策略;使用多步时序差分算法来估计多步累计回报的期望;通过聚类现有状态转移样本,使得每种样本都被充分采样。本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移,提高了学习效率;在上一步的基础上通过多步回报来估计累计回报的期望,让估计更准确;本发明还通过聚类使得不同类型的状态转移样本被均匀的采样,从而使得样本利用更充分。This framework combines the convolutional neural network, multi-step temporal difference estimation and state transition clustering through the cooperation of the multi-step state transfer learning module 100, the expectation estimation module 200 and the sample clustering module 300. It has the following characteristics: Use convolution The neural network considers multi-step state transitions to update the strategy; uses a multi-step temporal difference algorithm to estimate the expectation of multi-step cumulative returns; and clusters existing state transition samples so that each sample is fully sampled. This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

基于第一方面,进一步地,上述策略为其中,α是动作、ρ指的状态、π是策略,θc是卷积神经网络的参数、t为当前时刻以及np为状态转移步数。Based on the first aspect, further, the above strategy is in, α is the action, ρ refers to the state, π is the strategy, θ c is the parameter of the convolutional neural network, t is the current moment and n p is the number of state transition steps.

基于第一方面,进一步地,对以下目标函数进行最小化,以得到估计多步累计回报的期望,Based on the first aspect, further, the following objective function is minimized to obtain the expectation of estimating the multi-step cumulative return,

目标函数为:The objective function is:

其中,np为状态转移步数,nq为回报步数,Bn为采为样到的多步状态转移、多步回报以及动作的集合,E是期望,Q是估计累计回报期望的函数,是估计Q的参数以及Among them, n p is the number of state transition steps, n q is the number of reward steps, B n is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and

基于上述策略,在新定义的框架中,对累计回报的估计可通过最小化上述目标函数得到。上述与/>对应。Based on the above strategy, in the newly defined framework, the estimate of the cumulative return can be obtained by minimizing the above objective function. above with/> correspond.

基于第一方面,进一步地,采用函数更新策略。Based on the first aspect, further, using the function Update strategy.

在更新策略的时候需要最大化的目标函数如上,这一步也是由定义的卷积神经网络的更新来完成的。The objective function that needs to be maximized when updating the strategy is as above. This step is also completed by updating the defined convolutional neural network.

基于第一方面,进一步地,进行聚类时,将训练的总步数平均分配到不同时间段内,对每个时间段内的样本进行聚类。上述聚类的方法采用k-means算法。Based on the first aspect, further, when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are clustered. The above clustering method uses the k-means algorithm.

基于第一方面,进一步地,采样状态转移更新函数时,对每个聚类中的样本均匀采样。Based on the first aspect, further, when sampling the state transition update function, samples in each cluster are uniformly sampled.

在本发明的一些实施例中,同时还需要将训练的总步数平均的分到不同的时间段内,然后对每个时间段里的样本进行聚类。聚类的方法选用k-means。得到的采样池如图3所示。在采样的时假设当前时段为p,每个时段的聚类个数为k。在每次更新神经网络的时候每个聚类中样本被采样的概率为当前时段中样本被采样的概率为0.2。In some embodiments of the present invention, it is also necessary to divide the total number of training steps evenly into different time periods, and then cluster the samples in each time period. The clustering method uses k-means. The resulting sampling pool is shown in Figure 3. During sampling, it is assumed that the current period is p, and the number of clusters in each period is k. The probability that a sample in each cluster is sampled every time the neural network is updated is The probability of a sample being sampled in the current period is 0.2.

在本发明的一些实施例中,基于本框架进行学习的算法流程如下:In some embodiments of the present invention, the algorithm flow for learning based on this framework is as follows:

np为定义的时间段数,pt为每个时间段的步数,算法如下:np is the number of defined time periods, pt is the number of steps in each time period, and the algorithm is as follows:

初始化神经网络参数Initialize neural network parameters

初始化采样空间Initialize the sampling space

初始化探索噪声Initialize exploration noise

Fore=1:npFore=1:np

Fort=1:ptFort=1:pt

通过策略选择动作Choose actions through strategies

为动作添加探索噪声Add exploration noise to actions

执行动作得到回报rt和状态Executing an action gets the reward rt and status

将动作、状态和回报存入采样空间Store actions, states and returns in the sampling space

从采样空间中现有的各个聚类中和当前时段产生的状态转移中选择样本Select samples from each existing cluster in the sampling space and from the state transitions generated during the current period

通过选择的样本更新神经网络Update the neural network with selected samples

EndforEndfor

对上一个时间段内的样本进行聚类Cluster samples from the previous time period

EndforEndfor

输出基于神经网络的策略模型。Outputs a neural network-based policy model.

在本发明的一些实施例中,用本发明提出的框架改进了TD3算法,得到了新算法TD3+。在虚拟机器人控制环境Mujoco上进行了实验,实验任务包括HafCheetal、Walker2d、以及Hopper。如图2所示,是HafCheeta、Walker2d和Hopper环境中的智能体。前两个环境中,都需要通过强化学习让智能体在固定步数内行走的更远,而最后一个则需要训练单腿智能体尽可能的跳远。In some embodiments of the present invention, the TD3 algorithm is improved using the framework proposed by the present invention, and a new algorithm TD3+ is obtained. Experiments were conducted on the virtual robot control environment Mujoco, and the experimental tasks included HafCheetal, Walker2d, and Hopper. As shown in Figure 2, they are agents in the HafCheeta, Walker2d and Hopper environments. In the first two environments, reinforcement learning is needed to make the agent walk farther within a fixed number of steps, while the last one requires training a single-legged agent to jump as far as possible.

对比算法包括DDPG、SAC、和TD3。对于所有方法,每个任务运行2*10^6个时间步。不同算法在不同任务上得到的累计回报如表1所示,可以看出用所提出的框架实现的算法TD+3的效果要优于现有算法。Comparison algorithms include DDPG, SAC, and TD3. For all methods, each task runs for 2*10^6 time steps. The cumulative returns obtained by different algorithms on different tasks are shown in Table 1. It can be seen that the algorithm TD+3 implemented using the proposed framework is better than the existing algorithm.

表1:Table 1:

运行环境Operating environment TD3+TD3+ TD3TD3 SACSAC DDPGDDPG HafCheetalHafCheetal 13589.1713589.17 10032.6610032.66 9643.939643.93 9453.229453.22 Walker2dWalker2d 6167.266167.26 4471.434471.43 4971.424971.42 3804.913804.91 HopperHopper 3812.303812.30 3472.653472.65 3531.773531.77 3736.213736.21

在本发的一些实施例中,进行消融实验,结果如表2所示,对比了在新方法中不使用聚类(TD3+woC)、不使用卷积神经网络(TD3+woS)以及不使用多步时序差分算法(TD3+woQ)的结果。从中可以看出,本发明中的每个部分(卷积神经网络、多步时序差分估计和聚类)都能够有效的提升强化学习的效果。In some embodiments of the present invention, ablation experiments are performed, and the results are shown in Table 2, which compares the new method without clustering (TD3+woC), without using convolutional neural networks (TD3+woS), and without using Results of the multi-step temporal difference algorithm (TD3+woQ). It can be seen from this that each part of the present invention (convolutional neural network, multi-step temporal difference estimation and clustering) can effectively improve the effect of reinforcement learning.

表2:Table 2:

运行环境Operating environment TD3+TD3+ TD3+woCTD3+woC TD3+woSTD3+woS TD3+woQTD3+woQ HafCheetalHafCheetal 13589.1713589.17 12824.4812824.48 12654.8112654.81 12051.5612051.56 Walker2dWalker2d 6267.266267.26 6056.136056.13 5401.725401.72 5737.125737.12 HopperHopper 3812.303812.30 3758.303758.30 3713.233713.23 3762.343762.34

第二方面,本发明实施例提供一种基于如上述第一方面中任一项所述的连续动作控制强化学习框架的连续动作控制强化学习方法,包括以下步骤:In a second aspect, embodiments of the present invention provide a continuous action control reinforcement learning method based on the continuous action control reinforcement learning framework as described in any one of the above first aspects, including the following steps:

采用卷积神经网络学习多步状态转移,更新学习策略;Use convolutional neural networks to learn multi-step state transitions and update learning strategies;

采用多步时序差分算法估计多步累计回报的期望;A multi-step temporal difference algorithm is used to estimate the expectation of multi-step cumulative returns;

对不同类型的状态转移样本进行聚类,使每种样本被均匀采样。Cluster different types of state transition samples so that each sample is evenly sampled.

本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移,提高了学习效率;在上一步的基础上通过多步回报来估计累计回报的期望,让估计更准确;本发明还通过聚类使得不同类型的状态转移样本被均匀的采样,从而使得样本利用更充分。This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

以上仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其它的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present application is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of the equivalent elements are included in this application. Any reference signs in the claims shall not be construed as limiting the claim in question.

Claims (7)

1.一种连续动作控制强化学习框架,其特征在于,包括多步状态转移学习模块、期望估计模块以及样本聚类模块,其中:1. A continuous action control reinforcement learning framework, characterized by including a multi-step state transfer learning module, an expectation estimation module and a sample clustering module, wherein: 多步状态转移学习模块,用于采用卷积神经网络学习多步状态转移,更新策略;The multi-step state transfer learning module is used to learn multi-step state transfer and update strategies using convolutional neural networks; 期望估计模块,用于采用多步时序差分算法估计多步累计回报的期望;The expectation estimation module is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm; 样本聚类模块,用于对不同类型的状态转移样本进行聚类,使每种样本被均匀采样。The sample clustering module is used to cluster different types of state transition samples so that each sample is evenly sampled. 2.根据权利要求1所述的一种连续动作控制强化学习框架,其特征在于,所述策略为其中,/>α是动作、ρ指的状态、π是策略,θc是卷积神经网络的参数、t为当前时刻以及np为状态转移步数。2. A continuous action control reinforcement learning framework according to claim 1, characterized in that the strategy is Among them,/> α is the action, ρ refers to the state, π is the strategy, θ c is the parameter of the convolutional neural network, t is the current moment and n p is the number of state transition steps. 3.根据权利要求1所述的一种连续动作控制强化学习框架,其特征在于,对以下目标函数进行最小化,以得到估计多步累计回报的期望,3. A continuous action control reinforcement learning framework according to claim 1, characterized in that the following objective function is minimized to obtain the expectation of estimating the multi-step cumulative return, 目标函数为:The objective function is: 其中,np为状态转移步数,nq为回报步数,Bn为采为样到的多步状态转移、多步回报以及动作的集合,E是期望,Q是估计累计回报期望的函数,是估计Q的参数以及Among them, n p is the number of state transition steps, n q is the number of reward steps, B n is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and 4.根据权利要求3所述的一种连续动作控制强化学习框架,其特征在于,采用函数更新策略。4. A continuous action control reinforcement learning framework according to claim 3, characterized in that using a function Update strategy. 5.根据权利要求1所述的一种连续动作控制强化学习框架,其特征在于,进行聚类时,将训练的总步数平均分配到不同时间段内,对每个时间段内的样本进行聚类;所述聚类的方法采用k-means算法。5. A continuous action control reinforcement learning framework according to claim 1, characterized in that when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are Clustering; the clustering method adopts k-means algorithm. 6.根据权利要求1所述的一种连续动作控制强化学习框架,其特征在于,采样状态转移更新函数时,对每个聚类中的样本均匀采样。6. A continuous action control reinforcement learning framework according to claim 1, characterized in that when sampling the state transfer update function, samples in each cluster are uniformly sampled. 7.一种基于如权利要求1-6中任一项所述的连续动作控制强化学习框架的连续动作控制强化学习方法,其特征在于,包括以下操作7. A continuous action control reinforcement learning method based on the continuous action control reinforcement learning framework according to any one of claims 1 to 6, characterized in that it includes the following operations 采用卷积神经网络学习多步状态转移,更新学习策略;Use convolutional neural networks to learn multi-step state transitions and update learning strategies; 采用多步时序差分算法估计多步累计回报的期望;A multi-step temporal difference algorithm is used to estimate the expectation of multi-step cumulative returns; 对不同类型的状态转移样本进行聚类,使每种样本被均匀采样。Cluster different types of state transition samples so that each sample is evenly sampled.
CN202310805443.7A 2023-07-03 2023-07-03 A continuous action control reinforcement learning framework and learning method Pending CN116757272A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310805443.7A CN116757272A (en) 2023-07-03 2023-07-03 A continuous action control reinforcement learning framework and learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310805443.7A CN116757272A (en) 2023-07-03 2023-07-03 A continuous action control reinforcement learning framework and learning method

Publications (1)

Publication Number Publication Date
CN116757272A true CN116757272A (en) 2023-09-15

Family

ID=87956899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310805443.7A Pending CN116757272A (en) 2023-07-03 2023-07-03 A continuous action control reinforcement learning framework and learning method

Country Status (1)

Country Link
CN (1) CN116757272A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
US20210397959A1 (en) * 2020-06-22 2021-12-23 Google Llc Training reinforcement learning agents to learn expert exploration behaviors from demonstrators
CN115293217A (en) * 2022-08-23 2022-11-04 南京邮电大学 Unsupervised pseudo tag optimization pedestrian re-identification method based on radio frequency signals
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN116224794A (en) * 2023-03-03 2023-06-06 北京理工大学 Reinforced learning continuous action control method based on discrete-continuous heterogeneous Q network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
US20210397959A1 (en) * 2020-06-22 2021-12-23 Google Llc Training reinforcement learning agents to learn expert exploration behaviors from demonstrators
CN115293217A (en) * 2022-08-23 2022-11-04 南京邮电大学 Unsupervised pseudo tag optimization pedestrian re-identification method based on radio frequency signals
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN116224794A (en) * 2023-03-03 2023-06-06 北京理工大学 Reinforced learning continuous action control method based on discrete-continuous heterogeneous Q network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIN LI 等: "Clustering experience replay for the effective exploitation in reinforcement learning", ELSEVIER, pages 1 - 9 *
黄天意: "深度强化学习算法及其在无监督去噪中的应用研究", CNKI 博士电子期刊, pages 3 - 4 *

Similar Documents

Publication Publication Date Title
CN110958135B (en) A feature adaptive reinforcement learning DDoS attack elimination method and system
CN109657945B (en) Industrial production process fault diagnosis method based on data driving
CN104469833B (en) A kind of heterogeneous network operation management method based on user's perception
CN110083728B (en) Method, device and system for optimizing automatic picture data cleaning quality
CN110263979B (en) Method and device for predicting sample label based on reinforcement learning model
CN101251851A (en) Multi-Classifier Ensemble Method Based on Incremental Naive Bayesian Network
CN110081893B (en) Navigation path planning method based on strategy reuse and reinforcement learning
CN111950735A (en) A Reinforcement Learning Method Based on Bidirectional Model
CN107844460A (en) A kind of underwater multi-robot based on P MAXQ surrounds and seize method
CN109444604A (en) A kind of DC/DC converter method for diagnosing faults based on convolutional neural networks
CN117808120A (en) Method and apparatus for reinforcement learning of large language models
CN115552412A (en) Graph convolution reinforcement learning by utilizing heterogeneous agent group
CN108197665A (en) A kind of algorithm of Bayesian network structure learning based on parallel evolutionary search
CN114298302A (en) Agent task learning method and device
CN114611631A (en) Method, system, device and medium for fast training a model from a partial training set
CN118210603A (en) A cloud resource scheduling method based on enhanced growth optimizer
CN116911379A (en) Robust federated learning method for label noise based on self-paced learning and adjacency matrix
WO2023273171A1 (en) Image processing method and apparatus, device, and storage medium
CN115270782A (en) Event propagation popularity prediction method based on graph neural network
CN116757272A (en) A continuous action control reinforcement learning framework and learning method
CN113344317A (en) Close cooperation type supply chain task scheduling method based on double-depth time sequence differential neural network
CN118278541A (en) Federated strategy learning method and device based on momentum optimization
CN114722899B (en) A method for analyzing the effectiveness of friendly and enemy target identification based on expert experience and Bayesian network
CN106202717A (en) A kind of degeneration system risk probability computational methods based on multimode tree
Doshi et al. Epsilon-subjective equivalence of models for interactive dynamic influence diagrams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230915

RJ01 Rejection of invention patent application after publication