CN116757272A

CN116757272A - A continuous action control reinforcement learning framework and learning method

Info

Publication number: CN116757272A
Application number: CN202310805443.7A
Authority: CN
Inventors: 黄天意
Original assignee: Westlake University
Current assignee: Westlake University
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-15

Abstract

The application discloses a continuous action control reinforcement learning framework and a learning method, and relates to the technical field of artificial intelligence. The learning framework includes: the multi-step state transfer learning module is used for learning multi-step state transfer by adopting a convolutional neural network and updating a strategy; the expected estimation module is used for estimating the expected of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm; and the sample clustering module is used for clustering different types of state transition samples so that each sample is uniformly sampled. The application combines convolutional neural network, multi-step time sequence differential estimation and state transfer clustering, effectively improves learning efficiency and accuracy, and makes the sample more fully utilized.

Description

A continuous action control reinforcement learning framework and learning method

技术领域Technical field

本发明涉及人工智能技术领域，具体而言，涉及一种连续动作控制强化学习框架及学习方法。The present invention relates to the field of artificial intelligence technology, and specifically to a continuous action control reinforcement learning framework and a learning method.

背景技术Background technique

目前，一些有效的深度强化学习算法被提出用于优化连续控制。最具代表性的是DDPG，它是基于演员评论家方法工作的。定义ρ_t是t时刻的状态，α_t是t时刻的动作，同时定义确定性策略如下：Currently, some effective deep reinforcement learning algorithms have been proposed for optimizing continuous control. The most representative one is DDPG, which works based on the actor critic method. Define ρ _t as the state at time t, α _t as the action at time t, and define the deterministic strategy as follows:

α_t＝π_θ(ρ_t)α _t =π _θ (ρ _t )

现有的演员-评论家框架通过循环更新累计回报的估计函数和最大化这个函数的策略来训练智能体。对累计回报的估计可通过最小化如下目标函数得到。Existing actor-critic frameworks train agents by cyclically updating an estimated function of cumulative rewards and a policy that maximizes this function. An estimate of the cumulative return can be obtained by minimizing the following objective function.

其中B是采样到的状态转移，回报以及动作的集合。Where B is the set of sampled state transitions, returns and actions.

在更新策略的时候需要最大化的目标函数如下：The objective function that needs to be maximized when updating the strategy is as follows:

基于演员-评论家框架，DDPG主要是通过全连接的神经网络来学习单步的状态转移然后通过单步的累计回报来估计累计回报函数的期望。TD3和SAC是两个基于DDPG的改进算法，TD3通过双评论家网络，时序差分估计和高斯噪声改进了DDPG中的过估计，策略更新和探索。SAC主要通过改进目标函数提升了DDPG中的探索，它也使用了双评论家网络和时序差分估计。Based on the actor-critic framework, DDPG mainly learns the single-step state transition through a fully connected neural network and then estimates the expectation of the cumulative reward function through the cumulative reward of the single step. TD3 and SAC are two improved algorithms based on DDPG. TD3 improves overestimation, policy updating and exploration in DDPG through dual critic network, temporal difference estimation and Gaussian noise. SAC improves exploration in DDPG mainly by improving the objective function. It also uses dual critic network and temporal difference estimation.

但是现有技术存在如下缺点：However, the existing technology has the following shortcomings:

1.只考虑单步的状态转移会导致学习效率不够高。1. Only considering single-step state transfer will lead to insufficient learning efficiency.

2.只考虑单步的回报来估计累计回报的期望会导致估计不够准确。2. Estimating the expectation of cumulative returns by considering only single-step returns will lead to inaccurate estimates.

3.使用随机采样的状态转移更新神经网络，容易使得样本利用不够充分。3. Using randomly sampled state transitions to update the neural network can easily lead to insufficient sample utilization.

发明内容Contents of the invention

为了克服上述问题或者至少部分地解决上述问题，本发明提供一种连续动作控制强化学习框架及学习方法，结合了卷积神经网络、多步时序差分估计和状态转移聚类，有效提高了学习效率以及准确度，并使样本利用更充分。In order to overcome the above problems or at least partially solve the above problems, the present invention provides a continuous action control reinforcement learning framework and learning method, which combines a convolutional neural network, multi-step temporal difference estimation and state transfer clustering, effectively improving learning efficiency. and accuracy, and make full use of samples.

为解决上述技术问题，本发明采用的技术方案为：In order to solve the above technical problems, the technical solutions adopted by the present invention are:

第一方面，本发明提供一种连续动作控制强化学习框架，包括多步状态转移学习模块、期望估计模块以及样本聚类模块，其中：In the first aspect, the present invention provides a continuous action control reinforcement learning framework, including a multi-step state transfer learning module, an expectation estimation module and a sample clustering module, wherein:

多步状态转移学习模块，用于采用卷积神经网络学习多步状态转移，更新策略；The multi-step state transfer learning module is used to learn multi-step state transfer and update strategies using convolutional neural networks;

期望估计模块，用于采用多步时序差分算法估计多步累计回报的期望；The expectation estimation module is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm;

样本聚类模块，用于对不同类型的状态转移样本进行聚类，使每种样本被均匀采样。The sample clustering module is used to cluster different types of state transition samples so that each sample is evenly sampled.

本框架首次结合了卷积神经网络、多步时序差分估计和状态转移聚类，它具有如下特点：使用卷积神经网络考虑多步状态转移来更新策略；使用多步时序差分算法来估计多步累计回报的期望；通过聚类现有状态转移样本，使得每种样本都被充分采样。本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移，提高了学习效率；在上一步的基础上通过多步回报来估计累计回报的期望，让估计更准确；本发明还通过聚类使得不同类型的状态转移样本被均匀的采样，从而使得样本利用更充分。This framework combines convolutional neural networks, multi-step temporal difference estimation and state transition clustering for the first time. It has the following characteristics: it uses a convolutional neural network to consider multi-step state transitions to update the policy; it uses a multi-step temporal difference algorithm to estimate multi-step Expectation of cumulative returns; by clustering existing state transition samples, each sample is fully sampled. This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

基于第一方面，进一步地，上述策略为其中，α是动作、ρ指的状态、π是策略，θ^c是卷积神经网络的参数、t为当前时刻以及n_p为状态转移步数。Based on the first aspect, further, the above strategy is in, α is the action, ρ refers to the state, π is the strategy, θ ^c is the parameter of the convolutional neural network, t is the current moment and n _p is the number of state transition steps.

基于第一方面，进一步地，对以下目标函数进行最小化，以得到估计多步累计回报的期望，Based on the first aspect, further, the following objective function is minimized to obtain the expectation of estimating the multi-step cumulative return,

目标函数为：The objective function is:

中，n_p为状态转移步数，n_q为回报步数，B_n为采为样到的多步状态转移、多步回报以及动作的集合，E是期望，Q是估计累计回报期望的函数，是估计Q的参数以及Among them, n _p is the number of state transition steps, n _q is the number of reward steps, B _n is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and

基于第一方面，进一步地，采用函数更新策略。Based on the first aspect, further, using the function Update strategy.

基于第一方面，进一步地，进行聚类时，将训练的总步数平均分配到不同时间段内，对每个时间段内的样本进行聚类；上述聚类的方法采用k-means算法。Based on the first aspect, further, when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are clustered; the above clustering method uses the k-means algorithm.

基于第一方面，进一步地，采样状态转移更新函数时，对每个聚类中的样本均匀采样。Based on the first aspect, further, when sampling the state transition update function, samples in each cluster are uniformly sampled.

本发明至少具有如下优点或有益效果：The present invention has at least the following advantages or beneficial effects:

本发明提供一种连续动作控制强化学习框架及学习方法，结合了卷积神经网络、多步时序差分估计和状态转移聚类，在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移，提高了学习效率；在上一步的基础上通过多步回报来估计累计回报的期望，让估计更准确；本发明还通过聚类使得不同类型的状态转移样本被均匀的采样，从而使得样本利用更充分。The present invention provides a continuous action control reinforcement learning framework and learning method, which combines a convolutional neural network, multi-step temporal difference estimation and state transfer clustering, and learns multi-steps through the convolutional neural network in reinforcement learning for continuous control. State transfer improves learning efficiency; on the basis of the previous step, the expectation of cumulative return is estimated through multi-step returns, making the estimate more accurate; the present invention also uses clustering to evenly sample different types of state transition samples, so that Samples are more fully utilized.

附图说明Description of the drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings required to be used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and therefore do not It should be regarded as a limitation of the scope. For those of ordinary skill in the art, other relevant drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明实施例一种连续动作控制强化学习框架的示意图；Figure 1 is a schematic diagram of a continuous action control reinforcement learning framework according to an embodiment of the present invention;

图2为本发明实施例中智能体在不同虚拟环境中实验训练的示意图；Figure 2 is a schematic diagram of experimental training of an agent in different virtual environments in an embodiment of the present invention;

图3为本发明实施例中进行样本聚类后得到的采样池示意图；Figure 3 is a schematic diagram of the sampling pool obtained after sample clustering in the embodiment of the present invention;

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments. The components of the embodiments of the invention generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations.

因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Therefore, the following detailed description of the embodiments of the invention provided in the appended drawings is not intended to limit the scope of the claimed invention, but rather to represent selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters represent similar items in the following figures, therefore, once an item is defined in one figure, it does not need further definition and explanation in subsequent figures.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. Furthermore, the term "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus including a list of elements includes not only those elements but also other elements not expressly listed, Or it also includes elements inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or apparatus that includes the stated element.

在本发明实施例的描述中，“多个”代表至少2个。In the description of the embodiments of the present invention, “plurality” represents at least two.

实施例：Example:

如图1所示，第一方面，本发明实施例提供一种连续动作控制强化学习框架，包括多步状态转移学习模块100、期望估计模块200以及样本聚类模块300，其中：As shown in Figure 1, in the first aspect, embodiments of the present invention provide a continuous action control reinforcement learning framework, including a multi-step state transfer learning module 100, an expectation estimation module 200 and a sample clustering module 300, wherein:

多步状态转移学习模块100，用于采用卷积神经网络学习多步状态转移，更新策略；The multi-step state transfer learning module 100 is used to learn multi-step state transfer and update strategies using convolutional neural networks;

期望估计模块200，用于采用多步时序差分算法估计多步累计回报的期望；The expectation estimation module 200 is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm;

样本聚类模块300，用于对不同类型的状态转移样本进行聚类，使每种样本被均匀采样。The sample clustering module 300 is used to cluster different types of state transition samples so that each sample is uniformly sampled.

本框架通过多步状态转移学习模块100、期望估计模块200以及样本聚类模块300的配合，结合了卷积神经网络、多步时序差分估计和状态转移聚类，它具有如下特点：使用卷积神经网络考虑多步状态转移来更新策略；使用多步时序差分算法来估计多步累计回报的期望；通过聚类现有状态转移样本，使得每种样本都被充分采样。本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移，提高了学习效率；在上一步的基础上通过多步回报来估计累计回报的期望，让估计更准确；本发明还通过聚类使得不同类型的状态转移样本被均匀的采样，从而使得样本利用更充分。This framework combines the convolutional neural network, multi-step temporal difference estimation and state transition clustering through the cooperation of the multi-step state transfer learning module 100, the expectation estimation module 200 and the sample clustering module 300. It has the following characteristics: Use convolution The neural network considers multi-step state transitions to update the strategy; uses a multi-step temporal difference algorithm to estimate the expectation of multi-step cumulative returns; and clusters existing state transition samples so that each sample is fully sampled. This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

目标函数为：The objective function is:

其中，n_p为状态转移步数，n_q为回报步数，Bⁿ为采为样到的多步状态转移、多步回报以及动作的集合，E是期望，Q是估计累计回报期望的函数，是估计Q的参数以及Among them, n _p is the number of state transition steps, n _q is the number of reward steps, B ⁿ is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and

基于上述策略，在新定义的框架中，对累计回报的估计可通过最小化上述目标函数得到。上述与/>对应。Based on the above strategy, in the newly defined framework, the estimate of the cumulative return can be obtained by minimizing the above objective function. above with/> correspond.

在更新策略的时候需要最大化的目标函数如上，这一步也是由定义的卷积神经网络的更新来完成的。The objective function that needs to be maximized when updating the strategy is as above. This step is also completed by updating the defined convolutional neural network.

基于第一方面，进一步地，进行聚类时，将训练的总步数平均分配到不同时间段内，对每个时间段内的样本进行聚类。上述聚类的方法采用k-means算法。Based on the first aspect, further, when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are clustered. The above clustering method uses the k-means algorithm.

在本发明的一些实施例中，同时还需要将训练的总步数平均的分到不同的时间段内，然后对每个时间段里的样本进行聚类。聚类的方法选用k-means。得到的采样池如图3所示。在采样的时假设当前时段为p，每个时段的聚类个数为k。在每次更新神经网络的时候每个聚类中样本被采样的概率为当前时段中样本被采样的概率为0.2。In some embodiments of the present invention, it is also necessary to divide the total number of training steps evenly into different time periods, and then cluster the samples in each time period. The clustering method uses k-means. The resulting sampling pool is shown in Figure 3. During sampling, it is assumed that the current period is p, and the number of clusters in each period is k. The probability that a sample in each cluster is sampled every time the neural network is updated is The probability of a sample being sampled in the current period is 0.2.

在本发明的一些实施例中，基于本框架进行学习的算法流程如下：In some embodiments of the present invention, the algorithm flow for learning based on this framework is as follows:

np为定义的时间段数，pt为每个时间段的步数，算法如下：np is the number of defined time periods, pt is the number of steps in each time period, and the algorithm is as follows:

初始化神经网络参数Initialize neural network parameters

初始化采样空间Initialize the sampling space

初始化探索噪声Initialize exploration noise

Fore＝1:npFore＝1:np

Fort＝1:ptFort＝1:pt

通过策略选择动作Choose actions through strategies

为动作添加探索噪声Add exploration noise to actions

执行动作得到回报rt和状态Executing an action gets the reward rt and status

将动作、状态和回报存入采样空间Store actions, states and returns in the sampling space

从采样空间中现有的各个聚类中和当前时段产生的状态转移中选择样本Select samples from each existing cluster in the sampling space and from the state transitions generated during the current period

通过选择的样本更新神经网络Update the neural network with selected samples

EndforEndfor

对上一个时间段内的样本进行聚类Cluster samples from the previous time period

EndforEndfor

输出基于神经网络的策略模型。Outputs a neural network-based policy model.

在本发明的一些实施例中，用本发明提出的框架改进了TD3算法，得到了新算法TD3+。在虚拟机器人控制环境Mujoco上进行了实验，实验任务包括HafCheetal、Walker2d、以及Hopper。如图2所示，是HafCheeta、Walker2d和Hopper环境中的智能体。前两个环境中，都需要通过强化学习让智能体在固定步数内行走的更远，而最后一个则需要训练单腿智能体尽可能的跳远。In some embodiments of the present invention, the TD3 algorithm is improved using the framework proposed by the present invention, and a new algorithm TD3+ is obtained. Experiments were conducted on the virtual robot control environment Mujoco, and the experimental tasks included HafCheetal, Walker2d, and Hopper. As shown in Figure 2, they are agents in the HafCheeta, Walker2d and Hopper environments. In the first two environments, reinforcement learning is needed to make the agent walk farther within a fixed number of steps, while the last one requires training a single-legged agent to jump as far as possible.

对比算法包括DDPG、SAC、和TD3。对于所有方法，每个任务运行2*10^6个时间步。不同算法在不同任务上得到的累计回报如表1所示，可以看出用所提出的框架实现的算法TD+3的效果要优于现有算法。Comparison algorithms include DDPG, SAC, and TD3. For all methods, each task runs for 2*10^6 time steps. The cumulative returns obtained by different algorithms on different tasks are shown in Table 1. It can be seen that the algorithm TD+3 implemented using the proposed framework is better than the existing algorithm.

表1：Table 1:

运行环境Operating environment TD3+TD3+ TD3TD3 SACSAC DDPGDDPG HafCheetalHafCheetal 13589.1713589.17 10032.6610032.66 9643.939643.93 9453.229453.22 Walker2dWalker2d 6167.266167.26 4471.434471.43 4971.424971.42 3804.913804.91 HopperHopper 3812.303812.30 3472.653472.65 3531.773531.77 3736.213736.21

在本发的一些实施例中，进行消融实验，结果如表2所示，对比了在新方法中不使用聚类(TD3+woC)、不使用卷积神经网络(TD3+woS)以及不使用多步时序差分算法(TD3+woQ)的结果。从中可以看出，本发明中的每个部分(卷积神经网络、多步时序差分估计和聚类)都能够有效的提升强化学习的效果。In some embodiments of the present invention, ablation experiments are performed, and the results are shown in Table 2, which compares the new method without clustering (TD3+woC), without using convolutional neural networks (TD3+woS), and without using Results of the multi-step temporal difference algorithm (TD3+woQ). It can be seen from this that each part of the present invention (convolutional neural network, multi-step temporal difference estimation and clustering) can effectively improve the effect of reinforcement learning.

表2：Table 2:

运行环境Operating environment TD3+TD3+ TD3+woCTD3+woC TD3+woSTD3+woS TD3+woQTD3+woQ HafCheetalHafCheetal 13589.1713589.17 12824.4812824.48 12654.8112654.81 12051.5612051.56 Walker2dWalker2d 6267.266267.26 6056.136056.13 5401.725401.72 5737.125737.12 HopperHopper 3812.303812.30 3758.303758.30 3713.233713.23 3762.343762.34

第二方面，本发明实施例提供一种基于如上述第一方面中任一项所述的连续动作控制强化学习框架的连续动作控制强化学习方法，包括以下步骤：In a second aspect, embodiments of the present invention provide a continuous action control reinforcement learning method based on the continuous action control reinforcement learning framework as described in any one of the above first aspects, including the following steps:

采用卷积神经网络学习多步状态转移，更新学习策略；Use convolutional neural networks to learn multi-step state transitions and update learning strategies;

采用多步时序差分算法估计多步累计回报的期望；A multi-step temporal difference algorithm is used to estimate the expectation of multi-step cumulative returns;

对不同类型的状态转移样本进行聚类，使每种样本被均匀采样。Cluster different types of state transition samples so that each sample is evenly sampled.

本发明在针对连续控制的强化学习中通过卷积神经网络来学习多步状态转移，提高了学习效率；在上一步的基础上通过多步回报来估计累计回报的期望，让估计更准确；本发明还通过聚类使得不同类型的状态转移样本被均匀的采样，从而使得样本利用更充分。This invention uses a convolutional neural network to learn multi-step state transitions in reinforcement learning for continuous control, which improves learning efficiency; based on the previous step, it estimates the expectation of cumulative returns through multi-step returns, making the estimation more accurate; this invention The invention also enables different types of state transition samples to be evenly sampled through clustering, thereby making full use of the samples.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

对于本领域技术人员而言，显然本申请不限于上述示范性实施例的细节，而且在不背离本申请的精神或基本特征的情况下，能够以其它的具体形式实现本申请。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本申请的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the embodiments should be regarded as illustrative and non-restrictive from any point of view, and the scope of the present application is defined by the appended claims rather than the above description, and it is therefore intended that all claims falling within the claims All changes within the meaning and scope of the equivalent elements are included in this application. Any reference signs in the claims shall not be construed as limiting the claim in question.

Claims

1. A continuous action control reinforcement learning framework, characterized by including a multi-step state transfer learning module, an expectation estimation module and a sample clustering module, wherein:

The multi-step state transfer learning module is used to learn multi-step state transfer and update strategies using convolutional neural networks;

The expectation estimation module is used to estimate the expectation of multi-step cumulative returns using a multi-step temporal difference algorithm;

The sample clustering module is used to cluster different types of state transition samples so that each sample is evenly sampled.

2. A continuous action control reinforcement learning framework according to claim 1, characterized in that the strategy is Among them,/> α is the action, ρ refers to the state, π is the strategy, θ ^c is the parameter of the convolutional neural network, t is the current moment and n _p is the number of state transition steps.

3. A continuous action control reinforcement learning framework according to claim 1, characterized in that the following objective function is minimized to obtain the expectation of estimating the multi-step cumulative return,

The objective function is:

Among them, n _p is the number of state transition steps, n _q is the number of reward steps, B ⁿ is the set of sampled multi-step state transitions, multi-step rewards and actions, E is the expectation, and Q is a function that estimates the cumulative reward expectation. , are the parameters for estimating Q and

4. A continuous action control reinforcement learning framework according to claim 3, characterized in that using a function Update strategy.

5. A continuous action control reinforcement learning framework according to claim 1, characterized in that when performing clustering, the total number of training steps is evenly distributed to different time periods, and the samples in each time period are Clustering; the clustering method adopts k-means algorithm.

6. A continuous action control reinforcement learning framework according to claim 1, characterized in that when sampling the state transfer update function, samples in each cluster are uniformly sampled.

7. A continuous action control reinforcement learning method based on the continuous action control reinforcement learning framework according to any one of claims 1 to 6, characterized in that it includes the following operations

Use convolutional neural networks to learn multi-step state transitions and update learning strategies;

A multi-step temporal difference algorithm is used to estimate the expectation of multi-step cumulative returns;

Cluster different types of state transition samples so that each sample is evenly sampled.