CN116757272A - Continuous motion control reinforcement learning framework and learning method - Google Patents

Continuous motion control reinforcement learning framework and learning method Download PDF

Info

Publication number
CN116757272A
CN116757272A CN202310805443.7A CN202310805443A CN116757272A CN 116757272 A CN116757272 A CN 116757272A CN 202310805443 A CN202310805443 A CN 202310805443A CN 116757272 A CN116757272 A CN 116757272A
Authority
CN
China
Prior art keywords
learning
motion control
clustering
reinforcement learning
continuous motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310805443.7A
Other languages
Chinese (zh)
Inventor
黄天意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Westlake University
Original Assignee
Westlake University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Westlake University filed Critical Westlake University
Priority to CN202310805443.7A priority Critical patent/CN116757272A/en
Publication of CN116757272A publication Critical patent/CN116757272A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

The application discloses a continuous action control reinforcement learning framework and a learning method, and relates to the technical field of artificial intelligence. The learning framework includes: the multi-step state transfer learning module is used for learning multi-step state transfer by adopting a convolutional neural network and updating a strategy; the expected estimation module is used for estimating the expected of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm; and the sample clustering module is used for clustering different types of state transition samples so that each sample is uniformly sampled. The application combines convolutional neural network, multi-step time sequence differential estimation and state transfer clustering, effectively improves learning efficiency and accuracy, and makes the sample more fully utilized.

Description

Continuous motion control reinforcement learning framework and learning method
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a continuous action control reinforcement learning framework and a learning method.
Background
Currently, some effective deep reinforcement learning algorithms are proposed for optimizing continuous control. The most representative is DDPG, which works based on the actor commentator approach. Definition ρ t Is the state at time t, alpha t Is an action at time t, and a deterministic strategy is defined as follows:
α t =π θt )
existing actor-commentator frameworks train agents by cyclically updating the estimated function of cumulative returns and strategies to maximize this function. An estimate of the cumulative return may be obtained by minimizing the objective function as follows.
Where B is the set of sampled state transitions, rewards, and actions.
The objective function that needs to be maximized when updating the policy is as follows:
based on the actor-commentator framework, DDPG learns the state transitions of a single step mainly through a fully connected neural network and then estimates the expectations of the cumulative rewards function through the cumulative rewards of the single step. TD3 and SAC are two improved algorithms based on DDPG, and TD3 improves over-estimation, policy update and exploration in DDPG through a double criticizer network, time sequence differential estimation and gaussian noise. SAC has advanced the exploration in DDPG mainly by improving objective functions, it also uses double commentators networks and time-series differential estimation.
However, the prior art has the following disadvantages:
1. considering only single-step state transitions results in learning inefficiencies.
2. The desire to estimate the cumulative return taking into account only the single step returns may result in an inaccurate estimate.
3. Updating the neural network using randomly sampled state transitions tends to underutilize the samples.
Disclosure of Invention
In order to overcome or at least partially solve the above problems, the present application provides a continuous motion control reinforcement learning framework and a learning method, which combine convolutional neural network, multi-step time sequence differential estimation and state transfer clustering, effectively improve learning efficiency and accuracy, and make the sample more fully utilized.
In order to solve the technical problems, the application adopts the following technical scheme:
in a first aspect, the present application provides a continuous motion control reinforcement learning framework, comprising a multi-step state transition learning module, an expectation estimation module, and a sample clustering module, wherein:
the multi-step state transfer learning module is used for learning multi-step state transfer by adopting a convolutional neural network and updating a strategy;
the expected estimation module is used for estimating the expected of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm;
and the sample clustering module is used for clustering different types of state transition samples so that each sample is uniformly sampled.
The framework combines convolutional neural network, multi-step time sequence differential estimation and state transfer clustering for the first time, and has the following characteristics: updating the policy using a convolutional neural network to account for multi-step state transitions; estimating a desire for a multi-step cumulative return using a multi-step time series differential algorithm; each sample is fully sampled by clustering existing state transition samples. In the application, the multi-step state transition is learned through the convolutional neural network in the reinforcement learning aiming at continuous control, so that the learning efficiency is improved; estimating the expected accumulated return through multi-step return on the basis of the previous step, so that the estimation is more accurate; the application also enables the state transition samples of different types to be uniformly sampled through clustering, thereby enabling the samples to be more fully utilized.
Based on the first aspect, further, the policy is thatWherein,,alpha is action, ρ is state, pi is policy, θ c Is the parameter of the convolutional neural network, t is the current time and n p The number of steps for state transition.
Based on the first aspect, further, the following objective function is minimized to obtain the desire to estimate the multi-step cumulative return,
the objective function is:
in n p For the number of state transitions, n q To report the number of steps, B n For sampled multi-step state transitions, multi-step returns, and sets of actions, E is the expectation, Q is a function of the estimated cumulative return expectation,is to estimate the parameter sum of Q
Based on the first aspect, further, a function is adoptedUpdating the strategy.
Based on the first aspect, further, when clustering is performed, the total step number of training is distributed to different time periods in an average mode, and samples in each time period are clustered; the clustering method adopts a k-means algorithm.
Based on the first aspect, further, when the state transition update function is sampled, samples in each cluster are uniformly sampled.
The application has at least the following advantages or beneficial effects:
the application provides a continuous action control reinforcement learning framework and a learning method, which combine a convolutional neural network, multi-step time sequence difference estimation and state transfer clustering, learn multi-step state transfer through the convolutional neural network in reinforcement learning aiming at continuous control, and improve learning efficiency; estimating the expected accumulated return through multi-step return on the basis of the previous step, so that the estimation is more accurate; the application also enables the state transition samples of different types to be uniformly sampled through clustering, thereby enabling the samples to be more fully utilized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a continuous motion control reinforcement learning framework according to an embodiment of the present application;
FIG. 2 is a schematic diagram of experimental training of an intelligent agent in different virtual environments according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a sampling pool obtained after sample clustering in an embodiment of the present application;
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the description of the embodiments of the present application, "plurality" means at least 2.
Examples:
as shown in fig. 1, in a first aspect, an embodiment of the present application provides a continuous motion control reinforcement learning framework, including a multi-step state transition learning module 100, a desire estimation module 200, and a sample clustering module 300, wherein:
the multi-step state transition learning module 100 is used for learning multi-step state transitions by adopting a convolutional neural network and updating a strategy;
the expectation estimation module 200 is configured to estimate an expectation of multi-step cumulative returns using a multi-step time sequence difference algorithm;
the sample clustering module 300 is configured to cluster different types of state transition samples, so that each sample is uniformly sampled.
The framework combines convolutional neural network, multi-step time sequence differential estimation and state transfer clustering through the cooperation of the multi-step state transfer learning module 100, the expected estimation module 200 and the sample clustering module 300, and has the following characteristics: updating the policy using a convolutional neural network to account for multi-step state transitions; estimating a desire for a multi-step cumulative return using a multi-step time series differential algorithm; each sample is fully sampled by clustering existing state transition samples. In the application, the multi-step state transition is learned through the convolutional neural network in the reinforcement learning aiming at continuous control, so that the learning efficiency is improved; estimating the expected accumulated return through multi-step return on the basis of the previous step, so that the estimation is more accurate; the application also enables the state transition samples of different types to be uniformly sampled through clustering, thereby enabling the samples to be more fully utilized.
Based on the first aspect, further, the policy is thatWherein,,alpha is action, ρ is state, pi is policy, θ c Is the parameter of the convolutional neural network, t is the current time and n p The number of steps for state transition.
Based on the first aspect, further, the following objective function is minimized to obtain the desire to estimate the multi-step cumulative return,
the objective function is:
wherein n is p For the number of state transitions, n q To report the number of steps, B n For sampled multi-step state transitions, multi-step returns, and sets of actions, E is the expectation, Q is a function of the estimated cumulative return expectation,is to estimate the parameter sum of Q
Based on the strategy, in a newly defined framework, an estimate of the cumulative return can be obtained by minimizing the objective function described above. Above-mentionedAnd->Corresponding to the above.
Based on the first aspect, further, a function is adoptedUpdating the strategy.
The objective function that needs to be maximized when updating the strategy is as above, which is also accomplished by updating the defined convolutional neural network.
Based on the first aspect, further, when clustering is performed, the total step number of training is distributed to different time periods in average, and samples in each time period are clustered. The clustering method adopts a k-means algorithm.
Based on the first aspect, further, when the state transition update function is sampled, samples in each cluster are uniformly sampled.
In some embodiments of the present application, it is also desirable to divide the total number of steps of the training into different time periods on average, and then cluster the samples in each time period. The clustering method selects k-means. The resulting sample cell is shown in fig. 3. The current period is assumed to be p in sampling, and the number of clusters in each period is k. The probability that the samples in each cluster are sampled at each update of the neural network isThe probability that the sample is sampled in the current period is 0.2.
In some embodiments of the application, the algorithm flow for learning based on the framework is as follows:
np is the number of defined time periods, pt is the number of steps per time period, and the algorithm is as follows:
initializing neural network parameters
Initializing sampling space
Initializing exploring noise
Fore=1:np
Fort=1:pt
Selecting actions by policy
Adding exploratory noise for motion
Executing an action rewards rt and status
Storing actions, states, and rewards into a sampling space
Selecting samples from existing clusters in the sample space and state transitions generated by the current period
Updating neural networks by selected samples
Endfor
Clustering samples in a previous time period
Endfor
Outputting a policy model based on the neural network.
In some embodiments of the present application, the TD3 algorithm is modified with the framework proposed by the present application, resulting in a new algorithm td3+. Experiments were performed on the virtual robot control environment Mujoco, with experimental tasks including HafCheetal, walker d, and Hopper. As shown in fig. 2, are agents in HafCheeta, walker d and Hopper environments. In both the first two environments, the agent needs to walk farther in a fixed number of steps through reinforcement learning, while the last one needs to train the single-leg agent as far as possible.
The comparison algorithm includes DDPG, SAC, and TD3. For all methods, each task runs 2 x 10 x 6 time steps. The cumulative returns obtained by different algorithms on different tasks are shown in table 1, and it can be seen that the effect of algorithm td+3 implemented with the proposed framework is better than the existing algorithm.
Table 1:
running environment TD3+ TD3 SAC DDPG
HafCheetal 13589.17 10032.66 9643.93 9453.22
Walker2d 6167.26 4471.43 4971.42 3804.91
Hopper 3812.30 3472.65 3531.77 3736.21
In some embodiments of the present application, ablation experiments were performed, and the results are shown in table 2, comparing the results without using clustering (td3+ woC), without using convolutional neural network (td3+ woS), and without using multi-step time series differential algorithm (td3+ woQ) in the new method. From the above, each part (convolutional neural network, multi-step time sequence differential estimation and clustering) in the application can effectively improve the reinforcement learning effect.
Table 2:
running environment TD3+ TD3+woC TD3+woS TD3+woQ
HafCheetal 13589.17 12824.48 12654.81 12051.56
Walker2d 6267.26 6056.13 5401.72 5737.12
Hopper 3812.30 3758.30 3713.23 3762.34
In a second aspect, an embodiment of the present application provides a continuous motion control reinforcement learning method based on the continuous motion control reinforcement learning framework according to any one of the first aspect, including the steps of:
adopting a convolutional neural network to learn multi-step state transition, and updating a learning strategy;
estimating the expectation of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm;
different types of state transition samples are clustered such that each sample is sampled uniformly.
In the application, the multi-step state transition is learned through the convolutional neural network in the reinforcement learning aiming at continuous control, so that the learning efficiency is improved; estimating the expected accumulated return through multi-step return on the basis of the previous step, so that the estimation is more accurate; the application also enables the state transition samples of different types to be uniformly sampled through clustering, thereby enabling the samples to be more fully utilized.
The above is only a preferred embodiment of the present application, and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (7)

1. The continuous motion control reinforcement learning framework is characterized by comprising a multi-step state transition learning module, a desire estimation module and a sample clustering module, wherein:
the multi-step state transfer learning module is used for learning multi-step state transfer by adopting a convolutional neural network and updating a strategy;
the expected estimation module is used for estimating the expected of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm;
and the sample clustering module is used for clustering different types of state transition samples so that each sample is uniformly sampled.
2. The continuous motion control reinforcement learning framework of claim 1, wherein the strategy isWherein (1)>Alpha is action, ρ is state, pi is policy, θ c Is the parameter of the convolutional neural network, t is the current time and n p The number of steps for state transition.
3. The continuous motion control reinforcement learning framework of claim 1 wherein the following objective function is minimized to obtain the expectation of an estimated multi-step cumulative return,
the objective function is:
wherein n is p For the number of state transitions, n q To report the number of steps, B n For sampled multi-step state transitions, multi-step returns, and sets of actions, E is the expectation, Q is a function of the estimated cumulative return expectation,is to estimate the parameter sum of Q
4. A continuous motion control reinforcement learning framework in accordance with claim 3, characterized by using a functionUpdating the strategy.
5. The framework of continuous motion control reinforcement learning of claim 1, wherein when clustering, the total number of steps of training is evenly distributed to different time periods, and samples in each time period are clustered; the clustering method adopts a k-means algorithm.
6. The continuous motion control reinforcement learning framework of claim 1 wherein samples in each cluster are uniformly sampled as the state transition update function is sampled.
7. A continuous motion control reinforcement learning method based on the continuous motion control reinforcement learning framework according to any one of claims 1 to 6, characterized by comprising the following operations
Adopting a convolutional neural network to learn multi-step state transition, and updating a learning strategy;
estimating the expectation of multi-step accumulated returns by adopting a multi-step time sequence difference algorithm;
different types of state transition samples are clustered such that each sample is sampled uniformly.
CN202310805443.7A 2023-07-03 2023-07-03 Continuous motion control reinforcement learning framework and learning method Pending CN116757272A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310805443.7A CN116757272A (en) 2023-07-03 2023-07-03 Continuous motion control reinforcement learning framework and learning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310805443.7A CN116757272A (en) 2023-07-03 2023-07-03 Continuous motion control reinforcement learning framework and learning method

Publications (1)

Publication Number Publication Date
CN116757272A true CN116757272A (en) 2023-09-15

Family

ID=87956899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310805443.7A Pending CN116757272A (en) 2023-07-03 2023-07-03 Continuous motion control reinforcement learning framework and learning method

Country Status (1)

Country Link
CN (1) CN116757272A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
US20210397959A1 (en) * 2020-06-22 2021-12-23 Google Llc Training reinforcement learning agents to learn expert exploration behaviors from demonstrators
CN115293217A (en) * 2022-08-23 2022-11-04 南京邮电大学 Unsupervised pseudo tag optimization pedestrian re-identification method based on radio frequency signals
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN116224794A (en) * 2023-03-03 2023-06-06 北京理工大学 Reinforced learning continuous action control method based on discrete-continuous heterogeneous Q network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
US20210397959A1 (en) * 2020-06-22 2021-12-23 Google Llc Training reinforcement learning agents to learn expert exploration behaviors from demonstrators
CN115293217A (en) * 2022-08-23 2022-11-04 南京邮电大学 Unsupervised pseudo tag optimization pedestrian re-identification method based on radio frequency signals
CN115439887A (en) * 2022-08-26 2022-12-06 三维通信股份有限公司 Pedestrian re-identification method and system based on pseudo label optimization and storage medium
CN116224794A (en) * 2023-03-03 2023-06-06 北京理工大学 Reinforced learning continuous action control method based on discrete-continuous heterogeneous Q network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIN LI 等: "Clustering experience replay for the effective exploitation in reinforcement learning", ELSEVIER, pages 1 - 9 *
黄天意: "深度强化学习算法及其在无监督去噪中的应用研究", CNKI 博士电子期刊, pages 3 - 4 *

Similar Documents

Publication Publication Date Title
CN107247961B (en) Track prediction method applying fuzzy track sequence
CN111966823B (en) Graph node classification method facing label noise
Jourdan et al. Using datamining techniques to help metaheuristics: A short survey
CN110232416A (en) Equipment failure prediction technique based on HSMM-SVM
CN106067034B (en) Power distribution network load curve clustering method based on high-dimensional matrix characteristic root
CN112766603B (en) Traffic flow prediction method, system, computer equipment and storage medium
CN109558898B (en) Multi-choice learning method with high confidence based on deep neural network
Osentoski et al. Learning hierarchical models of activity
CN112308161A (en) Particle swarm algorithm based on artificial intelligence semi-supervised clustering target
CN111913887B (en) Software behavior prediction method based on beta distribution and Bayesian estimation
CN116757272A (en) Continuous motion control reinforcement learning framework and learning method
CN117117850A (en) Short-term electricity load prediction method and system
CN112415337A (en) Power distribution network fault diagnosis method based on dynamic set coverage
CN108134687B (en) Gray model local area network peak flow prediction method based on Markov chain
CN115794405A (en) Dynamic resource allocation method of big data processing framework based on SSA-XGboost algorithm
Knowles et al. Message Passing Algorithms for the Dirichlet Diffusion Tree.
Yu et al. Autonomous knowledge-oriented clustering using decision-theoretic rough set theory
Jia An adaptive sampling algorithm for simulation-based optimization with descriptive complexity preference
Dlapa Cluster restarted DM: New algorithm for global optimisation
CN103646407B (en) A kind of video target tracking method based on composition distance relation figure
CN105160436A (en) N neighboring Lipschitz supporting surface-based generalized augmented group global optimization method
Moreno et al. Robust growing hierarchical self organizing map
Chen et al. Composite kernel based SVM for hierarchical multi-label gene function classification
He et al. A method to cloud computing resources requirement prediction on SaaS application
CN108960427A (en) A kind of manufacture cloud service optimum choice method of knowledge based guidance type genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination