CN110502721A - A kind of continuity reinforcement learning system and method based on stochastic differential equation - Google Patents

A kind of continuity reinforcement learning system and method based on stochastic differential equation Download PDF

Info

Publication number
CN110502721A
CN110502721A CN201910712857.9A CN201910712857A CN110502721A CN 110502721 A CN110502721 A CN 110502721A CN 201910712857 A CN201910712857 A CN 201910712857A CN 110502721 A CN110502721 A CN 110502721A
Authority
CN
China
Prior art keywords
value
action
apg
estimator
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910712857.9A
Other languages
Chinese (zh)
Other versions
CN110502721B (en
Inventor
贾文川
程丽梅
陈添豪
孙翊
马书根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910712857.9A priority Critical patent/CN110502721B/en
Publication of CN110502721A publication Critical patent/CN110502721A/en
Application granted granted Critical
Publication of CN110502721B publication Critical patent/CN110502721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • G06F17/13Differential equations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a kind of continuity reinforcement learning system and method based on stochastic differential equation, system includes action policy generator APG, ambient condition estimator ESE, value estimator VE, memory storage module MS and external environment EE;Specific step is as follows: initialization action strategy generator APG, ambient condition estimator ESE and value estimator VE;Action policy generator APG calculates output action value increment Delta ak;External environment EE exports next step action value ak+1, next step ambient condition value sk+1And current step reward value Rk, and store into memory storage module MS;Ambient condition estimator ESE updates ambient condition parameter set θpWith prediction FUTURE ENVIRONMENT state estimation s 'k;VE optimizer updates Q Function Network and predicts the following reward estimated value R 'k;APG optimizer update action value parameter collection θv.This method is based on stochastic differential equation as basic model, is able to achieve the continuity of action control and can control training process variance, movement can be selected to realize better environmental interaction by predicting the variation of environment.

Description

Continuity reinforcement learning system and method based on random differential equation
Technical Field
The invention relates to the field of reinforcement learning and the field of random processes, in particular to a reinforcement learning method for a continuous system.
Background
The deep reinforcement learning is an end-to-end learning system, combines the perception capability of the deep learning and the decision capability of the reinforcement learning, has stronger universality, and realizes the direct control from the original input to the output. Reinforcement learning is a very important unsupervised learning method, so that an agent can judge the current environment state through a value function in interaction with the environment, and accordingly can make corresponding actions to obtain better rewards. At present, the algorithm of reinforcement learning mainly focuses on a discrete action strategy set, and the classical continuous reinforcement learning methods such as DDPG, A3C and the like can be used for continuous action control in applications such as robot motion control and unmanned driving.
However, most of the current continuous reinforcement learning methods have the theoretical disadvantages, for example, the noise introduced by the DDPG can ensure the continuity of the action, but the variance cannot be controlled; while A3C under the gaussian strategy, although the variance can be controlled, does not theoretically satisfy the continuity condition.
Disclosure of Invention
The invention aims to overcome the defects of the existing reinforcement learning method, and establishes a reinforcement learning system and a reinforcement learning method which can meet the continuity condition in any time interval and can control the variance of action output through a network.
Therefore, the invention provides a reinforcement learning framework which can theoretically ensure the continuity of actions and can control the variance in the training process, namely a continuous reinforcement learning system and method based on random differential equations. Under this framework, the agent can predict changes in the scene environment to make action selections. Unlike markov control, under this system, the agent is no longer passive in adapting to the environment, but rather better interacts with the environment to obtain the maximum reward.
The invention provides a reinforcement learning system based on a random differential equation, which comprises five main parts: (1) an environment state estimator ESE, (2) an action strategy generator APG, (3) a value estimator VE, (4) a memory storage module MS, and (5) an external environment EE.
The APG consists of an APG optimizer, an APG mean variance network and an APG arithmetic unit and is used for calculating and outputting an action increment delta a in each single-step training processk(ii) a The ambient state estimator ESE comprises an ESE optimizer for updating the ambient state parameter set θpAnd predicting a future environmental state estimate s'k(ii) a The value estimator VE is composed of VE optimizer and Q function network and is used for predicting future reward estimated value Rk'; the memory storage module MS is used for storing the current step environment state value skCurrent step action value akAnd storing the next environmental state value s obtained in each single-step training processk+1The next action value ak+1And the current step reward Rk(ii) a The external environment EE is used to describe and measure actions and environmental conditions, in the form of simulation software and real physical systems.
The invention provides a continuity reinforcement learning method based on a random differential equation, which comprises the following steps:
step 1, initializing all parameters in an action strategy generator APG, an environment state estimator ESE and a value estimator VE;
step 2, taking out the environmental state value s of the current step from the memory storage module MSkAnd the current step action value akThe current step environment state value skAnd the current step action value akInputting the current step motion value a to the motion strategy generator APG, and calculating and outputting the current step motion value a by the motion strategy generator APG according to the APG mean variance networkkDelta of (a)k(ii) a External Environment EE execution action value ak+ΔakGet the next action value ak+1Next environmental state value sk+1And a current step reward value Rk(ii) a The obtained data(s)k,ak,Rk,sk+1,ak+1) Storing in a memory storage module MS;
Step 3, taking out the environmental state value s of the current step from the memory storage module MSkAnd the current step action value akAre commonly input to an environmental state estimator ESE, which is based on(s) inputk,ak) Parameter set theta for environment statepUpdating and outputting, and estimating the future environment state s'kPerforming prediction calculation and output;
step 4, taking out the next environmental state value s from the memory storage module MSk+1The next action value ak+1And a current step reward value RkAnd is compared with a future environment state estimated value s 'output by the environment state estimator ESE'kAnd an updated set of environmental state parameters thetapAnd the current step action value a input into the action policy generator APGkAre jointly input into a value estimator VE, VE optimizer being dependent on(s) inputk+1,ak+1,Rk,s′kp,ak) Updating Q function network, and enabling value estimator VE to estimate value R 'of future reward'kPerforming prediction calculation and output;
step 5, outputting a future environment state estimation value s 'from the environment state estimator ESE'kAnd an environmental state parameter set thetapAnd a future reward estimate R 'output by the value estimator VE'kAre commonly input to an action strategy generator APG, the APG optimizer being dependent on the input (s'kp,R′k) For action value parameter set thetavUpdating is carried out;
step 6, judging whether the training reaches the end condition, if so, finishing the whole training process; if not, after adding 1 to the k value, returning to the step 2 to continue the next single-step training process.
Wherein the step 1 is used for realizing the initialization of the training process; step 2 is used for realizing action execution of the training process; and 3, updating the parameters of the training process together in the steps 4 and 5.
The invention provides a continuity reinforcement learning method based on a random differential equation, wherein the incremental description form of the environmental state and the action output is as follows:
wherein f is R(n+m)→Rn,g:R(n+m)→R(n×n)Is an environment variation function, and refers to a specific network or equation, theta, in the actual modelpBeing a set of ambient state parameters, θvAs a set of motion parameters, stIs the value of the environmental state at time t, atIs the action value at time t,(s)t,at) Is an array of environmental state values and action values, BtIs a brownian motion in the n-dimension,is a Brownian motion of m dimensions, BtAndeach component is independent;
the continuity reinforcement learning method based on the random differential equationThe diffusion equation is:
wherein Y ist=(st,at)∈Rm+nIs brownian motion of n + m dimensions,andthe method is a basic model of the continuity reinforcement learning method based on the random differential equation:
the APG mean variance network, the Q function network, the network of the environment change function f and the network of the environment change function g in the continuity reinforcement learning method based on the stochastic differential equation all use a rectification linear unit ReLU as a network model.
The value estimator VE, the VE optimizer of which predicts the future reward estimated value R through a Q functionk', the main objective function of the VE optimizer is:
wherein,is conditional on expectation, is known in(s)k,ak) (ii) a desire to; the Q value under the current state is based on the target function JQAnd (theta) solving and updating.
The environment state parameter set theta in the environment estimator ESEpThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:
the additional objective function of the ESE optimizer is:
an additional objective function of the ESE optimizer for evaluating the accuracy of the environment estimator ESE in estimating future environmental state changes.
The method comprises the following steps that an action strategy generator (APG) updates parameters of an APG mean variance network by using a strategy gradient method, and an objective function of an APG optimizer is as follows:
in summary, compared with the existing classical reinforcement learning method, the reinforcement learning method based on the stochastic differential equation provided by the invention takes the action stochastic differential equation as the core of the basic model, can solve the two problems of continuity and uncertainty control caused by variance control, which are difficult to be considered in the traditional reinforcement learning method, and can avoid the defect of markov control in the classical reinforcement learning.
The invention has the advantages and positive effects that:
(1) the continuity reinforcement learning method provided by the invention does not depend on the change of the current environment state value, but is based on the environment state value and the action value at the previous moment, so that the method has the adaptability to the real-time environment and can realize the estimation effect.
(2) The method provided by the invention is based on increment instead of a Markov control process, so that on one hand, the blind adaptation of the system to the environment can be avoided, and the self state can be adjusted by observing the environment; on the other hand, the influence of the delay effect in the process from the sensor to the actuator in the control process is reduced, so that the effect of the intelligent agent is smoother and more accurate.
(3) The continuity reinforcement learning method based on the random differential equation can strictly ensure the continuity of the action, can be applied to the control application of the continuity action, and can ensure the existence and the uniqueness of a value estimation network. Because the real physical system and the real action are mainly continuous systems and continuous actions, compared with other existing reinforcement learning methods, the invention theoretically promotes the real control application of the reinforcement learning method.
Drawings
FIG. 1 is a training flow chart of a continuity reinforcement learning method based on a random differential equation according to the present invention;
FIG. 2 is a single-step training process diagram of a continuity reinforcement learning method based on a random differential equation;
FIG. 3 is a diagram of a training updating process of the continuity reinforcement learning method based on the stochastic differential equation;
FIG. 4 is a diagram of a calculation example of a continuity reinforcement learning method based on a random differential equation according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly understood, a system and a method for continuous reinforcement learning based on random differential equations of the present invention are further described with reference to the accompanying drawings and embodiments.
The invention provides a system and a method for continuous reinforcement learning based on a random differential equation, which are suitable for continuous control application.
As shown in fig. 1, the step flow of the continuity reinforcement learning method based on the stochastic differential equation provided by the present invention includes the following steps:
step 1, initializing all parameters in an action strategy generator APG, an environment state estimator ESE, a value estimator VE, a memory storage module MS and an external environment EE which are contained in the whole learning method.
The invention takes the Pendulum-v0 inverted Pendulum motion control experiment in OpenAI gym as an example to explain the initialization of parameters. In the Pendulum-v0 experiment, the Pendulum needle of the inverted Pendulum started at a random position, and the control objective was to swing it upward, keeping it upright. In the experiment, the time interval delta t is set to be 0.05, the discount factor gamma is set to be 0.6, and the objective function of the action strategy generator APG is used for updating the action strategy generator APG network, so that the control process of the inverted pendulum classical control model is simulated.
Step 2, the environmental state value s of the current step is calculatedkAnd current step motionValue akInput to the action strategy generator APG, and the action strategy generator APG calculates and outputs the action value a of the current stepkDelta of (a)k(ii) a External Environment EE execution action value ak+ΔakGet the next action value ak+1Next environmental state value sk+1And a current step reward value Rk(ii) a The obtained data(s)k,ak,Rk,sk+1,ak+1) And storing the data in a memory storage module MS.
FIG. 2 is a single-step implementation of a stochastic differential equation-based continuous reinforcement learning method, illustrating a single training implementation of the method of the present invention. In the execution process, the action strategy generator APG is mainly working, the value estimator VE and the environment state estimator ESE are in a dormant state, the action strategy generator APG generates the increment of action at the next moment, and the external environment EE executes the current step action value a in each step execution processkGenerated after(s)k+1,ak+1,Rk) And storing the data in a memory storage module for a subsequent updating process.
The action output and the environmental state of the whole continuity reinforcement learning system are described by control change increment, and the increment description form of the environmental state and the action output is as follows:
wherein f is R(n+m)→Rn,g:R(n+m)→R(n×n)Is an environment variation function, and refers to a specific network or equation, theta, in the actual modelpBeing a set of ambient state parameters, θvAs a set of motion parameters, stIs the value of the environmental state at time t, atIs the action value at time t,(s)t,at) Is an array of environmental state values and action values, BtIs a brownian motion in the n-dimension,is a Brownian motion of m dimensions, BtAndeach component is independent;
the whole methodThe diffusion equation is:
wherein Y ist=(st,at)∈Rm+nIs brownian motion in n + m dimensions.Andis a basic model of the method, and the expression form is as follows:
the APG mean variance network, the Q function network, the environment change function f and the environment change function g network all use a rectification linear unit ReLU as a network model. The rectifying linear unit ReLU rarely has points that cannot be differentiated, thus guaranteeing the continuity of the method.
FIG. 3 is a diagram of a training and updating process of the continuity reinforcement learning method based on the stochastic differential equation, and the training and updating process of the method of the present invention is divided into three parts of updating: an environmental state estimator ESE parameter update, a value estimator VE parameter update and an action policy generator APG parameter update. The specific training updating process is as follows:
step 3, taking out the environmental state value s of the current step from the memory storage module MSkAnd the current step action value akAre commonly input to an environmental state estimator ESE, which is based on(s) inputk,ak) Parameter set theta for environment statepUpdating and outputting, and estimating the future environment state s'kPerforming prediction calculation and output;
environmental state parameter set θ in the environment estimator ESEpThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:
the additional objective function of the ESE optimizer is:
an additional objective function of the ESE optimizer for evaluating the accuracy of the environment estimator ESE in estimating future environmental state changes.
Step 4, taking out the next environmental state value s from the memory storage module MSk+1The next action value ak+1And a current step reward value RkAnd is compared with a future environment state estimated value s 'output by the environment state estimator ESE'kAnd an updated set of environmental state parameters thetapAnd the current step action value a input into the action policy generator APGkAre jointly input into a value estimator VE, VE optimizer being dependent on(s) inputk+1,ak+1,Rk,s′kp,ak) Updating Q function network, and enabling value estimator VE to estimate value R 'of future reward'kPerforming prediction calculation and output;
value estimator VE, whose VE optimizer predicts future reward estimate R 'by Q function'kThe main objective functions of the VE optimizer are:
wherein,is conditional on expectation, is known in(s)k,ak) (ii) a desire to; the Q value under the current state is based on the target function JQAnd (theta) solving and updating.
Step 5, outputting a future environment state estimation value s 'from the environment state estimator ESE'kAnd an environmental state parameter set thetapAnd a future reward estimate R 'output by the value estimator VE'kAre commonly input to an action strategy generator APG, the APG optimizer being dependent on the input (s'kp,R′k) For action value parameter set thetavUpdating is carried out;
the action strategy generator APG comprises an action strategy generator APG optimizer, a mean variance network and an operation part. The APG uses a strategy gradient method to update APG mean variance network parameters, and the objective function of an APG optimizer of the APG is as follows:
step 6, judging whether the training reaches the end condition, if so, finishing the whole training process; if not, after adding 1 to the k value, returning to the step 2 to continue the next single-step training process. The training termination condition may be a preset number of times of training or may be set according to a training target.
Fig. 4 is a calculation example of a continuity reinforcement learning method based on a random differential equation, three curves shown in the figure are respectively an experimental result of a Pendulum-v0 inverted Pendulum control training experiment performed in an OpenAIgym simulation environment by the method of the present invention and other two reinforcement learning methods DDPG and A3C. The training end condition is set as 1500 steps, and in each of the 1500 steps, the system method of the invention runs the steps 2 to 6 to realize the execution process and the training update process. As can be seen from FIG. 4, the system method of the present invention employs a random differential equation theory that is distinct from other methods, and is well suited for control training of continuous systems.
The above-mentioned embodiments are intended to illustrate the objects and technical solutions of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A continuity reinforcement learning system based on a random differential equation is characterized in that:
the system comprises an action strategy generator APG, an environment state estimator ESE, a value estimator VE, a memory storage module MS and an external environment EE;
the APG consists of an APG optimizer, an APG mean variance network and an APG arithmetic unit and is used for training each single step according to the environmental state value s of the current stepkAnd the current step action value akCalculating and outputting the motion increment delta ak
The ESE comprises an ESE optimizer used for optimizing the environment state value s according to the current stepkAnd the current step action value akUpdating the environmental status parameter set θpAnd predicting a future environmental state estimate s'k
The value estimator VE consists of a VE optimizer and a Q function network, the VE optimizer being dependent on the input(s)k+1,ak+1,Rk,s′kp,ak) Updating Q function network for predicting future reward estimated value R'k(ii) a Wherein s isk+1Is the next environmental state value, ak+1The next action value;
the memory storage module MS is used for storing the current step environment state value skCurrent step action value akAnd storing the next environmental state value s obtained in each single-step training processk+1The next action value ak+1And the current step reward Rk
The external environment EE is used to describe and measure actions and environmental conditions, in the form of simulation software and real physical systems.
2. The method for continuous reinforcement learning based on random differential equations of claim 1, wherein the training process of the method comprises the following steps:
step 1, initializing an action strategy generator APG, an environmental state estimator ESE and a value estimator VE;
step 2, taking out the environmental state value s of the current step from the memory storage module MSkAnd the current step action value akThe current step environment state value skAnd the current step action value akInputting the current step motion value a to the motion strategy generator APG, and calculating and outputting the current step motion value a by the motion strategy generator APG according to the APG mean variance networkkDelta of (a)k(ii) a External Environment EE execution action value ak+ΔakGet the next action value ak+1Next environmental state value sk+1And a current step reward value Rk(ii) a The obtained data(s)k,ak,Rk,sk+1,ak+1) Storing the data into a memory storage module MS;
step 3, taking out the environmental state value s of the current step from the memory storage module MSkAnd the current step action value akAre commonly input to an environmental state estimator ESE, which is based on(s) inputk,ak) Parameter set theta for environment statepUpdating and outputting, and estimating the future environment state s'kPerforming prediction calculation and output;
step 4, taking out the data from the memory storage module MSNext environmental state value sk+1The next action value ak+1And a current step reward value RkAnd is compared with a future environment state estimated value s 'output by the environment state estimator ESE'kAnd an updated set of environmental state parameters thetapAnd the current step action value a input into the action policy generator APGkAre jointly input into a value estimator VE, VE optimizer being dependent on(s) inputk+1,ak+1,Rk,s′kp,ak) Updating Q function network, and enabling value estimator VE to estimate value R 'of future reward'kPerforming prediction calculation and output;
step 5, outputting a future environment state estimation value s 'from the environment state estimator ESE'kAnd an environmental state parameter set thetapAnd a future reward estimate R 'output by the value estimator VE'kAre commonly input to an action strategy generator APG, the APG optimizer being dependent on the input (s'kp,R′k) For action value parameter set thetavUpdating is carried out;
step 6, judging whether the training reaches the end condition, if so, finishing the whole training process; if not, after adding 1 to the k value, returning to the step 2 to continue the next single-step training process.
3. The method for continuous reinforcement learning based on random differential equations as claimed in claim 2, wherein the function of step 1 is to implement initialization of the training process; the step 2 is used for realizing the action execution of the training process; the steps 3, 4 and 5 are used for jointly realizing the parameter updating of the training process.
4. The stochastic differential equation-based continuity reinforcement learning method according to claim 2, wherein the incremental description of the environmental state and the action output is in the form of:
wherein f is R(n+m)→Rn,g:R(n+m)→R(n×n)Is an environment variation function, and refers to a specific network or equation, theta, in the actual modelpBeing a set of ambient state parameters, θvAs a set of motion parameters, stIs the value of the environmental state at time t, atIs the action value at time t,(s)t,at) Is an array of environmental state values and action values, BtIs a brownian motion in the n-dimension,is a Brownian motion of m dimensions, BtAndeach component is independent;
the continuity reinforcement learning method based on the random differential equationThe diffusion equation is:
wherein Y ist=(st,at)∈Rm+nIs brownian motion of n + m dimensions,andis based onA basic model of a continuity reinforcement learning method of a random differential equation.
5. The method as claimed in claim 4, wherein the APG mean-variance network, the Q function network, the network of the environment variation function f and the network of the environment variation function g all use a rectifying linear unit ReLU as the network model.
6. The method of claim 2, wherein the value estimator VE has a VE optimizer for predicting a future reward estimate R 'by a Q function'kThe main objective functions of the VE optimizer are:
wherein,is conditional on expectation, is known in(s)k,ak) (ii) a desire to; the Q value under the current state is based on the target function JQAnd (theta) solving and updating.
7. The method according to claim 2, wherein the environmental state parameter set θ in the environmental estimator ESE is a set of environmental state parameters θpThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:
the additional objective function of the ESE optimizer is:
an additional objective function of the ESE optimizer for evaluating the accuracy of the environment estimator ESE in estimating future environmental state changes.
8. The method according to claim 2, wherein the APG updates parameters of the APG mean variance network by using a policy gradient method, and the objective function of the APG optimizer is as follows:
9. a method for continuous reinforcement learning based on stochastic differential equations as claimed in claim 4, claim 6 or claim 8, wherein the base model isAndcomprises the following steps:
CN201910712857.9A 2019-08-02 2019-08-02 Continuity reinforcement learning system and method based on random differential equation Active CN110502721B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910712857.9A CN110502721B (en) 2019-08-02 2019-08-02 Continuity reinforcement learning system and method based on random differential equation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910712857.9A CN110502721B (en) 2019-08-02 2019-08-02 Continuity reinforcement learning system and method based on random differential equation

Publications (2)

Publication Number Publication Date
CN110502721A true CN110502721A (en) 2019-11-26
CN110502721B CN110502721B (en) 2021-04-06

Family

ID=68587805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910712857.9A Active CN110502721B (en) 2019-08-02 2019-08-02 Continuity reinforcement learning system and method based on random differential equation

Country Status (1)

Country Link
CN (1) CN110502721B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523029A (en) * 2018-09-28 2019-03-26 清华大学深圳研究生院 For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN110032186A (en) * 2019-03-27 2019-07-19 上海大学 A kind of labyrinth feature identification of anthropomorphic robot and traveling method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109523029A (en) * 2018-09-28 2019-03-26 清华大学深圳研究生院 For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN110032186A (en) * 2019-03-27 2019-07-19 上海大学 A kind of labyrinth feature identification of anthropomorphic robot and traveling method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JINWUK SEOK 等: "《The analysis of a stochastic differential approach for Langevin competitive learning algorithm》", 《OBJECT RECOGNITION SUPPORTED BY USER INTERACTION FOR SERVICE ROBOTS》 *
TIMOTHY P. LILLICRAP 等: "《CONTINUOUS CONTROL WITH DEEP REINFORCEMENT》", 《ARXIV:1509.02971 [CS.LG]》 *
金海东 等: "《一种带自适应学习率的综合随机梯度下降Q-学习方法》", 《计算机学报》 *

Also Published As

Publication number Publication date
CN110502721B (en) 2021-04-06

Similar Documents

Publication Publication Date Title
JP6926203B2 (en) Reinforcement learning with auxiliary tasks
JP6827539B2 (en) Training action selection neural networks
US20210158162A1 (en) Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space
CN110223517A (en) Short-term traffic flow forecast method based on temporal correlation
CN112119404A (en) Sample efficient reinforcement learning
US20210103815A1 (en) Domain adaptation for robotic control using self-supervised learning
JP2019537136A (en) Environmental prediction using reinforcement learning
Narendra et al. Fast Reinforcement Learning using multiple models
CN111433689B (en) Generation of control systems for target systems
Truong et al. Design of an advanced time delay measurement and a smart adaptive unequal interval grey predictor for real-time nonlinear control systems
WO2022156182A1 (en) Methods and apparatuses for constructing vehicle dynamics model and for predicting vehicle state information
CN108594803B (en) Path planning method based on Q-learning algorithm
CN112154458A (en) Reinforcement learning using proxy courses
CN113419424B (en) Modeling reinforcement learning robot control method and system for reducing overestimation
US20240095495A1 (en) Attention neural networks with short-term memory units
CN112571420B (en) Dual-function model prediction control method under unknown parameters
WO2021225923A1 (en) Generating robot trajectories using neural networks
CN113614743A (en) Method and apparatus for operating a robot
CN114239974B (en) Multi-agent position prediction method and device, electronic equipment and storage medium
Caarls et al. Parallel online temporal difference learning for motor control
Inga et al. Online inverse linear-quadratic differential games applied to human behavior identification in shared control
US20230120256A1 (en) Training an artificial neural network, artificial neural network, use, computer program, storage medium and device
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
CN110502721B (en) Continuity reinforcement learning system and method based on random differential equation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant