CN110502721A

CN110502721A - A kind of continuity reinforcement learning system and method based on stochastic differential equation

Info

Publication number: CN110502721A
Application number: CN201910712857.9A
Authority: CN
Inventors: 贾文川; 程丽梅; 陈添豪; 孙翊; 马书根
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2019-11-26
Anticipated expiration: 2039-08-02
Also published as: CN110502721B

Abstract

The invention discloses a kind of continuity reinforcement learning system and method based on stochastic differential equation, system includes action policy generator APG, ambient condition estimator ESE, value estimator VE, memory storage module MS and external environment EE；Specific step is as follows: initialization action strategy generator APG, ambient condition estimator ESE and value estimator VE；Action policy generator APG calculates output action value increment Delta a_k；External environment EE exports next step action value a_k+1, next step ambient condition value s_k+1And current step reward value R_k, and store into memory storage module MS；Ambient condition estimator ESE updates ambient condition parameter set θ_pWith prediction FUTURE ENVIRONMENT state estimation s '_k；VE optimizer updates Q Function Network and predicts the following reward estimated value R '_k；APG optimizer update action value parameter collection θ_v.This method is based on stochastic differential equation as basic model, is able to achieve the continuity of action control and can control training process variance, movement can be selected to realize better environmental interaction by predicting the variation of environment.

Description

Continuity reinforcement learning system and method based on random differential equation

Technical Field

The invention relates to the field of reinforcement learning and the field of random processes, in particular to a reinforcement learning method for a continuous system.

Background

The deep reinforcement learning is an end-to-end learning system, combines the perception capability of the deep learning and the decision capability of the reinforcement learning, has stronger universality, and realizes the direct control from the original input to the output. Reinforcement learning is a very important unsupervised learning method, so that an agent can judge the current environment state through a value function in interaction with the environment, and accordingly can make corresponding actions to obtain better rewards. At present, the algorithm of reinforcement learning mainly focuses on a discrete action strategy set, and the classical continuous reinforcement learning methods such as DDPG, A3C and the like can be used for continuous action control in applications such as robot motion control and unmanned driving.

However, most of the current continuous reinforcement learning methods have the theoretical disadvantages, for example, the noise introduced by the DDPG can ensure the continuity of the action, but the variance cannot be controlled; while A3C under the gaussian strategy, although the variance can be controlled, does not theoretically satisfy the continuity condition.

Disclosure of Invention

The invention aims to overcome the defects of the existing reinforcement learning method, and establishes a reinforcement learning system and a reinforcement learning method which can meet the continuity condition in any time interval and can control the variance of action output through a network.

Therefore, the invention provides a reinforcement learning framework which can theoretically ensure the continuity of actions and can control the variance in the training process, namely a continuous reinforcement learning system and method based on random differential equations. Under this framework, the agent can predict changes in the scene environment to make action selections. Unlike markov control, under this system, the agent is no longer passive in adapting to the environment, but rather better interacts with the environment to obtain the maximum reward.

The invention provides a reinforcement learning system based on a random differential equation, which comprises five main parts: (1) an environment state estimator ESE, (2) an action strategy generator APG, (3) a value estimator VE, (4) a memory storage module MS, and (5) an external environment EE.

The APG consists of an APG optimizer, an APG mean variance network and an APG arithmetic unit and is used for calculating and outputting an action increment delta a in each single-step training process_k(ii) a The ambient state estimator ESE comprises an ESE optimizer for updating the ambient state parameter set θ_pAnd predicting a future environmental state estimate s'_k(ii) a The value estimator VE is composed of VE optimizer and Q function network and is used for predicting future reward estimated value R_k'; the memory storage module MS is used for storing the current step environment state value s_kCurrent step action value a_kAnd storing the next environmental state value s obtained in each single-step training process_k+1The next action value a_k+1And the current step reward R_k(ii) a The external environment EE is used to describe and measure actions and environmental conditions, in the form of simulation software and real physical systems.

The invention provides a continuity reinforcement learning method based on a random differential equation, which comprises the following steps:

step 1, initializing all parameters in an action strategy generator APG, an environment state estimator ESE and a value estimator VE;

step 2, taking out the environmental state value s of the current step from the memory storage module MS_kAnd the current step action value a_kThe current step environment state value s_kAnd the current step action value a_kInputting the current step motion value a to the motion strategy generator APG, and calculating and outputting the current step motion value a by the motion strategy generator APG according to the APG mean variance network_kDelta of (a)_k(ii) a External Environment EE execution action value a_k+Δa_kGet the next action value a_k+1Next environmental state value s_k+1And a current step reward value R_k(ii) a The obtained data(s)_k,a_k,R_k,s_k+1,a_k+1) Storing in a memory storage module MS；

Step 3, taking out the environmental state value s of the current step from the memory storage module MS_kAnd the current step action value a_kAre commonly input to an environmental state estimator ESE, which is based on(s) input_k,a_k) Parameter set theta for environment state_pUpdating and outputting, and estimating the future environment state s'_kPerforming prediction calculation and output;

step 4, taking out the next environmental state value s from the memory storage module MS_k+1The next action value a_k+1And a current step reward value R_kAnd is compared with a future environment state estimated value s 'output by the environment state estimator ESE'_kAnd an updated set of environmental state parameters theta_pAnd the current step action value a input into the action policy generator APG_kAre jointly input into a value estimator VE, VE optimizer being dependent on(s) input_k+1,a_k+1,R_k,s′_k,θ_p,a_k) Updating Q function network, and enabling value estimator VE to estimate value R 'of future reward'_kPerforming prediction calculation and output;

step 5, outputting a future environment state estimation value s 'from the environment state estimator ESE'_kAnd an environmental state parameter set theta_pAnd a future reward estimate R 'output by the value estimator VE'_kAre commonly input to an action strategy generator APG, the APG optimizer being dependent on the input (s'_k,θ_p,R′_k) For action value parameter set theta_vUpdating is carried out;

step 6, judging whether the training reaches the end condition, if so, finishing the whole training process; if not, after adding 1 to the k value, returning to the step 2 to continue the next single-step training process.

Wherein the step 1 is used for realizing the initialization of the training process; step 2 is used for realizing action execution of the training process; and 3, updating the parameters of the training process together in the steps 4 and 5.

The invention provides a continuity reinforcement learning method based on a random differential equation, wherein the incremental description form of the environmental state and the action output is as follows:

wherein f is R^(n+m)→Rⁿ,g:R^(n+m)→R^(n×n)Is an environment variation function, and refers to a specific network or equation, theta, in the actual model_pBeing a set of ambient state parameters, θ_vAs a set of motion parameters, s_tIs the value of the environmental state at time t, a_tIs the action value at time t,(s)_t,a_t) Is an array of environmental state values and action values, B_tIs a brownian motion in the n-dimension,is a Brownian motion of m dimensions, B_tAndeach component is independent;

the continuity reinforcement learning method based on the random differential equationThe diffusion equation is:

wherein Y is_t＝(s_t,a_t)∈R^m+n，Is brownian motion of n + m dimensions,andthe method is a basic model of the continuity reinforcement learning method based on the random differential equation:

the APG mean variance network, the Q function network, the network of the environment change function f and the network of the environment change function g in the continuity reinforcement learning method based on the stochastic differential equation all use a rectification linear unit ReLU as a network model.

The value estimator VE, the VE optimizer of which predicts the future reward estimated value R through a Q function_k', the main objective function of the VE optimizer is:

wherein,is conditional on expectation, is known in(s)_k,a_k) (ii) a desire to; the Q value under the current state is based on the target function J_QAnd (theta) solving and updating.

The environment state parameter set theta in the environment estimator ESE_pThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:

the additional objective function of the ESE optimizer is:

an additional objective function of the ESE optimizer for evaluating the accuracy of the environment estimator ESE in estimating future environmental state changes.

The method comprises the following steps that an action strategy generator (APG) updates parameters of an APG mean variance network by using a strategy gradient method, and an objective function of an APG optimizer is as follows:

in summary, compared with the existing classical reinforcement learning method, the reinforcement learning method based on the stochastic differential equation provided by the invention takes the action stochastic differential equation as the core of the basic model, can solve the two problems of continuity and uncertainty control caused by variance control, which are difficult to be considered in the traditional reinforcement learning method, and can avoid the defect of markov control in the classical reinforcement learning.

The invention has the advantages and positive effects that:

(1) the continuity reinforcement learning method provided by the invention does not depend on the change of the current environment state value, but is based on the environment state value and the action value at the previous moment, so that the method has the adaptability to the real-time environment and can realize the estimation effect.

(2) The method provided by the invention is based on increment instead of a Markov control process, so that on one hand, the blind adaptation of the system to the environment can be avoided, and the self state can be adjusted by observing the environment; on the other hand, the influence of the delay effect in the process from the sensor to the actuator in the control process is reduced, so that the effect of the intelligent agent is smoother and more accurate.

(3) The continuity reinforcement learning method based on the random differential equation can strictly ensure the continuity of the action, can be applied to the control application of the continuity action, and can ensure the existence and the uniqueness of a value estimation network. Because the real physical system and the real action are mainly continuous systems and continuous actions, compared with other existing reinforcement learning methods, the invention theoretically promotes the real control application of the reinforcement learning method.

Drawings

FIG. 1 is a training flow chart of a continuity reinforcement learning method based on a random differential equation according to the present invention;

FIG. 2 is a single-step training process diagram of a continuity reinforcement learning method based on a random differential equation;

FIG. 3 is a diagram of a training updating process of the continuity reinforcement learning method based on the stochastic differential equation;

FIG. 4 is a diagram of a calculation example of a continuity reinforcement learning method based on a random differential equation according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, a system and a method for continuous reinforcement learning based on random differential equations of the present invention are further described with reference to the accompanying drawings and embodiments.

The invention provides a system and a method for continuous reinforcement learning based on a random differential equation, which are suitable for continuous control application.

As shown in fig. 1, the step flow of the continuity reinforcement learning method based on the stochastic differential equation provided by the present invention includes the following steps:

step 1, initializing all parameters in an action strategy generator APG, an environment state estimator ESE, a value estimator VE, a memory storage module MS and an external environment EE which are contained in the whole learning method.

The invention takes the Pendulum-v0 inverted Pendulum motion control experiment in OpenAI gym as an example to explain the initialization of parameters. In the Pendulum-v0 experiment, the Pendulum needle of the inverted Pendulum started at a random position, and the control objective was to swing it upward, keeping it upright. In the experiment, the time interval delta t is set to be 0.05, the discount factor gamma is set to be 0.6, and the objective function of the action strategy generator APG is used for updating the action strategy generator APG network, so that the control process of the inverted pendulum classical control model is simulated.

Step 2, the environmental state value s of the current step is calculated_kAnd current step motionValue a_kInput to the action strategy generator APG, and the action strategy generator APG calculates and outputs the action value a of the current step_kDelta of (a)_k(ii) a External Environment EE execution action value a_k+Δa_kGet the next action value a_k+1Next environmental state value s_k+1And a current step reward value R_k(ii) a The obtained data(s)_k,a_k,R_k,s_k+1,a_k+1) And storing the data in a memory storage module MS.

FIG. 2 is a single-step implementation of a stochastic differential equation-based continuous reinforcement learning method, illustrating a single training implementation of the method of the present invention. In the execution process, the action strategy generator APG is mainly working, the value estimator VE and the environment state estimator ESE are in a dormant state, the action strategy generator APG generates the increment of action at the next moment, and the external environment EE executes the current step action value a in each step execution process_kGenerated after(s)_k+1,a_k+1,R_k) And storing the data in a memory storage module for a subsequent updating process.

The action output and the environmental state of the whole continuity reinforcement learning system are described by control change increment, and the increment description form of the environmental state and the action output is as follows:

the whole methodThe diffusion equation is:

wherein Y is_t＝(s_t,a_t)∈R^m+n，Is brownian motion in n + m dimensions.Andis a basic model of the method, and the expression form is as follows:

the APG mean variance network, the Q function network, the environment change function f and the environment change function g network all use a rectification linear unit ReLU as a network model. The rectifying linear unit ReLU rarely has points that cannot be differentiated, thus guaranteeing the continuity of the method.

FIG. 3 is a diagram of a training and updating process of the continuity reinforcement learning method based on the stochastic differential equation, and the training and updating process of the method of the present invention is divided into three parts of updating: an environmental state estimator ESE parameter update, a value estimator VE parameter update and an action policy generator APG parameter update. The specific training updating process is as follows:

environmental state parameter set θ in the environment estimator ESE_pThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:

the additional objective function of the ESE optimizer is:

value estimator VE, whose VE optimizer predicts future reward estimate R 'by Q function'_kThe main objective functions of the VE optimizer are:

the action strategy generator APG comprises an action strategy generator APG optimizer, a mean variance network and an operation part. The APG uses a strategy gradient method to update APG mean variance network parameters, and the objective function of an APG optimizer of the APG is as follows:

step 6, judging whether the training reaches the end condition, if so, finishing the whole training process; if not, after adding 1 to the k value, returning to the step 2 to continue the next single-step training process. The training termination condition may be a preset number of times of training or may be set according to a training target.

Fig. 4 is a calculation example of a continuity reinforcement learning method based on a random differential equation, three curves shown in the figure are respectively an experimental result of a Pendulum-v0 inverted Pendulum control training experiment performed in an OpenAIgym simulation environment by the method of the present invention and other two reinforcement learning methods DDPG and A3C. The training end condition is set as 1500 steps, and in each of the 1500 steps, the system method of the invention runs the steps 2 to 6 to realize the execution process and the training update process. As can be seen from FIG. 4, the system method of the present invention employs a random differential equation theory that is distinct from other methods, and is well suited for control training of continuous systems.

The above-mentioned embodiments are intended to illustrate the objects and technical solutions of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A continuity reinforcement learning system based on a random differential equation is characterized in that:

the system comprises an action strategy generator APG, an environment state estimator ESE, a value estimator VE, a memory storage module MS and an external environment EE;

the APG consists of an APG optimizer, an APG mean variance network and an APG arithmetic unit and is used for training each single step according to the environmental state value s of the current step_kAnd the current step action value a_kCalculating and outputting the motion increment delta a_k；

The ESE comprises an ESE optimizer used for optimizing the environment state value s according to the current step_kAnd the current step action value a_kUpdating the environmental status parameter set θ_pAnd predicting a future environmental state estimate s'_k；

The value estimator VE consists of a VE optimizer and a Q function network, the VE optimizer being dependent on the input(s)_k+1,a_k+1,R_k,s′_k,θ_p,a_k) Updating Q function network for predicting future reward estimated value R'_k(ii) a Wherein s is_k+1Is the next environmental state value, a_k+1The next action value;

the memory storage module MS is used for storing the current step environment state value s_kCurrent step action value a_kAnd storing the next environmental state value s obtained in each single-step training process_k+1The next action value a_k+1And the current step reward R_k；

The external environment EE is used to describe and measure actions and environmental conditions, in the form of simulation software and real physical systems.

2. The method for continuous reinforcement learning based on random differential equations of claim 1, wherein the training process of the method comprises the following steps:

step 1, initializing an action strategy generator APG, an environmental state estimator ESE and a value estimator VE;

step 2, taking out the environmental state value s of the current step from the memory storage module MS_kAnd the current step action value a_kThe current step environment state value s_kAnd the current step action value a_kInputting the current step motion value a to the motion strategy generator APG, and calculating and outputting the current step motion value a by the motion strategy generator APG according to the APG mean variance network_kDelta of (a)_k(ii) a External Environment EE execution action value a_k+Δa_kGet the next action value a_k+1Next environmental state value s_k+1And a current step reward value R_k(ii) a The obtained data(s)_k,a_k,R_k,s_k+1,a_k+1) Storing the data into a memory storage module MS;

step 4, taking out the data from the memory storage module MSNext environmental state value s_k+1The next action value a_k+1And a current step reward value R_kAnd is compared with a future environment state estimated value s 'output by the environment state estimator ESE'_kAnd an updated set of environmental state parameters theta_pAnd the current step action value a input into the action policy generator APG_kAre jointly input into a value estimator VE, VE optimizer being dependent on(s) input_k+1,a_k+1,R_k,s′_k,θ_p,a_k) Updating Q function network, and enabling value estimator VE to estimate value R 'of future reward'_kPerforming prediction calculation and output;

3. The method for continuous reinforcement learning based on random differential equations as claimed in claim 2, wherein the function of step 1 is to implement initialization of the training process; the step 2 is used for realizing the action execution of the training process; the steps 3, 4 and 5 are used for jointly realizing the parameter updating of the training process.

4. The stochastic differential equation-based continuity reinforcement learning method according to claim 2, wherein the incremental description of the environmental state and the action output is in the form of:

wherein Y is_t＝(s_t,a_t)∈R^m+n，Is brownian motion of n + m dimensions,andis based onA basic model of a continuity reinforcement learning method of a random differential equation.

5. The method as claimed in claim 4, wherein the APG mean-variance network, the Q function network, the network of the environment variation function f and the network of the environment variation function g all use a rectifying linear unit ReLU as the network model.

6. The method of claim 2, wherein the value estimator VE has a VE optimizer for predicting a future reward estimate R 'by a Q function'_kThe main objective functions of the VE optimizer are:

7. The method according to claim 2, wherein the environmental state parameter set θ in the environmental estimator ESE is a set of environmental state parameters θ_pThe objective function of the ESE optimizer is determined by the rule of the change of the environmental state as follows:

the additional objective function of the ESE optimizer is:

8. The method according to claim 2, wherein the APG updates parameters of the APG mean variance network by using a policy gradient method, and the objective function of the APG optimizer is as follows:

9. a method for continuous reinforcement learning based on stochastic differential equations as claimed in claim 4, claim 6 or claim 8, wherein the base model isAndcomprises the following steps:

。