CN112597693A

CN112597693A - Self-adaptive control method based on depth deterministic strategy gradient

Info

Publication number: CN112597693A
Application number: CN202011297651.3A
Authority: CN
Inventors: 卢旺; 孟凡石; 孙继泽
Original assignee: Shenyang Hangsheng Technology Co ltd
Current assignee: Shenyang Hangsheng Technology Co ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-04-02

Abstract

The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is a simulation training environment constructed according to the characteristics of a real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system. The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.

Description

Self-adaptive control method based on depth deterministic strategy gradient

The technical field is as follows:

the application relates to the technical field of computer software, in particular to a self-adaptive control method based on a deep reinforcement learning technology.

Background art:

the traditional PID controller needs a system to carry out accurate modeling, converts a time domain model into a frequency domain transfer function through Laplace transformation, and designs the classical PID controller according to methods such as a root track.

The problem to be solved by the application is how to solve the control problem of the nonlinear system, and a model-free control method is constructed without depending on an accurate mathematical model.

Disclosure of Invention

The application aims to provide an adaptive control method based on a deep reinforcement learning technology. The control problem of a nonlinear system is solved, and a model-free control method is constructed without depending on an accurate mathematical model.

The technical scheme of the application comprises the following steps: a self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:

1) firstly, establishing a simulation training environment according to the characteristics of a real system, wherein the simulation training environment is consistent with the real system, and the environment is interacted with reinforcement learning training;

2) respectively constructing state, return, action and cutoff conditions as training elements of deep reinforcement learning, wherein the action interval is as follows: a is an element of [ A ∈ [ ]_min,A_max]Carrying out control instruction amplitude limiting according to a real system;

3) constructing a critic network, an actor network and a corresponding critic-target network and actor-target network, wherein the networks form a neural network;

4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training; 5) the training result operator network is used as a controller.

The state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;

and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;

and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated.

The process of training the critic network and the actor network comprises the following steps:

a) initializing neural network parameters θ for operator and critic networks^QAnd theta^μCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;

next, M rounds of training are started:

b) the actor selects an action according to the actor network and delivers it to the environment, a_t＝μ(s_t|θ^μ)+OU_tWherein OU_tA stochastic process representing noise generation;

c) returning a reward and a new state (t +1) after the environment executes the action;

d) will(s)_t,a_t,r_t,s_t+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;

e) calculating the loss of the neural network according to the formula:

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1,|θ^μ′)|θ^Q′)

f) updating theta with Adam optimizer^Q；

g) Calculating the strategy gradient of the actor network:

h) updating theta with Adam optimizer^μ；

i) Updating an operator-target network and a critical-target network by adopting a soft update mode:

the application has the advantages that: a simulation training environment is constructed according to the characteristics of the real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system.

The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.

Drawings

FIG. 1 is a schematic diagram of an environment and reinforcement learning training interaction;

FIG. 2 is a schematic diagram of a neural network architecture;

FIG. 3 is a schematic diagram of interaction of a deep deterministic strategy gradient neural network with a training environment;

FIG. 4 transplants the trained actor network to a real system.

Detailed Description

The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is mainly characterized by comprising the following steps of:

1) firstly, a simulation training environment is constructed according to the characteristics of a real system, the simulation training environment is consistent with the real system, and the interaction between the environment and reinforcement learning training is shown in figure 1.

2) Respectively constructing state, return, action and cutoff conditions according to training elements of deep reinforcement learning;

state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;

and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated;

an action section: a is an element of [ A ∈ [ ]_min,A_max]And carrying out control instruction amplitude limiting according to the real system.

3) According to the depth deterministic strategy gradient method, a critic network, an actor network and a corresponding critic-target network and actor-target network are constructed, and the structure of a neural network is shown in figure 2.

A schematic diagram of the interaction of a deep deterministic strategy gradient neural network with the environment is shown in fig. 3.

4) Training a critic network and an actor network

Training the critic network and the actor network by using a depth deterministic strategy gradient algorithm, wherein the training process is as follows:

next, M rounds of training are started:

b) the actor selects an action according to the actor network, and passes it to the environment,

a_t＝μ(s_t|θ^μ)+OU_twherein OU_tA stochastic process representing noise generation;

e) calculating the loss of the neural network according to the formula:

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1,|θ^μ′)|θ^Q′)

f) updating theta with Adam optimizer^Q；

g) Calculating the strategy gradient of the actor network:

h) updating theta with Adam optimizer^μ；

and finishing the training of the current round and starting the next round of training.

5) The training result operator network is used as a controller.

Claims

1. A self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:

4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training;

5) the training result operator network is used as a controller.

2. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;

3. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the process of training the critic network and the actor network comprises the following steps:

next, M rounds of training are started:

E) calculating the loss of the neural network according to the formula:

y_i＝r_i+γQ′(s_i+1,μ′(s_i+1,|θ^μ′)|θ^Q′)

F) updating theta with Adam optimizer^Q；

G) Calculating the strategy gradient of the actor network:

H) updating theta with Adam optimizer^μ；