CN112597693A - Self-adaptive control method based on depth deterministic strategy gradient - Google Patents

Self-adaptive control method based on depth deterministic strategy gradient Download PDF

Info

Publication number
CN112597693A
CN112597693A CN202011297651.3A CN202011297651A CN112597693A CN 112597693 A CN112597693 A CN 112597693A CN 202011297651 A CN202011297651 A CN 202011297651A CN 112597693 A CN112597693 A CN 112597693A
Authority
CN
China
Prior art keywords
network
training
value
actor
critic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011297651.3A
Other languages
Chinese (zh)
Inventor
卢旺
孟凡石
孙继泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Hangsheng Technology Co ltd
Original Assignee
Shenyang Hangsheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Hangsheng Technology Co ltd filed Critical Shenyang Hangsheng Technology Co ltd
Priority to CN202011297651.3A priority Critical patent/CN112597693A/en
Publication of CN112597693A publication Critical patent/CN112597693A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is a simulation training environment constructed according to the characteristics of a real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system. The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.

Description

Self-adaptive control method based on depth deterministic strategy gradient
The technical field is as follows:
the application relates to the technical field of computer software, in particular to a self-adaptive control method based on a deep reinforcement learning technology.
Background art:
the traditional PID controller needs a system to carry out accurate modeling, converts a time domain model into a frequency domain transfer function through Laplace transformation, and designs the classical PID controller according to methods such as a root track.
The problem to be solved by the application is how to solve the control problem of the nonlinear system, and a model-free control method is constructed without depending on an accurate mathematical model.
Disclosure of Invention
The application aims to provide an adaptive control method based on a deep reinforcement learning technology. The control problem of a nonlinear system is solved, and a model-free control method is constructed without depending on an accurate mathematical model.
The technical scheme of the application comprises the following steps: a self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:
1) firstly, establishing a simulation training environment according to the characteristics of a real system, wherein the simulation training environment is consistent with the real system, and the environment is interacted with reinforcement learning training;
2) respectively constructing state, return, action and cutoff conditions as training elements of deep reinforcement learning, wherein the action interval is as follows: a is an element of [ A ∈ [ ]min,Amax]Carrying out control instruction amplitude limiting according to a real system;
3) constructing a critic network, an actor network and a corresponding critic-target network and actor-target network, wherein the networks form a neural network;
4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training; 5) the training result operator network is used as a controller.
The state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated.
The process of training the critic network and the actor network comprises the following steps:
a) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
b) the actor selects an action according to the actor network and delivers it to the environment, at=μ(stμ)+OUtWherein OUtA stochastic process representing noise generation;
c) returning a reward and a new state (t +1) after the environment executes the action;
d) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
e) calculating the loss of the neural network according to the formula:
Figure BDA0002785877570000021
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
f) updating theta with Adam optimizerQ
g) Calculating the strategy gradient of the actor network:
Figure BDA0002785877570000022
h) updating theta with Adam optimizerμ
i) Updating an operator-target network and a critical-target network by adopting a soft update mode:
Figure BDA0002785877570000023
the application has the advantages that: a simulation training environment is constructed according to the characteristics of the real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system.
The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.
Drawings
FIG. 1 is a schematic diagram of an environment and reinforcement learning training interaction;
FIG. 2 is a schematic diagram of a neural network architecture;
FIG. 3 is a schematic diagram of interaction of a deep deterministic strategy gradient neural network with a training environment;
FIG. 4 transplants the trained actor network to a real system.
Detailed Description
The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is mainly characterized by comprising the following steps of:
1) firstly, a simulation training environment is constructed according to the characteristics of a real system, the simulation training environment is consistent with the real system, and the interaction between the environment and reinforcement learning training is shown in figure 1.
2) Respectively constructing state, return, action and cutoff conditions according to training elements of deep reinforcement learning;
state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated;
an action section: a is an element of [ A ∈ [ ]min,Amax]And carrying out control instruction amplitude limiting according to the real system.
3) According to the depth deterministic strategy gradient method, a critic network, an actor network and a corresponding critic-target network and actor-target network are constructed, and the structure of a neural network is shown in figure 2.
A schematic diagram of the interaction of a deep deterministic strategy gradient neural network with the environment is shown in fig. 3.
4) Training a critic network and an actor network
Training the critic network and the actor network by using a depth deterministic strategy gradient algorithm, wherein the training process is as follows:
a) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
b) the actor selects an action according to the actor network, and passes it to the environment,
at=μ(stμ)+OUtwherein OUtA stochastic process representing noise generation;
c) returning a reward and a new state (t +1) after the environment executes the action;
d) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
e) calculating the loss of the neural network according to the formula:
Figure BDA0002785877570000041
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
f) updating theta with Adam optimizerQ
g) Calculating the strategy gradient of the actor network:
Figure BDA0002785877570000042
h) updating theta with Adam optimizerμ
I) Updating an operator-target network and a critical-target network by adopting a soft update mode:
Figure BDA0002785877570000043
and finishing the training of the current round and starting the next round of training.
5) The training result operator network is used as a controller.

Claims (3)

1. A self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:
1) firstly, establishing a simulation training environment according to the characteristics of a real system, wherein the simulation training environment is consistent with the real system, and the environment is interacted with reinforcement learning training;
2) respectively constructing state, return, action and cutoff conditions as training elements of deep reinforcement learning, wherein the action interval is as follows: a is an element of [ A ∈ [ ]min,Amax]Carrying out control instruction amplitude limiting according to a real system;
3) constructing a critic network, an actor network and a corresponding critic-target network and actor-target network, wherein the networks form a neural network;
4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training;
5) the training result operator network is used as a controller.
2. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated.
3. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the process of training the critic network and the actor network comprises the following steps:
A) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
B) the actor selects an action according to the actor network and delivers it to the environment, at=μ(stμ)+OUtWherein OUtA stochastic process representing noise generation;
C) returning a reward and a new state (t +1) after the environment executes the action;
D) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
E) calculating the loss of the neural network according to the formula:
Figure FDA0002785877560000021
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
F) updating theta with Adam optimizerQ
G) Calculating the strategy gradient of the actor network:
Figure FDA0002785877560000022
H) updating theta with Adam optimizerμ
I) Updating an operator-target network and a critical-target network by adopting a soft update mode:
Figure FDA0002785877560000023
CN202011297651.3A 2020-11-19 2020-11-19 Self-adaptive control method based on depth deterministic strategy gradient Pending CN112597693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011297651.3A CN112597693A (en) 2020-11-19 2020-11-19 Self-adaptive control method based on depth deterministic strategy gradient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011297651.3A CN112597693A (en) 2020-11-19 2020-11-19 Self-adaptive control method based on depth deterministic strategy gradient

Publications (1)

Publication Number Publication Date
CN112597693A true CN112597693A (en) 2021-04-02

Family

ID=75183402

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011297651.3A Pending CN112597693A (en) 2020-11-19 2020-11-19 Self-adaptive control method based on depth deterministic strategy gradient

Country Status (1)

Country Link
CN (1) CN112597693A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113721645A (en) * 2021-08-07 2021-11-30 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600379A (en) * 2018-04-28 2018-09-28 中国科学院软件研究所 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN110323981A (en) * 2019-05-14 2019-10-11 广东省智能制造研究所 A kind of method and system controlling permanent magnetic linear synchronous motor
CN111079936A (en) * 2019-11-06 2020-04-28 中国科学院自动化研究所 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108600379A (en) * 2018-04-28 2018-09-28 中国科学院软件研究所 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN110323981A (en) * 2019-05-14 2019-10-11 广东省智能制造研究所 A kind of method and system controlling permanent magnetic linear synchronous motor
CN111079936A (en) * 2019-11-06 2020-04-28 中国科学院自动化研究所 Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LE JIANG等: "Path tracking control based on Deep reinforcement learning in Autonomous driving", 《2019 3RD CONFERENCE ON VEHICLE CONTROL AND INTELLIGENCE(CVCI)》, pages 1 - 6 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113721645A (en) * 2021-08-07 2021-11-30 中国航空工业集团公司沈阳飞机设计研究所 Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning

Similar Documents

Publication Publication Date Title
CN108052004B (en) Industrial mechanical arm automatic control method based on deep reinforcement learning
CN110515303B (en) DDQN-based self-adaptive dynamic path planning method
CN110238839B (en) Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction
CN107272403A (en) A kind of PID controller parameter setting algorithm based on improvement particle cluster algorithm
CN111898770B (en) Multi-agent reinforcement learning method, electronic equipment and storage medium
CN112215364B (en) Method and system for determining depth of enemy-friend based on reinforcement learning
CN110427006A (en) A kind of multi-agent cooperative control system and method for process industry
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
Han et al. Intelligent decision-making for 3-dimensional dynamic obstacle avoidance of UAV based on deep reinforcement learning
Bianchi et al. Heuristically accelerated reinforcement learning: Theoretical and experimental results
CN114815882B (en) Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning
Ren Optimal control
CN114065929A (en) Training method and device for deep reinforcement learning model and storage medium
CN116604532A (en) Intelligent control method for upper limb rehabilitation robot
CN112597693A (en) Self-adaptive control method based on depth deterministic strategy gradient
CN116880191A (en) Intelligent control method of process industrial production system based on time sequence prediction
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN105117616B (en) Microbial fermentation optimization method based on particle cluster algorithm
CN110888323A (en) Control method for intelligent optimization of switching system
CN110450164A (en) Robot control method, device, robot and storage medium
CN110794825A (en) Heterogeneous stage robot formation control method
CN115618497A (en) Aerofoil optimization design method based on deep reinforcement learning
CN113919217B (en) Adaptive parameter setting method and device for active disturbance rejection controller
CN113759929B (en) Multi-agent path planning method based on reinforcement learning and model predictive control
CN115903901A (en) Output synchronization optimization control method for unmanned cluster system with unknown internal state

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination