CN112597693A - Self-adaptive control method based on depth deterministic strategy gradient - Google Patents
Self-adaptive control method based on depth deterministic strategy gradient Download PDFInfo
- Publication number
- CN112597693A CN112597693A CN202011297651.3A CN202011297651A CN112597693A CN 112597693 A CN112597693 A CN 112597693A CN 202011297651 A CN202011297651 A CN 202011297651A CN 112597693 A CN112597693 A CN 112597693A
- Authority
- CN
- China
- Prior art keywords
- network
- training
- value
- actor
- critic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000009471 action Effects 0.000 claims abstract description 16
- 238000004088 simulation Methods 0.000 claims abstract description 12
- 230000002787 reinforcement Effects 0.000 claims abstract description 11
- 238000013528 artificial neural network Methods 0.000 claims description 12
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 6
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000005309 stochastic process Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 abstract description 6
- 230000006870 function Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000013178 mathematical model Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Feedback Control In General (AREA)
Abstract
The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is a simulation training environment constructed according to the characteristics of a real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system. The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.
Description
The technical field is as follows:
the application relates to the technical field of computer software, in particular to a self-adaptive control method based on a deep reinforcement learning technology.
Background art:
the traditional PID controller needs a system to carry out accurate modeling, converts a time domain model into a frequency domain transfer function through Laplace transformation, and designs the classical PID controller according to methods such as a root track.
The problem to be solved by the application is how to solve the control problem of the nonlinear system, and a model-free control method is constructed without depending on an accurate mathematical model.
Disclosure of Invention
The application aims to provide an adaptive control method based on a deep reinforcement learning technology. The control problem of a nonlinear system is solved, and a model-free control method is constructed without depending on an accurate mathematical model.
The technical scheme of the application comprises the following steps: a self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:
1) firstly, establishing a simulation training environment according to the characteristics of a real system, wherein the simulation training environment is consistent with the real system, and the environment is interacted with reinforcement learning training;
2) respectively constructing state, return, action and cutoff conditions as training elements of deep reinforcement learning, wherein the action interval is as follows: a is an element of [ A ∈ [ ]min,Amax]Carrying out control instruction amplitude limiting according to a real system;
3) constructing a critic network, an actor network and a corresponding critic-target network and actor-target network, wherein the networks form a neural network;
4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training; 5) the training result operator network is used as a controller.
The state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated.
The process of training the critic network and the actor network comprises the following steps:
a) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
b) the actor selects an action according to the actor network and delivers it to the environment, at=μ(st|θμ)+OUtWherein OUtA stochastic process representing noise generation;
c) returning a reward and a new state (t +1) after the environment executes the action;
d) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
e) calculating the loss of the neural network according to the formula:
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
f) updating theta with Adam optimizerQ;
g) Calculating the strategy gradient of the actor network:
h) updating theta with Adam optimizerμ;
i) Updating an operator-target network and a critical-target network by adopting a soft update mode:
the application has the advantages that: a simulation training environment is constructed according to the characteristics of the real system; constructing states (observed quantities), return functions, cut-off conditions and actions; constructing a critic network, an actor network and a corresponding target network of a depth certainty strategy gradient method, and training by trial-and-error interaction with a simulation training environment; the operator network training results are used as the controller for the system.
The invention applies the deep reinforcement learning method to the design of the controller, introduces the implementation steps of the method, and transplants the controller into a real environment after meeting the requirements of the controller through off-line simulation training to realize the self-adaptive control of the nonlinear system.
Drawings
FIG. 1 is a schematic diagram of an environment and reinforcement learning training interaction;
FIG. 2 is a schematic diagram of a neural network architecture;
FIG. 3 is a schematic diagram of interaction of a deep deterministic strategy gradient neural network with a training environment;
FIG. 4 transplants the trained actor network to a real system.
Detailed Description
The invention provides a self-adaptive control method based on a depth deterministic strategy gradient, which is mainly characterized by comprising the following steps of:
1) firstly, a simulation training environment is constructed according to the characteristics of a real system, the simulation training environment is consistent with the real system, and the interaction between the environment and reinforcement learning training is shown in figure 1.
2) Respectively constructing state, return, action and cutoff conditions according to training elements of deep reinforcement learning;
state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated;
an action section: a is an element of [ A ∈ [ ]min,Amax]And carrying out control instruction amplitude limiting according to the real system.
3) According to the depth deterministic strategy gradient method, a critic network, an actor network and a corresponding critic-target network and actor-target network are constructed, and the structure of a neural network is shown in figure 2.
A schematic diagram of the interaction of a deep deterministic strategy gradient neural network with the environment is shown in fig. 3.
4) Training a critic network and an actor network
Training the critic network and the actor network by using a depth deterministic strategy gradient algorithm, wherein the training process is as follows:
a) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
b) the actor selects an action according to the actor network, and passes it to the environment,
at=μ(st|θμ)+OUtwherein OUtA stochastic process representing noise generation;
c) returning a reward and a new state (t +1) after the environment executes the action;
d) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
e) calculating the loss of the neural network according to the formula:
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
f) updating theta with Adam optimizerQ;
g) Calculating the strategy gradient of the actor network:
h) updating theta with Adam optimizerμ;
I) Updating an operator-target network and a critical-target network by adopting a soft update mode:
and finishing the training of the current round and starting the next round of training.
5) The training result operator network is used as a controller.
Claims (3)
1. A self-adaptive control method based on a depth deterministic strategy gradient is characterized by comprising the following steps: it comprises the following steps:
1) firstly, establishing a simulation training environment according to the characteristics of a real system, wherein the simulation training environment is consistent with the real system, and the environment is interacted with reinforcement learning training;
2) respectively constructing state, return, action and cutoff conditions as training elements of deep reinforcement learning, wherein the action interval is as follows: a is an element of [ A ∈ [ ]min,Amax]Carrying out control instruction amplitude limiting according to a real system;
3) constructing a critic network, an actor network and a corresponding critic-target network and actor-target network, wherein the networks form a neural network;
4) performing a plurality of rounds of training on the critic network and the actor network; after the training of the current round is finished, starting the next round of training;
5) the training result operator network is used as a controller.
2. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the state, return, action and cutoff conditions are respectively state: taking the current value true, the error value error ═ reference-true and the integral of the error ^ integral edt as the state quantity state;
and (3) returning: the reward is-100 if the actual value is less than the minimum value min or the maximum value max (the value is less than or equal to min | | | | true value is greater than or equal to max); if the absolute value of the error is greater than 0.1, the return value is-1; if the absolute value of the error is less than 0.1, the return value is + 10;
and (4) stopping under the conditions: if true value is less than or equal to min | | | true value is greater than or equal to max, the training of the round is terminated.
3. The adaptive control method based on the gradient of the depth deterministic strategy according to claim 1, characterized in that: the process of training the critic network and the actor network comprises the following steps:
A) initializing neural network parameters θ for operator and critic networksQAnd thetaμCopying the parameters to an operator-target network and a critic-target network; initializing an experience pool;
next, M rounds of training are started:
B) the actor selects an action according to the actor network and delivers it to the environment, at=μ(st|θμ)+OUtWherein OUtA stochastic process representing noise generation;
C) returning a reward and a new state (t +1) after the environment executes the action;
D) will(s)t,at,rt,st+1) Storing the data into an experience pool, and randomly sampling N data to be used as a mini-batch for network training;
E) calculating the loss of the neural network according to the formula:
yi=ri+γQ′(si+1,μ′(si+1,|θμ′)|θQ′)
F) updating theta with Adam optimizerQ;
G) Calculating the strategy gradient of the actor network:
H) updating theta with Adam optimizerμ;
I) Updating an operator-target network and a critical-target network by adopting a soft update mode:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011297651.3A CN112597693A (en) | 2020-11-19 | 2020-11-19 | Self-adaptive control method based on depth deterministic strategy gradient |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011297651.3A CN112597693A (en) | 2020-11-19 | 2020-11-19 | Self-adaptive control method based on depth deterministic strategy gradient |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112597693A true CN112597693A (en) | 2021-04-02 |
Family
ID=75183402
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011297651.3A Pending CN112597693A (en) | 2020-11-19 | 2020-11-19 | Self-adaptive control method based on depth deterministic strategy gradient |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112597693A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113721645A (en) * | 2021-08-07 | 2021-11-30 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN110323981A (en) * | 2019-05-14 | 2019-10-11 | 广东省智能制造研究所 | A kind of method and system controlling permanent magnetic linear synchronous motor |
CN111079936A (en) * | 2019-11-06 | 2020-04-28 | 中国科学院自动化研究所 | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning |
-
2020
- 2020-11-19 CN CN202011297651.3A patent/CN112597693A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN110323981A (en) * | 2019-05-14 | 2019-10-11 | 广东省智能制造研究所 | A kind of method and system controlling permanent magnetic linear synchronous motor |
CN111079936A (en) * | 2019-11-06 | 2020-04-28 | 中国科学院自动化研究所 | Wave fin propulsion underwater operation robot tracking control method based on reinforcement learning |
Non-Patent Citations (1)
Title |
---|
LE JIANG等: "Path tracking control based on Deep reinforcement learning in Autonomous driving", 《2019 3RD CONFERENCE ON VEHICLE CONTROL AND INTELLIGENCE(CVCI)》, pages 1 - 6 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113721645A (en) * | 2021-08-07 | 2021-11-30 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle continuous maneuvering control method based on distributed reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052004B (en) | Industrial mechanical arm automatic control method based on deep reinforcement learning | |
CN110515303B (en) | DDQN-based self-adaptive dynamic path planning method | |
CN110238839B (en) | Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction | |
CN107272403A (en) | A kind of PID controller parameter setting algorithm based on improvement particle cluster algorithm | |
CN111898770B (en) | Multi-agent reinforcement learning method, electronic equipment and storage medium | |
CN112215364B (en) | Method and system for determining depth of enemy-friend based on reinforcement learning | |
CN110427006A (en) | A kind of multi-agent cooperative control system and method for process industry | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
Han et al. | Intelligent decision-making for 3-dimensional dynamic obstacle avoidance of UAV based on deep reinforcement learning | |
Bianchi et al. | Heuristically accelerated reinforcement learning: Theoretical and experimental results | |
CN114815882B (en) | Unmanned aerial vehicle autonomous formation intelligent control method based on reinforcement learning | |
Ren | Optimal control | |
CN114065929A (en) | Training method and device for deep reinforcement learning model and storage medium | |
CN116604532A (en) | Intelligent control method for upper limb rehabilitation robot | |
CN112597693A (en) | Self-adaptive control method based on depth deterministic strategy gradient | |
CN116880191A (en) | Intelligent control method of process industrial production system based on time sequence prediction | |
Liu et al. | Forward-looking imaginative planning framework combined with prioritized-replay double DQN | |
CN105117616B (en) | Microbial fermentation optimization method based on particle cluster algorithm | |
CN110888323A (en) | Control method for intelligent optimization of switching system | |
CN110450164A (en) | Robot control method, device, robot and storage medium | |
CN110794825A (en) | Heterogeneous stage robot formation control method | |
CN115618497A (en) | Aerofoil optimization design method based on deep reinforcement learning | |
CN113919217B (en) | Adaptive parameter setting method and device for active disturbance rejection controller | |
CN113759929B (en) | Multi-agent path planning method based on reinforcement learning and model predictive control | |
CN115903901A (en) | Output synchronization optimization control method for unmanned cluster system with unknown internal state |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |