CN108008627B

CN108008627B - Parallel optimization reinforcement learning self-adaptive PID control method

Info

Publication number: CN108008627B
Application number: CN201711325553.4A
Authority: CN
Inventors: 孙歧峰; 任辉; 段友祥; 李洪强
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2022-10-28
Anticipated expiration: 2037-12-13
Also published as: CN108008627A

Abstract

The invention discloses a parallel optimized reinforcement learning self-adaptive PID control method, which is characterized by comprising the following steps: step S1: discretizing a transfer function by using matlab software through a zero-order keeper method, and initializing controller parameters and M control threads to perform parallel learning; step S2: defining a transfer function for transferring the input signal to S1, calculating an output value, and taking the difference value of the input signal and the output signal as an input vector of a control algorithm; and step S3: transferring the input vector to an improved self-adaptive PID controller for training, and iterating for N times to obtain a trained model; and step S4: carrying out control test by using the trained model, and recording input and output signals and the change value of a PID parameter; step S5: visual test data and control effect comparison. The invention better solves the problems of the prior self-adaptive PID, and improves the stability and the learning efficiency of the algorithm by utilizing the characteristic of multi-thread parallel learning of A3C learning.

Description

Parallel optimization reinforcement learning self-adaptive PID control method

Technical Field

The invention relates to a self-adaptive PID control method, belongs to the technical field of control, and particularly relates to an improved self-adaptive PID (proportional-integral-derivative) control algorithm based on a parallel optimized actuator evaluator.

Background

A PID (Proportional/Integral/Differential) control system is a linear controller, which is controlled according to the deviation principle, and has the advantages of simple principle, strong robustness, simple setting, no need of obtaining an accurate mathematical model of an object, and the like, so that the PID control system is the most commonly used control system in industrial control. In the engineering practice of PID control system parameter setting, particularly the engineering practice of PID control parameter setting of a linear, time-invariant and weak time-lag system, the traditional setting method obtains abundant experience and is widely applied. However, in the practical industrial process control engineering practice, many controlled objects have the characteristics of time-varying uncertainty, pure hysteresis and the like, and the control process mechanism is complex; under the influence of factors such as noise and load disturbance, process parameters and even model structures can change. Therefore, online adjustment of PID parameters is required to meet the requirement of real-time control. Under the condition, the traditional parameter setting method is difficult to meet the requirements of engineering practice, and shows great limitations.

Adaptive PID control techniques are an effective way to address such problems. The adaptive PID control model takes advantage of both the adaptive control philosophy and the conventional PID controller. Firstly, the method is an adaptive controller and has the advantages of automatically identifying a controlled process, automatically setting controller parameters, adapting to parameter change of the controlled process and the like; and secondly, the controller has the advantages of simple structure, good robustness, high reliability and the like of the conventional PID controller. Due to the advantages, the industrial process control device is ideal in engineering practice. After the adaptive PID control is proposed, the fuzzy adaptive PID controller, the neural network adaptive PID controller and the Actor-Critic adaptive PID controller are successively proposed under the research of a wide range of scholars.

For example, document 1: liu Guo Rong, yang Xianhui fuzzy adaptive PID controller [ J ] control and decision, 1995 (6) proposed a fuzzy rule based adaptive PID controller, whose main idea is: when the system gives sudden change, state interference or structure interference, the transient response can be divided into 9 conditions, and after the system response is obtained at each sampling moment, the control strength can be properly increased or decreased by using a fuzzy control method according to the given condition and the change trend of the system response at the moment and the existing system control knowledge so as to prevent the response from changing towards the given direction and enable the output to tend to be given as soon as possible. However, the control method requires the experience of professionals and parameter optimization to control a complex system, and the inaccurate control effect set by the fuzzy rule cannot achieve the satisfactory effect.

Document 2 discloses a study [ J ] of BP neural network-based PID parameter self-tuning, a systematic simulation report, 2005 proposes adaptive PID control based on BP neural network, and its control idea is: the neural network identifier transmits the control deviation back to the neural network neurons to correct the weight of the neural network, the deviation between the set input of the object and the actual output of the object is reversely transmitted to the neural network controller after passing through the identifier, the neural network identifier corrects the network weight by using an error signal, and the neural network identifier can gradually follow the change of a system after being learned for many times. The method generally adopts supervised learning to optimize parameters, but teacher signals are difficult to acquire.

Document 3 shows a loose learning, populus, adaptive PID control [ J ] based on actuator-evaluator learning, a control theory and application, 2011 proposes adaptive PID control of an Actor-critical structure. The control idea is as follows: the PID parameter is adaptively adjusted by using the model-free online learning capability of AC learning, the strategy function of the Actor and the value function of Critic are learned simultaneously by adopting one RBF network, the defect that the traditional PID controller is difficult to adjust the parameter online in real time is overcome, and the method has the advantages of strong response speed adaptive capability and the like. But instability of the AC learning structure itself often causes the algorithm to be difficult to converge.

Patent CN201510492758 discloses an actuator adaptive PID control method, which combines an expert PID controller and a fuzzy PID controller and is respectively connected with an actuator, and the actuator selects the expert PID controller or the fuzzy PID controller according to current state information and expected information, and although this controller can reduce overshoot and has the characteristic of high control precision, this controller still needs a lot of prior knowledge of professionals to make a decision on the use of the controller.

Disclosure of Invention

The invention aims at: aiming at the characteristics of adaptive PID control, a method of adaptive PID control (A3C) based on parallel optimization actuator evaluator learning is provided, and the method is used for controlling a system in industry. The invention better solves the problems of the prior self-adaptive PID, and improves the stability and the learning efficiency of the algorithm by utilizing the characteristic of multi-thread parallel learning of A3C learning. The adaptive PID controller based on A3C has the advantages of high response speed, strong adaptive capacity, strong anti-interference capacity and the like.

The self-adaptive PID control method based on the parallel optimization and the learning of the actuator evaluator comprises the following steps:

step S1: using MATLAB (MATLAB, commercial mathematical software produced by MathWorks company in America) software to define a continuous transfer function of any order of a controlled system, discretizing the continuous transfer function by a zero-order keeper method to obtain a discretized transfer function of a user-defined time interval, initializing controller parameters and M control threads to perform parallel learning, wherein the parameters mainly comprise BP neural network parameters and PID control environment parameters, and each thread is an independent control Agent;

step S2: in step S1, after initializing a BP neural network weight parameter and a control object of a PID controller, defining a discrete input signal RIN, sequentially transmitting the discrete input signal into a discrete transfer function according to a defined time interval, calculating an output value of the transfer function, and taking a difference value of the input signal and the output signal as an input vector x (t) of an A3C adaptive PID control algorithm;

and step S3: transmitting the input vector x (t) obtained in the step S2 into a built A3C self-adaptive PID control system for iterative training, and obtaining a trained model after iterating for N times;

step S31: calculating the current error e (t), the primary error delta e (t) and the secondary error delta e ² (t) as input vector x (t) = [ e (t), Δ ² e(t)] ^T And normalizing the same by a sigmod function;

step S32: and transmitting the input vector to an Actor network of each thread, and obtaining a new parameter of the PID. The Actor network outputs the mean value and the variance of Gaussian distribution of three parameters of the PID instead of directly outputting the parameter value of the PID, and estimates the three parameter values through the Gaussian distribution of the three parameters, wherein when o =1,2,3, the output layer outputs the mean value of the PID parameters, and when o =4,5,6, the output layer outputs the variance of the PID parameters.Wherein the Actor network is a BP neural network with 3 layers: input with layer 1 as input layer and layer 2 as hidden layer

Output ho of hidden layer _k (t)＝min(max(hi _k (t),0),6)k＝1,2,3…20

Layer 3 is the output layer, the input of which

Output of the output layer

Step S33: and the new PID parameters are given to the controller to obtain control output, the control error is calculated, and the reward value is calculated according to the environment reward function R (t). R (t) = alpha ₁ r ₁ (t)+α ₂ r ₂ (t)

Vector value x' (t) to the next state;

step S34: the reward function R (t), the current state vector x (t), and the next state vector x' (t) are passed to the Critic network, which is similar to the Actor network except that there is only one output node. Critic network outputs the state values and calculates the TD error, δ _TD ＝r(t)+γV(S _t+1 ,W _v ′)-V(S _t ,W _v ′)；

Step S35: after the TD error is calculated, each Actor-critical network in the A3C structure does not directly update the own network weight, but updates Actor-critical network parameters stored in a central brain (Global-net) by using the own gradient, wherein the updating mode is that

W _v ＝W _v +α _c dW _v Wherein W is _a Actor network weight, W 'stored for central brain' _a Weight of Actor network for each AC fabric, W _v Critic network weight, W 'stored by the Central brain' _v Critic network weight, α, representing each AC structure _a Alpha is the learning rate of Actor _c For Critic learning rate, the central brain will pass an up-to-date parameter to each AC structure after update;

step S36: in order to complete a training process, the loop is iterated for N times, the training is quitted, and the model is saved.

And step S4: carrying out control test by using the trained model, and recording the change values of the input signal, the output signal and the PID parameter;

step S41: transmitting the input signal defined in the step S1 to a control model of the thread with the highest trained reward function;

step S42: s41, calculating current, primary and secondary errors as input vectors, inputting the input vectors into the selected control model, wherein the difference from the training process is that only PID parameter adjustment quantity output by an Actor network is needed, and the adjusted PID parameter is transmitted to a controller to obtain the output of the controller;

step S43: the input signal, the output signal, and the PID parameter variation value obtained in step S42 are saved.

Step S5: and (4) visualizing the experimental data obtained in the step S4 by using Matlab, wherein the experimental data comprise input signals, output signals and change values of PID parameters of the controller, and comparing the change values with fuzzy adaptive PID control and AC-PID adaptive PID control to control the effect.

Drawings

FIG. 1 is a schematic process flow diagram of the present invention.

FIG. 2 is a block diagram of an improved adaptive PID controller

FIG. 3 is an output signal of the improved controller using a step signal as an input signal

FIG. 4 shows the control quantity of the controller after improvement

FIG. 5 is a control error of the improved adaptive PID controller

FIG. 6 is a parameter adjustment curve of the A3C adaptive PID controller

FIG. 7 is a comparison of an improved controller with a fuzzy, AC architecture adaptive PID controller

FIG. 8 comparison and analysis of control experiments of different controllers

Detailed Description

The invention is further described below using MATLAB software in conjunction with FIGS. 1-5: the specific implementation scheme of the adaptive PID control based on the parallel optimization and the learning of the actuator evaluator comprises the following steps as shown in FIG. 1:

(1) And initializing parameters. Is selected by the control system as

A third-order transfer function, the discrete time is set to be 0.001s, and the transfer function after discretization by adopting Z change is as follows: youtt (k) = -den (2) youtt (k-1) -den (3) youtt (k-2) -den (4) youtt (k-1) + num (2) u (k-1) + num (3) u (k-2) + num (4) u (k-3), the input signal is a step signal with the value equal to 1.0, the number of single training steps is 1000 steps, the time is 1.0s, 4 threads are initialized to represent 4 independent adaptive PID controllers, and training is carried out.

(2) An input vector is calculated. E (t) = rin (0) -yurt (0) =1.0 when t =0; e (t-1) =0; e (t-2) =0 input vector x (t) = [ e (t), Δ ² e(t)] ^T Wherein e (t) = rin-yourt =1.0 Δ e (t) = e (t) -e (t-1) =1.0 Δ ² e (t) = e (t) -2 × e (t-1) + e (t-2) =1.0; calculated x (t) = [1.0,1.0] ^T Normalizing by sigmod function to obtain final input vector x [ t ]]＝[0.73，0.73，0.73] ^T 。

(3) And (5) training the model. The improved adaptive PID controller structure is shown in FIG. 2, after calculating the state vector, firstly transmitting the state vector to the Actor network, the Actor network outputting the mean value mu and variance sigma of the three parameters P, I and D, obtaining the actual parameter values of P, I and D according to Gaussian sampling, assigning the new parameter values to the incremental PID controller, and the controller calculating the control quantity u (t) according to the error and the new PID parameter

u(t)＝u(t-1)+Δu(t)＝u(t-1)+K _I (t)e(t)+K _P (t)Δe(t)+K _D (t)Δ ² e(t)

The discrete transfer function acted by the control quantity calculates an output signal value yourt (t + 1), an error value and a state vector at the next time t +1 according to the process of (1). In addition, the environment reward function calculates the reward value of the control Agent according to the error, and the reward function is as follows:

R(t)＝α ₁ r ₁ (t)+α ₂ r ₂ (t)

where α 1=0.6, α 2=0.4, e (t) =0.001

The reward function is an important component of reinforcement learning, after the reward value is obtained, the reward value and the state vector of the next moment are transmitted to the criticic network, the criticic network outputs the state values of the t moment and the t +1 moment, and the TD error is calculated, wherein the calculation formula is as follows: delta _TD ＝r(t)+γV(S _t+1 ,W _v ′)-V(S _t ,W _v ′)，W _v ' is Critic network weight. Because the operation speeds of the threads are not synchronous, each controller does not fix the sequence to update the Actor network and Critic network parameters stored in Global Net in fig. 2, and the update formula is as follows:

wherein W _a Actor network weight, W 'stored by the Central brain' _a Weight of Actor network for each AC fabric, W _v Critic network weight, W 'stored for Central brain' _v Critic network weight, α, representing each AC structure _a =0.001 learning rate of Actor, α _c Where 0.01 is Critic's learning rate, the algorithm reaches a steady state after 3000 iterations after one training has been completed.

(4) And collecting experimental data. And (3) using the trained controller model, and selecting the thread with the highest accumulated reward as a test controller when controlling the test because 4 threads are set for control training. And (4) carrying out control test according to the control parameters set in the step (1), wherein the control time is 1s, namely, carrying out 1000 times of control. According to the calculation mode in the step (2), calculating a state vector, transmitting the state vector into a trained model, enabling the Critic network to be out of action in the control test process, outputting P, I and D parameter values by an Actor, and storing the values of yourt, rin, u, P, I and D for visual analysis in the control test process.

(5) And (6) visualizing the data. And (3) performing visualization analysis on the data stored in the step (4) by using a matlab software visualization tool: as shown in FIG. 3, FIG. 3 shows _y With an output value of outt, the controller can reach a steady state in less than 0.2s and has a fast regulation capability. Fig. 4 shows the output signal of the control variable of the controller, from which it can be derived that the controller can reach a stable state very quickly. Fig. 5 shows the control error of the controller, where the control error is equal to the input signal amount minus the output signal amount. Fig. 6 shows the variation of the parameters of the controllers P, I, and D, and it can be seen that there are different degrees of adjustment for the 3 parameters before the system is stabilized, and the parameters are not changed after the system is stabilized. The same control object and input signals are used for carrying out experimental comparison on the fuzzy adaptive PID controller and the Actor-critical adaptive PID controller, the signal output comparison diagram of the three controllers can be seen as an attached figure 7, the detailed control analysis can be seen as an attached figure 8, and as shown in the figure 8, the controller of the invention has smaller overshoot but faster response speed as the fuzzy PID controller while not needing too much professional priori knowledge, and has higher learning speed than the AC-PID controller, and the overshoot and the response speed both have great advantages.

The invention aims to solve the problems of the traditional adaptive PID controller, the fuzzy adaptive PID and the expert adaptive PID controller need related knowledge of a large number of professionals, and teacher signals of the neural network adaptive PID controller are difficult to obtain. And the learning algorithm is used for parallel learning in multiple threads of the CPU, so that the learning rate of the AC-PID controller is greatly improved, and a better control effect is achieved. The specific control effect comparison can be seen in fig. 7, and fig. 7 shows three controllers selected: the fuzzy PID controller, the AC-PID controller and the A3C-PID controller of the invention carry out control comparison under the same parameters, and detailed control analysis can be seen in the attached figure 8: the controller of the invention does not need too many professionals' prior knowledge, has smaller overshoot but faster response speed as the fuzzy PID controller, and has higher learning speed than the AC-PID controller, and both overshoot and response speed have great advantages.

The present invention is not limited to the above embodiments, and various other equivalent modifications, substitutions and alterations can be made without departing from the basic technical concept of the invention according to the common technical knowledge and conventional means in the field.

Claims

1. A reinforcement learning self-adaptive PID control method for parallel optimization is characterized by comprising the following steps:

step S1: using MATLAB software to define an arbitrary order continuous transfer function of a controlled system, discretizing the transfer function by a zero order keeper method to obtain a discretized transfer function with a custom time interval, initializing controller parameters and M control threads for parallel learning, wherein the parameters mainly comprise BP neural network parameters and PID control environment parameters, and each thread is an independent control Agent;

step S2: after initializing a BP neural network weight parameter and a control object of a PID controller, defining a discrete input signal RIN, sequentially transmitting the discrete input signal into a discrete transfer function according to a defined time interval, calculating an output value of the transfer function, and taking a difference value of the input signal and the output signal as an input vector x (t) of an A3C self-adaptive PID control algorithm;

step S31: calculating the current error e (t), the primary error delta e (t) and the secondary error delta e ² (t) as input vector x (t) = [ e (t), Δ ² e(t)]T, and normalizing the T by a sigmod function;

step S32: the method comprises the steps of transmitting an input vector to an Actor network of each thread and obtaining new parameters of the PID, wherein the Actor network does not directly output parameter values of the PID but outputs the mean value and variance of Gaussian distribution of three parameters of the PID, the three parameter values are estimated through the Gaussian distribution of the three parameters, when o =1,2,3, the mean value of the PID parameters is output by an output layer, and when o =4,5,6, the variance of the PID parameters is output, wherein the Actor network is a BP neural network with 3 layers: input with layer 1 as input layer and layer 2 as hidden layer

The output hok (t) = min (max (hik (t), 0), 6) k =1,2,3 \823020of the hidden layer,

layer 3 is the output layer, the input of which

Output of the output layer

Step S33: giving a new PID parameter to a controller to obtain control output, calculating a control error, and calculating a reward value according to an environment reward function R (t), wherein R (t) = alpha 1R1 (t) + alpha 2R2 (t) until a vector value x' (t) of a next state;

step S34: passing the reward function R (t), the current state vector x (t), and the next state vector x ' (t) to a Critic network, which is similar in structure to the Actor network except that there is only one output node, the Critic network primarily outputs the state value and calculates the TD error, δ TD = R (t) + γ V (St +1, wv ') -V (St, wv ');

step S35: after the TD error is calculated, each Actor-critical network in the A3C structure does not directly update its own network weight, but updates Actor-critical network parameters stored in the central brain (Global-net) by using its own gradient, in a manner of Wat +1= Wat + α adWat, wvt +1= Wvt + α cdWvt, where t and t +1 represent different times, wa is the Actor network weight stored in the central brain,

w' a is the weight of the Actor network for each AC fabric, wv is the Critic network weight stored by the central brain,

w' v represents a Critic network weight of each AC structure, alpha a is the learning rate of the Actor, alpha c is the learning rate of the Critic, and after updating, the central brain transmits a latest parameter to each AC structure;

step S36: in order to complete a training process, the loop is iterated for N times, the training is quitted, and the model is saved;

2. The robust learning adaptive PID control method of the parallel optimization according to the claim 1, wherein the step S4 comprises the following steps: