CN116822346A

CN116822346A - Sewage treatment nitrate nitrogen concentration control method based on Q learning

Info

Publication number: CN116822346A
Application number: CN202310713223.1A
Authority: CN
Inventors: 王鼎; 王元; 赵明明
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-09-29

Abstract

The invention relates to a sewage treatment nitrate nitrogen concentration control method based on Q learning. In sewage treatment systems, tracking nitrate nitrogen concentration to a desired trajectory is an important control objective of the sewage treatment process. Aiming at the problem of tracking the expected value of the nitrate nitrogen concentration, the invention provides a track tracking control method based on Q learning, which reduces the requirement on system model information and is used for realizing the tracking control design of the nitrate nitrogen concentration in the sewage treatment process. According to the self-adaptive dynamic programming algorithm, a Q learning algorithm framework is established, a neural network is trained to solve the optimal tracking control problem, and method verification is carried out on a BSM1 (Benchmark Simulation Model No. 1) simulation model. The invention can ensure that the nitrate nitrogen concentration tracks the expected track more accurately, thereby realizing the effective control of the sewage treatment process.

Description

Sewage treatment nitrate nitrogen concentration control method based on Q learning

Technical Field

The invention belongs to the field of sewage treatment.

Background

The activated sludge process is the most common sewage treatment process of sewage treatment plants in China. The main principle of the activated sludge process is to culture the growth environment of various microorganisms in a biochemical pond, and then to perform aeration so as to form flocculent activated sludge. In activated sludge, microorganisms undergo chemical reactions such as nitrification, denitrification, ammoniation, oxidation, etc., thereby decomposing organic matters in sewage. Finally, mud-water separation is realized through precipitation, and the upper clean water discharge system is recycled by the society again. Throughout the whole sewage treatment process, the sewage treatment system is a highly-complex nonlinear system, the variables are highly coupled, and a system model is difficult to build. And the sewage flow rate of sewage flowing into the sewage treatment plant every day can greatly fluctuate along with different weather and season changes. The variables detected in the reaction have a large time lag, which causes the control operation amount generated by the controller to also have hysteresis. Therefore, various characteristics of sewage treatment determine that a certain difficulty exists in controlling the process. In addition, under the existing conditions, a certain gap exists between the practical application of the sewage treatment control technology and the theory. Therefore, the control principle of the sewage treatment process is studied in depth, and particularly an advanced control method with good control effect is designed, so that the sewage treatment plant can operate stably with high efficiency, and the method is very important for alleviating the problem of water resources in China and developing the sustainable society. The invention aims at designing an advanced self-learning optimal tracking control technology and is used for solving the optimal control problem of a sewage treatment system.

The idea of reinforcement learning is derived from a human error trial and error learning mechanism, and the self control strategy is updated by learning through reward and punishment signals obtained through interaction with the outside. In order to solve the problem of "dimension disaster" occurring in the control of high-dimensional complex systems, an adaptive dynamic programming (Adaptive dynamic programming, ADP) method has been developed. ADP integrates the ideas of dynamic planning, neural network and reinforcement learning, and essentially utilizes online or offline data, adopts the neural network to estimate the system performance index function, and then obtains the optimal or near-optimal control strategy according to the optimality principle. Compared with the thought of solving the Hamilton-Jacobi-Bellman (HJB) equation in the reverse direction in the dynamic programming time, the ADP realizes the forward solving in time to obtain the optimal solution, thereby solving the problem of dimension disaster. The essence of sewage treatment control is that a series of control methods are used for ensuring that key variables in the system track ideal expected values, so that the effluent quality reaches the standard. The optimal tracking control problem is an important research subject in ADP, and the control target is to enable the state of a controlled system to track a set expected track. The error between the state of the controlled system and the desired trajectory can be usually adjusted as a new state, so that the optimal tracking problem can be converted into an optimal adjustment problem. The theory of ADP has been widely studied today, but ADP application for wastewater treatment control has been less studied. In particular, existing wastewater treatment application studies generally employ the traditional model-based HDP (Heuristic dynamic programming) method in ADP, and few tracking control methods based on model-free Q learning are available.

Under the background, the invention provides an optimal tracking control method based on Q learning aiming at a complex sewage treatment system. The approximation of the Q function and the promotion of the control strategy are realized by constructing a neural network framework based on execution-evaluation, so that the optimal tracking control without a model is realized, and the optimal tracking control method is applied to the concentration control of nitrate nitrogen in the sewage treatment process in sunny weather. Firstly, the invention utilizes the standard simulation model (Benchmark Simulation Model NO.1, BSM 1) of the sewage treatment of the activated sludge No.1 jointly developed by the International water quality society and the European Union scientific and technical Cooperation to simulate the whole sewage treatment process, which can simulate various components, the microbial reaction process and the activated sludge process in the sewage treatment process more accurately, and improves the reliability of experimental results. Secondly, the controlled system and the expected track are constructed into an amplifying system, so that the problem of optimal tracking of the nitrate nitrogen concentration can be converted into the problem of optimal adjustment of the amplifying system. Finally, the optimal tracking control method based on Q learning is applied to nitrate nitrogen concentration control in the sewage treatment process, so that a more accurate control effect is achieved, and the quality of the effluent is improved.

The sewage treatment system is a complex nonlinear system, and in order to better simulate the sewage treatment process, the invention adopts the BSM1 with higher precision to simulate the whole sewage treatment process of activated sludge. The activated sludge process consists of two main parts, including biochemical reaction tank and secondary sedimentation tank. As shown in figure 1, sewage firstly enters an anaerobic zone of a biochemical reaction tank, nitrate nitrogen in the water is reduced into nitrogen gas under the action of microorganisms and is discharged, namely denitrification reaction is carried out to carry out denitrification, and organic matters in the sewage are primarily degraded. In the aerobic zone, oxygen is introduced into the aeration tank, and ammonia nitrogen in the sewage is converted into nitrate and nitrite under the condition of sufficient oxygen, namely, nitration reaction is carried out, so that organic nitrogen compounds are removed. And (3) sewage after biochemical reaction enters a secondary sedimentation tank for mud-water separation, supernatant fluid is treated clean water, part of lower sludge flows back to the biochemical reaction tank to participate in the reaction again, and the other part of sludge is directly discharged out of the system.

According to the treatment process of the activated sludge process, the quality of the sewage treatment effect depends on whether the biochemical reaction is sufficient or not, and whether the biochemical reaction can be sufficiently performed or not is greatly influenced by the concentration of each component. Wherein the nitrate nitrogen concentration S of the second partition _NO,2 Is a key factor influencing the sewage treatment effect, mainly participates in nitrification and denitrification reactions in a biochemical reaction tank, and influences the effect of denitrification of the effluent water quality. The concentration of nitrate nitrogen is mainly equal to the water inlet component and the internal reflux quantity Q of sewage _a In the related art, if the concentration of nitrate nitrogen in the anoxic zone is too low, the denitrification reaction is slowed down, and the degradation of organic matters in the sewage is incomplete; if the nitrate nitrogen concentration is too high, overnutrition in water can be caused, and the denitrification efficiency is reduced. Therefore, the effective control of nitrate nitrogen concentration in the biochemical reaction tank is important in ensuring the sewage treatment effect.

Disclosure of Invention

The detailed steps of the design of the tracking controller for the nitrate nitrogen concentration are as follows.

And step 1, converting the problem of tracking control of the nitrate nitrogen concentration in sewage treatment into an optimal regulation problem.

Depending on the nature of the wastewater treatment, the wastewater treatment process can be generally described by a discrete-time nonlinear non-affine system as follows:

x(k+1)＝P(x(k),u(k)) (1)

wherein k is N, N is a natural number; x (k) is the nitrate nitrogen concentration S of the second partition at the current k moment _NO,2 The method comprises the steps of carrying out a first treatment on the surface of the u (k) is the internal reflux quantity at time k; p (·, ·) is an unknown nonlinear function representing the system dynamics.

Defining the desired trace of nitrate nitrogen concentration as

d(k+1)＝φ(d(k)) (2)

Wherein, phi (-) is a constant function representing the expected track, and is divided into a constant expected value and a dynamic expected value in the invention, wherein the output of phi (-) under the condition of the constant expected value is constant to be 1, the output of phi (-) under the condition of the dynamic expected value is regulated to be 1.4,11-14 days, the output of phi (-) is regulated to be 0.6, and the rest time is 1. The tracking error between nitrate nitrogen concentration and desired trace is defined as

e(k)＝d(k)-x(k) (3)

To achieve tracking of the desired trajectory, a total error control strategy is defined as

Where λ is a real number greater than 0, for BSM1 model can be taken as λ=0.02, u ₁ (k) Is a feedback control strategy that can stabilize the system, u ₂ (k) And outputting a near-optimal control strategy for the Q learning algorithm. The feedback control strategy can be expressed as

u ₁ (k)＝d(k+1)+αe(k) (5)

α is a constant parameter, and may be determined to be α=0.6 according to an actual controlled system.

The original system can be converted into a system concerning tracking error according to formulas (1) - (5)

Defining utility functions as

The utility function represents the immediate cost of control quantity generation at time k, and U (0, 0) =0;and->The constants of the utility function for the state variables and the control variables, respectively, are selected here +.>And->

For the optimal regulation problem in sewage treatment described above, it is desirable to find a suitable allowable control strategy that minimizes the Q function as follows

Where ζ=k, k+1, k+2 … represents k and any time thereafter, and the Q function is the sum of utility functions at all times.

When the Q function in the formula (8) is minimum, namely the optimal Q function Q ^* (e(k),u ₂ (k) Corresponding u) ₂ (k) For optimal control strategyThe optimal control strategy enables the system to remain stable while ensuring that the error of equation (6) tends to zero, thereby enabling tracking of the nitrate nitrogen concentration to the desired value.

According to the principle of optimality of Bellman, an optimal Q function Q ^* (e(k),u ₂ (k)) The following discrete time HJB equation is satisfied:

the optimal control strategy can be solved by

And 2, building a Q learning algorithm framework. Aiming at the optimal regulation problem, a Q learning algorithm is introduced to acquire an optimal control strategy.

Many conventional ADP algorithms exist that are model-based, meaning that the system dynamics equations need to be relied upon to obtain the optimal control strategy. In recent years, Q learning algorithms have been introduced into ADP to obtain optimal controllers via online data without the need to build accurate system models. The sewage treatment process is highly complex and nonlinear, and it is difficult to build a very accurate model to simulate it. Therefore, it is necessary to control the sewage treatment process by using the Q learning algorithm, and the overall structure thereof is shown in fig. 2.

Firstly, construction of a judgment network is needed. In the Q learning framework, the inputs to the evaluation network include the error e (k) and the control strategy u ₂ (k) And the two parts are output as a Q function of the current moment.

Judging the Q function of the network output as

Wherein W is _c1 (k) To input the network weight between the layer and the hidden layer, W _c2 (k) To hide the network weight between layer and output layer, ψ _c Hyperbolic tangent activation function for evaluating networkA number.

Defining an error function as

Training the objective function of the evaluation network as

Updating weights between a hidden layer and an output layer in the evaluation network, wherein the updating rule is that

W _c2 (k+1)＝W _c2 (k)-γ _c △W _c2 (k) (15)

Wherein, gamma _c To judge the learning rate of the network.

Then, construction of the execution network is performed. Executing the error e (k) with the network input as k moment and outputting as a control strategy

The control strategy expression is

Wherein W is _a1 (k) To perform network weights between the network input layer and hidden layer, W _a2 (k) To hide the network weight between layer and output layer, ψ _a (. Cndot.) is the hyperbolic tangent activation function that performs the network.

Under the framework of Q learning, defining an error function of an execution network as

Defining an objective function of an execution network as

W _a2 (k+1)＝W _a2 (k)-γ _a △W _a2 (k) (20)

Wherein, gamma _a To perform a network learning rate.

And 3, judging and executing network training.

After the judgment network and the execution network are constructed, the weight of the judgment network and the execution network is updated by adopting an online training method. The main training steps are as follows:

(1) Initializing the judgment network. FIG. 3 is a schematic diagram of a judging network, wherein the weight between the input layer and the hidden layer is W _c1 The weight between the hidden layer and the output layer is W _c2 . The network inputs are the current time error e (k) and the control strategy u of the system ₂ (k) The output is the Q function of the current time. Definition of learning Rate gamma _c =0.01, initial time k=0;

(2) The execution network is initialized. FIG. 4 is a diagram illustrating a network structure, wherein the weight between the input layer and the hidden layer is W _a1 The weight between the hidden layer and the output layer is W _a2 . The network input is the system current time error e (k), and the output is the control strategyDefinition of learning Rate gamma _a =0.01, initial time k=0;

(3) Inputting the current nitrate nitrogen concentration state x (k) of the BSM1, and calculating the error e (k) between the current nitrate nitrogen concentration state x (k) and the ideal track d (k)Further, a stable control strategy u is obtained ₁ (k)；

(4) Inputting the error e (k) into the execution network to calculate the control strategy of the current moment

(5) Control strategy for performing network outputWith the known stability control strategy u ₁ (k) Calculating to obtain u (k);

(6) Will u ₂ (k) And the current error e (k) is input into a judging network, and a utility function U (e (k), U is obtained through calculation ₂ (k) And Q functions Q (e (k), u) ₂ (k))；

(7) Acting u (k) on the sewage treatment system to obtain a nitrate nitrogen concentration state x (k+1) at the next moment, and further obtaining a control strategy u (k+1) at the next moment;

(8) Calculating according to the formula (13) to obtain an error function of the judging network, and updating the weight of the judging network according to the updating rule;

(9) And (3) calculating according to the formula (18) to obtain an error function of the execution network, and updating the weight of the execution network according to the updating rule.

And 4, applying a Q learning algorithm to the BSM1, sequentially setting a constant expected value and a dynamic expected value, comparing the constant expected value and the dynamic expected value with a traditional PID algorithm, and verifying the tracking control capability of the Q learning algorithm under different tracking tracks.

In the specific process of the algorithm, firstly, the error between the nitrate nitrogen concentration in the BSM1 and the expected value needs to be calculated, and the Q-degree algorithm is utilized to obtain the approximate optimal control strategy u of the error by executing the network ₂ (k) A. The invention relates to a method for producing a fibre-reinforced plastic composite Will u ₂ (k) And (3) calculating the total error control strategy of the system according to the formula (4) with the feedback control strategy, and further applying the control strategy to the BSM1 to enable the nitrate nitrogen concentration to track the expected track. Then, under the condition that other conditions are unchanged, an incremental PID algorithm is used for replacing a Q learning algorithm, the same initial error is input, and a PID algorithm is obtainedAnd the control strategy output by the method further controls the nitrate nitrogen concentration to track the constant and dynamic expected track. And finally, comparing and analyzing the control effects of the two algorithms.

The innovation of the invention is that: aiming at a complex nonlinear sewage treatment system with an unknown system model, a Q learning algorithm is utilized to design a controller. Aiming at the problem that the actual concentration of nitrate nitrogen is difficult to track the expected value in the sewage treatment process, a corresponding design method is provided. And building a Q learning algorithm overall framework, constructing a judging network and executing the network to train. The execution network calculates the current optimal control strategy, and the current optimal control strategy is further combined with the feedback control strategy to obtain the total control strategy of the system, so that stable control of the nitrate nitrogen concentration is realized. Tracking control verification is performed on the BSM1 simulation model for a variety of desired trajectories.

Drawings

Fig. 1BSM1 general structure diagram

FIG. 2Q learning overall structure framework

FIG. 3 shows a network structure

FIG. 4 is a diagram of an implementation network architecture

FIG. 5 inflow rate in sunny weather

FIG. 6 nitrate nitrogen concentration tracking curve of constant expected value

FIG. 7 tracking error of nitrate nitrogen concentration from desired value

FIG. 8 is a graph showing the change in the amount of backflow

FIG. 9 nitrate nitrogen concentration tracking curve of dynamic expectation value

FIG. 10 tracking error of nitrate nitrogen concentration from desired value

FIG. 11 is a graph showing the change in the amount of backflow

Detailed Description

In the part, the actual observation data in sunny weather is utilized to carry out experimental verification of nitrate nitrogen concentration control on a BSM1 simulation model. The data adopted are 14 days of sampling data of an actual sewage treatment plant, and the sampling period is 15 minutes. As shown in FIG. 5, the inflow water flow rate in sunny weather can be stabilized in normal sunny weather, and the daily inflow water flow rate of the sewage treatment plant is always kept at 2 every day×10 ⁴ m ³ And 3X 10 ⁴ m ³ Between them.

The initial state of the selection system is x (0) =0.5 mg/L, and the sewage treatment system is controlled by adopting the Q learning method. In order to more intuitively illustrate the potential of the control method based on Q learning on the BSM1 simulation model, the PID method most common in sewage treatment at present is selected to be compared with the PID method. The test is first performed with a constant expected value. The expected trace of the nitrate nitrogen concentration was empirically set to 1mg/L, and the tracking control effects of the Q learning method and the PID method are shown in fig. 6, where fig. 6 (a) is a complete concentration change curve for all 14 days, and fig. 6 (b) and fig. 6 (c) are partial enlarged views of around day 6 and day 12, respectively. The overshoot of the Q learning method is seen to be less than that of the PID method, with a more rapid response. Fig. 7 shows the error between the actual and expected values of the nitrate nitrogen concentration, fig. 7 (a) shows the error change curve for all 14 days, and fig. 7 (b) and fig. 7 (c) show partial enlarged views. Under the condition of constant expected value, the error of the Q learning control method is basically maintained within the range of +/-0.04 mg/L, the overall control effect is more accurate, and the error of PID control is obviously larger than that of Q learning control. The results of fig. 6 and 7 show that better control performance can be obtained using the Q learning method compared to the conventional PID method. Fig. 8 shows the change in the amount of internal reflux, and the control system controls the nitrate nitrogen concentration by controlling the amount of internal reflux.

In order to verify the stability of the system and the tracking capability of the expected value when the control method based on Q learning is used for dealing with emergency, the dynamic expected value is adopted for tracking the nitrate nitrogen concentration in the experiment. The tracking control effect is shown in fig. 9, in which fig. 9 (a) is a complete concentration change curve for all 14 days, and fig. 9 (b) and fig. 9 (c) are partial enlarged views of fig. 9 (a). Compared with PID, the control method based on Q learning can ensure that the nitrate nitrogen concentration tracks the expected value better. FIG. 10 shows the error between the actual and desired values of nitrate nitrogen concentration, and it can be seen that the control error of Q learning is maintained within.+ -. 0.04mg/L even if the desired value is changed, and the control error of PID is often over.+ -. 0.04mg/L. Fig. 11 shows the change in the internal reflux amount in the case of changing the desired value, and the internal reflux amount can be quickly adjusted as a controlled variable so that the actual value of the nitrate nitrogen concentration tracks the desired value. The results prove that the optimal tracking control method based on Q learning has good control effect in the sewage treatment process, can realize effective tracking of nitrate nitrogen concentration on a set track in sunny days, and ensures normal operation of a sewage treatment system.

Claims

1. The method for controlling the nitrate nitrogen concentration of the sewage treatment based on Q learning is characterized by comprising the following steps of:

step 1, converting the problem of tracking control of nitrate nitrogen concentration in sewage treatment into an optimal regulation problem;

the dynamic equation of the sewage treatment system is expressed as

x(k+1)＝P(x(k),u(k)) (1)

Wherein k is N, N is a natural number; x (k) is the nitrate nitrogen concentration S of the second partition at the current k moment _NO,2 The method comprises the steps of carrying out a first treatment on the surface of the u (k) is the internal reflux quantity at time k; p (·, ·) is an unknown nonlinear function representing the system dynamics;

defining the desired trace of nitrate nitrogen concentration as

d(k+1)＝φ(d(k)) (2)

Wherein, phi (-) is a constant function representing the expected track, and is divided into a constant expected value and a dynamic expected value, wherein the output of phi (-) is constant 1 under the condition of the constant expected value, the output of phi (-) is regulated to be 1.4,11-14 days phi (-) under the condition of the dynamic expected value, the output of phi (-) is 0.6, and the rest time is 1; the tracking error between nitrate nitrogen concentration and desired trace is defined as

e(k)＝d(k)-x(k) (3)

Where λ is a real number greater than 0, λ=0.02,u ₁ (k) Is a feedback control strategy that can stabilize the system, u ₂ (k) The method comprises the steps of outputting an approximate optimal control strategy for a Q learning algorithm; the feedback control strategy is expressed as

u ₁ (k)＝d(k+1)+αe(k) (5)

α is a constant parameter, α=0.6;

converting the original system into a system concerning tracking error according to formulas (1) - (5)

Defining utility functions as

The utility function represents the immediate cost of control quantity generation at time k, and U (0, 0) =0;and->Constants of utility function for state variable and control variable, respectively, +.>And->

Where ζ=k, k+1, k+2 … represents k and any time thereafter, and the Q function is the sum of utility functions at all times;

when the Q function in the formula (8) is minimum, namely the optimal Q function Q ^* (e(k),u ₂ (k) Corresponding u) ₂ (k) For optimal control strategyThe optimal control strategy can keep the system stable, and ensure that the error of the formula (6) tends to zero, so that the tracking of the nitrate nitrogen concentration to the expected value is realized;

according to the principle of optimality of Bellman, an optimal Q function Q ^* (e(k),u ₂ (k) Satisfying the discrete-time HJB equation:

the optimal control strategy is solved by

Step 2, building a Q learning algorithm framework;

firstly, constructing a judging network; in the Q learning framework, the inputs to the evaluation network include the error e (k) and the control strategy u ₂ (k) Two parts, which are output as Q functions at the current moment;

judging the Q function of the network output as

Wherein W is _c1 (k) Is an input layer and a hidden layerNetwork weights, W between _c2 (k) To hide the network weight between layer and output layer, ψ _c A hyperbolic tangent activation function for evaluating the network;

defining an error function as

Training the objective function of the evaluation network as

W _c2 (k+1)＝W _c2 (k)-γ _c △W _c2 (k) (15)

Wherein, gamma _c Judging the learning rate of the network;

then, constructing an execution network; executing the error e (k) with the network input as k moment and outputting as a control strategy

The control strategy expression is

Wherein W is _a1 (k) To perform network weights between the network input layer and hidden layer, W _a2 (k) To hide the network weight between layer and output layer, ψ _a (. Cndot.) is the hyperbolic tangent activation function that performs the network;

Defining an objective function of an execution network as

W _a2 (k+1)＝W _a2 (k)-γ _a △W _a2 (k) (20)

Wherein, gamma _a To perform a network learning rate;

step 3, judging and executing network training;

after the judgment network and the execution network are constructed, the weight of the judgment network and the execution network is updated by adopting an online training method; the main training steps are as follows:

(1) Initializing a judgment network; the weight between the input layer and the hidden layer is W _c1 The weight between the hidden layer and the output layer is W _c2 The method comprises the steps of carrying out a first treatment on the surface of the The network inputs are the current time error e (k) and the control strategy u of the system ₂ (k) Outputting as a Q function at the current moment; definition of learning Rate gamma _c =0.01, initial time k=0;

(2) Initializing an execution network; the weight between the input layer and the hidden layer is W _a1 The weight between the hidden layer and the output layer is W _a2 The method comprises the steps of carrying out a first treatment on the surface of the The network input is the system current time error e (k), and the output is the control strategyDefinition of learning Rate gamma _a =0.01, initial time k=0;

(3) Inputting the current nitrate nitrogen concentration state x (k) of the BSM1, calculating the error e (k) between the current nitrate nitrogen concentration state x (k) and the ideal track d (k), and further obtaining a stable control strategy u ₁ (k)；

(9) Calculating according to a formula (18) to obtain an error function of the execution network, and updating the weight of the execution network according to an updating rule;

step 4, applying a Q learning algorithm to the BSM1, and sequentially setting a constant expected value and a dynamic expected value;

the error between the nitrate nitrogen concentration in BSM1 and the expected value needs to be calculated, and the Q-degree algorithm is utilized to obtain the approximate optimal control strategy u of the error by executing the network ₂ (k) The method comprises the steps of carrying out a first treatment on the surface of the Will u ₂ (k) And (3) calculating the total error control strategy of the system according to the formula (4) with the feedback control strategy, and further applying the control strategy to the BSM1 to enable the nitrate nitrogen concentration to track the expected track.