CN117369263B

CN117369263B - Intelligent combustion control method of hot blast stove based on reinforcement learning and attention mechanism

Info

Publication number: CN117369263B
Application number: CN202311375874.0A
Authority: CN
Inventors: 梁合兰; 国宏伟; 闫炳基; 杨韬然
Original assignee: Suzhou University
Current assignee: Suzhou University
Filing date: 2023-10-23
Publication date: 2024-07-09
Anticipated expiration: 2043-10-23

Abstract

The invention belongs to the technical field of combustion control of hot blast stoves, and particularly relates to an intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanisms, which comprises the following steps: s1: acquiring historical data of combustion of a combustion furnace, and selecting continuous data from the historical data as combustion state data by utilizing a moving time window; s2: based on the combustion state data, obtaining a trained Attention-MLP model; s3: and acquiring real-time combustion data, and controlling the gas valve position adjusting direction of the hot blast stove according to the real-time combustion data by utilizing the Attention-MLP model. The invention solves the problem that the existing hot blast stove combustion control method can not meet the requirements of control precision and real-time property, has higher accuracy and can meet the requirements of intelligent combustion optimization control of the blast furnace hot blast stove.

Description

Intelligent combustion control method of hot blast stove based on reinforcement learning and attention mechanism

Technical Field

The invention relates to the technical field of combustion control of hot blast stoves, in particular to an intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanisms.

Background

The blast furnace smelting is a key link in the steel smelting process, and the production cost of the blast furnace smelting accounts for more than 60% of the cost of steel products. The hot blast stove is used as a heat exchange device which is vital in the iron-making process and has the function of generating and conveying high-temperature hot air to a blast furnace so as to meet the heat requirement of the reduction process of iron ores. The improvement of the combustion control precision of the hot blast stove is an important means for improving the air temperature, reducing the fuel consumption and reducing the emission, and is also an effective means for prolonging the service life of the hot blast stove and reducing the labor intensity of workers.

The goal of the hot blast stove combustion control is to achieve the optimal control of the vault temperature and the flue gas temperature rise rate in the combustion period by dynamically adjusting the air valve position and the gas valve position according to the combustion state. The combustion control method of the blast furnace hot blast stove mainly comprises three main types of traditional control methods, mathematical model methods and intelligent control methods. In the method, the traditional control method has the problems of control lag and overlarge control action intensity, and the mathematical model method has large investment because of more thermal parameters to be monitored. Compared with the prior art, the intelligent control method of the hot blast stove can utilize intelligent knowledge to design the controller, has the advantages of wide working range, wide application range and the like, and is the main stream direction of current development. However, in the existing intelligent control method, rule induction and extraction are difficult problems, and in actual operation, rules are mostly set manually, and the defects of single expression form, simple structure and the like exist.

Disclosure of Invention

Therefore, the invention aims to solve the technical problem that the control precision and the real-time performance of the optimized combustion of the hot blast stove cannot be considered in the prior art.

In order to solve the technical problems, the invention provides an intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanism, which comprises the following steps:

S1: acquiring historical data of combustion of a combustion furnace, and selecting continuous data from the historical data as combustion state data by utilizing a moving time window;

S2: based on the combustion state data, obtaining a trained Attention-MLP model;

S3: acquiring real-time combustion data, and controlling the gas valve position adjusting direction of the hot blast stove according to the real-time combustion data by utilizing the Attention-MLP model;

The Attention-MLP model comprises a predefined action space, an experience pool, a Q network for outputting actions and a target network for outputting estimated values to guide the Q network, wherein initial parameters of the Q network and initial parameters of the target network are the same; after the combustion state data is obtained by the intelligent agent observing the environment, an output state is formed after the combustion state data is processed by the attention mechanism module, a state transfer record obtained by the interaction of the intelligent agent and the environment is stored in the experience pool, the Q network is trained by adopting an experience playback mechanism, the parameters of the Q network are updated, and the Q network synchronizes the updated parameters to the target network;

the combustion state data is a plurality of pieces, and any piece of combustion state data comprises values of gas pressure, air valve position, gas valve position and vault temperature.

In one embodiment of the invention, storing state transition records obtained by the agent interacting with the environment in the experience pool comprises: the intelligent agent obtains feedback rewards corresponding to the action after selecting the action from the action space to execute according to the output state, and the intelligent agent transfers to a new state, and takes the combustion state data, the action, the feedback rewards and the new state as state transfer records to be stored in the experience pool; the action space is defined as a collection of 3 adjustment directions of a gas valve position, and the 3 adjustment directions are specifically: downward adjustment, upward adjustment and constant.

In one embodiment of the present invention, the specific steps for forming the output state after processing by the attention mechanism module are:

s211: taking combustion state data s _t as input, obtaining linear characteristics through linear mapping The following is shown:

Z_t＝{z₁,z₂,...,z_ω}＝W_zs_t+b_z

wherein, Representing the linear characteristic at time i, i=1, …, ω, ω representing the time window length; m is the length of each piece of furnace data; as a parameter of the neural network, D is the dimension of the embedded feature, which is the bias parameter;

s212: calculating hidden states added with position codes by taking linear characteristic Z _t as input The following are provided:

H_t＝{h₁,h₂,...,h_ω}＝Z_t+C

wherein, The hidden state at the i-th moment is represented, i=1, …, ω; Is a neural network parameter, and the size of the neural network parameter is the same as Z _t;

S213: mapping H _t to keys by linear mapping Sum valueThe following is shown:

K_t＝{k₁,k₂,...,k_ω}＝W_kH_t+b_k

V_t＝{v₁,v₂,...,v_ω}＝W_vH_t+b_v

In the method, in the process of the invention, Respectively representing a key vector and a value vector at the i-th moment, i=1, …, ω; A weight parameter for the embedded network; is a bias parameter;

S214: inputting query vectors Calculating relevance between query vectors and keysThe following is shown:

A_t＝(a₁,a₂,...,a_ω)＝softmax(q^TK_t)

In the method, in the process of the invention, An attention value between the query vector and the key representing the i-th moment, i=1, …, ω; Softmax is a normalized exponential function, a randomly initialized learnable parameter;

S215: substituting the correlation vector and the value matrix according to the following formula to obtain the output state of the combustion state data s _t at the moment t

In one embodiment of the invention, the training of the Q network using an empirical playback mechanism includes:

S21: initializing an experience pool D, parameters of the Q network and the target network, sampling priority of each state transition record, and recording the state transition record at the moment i as tau _i;

s22: according to the sampling priority, calculating the sampling probability of each state transition record:

Where beta is a state transition sampling constant, τ _i is a state transition record at time i, Recording the sampling priority of τ _i for the state transition;

S23: and sampling the state transition record according to the sampling probability, setting a loss function according to the Q value corresponding to the output action of the B-state transition record obtained by sampling, and updating the parameters of the Q network by adopting a gradient descent method so as to minimize the loss function value.

In one embodiment of the invention, the loss function is expressed as:

Where L _TD is the timing differential error, L _E is the large pitch class loss, For the L2 regularization penalty, λ ₁、λ₂ is the weight of the corresponding penalty function.

In one embodiment of the present invention, the expression of the time series differential error is:

L_TD＝Y_t-Q(s_t,a_t；θ)

Wherein Y _t represents the cumulative expected return estimate at time t, Q (s _t,a_t; θ) represents the cumulative expected return estimate output at the time of state s _t input in the Q network; wherein,

Wherein r _t+1 is a reward at time t+1; gamma epsilon (0, 1) is the discount coefficient; The input s _t+1 and the output action are shown as follows in the target network with the parameter of theta ^- A corresponding accumulated expected return estimate; To input state s _t+1 in the Q network, the action with the maximum expected return estimate output value is accumulated.

In one embodiment of the present invention, the expression of the large-pitch classification loss is:

wherein A represents an operation space, s _t represents a combustion state of the hot air furnace at a time t, a represents any operation in the operation space A, Q (s _t, a; θ) represents an output of the Q network with a parameter θ, which operates as a under an input state s _t, The expert action at the time t is represented, namely the corresponding valve position adjustment action under the state s _t in the state transition record; And b is a super-parameter greater than zero, representing a penalty function when the action output by the model is inconsistent with the expert action reflected in the record.

In one embodiment of the present invention, the expression of the L2 regularization loss is:

wherein θ is a parameter of the Q network.

In one embodiment of the present invention, the specific method for the Q network to synchronize the updated parameters to the target network is: and synchronizing the parameters of the Q network to the target network when the sampling times reach the preset target network parameter updating frequency.

In one embodiment of the present invention, the gradient descent method is as follows: and calculating importance sampling weights of all the state transition records according to the sampling probability, updating gradient values according to the average importance sampling weights of the B-state transition records obtained by sampling, and updating the sampling priority of the state transition records according to the time sequence differential error.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism combines the reinforcement learning and supervision learning ideas, utilizes the burning history data set to guide the intelligent body to learn offline, particularly aims at the characteristics of history data, and uses the attention mechanism to better learn the state change rule of the hot blast stove contained in the history data, thereby outputting better combustion control actions of the hot blast stove, and solving the problem that the combustion of the hot blast stove cannot meet the requirements of control precision and real-time performance due to the characteristics of nonlinearity, continuity, hysteresis and the like.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which

FIG. 1 is a flow chart of an intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanism;

FIG. 2 is a generation map of the combustion state data in the embodiment;

FIG. 3 is a block diagram of the Q network or the target network in an embodiment;

FIG. 4 is a flow chart of the processing of state data into output states by the attention-based embedded network of FIG. 3;

FIGS. 5 (a) - (b) are the accuracy of the Attention-MLP model and the MLP model on the training set and the test set, respectively;

FIG. 6 is a graph of a comparative change in loss function values for an Attention-MLP model and an MLP model;

FIG. 7 is a graph showing Q-value change curves of the Attention-MLP model and the MLP model.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

Referring to FIG. 1, the invention provides an intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanism, which comprises the following steps:

The combustion state data is multiple, and any one combustion state data comprises values of gas pressure, air valve position, gas valve position and vault temperature.

Specifically, in S1, a strategy of moving a time window is adopted, continuous data in the time window with the size of ω=10 is selected to construct a state, that is, S _t＝X_t-ω+1:t, where x _t represents the values of variables such as gas pressure and air pressure at time t, and m is the length of each record. For s _t to s _ω-1, the previous portion of data is missing, padded with 0. The generation of the status data is shown in fig. 2. The upper part of fig. 2 is the collected furnace history data, and the state data of the reinforcement learning model is shown in the lower part of the figure. Wherein, the state data s _t at time t is composed of the firing history record X _t-ω+1:t from time t-omega+1 to time t. The status data at other times can be obtained in the same way.

As shown in fig. 3, the Q network and the target network are both composed of an embedded network f and a fully connected network g, the embedded network f based on the attention mechanism maps the two-dimensional time sequence state data s _t into an output state e _t, and then takes the output state e _t as input, and Q value output estimated values corresponding to 3 actions of the gas valve position are obtained through the fully connected network g.

Specifically, storing state transition records obtained by interaction of an agent with an environment in the experience pool includes: the intelligent agent obtains feedback rewards corresponding to the action after selecting the action from the action space to execute according to the output state, and the intelligent agent transfers to a new state, and takes the combustion state data, the action, the feedback rewards and the new state as state transfer records to be stored in the experience pool; the action space is defined as a collection of 3 adjustment directions of a gas valve position, and the 3 adjustment directions are specifically: downward adjustment, upward adjustment and constant.

As shown in fig. 4, the specific steps for forming the output state after processing by the attention mechanism module are as follows:

Z_t＝{z₁,z₂,...,z_ω}＝W_zs_t+b_z

wherein, Representing the linear characteristic at time i, i=1,..; m is the length of each piece of combustion state data, m=5; as a parameter of the neural network, D is the dimension of the embedded feature, d=64, for the bias parameter;

H_t＝{h₁,h₂,...,h_ω}＝Z_t+C

S213: mapping H _t to keys by linear mapping Sum valueThe following is shown:

K_t＝{k₁,k₂,...,k_ω}＝W_kH_t+b_k

V_t＝{v₁,v₂,...,v_ω}＝W_vH_t+b_v

A_t＝(a₁,a₂,...,a_ω)＝softmax(q^TK_t)

The method for obtaining the Q value output estimated value comprises the following steps: then inputting the output state e _t into the full-connection network g, setting the node number of the output layer to be 3, and respectively outputting Q value output estimated values corresponding to three actions of increasing the gas valve position, reducing the gas valve position and keeping the gas valve position unchanged, namely:

Q＝{Q₁,Q₂,Q₃}＝MLP(e_t)。

in this embodiment, the training of the Q network using an empirical playback mechanism includes:

Where β is a state transition sampling constant, β=0.4 in this embodiment, τ _i is a state transition record at time i, Recording the sampling priority of τ _i for the state transition;

The expression of the loss function is:

The expression of the time sequence differential error is as follows:

L_TD＝Y_t-Q(s_t,a_t；θ)

The expression of the large-space classification loss is as follows:

The expression of the L2 regularization loss is:

wherein θ is a parameter of the Q network.

In this embodiment, the specific method for the Q network to synchronize the updated parameters to the target network is: and synchronizing the parameters of the Q network to the target network when the sampling times reach the preset target network parameter updating frequency.

The method for updating the parameters of the Q network by adopting the gradient descent method comprises the following steps:

and calculating importance sampling weights of all state transition records according to the sampling probability:

In the method, in the process of the invention, For the importance sampling constant, in this embodiment

Updating a gradient value according to the average importance sampling weight of the B-state transition record obtained by sampling: θ=θ—λΔ _θ L, where Δ _θL(τ_i) represents a loss function gradient value corresponding to the state transition record τ _i, λ is a learning rate, and updates the sampling priority of the state transition record according to the time-series differential error:

In order to evaluate the performance of the Attention-MLP model, the proposed model is realized by adopting software such as Python 3.11.3, JAX 0.4.8 and the like, a GNU/Linux operating system is installed, a computer with Intel (R) Core (TM) i7-10700 CPU and a 16GB memory is used for completing calculation, and experimental data are recorded by Wandb. In terms of data, a set of stove combustion control data is obtained for training and testing. By dividing the data according to the burning period and eliminating the abnormal period, 27 burning periods are obtained, and the number of the burning periods is 7: the ratio of 3 randomly divides the training and testing sets (19 cycles for training phase and the remaining 8 for testing phase). Then, 47739 state transitions are obtained in total according to the generation rule of the state data.

Four different random number seeds are used for determining the parameter theta of the Q network and the parameter theta ^- of the target network, and four experiments are respectively carried out on each network structure. Fig. 5 to 7 show the variation of the statistical index in the training process, the Attention-MLP model is the reinforcement learning model based on the Attention mechanism proposed by the present invention, and the MLP is the reinforcement learning model only using the fully connected network.

As can be seen from the accuracy change curves of FIGS. 5 (a) - (b), as training proceeds, the accuracy of both models increases, and at the same time, the accuracy of the training set and the accuracy of the test set are close at the same time step, indicating that the Attention-MLP model does not have over-fit or under-fit. Fig. 6 shows the average loss value change of the model in the training process, and it can be seen that the overall loss is continuously reduced, which indicates that the model can effectively learn the data rule. Fig. 7 shows the average Q value of the model during training. It can be seen that the average Q values of all models eventually converge to around 8, which means that all models have similar estimates of Q values. Compared with the MLP model, the Attention-MLP model can pay Attention to important parts in the input state sequence, and meanwhile, the addition of position embedding enables the model to be optimal in solving quality in consideration of sequence order.

According to the embodiment, the intelligent optimized combustion control method for the blast furnace hot blast stove based on the deep reinforcement learning does not need to use monitoring instruments such as a flowmeter and a residual oxygen meter, does not need to explicitly provide the corresponding relation between the valve position and detection equipment such as the flowmeter, and autonomously learns the implicit relation between the furnace burning state and the valve position adjustment parameter based on the deep reinforcement learning method according to the historical furnace burning operation. Furthermore, considering the difficulty of state feature representation, an Attention-based deep embedded network and fully connected network model is proposed. Experimental results show that the accuracy of the model reaches 86%, and the intelligent combustion optimization control requirement of the blast furnace hot blast stove can be met.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. An intelligent combustion control method of a hot blast stove based on reinforcement learning and attention mechanism is characterized by comprising the following steps:

S3: acquiring real-time combustion data, and controlling the adjustment direction of a gas valve position of the hot blast stove according to the real-time combustion data by utilizing the Attention-MLP model;

The combustion state data is multiple, and any combustion state data comprises values of gas pressure, air valve position, gas valve position and vault temperature;

the specific steps of forming the output state after being processed by the attention mechanism module are as follows:

Z_t＝{z₁,z₂,...,z_ω}＝W_zs_t+b_z

H_t＝{h₁,h₂,...,h_ω}＝Z_t+C

S213: mapping H _t to keys by linear mapping Sum valueThe following is shown:

K_t＝{k₁,k₂,...,k_ω}＝W_kH_t+b_k

V_t＝{v₁,v₂,...,v_ω}＝W_vH_t+b_v

A_t＝(a₁,a₂,...,a_ω)＝softmax(q^TK_t)

2. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 1, wherein the intelligent combustion control method is characterized by comprising the following steps of: storing a state transition record obtained by interaction of an agent with an environment in the experience pool comprises: the intelligent agent obtains feedback rewards corresponding to the action after selecting the action from the action space to execute according to the output state, and the intelligent agent transfers to a new state, and takes the combustion state data, the action, the feedback rewards and the new state as state transfer records to be stored in the experience pool; the action space is defined as a collection of 3 adjustment directions of a gas valve position, and the 3 adjustment directions are specifically: downward adjustment, upward adjustment and constant.

3. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 1, wherein the intelligent combustion control method is characterized by comprising the following steps of: the training of the Q network by adopting an experience playback mechanism comprises the following steps:

Wherein, beta is a state transition sampling constant, tau _i is a state transition record of the moment i, and p _τi is a sampling priority of the state transition record tau _i;

4. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 3, wherein the intelligent combustion control method is characterized by comprising the following steps of: the expression of the loss function is:

5. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 4, wherein the intelligent combustion control method is characterized by comprising the following steps of: the expression of the time sequence differential error is as follows:

L_TD＝Y_t-Q(s_t,a_t；θ)

6. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 4, wherein the intelligent combustion control method is characterized by comprising the following steps of: the expression of the large-space classification loss is as follows:

7. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 4, wherein the intelligent combustion control method is characterized by comprising the following steps of: the expression of the L2 regularization loss is:

wherein θ is a parameter of the Q network.

8. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 3, wherein the intelligent combustion control method is characterized by comprising the following steps of: the specific method for the Q network to synchronize the updated parameters to the target network is as follows: and synchronizing the parameters of the Q network to the target network when the sampling times reach the preset target network parameter updating frequency.

9. The intelligent combustion control method for the hot blast stove based on the reinforcement learning and attention mechanism according to claim 4, wherein the intelligent combustion control method is characterized by comprising the following steps of: the gradient descent method comprises the following steps: and calculating importance sampling weights of all the state transition records according to the sampling probability, updating gradient values according to the average importance sampling weights of the B-state transition records obtained by sampling, and updating the sampling priority of the state transition records according to the time sequence differential error.