CN113637819B

CN113637819B - Blast furnace material distribution method and system based on deep reinforcement learning

Info

Publication number: CN113637819B
Application number: CN202110937412.8A
Authority: CN
Inventors: 何树营; 赵春鹏; 李智杰; 周春晖
Original assignee: Beris Engineering and Research Corp
Current assignee: Beris Engineering and Research Corp
Priority date: 2021-08-16
Filing date: 2021-08-16
Publication date: 2022-10-18
Anticipated expiration: 2041-08-16
Also published as: CN113637819A

Abstract

The invention provides a blast furnace material distribution method based on deep reinforcement learning, which comprises the following steps: acquiring actual charge level state data of the blast furnace; inputting the actual burden surface state data into a preset blast furnace burden distribution matrix optimization depth reinforcement learning model to obtain an optimized burden distribution matrix; the material distribution system of the blast furnace is autonomously controlled through the optimized material distribution matrix; when the blast furnace burden distribution matrix optimization deep reinforcement learning model is trained, the actual burden surface state, the burden distribution matrix, the reward or punishment obtained after the burden distribution matrix is implemented and the influence on the actual burden surface after burden distribution are considered; the method based on the deep reinforcement learning realizes optimization of the cloth matrix, and has the advantages of high control precision, good generalization, strong anti-interference capability, high flexibility, high optimization efficiency and the like.

Description

Blast furnace material distribution method and system based on deep reinforcement learning

Technical Field

The disclosure belongs to the technical field of metallurgy, and particularly relates to a blast furnace material distribution method and system based on deep reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

A large number of complex physical and chemical reactions exist in the blast furnace ironmaking process; two materials react in the blast furnace; solid raw materials such as ore, coke and the like are added into a hearth by a bell-less top charging system, auxiliary fuels such as oxygen and coal powder are used and are arranged at the bottom of a furnace through a bottom air port, and the main principle of the interior of a blast furnace is that the iron ore is reduced and oxidized by carbon monoxide through thermochemistry; the continuous and stable thermal environment is a prerequisite for producing molten iron, and the operation mechanism of the iron-making process often has the characteristics of nonlinearity, large time lag, serious noise, parameter distribution and the like; therefore, the control method, in particular the control of the charge, is mainly carried out on the basis of experience; the automatic control of the blast furnace ironmaking process is always a hot point of research in academia and industry.

Burden distribution is an important part in the blast furnace ironmaking process, and the most common burden mode is bell-less distribution; after the ore and the coke enter a parallel or serial hopper, distributing the materials layer by layer through a rotary chute; the shape of the burden surface directly influences the airflow distribution in the iron-making process; in general, it is beneficial to equip the V-shaped charge with a suitable platform that can meet the standards for the development of air flow: the central air flow is activated and the peripheral air flow is suppressed. The shape of the burden surface of the furnace is closely related to the operation condition of the blast furnace, however, the research on how to set a distribution matrix according to the actual burden surface condition to distribute the optimal burden surface shape is few; the distribution matrix output by the distribution system can influence the distribution of furnace burden, the distribution period, the distribution time sequence and the production dynamics of iron; in addition, the temperature distribution and the airflow distribution can be adjusted through the distribution output characteristics of the furnace burden, so that the running state of the whole hearth is influenced; practice and experience tell us that optimizing the distribution matrix has important meaning to the stable production of furnace body, steady operation, reduction accident rate and fuel consumption.

Disclosure of Invention

The invention provides a blast furnace burden distribution method and system based on deep reinforcement learning to solve the problems, wherein historical data of blast furnace burden distribution are obtained and stored in a memory according to a certain rule, an off-line training learning is carried out on a blast furnace burden distribution matrix model by utilizing the historical data and the deep reinforcement learning, and the blast furnace burden distribution system realizes autonomous control on the blast furnace burden distribution by observing the actual environment of the blast furnace by utilizing the trained model.

In order to achieve the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present disclosure provides a blast furnace burden distribution method based on deep reinforcement learning, including:

acquiring actual charge level state data of the blast furnace;

inputting the actual burden surface state data into a preset blast furnace burden distribution matrix optimization depth reinforcement learning model to obtain an optimized burden distribution matrix;

the material distribution system of the blast furnace is autonomously controlled through the optimized material distribution matrix;

when the blast furnace burden distribution matrix optimization deep reinforcement learning model is trained, the actual burden surface state, the burden distribution matrix, the reward or punishment obtained after the burden distribution matrix is implemented and the influence on the actual burden surface after burden distribution are considered.

Further, the main contents of the blast furnace burden distribution matrix optimization deep reinforcement learning model training comprise:

acquiring blast furnace burden distribution historical data, including an actual burden surface state, a burden distribution matrix, rewards or punishments obtained after the burden distribution matrix is implemented and influences on the actual burden surface after burden distribution;

selecting a piece of data from the historical data in batch; setting a material distribution strategy at the current moment according to the actual material level state at the current moment, and setting a material distribution matrix; obtaining actual reward of material distribution according to actual feedback of the blast furnace burden surface; estimating the optimal charge level state and the optimal distribution matrix action at the next moment;

adjusting the weight of the depth network along the gradient direction of the loss function reduction; and (5) iteratively training until the deep network is optimal.

Further, conducting regularization processing on the obtained blast furnace burden distribution historical data; the regular data format is quintuple, which comprises the blast furnace charge level state at the previous moment, the blast furnace burden distribution action at the current moment, the reward value obtained by the current action, the reward value calculated according to the set reward rule and the blast furnace charge level state at the next moment after the action is executed.

Further, the blast furnace burden distribution matrix optimization deep reinforcement learning model comprises a target network module and a prediction network module, wherein the target network and the prediction network alternately update weights along the gradient descending direction of the loss function until an iteration condition is met to generate an optimal burden distribution matrix optimization model.

Furthermore, the actual charge level state data of the blast furnace are continuously acquired, and the actual charge level state at each moment is acquired.

Further, the input state value of the blast furnace burden distribution matrix optimization depth reinforcement learning model is the actual burden surface state of the blast furnace obtained continuously, and the output is a burden distribution matrix which comprises burden distribution angles and burden distribution turns.

Furthermore, the trained blast furnace material distribution matrix optimization deep reinforcement learning model is packaged and integrated into an independent system, and is communicated with material distribution and detection equipment to realize the autonomous control of the material distribution.

In a second aspect, the disclosure also provides a blast furnace burden distribution system based on deep reinforcement learning, which comprises a data acquisition module, an optimization module and a control module;

the data acquisition module configured to: acquiring actual charge level state data of the blast furnace;

the optimization module configured to: inputting the actual burden surface state data into a preset blast furnace burden distribution matrix optimization depth reinforcement learning model to obtain an optimized burden distribution matrix;

the control module configured to: the material distribution system of the blast furnace is autonomously controlled through the optimized material distribution matrix;

Compared with the prior art, this disclosed beneficial effect does:

the method comprises the steps of acquiring real sensor data of an actual blast furnace material distribution system, carrying out regularization processing on the data, carrying out offline training and learning on an optimization model of a material distribution matrix of the blast furnace material distribution system by utilizing deep reinforcement learning to generate optimal models in each stage, and carrying out autonomous control on the material distribution of the blast furnace by the trained blast furnace material distribution matrix optimization deep reinforcement learning model of the blast furnace material distribution system; the blast furnace material distribution system is independently controlled in material distribution intelligence based on deep reinforcement learning, the optimization of a material distribution matrix is realized, and the method has the advantages of high control precision, good generalization, strong anti-interference capability, high flexibility, high optimization searching efficiency and the like.

Drawings

The accompanying drawings, which form a part hereof, are included to provide a further understanding of the present embodiments, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the present embodiments and together with the description serve to explain the present embodiments without unduly limiting the present embodiments.

Fig. 1 is a flow chart of example 1 of the present disclosure;

fig. 2 is a block diagram of a deep reinforcement learning structure according to embodiment 1 of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

Example 1:

as shown in fig. 1, the present embodiment provides a blast furnace burden distribution method and system based on deep reinforcement learning, including:

s1: acquiring historical experience data of blast furnace burden distribution, and storing the historical data into a memory, wherein the acquired historical data comprises the following steps: the state of the actual charge level, the action (a distribution matrix) taken, the reward or punishment obtained after the distribution matrix is implemented and the influence on the actual charge level after the distribution, namely the charge level state at the next moment;

s2: training a blast furnace burden distribution matrix optimization depth reinforcement learning model according to blast furnace burden distribution historical data;

the training comprises the following steps: selecting a piece of data from the stored historical data in batches; according to the actual charge level state at the current moment, setting a charge distribution strategy at the current moment and setting a charge distribution matrix; obtaining actual reward of material distribution according to actual feedback of the blast furnace burden surface; estimating the optimal charge level state and the optimal distribution matrix action at the next moment until the distribution is finished; adjusting the weight of the depth network along the gradient direction of the loss function reduction; iteratively training until the depth network is optimal;

s3: and independently controlling a material distribution system of the blast furnace through the trained blast furnace material distribution matrix optimization deep reinforcement learning model.

In the step S1, the blast furnace burden distribution historical data comprises the burden level state of the blast furnace at the current moment, the set burden distribution matrix, the reward obtained by implementing the burden distribution matrix and the burden level state at the next moment;

in the step S1, the obtained blast furnace burden distribution historical data is subjected to regularization processing;

the regularized data format is quintuple(s) _t ,a _t ,R _t ,s _t+1 ) Wherein s is _t At the present moment, a _t Is the action of the blast furnace burden at the current moment, namely burden distribution matrix a _t The reward value obtained for the current action is calculated according to the set reward rule, s _t+1 To execute a _t The charge level state of the blast furnace charge at the next moment after the action.

In the step S2, training a blast furnace burden distribution matrix optimization depth reinforcement learning model by using the collected blast furnace burden distribution data; the intelligent agent of the embodiment is a blast furnace material distribution system, and a material distribution matrix is set according to the real-time blast furnace burden level state to implement material distribution;

as shown in fig. 2, in the present embodiment, the blast furnace burden distribution matrix optimization depth strengthening model includes a target network module and a prediction network module; the prediction network module estimates the state of the charge level of the blast furnace, outputs the value of the charge level system of the blast furnace when different charge level matrixes are adopted in the state, selects the optimal charge level matrix action according to the action value corresponding to the estimated state of the charge level of the blast furnace, and feeds the selected optimal charge level matrix action back to the target network module; the target network module is mainly responsible for the evaluation function and evaluates the state value, the action value, the optimal state and the optimal action of the prediction network estimation, so that the selection and the evaluation function of the neural network are separated to prevent the value of selection overestimation, and the estimation is too optimistic.

Two value functions are learned by randomly selecting historical data in batches, the weight theta of the prediction network is updated in each learning, and the weight theta' of the target network is updated at fixed learning time intervals. The target network weight theta' is directly predicted network weight theta and does not participate in the real-time updating of the learning process. For each update, one set of weights is used to determine the greedy strategy, and another set of weights is used to determine the value of the greedy strategy.

In the present disclosure, the instantaneous reward function of the material distribution system is set as the difference between the actual charge level and the ideal charge level of the blast furnace, and the formula is expressed as:

r _t ＝-[ψ(x)-ξ(x)] ²

wherein psi (x) and xi (x) are respectively an ideal charge level function and an actual charge level function of the blast furnace; note that, many researchers have developed many mature methods for calculating the actual charge level function and the ideal charge level function, and the detailed description is omitted here, and the focus of the present disclosure is a method for optimizing the charge distribution matrix of the blast furnace.

Of particular interest in this disclosure is the long-term reward R for the distribution system _t ，R _t Defined as the sum of the current instant prize and the future (infinitely long) instant prize; to ensure the finiteness of the reward, a discount factor of the future reward to the current reward is introduced; for simplicity, this assumption is usually a special form of a geometric progression of the weight series: gamma belongs to a fixed value in [0, 1), and the discount factor of a certain future time is the multiplication of gamma and the discount factor of the previous time. Thus, the long-term reward R of the distribution system _t Can be expressed as:

the method mainly aims at optimizing a distribution matrix, wherein the distribution matrix consists of an angle of a distribution chute and the number of turns of distribution at the angle; defining a cloth angle and cloth turn number combined decision set as a Cartesian product of the angle and turn number decision set, A _t ＝A ₁ ×A ₂ (ii) a Wherein, A ₁ Cloth angle indicatorPolicy set, A ₂ And (4) representing a cloth turn number decision set. Then, the material distribution system target is determined by selecting the optimal decision at any time t

Maximizing long-term revenue; expressed as:

the training iterative process of the hot blast stove deep strengthening model in the embodiment is as follows:

(1) Storing the regularized blast furnace burden distribution data in a storage unit;

(2) Uniformly randomly drawing small batches of samples from historical data for model training;

(3) Calculating a state value and an action value through a value function;

wherein Q is _π (s _t ,a _t ) As a function of the cost of the action, V _π (s _t ) Is a state cost function.

(4) Updating all weights of the prediction network by gradient back propagation of the neural network by using a loss function;

W _t+1 ＝W _t -αg _t

wherein, W is the weight of the neural network, alpha is the learning rate, and g is the gradient of loss reduction.

(5) And after the iteration is finished, performing simulation test by using the generated blast furnace burden distribution matrix optimization deep reinforcement learning model.

In the step S3, the trained blast furnace material distribution matrix optimization depth reinforcement learning model is packaged and integrated into an independent system, and is communicated with material distribution and detection equipment such as a chute and the like, so that the material distribution is controlled autonomously.

Example 2:

the embodiment provides a blast furnace material distribution system based on deep reinforcement learning, which comprises a data acquisition module, an optimization module and a control module;

Example 3:

the present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the deep reinforcement learning-based blast furnace burden distribution method described in embodiment 1.

Example 4:

the embodiment provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to implement the deep reinforcement learning-based blast furnace burden distribution method described in embodiment 1.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and those skilled in the art can make various modifications and variations. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present embodiment should be included in the protection scope of the present embodiment.

Claims

1. A blast furnace material distribution method based on deep reinforcement learning is characterized by comprising the following steps:

acquiring actual charge level state data of the blast furnace;

when the blast furnace burden distribution matrix optimization deep reinforcement learning model is trained, the actual burden surface state, the burden distribution matrix, the reward or punishment obtained after the burden distribution matrix is implemented and the influence on the actual burden surface after burden distribution are considered; setting an instantaneous reward function of the material distribution system as a difference value between the actual charge level and the ideal charge level of the blast furnace;

the blast furnace burden distribution matrix optimization deep reinforcement learning model training content comprises the following steps: adjusting the weight of the depth network along the gradient direction of the loss function reduction; the blast furnace burden distribution matrix optimization deep reinforcement learning model comprises a target network module and a prediction network module; alternately updating the weight values of the target network and the prediction network along the gradient descending direction of the loss function; the prediction network module estimates the state of the blast furnace burden surface, outputs the value of the blast furnace burden distribution system adopting different burden distribution matrixes in the state, selects the burden distribution matrix action according to the action value corresponding to the estimated state of the blast furnace burden surface, and feeds back the selected optimal burden distribution matrix action to the target network module, and the target network module is responsible for evaluating the function and evaluating the state value, the action value, the state and the action estimated by the prediction network.

2. The blast furnace burden distribution method based on deep reinforcement learning of claim 1, wherein the blast furnace burden distribution matrix optimization deep reinforcement learning model training comprises the following main contents:

selecting a piece of data from the historical data in batch; according to the actual charge level state at the current moment, setting a charge distribution strategy at the current moment and setting a charge distribution matrix; obtaining actual reward of material distribution according to actual feedback of the blast furnace burden surface; estimating the optimal charge level state at the next moment and the optimal distribution matrix action;

3. The blast furnace burden distribution method based on deep reinforcement learning as claimed in claim 2, characterized in that the obtained blast furnace burden distribution historical data is subjected to regularization processing; the regular data format is quintuple, which comprises the blast furnace charge level state at the previous moment, the blast furnace burden distribution action at the current moment, the reward value obtained by the current action, the reward value calculated according to the set reward rule and the blast furnace charge level state at the next moment after the action is executed.

4. The blast furnace burden distribution method based on the deep reinforcement learning of claim 2, wherein the blast furnace burden distribution matrix optimization deep reinforcement learning model comprises a target network module and a prediction network module, and the target network and the prediction network alternately update the weight values along the gradient descending direction of the loss function until an iteration condition is met to generate an optimal burden distribution matrix optimization model.

5. The blast furnace burden distribution method based on deep reinforcement learning as claimed in claim 1, wherein the data of the actual burden surface state of the blast furnace is obtained continuously, and the actual burden surface state at each moment is obtained.

6. The blast furnace burden distribution method based on the deep reinforcement learning of claim 1, wherein the input state value of the blast furnace burden distribution matrix optimization deep reinforcement learning model is the actual burden surface state of the blast furnace obtained continuously, and the output is a burden distribution matrix comprising burden distribution angles and burden distribution turns.

7. The blast furnace burden distribution method based on deep reinforcement learning of claim 1, wherein the trained blast furnace burden distribution matrix optimization deep reinforcement learning model is packaged and integrated into an independent system, and is communicated with burden distribution and detection equipment to realize autonomous control of burden distribution.

8. A blast furnace material distribution system based on deep reinforcement learning is characterized by comprising a data acquisition module, an optimization module and a control module;

9. A computer-readable storage medium, on which a computer program is stored for fingerprint similarity calculation, wherein the program, when executed by a processor, implements the blast furnace burden distribution method based on deep reinforcement learning according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the deep reinforcement learning-based blast furnace burden distribution method according to any one of claims 1 to 7.