EP3872432A1

EP3872432A1 - Method, apparatus and electronic device for constructing reinforcement learning model

Info

Publication number: EP3872432A1
Application number: EP21164660.9A
Authority: EP
Inventors: Ying Liu; Xin Xie; Ming Xu; Yuezhen QI; Ruifeng Li; Lu Bai
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-10
Filing date: 2021-03-24
Publication date: 2021-09-01
Anticipated expiration: 2041-03-24
Also published as: CN112100916A; JP7257436B2; JP2022023775A; KR102506122B1; EP3872432B1; US20210216686A1; CN112100916B; KR20210052412A

Abstract

Embodiments of the present disclosure disclose a method, apparatus and electronic device for constructing a reinforcement learning model, and a computer readable storage medium, relate to the field of big data and deep learning technology. An implementation of the method includes: establishing a first simulation model between a calciner coal feed amount and a calciner temperature; establishing a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature; establishing a prediction model among: an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature content output by the second simulation model; and a free calcium; and constructing a reinforcement learning model according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model.

Description

TECHNICAL FIELD

The present disclosure relates to the field of data processing technology, particularly to the field of big data and deep learning technology, and more particularly to a method, apparatus and electronic device for constructing a reinforcement learning model, and relates to a computer readable storage medium.

BACKGROUND

There are three main stages in the production process of cement: raw material mining and grinding, calcination of raw material to clinker, and clinker reprocessing. The calcination of raw material to clinker is a very complicated process, and the costs of coal and electricity consumed in the process is very high. In the calcination process, main consumption is coal and electricity, of which coal consumption accounts for the largest proportion. Thus, how to reasonably manage and control a coal feed amount in the calcination stage is the key to decrease the cost and increase the efficiency in the cement industry.

SUMMARY

Embodiments of the present disclosure propose a method, apparatus and electronic device for constructing a reinforcement learning model. The embodiments also refers to a computer readable storage medium.
In a first aspect, embodiments of the present disclosure provide a method for constructing a reinforcement learning model, comprising: establishing a first simulation model between a calciner coal feed amount and a calciner temperature; establishing a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature; establishing a prediction model among: an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content; and constructing a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model; the coal feed amount comprising the calciner coal feed amount and the kiln head coal feed amount.
In a second aspect, embodiments of the present disclosure provide an apparatus for constructing a reinforcement learning model, comprising: a first simulation model establishing unit, configured to establish a first simulation model between a calciner coal feed amount and a calciner temperature; a second simulation model establishing unit, configured to establish a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature; a prediction model establishing unit, configured to establish a prediction model among an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content; and a reinforcement learning model construction unit, configure to construct a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model; the coal feed amount comprising the calciner coal feed amount and the kiln head coal feed amount.
In a third aspect, embodiments of the present disclosure provide an electronic device, comprising: one or more processors; and a storage apparatus, storing one or more programs thereon, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method provided by the first aspect.
In a forth aspect, embodiments of the present disclosure provide a computer-readable medium, storing a computer program thereon, wherein the program, when executed by a processor, causes the processor to implement the method provided by the first aspect.
In a fifth aspect, embodiments of the present disclosure provide a computer program product including a computer program, where the computer program, when executed by a processing apparatus, implements the method provided by the first aspect.
The method, apparatus for constructing a reinforcement learning model, electronic device, and computer readable storage medium provided by the embodiments of the present disclosure, first establishing the first simulation model between the calciner coal feed amount and the calciner temperature, and establishing the second simulation model between the kiln head coal feed amount and the kiln current, the secondary air temperature, and the smoke chamber temperature; then establishing the prediction model between the under-grate pressure, the calciner temperature output by the first simulation model, and the kiln current, the secondary air temperature, the smoke chamber temperature output by the second simulation model, and the free calcium content; and finally constructing the reinforcement learning model that represents the association between the coal feed amount and the free calcium content according to the preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model, the coal feed amount including the calciner coal feed amount and the kiln head coal feed amount.
Different from the existing technology that may not meet the needs of a complex scenario of cement calcination, the present disclosure introduces the concept of reinforcement learning into a cement calcination scenario. Based on the established simulation models and the prediction model, under the reinforcement learning architecture, the reinforcement learning model that may represent the corresponding relationship between the input coal feed amount and the free calcium content of a final product under the influence of a plurality of parameters is constructed. In addition, since the reinforcement learning model is different from the compensator characteristics of other machine learning models, it is more compatible with the complex and multi-parameter cement calcination scenario, making the determined corresponding relationship more accurate, and at the same time a strong generalization ability of the reinforcement learning model may also be more simply applied to other similar scenarios.
It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be easily understood by the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

By reading the detailed description of non-limiting embodiments with reference to the following accompanying drawings, other features, objectives and advantages of the present disclosure will become more apparent:

Fig. 1 is an exemplary system architecture in which the present disclosure may be implemented;
Fig. 2 is a flowchart of a method for constructing a reinforcement learning model according to an embodiment of the present disclosure;
Fig. 3 is a flowchart of another method for constructing a reinforcement learning model according to an embodiment of the present disclosure;
Fig. 4 is a schematic flowchart of the method for constructing a reinforcement learning model in an application scenario according to an embodiment of the present disclosure;
Fig. 5 is a structural block diagram of an apparatus for constructing a reinforcement learning model according to an embodiment of the present disclosure; and
Fig. 6 is a block diagram of an electronic device suitable for implementing the method for constructing a reinforcement learning model according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

The present disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. It may be understood that the embodiments described herein are only used to explain the relevant disclosure, but not to limit the disclosure. In addition, it should be noted that, for ease of description, only the parts related to the relevant disclosure are shown in the accompanying drawings.
It should be noted that the embodiments in the present disclosure and the features in the embodiments may be combined with each other on a non-conflict basis. The present disclosure will be described below in detail with reference to the accompanying drawings and in combination with the embodiments.
Fig. 1 shows an exemplary system architecture 100 to which the embodiments of the method, apparatus and electronic device for constructing a reinforcement learning model, and computer readable storage medium of the present disclosure may be applied.
As shown in Fig. 1, the system architecture 100 may include sensors 101, 102, and 103, a network 104, a server 105, and a coal feeding device 106. The network 104 is used to provide a communication link medium between the sensors 101, 102, and 103 and the server 105, and between the server 105 and the coal feeding device 106. The network 104 may include various connection types, such as wired, wireless communication links, or optic fibers.
Various types of information acquired by the sensors 101, 102, and 103 may be sent to the server 105 through the network 104, and the server 105 may generate control instructions based on the received information after processing and then issue the control instructions to the coal feeding device 106 through the network 104. The above communication may be implemented by various applications installed on the sensors 101, 102, and 103, the server 105, and the coal feeding device 106, such as information transmission applications, coal feed optimization control applications, or control instruction sending and receiving applications.
Typically, the sensors 101, 102, and 103 are physical components (such as pressure sensors, temperature sensors, current sensors) installed in relevant positions of cement calcination-related devices (such as calciners, clinker kilns) to receive actual signals generated by actual devices. However, in test and simulation scenarios, the sensors 101, 102, and 103 may also be virtual components provided on virtual related devices of cement calcination, to receive parameters or simulation parameters predetermined in the test scenarios. The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server; when the server is software, it may be implemented as a plurality of software or software modules, or as a single software or software module, which is not limited herein. In an actual scenario, the coal feeding device 106 may be embodied as a physical device such as a coal conveyor belt or a coal conveyor. In a virtual test scenario, it may be directly replaced by a virtual device having a controlled coal conveying capacity.
The server 105 may provide various services through various built-in applications. A coal feed optimization control application that may provide a coal feed optimization control service in cement calcination may be used as an example. The server 105 operates the coal feed optimization control application and may achieve the following effects: first, receive an instruction for a target free calcium content required for cement clinker production of the present batch; then, input the target free calcium content into a pre-constructed reinforcement learning model that represents a corresponding relationship between a coal feed amount and a free calcium content to obtain a theoretical coal feed amount output by the reinforcement learning model; next, issue a corresponding coal feed amount instruction to the coal feeding device 106 using a theoretical calciner coal feed amount and a theoretical kiln head coal feed amount included in the theoretical coal feed amount.
The reinforcement learning model used by the server 105 in the above process may be constructed based on the following method: first, receiving a large amount of historical calciner coal feed amount, calciner temperature, kiln head coal feed amount, kiln current, secondary air temperature, smoke chamber temperature and under-grate pressure from the sensors 101, 102, and 103 through the network 104; then, establishing a first simulation model between the calciner coal feed amount and the calciner temperature, and establishing a second simulation model among the kiln head coal feed amount, the kiln current, the secondary air temperature, and the smoke chamber temperature; then, establishing a prediction model among: the under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content; and finally, constructing the reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model.
It should be noted that the parameters such as the calciner coal feed amount, the calciner temperature, the kiln head coal feed amount, the kiln current, the secondary air temperature, the smoke chamber temperature, and the under-grate pressure used to construct the simulation models and the prediction model, in addition to being acquired from the sensors 101, 102, and 103, these parameters may also be stored locally in the server 105 in various forms such as log or production inspection data report. Therefore, when the server 105 detects that the data have been stored locally, it may choose to directly acquire the data locally. In this regard, the process of generating the reinforcement learning model may also not require the sensors 101, 102, and 103 and the network 104.
Since the construction of the simulation models, the prediction model, and the reinforcement learning model based on a large number of parameters requires to occupy more computing resources and strong computing power, the method for constructing a reinforcement learning model provided in the subsequent embodiments of the present disclosure is generally performed by the server 105 having strong computing power and more computing resources. Accordingly, the apparatus for constructing a reinforcement learning model is generally also provided in the server 105.
It should be understood that the number of sensors, networks, servers and coal feeding devices in Fig. 1 is merely illustrative. Depending on the implementation needs, there may be any number of sensors, networks, servers and coal feeding devices.
With reference to Fig. 2, Fig. 2 is a flowchart of a method for constructing a reinforcement learning model according to an embodiment of the present disclosure. A flow 200 includes the following steps:
Step 201, establishing a first simulation model between a calciner coal feed amount and a calciner temperature;
This step aims to establish the first simulation model between the calciner coal feed amount and the calciner temperature by an executing body of the method for constructing a reinforcement learning model (for example, the server 105 shown in Fig. 1).
The first simulation model is used to represent a corresponding relationship between the calciner coal feed amount and the calciner temperature. In order to construct the first simulation model that may represent this corresponding relationship, a large number of historical coal feed amount and corresponding historical calciner temperature data are required to be used as sample data to participate in training and construction of the simulation model. For example, the first simulation model that represents the corresponding relationship between the calciner coal feed amount and the calciner temperature may be constructed in the form of the following formula: $y (k) = a * y (k - 1) + b * u (k - 1);$
In the formula, y(k) is the calciner temperature at time k, and y(k-1) and u(k-1) are respectively the calciner temperature and the calciner coal feed amount at time k-1 (that is, a previous moment of time K); a and b are respectively undetermined coefficients, and the particular values may be obtained by calculation using the least squares method based on historical data. For example, in a certain experimental scenario, a is 0.983 and b is 0.801.
S202, establishing a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature;
This step aims to establish the second simulation model among the kiln head coal feed amount, the kiln current, the secondary air temperature, and the smoke chamber temperature by the executing body.
Different from the first simulation model, the second simulation model is used to represent a corresponding relationship between the kiln head coal feed amount and the kiln current, the secondary air temperature, and the smoke chamber temperature. In order to construct the second simulation model that may represent this corresponding relationship, a large number of historical kiln head coal feed amount and corresponding historical kiln current, historical secondary air temperature, and historical smoke chamber temperature are required to be used as sample data to participate in training and construction of the simulation model. It is also possible to construct the second simulation model in the same form as the above formula.
It should be noted that the executing body needs to construct the first simulation model and the second simulation model through step 201 and step 202 respectively, because in a cement clinker calcination process, adjustable variables mainly include: feed amount, calciner coal feed amount, kiln head coal feed amount, kiln speed, high temperature fan speed, grate cooler speed, and controlled variables mainly include: calciner outlet temperature, calciner outlet pressure, secondary air temperature, tertiary air temperature, kiln burning zone temperature, kiln head negative pressure, kiln tail temperature, smoke chamber temperature, kiln current, under-grate pressure, and vertical weight. The controlled variable refers to a variable that may not be directly debugged, but may be affected by an adjustable variable.
All of the above variables ultimately act on an index-free calcium content of a calcined finished product. Therefore, in order to ensure a clinker quality of the finished product, it is necessary to monitor these variables during the entire calcination to calculate the quality of the calcined clinker product using these variables. After investigation, the free calcium content is mainly related to the calciner temperature, the kiln current, the secondary air temperature, the smoke chamber temperature, and the under-grate pressure, and these variables are mainly determined by the three adjustable parameters: the calciner coal feed amount, the kiln head coal feed amount, and the under-grate pressure. Therefore, in the case that the present disclosure mainly focuses on coal consumption caused by coal feeding and the clinker quality (i.e., free calcium content), the three adjustable variables may be mainly considered: the calciner coal feed amount, the kiln head coal feed amount, and the under-grate pressure, four controlled variables may be mainly considered: the calciner temperature, the kiln current, the secondary air temperature, and the smoke chamber temperature, and one final target variable may be considered: the free calcium content.
In order to optimize parameter adjustment of the coal feed amount using the reinforcement learning model, the construction of the simulation models that represent parameter changes related to the coal feed amount during cement calcination is indispensable. Therefore, the executing body respectively constructs the first simulation model that represents the corresponding relationship between the controlled variable-the calciner temperature and the adjustable variable-the calciner coal feed amount through step 201, and constructs the second simulation model that represents the corresponding relationship between the controlled variables-the kiln current, the secondary air temperature, the smoke chamber temperature and the adjustable variable-the kiln head coal feed amount through step 202.
Step 203, establishing a prediction model among: an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content;
On the basis of step 201 and step 202, this step aims to establish the prediction model between the under-grate pressure, the calciner temperature output by the first simulation model, and the kiln current, the secondary air temperature, the smoke chamber temperature output by the second simulation model by the executing body, and the free calcium content.
As described in step 202, it may be considered that the clinker quality index-free calcium content is mainly affected by the five controlled variables of under-grate pressure, calciner temperature, kiln current, secondary air temperature, and smoke chamber temperature. Therefore, in this step, the prediction model between the above five controlled variables and the free calcium content is established, that is, the generated prediction model may predict a prediction value of the corresponding free calcium content based on actual values of the given five controlled variables.
The establishing of the above prediction model requires a large amount of historical data to participate in training, so as to find a more accurate relationship of the influence of the controlled variables on the free calcium content, which may be achieved using various models or algorithms that support multiple input parameters to predict unique output parameter such as SVM (Support Vector Machine), neural network, or tree model, which is not limited herein, and the model or algorithm may be selected based on all possible influencing factors in actual application scenarios.
The large amount of historical data as samples required to construct the models in the above steps may all come from acquisition of various sensors (such as the sensors 101, 102, and 103 shown in Fig. 1) installed on relevant devices used in clinker calcination. For example, the under-grate pressure may be acquired by a pressure sensor installed in a grate cooler, the kiln current may be acquired by a current sensor installed in the kiln head, and for the various temperatures, temperature sensors of different performance and models may be selected based on actual temperature ranges.
Step 204, constructing a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model.
On the basis of step 203, this step aims to establish the reinforcement learning model that represents the association between the coal feed amount and the free calcium content according to the preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model by the executing body.
Although the free calcium content is affected by the five controlled variables of under-grate pressure, calciner temperature, kiln current, secondary air temperature, and smoke chamber temperature, these five controlled variables are respectively controlled by the two adjustable variables, namely, the calciner coal feed amount and the kiln head coal feed amount. In addition, based on the main purpose of the present disclosure, based on the simulation models that represent the corresponding relationships between the adjustable variables and the controlled variables, and the prediction model that represents the corresponding relationship between the controlled variables and the quality index, the reinforcement learning model that may represent the association between the coal feed amount and the free calcium content is constructed according to the reinforcement learning model architecture.
Reinforcement learning (RL), also known as encouragement learning, evaluation learning, or enhancement learning, is one of the paradigms and methodologies of machine learning. Reinforcement learning is used to describe and solve the problem that an agent maximizes returns or achieves a particular goal during interaction with the environment through learning strategies. Different from other neural network deep learning algorithms that simulate biological neural networks, the reinforcement learning algorithm is that the agent learns in a "trial and error" method, obtains a reward guide behavior through interaction with the environment, aims to maximize the reward for the agent. Reinforcement learning is different from supervised learning in connectionist learning, which is mainly manifested in reinforcement signals. A reinforcement signal provided by the environment in reinforcement learning is an evaluation of the quality of a generated action (usually a scalar signal), rather than telling a reinforcement learning system (RLS) how to generate a correct action. Because an external environment provides little information, RLS must rely on its own experience to learn. Using the method, RLS gains knowledge in an action-evaluation environment and improves action plans to adapt to the environment. Deep learning models may also be used in reinforcement learning to form deep reinforcement learning (DRL) having a better effect.
Actor-critic (A2C), PPO, TRPO, and other reinforcement learning model architectures having different characteristics may be used to construct the reinforcement learning model that may represent the corresponding relationship between the coal feed amount and the free calcium content required in this step.
Different from the existing technology that may not meet the needs of a complex scenario of cement calcination, the method for constructing a reinforcement learning model provided in the embodiment of the present disclosure introduces the concept of reinforcement learning into a cement calcination scenario, based on the established simulation models and the prediction model, under the reinforcement learning architecture, the reinforcement learning model that may represent the corresponding relationship between the input coal feed amount under the influence of a plurality of parameters and the free calcium content of a final product is constructed. In addition, since the reinforcement learning model is different from the compensator characteristics of other machine learning models, it is more compatible with the complex and multi-parameter cement calcination scenario, making the determined corresponding relationship more accurate, and at the same time a strong generalization ability of the reinforcement learning model may also be more simply applied to the present disclosure in other similar scenarios.
The existing technology may not meet the needs of a complex scenario of cement calcination, because PID control only considers system deviations, mainly is to track system setting values, but does not support a multi-objective optimization of clinker quality and energy consumption in the cement calcination scenario. On the other hand, due to the real-time control of a plurality of parameters involved in the cement production process, it is also difficult for MPC to achieve unified real-time control of the plurality of parameters. At the same time, the generalization ability of MPC is poor. For a calcination system of similar scenarios, the models need to be re-established each time.
With reference to Fig. 3, Fig. 3 is a flowchart of another method for constructing a reinforcement learning model according to an embodiment of the present disclosure. A flow 300 includes the following steps:

Step 301: establishing a first simulation model between a calciner coal feed amount and a calciner temperature;
Step 302: establishing a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature;
Step 303: establishing a prediction model among: an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content;
Step 304: constructing a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model;

The above steps 301-304 are the same as steps 201-204 as shown in Fig. 2. The above steps may be summarized as a construction process of the reinforcement learning model. For contents of the same parts, reference may be made to the corresponding parts of the previous embodiment, and detailed description thereof will be omitted.
Step 305: receiving a target free calcium content given in a target scenario;
On the basis of constructing the available reinforcement learning model in step 304, this step aims to receive the target free calcium content given by a user in the target scenario by the executing body. This step is used as a first step for the reinforcement learning model to instruct the use of the coal feed amount during cement calcination, that is, to acquire a set clinker quality index.
Step 306: determining a theoretical coal feed amount corresponding to the target free calcium content using the reinforcement learning model;
On the basis of step 305, this step aims to determine the theoretical coal feed amount corresponding to the target free calcium content using the reinforcement learning model by the executing body. That is, since the reinforcement learning model may represent the corresponding relationship between the coal feed amount and the free calcium content, given the target free calcium content, the corresponding theoretical coal feed amount may be inversely derived from the corresponding relationship, where the theoretical coal feed amount includes a theoretical calciner coal feed amount and a theoretical kiln head coal feed amount.
Step 307: guiding a calciner coal feeding operation and a kiln head coal feeding operation in the target scenario based on the theoretical coal feed amount.
On the basis of step 306, this step aims to instruct the calciner coal feeding operation and the kiln head coal feeding operation in the target scenario based on the theoretical coal feed amount by the executing body. For example, a coal feeding device (such as the coal feeding device 106 shown in Fig. 1) is controlled to feed a corresponding amount of coal to the calciner and the kiln head.
Since this embodiment of the present disclosure include all the technical features of the previous embodiment (that is, the construction steps of the reinforcement learning model), it should have all the beneficial effects of the previous embodiment. On this basis, this embodiment of the present disclosure also provides a solution on how to instruct the coal feed amount based on the constructed reinforcement learning model through steps 305 to 307, so as to instruct the input coal amount in the cement calcination process by giving a reasonable coal feed amount, put in as little coal as possible while ensuring the clinker quality as much as possible, to reduce costs and increase efficiency. Saved coal is also equivalent to reducing a corresponding amount of carbon dioxide emission to the atmosphere, which is conducive to striving for an environmentally friendly enterprise.
On the basis of the previous embodiment, although the above controlled variables are mainly affected by the adjustable variables, cement calcination is a very complicated process. There are many other sudden or indispensable factors that may cause some controlled variable changes, in turn affecting the clinker quality. Therefore, the following solution may also be used to determine whether other methods are required to adjust the controlled variables:
for the calciner temperature:

acquiring a current calciner temperature, and determining a simulated calciner coal feed amount corresponding to the current calciner temperature based on the first simulation model; and
adjusting the calciner temperature based on a plus or minus of the first difference, in response to a first difference between the simulated calciner coal feed amount and the theoretical calciner coal feed amount exceeding a first preset threshold.

Similarly, for the kiln current, the secondary air temperature, and the smoke chamber temperature:

acquiring a current kiln current, a current secondary air temperature, and a current smoke chamber temperature, and determining a simulated kiln head coal feed amount corresponding to the current kiln current, the current secondary air temperature, and the current smoke chamber temperature based on the second simulation model; and
adjusting the kiln current, the secondary air temperature, and the smoke chamber temperature based on the second difference, in response to a second difference between the simulated kiln head coal feed amount and the theoretical kiln head coal feed amount exceeding a second preset threshold.

Temperature control includes, but is not limited to, all effective means such as physical cooling or reduction of coal feed amount.
To deepen understanding, the present disclosure also provides an implementation solution in combination with an application scenario. Reference may be made to the schematic diagram as shown in Fig. 4.
An entire process of cement calcination may be seen in the schematic diagram of an apparatus given in the upper left corner of Fig. 4. First, feeding a raw material, and then sequentially going through four processes of preheater preheating, calciner heating, rotary kiln calcination, and grate cooler cooling to generate clinker. The entire process involves many controlled parameters, such as the calciner coal feed amount, or the kiln head coal feed amount, and these parameters may directly affect the quality of the clinker, that is, the free calcium content. In actual production, an enterprise usually requires that the free calcium content is between 0.5% and 1.5%. Through mechanism research, it is found that the free calcium content is low because the calcination temperature is too high, causing overburning, and the higher the corresponding coal feed amount, the higher the coal consumption. Therefore, in order to ensure low coal consumption under the premise of qualified quality, the free calcium content in a modeling process of the embodiments of the present disclosure is adjusted to be between 1%-1.5%, to reduce production costs as much as possible while ensuring quality.
The process parameters are adjusted based on the reinforcement learning model to reduce coal consumption while ensuring quality. The entire modeling process is very complicated. The following is a detailed introduction to the various parts of the construction process of the reinforcement learning model that the server is responsible for:

1) Construction of real-time prediction model on free calcium content

The free calcium content is measured about once an hour in the production process. Because it is required to control and adjust the coal feed amount and other parameters in real time, it is necessary to establish the real-time prediction model on the free calcium content. Since the free calcium content is mainly related to the calciner temperature, the kiln current, the secondary air temperature, the smoke chamber temperature and the under-grate pressure, the established model is:
Free calcium content = f (calciner temperature, kiln current, secondary air temperature, smoke chamber temperature, under-grate pressure); in an experiment, a large amount of historical data are used to fit f. In the present embodiment, the large amount of historical data are used to construct the prediction model through neural networks.

2) Construction of simulation environment for cement raw material calcination

To adjust the parameters using the reinforcement learning model, it is necessary to construct the simulation models in the cement calcination process. That is, after the coal feed amount is adjusted, how the controlled variables such as the calciner temperature, the kiln current, the secondary air temperature, or the smoke chamber temperature may change during calcination. In the industry, a first-order inertia model plus a hysteresis link are often selected to simulate a complex industrial system having large inertia and pure lag. By consulting relevant professional information, the calciner temperature is mainly related to the calciner coal feed amount, and the kiln current, the secondary air temperature, and the smoke chamber temperature are mainly related to the kiln head coal feed amount. A system model of the calciner temperature with respect to the calciner coal feed amount may be established, and a system model of the kiln current, the secondary air temperature, and the smoke chamber temperature with respect to the kiln head coal feed amount may be established.

3) Construction of reinforcement learning model

With the simulation models and the prediction model constructed in the above steps, the reinforcement learning model may be easily established. The present embodiment uses an Actor-critic reinforcement learning model, uses the three adjustable parameters: calciner coal feed amount, kiln head coal feed amount, and under-grate pressure as an Action of the reinforcement learning model, and temporarily ignores other parameters in the calcination process. The purpose is to ensure that a final free calcium content is between 1%-1.5%, and at the same time, when the feed amount is set at a certain level, the calciner coal feed amount and the kiln head coal feed amount should be as little as possible. Since the measurement standard of coal consumption is total coal feed amount/feed amount, it is assumed that the speed of the feed amount is fixed, that is, the feed amount per unit time is fixed, so the coal consumption only needs to consider the calciner coal feed amount and the kiln head coal feed amount.
Model details are as follows:

Action: is a three-dimensional vector, three-dimension continuous action, which is the calciner coal feed amount, the kiln head coal feed amount, and an under- grate pressure value. That is, these three parameters are output for control at every moment;
State: is a 14-dimensional (10-dimensional, after cutting some of the parameters corresponding to t-2) vector, which is calciner temperature t-2 (may be cut), t-1, a value at time t, the kiln current, the secondary air temperature and smoke chamber temperature t-2 (may be cut), t-1, a value at time t, a current value of the under-grate pressure, and a prediction value of the free calcium content given by a free calcium content prediction model constructed through the above steps. After each execution of an Action, State updates through the simulation environment;
Reward (reward value): since the purpose is to reduce coal consumption while ensuring quality, Reward is divided into two parts, that is, whether the free calcium content is within a target value range and a current coal feed amount. That is, Reward = -(kiln head coal feed amount + calciner coal feed amount)+100^∗I_({1%≤actual free calcium content≤1.5%}). Here, I is an indicative function, when 1%≤actual free calcium content≤1.5%, the value of I is 1, otherwise the value of I is 0.

It may be seen from the above Reward formula that when the free calcium content meets the standard, the less the total coal feed amount, the greater the value of Reward.
A data processing step at the bottom of Fig. 4 is a parameter update process of the Actor-critic reinforcement learning model based on sample. First, a sample is selected from each actual Action (the parameters may be named as S_t, a_t, r_t , S _t+1, etc.). Then, these samples are stored in a memory database in the form of tuple. Next, some data are selected from the memory database by sampling to update the parameters of the Actor-critic reinforcement learning model. Thus, the effectiveness and usability of the Actor-critic reinforcement learning model are maintained using this update method.
After the server is installed with the reinforcement learning model obtained by construction through the above construction steps, in the corresponding cement calcination scenario, the minimized coal feed amount may be subsequently determined based on the given free calcium content, so as to achieve cost decreasing and benefit increasing.
With further reference to Fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for constructing a reinforcement learning model, and the apparatus embodiment corresponds to the method embodiment as shown in Fig. 2. The apparatus may be particularly applied to various electronic devices.
As shown in Fig. 5, an apparatus 500 for constructing a reinforcement learning model of the present embodiment may include: a first simulation model establishing unit 501, a second simulation model establishing unit 502, a prediction model establishing unit 503 and a reinforcement learning model construction unit 504. The first simulation model establishing unit 501 is configured to establish a first simulation model between a calciner coal feed amount and a calciner temperature. The second simulation model establishing unit 502 is configured to establish a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature. The prediction model establishing unit 503 is configured to establish a prediction model among: an under-grate pressure; the calciner temperature output by the first simulation model; the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and a free calcium content. The reinforcement learning model construction unit 504 is configure to construct a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model; the coal feed amount including the calciner coal feed amount and the kiln head coal feed amount.
In the present embodiment, in the apparatus 500 for constructing a reinforcement learning model, for the particular processing and the technical effects thereof of the first simulation model establishing unit 501, the second simulation model establishing unit 502, the prediction model establishing unit 503 and the reinforcement learning model construction unit 504, reference may be made to the relevant descriptions of steps 201-204 in the corresponding embodiment of Fig. 2 respectively, and detailed description thereof will be omitted.
In some alternative implementations of the present embodiment, the apparatus 500 for constructing a reinforcement learning model may further include:

a given parameter receiving unit, configured to receive a target free calcium content given in a target scenario;
a theoretical coal feed amount determination unit, configured to determine a theoretical coal feed amount corresponding to the target free calcium content using the reinforcement learning model; where, the theoretical coal feed amount includes a theoretical calciner coal feed amount and a theoretical kiln head coal feed amount; and
a coal feeding operation instruction unit, configured to guide a calciner coal feeding operation and a kiln head coal feeding operation in the target scenario based on the theoretical coal feed amount.

In some alternative implementations of the present embodiment, the apparatus 500 for constructing a reinforcement learning model may further include:

a simulated calciner temperature determination unit, configured to acquire a current calciner temperature, and determine a simulated calciner coal feed amount corresponding to the current calciner temperature based on the first simulation model; and
a first adjusting unit, configured to adjust the calciner temperature based on a plus or minus of the first difference, in response to a first difference between the simulated calciner coal feed amount and the theoretical calciner coal feed amount exceeding a first preset threshold.

a simulated kiln head coal feed amount determination unit, configured to acquire a current kiln current, a current secondary air temperature, and a current smoke chamber temperature, and determine a simulated kiln head coal feed amount corresponding to the current kiln current, the current secondary air temperature, and the current smoke chamber temperature based on the second simulation model; and
a second adjusting unit, configured to adjust the kiln current, the secondary air temperature, and the smoke chamber temperature based on the second difference, in response to a second difference between the simulated kiln head coal feed amount and the theoretical kiln head coal feed amount exceeding a second preset threshold.

In some alternative implementations of the present embodiment, the reinforcement learning model construction unit 504 may include:
an A2C reinforcement learning model construction subunit, configured to construct the reinforcement learning model that represents the association between the coal feed amount and the free calcium content according to an Actor-Critic reinforcement learning model architecture.
In some alternative implementations of the present embodiment, the A2C reinforcement learning model construction subunit may be further configured to:

an Action configuration module, configured to construct the calciner coal feed amount, the kiln head coal feed amount, and the under-grate pressure as an Action represented by a three-dimensional vector;
a State configuration module, configured to construct a State represented by a ten-dimensional vector by at least using followings as a dimension respectively: a calciner temperature, a kiln current, a secondary air temperature, and a smoke chamber temperature at a previous moment; a calciner temperature, a kiln current, a secondary air temperature, a smoke chamber temperature and an under-grate pressure at a current moment; and a prediction value of the free calcium content output by the prediction model; wherein, after each execution of an Action, the State is updated through a preset simulation environment;
a Reward configuration module, configured to determine a Reward indicating whether the output prediction value of the free calcium content is within a preset target value range, and indicating a current coal feed amount; and
an A2C reinforcement learning model construction module, configured to construct the reinforcement learning model that represents the association between the coal feed amount and the free calcium content, based on the Action, the State and the Reward.

The present embodiment corresponds to the above method embodiment as the apparatus embodiment. Different from the existing technology that may not meet the needs of a complex scenario of cement calcination, the apparatus for constructing a reinforcement learning model provided in this embodiment of the present disclosure introduces the concept of reinforcement learning into a cement calcination scenario, based on the established simulation models and the prediction model, under the reinforcement learning architecture, the reinforcement learning model that may represent the corresponding relationship between the input coal feed amount under the influence of a plurality of parameters and the free calcium content of a final product is constructed. In addition, since the reinforcement learning model is different from the compensator characteristics of other machine learning models, it is more compatible with the complex and multi-parameter cement calcination scenario, making the determined corresponding relationship more accurate, and at the same time a strong generalization ability of the reinforcement learning model may also be more simply applied to other similar scenarios.
According to an embodiment of the present disclosure, the present disclosure also provides an electronic device for constructing a reinforcement learning model and a readable storage medium.
Fig. 6 shows a block diagram of an electronic device suitable for implementing the method for constructing a reinforcement learning model according to an embodiment of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the present disclosure described and/or claimed herein.
As shown in Fig. 6, the electronic device includes: one or more processors 601, a memory 602, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces. The various components are connected to each other using different buses, and may be installed on a common motherboard or in other methods as needed. The processor may process instructions executed within the electronic device, including instructions stored in or on the memory to display graphic information of GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, a plurality of processors and/or a plurality of buses may be used together with a plurality of memories and a plurality of memories if desired. Similarly, a plurality of electronic devices may be connected, and the devices provide some necessary operations, for example, as a server array, a set of blade servers, or a multi-processor system. In Fig. 6, one processor 601 is used as an example.
The memory 602 is a non-transitory computer readable storage medium provided by the present disclosure. The memory stores instructions executable by at least one processor, so that the at least one processor performs the method for constructing a reinforcement learning model provided by the present disclosure. The non-transitory computer readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the method for constructing a reinforcement learning model provided by the present disclosure.
The memory 602, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as program instructions/modules corresponding to the method for constructing a reinforcement learning model in the embodiments of the present disclosure (for example, the first simulation model establishing unit 501, the second simulation model establishing unit 502, the prediction model establishing unit 503 and the reinforcement learning model construction unit 504 as shown in Fig. 5). The processor 601 executes the non-transitory software programs, instructions, and modules stored in the memory 602 to execute various functional applications and data processing of the server, that is, to implement the method for constructing a reinforcement learning model in the foregoing method embodiments.
The memory 602 may include a storage program area and a storage data area, where the storage program area may store an operating system and at least one function required application program; and the storage data area may store data created by the use of the electronic device according to the method for constructing a reinforcement learning model, etc. In addition, the memory 602 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 602 may optionally include memories remotely provided with respect to the processor 601, and these remote memories may be connected to the electronic device of the method for constructing a reinforcement learning model through a network. Examples of the above network include but are not limited to the Internet, intranet, local area network, mobile communication network, and combinations thereof.
The electronic device of the method for constructing a reinforcement learning model may further include: an input apparatus 603 and an output apparatus 604. The processor 601, the memory 602, the input apparatus 603, and the output apparatus 604 may be connected through a bus or in other methods. In Fig. 6, connection through a bus is used as an example.
The input apparatus 603 may receive input digital or character information, and generate key signal inputs related to user settings and function control of the electronic device of the method for constructing a reinforcement learning model, such as touch screen, keypad, mouse, trackpad, touchpad, pointing stick, one or more mouse buttons, trackball, joystick and other input apparatuses. The output apparatus 604 may include a display device, an auxiliary lighting apparatus (for example, LED), a tactile feedback apparatus (for example, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
Various embodiments of the systems and technologies described herein may be implemented in digital electronic circuit systems, integrated circuit systems, dedicated ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs that may be executed and/or interpreted on a programmable system that includes at least one programmable processor. The programmable processor may be a dedicated or general-purpose programmable processor, and may receive data and instructions from a storage system, at least one input apparatus, and at least one output apparatus, and transmit the data and instructions to the storage system, the at least one input apparatus, and the at least one output apparatus.
These computing programs (also referred to as programs, software, software applications, or codes) include machine instructions of the programmable processor and may use high-level processes and/or object-oriented programming languages, and/or assembly/machine languages to implement these computing programs. As used herein, the terms "machine readable medium" and "computer readable medium" refer to any computer program product, device, and/or apparatus (for example, magnetic disk, optical disk, memory, programmable logic apparatus (PLD)) used to provide machine instructions and/or data to the programmable processor, including machine readable medium that receives machine instructions as machine readable signals. The term "machine readable signal" refers to any signal used to provide machine instructions and/or data to the programmable processor.
In order to provide interaction with a user, the systems and technologies described herein may be implemented on a computer, the computer has: a display apparatus for displaying information to the user (for example, CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and a pointing apparatus (for example, mouse or trackball), and the user may use the keyboard and the pointing apparatus to provide input to the computer. Other types of apparatuses may also be used to provide interaction with the user; for example, feedback provided to the user may be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and any form (including acoustic input, voice input, or tactile input) may be used to receive input from the user.
The systems and technologies described herein may be implemented in a computing system that includes backend components (e.g., as a data server), or a computing system that includes middleware components (e.g., application server), or a computing system that includes frontend components (for example, a user computer having a graphical user interface or a web browser, through which the user may interact with the implementations of the systems and the technologies described herein), or a computing system that includes any combination of such backend components, middleware components, or frontend components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., communication network). Examples of the communication network include: local area networks (LAN), wide area networks (WAN) and the Internet.
The computing system may include a client and a server. The client and the server are generally far from each other and usually interact through the communication network. The relationship between the client and the server is generated by computer programs that run on the corresponding computer and have a client-server relationship with each other. The server may be a cloud server, also known as a cloud computing server or a cloud host. The server is a host product in the cloud computing service system to solve the defects of management difficulty in traditional physical host and virtual private server (VPS) services Large, and weak business scalability.
According to the technical solution of the embodiments of the present disclosure, the concept of reinforcement learning is introduced into a cement calcination scenario, based on the established simulation models and the prediction model, under the reinforcement learning architecture, the reinforcement learning model that may represent the corresponding relationship between the input coal feed amount and the free calcium content of a final product under the influence of a plurality of parameters is constructed. In addition, since the reinforcement learning model is different from the compensator characteristics of other machine learning models, it is more compatible with the complex and multi-parameter cement calcination scenario, making the determined corresponding relationship more accurate, and at the same time a strong generalization ability of the reinforcement learning model may also be more simply applied to other similar scenarios.
It should be understood that the various forms of processes shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in different orders. As long as the desired results of the technical solution disclosed in the present disclosure may be achieved, no limitation is made herein.
The above particular embodiments do not constitute limitation on the protection scope of the present disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions may be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims

A method for constructing a reinforcement learning model, the method comprising:
establishing (201) a first simulation model between a calciner coal feed amount and a calciner temperature;

establishing (202) a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature;

establishing (203) a prediction model among:
an under-grate pressure;

the calciner temperature output by the first simulation model;

the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and

a free calcium content; and

constructing (204) a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model; the coal feed amount comprising the calciner coal feed amount and the kiln head coal feed amount.
The method according to claim 1, further comprising:
receiving (305) a target free calcium content given in a target scenario;

determining (306) a theoretical coal feed amount corresponding to the target free calcium content using the reinforcement learning model; wherein, the theoretical coal feed amount comprises a theoretical calciner coal feed amount and a theoretical kiln head coal feed amount; and

guiding (307) a calciner coal feeding operation and a kiln head coal feeding operation in the target scenario based on the theoretical coal feed amount.
The method according to claim 2, further comprising:
acquiring a current calciner temperature, and determining a simulated calciner coal feed amount corresponding to the current calciner temperature based on the first simulation model; and

adjusting the calciner temperature based on a plus or minus of a first difference between the simulated calciner coal feed amount and the theoretical calciner coal feed amount, in response to the first difference exceeding a first preset threshold.
The method according to claim 2, further comprising:
acquiring a current kiln current, a current secondary air temperature, and a current smoke chamber temperature, and determining a simulated kiln head coal feed amount corresponding to the current kiln current, the current secondary air temperature, and the current smoke chamber temperature based on the second simulation model; and

adjusting the kiln current, the secondary air temperature, and the smoke chamber temperature based on a second difference between the simulated kiln head coal feed amount and the theoretical kiln head coal feed amount, in response to the second difference exceeding a second preset threshold.
The method according to any one of claims 1-4, wherein, the constructing (204) comprises:
constructing the reinforcement learning model according to an Actor-Critic reinforcement learning model architecture.
The method according to claim 5, wherein, the constructing the reinforcement learning model according to an Actor-Critic reinforcement learning model architecture, comprises:
constructing the calciner coal feed amount, the kiln head coal feed amount, and the under-grate pressure as an Action represented by a three-dimensional vector;

constructing a State represented by a ten-dimensional vector by at least using followings as a dimension respectively:
a calciner temperature, a kiln current, a secondary air temperature, and a smoke chamber temperature at a previous moment;

a calciner temperature, a kiln current, a secondary air temperature, a smoke chamber temperature and an under-grate pressure at a current moment; and

a prediction value of the free calcium content output by the prediction model;

wherein, after each execution of an Action, the State is updated through a preset simulation environment;

determining a Reward indicating whether the output prediction value of the free calcium content is within a preset target value range, and indicating a current coal feed amount; and

constructing the reinforcement learning model that represents the association between the coal feed amount and the free calcium content, based on the Action, the State and the Reward.
An apparatus for constructing a reinforcement learning model, the apparatus comprising:
a first simulation model establishing unit (501), configured to establish a first simulation model between a calciner coal feed amount and a calciner temperature;

a second simulation model establishing unit (502), configured to establish a second simulation model among a kiln head coal feed amount, a kiln current, a secondary air temperature, and a smoke chamber temperature;

a prediction model establishing unit (503), configured to establish a prediction model among:
an under-grate pressure;

the calciner temperature output by the first simulation model; and

the kiln current, the secondary air temperature, and the smoke chamber temperature output by the second simulation model; and

a free calcium content; and

a reinforcement learning model construction unit (504), configure to construct a reinforcement learning model that represents an association between a coal feed amount and the free calcium content according to a preset reinforcement learning model architecture, using the first simulation model, the second simulation model, and the prediction model; the coal feed amount comprising the calciner coal feed amount and the kiln head coal feed amount.
The apparatus according to claim 7, further comprising:
a given parameter receiving unit, configured to receive a target free calcium content given in a target scenario;

a theoretical coal feed amount determination unit, configured to determine a theoretical coal feed amount corresponding to the target free calcium content using the reinforcement learning model; wherein, the theoretical coal feed amount comprises a theoretical calciner coal feed amount and a theoretical kiln head coal feed amount; and

a coal feeding operation instruction unit, configured to guide a calciner coal feeding operation and a kiln head coal feeding operation in the target scenario based on the theoretical coal feed amount.
The apparatus according to claim 8, further comprising:
a simulated calciner temperature determination unit, configured to acquire a current calciner temperature, and determine a simulated calciner coal feed amount corresponding to the current calciner temperature based on the first simulation model; and

a first adjusting unit, configured to adjust the calciner temperature based on a plus or minus of a first difference between the simulated calciner coal feed amount and the theoretical calciner coal feed amount, in response to the first difference exceeding a first preset threshold.
The apparatus according to claim 8, further comprising:
a simulated kiln head coal feed amount determination unit, configured to acquire a current kiln current, a current secondary air temperature, and a current smoke chamber temperature, and determine a simulated kiln head coal feed amount corresponding to the current kiln current, the current secondary air temperature, and the current smoke chamber temperature based on the second simulation model; and

a second adjusting unit, configured to adjust the kiln current, the secondary air temperature, and the smoke chamber temperature based on a second difference between the simulated kiln head coal feed amount and the theoretical kiln head coal feed amount, in response to the second difference exceeding a second preset threshold.
The apparatus according to any one of claims 7-10, wherein, the reinforcement learning model construction unit (504) comprises:
an A2C reinforcement learning model construction subunit, configured to construct the reinforcement learning model according to an Actor-Critic reinforcement learning model architecture.
The apparatus according to claim 11, wherein, the A2C reinforcement learning model construction subunit is further configured to:
an Action configuration module, configured to construct the calciner coal feed amount, the kiln head coal feed amount, and the under-grate pressure as an Action represented by a three-dimensional vector;

a State configuration module, configured to construct a State represented by a ten-dimensional vector by at least using followings as a dimension respectively:
a calciner temperature, a kiln current, a secondary air temperature, and a smoke chamber temperature at a previous moment;

a calciner temperature, a kiln current, a secondary air temperature, a smoke chamber temperature and an under-grate pressure at a current moment; and

a prediction value of the free calcium content output by the prediction model;

wherein, after each execution of an Action, the State is updated through a preset simulation environment;

a Reward configuration module, configured to determine a Reward indicating whether the output prediction value of the free calcium content is within a preset target value range, and indicating a current coal feed amount; and

an A2C reinforcement learning model construction module, configured to construct the reinforcement learning model that represents the association between the coal feed amount and the free calcium content, based on the Action, the State and the Reward.
An electronic device, comprising:
at least one processor (601); and

a memory (602), communicatively connected to the at least one processor (601); wherein,

the memory (602), storing instructions executable by the at least one processor (601), the instructions, when executed by the at least one processor (601), cause the at least one processor (601) to perform the method for constructing a reinforcement learning model according to any one of claims 1-6.
A non-transitory computer readable storage medium, storing computer instructions, the computer instructions, being used to cause a computer to perform the method for constructing a reinforcement learning model according to any one of claims 1-6.
A computer program product comprising a computer program, the computer program, when executed by a processor (601), implementing the method according to any one of claims 1-6.