CN114484584A

CN114484584A - Heat supply control method and system based on offline reinforcement learning

Info

Publication number: CN114484584A
Application number: CN202210067515.8A
Authority: CN
Inventors: 马志军; 胡继新; 梁炜; 何子峰; 张康; 成甜甜; 曹玉玺
Original assignee: State Power Investment Group Xiongan Energy Co ltd; Guodian Investment Fenghe New Energy Technology Hebei Co ltd
Current assignee: State Power Investment Group Xiongan Energy Co ltd; Guodian Investment Fenghe New Energy Technology Hebei Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-13
Anticipated expiration: 2042-01-20
Also published as: CN114484584B

Abstract

The invention provides a heat supply control method and system based on offline reinforcement learning, wherein the method comprises the following steps: collecting heat supply data, and inputting the heat supply data set into a heat supply model; centralized sampling from heating data

The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training G_ωA model; will train G_ωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to G_ωThe effect of the model is monitored. The invention applies advanced off-line reinforcement learning algorithm to the centralized heating control systemThe advantages of the reinforcement learning algorithm are fully exerted under the condition of not needing to interact with a real environment, and inefficient sampling and expensive cost during the interaction with the environment are avoided; the historical interactive data is fully utilized, and compared with the prior art, the performance of the control algorithm is greatly improved theoretically and practically.

Description

Heat supply control method and system based on offline reinforcement learning

Technical Field

The invention relates to the technical field of heating system control, in particular to a heating control method and system based on offline reinforcement learning.

Background

The intelligent control of the centralized heating system has great influence on the improvement of the living quality of residents in China and the development of urban construction, and is a technology which is currently paid much attention. The central heating system mainly comprises three parts: a heat source, a heat exchange station, and a user. The heat sources of the current centralized heating system mainly comprise a thermal power plant, a regional boiler room and a centralized boiler room, steam or hot water generated by the heat sources is sent to a heat exchange station through a primary pipe network, and the heat exchange station transmits the heat of the steam or the hot water of the primary pipe network to a user terminal through a secondary pipe network.

In the past, some traditional optimization control methods are used for controlling a central heating system, namely, an algorithm driven by a model constructed through a physical mechanism is used for regulation and control, the defects of the methods are obvious, once the operation working condition changes, the adjustment capability of the algorithm is very limited, and modeling needs to be carried out again at the moment.

With the rapid development of the intelligent cloud technology and the AI algorithm, the application of the novel artificial intelligence algorithm in the central heating system is gradually deepened, and the advantages of the novel artificial intelligence algorithm are gradually highlighted. Compared with the traditional control algorithm, the intelligent control algorithm based on data driving has the advantages of strong robustness, high response speed and the like. At present, the intelligent control algorithm based on data driving mainly comprises the following two types:

1. and (3) supervision and learning: through the end-to-end training model of historical data, the performance of the model is very dependent on the quality and the quantity of the data, and the generalization performance is poor in consideration of the quality of the data in an actual scene.

2. Reinforcement learning: interaction with the environment is required, i.e. given a State (State) of an environment, the program selects a corresponding Action (Action) according to a certain Policy (Policy), and after executing the Action, the environment changes, i.e. the State changes to a new State S', and after executing an Action, the program obtains an incentive value (Reward), and the program adjusts its Policy according to the obtained incentive value, so that after all steps are executed, i.e. when the State reaches a Terminal State (Terminal), the obtained Reward sum is maximum. The reinforcement learning continuously reinforces the decision level of the intelligent agent, but the safety and the cost cannot be guaranteed in consideration that the actual scene usually does not give the intelligent agent the opportunity of continuous trial and error. Without interaction with the environment, a large extrapolation error may result.

The two methods have respective advantages and limitations, the idea of supervised learning is a pure end-to-end form, and if the training data is insufficient, the generalization error of the control algorithm is very large. The general reinforcement learning method can be well performed in a control task such as central heating, but requires interaction of environments as a basis for improving the performance of a model.

Disclosure of Invention

In view of this, the present invention provides a novel algorithm, which more stably meets the heat demand of the user, effectively reduces the operation loss of the heating system and reduces the heating cost on the basis of meeting the heat demand, and provides an offline reinforcement learning method which does not need to interact with the environment and can fully utilize the reinforcement learning advantage. The offline reinforcement learning is a hotspot of the current learning and industry, is an important form of the floor heating scene of reinforcement learning, can reduce the threshold of the reinforcement learning applied to heat supply, and is beneficial to the intellectualization and digitalization transformation of the heat supply industry. The control of the central heating system aims at providing a comfortable indoor environment for a user, and technically, relevant parameters are adjusted through a control algorithm to meet the heat demand of a heating user.

The invention provides a heat supply control method based on offline reinforcement learning, which comprises the following steps:

s1, collecting heat supply data, inputting the heat supply data into a heat supply model, setting time step T, target network update rate tau and small batch data scale

Maximum disturbance phi, number of sampled actions n, minimum weight lambda, random parameter theta₁，θ₂，φ，ω；

Initializing two Q matrices Q_θ(s，a)：Q_θ1，Q_θ2(ii) a Disturbance model xi_φTarget network Q_θ′1，Q_θ′2Two target networks are used for the purpose of preventing over-estimation of the Q value, and a target disturbance model xi_φ′The purpose of the perturbation network is to provide diversity of actions, such that [ - φ, can be sampled]Actions within, rather than relying solely on generator generation; generation of VAE Normal distribution model G_ω＝{E_ω1，D_ω2}，

Wherein theta is₁←θ₁，θ₂←θ₂，φ←φ；

The parameter φ is used to determine the action [ - φ, φ for action]Adjustments are made within the scope, which allows the algorithm to access actions in the constrained region without generating the model G from_ωSampling for many times;

s2, centralized sampling from heating data

The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training G_ωA model;

let μ, σ be E based on normal distribution N (μ, σ)_ω1(s，a)，

s is a State State, a is an action taken, and s' is the next State after s executes a;

the parameter is used for expressing the influence of the new value on the updated value, and r is the Reward obtained after the action a is taken in the state s;

from G_ωSelecting actions with the highest similarity as candidates according to the distribution in the data set, wherein the sampled action number n is used for representing the number of the candidate actions;

sample n action actions:

and disturbing each action of the sampling:

to enhance the diversity of actions;

selecting the action with the highest value in the actions as the actually taken action according to the Q network;

setting a target gamma:

the lambda parameter is used to control the penalty degree of future uncertainty; gamma is a count value used to reduce the influence of the new value, where d and gamma are both in the range of 0-1;

θ←argmin_ω∑(y-Q_θ(s，a))²；

updating the target network: theta'_i←τθ+(1-τ)θ′_i；φ′←τφ+(1-τ)φ′；

Looping until the minimum of the two Q matrices ends;

s3, G after training_ωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to G_ωMonitoring the effect of the model at regular time according to G_ωEffect of model, G for effect promotion_ωG for updating and training the model and having poor effect_ωRolling back the model;

post-deployment G_ωThe model can further accumulate real-time expert data, returns to the step of data acquisition again, and the model is repeated and iterated continuously, improves the operating efficiency of the central heating system, and saves energy under the condition of ensuring that the heat supply of residents is enough.

Further, the G of the S1 step_ωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;

the data cleaning method comprises the following steps: removing abnormal values and mutation points in the data based on an Elliptic model Elliptic Envelope, and filling missing data by a linear interpolation method;

the data aggregation method comprises the following steps: and aligning the timestamps of the collected heat supply data with different frequencies to form complete historical data.

Further, the method for collecting heating data in step S1 includes collecting weather data, district heating system conditions, and real-time data related to the heat consumer.

Further, the method for collecting weather data comprises the step of collecting real-time weather of heat supply city granularity and weather forecast data of the future 24 hours in real time at a frequency of 5 minutes through an API (application program interface) provided by an air network.

Further, the method for the central heating system working condition comprises the following steps: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data is transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.

Further, the method for acquiring the client data comprises the following steps: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.

The invention also provides a heat supply control system based on off-line reinforcement learning, which uses the heat supply control method based on off-line reinforcement learning, and comprises the following steps:

a data acquisition module: the system is used for acquiring heat supply data and inputting the heat supply data into a heat supply model;

a model generation module: for centralized sampling from heating data

a model deployment module: for converting the trained G_ωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to G_ωThe effect of the model is monitored regularly according to G_ωEffect of model, G for effect promotion_ωG for updating and training the model and having poor effect_ωThe model is rolled back.

Further, the model generation module includes G_ωA model training submodule for training G based on BQC algorithm_ωAnd (4) modeling.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the off-line reinforcement learning-based heating control method described above.

The invention also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the heating control method based on offline reinforcement learning.

Compared with the prior art, the invention has the beneficial effects that:

the invention applies the advanced offline reinforcement learning algorithm to the centralized heating control system, fully exerts the advantages of the reinforcement learning algorithm under the condition of not interacting with the real environment, and avoids inefficient sampling and expensive cost when interacting with the environment; in addition, the invention fully utilizes historical interaction data, and compared with the prior art, the invention greatly improves the performance of the control algorithm in theory and practice.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

In the drawings:

FIG. 1 is a flow chart of a heat supply control method based on offline reinforcement learning according to the present invention;

FIG. 2 is a block diagram of a computer device according to an embodiment of the present invention;

FIG. 3 is a flowchart of a model training deployment according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and products consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The method for realizing intelligent control of the centralized heating system mainly utilizes historical interactive data (or also called as expert data) in the past year to design an intelligent control algorithm based on machine learning at present, and specifically comprises the following two means:

(1) and (3) supervision and learning: through historical interactive data, strategies in the historical data are learned end to end, various numerical characteristics under actual working conditions are input, and a label is a regulation value, namely the two-network water supply temperature. In short, the method establishes a direct mapping between actual working condition data and the action taken, thereby learning the strategy.

(2) Reinforcement learning: the reinforcement learning can be subdivided into three technical realization ideas, namely, directly establishing an intelligent learning algorithm to continuously interact with a real environment, so that the performance of a strategy is continuously reinforced; learning an environment strategy from historical interactive data (expert data), namely simulating the change of the environment after action is taken, and then training a reinforcement learning algorithm in the simulated environment; and thirdly, the constructed mechanism model is used as a reinforcement learning environment to carry out strategy learning of the intelligent agent.

The embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention provides a heat supply control method based on offline reinforcement learning, which is shown in a figure 1 and comprises the following steps:

Wherein theta'₁←θ₁，θ′₂←θ₂，φ′←φ；

The parameter φ is used to determine the action [ - φ, φ for action]Adjustments are made in scope, which allows the algorithm to access actions in the constrained region without generating the model G from_ωSampling for many times;

s2, centralized sampling from heating data

let μ, σ be E based on normal distribution N (μ, σ)_ω1(s，a)，

sample n action actions:

and disturbing each action of the sampling:

to enhance the diversity of actions;

setting a target gamma:

θ←argmin_ω∑(y-Q_θ(s，a))²；

Looping until the minimum of the two Q matrices ends;

The G of the S1 step_ωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;

The method for collecting heat supply data in step S1, as shown in fig. 3, includes collecting weather data, central heating system conditions, and real-time data related to the heat consumer.

The method for collecting weather data comprises the step of collecting real-time weather of heat supply city granularity and weather forecast data of the future 24 hours in real time at a frequency of 5 minutes through an API (application program interface) provided by a weather network.

The method for the working condition of the central heating system comprises the following steps: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data are transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.

The method for acquiring the client data comprises the following steps: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.

The embodiment of the present invention further provides a heat supply control system based on offline reinforcement learning, which uses the heat supply control method based on offline reinforcement learning as described above, and includes:

a model generation module: for centralized sampling from heating data

a model deployment module: for converting the trained G_ωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to G_ωMonitoring the effect of the model at regular time according to G_ωEffect of model, G for effect promotion_ωG for updating and training the model and having poor effect_ωThe model is rolled back.

The model generation module comprises G_ωA model training submodule for training G based on BQC algorithm_ωAnd (4) modeling.

The performance of the supervised learning algorithm in machine learning is guaranteed by using a large amount of high-quality data as support, the performance of the supervised learning algorithm for a control task is usually inferior to that of the reinforcement learning algorithm.

Training of the reinforcement learning agent needs to interact with a real environment continuously to generate new data, however, the weak agent is obviously not allowed to sample in the environment in an actual heat supply scene, and the problems of safety and high cost caused by interaction between the agent and the environment are solved.

The embodiment of the invention is applied to a centralized heating system, is also a big example of an industrial scene of the reinforced learning landing, and has certain inspiration for the application of the reinforced learning in the centralized heating system and the industry in the future.

Fig. 2 is a schematic structural diagram of a computer device provided in an embodiment of the present invention; referring to fig. 2 of the drawings, the computer apparatus comprises: an input device 23, an output device 24, a memory 22 and a processor 21; the memory 22 for storing one or more programs; when the one or more programs are executed by the one or more processors 21, causing the one or more processors 21 to implement the heating control method as provided in the above embodiments; wherein the input device 23, the output device 24, the memory 22 and the processor 21 may be connected by a bus or other means, as exemplified by the bus connection in fig. 2.

The memory 22 is a readable and writable storage medium of a computing device, and can be used for storing a software program, a computer executable program, and program instructions corresponding to the heating control method according to the embodiment of the present invention; the memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like; further, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device; in some examples, the memory 22 may further include memory located remotely from the processor 21, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 23 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus; the output device 24 may include a display device such as a display screen.

The processor 21 executes various functional applications and data processing of the device by executing software programs, instructions and modules stored in the memory 22, so as to implement the above-mentioned heat supply control method.

The computer device provided above can be used to execute the heating control method based on offline reinforcement learning provided in the above embodiments, and has corresponding functions and advantages.

Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the offline reinforcement learning-based heating control method according to the above embodiments, where the storage medium is any of various types of memory devices or storage devices, and the storage medium includes: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc.; the storage medium may also include other types of memory or combinations thereof; in addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet); the second computer system may provide program instructions to the first computer for execution. A storage medium includes two or more storage media that may reside in different locations, such as in different computer systems connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the heating control method based on offline reinforcement learning as described in the above embodiments, and may also perform related operations in the heating control method provided by any embodiment of the present invention.

Technical solutions of the present invention have been described with reference to preferred embodiments shown in the drawings, but it is apparent that the scope of the present invention is not limited to these specific embodiments, as will be readily understood by those skilled in the art. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A heat supply control method based on offline reinforcement learning is characterized by comprising the following steps:

Initializing two Q matrices Q_θ(s，a)：Q_θ1，Q_θ2(ii) a Disturbance model xi_φTarget network Q_θ′1，Q_θ′2Target disturbance model xi_φ′Generating a VAE Normal distribution model G_ω＝{E_ω1，D_ω2}，

Wherein theta'₁←θ₁，θ′₂←θ₂，φ′←φ；

S2, centralized sampling from heating data

let μ, σ be E based on normal distribution N (μ, σ)_ω1(s，a)，

sample n action actions:

and disturbing each action of the sampling:

setting a target gamma:

the lambda parameter is used to control the penalty degree of future uncertainty;

θ←argmin_ω∑(y-Q_θ(s，a))²；

updating the target network: theta'_i←τθ+(1-τ)θ‘_i；φ′←τφ+(1-τ)φ′；

Looping until the minimum of the two Q matrices ends;

s3, G after training_ωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to G_ωMonitoring the effect of the model at regular time according to G_ωEffect of model, G for effect promotion_ωG for updating and training the model and having poor effect_ωThe model is rolled back.

2. A heating control method according to claim 1, wherein the G of the step S1_ωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;

3. A heating control method according to claim 1, wherein the heating data collection method of S1 includes collecting weather data, district heating system conditions, and real-time data relating to the user' S heat.

4. A heating control method according to claim 3, characterized in that the method of collecting weather data comprises collecting real-time weather and future 24-hour weather forecast data for heating city granularity in real-time at a frequency of 5 minutes via an API interface provided by the air network.

5. A heating control method according to claim 3, wherein the method of central heating system operating conditions comprises: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data are transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.

6. A heating control method according to claim 3, characterized in that the method of collecting user-side data comprises: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.

7. A heating control system based on off-line reinforcement learning, characterized in that the heating control method based on off-line reinforcement learning according to any one of claims 1-6 is used, and comprises the following steps:

a model generation module: for centralized sampling from heating data

8. A heating control system according to claim 7, characterized in that the model generation module comprises G_ωA model training submodule for training G based on BQC algorithm_ωAnd (4) modeling.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the offline reinforcement learning-based heating control method according to any one of claims 1 to 6.

10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the off-line reinforcement learning based heating control method according to any one of claims 1-6.