CN114484584A - Heat supply control method and system based on offline reinforcement learning - Google Patents

Heat supply control method and system based on offline reinforcement learning Download PDF

Info

Publication number
CN114484584A
CN114484584A CN202210067515.8A CN202210067515A CN114484584A CN 114484584 A CN114484584 A CN 114484584A CN 202210067515 A CN202210067515 A CN 202210067515A CN 114484584 A CN114484584 A CN 114484584A
Authority
CN
China
Prior art keywords
data
model
steps
heat supply
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210067515.8A
Other languages
Chinese (zh)
Other versions
CN114484584B (en
Inventor
马志军
胡继新
梁炜
何子峰
张康
成甜甜
曹玉玺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Power Investment Group Xiongan Energy Co ltd
Guodian Investment Fenghe New Energy Technology Hebei Co ltd
Original Assignee
State Power Investment Group Xiongan Energy Co ltd
Guodian Investment Fenghe New Energy Technology Hebei Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Power Investment Group Xiongan Energy Co ltd, Guodian Investment Fenghe New Energy Technology Hebei Co ltd filed Critical State Power Investment Group Xiongan Energy Co ltd
Priority to CN202210067515.8A priority Critical patent/CN114484584B/en
Publication of CN114484584A publication Critical patent/CN114484584A/en
Application granted granted Critical
Publication of CN114484584B publication Critical patent/CN114484584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24DDOMESTIC- OR SPACE-HEATING SYSTEMS, e.g. CENTRAL HEATING SYSTEMS; DOMESTIC HOT-WATER SUPPLY SYSTEMS; ELEMENTS OR COMPONENTS THEREFOR
    • F24D19/00Details
    • F24D19/10Arrangement or mounting of control or safety devices
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention provides a heat supply control method and system based on offline reinforcement learning, wherein the method comprises the following steps: collecting heat supply data, and inputting the heat supply data set into a heat supply model; centralized sampling from heating data
Figure DDA0003480760610000011
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model; will train GωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωThe effect of the model is monitored. The invention applies advanced off-line reinforcement learning algorithm to the centralized heating control systemThe advantages of the reinforcement learning algorithm are fully exerted under the condition of not needing to interact with a real environment, and inefficient sampling and expensive cost during the interaction with the environment are avoided; the historical interactive data is fully utilized, and compared with the prior art, the performance of the control algorithm is greatly improved theoretically and practically.

Description

Heat supply control method and system based on offline reinforcement learning
Technical Field
The invention relates to the technical field of heating system control, in particular to a heating control method and system based on offline reinforcement learning.
Background
The intelligent control of the centralized heating system has great influence on the improvement of the living quality of residents in China and the development of urban construction, and is a technology which is currently paid much attention. The central heating system mainly comprises three parts: a heat source, a heat exchange station, and a user. The heat sources of the current centralized heating system mainly comprise a thermal power plant, a regional boiler room and a centralized boiler room, steam or hot water generated by the heat sources is sent to a heat exchange station through a primary pipe network, and the heat exchange station transmits the heat of the steam or the hot water of the primary pipe network to a user terminal through a secondary pipe network.
In the past, some traditional optimization control methods are used for controlling a central heating system, namely, an algorithm driven by a model constructed through a physical mechanism is used for regulation and control, the defects of the methods are obvious, once the operation working condition changes, the adjustment capability of the algorithm is very limited, and modeling needs to be carried out again at the moment.
With the rapid development of the intelligent cloud technology and the AI algorithm, the application of the novel artificial intelligence algorithm in the central heating system is gradually deepened, and the advantages of the novel artificial intelligence algorithm are gradually highlighted. Compared with the traditional control algorithm, the intelligent control algorithm based on data driving has the advantages of strong robustness, high response speed and the like. At present, the intelligent control algorithm based on data driving mainly comprises the following two types:
1. and (3) supervision and learning: through the end-to-end training model of historical data, the performance of the model is very dependent on the quality and the quantity of the data, and the generalization performance is poor in consideration of the quality of the data in an actual scene.
2. Reinforcement learning: interaction with the environment is required, i.e. given a State (State) of an environment, the program selects a corresponding Action (Action) according to a certain Policy (Policy), and after executing the Action, the environment changes, i.e. the State changes to a new State S', and after executing an Action, the program obtains an incentive value (Reward), and the program adjusts its Policy according to the obtained incentive value, so that after all steps are executed, i.e. when the State reaches a Terminal State (Terminal), the obtained Reward sum is maximum. The reinforcement learning continuously reinforces the decision level of the intelligent agent, but the safety and the cost cannot be guaranteed in consideration that the actual scene usually does not give the intelligent agent the opportunity of continuous trial and error. Without interaction with the environment, a large extrapolation error may result.
The two methods have respective advantages and limitations, the idea of supervised learning is a pure end-to-end form, and if the training data is insufficient, the generalization error of the control algorithm is very large. The general reinforcement learning method can be well performed in a control task such as central heating, but requires interaction of environments as a basis for improving the performance of a model.
Disclosure of Invention
In view of this, the present invention provides a novel algorithm, which more stably meets the heat demand of the user, effectively reduces the operation loss of the heating system and reduces the heating cost on the basis of meeting the heat demand, and provides an offline reinforcement learning method which does not need to interact with the environment and can fully utilize the reinforcement learning advantage. The offline reinforcement learning is a hotspot of the current learning and industry, is an important form of the floor heating scene of reinforcement learning, can reduce the threshold of the reinforcement learning applied to heat supply, and is beneficial to the intellectualization and digitalization transformation of the heat supply industry. The control of the central heating system aims at providing a comfortable indoor environment for a user, and technically, relevant parameters are adjusted through a control algorithm to meet the heat demand of a heating user.
The invention provides a heat supply control method based on offline reinforcement learning, which comprises the following steps:
s1, collecting heat supply data, inputting the heat supply data into a heat supply model, setting time step T, target network update rate tau and small batch data scale
Figure BDA0003480760590000021
Maximum disturbance phi, number of sampled actions n, minimum weight lambda, random parameter theta1,θ2,φ,ω;
Initializing two Q matrices Qθ(s,a):Qθ1,Qθ2(ii) a Disturbance model xiφTarget network Qθ′1,Qθ′2Two target networks are used for the purpose of preventing over-estimation of the Q value, and a target disturbance model xiφ′The purpose of the perturbation network is to provide diversity of actions, such that [ - φ, can be sampled]Actions within, rather than relying solely on generator generation; generation of VAE Normal distribution model Gω={Eω1,Dω2},
Wherein theta is1←θ1,θ2←θ2,φ←φ;
The parameter φ is used to determine the action [ - φ, φ for action]Adjustments are made within the scope, which allows the algorithm to access actions in the constrained region without generating the model G fromωSampling for many times;
s2, centralized sampling from heating data
Figure BDA0003480760590000031
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
let μ, σ be E based on normal distribution N (μ, σ)ω1(s,a),
Figure BDA0003480760590000032
Figure BDA0003480760590000033
s is a State State, a is an action taken, and s' is the next State after s executes a;
Figure BDA0003480760590000034
the parameter is used for expressing the influence of the new value on the updated value, and r is the Reward obtained after the action a is taken in the state s;
from GωSelecting actions with the highest similarity as candidates according to the distribution in the data set, wherein the sampled action number n is used for representing the number of the candidate actions;
sample n action actions:
Figure BDA0003480760590000035
and disturbing each action of the sampling:
Figure BDA0003480760590000036
to enhance the diversity of actions;
selecting the action with the highest value in the actions as the actually taken action according to the Q network;
setting a target gamma:
Figure BDA0003480760590000037
the lambda parameter is used to control the penalty degree of future uncertainty; gamma is a count value used to reduce the influence of the new value, where d and gamma are both in the range of 0-1;
θ←argminω∑(y-Qθ(s,a))2
Figure BDA0003480760590000041
updating the target network: theta'i←τθ+(1-τ)θ′i;φ′←τφ+(1-τ)φ′;
Looping until the minimum of the two Q matrices ends;
s3, G after trainingωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωMonitoring the effect of the model at regular time according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωRolling back the model;
post-deployment GωThe model can further accumulate real-time expert data, returns to the step of data acquisition again, and the model is repeated and iterated continuously, improves the operating efficiency of the central heating system, and saves energy under the condition of ensuring that the heat supply of residents is enough.
Further, the G of the S1 stepωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;
the data cleaning method comprises the following steps: removing abnormal values and mutation points in the data based on an Elliptic model Elliptic Envelope, and filling missing data by a linear interpolation method;
the data aggregation method comprises the following steps: and aligning the timestamps of the collected heat supply data with different frequencies to form complete historical data.
Further, the method for collecting heating data in step S1 includes collecting weather data, district heating system conditions, and real-time data related to the heat consumer.
Further, the method for collecting weather data comprises the step of collecting real-time weather of heat supply city granularity and weather forecast data of the future 24 hours in real time at a frequency of 5 minutes through an API (application program interface) provided by an air network.
Further, the method for the central heating system working condition comprises the following steps: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data is transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.
Further, the method for acquiring the client data comprises the following steps: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.
The invention also provides a heat supply control system based on off-line reinforcement learning, which uses the heat supply control method based on off-line reinforcement learning, and comprises the following steps:
a data acquisition module: the system is used for acquiring heat supply data and inputting the heat supply data into a heat supply model;
a model generation module: for centralized sampling from heating data
Figure BDA0003480760590000051
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
a model deployment module: for converting the trained GωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωThe effect of the model is monitored regularly according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωThe model is rolled back.
Further, the model generation module includes GωA model training submodule for training G based on BQC algorithmωAnd (4) modeling.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the off-line reinforcement learning-based heating control method described above.
The invention also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the heating control method based on offline reinforcement learning.
Compared with the prior art, the invention has the beneficial effects that:
the invention applies the advanced offline reinforcement learning algorithm to the centralized heating control system, fully exerts the advantages of the reinforcement learning algorithm under the condition of not interacting with the real environment, and avoids inefficient sampling and expensive cost when interacting with the environment; in addition, the invention fully utilizes historical interaction data, and compared with the prior art, the invention greatly improves the performance of the control algorithm in theory and practice.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.
In the drawings:
FIG. 1 is a flow chart of a heat supply control method based on offline reinforcement learning according to the present invention;
FIG. 2 is a block diagram of a computer device according to an embodiment of the present invention;
FIG. 3 is a flowchart of a model training deployment according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and products consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The method for realizing intelligent control of the centralized heating system mainly utilizes historical interactive data (or also called as expert data) in the past year to design an intelligent control algorithm based on machine learning at present, and specifically comprises the following two means:
(1) and (3) supervision and learning: through historical interactive data, strategies in the historical data are learned end to end, various numerical characteristics under actual working conditions are input, and a label is a regulation value, namely the two-network water supply temperature. In short, the method establishes a direct mapping between actual working condition data and the action taken, thereby learning the strategy.
(2) Reinforcement learning: the reinforcement learning can be subdivided into three technical realization ideas, namely, directly establishing an intelligent learning algorithm to continuously interact with a real environment, so that the performance of a strategy is continuously reinforced; learning an environment strategy from historical interactive data (expert data), namely simulating the change of the environment after action is taken, and then training a reinforcement learning algorithm in the simulated environment; and thirdly, the constructed mechanism model is used as a reinforcement learning environment to carry out strategy learning of the intelligent agent.
The embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
The embodiment of the invention provides a heat supply control method based on offline reinforcement learning, which is shown in a figure 1 and comprises the following steps:
s1, collecting heat supply data, inputting the heat supply data into a heat supply model, setting time step T, target network update rate tau and small batch data scale
Figure BDA0003480760590000071
Maximum disturbance phi, number of sampled actions n, minimum weight lambda, random parameter theta1,θ2,φ,ω;
Initializing two Q matrices Qθ(s,a):Qθ1,Qθ2(ii) a Disturbance model xiφTarget network Qθ′1,Qθ′2Two target networks are used for the purpose of preventing over-estimation of the Q value, and a target disturbance model xiφ′The purpose of the perturbation network is to provide diversity of actions, such that [ - φ, can be sampled]Actions within, rather than relying solely on generator generation; generation of VAE Normal distribution model Gω={Eω1,Dω2},
Wherein theta'1←θ1,θ′2←θ2,φ′←φ;
The parameter φ is used to determine the action [ - φ, φ for action]Adjustments are made in scope, which allows the algorithm to access actions in the constrained region without generating the model G fromωSampling for many times;
s2, centralized sampling from heating data
Figure BDA0003480760590000072
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
let μ, σ be E based on normal distribution N (μ, σ)ω1(s,a),
Figure BDA0003480760590000081
Figure BDA0003480760590000082
s is a State State, a is an action taken, and s' is the next State after s executes a;
Figure BDA0003480760590000083
the parameter is used for expressing the influence of the new value on the updated value, and r is the Reward obtained after the action a is taken in the state s;
from GωSelecting actions with the highest similarity as candidates according to the distribution in the data set, wherein the sampled action number n is used for representing the number of the candidate actions;
sample n action actions:
Figure BDA0003480760590000084
and disturbing each action of the sampling:
Figure BDA0003480760590000085
to enhance the diversity of actions;
selecting the action with the highest value in the actions as the actually taken action according to the Q network;
setting a target gamma:
Figure BDA0003480760590000086
the lambda parameter is used to control the penalty degree of future uncertainty; gamma is a count value used to reduce the influence of the new value, where d and gamma are both in the range of 0-1;
θ←argminω∑(y-Qθ(s,a))2
Figure BDA0003480760590000087
updating the target network: theta'i←τθ+(1-τ)θ′i;φ′←τφ+(1-τ)φ′;
Looping until the minimum of the two Q matrices ends;
s3, G after trainingωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωMonitoring the effect of the model at regular time according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωRolling back the model;
post-deployment GωThe model can further accumulate real-time expert data, returns to the step of data acquisition again, and the model is repeated and iterated continuously, improves the operating efficiency of the central heating system, and saves energy under the condition of ensuring that the heat supply of residents is enough.
The G of the S1 stepωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;
the data cleaning method comprises the following steps: removing abnormal values and mutation points in the data based on an Elliptic model Elliptic Envelope, and filling missing data by a linear interpolation method;
the data aggregation method comprises the following steps: and aligning the timestamps of the collected heat supply data with different frequencies to form complete historical data.
The method for collecting heat supply data in step S1, as shown in fig. 3, includes collecting weather data, central heating system conditions, and real-time data related to the heat consumer.
The method for collecting weather data comprises the step of collecting real-time weather of heat supply city granularity and weather forecast data of the future 24 hours in real time at a frequency of 5 minutes through an API (application program interface) provided by a weather network.
The method for the working condition of the central heating system comprises the following steps: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data are transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.
The method for acquiring the client data comprises the following steps: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.
The embodiment of the present invention further provides a heat supply control system based on offline reinforcement learning, which uses the heat supply control method based on offline reinforcement learning as described above, and includes:
a data acquisition module: the system is used for acquiring heat supply data and inputting the heat supply data into a heat supply model;
a model generation module: for centralized sampling from heating data
Figure BDA0003480760590000101
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
a model deployment module: for converting the trained GωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωMonitoring the effect of the model at regular time according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωThe model is rolled back.
The model generation module comprises GωA model training submodule for training G based on BQC algorithmωAnd (4) modeling.
The performance of the supervised learning algorithm in machine learning is guaranteed by using a large amount of high-quality data as support, the performance of the supervised learning algorithm for a control task is usually inferior to that of the reinforcement learning algorithm.
Training of the reinforcement learning agent needs to interact with a real environment continuously to generate new data, however, the weak agent is obviously not allowed to sample in the environment in an actual heat supply scene, and the problems of safety and high cost caused by interaction between the agent and the environment are solved.
The embodiment of the invention is applied to a centralized heating system, is also a big example of an industrial scene of the reinforced learning landing, and has certain inspiration for the application of the reinforced learning in the centralized heating system and the industry in the future.
Fig. 2 is a schematic structural diagram of a computer device provided in an embodiment of the present invention; referring to fig. 2 of the drawings, the computer apparatus comprises: an input device 23, an output device 24, a memory 22 and a processor 21; the memory 22 for storing one or more programs; when the one or more programs are executed by the one or more processors 21, causing the one or more processors 21 to implement the heating control method as provided in the above embodiments; wherein the input device 23, the output device 24, the memory 22 and the processor 21 may be connected by a bus or other means, as exemplified by the bus connection in fig. 2.
The memory 22 is a readable and writable storage medium of a computing device, and can be used for storing a software program, a computer executable program, and program instructions corresponding to the heating control method according to the embodiment of the present invention; the memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like; further, the memory 22 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device; in some examples, the memory 22 may further include memory located remotely from the processor 21, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 23 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the apparatus; the output device 24 may include a display device such as a display screen.
The processor 21 executes various functional applications and data processing of the device by executing software programs, instructions and modules stored in the memory 22, so as to implement the above-mentioned heat supply control method.
The computer device provided above can be used to execute the heating control method based on offline reinforcement learning provided in the above embodiments, and has corresponding functions and advantages.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform the offline reinforcement learning-based heating control method according to the above embodiments, where the storage medium is any of various types of memory devices or storage devices, and the storage medium includes: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc.; the storage medium may also include other types of memory or combinations thereof; in addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet); the second computer system may provide program instructions to the first computer for execution. A storage medium includes two or more storage media that may reside in different locations, such as in different computer systems connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.
Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the heating control method based on offline reinforcement learning as described in the above embodiments, and may also perform related operations in the heating control method provided by any embodiment of the present invention.
Technical solutions of the present invention have been described with reference to preferred embodiments shown in the drawings, but it is apparent that the scope of the present invention is not limited to these specific embodiments, as will be readily understood by those skilled in the art. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A heat supply control method based on offline reinforcement learning is characterized by comprising the following steps:
s1, collecting heat supply data, inputting the heat supply data into a heat supply model, setting time step T, target network update rate tau and small batch data scale
Figure FDA0003480760580000011
Maximum disturbance phi, number of sampled actions n, minimum weight lambda, random parameter theta1,θ2,φ,ω;
Initializing two Q matrices Qθ(s,a):Qθ1,Qθ2(ii) a Disturbance model xiφTarget network Qθ′1,Qθ′2Target disturbance model xiφ′Generating a VAE Normal distribution model Gω={Eω1,Dω2},
Wherein theta'1←θ1,θ′2←θ2,φ′←φ;
S2, centralized sampling from heating data
Figure FDA0003480760580000012
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
let μ, σ be E based on normal distribution N (μ, σ)ω1(s,a),
Figure FDA0003480760580000013
Figure FDA0003480760580000014
s is a State State, a is an action taken, and s' is the next State after s executes a;
Figure FDA0003480760580000015
the parameter is used for expressing the influence of the new value on the updated value, and r is the Reward obtained after the action a is taken in the state s;
from GωSelecting actions with the highest similarity as candidates according to the distribution in the data set, wherein the sampled action number n is used for representing the number of the candidate actions;
sample n action actions:
Figure FDA0003480760580000016
and disturbing each action of the sampling:
Figure FDA0003480760580000017
selecting the action with the highest value in the actions as the actually taken action according to the Q network;
setting a target gamma:
Figure FDA0003480760580000021
the lambda parameter is used to control the penalty degree of future uncertainty;
θ←argminω∑(y-Qθ(s,a))2
Figure FDA0003480760580000022
updating the target network: theta'i←τθ+(1-τ)θ‘i;φ′←τφ+(1-τ)φ′;
Looping until the minimum of the two Q matrices ends;
s3, G after trainingωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωMonitoring the effect of the model at regular time according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωThe model is rolled back.
2. A heating control method according to claim 1, wherein the G of the step S1ωThe generation method of the model comprises the following steps: and (3) carrying out basic data processing aiming at heat supply data collected by different channels: data cleaning and data aggregation are included;
the data cleaning method comprises the following steps: removing abnormal values and mutation points in the data based on an Elliptic model Elliptic Envelope, and filling missing data by a linear interpolation method;
the data aggregation method comprises the following steps: and aligning the timestamps of the collected heat supply data with different frequencies to form complete historical data.
3. A heating control method according to claim 1, wherein the heating data collection method of S1 includes collecting weather data, district heating system conditions, and real-time data relating to the user' S heat.
4. A heating control method according to claim 3, characterized in that the method of collecting weather data comprises collecting real-time weather and future 24-hour weather forecast data for heating city granularity in real-time at a frequency of 5 minutes via an API interface provided by the air network.
5. A heating control method according to claim 3, wherein the method of central heating system operating conditions comprises: the method comprises the following steps of collecting working condition data in real time at a frequency of 10 minutes through a pressure sensor, a temperature sensor, a flowmeter and a heat meter hardware device, wherein the collected working condition data comprises the following steps: the collected working condition data are transmitted into the intelligent gateway through a PLC protocol and uploaded to a time sequence database through the gateway.
6. A heating control method according to claim 3, characterized in that the method of collecting user-side data comprises: through intelligent audio amplifier to the indoor temperature of user, humidity data are gathered in real time to 5 minutes's frequency, upload to thing networking platform in real time through wifi, and real-time synchronization to database.
7. A heating control system based on off-line reinforcement learning, characterized in that the heating control method based on off-line reinforcement learning according to any one of claims 1-6 is used, and comprises the following steps:
a data acquisition module: the system is used for acquiring heat supply data and inputting the heat supply data into a heat supply model;
a model generation module: for centralized sampling from heating data
Figure FDA0003480760580000031
The interaction data is obtained as a quadruple (s, a, r, s') which is cycled from T-1 to T steps in time steps, training GωA model;
a model deployment module: for converting the trained GωThe model is deployed to a server, the water supply temperature of the first network and the water supply temperature of the second network are predicted through a timing task, and the prediction result is issued to the heat exchange station; and to GωMonitoring the effect of the model at regular time according to GωEffect of model, G for effect promotionωG for updating and training the model and having poor effectωThe model is rolled back.
8. A heating control system according to claim 7, characterized in that the model generation module comprises GωA model training submodule for training G based on BQC algorithmωAnd (4) modeling.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the offline reinforcement learning-based heating control method according to any one of claims 1 to 6.
10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the off-line reinforcement learning based heating control method according to any one of claims 1-6.
CN202210067515.8A 2022-01-20 2022-01-20 Heat supply control method and system based on offline reinforcement learning Active CN114484584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210067515.8A CN114484584B (en) 2022-01-20 2022-01-20 Heat supply control method and system based on offline reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210067515.8A CN114484584B (en) 2022-01-20 2022-01-20 Heat supply control method and system based on offline reinforcement learning

Publications (2)

Publication Number Publication Date
CN114484584A true CN114484584A (en) 2022-05-13
CN114484584B CN114484584B (en) 2022-11-11

Family

ID=81472980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210067515.8A Active CN114484584B (en) 2022-01-20 2022-01-20 Heat supply control method and system based on offline reinforcement learning

Country Status (1)

Country Link
CN (1) CN114484584B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0249531A1 (en) * 1986-06-06 1987-12-16 Alcatel Method and apparatus for controlling a central-heating system
CN103591637A (en) * 2013-11-19 2014-02-19 长春工业大学 Centralized heating secondary network operation adjustment method
US20150134124A1 (en) * 2012-05-15 2015-05-14 Passivsystems Limited Predictive temperature management system controller
CN108613332A (en) * 2018-04-12 2018-10-02 南京信息工程大学 A kind of energy-saving building film micro area personnel interactive mode hot comfort adjusting method
CN111561732A (en) * 2020-05-18 2020-08-21 瑞纳智能设备股份有限公司 Heat exchange station heat supply adjusting method and system based on artificial intelligence
CN111652371A (en) * 2020-05-29 2020-09-11 京东城市(北京)数字科技有限公司 Offline reinforcement learning network training method, device, system and storage medium
CN112268312A (en) * 2020-10-23 2021-01-26 哈尔滨派立仪器仪表有限公司 Intelligent heat supply management system based on deep learning
CN113606649A (en) * 2021-07-23 2021-11-05 淄博热力有限公司 Intelligent heat supply station control prediction system based on machine learning algorithm

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0249531A1 (en) * 1986-06-06 1987-12-16 Alcatel Method and apparatus for controlling a central-heating system
US20150134124A1 (en) * 2012-05-15 2015-05-14 Passivsystems Limited Predictive temperature management system controller
CN103591637A (en) * 2013-11-19 2014-02-19 长春工业大学 Centralized heating secondary network operation adjustment method
CN108613332A (en) * 2018-04-12 2018-10-02 南京信息工程大学 A kind of energy-saving building film micro area personnel interactive mode hot comfort adjusting method
CN111561732A (en) * 2020-05-18 2020-08-21 瑞纳智能设备股份有限公司 Heat exchange station heat supply adjusting method and system based on artificial intelligence
CN111652371A (en) * 2020-05-29 2020-09-11 京东城市(北京)数字科技有限公司 Offline reinforcement learning network training method, device, system and storage medium
CN112268312A (en) * 2020-10-23 2021-01-26 哈尔滨派立仪器仪表有限公司 Intelligent heat supply management system based on deep learning
CN113606649A (en) * 2021-07-23 2021-11-05 淄博热力有限公司 Intelligent heat supply station control prediction system based on machine learning algorithm

Also Published As

Publication number Publication date
CN114484584B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
Lissa et al. Deep reinforcement learning for home energy management system control
Zhou et al. Combined heat and power system intelligent economic dispatch: A deep reinforcement learning approach
CN109253494B (en) Control method of electric heat storage device based on heat load prediction
CN109270842B (en) Bayesian network-based regional heat supply model prediction control system and method
Claessens et al. Model-free control of thermostatically controlled loads connected to a district heating network
JP2023129546A (en) System and method for optimal control of energy storage system
CN112614009B (en) Power grid energy management method and system based on deep expectation Q-learning
WO2021062748A1 (en) Optimization method and apparatus for integrated energy system and computer readable storage medium
CN114498641A (en) Distributed flexible resource aggregation control device and control method
Wojdyga Predicting heat demand for a district heating systems
CN112413831A (en) Energy-saving control system and method for central air conditioner
CN114503120A (en) Simulation method and device of integrated energy system and computer readable storage medium
Chitsazan et al. Wind speed forecasting using an echo state network with nonlinear output functions
Yahya et al. Short-term electric load forecasting using recurrent neural network (study case of load forecasting in central java and special region of yogyakarta)
Fusco et al. Knowledge-and data-driven services for energy systems using graph neural networks
CN114707737A (en) System and method for predicting power consumption based on edge calculation
Yu et al. Short-term cooling and heating loads forecasting of building district energy system based on data-driven models
Ruelens et al. Residential demand response applications using batch reinforcement learning
CN114484584B (en) Heat supply control method and system based on offline reinforcement learning
Liao et al. MEMS: An automated multi-energy management system for smart residences using the DD-LSTM approach
Zhang et al. Flexible selection framework for secondary frequency regulation units based on learning optimisation method
CN115169839A (en) Heating load scheduling method based on data-physics-knowledge combined drive
Panahazari et al. A hybrid optimization and deep learning algorithm for cyber-resilient der control
CN110826776B (en) Initial solution optimization method based on dynamic programming in distribution network line transformation relation identification
Zandi et al. An automatic learning framework for smart residential communities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant