CN116880169A - Peak power demand prediction control method based on deep reinforcement learning - Google Patents

Peak power demand prediction control method based on deep reinforcement learning Download PDF

Info

Publication number
CN116880169A
CN116880169A CN202310744068.XA CN202310744068A CN116880169A CN 116880169 A CN116880169 A CN 116880169A CN 202310744068 A CN202310744068 A CN 202310744068A CN 116880169 A CN116880169 A CN 116880169A
Authority
CN
China
Prior art keywords
action
network
energy consumption
value
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310744068.XA
Other languages
Chinese (zh)
Inventor
傅启明
刘璐
马杰
陈建平
陆悠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University of Science and Technology
Original Assignee
Suzhou University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University of Science and Technology filed Critical Suzhou University of Science and Technology
Priority to CN202310744068.XA priority Critical patent/CN116880169A/en
Publication of CN116880169A publication Critical patent/CN116880169A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The application relates to the technical field of building energy conservation, and discloses a peak power demand prediction control method based on deep reinforcement learning, which comprises the following steps: acquiring data samples of four buildings in a period of time as a data set, dividing control actions of building energy consumption into M sections with the same size, and dividing the data into 8:2 into training set and test set, constructing a deep forest module, constructing a first deep reinforcement learning module, predicting and classifying new energy consumption data by using the trained model, constructing a second deep reinforcement learning module, and predicting the next action a by using an Actor network t Through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load. The method avoids human factors and limitations existing in the traditional method, and has better flexibility and adaptability.

Description

Peak power demand prediction control method based on deep reinforcement learning
Technical Field
The application relates to the technical field of building energy conservation, in particular to a peak power demand prediction control method based on deep reinforcement learning.
Background
The construction industry is one of the fields with the greatest energy consumption worldwide, and management and optimization of construction energy consumption have become urgent. Peak power demand control is an important strategy in building energy management and optimization. By implementing peak power demand control, energy consumption of the building can be reduced, and energy utilization efficiency and management level of the building can be improved. However, there are difficulties in achieving control of peak power demand.
Conventional rule-based control strategies suffer from a number of limitations, such as being unable to accommodate complex environmental and demand changes. Although deep reinforcement learning techniques have been widely used in recent years in peak power demand prediction and control, existing research has certain drawbacks. First, the traditional deep reinforcement learning method has high calculation cost when processing the continuous state space, which results in slow algorithm convergence. Second, existing studies tend to focus only on control of peak power demand, and ignore the importance of predictions for optimizing control strategies. Therefore, innovations in prediction and control are needed to improve the efficiency and accuracy of energy consumption management and optimization.
Disclosure of Invention
The application aims to provide a peak power demand prediction control method based on deep reinforcement learning, which aims to solve the problems that the traditional rule-based control strategy proposed in the background art has a plurality of limitations, such as incapability of adapting to complex environmental changes, demand changes and the like. Although deep reinforcement learning techniques have been widely used in recent years in peak power demand prediction and control, existing research has certain drawbacks. First, the traditional deep reinforcement learning method has high calculation cost when processing the continuous state space, which results in slow algorithm convergence. Second, existing studies tend to focus only on the control of peak power demand, and ignore the issue of predicting the importance of optimizing control strategies.
In order to achieve the above purpose, the present application provides the following technical solutions: a peak power demand prediction control method based on deep reinforcement learning comprises the following steps:
step one, simulating and acquiring data samples of four buildings in a period of time by using energy plus as a data set;
dividing the control action of the building energy consumption into M sections with the same size to obtain different action spaces, and dividing the action control space into a plurality of equidistant sections to convert the discrete action control space into a continuous numerical range so that the discrete action control space can be better processed and modeled;
step three, data are processed according to 8:2, dividing the ratio into a training set and a testing set, and reconstructing energy consumption data in the lower range of the training set; classifying and labeling the energy consumption data, wherein the label interval is [1, M ], forming a new sample and a label, and normalizing the new sample and the label;
step four, constructing a depth forest module, taking the data set in the step three as the input of the depth forest module, and training a depth forest classifier; after the classifier training is completed, the normalized sample is used as an original characteristic vector to be re-transmitted into the classifier; obtaining a transformed feature vector through multi-granularity scanning; the cascade forest structure in the depth forest takes the transformed feature vector as input and outputs the probability of each action category corresponding to the data;
step five, constructing a first deep reinforcement learning module for predicting energy consumption data; combining the input normalized newly constructed sample with the action interval category probability output by the depth forest module to serve as the input of the Q neural network; the Q neural network calculates the Q values of all actions, and calculates a target Q value through a target Q network; the TD error between the two is calculated to update the parameters of the Q network;
step six, predicting and classifying new energy consumption data by using the trained model, and comparing and verifying the new energy consumption data with an actual observed value to evaluate the generalization capability and the prediction precision of the model;
step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load; at each time step t, the Agent predicts building clusters using a deep reinforcement learning module in combination with a deep forestAnd combine it with the current building status, weather and time to form a new status tuple s t Inputting another deep reinforcement learning module; the Agent selects an action a based on the state tuple t The peak load of the whole system is influenced by controlling energy storage equipment in four buildings;
step eight, agent obtains new state tuple s t Predicting next action a using an Actor network t
Step nine, through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load.
Preferably, in the third step, the training set range data is subjected to sample and label reconstruction, proper attributes are required to be selected as characteristics, and proper first n pieces of historical energy consumption data are selected as characteristics through cross verification; then for time t, will [ E ] t-n ,E t-n-1 …,E t-1 ]As a new sample, E t For its corresponding new tag.
Preferably, in the fifth step, the algorithm minimizes the average mean square error between the Q network and the target Q network by gradient descent, so as to optimize the training effect of the model.
Preferably, in step eight, the Actor network will s t As input, output an a t Then Agent uses the probability distribution to sample an action a t The method comprises the steps of carrying out a first treatment on the surface of the Next, agent will a t As input, combine the current state s t Calculating a target Q value Q through a Critic network target (s t ,a t ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the Agent uses an Adam optimization algorithm to update the parameters of the Actor network and the Critic network so as to maximize the target Q value; in the optimization process, in order to prevent the oscillation of network parameters, the parameters of the target Actor network and the target Critic network are updated by using a soft update strategy.
Preferably, the original large prediction space is divided into N subspaces by a depth forest module, the actions in each subspace are expressed by a unified formula, the formula ingeniously utilizes the property of a general term to compress the action space, and each action in the compression space is expressed as one action in the whole subspace;
in the formula, x and z represent an upper limit and a lower limit of the action space, respectively, and N represents a final value of the compression space; by compressing the action space in this way, the size of the action space can be greatly reduced to cope with the problem of the reduction of prediction accuracy caused by the large prediction space.
Preferably, in the fifth step, modeling the energy consumption prediction problem as MDP modeling, and constructing corresponding states, actions and immediate rewards functions;
wherein:
status: denoted by s; s is(s) t Consists of the normalized sample in the third step and the probability output by the depth forest module in the fourth step, namely
The actions are as follows: a is used for representing that each action corresponds to an energy consumption predicted value;
immediate rewards function: denoted by r; at time t, a t For the predicted value of energy consumption, the absolute value of the difference value between the predicted value of energy consumption and the actual energy consumption value can be regarded as rewards obtained by the agent at the time t, and the predicted value of energy consumption and the actual energy consumption value are expressed as follows:
R 1 =|En pre -En true |。
preferably, in the fifth step, the update parameter θ is updated by using TD errors of both the Q network and the target Q network, specifically:
where (s, a, r, s ') is a quadruple obtained from the experience pool, a' is an action performed by the agent at time t+1,and theta i The parameters of the target Q network and the Q network are respectively represented, and r is the state s at the moment t t Lower execution action a t The acquired prize.
Preferably, in step seven, modeling the control problem of the energy storage device in the building as MDP modeling, and constructing corresponding states, actions and immediate rewards functions;
wherein:
status: the state variable of the control system mainly consists of two parts; the first part comprises state variables of the cluster building, which are divided into time variables, area related variables and building related variables; the time variable comprises month, time and day types; the area related variables include weather information and electricity prices, including outdoor dry bulb temperature, relative humidity, direct and diffuse solar radiation, solar power generation, and predicted outdoor temperatures and humidities for 5-8 hours and 11-15 hours into the future; building related variables include indoor temperature, indoor humidity, and use of non-mobile devices; the second part comprises dynamic state variables such as coefficient of performance of the heat pump, state of charge of the hot and cold water tanks, and building energy consumption by the predicted next time step t;
the actions are as follows: the energy storage system under each building consists of two controllable units which respectively represent hot water storage tanks and cold water storage tanks; in order to ensure that the shortage of energy supply and demand does not occur, the upper and lower limits of the action space are set to be 1/3 of the maximum energy storage capacity, and the action space is expressed as { a } 11 ,a 12 ,a 21 ,a 22 ,a 31 ,a 32 ,a 41 ,a 42 };
Immediate rewards function: the bonus function of the control section should take into account the power peak regulation effect and the power cost, as they both affect the quality of the timing control of the system; the quality of the electric power peak regulation is mainly reflected on energy consumption variables in the rewarding function, and the cost judgment is based on the influence of the current price; thus, the bonus function is designed as follows:
R 2 =α*En t +β*[(En t /10) 3 ]*Pr t
wherein En is t Represents the current power demand value, which has been smoothed to increaseCalculation accuracy, pr t A current electricity price representing time t; the reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost; wherein the set values of a and beta are 0.8 and 0.2, respectively.
Preferably, in the updating of the eighth Actor network, action a is taken by maximizing the current state t The Q value obtained, i.e. maxQ (s t ,a t ) Updating parameters of the Actor network; the updating process uses a gradient rising method, so that the quality of a strategy of an Actor network can be gradually improved; the method comprises the following steps:
critic network whose goal is to minimize the error of the predicted Q value from the true Q value, i.e., training parameters of Critic network by minimizing mean square error using the predicted Q value as the goal, TD goal is defined as y i =r+γQ(s',μ(s'|θ μ′ )|θ Q ) The method is characterized by comprising the following steps:
L=1/N∑ i (y i -Q(s i ,a iQ )) 2
the target Actor network and the target Critic network both adopt a soft update method to ensure the stability of the algorithm instead of directly copying network parameters, and the method is as follows:
the application has the beneficial effects that:
the method can efficiently solve the problems of building load prediction and energy storage equipment control. According to the method, future load conditions can be predicted through learning building load data, and an optimal control strategy is generated according to the real-time state of the energy storage equipment and the predicted conditions of the building load. Compared with the traditional rule and experience-based control method, the method does not need to manually design a control strategy, and can automatically learn an optimal strategy according to data, so that human factors and limitations in the traditional method are avoided, and the method has better flexibility and adaptability. Meanwhile, the method can effectively reduce the energy consumption cost of the building, improves the energy utilization efficiency, and plays an important role in promoting the realization of the aim of intelligent energy.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a diagram of the overall architecture of the present application;
FIG. 2 is an enlarged view of the data set generation architecture of FIG. 1 in accordance with the present application;
FIG. 3 is an enlarged view of the prediction phase architecture of FIG. 1 according to the present application;
FIG. 4 is an enlarged view of the Control phase architecture of FIG. 1 according to the present application;
FIG. 5 is an enlarged view of the data preprocessing architecture of the prediction stage of FIG. 2 according to the present application;
FIG. 6 is an enlarged view of the depth forest classifier architecture of the prediction stage of FIG. 2 of the present application;
fig. 7 is an enlarged view of the DQN-based energy consumption prediction architecture of the prediction stage of fig. 2 according to the present application.
Detailed Description
In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the application, whereby the application is not limited to the specific embodiments disclosed below.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
Examples
As shown in fig. 1-7, an embodiment of the present application discloses a peak power demand prediction control method based on deep reinforcement learning, which includes the following steps:
step one, referring to a commercial reference building developed by the United states department of energy (DOE), simulating and acquiring data samples of four buildings for a period of time as a data set by using energy plus;
dividing the control action of the building energy consumption into M sections with the same size to obtain different action spaces, and dividing the action control space into a plurality of equidistant sections to convert the discrete action control space into a continuous numerical range so that the discrete action control space can be better processed and modeled;
step three, data are processed according to 8:2 is divided into a training set and a testing set, and the energy consumption data in the range under the training set is reconstructed. Classifying and labeling the energy consumption data, wherein the label interval is [1, M ], forming a new sample and a label, and normalizing the new sample and the label;
and step four, constructing a depth forest module, taking the data set in the step three as the input of the depth forest module, and training a depth forest classifier. And after the classifier training is finished, the normalized sample is used as an original characteristic vector to be re-transmitted into the classifier. And obtaining the transformed feature vector through multi-granularity scanning. The cascade forest structure in the depth forest takes the transformed feature vector as input and outputs the probability of each action category corresponding to the data;
and fifthly, constructing a first deep reinforcement learning module for predicting energy consumption data, wherein the deep reinforcement learning module adopts a DQN architecture, and combines the newly constructed sample after input normalization with action interval category probability output by the deep forest module to serve as input of the Q neural network. And the Q neural network calculates the Q values of all actions, and calculates a target Q value through the target Q network. The TD error between the two is calculated to update the parameters of the Q network. Specifically, the algorithm minimizes the average mean square error between the Q network and the target Q network by gradient descent, thereby optimizing the training effect of the model, specifically expressed as:
where (s, a, r, s ') is a quadruple obtained from the experience pool, a' is an action performed by the agent at time t+1,and theta i The parameters of the target Q network and the Q network are respectively represented, and r is the state s at the moment t t Lower execution action a t The acquired prize. Specifically, this section models the energy consumption prediction problem as MDP modeling, and constructs the corresponding state, action, and immediate rewards functions.
Wherein:
(a) Status: denoted by s. s is(s) t Consists of the normalized sample in the third step and the probability output by the depth forest module in the fourth step, namely
(b) The actions are as follows: denoted by a, each action corresponds to an energy consumption predictor.
(c) Immediate rewards function: denoted by r. At time t, a t For the predicted value of energy consumption, the absolute value of the difference value between the predicted value of energy consumption and the actual energy consumption value can be regarded as rewards obtained by the agent at the time t, and the predicted value of energy consumption and the actual energy consumption value are expressed as follows:
R 1 =|En pre -En true |
step six, predicting and classifying new energy consumption data by using the trained model, and comparing and verifying the new energy consumption data with an actual observed value to evaluate the generalization capability and the prediction precision of the model;
and step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load. At each time step t, the Agent predicts the future energy demand of the building group using a deep reinforcement learning module in combination with the deep forest and combines it with the current building state, weather and time to form a new state tuple s t Another deep reinforcement learning module is input. The Agent selects an action a based on the state tuple t The peak load of the overall system is affected by controlling the energy storage devices (hot and cold water storage tanks) in four buildings. Specifically, this section models control problems of energy storage devices in a building as MDP modeling, and constructs corresponding status, action, and immediate rewards functions.
Wherein:
(a) Status: the state variables of the control system are mainly composed of two parts. The first part includes state variables of the clustered building, divided into time, area-related and building-related variables. The time variable comprises month, time and day types; the area related variables include weather information and electricity prices including outdoor dry bulb temperature, relative humidity, direct and diffuse solar radiation, solar power generation, and outdoor temperatures and humidities predicted for 5-8 hours and 11-15 hours in the future, with outdoor temperatures and humidities of 6 hours and 12 hours being preferred; building related variables include indoor temperature, indoor humidity, and use of non-mobile devices. The second part includes dynamic state variables such as coefficient of performance (COP) of the heat pump, state of charge (SOC) of the hot and cold water tanks, and building energy consumption by the predicted next time step t.
(b) The actions are as follows: the energy storage system under each building consists of two controllable units, representing hot water and cold water storage tanks, respectively. In order to ensure that the shortage of energy supply and demand does not occur, the upper and lower limits of the action space are set to be 1/3 of the maximum energy storage capacity, and the action space is expressed as { a } 11 ,a 12 ,a 21 ,a 22 ,a 31 ,a 32 ,a 41 ,a 42 }。
(c) Immediate rewards function: the bonus function of the control section should take into account the power peak regulation effect and the power cost, as they both affect the quality of the timing control of the system. The quality of the electric peak regulation is mainly reflected on energy consumption variables in the rewarding function, and the cost judgment is based on the influence of the current price. Thus, the bonus function is designed as follows:
R 2 =α*En t +β*[(En t /10) 3 ]*Pr t
wherein En is t Represents the current power demand value, which is smoothed to improve the calculation accuracy, pr t Representing the current electricity price at time t. The reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost. Wherein, the set values of alpha and beta are respectively 0.8 and 0.2.
Step eight, the control module adopts a DDPG architecture, and the Agent obtains a new state tuple s t Predicting next action a using an Actor network t . Specifically, the Actor network will s t As input, output an a t Then Agent uses the probability distribution to sample an action a t . Next, agent will a t As input, combine the current state s t Calculating a target Q value Q through a Critic network target (s t ,a t ). Finally, the Agent uses Adam optimization algorithm to update the parameters of the Actor network and the Critic network to maximize the target Q value. Action a is taken in the update of the Actor network by maximizing the current state t The Q value obtained, i.e. maxQ (s t ,a t ) To update the parameters of the Actor network. This update procedure uses a gradient-increasing approach, enabling the Actor network to gradually improve the quality of its policies. The method comprises the following steps:
critic networks are similar to the Q networks in step five, with the goal of minimizing the error of the predicted Q value from the true Q valueI.e. training parameters of Critic networks by minimizing mean square error using predicted Q values as targets, TD targets are defined as y i =r+γQ(s',μ(s'|θ μ′ )|θ Q ) The method is characterized by comprising the following steps:
L=1/N∑ i (y i -Q(s i ,a iQ )) 2
the target Actor network and the target Critic network both adopt a soft update method to ensure the stability of the algorithm instead of directly copying network parameters, and the method is as follows:
step nine, through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load.
Specifically, the specific algorithm flow of the whole predictive control method is as follows:
s1, initializing a state class M, wherein the state class M corresponds to the number of sample classifications;
s2, initializing an experience pool D1 and initializing an experience pool D2;
s3, initializing Q function Q 1 Objective function Q' 1
S4, initializing an Actor network and a target Actor network, wherein the Actor network parameters theta a Target Actor network parameter θ a' The method comprises the steps of carrying out a first treatment on the surface of the Initializing a Critic network and a target Critic network, critic network parameters θ c Target Critic network parameter θ c'
S5, dividing the data set, reconstructing the data in the training set range, forming a new sample and a label value, and normalizing the data;
s6, training a depth forest classifier;
s7, setting a training cycle number, entering training, and carrying out steps S8 to S17 for each round of epoode;
s8, randomly selecting a data sample, classifying the sample by using a trained depth forest classifier, and outputtingClass probability and constructing new state s by using the class probability and original sample 1t
S9, using the current state S 1t According to the action value Q function Q 1 Calculate Q values for all actions and select action a using an E-greedy policy 1
S10, executing action a 1 A new state s is observed 1t+1 And instant rewards R 1 Experience(s) 1t ,a 1 ,R 1 ,s 1t+1 ) Storing the data into a playback memory D1;
s11, randomly extracting a batch of experiences from the experience pool D1, and updating the action value function Q by using experience data 1
S12, updating the objective function Q 'once every n steps' 1
S13, if the prediction precision reaches a preset percentage, constructing a new state S by using the prediction value 2t
S14, the Actor network is based on the new state S 2t And selects action a using an e-greedy policy 2
S15, executing action a 2 A new state S is observed 2t+1 And instant rewards R 2 Experience(s) 2t ,a 2 ,R 2 ,S 2t+1 ) Store in experience pool D2;
s16, randomly sampling a batch of experiences from the experience pool D2, and updating the Critic network parameter theta by using the experiences c And Actor network parameter θ a
S17, updating the target Critic network parameter theta by using the SoftUpdate at intervals c' And target Actor network parameters θ a'
Those skilled in the art will appreciate that the features recited in the various embodiments of the application and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the application. In particular, the features recited in the various embodiments of the application and/or in the claims can be combined in various combinations and/or combinations without departing from the spirit and teachings of the application. All such combinations and/or combinations fall within the scope of the application.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the application thereto, but to limit the application thereto, and any modifications, equivalents, improvements and equivalents thereof may be made without departing from the spirit and principles of the application.

Claims (9)

1. A peak power demand prediction control method based on deep reinforcement learning is characterized in that: the method comprises the following steps:
step one, simulating and acquiring data samples of four buildings in a period of time by using energy plus as a data set;
dividing the control action of the building energy consumption into M sections with the same size to obtain different action spaces, and dividing the action control space into a plurality of equidistant sections to convert the discrete action control space into a continuous numerical range so that the discrete action control space can be better processed and modeled;
step three, data are processed according to 8:2, dividing the ratio into a training set and a testing set, and reconstructing energy consumption data in the lower range of the training set; classifying and labeling the energy consumption data, wherein the label interval is [1, M ], forming a new sample and a label, and normalizing the new sample and the label;
step four, constructing a depth forest module, taking the data set in the step three as the input of the depth forest module, and training a depth forest classifier; after the classifier training is completed, the normalized sample is used as an original characteristic vector to be re-transmitted into the classifier; obtaining a transformed feature vector through multi-granularity scanning; the cascade forest structure in the depth forest takes the transformed feature vector as input and outputs the probability of each action category corresponding to the data;
step five, constructing a first deep reinforcement learning module for predicting energy consumption data; combining the input normalized newly constructed sample with the action interval category probability output by the depth forest module to serve as the input of the Q neural network; the Q neural network calculates the Q values of all actions, and calculates a target Q value through a target Q network; the TD error between the two is calculated to update the parameters of the Q network;
step six, predicting and classifying new energy consumption data by using the trained model, and comparing and verifying the new energy consumption data with an actual observed value to evaluate the generalization capability and the prediction precision of the model;
step seven, constructing a second deep reinforcement learning module for controlling energy storage equipment in the building group so as to optimize peak load; at each time step t, the Agent predicts the future energy demand of the building group using a deep reinforcement learning module in combination with the deep forest and combines it with the current building state, weather and time to form a new state tuple s t Inputting another deep reinforcement learning module; the Agent selects an action a based on the state tuple t The peak load of the whole system is influenced by controlling energy storage equipment in four buildings;
step eight, agent obtains new state tuple s t Predicting next action a using an Actor network t
Step nine, through the iterative updating, the Agent can learn the optimal action strategy step by step, and realize the optimal control of the building group peak load.
2. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in the third step, reconstructing samples and labels of training set range data, selecting proper attributes as features, and selecting proper first n pieces of historical energy consumption data as features through cross verification; then for time t, will [ E ] t-n ,E t-n-1 …,E t-1 ]As a new sample, E t For its corresponding new tag.
3. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in the fifth step, the algorithm minimizes the average mean square error between the Q network and the target Q network through gradient descent, so that the training effect of the model is optimized.
4. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: in step eight, the Actor network will s t As input, output an a t Then Agent uses the probability distribution to sample an action a t The method comprises the steps of carrying out a first treatment on the surface of the Next, agent will a t As input, combine the current state s t Calculating a target Q value O through a Critic network target (s t ,a t ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the Agent uses an Adam optimization algorithm to update the parameters of the Actor network and the Critic network so as to maximize the target Q value; in the optimization process, in order to prevent the oscillation of network parameters, the parameters of the target Actor network and the target Critic network are updated by using a soft update strategy.
5. The method for predictive control of peak power demand based on deep reinforcement learning of claim 3, wherein: dividing the original large prediction space into N subspaces through a depth forest module, wherein the actions in each subspace are expressed by a unified formula, the formula ingeniously utilizes the property of a general term to compress the action space, and each action in the compression space is expressed as one action in the whole subspace;
in the formula, x and z represent an upper limit and a lower limit of the action space, respectively, and N represents a final value of the compression space; by compressing the action space in this way, the size of the action space can be greatly reduced to cope with the problem of the reduction of prediction accuracy caused by the large prediction space.
6. The method for predictive control of peak power demand based on deep reinforcement learning of claim 5, wherein: modeling the energy consumption prediction problem as MDP modeling, and constructing corresponding states, actions and immediate rewarding functions;
wherein:
status: denoted by s; s is(s) t Consists of the normalized sample in the third step and the probability output by the depth forest module in the fourth step, namely
The actions are as follows: a is used for representing that each action corresponds to an energy consumption predicted value;
immediate rewards function: denoted by r; at time t, a t For the predicted value of energy consumption, the absolute value of the difference value between the predicted value of energy consumption and the actual energy consumption value can be regarded as rewards obtained by the agent at the time t, and the predicted value of energy consumption and the actual energy consumption value are expressed as follows:
R 1 =|En pre -En true |。
7. the method for predictive control of peak power demand based on deep reinforcement learning of claim 6, wherein: in the fifth step, updating the update parameter θ by using TD errors of the Q network and the target Q network, specifically:
where (s, a, r, s ') is a quadruple obtained from the experience pool, a' is an action performed by the agent at time t+1,and theta i The parameters of the target Q network and the Q network are respectively represented, and r is the state s at the moment t t Lower execution action a t The acquired prize.
8. The method for predictive control of peak power demand based on deep reinforcement learning of claim 1, wherein: modeling a control problem of energy storage equipment in a building as MDP modeling, and constructing corresponding states, actions and immediate rewarding functions;
wherein:
status: the state variable of the control system mainly consists of two parts; the first part comprises state variables of the cluster building, which are divided into time variables, area related variables and building related variables; the time variable comprises month, time and day types; the area related variables include weather information and electricity prices, including outdoor dry bulb temperature, relative humidity, direct and diffuse solar radiation, solar power generation, and predicted outdoor temperatures and humidities for 5-8 hours and 11-15 hours into the future; building related variables include indoor temperature, indoor humidity, and use of non-mobile devices; the second part comprises dynamic state variables such as coefficient of performance of the heat pump, state of charge of the hot and cold water tanks, and building energy consumption by the predicted next time step t;
the actions are as follows: the energy storage system under each building consists of two controllable units which respectively represent hot water storage tanks and cold water storage tanks; in order to ensure that the shortage of energy supply and demand does not occur, the upper and lower limits of the action space are set to be 1/3 of the maximum energy storage capacity, and the action space is expressed as { a } 11 ,a 12 ,a 21 ,a 22 ,a 31 ,a 32 ,a 41 ,a 42 };
Immediate rewards function: the bonus function of the control section should take into account the power peak regulation effect and the power cost, as they both affect the quality of the timing control of the system; the quality of the electric power peak regulation is mainly reflected on energy consumption variables in the rewarding function, and the cost judgment is based on the influence of the current price; thus, the bonus function is designed as follows:
R 2 =α*En t +β*[(En t /10) 3 ]*Pr t
wherein En is t Represents the current power demand value, which is smoothed to improve the calculation accuracy, pr t A current electricity price representing time t; the reward function in this equation grasps the interaction between power demand and price in order to find an intermediate value that balances peak power demand and power cost; wherein alpha isAnd β are set to 0.8 and 0.2, respectively.
9. The method for predictive control of peak power demand based on deep reinforcement learning of claim 4, wherein: step eight Actor network update, action a is taken by maximizing current state t The Q value obtained, i.e. max Q (s t ,a t ) Updating parameters of the Actor network; the updating process uses a gradient rising method, so that the quality of a strategy of an Actor network can be gradually improved; the method comprises the following steps:
critic network whose goal is to minimize the error of the predicted Q value from the true Q value, i.e., training parameters of Critic network by minimizing mean square error using the predicted Q value as the goal, TD goal is defined as y i =r+γQ(s',μ(s'|θ μ )|θ Q ) The method is characterized by comprising the following steps:
L=1/N∑ i (y i -Q(s i ,a iQ )) 2
the target Actor network and the target Critic network both adopt a soft update method to ensure the stability of the algorithm instead of directly copying network parameters, and the method is as follows:
CN202310744068.XA 2023-06-25 2023-06-25 Peak power demand prediction control method based on deep reinforcement learning Pending CN116880169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310744068.XA CN116880169A (en) 2023-06-25 2023-06-25 Peak power demand prediction control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310744068.XA CN116880169A (en) 2023-06-25 2023-06-25 Peak power demand prediction control method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116880169A true CN116880169A (en) 2023-10-13

Family

ID=88263433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310744068.XA Pending CN116880169A (en) 2023-06-25 2023-06-25 Peak power demand prediction control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116880169A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688846A (en) * 2024-02-02 2024-03-12 杭州经纬信息技术股份有限公司 Reinforced learning prediction method and system for building energy consumption and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688846A (en) * 2024-02-02 2024-03-12 杭州经纬信息技术股份有限公司 Reinforced learning prediction method and system for building energy consumption and storage medium

Similar Documents

Publication Publication Date Title
Tang et al. Short‐term power load forecasting based on multi‐layer bidirectional recurrent neural network
CN112614009B (en) Power grid energy management method and system based on deep expectation Q-learning
CN113572157B (en) User real-time autonomous energy management optimization method based on near-end policy optimization
CN112116144B (en) Regional power distribution network short-term load prediction method
CN113112077B (en) HVAC control system based on multi-step prediction deep reinforcement learning algorithm
Xie et al. A hybrid short-term load forecasting model and its application in ground source heat pump with cooling storage system
Razghandi et al. Short-term load forecasting for smart home appliances with sequence to sequence learning
CN111754037A (en) Long-term load hybrid prediction method for regional terminal integrated energy supply system
CN116880169A (en) Peak power demand prediction control method based on deep reinforcement learning
CN112149890A (en) Comprehensive energy load prediction method and system based on user energy label
CN115796393A (en) Energy network management optimization method, system and storage medium based on multi-energy interaction
CN114841410A (en) Heat exchange station load prediction method and system based on combination strategy
CN116826710A (en) Peak clipping strategy recommendation method and device based on load prediction and storage medium
CN112348287A (en) Electric power system short-term load probability density prediction method based on LSTM quantile regression
Fu et al. Predictive control of power demand peak regulation based on deep reinforcement learning
CN114611757A (en) Electric power system short-term load prediction method based on genetic algorithm and improved depth residual error network
CN116227883A (en) Intelligent household energy management system prediction decision-making integrated scheduling method based on deep reinforcement learning
CN116090624A (en) Fine granularity load segmentation prediction method
CN115511218A (en) Intermittent type electrical appliance load prediction method based on multi-task learning and deep learning
CN115481788A (en) Load prediction method and system for phase change energy storage system
CN115456250A (en) Optimal configuration method and system suitable for building micro-grid energy system capacity
Atabay et al. Multivariate time series forecasting using arimax, sarimax, and rnn-based deep learning models on electricity consumption
CN116632842B (en) Clustering characteristic-based method and system for predicting distribution type photovoltaic load probability of platform
Nan et al. Research on Demand Response Strategy of HVAC Based on Deep Reinforcement Learning
CN116415715A (en) Prediction method for load clusters in short period of configuration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination