CN117057553A

CN117057553A - Deep reinforcement learning-based household energy demand response optimization method and system

Info

Publication number: CN117057553A
Application number: CN202310984893.7A
Authority: CN
Inventors: 郑伟铭; 杨超; 林俊鹏; 袁郁; 林焕鑫
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2023-11-14

Abstract

The invention relates to the technical field of household energy demand response, and discloses a household energy demand response optimization method and system based on deep reinforcement learning, wherein the method comprises the following specific steps: s1: constructing a home energy system model according to the electric appliances managed by the home energy management system; s2: based on a Markov decision process, constructing a double-agent reinforcement learning model of a home energy management system based on a depth Q network and a depth deterministic strategy gradient network according to a home energy system model; s3: designing a reward function of the double-agent reinforcement learning model; s4: based on the designed reward function, learning is carried out through a double-agent reinforcement learning model, an optimal strategy is obtained, and the response of the family energy management system to the family energy requirement is optimized. The method solves the problem of dimensional explosion possibly caused by action space in the prior art, and has the characteristics of improving the convergence rate of the algorithm and providing better user experience for users.

Description

Deep reinforcement learning-based household energy demand response optimization method and system

Technical Field

The invention relates to the technical field of household energy demand response, in particular to a household energy demand response optimization method and system based on deep reinforcement learning.

Background

With advances in power systems and the large-scale incorporation of renewable energy into the power grid, smart grids are playing an increasing role in the overall power system, with Home Energy Management Systems (HEMS) playing an indispensable role in demand-side management, efficient home energy management systems being able to provide many benefits to residents and power systems, such as financial profits, energy conservation, comfort and convenience, minimization of greenhouse gas emissions, and increased utilization of energy storage systems and renewable energy. In addition, incorporation of renewable energy sources in smart homes is becoming more common in home life, however, because renewable energy generation is very susceptible to weather and natural conditions, such as uncertainty and uncontrollability, and the density of renewable energy generation devices is low, it is becoming more difficult to accurately predict renewable energy generation, and it is a serious challenge to use renewable energy sources on a large scale.

There are many existing references to optimizing home energy demand response that are addressed using conventional constrained optimization methods, including but not limited to mixed integer linear programming, dynamic programming, multi-objective optimization, mixed integer quadratic programming. The traditional constraint optimization methods are mature in mathematical basis and theory, have certain stability and reliability, but can face the problems of computational complexity, local optimal solution and the like when dealing with complex nonlinearity and large-scale problems. In addition, there are some studies using reinforcement learning methods, such as federal reinforcement learning, multi-objective reinforcement learning methods, energy management algorithms based on depth deterministic strategy gradients, and the like. However, in the above-mentioned studies considering renewable energy sources, mainly focused on overcoming uncertainty of renewable energy sources by using algorithms themselves, which generally requires experiments using a large amount of experimental and interactive data, making it difficult or time-consuming to collect enough empirical data in a complex environment, many studies have not considered reducing errors in renewable energy source data prediction in feature processing steps, if any, but have mainly studied in terms of data quality, prediction models and algorithms, feature engineering, correction models and parameter adjustment, and have not considered balancing differences between renewable energy source supply and demand using energy storage systems, coping with changes in energy source demand, and improving reliability and predictability of renewable energy sources by storing and smoothing fluctuations of renewable energy sources.

In the invention, under the conditions of no building thermodynamic model and ensuring user comfort, the problem of minimizing the energy cost of the intelligent household is modeled as a Markov decision process and corresponding environmental states, actions and rewarding functions are designed, wherein the environmental states comprise real-time and future photovoltaic generator output power, electricity price and outdoor temperature data are predicted, and the local features of the predicted data are obtained by utilizing discrete wavelet transformation to carry out decomposition and denoising reconstruction. Under the prediction and decision integrated scheduling mode, the intelligent household energy management system optimization control based on the randomness strategy SoftActorCritic algorithm consists of a learning link and an application link.

However, the problem of dimensional explosion possibly caused by action space in practical application still exists in the prior art, so how to invent a family energy demand response optimization method based on deep reinforcement learning is a technical problem to be solved in the technical field.

Disclosure of Invention

The invention provides a home energy demand response optimization method based on deep reinforcement learning, which aims to solve the problem of dimensional explosion possibly caused by action space in the prior art, and has the characteristics of improving the convergence speed of an algorithm and providing better user experience for users.

In order to achieve the above purpose of the present invention, the following technical scheme is adopted:

a home energy demand response optimization method based on deep reinforcement learning comprises the following specific steps:

s1: constructing a home energy system model according to the electric appliances managed by the home energy management system;

s2: based on a Markov decision process, constructing a double-agent reinforcement learning model of a home energy management system based on a depth Q network and a depth deterministic strategy gradient network according to a home energy system model;

s3: designing a reward function of the double-agent reinforcement learning model;

s4: based on the designed reward function, learning is carried out through a double-agent reinforcement learning model, an optimal strategy is obtained, and the response of the family energy management system to the family energy requirement is optimized.

Preferably, the electric appliances managed by the household energy management system comprise an energy storage system, a power fixed electric appliance, a movable electric appliance, a power controllable electric appliance, a new energy automobile and a photovoltaic power generation system.

Further, in the step S1, a home energy system model is constructed according to the electric appliances managed by the home energy management system, specifically:

time-shifting:

definition of working time at Within the range of>Indicating a set start time for allowing the appliance to start running, < >>Representing the end time, the running time is d _n ，/>Representing the time step at which the time-movable unit actually starts to operate, < >>Representing the current power of the nth power adjustable electric appliance,/->Is a boolean variable representing the on-off state of the time-lapse mobile; the state of the time-displaceable time is expressed as:

assuming that the appliance completes a complete cycle at start-up,the constraint is satisfied:

if the running state k at the time t-1 _n,t-1 =1 andk _t,n 1 is shown in the specification; if t arrives +.>At a critical value of k _n,t-1 ＝1；

Power controllable electric appliance:

will beRepresenting the current power of the nth power controllable appliance at time t, which is +.>Changes in between, wherein->Maximum power for the nth power controllable appliance, +.>The lowest power of the nth power controllable electric appliance; the state of the power controllable appliance is expressed as:

the constraints of the power controllable appliance are expressed as:

power fixed electric appliance:

is provided withFor the switching state of the nth power fixture at time t, the +.>To power the nth power fixture at time t, define the state of the power fixture as:

an energy storage system:

the state of the energy storage system is represented by the state of charge of the SoC at the time t, and the range of the SoC is from 0% to 100%; the SoC state at time t+1 is determined by the last time step and the action executed; when the energy storage system is in a charging state, soC _t+1 Multiplying the charging power at time t by the charging efficiencyBeta SoC considering energy leakage _t Obtaining SoC _t+1 Where β represents the coefficient of energy leakage, it is therefore assumed that the cost per cycle self-loss rate of the energy storage system is small; assuming a sub-loss rate τ=0, representing the constraint of the energy storage system as:

SoC ^min ≤SoC _t ≤SoC ^max

wherein,is the state of charge of the nth energy storage battery at time t +.>Represents the power level of the charge of the energy storage battery at time t, < >>Representing the highest power level of the charge of the energy storage battery, +.>Power level representing discharge of energy storage battery, +.>Representing the highest power level at which the energy storage battery discharges, the energy storage system is set to have an approximate initial capacity at the beginning of each day, further constraining:

wherein, the sampling interval of the variable t is once every one hour, and t+24 represents the t moment step of the next day;

new energy automobile:

considering the loss coefficient τ, the state of the energy storage system is represented by considering the state of charge of the SoC at time t:

SoC ^min ≤SoC _t ≤SoC ^max

wherein,the state of charge of the n-th new energy automobile battery at time t is +.>Representing the power level of the battery charge of the new energy vehicle at time t +.>Maximum power level representing the charging of the battery of a new energy vehicle,/->New energy automobile representing electricity Power level of cell discharge, +.>Representing the highest power level at which the new energy vehicle battery discharges.

Further, in the step S2, the markov decision process is specifically: (S, O, A, p (S) _i+1 |s _i ,a _i ) R), where S is the state space, O is the observed value, A is the action space, p (S) _i+1 |s _i ,a _i ) Is represented in the current state s _i E S is executing action a _i After becoming the next shape s _i+1 S represents the instant rewarding function, R is the expected rewarding; let the strategy pi: S→P (A) is the mapping of state slave actions; the instant prize function at time t is denoted r _t The expected rewards of the policy are expressed as:

where γ ε (0, 1) represents the discount factor.

Furthermore, in the step S2, based on the markov decision process, a dual-agent reinforcement learning model based on a deep Q network and a deep deterministic strategy gradient network of the home energy management system is constructed according to the home energy system model, which specifically includes:

the total state of the household energy management system at the time t is expressed as:

wherein the method comprises the steps ofRespectively represent the states defined in the system model formula, P _t ^PV Representing the power of the photovoltaic power generation system at the moment t, P _t ^err Represents the error lambda of the photovoltaic power generation system at the moment t _t Representing the electricity price at the time t;

the total observed value of the home energy management system at the time t is expressed as:

wherein the indicator variable is an estimate of the true system state;

two agents, called dual agents, are instantiated separately using a deep Q network and a deep deterministic policy gradient network;

the deep Q network realizes the centralized use of energy in a time period with lower energy price by controlling the power level and the running time of the electric appliance; the action space of the deep Q network at the t time is expressed as:

wherein,power level representing electric vehicle charge, +.>Representing the power level of the electric vehicle discharge,power level representing charge of energy storage battery, +.>A power level representative of a discharge of the energy storage battery;

the depth deterministic strategy gradient network reduces the influence caused by the prediction error of the photovoltaic power generation system by controlling the charge and discharge of the energy storage system, and represents the action space of the depth deterministic strategy gradient network at the t moment as:

obtaining the double-agent reinforcement learning model

Furthermore, the rewarding function is electric power cost, user comfort and error compensation of the photovoltaic power generation system;

in the step S3, a reward function of the two-agent reinforcement learning model is designed, specifically: designing the rewarding function as a multi-objective optimization problem of optimizing the electric power cost, the electricity utilization comfort of residents and renewable energy sources:

Wherein mu represents the weight of each random variable,is the total electricity cost at time t +.>For i user comfort costs at time t, < ->The error compensation cost at the time t is calculated;

still further, the method comprises the steps of,wherein the variable represents the total electricity price at time t, lambda _t Representing the price of electricity purchased from an external power grid at time t, P _t ^load The lower the total power load of the household energy management system is, the lower the electricity price is, the higher the rewards are, wherein P is _t ^g Representing power level purchased from the grid, P _t ^pv Representing input from a photovoltaic systemPower level, then P _t ^load The constraints of (2) are:

P _t ^load ≤P _t ^g +P _t ^pv +P _t ^ess ；

including the influence of power consumption waiting time on satisfaction->Influence of controllable electric appliance power on satisfactionInfluence of Battery anxiety on satisfaction ∈ ->Influence of excessive charge and discharge on satisfaction +.>

Wherein θ is ₁ Representing a mapping from time to satisfaction coefficients;

wherein P is _max At maximum electric power, P _n,t The power of the nth electric appliance at the t moment;

wherein eta ^max Coefficient, θ, representing maximum allowable capacity of battery ₄ A map coefficient representing a state of charge of the battery to satisfaction;

smoothing the fluctuation of the output of the photovoltaic system by using the energy storage system is also considered:

wherein B is ^up Representing the upper limit value of the prediction error, B ^low Representing the lower limit value of the prediction error, P _t ^err Representing the prediction error power level, P _t ^err ＝P _t ^forecast -P _t ^actual The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _t ^forecast Representing the predicted value of the current power, P _t ^actual Representing the actual value of the current power, ζ _n Representing power toMapping of->The smaller the bonus function, the larger the bonus function.

Further, the prediction error of the photovoltaic power generation system exceeds B ^up Or lower than B ^low When the energy storage system is started, the compensation prediction error function of the energy storage system is started; if P _t ^err >B ^up >0, wherein the predicted value is larger than the actual value, and the energy storage system outputs electric quantity; if P _t ^err <B ^low <0, meaning that the predicted value is less than the actual value, the energy storage system stores the electric quantity;

when the predicted value is greater than the actual value, P _t ^err Subtracting the actual charging power, when the predicted value is smaller than the actual value, P _t ^err Plus the power of the actual discharge.

Further, the deep Q network includes a Q network and a Q-network; the depth deterministic strategy gradient network comprises an Actor network and a Critic network; the Actor network comprises a 1 st Adam optimizer, an Actor neural network and an Actor target neural network; the Critic network comprises a 2 nd Adam optimizer, a Critic neural network and a Critic target neural network; the double-agent reinforcement learning model also comprises an experience buffer zone.

Furthermore, in the step S4, learning is performed through a dual-agent reinforcement learning model to obtain an optimal strategy, which specifically includes:

During deep Q network learning, the Q network approaches to an action value function Q (s, a; theta), wherein theta is a Q network parameter, and the Q (s, a; theta) is trained through gradient descent after action value error calculation; the Q network copies theta to the Q network every n steps; q network approximated target value function max _a′ Q(s′,a′；θ ^- )，max _a′ Q(s′,a′；θ ^- ) Training the Q network through gradient descent after calculating the action value error;

when the depth deterministic strategy gradient network is learned, the Actor neural network continuously inputs the output after gradient descent into the 1 st Adam optimizer, and the 1 st Adam optimizer continuously updates the parameters of the Actor neural network; the Critic neural network continuously inputs the output after the Q gradient treatment into a 2 nd Adam optimizer, and the 2Adam optimizer continuously updates the parameters of the Actor neural network; the Actor neural network will find the action a=μ(s) that maximizes the cost function using the strategy μ _t ) Inputting a Critic neural network; critic neural networks are based on a=μ (s _t ) Updating an Actor neural network through momentum; the Actor neural network carries out soft update on the Actor target neural network; the Actor target neural network outputs an action a ' =μ '(s) obtained by using the strategy μ ' _t+1 ) By a '=μ' (s _t+1 ) Updating the Critic target neural network;

executing action executed at t moment of learned deep Q network output in home energy system model environment Depth and depth ofAction performed at time t of deterministic strategy gradient network output +.>According to the total observation value, obtaining a four-element group updated after interaction between the moment t and the environment>Will->Inputting experience buffer area; the empirical buffer randomly samples the quadruple at the moment k to obtain +.>Will->And respectively inputting a depth Q network and a depth deterministic strategy gradient network to learn in the next round until the optimal strategy is obtained.

A home energy demand response optimization system based on deep reinforcement learning comprises an energy system module, a double-agent reinforcement learning model construction module, a reward function module and a home energy demand response optimization module:

the energy system module constructs a household energy system model according to the electric appliances managed by the household energy management system;

the dual-agent reinforcement learning model building module is used for building a dual-agent reinforcement learning model of a home energy management system based on a depth Q network and a depth deterministic strategy gradient network according to a home energy system model based on a Markov decision process;

the rewarding function module is used for designing a rewarding function of the double-agent reinforcement learning model;

the household energy demand response optimization module is used for learning through the double-agent reinforcement learning model based on the designed reward function to obtain an optimal strategy and optimize the response of the household energy management system to the household energy demand.

The beneficial effects of the invention are as follows:

according to the invention, different types of electric appliances in a household are considered in an integral manner, and a household energy system model is constructed; the method comprises the steps of establishing a home energy demand response problem as a Markov decision process, designing a dual-agent reinforcement learning model based on a deep Q network and a deep deterministic strategy gradient network, learning through the dual-agent reinforcement learning model based on a designed reward function to obtain an optimal strategy, and optimizing the response of a home energy management system to the home energy demand. Therefore, the method solves the problem of dimensional explosion possibly caused by action space in the prior art, and has the characteristics of improving the convergence rate of the algorithm and providing better user experience for users.

Drawings

FIG. 1 is a schematic flow chart of a home energy demand response optimization method based on deep reinforcement learning.

Fig. 2 is a schematic diagram of a home energy management system and a home energy system model of a home energy demand response optimization method based on deep reinforcement learning.

FIG. 3 is an algorithm framework diagram of a home energy demand response optimization method based on deep reinforcement learning.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

Example 1

As shown in fig. 1, a method for optimizing the response of the household energy demand based on deep reinforcement learning comprises the following specific steps:

In this embodiment, the step S4 specifically includes: based on the designed reward function, learning is carried out through a double-agent reinforcement learning model, the agents interact with the environment according to the executed actions, a state is generated, the Q value of the iterative deep Q network is fitted and iterated through the neural network, the reward function is used as a target, the loss function is reduced, and the circulation is carried out for a certain number of times, so that the demand response of the family energy management system to family energy is optimized.

In one embodiment, as shown in fig. 2, the electric appliances managed by the home energy management system include an energy storage system, a power fixed electric appliance, a movable electric appliance, a power controllable electric appliance, a new energy automobile and a photovoltaic power generation system. In this embodiment, the photovoltaic power generation system generates power from renewable energy sources; the system also comprises an external power grid; the photovoltaic power generation system and an external power grid supply power for electric appliances managed by the household energy management system.

In this embodiment, the home energy management system receives energy load information of an electric appliance managed by an external power grid, and issues a control instruction according to the energy load information.

Example 2

More specifically, in a specific embodiment, in the step S1, a home energy system model is constructed according to the electric appliances managed by the home energy management system, and specifically:

time-shifting:

in this embodiment, the time-shiftable electrical appliance can shift in time according to the needs of the consumer, so as to achieve the purposes of being turned on in the low electricity price period and turned off in the high electricity price period, and their on/off states can be as follows: 1"," off:0 ".

Definition of working time atWithin the range of >Indicating a set start time for allowing the appliance to start running, < >>Representing the end time, the running time is d _n ，/>Representing the time step at which the time-movable unit actually starts to operate, < >>Representing the current power of the nth power adjustable electric appliance,/->Is a boolean variable representing the on-off state of the time-lapse mobile; the state of the time-displaceable time is expressed as:

in the present embodiment, it is assumed that a period must be completed after the time-lapse mobile phone is started, if the operation state is at time t-1And->Meaning that the time-shift has not completed one cycle, k _t,n 1 is shown in the specification; if the time t arrivesMeaning that the time shift has not completed one cycle,

power controllable electric appliance:

the household electric energy can adjust the power level according to the requirement of the user, and the household electric energy is used for controlling the power levelRepresenting the current power of the nth power controllable appliance at time t, which is +.>And the energy is saved or the user satisfaction is improved. Wherein the method comprises the steps ofMaximum power for the nth power controllable appliance, +.>The lowest power of the nth power controllable electric appliance; the state of the power controllable appliance is expressed as:

the constraints of the power controllable appliance are expressed as:

Power fixed electric appliance:

such an electricityThe energy consumption of the device is fixed in the using process and can not change along with the change of time, load or other factors, and the device is provided withFor the switching state of the nth power fixture at time t, the +.>To power the nth power fixture at time t, define the state of the power fixture as:

an energy storage system:

the home energy storage system is capable of storing electrical energy and supplying power when needed, and can be utilized by the home energy management system at any time of the day. The energy storage system is used for improving the energy utilization rate, coping with errors generated by photovoltaic power generation and balancing the load of a power grid. The state of the energy storage system is represented by the state of charge of the SoC at the time t, and the range of the SoC is from 0% to 100%; the SoC state at time t+1 is determined by the last time step and the action executed; when the energy storage system is in a charging state, soC _t+1 Multiplying the charging power at time t by the charging efficiencyBeta SoC considering energy leakage _t Obtaining SoC _t+1 Where β represents the coefficient of energy leakage, the cost per cycle self-loss rate of the energy storage system is assumed to be small due to the small size of the home battery system; assuming a sub-loss rate τ=0, the constraint of the energy storage system is expressed as:

SoC ^min ≤SoC _t ≤SoC ^max

Wherein,is the state of charge of the nth energy storage battery at time t +.>Represents the power level of the charge of the energy storage battery at time t, < >>Representing the highest power level of the charge of the energy storage battery, +.>Power level representing discharge of energy storage battery, +.>Representing the highest power level at which the energy storage battery discharges, the energy storage system is set to have an approximate initial capacity at the beginning of each day, for better prediction and planning of energy use and supply, maintaining the stability and reliability of the system, assuming the energy storage system has an approximate initial capacity at the beginning of each day, further constraint:

wherein, the sampling interval of the variable t is once every one hour, t+24 represents the t moment step of the next day, so the initial capacity at the beginning moment of each day is not more than 0.005 of the maximum battery capacity;

new energy automobile:

the state s of the electric automobile is approximately the same as the state s of the energy storage system, and SoC is considered _t As their state. However, since the battery capacity of the electric vehicle is generally large, the cruising ability is affected by the loss of the battery, and therefore, considering the loss coefficient τ, each charge and discharge of the electric vehicle will cause a certain loss, and considering the state of charge of the SoC at the time t, the state of the energy storage system is represented:

SoC ^min ≤SoC _t ≤SoC ^max

Wherein,the state of charge of the n-th new energy automobile battery at time t is +.>Representing the power level of the battery charge of the new energy vehicle at time t +.>Maximum power level representing the charging of the battery of a new energy vehicle,/->Power level representing the discharge of the battery of a new energy vehicle, < >>Representing the highest power level at which the new energy vehicle battery discharges.

In a specific embodiment, in the step S2, the problem of demand response and error compensation of the home energy management system is defined as a markov decision process, and the objective is to reduce the power cost and improve the user satisfaction, and at the same time, reduce the prediction error of the photovoltaic system. The Markov decision process specifically comprises the following steps: (S, O, A, p (S) _i+1 |s _i ,a _i ) R), where S is the state space, O is the observed value, A is the action space, p (S) _i+1 |s _i ,a _i ) Is represented in the current state s _i E S is executing action a _i After becoming the next shape s _i+1 S represents the instant rewarding function, R is the expected rewarding; let the strategy pi: S→P (A) is the mapping of state slave actions; the instant prize function at time t is denoted r _t The expected rewards of the policy are expressed as:

wherein gamma E (0, 1) represents a discount factor, and the current rewards are better than the future rewards in a visual sense; mathematically, the sum of infinite rewards does not necessarily converge to a finite value and is difficult to process with an equation.

wherein the indicator variable is an estimate of the true system state;

in this embodiment, the dual agents possess similar perceptions and decision processes that can learn and act from the same state to reduce the repeated transmission and processing of state information, thus enabling the deep Q network and the deep deterministic policy gradient network to share the same state, but even then the agents can learn and act independently, with their policies and cost functions differing.

In this embodiment, different algorithms use different action spaces for the action spaces to obtain different policies and cost functions. Because in the electric power market, electricity prices generally fluctuate according to supply and demand conditions and peak-to-valley differences, the DQN realizes centralized use of energy in a time period with lower energy price by controlling the power level and the running time of an electric appliance, thereby reducing energy cost. The deep Q network realizes the centralized use of energy in a time period with lower energy price by controlling the power level and the running time of the electric appliance; the action space of the deep Q network at the t time is expressed as:

in this embodiment, the depth deterministic strategy gradient algorithm is used to reduce the influence caused by the prediction error of the photovoltaic power generation system by controlling the charge and discharge of the energy storage system; the depth deterministic strategy gradient network reduces the influence caused by the prediction error of the photovoltaic power generation system by controlling the charge and discharge of the energy storage system, and represents the action space of the depth deterministic strategy gradient network at the t moment as:

/>

Obtaining the double-agent reinforcement learning model

In one embodiment, the reward function is power cost, user comfort, error compensation of the photovoltaic power generation system.

In a specific embodiment, the step S3 is to design a reward function of the two-agent reinforcement learning model, specifically: the reward function generally reflects the goal in optimizing the problem, because the actions selected by the agent are based on rewards, designing an appropriate reward function can play a vital role in strengthening the learning algorithm, and the goal is to find a strategy pi to maximize the expected rewards, therefore, the reward function is defined as the cost of electricity, the comfort level of electricity consumption of the resident and the error compensation of renewable energy sources. Thus, the bonus function is designed as a multi-objective optimization problem of optimizing power cost, residential electricity comfort, renewable energy sources:

wherein mu represents the weight of each random variable,is the total electricity cost at time t +.>For i user comfort costs at time t, < - >The error compensation cost at the time t is calculated;

still further, the method comprises the steps of,wherein the variable represents the total electricity price at time t, lambda _t Representing the price of electricity purchased from an external power grid at time t, P _t ^load Representing the sum of the total power of the household energy management system at the time t, the lower the total power load, the lower the electricity price, the higher the rewards, wherein P _t ^g Representing power level purchased from the grid, P _t ^pv Representing the power level input from the photovoltaic system, then P _t ^load The constraints of (2) are:

P _t ^load ≤P _t ^g +P _t ^pv +P _t ^ess ；

wherein P is _max At maximum electric power, P _n,t The power of the nth electric appliance at the t moment; since the higher power can ensure the stability of the power quality, reduce the problems of voltage fluctuation or power noise, etc., providing a better power quality experience will improve the satisfaction of the user, thus P _max -P _n,t The smaller the satisfaction, the higher:

in the present embodiment, followingWith the popularization of new energy automobiles, the new energy automobiles are popular due to environmental protection and economic characteristics, but the endurance mileage of the automobile is always one of the problems of the new energy automobiles being disputed by people, the fundamental determinant of the endurance mileage is the size of the SoC, when the SoC is closer to the full state, the endurance mileage anxiety of the automobile owner can be relieved, conversely, when the endurance mileage of the automobile is closer to the minimum allowable charge state of the battery capacity, the endurance mileage anxiety of the automobile owner can be increased sharply, therefore, in order to relieve the problem, the new energy automobiles are encouraged to always keep a better battery charge state to ensure the endurance of travel, and the battery capacity of the automobile is always kept at the level reaching higher satisfaction degree by charging the new energy automobiles in a low-charge period, so that the SoC is provided _t The closer to the maximum battery state of charge, the higher the satisfaction;

/>

in this embodiment, the capacity of the battery is accelerated to decay when the new energy automobile is overcharged and discharged, and when the battery is frequently charged and discharged to an extremely high or extremely low level of electric power, a drastic change of chemical reaction inside the battery is caused, resulting in a decrease in the capacity of the battery. This means that the energy that the battery can store in a single charge is reduced, thereby reducing the endurance mileage of the battery, and secondly, excessive charge and discharge may cause a decrease in the power performance of the new energy automobile. The state of charge of the battery has a direct impact on the acceleration capability and maximum power output of the electric vehicle. When the battery is overcharged or discharged, its voltage and current output may become unstable, thereby affecting the power performance and driving experience of the vehicle, and furthermore, the overcharge and discharge may shorten the cycle life of the battery of the new energy automobile. Cycle life refers to the number of times a battery can undergo a complete charge-discharge cycle. Frequent overcharging and discharging accelerates the aging process of the battery, resulting in a shortened life thereof. This will cause the battery of the vehicle to lose performance and cruising ability more quickly, so in order to protect the new energy vehicle battery and extend its life, consider the effect of overcharge and discharge on satisfaction:

improving the prediction precision, wherein B ^up Representing the upper limit value of the prediction error, B ^low Representing the lower limit value of the prediction error, P _t ^err Representing the prediction error power level, P _t ^err ＝P _t ^forecast -P _t ^actual The method comprises the steps of carrying out a first treatment on the surface of the Wherein P is _t ^forecast Representing the predicted value of the current power, P _t ^actual Representing the actual value of the current power, ζ _n Representing power toMapping of->The smaller the bonus function, the larger the bonus function.

In one embodiment, the photovoltaic power generation system has a prediction error exceeding B ^up Or lower than B ^low When the energy storage system is started, the compensation prediction error function of the energy storage system is started; if P _t ^err >B ^up >0, wherein the predicted value is larger than the actual value, and the energy storage system outputs electric quantity; if P _t ^err <B ^low <0, meaning that the predicted value is less than the actual value, the energy storage system stores the electric quantity;

when the predicted value is greater than the actual value, P _t ^err Subtracting the power of the actual charge when the predicted value is smaller than the actual value，P _t ^err Plus the power of the actual discharge.

In one embodiment, as shown in fig. 3, the deep Q network includes a Q network, a Q-network; the depth deterministic strategy gradient network comprises an Actor network and a Critic network; the Actor network comprises a 1 st Adam optimizer, an Actor neural network and an Actor target neural network; the Critic network comprises a 2 nd Adam optimizer, a Critic neural network and a Critic target neural network; the double-agent reinforcement learning model also comprises an experience buffer zone.

In a specific embodiment, in the step S4, learning is performed by using a dual-agent reinforcement learning model to obtain an optimal strategy, which specifically includes:

executing the t moment of the learned deep Q network output in the home energy system model environmentThe actions performedAnd action performed at time t of depth deterministic strategy gradient network output +.>According to the total observation value, obtaining a four-element group updated after interaction between the moment t and the environment>Will->Inputting experience buffer area; the empirical buffer randomly samples the quadruple at the moment k to obtain +.>Will->And respectively inputting a depth Q network and a depth deterministic strategy gradient network to learn in the next round until the optimal strategy is obtained.

Compared with the prior art, the invention has the advantages that:

the prior art does not consider the problem of electrical comfort. The invention emphasizes that on the premise of saving energy and reducing cost, the living comfort and satisfaction of family members in the home are ensured, which means that the requirements and preferences of the family members are comprehensively considered through an intelligent control system and an optimization algorithm, and the use time and mode of energy equipment are flexibly adjusted so as to provide a comfortable, healthy, economical and efficient living environment.

The invention also considers the problem of the accuracy of demand prediction for renewable energy sources. In the practical application scene of household energy, the energy storage system is coordinated with the renewable energy, and the optimal energy storage strategy is obtained based on a data driving and depth deterministic strategy gradient algorithm by controlling the charge and discharge power level of the energy storage equipment, so that the fluctuation of the renewable energy is smoothed, and the reliability and the predictability of the renewable energy are improved.

In order to solve the problem of dimensional explosion possibly caused by action space in practical application, the invention adopts a double-agent reinforcement learning framework. According to the method, the problem can be decomposed into two layers, each layer corresponds to different action spaces, the former agent uses discrete action spaces and regards different intelligent households as a whole, the number of state-action pairs is greatly reduced, and the latter agent interacts with the environment by adopting continuous action spaces due to the continuity of control variables, so that a better optimization strategy is obtained. By the layering method, the calculation dimension of the algorithm in practical application can be greatly reduced, so that the convergence speed of the algorithm is improved, and better user experience is provided for users.

The invention takes resident electricity load to participate in power grid peak shaving as an access point, and considers the household energy storage system to compensate the prediction error:

the invention considers different types of electric appliances in families in an integral mode, including time-shifting electric appliances, electric appliances with controllable power and electric appliances with fixed power, aims at minimizing the electricity cost of residents and maximizing the satisfaction of residents, establishes the demand response as a Markov decision process, and optimizes the user load curve by controlling the operation of various electric appliances and solving by using a deep Q network so as to achieve the purpose of intelligent electricity utilization.

In order to reduce the influence of the power generation prediction error of the photovoltaic system, the invention considers the charge and discharge power level of the energy storage system as a continuous random variable, proposes to use a depth deterministic strategy gradient algorithm to compensate the prediction error, and improves the prediction precision by selecting the optimal scheduling scheme of the energy storage device so as to optimize the demand response strategy.

In the invention, as the control of multiple agents in the same environment is complex, and different electric appliances do not have strong game relations in families, in order to reduce the calculation complexity, we consider all devices in the families as a single agent, reduce the action state pairs of the agents, and greatly reduce the calculation complexity.

Example 3

More specifically, the problem that this patent mainly solves is:

1) Household energy electricity cost and electricity comfort level:

because in the electric power market, the electricity price generally fluctuates according to the supply and demand conditions and peak-valley differences, the household energy electricity consumption cost is unstable,

such volatility of electricity prices complicates the prediction and management of electricity costs and families may need to reasonably arrange high energy consumption activities such as washing clothes, cooking, etc. during periods of lower electricity prices to reduce energy costs. In addition, households may also consider utilizing the volatility of electricity prices to optimize energy management, such as charging electric vehicles, storing energy, etc. during periods of lower electricity prices. In order to cope with the fluctuation of electricity price, the invention provides a household energy management system, the energy utilization mode is flexibly adjusted by using an optimization algorithm through monitoring the energy demand condition so as to maximize the energy utilization and reduce the cost, besides, the invention also considers the satisfaction level of a user, and the invention emphasizes that the living comfort and satisfaction of household members in the household are ensured on the premise of saving the energy and reducing the cost, which means that the invention comprehensively considers the demands and the preferences of the household members through an intelligent control system and the optimization algorithm and flexibly adjusts the use time and the mode of energy equipment so as to provide a comfortable, healthy, economical and efficient living environment.

2) The prediction accuracy of renewable energy sources is improved:

the energy consumption pressure is reduced by greatly providing the utilization rate of energy by deploying renewable energy devices in the home, but renewable energy is easily affected by weather and natural conditions, which may cause instability of home energy supply, and may occur in the case of shortage or surplus of power supply. In case of surplus power supply, the energy storage system stores the surplus power to balance the power demand of residents in the period of high electricity price, so that the cost of purchasing power from an external power grid is reduced, and in case of shortage of power supply, the energy storage system releases the power to achieve the pressure of reducing energy consumption. It follows that accurate renewable energy predictions can provide tremendous assistance to good demand response strategies. Therefore, the patent provides an optimization algorithm framework of prediction precision for the practical application scene by setting a threshold value of the charge and discharge power level based on continuous variables of which the data are set as prediction errors, and simultaneously considers the electricity cost and the electricity comfort level so as to achieve the aim of a better demand response strategy.

3) Interaction of smart home with environment:

Most researches consider different intelligent home furnishings as single intelligent agents, and the game relation of different intelligent agents is considered by utilizing a multi-intelligent-agent reinforcement learning algorithm framework so as to achieve the aim of optimizing demand response. However, the multi-agent generally suffers from the problem of difficult convergence, is very harsh for the design of hyper-parameters and the design of rewarding functions, and in addition, the multi-agent causes the problems of gradient explosion and the like, and consumes very large computing resources. In order to avoid the problems, the invention considers different intelligent households as a whole as an intelligent body, and reduces the coupling relation between the intelligent households on the premise of not influencing the objective function set in the process, so that the interactivity of the environment can play a role, and the computing resources and the complexity are reduced.

Therefore, the invention provides a double-agent reinforcement learning model for home energy demand response considering renewable energy error compensation, which firstly establishes a mathematical model for different intelligent households, considers constraint conditions, then respectively establishes different agents for demand response optimization and error compensation optimization, stores Markov quadruples into an experience buffer playback area through interaction with the environment so as to reduce the correlation between data, thereby achieving the three aims of reducing the electric cost, improving the electric comfort of users and improving the renewable energy prediction error, providing good electricity utilization strategies for different demands of residents, and being beneficial to electric balance between a demand side and a supply side.

Example 4

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A home energy demand response optimization method based on deep reinforcement learning is characterized in that: the method comprises the following specific steps:

2. The home energy demand response optimization method based on deep reinforcement learning of claim 1, wherein: the household energy management system manages the electric appliances including energy storage system, power fixed electric appliance, time-shifting electric appliance, power controllable electric appliance, new energy automobile, photovoltaic power generation system.

3. The home energy demand response optimization method based on deep reinforcement learning according to claim 2, wherein: in the step S1, a home energy system model is constructed according to the electric appliances managed by the home energy management system, specifically:

Time-shifting:

definition of working time atWithin the range of>Indicating a set start time for allowing the appliance to start running, < >>Representing the end time, the running time is d _n ，/>Representing the time step at which the time-movable unit actually starts to operate, < >>Representing the current power of the nth power adjustable electric appliance,/->Is a boolean variable representing the on-off state of the time-lapse mobile; the state of the time-displaceable time is expressed as:

assuming that the appliance completes a complete cycle when it is started up, k _t,n The constraint is satisfied:

Power controllable electric appliance:

will beRepresenting the current power of the nth power controllable appliance at time t, which is +.>Changes between (a) and (b) whereinMaximum power for the nth power controllable appliance, +.>The lowest power of the nth power controllable electric appliance; the state of the power controllable appliance is expressed as:

the constraints of the power controllable appliance are expressed as:

power fixed electric appliance:

an energy storage system:

the state of the energy storage system is represented by the state of charge of the SoC at the time t, and the range of the SoC is from 0% to 100%; the SoC state at time t+1 is determined by the last time step and the action executed; when the energy storage system is in a charging state, soC _t+1 Multiplying the charging power at time t by the charging efficiencyBeta SoC considering energy leakage _t Obtaining SoC _t+1 Where β represents the coefficient of energy leakage, assuming that the cost of the self-loss rate per cycle of the energy storage system is small; assuming a sub-loss rate τ=0, the constraint of the energy storage system is expressed as:

SoC ^min ≤SoC _t ≤SoC ^max

wherein,is the state of charge of the nth energy storage battery at time t +.>Represents the power level of the charge of the energy storage battery at time t, < >>Representing the highest power level of the charge of the energy storage battery, +.>Representing the power level at which the energy storage cell discharges,representing the highest power level at which the energy storage battery discharges, in order to improve the stability of the system and to cope with the power fluctuation capacity, it is provided that the energy storage system has an approximate initial capacity at the beginning of each day, further constraint:

new energy automobile:

SoC ^min ≤SoC _t ≤SoC ^max

wherein,the state of charge of the n-th new energy automobile battery at time t is +.>Representing the power level of the battery charge of the new energy vehicle at time t +.>Maximum power level representing the charging of the battery of a new energy vehicle,/- >Power level representing the discharge of the battery of a new energy vehicle, < >>Representing the highest power level at which the new energy vehicle battery discharges.

4. A home energy demand response optimization method based on deep reinforcement learning as claimed in claim 3, wherein: in the step S2, the markov decision process specifically includes: (S, O, A, p (S) _i+1 |s _i ,a _i ) R), where S is the state space, O is the observed value, A is the action space, p (S) _i+1 |s _i ,a _i ) Is represented in the current state s _i E S is executing action a _i After becoming the next shape s _i+1 S represents the instant rewarding function, R is the expected rewarding; let the strategy pi: S→P (A) is the mapping of state slave actions; the instant prize function at time t is denoted r _t The expected rewards of the policy are expressed as:

where γ ε (0, 1) represents the discount factor.

5. The home energy demand response optimization method based on deep reinforcement learning of claim 4, wherein: in the step S2, based on the markov decision process, a dual-agent reinforcement learning model based on a deep Q network and a deep deterministic strategy gradient network of a home energy management system is constructed according to a home energy system model, and specifically comprises the following steps:

wherein the indicator variable is an estimate of the true system state;

wherein,power level representing electric vehicle charge, +.>Power level representing electric vehicle discharge, +.>Power level representing charge of energy storage battery, +.>A power level representative of a discharge of the energy storage battery;

Obtaining the double-agent reinforcement learning model

6. The home energy demand response optimization method based on deep reinforcement learning of claim 5, wherein: the rewarding function is the error compensation of the power cost, the user comfort level and the photovoltaic power generation system;

wherein mu represents the weight of each random variable,is the total electricity cost at time t +.>At tCarved i user comfort costs, +.>The error compensation cost at the time t is calculated;

wherein the variable represents the total electricity price at time t, lambda _t Representing the price of electricity purchased from an external power grid at time t, P _t ^load The lower the total power load of the household energy management system is, the lower the electricity price is, the higher the rewards are, wherein P is _t ^g Representing power level purchased from the grid, P _t ^pv Representing the power level input from the photovoltaic system, then P _t ^load The constraints of (2) are:

P _t ^load ≤P _t ^g +P _t ^pv +P _t ^ess ；

including the influence of power consumption waiting time on satisfaction->Influence of the power level of the controllable appliance on the satisfaction +. >Influence of Battery anxiety on satisfaction ∈ ->Influence of excessive charge and discharge on satisfaction +.>

wherein eta ^max Coefficient, θ, representing maximum allowable capacity of battery ₄ Representing the state of charge of the battery to satisfaction.

7. The home energy demand response optimization method based on deep reinforcement learning of claim 6, wherein: prediction error of photovoltaic power generation system exceeds B ^up Or lower than B ^low When the energy storage system is started, the compensation prediction error function of the energy storage system is started; if P _t ^err >B ^up >0, wherein the predicted value is larger than the actual value, and the energy storage system outputs electric quantity; if P _t ^err <B ^low <0, meaning that the predicted value is less than the actual value, the energy storage system stores the electric quantity;

when the predicted value is greater than the actual value, P _t ^err Subtracting the actual charging power, when the predicted value is smaller than the actual value, P _t ^err Adding the power of the actual discharge;

8. The home energy demand response optimization method based on deep reinforcement learning of claim 7, wherein: the deep Q network comprises a Q network and a Q-network; the depth deterministic strategy gradient network comprises an Actor network and a Critic network; the Actor network comprises a 1 st Adam optimizer, an Actor neural network and an Actor target neural network; the Critic network comprises a 2 nd Adam optimizer, a Critic neural network and a Critic target neural network; the double-agent reinforcement learning model also comprises an experience buffer zone.

9. The home energy demand response optimization method based on deep reinforcement learning of claim 8, wherein: in the step S4, learning is performed through a dual-agent reinforcement learning model to obtain an optimal strategy, which specifically includes:

during deep Q network learning, the Q network approximates actionsThe value function Q (s, a; theta), wherein theta is a Q network parameter, and the Q (s, a; theta) is trained on the Q network through gradient descent after the action value error calculation; the Q network copies theta to the Q network every n steps; q network approximated target value function max _a′ Q(s′,a′；θ ^- )，max _a′ Q(s′,a′；θ ^- ) Training the Q network through gradient descent after calculating the action value error;

executing action executed at t moment of learned deep Q network output in home energy system model environmentAnd action performed at time t of depth deterministic strategy gradient network output +.>According to the total observation value, obtaining a four-element group updated after interaction between the moment t and the environment>Will->Inputting experience buffer area; experience is slow The four-element group at the moment k is randomly sampled in the memory area to obtain +.>Will->And respectively inputting a depth Q network and a depth deterministic strategy gradient network to learn in the next round until the optimal strategy is obtained.

10. A family energy demand response optimizing system based on deep reinforcement learning is characterized in that: the system comprises an energy system module, a double-agent reinforcement learning model construction module, a reward function module and a family energy demand response optimization module: