CN112488531B

CN112488531B - Heterogeneous flexible load real-time regulation and control method and device based on deep reinforcement learning

Info

Publication number: CN112488531B
Application number: CN202011389959.0A
Authority: CN
Inventors: 肖云鹏; 蔡秋娜; 关玉衡; 张兰; 白杨; 刘思捷
Original assignee: Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Current assignee: Electric Power Dispatch Control Center of Guangdong Power Grid Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2022-09-06
Anticipated expiration: 2040-12-02
Also published as: CN112488531A

Abstract

The application discloses a heterogeneous flexible load real-time regulation and control method and device based on deep reinforcement learning. The technical problems that the response capability of the flexible load on the user side is low and the demand response potential on the user side is difficult to excite in the existing load regulation and control mode are solved.

Description

Heterogeneous flexible load real-time regulation and control method and device based on deep reinforcement learning

Technical Field

The application relates to the technical field of load regulation and control of power systems, in particular to a heterogeneous flexible load real-time regulation and control method and device based on deep reinforcement learning.

Background

With the wide access of a large number of different flexible loads on demand sides and participation in power grid regulation, the heterogeneous characteristics of the flexible loads are gradually highlighted, and the processing of the heterogeneity becomes a key problem of practical regulation and application. The heterogeneous loads are divided into two heterogeneous modes of types and parameters, generally, different types of loads form heterogeneous types, loads of the same type but with different inherent parameters form heterogeneous parameters, and modeling of heterogeneous flexible loads is the basis of flexible load regulation.

The conventional load regulation and control models heterogeneous loads by using established physical parameters, and further performs target optimization and unified scheduling by clustering and dividing the heterogeneous loads into isomorphic groups or equivalent groups according to the similarity of the parameters, but the problem of complex physical parameters of the heterogeneous equipment is difficult to avoid. For example, for a temperature-controlled load, a first-order thermodynamic model of the temperature-controlled load is established in a conventional method based mainly on the dynamic temperature characteristic and the periodic operation mode of the load, but the response capability of a flexible load on a user side is reduced due to various loads, severe parameter differentiation, and multiple perception and interaction information depended on by regulation, and the demand response potential on the user side is difficult to excite.

Disclosure of Invention

The application provides a heterogeneous flexible load real-time regulation and control method and device based on deep reinforcement learning, and the method and device are used for solving the technical problems that the response capability of a flexible load on a user side is low and the demand response potential of the user side is difficult to stimulate in the existing load regulation and control mode.

In view of this, the first aspect of the present application provides a method for real-time regulation and control of heterogeneous flexible loads based on deep reinforcement learning, including:

respectively establishing a single flexible load dynamic model for different types of heterogeneous flexible loads of the power system to obtain a state variable, an action variable, an environment variable and a return function of the single flexible load;

according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, establishing a heterogeneous flexible load aggregation model, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of aggregated loads;

applying the aggregation model to a real-time regulation and control environment of the power system to obtain a return function of aggregation load participating in real-time response;

establishing a polymerization load real-time regulation and control deep reinforcement learning model, and training the polymerization load real-time regulation and control deep reinforcement learning model according to a state variable, an action variable, a state transfer function and a return function participating in real-time response of the polymerization load to obtain a real-time optimization regulation and control decision model for flexible load polymerization;

and inputting the state variable of the target aggregated load into a real-time optimization regulation and control decision model for flexible load aggregation to obtain an optimal strategy for real-time regulation and control of the aggregated load.

Optionally, the single flexible load dynamics model includes a load temperature control dynamics function, a user discomfort function, and a reward function.

Optionally, the heterogeneous flexible load aggregation model is:

s(t+1)＝F _transition (s(t),a(t),w(t))

wherein s (t +1) is the state variable of the aggregated load at the time t +1, s (t) is the state variable of the aggregated load at the time t, a (t) is the action variable of the aggregated load at the time t, w (t) is the environment variable at the time t, R _agg (t) is a reward function of the aggregate load at time t, r _DR (t) aggregate load participation total revenue for demand response at time t,

λ (t) (P) for total user discomfort _agg (t)-P _base (t)) Δ t is the reduction amount of electricity fee expenditure.

Optionally, the aggregate load real-time regulation deep reinforcement learning model is trained by using a deep Q-value network algorithm.

Optionally, the loss function of the deep reinforcement learning model is:

wherein, y _j Is a target value of the Q network function, m is the number of samples, theta is a weight coefficient of the Q network function, s _j Is the state variable of the jth sample, a _j Is the action variable of the jth sample.

Optionally, the training of the aggregated load real-time regulation and control deep reinforcement learning model to obtain a real-time optimization regulation and control decision model for flexible load aggregation includes:

initializing a predicted Q network function and a target Q network function, setting the number of iteration rounds as EP, the learning rate as alpha, the exploration rate as epsilon and the maximum size of an experience playback pool as M;

collecting training samples, and storing the training samples in the experience playback pool;

extracting n samples from the experience playback pool in random batch, and calculating a loss function of the Q network function;

updating the weight coefficient theta of the Q network function by adopting a gradient descent method;

continuously generating new samples, replacing the old samples in the experience playback pool with the new samples, and calculating a loss function and a weight coefficient theta of the Q network function;

updating a weight coefficient theta' of the target Q network function;

checking whether the state variable s of the aggregated load is in a final state, if so, emptying the experience playback pool, sampling again, and putting a sampling sample into the experience playback pool;

and judging whether the iteration round number reaches the EP, if so, finishing the training, and otherwise, continuing the iteration.

Optionally, the method further comprises:

and testing the real-time optimization regulation and control decision model of flexible load aggregation.

The application second aspect provides a heterogeneous flexible load real-time regulation and control device based on deep reinforcement learning, includes:

the single flexible load modeling module is used for respectively establishing a single flexible load dynamic model for different types of heterogeneous flexible loads of the power system to obtain a state variable, an action variable, an environment variable and a return function of the single flexible load;

the aggregation load modeling module is used for establishing a heterogeneous flexible load aggregation model according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of the aggregation loads;

the application module is used for applying the aggregation model to a real-time regulation and control environment of the power system to obtain a return function of aggregation load participating in real-time response;

the deep reinforcement learning module is used for establishing a polymerization load real-time regulation and control deep reinforcement learning model, and training the polymerization load real-time regulation and control deep reinforcement learning model according to a state variable, an action variable, a state transfer function and a return function participating in real-time response of the polymerization load to obtain a real-time optimization regulation and control decision model for flexible load polymerization;

and the strategy output module is used for inputting the state variable of the target aggregation load into the real-time optimization regulation and control decision model for flexible load aggregation to obtain the optimal strategy for real-time regulation and control of the aggregation load.

Optionally, the method further comprises:

and the model testing module is used for testing the real-time optimization regulation and control decision model of flexible load aggregation.

According to the technical scheme, the embodiment of the application has the following advantages:

the application provides a heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning, which comprises the following steps: respectively establishing a single flexible load dynamic model for different types of heterogeneous flexible loads of the power system to obtain a state variable, an action variable, an environment variable and a return function of the single flexible load; according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, a heterogeneous flexible load aggregation model is established, and comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of aggregated loads; applying the aggregation model to a real-time regulation and control environment of the power system to obtain a return function of aggregation load participating in real-time response; establishing a polymerization load real-time regulation and control deep reinforcement learning model, and training the polymerization load real-time regulation and control deep reinforcement learning model according to a state variable, an action variable, a state transfer function and a return function participating in real-time response of the polymerization load; and inputting the state variable of the target aggregated load into a real-time optimization regulation and control decision model for flexible load aggregation to obtain an optimal strategy for real-time regulation and control of the aggregated load.

According to the heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning, firstly, single flexible load models are respectively established for heterogeneous flexible loads of different types, then a polymerization load model is established for a plurality of flexible loads with different parameters and different structures, so that a Markov decision process when the heterogeneous flexible loads participate in demand response is obtained, a decision function of a polymer is trained through a machine learning framework of the deep reinforcement learning based on historical data, a real-time optimization regulation and control decision model of the heterogeneous flexible load polymer is obtained, an optimal strategy of real-time regulation and control of the polymerization load is obtained, and the flexible load response capability of a user side is improved. The technical problems that the response capability of the flexible load on the user side is low and the demand response potential on the user side is difficult to excite in the existing load regulation and control mode are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for a user of ordinary skill in the art, other related drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a neural network form of a Q network function;

fig. 2 is a block diagram of a flow structure of a heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning provided in an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application provides an embodiment of a heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning, which comprises the following steps:

step 101, respectively establishing a single flexible load dynamic model for different types of heterogeneous flexible loads of the power system to obtain a state variable, an action variable, an environment variable and a return function of the single flexible load.

It should be noted that, first, a single flexible load dynamic model is respectively established for heterogeneous flexible loads of different types, and the value ranges and the variation trends of variables such as state variables, behavior variables, environment variables, return functions and the like of the single flexible load are obtained. For convenience of understanding, two heterogeneous flexible loads, namely an electric heating load and an electric water heating load, are taken as examples in the embodiment of the present application for illustration, and it should be understood that other single flexible loads can be subjected to corresponding parameter changes on the basis of the embodiment of the present application to obtain the same effects as the embodiment of the present application.

For electric heating load:

the electric heating load is typically a temperature-controlled load, the purpose of which is to maintain the room temperature within a certain comfort range. Electric heating simulation by adopting equivalent model similar to first-order circuitSetting the rated power of the electric heating equipment i as P in the running process of the load _i ^rate The equivalent thermal resistance of the room is

Equivalent heat capacity of

Indoor temperature at time T is T _i (T) outdoor temperature T _ex (t), the dynamic equation of the indoor temperature can be expressed as:

t∈[0,T]

in the formula, K _i (t) is an action variable for controlling the on-off of the electric heating equipment i, and K _i (t)∈{0,1}。

When the time granularity is Δ t, the formula

t∈[0,T]Approximation translates into a state-transfer equation at discrete time:

let the comfortable temperature range of the user be [ T _i ^L ，T _i ^U ]The temperature of the electric heating equipment can be changed within the range of

The user's discomfort function may be defined as:

wherein, the first and the second end of the pipe are connected with each other,

and

and respectively, discomfort penalty factors of the indoor temperature exceeding the upper limit and the lower limit.

And if the electricity price at the time t is lambda (t), the electricity fee expenditure of the user is as follows:

f _i ^elec ＝λ(t)P _i ^rate ·K _i (t)Δt。

since the return function is a maximization function, the return function of the electric heating device i can be represented as:

r _i (t)＝-f _i ^unc -f _i ^elec 。

for electric water heating loads:

the electric hot water load can maintain the domestic water stored in the water tank within a comfortable range. In addition to dissipating heat in the environment, hot water in the water tank may also dissipate heat by flowing in cold water due to domestic hot water flowing out. Thus, the water temperature dynamic equation in tank i can be defined as:

t∈[0,T]

represents the equivalent thermal resistance of the water tank i,

represents the equivalent heat capacity, T, of the tank i _i (t) Water temperature in tank i, P at time t _i ^rate Indicating rated power, Q, of electric water heater of water tank i _i (t) represents the amount of heat taken away by domestic water at time t, Q _i (t) is related to the water usage habits of the user, K _i (t) is a control variable for controlling the on-off of the electric hot water tank i, and K _i (t)∈{0,1}。

General formula

t∈[0,T]Converting into a discrete form to obtain a state transition equation as follows:

approximately, the comfortable temperature range of the electric hot water load can also be defined as [ T [ ] _i ^L ,T _i ^U ]The temperature of the electric heating water can be changed within the range of

And obtain the user's discomfort function, power consumption cost and reward function, which will not be described herein.

Step 102, establishing a heterogeneous flexible load aggregation model according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of aggregated loads.

The flexible loads including heterogeneous types and heterogeneous parameters are aggregated to obtain an aggregation model of heterogeneous flexible loads, where the model includes a state variable, a state space, an action variable, an action space, and a state transfer function of the aggregated loads. For convenience of explanation, an aggregation model of heterogeneous flexible loads is established by taking N electric heating loads and L electric water heating loads containing heterogeneous parameters as an example. The subscripts of the electric heating load and the heterogeneous parameters thereof are respectively 1-N, and the subscript of the electric water heating load is N + 1-N + L.

Let the state variables of the aggregate load be:

the action variables are:

the state space of the aggregate load is:

the lower limit and the upper limit of the water temperature control range of the electric heating water load are respectively.

The action space is as follows:

A＝{0,1} _N+L 。

since each load in the aggregate load has heterogeneous parameters, a single dynamic equation cannot be directly established. The state transition equations of the respective loads can be simultaneously established to obtain:

the above state transition equation can be simplified as:

s(t+1)＝F _transition (s(t),a(t),w(t))

where w (t) represents an environmental variable.

Setting the demand response regulation power required to be executed by the aggregate load at the time t to be delta P _DR (t) the base line load of the aggregate load at time t is P _base (t), aggregate power P at time t _agg (t) is:

the aggregated load is in real-time market for revenue in response to system regulatory instructions. Assuming that the unit benefit of demand response at time t is μ (t), and that the full benefit is available only when the power of the actual response of the user is within a certain error range [1- ε,1+ ε ], ε is typically taken to be 20%, and the excess is not available. Thus, the total revenue of the aggregate load participating in the demand response is:

thus, the reward function at the time of the aggregate load t can be expressed as:

wherein R is _agg (t) is a reward function of the aggregate load at time t, r _DR (t) aggregate load participation total revenue for demand response at time t,

λ (t) (P) for total user discomfort _agg (t)-P _base (t)) Δ t is an electric power fee expenditure reduction amount.

At t ₀ The purpose of carrying out real-time optimization regulation and control on the aggregation load at any moment is to find t ₀ Optimal strategy of time of day such that from t ₀ The cumulative expected reward of the user from time to T period is the greatest. Considering the uncertainty of the future period, it is necessary to multiply the gain occurring for the future period by the attenuation coefficient γ. Assuming that the initial state variable is known as s ₀ Then the real-time regulatory optimization problem can be expressed as:

s.t:s(t ₀ )＝s ₀ ,s(t)∈S,a(t)∈A

E _s(t),a(t) are expected values in the feasible domain state space S and the action space a. The real-time regulation and control optimization problem is a mixed integer nonlinear programming problem, and the optimization variable is a (t) ₀ )。

And 103, applying the aggregation model to a real-time regulation and control environment of the power system to obtain a return function of aggregation load participating in real-time response.

It should be noted that after the aggregation model is obtained, aggregation wakeup needs to be put into the real-time regulation and control environment of the power system, interact with the real-time environment, and continuously evolve to obtain a return function of aggregation load participating in real-time response.

And step 104, establishing a polymerization load real-time regulation and control deep reinforcement learning model, and training the polymerization load real-time regulation and control deep reinforcement learning model according to a state variable, an action variable, a state transfer function and a return function participating in real-time response of the polymerization load to obtain a real-time optimization regulation and control decision model for flexible load polymerization.

It should be noted that, as can be seen from the real-time regulation and optimization problem, the dimension considering the constraint condition reaches (N + L) · (T-T) ₀ ) Obviously, when the number of the aggregation loads is large, the complexity of direct optimization solution is large, and the instantaneity requirement of real-time optimization regulation and control is difficult to meet. Therefore, a (t) is obtained by the following deep reinforcement learning method ₀ ) An approximation of (d). From the steps 102 to 103, it can be known that the process of the aggregation load participating in the real-time regulation is a Markov decision process, that is, the individual is at t ₀ The decision of the moment and the subsequent state are only compared with the current state s (t) ₀ ) Regardless of historical information. Slave type

Quadruplets of Markov decision process can be obtained<s,a,r,s'>Namely:

the state variable S belongs to S;

an action variable a belongs to A;

state transfer function s' ═ F _transition (s,a)；

The return function R is R _agg (s,a)。

Let pi be the policy of aggregated load, which means the probability of taking a possible action variable a for a state variable s in the markov decision process, and is expressed as pi (a | s), as shown in the following formula:

π(a|s)＝Pr[a(t)＝a|s(t)＝s]

thus, the goal of deep reinforcement learning is to find the optimal strategy that maximizes the expected value of the cumulative reward function

Defining a Q network function of the aggregated load and expressing the Q network function through a neural network, defining a loss function of a learning Q network function, and initializing a prediction Q network and a target Q network. Initializing the state of the aggregation load and collecting samples to store in an experience replay pool. And performing off-line training on the predicted Q network by using batch sampling in the empirical playback pool and the value of the target Q network, updating the parameter of the predicted Q network by a gradient descent method, repeating the step and periodically updating the parameter of the target Q network until the iteration number reaches the maximum value.

And 105, inputting the state variable of the target aggregated load into a real-time optimization regulation and control decision model for flexible load aggregation to obtain an optimal strategy for real-time regulation and control of the aggregated load.

It should be noted that after the real-time optimal regulation and control decision model for flexible load aggregation is obtained, the state variable s of the target aggregated load is input into the real-time optimal regulation and control decision model for flexible load aggregation, and an optimal strategy for aggregated load real-time regulation and control is obtained, which is expressed as:

where Q (s, a | θ) is a Q network function.

According to the heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning, firstly, single flexible load models are respectively established for heterogeneous flexible loads of different types, then a polymerization load model is established for a plurality of heterogeneous flexible loads of different parameters, so that a Markov decision process when the heterogeneous flexible loads participate in demand response is obtained, a decision function of a polymer is trained through a machine learning framework of the deep reinforcement learning based on historical data, a real-time optimization regulation and control decision model of the heterogeneous flexible load polymer is obtained, an optimal strategy of real-time regulation and control of the polymerization load is obtained, and the flexible load response capability of a user side is improved. The technical problems that the response capability of the flexible load on the user side is low and the demand response potential on the user side is difficult to excite in the existing load regulation and control mode are solved.

Specifically, the aggregation load real-time regulation and control deep reinforcement learning model is trained by adopting a deep Q value network algorithm. The deep Q value network algorithm introduces two neural network functions to search for the optimal strategy, namely a value function and a Q network function. Wherein, the cost function represents the accumulated return expectation value obtained by the individual in the state s by adopting the strategy pi, and is represented as:

the Q network function represents the accumulated return expectation value obtained by selecting the action variable a under the state variable s and then continuously adopting the strategy pi, and is represented as follows:

at the optimal strategy

Next, for any other strategy π, given an arbitrary state variable s, the individual's cost function should be satisfied

Based on the Bellman optimal equation, the optimal strategy adopted can be obtained

In the case of (2), the relationship between the cost function and the Q-network function is expressed as:

that is, the Q network function can be decomposed into two parts, namely a reward function in the current state and a cost function in the next state multiplied by an attenuation coefficient.

And the value function under the optimal strategy satisfies the following conditions:

will be provided with

Substitution into

In (3), it can be obtained that the Q network function satisfies

Can be combined with

The method is applied to the training of the neural network.

The left side is regarded as the predicted value Q of the Q network function, and the right side is regarded as the target value Q' of the Q network function.

And carrying out neural network parameterization representation on the Q network function. The mapping from input (s, a) to Q is first represented by having a typical fully connected neural network, as shown in fig. 1. Where the inputs are s and a, the output is Q, and the weighting factor is represented by θ. The purpose of the deep reinforcement learning is to make the predicted value Q of Q approach the target value Q' as much as possible by training the weight coefficient θ.

If it is paired with

The Q network functions on both sides are trained by adopting the same parameters, so that the dependency of the two is too strong, and the algorithm convergence is not facilitated. Therefore, it is necessary to put both sidesThe Q network function is represented by two neural networks Q and Q ', which are respectively called a prediction Q network and a target Q network, the structures of the two networks are completely consistent, and the corresponding weight coefficients are theta and theta', respectively.

Deep reinforcement learning requires training neural network parameters using owned data so that the output of the neural network approaches a target value as closely as possible. Let the current data have m samples(s) _j ,a _j ,s′ _j ,r _j ) And j is 1,2, …, m, the mean square error loss function of the neural network can be expressed as:

wherein, y _j Representing the target value of the Q network function, the expression is:

as shown in fig. 2, the process of training the aggregation load real-time regulation deep reinforcement learning model is as follows:

(1) the neural network functions Q and Q' are initialized. The iteration round number is set to be EP, the learning rate is alpha, the exploration rate is epsilon, and the maximum size of the experience playback pool is M. Iterative training then begins.

(2) And sampling to obtain an experience playback pool.

Data samples for training the neural network can be obtained through offline sampling, and the collected samples are stored in an experience playback pool. Firstly, randomly initializing the state variable of the aggregated load to obtain s-s ₁ 。

Then adopting greedy strategy (epsilon-greedy) to obtain a as a ₁ 。

The epsilon-greedy strategy is as follows:

wherein the exploration rate epsilon is a constant between 0 and 1, and delta is a random sampling value between 0 and 1. The method is adopted to explore more action spaces as much as possible and avoid falling into a local optimization solution.

Then will s ₁ And a ₁ The transfer function and the return function are brought in to obtain the next state value s' ₁ And r ₁ Obtaining a sample quadruple(s) ₁ ,a ₁ ,s′ ₁ ,r ₁ )。

Let s ₂ ＝s′ ₁ . Repeating the above steps to obtain(s) ₂ ,a ₂ ,s′ ₂ ,r ₂ )、…、(s _M ,a _M ,s′ _M ,r _M ). M is the maximum number of samples of the empirical playback pool. Where the initial state variables need to be reset each time t reaches a maximum value.

(3) Randomly batch-extracting n samples from experience playback pool, substituting

The corresponding Loss function Loss (θ) is calculated.

(4) The parameter theta of the Q-network function is updated. Updating theta in a gradient descent mode:

(5) continue to generate new samples(s) _j ,a _j ,s′ _j ,r _j ) J is M +1, M + 2. And replaces the old sample in the empirical playback pool with the new sample. Repeating the steps (3) and (4) every time n new data samples are put in.

(6) The parameter θ 'of the target Q network function Q' is updated. Updating the target Q network function Q' once, namely:

θ′←θ

(7) checking whether the state s is a final state, and if so, emptying the experience playback pool and jumping to the step (2) to restart.

(8) And (5) repeating the steps (2) to (7) until the number of iteration rounds reaches EP.

After neural network training is completed, test set data may be generated to verify the validity of the strategy.

In any state, given the state s of the aggregate load, the optimal decision for real-time regulation is obtained as follows.

And testing and recording the result of the optimized regulation and control of the polymerization load.

The application also provides an embodiment of the heterogeneous flexible load real-time regulation and control device based on deep reinforcement learning, which comprises the following steps:

the aggregation load modeling module is used for establishing an heterogeneous flexible load aggregation model according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of the aggregation loads;

and the strategy output module is used for inputting the state variable of the target aggregated load into the real-time optimization regulation and control decision model for flexible load aggregation to obtain the optimal strategy for real-time regulation and control of the aggregated load.

According to the heterogeneous flexible load real-time regulation and control device based on deep reinforcement learning, firstly, single flexible load models are respectively established for heterogeneous flexible loads of different types, then a polymerization load model is established for a plurality of flexible loads with different parameters and different structures, so that a Markov decision process when the heterogeneous flexible loads participate in demand response is obtained, a decision function of a polymer is trained through a machine learning framework of the deep reinforcement learning based on historical data, a real-time optimization regulation and control decision model of the heterogeneous flexible load polymer is obtained, an optimal strategy of real-time regulation and control of the polymerization load is obtained, and the flexible load response capability of a user side is improved. The technical problems that the response capability of the flexible load on the user side is low and the demand response potential on the user side is difficult to excite in the existing load regulation and control mode are solved.

Further, the single flexible load dynamic model includes a load temperature control dynamic function, a user discomfort function, and a reward function.

Further, the method also comprises the following steps:

It should be noted that the device provided in the embodiment of the present application is a virtual device embodiment corresponding to the foregoing heterogeneous flexible load real-time regulation and control method embodiment based on deep reinforcement learning, and the embodiment of the present application can achieve the same technical effects as the foregoing heterogeneous flexible load real-time regulation and control method embodiment based on deep reinforcement learning, and is not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning is characterized by comprising the following steps:

according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, establishing a heterogeneous flexible load aggregation model, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of aggregated loads; wherein the heterogeneous flexible load aggregation model is as follows:

s(t+1)＝F _transition (s(t),a(t),w(t))

wherein s (t +1) is a state variable of the aggregated load at the time t +1, F _transition (s (t), a (t), w (t)) is a state transfer function of the aggregated load at time t, s (t) is a state variable of the aggregated load at time t, a (t) is an action variable of the aggregated load at time t, w (t) is an environment variable at time t, R _agg (t) is a reward function of the aggregate load at time t, r _DR (t) Total revenue of aggregated load participation in demand response at time t, λ (t) (P) _agg (t)-P _base (t)) Δ t is the reduction amount of electricity charge expenditure, f _i ^unc For user non-fitness, T _i ^L A comfort temperature lower limit for the user; t is a unit of _i ^U Is the upper comfortable temperature limit, T, of the user _i ^min Is the lower limit of the temperature variation, T, of the device i _i ^max Is the upper limit of temperature variation of the equipment i, N is the number of electric heating equipment, L is the number of electric water heating equipment, T _i (t) is the indoor temperature,

is the indoor temperature T _i (T) exceeding the user's comfort upper temperature limit T _i ^U Is determined by the non-suitability penalty factor of (c),

is the indoor temperature T _i (T) is below the user's comfort temperature lower limit T _i ^L A discomfort penalty factor of;

establishing a polymerization load real-time regulation and control deep reinforcement learning model, and training the polymerization load real-time regulation and control deep reinforcement learning model according to a state variable, an action variable, a state transfer function and a return function participating in real-time response of the polymerization load to obtain a flexible load polymerization real-time optimization regulation and control decision model;

2. The method for real-time regulation and control of heterogeneous flexible loads based on deep reinforcement learning according to claim 1, wherein the single flexible load dynamic model comprises a load temperature control dynamic function, a user discomfort function and a reward function.

3. The heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning of claim 1, wherein the aggregate load real-time regulation and control deep reinforcement learning model is trained by adopting a deep Q value network algorithm.

4. The method for regulating and controlling the heterogeneous flexible load based on the deep reinforcement learning according to claim 3, wherein the loss function of the deep reinforcement learning model is as follows:

wherein, y _j Is the target value of the Q network function, m is the number of samples, theta is the weight coefficient of the Q network function, s _j Is the state variable of the jth sample, a _j Is the action variable of the jth sample.

5. The heterogeneous flexible load real-time regulation and control method based on deep reinforcement learning of claim 4, wherein the training of the aggregated load real-time regulation and control deep reinforcement learning model to obtain a flexible load aggregated real-time optimization regulation and control decision model comprises:

updating a weight coefficient theta' of the target Q network function;

checking whether a state variable s of the aggregation load is in a final state, if so, emptying the experience playback pool, sampling again, and putting a sampling sample into the experience playback pool;

and judging whether the iteration round number reaches EP, if so, finishing the training, and otherwise, continuing the iteration.

6. The method for regulating and controlling the heterogeneous flexible load based on the deep reinforcement learning in real time as claimed in claim 5, further comprising:

7. The utility model provides a heterogeneous flexible load real-time regulation and control device based on deep reinforcement study which characterized in that includes:

the aggregation load modeling module is used for establishing a heterogeneous flexible load aggregation model according to the state variables, the action state variables, the environment variables and the return functions of all the single flexible loads, wherein the heterogeneous flexible load aggregation model comprises the state variables, the state spaces, the action variables, the action spaces and the state transfer functions of the aggregation loads; wherein the heterogeneous flexible load aggregation model is as follows:

s(t+1)＝F _transition (s(t),a(t),w(t))

wherein s (t +1) is a state variable of the aggregated load at the time of t +1, F _transition (s (t), a (t), w (t)) is a state transfer function of the aggregated load at time t, s (t) is a state variable of the aggregated load at time t, a (t) is an action variable of the aggregated load at time t, w (t) is an environment variable at time t, R _agg (t) is a reward function of the aggregate load at time t, r _DR (t) Total revenue of aggregated load participation in demand response at time t, λ (t) (P) _agg (t)-P _base (t)) Δ t is an electric power charge reduction amount, f _i ^unc For user non-fitness, T _i ^L A comfort temperature lower limit for the user; t is _i ^U Is the upper comfortable temperature limit, T, of the user _i ^min Is the lower limit of the temperature variation, T, of the device i _i ^max The upper limit of the temperature variation of the equipment i, N is the number of electric heating equipment, L is the number of electric water heating equipment, and T is the number of electric water heating equipment _i (t) is the indoor temperature,

is the indoor temperature T _i (T) exceeding an upper comfort temperature limit T for the user _i ^U Is determined by the non-suitability penalty factor of (c),

8. The device for real-time regulation and control of heterogeneous flexible loads based on deep reinforcement learning of claim 7, wherein the single flexible load dynamic model comprises a load temperature control dynamic function, a user discomfort function and a reward function.

9. The device for real-time regulation and control of heterogeneous flexible loads based on deep reinforcement learning according to claim 7, further comprising: