CN115001002B

CN115001002B - Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling

Info

Publication number: CN115001002B
Application number: CN202210916196.3A
Authority: CN
Inventors: 陈显超; 张杰明; 高宜凡; 陈展尘; 王辉; 梁妍陟; 仲卫; 程林晖; 钟榜; 褚裕谦
Original assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-12-30
Anticipated expiration: 2042-08-01
Also published as: CN115001002A

Abstract

The invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by utilizing load historical data and the power yield of energy storage at the corresponding moment, and limiting the updating times of a control strategy by utilizing a trust domain optimization model in the training process, so that an optimal strategy is rapidly and accurately obtained, and the optimal scheduling control of the energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not greatly change the distribution form during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.

Description

Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling

Technical Field

The invention belongs to the technical field of power grid dispatching, and particularly relates to an optimal dispatching method and system for solving energy storage participation peak clipping and valley filling based on a trust domain-reinforcement learning.

Background

The large-scale battery energy storage system can realize the peak clipping and valley filling functions of the load by discharging at the peak of the load and charging at the valley of the load. The power grid company utilizes the stored energy to cut peaks and fill valleys, so that the upgrading of the equipment capacity can be postponed, the utilization rate of the equipment is improved, and the updating cost of the equipment is saved; the power consumer can utilize the energy storage to cut peak and fill valley, and can utilize the peak-valley power price difference to obtain economic benefit. How to achieve the optimal peak clipping and valley filling effects by using the limited battery capacity and meet the limits of a set of constraint conditions needs to be realized by means of an optimization algorithm.

The classical optimization algorithm for solving the charging and discharging strategy of the energy storage system comprises a gradient algorithm and a dynamic programming algorithm. The gradient algorithm cannot process discontinuous constraint conditions and has strong dependence on initial values. Discontinuous and nonlinear constraints can be considered in the model by adopting a dynamic programming algorithm, and the solution is convenient to use a computer. However, when large-scale energy storage grid connection and high-randomness loads exist, the two methods have the problems of precision and calculation efficiency, and meanwhile, the two methods are based on accurate physical models, so that the accuracy of modeling is difficult to guarantee in practical problems.

Disclosure of Invention

In view of the above, the invention aims to solve the problems that when a large-scale energy storage grid connection and a high-randomness load exist, the classical optimization algorithms for solving the charging and discharging strategies of the energy storage system have the precision and the calculation efficiency, and the modeling accuracy is difficult to ensure.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, the present invention provides an optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling, including the following steps:

setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing the input control strategies by utilizing network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies;

acquiring historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the energy storage power output, the active values and the predicted values at initial moments as initial states, controlling energy storage by any initial energy storage control strategy, performing iterative training on a parameterized deep Q value network and updating network parameters by taking the variance of a minimized load curve as a target, controlling the updating times of the network parameters by using a trust domain optimization model, and meeting the condition

When it is time, finish training, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameterized control strategy

，

A constraint limit value is indicated and,

and

representing network parameters

The number of updates of (a);

and acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.

Further, the parameterized depth Q-value network specifically includes: an energy storage strategy neural network and an energy storage state value neural network;

the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage

Set up as corresponding network parameters

；

The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state

Set up as corresponding network parameters

；

Wherein,

the status is represented by a number of time slots,

the representation of the motion is shown as,

which is indicative of the time of day,

the energy storage control strategy is represented by,

indicating a state

When taking action

The value of the time-frequency response is,

indicating a state

For all possible actions

In the light of the expected value of the composition,

the indication of the return is that,

representing a discount factor.

Further, the trust domain optimization model specifically includes:

in the formula,

which indicates the control strategy before the update,

representing per-network parameters

The updated control strategy is then used to control the power converter,

indicating an expected discount return for the updated control strategy compared to the pre-update control strategy,

representing trust domain constraints between the updated control strategy and the control strategy before the update.

Further, iterative training is carried out on the parameterized depth Q value network, network parameters are updated, the trust domain optimization model is used for controlling the updating times of the network parameters, and the conditions are met

And then, finishing the training, specifically comprising:

taking the initial state as an initial state to control the strategy

To store energy

Secondary control to obtain strategy state-action track

Wherein

As an output result of the energy storage strategy neural network,

for the parameters of the energy storage policy network,

is as follows

The wheel strategy state-the motion trajectory,

is as follows

A track and

，

is a time of day

To (1)

A trajectory state and a motion vector;

for the

Each step in

All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return

And calculating a state cost function of the corresponding step by using the energy storage state cost neural network

Wherein

Is a parameter of the energy storage state value neural network;

for the

Each step in

Computing a merit function based on the action-state cost function and the state cost function

，

；

Estimating a policy gradient based on the merit function

，

Wherein

the total control wheel number of the load and the stored energy is represented;

representing the energy storage strategy neural network in

The gradient of (d);

computing the energy storage policy neural network pair based on the policy gradient

Second order partial derivative of

，

Wherein

Is an auxiliary variable and has no actual physical significance;

let iteration subscript

Sequentially updating the network parameters of the energy storage strategy neural network to be

，

Wherein

Representing the maximum backtracking times of the step length of the energy storage strategy neural network;

to the energy storage state value neural network to

For labeling, a random gradient descent algorithm is adopted to update parameters into

，

Wherein

Value neural network loss function for the energy storage state

For network parameters

The gradient of (a) of (b) is,

；

repeating the above steps until the conditions are met

And

when so, the training is finished.

Further, the expression for minimizing the variance of the load curve is specifically as follows:

in the formula,

the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time

（

) Load data;

is a time of day

Is a known amount, and

the time is the actual load,

the time is the predicted load;

is a time of day

Arrival time

Between BES, the battery is charged positive, the discharge is negative, and

when is alreadyThe amount of the active carbon is known,

is the control variable.

In a second aspect, the present invention provides an optimized scheduling system for solving the problem that energy storage participates in peak clipping and valley filling, including:

the system comprises a setting unit, a parameter setting unit and a parameter setting unit, wherein the setting unit is used for setting a parameterized depth Q value network, and the parameterized depth Q value network is used for parameterizing an input control strategy by utilizing a network parameter of the parameterized depth Q value network and outputting a plurality of parameterized control strategies;

a training unit for obtaining the historical active value and the predicted value of the load and the output of the energy storage power at the corresponding moment, inputting the output of the energy storage power at the initial moment, the active value and the predicted value of the load as initial states, controlling the energy storage by any initial energy storage control strategy, iteratively training the parameterized depth Q value network and updating the network parameters by taking the variance of the minimized load curve as a target, controlling the updating times of the network parameters by utilizing a trust domain optimization model, and meeting the requirements

When it is time, finish training, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameterized control strategy

，

A constraint limit value is indicated and,

and

representing network parameters

The number of updates of (a);

and the control unit is used for acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting the strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-station controller for energy storage scheduling control.

Set up as corresponding network parameters

；

Set up as corresponding network parameters

；

Wherein,

the status is represented by a number of time slots,

the motion is represented by a motion vector representing the motion,

which is indicative of the time of day,

an energy storage control strategy is shown,

indicating a state

When taking action

The value of the time-domain data is,

indicating a state

For all possible actions

In the light of the expected value of,

the indication of the return is that,

representing a discount factor.

Further, the trust domain optimization model specifically includes:

in the formula,

which indicates the control strategy before the update,

representing per-networkParameter(s)

The updated control strategy is then used to control the power converter,

Further, the process of the training unit performing iterative training on the parameterized deep Q-value network and updating the network parameters specifically includes:

taking the initial state as an initial state to control the strategy

To store energy

Secondary control to obtain strategy state-action track

Wherein

As an output result of the energy storage strategy neural network,

for the parameters of the energy storage policy network,

is as follows

The wheel strategy state-the motion trajectory,

is as follows

A track and

，

is a time of day

To (1) a

A trajectory state and a motion vector;

for the

Each step in

In which

Is a parameter of the energy storage state value neural network;

for the

Each step in

，

；

Estimating a policy gradient based on the merit function

，

Wherein

representing the energy storage strategy neural network in

The gradient of (d);

Second order partial derivative of

，

Wherein

Is an auxiliary variable and has no actual physical significance;

let iteration subscript

，

Wherein

to the energy storage state value neural network to

，

Wherein

Value neural network loss function for the energy storage state

For network parameters

The gradient of (a) of (b) is,

；

repeating the steps until the conditions are met

And

when so, the training is finished.

in the formula,

（

) Individual load data;

is a time of day

Is a known amount, and

the time is the actual load,

the time is the predicted load;

is a time of day

Arrival time

Between BES, the battery is charged positive, the discharge is negative, and

the time is a known amount of the compound,

is the control variable.

In conclusion, the invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by using load historical data and the power output rate of energy storage at the corresponding moment, and limiting the updating times of a control strategy by using a trust domain optimization model in the training process, so that an optimal strategy is quickly and accurately obtained, and optimal scheduling control of energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not change the distribution form greatly during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an optimal scheduling method for solving energy storage participation peak clipping and valley filling according to an embodiment of the present invention;

FIG. 2 is a parameter updating process of trust domain-reinforcement learning provided by the embodiment of the present invention;

fig. 3 is a schematic diagram of an energy storage strategy neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an energy storage state value neural network provided by an embodiment of the present invention;

fig. 5 is a flowchart of network training according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The large-scale battery energy storage system can realize the peak clipping and valley filling functions of the load by discharging at the peak of the load and charging at the valley of the load. The power grid company utilizes the energy storage to carry out peak clipping and valley filling, so that the upgrading of the equipment capacity can be postponed, the utilization rate of the equipment is improved, and the updating cost of the equipment is saved; the power consumer can utilize the energy storage to cut peak and fill valley, and can utilize the peak-valley power price difference to obtain economic benefit. How to achieve the optimal peak clipping and valley filling effects by using the limited battery capacity and meet the limits of a set of constraint conditions needs to be realized by means of an optimization algorithm.

The classical optimization algorithm for solving the charging and discharging strategy of the energy storage system comprises a gradient algorithm and a dynamic programming algorithm. The gradient algorithm cannot process discontinuous constraint conditions and has strong dependence on initial values. Discontinuous and nonlinear constraints can be considered in the model by adopting a dynamic programming algorithm, and the solution is convenient to use a computer. However, when a large-scale energy storage grid connection and a high-randomness load exist, the two methods have the problems of precision and computational efficiency, and meanwhile, the two methods are based on an accurate physical model, so that the accuracy of modeling is difficult to guarantee in a practical problem.

The traditional reinforcement learning method based on strategy gradient makes the deep neural network make obvious progress in the control task. However, there are difficulties in achieving good results with the strategic gradient method, since this method is very sensitive to the number of iteration steps: if too small, the training process is very slow; if chosen too large, the feedback signal will be buried in the noise, possibly even allowing the model to behave avalanche-wise. The sampling efficiency of such methods is often low, and a task of learning simply requires millions to billions of total iterations.

Based on the method, the invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling based on the trust domain-reinforcement learning.

The following describes a method for optimizing and scheduling energy storage participation peak clipping and valley filling based on trust domain-reinforcement learning.

Referring to fig. 1, the present embodiment provides an optimal scheduling method for solving energy storage participation peak clipping and valley filling based on trust domain-reinforcement learning.

Firstly, a design idea of solving energy storage participating in peak clipping and valley filling optimization scheduling based on confidence domain-reinforcement learning is explained in detail as follows:

trust Region Policy Optimization (TRPO) limits the size of strategy update in continuous control, does not change the distribution form greatly every time of update, enables the benefit to meet the requirement of incremental convergence, and can correct the Optimization result on line.

Because the charging and discharging power of the stored energy can be changed rapidly and flexibly, the climbing rate constraint does not need to be considered. Ignoring the internal losses of the battery, the battery can be considered as a constant voltage source model. If the owner of the energy storage system is a power consumer, under a market electricity price system, the user aims to maximize the economic benefit brought to the user by the energy storage system; if the owner of the energy storage system is the grid, the load curve of the grid is as flat as possible in order to reduce the number of times of start-up and shut-down of the conventional generator set and the capacity of the spinning reserve. Mathematically, the variance may reflect the degree to which the random variable deviates from its mean, and the variance of the load may reflect the degree to which the load curve is flat. The present embodiment therefore chooses the variance of the minimized load curve as the objective function:

in the formula,

（

) Individual load data;

is a time of day

Is a known amount, and

the time is the actual load,

the time is the predicted load;

is a time of day

Arrival time

Between BES, the battery is charged positive, the discharge is negative, and

the time is a known amount of the compound,

is the control variable.

The following describes the parameters of the present embodiment in turn. The real-time optimization of the present embodiment includes the following constraints.

1. Battery capacity constraint

The battery electric quantity at each moment does not exceed the upper and lower limits of the battery capacity:

in the formula:

and

respectively the lower limit and the upper limit of the residual capacity of the battery;

is a time of day

The charge of the battery is set to be,

the time is a known amount of the compound,

is a state variable.

When calculating on line, the electric quantity of the current moment

Is the initial value of the number of the first time,

electric quantity of time

Is the final value. After neglecting the loss of the battery, the battery is

The amount of power reduced in time is equal to the amount of power output in this period of time:

in the formula:

interval time of adjacent load data;

and

respectively an initial value and a final value of the residual capacity of the battery.

2. Power constraint

Due to the limitations of a power electronic converter (PCS) and a battery body, the output power of the battery at each moment cannot exceed the upper and lower power limits:

in the formula:

is the maximum charge-discharge power limit.

In this embodiment, the optimization problem is converted into a markov sequence decision model, which mainly includes a state space, an action space, and a return function.

For convenience of description, the following description will be made with reference to the following drawings:

3、

: state space, state space in this embodimentIs to store the current output power

And load prediction value

。

: an action space, which refers to the charge and discharge power at the future moment of energy storage in this embodiment;

：

the transition probability distribution, where the transition probability distribution is deterministic, is set to 1.

4、

：

The reward function, which is found in this embodiment, is:

wherein

Is the variance minimum objective function of the load fluctuations,

it is ensured that the electric quantity is within the corresponding upper and lower limit ranges,

in order to ensure that the charging and discharging power is within the corresponding upper and lower limits,

the purpose is to ensure charge and discharge electricity quantity-power balance.

5、

：

Initial state

Probability distribution of (2), this example

Is a standard normal distribution.

6、

The discount factor adopts a conservative strategy of

。

7、

：

The patent refers to the probability of charge and discharge power corresponding to stored energy.

8、

Expected discount return:

wherein:

，

is the time-of-day index of the sample trace,

is representative of an averaging operator.

9. State-action energy storage Q-Value network:

its physical meaning is, state

When taking action

The corresponding value of the time.

10. Energy storage Q-Value network:

its physical meaning is, state

For all possible actions

The expected value in terms of.

11. The merit function:

physics of itMeaning, state

Next, the difference between the value corresponding to an action and the expected value for all possible actions is selected, where

。

The design concept of the present embodiment is explained based on the above description. The starting point of the scheme is to take each strategy

Can be made such that

Increase monotonically and therefore will

The expression of (c) is written in the form:

wherein

For the function to be solved, it must satisfy

Its purpose is to ensure

Monotonically increasing.

By

Redefining

：

Here, the

And

is an arbitrary two control strategies, and can be seen as a successful discount return function for evaluating the strategies

Into a form evaluated by the merit function, and then when this term is positive, the policy is updated for positive. But this expression does not give much information, we will take the state in it

The explicit expression is as follows:

adjusting the position of each item:

defining discount state access probability:

its physical meaning is in policy

Next, access to status with discount factor

When there is no normalization, in the case of

Comprises the following steps:

from this equation, it can be seen that for a new policy

How to judge whether it is a better strategy. That is to say for the new policy

For all possible states

And inspecting the expected advantage value of the product, if:

then explain

For better strategies, in the state under investigation

The policy is updated according to:

until all

State of lower reach

And state of the sum

All actions that may be taken

Are no longer positive

Convergence to the optimal strategy is indicated.

Furthermore, in order to accelerate the calculation process, especially the load of the garden, the photovoltaic and the energy storage in each control later period, the optimal control capacity can not change greatly, the range of variation of each training is not particularly large, and the change of the access probability of the discount state caused by the strategy updating is considered to be ignored for use

Substitution

At this time, there are:

for reinforcement learning, the use of parameter vectors can be adopted

Parameterizable control strategy

Is composed of

It can be proved that:

wherein:

for the purpose of the current parameterized control strategy,

the strategy is controlled for the updated parameterization.

Here, the

Is composed of

，

To (1) a

And (4) each element.

The calculation expression of (a) is as follows:

for the sake of consistency with the notation in the algorithm below, and for ease of description, the simple rewrite subscripts herein are labeled as follows:

here, the

Represents the current policy to

Represents the updated policy, which is one or more

The parameterized policy function may be updated with this inequality relationship, which is an inequality of the variables.

Order to

This embodiment maximizes at each step according to the principle of Maxorize-Minimize optimization

Updating control strategies

The expected discount return may then be incrementally increased

。

To maximize

The scheme adopts a confidence domain type method to optimize the model:

the idea of trust domains is to embody trust domain constraints on the manifold

This constraint is applied to all states, and each state is examined, which is similar to the Euclidean spatial trust domain constraint in the optimization theory.

The following discussion calculates the objective function in the above optimization problem from the sampled values:

for the

And replacing by adopting a sample mean value: namely, it is

Here, the

To be at a parameter

Probability distribution of the following states.

For the

One term, importance sample estimation may be employed, the order

Representing the distribution of samples, then for

A state

In other words, this term can be estimated by the following importance samples:

in view of

Has higher calculation complexity, and is used in the scheme

And (4) replacing.

The final calculation form of the above trust domain problem is:

in summary, the implementation steps of the optimal scheduling method for solving the energy storage participation peak clipping and valley filling based on the trust domain-reinforcement learning in this embodiment are as follows:

s100: and setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing the input control strategy by utilizing the network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies.

The setting flow of this embodiment is as follows:

step 1: respectively dispersing the energy storage control interval into 10 equally divided intervals, wherein the step length of each interval is

；

Step 2: setting approximate state-action energy storage Q-Value network

The corresponding energy storage strategy neural network:

order to

Corresponding parameters are

；

And step 3: setting approximate state energy storage Q-Value network

The corresponding energy storage state value neural network:

order to

Corresponding parameters are

。

S200: obtaining historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the energy storage power output, the active values and the predicted values at initial moments as initial states, controlling energy storage by any initial energy storage control strategy, iteratively training a parameterized deep Q value network and updating network parameters by taking variance of a minimized load curve as a target, controlling the updating times of the network parameters by using a trust domain optimization model, and meeting the conditions

When it is time, finish training, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameter(s)Customized control strategy

，

Representing the constraint limits (this condition is not shown in figure 1).

In this embodiment, a specific process of performing iterative training on the parameterized depth Q-value network is as follows:

step 1: assuming that the initial distribution of the park success is a standard normal distribution

Obtaining historical active value and predicted value of the park load and the output of the energy storage power at the corresponding moment

；

And 2, step: setting parameters

=0.9, maximum backtracking number of step length

；

And step 3: initializing policy parameters

And energy storage Q-Value network parameters

；

And 4, step 4: order to

=0,1,2, …, performing the following steps in order:

1) To be provided with

Is in an initial stateTo control the strategy

For stored energy

Secondary control to obtain the track

Here, the

Is shown as

The number of the tracks is one,

is a time of day

To (1)

The state of each track and the motion vector,

as an output result of the energy storage policy network,

for the parameters of the energy storage policy network,

is as follows

Wheel strategy state-action trajectory;

2) To pair

Each step in

Record its corresponding reward

Here, the

The energy storage-load regulation gains are realized;

3) To pair

Each step in

Computing an action-state cost function for a corresponding step using an action-state neural network

；

4) To pair

Each step in

Calculating the energy storage Q-Value network corresponding to the step by using the energy storage Q-Value network

Wherein

Is a parameter of the energy storage Q-Value network;

5) To pair

Each step in

Computing a merit function

：

6) Estimating a policy gradient

Here, the

In order to be a strategy gradient, the gradient is determined,

representing energy storage policy network in

The gradient of (d);

7) Computing energy storage policy network pair

Second order partial derivative of

；

8) Solving the following system of equations:

here, the

Is an auxiliary variable and has no actual physical significance;

9) Let iteration subscript

And sequentially updating the network parameters of the energy storage strategy:

if it is not

Can meet the requirement when reducing the network loss of the energy storage strategy

If so, ending the process of updating the energy storage strategy network parameters; otherwise, continuing to execute the step 9;

10 To a tank Q-Value network, to

For labeling, updating parameters by adopting a random gradient descent algorithm:

here, the

For energy storage Q-Value network loss function

For network parameters

The gradient of (a) of (b) is,

；

11 1) to 10) are repeated until the energy storage Q-Value network parameter

Energy storage policy network parameters

And finishing the training.

As shown in fig. 2, fig. 2 is a process of updating a parameter of trust domain-reinforcement learning, a direction indicated by an arrow is a direction for ensuring reduction of network loss of the energy storage policy or reduction of a random gradient, and a corresponding circle is a value range of the parameter under the update. The updating range of the parameters is smaller and smaller along with the updating times when the parameters are updated every time, so that the updating of the network parameters is realized in a limited number of times.

Fig. 3 and 4 are schematic diagrams of an energy storage strategy neural network and an energy storage state value neural network, respectively. The input of the energy storage strategy neural network comprises a load predicted value, a current load and a current energy storage charging and discharging power, and the probability corresponding to the future energy storage charging and discharging power state is output after the hidden layer operation; the input of the energy storage state value neural network comprises a load predicted value, a current load and a current energy storage charging and discharging power, and after the hidden layer operation, a Q value corresponding to a future energy storage charging and discharging power state is output.

Fig. 5 shows a simplified flow chart of parameterized deep Q-value network training. The training process is based on updating of the merit function, the energy storage strategy neural network realizes parameter updating through a trust domain method, and the energy storage Q-Value network updates network parameters through a random gradient descent method.

S300: and acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.

Based on the trained parameterized depth Q-value network, the real-time control steps for implementing optimal scheduling in this embodiment are as follows:

step 1: obtaining the active value and the stored energy output of the current load

；

Step 2: will be provided with

Inputting an energy storage strategy network;

step 2: selecting the strategy corresponding to the ten values with the maximum output result in the energy storage strategy network

；

And step 3: will be provided with

And sending the data to the energy storage sub-controller.

The embodiment provides an optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by using load historical data and the power yield of energy storage at the corresponding moment, and limiting the updating times of a control strategy by using a trust domain optimization model in the training process, so that an optimal strategy is rapidly and accurately obtained, and optimal scheduling control of energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not change the distribution form greatly during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.

The above is a detailed description of an embodiment of the optimal scheduling method for solving the problem that the stored energy participates in load shifting, and the following is a detailed description of an embodiment of the optimal scheduling system for solving the problem that the stored energy participates in load shifting.

The embodiment provides an optimal scheduling system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises:

When it is time, the training is ended, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameterized control strategy

，

Representing a constraint limit;

The parameterized depth Q-value network specifically includes: an energy storage strategy neural network and an energy storage state value neural network;

Set up as corresponding network parameters of

；

Set up as corresponding network parameters

；

Wherein,

the status is represented by a number of time slots,

the representation of the motion is shown as,

which is indicative of the time of day,

the energy storage control strategy is represented by,

indicating a state

When taking action

The value of the time-domain data is,

indicating a state

For all possible actions

In the light of the expected value of,

the indication of the return is that,

representing a discount factor.

In addition, the trust domain optimization model is specifically as follows:

in the formula,

which represents the control strategy before the update,

representing per-network parameters

The updated control strategy is then used to control the power converter,

indicating an expected discount return for the updated control strategy compared to the pre-updated control strategy,

Further, the process of iteratively training the parameterized depth Q-value network and updating the network parameters by the training unit specifically includes:

taking the initial state as an initial state to control the strategy

To store energy

Secondary control to obtain strategy state-action track

Wherein

As an output result of the energy storage strategy neural network,

for the parameters of the energy storage policy network,

is as follows

The wheel strategy state-the motion trajectory,

is as follows

A track and

，

is a time of day

To (1) a

A trajectory state and a motion vector;

for the

Each step in

Wherein

Is a parameter of the energy storage state value neural network;

for the

Each step in

，

；

Estimating a policy gradient based on the merit function

，

Wherein

representing the energy storage strategy neural network in

The gradient of (d);

Second order partial derivative of

，

Wherein

Is an auxiliary variable and has no actual physical significance;

let iteration subscript

，

Wherein

to the energy storage state value neural network to

，

Wherein

Value neural network loss function for the energy storage state

For network parameters

The gradient of (a) of (b) is,

；

repeating the above steps until the conditions are met

And

when so, the training is finished.

Further, the expression for minimizing the variance of the load curve of the present embodiment is specifically as follows:

in the formula,

（

) Individual load data;

is a time of day

Is a known amount, and

the time is the actual load,

the time is the predicted load;

is a time of day

Arrival time

Between BES, the battery is charged positive, the discharge is negative, and

the time is a known amount of the compound,

is the control variable.

It should be noted that, the optimal scheduling system for solving energy storage participation peak clipping and valley filling provided in this embodiment is used to implement the optimal scheduling method provided in the foregoing embodiment, and specific settings of each unit are based on complete implementation of the method, which is not described herein again.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling is characterized by comprising the following steps:

setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing an input control strategy by utilizing network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies, and the parameterized depth Q value network specifically comprises the following steps: an energy storage strategy neural network and an energy storage state value neural network;

Set up as corresponding network parameters of

；

Set up as corresponding network parameters

；

Wherein,

the status is represented by a number of time slots,

the motion is represented by a motion vector representing the motion,

which is indicative of the time of day,

the energy storage control strategy is represented by,

indicating a state

When taking action

The value of the time-domain data is,

indicating a state

For all possible actions

In the light of the expected value of,

the indication of the return is that,

represents a discount factor;

acquiring historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the historical active values and predicted values of the loads by taking the energy storage power output at initial moments, the active values and the predicted values of the loads as initial states, controlling energy storage by any initial energy storage control strategy, performing iterative training on the parameterized deep Q value network by taking the variance of a minimized load curve as a target, updating network parameters, and utilizing the variance of a minimized load curve as a targetThe trust domain optimization model controls the updating times of the network parameters to meet the conditions

When it is time, the training is ended, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameterized control strategy

，

A constraint limit value is indicated and,

and

representing network parameters

The trust domain optimization model specifically comprises:

in the formula,

which indicates the control strategy before the update,

representing per-network parameters

The updated control strategy is then used to control the power converter,

representing a trust domain constraint between the updated control strategy and the control strategy before updating; performing iterative training on the parameterized depth Q value network, updating the network parameters, and controlling the updating times of the network parameters by using a trust domain optimization model to meet the conditions

And then, finishing the training, specifically comprising:

taking the initial state as an initial state to control the strategy

To store energy

Secondary control to obtain strategy state-action track

In which

As an output result of the energy storage strategy neural network,

for the parameters of the energy storage policy network,

is as follows

The wheel strategy state-the motion trajectory,

is as follows

A track and

，

is a time of day

To (1) a

A trajectory state and an action vector;

for the

Each step in

And calculating correspondences using the energy storage state value neural networkState cost function of step

Wherein

Is a parameter of the energy storage state value neural network;

for the

Each step in

，

；

Estimating a policy gradient based on the merit function

，

Wherein

representing the energy storage strategy neural network in

The gradient of (d);

Second order partial derivative of

，

Wherein

Is an auxiliary variable and has no actual physical significance;

let iteration subscript

，

Wherein

to the energy storage state value neural network to

，

Wherein

Value neural network loss function for the energy storage state

For network parameters

The gradient of (a) of (b) is,

；

repeating the above steps until the conditions are met

And

when the training is finished, finishing the training;

2. The optimal scheduling method for solving the problem of energy storage participation peak clipping and valley filling according to claim 1, wherein the expression of the variance of the minimized load curve is specifically as follows:

in the formula,

for the number of load data points in a day, fromDetermining the load measuring data, and setting the current time to correspond to the second time

（

) Individual load data;

is a time of day

Is a known amount, and

the time is the actual load, and the load is,

the time is the predicted load;

is a time of day

Arrival time

Between BES, the battery is charged positive, the discharge is negative, and

when the amount of the water is a known amount,

is the control variable.

3. An optimal scheduling system for solving the problem that energy storage participates in peak clipping and valley filling is characterized by comprising the following steps of:

the system comprises a setting unit, a parameterized depth Q value network, a parameter setting unit and a parameter setting unit, wherein the parameterized depth Q value network is used for parameterizing an input control strategy by using a network parameter of the parameterized depth Q value network and outputting a plurality of parameterized control strategies, and the parameterized depth Q value network specifically comprises: an energy storage strategy neural network and an energy storage state value neural network;

Set up as corresponding network parameters

；

Set up as corresponding network parameters

；

Wherein,

the status is represented by a number of time slots,

the representation of the motion is shown as,

which is indicative of the time of day,

the energy storage control strategy is represented by,

indicating a state

When taking action

The value of the time-frequency response is,

indicating a state

For all possible actions

In the light of the expected value of,

the indication of the return is that,

represents a discount factor;

a training unit for obtaining historical active value and predicted value of load and energy storage power output at corresponding time, inputting the energy storage power output, active value of load and predicted value at initial time as initial state, controlling energy storage by any initial energy storage control strategy, iteratively training the parameterized depth Q value network and updating the network parameter by using the variance of minimized load curve as target, controlling the update times of the network parameter by using a trust domain optimization model, and satisfying the condition

When it is time, finish training, wherein

Representing a trust domain constraint on the manifold,

representing utilization of network parameters

Parameterized control strategy

，

A constraint limit value is indicated and,

and

representing network parameters

The trust domain optimization model specifically comprises:

in the formula,

which indicates the control strategy before the update,

representing per-network parameters

The updated control strategy is then used to control the power converter,

representing a trust domain constraint between the updated control strategy and the control strategy before updating; the process of the training unit performing iterative training on the parameterized depth Q-value network and updating the network parameters specifically includes:

taking the initial state as an initial state to control the strategy

To store energy

Secondary control to obtain strategy state-action track

Wherein

As an output result of the energy storage strategy neural network,

for the parameters of the energy storage policy network,

is a first

The wheel strategy state-the motion trajectory,

is a first

A track and

，

is a time of day

To (1) a

A trajectory state and an action vector;

for the

Each step in

Wherein

Is a parameter of the energy storage state value neural network;

for

Each step in

，

；

Estimating a policy gradient based on the merit function

，

Wherein

representing the energy storage strategy neural network in

The gradient of (d);

Second order partial derivative of

，

Wherein

Is an auxiliary variable and has no actual physical significance;

index iteration

，

Wherein

to the energy storage state value neural network to

，

In which

Value neural network loss function for the energy storage state

For network parameters

The gradient of (a) of (b) is,

；

repeating the above steps until the conditions are met

And

when the training is finished, the training is finished;

and the control unit is used for acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.

4. The optimal scheduling system for solving the problem of energy storage participation peak clipping and valley filling according to claim 3, wherein the expression of the variance of the minimized load curve is specifically as follows: