CN115001002B - Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling - Google Patents

Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling Download PDF

Info

Publication number
CN115001002B
CN115001002B CN202210916196.3A CN202210916196A CN115001002B CN 115001002 B CN115001002 B CN 115001002B CN 202210916196 A CN202210916196 A CN 202210916196A CN 115001002 B CN115001002 B CN 115001002B
Authority
CN
China
Prior art keywords
energy storage
value
network
strategy
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210916196.3A
Other languages
Chinese (zh)
Other versions
CN115001002A (en
Inventor
陈显超
张杰明
高宜凡
陈展尘
王辉
梁妍陟
仲卫
程林晖
钟榜
褚裕谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Zhaoqing Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202210916196.3A priority Critical patent/CN115001002B/en
Publication of CN115001002A publication Critical patent/CN115001002A/en
Application granted granted Critical
Publication of CN115001002B publication Critical patent/CN115001002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/28Arrangements for balancing of the load in a network by storage of energy
    • H02J3/32Arrangements for balancing of the load in a network by storage of energy using batteries with converting means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Linguistics (AREA)
  • Power Engineering (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Educational Administration (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by utilizing load historical data and the power yield of energy storage at the corresponding moment, and limiting the updating times of a control strategy by utilizing a trust domain optimization model in the training process, so that an optimal strategy is rapidly and accurately obtained, and the optimal scheduling control of the energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not greatly change the distribution form during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.

Description

Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
Technical Field
The invention belongs to the technical field of power grid dispatching, and particularly relates to an optimal dispatching method and system for solving energy storage participation peak clipping and valley filling based on a trust domain-reinforcement learning.
Background
The large-scale battery energy storage system can realize the peak clipping and valley filling functions of the load by discharging at the peak of the load and charging at the valley of the load. The power grid company utilizes the stored energy to cut peaks and fill valleys, so that the upgrading of the equipment capacity can be postponed, the utilization rate of the equipment is improved, and the updating cost of the equipment is saved; the power consumer can utilize the energy storage to cut peak and fill valley, and can utilize the peak-valley power price difference to obtain economic benefit. How to achieve the optimal peak clipping and valley filling effects by using the limited battery capacity and meet the limits of a set of constraint conditions needs to be realized by means of an optimization algorithm.
The classical optimization algorithm for solving the charging and discharging strategy of the energy storage system comprises a gradient algorithm and a dynamic programming algorithm. The gradient algorithm cannot process discontinuous constraint conditions and has strong dependence on initial values. Discontinuous and nonlinear constraints can be considered in the model by adopting a dynamic programming algorithm, and the solution is convenient to use a computer. However, when large-scale energy storage grid connection and high-randomness loads exist, the two methods have the problems of precision and calculation efficiency, and meanwhile, the two methods are based on accurate physical models, so that the accuracy of modeling is difficult to guarantee in practical problems.
Disclosure of Invention
In view of the above, the invention aims to solve the problems that when a large-scale energy storage grid connection and a high-randomness load exist, the classical optimization algorithms for solving the charging and discharging strategies of the energy storage system have the precision and the calculation efficiency, and the modeling accuracy is difficult to ensure.
In order to solve the technical problems, the invention provides the following technical scheme:
in a first aspect, the present invention provides an optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling, including the following steps:
setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing the input control strategies by utilizing network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies;
acquiring historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the energy storage power output, the active values and the predicted values at initial moments as initial states, controlling energy storage by any initial energy storage control strategy, performing iterative training on a parameterized deep Q value network and updating network parameters by taking the variance of a minimized load curve as a target, controlling the updating times of the network parameters by using a trust domain optimization model, and meeting the condition
Figure 813452DEST_PATH_IMAGE001
When it is time, finish training, wherein
Figure 958125DEST_PATH_IMAGE002
Representing a trust domain constraint on the manifold,
Figure 71575DEST_PATH_IMAGE003
representing utilization of network parameters
Figure 433024DEST_PATH_IMAGE004
Parameterized control strategy
Figure 657331DEST_PATH_IMAGE005
Figure 922091DEST_PATH_IMAGE006
A constraint limit value is indicated and,
Figure 940862DEST_PATH_IMAGE007
and
Figure 291072DEST_PATH_IMAGE008
representing network parameters
Figure 319071DEST_PATH_IMAGE004
The number of updates of (a);
and acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.
Further, the parameterized depth Q-value network specifically includes: an energy storage strategy neural network and an energy storage state value neural network;
the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage
Figure 438337DEST_PATH_IMAGE009
Set up as corresponding network parameters
Figure 893589DEST_PATH_IMAGE004
The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state
Figure 964051DEST_PATH_IMAGE010
Set up as corresponding network parameters
Figure 467844DEST_PATH_IMAGE011
Wherein,
Figure 769513DEST_PATH_IMAGE012
the status is represented by a number of time slots,
Figure 333349DEST_PATH_IMAGE013
the representation of the motion is shown as,
Figure 454889DEST_PATH_IMAGE014
which is indicative of the time of day,
Figure 762374DEST_PATH_IMAGE005
the energy storage control strategy is represented by,
Figure 354767DEST_PATH_IMAGE015
indicating a state
Figure 89505DEST_PATH_IMAGE016
When taking action
Figure 636024DEST_PATH_IMAGE017
The value of the time-frequency response is,
Figure 481620DEST_PATH_IMAGE018
indicating a state
Figure 492301DEST_PATH_IMAGE016
For all possible actions
Figure 902335DEST_PATH_IMAGE013
In the light of the expected value of the composition,
Figure 936150DEST_PATH_IMAGE019
the indication of the return is that,
Figure 647754DEST_PATH_IMAGE020
representing a discount factor.
Further, the trust domain optimization model specifically includes:
Figure 450625DEST_PATH_IMAGE021
in the formula,
Figure 589482DEST_PATH_IMAGE022
which indicates the control strategy before the update,
Figure 110593DEST_PATH_IMAGE003
representing per-network parameters
Figure 625888DEST_PATH_IMAGE004
The updated control strategy is then used to control the power converter,
Figure 781801DEST_PATH_IMAGE023
indicating an expected discount return for the updated control strategy compared to the pre-update control strategy,
Figure 91560DEST_PATH_IMAGE024
representing trust domain constraints between the updated control strategy and the control strategy before the update.
Further, iterative training is carried out on the parameterized depth Q value network, network parameters are updated, the trust domain optimization model is used for controlling the updating times of the network parameters, and the conditions are met
Figure 99967DEST_PATH_IMAGE001
And then, finishing the training, specifically comprising:
taking the initial state as an initial state to control the strategy
Figure 91057DEST_PATH_IMAGE025
To store energy
Figure 930837DEST_PATH_IMAGE026
Secondary control to obtain strategy state-action track
Figure 349180DEST_PATH_IMAGE027
Wherein
Figure 907200DEST_PATH_IMAGE005
As an output result of the energy storage strategy neural network,
Figure 934937DEST_PATH_IMAGE028
for the parameters of the energy storage policy network,
Figure 629223DEST_PATH_IMAGE029
is as follows
Figure 484047DEST_PATH_IMAGE007
The wheel strategy state-the motion trajectory,
Figure 263784DEST_PATH_IMAGE030
is as follows
Figure 596676DEST_PATH_IMAGE031
A track and
Figure 411048DEST_PATH_IMAGE032
Figure 171194DEST_PATH_IMAGE033
is a time of day
Figure 874445DEST_PATH_IMAGE014
To (1)
Figure 73346DEST_PATH_IMAGE031
A trajectory state and a motion vector;
for the
Figure 679908DEST_PATH_IMAGE029
Each step in
Figure 673271DEST_PATH_IMAGE014
All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return
Figure 365284DEST_PATH_IMAGE034
And calculating a state cost function of the corresponding step by using the energy storage state cost neural network
Figure 39979DEST_PATH_IMAGE035
Wherein
Figure 563364DEST_PATH_IMAGE036
Is a parameter of the energy storage state value neural network;
for the
Figure 423567DEST_PATH_IMAGE029
Each step in
Figure 665193DEST_PATH_IMAGE014
Computing a merit function based on the action-state cost function and the state cost function
Figure 877999DEST_PATH_IMAGE037
Figure 459153DEST_PATH_IMAGE038
Estimating a policy gradient based on the merit function
Figure 794320DEST_PATH_IMAGE039
Figure 460924DEST_PATH_IMAGE040
Wherein
Figure 539739DEST_PATH_IMAGE041
the total control wheel number of the load and the stored energy is represented;
Figure 473935DEST_PATH_IMAGE042
representing the energy storage strategy neural network in
Figure 245582DEST_PATH_IMAGE028
The gradient of (d);
computing the energy storage policy neural network pair based on the policy gradient
Figure 133903DEST_PATH_IMAGE028
Second order partial derivative of
Figure 16408DEST_PATH_IMAGE043
Figure 306575DEST_PATH_IMAGE044
Wherein
Figure 249124DEST_PATH_IMAGE045
Is an auxiliary variable and has no actual physical significance;
let iteration subscript
Figure 624741DEST_PATH_IMAGE046
Sequentially updating the network parameters of the energy storage strategy neural network to be
Figure 45358DEST_PATH_IMAGE047
Figure 954146DEST_PATH_IMAGE048
Wherein
Figure 5279DEST_PATH_IMAGE049
Representing the maximum backtracking times of the step length of the energy storage strategy neural network;
to the energy storage state value neural network to
Figure 930510DEST_PATH_IMAGE050
For labeling, a random gradient descent algorithm is adopted to update parameters into
Figure 92501DEST_PATH_IMAGE051
Figure 153998DEST_PATH_IMAGE052
Wherein
Figure 376032DEST_PATH_IMAGE053
Value neural network loss function for the energy storage state
Figure 788558DEST_PATH_IMAGE054
For network parameters
Figure 252775DEST_PATH_IMAGE036
The gradient of (a) of (b) is,
Figure 434358DEST_PATH_IMAGE055
repeating the above steps until the conditions are met
Figure 561714DEST_PATH_IMAGE056
And
Figure 399220DEST_PATH_IMAGE057
when so, the training is finished.
Further, the expression for minimizing the variance of the load curve is specifically as follows:
Figure 965331DEST_PATH_IMAGE058
in the formula,
Figure 204682DEST_PATH_IMAGE059
the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time
Figure 565256DEST_PATH_IMAGE060
Figure 394453DEST_PATH_IMAGE061
) Load data;
Figure 764254DEST_PATH_IMAGE062
is a time of day
Figure 592533DEST_PATH_IMAGE031
Is a known amount, and
Figure 327271DEST_PATH_IMAGE063
the time is the actual load,
Figure 936107DEST_PATH_IMAGE064
the time is the predicted load;
Figure 47282DEST_PATH_IMAGE065
is a time of day
Figure 57964DEST_PATH_IMAGE031
Arrival time
Figure 462138DEST_PATH_IMAGE066
Between BES, the battery is charged positive, the discharge is negative, and
Figure 558270DEST_PATH_IMAGE067
when is alreadyThe amount of the active carbon is known,
Figure 207557DEST_PATH_IMAGE068
is the control variable.
In a second aspect, the present invention provides an optimized scheduling system for solving the problem that energy storage participates in peak clipping and valley filling, including:
the system comprises a setting unit, a parameter setting unit and a parameter setting unit, wherein the setting unit is used for setting a parameterized depth Q value network, and the parameterized depth Q value network is used for parameterizing an input control strategy by utilizing a network parameter of the parameterized depth Q value network and outputting a plurality of parameterized control strategies;
a training unit for obtaining the historical active value and the predicted value of the load and the output of the energy storage power at the corresponding moment, inputting the output of the energy storage power at the initial moment, the active value and the predicted value of the load as initial states, controlling the energy storage by any initial energy storage control strategy, iteratively training the parameterized depth Q value network and updating the network parameters by taking the variance of the minimized load curve as a target, controlling the updating times of the network parameters by utilizing a trust domain optimization model, and meeting the requirements
Figure 72745DEST_PATH_IMAGE001
When it is time, finish training, wherein
Figure 149285DEST_PATH_IMAGE002
Representing a trust domain constraint on the manifold,
Figure 732713DEST_PATH_IMAGE003
representing utilization of network parameters
Figure 920112DEST_PATH_IMAGE004
Parameterized control strategy
Figure 905386DEST_PATH_IMAGE005
Figure 651363DEST_PATH_IMAGE006
A constraint limit value is indicated and,
Figure 722087DEST_PATH_IMAGE007
and
Figure 713176DEST_PATH_IMAGE008
representing network parameters
Figure 490640DEST_PATH_IMAGE004
The number of updates of (a);
and the control unit is used for acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting the strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-station controller for energy storage scheduling control.
Further, the parameterized depth Q-value network specifically includes: an energy storage strategy neural network and an energy storage state value neural network;
the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage
Figure 971300DEST_PATH_IMAGE009
Set up as corresponding network parameters
Figure 201424DEST_PATH_IMAGE004
The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state
Figure 58521DEST_PATH_IMAGE010
Set up as corresponding network parameters
Figure 189026DEST_PATH_IMAGE011
Wherein,
Figure 840587DEST_PATH_IMAGE012
the status is represented by a number of time slots,
Figure 823587DEST_PATH_IMAGE013
the motion is represented by a motion vector representing the motion,
Figure 218796DEST_PATH_IMAGE014
which is indicative of the time of day,
Figure 705272DEST_PATH_IMAGE005
an energy storage control strategy is shown,
Figure 793314DEST_PATH_IMAGE015
indicating a state
Figure 998030DEST_PATH_IMAGE016
When taking action
Figure 196930DEST_PATH_IMAGE017
The value of the time-domain data is,
Figure 54026DEST_PATH_IMAGE018
indicating a state
Figure 312969DEST_PATH_IMAGE016
For all possible actions
Figure 4982DEST_PATH_IMAGE013
In the light of the expected value of,
Figure 741994DEST_PATH_IMAGE019
the indication of the return is that,
Figure 203062DEST_PATH_IMAGE020
representing a discount factor.
Further, the trust domain optimization model specifically includes:
Figure 632906DEST_PATH_IMAGE021
in the formula,
Figure 546636DEST_PATH_IMAGE022
which indicates the control strategy before the update,
Figure 87338DEST_PATH_IMAGE003
representing per-networkParameter(s)
Figure 167028DEST_PATH_IMAGE004
The updated control strategy is then used to control the power converter,
Figure 439877DEST_PATH_IMAGE023
indicating an expected discount return for the updated control strategy compared to the pre-update control strategy,
Figure 168799DEST_PATH_IMAGE024
representing trust domain constraints between the updated control strategy and the control strategy before the update.
Further, the process of the training unit performing iterative training on the parameterized deep Q-value network and updating the network parameters specifically includes:
taking the initial state as an initial state to control the strategy
Figure 185296DEST_PATH_IMAGE025
To store energy
Figure 683274DEST_PATH_IMAGE026
Secondary control to obtain strategy state-action track
Figure 127025DEST_PATH_IMAGE027
Wherein
Figure 343242DEST_PATH_IMAGE005
As an output result of the energy storage strategy neural network,
Figure 661966DEST_PATH_IMAGE028
for the parameters of the energy storage policy network,
Figure 14450DEST_PATH_IMAGE029
is as follows
Figure 894681DEST_PATH_IMAGE007
The wheel strategy state-the motion trajectory,
Figure 332616DEST_PATH_IMAGE030
is as follows
Figure 690916DEST_PATH_IMAGE031
A track and
Figure 163485DEST_PATH_IMAGE032
Figure 949039DEST_PATH_IMAGE033
is a time of day
Figure 874270DEST_PATH_IMAGE014
To (1) a
Figure 534796DEST_PATH_IMAGE031
A trajectory state and a motion vector;
for the
Figure 799555DEST_PATH_IMAGE029
Each step in
Figure 83906DEST_PATH_IMAGE014
All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return
Figure 434116DEST_PATH_IMAGE034
And calculating a state cost function of the corresponding step by using the energy storage state cost neural network
Figure 196535DEST_PATH_IMAGE035
In which
Figure 315801DEST_PATH_IMAGE036
Is a parameter of the energy storage state value neural network;
for the
Figure 771053DEST_PATH_IMAGE029
Each step in
Figure 112954DEST_PATH_IMAGE014
Computing a merit function based on the action-state cost function and the state cost function
Figure 616747DEST_PATH_IMAGE037
Figure 652837DEST_PATH_IMAGE038
Estimating a policy gradient based on the merit function
Figure 216673DEST_PATH_IMAGE039
Figure 603792DEST_PATH_IMAGE040
Wherein
Figure 645697DEST_PATH_IMAGE041
the total control wheel number of the load and the stored energy is represented;
Figure 801872DEST_PATH_IMAGE042
representing the energy storage strategy neural network in
Figure 35145DEST_PATH_IMAGE028
The gradient of (d);
computing the energy storage policy neural network pair based on the policy gradient
Figure 643981DEST_PATH_IMAGE028
Second order partial derivative of
Figure 755157DEST_PATH_IMAGE043
Figure 765838DEST_PATH_IMAGE044
Wherein
Figure 671477DEST_PATH_IMAGE045
Is an auxiliary variable and has no actual physical significance;
let iteration subscript
Figure 705292DEST_PATH_IMAGE046
Sequentially updating the network parameters of the energy storage strategy neural network to be
Figure 151317DEST_PATH_IMAGE047
Figure 718302DEST_PATH_IMAGE048
Wherein
Figure 857159DEST_PATH_IMAGE049
Representing the maximum backtracking times of the step length of the energy storage strategy neural network;
to the energy storage state value neural network to
Figure 378271DEST_PATH_IMAGE050
For labeling, a random gradient descent algorithm is adopted to update parameters into
Figure 627986DEST_PATH_IMAGE051
Figure 550943DEST_PATH_IMAGE052
Wherein
Figure 860702DEST_PATH_IMAGE053
Value neural network loss function for the energy storage state
Figure 869109DEST_PATH_IMAGE069
For network parameters
Figure 358734DEST_PATH_IMAGE036
The gradient of (a) of (b) is,
Figure 198514DEST_PATH_IMAGE055
repeating the steps until the conditions are met
Figure 616857DEST_PATH_IMAGE056
And
Figure 909298DEST_PATH_IMAGE057
when so, the training is finished.
Further, the expression for minimizing the variance of the load curve is specifically as follows:
Figure 704079DEST_PATH_IMAGE058
in the formula,
Figure 398365DEST_PATH_IMAGE059
the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time
Figure 987610DEST_PATH_IMAGE060
Figure 32926DEST_PATH_IMAGE061
) Individual load data;
Figure 858494DEST_PATH_IMAGE062
is a time of day
Figure 407287DEST_PATH_IMAGE031
Is a known amount, and
Figure 433012DEST_PATH_IMAGE063
the time is the actual load,
Figure 700045DEST_PATH_IMAGE064
the time is the predicted load;
Figure 571049DEST_PATH_IMAGE065
is a time of day
Figure 177611DEST_PATH_IMAGE031
Arrival time
Figure 436554DEST_PATH_IMAGE066
Between BES, the battery is charged positive, the discharge is negative, and
Figure 361522DEST_PATH_IMAGE067
the time is a known amount of the compound,
Figure 364114DEST_PATH_IMAGE068
is the control variable.
In conclusion, the invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by using load historical data and the power output rate of energy storage at the corresponding moment, and limiting the updating times of a control strategy by using a trust domain optimization model in the training process, so that an optimal strategy is quickly and accurately obtained, and optimal scheduling control of energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not change the distribution form greatly during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a schematic flowchart of an optimal scheduling method for solving energy storage participation peak clipping and valley filling according to an embodiment of the present invention;
FIG. 2 is a parameter updating process of trust domain-reinforcement learning provided by the embodiment of the present invention;
fig. 3 is a schematic diagram of an energy storage strategy neural network according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an energy storage state value neural network provided by an embodiment of the present invention;
fig. 5 is a flowchart of network training according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The large-scale battery energy storage system can realize the peak clipping and valley filling functions of the load by discharging at the peak of the load and charging at the valley of the load. The power grid company utilizes the energy storage to carry out peak clipping and valley filling, so that the upgrading of the equipment capacity can be postponed, the utilization rate of the equipment is improved, and the updating cost of the equipment is saved; the power consumer can utilize the energy storage to cut peak and fill valley, and can utilize the peak-valley power price difference to obtain economic benefit. How to achieve the optimal peak clipping and valley filling effects by using the limited battery capacity and meet the limits of a set of constraint conditions needs to be realized by means of an optimization algorithm.
The classical optimization algorithm for solving the charging and discharging strategy of the energy storage system comprises a gradient algorithm and a dynamic programming algorithm. The gradient algorithm cannot process discontinuous constraint conditions and has strong dependence on initial values. Discontinuous and nonlinear constraints can be considered in the model by adopting a dynamic programming algorithm, and the solution is convenient to use a computer. However, when a large-scale energy storage grid connection and a high-randomness load exist, the two methods have the problems of precision and computational efficiency, and meanwhile, the two methods are based on an accurate physical model, so that the accuracy of modeling is difficult to guarantee in a practical problem.
The traditional reinforcement learning method based on strategy gradient makes the deep neural network make obvious progress in the control task. However, there are difficulties in achieving good results with the strategic gradient method, since this method is very sensitive to the number of iteration steps: if too small, the training process is very slow; if chosen too large, the feedback signal will be buried in the noise, possibly even allowing the model to behave avalanche-wise. The sampling efficiency of such methods is often low, and a task of learning simply requires millions to billions of total iterations.
Based on the method, the invention provides an optimal scheduling method and system for solving the problem that energy storage participates in peak clipping and valley filling based on the trust domain-reinforcement learning.
The following describes a method for optimizing and scheduling energy storage participation peak clipping and valley filling based on trust domain-reinforcement learning.
Referring to fig. 1, the present embodiment provides an optimal scheduling method for solving energy storage participation peak clipping and valley filling based on trust domain-reinforcement learning.
Firstly, a design idea of solving energy storage participating in peak clipping and valley filling optimization scheduling based on confidence domain-reinforcement learning is explained in detail as follows:
trust Region Policy Optimization (TRPO) limits the size of strategy update in continuous control, does not change the distribution form greatly every time of update, enables the benefit to meet the requirement of incremental convergence, and can correct the Optimization result on line.
Because the charging and discharging power of the stored energy can be changed rapidly and flexibly, the climbing rate constraint does not need to be considered. Ignoring the internal losses of the battery, the battery can be considered as a constant voltage source model. If the owner of the energy storage system is a power consumer, under a market electricity price system, the user aims to maximize the economic benefit brought to the user by the energy storage system; if the owner of the energy storage system is the grid, the load curve of the grid is as flat as possible in order to reduce the number of times of start-up and shut-down of the conventional generator set and the capacity of the spinning reserve. Mathematically, the variance may reflect the degree to which the random variable deviates from its mean, and the variance of the load may reflect the degree to which the load curve is flat. The present embodiment therefore chooses the variance of the minimized load curve as the objective function:
Figure 825182DEST_PATH_IMAGE058
in the formula,
Figure 989447DEST_PATH_IMAGE059
the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time
Figure 168756DEST_PATH_IMAGE060
Figure 709458DEST_PATH_IMAGE061
) Individual load data;
Figure 25033DEST_PATH_IMAGE062
is a time of day
Figure 625779DEST_PATH_IMAGE031
Is a known amount, and
Figure 525339DEST_PATH_IMAGE063
the time is the actual load,
Figure 869733DEST_PATH_IMAGE064
the time is the predicted load;
Figure 305394DEST_PATH_IMAGE065
is a time of day
Figure 749144DEST_PATH_IMAGE031
Arrival time
Figure 965362DEST_PATH_IMAGE066
Between BES, the battery is charged positive, the discharge is negative, and
Figure 785551DEST_PATH_IMAGE067
the time is a known amount of the compound,
Figure 574253DEST_PATH_IMAGE068
is the control variable.
The following describes the parameters of the present embodiment in turn. The real-time optimization of the present embodiment includes the following constraints.
1. Battery capacity constraint
The battery electric quantity at each moment does not exceed the upper and lower limits of the battery capacity:
Figure 251222DEST_PATH_IMAGE070
in the formula:
Figure 626839DEST_PATH_IMAGE071
and
Figure 250719DEST_PATH_IMAGE072
respectively the lower limit and the upper limit of the residual capacity of the battery;
Figure 457709DEST_PATH_IMAGE073
is a time of day
Figure 508842DEST_PATH_IMAGE031
The charge of the battery is set to be,
Figure 434073DEST_PATH_IMAGE063
the time is a known amount of the compound,
Figure 100458DEST_PATH_IMAGE064
is a state variable.
When calculating on line, the electric quantity of the current moment
Figure 427534DEST_PATH_IMAGE074
Is the initial value of the number of the first time,
Figure 649568DEST_PATH_IMAGE059
electric quantity of time
Figure 796516DEST_PATH_IMAGE075
Is the final value. After neglecting the loss of the battery, the battery is
Figure 762198DEST_PATH_IMAGE076
The amount of power reduced in time is equal to the amount of power output in this period of time:
Figure 943780DEST_PATH_IMAGE077
in the formula:
Figure 336716DEST_PATH_IMAGE076
interval time of adjacent load data;
Figure 236538DEST_PATH_IMAGE078
and
Figure 238867DEST_PATH_IMAGE079
respectively an initial value and a final value of the residual capacity of the battery.
2. Power constraint
Due to the limitations of a power electronic converter (PCS) and a battery body, the output power of the battery at each moment cannot exceed the upper and lower power limits:
Figure 212640DEST_PATH_IMAGE080
in the formula:
Figure 838793DEST_PATH_IMAGE081
is the maximum charge-discharge power limit.
In this embodiment, the optimization problem is converted into a markov sequence decision model, which mainly includes a state space, an action space, and a return function.
For convenience of description, the following description will be made with reference to the following drawings:
3、
Figure 898016DEST_PATH_IMAGE012
: state space, state space in this embodimentIs to store the current output power
Figure 267817DEST_PATH_IMAGE074
And load prediction value
Figure 361675DEST_PATH_IMAGE082
Figure 158730DEST_PATH_IMAGE083
: an action space, which refers to the charge and discharge power at the future moment of energy storage in this embodiment;
Figure 203784DEST_PATH_IMAGE084
Figure 111697DEST_PATH_IMAGE085
the transition probability distribution, where the transition probability distribution is deterministic, is set to 1.
4、
Figure 60062DEST_PATH_IMAGE019
Figure 965701DEST_PATH_IMAGE086
The reward function, which is found in this embodiment, is:
Figure 61833DEST_PATH_IMAGE087
wherein
Figure 711120DEST_PATH_IMAGE088
Is the variance minimum objective function of the load fluctuations,
Figure 841887DEST_PATH_IMAGE089
it is ensured that the electric quantity is within the corresponding upper and lower limit ranges,
Figure 416963DEST_PATH_IMAGE065
in order to ensure that the charging and discharging power is within the corresponding upper and lower limits,
Figure 672494DEST_PATH_IMAGE090
the purpose is to ensure charge and discharge electricity quantity-power balance.
5、
Figure 187789DEST_PATH_IMAGE091
Figure 110746DEST_PATH_IMAGE086
Initial state
Figure 420505DEST_PATH_IMAGE092
Probability distribution of (2), this example
Figure 163333DEST_PATH_IMAGE091
Is a standard normal distribution.
6、
Figure 216739DEST_PATH_IMAGE093
The discount factor adopts a conservative strategy of
Figure 510316DEST_PATH_IMAGE094
7、
Figure 990976DEST_PATH_IMAGE005
Figure 486679DEST_PATH_IMAGE095
The patent refers to the probability of charge and discharge power corresponding to stored energy.
8、
Figure 15881DEST_PATH_IMAGE096
Expected discount return:
Figure 975746DEST_PATH_IMAGE097
wherein:
Figure 564991DEST_PATH_IMAGE098
Figure 344728DEST_PATH_IMAGE099
is the time-of-day index of the sample trace,
Figure 176155DEST_PATH_IMAGE100
is representative of an averaging operator.
9. State-action energy storage Q-Value network:
Figure 990528DEST_PATH_IMAGE101
its physical meaning is, state
Figure 750673DEST_PATH_IMAGE016
When taking action
Figure 17706DEST_PATH_IMAGE017
The corresponding value of the time.
10. Energy storage Q-Value network:
Figure 154290DEST_PATH_IMAGE102
its physical meaning is, state
Figure 760852DEST_PATH_IMAGE016
For all possible actions
Figure 754215DEST_PATH_IMAGE013
The expected value in terms of.
11. The merit function:
Figure 944763DEST_PATH_IMAGE103
physics of itMeaning, state
Figure 947354DEST_PATH_IMAGE016
Next, the difference between the value corresponding to an action and the expected value for all possible actions is selected, where
Figure 408422DEST_PATH_IMAGE104
The design concept of the present embodiment is explained based on the above description. The starting point of the scheme is to take each strategy
Figure 572688DEST_PATH_IMAGE005
Can be made such that
Figure 751996DEST_PATH_IMAGE096
Increase monotonically and therefore will
Figure 292699DEST_PATH_IMAGE096
The expression of (c) is written in the form:
Figure 608274DEST_PATH_IMAGE105
wherein
Figure 209019DEST_PATH_IMAGE106
For the function to be solved, it must satisfy
Figure 108580DEST_PATH_IMAGE107
Its purpose is to ensure
Figure 452974DEST_PATH_IMAGE096
Monotonically increasing.
By
Figure 623055DEST_PATH_IMAGE108
Redefining
Figure 394702DEST_PATH_IMAGE096
Figure 283023DEST_PATH_IMAGE109
Here, the
Figure 165529DEST_PATH_IMAGE110
And
Figure 455696DEST_PATH_IMAGE005
is an arbitrary two control strategies, and can be seen as a successful discount return function for evaluating the strategies
Figure 840322DEST_PATH_IMAGE111
Into a form evaluated by the merit function, and then when this term is positive, the policy is updated for positive. But this expression does not give much information, we will take the state in it
Figure 278256DEST_PATH_IMAGE112
The explicit expression is as follows:
Figure 902136DEST_PATH_IMAGE113
adjusting the position of each item:
Figure 109126DEST_PATH_IMAGE114
defining discount state access probability:
Figure 160259DEST_PATH_IMAGE115
its physical meaning is in policy
Figure 85489DEST_PATH_IMAGE005
Next, access to status with discount factor
Figure 247480DEST_PATH_IMAGE012
When there is no normalization, in the case of
Figure 574557DEST_PATH_IMAGE116
Comprises the following steps:
Figure 29546DEST_PATH_IMAGE117
from this equation, it can be seen that for a new policy
Figure 442073DEST_PATH_IMAGE110
How to judge whether it is a better strategy. That is to say for the new policy
Figure 407755DEST_PATH_IMAGE110
For all possible states
Figure 527021DEST_PATH_IMAGE012
And inspecting the expected advantage value of the product, if:
Figure 982273DEST_PATH_IMAGE118
then explain
Figure 554200DEST_PATH_IMAGE110
For better strategies, in the state under investigation
Figure 120310DEST_PATH_IMAGE012
The policy is updated according to:
Figure 858197DEST_PATH_IMAGE119
until all
Figure 484350DEST_PATH_IMAGE110
State of lower reach
Figure 543573DEST_PATH_IMAGE012
And state of the sum
Figure 913375DEST_PATH_IMAGE012
All actions that may be taken
Figure 7233DEST_PATH_IMAGE013
Are no longer positive
Figure 804287DEST_PATH_IMAGE108
Convergence to the optimal strategy is indicated.
Furthermore, in order to accelerate the calculation process, especially the load of the garden, the photovoltaic and the energy storage in each control later period, the optimal control capacity can not change greatly, the range of variation of each training is not particularly large, and the change of the access probability of the discount state caused by the strategy updating is considered to be ignored for use
Figure 350806DEST_PATH_IMAGE120
Substitution
Figure 694938DEST_PATH_IMAGE121
At this time, there are:
Figure 643302DEST_PATH_IMAGE122
for reinforcement learning, the use of parameter vectors can be adopted
Figure 611258DEST_PATH_IMAGE004
Parameterizable control strategy
Figure 645073DEST_PATH_IMAGE005
Is composed of
Figure 356677DEST_PATH_IMAGE123
It can be proved that:
Figure 159548DEST_PATH_IMAGE124
wherein:
Figure 298406DEST_PATH_IMAGE022
for the purpose of the current parameterized control strategy,
Figure 312192DEST_PATH_IMAGE125
the strategy is controlled for the updated parameterization.
Figure 765171DEST_PATH_IMAGE126
Here, the
Figure 484865DEST_PATH_IMAGE127
Is composed of
Figure 732307DEST_PATH_IMAGE022
Figure 803031DEST_PATH_IMAGE125
To (1) a
Figure 794121DEST_PATH_IMAGE031
And (4) each element.
Figure 633901DEST_PATH_IMAGE128
The calculation expression of (a) is as follows:
Figure 550779DEST_PATH_IMAGE129
for the sake of consistency with the notation in the algorithm below, and for ease of description, the simple rewrite subscripts herein are labeled as follows:
Figure 46482DEST_PATH_IMAGE124
here, the
Figure 638000DEST_PATH_IMAGE130
Represents the current policy to
Figure 269970DEST_PATH_IMAGE131
Represents the updated policy, which is one or more
Figure 187111DEST_PATH_IMAGE005
The parameterized policy function may be updated with this inequality relationship, which is an inequality of the variables.
Order to
Figure 904531DEST_PATH_IMAGE132
This embodiment maximizes at each step according to the principle of Maxorize-Minimize optimization
Figure 299740DEST_PATH_IMAGE133
Updating control strategies
Figure 550331DEST_PATH_IMAGE005
The expected discount return may then be incrementally increased
Figure 372793DEST_PATH_IMAGE096
To maximize
Figure 577509DEST_PATH_IMAGE134
The scheme adopts a confidence domain type method to optimize the model:
Figure 714093DEST_PATH_IMAGE135
the idea of trust domains is to embody trust domain constraints on the manifold
Figure 382971DEST_PATH_IMAGE024
This constraint is applied to all states, and each state is examined, which is similar to the Euclidean spatial trust domain constraint in the optimization theory.
The following discussion calculates the objective function in the above optimization problem from the sampled values:
Figure 314018DEST_PATH_IMAGE136
for the
Figure 68348DEST_PATH_IMAGE137
And replacing by adopting a sample mean value: namely, it is
Figure 241578DEST_PATH_IMAGE138
Here, the
Figure 764963DEST_PATH_IMAGE139
To be at a parameter
Figure 132491DEST_PATH_IMAGE140
Probability distribution of the following states.
For the
Figure 374116DEST_PATH_IMAGE141
One term, importance sample estimation may be employed, the order
Figure 586923DEST_PATH_IMAGE142
Representing the distribution of samples, then for
Figure 230394DEST_PATH_IMAGE060
A state
Figure 503243DEST_PATH_IMAGE074
In other words, this term can be estimated by the following importance samples:
Figure 232165DEST_PATH_IMAGE143
in view of
Figure 753057DEST_PATH_IMAGE144
Has higher calculation complexity, and is used in the scheme
Figure 251034DEST_PATH_IMAGE145
And (4) replacing.
The final calculation form of the above trust domain problem is:
Figure 960364DEST_PATH_IMAGE146
in summary, the implementation steps of the optimal scheduling method for solving the energy storage participation peak clipping and valley filling based on the trust domain-reinforcement learning in this embodiment are as follows:
s100: and setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing the input control strategy by utilizing the network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies.
The setting flow of this embodiment is as follows:
step 1: respectively dispersing the energy storage control interval into 10 equally divided intervals, wherein the step length of each interval is
Figure 848686DEST_PATH_IMAGE147
Step 2: setting approximate state-action energy storage Q-Value network
Figure 731191DEST_PATH_IMAGE101
The corresponding energy storage strategy neural network:
Figure 21358DEST_PATH_IMAGE148
order to
Figure 963906DEST_PATH_IMAGE015
Corresponding parameters are
Figure 838059DEST_PATH_IMAGE004
And step 3: setting approximate state energy storage Q-Value network
Figure 258676DEST_PATH_IMAGE102
The corresponding energy storage state value neural network:
Figure 668929DEST_PATH_IMAGE149
order to
Figure 782379DEST_PATH_IMAGE018
Corresponding parameters are
Figure 645292DEST_PATH_IMAGE011
S200: obtaining historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the energy storage power output, the active values and the predicted values at initial moments as initial states, controlling energy storage by any initial energy storage control strategy, iteratively training a parameterized deep Q value network and updating network parameters by taking variance of a minimized load curve as a target, controlling the updating times of the network parameters by using a trust domain optimization model, and meeting the conditions
Figure 869600DEST_PATH_IMAGE150
When it is time, finish training, wherein
Figure 868780DEST_PATH_IMAGE151
Representing a trust domain constraint on the manifold,
Figure 589349DEST_PATH_IMAGE003
representing utilization of network parameters
Figure 1876DEST_PATH_IMAGE004
Parameter(s)Customized control strategy
Figure 967558DEST_PATH_IMAGE005
Figure 149141DEST_PATH_IMAGE006
Representing the constraint limits (this condition is not shown in figure 1).
In this embodiment, a specific process of performing iterative training on the parameterized depth Q-value network is as follows:
step 1: assuming that the initial distribution of the park success is a standard normal distribution
Figure 276497DEST_PATH_IMAGE091
Obtaining historical active value and predicted value of the park load and the output of the energy storage power at the corresponding moment
Figure 176320DEST_PATH_IMAGE092
And 2, step: setting parameters
Figure 680113DEST_PATH_IMAGE006
=0.9, maximum backtracking number of step length
Figure 418000DEST_PATH_IMAGE049
And step 3: initializing policy parameters
Figure 778574DEST_PATH_IMAGE152
And energy storage Q-Value network parameters
Figure 103376DEST_PATH_IMAGE153
And 4, step 4: order to
Figure 473178DEST_PATH_IMAGE007
=0,1,2, …, performing the following steps in order:
1) To be provided with
Figure 301456DEST_PATH_IMAGE092
Is in an initial stateTo control the strategy
Figure 98511DEST_PATH_IMAGE025
For stored energy
Figure 645030DEST_PATH_IMAGE026
Secondary control to obtain the track
Figure 818522DEST_PATH_IMAGE154
Here, the
Figure 283000DEST_PATH_IMAGE155
Is shown as
Figure 250956DEST_PATH_IMAGE031
The number of the tracks is one,
Figure 284771DEST_PATH_IMAGE156
is a time of day
Figure 996375DEST_PATH_IMAGE014
To (1)
Figure 799246DEST_PATH_IMAGE031
The state of each track and the motion vector,
Figure 875787DEST_PATH_IMAGE005
as an output result of the energy storage policy network,
Figure 459215DEST_PATH_IMAGE028
for the parameters of the energy storage policy network,
Figure 145149DEST_PATH_IMAGE029
is as follows
Figure 130422DEST_PATH_IMAGE007
Wheel strategy state-action trajectory;
2) To pair
Figure 377864DEST_PATH_IMAGE029
Each step in
Figure 448588DEST_PATH_IMAGE014
Record its corresponding reward
Figure 439678DEST_PATH_IMAGE157
Here, the
Figure 279458DEST_PATH_IMAGE157
The energy storage-load regulation gains are realized;
3) To pair
Figure 697801DEST_PATH_IMAGE029
Each step in
Figure 426460DEST_PATH_IMAGE014
Computing an action-state cost function for a corresponding step using an action-state neural network
Figure 283558DEST_PATH_IMAGE015
4) To pair
Figure 915528DEST_PATH_IMAGE029
Each step in
Figure 567089DEST_PATH_IMAGE014
Calculating the energy storage Q-Value network corresponding to the step by using the energy storage Q-Value network
Figure 550088DEST_PATH_IMAGE035
Wherein
Figure 945297DEST_PATH_IMAGE036
Is a parameter of the energy storage Q-Value network;
5) To pair
Figure 431774DEST_PATH_IMAGE029
Each step in
Figure 519815DEST_PATH_IMAGE014
Computing a merit function
Figure 223067DEST_PATH_IMAGE037
Figure 421967DEST_PATH_IMAGE038
6) Estimating a policy gradient
Figure 762950DEST_PATH_IMAGE158
Here, the
Figure 21893DEST_PATH_IMAGE039
In order to be a strategy gradient, the gradient is determined,
Figure 713905DEST_PATH_IMAGE041
the total control wheel number of the load and the stored energy is represented;
Figure 450917DEST_PATH_IMAGE042
representing energy storage policy network in
Figure 911985DEST_PATH_IMAGE028
The gradient of (d);
7) Computing energy storage policy network pair
Figure 783907DEST_PATH_IMAGE004
Second order partial derivative of
Figure 759954DEST_PATH_IMAGE159
8) Solving the following system of equations:
Figure 238339DEST_PATH_IMAGE044
here, the
Figure 881810DEST_PATH_IMAGE045
Is an auxiliary variable and has no actual physical significance;
9) Let iteration subscript
Figure 154660DEST_PATH_IMAGE046
And sequentially updating the network parameters of the energy storage strategy:
Figure 883581DEST_PATH_IMAGE160
if it is not
Figure 900079DEST_PATH_IMAGE047
Can meet the requirement when reducing the network loss of the energy storage strategy
Figure 834275DEST_PATH_IMAGE150
If so, ending the process of updating the energy storage strategy network parameters; otherwise, continuing to execute the step 9;
10 To a tank Q-Value network, to
Figure 340342DEST_PATH_IMAGE050
For labeling, updating parameters by adopting a random gradient descent algorithm:
Figure 494243DEST_PATH_IMAGE052
here, the
Figure 376749DEST_PATH_IMAGE053
For energy storage Q-Value network loss function
Figure 666916DEST_PATH_IMAGE161
For network parameters
Figure 609464DEST_PATH_IMAGE036
The gradient of (a) of (b) is,
Figure 985081DEST_PATH_IMAGE055
11 1) to 10) are repeated until the energy storage Q-Value network parameter
Figure 405699DEST_PATH_IMAGE057
Energy storage policy network parameters
Figure 314486DEST_PATH_IMAGE056
And finishing the training.
As shown in fig. 2, fig. 2 is a process of updating a parameter of trust domain-reinforcement learning, a direction indicated by an arrow is a direction for ensuring reduction of network loss of the energy storage policy or reduction of a random gradient, and a corresponding circle is a value range of the parameter under the update. The updating range of the parameters is smaller and smaller along with the updating times when the parameters are updated every time, so that the updating of the network parameters is realized in a limited number of times.
Fig. 3 and 4 are schematic diagrams of an energy storage strategy neural network and an energy storage state value neural network, respectively. The input of the energy storage strategy neural network comprises a load predicted value, a current load and a current energy storage charging and discharging power, and the probability corresponding to the future energy storage charging and discharging power state is output after the hidden layer operation; the input of the energy storage state value neural network comprises a load predicted value, a current load and a current energy storage charging and discharging power, and after the hidden layer operation, a Q value corresponding to a future energy storage charging and discharging power state is output.
Fig. 5 shows a simplified flow chart of parameterized deep Q-value network training. The training process is based on updating of the merit function, the energy storage strategy neural network realizes parameter updating through a trust domain method, and the energy storage Q-Value network updates network parameters through a random gradient descent method.
S300: and acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.
Based on the trained parameterized depth Q-value network, the real-time control steps for implementing optimal scheduling in this embodiment are as follows:
step 1: obtaining the active value and the stored energy output of the current load
Figure 162357DEST_PATH_IMAGE012
Step 2: will be provided with
Figure 25271DEST_PATH_IMAGE012
Inputting an energy storage strategy network;
step 2: selecting the strategy corresponding to the ten values with the maximum output result in the energy storage strategy network
Figure 187262DEST_PATH_IMAGE162
And step 3: will be provided with
Figure 514338DEST_PATH_IMAGE162
And sending the data to the energy storage sub-controller.
The embodiment provides an optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling, which comprises the steps of setting a parameterized depth Q value network, training the parameterized depth Q value network by using load historical data and the power yield of energy storage at the corresponding moment, and limiting the updating times of a control strategy by using a trust domain optimization model in the training process, so that an optimal strategy is rapidly and accurately obtained, and optimal scheduling control of energy storage is realized under the current condition. The invention utilizes the trust domain-reinforcement learning to limit the size of strategy updating in continuous control, does not change the distribution form greatly during each updating, ensures that the yield meets the adjustment and increment convergence, can correct the optimization result on line, and takes charge and discharge constraints into consideration to achieve the optimal peak clipping and valley filling control function.
The above is a detailed description of an embodiment of the optimal scheduling method for solving the problem that the stored energy participates in load shifting, and the following is a detailed description of an embodiment of the optimal scheduling system for solving the problem that the stored energy participates in load shifting.
The embodiment provides an optimal scheduling system for solving the problem that energy storage participates in peak clipping and valley filling, which comprises:
the system comprises a setting unit, a parameter setting unit and a parameter setting unit, wherein the setting unit is used for setting a parameterized depth Q value network, and the parameterized depth Q value network is used for parameterizing an input control strategy by utilizing a network parameter of the parameterized depth Q value network and outputting a plurality of parameterized control strategies;
a training unit for obtaining the historical active value and the predicted value of the load and the output of the energy storage power at the corresponding moment, inputting the output of the energy storage power at the initial moment, the active value and the predicted value of the load as initial states, controlling the energy storage by any initial energy storage control strategy, iteratively training the parameterized depth Q value network and updating the network parameters by taking the variance of the minimized load curve as a target, controlling the updating times of the network parameters by utilizing a trust domain optimization model, and meeting the requirements
Figure 736372DEST_PATH_IMAGE001
When it is time, the training is ended, wherein
Figure 585117DEST_PATH_IMAGE002
Representing a trust domain constraint on the manifold,
Figure 285219DEST_PATH_IMAGE003
representing utilization of network parameters
Figure 404485DEST_PATH_IMAGE004
Parameterized control strategy
Figure 859737DEST_PATH_IMAGE005
Figure 697243DEST_PATH_IMAGE006
Representing a constraint limit;
and the control unit is used for acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting the strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-station controller for energy storage scheduling control.
The parameterized depth Q-value network specifically includes: an energy storage strategy neural network and an energy storage state value neural network;
the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage
Figure 263354DEST_PATH_IMAGE009
Set up as corresponding network parameters of
Figure 729802DEST_PATH_IMAGE004
The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state
Figure 293638DEST_PATH_IMAGE010
Set up as corresponding network parameters
Figure 680757DEST_PATH_IMAGE011
Wherein,
Figure 722663DEST_PATH_IMAGE012
the status is represented by a number of time slots,
Figure 878838DEST_PATH_IMAGE013
the representation of the motion is shown as,
Figure 613575DEST_PATH_IMAGE014
which is indicative of the time of day,
Figure 222411DEST_PATH_IMAGE005
the energy storage control strategy is represented by,
Figure 832122DEST_PATH_IMAGE015
indicating a state
Figure 780486DEST_PATH_IMAGE016
When taking action
Figure 748442DEST_PATH_IMAGE017
The value of the time-domain data is,
Figure 782257DEST_PATH_IMAGE018
indicating a state
Figure 228282DEST_PATH_IMAGE016
For all possible actions
Figure 296732DEST_PATH_IMAGE013
In the light of the expected value of,
Figure 435590DEST_PATH_IMAGE019
the indication of the return is that,
Figure 455236DEST_PATH_IMAGE020
representing a discount factor.
In addition, the trust domain optimization model is specifically as follows:
Figure 642635DEST_PATH_IMAGE021
in the formula,
Figure 627908DEST_PATH_IMAGE022
which represents the control strategy before the update,
Figure 875350DEST_PATH_IMAGE003
representing per-network parameters
Figure 946074DEST_PATH_IMAGE004
The updated control strategy is then used to control the power converter,
Figure 937164DEST_PATH_IMAGE023
indicating an expected discount return for the updated control strategy compared to the pre-updated control strategy,
Figure 776944DEST_PATH_IMAGE024
representing trust domain constraints between the updated control strategy and the control strategy before the update.
Further, the process of iteratively training the parameterized depth Q-value network and updating the network parameters by the training unit specifically includes:
taking the initial state as an initial state to control the strategy
Figure 693822DEST_PATH_IMAGE025
To store energy
Figure 923946DEST_PATH_IMAGE026
Secondary control to obtain strategy state-action track
Figure 781044DEST_PATH_IMAGE027
Wherein
Figure 413014DEST_PATH_IMAGE005
As an output result of the energy storage strategy neural network,
Figure 64575DEST_PATH_IMAGE028
for the parameters of the energy storage policy network,
Figure 47574DEST_PATH_IMAGE029
is as follows
Figure 442784DEST_PATH_IMAGE007
The wheel strategy state-the motion trajectory,
Figure 433654DEST_PATH_IMAGE030
is as follows
Figure 459379DEST_PATH_IMAGE031
A track and
Figure 726412DEST_PATH_IMAGE032
Figure 597416DEST_PATH_IMAGE033
is a time of day
Figure 266295DEST_PATH_IMAGE014
To (1) a
Figure 462921DEST_PATH_IMAGE031
A trajectory state and a motion vector;
for the
Figure 951671DEST_PATH_IMAGE029
Each step in
Figure 390481DEST_PATH_IMAGE014
All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return
Figure 913866DEST_PATH_IMAGE034
And calculating a state cost function of the corresponding step by using the energy storage state cost neural network
Figure 15814DEST_PATH_IMAGE035
Wherein
Figure 257440DEST_PATH_IMAGE036
Is a parameter of the energy storage state value neural network;
for the
Figure 735826DEST_PATH_IMAGE029
Each step in
Figure 113717DEST_PATH_IMAGE014
Computing a merit function based on the action-state cost function and the state cost function
Figure 652146DEST_PATH_IMAGE037
Figure 551707DEST_PATH_IMAGE038
Estimating a policy gradient based on the merit function
Figure 896100DEST_PATH_IMAGE039
Figure 331761DEST_PATH_IMAGE040
Wherein
Figure 837829DEST_PATH_IMAGE041
the total control wheel number of the load and the stored energy is represented;
Figure 991729DEST_PATH_IMAGE042
representing the energy storage strategy neural network in
Figure 811918DEST_PATH_IMAGE028
The gradient of (d);
computing the energy storage policy neural network pair based on the policy gradient
Figure 164402DEST_PATH_IMAGE028
Second order partial derivative of
Figure 277589DEST_PATH_IMAGE043
Figure 715524DEST_PATH_IMAGE044
Wherein
Figure 339403DEST_PATH_IMAGE045
Is an auxiliary variable and has no actual physical significance;
let iteration subscript
Figure 484076DEST_PATH_IMAGE046
Sequentially updating the network parameters of the energy storage strategy neural network to be
Figure 597526DEST_PATH_IMAGE047
Figure 460440DEST_PATH_IMAGE048
Wherein
Figure 684748DEST_PATH_IMAGE049
Representing the maximum backtracking times of the step length of the energy storage strategy neural network;
to the energy storage state value neural network to
Figure 465620DEST_PATH_IMAGE050
For labeling, a random gradient descent algorithm is adopted to update parameters into
Figure 749971DEST_PATH_IMAGE051
Figure 834602DEST_PATH_IMAGE052
Wherein
Figure 800284DEST_PATH_IMAGE053
Value neural network loss function for the energy storage state
Figure 981866DEST_PATH_IMAGE054
For network parameters
Figure 374802DEST_PATH_IMAGE036
The gradient of (a) of (b) is,
Figure 274624DEST_PATH_IMAGE055
repeating the above steps until the conditions are met
Figure 276953DEST_PATH_IMAGE056
And
Figure 250725DEST_PATH_IMAGE057
when so, the training is finished.
Further, the expression for minimizing the variance of the load curve of the present embodiment is specifically as follows:
Figure 876879DEST_PATH_IMAGE058
in the formula,
Figure 936102DEST_PATH_IMAGE059
the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time
Figure 305903DEST_PATH_IMAGE060
Figure 399761DEST_PATH_IMAGE061
) Individual load data;
Figure 196816DEST_PATH_IMAGE062
is a time of day
Figure 241870DEST_PATH_IMAGE031
Is a known amount, and
Figure 149783DEST_PATH_IMAGE063
the time is the actual load,
Figure 98148DEST_PATH_IMAGE064
the time is the predicted load;
Figure 66104DEST_PATH_IMAGE065
is a time of day
Figure 99919DEST_PATH_IMAGE031
Arrival time
Figure 749206DEST_PATH_IMAGE066
Between BES, the battery is charged positive, the discharge is negative, and
Figure 879973DEST_PATH_IMAGE067
the time is a known amount of the compound,
Figure 455048DEST_PATH_IMAGE068
is the control variable.
It should be noted that, the optimal scheduling system for solving energy storage participation peak clipping and valley filling provided in this embodiment is used to implement the optimal scheduling method provided in the foregoing embodiment, and specific settings of each unit are based on complete implementation of the method, which is not described herein again.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (4)

1. An optimal scheduling method for solving the problem that energy storage participates in peak clipping and valley filling is characterized by comprising the following steps:
setting a parameterized depth Q value network, wherein the parameterized depth Q value network is used for parameterizing an input control strategy by utilizing network parameters of the parameterized depth Q value network and outputting a plurality of parameterized control strategies, and the parameterized depth Q value network specifically comprises the following steps: an energy storage strategy neural network and an energy storage state value neural network;
the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage
Figure 167707DEST_PATH_IMAGE001
Set up as corresponding network parameters of
Figure 603122DEST_PATH_IMAGE002
The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state
Figure 427859DEST_PATH_IMAGE003
Set up as corresponding network parameters
Figure 671758DEST_PATH_IMAGE004
Wherein,
Figure 251644DEST_PATH_IMAGE005
the status is represented by a number of time slots,
Figure 213784DEST_PATH_IMAGE006
the motion is represented by a motion vector representing the motion,
Figure 842211DEST_PATH_IMAGE007
which is indicative of the time of day,
Figure 940617DEST_PATH_IMAGE008
the energy storage control strategy is represented by,
Figure 629088DEST_PATH_IMAGE009
indicating a state
Figure 78523DEST_PATH_IMAGE010
When taking action
Figure 245063DEST_PATH_IMAGE011
The value of the time-domain data is,
Figure 463554DEST_PATH_IMAGE012
indicating a state
Figure 447560DEST_PATH_IMAGE010
For all possible actions
Figure 181029DEST_PATH_IMAGE006
In the light of the expected value of,
Figure 416839DEST_PATH_IMAGE013
the indication of the return is that,
Figure 489837DEST_PATH_IMAGE014
represents a discount factor;
acquiring historical active values and predicted values of loads and energy storage power output at corresponding moments, inputting the historical active values and predicted values of the loads by taking the energy storage power output at initial moments, the active values and the predicted values of the loads as initial states, controlling energy storage by any initial energy storage control strategy, performing iterative training on the parameterized deep Q value network by taking the variance of a minimized load curve as a target, updating network parameters, and utilizing the variance of a minimized load curve as a targetThe trust domain optimization model controls the updating times of the network parameters to meet the conditions
Figure 520110DEST_PATH_IMAGE015
When it is time, the training is ended, wherein
Figure 678559DEST_PATH_IMAGE016
Representing a trust domain constraint on the manifold,
Figure 124583DEST_PATH_IMAGE017
representing utilization of network parameters
Figure 317667DEST_PATH_IMAGE002
Parameterized control strategy
Figure 518842DEST_PATH_IMAGE008
Figure 164587DEST_PATH_IMAGE018
A constraint limit value is indicated and,
Figure 414302DEST_PATH_IMAGE019
and
Figure 461893DEST_PATH_IMAGE020
representing network parameters
Figure 833968DEST_PATH_IMAGE002
The trust domain optimization model specifically comprises:
Figure 967009DEST_PATH_IMAGE021
in the formula,
Figure 82733DEST_PATH_IMAGE022
which indicates the control strategy before the update,
Figure 922513DEST_PATH_IMAGE017
representing per-network parameters
Figure 465490DEST_PATH_IMAGE002
The updated control strategy is then used to control the power converter,
Figure 820248DEST_PATH_IMAGE023
indicating an expected discount return for the updated control strategy compared to the pre-update control strategy,
Figure 739662DEST_PATH_IMAGE024
representing a trust domain constraint between the updated control strategy and the control strategy before updating; performing iterative training on the parameterized depth Q value network, updating the network parameters, and controlling the updating times of the network parameters by using a trust domain optimization model to meet the conditions
Figure 433949DEST_PATH_IMAGE025
And then, finishing the training, specifically comprising:
taking the initial state as an initial state to control the strategy
Figure 147827DEST_PATH_IMAGE026
To store energy
Figure 255460DEST_PATH_IMAGE027
Secondary control to obtain strategy state-action track
Figure 775303DEST_PATH_IMAGE028
In which
Figure 386413DEST_PATH_IMAGE008
As an output result of the energy storage strategy neural network,
Figure 474455DEST_PATH_IMAGE029
for the parameters of the energy storage policy network,
Figure 803805DEST_PATH_IMAGE030
is as follows
Figure 799443DEST_PATH_IMAGE019
The wheel strategy state-the motion trajectory,
Figure 530638DEST_PATH_IMAGE031
is as follows
Figure 851898DEST_PATH_IMAGE032
A track and
Figure 402965DEST_PATH_IMAGE033
Figure 467873DEST_PATH_IMAGE034
is a time of day
Figure 991259DEST_PATH_IMAGE007
To (1) a
Figure 217841DEST_PATH_IMAGE032
A trajectory state and an action vector;
for the
Figure 521783DEST_PATH_IMAGE030
Each step in
Figure 124803DEST_PATH_IMAGE007
All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return
Figure 565011DEST_PATH_IMAGE035
And calculating correspondences using the energy storage state value neural networkState cost function of step
Figure 228074DEST_PATH_IMAGE036
Wherein
Figure 691416DEST_PATH_IMAGE037
Is a parameter of the energy storage state value neural network;
for the
Figure 98127DEST_PATH_IMAGE030
Each step in
Figure 658421DEST_PATH_IMAGE007
Computing a merit function based on the action-state cost function and the state cost function
Figure 226806DEST_PATH_IMAGE038
Figure 505340DEST_PATH_IMAGE039
Estimating a policy gradient based on the merit function
Figure 387846DEST_PATH_IMAGE040
Figure 802646DEST_PATH_IMAGE041
Wherein
Figure 541932DEST_PATH_IMAGE042
the total control wheel number of the load and the stored energy is represented;
Figure 42184DEST_PATH_IMAGE043
representing the energy storage strategy neural network in
Figure 728380DEST_PATH_IMAGE029
The gradient of (d);
computing the energy storage policy neural network pair based on the policy gradient
Figure 997687DEST_PATH_IMAGE029
Second order partial derivative of
Figure 173454DEST_PATH_IMAGE044
Figure 161001DEST_PATH_IMAGE045
Wherein
Figure 509943DEST_PATH_IMAGE046
Is an auxiliary variable and has no actual physical significance;
let iteration subscript
Figure 961653DEST_PATH_IMAGE047
Sequentially updating the network parameters of the energy storage strategy neural network to be
Figure 394075DEST_PATH_IMAGE048
Figure 603339DEST_PATH_IMAGE049
Wherein
Figure 693655DEST_PATH_IMAGE050
Representing the maximum backtracking times of the step length of the energy storage strategy neural network;
to the energy storage state value neural network to
Figure 937555DEST_PATH_IMAGE051
For labeling, a random gradient descent algorithm is adopted to update parameters into
Figure 455124DEST_PATH_IMAGE052
Figure 417264DEST_PATH_IMAGE053
Wherein
Figure 45691DEST_PATH_IMAGE054
Value neural network loss function for the energy storage state
Figure 144097DEST_PATH_IMAGE055
For network parameters
Figure 770250DEST_PATH_IMAGE037
The gradient of (a) of (b) is,
Figure 954107DEST_PATH_IMAGE056
repeating the above steps until the conditions are met
Figure 386225DEST_PATH_IMAGE057
And
Figure 604717DEST_PATH_IMAGE058
when the training is finished, finishing the training;
and acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.
2. The optimal scheduling method for solving the problem of energy storage participation peak clipping and valley filling according to claim 1, wherein the expression of the variance of the minimized load curve is specifically as follows:
Figure 464089DEST_PATH_IMAGE059
in the formula,
Figure 135242DEST_PATH_IMAGE060
for the number of load data points in a day, fromDetermining the load measuring data, and setting the current time to correspond to the second time
Figure 43155DEST_PATH_IMAGE061
Figure 116153DEST_PATH_IMAGE062
) Individual load data;
Figure 146426DEST_PATH_IMAGE063
is a time of day
Figure 304875DEST_PATH_IMAGE032
Is a known amount, and
Figure 78796DEST_PATH_IMAGE064
the time is the actual load, and the load is,
Figure 271880DEST_PATH_IMAGE065
the time is the predicted load;
Figure 473054DEST_PATH_IMAGE066
is a time of day
Figure 853220DEST_PATH_IMAGE032
Arrival time
Figure 430831DEST_PATH_IMAGE067
Between BES, the battery is charged positive, the discharge is negative, and
Figure 478422DEST_PATH_IMAGE068
when the amount of the water is a known amount,
Figure 850497DEST_PATH_IMAGE069
is the control variable.
3. An optimal scheduling system for solving the problem that energy storage participates in peak clipping and valley filling is characterized by comprising the following steps of:
the system comprises a setting unit, a parameterized depth Q value network, a parameter setting unit and a parameter setting unit, wherein the parameterized depth Q value network is used for parameterizing an input control strategy by using a network parameter of the parameterized depth Q value network and outputting a plurality of parameterized control strategies, and the parameterized depth Q value network specifically comprises: an energy storage strategy neural network and an energy storage state value neural network;
the energy storage strategy neural network is a Q-Value network based on approximate state-action energy storage
Figure 780276DEST_PATH_IMAGE070
Set up as corresponding network parameters
Figure 896000DEST_PATH_IMAGE002
The energy storage state Value neural network is an energy storage Q-Value network according to an approximate state
Figure 798097DEST_PATH_IMAGE071
Set up as corresponding network parameters
Figure 465707DEST_PATH_IMAGE004
Wherein,
Figure 86044DEST_PATH_IMAGE005
the status is represented by a number of time slots,
Figure 677563DEST_PATH_IMAGE006
the representation of the motion is shown as,
Figure 699745DEST_PATH_IMAGE007
which is indicative of the time of day,
Figure 413623DEST_PATH_IMAGE008
the energy storage control strategy is represented by,
Figure 255678DEST_PATH_IMAGE072
indicating a state
Figure 713204DEST_PATH_IMAGE010
When taking action
Figure 589893DEST_PATH_IMAGE011
The value of the time-frequency response is,
Figure 412355DEST_PATH_IMAGE012
indicating a state
Figure 741705DEST_PATH_IMAGE010
For all possible actions
Figure 2923DEST_PATH_IMAGE006
In the light of the expected value of,
Figure 734118DEST_PATH_IMAGE013
the indication of the return is that,
Figure 727482DEST_PATH_IMAGE014
represents a discount factor;
a training unit for obtaining historical active value and predicted value of load and energy storage power output at corresponding time, inputting the energy storage power output, active value of load and predicted value at initial time as initial state, controlling energy storage by any initial energy storage control strategy, iteratively training the parameterized depth Q value network and updating the network parameter by using the variance of minimized load curve as target, controlling the update times of the network parameter by using a trust domain optimization model, and satisfying the condition
Figure 544128DEST_PATH_IMAGE073
When it is time, finish training, wherein
Figure 609036DEST_PATH_IMAGE074
Representing a trust domain constraint on the manifold,
Figure 194738DEST_PATH_IMAGE017
representing utilization of network parameters
Figure 421320DEST_PATH_IMAGE002
Parameterized control strategy
Figure 725263DEST_PATH_IMAGE008
Figure 328282DEST_PATH_IMAGE018
A constraint limit value is indicated and,
Figure 768491DEST_PATH_IMAGE019
and
Figure 369237DEST_PATH_IMAGE020
representing network parameters
Figure 894896DEST_PATH_IMAGE002
The trust domain optimization model specifically comprises:
Figure 301606DEST_PATH_IMAGE075
in the formula,
Figure 596322DEST_PATH_IMAGE022
which indicates the control strategy before the update,
Figure 367968DEST_PATH_IMAGE017
representing per-network parameters
Figure 380924DEST_PATH_IMAGE002
The updated control strategy is then used to control the power converter,
Figure 325746DEST_PATH_IMAGE023
indicating an expected discount return for the updated control strategy compared to the pre-updated control strategy,
Figure 678230DEST_PATH_IMAGE024
representing a trust domain constraint between the updated control strategy and the control strategy before updating; the process of the training unit performing iterative training on the parameterized depth Q-value network and updating the network parameters specifically includes:
taking the initial state as an initial state to control the strategy
Figure 683095DEST_PATH_IMAGE026
To store energy
Figure 183347DEST_PATH_IMAGE027
Secondary control to obtain strategy state-action track
Figure 869543DEST_PATH_IMAGE028
Wherein
Figure 138850DEST_PATH_IMAGE008
As an output result of the energy storage strategy neural network,
Figure 314617DEST_PATH_IMAGE029
for the parameters of the energy storage policy network,
Figure 302164DEST_PATH_IMAGE030
is a first
Figure 526472DEST_PATH_IMAGE019
The wheel strategy state-the motion trajectory,
Figure 915865DEST_PATH_IMAGE031
is a first
Figure 996954DEST_PATH_IMAGE032
A track and
Figure 471797DEST_PATH_IMAGE076
Figure 562113DEST_PATH_IMAGE077
is a time of day
Figure 806013DEST_PATH_IMAGE007
To (1) a
Figure 261265DEST_PATH_IMAGE032
A trajectory state and an action vector;
for the
Figure 957825DEST_PATH_IMAGE030
Each step in
Figure 586253DEST_PATH_IMAGE007
All record the corresponding return and calculate the action-state cost function of the corresponding step by utilizing the energy storage strategy neural network based on the return
Figure 950238DEST_PATH_IMAGE035
And calculating a state cost function of the corresponding step by using the energy storage state cost neural network
Figure 576391DEST_PATH_IMAGE036
Wherein
Figure 783686DEST_PATH_IMAGE037
Is a parameter of the energy storage state value neural network;
for
Figure 215804DEST_PATH_IMAGE030
Each step in
Figure 434296DEST_PATH_IMAGE007
Computing a merit function based on the action-state cost function and the state cost function
Figure 231350DEST_PATH_IMAGE038
Figure 902503DEST_PATH_IMAGE039
Estimating a policy gradient based on the merit function
Figure 872733DEST_PATH_IMAGE040
Figure 945731DEST_PATH_IMAGE078
Wherein
Figure 976004DEST_PATH_IMAGE042
the total control wheel number of the load and the stored energy is represented;
Figure 134453DEST_PATH_IMAGE043
representing the energy storage strategy neural network in
Figure 846057DEST_PATH_IMAGE029
The gradient of (d);
computing the energy storage policy neural network pair based on the policy gradient
Figure 773562DEST_PATH_IMAGE029
Second order partial derivative of
Figure 974736DEST_PATH_IMAGE044
Figure 558164DEST_PATH_IMAGE045
Wherein
Figure 135776DEST_PATH_IMAGE046
Is an auxiliary variable and has no actual physical significance;
index iteration
Figure 917787DEST_PATH_IMAGE047
Sequentially updating the network parameters of the energy storage strategy neural network to be
Figure 289863DEST_PATH_IMAGE048
Figure 360587DEST_PATH_IMAGE079
Wherein
Figure 476310DEST_PATH_IMAGE050
Representing the maximum backtracking times of the step length of the energy storage strategy neural network;
to the energy storage state value neural network to
Figure 378407DEST_PATH_IMAGE051
For labeling, a random gradient descent algorithm is adopted to update parameters into
Figure 921384DEST_PATH_IMAGE052
Figure 479405DEST_PATH_IMAGE053
In which
Figure 133240DEST_PATH_IMAGE054
Value neural network loss function for the energy storage state
Figure 952160DEST_PATH_IMAGE055
For network parameters
Figure 869300DEST_PATH_IMAGE037
The gradient of (a) of (b) is,
Figure 711355DEST_PATH_IMAGE056
repeating the above steps until the conditions are met
Figure 168881DEST_PATH_IMAGE057
And
Figure 45570DEST_PATH_IMAGE058
when the training is finished, the training is finished;
and the control unit is used for acquiring the current load active value and the energy storage power output, inputting the current load active value and the energy storage power output into the trained parameterized depth Q value network, and selecting a strategy corresponding to the maximum value in the output result and issuing the strategy to the energy storage sub-controller for energy storage scheduling control.
4. The optimal scheduling system for solving the problem of energy storage participation peak clipping and valley filling according to claim 3, wherein the expression of the variance of the minimized load curve is specifically as follows:
Figure 930349DEST_PATH_IMAGE080
in the formula,
Figure 259699DEST_PATH_IMAGE060
the number of load data points in a day is determined by the predicted load data, and the current time is set to correspond to the first time
Figure 458600DEST_PATH_IMAGE061
Figure 189795DEST_PATH_IMAGE062
) Individual load data;
Figure 245476DEST_PATH_IMAGE063
is a time of day
Figure 62122DEST_PATH_IMAGE032
Is a known amount, and
Figure 861451DEST_PATH_IMAGE064
the time is the actual load, and the load is,
Figure 384836DEST_PATH_IMAGE065
the time is the predicted load;
Figure 876997DEST_PATH_IMAGE066
is a time of day
Figure 180940DEST_PATH_IMAGE032
Arrival time
Figure 518380DEST_PATH_IMAGE067
Between BES, the battery is charged positive, the discharge is negative, and
Figure 161851DEST_PATH_IMAGE068
when the amount of the water is a known amount,
Figure 559334DEST_PATH_IMAGE069
is the control variable.
CN202210916196.3A 2022-08-01 2022-08-01 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling Active CN115001002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210916196.3A CN115001002B (en) 2022-08-01 2022-08-01 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210916196.3A CN115001002B (en) 2022-08-01 2022-08-01 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling

Publications (2)

Publication Number Publication Date
CN115001002A CN115001002A (en) 2022-09-02
CN115001002B true CN115001002B (en) 2022-12-30

Family

ID=83021019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210916196.3A Active CN115001002B (en) 2022-08-01 2022-08-01 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling

Country Status (1)

Country Link
CN (1) CN115001002B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116826816B (en) * 2023-08-30 2023-11-10 湖南大学 Energy storage active-reactive coordination multiplexing method considering electric energy quality grading management

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN110365057A (en) * 2019-08-14 2019-10-22 南方电网科学研究院有限责任公司 Distributed energy participation power distribution network peak regulation scheduling optimization method based on reinforcement learning
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN113242469A (en) * 2021-04-21 2021-08-10 南京大学 Self-adaptive video transmission configuration method and system
CN113572157A (en) * 2021-07-27 2021-10-29 东南大学 User real-time autonomous energy management optimization method based on near-end policy optimization
CN114630299A (en) * 2022-03-08 2022-06-14 南京理工大学 Information age-perceptible resource allocation method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220164657A1 (en) * 2020-11-25 2022-05-26 Chevron U.S.A. Inc. Deep reinforcement learning for field development planning optimization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109347149A (en) * 2018-09-20 2019-02-15 国网河南省电力公司电力科学研究院 Micro-capacitance sensor energy storage dispatching method and device based on depth Q value network intensified learning
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110365057A (en) * 2019-08-14 2019-10-22 南方电网科学研究院有限责任公司 Distributed energy participation power distribution network peak regulation scheduling optimization method based on reinforcement learning
CN113242469A (en) * 2021-04-21 2021-08-10 南京大学 Self-adaptive video transmission configuration method and system
CN113572157A (en) * 2021-07-27 2021-10-29 东南大学 User real-time autonomous energy management optimization method based on near-end policy optimization
CN114630299A (en) * 2022-03-08 2022-06-14 南京理工大学 Information age-perceptible resource allocation method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
应对新能源预测偏差不确定性的电力系统动态经济调度研究;吕晓茜;《中国优秀硕士学位论文全文数据库-工程科技II辑》;20220228;29-30 *

Also Published As

Publication number Publication date
CN115001002A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN110059844B (en) Energy storage device control method based on ensemble empirical mode decomposition and LSTM
Jasmin et al. Reinforcement learning approaches to economic dispatch problem
CN112614009A (en) Power grid energy management method and system based on deep expected Q-learning
Zhou et al. Reinforcement learning-based scheduling strategy for energy storage in microgrid
CN112491094B (en) Hybrid-driven micro-grid energy management method, system and device
CN105631528B (en) Multi-target dynamic optimal power flow solving method based on NSGA-II and approximate dynamic programming
CN117277357B (en) Novel thermal power energy storage frequency modulation method and system adopting flow battery and electronic equipment
CN111367349A (en) Photovoltaic MPPT control method and system based on prediction model
CN112213945B (en) Improved robust prediction control method and system for electric vehicle participating in micro-grid group frequency modulation
CN114784823A (en) Micro-grid frequency control method and system based on depth certainty strategy gradient
CN115001002B (en) Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
CN116629461B (en) Distributed optimization method, system, equipment and storage medium for active power distribution network
CN116436003B (en) Active power distribution network risk constraint standby optimization method, system, medium and equipment
CN116862050A (en) Time sequence network-based daily prediction method, system, storage medium and equipment for carbon emission factors
CN118381095B (en) Intelligent control method and device for energy storage charging and discharging of new energy micro-grid
CN115986839A (en) Intelligent scheduling method and system for wind-water-fire comprehensive energy system
Harrold et al. Battery control in a smart energy network using double dueling deep q-networks
CN111313449A (en) Cluster electric vehicle power optimization management method based on machine learning
CN111516702B (en) Online real-time layered energy management method and system for hybrid electric vehicle
Wang et al. Prioritized sum-tree experience replay TD3 DRL-based online energy management of a residential microgrid
CN117833316A (en) Method for dynamically optimizing operation of energy storage at user side
CN111799820B (en) Double-layer intelligent hybrid zero-star cloud energy storage countermeasure regulation and control method for power system
CN115459320B (en) Intelligent decision-making method and device for aggregation control of multipoint distributed energy storage system
CN114048576B (en) Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid
CN116979579A (en) Electric automobile energy-computing resource scheduling method based on safety constraint of micro-grid

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant