CN112003269B

CN112003269B - Intelligent on-line control method of grid-connected shared energy storage system

Info

Publication number: CN112003269B
Application number: CN202010754472.1A
Authority: CN
Inventors: 刘友波; 宋航; 黄媛; 刘俊勇
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-06-28
Anticipated expiration: 2040-07-30
Also published as: CN112003269A

Abstract

The invention discloses an intelligent online control method of a grid-connected shared energy storage system, which comprises the steps of building two multi-hidden-layer competition Q network models; establishing a Markov decision process of the CBESS, and mapping the charging and discharging behaviors of the CBESS into a reinforcement learning process based on action value iterative updating; determining environmental state characteristics and an instant reward function; e-round loop iterative learning is carried out; the MG executes the first planning scheduling in the round to obtain a first state vector s obtained by the agent perception environment of the pre-transaction amount CBESS of the external system_t(ii) a Using s in a master contention Q network_tAs an input, Q value outputs corresponding to all actions are obtained. Residual capacity SOC of CBESS_tUpdate to SOC_t+1(ii) a MG carries out secondary planning of the period according to the tradable electric quantity actually fed back by CBESS, calculates s_t、a_t、r_t、s_t+1Updating all the hyperparameters of the main competition Q network through gradient back propagation; updating priority p of stored data in sumtree_iAnd copying the parameters of the main competition Q network to the target competition Q network.

Description

Intelligent on-line control method of grid-connected shared energy storage system

Technical Field

The invention relates to the technical field of power system automation, in particular to an intelligent online control method of a grid-connected shared energy storage system.

Background

Unlike an Energy Storage System (ESS) with centralized control, a shared energy storage system (CESS) has a small scale, generally has a capacity of only several megawatts, and is configured on the secondary side of an electrical transformer of a distribution substation to reduce the negative effects of renewable resources and continuous load changes. Once integrated into a grid-tied micro-grid (MG), CESS can improve the flexibility and reliability of the MG through rapid charging and discharging. With the deregulation of the distribution market, CESS can be operated by independent entity enterprises and participate in the market through price reaction behavior and realize arbitrage. However, in the conventional method for CESS optimization decision, whether centralized optimization control or distributed coordination optimization method is adopted, complex system modeling, data non-observability and various uncertainty factors bring many challenges to model-based physical modeling.

In recent years, machine learning is rapidly developed, and strong perception learning capacity and data analysis capacity of the machine learning accord with the requirements of big data application in a smart grid. Among them, Reinforcement Learning (RL) acquires environmental knowledge through continuous interaction between a decision-making subject and an environment, and takes actions that affect the environment to achieve a preset target. Deep Learning (DL) does not depend on any analytical equation, but describes a mathematical problem and an approximate solution by using a large amount of existing data, and can effectively alleviate the problems of difficulty in solving a cost function and the like when the Deep Learning (DL) is applied to RL. The method aims to solve the problems of difficult modeling, poor expansibility, poor practicability and the like of a physical modeling method, and overcomes the defects of difficult solving, poor convergence, poor robustness, slow convergence and the like of the traditional intelligent algorithm when the state space is too large.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an intelligent online control method of a grid-connected shared energy storage system, which comprises the following steps:

step one, building two multi-hidden-layer competition Q network models, namely a main competition Q network and a target competition Q network, and inputting a feature vector s of an observation state_tThe output corresponds to a in each action set A_tOperating value Q(s)_t,a_t)；

Step two, establishing a Markov decision process of the CBESS, and mapping the charging and discharging behaviors of the Markov decision process into a reinforcement learning process based on action value iterative updating; determining environmental state characteristics and an instant reward function;

step three, entering loop iterative learning of E rounds, and starting to reinitialize the load curve of the MG, the output of the RDG, the market price and the SOC for sharing energy storage in each round;

step four, MG executesThe first scheduling in turn, the first state vector s obtained by the agent-aware environment of the pretreat volume CBESS with the external system_t；

Step five, using s in the main competition Q network_tAs an input, Q value outputs corresponding to all actions are obtained. Selecting an optimal estimated Q value from the current Q value output by adopting an epsilon-greedy method to determine the corresponding action a _tAnd executing;

step six, residual electric quantity SOC of CBESS_tUpdate to SOC_t+1Judging the SOC_t+1Whether or not to exceed [0,1 ]]The range is used for judging whether the range exceeds the limit or not, and the termination judgment index done of the iteration is calculated according to the range_tSimultaneously calculating the instant reward r after the action_t；

Step seven, the MG performs secondary planning of the current time interval according to the actually fed back transactable electric quantity by the CBESS, determines the transactable electric quantity with an external system, and simultaneously gives the pre-transacted electric quantity Pmg.CHE t +1 and Pmg.grid t +1 of the next time interval to be used as the sensing state information of the agent in the next time interval; and updating the state of the system to s_t+1；

Step eight, calculating s_t、a_t、r_t、s_t+1And then compares it with done_tAll indexes are sequentially stored in leaf nodes of sumtree; if the quantity of the stored data reaches a preset small batch sampling capacity m, randomly sampling m samples from the stored data, calculating a current target Q value and an error of the current target Q value, and updating all hyper-parameters of the main competition Q network through gradient back propagation;

step nine, recalculating and updating the priority p of the stored data in the sumtree after the Q network is updated_iCopying the parameters of the main competition Q network to the target Q network, and simultaneously setting the current state s to be s_t+1(ii) a If s is in a termination state or the number of iteration rounds T is reached, the iteration of the current round is finished, and the step three is returned to for circulation; otherwise, go to step five to continue the iteration.

Furthermore, the master competition Q network is a multi-hidden-layer master competition Q network architecture with a single-neuron state value sub-layer and K neuron motion advantage sub-layers, and the ReLu is selected as an activation functionA function to speed up the convergence process; normally initializing interlayer weight omega, and initializing bias b to be constant tending to 0; forming a state characteristic vector s by the time sequence number, the charge state of the CBESS, the market price, the pre-trading electric quantity of the MG and the CBESS/superior distribution network_tOutputting the optimal discretized charge-discharge action value Q as network input_tAnd finally, performing network training by preferentially playing back data to iteratively converge.

Further, the action set a is:

the motion space of CBESS is divided into K discrete charge-discharge options P (K) be, and the motion space A is uniformly discretized

Wherein A is a set of all possible actions; p_be ^(k)Representing the k-th charge/discharge event in the CBESS uniform discrete motion space.

Further, the markov decision process for building the CBESS maps the CBESS charging and discharging behavior to a reinforcement learning process based on iterative update of action value, which specifically includes:

the residual electric quantity of the BESS continuously changes in the charging and discharging process, and the change quantity of the BESS is related to the charging and discharging electric quantity and self-discharging in the period; the recursive relationship of energy storage and charging is

SoC(t)＝(1-σ_sdr)·SoC(t-1)+P_be·(1-L_c)Δt/E_cap

The energy storage discharge process is shown below

SoC(t)＝(1-σ_sdr)·SoC(t-1)-P_beΔt/[E_cap·(1-L_dc)]

In the formula: soc (t) is the state of charge of CBESS during t period; p_be(t) is the charge and discharge power of the CBESS during the period t; sigma_sdrIs the self-discharge rate of the energy storage medium; l is_cAnd L_dcCharging and discharging losses for CBESS, respectively; delta t is the duration of each calculation window;

the maximum allowable charge-discharge power of the CBESS at the moment t is determined by the charge-discharge characteristics of the CBESS and the residual charge state at the moment t, and simultaneously, the maximum allowable charge-discharge power meets the following constraint in the operation process:

SoC_min≤SoC(t)≤SoC_max

in the formula: SoC (system on chip)_maxAnd SoC_minThe upper limit and the lower limit of the CBESS SOC constraint are respectively set;

the environmental state is characterized in that:

defining the environmental state feature vector sensed by CBESS at time t as s_tIs composed of

In the formula, t is a time sequence number; pric_t ^b.pre/pric_t ^s.preRespectively represents the predicted selling and purchasing prices P of the superior power grid at the time of t_t ^mg.CHE/P_t ^mg.gridRespectively representing the pre-transaction electric quantity between the micro-grid and the CBESS and the superior grid;

2) the instant reward function is as follows: CBESS gains energy arbitrage by charging during off-peak hours and then discharging during peak hours; after the actual transaction power with the micro-grid and the upper-level power grid is respectively determined, the reward benefit r is calculated according to the real-time price_EAP；

Total cost of operation and maintenance C of CBESS_o,mSee the following formula

C₁＝|P_be|·c_be

Adding a negative reward line with coefficient sigma as penalty to suppress power (P) of point of connection _{exc_grid}) Wave motion

r_line＝-σ·|P_{exc_grid}|

If the action performed results in SOC exceeding [0,1 ]]Giving a large penalty of r_excTo prevent the agent from learning laterMaking an unreasonable decision; instant reward r_tComprises the following steps:

further, the MG executes the first scheduling in the round to obtain a first state vector s obtained by the agent sensing environment of the pre-transaction amount CBESS of the external system_tThe method comprises the following steps: for the MG model, which aims to minimize the running cost under the predicted price signal, the objective function of the economic dispatch model is as follows:

in the formula, T is a planning period; cCDG z is the power generation cost of the z-th CDG, c_i ^esThe operation cost of the ith microgrid for energy storage is; the PCDG z, t is the power output of the z-th CDG, and the Pes i, t is the charge and discharge power of the i-th microgrid energy storage; grid t/ps.grid t represents the selling and purchasing prices of the superior distribution network at each time period, P_t ^b.CHE/P_t ^s.CHERespectively representing the selling price and the purchasing price issued by the CBESS operator;

the micro-grid adopts a Mixed Integer Linear Programming (MILP) method according to the prediction data to obtain the transaction electric quantity P between the time interval and the CBESS and the superior distribution network_t ^mg.CHE/P_t ^mg.gridAnd issuing the transaction information to the outside; the agent of CBESS obtains the state feature vector s by sensing the external environment _t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]。

Further, said using s in the master contention Q network_tAs an input, Q value outputs corresponding to all actions are obtained. Selecting an optimal estimated Q value from the current Q value output by adopting an epsilon-greedy method to determine the corresponding action a_tAnd executing, including the following processes:

using s in a master contention Q network_tAs input, obtaining Q value output corresponding to all actions; selecting a corresponding action a in the current Q value output by adopting epsilon greedy method_tIn a state s_tPerforming a current action a_t(ii) a For the ε -greedy policy, first by setting the value of ε ∈ (0,1), then at the corresponding action, greedily select the optimal action a, currently considered to be the maximum Q value, with probability (1- ε)^*And the potential behavior is randomly explored from all K discrete alternative behaviors with a probability of epsilon:

wherein ε will follow the iterative process from ε_iniGradually decrease epsilon_fin。

Further, the residual capacity SOC of the CBESS_tUpdate to SOC_t+1Judging the SOC_t+1Whether or not to exceed [0,1 ]]The range is used for judging whether the range exceeds the limit or not, and the termination judgment index done of the iteration is calculated according to the range_tSimultaneously calculating the instant reward r after the action_tThe method specifically comprises the following steps: electric quantity SOC of CBESS_tUpdate to SOC _t+1Judging whether the iteration is in a termination state or not, and calculating the instant reward r after the action_t(ii) a Using a binary variable done as an iteration termination judgment index used as an interruption index of each iteration process

In the formula, if the charge state in the energy storage operation process is out of limit, done of the iteration is equal to 1, otherwise, done is 0; and done-1 indicates termination and jumps out of the iteration, and done-0 indicates that the iteration is not terminated.

Further, said step eight calculates s_t、a_t、r_t、s_t+1And then compares it with done_tAll indexes are based onStoring the leaf node of the sumtree; if the quantity of the stored data reaches a preset small batch sampling capacity m, randomly sampling m samples from the stored data, calculating a current target Q value and an error thereof, and updating all hyper-parameters of the main competition Q network through gradient back propagation, wherein the current target Q value y_jComprises the following steps:

adopting a proportional prioritization strategy, namely P (i) of the extraction probability of the ith sample is:

wherein alpha is [0,1 ]]The significance of the TD error is converted into a power exponent of the priority; if alpha is 0, converting into uniform random sampling; p is a radical of_iIs the priority of transition i, calculated as follows:

p(i)＝|δ_i|+ζ

wherein

Is a positive deviation;

The bias is corrected using the significant sampling weights to obtain a mean square error loss function L that takes into account the sample priority_i(θ i). And finally, updating all parameters theta of the main competition Q network through gradient back propagation of the neural network:

ω_j＝(N·P(j))^-β/max_iω_i

θ_i＝θ_i-1+α▽_θiL_i(θ_i)

wherein omega_jIS the IS weight of sample j; beta is a hyperparameter that gradually increases to 1.

The invention has the beneficial effects that: 1. the invention endows CBESS with strong online learning and decision-making ability in high uncertain environment, and solves the problem that iterative solution cannot be carried out due to continuous environment state and huge space by approximating the optimal action value function without depending on any analytic equation;

2. the collaborative optimization of the double-competition Q network structure and the priority playback strategy can effectively relieve the problem of model over-optimization estimation, obviously improve the accuracy of agent decision and the convergence robustness, accelerate the convergence speed of the algorithm and improve the on-line calculation efficiency.

Drawings

FIG. 1 is a flow chart of an intelligent online control method of a grid-connected shared energy storage system;

FIG. 2 is a diagram of a contention Q network;

FIG. 3 is a diagram of a sumtree data structure;

fig. 4 is a schematic diagram of an algorithm structure of a prior experience playback strategy.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.

As shown in fig. 1, the inventive data driving technique for online control decision of a grid-connected shared energy storage system includes the following steps:

s1: two multi-hidden-layer competition Q network models, namely a main competition Q network and a target competition Q network, are set up, and the input of the models is a feature vector s of an observation state_tThe output corresponds to a in each action set A_tOperating value Q(s)_t,a_t). All parameters of the Q network, the capacity D of the data storage structure sumtree and the priority values of its leaf nodes are first initialized.

S2: establishing a Markov decision process of CBESS, mapping the charging and discharging behaviors of the CBESS into a reinforcement learning process based on action value iterative updating, and determining that 1) the control target of the algorithm is as follows: the power fluctuation of a grid-connected point of the micro-grid is stabilized as much as possible under the condition of maximizing the arbitrage of the energy storage market; 2) environmental status characteristicsCombining: the method comprises the steps of obtaining a pre-transaction electric quantity value of a distribution network/CBESS, wherein the pre-transaction electric quantity value is obtained by MG one-time economic dispatching and comprises a time sequence number of a current time interval, the residual electric quantity of the CBESS, a predicted sale/purchase electric price of a superior power grid and the pre-transaction electric quantity value of the distribution network/CBESS; 3) the reward function: energy arbitrage profit r realized by flexible charging and discharging comprising CBESS _EAPTotal cost of operation and maintenance C_o,mAnd punishment r of power fluctuation of grid-connected point_lineAnd energy storage SOC out-of-limit punishment r_exc。

S3: before each round of the iteration is started, the uncertain data including a load curve of the microgrid, renewable distributed generation output, market price signals and the like need to be initialized again;

s4: the micro-grid carries out pre-planning of each time interval based on the prediction data to obtain the amount of pre-transaction electric quantity between the time interval t and the CBESS/superior distribution network, namely P_t ^mg.CHE/P_t ^mg.gridAnd release the information to the outside; meanwhile, the agent of CBESS obtains the state feature vector s by sensing the external environment_t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]。

S5: using s in a master contention Q network_tAs an input, Q value outputs corresponding to all actions are obtained. Selecting an optimal estimated Q value from the current Q value output by adopting an epsilon-greedy method to determine the corresponding action a_tAnd executed.

S6: residual capacity SOC of CBESS_tUpdate to SOC_t+1Judging the SOC_t+1Whether or not to exceed [0,1 ]]The range is used for judging whether the range exceeds the limit or not, and the termination judgment index done of the iteration is calculated according to the range_tSimultaneously calculating the instant reward r after the action_t。

S7: the MG carries out secondary planning of the current time interval according to the tradable electric quantity actually fed back by the CBESS, determines the tradable electric quantity with an external system, and simultaneously gives pre-tradable electric quantities Pmg.CHE t +1 and Pmg.grid t +1 of the next time interval to be used as sensing state information of the agent in the next time interval; at this time, the state of the system is updated to s _t+1。

S8: calculating outs_t、a_t、r_t、s_t+1And then compares it with done_tAll indexes are sequentially stored in leaf nodes of sumtree. Once the quantity of the stored data reaches the preset small batch sampling capacity m, the random sampling of m samples is started, the current target Q value and the error thereof are calculated, and all the hyperparameters of the main competition Q network are updated through gradient back propagation.

S9: q network needs to recalculate and update the priority p of stored data in sumtree after updating_iAnd periodically copying the parameters of the main competition Q network to the target Q network, and simultaneously making the current state s equal to s_t+1. If S is in a termination state or the iteration round number T is reached, the iteration is finished, and the loop returns to S3 for circulation; otherwise go to step S5 to continue the iteration.

5.1 the concrete process of the step S1 is as follows:

the CBESS interacts with the environment under a control target to obtain feedback rewards by continuously sensing the power demand of the microgrid and the market environment. A multi-hidden-layer master competition Q network architecture having a sub-layer of state values of single neurons and a sub-layer of action dominance of K neurons is constructed, as shown in fig. 2. The corresponding target contention Q network architecture is consistent therewith. The activation function selects the ReLu function to speed up the convergence process. The normal initialization interlayer weight ω and the initialization bias b are all constants tending to 0. Forming a state characteristic vector s by the time sequence number, the charge state of the CBESS, the market price, the pre-trading electric quantity of the MG and the CBESS/superior distribution network _tOutputting the optimal discretized charge-discharge action value Q as network input_tAnd finally, performing network training by preferentially playing back data to iteratively converge. In the energy storage intelligent decision method based on model-free reinforcement learning and data driving, a priority proportion sample playback method based on a sumtree data structure is adopted, and simultaneously, the strategy precision and the convergence speed can be remarkably improved after the method is compatible with DDQN, so that the algorithm robustness is increased; meanwhile, the application of the competitive network architecture can enable the agent to quickly identify correct actions during the strategy evaluation, and the method has higher calculation efficiency and considerable fitting precision and stronger self-adaptive capacity.

Sumtree is the binary tree structure shown in fig. 3. The root node is at the top level, the branch nodes are at the middle level, and only the leaf nodes at the bottom are responsible for storing samples. Each parent node contains the sum of its two child nodes. Thus, the root node is the sum of all priorities, denoted as p_total. Since this data structure provides an efficient way to compute the priority cumulative sum, sumtree helps to efficiently store, update, and sample the scale variable. During storage, the acquired data is stored in the leaf nodes from left to right, and once the leaf nodes are filled, the old data overflows one by one from the left. One significant advantage of this approach is that the conversions do not need to be sorted by priority, greatly reducing the computational burden and facilitating real-time training. Before iteration, the capacity size of a sumtree leaf node needs to be determined, and the priority value of the leaf node needs to be initialized.

When the change of the environment state is sensed, the agent controls the CBESS to feed back a corresponding action a_t. The motion space of CBESS is divided into K discrete charge-discharge options P (K) be, and the motion space A is uniformly discretized

5.2 the concrete process of the step S2 is as follows:

establishing a Markov decision process of the CBESS, and mapping the charging and discharging behaviors of the CBESS into a reinforcement learning process based on action value iterative updating, wherein the method specifically comprises the following steps:

the residual capacity of the BESS changes continuously in the charging and discharging process, and the change quantity of the BESS is related to the charging and discharging capacity and self-discharging in the period. The recursive relationship of energy storage and charging is

SoC(t)＝(1-σ_sdr)·SoC(t-1)+P_be·(1-L_c)Δt/E_cap

The energy storage discharge process is shown below

SoC(t)＝(1-σ_sdr)·SoC(t-1)-P_beΔt/[E_cap·(1-L_dc)]

In the formula: SoC (t) is the state of charge (SoC) of CBESS during t period; p_be(t) is the charge and discharge power of the CBESS during the period t; sigma_sdrIs the self-discharge rate of the energy storage medium; l is_cAnd L_dcCharging and discharging losses for CBESS, respectively; Δ t is the duration of each calculation window.

SoC_min≤SoC(t)≤SoC_max

In the formula: SoC (system on chip)_maxAnd SoC_minRespectively, the upper and lower limits of the CBESS state of charge constraint.

Reinforcement learning is a learning that maps from an environment state to an action with the goal of getting the agent (agent) the maximum accumulated reward during interaction with the environment. The RL utilizes the Markov Decision Process (MDP) to simplify its modeling, the MDP typically being defined as a four-tuple (S, A, r, f), where: s is the set of all environmental states, S_tE, S represents the state of agent at the time t; a is a set of agent executable actions, a_tE is A to represent the action taken by agent at the time t; r is a reward function, r_t～r(s_t，a_t) Indicates agent is in state s_tPerforming action a_tAn immediate prize value obtained; f is the state transition probability distribution function, s_t+1～f(s_t，a_t) Indicates agent is in state s_tPerforming action a_tTransition to the next state s_t+1The probability of (c). The goal of the Markov model is to find an optimal planning strategy V that maximizes the sum of expected rewards after an initialization state s^π*

In the formula, E_πRepresents the expectation of value at strategy pi; 0<γ<1 is a decay coefficient in reinforcement learning that characterizes the importance of future rewards.

When the scale of the problem is small, the algorithm is relatively easy to solve. However, for practical problems, the state space is usually very large, the calculation cost of the conventional iterative solution is too high, and the disadvantages of difficult convergence, slow convergence speed, easy occurrence of over-optimal estimation and the like exist, so that the method provided by the invention needs to be utilized for improved solution. Corresponding to the online control data driving technology of the grid-connected shared energy storage system, the mapping relation is as follows:

(1) Characteristic of environmental state

s_t＝[t,SOC_t ^be,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]^T,s_t∈S

Wherein t is a sequence number; pric_t ^b.pre/pric_t ^s.preRespectively represents the predicted selling and purchasing prices P of the superior power grid at the time of t_t ^mg.CHE/P_t ^mg.gridRespectively representing the pre-transaction electric quantity between the micro-grid and the CBESS and the superior grid.

(2) Feedback rewards

CBESS in a given environment state s during continuous sensing and learning_tAnd selecting action a_tThereafter, the single step instant prize r is obtained_tIncluded

3) CBESS achieves Energy arbitrage profits by charging during off-peak hours and then discharging during peak hours (EAP). After the actual transaction power with the micro-grid and the upper-level power grid is respectively determined, the reward benefit r is calculated according to the real-time price_EAP。

2) In addition to the basic unit cost of CBESS c_beFurthermore, when its charge approaches a limit, it may still continue to operate resulting in increased costs. Finally, CBESS's operation and maintenance assemblyThis C_o,mSee the following formula

C₁＝|P_be|·c_be

4) CBESS has the ability to mitigate the negative impact of MG on the distribution grid. Therefore, a negative reward line with coefficient σ is added as a penalty to suppress the power (P) of the point of connection_{exc_grid}) Wave motion

r_line＝-σ·|P_{exc_grid}|

5) Once the action performed results in SOC exceeding [0,1 ]]A large penalty r must be given _excTo prevent the agent from making unreasonable decisions in subsequent learning. Finally, the instant award r_tIs defined as

5.3 the concrete process of the step S3 is as follows:

before each round of the combination iteration is started, uncertainty data including a load curve of the micro-grid, renewable distributed power generation output, market price signals and the like are initialized. Specifically, the actual values of a load curve, RDG output and market electricity price can be given, and the prediction errors are assumed to be subject to certain normal distribution, so that uncertainty fluctuation is represented.

5.4 the concrete process of the step S4 is as follows:

for the MG model, whose goal is to minimize the operating cost under the predicted price signal, the objective function of the Economic Dispatch (ED) model is as follows:

in the formula, T is a planning period; cCDG z is the power generation cost of the z-th CDG, c_i ^esThe operation cost of the ith microgrid for energy storage is; PCDG z, t is the power of the z-th CDGAnd output, wherein Pes i, t is the charge and discharge power stored by the ith microgrid. Grid t/ps.grid t respectively represents the selling and purchasing prices of the superior distribution network at each time interval, P_t ^b.CHE/P_t ^s.CHEThe price for selling and purchasing electricity issued by the CBESS operator is indicated, respectively.

The micro-grid adopts a Mixed Integer Linear Programming (MILP) method according to the predicted data to obtain the transaction electric quantity P between the time interval and the CBESS and the superior distribution network _t ^mg.CHE/P_t ^mg.gridAnd issuing the transaction information to the outside; meanwhile, the agent of CBESS obtains the state feature vector s by sensing the external environment_t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]

5.5 the concrete process of step S5 is:

using s in a master contention Q network_tAs an input, Q value outputs corresponding to all actions are obtained. Selecting a corresponding action a in the current Q value output by adopting epsilon greedy method_tIn a state s_tPerforming a current action a_t(ii) a For the ε -greedy policy, first by setting the value of ε ∈ (0,1), then at the corresponding action, greedily select the optimal action a, currently considered to be the maximum Q value, with probability (1- ε)^*And randomly exploring potential behaviors from all K discrete optional behaviors with a probability of epsilon

Where ε will follow the iterative process from ε_iniGradually decrease epsilon_finTo encourage more exploration early in the iteration and focus primarily on greedy convergence later so that the algorithm can converge stably.

5.6 the concrete process of step S6 is:

s6: electric quantity SOC of CBESS_tUpdate to SOC_t+1Judging whether the iteration is in a termination state or not, and calculating the instant reward after the actionr_t. Taking a binary variable done as an iteration termination judgment index used as an interruption index of each iteration process

In the formula, if the state of charge in the energy storage operation process is out of limit, done of the iteration is equal to 1, otherwise, done is 0. And done is 1 to indicate termination and jump out of the iteration, while done is 0 to indicate that the iteration is not terminated.

S7: MG carries on the secondary MILP to plan according to the actual tradable electric quantity that feedbacks of CBESS, confirm the electric quantity of trade with the external system in this time interval, give the advance trade electric quantity Pmg.CHE t +1 of the next time interval at the same time, Pmg.grid t +1 is regarded as the perception state information of the next time interval of acting; at this time, the state of the system is updated to s_t+1；

S8: s obtained every time period t in the process of continuously iterating and updating_t、a_t、r_t、s_t+1And a quintuple { s } consisting of a termination judgment index done_t,a_t,r_t,s_t+1Done is put in the leaf nodes of the sumtree in turn. If the storage quantity reaches the maximum capacity of the leaf nodes, the old data are overflowed according to the rolling and the new data are stored, so that the effectiveness of the samples is ensured. Once the number of samples reaches the number m of the training samples in the small batch, the random sampling of the m samples from the leaf nodes is started according to a priority playback mechanism

(j ═ 1,2 ·, m), and calculating the current target Q value y corresponding to each sample_j

For the priority playback mechanism, i.e. more important sample data is played back with a higher frequency. Therefore, the TD error δ needs to be calculated and saved, and samples with larger absolute values of δ are easier to sample. A proportional prioritization strategy is adopted, which is a random adoption strategy between a pure greedy strategy and a uniform sampling strategy, namely that P (i) of the extraction probability of the ith sample is P (i)

Wherein α ∈ [0,1 ]]The significance of the TD error is converted to a power exponent of priority. If α is 0, it is converted to uniform random sampling. p is a radical of_iIs the priority of the conversion i, calculated as shown in the following equation

Wherein

Is a small positive deviation to ensure that some edge samples with TD error of 0 can still be extracted. The above process may cause the desired distribution of random updates to change, and thus the convergence solution to change. In view of this, significant sampling (IS) weights are used to correct the bias, resulting in a mean square error loss function L that takes into account sample priority_i(θ i). Finally, all parameters theta of the main competition Q network are updated through gradient back propagation of the neural network

ω_j＝(N·P(j))^-β/max_iω_i

θ_i＝θ_i-1+α▽_θiL_i(θ_i)

Wherein omega_jIS weight of sample j; β is a hyperparameter that gradually increases to 1. Fig. 3 summarizes the preferred empirical playback algorithm structure.

S9: q network recalculation after update and updating of priority p of stored data in sumtree_iAnd periodically connecting the main competition Q networkCopying the parameters of the network to the target Q network, and simultaneously making the current state s equal to s_t+1If S is in the termination state, the current iteration is finished, or the iteration number T is reached, all iterations are finished and returned to S3 for circulation, otherwise, the iteration is continued in the step S5.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The intelligent online control method of the grid-connected shared energy storage system is characterized by comprising the following steps:

step one, building two multi-hidden-layer competition Q network models, namely a main competition Q network and a target competition Q network, and inputting a characteristic vector s of an observation state_tThe output corresponds to a in each action set A_tOperating value Q(s)_t,a_t)；

Step four, the MG executes the first planning scheduling in the round to obtain a first state vector s obtained by the agent sensing environment of the pre-transaction amount CBESS of the external system_t；

Step five, using s in the main competition Q network_tObtaining Q value outputs corresponding to all actions as input, and selecting an optimal estimated Q value from the current Q value outputs by an epsilon-greedy method to determine the optimal estimated Q valueCorresponding action a_tAnd executing;

Step nine, after Q network is updated, recalculating and updating priority p of data stored in sumtree_iCopying the parameters of the main competition Q network to the target Q network, and simultaneously setting the current state s to be s_t+1(ii) a If s is in a termination state or the number of iteration rounds T is reached, the iteration of the current round is finished, and the step three is returned to for circulation; otherwise, go to step five to continue the iteration.

2. The intelligent online control method of the grid-connected shared energy storage system according to claim 1, wherein the master contention Q network is a multi-hidden-layer master contention Q network architecture having a state value sublayer of a single neuron and an action advantage sublayer of K neurons, and the activation function selects a ReLu function to accelerate a convergence process; normally initializing interlayer weight omega, and initializing bias b to be constant tending to 0; forming a state characteristic vector s by the time sequence number, the charge state of the CBESS, the market price, the pre-trading electric quantity of the MG and the CBESS/superior distribution network_tOutputting the optimal discretized charge-discharge action value Q as network input_tAnd finally, performing network training by preferentially playing back data to iteratively converge.

3. The intelligent online control method of a grid-connected shared energy storage system according to claim 1, wherein the action set a is:

Dividing the motion space of CBESS into K discrete charge and discharge options P_be ^(k)Uniform discretization of the motion space A

Wherein A is a set of all possible actions; p_be ^(k)Representing the kth charge/discharge event in the CBESS uniform discrete event space.

4. The intelligent online control method of the grid-connected shared energy storage system according to claim 1, wherein the markov decision process for building the CBESS maps the CBESS charging and discharging behavior to a reinforcement learning process based on iterative update of action value, and specifically comprises:

the residual electric quantity of the CBESS is continuously changed in the charging and discharging process, and the change quantity of the residual electric quantity is related to the charging and discharging electric quantity and self-discharging in the period; the recursion relation of the stored energy charging is SoC (t) ═ 1-sigma_sdr)·SoC(t-1)+P_be·(1-L_c)Δt/E_cap

The energy storage discharge process is shown below

SoC(t)＝(1-σ_sdr)·SoC(t-1)-P_beΔt/[E_cap·(1-L_dc)]

In the formula: soc (t) is the residual capacity of CBESS during the period t; p_be(t) is the charge and discharge power of the CBESS during the period t; sigma_sdrIs the self-discharge rate of the energy storage medium; l is_cAnd L_dcCharging and discharging losses for CBESS, respectively; delta t is the duration of each calculation window;

SoC_min≤SoC(t)≤SoC_max

In the formula: SoC (system on chip)_maxAnd SoC_minRespectively the upper limit and the lower limit of the CBESS charge state constraint;

the environmental state characteristics are as follows:

In the formula, t is a time sequence number;

respectively represents the predicted selling price and purchasing price of the superior power grid at the time t,

respectively representing the pre-transaction electric quantity between the micro-grid and the CBESS and the superior grid;

1) the instant reward function is as follows: CBESS gains energy arbitrage by charging during off-peak hours and then discharging during peak hours; after the actual transaction power with the micro-grid and the upper-level power grid is respectively determined, the reward benefit r is calculated according to the real-time price_EAP；

Total cost of operation and maintenance C of CBESS_o,mSee the following formula

C₁＝|P_be|·c_be

Adding a negative reward line with coefficient sigma as penalty to suppress power P of point of connection_{exc_grid}Wave motion

r_line＝-σ·|P_{exc_grid}|

If the action performed results in SOC exceeding [0,1 ]]Giving a large penalty of r_excTo prevent the agent from making unreasonable decisions in subsequent learning; instant reward r_tComprises the following steps:

5. the method of claim 1, wherein the MG performs a first scheduling in a round to obtain a first state vector s obtained by a proxy-aware environment of a pre-transaction CBESS with an external system _tThe method comprises the following steps: for the MG model, which aims to minimize the running cost under the predicted price signal, the objective function of the economic dispatch model is as follows:

in the formula, T is a planning period;

is the power generation cost of the z-th CDG, c_i ^esThe operation cost of the ith microgrid for energy storage is;

is the power output of the z-th CDG,

the charging and discharging power of the ith microgrid energy storage;

respectively represents the selling price and the purchasing price of the superior distribution network in each period,

respectively representing the selling price and the purchasing price issued by the CBESS operator;

the micro-grid adopts a mixed integer linear programming method according to the prediction data to obtain the transaction electric quantity between the time interval and the CBESS and the superior distribution network

And issuing transaction information to the outside; the agent of CBESS obtains the state feature vector by sensing the external environment

6. The method according to claim 1, wherein the s is used in a main competitive Q network_tObtaining Q value outputs corresponding to all actions as input, selecting an optimal estimated Q value from the current Q value outputs by an epsilon-greedy method, and determining the action a corresponding to the optimal estimated Q value_tAnd executing, including the following processes:

using s in a master contention Q network _tAs input, obtaining Q value output corresponding to all actions; selecting a corresponding action a in the current Q value output by adopting epsilon greedy method_tIn a state s_tPerforming a current action a_t(ii) a For the ε -greedy policy, first by setting the value of ε ∈ (0,1), then at the corresponding action, greedily select the optimal action a, currently considered to be the maximum Q value, with probability (1- ε)^*And the potential behavior is randomly explored from all K discrete alternative behaviors with a probability of epsilon:

7. The intelligent online control method of the grid-connected shared energy storage system according to claim 1, wherein the residual capacity SoC of the CBESS_tUpdate to SoC_t+1Judging SoC_t+1Exceeds [0,1 ]]The range is used for judging whether the range exceeds the limit or not, and the termination judgment index done of the iteration is calculated according to the range_tSimultaneously calculating the instant reward r after the action_tThe method specifically comprises the following steps: electric quantity SoC of CBESS_tUpdate to SoC_t+1Judging whether the iteration is in a termination state or not, and calculating the instant reward r after the action_t(ii) a Using a binary variable done as an iteration termination judgment index used as an interruption index of each iteration process

In the formula, if the state of charge in the energy storage operation process is out of limit, done of the iteration is equal to 1, otherwise, 0; and done is 1 to indicate termination and jump out of the iteration, while done is 0 to indicate that the iteration is not terminated.

8. The method according to claim 1, wherein s is calculated in step eight_t、a_t、r_t、s_t+1And then compares it with done_tAll indexes are sequentially stored in leaf nodes of sumtree; if the quantity of the stored data reaches the preset small batch sampling capacity m, randomly sampling m samples from the stored data, calculating the current target Q value and the error of the current target Q value, and updating all the main competitive Q networks through gradient back propagationHyperparametric, wherein said current target Q value y_jComprises the following steps:

wherein alpha is [0,1 ]]The importance of the TD error is converted into a power exponent of the priority; if alpha is 0, converting into uniform random sampling; p is a radical of_iIs the priority of transition i, calculated as follows:

wherein

Is a positive deviation;

the bias is corrected using significant sampling weights to obtain a mean square error loss function L that takes into account sample priority_i(thetai), finally updating all parameters theta of the main competitive Q network by gradient back propagation of the neural network:

ω_j＝(N·P(j))^-β/max ω_i

wherein ω is_jIS weight of sample j; beta isA hyperparameter increasing incrementally to 1.