CN112614009B

CN112614009B - Power grid energy management method and system based on deep expectation Q-learning

Info

Publication number: CN112614009B
Application number: CN202011418334.2A
Authority: CN
Inventors: 陈振; 韩晓言; 丁理杰; 魏巍
Original assignee: Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Current assignee: Electric Power Research Institute of State Grid Sichuan Electric Power Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2023-08-25
Anticipated expiration: 2040-12-07
Also published as: CN112614009A

Abstract

The application discloses a power grid energy management method and a power grid energy management system based on a double-depth expected Q-learning network algorithm, wherein the power grid energy management method and the power grid energy management system are characterized in that firstly, a predicted point photovoltaic output uncertainty is modeled based on a Bayesian neural network, and probability distribution of the photovoltaic output is obtained; inputting probability distribution of photovoltaic output into a power grid energy management model based on a double-depth expected Q-learning network algorithm to obtain a corresponding photovoltaic power generation output strategy; the system operates each photovoltaic output device according to a photovoltaic power generation output strategy; according to the application, the economic dispatching problem of the micro-grid is simulated into a Markov decision process, an objective function and constraint conditions are mapped into a reward and punishment function of reinforcement learning, the optimal decision is obtained by utilizing the learning and environment interaction capabilities of the objective function and the constraint conditions, the state random transition is properly considered in the Markov decision process by means of uncertainty modeling of the photovoltaic power generation output in the learning environment through the Bayesian neural network, and the convergence rate of an algorithm is remarkably improved.

Description

Power grid energy management method and system based on deep expectation Q-learning

Technical Field

The application relates to the technical field of power grid energy management systems, in particular to a power grid energy management method and system based on deep expectation Q-learning.

Background

With the development of renewable energy power generation technology, the permeability of a distributed power supply such as a photovoltaic power supply in a power system is continuously improved, and problems and even challenges are brought to the safe and economic operation of the power system. The uncertainty and time variability of the output of the distributed power supply such as the photovoltaic power supply are influenced by the surrounding environment factors such as climate and the like, and difficulty is brought to the establishment of a scheduling plan. How to properly model and efficiently solve the uncertainty of the photovoltaic output is an important issue worthy of research.

In the aspect of uncertainty modeling, the current common methods mainly comprise a random model, a fuzzy model, an interval number model and an opportunity constraint model. The fitting effect of the random model is limited by the kind of the selected distribution function; the interval number model describes an uncertainty set by introducing interval numbers, avoids risks under extreme conditions, but the required strategies are conservative, and the economical efficiency of system operation is sacrificed; the opportunistic constraint model attempts to balance minimizing risk with maximizing economic benefit by translating the scheduling model of uncertainty into a deterministic optimization problem.

Considering that the solution of the uncertainty optimization model is quite complex, the nonlinear optimization model is generally linearized and then solved, and the current common methods comprise mixed integer programming, dynamic programming, random linear programming, improved differential evolution algorithm, moth fire suppression algorithm and the like. The classical optimization algorithm is difficult to obtain the global optimal solution of the nonlinear optimization model, and the heuristic optimization algorithm generally takes a long time. In this context, for a photovoltaic power generation high permeability microgrid, more accurate modeling of photovoltaic power generation output is required and efficient solving algorithms are sought.

Deep reinforcement learning is a rapidly evolving branch of artificial intelligence technology that can automatically adapt to changes in uncertainty factors by constantly improving strategies with environmental interactions, feedback learning. Compared with the traditional algorithm, the deep reinforcement learning algorithm does not need to rely on an explicit objective function, evaluates the decision behavior instead of a reward function, and can give out a corresponding control scheme and an optimization strategy according to different operation requirements and optimization targets to realize real-time decision.

Disclosure of Invention

In order to realize proper modeling and efficient solving of uncertainty of photovoltaic output, the application provides a power grid energy management method and system based on a deep expectation Q reinforcement learning algorithm, and realizes real-time energy and economic dispatching of a micro-grid.

The application is realized by the following technical scheme:

the scheme provides a power grid energy management method based on a double-deep expected Q-learning network algorithm, which comprises the following steps of:

s1, modeling uncertainty of photovoltaic output of a predicted point based on a Bayesian neural network and obtaining probability distribution of the photovoltaic output;

s2, inputting probability distribution of photovoltaic output into a power grid energy management model based on a double-depth expected Q-learning network algorithm to obtain a corresponding photovoltaic power generation output strategy;

and S3, operating all the photovoltaic output devices according to the photovoltaic power generation output strategy.

The further optimization scheme is that the power grid energy management model establishment process based on the double-depth expected Q-learning network algorithm is as follows:

the method comprises the following steps of T1, only considering an energy storage system as a controllable resource, taking the lowest daily operation cost as an objective function and meeting the operation constraint of a micro-grid, and establishing a power grid energy management model;

t2, modeling the power grid energy management model in the T1 as a Markov decision process;

thirdly, based on probability distribution of photovoltaic output, considering a random process of state transition, providing a double-depth expected Q-learning network algorithm by modifying an iteration rule of a Q value on the basis of a traditional model-free algorithm, and solving a Markov decision process;

and T4, setting reasonable parameters to ensure convergence of a neural network learning process, and training a neural network based on a double-depth expected Q-learning network algorithm to obtain a power grid energy management model based on the double-depth expected Q-learning network algorithm.

The further optimization scheme is that the specific modeling process of the predicted point photovoltaic output uncertainty based on the Bayesian neural network in the S1 is as follows:

s11, information of decisive factors, persistence influence factors and bursty influence factors of the predicted points is read, and data preprocessing is carried out;

s12, inputting the preprocessed predicted point decisive factor data and the persistence influence factor data into a deep full-connection layer of the Bayesian neural network, and inputting the preprocessed abrupt influence factor data into a probability layer of the Bayesian neural network for modeling;

s13, obtaining photovoltaic output probability distribution of the predicted point after multiple model training.

The further optimization scheme is that the objective function with the lowest daily operation cost in T1 is as follows: the daily operation cost is the sum of the electricity purchasing cost and the operation cost of the energy storage system in the dispatching period, and is expressed as:

wherein: t is the number of scheduling periods; x is x _t The amount of electricity x needed to be exchanged with the main grid for period t _t And > 0 represents purchasing electricity from the main power grid, and conversely selling electricity to the main power grid; c _b,t /c _g,t Representing prices for buying/selling electricity from/to the main grid during period t; τ _t For the operation cost of the energy storage system in the period t, || ⁺ As a positive function.

The further optimization scheme is that the micro-grid operation constraint in the T1 comprises the following steps: power balance constraints, energy storage system operating constraints, and battery state constraints during a scheduling period.

The further optimization scheme is that the specific modeling process of the Markov decision process in T2 comprises the following steps:

constructing a state space by considering the diversity and the necessity of the system variables;

taking charge and discharge of the energy storage system and the action of buying and selling electric quantity to the power grid into consideration to ensure the power balance inside the system to construct an action space;

mapping the objective function into a rewarding decision function;

the discount rate takes a fixed value of 0.9 in calculation;

the state transition probability is expressed as the probability of the photovoltaic output of the next state.

The further optimization scheme is that the specific method of the step T3 is as follows:

introducing an experience playback mechanism on the basis of a reinforcement learning Q-learning algorithm, storing rewards and state updating conditions obtained by each interaction with the environment, and obtaining an approximate Q value after the parameters of the neural network are converged; selecting decoupling actions of the estimated Q network and the target Q network and calculating a target Q value;

a double-deep expected Q-learning network algorithm is provided on the basis of a double-deep Q-learning network, a Bayesian neural network and deep reinforcement learning are combined, a stochastic process of state transition is represented by the Bayesian neural network, and a Q expected value in a stochastic state is utilized to update the Q network.

The further optimization scheme is that the specific process of updating the Q network by using the Q expected value in the random state is as follows:

firstly, selecting an energy storage system scheduling strategy in an estimated Q network;

then, updating the Q value in the target Q network;

simplifying the model and discretizing the probability density function.

The further optimization scheme is that reasonable parameters are set in the T4 to ensure that an experience playback pool, an exploration rate and a learning rate are required to be considered when the neural network learning process converges.

The application also provides a grid energy management system based on the double-deep expectation Q-learning network algorithm, which comprises:

the probability distribution acquisition device models uncertainty of photovoltaic output of the predicted point based on the Bayesian neural network and acquires probability distribution of the photovoltaic output;

the first modeling device only considers the energy storage system as a controllable resource, takes the lowest daily operation cost as an objective function and meets the operation constraint of the micro-grid, and establishes a power grid energy management model;

the second modeling device models the power grid energy management model into a Markov decision process;

the solving device considers the random process of state transition, proposes a double-depth expected Q-learning network algorithm by modifying the iteration rule of the Q value on the basis of the traditional model-free algorithm, and solves the Markov decision process;

the model training device sets reasonable parameters to ensure convergence of a neural network learning process, trains a neural network based on a double-depth expected Q-learning network algorithm to obtain a power grid energy management model based on the double-depth expected Q-learning network algorithm;

the power grid energy management system controls all photovoltaic output devices based on a photovoltaic power generation output strategy obtained by a power grid energy management model of a double-depth expected Q-learning network algorithm.

The principle of the application is as follows:

1. modeling the uncertainty of the photovoltaic output of the predicted point based on the Bayesian neural network and obtaining probability distribution of the photovoltaic output;

the Bayesian neural network can obtain a relatively stable prediction model according to a relatively small data volume, so that the fitting problem can not occur; meanwhile, the weights and the biases of the neurons of the probability layer obey a certain probability distribution, and the capability of describing uncertainty variables is provided. The method is characterized in that the photovoltaic output prediction based on the Bayesian neural network needs to analyze various influencing factors, the factors influencing the photovoltaic output are of various types, and the step is to model the photovoltaic output in a classification way:

(1) Decisive factor

The intensity of the illumination radiation is a decisive factor influencing the photovoltaic output. The photovoltaic output can be obtained by the following formula.

P ^PV ＝φAη

Wherein: phi is the illumination radiation intensity; a is the total area of the photovoltaic array; η is the photoelectric conversion efficiency; a and η are fixed parameters of the photovoltaic panel.

(2) Persistence influencing factor

Sustainability-influencing factors refer to temperature, relative humidity, wind speed, etc. that can affect photovoltaic output over a longer period of time. These factors often cover a period of time greater than the dispatch period for the effect of the photovoltaic output, and thus their effect on the photovoltaic output is mined from historical data. Because the data is complex, the characteristic dimension is high, the relation between the data and the photovoltaic output is not linear, the training difficulty of the neural network can be increased by directly inputting the data, and the data needs to be preprocessed through a regression analysis module and a characteristic extraction module. First, their pearson coefficients with the photovoltaic output are calculated, and the quantitative relationship of the interdependence of temperature, wind speed, relative humidity and photovoltaic output is determined. And then, the correlation coefficient of the persistence influencing factor and the photovoltaic output is determined by the predicted time interval, so that the correlation coefficients of the temperature, the relative humidity, the wind speed and the photovoltaic output in different time periods are obtained through the learning of the historical data. Finally, mapping the multidimensional features into low dimensions through the deep fully connected nerve layer, and guaranteeing the integrity of the features while reducing the complexity of the model and improving the training efficiency.

(3) Factors of bursty influence

The sudden influence factors can influence the photovoltaic output in a short time, such as haze, sports cloud layer and the like. Such factors generally cover a period of time less than the schedule period of time for the photovoltaic output effect. The influence of the sudden influence factors on the photovoltaic output is only reflected between adjacent time periods, namely the photovoltaic output of the predicted point has a certain relation with the output value of the photovoltaic at the moment before the predicted point, and the correlation between the photovoltaic output of the predicted point and the photovoltaic output of the last time period of the predicted point is highest. Therefore, the output data of the predicted point at the previous moment is input into the Bayesian neural network, so that the data redundancy caused by multi-period input is avoided.

And inputting the temperature, wind speed and relative humidity data subjected to regression analysis into a deep full-connection layer to realize feature extraction and data dimension reduction, and inputting the maximum photovoltaic output prediction result and the extracted features into the deep full-connection layer simultaneously to be used as a probability layer input of a Bayesian neural network together with the photovoltaic output of the previous period of the prediction point.

2. Only considering the energy storage system as a controllable resource, taking the lowest daily operation cost as an objective function and meeting the operation constraint of the micro-grid, and establishing a power grid energy management model;

as shown in fig. 3, controllable devices in a microgrid generally include energy storage systems, controllable loads, electric vehicles involved in scheduling, and the like. The application focuses on the modeling and solving of the micro-grid random scheduling based on deep reinforcement learning, so that only the energy storage system is considered as a controllable resource. For scenes containing other controllable devices, only the dimension of the action in the Markov decision process needs to be changed on the basis of the model of the application:

(1) Objective function

And taking the lowest daily operation cost as an objective function, and solving an energy management strategy of the micro-grid. The daily operation cost is the sum of the electricity purchase cost and the operation cost of the energy storage system in the dispatching period, and can be defined as follows:

wherein: t is the number of scheduling periods; x is x _t The amount of electricity x needed to be exchanged with the main grid for period t _t And > 0 represents purchasing electricity from the main power grid, and conversely selling electricity to the main power grid; c _b,t /c _g,t Representing prices for buying/selling electricity from/to the main grid during period t; τ _t And the operation cost of the energy storage system is t time period. I. ⁺ As a positive function.

(2) Constraint conditions

1) Power balance constraint

x _t -P _t ^L +P _t ^PV -P _t ^ESS ＝0

Wherein: p (P) _t ^PV The power generation output of the photovoltaic at the moment t is a random variable; p (P) _t ^ESS For the power of the energy storage battery at the moment t, when P _t ^ESS > 0 represents charging of the energy storage system and vice versa discharging; p (P) _t ^L The load power at time t.

2) Energy storage system operation constraints

β _min ＜β _t ＜β _max

β _t+1 ＝β _t +η _c P _t ^ch Δt-P _t ^dis /η _d Δt

Wherein: beta _t Representing the state of charge of the energy storage system at time t, beta _min And beta _max Respectively representing the minimum value and the maximum value allowed by the charge state of the energy storage system; p (P) _t ^ch And P _t ^dis Respectively representing the charge and discharge power of the energy storage system; η (eta) _c And eta _d Respectively representing the charge and discharge efficiency of the energy storage system;and->Respectively represent the maximum value of the charge and discharge power of the energy storage system.

Due to the influence of service life attenuation and capacity attenuation of the energy storage system, the electricity-electricity cost of the energy storage system needs to be considered in the process of optimizing and scheduling. The electricity-measuring cost is the energy-storing cost calculated by leveling the cost and the generated energy in the whole life cycle of the energy-storing system. Defining the electricity-measuring cost of the energy storage system when operating as lambda, the operating cost of the energy storage system in the t period can be expressed as:

τ _t ＝λ|P _t ^ESS |

3) Battery state constraints during scheduling periods

β ₀ ＝β _T

Wherein: beta _T Is the state of charge of the energy storage system at the end of the dispatch period, beta ₀ Is the state of charge at which the scheduling period begins.

The model is oriented to a small micro-grid, as shown in figure 2, all electric equipment is powered by the same distribution network feeder line, and the geographic position is relatively close, so that the constraint of power flow is not needed to be considered.

In the model, as the photovoltaic output is an uncertainty variable, the objective function is a desired value, and the corresponding random optimization model can be expressed as follows:

3. modeling a grid energy management model as a markov decision process;

when the deep reinforcement learning algorithm is adopted to solve the economic dispatch model, firstly, the power grid energy management model needs to be modeled as a Markov decision process:

(1) State space

The state is the observable variable. When a state space is constructed, the diversity and the necessity of system variables are simultaneously considered, and the state at the moment t comprises the state of charge of an energy storage system in the micro-grid, real-time load power, real-time photovoltaic power generation power and the predicted value of photovoltaic output at the next moment. The output power of the next period of the photovoltaic with uncertainty is represented by probability distribution output by a Bayesian neural network, and the state at the time t can be represented as: { beta } _t ,P _t ^PV ,P _t ^L }

(2) Action space

The action can adjust the variable. In the model, the power balance inside the system is ensured through the actions of charging and discharging the energy storage system and buying and selling the electric quantity to the power grid. Wherein the main network supports the micro-grid to ensure the energy balance inside the micro-grid, so that the action at time t can be expressed as follows:

wherein the first n elements and the last n elements represent the discharge and charge of the energy storage system, respectively.

(3) Rewards

In deep reinforcement learning, optimization objectives are mapped into rewards decision functions. According to an objective function modeled by the power grid energy management model, the rewards at the time t are set as follows:

wherein:indicating rewards for buying/selling electricity from/to the grid,/->Is an energy storage systemThe operation of the system rewards.

Can be expressed as: />

Awarding of energy storage systemsComprising an operation cost tau _t And penalty v against operating constraints _t Setting a violation punishment item upsilon for the state of charge constraint _t The definition is as follows: upsilon (v) _t (s∈ψ,P _t ^ESS )＝-δ*|P _t ^ESS |

Wherein: delta represents the penalty per unit cost and can be represented by a larger number. And psi is a set of violation states when the system operates, and mainly comprises the state of charge of the energy storage system is out of limit. In period t, the violation state may be expressed as:

Δβ＞β _max -β _t

Δβ＞β _t -β _min

wherein Δβ=η _c |P _t ^ESS | ⁺ Δt+|-P _t ^ESS | ⁺ /η _d Δt。

For battery state constraints during a scheduling period, a larger penalty Γ is set if the battery state is not equal to the initial state at the end of the period. the battery run rewards for period t can be expressed as:

the rewards within one scheduling period may be expressed as:

(4) State transition probability and discount rate

In the Markov decision process, the discount rate is the attention degree of future rewards, and a fixed value of 0.9 is taken in calculation. After the state s and the action a are selected, the state of charge of the energy storage system in the next state can be obtained by the operation constraint of the energy storage system, the real-time load power can be directly read, and the state transition probability can be expressed as the probability of the photovoltaic output of the next state.

4. Considering a state transition random process, providing a double-depth expected Q-learning network algorithm by modifying an iteration rule of a Q value on the basis of a traditional model-free algorithm, and solving a Markov decision process;

the model-free reinforcement learning algorithm obtains a single fixed state transition process according to the interaction of the intelligent agent and the environment, and ignores the random problem of state transition in the learning environment. When the state variable of the reinforcement learning contains uncertain factors, the random transition of the neglected state can influence the convergence speed of the deep reinforcement learning algorithm, so the application provides a double-deep expected Q-learning network algorithm. The dual-deep expectation Q-learning network algorithm combines Bayesian neural networks with deep reinforcement learning, and updates the Q network with the Q expectation in the random state by representing the random process of state transition with the Bayesian neural network. The flow of the algorithm is shown in fig. 4;

(1) Q-learning algorithm in reinforcement learning

Based on the state s, the learning body selects the action a by using an epsilon-greedy method to obtain the reward r (s, a), and updates the cost function Q (s, a) in the state s after entering the state s', which can be expressed as follows:

wherein epsilon is the exploration probability and gamma is the attenuation factor.

(2) Dual deep Q learning network (double deep Q network, DDQN) algorithm

By introducing an experience playback mechanism, rewards and state updating conditions obtained by interaction with the environment each time are saved, and when the neural network parameters are converged, an approximate Q value is obtained, and because the Q value is often overestimated, the overestimation phenomenon is avoided by utilizing selection of decoupling actions of an estimated Q network and a target Q network and calculation of the target Q value.

The specific algorithm can be expressed as follows:

Q(s,a；θ _t )＝r(s,a)+γ*Q(s',a；θ _t )

wherein: θ _e To estimate parameters of Q network, θ _t Is a parameter of the target Q network. Transmitting parameters of the estimated Q network to a target Q network every training for a certain step number, namely: θ _t ←θ _e

(3) Dual-deep expectation Q-learning network (double deep expected Q network, DDEQN) algorithm

Based on DDQN, a DDEQN algorithm is provided, a Bayesian neural network is combined with deep reinforcement learning, a stochastic process of state transition is represented by the Bayesian neural network, and a Q expected value in a stochastic state is utilized to update the Q network.

First, an energy storage system scheduling policy is selected in an estimated Q network:

then, the Q value is updated in the target Q network, and the calculation formula is as follows:

Q(s,a；θ _t )＝r(s,a)+γ*E(Q(s',a；θ _t ))

wherein E (Q (s', a; θ) _t ) A) selects the desired target Q value for action a for the next state s'. In the period t, the Bayesian neural network predicts the photovoltaic output of the next period, the probability density function of the Bayesian neural network is ρ (s '), E (Q (s', a; θ) _t ) Any of the above-mentioned)Expressed as:

simplifying the model and discretizing the probability density function. And sampling the prediction result of the Bayesian neural network, and dividing 2m intervals according to the obtained maximum and minimum values. The predicted value of the section is represented by the left value of the section. After multiple samples, the probability to each interval is estimated, and the expected Q is expressed as:

thus, the action and cost function may be rewritten as:

5. setting reasonable parameters to ensure convergence of a neural network learning process, and training a neural network based on a double-deep expected Q-learning network algorithm to obtain a power grid energy management model based on the double-deep expected Q-learning network algorithm.

Experience playback pools in the neural network training process, the exploration rate and the learning rate can influence the convergence performance of the neural network, so that parameters must be reasonably set to ensure the convergence of the neural network learning process.

(1) Experience playback pool: experience playback mainly avoids correlation of experience data, and training is performed by randomly sampling from a previous state transition set. In the training process, the model has more action sets, and a larger experience playback pool is required to be arranged so as to meet the diversity and comprehensiveness of random sampling action sets during small-batch training.

(2) Search rate: the fixed epsilon in the epsilon greedy approach may result in late misconvergence of the neural network training. The application sets epsilon to gradually reduce along with training times to explore the environment so as to achieve better convergence effect.

(3) Learning rate: too high a learning rate can lead to an overfitting phenomenon, otherwise the convergence rate can be slow or even stagnant. It is necessary to set an appropriate learning rate through a plurality of attempts. The parameters of the target Q network are obtained by estimating Q network replication, so the appropriate replication frequency should also be set to avoid overestimation.

And finally, controlling the operation of each photovoltaic output device based on the photovoltaic power generation output strategy obtained in the power grid energy management model of the double-depth expected Q-learning network algorithm.

Compared with the prior art, the application has the following advantages and beneficial effects:

the application provides a power grid energy management method and a system based on a double-depth expected Q-learning network algorithm, which are used for simulating a micro-grid economic scheduling problem into a Markov decision process, mapping an objective function and constraint conditions into a reinforced learning reward and punishment function, and realizing real-time optimal decision by utilizing the learning and environment interaction capabilities of the objective function and constraint conditions; uncertainty modeling of photovoltaic power generation output in a learning environment is conducted through a Bayesian neural network, state random transfer is properly considered in a Markov decision process, and the convergence rate of an algorithm is remarkably improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application.

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a schematic diagram of a photovoltaic output prediction flow based on a Bayesian neural network;

FIG. 3 is a schematic diagram of a microgrid system composition;

FIG. 4 is a neural network training flow diagram of an algorithm;

FIG. 5 is a graph of park photovoltaic output;

FIG. 6 is a plot of campus load;

FIG. 7 is a graph of typical solar farm photovoltaic output versus load;

FIG. 8 is a graph comparing predicted results with actual values of photovoltaic output in different days;

FIG. 9 is a comparison of mode one and mode two convergence behavior;

fig. 10 is a state of charge diagram of the energy storage system in three modes.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present application, the present application will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present application and the descriptions thereof are for illustrating the present application only and are not to be construed as limiting the present application.

Example 1

As shown in fig. 1, a grid energy management method based on a dual-deep expectation Q-learning network algorithm includes the following steps:

mapping the objective function into a rewarding decision function;

the discount rate takes a fixed value of 0.9 in calculation;

then, updating the Q value in the target Q network;

simplifying the model and discretizing the probability density function.

Example 2

The embodiment provides a grid energy management system based on a double-deep expectation Q-learning network algorithm, which comprises:

Example 3

Practical application of the present application will be explained taking as an example the data based on photovoltaic output and total load of a garden for 5 months to 12 months in a small industrial garden.

Assuming that the photovoltaic output and load power of the industrial park are shown in fig. 5 and fig. 6 and 7, other parameters are listed in table 1.

TABLE 1 energy storage system parameters

Through multiple attempts, the sample storage amount of an experience playback mechanism in the DDEQN algorithm is set to 4800, and the sampling scale of each small batch is set to 600; the initial exploration rate is 0.1, the final exploration rate is 0.001, and the exploration step number is 24000; learning rate is 0.001; the target Q network parameters are updated once every 10 trains.

And a Python language is used, a PyTorch package is called to write a Bayesian neural network photovoltaic power generation output prediction program, a DDEQN algorithm program is written based on a TensorFlow framework, and an optimization algorithm is an Adam algorithm capable of adaptively changing the learning rate, so that the method has a faster convergence speed and a better convergence effect. The hardware condition of the computer is Core i7-8550U and RAM 8GB. The training steps of the Bayesian neural network are 10000 times, the training time is 22h, the training steps of the DDEQN algorithm neural network are 70000 times, and the training time is 49h.

(1) Bayesian neural network training results

In the bayesian neural network, the full-connection-layer neuron used for feature extraction is 30, the next full-connection-layer neuron is 50, and the probability-layer neuron is 55. Two days, 7 months 10 days (sunny day) and 9 months 6 days (rainy day), were selected to verify the prediction results.

As can be taken from fig. 8, the bayesian neural network has a high prediction accuracy. On a sunny day, the Bayesian neural network prediction mean value is basically equal to the actual value, the 95% confidence interval is smaller, and the prediction accuracy is higher; in rainy days, due to the complexity and variability of surrounding environment factors, the error of the predicted value of the Bayesian neural network is larger at 6:00 points, but the predicted value still has higher precision at other moments. Although the prediction accuracy is reduced compared with a sunny day, the change of the prediction value completely accords with the change trend of the actual output, and the prediction error is within an acceptable range.

(2) Validity verification of DDEQN algorithm

The following three modes were designed for comparative analysis:

mode one: adopting a DDQN algorithm, taking uncertainty of photovoltaic output into consideration, and randomly extracting a predicted result of the Bayesian neural network as input to train the deep neural network;

mode two: adopting a DDEQN algorithm, namely the algorithm provided by the application;

mode three: a random optimization algorithm based on a scene method and considering uncertainty of photovoltaic output.

The scheduling period is one day, divided into 24 time periods. Photovoltaic output and load demand over a scheduling period are shown in appendix D. In the process of training the neural network, a two-step training method is adopted, action optimizing training is firstly carried out on a single period, and then overall training is carried out on a scheduling period, so that the convergence rate of an algorithm can be effectively improved.

After each training, the neural network is tested, Q values corresponding to the optimal actions in one scheduling period are accumulated, and normalization processing is carried out to represent the convergence degree of the neural network. Defining Θ as the convergence rate, the convergence rate of the neural network after training in step i can be expressed as:

q in ^* And the accumulated Q value after the convergence of the neural network.

In order to examine the convergence performance of the proposed method, Q should also be predetermined ^* . Through multiple tests, the first mode and the second mode can be converged after being trained in 70000 steps, so that Q is caused to be ^* The method is accumulation of Q values corresponding to optimal actions in a scheduling period when the step 70000 is trained.

The training results for mode one and mode two are shown in fig. 9:

when the convergence rate Θ reaches 0.995, the neural network is considered to converge. As can be taken from fig. 2, the pattern two converges at the step of training 35000, and the pattern one converges at the step of training 67000 or so. Therefore, the DDEQN provided by the application has better convergence performance.

(3) Comparison with random optimization algorithm

In order to simulate uncertainty of photovoltaic power generation output, 10000 scenes are sampled by a Bayesian neural network and used as scene sets of photovoltaic power output in a random optimization model. In contrast, the random optimization algorithm is consistent with the charge and discharge power of the energy storage system in the deep reinforcement learning algorithm. As can be seen from table 2, the deep reinforcement learning algorithm can better adapt to uncertainty of photovoltaic output, and the optimization result is more economical compared with the random optimization algorithm. In addition, compared with the traditional DDQN algorithm, the DDEQN algorithm provided by the application achieves lower running cost, which is mainly caused by better convergence performance of the DDEQN algorithm.

TABLE 2 comparison of the economics of the different modes

In order to further analyze and compare the economy of the deep reinforcement learning algorithm and the random optimization algorithm, the charge and discharge strategies of the energy storage system in one scheduling period in three modes are compared with the charge state. The comparison result is shown in FIG. 10.

In fig. 10, 0 to 3 represent four gear positions in which the energy storage system is discharged, 4 represents the energy storage system is inactive, and 5 to 8 represent four gear positions in which the energy storage system is charged. As can be seen from fig. 10, the DDQN algorithm and the energy storage system of the DDEQN algorithm are very similar in operation, because the DDEQN algorithm according to the present application models the state transition based on the DDQN algorithm to accelerate the convergence speed of the algorithm, and the two algorithms must converge to the same point.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A deep expectation Q-learning based grid energy management method, comprising the steps of:

the modeling of the uncertainty of the photovoltaic output of the predicted point based on the Bayesian neural network comprises the following specific processes:

s13, obtaining photovoltaic output probability distribution of the predicted point after multiple model training;

s3, the system operates all the photovoltaic output devices according to a photovoltaic power generation output strategy;

the power grid energy management model establishment process based on the double-depth expected Q-learning network algorithm comprises the following steps:

the method comprises the following steps of T1, only considering an energy storage system as a controllable resource, taking the lowest daily operation cost as an objective function and meeting the operation constraint of a micro-grid, and establishing a power grid energy management model; the objective function with the lowest daily running cost in T1 is as follows: the daily operation cost is the sum of the electricity purchasing cost and the operation cost of the energy storage system in the dispatching period, and is expressed as:

wherein: t is the number of scheduling periods; x is x _t The amount of electricity x needed to be exchanged with the main grid for period t _t And > 0 represents purchasing electricity from the main power grid, and conversely selling electricity to the main power grid; c _b,t Representing the price of buying electricity from a main power grid in the period t; c _g,t Representing the price of selling electricity to the main power grid in the period t; τ _t For the operation cost of the energy storage system in the period t, || ⁺ Taking a positive function;

the microgrid operational constraints include: power balance constraints, energy storage system operating constraints, and battery state constraints during a scheduling period

the Markov decision process specific modeling process includes:

mapping the objective function into a rewarding decision function;

the discount rate takes a fixed value of 0.9 in calculation;

the state transition probability is expressed as the probability of the photovoltaic output of the next state;

2. The method for managing power grid energy based on deep expectation Q-learning according to claim 1, wherein the specific method of step T3 is as follows:

3. The deep expectation Q-learning based grid energy management method according to claim 2, wherein the specific process of updating the Q network with the Q expectation value in the random state is:

then, updating the Q value in the target Q network;

simplifying the model and discretizing the probability density function.

4. The power grid energy management method based on deep expectation Q-learning according to claim 1, wherein when reasonable parameters are set in T4 to ensure convergence of the neural network learning process, experience playback pools, exploration rates and learning rates need to be considered.

5. A deep desired Q-learning based grid energy management system for implementing the deep desired Q-learning based grid energy management method of any one of claims 1-4, comprising: