CN112054561B

CN112054561B - Wind power-pumped storage combined system daily random dynamic scheduling method based on SARSA (lambda) algorithm

Info

Publication number: CN112054561B
Application number: CN202010973224.6A
Authority: CN
Inventors: 李文武; 郑凯新; 刘江鹏; 石强; 余跃; 赵迪
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2022-06-14
Anticipated expiration: 2040-09-16
Also published as: CN112054561A

Abstract

The invention provides a wind power-pumped storage combined system daily random dynamic scheduling method based on an SARSA (lambda) algorithm, which comprises the following steps: firstly, the randomness of wind power output is considered, and the probability distribution of the wind power output is represented by Beta distribution; secondly, establishing a daily random dynamic scheduling model of the wind power-storage combination system considering the time-of-use electricity price; and finally, introducing a multi-step time sequence differential SARSA (lambda) algorithm in reinforcement learning into model solution, learning through historical scene data, and continuously trial and error accumulating experience. The method provides a new idea for solving the multi-stage decision problem of wind storage combined optimization scheduling considering randomness, and improves the solution efficiency while obtaining the optimization scheduling target.

Description

Wind power-pumped storage combined system daily random dynamic scheduling method based on SARSA (lambda) algorithm

Technical Field

The invention belongs to the problems of water resource recycling and natural water collection and distribution in water-saving activities in the field of energy-saving and environment-friendly industries, and solves the problems by adopting a reinforcement learning method in big data. Relates to a daily random dynamic scheduling method of a wind power-pumped storage combined system based on a reinforcement learning SARSA (lambda) algorithm.

Background

Wind power generation is widely used worldwide today as the energy industry steps into high-quality development. Meanwhile, the randomness and the volatility of the wind power bring challenges to the operation scheduling of a power grid, and how to control the power characteristics when the wind power is connected to the grid becomes a difficult problem to be solved urgently by efficiently consuming large-scale wind power.

Along with the development of energy storage technology, the pumped storage power station has the characteristics of flexible response and rapid start and stop as an energy storage device with mature technology and wide application, is provided with the pumped storage power station for a power system, can not only cut peaks and fill valleys, but also provide dynamic services such as rotation standby, load tracking, phase modulation, frequency control and the like, improves the static stability and the dynamic stability of the system, brings considerable benefits for the system, and ensures the safe and stable operation of a power grid. The wind power and the pumped storage power station are combined to optimize operation, so that the wind power operation benefit can be effectively improved, the wind power grid connection limitation is reduced, and considerable economic benefit is obtained.

In the existing method, for the dispatching of the wind power-pumped storage combined system, a traditional random dynamic programming algorithm is adopted, and the technical problems of poor dispatching effect and low efficiency exist.

Disclosure of Invention

The invention provides a wind power-pumped storage combined system daily random dynamic scheduling method based on an SARSA (lambda) algorithm, which is used for solving or at least partially solving the technical problems of poor scheduling effect and low efficiency in the prior art.

In order to solve the technical problem, the invention provides a wind power-pumped storage combined system daily random dynamic scheduling method based on an SARSA (lambda) algorithm, which comprises the following steps:

s1: describing the randomness of wind power output;

s2: according to the randomness of wind power output and the time-of-use electricity price, a daily random dynamic scheduling model of the wind power-storage combination system is constructed:

in the formula: t is the number of time segments in a period; r_tAn index function for a period t; v_tThe storage capacity of an upper reservoir of the pumped storage power station at the beginning of the t time period; p_t ^gdThe power generated by the pumped storage power station in the time period t is pumped when the power is less than 0 and is used for generating power when the power is more than 0; r_t、P_t ^gdThe expression of (a) is as follows:

in the formula: c_tThe peak-valley electricity price corresponding to the t time period; after a wind power prediction error distribution function curve in the t time period is dispersed into N values, the corresponding power is

Corresponding probability is p_t,i；G_hThe cost required for starting and stopping a single unit in a pumped storage power station, n_tThe number of the units which are turned on/off in the t time period of the pumped storage power station; when the pumped storage power station unit is in a power generation state in the period of t,

is 1, otherwise is 0; p is_t ^gThe generated output corresponding to the unit at the time t; when the pumped-storage power station unit is in the motor state in the period of t,

is 1, otherwise is 0, P_t ^dThe pumping power corresponding to the unit in the time period t;

s3: determining constraint conditions of a daily random dynamic scheduling model of the wind power-storage and pumping combined system;

s4: and solving a daily random dynamic scheduling model of the wind power-storage combination system by adopting an SARSA (lambda) algorithm in reinforcement learning to obtain a scheduling result.

In one embodiment, S1 specifically includes:

the probability density function of the wind power prediction error is expressed by adopting Beta distribution, and the expression is as follows:

in the formula: x is a wind power output prediction error; the a and the b are Beta distribution shape parameters, Beta distributions with different shapes can be obtained by changing the values of the a and the b, and the positive bias or the negative bias which possibly occurs in the wind power output prediction error is met; wherein B (a, B) is represented by:

acquiring and sorting historical data of the wind power plant to obtain the prediction error frequency distribution of the wind power plant, and calculating shape parameters a and b of Beta distribution according to the mean value and variance of prediction errors, wherein the calculation equation is as follows:

in the formula: mu is the mean value of the prediction errors; σ is the standard deviation of the prediction error.

In one embodiment, the constraints in S3 include:

(1) reservoir capacity constraint:

V_min≤V_t≤V_max

in the formula: v_min、V_maxRespectively the minimum and maximum storage capacity, V, available for the upper reservoir of the pumped storage power station_tActual storage capacity available for an upper reservoir of the pumped storage power station at the time t;

(2) and (3) constraint of the change amount of the reservoir capacity in the first and last periods of each day:

V₂₄-V₀＝0

wherein the pumped storage reservoir is regulated day by day, the reservoir capacities of the reservoirs in the first and last periods of the day are equal, V₂₄、V₀The storage capacities of the reservoirs at 24 hours and 0 hours are shown, respectively;

(3) power generation and pumping power constraint:

in the formula:

respectively the upper and lower limits of the generated output, P, of the pumped storage power station unit_t ^gActual generated output of the pumped storage power station unit is t time period;

the constraint of the pumping power is as follows:

P_t ^d＝P^dk_t

in the formula: p_t ^dActual pumping power, P, of a single unit of the pumped storage power station at the time period of t^dRated pumping power k for a single unit of a pumped storage power station_tThe total number of the water pumping units operated in the time period t;

(4) and (3) drawing and sending mutual exclusion constraint:

the constraint indicates that the pumped storage power station unit can not be in the power generation and water pumping states at the same time in the same time period,

the mark of whether the pumped storage power station unit is in the power generation state or not in the time period t,

the mark indicates whether the pumped storage power station unit is in a motor state or not at the time t;

(5) and (3) total station number constraint of the unit:

in the formula:

the total number of the units in the working state in the time period t is shown, and N is the total number of all the available units of the pumped storage power station.

In one embodiment, the SARSA (λ) algorithm in S4 introduces an E matrix recording the paths and attenuation situations traveled in each round based on the model-independent SARSA algorithm, and a step attenuation coefficient λ, and changes the single-step updating mode of the SARSA algorithm into the round updating mode of the SARSA (λ) algorithm.

In one embodiment, after one step in the round is taken, that is, after one action is selected in the current state, the value of the (S, a) position corresponding to the utility trace function E (S, a) is incremented by 1, and after each action, the value function Q (S, a) and the utility trace function E (S, a) are updated, where Q (S, a) is updated as follows:

wherein, alpha is a learning rate and is used for controlling the convergence condition of learning; gamma is an attenuation factor used for reducing the influence of future return on the current strategy; the TD error δ represents the error between the ideal value and the actual value of Q (S, a);

e (S, A) update mode is as follows:

E(S,A)＝γλE(S,A)。

in one embodiment, a greedy strategy is adopted in an iterative process of the SARSA (λ) algorithm to select actions in each state, specifically: randomly generating a decimal between 0 and 1, and comparing the decimal with the exploration probability epsilon; if the probability is smaller than epsilon, the system selects actions in a random mode, and the probability of each action selected is the same; if the current state is not less than epsilon, the system selects the known optimal action in the current state, as shown in the following formula:

in the formula, A_iIs the known optimal strategy in state S.

In one embodiment, S4 specifically includes:

s4.1: processing the historical data of electricity price and the historical data of wind power output, putting the processed historical data into a daily random dynamic scheduling model solution of a wind power-storage combination system for pre-learning, and continuously exploring accumulated experiences in a pre-learning stage to update the element values of a Q value table and an effect trace function E;

s4.2: and performing online learning according to the updated Q value table obtained in the pre-learning stage and the element value of the utility trace function E, and selecting the action with the maximum Q value in the current state according to a greedy strategy.

In one embodiment, S4.1 specifically includes:

s4.1.1: initializing a Q value table, a utility trace function E, iteration times and a learning rate, wherein each element in the Q value table in the initial stage of the SARSA (lambda) algorithm is 0, and the utility trace functions E corresponding to all state actions are 0;

s4.1.2: determining the upper reservoir capacity value corresponding to the current time interval as the first state of the state sequence, and solving the wind power output of the current time interval according to the wind power predicted value under the current time interval and a wind power prediction error probability density function obeying Beta distribution;

s4.1.3: determining a state S corresponding to the storage capacity value of a reservoir on the pumped storage power station at the current time period, selecting a pumping/power generation action through a greedy strategy according to the electricity price at the current time period and the electricity price trend at each time period, and determining the pumping/power generation flow of the reservoir on the pumped storage power station according to the action;

s4.1.4: obtaining a new state S' corresponding to the upper reservoir storage capacity of the pumped storage power station at the next time interval after the action is taken and a reward R obtained by taking the action by a state transfer equation;

s4.1.5: solving the wind power output in the next period;

s4.1.6: selecting a new pumping/generating action by a greedy strategy according to the electricity price of the next time period and the electricity price trend of each time period;

s4.1.7: updating a trace function E (S, A) and a TD error delta according to an updating mode of an SARSA (lambda) algorithm;

s4.1.8: updating the value function Q of the upper reservoir storage capacity state S and the pumping/generating flow A corresponding to all the current time intervals, and attenuating the updated trace function E in S4.1.7;

s4.1.9: and judging whether the algorithm reaches the specified iteration times, if not, making the time period t equal to t +1, and returning to S4.1.2 to continue the iteration.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a daily random dynamic scheduling method of a wind power-pumped storage combined system based on an SARSA (lambda) algorithm, which comprises the steps of firstly considering the randomness of wind power output and establishing a daily random dynamic scheduling model of the wind power-pumped storage combined system considering time-of-use electricity price; and then determining constraint conditions of the model, and finally introducing an SARSA (lambda) algorithm in reinforcement learning into model solution, wherein the method provides a new idea for solving a problem of wind storage combined optimization scheduling considering randomness, analyzes the randomness of wind power output when a wind power-pumped storage combined system is researched for optimization scheduling, can improve the accuracy of prediction, establishes a wind power-pumped storage combined system daily random dynamic scheduling model and constraint conditions considering time-of-use electricity price, can optimize the scheduling effect, and can improve the efficiency of solution and scheduling by adopting the SARSA (lambda) algorithm to solve the model.

Furthermore, the method fully utilizes local historical data of the wind power plant to analyze the randomness of the wind power output, and simultaneously utilizes the historical data to pre-learn the optimization decision capability of the system so as to gradually accumulate experience.

Furthermore, the algorithm is applied to the wind storage combined optimization scheduling problem with randomness for the first time, the Q value table is updated through pre-learning, and the round updating mode of the algorithm is adopted, so that the solving time is greatly shortened, the problem of dimension disaster easily caused by the traditional algorithm is avoided, and meanwhile, the optimal strategy can be learned in the continuous interaction process with the environment, and a high-quality solution is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is an overall structural diagram of a scheduling method provided by the present invention;

figure 2 is a diagram of a markov decision process of the present invention;

FIG. 3 is a schematic diagram of reinforcement learning according to the present invention;

FIG. 4 is a flow chart of the SARSA (λ) algorithm of the present invention;

FIG. 5 is a flowchart illustrating an application of the SARSA (λ) algorithm in daily random dynamic scheduling of a wind storage combined system according to the present invention.

Detailed Description

The inventor of the application finds out through a great deal of research and practice that: with the gradual improvement and maturity of the theoretical research of wind power-pumped storage optimal scheduling, the factors for researching the problem are more and more complex, and higher requirements are provided for the research method. Considering that a certain error exists in the wind power output prediction, the randomness of the wind power output needs to be analyzed firstly when the optimal scheduling of the wind power-pumped storage combined system is researched, and the prediction accuracy is improved. Therefore, the problem of random optimization scheduling of the wind power-pumped storage combined system is a high-dimensional, multi-stage and nonlinear optimization problem, the constraint of the problem is complex, and more factors are comprehensively considered. When solving the random optimization scheduling problem, commonly used solving algorithms include a particle swarm algorithm, a random dynamic programming algorithm and the like. However, with the expansion of the system scale, the algorithms may have certain limitations in the solving process, for example, the particle swarm algorithm is easy to fall into a local optimal solution in the optimizing process, the local optimal solution needs to be skipped in the calculating process, and a theoretical global optimal solution is difficult to find; although the random dynamic programming algorithm can find out the theoretical optimal solution, dimension disaster easily occurs in the solving process, so that the solving time is too long, and the application is difficult to obtain in practice.

Reinforcement Learning (RL) is an important branch of the field of Machine Learning (ML), and generally includes two parts, Agent and environment, and Agent is not informed what action should be taken in the environment, but rather must be discovered through its own attempts to generate the greatest benefit. Actions tend to affect not only instant benefits, but also the next context, and thus subsequent benefits. Trial and error and delayed gain are the two most important and most significant features of reinforcement learning.

On the basis of the existing research, the invention applies the SARSA (lambda) algorithm in reinforcement learning to the daily random dynamic scheduling model solution of the wind power-storage-extraction combined system for the first time, considers the randomness of wind power output and the Markov decision process, basically updates each element in a Q value table after continuously trial and error accumulating experience in the pre-learning stage, and then puts the elements into online learning, so that the model can obtain high-quality solution and can effectively avoid the problem of dimension disaster. The algorithm provides a new idea for solving the problem of multi-energy complementation with randomness.

In order to achieve the technical effects, the main concept of the invention is as follows:

the novel method for solving the daily random dynamic scheduling problem of the wind storage combined system is provided, the scheduling effect is improved, meanwhile, the solving time can be effectively reduced, and the scheduling efficiency is improved. Firstly, learning an optimal strategy through continuous interaction between reinforcement learning and the environment, and effectively solving the problem of dimension disaster easily caused by the traditional random dynamic programming algorithm; secondly, on the basis of the SARSA algorithm irrelevant to the model, an E matrix for recording the path and attenuation condition of each round and a step attenuation coefficient lambda are introduced, the single-step updating mode of the SARSA algorithm is improved into the round updating mode of the SARSA (lambda) algorithm, meanwhile, the SARSA (lambda) algorithm belongs to an online learning method, and the improvement enables the SARSA (lambda) algorithm to find a high-quality solution more quickly and reduces the solving time. The invention researches the daily random dynamic scheduling problem of the wind-storage combined system, fully utilizes the pumped storage unit of the reservoir in the scheduling period, stores cheap wind power through the pumped storage power station, and sends out the wind power at high price in the peak period, and the economic benefit brought by the system is the maximum on the premise of meeting all constraint conditions. The research is mainly carried out aiming at the following aspects:

1) the randomness of the wind power output is described.

2) And establishing a daily random dynamic scheduling model of the wind power-storage combined system considering the time-of-use electricity price, and determining a target function and a constraint condition according to the maximum economic benefit of the wind power-storage combined system in one day.

3) And applying a reinforcement learning theory to a daily random dynamic scheduling problem of the wind storage combined system to solve, and determining a recursion equation and a state transition equation.

4) And applying the SARSA (lambda) algorithm to model solution to determine a solution process.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The embodiment of the invention provides a wind power-pumped storage combined system daily random dynamic scheduling method based on an SARSA (lambda) algorithm, which comprises the following steps:

s1: describing the randomness of wind power output;

in the formula: t is the number of time segments in a period; r_tAn index function for a period t; v_tThe storage capacity of an upper reservoir of the pumped storage power station at the beginning of the t time period; p_t ^gdThe power generated by the pumped storage power station in the time period t is pumped when the power is less than 0 and is used for generating power when the power is more than 0; r is_t、P_t ^gdThe expression of (a) is as follows:

is 1, otherwise is 0; p_t ^gThe generated output corresponding to the unit at the time t; when the pumped-storage power station unit is in the motor state in the period of t,

Specifically, the invention provides a wind power-pumped storage combined system daily random dynamic scheduling method based on SARSA (lambda) algorithm, which comprises the steps of firstly describing the randomness of wind power output; then establishing a daily random dynamic scheduling model of the wind power-storage combination system considering the time-of-use electricity price; then determining constraint conditions of the model; finally, introducing the SARSA (lambda) algorithm into model solution, specifically comprising determining a state transition equation and a recursion equation of the model, and determining a solution process of the SARSA (lambda) algorithm; in the specific implementation process, the method comprises two processes of pre-learning and online learning, and the method comprises the steps of firstly processing historical data of electricity price and historical data of wind power output and then putting the processed historical data into model solution for pre-learning; after the pre-learning is finished, putting the pre-learning into online learning. The overall structure of the process is shown in FIG. 1.

According to the method provided by the invention, on one hand, the randomness of the wind power output is considered, the precision of the predicted value of the wind power output is ensured, and the effect of the model is more in line with the reality.

On the other hand, the time-of-use electricity price is considered, the electricity utilization peak period is usually at noon and night, the wind power output is small at the moment, and the reservoir is used for pumping water to generate electricity; at night in the electricity consumption valley period, the wind power output is large, and redundant wind power is stored through reservoir energy storage; the daily income of the combined system can be maximized, the effects of peak clipping and valley filling can be achieved, and the problem of wind power consumption is effectively solved.

In one embodiment, S1 specifically includes:

in the formula: mu is the mean value of the prediction error; σ is the standard deviation of the prediction error.

Specifically, the randomness of the wind power output is determined by the time-varying property of the wind speed. For the in-day prediction of a large-scale wind power plant, the probability density function of the wind power prediction error is represented by Beta distribution.

And integrating to obtain a corresponding prediction error distribution function after obtaining the wind power output prediction error probability density function based on Beta distribution. Specifically, after the prediction error distribution function is obtained, the prediction error distribution function is dispersed into n values, the corresponding abscissa is the prediction error, the ordinate is the corresponding probability, and the expected value of the wind power output in the t time period can be obtained by combining the predicted value of the wind power output in the t time period. Mainly used for later solving wind power output expected value, namely p in model_t,i,

And a prize value formula R_tP in (1)_t ^w*。

And combining the predicted value of the wind power output at each time interval of the next day to obtain the wind power output at the time interval t:

in the formula: predicted value P of wind power output in t time period_t ^wIs a determined value; prediction error of wind power output in t time period

Is a random variable; therefore, the wind power output in the t time period

The wind power output at each time interval forms a wind power output sequence, namely the expected value of the wind power output at each time interval is the embodiment of the randomness of the wind power output.

In one embodiment, the constraints in S3 include:

(1) reservoir capacity constraint:

V_min≤V_t≤V_max

V₂₄-V₀＝0

(3) power generation and pumping power constraint:

in the formula:

the constraint of the pumping power is as follows:

P_t ^d＝P^dk_t

in the formula: p_t ^dActual pumping power of a single unit of a pumped storage power station for a period of t, P^dRated pumping power, k, for a single unit of a pumped storage power station_tThe total number of the water pumping units operated in the time period t;

(4) and (3) drawing and sending mutual exclusion constraint:

(5) and (3) total station number constraint of the unit:

in the formula:

In one embodiment, the SARSA (λ) algorithm in S4 introduces an E matrix recording the paths and attenuation in each round and a step attenuation coefficient λ based on the model-independent SARSA algorithm, and changes the single-step updating mode of the SARSA algorithm into the round updating mode of the SARSA (λ) algorithm.

Specifically, a round refers to the number of steps in the reinforcement learning algorithm that are taken from the first step to the final reward.

E, matrix: SARSA (λ) is a round update algorithm, and the matrix E is introduced to save each step in the round path, i.e. an action is selected in each state, and the value of the element in the E matrix is larger by adding 1 to the corresponding (state, action) position in the E matrix, which indicates that the more times this step is performed, the more important this step is in the integration round. The matrix E is the utility trace function E.

e (S, A) update mode is as follows:

E(S,A)＝γλE(S,A)。

specifically, reinforcement learning is a method for learning the best strategy by continuously interacting with the environment, and besides an Agent and the environment, a reinforcement learning system has four core elements: policies, revenue signals, cost functions, and (optionally) models built to the environment. A policy defines the way a learning Agent behaves at a particular time, which is a mapping of environment states to actions. The revenue signal defines the goal in the reinforcement learning problem and is the important basis for changing the strategy, which indicates what actions are good in a short time. The cost function shows what is good in the long term, and the value of a state is that an Agent starts from the state and expects the total income accumulated in the future. While revenue determines the immediate, inherent appeal of environmental conditions, value represents a long-term expectation of all possible conditions to follow. If the environment is modeled, the environment model is used for planning, i.e. before the real experience, the environment model considers various situations that may occur in the future to decide what action to take in advance. The approach of using environmental models and planning to solve reinforcement learning problems is referred to as a modeled approach. A simple modeless approach is to try and error directly and select the best action by accumulating experience over time. The reinforcement learning process is illustrated in fig. 3.

The SARSA (lambda) algorithm is a multi-step time sequence difference online control algorithm irrelevant to a model, and is an improved version of the traditional SARSA algorithm. The core of the SARSA algorithm is the current state S, the current action A, the reward R obtained after taking action, the next state S entered after taking action, and the action A taken in the next state. The difference is that the updating mode of the SARSA algorithm belongs to a single step updating mode, and the SARSA (λ) algorithm belongs to a round updating mode, and in order to embody the round updating mode, the SARSA (λ) algorithm introduces a utility trace function E (S, a) and a step attenuation coefficient λ. The utility trace function E (S, a) is used to record the path taken and the attenuation in each round.

Referring to fig. 2 and fig. 3, fig. 2 is a diagram of a markov decision process, and fig. 3 is a diagram of reinforcement learning according to the present invention.

in the formula, A_iIs the known optimal strategy in state S.

Specifically, the flow of the SARSA (λ) algorithm is summarized as follows:

Input:

iteration times T, a state set S, an action set A, a learning rate alpha, an attenuation factor gamma, an exploration probability epsilon and a pace attenuation coefficient lambda.

Start

And (4) randomly initializing values Q corresponding to all states and actions, wherein each element in the Q value table in the initial stage is 0.

For i from 1to T:

a) And initializing the utility trace function E corresponding to all state actions to be 0, and initializing S to be the first state of the current state sequence. Setting A as the action selected by the epsilon-greedy strategy under the current S.

b) After state S performs action A, a new state S' and reward R are obtained.

c) A 'under state S' is selected by an epsilon-greedy strategy.

d) Update utility trace function E (S, a) and TD error δ:

E(S,A)＝E(S,A)+1

δ＝R+γQ(S′,A′)-Q(S,A)

e) updating a cost function Q (S, A) and a utility trace function E (S, A) for all the states S and corresponding actions A of the current sequence:

Q(S,A)＝Q(S,A)+αδE(S,A)

E(S,A)＝γλE(S,A)

f)S＝S′,A＝A′

g) and c, judging whether the iteration times are reached, if so, finishing the iteration, otherwise, turning to the step b).

End

Output:

And Q value Q corresponding to all the states and actions in the Q value table.

In the algorithm flow, in order to ensure that the action cost function Q can converge, the learning rate α generally needs to be gradually reduced as the iteration progresses. The flow chart of the SARSA (lambda) algorithm is shown in FIG. 4.

In one embodiment, S4 specifically includes:

Specifically, the Q value table in the initial stage of the SARSA (λ) algorithm and the element value of the utility trace function E are both 0, and if the method is directly put into online learning, a large amount of exploration needs to be performed during initial iteration, and most of the actions selected in each state are random and are not the optimal strategy. Therefore, before online learning, the historical data of electricity price and the historical data of wind power output are processed and then put into model solution for pre-learning. And continuously exploring accumulated experience in a pre-learning stage, updating the Q value table and the element value of the utility trace function E, and putting the updated Q value table and the element value into online learning, so that the system basically has the capability of providing the optimal action strategy in the initial stage.

After the pre-learning is finished, putting the pre-learning into online learning. At this time, the elements in the Q value table of the system are basically updated, a better strategy can be basically given at each stage, the exploration probability value at this time is larger, and the system selects the action with the maximum Q value at the current state according to a greedy strategy. Through the steps, the SARSA (lambda) algorithm can ensure to obtain a high-quality solution with the maximum economic benefit in the solution of the daily random dynamic scheduling problem of the wind storage combined system, can effectively reduce the solution time, and avoids the problem of dimension disaster.

Compared with the traditional optimization method, the SARSA (lambda) algorithm based on reinforcement learning provided by the invention has the following three improvements in solving the problem of wind storage joint random optimization scheduling:

one is as follows: a daily random dynamic scheduling model of the wind storage combined system is established, wherein the daily random dynamic scheduling model takes the daily economic benefit maximization of the wind storage combined system as a target and considers the time-of-use electricity price.

The second step is as follows: by introducing a reinforcement learning theory, the decision-making capability of the system is gradually improved under the continuous interaction of the learning system and the environment, the convergence of the value function Q in a short time is ensured, and the problem of dimension disaster easily occurring in the conventional algorithm is solved.

And thirdly: the SARSA (lambda) algorithm updated by the round system is applied to model solution, so that the solution time is shortened while high-quality solution is guaranteed.

In one embodiment, S4.1 specifically includes:

s4.1.2: determining an upper water reservoir capacity value corresponding to the current time period as a first state of a state sequence, and solving the wind power output of the current time period according to a wind power predicted value under the current time period and a wind power prediction error probability density function obeying Beta distribution;

s4.1.5: solving the wind power output in the next period;

Specifically, the SARSA (lambda) algorithm is introduced into model solution, and a state transition equation and a recursion equation of the model are determined. The selection of the state variable is crucial, the state variable has to be closely related to the decision variable, the goal can be well reflected through the state variable, the size of the decision variable can be obtained at the same time, and more importantly, the recursion process in the whole recursion process can be well reflected, so that the characteristic of 'no aftereffect' can be met.

The wind storage combined daily random dynamic scheduling problem is a multi-stage decision problem, and when the SARSA (lambda) algorithm of reinforcement learning is adopted for solving, discrete processing needs to be carried out on state variables and decision variables.The state variable is the storage capacity of an upper reservoir of the pumped storage power station, and the water level Z is set_iDiscrete as M values from small to large, the corresponding storage capacity is V_i(ii) a The decision variable is the pumping/generating power adopted by the pumped storage power station, the pumping power of a single unit is a fixed value, the pumping power can be dispersed according to the number of the units, and the generating power can be uniformly dispersed according to a fixed power interval. After the state variables and the decision variables are determined, the state transition equation can be obtained as follows:

in the formula, V_t、V_t+1The storage capacities of the first and last reservoirs at the time period t are respectively; q_cIs the pumping flow rate of the t period, m³/s；Q_fdIs the generated flow rate of t period, m³S; and delta T is the time for generating electricity/pumping water in the period T.

According to the Bellman principle, under a certain state, maximizing the future reward is equivalent to maximizing the sum of the instant reward and the maximum future reward of the next state, and when the wind power output in each time period is an independent random variable, a recurrence equation can be obtained:

in the formula:

the expectation value representing the maximized economic benefit from the t time period to the 24 th time period;

the economic benefit value of the t time interval is obtained;

the expectation value of the maximized economic benefit is (t +1) to 24 time intervals.

In the same state, the rewards obtained by taking different decisions (actions) are different. In each state, the prize value is:

R_t＝C_t(P_t ^w*+P_t ^gd)-G_hn_t

referring to fig. 5, it is shown that in the flow chart of the application of the SARSA (λ) algorithm in daily random dynamic scheduling of the wind storage integrated system, in the specific implementation process, the wind power predicted value at each time interval is known, the scheduling center obtains the wind power predicted value at each time interval of today on the previous day according to the historical wind power data, the current time interval electricity price and the electricity price trend at each time interval are known, and the current time interval electricity price and the electricity price trend at each time interval of each day in the area can be obtained from the historical electricity price trend at each time interval of each day. S4.1.5 the method of solving for wind power in the next time period is the same as step S4.1.2.

The specific implementation examples described in this invention are merely illustrative of the system of the present invention. Those skilled in the art to which the invention relates may make various changes, additions or modifications to the described embodiments (i.e., using similar alternatives), without departing from the principles and spirit of the invention or exceeding the scope thereof as defined in the appended claims. The scope of the invention is only limited by the appended claims.

Claims

1. A wind power-pumped storage combined system daily random dynamic scheduling method based on an SARSA (lambda) algorithm is characterized by comprising the following steps:

s1: describing the randomness of wind power output;

s2: according to the randomness of wind power output and the time-of-use electricity price, constructing a target function of a daily random dynamic scheduling model of the wind power-storage combination system:

in the formula: t is the number of time segments in a period; r_tAn index function for a period t; v_tThe storage capacity of an upper reservoir of the pumped storage power station at the beginning of the t time period; p_t ^gdFor the pumped storage power station to emit in t time periodThe power of less than 0 is water pumping, and the power of more than 0 is power generation; r_t、P_t ^gdThe expression of (a) is as follows:

in the formula: c_tThe peak-valley electricity price corresponding to the t time period; after the wind power output prediction error distribution function curve in the t time period is dispersed into N values, the corresponding power is

Corresponding probability is p_t,i；G_hThe cost required for starting and stopping a single unit for a pumped storage power station, n_tThe number of the units which are turned on/off in the t time period of the pumped storage power station; when the pumped storage power station unit is in a power generation state in the period of t,

s4: solving a daily random dynamic scheduling model of the wind power-storage combination system by adopting an SARSA (lambda) algorithm in reinforcement learning to obtain a scheduling result;

wherein, S4 specifically includes:

2. The dynamic scheduling method of claim 1, wherein the S1 specifically comprises:

acquiring and sorting historical data of the wind power plant to obtain wind power plant prediction error frequency distribution, and calculating shape parameters a and b of Beta distribution according to the mean value and variance of prediction errors, wherein the calculation equation is as follows:

3. The dynamic scheduling method of claim 1 wherein the constraints in S3 include:

(1) reservoir capacity constraint:

V_min≤V_t≤V_max

V₂₄-V₀＝0

wherein the pumped storage reservoir is of a daily regulation type, the reservoir capacities of the reservoirs in the first and last periods of each day are equal, V₂₄、V₀The storage capacities of the reservoirs at 24 hours and 0 hours are shown, respectively;

(3) power generation and pumping power constraint:

in the formula:

the constraint of the pumping power is as follows:

P_t ^d＝P^dk_t

(4) and (3) drawing and sending mutual exclusion constraint:

(5) and (3) total station number constraint of the unit:

in the formula:

the total number of the units in the working state in the time period t is shown, and N' is the total number of all the available units of the pumped storage power station.

4. The dynamic scheduling method of claim 1 wherein the SARSA (λ) algorithm in S4 introduces an E matrix recording the paths and fading in each round and a step fading coefficient λ based on the model-independent SARSA algorithm, and changes the single step updating mode of the SARSA algorithm into the round updating mode of the SARSA (λ) algorithm, where the E matrix is a utility trace function.

5. The dynamic scheduling method of claim 4 wherein the SARSA (λ) algorithm increments the value of the (S, A) position corresponding to the utility trace function E (S, A) by 1 after each step in the round, i.e. after an action is selected in the current state, and updates the action cost function Q (S, A) and the utility trace function E (S, A) after each action, the updating of Q (S, A) is as follows:

wherein, alpha is a learning rate and is used for controlling the convergence condition of learning; gamma is an attenuation factor used for reducing the influence of future return on the current strategy; TD error delta represents the error between the ideal value and the actual value of Q (S, A), S represents the current state, A represents the current action, R represents the reward obtained after action A is taken, S 'represents the next state entered after action is taken, and A' represents the action taken in the next state;

e (S, A) update mode is as follows:

E(S,A)＝γλE(S,A)。

6. the dynamic scheduling method of claim 5, wherein a greedy strategy is adopted in the SARSA (λ) algorithm iteration process to select the actions in each state, specifically: randomly generating a decimal between 0 and 1, and comparing the decimal with the exploration probability epsilon; if the probability is smaller than epsilon, the system selects actions in a random mode, and the probability of each action selected is the same; if the current state is not less than epsilon, the system selects the known optimal action in the current state, as shown in the following formula:

in the formula, A_iIs the known optimal strategy in state S.

7. The dynamic scheduling method of claim 1, wherein S4.1 specifically comprises:

s4.1.3: determining a state corresponding to the storage capacity value of the upper reservoir of the pumped storage power station at the current time period, selecting a pumping/power generation action by a greedy strategy according to the current time period power price and the power price trend of each time period, and determining the pumping/power generation flow of the upper reservoir of the pumped storage power station according to the action;

s4.1.4: obtaining a new state corresponding to the upper reservoir storage capacity of the pumped storage power station at the next time interval after the action is taken and the reward obtained by taking the action by a state transfer equation;

s4.1.5: solving the wind power output in the next period;

s4.1.8: updating the value function Q of the upper reservoir storage capacity state and the pumping/generating flow corresponding to all the current time intervals, and attenuating the updated trace function E in S4.1.7;