CN114841595A

CN114841595A - Deep-enhancement-algorithm-based hydropower station plant real-time optimization scheduling method

Info

Publication number: CN114841595A
Application number: CN202210548151.5A
Authority: CN
Inventors: 谢俊; 包长玉; 潘学萍; 郑源; 潘虹
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-02

Abstract

The invention discloses a deep reinforcement algorithm-based real-time optimal scheduling method in a hydropower station plant, which is used for solving real-time optimal scheduling in the hydropower station plant. Real-time optimization scheduling in a hydropower station plant is an important link for adjusting a day-ahead power generation plan and is an important problem in the economic operation problem of a power system. The method is developed and researched aiming at the optimal adjustment of the power grid load prediction and the day-ahead power generation plan deviation amount, firstly, the real-time scheduling problem in the hydropower station plant is converted into a Markov decision process, then, a Deep reinforcement Learning algorithm of Deep Learning of Deep Q-Learning is applied to solve the problem, and finally, a real-time rolling operation strategy in the hydropower station plant is obtained and is applied to the formulation of the actual real-time scheduling strategy in the hydropower station plant. The method is based on the data driving thought, can effectively solve the problem of real-time optimized scheduling in the hydropower station plant, and has good robustness in dealing with emergency situations.

Description

Deep-enhancement-algorithm-based hydropower station plant real-time optimization scheduling method

Technical Field

The invention belongs to the field of power dispatching, and particularly relates to a hydropower station plant real-time optimization dispatching method based on a deep reinforcement learning algorithm.

Background

Most of the existing researches focus on modeling of real-time optimal scheduling and output optimal configuration, and the problem of adjustment of load distribution deviation of the hydroelectric generating set caused by inaccurate prediction is less considered. However, the real-time load and the planned load in the actual production operation inevitably have deviation within the dispatching day, so that the safe and stable operation of the power grid is influenced. Therefore, it is inevitable and important to study the real-time response of the grid load regulation. In fact, in a long term, real-time scheduling of the hydropower station also has certain repeatability, and the accumulated historical decision scheme also has guiding significance for decision of real-time deviation of the hydropower unit. With the development of artificial intelligence technology, the deep reinforcement learning algorithm based on data driving can enable an intelligent agent to quickly and accurately react to load deviation amount through interaction with the environment, and has more advantages than a conventional algorithm when solving a real-time optimization scheduling problem in a hydropower station plant.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a hydropower station in-real-time scheduling decision method based on a deep reinforcement learning algorithm, which is used for solving the problem of in-real-time optimized operation in a hydropower station.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a hydropower station in-plant real-time optimization scheduling method based on a deep enhancement algorithm comprises the following steps:

(1) constructing a mathematical model according to real-time optimized operation conditions in a hydropower station plant;

(2) converting the real-time scheduling problem in the hydropower station plant into a Markov decision process according to the mathematical model constructed in the step (1);

(3) solving the Markov decision process by applying Deep reinforcement Learning algorithm Deep Q-Learning,

and obtaining a real-time rolling operation strategy in the hydropower station plant.

Preferably, the step (1) specifically comprises:

adjusting the given day-ahead power generation plan and target function of the hydroelectric generating set according to the optimized operation criterion of the hydroelectric generating set

The number is as follows:

in the formula, delta Q is the total water consumption deviation of all running hydroelectric generating sets within a scheduling time period of 15 min; n is the number of all hydroelectric generating sets; delta Q _i The water consumption deviation of the hydroelectric generating set i in a 15min scheduling time period is obtained; consumption characteristic function Q of hydroelectric generating set _i ＝f(P _i ,H _i ): flow Q of hydroelectric generating set _i Average output P of hydroelectric generating set i in dispatching time interval _i And average head H _i A non-linear function of (d); specifically, the following are shown:

Q _i ＝f(P _i ,H _i )

in the formula, f is a consumption characteristic function of the hydroelectric generating set;

when the output real-time adjustment quantity of the hydroelectric generating set is delta P _i When the water consumption is adjusted correspondingly delta Q _i Is represented as follows:

ΔQ _i ＝f(ΔP _i ,H _i )

the method comprises the following steps of establishing power deviation balance constraint, output constraint of each unit, flow constraint of each unit and vibration area constraint of each unit during operation by combining real-time operation variables for deviation values of output and water consumption of the hydroelectric generating units, and specifically representing the following steps:

the power deviation balance constraint is specifically expressed as:

in the formula,. DELTA.P _i Real-time adjustment quantity of i output, delta P, of hydroelectric generating set _L The deviation between the total system load and the day-ahead output plan in the scheduling time period of 15 min;

the unit output constraint is specifically expressed as:

P _i,min ≤P _i ±ΔP _i ≤P _i,max

in the formula, P _i,max 、P _i,min Respectively representing the upper limit and the lower limit of the output of the hydroelectric generating set i;

the unit flow constraint is specifically expressed as:

Q _i,min ≤Q _i ±ΔQ _i ≤Q _i,max

in the formula, Q _i,max 、Q _i,min Respectively representing the upper limit and the lower limit of the generating flow of the hydroelectric generating set i;

the unit cavitation vibration area constraint is specifically expressed as:

(P _i +ΔP _i -P _zi,max )(P _i +ΔP _i -P _zi,min )≥0

in the formula, P _zi,max 、P _zi,min The upper limit and the lower limit of the vibration cavitation area when the hydroelectric generating set i operates are respectively. Preferably, the markov decision process in the step (2) is combined with the reinforcement learning and the main characteristics of the hydropower station real-time optimization scheduling problem, a learning process is constructed by defining an agent, a state set S, an action set a and a reward matrix R in a reinforcement learning algorithm, and the agent is a dispatcher of the hydropower station or an automatic power generation control system and learns and selects actions from the environment to maximize future returns; interval [0, 5% P _i,max ]Is divided into K parts as a state set S, P _i,max Represents the maximum output; the action set A is a discrete output deviation set (delta P) of each unit of the hydropower station _i }; determining an element value r in the reward matrix according to the related parameters, the state set S and the action set A of the hydroelectric generating set _t (s _t ,s _t+1 ,a _t ) I.e. the state s of the current time period _t Take any action a _t Updating state s to next time period _t+1 The value of the prize earned.

Preferably, the initialization setting of the parameters of the Deep reinforcement Learning algorithm Deep Q-Learning in step (3) includes the following steps:

31) randomly initializing a state s, initializing an experience memory pool D, and setting the capacity of the experience memory pool D as N;

32) constructing a Q network and a target Q network, randomly initializing a Q network weight theta, and enabling the target Q network weight theta to be ^- ＝θ；

33) Initializing a step factor alpha and a discount factor gamma;

34) and initializing the iterative training times M.

Preferably, the deep reinforcement learning algorithm in step (3) specifically includes the following steps:

321) selecting an action a according to an epsilon-greedy strategy;

322) executing the action a to obtain an instant reward r, a next state s' and a termination state done;

taking { s, a, r, s', done } as a group of batch data to be stored in an experience memory pool D;

324) judging whether the batch data in the experience memory pool D is more than or equal to N:

when the batch data in D is larger than or equal to N, randomly extracting m batch data in D as training samples, wherein m is 32;

and taking s 'of all training samples as an input value of the Q network to obtain a Q value of each action under the state s':

wherein Q(s) _t ,a _t ) Is in a state s _t Lower execution action a _t The value of Q of (A) is,

is in a state s _t Lower execution action a _t Calculating the target _ Q value of the Q value corresponding to the target Q network according to the expected reward which can be obtained;

training a Q network by applying a gradient descent algorithm to the Q value and the target _ Q value, and updating the target once every C steps

The standard Q network is used for marking the network,i.e. make theta ^- ＝θ；

325) When the batch data in the D is smaller than N, judging whether the batch data is in a termination state:

if the state is a termination state, searching the next initialization state s and continuing the steps;

if the current state is not the termination state, the current state s is converted into a new next state s', and the steps are repeatedly circulated.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the invention applies the data-driven DQN algorithm to the real-time optimization scheduling problem in hydropower plants, and compares the effectiveness of the DQN algorithm in real-time decision making under the conditions of day-ahead output planning and unit combination determination through different algorithms, and the method specifically comprises the following steps:

1) the intelligent agent trained based on the DQN algorithm can determine the optimal deviation amount adjustment of each unit according to the accumulated reward size and the day-ahead output plan and the real-time load deviation in a random environment so as to meet the real-time change of the environment. When the intelligent agent faces unknown environment, the deviation amount of the unit can be adjusted for many times according to the learned priori knowledge until the current state is the final state.

2) According to the prediction deviation, compared with the GA algorithm, the deviation water consumption of the algorithm provided by the invention is obviously reduced, namely, the decision made by the DQN algorithm is better than that of the GA algorithm, and the online decision time of the DQN algorithm is far shorter than that of the GA algorithm.

Drawings

FIG. 1 is a schematic diagram of a reinforcement learning process;

FIG. 2 is a graph of a DQN value function approximation network;

FIG. 3 is a RL optimization decision and control framework diagram;

FIG. 4 is a diagram of the overall architecture of a hydropower station real-time optimization operation based on deep reinforcement learning;

fig. 5 is a graph of the average reward variation of the DQN algorithm.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

The technical scheme adopted by the invention is a hydropower station in-plant real-time optimization scheduling method based on a deep reinforcement learning algorithm, which mainly comprises the following three steps:

1) performing mathematical description on real-time optimization operation problems in a hydropower plant;

2) converting the real-time scheduling problem in the hydropower station plant into a Markov decision process according to the mathematical model given in the step 1);

3) and solving the Deep reinforcement learning algorithm by applying Deep Q-learning (DQN) to obtain a real-time rolling operation strategy in the hydropower station plant.

When the real-time power generation plan of the hydropower station is adjusted, the day-ahead power generation plan of the hydropower station needs to be updated in a rolling mode in real time, the time interval is set to be 15 minutes, and the scheduling period is 24 points from the current scheduling time to the current day.

According to the optimization operation criteria of the hydroelectric generating set, aiming at the real-time load deviation amount, a given hydroelectric generating set day-ahead power generation plan is finely adjusted on the premise of meeting the safe operation, so that the consumed deviation water consumption is minimum, and the objective function is as follows:

in the formula, delta Q is the total water consumption deviation of all running units within a scheduling time period of 15 min; n is the number of all hydroelectric generating sets; delta Q _i And (4) scheduling the water consumption deviation of the unit i in the 15min scheduling period.

Consumption characteristic Q of a hydroelectric power plant _i Can be expressed as the average output P of the unit i in the scheduling period _i And average head H _i I.e.:

Q _i ＝f(P _i ,H _i ) (2)

when the output of the unit is adjusted by delta P in real time _i When the water consumption is adjusted correspondingly delta Q _i Comprises the following steps:

ΔQ _i ＝f(ΔP _i ,H _i ) (3)

the constraint condition of the real-time operation of the hydropower station is similar to the economic operation, but the variable of the real-time operation is the deviation amount of the output and the water consumption of the hydropower unit, and 4 aspects are also considered, namely output deviation balance constraint, output constraint of each unit, flow constraint of each unit, vibration area constraint during the operation of each unit and the like.

1) Power balance constraint

In the formula,. DELTA.P _L Is the total load deviation of the system in a 15min period.

2) Unit output constraint

P _i,min ≤P _i ±ΔP _i ≤P _i,max (5)

In the formula, P _i,max 、P _i,min Respectively the upper and lower limits of the output of the unit i.

3) Unit flow restriction

Q _i,min ≤Q _i ±ΔQ _i ≤Q _i,max (6)

In the formula, Q _i,max 、Q _i,min Respectively representing the upper limit and the lower limit of the generating flow of the unit i.

4) Unit cavitation vibration zone restraint

When the hydroelectric generating set is in operation, the cavitation vibration operation area should be avoided as much as possible, namely

(P _i +ΔP _i -P _zi,max )(P _i +ΔP _i -P _zi,min )≥0 (7)

In the formula, P _zi,max 、P _zi,min The upper limit and the lower limit of a vibration cavitation area when the unit i operates are respectively, and the unit should avoid the operation area P when the unit operates _zi,min ,P _zi,max ]。

Because real-time scheduling requires the unit to quickly and accurately react to the load deviation, the method utilizes an algorithm based on data driving to carry out analysis and research. Inspired by behavioral psychology, Reinforcement Learning (RL) is an optimization method based on simulation, and through interaction with an environment containing all other active agents, the interacting agents are used to find the optimal or near optimal strategy for a certain agent. Based on this evolutionary computing approach, the agent is trained to take optimal or near optimal behavior through interaction with the environment. Unlike supervised learning approaches, which require an external supervisor to provide example strategies, RL-based learning processes proceed through the interaction of a dynamic environment and feedback that analyzes early decisions.

The Reinforcement Learning (RL) basic framework is mainly composed of two parts, environment and agent, as shown in fig. 1. The agent, with the maximum goal of accepting long-term reward values, selects an action according to a certain policy and acts on the environment, and finally decides what action should be taken when each state is encountered. The purpose of the RL is the learning of the system from the environment to the behavior map to optimize the objective function.

When a hydropower station real-time optimization scheduling model based on a reinforcement learning theory is established, the main characteristics of reinforcement learning and the hydropower station real-time optimization scheduling problem need to be combined, and a state set S, an action set A and a reward matrix R in a reinforcement learning algorithm are reasonably defined. First, a set of states S is defined from 0 to a maximum output P _i,max K discrete load offsets of 5%, so the elements in the state set S for each time period can be divided into K states; secondly, the action set A is a plurality of discrete output deviation value sets { delta P of each unit of the hydropower station _i }; and finally, determining the element value r in the reward matrix according to the relevant parameters of the hydroelectric generating set, the element values in the state set S and the action set A _t (s _t ,s _t+1 ,a _t ) I.e. the state s of the current time period _t Take any action a _t Updating state s to next time period _t+1 The value of the prize earned.

Deep Learning (DL) is a feature extraction optimization method based on an artificial neural network, and a series of nonlinear units are used to realize direct mapping relation between input and output. The output of the lower layer is used as the input of the upper layer, training is carried out according to a back propagation algorithm, and useful characteristic information is automatically mined from mass data. As a data-driven method, the method overcomes the over-rigidity problem of manually extracting the features, considers complex environmental factors and is beneficial to solving the nonlinear problem.

Deep Reinforcement Learning (DRL) combines DL and RL, introducing neural networks to directly express and optimize a cost function, a strategy or an environmental model in an end-to-end manner. The DRL can fully utilize high-dimensional original input data to carry out mode extraction and model construction; in addition, it can also be used as the basis for policy control. Compared with the traditional reinforcement learning, the deep reinforcement learning overcomes the problem that the high-dimensional large scale can not be processed.

Q-leannig is the most common method in reinforcement learning, and in the Q-leannig algorithm, the Q-value of each state-action pair (the value of each action selected in each state) is stored in a table, called Q-table, and updated by a random gradient descent method.

Wherein Q(s) _t ,a _t ) Is in a state s _t Lower execution action a _t Is the step size control factor, alpha is the step size control factor,

is by being in state s _t Lower execution action a _t The desired reward that can be achieved, γ is the discount factor. However, in a high-dimensional space, the traversal speed of the agent is too slow to learn the values of the states separately. When the state space or motion space is in a high dimension, the Q-learning algorithm becomes impractical. To overcome this problem, researchers have proposed value function approximations. By adjusting the parameter theta, the function is approximated to a value function based on a certain strategy, as shown in formula (9).

Q(s _t ,a _t ,θ)≈Q(s _t ,a _t )(9)

By this method, the task is converted to solve for the parameter θ in the objective function:

wherein L represents an objective function and E represents expectation;

and gradually updating the parameter theta by adopting a random gradient descent method so that the target function converges to the minimum value.

As shown in fig. 2, the value function of DQN is approximated by a neural network, and the parameter θ is the network weight of each layer in the neural network, unlike the table Q-learning algorithm, the updating step of the value function by DQN is to update the parameter θ instead of the Q table, and the parameter updating formula is as follows:

in the formula, theta _t+1 Representing the parameters to be updated in the next iteration; theta _t Representing the parameters of the iteration updating;

the gradient is indicated.

In this way, the updating process of the DQN value function is then converted into an updating process for supervised learning.

The improvement of the Q-learning algorithm by the DQN is mainly reflected in three aspects:

1) approximating a value function using a deep neural network;

2) the learning process of reinforcement learning is trained by using experience playback;

3) separately setting a target Q network for calculating TD deviation;

the algorithm pseudo code is shown in table 1:

TABLE 1

The main advantage of RL or DRL is that the model can be learned from an offline environment and can adapt to a dynamic environment. After the model is fully trained using offline data, the model can be utilized online in a real-time environment. The framework of the implementation of the algorithm applied in decision and control is shown in fig. 3.

The frame comprises two parts: a training part and an execution part. The training section is the main research content of the invention. The training part supervises the learning of knowledge, and the executing part puts the learned knowledge into practice so as to make optimization decisions in a real-time physical environment. If an emergency occurs, the agent will interact with the new environment. Through the adjustment of the behavior, the intelligent agent gradually increases the obtained reward and recovers the optimization effect.

As mentioned above, agents, environments, rewards, and actions are four basic components of reinforcement learning. Further, the details of the algorithm implementation will be described based on these four parts, and the overall architecture is shown in fig. 4.

Through a learning process, the agent finds a set of optimal behaviors that can affect the environment. The agent must be able to generate any behavior contained in the modeler-defined set of allowable behaviors (e.g. differences in the flow of electricity generated by the hydroelectric generating set) and to perceive a feedback of its behavior. Feedback is the only guidance an agent improves its decision making. The environment is defined by a set of states that the agent has access to. The purpose of learning is to find the optimal behavior in each state. For example, in the real-time optimization operation of a hydropower station, the intelligent agent can be a dispatcher for adjusting the output of the unit, and the deviation amount of the day-ahead dispatching plan and the real-time load can be regarded as the environment. In this case, the learning objective is to find the best block adjusted output strategy given the real-time deviation based on operational objectives (such as maximizing hydroelectric profit, minimizing expected water consumption, etc.) and constraints (such as block output constraints, upper and lower limits on generated flow, etc.).

1) Intelligent agent

The dispatcher of the hydroelectric power plant or the automatic generation control system can be seen as an agent that learns from the environment to select actions to maximize future returns. In the real-time optimization operation of the hydropower station, the intelligent agent can be used for adjusting the output of the generator set, and in the algorithm design, the optimization decision is highly dependent on the environment and reward.

2) Environment(s)

The environment in the real-time optimized operation of the hydropower station refers to the dynamic output deviation of the hydropower unit. The adjustment is done every 15 minutes in real time, i.e. the state changes. The environment is defined by a set of states that the agent has access to. The purpose of learning is to find out the optimal output adjustment amount under the output deviation amount of each hydroelectric generating set, so that the water consumption of the hydroelectric generating sets is minimum.

3) Reward

Rewards or feedback are key to reinforcement learning algorithms. Generally, the agent can be guided to move to the "right direction" through a reasonable cost function, and the setting of the cost function of the invention is mainly based on the formula (3). Furthermore, since the cost function is a minimization of water consumption, it is designed to be the inverse of the cost function when considering the reward obtained by the agent. The goal of this strategy is that the larger the reward the agent receives, the less water the unit consumes in the real-time optimization of the hydroelectric power plant. The agent gets the reward and approaches the objective function step by step, as shown in the following formula.

4) Movement of

The action performed by the agent may be defined as the output adjustment Δ P made by the running hydro-power unit _i These actions are limited by the upper and lower limits of the unit output force, load balancing constraints and vibration cavitation zone constraints.

Examples

The method for real-time scheduling in the hydropower station plant based on the DQN algorithm is applied to the hydropower station examples of the 4 units, and the problem of real-time optimized operation in the hydropower station plant is solved.

1) The deep reinforcement learning algorithm is mainly constructed based on a deep neural network, so that the network structure directly determines the performance of the algorithm. In the neural network in the calculation example, three hidden layers are established, the number of the neurons is respectively set to be 128, 128 and 250, two rectifying linear units relu and a sigmoid function are used as the activation function of the neural network. The hyper-parameters of the DQN deep reinforcement learning algorithm are shown in table 2.

TABLE 2

Parameter(s)	Numerical value
		Number of iterative training M	5000
Step size factor alpha	0.01
		Discount factor gamma	0.95
Experience memory cell capacity N	20000
		Training frequency n	5
Number of training batches m	32

The hydropower station 15-min-week-level data are selected as training samples of the neural network, the training samples are firstly classified and preprocessed according to unit combinations, in order to guarantee training effects, the DQN algorithm is applied to train the intelligent body 5000 times, and average reward change curves corresponding to each training under different unit combinations are shown in FIG. 5.

When the iterative training times M are set to be 5000 times, the intelligent agent can be reliably converged, which shows that the intelligent agent gradually adapts to the environment through learning and obtains more returns, a plurality of random choices are initially provided, and after a plurality of iterations, the intelligent agent learns to select the convergence trend and the probability which are close to the optimization target, so that the DQN algorithm obtains a good training effect. The training effects are different under different unit combinations, the water consumption for adjusting the four units in operation is less than that for adjusting the two units and the three units in operation, but the convergence effects are approximately the same under the different unit combinations, and the optimal solutions are achieved in about 2800 iterations.

2) After the training of the intelligent agent is completed in the established real-time scheduling environment, the output fine-tuning quantity of the hydroelectric generating set can be decided according to the real-time state (the real-time output deviation quantity of the load curve and the day-ahead power generation plan). In order to verify the effectiveness of the intelligent agent in running in a random environment, the invention selects a load curve of a certain day and the day-ahead output of the hydroelectric generating set, and superposes a scheduling environment with random load deviation simulating random change, wherein the maximum load deviation is 5 percent P _i,max Consider. The 5 decisions were made for the random deviations for period 1, period 7 and period 13, with the results shown in table 3.

TABLE 3

Under the condition that the day-ahead unit combination is determined, the intelligent agent can determine the optimal deviation amount adjustment of each unit according to the accumulated reward size and the day-ahead output plan and the real-time load deviation in a random environment so as to meet the real-time change of the environment.

3) In order to verify the effectiveness of the intelligent agent in dealing with emergency, the deviation amount delta P at a certain time in the time period 1 is artificially set _L The decisions made by the agent in the unknown environment were observed at 7MW, with the results shown in table 4.

TABLE 4

Number of times of adjustment	Residual deviation/MW	Unit	1	Unit 2	Unit 3	Unit 4
							First adjustment	2	0.12	0.85	4.03	0
Second adjustment	0	1.51	0.21	0.26	0.02

When the intelligent agent faces an unknown environment, the deviation amount of the unit can be adjusted for many times according to the learned priori knowledge until the current state is the final state, namely the residual deviation amount is 0 MW. In addition, after the intelligent agent gradually adapts to the unknown environment in the interaction process with the environment, the intelligent agent can be stored as learned knowledge to improve the self strategy and cope with the change of the unknown environment, so that the self learning and the self evolution of the intelligent agent are completed.

4) The method analyzes the high efficiency of the DQN algorithm in solving the real-time scheduling problem in the hydropower station plant, and adopts a Genetic Algorithm (GA) for comparison. Under the condition that the day-ahead output plan and the unit combination are determined, the prediction deviation is set, the DQN algorithm and the GA algorithm are respectively applied to solve the single-period real-time scheduling, and the water consumption and the solving time of comparing the two algorithms are shown in tables 5 and 6.

TABLE 5

Solving algorithm	deviation/MW	Unit	1	Unit 2	Unit 3	Unit 4	Deviation water consumption/3600 m ³
								GA	5	2.1	0	0	2.9	43.89
DQN	5	4.26	0	0	0.74	6.19
							GA	5	0.322	1.9	2.67	0.107	21.917
DQN	5	0.12	0.85	4.03	0	4.862
							GA	3	0.1166	0	0	2.8834	55.144
DQN	3	2.87	0	0	0.13	2.6314

TABLE 6

Solving algorithm	Training time/s	Decision time/s
			GA	—	12.627
DQN	167.324	0.233

Under the condition that the output plan and the unit combination are certain in the day ahead, after the output of the unit is adjusted according to the prediction deviation and the DQN algorithm, the deviation water consumption is obviously reduced compared with the GA algorithm, namely, the decision made by the DQN algorithm is better, because the DQN algorithm is a data-driven algorithm, the optimal decision output deviation can be rapidly searched in a memory base of the DQN algorithm through interaction with the current environment, and the genetic algorithm needs to depend on the scale and quality of the population, genetic variation and other operations when in resolving, so that the complexity and the solving time of the problem are increased. The training time of the DQN algorithm is far longer than the decision time, so that in the practical application process, the mode of off-line training and on-line decision can be adopted, and the influence of the big data training process on the decision efficiency can be avoided.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. A hydropower station in-plant real-time optimization scheduling method based on a deep enhancement algorithm is characterized by comprising the following steps:

(3) and solving the Markov decision process by applying Deep reinforcement Learning algorithm Deep Q-Learning to obtain a real-time rolling operation strategy in the hydropower station plant.

2. The method for real-time optimal scheduling in a hydropower plant based on a deep-enhancement algorithm according to claim 1, wherein the step (1) specifically comprises:

according to the optimized operation criterion of the hydroelectric generating set, a given day-ahead generating plan of the hydroelectric generating set is adjusted, and the objective function is as follows:

Q _i ＝f(P _i ,H _i )

ΔQ _i ＝f(ΔP _i ,H _i )

the power deviation balance constraint is specifically expressed as:

the unit output constraint is specifically expressed as:

P _i,min ≤P _i ±ΔP _i ≤P _i,max

the unit flow constraint is specifically expressed as:

Q _i,min ≤Q _i ±ΔQ _i ≤Q _i,max

the unit cavitation vibration area constraint is specifically expressed as:

(P _i +ΔP _i -P _zi,max )(P _i +ΔP _i -P _zi,min )≥0

in the formula, P _zi,max 、P _zi,min The upper limit and the lower limit of the vibration cavitation area when the hydroelectric generating set i operates are respectively.

3. The method for real-time optimal scheduling in hydropower station plant based on deep reinforcement algorithm as claimed in claim 2, wherein the markov decision process in step (2) combines the main features of reinforcement learning and the real-time optimal scheduling problem of hydropower station, and defines an agent, a state set S, an action set a and an incentive matrix R in the reinforcement learning algorithm to construct a learning process, wherein the agent is a dispatcher of the hydropower station or an automatic power generation control system, and selects from the environment through learningAct to maximize future returns; interval [0, 5% P _i,max ]Is divided into K parts as a state set S, P _i,max Represents the maximum output; the action set A is a discrete output deviation set (delta P) of each unit of the hydropower station _i }; determining an element value r in the reward matrix according to the related parameters, the state set S and the action set A of the hydroelectric generating set _t (s _t ,s _t+1 ,a _t ) I.e. the state s of the current time period _t Take any action a _t Updating state s to next time period _t+1 The value of the prize earned.

4. The method for real-time optimized dispatching in hydropower plants based on the Deep reinforcement algorithm as claimed in claim 1, wherein the parameter initialization setting of Deep reinforcement Learning algorithm Deep Q-Learning in step (3) comprises the following steps:

32) constructing a Q network and a target Q network, randomly initializing a Q network weight theta, and making a target Q network weight theta ^- ＝θ；

33) Initializing a step factor alpha and a discount factor gamma;

34) and initializing iterative training times M.

5. The method for real-time optimized dispatching in hydropower plants based on the deep reinforcement algorithm according to claim 4, wherein the deep reinforcement learning algorithm in the step (3) specifically comprises the following steps:

321) selecting an action a according to an epsilon-greedy strategy;

training the Q network by applying a gradient descent algorithm to the Q value and the target _ Q value, and updating the target Q network once every C steps, namely ordering theta ^- ＝θ；