CN115049292A

CN115049292A - Intelligent single reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm

Info

Publication number: CN115049292A
Application number: CN202210741864.3A
Authority: CN
Inventors: 任明磊; 徐炜; 刘昌军; 魏国振; 王刚; 赵丽平; 顾李华; 王凯; 张琪; 刘小虎
Original assignee: Huaihe River Water Resources Commission Hydrology Bureau (information Center); China Institute of Water Resources and Hydropower Research; Chongqing Jiaotong University
Current assignee: Huaihe River Water Resources Commission Hydrology Bureau (information Center); China Institute of Water Resources and Hydropower Research; Chongqing Jiaotong University
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-13
Anticipated expiration: 2042-06-28
Also published as: CN115049292B

Abstract

The invention relates to a DQN deep reinforcement learning algorithm-based intelligent single reservoir flood control scheduling method, which comprises the following steps: the method comprises the steps of constructing an artificial intelligence-based reservoir scheduling unsupervised deep learning model, establishing DRL reward feedback based on reservoir power generation scheduling, and establishing artificial intelligence experts for scheduling of a certain reservoir based on an actual reservoir measurement warehousing runoff process. Compared with the optimal power generation scheduling process of dynamic programming solution, the power generation scheduling result of the intelligent single-reservoir flood control scheduling method based on the DQN deep reinforcement learning algorithm is obviously superior to the traditional reservoir power generation scheduling result based on the decision tree, and the unsupervised deep learning model for reservoir scheduling has strong learning capability and decision capability and strong adaptability in reservoir scheduling decision.

Description

Intelligent single reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a single reservoir intelligent flood control scheduling method based on a DQN deep reinforcement learning algorithm.

Background

In 2016, the success of go AlphaGo activated the potential for artificial intelligence. The emergence of AlphaGo has had a sense of milestone, thereby raising the surge in artificial intelligence development. Under the push of the wave, the development of the core technology of the wave is further accelerated and the wave is derived to other industries. In the Alphago playing process, the postefficiency and the maximum winning probability of each step of decision need to be considered. The most central algorithm in AlphaGo is Deep Reinforcement Learning (DRL), which is suitable for the mode of state vs. decision, and in particular for the decision process with markov properties.

Traditional reinforcement learning theories have been continuously perfected over the last decades, but it is still difficult to solve the complex problems in the real world, especially in multi-state and multi-decision situations. Deep reinforcement learning DRL is a product of a combination of deep learning and reinforcement learning. The DRL integrates the strong comprehension ability of deep Learning on the problems of vision, cognition and the like and the decision-making ability of reinforcement Learning, and forms a brand-new mode of End-to-End Learning (End-to-End Learning) from Perception (Perception) to Action (Action). The mode enables machine learning to have a real meaning of 'autonomous learning'. The DRL technology enables the artificial intelligence technology to be really practical, and enables the artificial intelligence technology to have strong learning and survival ability in complex environments with high-dimensional states and decisions.

In the water conservancy industry, reservoir scheduling has the characteristic of a typical Markov decision process, and scheduling decisions depend on state conditions of reservoir storage, incoming water and the like, so the concepts of reservoir scheduling and a DRL algorithm have high coincidence. If the DRL technology is derived to the water conservancy industry, the reservoir dispatching direction is one of the main battlefields for the application.

Therefore, how to introduce the DRL technology into reservoir scheduling, adapt to reservoir scheduling decisions and determine the optimal control process of reservoir power generation is a problem to be solved.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides a single reservoir intelligent flood control scheduling method based on a DQN deep reinforcement learning algorithm. The method is based on a DQN network and a DRL model, adopts reservoir power generation as reward feedback, establishes a reservoir operation control model based on a deep reinforcement learning algorithm, and establishes a reservoir dispatching artificial intelligence expert, thereby determining the optimal control process of the reservoir power generation.

The purpose of the invention is realized as follows:

a single reservoir intelligent flood control scheduling method based on a DQN deep reinforcement learning algorithm comprises the following steps:

step 1: constructing an artificial intelligence-based reservoir scheduling unsupervised deep learning model:

respectively establishing a brain, a memory library and an 'autonomous learning' algorithm module of the Agent by taking a DRL technical architecture as a reference;

the brain of the Agent is constructed by adopting a Deep Q-Network (DQN) algorithm and is provided with a double-layer neural Network, namely AN Action Network (AN) and a Target Network (TN);

the memory base stores the scheduling knowledge generated in the scheduling process, and the scheduling decision of each time interval can form a knowledge;

the value function of the autonomous learning module based on the Bellman equation is continuously increased, so that the Agent decision-making capability is continuously improved; with the increase of the learning times, the learning cost function of this time is embodied by the average value of the cost functions calculated by adjacent k times of learning, and the formula is as follows:

in the formula u _k A decision cost function under the condition of a given scheduling state is learned for the kth time; u. of _i A decision cost function under the condition of a given scheduling state is learned for the ith time; u shape _k An average value function obtained after the current learning is performed for the kth learning; u shape _k-1 The average cost function obtained after the current learning is learned for the k-1 th time.

After the autonomous learning, the updating of the cost function is realized by adopting the following formula:

U _k (S _t ,A _t )＝(1-α)U _k-1 (S _t ,A _t )+α·u(S _t ,A _t )

in the formula, S _t Is a conditional attribute at the beginning of the t period, A _t A decision attribute at the beginning of the t period; alpha is the learning rate; u is a decision value function under the condition of a given scheduling state;

in the decision value estimation of reservoir power generation dispatching, according to the state S _t+1 Calculating the U value of each decision in the decision set, and evaluating the decision by adopting an average value modeThe strategy benefit is as follows:

in the formula, R _t A decision benefit value obtained for a time period t; s _t+1 Is the condition attribute at the end of the t period,

the decision attribute at the end of the t period; λ is a discount factor;

after each learning, updating error feedback of Neural Network weight parameters by using a gradient descent method according to the change of the value function, wherein the formula is as follows:

E _k ＝U _k -U _k-1

in the formula, E _k The difference value of the value function of the k-1 and k-th learning is obtained;

step 2: and (3) establishing reward feedback of the DRL on the basis of reservoir power generation scheduling:

evaluating the benefit of the decision according to the state of the current time period and the obtained decision, and feeding back the benefit in a reward mode; wherein, the generated energy and whether the guaranteed output is reached are taken as indexes of benefit evaluation;

and step 3: establishing a scheduling artificial intelligence expert aiming at a certain reservoir based on the actual measurement and warehousing runoff process of the reservoir:

taking actually measured reservoir warehousing runoff information and a corresponding scheduling time period as input states, carrying out autonomous learning through the 'autonomous learning' algorithm module, and deciding reservoir operation in a future time period through the brain of Agent, namely generating output; on the basis, a reservoir power generation dispatching simulation mode is adopted, and the reward of the operation, namely the power generation benefit, is estimated and returned; then, storing the state, operation and benefit of the reservoir into the memory bank in a knowledge mode, starting to learn the knowledge in memory when the memory bank has enough knowledge and meeting the learning condition, then continuously carrying out actual scheduling operation to obtain new knowledge and update the memory bank, and finally enabling the Agent to be gradually mature and become an artificial intelligent expert for reservoir scheduling by circulating the learning-actual scheduling process;

and using the established reservoir dispatching artificial intelligence expert for the reservoir power generation dispatching decision to determine the optimal control process of the reservoir power generation.

Further, in the memory bank, the scheduling decision of the time period t can be the condition attribute (S) at the beginning of the time period _t ) Decision attribute (Action), Reward (Reward), and condition attribute at the end of t period (S) _t+1 ) Jointly form a piece of knowledge and store the knowledge in a memory base, wherein the formula is as follows:

in the formula, S _t 、S _t+1 The condition attributes are respectively at the end of the t period and the t period; r _t Is the electric energy production benefit punished at the t period; a. the _t Is the decision attribute for the t period; t is _t 、T _t+1 Numbering the scheduling periods in a year; l is _t 、L _t+1 Controlling the water level of the reservoir at the beginning of the time period t and the time period t +1 respectively; q _t 、Q _t+1 The total water volume of the reservoir at the end of the time period t and the time period t is respectively.

Further, the formula of the reward feedback in step 2 is as follows:

R(K _t ,Q _t ,N _t )＝[b(K _t ,Q _t ,N _t )-a·{Max(e-b(K _t ,Q _t ,N _t ),0)} ^b ]·Δt

V _t+1 ＝V _t +Q _t -Q _p,t -Q _s,t

in the formula: r is the electric energy generating capacity benefit after the penalty of t time period, namely Reward; k _t Is the initial water storage level of the t time period; q _t Is the total water volume of the reservoir in the period t; n is a radical of _t Generating output of the reservoir at the time t; q _p,t Representing the total generated flow in the t period; q _s,t The water abandon amount in the period t; b (-) is the hydropower station generated energy in the period t; a and b are penalty coefficients; e, ensuring the output of the system; v _t Water storage for time period tStorage capacity; Δ t represents a scheduling period length; v _t+1 The capacity of the water storage at the end of the t period;

the constraints are as follows:

K _min ≤K _t ≤K _max

0≤N _t ≤N _M

0≤Q _t ≤Q _M

wherein, K _min And K _max Respectively representing the minimum and maximum reservoir water levels of the reservoir in the time period t; n is a radical of _t Representing a time period t decision effort; n is a radical of _M The installed capacity of the reservoir is the maximum power generation output of the reservoir; q _M The maximum flow capacity of the water turbine.

The DRL algorithm is oriented to decision and control problems, and the decision and control directly determine the intelligent degree of artificial intelligence. Conventional Reinforcement Learning uses a state decision table to make decisions, which limits its capabilities. The DRL algorithm takes the Bellman equation as a core, and uses a deep neural network to solve the relation between the constructed state and the decision, thereby effectively improving the learning, decision and control abilities in the high-dimensional state and decision environment. The reservoir scheduling has the characteristic of a typical Markov decision process, and the scheduling decision depends on the state conditions of reservoir storage, incoming water and the like, so the reservoir scheduling and the DRL algorithm have high coincidence. The DRL technology is derived to the water conservancy industry, where the reservoir deployment direction will be one of the main battlefields in which it is used.

Compared with the prior art, the invention has the advantages and beneficial effects that:

the invention applies the deep reinforcement learning technology in the artificial intelligence algorithm to the reservoir power generation scheduling decision, explores the coupling mode of the deep reinforcement learning and the reservoir power generation scheduling decision, and has application potential. Firstly, establishing a DRL learning model based on a deep Q value learning algorithm (DQN) according to a theoretical framework of an RL algorithm; then coupling the DRL algorithm with a reservoir power generation dispatching model by taking the Reward estimation as a connection point; and finally, on the basis of a random simulation runoff process, constructing an unsupervised deep learning model for reservoir scheduling through unsupervised autonomous learning. Compared with the optimal power generation scheduling process of dynamic planning and solving, the power generation scheduling result of the single reservoir intelligent flood control scheduling method based on the DQN deep reinforcement learning algorithm is obviously superior to the traditional reservoir power generation scheduling result based on the decision tree, and the reservoir scheduling unsupervised deep learning model has strong learning capacity and decision making capacity and strong adaptability in reservoir scheduling decision making.

Drawings

The invention is further illustrated by the following figures and examples.

FIG. 1 is a schematic network structure diagram of an artificial intelligence-based reservoir scheduling unsupervised deep learning model of the invention;

FIG. 2 is a Q-Learning algorithm based cost function update process according to the present invention;

FIG. 3 is a comparison diagram of the water level control process of the hammer reservoir of different networks according to the present invention;

FIG. 4 is a graph illustrating the effect of learning efficiency parameters on DRL "autonomous learning" efficiency;

FIG. 5 is a deviation result diagram of the DRL and decision tree rule based power generation scheduling process and the optimization process of the present invention.

Detailed Description

Example (b):

in the embodiment, the power generation dispatching of the Huaren reservoir is taken as an example, the DRL technology is introduced into the reservoir dispatching, and the intelligent flood control dispatching method for the single reservoir based on the DQN depth-enhanced learning algorithm is provided by utilizing the reservoir dispatching 'unsupervised deep learning' model based on artificial intelligence constructed in the embodiment 1. The stem reservoir is located at middle and downstream of muddy river and is positioned at 124-136-50 'of east longitude, 40-42-15' of north latitude and 10364km of dam site control watershed area ² . The average annual rainfall of the drainage basin for many years is 860mm, 70% of rainfall is concentrated between 6 and 9 months, and heavy flooding generally occurs in late 7 to middle 8 months. The hull reservoir is positioned at the upstream of the reservoir group, is a leading reservoir in the muddy river step reservoir group, has annual adjustment capacity, is mainly used for power generation, and has comprehensive utilization benefits of flood control, irrigation, cultivation, travel and the like.

The basic parameters of the bar reservoir are shown in table 1.

TABLE 1 basic parameter description of hammer reservoir

In the embodiment, the runoff process of 400 years is generated in a random simulation mode on the basis of the average warehousing observed flow in 2010 aristoma ten days, and the DRL deep learning is trained by simulating the runoff process, so that an artificial intelligence expert is established for scheduling in ten days based on the aristoma ten days.

The intelligent flood control scheduling method comprises the following steps:

the brain of the Agent is constructed by adopting a DeepQ-Network (DQN) algorithm, the Agent is provided with a double-layer neural Network which is respectively AN Action Network (AN) and a Target Network (TN), the DRL learns in a recollection mode, the learning aims to train the AN and the TN, and the Sarsa algorithm is adopted to update the Q value in the DQN.

In the memory base, the scheduling knowledge generated in the scheduling process is stored, and the scheduling decision of each time interval can form a knowledge. For example, in the memory bank, the scheduling decision of the time period t can be the condition attribute (S) at the beginning of the time period _t ) Decision attribute (Action), Reward (Reward), and condition attribute at the end of t period (S) _t+1 ) Jointly form a piece of knowledge and store the knowledge in a memory base, wherein the formula is as follows:

<S _t ＝(T _t ,L _t ,Q _t ),Reward＝R _t ,Action＝A _t ,S _t+1 ＝(T _t+1 ,L _t+1 ,Q _t+1 )> (1)

U _k (S _t ,A _t )＝(1-α)U _k-1 (S _t ,A _t )+α·u(S _t ,A _t ) (3)

in the formula, S _t Is a conditional attribute at the beginning of the t period, A _t A decision attribute at the beginning of the t period; alpha is the learning rate, and the larger the value is, the more important the current decision value is; u is a decision value function under the condition of a given scheduling state;

in the decision value estimation of reservoir power generation dispatching, according to the state S _t+1 Calculating the U value of each decision in the decision set, wherein the U value in the formula is used for evaluating the decision benefit in an average value mode, and the formula is as follows:

the decision attribute at the end of the t period; lambda is a discount factor, the larger the value is, the larger the influence of the remaining period on the decision value function is, and the smaller the influence is;

E _k ＝U _k -U _k-1 (5)

in the formula, E _k The difference value of the k-1 and k-th learned cost functions.

evaluating the benefit of the decision according to the state of the current time period and the obtained decision, and feeding back the benefit in a reward mode; and taking the generated energy and whether the guaranteed output is reached as indexes of benefit evaluation.

The formula for the reward feedback is:

R(K _t ,Q _t ,N _t )＝[b(K _t ,Q _t ,N _t )-a·{Max(e-b(K _t ,Q _t ,N _t ),0)} ^b ]·Δt (6)

V _t+1 ＝V _t +Q _t -Q _p,t -Q _s,t (7)

in the formula: r is the electric energy generating capacity benefit after the penalty of t time period, namely Reward; k is _t Is the initial water storage level of the t time period; q _t Is the total water volume of the reservoir in the period t; n is a radical of _t Generating output of the reservoir at the time t; q _p,t Representing the total generated flow in the t period; q _s,t The water abandon amount in the period t; b (-) is the hydropower station generating capacity in the time period t, and is obtained by calculating the water consumption rate, the water head and the like; a and b are punishment coefficients which are determined by the power generation guarantee rate of the hydropower station, and the values are respectively 1 and 2; e is system guaranteeForce; e, ensuring the output of the system; v _t The reservoir capacity at time t; Δ t represents a scheduling period length; v _t+1 The capacity of the water storage at the end of the t period;

the constraints are as follows:

K _min ≤K _t ≤K _max (8)

0≤N _t ≤N _M (9)

0≤Q _t ≤Q _M (10)

in an artificial intelligence system, an Agent is used to represent an object with behavioral capabilities, such as a robot. The main task of reinforcement learning is to realize that the Agent learns and masters the knowledge of Environment through exploration and forms own knowledge system and memory.

The learning mode of the DRL can be introduced through the process of playing the game. The Agent senses the state of the characters in the game Environment through eyes, then selects the best keyboard operation of the characters through the brain, evaluates the quality of the operation by using the state fed back in the Environment, and further continuously adjusts the operation to strive to make the characters in the game develop towards the victory direction. When the Agent plays the game for hundreds of times, the operation process and winning key of the game are stored in the Agent memory. The memory of the Agent is derived from the experience obtained by playing the game, and the experience of other people can be stored in the memory of the Agent. The Agent learns and summarizes continuously through a recall mode, and then the Agent becomes a high-hand player of the game gradually.

The above process can be followed for the cultivation of the reservoir scheduling artificial intelligence expert, as shown in fig. 1. The method comprises the steps of taking actually measured reservoir warehousing runoff information and a corresponding scheduling time interval as input states, conducting autonomous learning through an 'autonomous learning' algorithm module, firstly sensing the current state of a reservoir, including information such as water storage level and incoming water, and deciding reservoir operation in a future time interval through the brain of Agent. On the basis, a reservoir power generation dispatching simulation mode is adopted, and the reward of the operation, namely the power generation benefit, is estimated and returned; then, storing the state, operation and benefit of the reservoir into the memory bank in a knowledge mode, starting to learn the knowledge in memory when the memory bank has enough knowledge and meeting the learning condition, then continuously carrying out actual scheduling operation to obtain new knowledge and update the memory bank, and finally enabling the Agent to be gradually mature and become an artificial intelligent expert for reservoir scheduling by circulating the learning-actual scheduling process; and using the established reservoir dispatching artificial intelligence expert for the reservoir power generation dispatching decision to determine the optimal control process of the reservoir power generation.

The following research results of the hammer reservoir:

an artificial intelligence body is established aiming at the power generation dispatching of the reservoir, and because the power generation dispatching of the reservoir belongs to the Markov process, different power generation dispatching strategies are provided under different states of the reservoir, and the state at the end of the period is greatly different, the water storage level and the incoming water of the reservoir are used as input states. In addition, due to the abundant variation of the water process in the year, the remaining period benefits of different scheduling periods are different, so the scheduling period must be used as one of the input states. In this embodiment, an artificial intelligence expert for ten-day scheduling is established based on the runoff information of the ten-day scale, and then the state space of the scheduling period is the ten-day number in one year, that is, 1, 2, …, 36. The decision of the reservoir power generation dispatching can adopt the power generation discharge flow rate and also can select the generating set to generate power. The present embodiment takes the power generation output as a decision.

In the deep learning of the RDL based on the DQN network, both the state and the decision exist in discrete form. In order to increase the operability of practical application, the warehousing runoff of each magnitude of the hammer reservoir is subjected to discrete treatment. In discrete operation, the required flows of the reservoir unit and the downstream reservoir are considered together, the hammerThe reservoir runoff will be discretized according to the criteria of table 2. Wherein the flow rate is 150m ³ The/s is the maximum overcurrent capacity of the maximum output of a single unit; 300m ³ The flow rate of the downstream dragon-returning reservoir is/s; 500m ³ And/s is the full whistle flow.

TABLE 2 grading Standard of average warehousing runoff of

hammer kernels reservoirs

1, 3, 7 and 10 balances

According to the actual power generation requirement and the unit structure of the hammer reservoir, the output of the reservoir is dispersed into 6 grades, namely 6 decisions, as shown in table 3. And the DQN network selects the optimal output from the 6 decisions to carry out reservoir scheduling according to the input state information.

Table 3 cluster center of power generation output of hammer reservoir (thousand kilowatts)

And simulating the runoff process of 400 years by the DRL model according to the actual measurement warehousing runoff process. And performing 'autonomous simulation' according to the simulated runoff, and using the knowledge obtained by simulation for learning. And (4) utilizing the learned DRL to decide the power generation dispatching process of the hammer reservoir. To evaluate the decision-making ability of DRLs in reservoir power generation scheduling, dynamic programming is used to determine the optimal control process for reservoir power generation.

In order to determine that the influence of the decision value function on the reservoir power generation dispatching is analyzed when the unsupervised deep learning model is applied to the learning of the reservoir power generation dispatching, the decision benefits are evaluated in a maximum value and average value mode respectively, and a DRL1 model and a DRL2 model are respectively established.

According to the following formula, in the decision value estimation of reservoir power generation dispatching, according to the state S _t+1 The U value of each decision in the decision set is calculated, and the maximum U value is selected as the remaining period power generation benefit (Rr), as shown in fig. 2 (a).

In addition, in the formula (4), according to the state S _t+1 Calculating the U value of each decision in the decision set, averaging the U values of each decision, and using the average as the remaining period power generation benefit (Rr), as shown in fig. 2 (b).

After the two models of DRL1 and DRL2 are studied 2000 times on the basis of simulated runoff in 400 years, the power generation scheduling process of 1116 ten days in 1980-2010 decision-making Onhua reservoir is simulated. Compared with the optimal control process determined by dynamic planning, the water level control process of the hammer reservoir is shown in fig. 3.

Fig. 3(a) is a comparison of the hammer reservoir water level control process based on the DRL1 model and the optimal water level control process. The comparison result shows that the water level based on the DRL1 model is in the dead water level operation in most scheduling time periods, and the reservoir water level can rise to the normal water storage level only in a few time periods when the warehousing flow rate in the flood season is particularly large. The main reason is that when the DRL1 model evaluates a decision value function, the power generation benefit of the remaining period is represented by a maximum U value, and finally, the Agent always selects the maximum power generation amount to make a decision in learning.

Fig. 3(b) is a comparison of the water level control process of the hammer reservoir based on the DRL2 model and the optimal water level control process. The comparison result shows that the DRL2 has strong decision-making capability, and the control process of the water level of the hammer barrel reservoir and the optimal water level control process have high consistency.

Therefore, the invention determines to evaluate the decision benefit in the form of an average value, rather than taking a maximum value.

The DRL algorithm theory shows that the learning efficiency of the DRL model is controlled by the model parameters. The values of the DRL learning efficiency parameter in this embodiment are shown in table 4. Model parameters can be divided into two categories, the first category being knowledge control parameters; the second type is a learning efficiency parameter. The knowledge control parameters control Memory capacity and "autonomous learning" start conditions, etc., which belong to low sensitivity parameters in the learning of the DRL. The learning efficiency parameter has a controlling function on the stability of 'autonomous learning', the search of a decision space and the convergence rate, and belongs to a sensitivity parameter. Therefore, the embodiment analyzes the influence of the learning efficiency parameter on the 'autonomous learning' of the reservoir power generation dispatching.

TABLE 4 control parameters for the DRL deep learning System

Knowledge control parameters	Value taking	Learning efficiency parameter	Value taking
				Memory total knowledge (M)	2000	Learning rate (alpha)	0.03
One-time learning knowledge quantity (W)	200	Discount factor (lambda)	0.9
				Learning interval threshold (L)	50	Greedy probability (epsilon)	0.9
Learning knowledge quantity threshold (D)	200	Weight update interval (K)	30

FIG. 4 is a change process of Reward under different parameter values.

FIG. 4(a) is a Reward value change process of greedy probability (epsilon) under different values. The greedy probability (epsilon) determines the probability that the scheduling decision will jump out of "development" and "exploration" during the "autonomous simulation" process. When the greedy probability (epsilon) takes a value of 0.95, the probability of 'exploration' is only 0.05, which is not beneficial to finding a new knowledge sample, and thus the learning efficiency is low. When (epsilon) takes a value of 0.8, the probability of 'exploration' reaches 0.2, and a large amount of 'exploration sample' knowledge is generated and stored in a memory base in the 'autonomous learning' process, wherein the 'exploration sample' knowledge reflects the diversity of the sample. However, the optimal knowledge of the 'exploration samples' is only one, and if a large amount of knowledge of the 'exploration samples' is reserved in the memory base for a long time, the 'inferior' knowledge of the 'exploration samples' affects the stability and the accuracy of the learning efficiency of the 'exploration samples' knowledge.

Fig. 4(b) is a change process of the Reward value with the increase of the learning times under different values of the discount factor (λ). The discount factor (λ) represents the degree of influence of the remaining-period power generation benefit on the decision value. In the decision value estimation of reservoir power generation dispatching, the larger the lambda value is, the stronger the influence of the power generation benefit of the remaining period on the decision value is. When the lambda value is 0.95, the decision value mainly consists of the remaining power generation benefit, and the Reward influence of the power generation decision is weak. The decision value cannot fully reflect the scheduling effect of the power generation decision, so that the learning efficiency is reduced. When the lambda value is too low, the influence of the Reward of the power generation decision on the decision value is increased, so that the scheduling decision can pay more attention to the scheduling benefit of 'before the eye', and the decision value of 'exploring sample' knowledge is changed violently. The learning of the knowledge sample leads to poor stability of the network model.

Fig. 4(c) is a change process of the Reward value with the increase of the learning times under different values of the learning rate (α). In the 2000 learning processes, when the learning rate alpha parameter is 0.03, the learning rate alpha parameter has a higher Reward value. When the value of the alpha value is 0.001, the formula (3) shows that the average decision value is less influenced by Reward, and the DRL learning effect is the worst. The reason is that the average value function overloads historical average value in the updating process, so that the change of the average value function value is small, and the learning of reservoir scheduling is not facilitated. When the alpha value is 0.3, the decision value in the updating process of the average value function occupies a larger proportion. Because the decision value is influenced by a time interval scheduling decision (Action), when the scheduling decision randomly selects the scheduling decision Action in an 'exploration' mode, the fluctuation change of the decision value of 'exploration sample' knowledge is large. Thereby affecting the stability of learning and resulting in a reduction in learning efficiency.

Fig. 4(d) is the effect of the update interval (K) of TN network weights in DQN on the Reward value. And when the updating interval K is 10, the weight parameter of the AN network is assigned to the TN network after 10 times of learning is shown. When the value K is smaller, the assignment of the network weight parameter is frequently executed, and finally the average cost function is unstable. When the value K is large, the assignment of the network weight parameter needs to wait for a long time, and the difference between the weight parameters of the AN and the TN network is large. When the average cost function is updated by equation (2), a distortion of the cost results.

Comparative example:

in order to compare the learning effect of the DRL, the comparative example adopts dynamic planning and a decision tree (C5.0) to establish a power generation dispatching model of the Ipomoea batatas Lam. And taking the ten-day power generation result of 1980-2010 of dynamic programming solution as an optimal scheduling result, and taking the power generation amount and the guarantee rate as a comparison reference.

In reservoir power generation scheduling research, decision trees are often used for mining of scheduling knowledge. The comparison example adopts a decision tree (C5.0) algorithm to mine an optimal power generation scheduling process, and finds and establishes a power generation scheduling rule set suitable for the Ipomoea batatas Lam Hance reservoir. Firstly, the optimal power generation process of the hull core reservoir is obtained by dynamic programming, solving and simulating under the condition of the runoff process of 400 years. Then, dispersing the state variable and the decision variable of the reservoir by adopting the standards of the tables 2 and 3 in the embodiment; and finally, mining by adopting a decision tree (C5.0) algorithm to obtain a power generation dispatching set of the hammer reservoir. According to the scheduling rule, the electricity generation scheduling process of the Huanhun reservoir in 1980-2010 is simulated.

Fig. 5 shows the deviation of the reservoir simulated water level process from the DP optimal process based on the DRL and decision tree rules. As can be seen from fig. 5, compared with the water level process of DP optimal scheduling, the fluctuation range of the reservoir water level control process based on the decision tree scheduling rule is larger than that of the DRL water level process. The DRL-based reservoir power generation scheduling decision is closer to the optimal decision.

Based on the DRL model which is learned 2000 times, the ten-day power generation scheduling process of the Ipomoea batange lake is simulated. The power generation and the assurance rates for the dynamic programming, decision trees and DRL models were counted as shown in table 5. And DP is used as the optimal solution, and the generated energy and the guarantee rate are highest. Power generation scheduling based on decision tree scheduling rules has poor simulation results. The difference between the generated energy and the guarantee rate of the DRL and the result of the DP model is small, and the DRL has good decision-making capability.

TABLE 5 control parameters for DRL deep learning System

On the basis of a DQN (data Quadrature reference network) and a DRL (data logging language) model, the reservoir power generation is used as reward feedback, and an artificial intelligence-based reservoir scheduling unsupervised deep learning model is established. Taking the hammer reservoir as an example, the DRL model is learned by simulating the runoff process of 400 years, and the decision-making capability of the DRL model is checked and evaluated according to the actual measurement runoff process of 1980-2010. The study conclusion is as follows:

(1) the DRL model is applied to the study of reservoir power generation dispatching, and the power generation benefit of the reservoir remaining period needs to be evaluated in an average value mode instead of a maximum value mode during the evaluation of the value function. If the uncertainty of the runoff is considered or the runoff forecast information is considered, the Markov state transfer or Bayesian theory can be adopted to realize the evaluation of the power generation benefit of the reserve period of the reservoir.

(2) Under the influence of model parameters, the learning efficiency of the DRL shows a large difference. By comparing the learning rate, the discount factor, the greedy coefficient and the weight updating interval, the invention can preliminarily determine the value ranges of different parameters in reservoir power generation dispatching and the influence degree and the influence mode on the learning efficiency.

(3) Compared with the optimal power generation scheduling result solved by dynamic planning, the DRL-based power generation scheduling result is obviously superior to the traditional decision tree-based reservoir power generation scheduling result. The strong learning ability and decision making ability of the DRL are fully displayed, and the method has strong adaptability in reservoir scheduling decision making.

Finally, it should be noted that the above is only for illustrating the technical solution of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred arrangement, it should be understood by those skilled in the art that the technical solution of the present invention (such as the application of various formulas, the sequence of steps, etc.) can be modified or equivalently replaced without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A single reservoir intelligent flood control scheduling method based on a DQN deep reinforcement learning algorithm is characterized by comprising the following steps:

the brain of the Agent is constructed by adopting a Deep Q-Network (DQN) algorithm, and is provided with a double-layer neural Network, namely AN Action Network (AN) and a Target Network (TN);

the autonomous learning module continuously increases a value function based on a Bellman equation, so that the Agent decision-making capability is continuously improved; with the increase of the learning times, the learning cost function of this time is embodied by the average value of the cost functions calculated by adjacent k times of learning, and the formula is as follows:

in the formula u _k A decision cost function under the condition of a given scheduling state is learned for the kth time; u. of _i A decision cost function under the condition of a given scheduling state is learned for the ith time; u shape _k An average value function obtained after the current learning is performed for the kth learning; u shape _k-1 An average cost function obtained after the current learning is performed for the k-1 st learning;

U _k (S _t ,A _t )＝(1-α)U _k-1 (S _t ,A _t )+α·u(S _t ,A _t )

the decision attribute at the end of the t period; λ is a discounting factor;

E _k ＝U _k -U _k-1

2. The intelligent single-reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm as claimed in claim 1, wherein in the memory bank, the scheduling decision of time period t can be used to determine the condition attribute (S) at the beginning of the time period _t ) Decision attribute (Action), Reward (Reward), and condition attribute at the end of t period (S) _t+1 ) Jointly form a piece of knowledge and store the knowledge in a memory base, wherein the formula is as follows:

3. The intelligent single-reservoir flood control scheduling method based on DQN deep reinforcement learning algorithm of claim 1, wherein the formula of the prize feedback in step 2 is:

V _t+1 ＝V _t +Q _t -Q _p,t -Q _s,t

in the formula: r is the electric energy generating capacity benefit after the penalty of t time period, namely Reward; k _t Is the initial water storage level of the t time period; q _t Is the total water volume of the reservoir in the period t; n is a radical of _t Generating output of the reservoir at the time t; q _p,t Representing the total generated flow in the t period; q _s,t The water abandon amount in the period t; b (-) is the hydropower station generated energy in the period t; a and b are penalty coefficients; e, ensuring the output of the system; v _t The reservoir capacity at time t; Δ t represents a scheduling period length; v _t+1 The capacity of the water storage at the end of the t period;

the constraints are as follows:

K _min ≤K _t ≤K _max

0≤N _t ≤N _M

0≤Q _t ≤Q _M

wherein, K _min And K _max Respectively representing the minimum and maximum reservoir water levels of the reservoir in the time period t; n is a radical of _t Representing a time period t decision effort; n is a radical of hydrogen _M The installed capacity of the reservoir is the maximum power generation output of the reservoir; q _M The maximum flow capacity of the water turbine is achieved.