CN115912367A

CN115912367A - Intelligent generation method for operation mode of power system based on deep reinforcement learning

Info

Publication number: CN115912367A
Application number: CN202211418090.7A
Authority: CN
Inventors: 吕晨; 陈兴雷; 于子洋; 周博文; 杨东升; 李广地; 伍薇蓉; 马全; 杨钊; 文晶; 李文臣; 崔勇; 顾军; 涂崎
Original assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Shanghai Electric Power Co Ltd
Current assignee: China Electric Power Research Institute Co Ltd CEPRI; State Grid Shanghai Electric Power Co Ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-04-04

Abstract

The invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning, and relates to the technical field of power grid operation. According to the method, a Markov decision process MDP is used for carrying out reinforcement learning modeling on a power grid, and an improved mapping strategy of an intelligent action and an adjustable action object is established; intelligently generating a DQN network by constructing an operation mode, inputting the current system load flow state and the target operation state into the DQN network, and outputting the action with the maximum Q value; carrying out load flow calculation iteration by using a PQ decomposition method, if lambda is greater than 1 or the load flow is not converged after 10 times of iteration, considering that pathological load flow occurs, abandoning the action, and regenerating a new action by the DQN network; if the power flow is converged, adjusting the running state of the adjustable action object according to an improved mapping strategy; continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached; and the intelligent generation and intelligent deletion of the power grid operation mode are completed by outputting the estimated Q network parameters.

Description

Intelligent generation method for operation mode of power system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of power grid operation, in particular to an intelligent generation method of an electric power system operation mode based on deep reinforcement learning.

Background

The calculation of the operation mode of the power system can provide a safe and stable operation boundary of the power grid, is a general guidance scheme for ensuring the safe and stable operation of the power grid, and is also a theoretical basis for the dispatching personnel to evaluate the real-time operation state of the power grid. Because various kinds of stability analysis of the power system need to be performed based on the result of the tidal current calculation, the tidal current calculation is an important basic content of the calculation of the operation mode of the power grid. In recent years, due to the rapid development of social economy, the large-scale new energy access and the novel power system not only increase the scale and complexity of a power grid unprecedentedly, but also obviously increase the typical operation mode of the power grid, and the operation mode calculation work faces a serious challenge.

In actual engineering, the annual operation mode calculation work of a large power grid is mainly completed by calculation personnel in a mode of a regulation center at each level in cooperation based on simulation analysis software of a power system, and a large amount of manual participation content is involved. Specifically, according to the prediction result of the load and the grid structure change of the next year, the typical operation modes under various limit working conditions are preliminarily formulated by referring to the operation experience of the previous year, and then the safe operation boundary of the power grid is determined by using the mode of combining the manual power flow adjustment and the stable calculation, so that a theoretical basis is provided for the work of power grid economic dispatching, equipment maintenance planning and the like. On one hand, the scale of the power grid is increasing day by day, and the operation characteristics are becoming more and more complex; on the other hand, for a long time, the operation mode calculation mainly involves a large amount of manual labor, and has large workload and high repeatability.

Currently, artificial intelligence technology is leading a new revolution of technology and industry. With the development of artificial intelligence, deep reinforcement learning helps human to obtain the general rule of data on the basis of training of a large number of samples, and the investment of manpower and material resources is greatly reduced. The intelligent generation of the operation mode of the power system based on the deep reinforcement learning is that a machine replaces manual work to complete the generation process of the operation mode of the power grid, the rationality of the current operation mode is diagnosed while the operation mode is generated, namely whether the current operation mode is converged or a morbid tide current is generated, the high-dimensional space of the tide can be intelligently adjusted by means of the deep reinforcement learning, knowledge is added in the adjustment process, the action space is reduced, the manual adjustment process is effectively simulated, the burden of workers is relieved, a tide adjustment basis is provided for the operators, and the automation level of the power system is improved.

A power grid operation mode calculation method based on improved deep Q learning is provided in a Chinese patent CN111478331A, which is a method and a system for adjusting power flow convergence of a power system, and input and output dimensions of a Q neural network model are determined according to a state space and an action space; determining a mapping relation between the action space and the start-stop state of the generator of the power system, and adjusting the running state of the generator of the power system according to the adjusting action output by the training model; taking a target load level and a start-stop state of a generator of a power system as input, adjusting action as output, and training a Q neural network model according to input and output dimensions; adjusting the power flow of the power system to a convergence state according to the adjusting action; the load requirements under different operation modes are met by switching on and off the generator and adjusting the power of the balance machine. The patent meets the load requirements under different operation modes by switching on and off the generator and adjusting the power of the balance machine, wherein the adjustable action object is only the generator, and the adjustable parameters in the novel power system accessed by the large-scale new energy at present comprise a line operation state, a new energy output state, a controllable load state, a direct current state and the like besides the generator power. In addition, the generator state in the patent only has two types of opening and closing, and the requirement for adjusting the output of the unit part in the actual power grid operation mode calculation cannot be met.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning.

An intelligent generation method of an electric power system operation mode based on deep reinforcement learning specifically comprises the following steps:

step 1: performing reinforcement learning modeling on the power grid by using a Markov Decision Process (MDP);

step 1.1: setting parameters in the power grid operation mode process by using a Markov decision process;

setting a power grid operation mode calculator as an intelligent agent, setting power grid operation data and a power flow calculation formula as an environment, wherein the power grid power flow calculation convergence is the result of interaction between the intelligent agent and the environment, and the process of interaction between the intelligent agent and the environment is represented by a Markov decision process;

the Markov decision process MDP consists of 5-tuple (S, A, P) _r R, gamma), S is the system environment state space, S _t The system state at the moment t; a is an action space, a _t The action of the agent at the moment t; p _r To transition probabilities, P _r (s _t+1 |s _t ,a _t ) Is in a state s _t Taking action a _t Post transition to state s _t+1 The probability of (d); r is a reward function, R _t Is in a state s _t Take action a _t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision making process;

to quantitatively describe the action a at time t _t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t _t Performing action a _t The expected value of the jackpot prize to be obtained later is represented by Q (s, a), and the specific calculation method is shown in formula (1).

Q(s,a)＝E[r _t +γr _t+1 +γ ² r _t+2 +...|s _t ＝s,a _t ＝a,π],s∈S,a∈A (1)

Where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s _t And action a _t The mapping relationship between them.

Calculating an optimal strategy mu ^* Make the action a at each time _t The value of Q of (a) is the maximum,the formula is shown as (2):

μ ^* ＝maxQ _μ (s,a) (2)

in the formula, Q _μ (s, a) is the expected return of the policy μ after taking action a from state s;

step 1.2: defining an expression of a system environment state space S, an action space A and a reward function R in the Markov decision model;

s in the system environment state space, defining the state space S at the time t _t Comprises the following steps:

s _t ＝[p,q,s,v,L,D,l] (3)

p＝[p ₁ ,p ₂ ,...,p _m ] (4)

q＝[q ₁ ,q ₂ ,...,q _m ] (5)

s＝[s ₁ ,s ₂ ,...,s _n ] (5)

v＝[v ₁ ,v ₂ ,...,v _g ] (6)

L＝[L ₁ ,L ₂ ,...,L _h ] (7)

D＝[D ₁ ,D ₂ ,...,D _k ] (8)

l＝[l ₁ ,l ₂ ,...,l _N ] (9)

in the formula, p _i Numbering the active power of the i-node generator; q. q of _i Numbering the reactive power of the i-node generator; s _i Numbering the line commissioning states of the i nodes; v. of _i New energy contribution condition, L, numbering inode _i The controllable load output of the node i is numbered; d _i The direct current output is numbered as an i node; m, n, g, h and k are respectively the total number of adjustable generator nodes, the total number of circuits, the total number of new energy nodes, the total number of controllable load nodes and the total number of direct current nodes without a balancing machine; l ₁ ,l ₂ ,...,l _N The codes are combined to form a binary code which is used for representing the numbers of different operation modes;

in the action space a, the action space a is discrete, and the action space a is associated with discrete positive integers, and the formula is shown as (10).

A＝[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)

The number in the set A represents the number of the adjustable action object, and the adjusting action a at the moment t _t To represent;

defining 4 indexes of the flow adjustment problem: (1) Convergence of load flow calculation by c ₁ Represents; (2) The output power of the balancing machine is not out of limit by c ₂ Represents; (3) Quantifying the network loss rate lower than the set value by calculating the network loss rate; (4) No pathological load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):

execution of a _t Then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, then R is 0, and R is-1 in other cases;

step 2: establishing an improved mapping strategy of the intelligent agent action and the adjustable action object;

the improved mapping strategy is to set P _G The active power sum of the current power grid generator without the balancing machine is obtained; p is _L The total active power of all the current loads of the power grid is obtained;

maximum/minimum active power for the balancing machine; k is a set target network loss rate; p is _i Active power of generator i; p is _imax The minimum adjustment threshold value is 0.05P for the maximum active power of the generator i _imax (ii) a The following three cases are included:

(1) When in use

When a is _t If i, let P _i ＝0.5P _imax If at this time P _i ≥0.5P _imax Then put into operation P _i And P _imax Is rounded up until P is delivered _imax Until now. The total active power of the generator of the system is judged under the sceneThe power is insufficient, and the active power of the generator needs to be increased to meet the requirement of power flow convergence.

(2) When in use

When a is _t If i, let P _i ＝0.5P _imax If at this time P _i ≤0.5P _imax Then P is _i And rounding the median value of the output force of the shutdown till the delivery reaches 0 percent, namely, the shutdown is carried out. In the scene, the situation that the total active power of the generator of the system is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence.

(3) When except (1) and (2), a _t If P is not less than = i _i ≥0.5P _imax Then put into operation P _i And P _imax Is rounded up until P is reached _imax Until the end; otherwise, put into operation P _i Rounding the median value of the stopping output downwards until the delivery is 0 percent, and stopping the machine;

and 3, step 3: an operation mode is established to intelligently generate a DQN network;

the DQN network is that a Q-learning network is combined with a neural network, the Q value function is estimated by using the neural network, after the value function of each load flow adjustment action is calculated by the neural network, epsilon-greedy search is adopted to select the action, and the action with the maximum Q value is selected to be output.

The DQN network introduces an estimation Q network and a target Q network, and the training process comprises the following steps:

step A1: when training starts, setting the parameters of the estimated Q network and the target Q network node, the generator, the circuit and the load to be the same, wherein the parameter matrixes are theta and theta';

step A2: in the training process, each time step of the estimated Q network is updated once according to the gradient descending direction of the loss function as the formula (13), and the DQN network calculates a Q value according to the estimated Q network and the current state and outputs a load flow adjusting action;

step A3: adjusting the running state of the adjustable action object according to the step A2;

step A4: transmitting the estimated Q network parameter theta to a target Q network theta' every other step C;

step A5: the target Q network is updated every C time steps in the gradient descent direction of equation (13).

The power flow adjusting action value calculated by the estimated Q network is called a predicted value, the sum of the instant reward in the current state and the state power flow adjusting action value calculated by the target Q network is called a true value, and the parameters of the estimated Q network are updated in a back propagation mode. Repeating the updating process during training until the trend is converged and the output of the balancing machine is not out of limit or the number of iteration rounds is reached;

and 4, step 4: modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;

if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the flow calculation has a feasible solution but cannot be searched, namely the problem of pathological flow;

the pathological trend problem comprises the following two situations;

(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological tide caused by the overweight of the section tide, so that the problem of the pathological tide is solved;

(2) The pathological tide caused by insufficient local reactive power support is judged by defining the following tide iteration indexes:

when the PQ decomposition method is adopted for load flow calculation and the load flow is not converged, the index lambda is taken as a criterion, and the formula (14) shows that:

λ＝max{|[ΔU] ⁽³⁾ /[ΔU] ⁽²⁾ |} (14)

in the formula, [ Delta U ]] ⁽³⁾ For the third iteration voltage value increment, [ Δ U [ ]] ⁽²⁾ Incrementing the voltage value for the second iteration value;

lambda <1 when the power flow is normally converged, lambda is increased when the reactive demand of the PQ node load is increased, and lambda >1 when the power flow is ill-conditioned.

And 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, performing flow calculation by adopting a PQ decomposition method, and considering the pathological flow phenomenon when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times. Deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode;

and 5: repeating the step 3 to the step 4, and continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached;

step 6: and outputting an estimated Q network parameter theta, and finishing intelligent generation and intelligent deletion of a power grid operation mode.

Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:

the invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning, which is characterized in that a computer replaces people to complete the adjustment process of an adjustable object and finally outputs an available operation mode, thereby greatly reducing the working intensity of operation mode calculation personnel. The actual requirements of the adjustment of the elements of the novel power system can be well met by adjusting the action space, and the training speed of the model is accelerated by using the design of the improved mapping strategy, so that the calculation force requirements are reduced. The output adjustment threshold value of the action object is 5% of the maximum power, the calculation adjustment requirements of the operation mode in the actual power grid are better met, different mapping strategies are designed for different action objects, the increase of the action space can be well solved by improving the mapping strategies, the operation time cannot be obviously increased, and the DQN network training process is accelerated.

Compared with the prior art, the technical scheme provided by the invention adopts an intelligent generation method of the operation mode based on deep reinforcement learning, the computer replaces people to complete the adjustment process of the adjustable object, the available operation mode is finally output, the working intensity of the operation mode calculator is greatly reduced, the actual requirement of the adjustment of the novel power system element can be well met by adjusting the action space, the training speed of the model is accelerated by using the design of the improved mapping strategy, and the calculation force requirement is reduced.

Drawings

FIG. 1 is a flow chart of an intelligent generation method of an operation mode of an electric power system according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a node mapping relationship of an IEEE-30 node system generator in an embodiment of the present invention;

fig. 3 is a flowchart of an algorithm for intelligently generating DQN network in an operating mode according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

An intelligent generation method for an electric power system operation mode based on deep reinforcement learning is disclosed, as shown in fig. 1, and specifically comprises the following steps:

the making of the operation mode of the power grid is essentially a process of convergence adjustment of the power flow of the power system, which can be regarded as a decision process of operation mode calculation personnel to calculate and adjust the power grid data so as to obtain the power flow of the system,

the method comprises the steps that a power grid operation mode calculator is set as an Agent (Agent, which refers to a calculation entity which is resident in a certain environment, can continuously and autonomously play a role and has the characteristics of residence, reactivity, sociality, initiative and the like), power grid operation data and a power flow calculation formula are set as the environment, and the result of interaction between the Agent and the environment is power grid power flow calculation convergence to obtain the result of a typical operation mode. The Process of interaction of the agent with the environment is represented by Markov Decision Process (MDP);

the Markov departmentThe Freund decision process MDP consists of 5-tuple (S, A, P) _r R, gamma), S is the system environment state space, S _t The system state at the moment t; a is an action space, a _t Is the agent action at time t; p _r To transition probabilities, P _r (s _t+1 |s _t ,a _t ) Is in a state s _t Taking action a _t Post transition to state s _t+1 The probability of (d); r is a reward function, R _t Is in a state s _t Take action a _t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision making process;

Calculating an optimal strategy mu ^* Make the action a at each time _t The Q value of (2) is maximum, and the formula is shown as (2):

μ ^* ＝maxQ _μ (s,a) (2)

in the formula, Q _μ (s, a) is the expected return of policy μ after taking action a from state s;

s in the system environment state space, defining the state space S at the moment t _t Comprises the following steps:

s _t ＝[p,q,s,v,L,D,l] (3)

p＝[p ₁ ,p ₂ ,...,p _m ] (4)

q＝[q ₁ ,q ₂ ,...,q _m ] (5)

s＝[s ₁ ,s ₂ ,...,s _n ] (5)

v＝[v ₁ ,v ₂ ,...,v _g ] (6)

L＝[L ₁ ,L ₂ ,...,L _h ] (7)

D＝[D ₁ ,D ₂ ,...,D _k ] (8)

l＝[l ₁ ,l ₂ ,...,l _N ] (9)

in the formula, p _i Numbering the active power of the i-node generator; q. q.s _i Numbering the reactive power of the i-node generator; s is _i Numbering the line commissioning state of the node i; v. of _i Considering the new energy output condition of the numbered i node, and considering that the new energy output accords with Weibull probability distribution; l is _i The controllable load output of the node i is numbered and is considered as the negative generator output; d _i The output end of the direct current power generator is regarded as the output of a negative current source power generator, and the output end of the direct current power generator is regarded as the output of a positive current source power generator; m, n, g, h and k are respectively the total number of adjustable generator nodes, the total number of circuits, the total number of new energy nodes, the total number of controllable load nodes and the total number of direct current nodes without a balancing machine; l. the ₁ ,l ₂ ,...,l _N Together form a binary code to represent different operating mode numbers, for example: there are 16 operating modes in total, N =4,l ₁ ,l ₂ ,l ₃ ,l ₄ =0000 denotes the 1 st operating mode, l ₁ ,l ₂ ,l ₃ ,l ₄ =1111 denotes the 16 th operating mode.

In this embodiment, in order to simplify the model, the adjustment mode of the actual operation mode calculator on the adjustable action object is combined, the generator, the controllable load and the direct current output are simplified to be that the minimum threshold value of one-time adjustment is 5% of the maximum output, and the adjustment range is adjustedThe degree being an integer multiple of only 5%, e.g. p _i Allowable values are [0,0.05, 0.1., 1.0 ]]，p ₁ =0.3 represents that the output of the generator with node 1 is adjusted to 30% of the maximum active power at this time; simplifying the line commissioning state into only two 1/0 states, wherein 1 is commissioning and 0 is outage; the new energy output has random fluctuation, the output obeys Weibull probability distribution, the output is simplified into the expected value of a Weibull distribution function, only 1/0 state exists, 1 is the expected value output state of the distribution function, and 0 is shutdown.

In the action space A, because the adjustment modes of the adjustable action object are defined in a simplified manner, the generator, the controllable load and the direct current node have only 21 output adjustment modes of 0,5%,10%, 100%, and the line commissioning state and the new energy node have only 1 and 0 adjustment modes, the action space A is discrete and is associated with a discrete positive integer, and the formula is shown as (10).

A＝[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)

The number in the set A represents the number of the adjustable action object, if the number is 0, the action of the object is not adjusted, and the code is selectively output according to the actual conditions of different node systems. Adjustment action a at time t _t To represent; a is _t ＝[1,3,0,0,4]Representing the operating conditions of the regulated generator node 1, the line 3 and the dc node 4.

In the reward function R, the instant reward R is obtained by the formula (2) _t Will affect Q _μ (s, a) calculation result, then Q _μ (s, a) in turn affects action a _t Selection of (2). The design idea of the reward function is as follows: when an agent chooses an action that enables trend convergence, the environment is awarded a greater reward; when the action of diverging the trend or exceeding the limit of the balancing machine is selected, the environment gives a corresponding penalty value, and the intelligent agent restricts the action to meet the action change rate in order to obtain the maximum reward. Defining 4 indexes in the flow adjustment problem: (1) Convergence of load flow calculation by c ₁ Represents; (2) The output power of the balancing machine is not out of limit by c ₂ Represents; (3) The network loss rate lower than the set value is calculatedQuantizing the network loss rate; (4) No ill-conditioned load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):

execution of a _t Then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, then R is 0, and R is-1 in other cases; the fewer the number of adjustment steps, the greater the jackpot achieved.

in the embodiment, taking IEEE30 node as an example, the generator node mapping relationship is shown in fig. 2. The usual mapping strategy is to move action a _t The state space increases exponentially with each more generator, most of the state load flow calculation is not converged, and if the traversal action state search is adopted, the required time also increases exponentially. In order to improve the search efficiency, the improved mapping strategy is designed as follows:

the improved mapping strategy is to set P _G The active power sum of the current power grid generator without the balancing machine is obtained; p _L The total active power of all the current loads of the power grid is obtained;

maximum/minimum active power for the balancing machine; k is the set target network loss rate; p _i Active power of generator i; p is _imax The minimum adjustment threshold value is 0.05P for the maximum active power of the generator i _imax (ii) a The following three cases are included:

(2) When in use

When a is _t If = i, then let P _i ＝0.5P _imax If at this time P _i ≥0.5P _imax Then put into operation P _i And P _imax The median value of (A) is rounded up until it is thrownTo P _imax Until now. For example: a is _t = i and current P _i ＝0.75P _imax Then P after adjustment _i ＝[0.5*(75％+100％)]P _imax ＝0.875P _imax →0.9P _imax (ii) a The situation that the total active power of the generator of the system is insufficient is judged, and the active power of the generator needs to be increased to meet the requirement of power flow convergence.

(2) When the temperature is higher than the set temperature

When a is turned on _t If i, let P _i ＝0.5P _imax If at this time P _i ≤0.5P _imax Then P is _i And rounding the median value of the output force of the shutdown till the delivery reaches 0 percent, namely, the shutdown is carried out. For example: a is a _t = i and current P _i ＝0.25P _imax Then P after adjustment _i ＝[0.5*(25％+0％)]P _imax ＝0.125P _imax →0.1P _imax . In the scene, the situation that the total active power of the generator of the system is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence.

(3) When except (1) and (2), a _t If P is not less than = i _i ≥0.5P _imax Then put into operation P _i And P _imax Is rounded up until P is delivered _imax Until the end; otherwise, put into operation P _i Rounding the median value of the stopping output downwards until the delivery is 0 percent, and stopping the machine;

and replacing the generator node with other adjustable action objects, obtaining the mapping strategy of the other action objects by analogy, and adjusting the mapping strategy according to the actual condition of the node system.

And 3, step 3: intelligently generating a DQN network by constructing an operation mode;

a flowchart of an algorithm for intelligently generating a DQN network in an operating mode is shown in fig. 3. The DQN network is improved on the basis of the Q-learning network, the Q-learning network is combined with the neural network, the Q value function is estimated by using the neural network, after the value function of each tide adjusting action is calculated by the neural network, the action is selected by adopting epsilon-greedy search, and the action with the maximum Q value is selected to be output.

The DQN network introduces an estimation Q network and a target Q network, and the training process is as follows:

step A2: in the training process, updating each time step of the estimated Q network once according to the gradient descending direction of the loss function as the formula (13), and calculating a Q value by the DQN network according to the estimated Q network and the current state to output a load flow adjusting action;

The power flow adjusting action value calculated by the estimation Q network is called a predicted value, the sum of the instant reward in the current state and the state power flow adjusting action value calculated by the target Q network is called a true value, and the parameters of the estimation Q network are updated in a back propagation mode. Repeating the updating process during training until the trend is converged and the output of the balancing machine is not out of limit or the number of iteration rounds is reached;

and 4, step 4: (ii) a Modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;

if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the power flow calculation has a feasible solution but cannot be searched, namely, the problem of pathological power flow is solved. The pathological trend is characterized by: the power flow convergence solution is seriously deviated from the initial value, the iteration times are increased, and the convergence speed is low; or the jacobian matrix tends to be singular, so that the trend cannot be converged to the feasible solution. The cause of the pathological tidal current is two: (1) the section tidal current is too heavy, namely the active power is too large; (2) local reactive support is insufficient.

The pathological trend problem comprises the following two situations;

λ＝max{|[ΔU] ⁽³⁾ /[ΔU] ⁽²⁾ |} (14)

lambda <1 when the power flow is normally converged, lambda is increased when the reactive demand of the PQ node load is increased, and lambda >1 when the power flow is ill-conditioned. Taking an IEEE118 node system as an example, the active demand of the PQ node No. 29 is 24MW, the reactive demand is 4MVar, and the reactive demand of the load node is gradually increased, and the relation between the reactive demand and λ is shown in table 1.

TABLE 1 relationship between reactive demand increase and lambda of load nodes

As can be seen from Table 1, as the reactive power demand of the node increases, λ >1 when the system is ill, so it is reasonable to adopt λ as an index for measuring the ill degree of the system.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An intelligent generation method of an electric power system operation mode based on deep reinforcement learning is characterized by comprising the following steps:

and step 3: an operation mode is established to intelligently generate a DQN network;

if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the power flow calculation has a feasible solution but cannot be searched, namely the problem of pathological power flow is solved;

2. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning according to claim 1, wherein the step 1 specifically comprises the following steps:

setting a power grid operation mode calculator as an intelligent agent, setting power grid operation data and a power flow calculation formula as an environment, wherein the result of interaction between the intelligent agent and the environment is power grid power flow calculation convergence, and the process of interaction between the intelligent agent and the environment is represented by a Markov decision process;

the Markov decision process MDP consists of 5-tuple (S, A, P) _r R, gamma), S is the system environment state space, S _t The system state at the moment t; a is an action space, a _t The action of the agent at the moment t; p _r To transition probabilities, P _r (s _t+1 |s _t ,a _t ) Is in a state s _t Taking action a _t Post transition to state s _t+1 The probability of (d); r is a reward function, R _t Is in a state s _t Take action a _t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision process;

to quantitatively describe the action a at time t _t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t _t Performing action a _t The expected value of the accumulated reward obtained later is expressed by Q (s, a), and the specific calculation method is shown as formula (1);

where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s _t And action a _t The mapping relationship between the two;

calculating the optimal strategy mu ^* Make the action a at each time _t The Q value of (2) is maximum, and the formula is shown as (2):

μ ^* ＝max Q _μ (s,a) (2)

s _t ＝[p,q,s,v,L,D,l] (3)

p＝[p ₁ ,p ₂ ,...,p _m ] (4)

q＝[q ₁ ,q ₂ ,...,q _m ] (5)

s＝[s ₁ ,s ₂ ,...,s _n ] (5)

v＝[v ₁ ,v ₂ ,...,v _g ] (6)

L＝[L ₁ ,L ₂ ,...,L _h ] (7)

D＝[D ₁ ,D ₂ ,...,D _k ] (8)

l＝[l ₁ ,l ₂ ,...,l _N ] (9)

in the formula, p _i Numbering the active power of the i-node generator; q. q.s _i Numbering the reactive power of the i-node generator; s is _i Numbering the line commissioning state of the node i; v. of _i New energy contribution condition, L, for numbering i nodes _i The controllable load output of the node i is numbered; d _i The direct current output is numbered as an i node; m, n, g, h and k are respectively adjustable generator sections without balancing machinesPoint total number, line total number, new energy node total number, controllable load node total number and direct current node total number; l ₁ ,l ₂ ,...,l _N The codes are combined to form a binary code which is used for representing the numbers of different operation modes;

in the action space A, the action space A is discrete, and the action space A is associated with discrete positive integers, wherein the formula is shown as (10);

A＝[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)

defining 4 indexes of the flow adjustment problem: (1) Convergence of load flow calculation by c ₁ Represents; (2) The output power of the balancing machine is not out of limit by c ₂ Represents; (3) Quantizing the network loss rate lower than a set value by calculating the network loss rate; (4) No pathological load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):

execution of a _t And then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, R is 0, and R is-1 in other cases.

3. The method as claimed in claim 1, wherein the improved mapping strategy in step 2 is set as P _G The active power sum of the current power grid generator without the balancing machine is obtained; p _L The total active power of all the current loads of the power grid is obtained;

maximum/minimum active power for the balancing machine; k is the set target network loss rate; p _i Is the active power of the generator i; p _imax For maximum active power of generator i, minimum adjustment thresholdIs 0.05P _imax 。

4. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the DQN network in step 3 is obtained by combining a Q-learning network and a neural network, estimating a Q-value function by using the neural network, calculating a value function of each power flow adjustment action by the neural network, then selecting an action by using epsilon-greedy search, and selecting an action output with the largest Q value.

5. The intelligent generation method for the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the pathological power flow problem in the step 4 includes the following two situations;

(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological trend caused by the overweight of the section trend, so that the problem of the pathological trend is solved;

when the flow calculation by adopting the PQ decomposition method is not converged, taking the index lambda as a criterion, as shown in a formula (14):

λ＝max{|[ΔU] ⁽³⁾ /[ΔU] ⁽²⁾ |} (14)

in the formula, [ Delta U ]] ⁽³⁾ For the third iteration voltage value increment, [ Δ U [ ]] ⁽²⁾ The voltage value increment is the second iteration value;

lambda is less than 1 when the power flow is normally converged, lambda is increased when the reactive demand of PQ node load is increased, and lambda is greater than 1 when the power flow is ill-conditioned;

and 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, carrying out flow calculation by adopting a PQ decomposition method, and when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times, considering the pathological flow phenomenon and deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode.

6. The intelligent deep reinforcement learning-based power system operation mode generation method according to claim 3, wherein the improved mapping strategy comprises the following three conditions:

(1) When the temperature is higher than the set temperature

When a is turned on _t If i, let P _i ＝0.5P _imax If at this time P _i ≥0.5P _imax Then put into operation P _i And P _imax Is rounded up until P is delivered _imax Until the end; judging that the total active power of a system generator is insufficient in the scene, and increasing the active power of the generator to meet the requirement of power flow convergence;

(2) When the temperature is higher than the set temperature

When a is turned on _t If i, let P _i ＝0.5P _imax If at this time P _i ≤0.5P _imax Then P is _i The median value of the output force of the machine halt is rounded downwards until the input is 0 percent, and the machine halt is carried out; in the scene, the situation that the total active power of a system generator is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence;

(3) When except (1) and (2), a _t If P is not less than = i _i ≥0.5P _imax Then put into operation P _i And P _imax Is rounded up until P is reached _imax Until the end; otherwise, put into operation P _i And rounding the median value of the output force of the shutdown downwards until the delivery reaches 0 percent, namely, the shutdown is carried out.

7. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning of claim 4, wherein the DQN network introduces an estimation Q network and a target Q network, and the training process comprises:

step A5: updating the target Q network once every C time steps according to the gradient descending direction of the formula (13);

and updating parameters of the estimated Q network in a back propagation mode, and repeating the updating process during training until the power flow is converged and the output of the balancing machine is not out of limit or reaches the number of iteration rounds.