CN115912367A - Intelligent generation method for operation mode of power system based on deep reinforcement learning - Google Patents

Intelligent generation method for operation mode of power system based on deep reinforcement learning Download PDF

Info

Publication number
CN115912367A
CN115912367A CN202211418090.7A CN202211418090A CN115912367A CN 115912367 A CN115912367 A CN 115912367A CN 202211418090 A CN202211418090 A CN 202211418090A CN 115912367 A CN115912367 A CN 115912367A
Authority
CN
China
Prior art keywords
action
power
operation mode
network
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211418090.7A
Other languages
Chinese (zh)
Inventor
吕晨
陈兴雷
于子洋
周博文
杨东升
李广地
伍薇蓉
马全
杨钊
文晶
李文臣
崔勇
顾军
涂崎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanghai Electric Power Co Ltd
Original Assignee
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Electric Power Research Institute Co Ltd CEPRI, State Grid Shanghai Electric Power Co Ltd filed Critical China Electric Power Research Institute Co Ltd CEPRI
Priority to CN202211418090.7A priority Critical patent/CN115912367A/en
Publication of CN115912367A publication Critical patent/CN115912367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning, and relates to the technical field of power grid operation. According to the method, a Markov decision process MDP is used for carrying out reinforcement learning modeling on a power grid, and an improved mapping strategy of an intelligent action and an adjustable action object is established; intelligently generating a DQN network by constructing an operation mode, inputting the current system load flow state and the target operation state into the DQN network, and outputting the action with the maximum Q value; carrying out load flow calculation iteration by using a PQ decomposition method, if lambda is greater than 1 or the load flow is not converged after 10 times of iteration, considering that pathological load flow occurs, abandoning the action, and regenerating a new action by the DQN network; if the power flow is converged, adjusting the running state of the adjustable action object according to an improved mapping strategy; continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached; and the intelligent generation and intelligent deletion of the power grid operation mode are completed by outputting the estimated Q network parameters.

Description

Intelligent generation method for operation mode of power system based on deep reinforcement learning
Technical Field
The invention relates to the technical field of power grid operation, in particular to an intelligent generation method of an electric power system operation mode based on deep reinforcement learning.
Background
The calculation of the operation mode of the power system can provide a safe and stable operation boundary of the power grid, is a general guidance scheme for ensuring the safe and stable operation of the power grid, and is also a theoretical basis for the dispatching personnel to evaluate the real-time operation state of the power grid. Because various kinds of stability analysis of the power system need to be performed based on the result of the tidal current calculation, the tidal current calculation is an important basic content of the calculation of the operation mode of the power grid. In recent years, due to the rapid development of social economy, the large-scale new energy access and the novel power system not only increase the scale and complexity of a power grid unprecedentedly, but also obviously increase the typical operation mode of the power grid, and the operation mode calculation work faces a serious challenge.
In actual engineering, the annual operation mode calculation work of a large power grid is mainly completed by calculation personnel in a mode of a regulation center at each level in cooperation based on simulation analysis software of a power system, and a large amount of manual participation content is involved. Specifically, according to the prediction result of the load and the grid structure change of the next year, the typical operation modes under various limit working conditions are preliminarily formulated by referring to the operation experience of the previous year, and then the safe operation boundary of the power grid is determined by using the mode of combining the manual power flow adjustment and the stable calculation, so that a theoretical basis is provided for the work of power grid economic dispatching, equipment maintenance planning and the like. On one hand, the scale of the power grid is increasing day by day, and the operation characteristics are becoming more and more complex; on the other hand, for a long time, the operation mode calculation mainly involves a large amount of manual labor, and has large workload and high repeatability.
Currently, artificial intelligence technology is leading a new revolution of technology and industry. With the development of artificial intelligence, deep reinforcement learning helps human to obtain the general rule of data on the basis of training of a large number of samples, and the investment of manpower and material resources is greatly reduced. The intelligent generation of the operation mode of the power system based on the deep reinforcement learning is that a machine replaces manual work to complete the generation process of the operation mode of the power grid, the rationality of the current operation mode is diagnosed while the operation mode is generated, namely whether the current operation mode is converged or a morbid tide current is generated, the high-dimensional space of the tide can be intelligently adjusted by means of the deep reinforcement learning, knowledge is added in the adjustment process, the action space is reduced, the manual adjustment process is effectively simulated, the burden of workers is relieved, a tide adjustment basis is provided for the operators, and the automation level of the power system is improved.
A power grid operation mode calculation method based on improved deep Q learning is provided in a Chinese patent CN111478331A, which is a method and a system for adjusting power flow convergence of a power system, and input and output dimensions of a Q neural network model are determined according to a state space and an action space; determining a mapping relation between the action space and the start-stop state of the generator of the power system, and adjusting the running state of the generator of the power system according to the adjusting action output by the training model; taking a target load level and a start-stop state of a generator of a power system as input, adjusting action as output, and training a Q neural network model according to input and output dimensions; adjusting the power flow of the power system to a convergence state according to the adjusting action; the load requirements under different operation modes are met by switching on and off the generator and adjusting the power of the balance machine. The patent meets the load requirements under different operation modes by switching on and off the generator and adjusting the power of the balance machine, wherein the adjustable action object is only the generator, and the adjustable parameters in the novel power system accessed by the large-scale new energy at present comprise a line operation state, a new energy output state, a controllable load state, a direct current state and the like besides the generator power. In addition, the generator state in the patent only has two types of opening and closing, and the requirement for adjusting the output of the unit part in the actual power grid operation mode calculation cannot be met.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning.
An intelligent generation method of an electric power system operation mode based on deep reinforcement learning specifically comprises the following steps:
step 1: performing reinforcement learning modeling on the power grid by using a Markov Decision Process (MDP);
step 1.1: setting parameters in the power grid operation mode process by using a Markov decision process;
setting a power grid operation mode calculator as an intelligent agent, setting power grid operation data and a power flow calculation formula as an environment, wherein the power grid power flow calculation convergence is the result of interaction between the intelligent agent and the environment, and the process of interaction between the intelligent agent and the environment is represented by a Markov decision process;
the Markov decision process MDP consists of 5-tuple (S, A, P) r R, gamma), S is the system environment state space, S t The system state at the moment t; a is an action space, a t The action of the agent at the moment t; p r To transition probabilities, P r (s t+1 |s t ,a t ) Is in a state s t Taking action a t Post transition to state s t+1 The probability of (d); r is a reward function, R t Is in a state s t Take action a t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision making process;
to quantitatively describe the action a at time t t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t t Performing action a t The expected value of the jackpot prize to be obtained later is represented by Q (s, a), and the specific calculation method is shown in formula (1).
Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1)
Where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s t And action a t The mapping relationship between them.
Calculating an optimal strategy mu * Make the action a at each time t The value of Q of (a) is the maximum,the formula is shown as (2):
μ * =maxQ μ (s,a) (2)
in the formula, Q μ (s, a) is the expected return of the policy μ after taking action a from state s;
step 1.2: defining an expression of a system environment state space S, an action space A and a reward function R in the Markov decision model;
s in the system environment state space, defining the state space S at the time t t Comprises the following steps:
s t =[p,q,s,v,L,D,l] (3)
p=[p 1 ,p 2 ,...,p m ] (4)
q=[q 1 ,q 2 ,...,q m ] (5)
s=[s 1 ,s 2 ,...,s n ] (5)
v=[v 1 ,v 2 ,...,v g ] (6)
L=[L 1 ,L 2 ,...,L h ] (7)
D=[D 1 ,D 2 ,...,D k ] (8)
l=[l 1 ,l 2 ,...,l N ] (9)
in the formula, p i Numbering the active power of the i-node generator; q. q of i Numbering the reactive power of the i-node generator; s i Numbering the line commissioning states of the i nodes; v. of i New energy contribution condition, L, numbering inode i The controllable load output of the node i is numbered; d i The direct current output is numbered as an i node; m, n, g, h and k are respectively the total number of adjustable generator nodes, the total number of circuits, the total number of new energy nodes, the total number of controllable load nodes and the total number of direct current nodes without a balancing machine; l 1 ,l 2 ,...,l N The codes are combined to form a binary code which is used for representing the numbers of different operation modes;
in the action space a, the action space a is discrete, and the action space a is associated with discrete positive integers, and the formula is shown as (10).
A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)
The number in the set A represents the number of the adjustable action object, and the adjusting action a at the moment t t To represent;
defining 4 indexes of the flow adjustment problem: (1) Convergence of load flow calculation by c 1 Represents; (2) The output power of the balancing machine is not out of limit by c 2 Represents; (3) Quantifying the network loss rate lower than the set value by calculating the network loss rate; (4) No pathological load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):
Figure BDA0003941469780000041
execution of a t Then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, then R is 0, and R is-1 in other cases;
step 2: establishing an improved mapping strategy of the intelligent agent action and the adjustable action object;
the improved mapping strategy is to set P G The active power sum of the current power grid generator without the balancing machine is obtained; p is L The total active power of all the current loads of the power grid is obtained;
Figure BDA0003941469780000042
maximum/minimum active power for the balancing machine; k is a set target network loss rate; p is i Active power of generator i; p is imax The minimum adjustment threshold value is 0.05P for the maximum active power of the generator i imax (ii) a The following three cases are included:
(1) When in use
Figure BDA0003941469780000043
When a is t If i, let P i =0.5P imax If at this time P i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is delivered imax Until now. The total active power of the generator of the system is judged under the sceneThe power is insufficient, and the active power of the generator needs to be increased to meet the requirement of power flow convergence.
(2) When in use
Figure BDA0003941469780000044
When a is t If i, let P i =0.5P imax If at this time P i ≤0.5P imax Then P is i And rounding the median value of the output force of the shutdown till the delivery reaches 0 percent, namely, the shutdown is carried out. In the scene, the situation that the total active power of the generator of the system is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence.
(3) When except (1) and (2), a t If P is not less than = i i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is reached imax Until the end; otherwise, put into operation P i Rounding the median value of the stopping output downwards until the delivery is 0 percent, and stopping the machine;
and 3, step 3: an operation mode is established to intelligently generate a DQN network;
the DQN network is that a Q-learning network is combined with a neural network, the Q value function is estimated by using the neural network, after the value function of each load flow adjustment action is calculated by the neural network, epsilon-greedy search is adopted to select the action, and the action with the maximum Q value is selected to be output.
The DQN network introduces an estimation Q network and a target Q network, and the training process comprises the following steps:
step A1: when training starts, setting the parameters of the estimated Q network and the target Q network node, the generator, the circuit and the load to be the same, wherein the parameter matrixes are theta and theta';
step A2: in the training process, each time step of the estimated Q network is updated once according to the gradient descending direction of the loss function as the formula (13), and the DQN network calculates a Q value according to the estimated Q network and the current state and outputs a load flow adjusting action;
Figure BDA0003941469780000045
step A3: adjusting the running state of the adjustable action object according to the step A2;
step A4: transmitting the estimated Q network parameter theta to a target Q network theta' every other step C;
step A5: the target Q network is updated every C time steps in the gradient descent direction of equation (13).
The power flow adjusting action value calculated by the estimated Q network is called a predicted value, the sum of the instant reward in the current state and the state power flow adjusting action value calculated by the target Q network is called a true value, and the parameters of the estimated Q network are updated in a back propagation mode. Repeating the updating process during training until the trend is converged and the output of the balancing machine is not out of limit or the number of iteration rounds is reached;
and 4, step 4: modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;
if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the flow calculation has a feasible solution but cannot be searched, namely the problem of pathological flow;
the pathological trend problem comprises the following two situations;
(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological tide caused by the overweight of the section tide, so that the problem of the pathological tide is solved;
(2) The pathological tide caused by insufficient local reactive power support is judged by defining the following tide iteration indexes:
when the PQ decomposition method is adopted for load flow calculation and the load flow is not converged, the index lambda is taken as a criterion, and the formula (14) shows that:
λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)
in the formula, [ Delta U ]] (3) For the third iteration voltage value increment, [ Δ U [ ]] (2) Incrementing the voltage value for the second iteration value;
lambda <1 when the power flow is normally converged, lambda is increased when the reactive demand of the PQ node load is increased, and lambda >1 when the power flow is ill-conditioned.
And 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, performing flow calculation by adopting a PQ decomposition method, and considering the pathological flow phenomenon when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times. Deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode;
and 5: repeating the step 3 to the step 4, and continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached;
step 6: and outputting an estimated Q network parameter theta, and finishing intelligent generation and intelligent deletion of a power grid operation mode.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:
the invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning, which is characterized in that a computer replaces people to complete the adjustment process of an adjustable object and finally outputs an available operation mode, thereby greatly reducing the working intensity of operation mode calculation personnel. The actual requirements of the adjustment of the elements of the novel power system can be well met by adjusting the action space, and the training speed of the model is accelerated by using the design of the improved mapping strategy, so that the calculation force requirements are reduced. The output adjustment threshold value of the action object is 5% of the maximum power, the calculation adjustment requirements of the operation mode in the actual power grid are better met, different mapping strategies are designed for different action objects, the increase of the action space can be well solved by improving the mapping strategies, the operation time cannot be obviously increased, and the DQN network training process is accelerated.
Compared with the prior art, the technical scheme provided by the invention adopts an intelligent generation method of the operation mode based on deep reinforcement learning, the computer replaces people to complete the adjustment process of the adjustable object, the available operation mode is finally output, the working intensity of the operation mode calculator is greatly reduced, the actual requirement of the adjustment of the novel power system element can be well met by adjusting the action space, the training speed of the model is accelerated by using the design of the improved mapping strategy, and the calculation force requirement is reduced.
Drawings
FIG. 1 is a flow chart of an intelligent generation method of an operation mode of an electric power system according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a node mapping relationship of an IEEE-30 node system generator in an embodiment of the present invention;
fig. 3 is a flowchart of an algorithm for intelligently generating DQN network in an operating mode according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
An intelligent generation method for an electric power system operation mode based on deep reinforcement learning is disclosed, as shown in fig. 1, and specifically comprises the following steps:
step 1: performing reinforcement learning modeling on the power grid by using a Markov Decision Process (MDP);
the making of the operation mode of the power grid is essentially a process of convergence adjustment of the power flow of the power system, which can be regarded as a decision process of operation mode calculation personnel to calculate and adjust the power grid data so as to obtain the power flow of the system,
step 1.1: setting parameters in the power grid operation mode process by using a Markov decision process;
the method comprises the steps that a power grid operation mode calculator is set as an Agent (Agent, which refers to a calculation entity which is resident in a certain environment, can continuously and autonomously play a role and has the characteristics of residence, reactivity, sociality, initiative and the like), power grid operation data and a power flow calculation formula are set as the environment, and the result of interaction between the Agent and the environment is power grid power flow calculation convergence to obtain the result of a typical operation mode. The Process of interaction of the agent with the environment is represented by Markov Decision Process (MDP);
the Markov departmentThe Freund decision process MDP consists of 5-tuple (S, A, P) r R, gamma), S is the system environment state space, S t The system state at the moment t; a is an action space, a t Is the agent action at time t; p r To transition probabilities, P r (s t+1 |s t ,a t ) Is in a state s t Taking action a t Post transition to state s t+1 The probability of (d); r is a reward function, R t Is in a state s t Take action a t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision making process;
to quantitatively describe the action a at time t t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t t Performing action a t The expected value of the jackpot prize to be obtained later is represented by Q (s, a), and the specific calculation method is shown in formula (1).
Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1)
Where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s t And action a t The mapping relationship between them.
Calculating an optimal strategy mu * Make the action a at each time t The Q value of (2) is maximum, and the formula is shown as (2):
μ * =maxQ μ (s,a) (2)
in the formula, Q μ (s, a) is the expected return of policy μ after taking action a from state s;
step 1.2: defining an expression of a system environment state space S, an action space A and a reward function R in the Markov decision model;
s in the system environment state space, defining the state space S at the moment t t Comprises the following steps:
s t =[p,q,s,v,L,D,l] (3)
p=[p 1 ,p 2 ,...,p m ] (4)
q=[q 1 ,q 2 ,...,q m ] (5)
s=[s 1 ,s 2 ,...,s n ] (5)
v=[v 1 ,v 2 ,...,v g ] (6)
L=[L 1 ,L 2 ,...,L h ] (7)
D=[D 1 ,D 2 ,...,D k ] (8)
l=[l 1 ,l 2 ,...,l N ] (9)
in the formula, p i Numbering the active power of the i-node generator; q. q.s i Numbering the reactive power of the i-node generator; s is i Numbering the line commissioning state of the node i; v. of i Considering the new energy output condition of the numbered i node, and considering that the new energy output accords with Weibull probability distribution; l is i The controllable load output of the node i is numbered and is considered as the negative generator output; d i The output end of the direct current power generator is regarded as the output of a negative current source power generator, and the output end of the direct current power generator is regarded as the output of a positive current source power generator; m, n, g, h and k are respectively the total number of adjustable generator nodes, the total number of circuits, the total number of new energy nodes, the total number of controllable load nodes and the total number of direct current nodes without a balancing machine; l. the 1 ,l 2 ,...,l N Together form a binary code to represent different operating mode numbers, for example: there are 16 operating modes in total, N =4,l 1 ,l 2 ,l 3 ,l 4 =0000 denotes the 1 st operating mode, l 1 ,l 2 ,l 3 ,l 4 =1111 denotes the 16 th operating mode.
In this embodiment, in order to simplify the model, the adjustment mode of the actual operation mode calculator on the adjustable action object is combined, the generator, the controllable load and the direct current output are simplified to be that the minimum threshold value of one-time adjustment is 5% of the maximum output, and the adjustment range is adjustedThe degree being an integer multiple of only 5%, e.g. p i Allowable values are [0,0.05, 0.1., 1.0 ]],p 1 =0.3 represents that the output of the generator with node 1 is adjusted to 30% of the maximum active power at this time; simplifying the line commissioning state into only two 1/0 states, wherein 1 is commissioning and 0 is outage; the new energy output has random fluctuation, the output obeys Weibull probability distribution, the output is simplified into the expected value of a Weibull distribution function, only 1/0 state exists, 1 is the expected value output state of the distribution function, and 0 is shutdown.
In the action space A, because the adjustment modes of the adjustable action object are defined in a simplified manner, the generator, the controllable load and the direct current node have only 21 output adjustment modes of 0,5%,10%, 100%, and the line commissioning state and the new energy node have only 1 and 0 adjustment modes, the action space A is discrete and is associated with a discrete positive integer, and the formula is shown as (10).
A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)
The number in the set A represents the number of the adjustable action object, if the number is 0, the action of the object is not adjusted, and the code is selectively output according to the actual conditions of different node systems. Adjustment action a at time t t To represent; a is t =[1,3,0,0,4]Representing the operating conditions of the regulated generator node 1, the line 3 and the dc node 4.
In the reward function R, the instant reward R is obtained by the formula (2) t Will affect Q μ (s, a) calculation result, then Q μ (s, a) in turn affects action a t Selection of (2). The design idea of the reward function is as follows: when an agent chooses an action that enables trend convergence, the environment is awarded a greater reward; when the action of diverging the trend or exceeding the limit of the balancing machine is selected, the environment gives a corresponding penalty value, and the intelligent agent restricts the action to meet the action change rate in order to obtain the maximum reward. Defining 4 indexes in the flow adjustment problem: (1) Convergence of load flow calculation by c 1 Represents; (2) The output power of the balancing machine is not out of limit by c 2 Represents; (3) The network loss rate lower than the set value is calculatedQuantizing the network loss rate; (4) No ill-conditioned load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):
Figure BDA0003941469780000091
execution of a t Then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, then R is 0, and R is-1 in other cases; the fewer the number of adjustment steps, the greater the jackpot achieved.
Step 2: establishing an improved mapping strategy of the intelligent agent action and the adjustable action object;
in the embodiment, taking IEEE30 node as an example, the generator node mapping relationship is shown in fig. 2. The usual mapping strategy is to move action a t The state space increases exponentially with each more generator, most of the state load flow calculation is not converged, and if the traversal action state search is adopted, the required time also increases exponentially. In order to improve the search efficiency, the improved mapping strategy is designed as follows:
the improved mapping strategy is to set P G The active power sum of the current power grid generator without the balancing machine is obtained; p L The total active power of all the current loads of the power grid is obtained;
Figure BDA0003941469780000092
maximum/minimum active power for the balancing machine; k is the set target network loss rate; p i Active power of generator i; p is imax The minimum adjustment threshold value is 0.05P for the maximum active power of the generator i imax (ii) a The following three cases are included:
(2) When in use
Figure BDA0003941469780000093
When a is t If = i, then let P i =0.5P imax If at this time P i ≥0.5P imax Then put into operation P i And P imax The median value of (A) is rounded up until it is thrownTo P imax Until now. For example: a is t = i and current P i =0.75P imax Then P after adjustment i =[0.5*(75%+100%)]P imax =0.875P imax →0.9P imax (ii) a The situation that the total active power of the generator of the system is insufficient is judged, and the active power of the generator needs to be increased to meet the requirement of power flow convergence.
(2) When the temperature is higher than the set temperature
Figure BDA0003941469780000094
When a is turned on t If i, let P i =0.5P imax If at this time P i ≤0.5P imax Then P is i And rounding the median value of the output force of the shutdown till the delivery reaches 0 percent, namely, the shutdown is carried out. For example: a is a t = i and current P i =0.25P imax Then P after adjustment i =[0.5*(25%+0%)]P imax =0.125P imax →0.1P imax . In the scene, the situation that the total active power of the generator of the system is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence.
(3) When except (1) and (2), a t If P is not less than = i i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is delivered imax Until the end; otherwise, put into operation P i Rounding the median value of the stopping output downwards until the delivery is 0 percent, and stopping the machine;
and replacing the generator node with other adjustable action objects, obtaining the mapping strategy of the other action objects by analogy, and adjusting the mapping strategy according to the actual condition of the node system.
And 3, step 3: intelligently generating a DQN network by constructing an operation mode;
a flowchart of an algorithm for intelligently generating a DQN network in an operating mode is shown in fig. 3. The DQN network is improved on the basis of the Q-learning network, the Q-learning network is combined with the neural network, the Q value function is estimated by using the neural network, after the value function of each tide adjusting action is calculated by the neural network, the action is selected by adopting epsilon-greedy search, and the action with the maximum Q value is selected to be output.
The DQN network introduces an estimation Q network and a target Q network, and the training process is as follows:
step A1: when training starts, setting the parameters of the estimated Q network and the target Q network node, the generator, the circuit and the load to be the same, wherein the parameter matrixes are theta and theta';
step A2: in the training process, updating each time step of the estimated Q network once according to the gradient descending direction of the loss function as the formula (13), and calculating a Q value by the DQN network according to the estimated Q network and the current state to output a load flow adjusting action;
Figure BDA0003941469780000101
step A3: adjusting the running state of the adjustable action object according to the step A2;
step A4: transmitting the estimated Q network parameter theta to a target Q network theta' every other step C;
step A5: the target Q network is updated every C time steps in the gradient descent direction of equation (13).
The power flow adjusting action value calculated by the estimation Q network is called a predicted value, the sum of the instant reward in the current state and the state power flow adjusting action value calculated by the target Q network is called a true value, and the parameters of the estimation Q network are updated in a back propagation mode. Repeating the updating process during training until the trend is converged and the output of the balancing machine is not out of limit or the number of iteration rounds is reached;
and 4, step 4: (ii) a Modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;
if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the power flow calculation has a feasible solution but cannot be searched, namely, the problem of pathological power flow is solved. The pathological trend is characterized by: the power flow convergence solution is seriously deviated from the initial value, the iteration times are increased, and the convergence speed is low; or the jacobian matrix tends to be singular, so that the trend cannot be converged to the feasible solution. The cause of the pathological tidal current is two: (1) the section tidal current is too heavy, namely the active power is too large; (2) local reactive support is insufficient.
The pathological trend problem comprises the following two situations;
(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological tide caused by the overweight of the section tide, so that the problem of the pathological tide is solved;
(2) The pathological tide caused by insufficient local reactive power support is judged by defining the following tide iteration indexes:
when the PQ decomposition method is adopted for load flow calculation and the load flow is not converged, the index lambda is taken as a criterion, and the formula (14) shows that:
λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)
in the formula, [ Delta U ]] (3) For the third iteration voltage value increment, [ Δ U [ ]] (2) Incrementing the voltage value for the second iteration value;
lambda <1 when the power flow is normally converged, lambda is increased when the reactive demand of the PQ node load is increased, and lambda >1 when the power flow is ill-conditioned. Taking an IEEE118 node system as an example, the active demand of the PQ node No. 29 is 24MW, the reactive demand is 4MVar, and the reactive demand of the load node is gradually increased, and the relation between the reactive demand and λ is shown in table 1.
TABLE 1 relationship between reactive demand increase and lambda of load nodes
Figure BDA0003941469780000111
As can be seen from Table 1, as the reactive power demand of the node increases, λ >1 when the system is ill, so it is reasonable to adopt λ as an index for measuring the ill degree of the system.
And 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, performing flow calculation by adopting a PQ decomposition method, and considering the pathological flow phenomenon when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times. Deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode;
and 5: repeating the step 3 to the step 4, and continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached;
step 6: and outputting an estimated Q network parameter theta, and finishing intelligent generation and intelligent deletion of a power grid operation mode.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims (7)

1. An intelligent generation method of an electric power system operation mode based on deep reinforcement learning is characterized by comprising the following steps:
step 1: performing reinforcement learning modeling on the power grid by using a Markov Decision Process (MDP);
step 2: establishing an improved mapping strategy of the intelligent agent action and the adjustable action object;
and step 3: an operation mode is established to intelligently generate a DQN network;
and 4, step 4: modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;
if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the power flow calculation has a feasible solution but cannot be searched, namely the problem of pathological power flow is solved;
and 5: repeating the step 3 to the step 4, and continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached;
step 6: and outputting an estimated Q network parameter theta, and finishing intelligent generation and intelligent deletion of a power grid operation mode.
2. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: setting parameters in the power grid operation mode process by using a Markov decision process;
setting a power grid operation mode calculator as an intelligent agent, setting power grid operation data and a power flow calculation formula as an environment, wherein the result of interaction between the intelligent agent and the environment is power grid power flow calculation convergence, and the process of interaction between the intelligent agent and the environment is represented by a Markov decision process;
the Markov decision process MDP consists of 5-tuple (S, A, P) r R, gamma), S is the system environment state space, S t The system state at the moment t; a is an action space, a t The action of the agent at the moment t; p r To transition probabilities, P r (s t+1 |s t ,a t ) Is in a state s t Taking action a t Post transition to state s t+1 The probability of (d); r is a reward function, R t Is in a state s t Take action a t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision process;
to quantitatively describe the action a at time t t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t t Performing action a t The expected value of the accumulated reward obtained later is expressed by Q (s, a), and the specific calculation method is shown as formula (1);
Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1)
where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s t And action a t The mapping relationship between the two;
calculating the optimal strategy mu * Make the action a at each time t The Q value of (2) is maximum, and the formula is shown as (2):
μ * =max Q μ (s,a) (2)
in the formula, Q μ (s, a) is the expected return of policy μ after taking action a from state s;
step 1.2: defining an expression of a system environment state space S, an action space A and a reward function R in the Markov decision model;
s in the system environment state space, defining the state space S at the moment t t Comprises the following steps:
s t =[p,q,s,v,L,D,l] (3)
p=[p 1 ,p 2 ,...,p m ] (4)
q=[q 1 ,q 2 ,...,q m ] (5)
s=[s 1 ,s 2 ,...,s n ] (5)
v=[v 1 ,v 2 ,...,v g ] (6)
L=[L 1 ,L 2 ,...,L h ] (7)
D=[D 1 ,D 2 ,...,D k ] (8)
l=[l 1 ,l 2 ,...,l N ] (9)
in the formula, p i Numbering the active power of the i-node generator; q. q.s i Numbering the reactive power of the i-node generator; s is i Numbering the line commissioning state of the node i; v. of i New energy contribution condition, L, for numbering i nodes i The controllable load output of the node i is numbered; d i The direct current output is numbered as an i node; m, n, g, h and k are respectively adjustable generator sections without balancing machinesPoint total number, line total number, new energy node total number, controllable load node total number and direct current node total number; l 1 ,l 2 ,...,l N The codes are combined to form a binary code which is used for representing the numbers of different operation modes;
in the action space A, the action space A is discrete, and the action space A is associated with discrete positive integers, wherein the formula is shown as (10);
A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)
the number in the set A represents the number of the adjustable action object, and the adjusting action a at the moment t t To represent;
defining 4 indexes of the flow adjustment problem: (1) Convergence of load flow calculation by c 1 Represents; (2) The output power of the balancing machine is not out of limit by c 2 Represents; (3) Quantizing the network loss rate lower than a set value by calculating the network loss rate; (4) No pathological load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):
Figure FDA0003941469770000031
execution of a t And then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, R is 0, and R is-1 in other cases.
3. The method as claimed in claim 1, wherein the improved mapping strategy in step 2 is set as P G The active power sum of the current power grid generator without the balancing machine is obtained; p L The total active power of all the current loads of the power grid is obtained;
Figure FDA0003941469770000032
maximum/minimum active power for the balancing machine; k is the set target network loss rate; p i Is the active power of the generator i; p imax For maximum active power of generator i, minimum adjustment thresholdIs 0.05P imax
4. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the DQN network in step 3 is obtained by combining a Q-learning network and a neural network, estimating a Q-value function by using the neural network, calculating a value function of each power flow adjustment action by the neural network, then selecting an action by using epsilon-greedy search, and selecting an action output with the largest Q value.
5. The intelligent generation method for the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the pathological power flow problem in the step 4 includes the following two situations;
(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological trend caused by the overweight of the section trend, so that the problem of the pathological trend is solved;
(2) The pathological tide caused by insufficient local reactive power support is judged by defining the following tide iteration indexes:
when the flow calculation by adopting the PQ decomposition method is not converged, taking the index lambda as a criterion, as shown in a formula (14):
λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)
in the formula, [ Delta U ]] (3) For the third iteration voltage value increment, [ Δ U [ ]] (2) The voltage value increment is the second iteration value;
lambda is less than 1 when the power flow is normally converged, lambda is increased when the reactive demand of PQ node load is increased, and lambda is greater than 1 when the power flow is ill-conditioned;
and 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, carrying out flow calculation by adopting a PQ decomposition method, and when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times, considering the pathological flow phenomenon and deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode.
6. The intelligent deep reinforcement learning-based power system operation mode generation method according to claim 3, wherein the improved mapping strategy comprises the following three conditions:
(1) When the temperature is higher than the set temperature
Figure FDA0003941469770000041
When a is turned on t If i, let P i =0.5P imax If at this time P i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is delivered imax Until the end; judging that the total active power of a system generator is insufficient in the scene, and increasing the active power of the generator to meet the requirement of power flow convergence;
(2) When the temperature is higher than the set temperature
Figure FDA0003941469770000042
When a is turned on t If i, let P i =0.5P imax If at this time P i ≤0.5P imax Then P is i The median value of the output force of the machine halt is rounded downwards until the input is 0 percent, and the machine halt is carried out; in the scene, the situation that the total active power of a system generator is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence;
(3) When except (1) and (2), a t If P is not less than = i i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is reached imax Until the end; otherwise, put into operation P i And rounding the median value of the output force of the shutdown downwards until the delivery reaches 0 percent, namely, the shutdown is carried out.
7. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning of claim 4, wherein the DQN network introduces an estimation Q network and a target Q network, and the training process comprises:
step A1: when training starts, setting the parameters of the estimated Q network and the target Q network node, the generator, the circuit and the load to be the same, wherein the parameter matrixes are theta and theta';
step A2: in the training process, updating each time step of the estimated Q network once according to the gradient descending direction of the loss function as the formula (13), and calculating a Q value by the DQN network according to the estimated Q network and the current state to output a load flow adjusting action;
Figure FDA0003941469770000043
step A3: adjusting the running state of the adjustable action object according to the step A2;
step A4: transmitting the estimated Q network parameter theta to a target Q network theta' every other step C;
step A5: updating the target Q network once every C time steps according to the gradient descending direction of the formula (13);
and updating parameters of the estimated Q network in a back propagation mode, and repeating the updating process during training until the power flow is converged and the output of the balancing machine is not out of limit or reaches the number of iteration rounds.
CN202211418090.7A 2022-11-14 2022-11-14 Intelligent generation method for operation mode of power system based on deep reinforcement learning Pending CN115912367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211418090.7A CN115912367A (en) 2022-11-14 2022-11-14 Intelligent generation method for operation mode of power system based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211418090.7A CN115912367A (en) 2022-11-14 2022-11-14 Intelligent generation method for operation mode of power system based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115912367A true CN115912367A (en) 2023-04-04

Family

ID=86496603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211418090.7A Pending CN115912367A (en) 2022-11-14 2022-11-14 Intelligent generation method for operation mode of power system based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115912367A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041068A (en) * 2023-07-31 2023-11-10 广东工业大学 Deep reinforcement learning reliable sensing service assembly integration method and system
CN118278495A (en) * 2024-05-27 2024-07-02 东北大学 Method for generating power grid operation mode based on reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041068A (en) * 2023-07-31 2023-11-10 广东工业大学 Deep reinforcement learning reliable sensing service assembly integration method and system
CN118278495A (en) * 2024-05-27 2024-07-02 东北大学 Method for generating power grid operation mode based on reinforcement learning

Similar Documents

Publication Publication Date Title
CN115912367A (en) Intelligent generation method for operation mode of power system based on deep reinforcement learning
CN112039069A (en) Double-layer collaborative planning method and system for power distribution network energy storage and flexible switch
CN107292502B (en) Power distribution network reliability assessment method
CN114362196A (en) Multi-time-scale active power distribution network voltage control method
CN103345663B (en) Consider the Unit Commitment optimization method of ramping rate constraints
CN114722709B (en) Cascade reservoir group optimal scheduling method and system considering generated energy and minimum output
CN115345380A (en) New energy consumption electric power scheduling method based on artificial intelligence
CN115313403A (en) Real-time voltage regulation and control method based on deep reinforcement learning algorithm
CN112012875B (en) Optimization method of PID control parameters of water turbine regulating system
CN115986834A (en) Near-end strategy optimization algorithm-based optical storage charging station operation optimization method and system
CN114566971A (en) Real-time optimal power flow calculation method based on near-end strategy optimization algorithm
CN115036931A (en) Active power grid reactive voltage affine adjustable robust optimization method and device
CN114784823A (en) Micro-grid frequency control method and system based on depth certainty strategy gradient
CN107516892A (en) The method that the quality of power supply is improved based on processing active optimization constraints
CN118174355A (en) Micro-grid energy optimization scheduling method
CN117057228A (en) Inverter multi-objective optimization method based on deep reinforcement learning
CN116914751A (en) Intelligent power distribution control system
CN117893043A (en) Hydropower station load distribution method based on DDPG algorithm and deep learning model
CN114330649A (en) Voltage regulation method and system based on evolutionary learning and deep reinforcement learning
CN111525556B (en) Multi-target optimal power flow calculation method considering wind power confidence risk
CN113555876A (en) Line power flow regulation and control method and system based on artificial intelligence
CN117277346A (en) Energy storage frequency modulation method, device and equipment based on multi-agent system
CN117117989A (en) Deep reinforcement learning solving method for unit combination
CN111799820A (en) Double-layer intelligent hybrid zero-star cloud energy storage countermeasure regulation and control method for power system
CN117057623A (en) Comprehensive power grid safety optimization scheduling method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination