CN115912367A - A method for intelligent generation of power system operation mode based on deep reinforcement learning - Google Patents

A method for intelligent generation of power system operation mode based on deep reinforcement learning Download PDF

Info

Publication number
CN115912367A
CN115912367A CN202211418090.7A CN202211418090A CN115912367A CN 115912367 A CN115912367 A CN 115912367A CN 202211418090 A CN202211418090 A CN 202211418090A CN 115912367 A CN115912367 A CN 115912367A
Authority
CN
China
Prior art keywords
action
power
operation mode
network
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211418090.7A
Other languages
Chinese (zh)
Inventor
吕晨
陈兴雷
于子洋
周博文
杨东升
李广地
伍薇蓉
马全
杨钊
文晶
李文臣
崔勇
顾军
涂崎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanghai Electric Power Co Ltd
Original Assignee
China Electric Power Research Institute Co Ltd CEPRI
State Grid Shanghai Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Electric Power Research Institute Co Ltd CEPRI, State Grid Shanghai Electric Power Co Ltd filed Critical China Electric Power Research Institute Co Ltd CEPRI
Priority to CN202211418090.7A priority Critical patent/CN115912367A/en
Publication of CN115912367A publication Critical patent/CN115912367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides an intelligent generation method of an electric power system operation mode based on deep reinforcement learning, and relates to the technical field of power grid operation. According to the method, a Markov decision process MDP is used for carrying out reinforcement learning modeling on a power grid, and an improved mapping strategy of an intelligent action and an adjustable action object is established; intelligently generating a DQN network by constructing an operation mode, inputting the current system load flow state and the target operation state into the DQN network, and outputting the action with the maximum Q value; carrying out load flow calculation iteration by using a PQ decomposition method, if lambda is greater than 1 or the load flow is not converged after 10 times of iteration, considering that pathological load flow occurs, abandoning the action, and regenerating a new action by the DQN network; if the power flow is converged, adjusting the running state of the adjustable action object according to an improved mapping strategy; continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached; and the intelligent generation and intelligent deletion of the power grid operation mode are completed by outputting the estimated Q network parameters.

Description

一种基于深度强化学习的电力系统运行方式智能生成方法A method for intelligent generation of power system operation modes based on deep reinforcement learning

技术领域Technical Field

本发明涉及电网运行技术领域,尤其涉及一种基于深度强化学习的电力系统运行方式智能生成方法。The present invention relates to the field of power grid operation technology, and in particular to a method for intelligently generating power system operation modes based on deep reinforcement learning.

背景技术Background Art

电力系统运行方式计算可以给出电网安全稳定运行边界,是保证电网安全稳定运行的总体指导方案,也是调度人员评估电网实时运行状态的理论依据。由于电力系统各类稳定分析均需要基于潮流计算结果来进行,因此潮流计算是电网运行方式计算的重要基础内容。近年来,由于社会经济的高速发展,大规模新能源接入与新型电力系统不仅使电网的规模和复杂度空前增加,同时也使电网典型运行方式显著增加,运行方式计算工作面临严峻挑战。The calculation of the operation mode of the power system can provide the safe and stable operation boundary of the power grid. It is the overall guidance plan to ensure the safe and stable operation of the power grid, and it is also the theoretical basis for dispatchers to evaluate the real-time operation status of the power grid. Since all kinds of stability analysis of the power system need to be based on the results of power flow calculation, power flow calculation is an important basic content of the calculation of the operation mode of the power grid. In recent years, due to the rapid development of social economy, large-scale access to new energy and new power systems have not only increased the scale and complexity of the power grid to an unprecedented level, but also significantly increased the typical operation mode of the power grid. The operation mode calculation work faces severe challenges.

实际工程中,大电网年度运行方式计算工作主要由各级调控中心方式计算人员,基于电力系统仿真分析软件协同完成,涉及大量人工参与内容。具体需要依据下一年度的负荷和网架结构变化的预测结果,参考上一年度的运行经验初步制定各种极限工况下的典型运行方式,然后使用人工调整潮流与稳定计算相结合的方式确定电网的安全运行边界,为电网经济调度、设备检修计划等工作提供理论依据。一方面,电网规模日益增大,运行特性日趋复杂;另一方面,长期以来,运行方式计算多以大量人工劳动为主,工作量大、重复性高。In actual projects, the annual operation mode calculation of large power grids is mainly completed by the mode calculation personnel of the control centers at all levels, based on the power system simulation analysis software, and involves a lot of manual participation. Specifically, it is necessary to preliminarily formulate typical operation modes under various extreme conditions based on the predicted results of the load and grid structure changes in the next year, and refer to the operation experience of the previous year. Then, the safe operation boundary of the power grid is determined by combining manual adjustment of the flow with stability calculation, providing a theoretical basis for the economic dispatch of the power grid, equipment maintenance plan and other work. On the one hand, the scale of the power grid is increasing, and the operation characteristics are becoming more and more complex; on the other hand, for a long time, the operation mode calculation has been mainly based on a large amount of manual labor, with a large workload and high repetitiveness.

当前,人工智能技术正在引领新一轮科技革命和产业变革。随着人工智能的发展,深度强化学习在大量样本训练的基础上,帮助人类获取数据的一般性规律,极大地减少人力物力投入。基于深度强化学习的电力系统运行方式智能生成是由机器代替人工完成电网运行方式生成过程,在生成运行方式的同时诊断当前运行方式的合理性,即是否收敛或产生病态潮流潮流,借助深度强化学习可以对潮流的高维度空间进行智能调整,在调整过程中加入知识经验,缩小动作空间,更有效的模拟人工调整的过程,对减轻工作人员负担,为操作人员提供潮流调整依据,提高电力系统的自动化水平具有重要意义。At present, artificial intelligence technology is leading a new round of scientific and technological revolution and industrial transformation. With the development of artificial intelligence, deep reinforcement learning helps humans obtain the general laws of data based on a large number of sample training, greatly reducing the input of manpower and material resources. The intelligent generation of power system operation mode based on deep reinforcement learning is a process of generating power grid operation mode by machines instead of humans. While generating the operation mode, it diagnoses the rationality of the current operation mode, that is, whether it converges or generates pathological current flow. With the help of deep reinforcement learning, the high-dimensional space of the current flow can be intelligently adjusted, knowledge and experience can be added in the adjustment process, the action space can be reduced, and the process of manual adjustment can be simulated more effectively. It is of great significance to reduce the burden on staff, provide operators with a basis for flow adjustment, and improve the automation level of the power system.

中国专利“CN111478331A一种用于调整电力系统潮流收敛的方法及系统”中提出了一种基于改进深度Q学习的电网运行方式计算方法,此专利根据状态空间和动作空间确定Q神经网络模型的输入和输出维度;确定动作空间与电力系统发电机的起停机状态的映射关系,根据训练模型输出的调整动作调整电力系统发电机运行状态;以目标负荷水平和电力系统发电机起停机状态作为输入,调整动作作为输出,根据输入和输出维度训练Q神经网络模型;根据调整动作调整电力系统潮流至收敛状态;通过开关发电机和调整平衡机功率来满足不同运行方式下的负荷需求。该专利通过开关发电机和调整平衡机功率来满足不同运行方式下的负荷需求,其中可以调整的动作对象仅有发电机,在当今大规模新能源接入的新型电力系统中的可调整的参数除发电机功率外,还包括线路投运状态、新能源出力状态、可控负荷状态和直流状态等。此外,该专利中发电机状态仅有开启和关闭两种,无法满足实际电网运行方式计算中调整机组部分出力的需求。The Chinese patent "CN111478331A A method and system for adjusting the convergence of power system power flow" proposes a method for calculating the operation mode of the power grid based on improved deep Q learning. This patent determines the input and output dimensions of the Q neural network model according to the state space and action space; determines the mapping relationship between the action space and the start and stop state of the power system generator, and adjusts the operation state of the power system generator according to the adjustment action output by the training model; takes the target load level and the start and stop state of the power system generator as input, and the adjustment action as output, and trains the Q neural network model according to the input and output dimensions; adjusts the power system power flow to the convergence state according to the adjustment action; and meets the load requirements under different operation modes by switching the generator on and off and adjusting the power of the balancing machine. This patent meets the load requirements under different operation modes by switching the generator on and off and adjusting the power of the balancing machine. The only action object that can be adjusted is the generator. In today's new power system with large-scale new energy access, the adjustable parameters include the line commissioning state, the new energy output state, the controllable load state, and the DC state, etc. In addition, the generator state in this patent has only two types: open and closed, which cannot meet the needs of adjusting the partial output of the unit in the actual power grid operation mode calculation.

发明内容Summary of the invention

针对现有技术的不足,本发明提供一种基于深度强化学习的电力系统运行方式智能生成方法。In view of the shortcomings of the prior art, the present invention provides a method for intelligently generating an operation mode of an electric power system based on deep reinforcement learning.

一种基于深度强化学习的电力系统运行方式智能生成方法,具体包括以下步骤:A method for intelligently generating an operation mode of a power system based on deep reinforcement learning, specifically comprising the following steps:

步骤1:使用马尔科夫决策过程MDP进行对电网进行强化学习建模;Step 1: Use Markov decision process MDP to perform reinforcement learning modeling on the power grid;

步骤1.1:使用马尔科夫决策过程进行电网运行方式过程中的参数设定;Step 1.1: Use the Markov decision process to set parameters in the process of power grid operation;

将电网运行方式计算人员设为一个智能体,电网运行数据和潮流计算公式设为环境,智能体与环境交互的结果是电网潮流计算收敛,智能体与环境交互的过程用马尔可夫决策过程来表示;The person calculating the power grid operation mode is set as an intelligent agent, the power grid operation data and power flow calculation formula are set as the environment, the result of the interaction between the intelligent agent and the environment is the convergence of the power grid power flow calculation, and the process of interaction between the intelligent agent and the environment is represented by the Markov decision process;

所述马尔科夫决策过程MDP由5元组(S,A,Pr,R,γ)构成,S为系统环境状态空间,st为t时刻的系统状态;A为动作空间,at为t时刻的智能体动作;Pr为转移概率,Pr(st+1|st,at)为在状态st采取动作at后转移到状态st+1的概率;R为奖励函数,rt为在状态st下采取动作at后得到的奖励值;γ是折扣因子(0≤γ≤1),用来权衡即时奖励值和将来奖励值对决策过程的影响;The Markov decision process MDP is composed of a 5-tuple (S, A, P r , R, γ), where S is the system environment state space, s t is the system state at time t; A is the action space, a t is the agent action at time t; P r is the transition probability, P r (s t+1 |s t , a t ) is the probability of transferring to state s t+1 after taking action a t in state s t ; R is the reward function, r t is the reward value obtained after taking action a t in state s t ; γ is a discount factor (0≤γ≤1), which is used to weigh the impact of immediate reward value and future reward value on the decision-making process;

为定量描述t时刻的动作at对系统状态转移方向的引导作用,引入状态动作State-Action值函数概念,即在t时刻对状态st执行动作at后获得的累积奖励期望值用Q(s,a)来表示,具体计算方法如公式(1)所示。In order to quantitatively describe the guiding role of action a t at time t on the direction of system state transfer, the concept of state-action value function is introduced, that is, the expected value of the cumulative reward obtained after executing action a t on state s t at time t is expressed by Q(s, a). The specific calculation method is shown in formula (1).

Q(s,a)=E[rt+γrt+12rt+2+...|st=s,at=a,π],s∈S,a∈A (1)Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1 )

式中,γ越大表示将来奖励值对Q(s,a)的影响越大,γ=1表示将来奖励值与即时奖励值对Q(s,a)的影响相同,而γ=1表示只有即时奖励值会影响Q(s,a);π表示智能体的动作执行策略,即系统状态st与动作at之间的映射关系。In the formula, the larger the γ, the greater the impact of the future reward value on Q(s,a), γ = 1 means that the future reward value and the immediate reward value have the same impact on Q(s,a), and γ = 1 means that only the immediate reward value will affect Q(s,a); π represents the action execution strategy of the agent, that is, the mapping relationship between the system state s t and the action a t .

计算最优策略μ*,使每个时刻的动作at的Q值最大,公式如(2)所示:Calculate the optimal strategy μ * to maximize the Q value of action a t at each moment. The formula is shown in (2):

μ*=maxQμ(s,a) (2)μ * =maxQ μ (s,a) (2)

式中,Qμ(s,a)为从状态s开始采取动作a后策略μ的预期回报;Where Q μ (s, a) is the expected return of strategy μ after taking action a starting from state s;

步骤1.2:定义马尔科夫决策模型中系统环境状态空间S、动作空间A,奖励函数R的表达式,;Step 1.2: Define the expressions of the system environment state space S, action space A, and reward function R in the Markov decision model;

所述系统环境状态空间中S,定义t时刻的状态空间st为:In the system environment state space S, the state space s t at time t is defined as:

st=[p,q,s,v,L,D,l] (3)s t = [p, q, s, v, L, D, l] (3)

p=[p1,p2,...,pm] (4)p=[p 1 ,p 2 ,..., pm ] (4)

q=[q1,q2,...,qm] (5)q=[q 1 ,q 2 ,...,q m ] (5)

s=[s1,s2,...,sn] (5)s=[s 1 ,s 2 ,...,s n ] (5)

v=[v1,v2,...,vg] (6)v=[v 1 ,v 2 ,...,v g ] (6)

L=[L1,L2,...,Lh] (7)L=[L 1 ,L 2 ,...,L h ] (7)

D=[D1,D2,...,Dk] (8)D=[D 1 ,D 2 ,...,D k ] (8)

l=[l1,l2,...,lN] (9)l=[l 1 ,l 2 ,...,l N ] (9)

式中,pi为编号i节点发电机的有功功率;qi为编号i节点发电机的无功功率;si编号i节点的线路投运状态;vi为编号i节点的新能源出力情况,Li为编号i节点的可控负荷出力;Di为编号为i节点的直流出力;m,n,g,h,k分别为不含平衡机的可调发电机节点总数、线路总数、新能源节点总数、可控负荷节点总数和直流节点总数;l1,l2,...,lN共同组成一个二进制编码,用来表示不同运行方式编号;Wherein, pi is the active power of the generator at node i; qi is the reactive power of the generator at node i; si is the line operation status of node i; vi is the new energy output of node i, Li is the controllable load output of node i; Di is the DC output of node i; m, n, g, h, k are the total number of adjustable generator nodes, total number of lines, total number of new energy nodes, total number of controllable load nodes and total number of DC nodes excluding balancing machines respectively; l1 , l2 , ..., lN together form a binary code to represent the numbers of different operation modes;

所述动作空间A中,动作空间A是离散的,将动作空间A与离散正整数相关联,公式如(10)所示。In the action space A, the action space A is discrete, and the action space A is associated with a discrete positive integer, as shown in formula (10).

A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}](10)A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h} ,{1,2,...,k}](10)

集合A中的数字代表可调动作对象的编号,t时刻的调整动作用at来表示;The numbers in set A represent the numbers of the adjustable action objects, and the adjustment action at time t is represented by a t ;

定义潮流调整问题的4个指标:(1)潮流计算收敛,用c1表示;(2)平衡机输出功率不越限,用c2表示;(3)低于设定值的网损率,通过计算网损率量化;(4)不产生病态潮流,通过潮流计算迭代的λ值量化;因此,奖励函数R如式(11):Four indicators of the power flow adjustment problem are defined: (1) the power flow calculation converges, represented by c1 ; (2) the output power of the balancing machine does not exceed the limit, represented by c2 ; (3) the network loss rate is lower than the set value, which is quantified by calculating the network loss rate; (4) no pathological power flow is generated, which is quantified by the λ value of the power flow calculation iteration; therefore, the reward function R is as follows:

Figure BDA0003941469780000041
Figure BDA0003941469780000041

执行at后,潮流计算收敛且平衡机输出功率不越限,则R为0,其余情况R为-1;After executing a t , if the power flow calculation converges and the output power of the balancing machine does not exceed the limit, R is 0, and in other cases R is -1;

步骤2:建立智能体动作与可调动作对象的改进映射策略;Step 2: Establish an improved mapping strategy between agent actions and adjustable action objects;

所述改进映射策略为,设PG为当前不含平衡机的电网发电机有功功率总和;PL为电网当前所有负荷有功功率总和;

Figure BDA0003941469780000042
为平衡机的最大/最小有功功率;K为设定的目标网损率;Pi为发电机i的有功功率;Pimax为发电机i的最大有功功率,最小调整阈值为0.05Pimax;包括以下三种情况:The improved mapping strategy is: let PG be the sum of the active power of the current power grid generators without the balancing machine; PL be the sum of the active power of all the current loads in the power grid;
Figure BDA0003941469780000042
is the maximum/minimum active power of the balancing machine; K is the set target network loss rate; Pi is the active power of generator i; Pimax is the maximum active power of generator i, and the minimum adjustment threshold is 0.05P imax ; including the following three cases:

(1)当

Figure BDA0003941469780000043
时,at=i,则令Pi=0.5Pimax,若此时Pi≥0.5Pimax,则投运Pi与Pimax的中值向上取整,直到投运到Pimax为止。该场景下判断出系统发电机总有功功率不足,需要增加发电机有功功率,来满足潮流收敛要求。(1) When
Figure BDA0003941469780000043
When a t = i, let Pi = 0.5Pimax . If Pi0.5Pimax , the median of Pi and Pimax is rounded up until Pimax is reached. In this scenario, it is determined that the total active power of the system generator is insufficient, and the active power of the generator needs to be increased to meet the power flow convergence requirements.

(2)当

Figure BDA0003941469780000044
时,at=i,则令Pi=0.5Pimax,若此时Pi≤0.5Pimax,则Pi与停机出力的中值向下取整,直到投运到0%,即停机为止。该场景下判断出系统发电机总有功功率过大,需要减少发电机有功功率,来满足潮流收敛要求。(2) When
Figure BDA0003941469780000044
When a t = i, let Pi = 0.5Pimax . If Pi0.5Pimax , the median of Pi and shutdown output is rounded down until it reaches 0%, i.e. shutdown. In this scenario, it is judged that the total active power of the system generator is too large, and the active power of the generator needs to be reduced to meet the power flow convergence requirements.

(3)除(1)和(2)时,at=i,若Pi≥0.5Pimax,则投运Pi与Pimax的中值向上取整,直到投运到Pimax为止;否则投运Pi与停机出力的中值向下取整,直到投运到0%,即停机为止;(3) Except for (1) and (2), a t = i, if Pi ≥ 0.5P imax , the median of the operating Pi and Pimax is rounded up until the operation reaches Pimax ; otherwise, the median of the operating Pi and the shutdown output is rounded down until the operation reaches 0%, i.e., the shutdown is completed;

步骤3:构建运行方式智能生成DQN网络;Step 3: Build and run the DQN network intelligently;

所述DQN网络为,将Q-learning网络与神经网络结合,使用神经网络对Q值函数进行估算,神经网络计算出每个潮流调整动作的值函数后,采用ε-greedy贪心搜索进行动作选取,选择Q值最大的动作输出。The DQN network combines the Q-learning network with the neural network, uses the neural network to estimate the Q value function, and after the neural network calculates the value function of each power flow adjustment action, it uses ε-greedy search to select the action and selects the action with the largest Q value as output.

其中,DQN网络引入了估计Q网络和目标Q网络,其训练过程为:Among them, the DQN network introduces the estimated Q network and the target Q network, and its training process is:

步骤A1:训练开始时,将估计Q网络和目标Q网络节点、发电机、线路、负荷参数设置相同,参数矩阵分别为θ和θ′;Step A1: At the beginning of training, the estimated Q network and the target Q network nodes, generators, lines, and load parameters are set to be the same, and the parameter matrices are θ and θ′ respectively;

步骤A2:训练过程中,估计Q网络每个时间步按照如公式(13)的损失函数梯度下降方向更新一次,DQN网络根据估计Q网络和当前状态计算Q值输出潮流调整动作;Step A2: During the training process, the estimated Q network is updated once at each time step according to the gradient descent direction of the loss function as shown in formula (13). The DQN network calculates the Q value output flow adjustment action based on the estimated Q network and the current state;

Figure BDA0003941469780000045
Figure BDA0003941469780000045

步骤A3:依据步骤A2调整可调动作对象的运行状态;Step A3: adjusting the running state of the adjustable action object according to step A2;

步骤A4:每隔C步将估计Q网络参数θ传给目标Q网络θ′;Step A4: every C steps, the estimated Q network parameters θ are passed to the target Q network θ′;

步骤A5:目标Q网络每C个时间步按照公式(13)的梯度下降方向更新一次。Step A5: The target Q network is updated once every C time steps according to the gradient descent direction of formula (13).

估计Q网络计算的潮流调整动作值被称为预测值,当前状态下的即时奖励与目标Q网络计算出的状态潮流调整动作值之和被称为真实值,通过反向传播的方式对估计Q网络的参数进行更新。训练时重复上述更新过程,直至潮流收敛且平衡机输出不越限或是达到迭代轮数;The power flow adjustment action value calculated by the estimated Q network is called the predicted value, and the sum of the instant reward in the current state and the state power flow adjustment action value calculated by the target Q network is called the true value. The parameters of the estimated Q network are updated by back propagation. During training, the above update process is repeated until the power flow converges and the output of the balancing machine does not exceed the limit or the number of iterations is reached;

步骤4:运行方式智能删减过程建模,构建病态潮流诊断模型;Step 4: Model the operation mode intelligent deletion process and build a sick power flow diagnosis model;

若潮流计算不能收敛,分为以下两种情况:潮流计算无可行解,即潮流无解;或潮流计算有可行解,但无法搜索到,即病态潮流问题;If the power flow calculation cannot converge, it can be divided into the following two cases: the power flow calculation has no feasible solution, that is, the power flow has no solution; or the power flow calculation has a feasible solution, but it cannot be searched, that is, an ill-conditioned power flow problem;

所述病态潮流问题包括以下两种情况;The pathological power flow problem includes the following two situations:

(1)由断面潮流过重引起的病态潮流,通过调整可调动作对象出力来分摊掉有功不平衡功率,从而解决病态潮流问题;(1) The pathological power flow caused by excessive cross-section power flow can be solved by adjusting the output of the adjustable action object to distribute the unbalanced active power;

(2)由局部无功支撑不足引起的病态潮流,定义如下潮流迭代指标来进行判断:(2) For pathological power flow caused by insufficient local reactive power support, the following power flow iteration index is defined to judge:

当采用PQ分解法进行潮流计算不收敛时,根据指标λ作为判据,如公式(14)所示:When the PQ decomposition method does not converge in power flow calculation, the index λ is used as the criterion, as shown in formula (14):

λ=max{|[ΔU](3)/[ΔU](2)|} (14)λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)

式中,[ΔU](3)为第三次迭代电压值增量,[ΔU](2)为第二次迭代值电压值增量;Where, [ΔU] (3) is the voltage increment of the third iteration, and [ΔU] (2) is the voltage increment of the second iteration;

当潮流正常收敛时λ<1,当PQ节点负荷的无功需求增加时λ随之增加,当潮流呈现病态时λ>1。When the power flow converges normally, λ<1; when the reactive power demand of the PQ node load increases, λ increases accordingly; when the power flow becomes pathological, λ>1.

在步骤3生成运行方式后,对运行方式合理性进行判断,采用PQ分解法来进行潮流计算,当λ>1或潮流计算迭代在10次以内不收敛,即认为病态潮流现象。删除该运行方式,否则认为该运行方式合理并保留,完成运行方式的智能删减;After the operation mode is generated in step 3, the rationality of the operation mode is judged, and the PQ decomposition method is used to perform power flow calculation. When λ>1 or the power flow calculation iteration does not converge within 10 times, it is considered to be a pathological power flow phenomenon. Delete the operation mode, otherwise it is considered to be reasonable and retained, completing the intelligent deletion of the operation mode;

步骤5:重复步骤3至步骤4,不断进行动作调整,直到满足目标运行方式负荷水平或达到最大动作调整次数;Step 5: Repeat steps 3 to 4 and continuously adjust the action until the target operation mode load level is met or the maximum number of action adjustments is reached;

步骤6:输出估计Q网络参数θ,完成电网运行方式智能生成与智能删减。Step 6: Output the estimated Q network parameter θ to complete the intelligent generation and intelligent deletion of power grid operation mode.

采用上述技术方案所产生的有益效果在于:The beneficial effects of adopting the above technical solution are:

本发明提供一种基于深度强化学习的电力系统运行方式智能生成方法,由计算机来代替人来完成可调整对象的调整过程,最终输出可用的运行方式,极大地减轻了运行方式计算人员工作强度,本发明中可调动作对象包括发电机状态、线路投运状态、新能源出力状态、可控负荷状态和直流状态等,更加符合实际电网中运行方式计算的需求。通过对动作空间的调整可以很好的满足新型电力系统元件调整的实际需求,使用改进映射策略的设计也加快了模型的训练速度,从而减少了算力需求。本发明中对动作对象的输出调整阈值为最大功率的5%,更加符合实际电网中的运行方式计算调整需求,为不同的动作的动作对象设计了不同的映射策略,动作空间数量增加可以通过改进映射策略很好的解决不会造成运行时间的显著增加,加快了DQN网络训练过程。The present invention provides an intelligent generation method for the operation mode of an electric power system based on deep reinforcement learning, in which a computer replaces a person to complete the adjustment process of an adjustable object, and finally outputs an available operation mode, which greatly reduces the workload of operation mode calculation personnel. The adjustable action objects in the present invention include generator status, line commissioning status, new energy output status, controllable load status, and DC status, etc., which are more in line with the needs of operation mode calculation in an actual power grid. The actual needs of adjustment of new power system components can be well met by adjusting the action space, and the design of the improved mapping strategy also speeds up the training speed of the model, thereby reducing the computing power requirements. The output adjustment threshold of the action object in the present invention is 5% of the maximum power, which is more in line with the operation mode calculation adjustment requirements in an actual power grid. Different mapping strategies are designed for action objects of different actions. The increase in the number of action spaces can be well solved by improving the mapping strategy without causing a significant increase in the running time, thereby speeding up the DQN network training process.

与现有技术相比,本发明提出的技术方案中采用基于深度强化学习的运行方式智能生成方法,由计算机来代替人来完成可调整对象的调整过程,最终输出可用的运行方式,极大地减轻了运行方式计算人员工作强度,通过对动作空间的调整可以很好的满足新型电力系统元件调整的实际需求,使用改进映射策略的设计也加快了模型的训练速度,从而减少了算力需求。Compared with the prior art, the technical solution proposed in the present invention adopts an intelligent generation method of operating modes based on deep reinforcement learning. Computers replace people to complete the adjustment process of adjustable objects and finally output available operating modes, which greatly reduces the workload of personnel who calculate the operating modes. By adjusting the action space, the actual needs of adjusting new power system components can be well met. The design of improved mapping strategies also speeds up the training speed of the model, thereby reducing computing power requirements.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中电力系统运行方式智能生成方法流程图;FIG1 is a flow chart of a method for intelligently generating an operation mode of a power system according to an embodiment of the present invention;

图2为本发明实施例中IEEE-30节点系统发电机节点映射关系示意图;FIG2 is a schematic diagram of a mapping relationship between generator nodes in an IEEE-30 node system according to an embodiment of the present invention;

图3为本发明实施例中运行方式智能生成DQN网络算法流程图。FIG3 is a flow chart of an algorithm for intelligently generating a DQN network in an operation mode according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明,但不用来限制本发明的范围。The specific implementation of the present invention is further described in detail below in conjunction with the accompanying drawings and examples. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

一种基于深度强化学习的电力系统运行方式智能生成方法,如图1所示,具体包括以下步骤:A method for intelligently generating power system operation modes based on deep reinforcement learning, as shown in FIG1 , specifically comprises the following steps:

步骤1:使用马尔科夫决策过程MDP进行对电网进行强化学习建模;Step 1: Use Markov decision process MDP to perform reinforcement learning modeling on the power grid;

电网运行方式的制定本质上是一个电力系统潮流收敛调整的过程,该过程可以看作一个运行方式计算人员对电网数据计算调整从而得到系统潮流的决策过程,The formulation of the grid operation mode is essentially a process of convergence and adjustment of the power system flow. This process can be regarded as a decision-making process in which an operation mode calculation personnel calculates and adjusts the grid data to obtain the system flow.

步骤1.1:使用马尔科夫决策过程进行电网运行方式过程中的参数设定;Step 1.1: Use the Markov decision process to set parameters in the process of power grid operation;

将电网运行方式计算人员设为一个智能体(Agent,指驻留在某一环境下,能持续自主地发挥作用,具备驻留性、反应性、社会性、主动性等特征的计算实体),电网运行数据和潮流计算公式设为环境,智能体与环境交互的结果是电网潮流计算收敛,得到典型运行方式的结果。智能体与环境交互的过程用马尔可夫决策过程(Markov Decision Process,MDP)来表示;The person who calculates the operation mode of the power grid is set as an agent (Agent refers to a computing entity that resides in a certain environment, can continuously and autonomously play a role, and has the characteristics of residence, responsiveness, sociality, initiative, etc.), the power grid operation data and power flow calculation formula are set as the environment, and the result of the interaction between the agent and the environment is the convergence of the power grid power flow calculation, and the result of the typical operation mode is obtained. The process of interaction between the agent and the environment is represented by the Markov Decision Process (MDP);

所述马尔科夫决策过程MDP由5元组(S,A,Pr,R,γ)构成,S为系统环境状态空间,st为t时刻的系统状态;A为动作空间,at为t时刻的智能体动作;Pr为转移概率,Pr(st+1|st,at)为在状态st采取动作at后转移到状态st+1的概率;R为奖励函数,rt为在状态st下采取动作at后得到的奖励值;γ是折扣因子(0≤γ≤1),用来权衡即时奖励值和将来奖励值对决策过程的影响;The Markov decision process MDP is composed of a 5-tuple (S, A, P r , R, γ), where S is the system environment state space, s t is the system state at time t; A is the action space, a t is the agent action at time t; P r is the transition probability, P r (s t+1 |s t , a t ) is the probability of transferring to state s t+1 after taking action a t in state s t ; R is the reward function, r t is the reward value obtained after taking action a t in state s t ; γ is a discount factor (0≤γ≤1), which is used to weigh the impact of immediate reward value and future reward value on the decision-making process;

为定量描述t时刻的动作at对系统状态转移方向的引导作用,引入状态动作State-Action值函数概念,即在t时刻对状态st执行动作at后获得的累积奖励期望值用Q(s,a)来表示,具体计算方法如公式(1)所示。In order to quantitatively describe the guiding role of action a t at time t on the direction of system state transfer, the concept of state-action value function is introduced, that is, the expected value of the cumulative reward obtained after executing action a t on state s t at time t is expressed by Q(s, a). The specific calculation method is shown in formula (1).

Q(s,a)=E[rt+γrt+12rt+2+...|st=s,at=a,π],s∈S,a∈A (1)Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1 )

式中,γ越大表示将来奖励值对Q(s,a)的影响越大,γ=1表示将来奖励值与即时奖励值对Q(s,a)的影响相同,而γ=1表示只有即时奖励值会影响Q(s,a);π表示智能体的动作执行策略,即系统状态st与动作at之间的映射关系。In the formula, the larger the γ, the greater the impact of the future reward value on Q(s,a), γ = 1 means that the future reward value and the immediate reward value have the same impact on Q(s,a), and γ = 1 means that only the immediate reward value will affect Q(s,a); π represents the action execution strategy of the agent, that is, the mapping relationship between the system state s t and the action a t .

计算最优策略μ*,使每个时刻的动作at的Q值最大,公式如(2)所示:Calculate the optimal strategy μ * to maximize the Q value of action a t at each moment. The formula is shown in (2):

μ*=maxQμ(s,a) (2)μ * =maxQ μ (s,a) (2)

式中,Qμ(s,a)为从状态s开始采取动作a后策略μ的预期回报;Where Q μ (s, a) is the expected return of strategy μ after taking action a starting from state s;

步骤1.2:定义马尔科夫决策模型中系统环境状态空间S、动作空间A,奖励函数R的表达式,;Step 1.2: Define the expressions of the system environment state space S, action space A, and reward function R in the Markov decision model;

所述系统环境状态空间中S,定义t时刻的状态空间st为:In the system environment state space S, the state space s t at time t is defined as:

st=[p,q,s,v,L,D,l] (3)s t = [p, q, s, v, L, D, l] (3)

p=[p1,p2,...,pm] (4)p=[p 1 ,p 2 ,..., pm ] (4)

q=[q1,q2,...,qm] (5)q=[q 1 ,q 2 ,...,q m ] (5)

s=[s1,s2,...,sn] (5)s=[s 1 ,s 2 ,...,s n ] (5)

v=[v1,v2,...,vg] (6)v=[v 1 ,v 2 ,...,v g ] (6)

L=[L1,L2,...,Lh] (7)L=[L 1 ,L 2 ,...,L h ] (7)

D=[D1,D2,...,Dk] (8)D=[D 1 ,D 2 ,...,D k ] (8)

l=[l1,l2,...,lN] (9)l=[l 1 ,l 2 ,...,l N ] (9)

式中,pi为编号i节点发电机的有功功率;qi为编号i节点发电机的无功功率;si编号i节点的线路投运状态;vi为编号i节点的新能源出力情况,认为新能源出力符合Weibull概率分布;Li为编号i节点的可控负荷出力,将其考虑为负的发电机出力;Di为编号为i节点的直流出力,送端认为是负的电流源发电机出力,受端认为是正的电流源发电机出力;m,n,g,h,k分别为不含平衡机的可调发电机节点总数、线路总数、新能源节点总数、可控负荷节点总数和直流节点总数;l1,l2,...,lN共同组成一个二进制编码,用来表示不同运行方式编号,例如:共有16种运行方式,则N=4,l1,l2,l3,l4=0000表示第1种运行方式,l1,l2,l3,l4=1111表示第16种运行方式。where pi is the active power of the generator at node i; qi is the reactive power of the generator at node i; s is the line operation status of node i; vi is the output of new energy at node i, and it is considered that the output of new energy conforms to the Weibull probability distribution; Li is the controllable load output of node i, which is considered to be a negative generator output; Di is the DC output of node i, which is considered to be a negative current source generator output at the sending end and a positive current source generator output at the receiving end; m, n, g, h, k are the total number of adjustable generator nodes, the total number of lines, the total number of new energy nodes, the total number of controllable load nodes, and the total number of DC nodes excluding balancing machines, respectively; l 1 ,l 2 ,...,l N together form a binary code to represent the numbers of different operating modes. For example, if there are 16 operating modes, then N=4, l 1 ,l 2 ,l 3 ,l 4 =0000 represents the first operating mode, l 1 ,l 2 ,l 3 ,l 4 =0000 represents the first operating mode, l 1 ,l 2 ,l 3 ,l 4 =0000 represents the first operating mode, 4 = 1111 indicates the 16th operating mode.

本实施例中,为了简化模型,结合实际运行方式计算人员对可调动作对象的调整方式,将发电机、可控负荷和直流出力简化为一次调整的最小阈值是最大出力的5%,且调整幅度只能是5%的整数倍数,例如pi允许的取值为[0,0.05,0.1,...,1.0],p1=0.3代表着编号1节点发电机此时出力调整为最大有功功率的30%;对线路投运状态简化为只有两种1/0两种状态,1为投运,0为停运;由于新能源出力具有随机波动性,其出力服从Weibull概率分布,对其出力简化为Weibull分布函数的期望值,且仅有1/0两种状态,1为分布函数期望值出力状态,0为停运。In this embodiment, in order to simplify the model, the adjustment method of the adjustable action object by the calculation personnel is combined with the actual operation mode, and the generator, controllable load and DC output are simplified to a minimum threshold of one adjustment of 5% of the maximum output, and the adjustment range can only be an integer multiple of 5%. For example, the allowed value of p i is [0, 0.05, 0.1, ..., 1.0], and p 1 = 0.3 represents that the output of the generator at node number 1 is adjusted to 30% of the maximum active power at this time; the line operation status is simplified to only two 1/0 states, 1 for operation and 0 for shutdown; since the output of new energy has random volatility, its output obeys the Weibull probability distribution, and its output is simplified to the expected value of the Weibull distribution function, and there are only two 1/0 states, 1 for the expected value output state of the distribution function, and 0 for shutdown.

所述动作空间A中,由于对可调动作对象调整方式简化的定义,发电机、可控负荷和直流节点仅有0,5%,10%,...,100%共21种出力调整方式,线路投运状态和新能源节点仅有1和0两种调整方式,故动作空间A是离散的,将动作空间A与离散正整数相关联,公式如(10)所示。In the action space A, due to the simplified definition of the adjustment method of the adjustable action object, there are only 21 output adjustment methods for generators, controllable loads and DC nodes, namely 0, 5%, 10%, ..., 100%, and there are only two adjustment methods, 1 and 0, for the line commissioning status and new energy nodes. Therefore, the action space A is discrete. The action space A is associated with discrete positive integers, as shown in formula (10).

A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}](10)A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h} ,{1,2,...,k}](10)

集合A中的数字代表可调动作对象的编号,如数字为0代表不进行该对象动作调整,根据不同节点系统实际情况,选择性输出编码。t时刻的调整动作用at来表示;at=[1,3,0,0,4]代表调整发电机节点1、线路3和直流节点4的运行状态。The numbers in set A represent the numbers of the adjustable action objects. If the number is 0, it means that the action of the object is not adjusted. According to the actual situation of different node systems, the code is selectively output. The adjustment action at time t is represented by a t ; a t = [1, 3, 0, 0, 4] represents the adjustment of the operating status of generator node 1, line 3 and DC node 4.

奖励函数R中,由公式(2)得,即时奖励rt会影响Qμ(s,a)的计算结果,随后Qμ(s,a)又会影响到动作at的选择。奖励函数的设计思路如下:当智能体选择能使潮流收敛的动作时,环境会给予较大的奖励;当选择了使潮流发散,或平衡机越限的动作时,环境会给予对应的惩罚值,智能体为了获得最大奖励,会约束动作满足动作变化率。定义潮流调整问题中的4个指标:(1)潮流计算收敛,用c1表示;(2)平衡机输出功率不越限,用c2表示;(3)低于设定值的网损率,通过计算网损率量化;(4)不产生病态潮流,通过潮流计算迭代的λ值量化;因此,奖励函数R如式(11):In the reward function R, according to formula (2), the immediate reward r t will affect the calculation result of Q μ (s, a), and then Q μ (s, a) will affect the selection of action a t . The design idea of the reward function is as follows: when the agent chooses an action that can make the power flow converge, the environment will give a larger reward; when the action that makes the power flow diverge or the balancing machine exceeds the limit is selected, the environment will give a corresponding penalty value. In order to obtain the maximum reward, the agent will constrain the action to meet the action change rate. Define the four indicators in the power flow adjustment problem: (1) The power flow calculation converges, represented by c 1 ; (2) The output power of the balancing machine does not exceed the limit, represented by c 2 ; (3) The network loss rate is lower than the set value, which is quantified by calculating the network loss rate; (4) No pathological power flow is generated, which is quantified by the λ value of the power flow calculation iteration; Therefore, the reward function R is as follows (11):

Figure BDA0003941469780000091
Figure BDA0003941469780000091

执行at后,潮流计算收敛且平衡机输出功率不越限,则R为0,其余情况R为-1;因此调整步数越少,得到的累积奖励越大。After executing a t , if the power flow calculation converges and the output power of the balancing machine does not exceed the limit, R is 0, and in other cases R is -1; therefore, the fewer the adjustment steps, the greater the cumulative reward.

步骤2:建立智能体动作与可调动作对象的改进映射策略;Step 2: Establish an improved mapping strategy between agent actions and adjustable action objects;

本实施例中以IEEE30节点为例,发电机节点映射关系如图2所示。通常的映射策略是将动作at与可调发电机一一对应,但电网有众多的发电机,且每多一台发电机,状态空间将呈指数增长,且其中大部分的状态潮流计算不收敛,若采用遍历动作状态搜索,需要的时间也将呈指数增长。为提高搜索效率,设计的改进映射策略如下:In this embodiment, IEEE30 nodes are taken as an example, and the mapping relationship of generator nodes is shown in Figure 2. The usual mapping strategy is to correspond actions a t to adjustable generators one by one, but the power grid has many generators, and with each additional generator, the state space will grow exponentially, and most of the state flow calculations will not converge. If the traversal action state search is used, the time required will also grow exponentially. In order to improve the search efficiency, the improved mapping strategy designed is as follows:

所述改进映射策略为,设PG为当前不含平衡机的电网发电机有功功率总和;PL为电网当前所有负荷有功功率总和;

Figure BDA0003941469780000092
为平衡机的最大/最小有功功率;K为设定的目标网损率;Pi为发电机i的有功功率;Pimax为发电机i的最大有功功率,最小调整阈值为0.05Pimax;包括以下三种情况:The improved mapping strategy is: let PG be the sum of the active power of the current power grid generators without the balancing machine; PL be the sum of the active power of all the current loads in the power grid;
Figure BDA0003941469780000092
is the maximum/minimum active power of the balancing machine; K is the set target network loss rate; Pi is the active power of generator i; Pimax is the maximum active power of generator i, and the minimum adjustment threshold is 0.05P imax ; including the following three cases:

(2)当

Figure BDA0003941469780000093
时,at=i,则令Pi=0.5Pimax,若此时Pi≥0.5Pimax,则投运Pi与Pimax的中值向上取整,直到投运到Pimax为止。例如:at=i且当前Pi=0.75Pimax,则调整后的Pi=[0.5*(75%+100%)]Pimax=0.875Pimax→0.9Pimax;该场景下判断出系统发电机总有功功率不足,需要增加发电机有功功率,来满足潮流收敛要求。(2) When
Figure BDA0003941469780000093
When a t = i, let Pi = 0.5P imax . If Pi ≥ 0.5P imax at this time, the median of Pi and Pimax is rounded up until Pimax is reached. For example: a t = i and the current Pi = 0.75P imax , then the adjusted Pi = [0.5*(75%+100%)]P imax = 0.875P imax → 0.9P imax ; In this scenario, it is determined that the total active power of the system generator is insufficient, and the active power of the generator needs to be increased to meet the power flow convergence requirements.

(2)当

Figure BDA0003941469780000094
时,at=i,则令Pi=0.5Pimax,若此时Pi≤0.5Pimax,则Pi与停机出力的中值向下取整,直到投运到0%,即停机为止。例如:at=i且当前Pi=0.25Pimax,则调整后的Pi=[0.5*(25%+0%)]Pimax=0.125Pimax→0.1Pimax。该场景下判断出系统发电机总有功功率过大,需要减少发电机有功功率,来满足潮流收敛要求。(2) When
Figure BDA0003941469780000094
When a t = i, let Pi = 0.5P imax . If Pi ≤ 0.5P imax at this time, the median of Pi and shutdown output is rounded down until it reaches 0%, that is, shutdown. For example: a t = i and the current Pi = 0.25P imax , then the adjusted Pi = [0.5*(25%+0%)]P imax = 0.125P imax → 0.1P imax . In this scenario, it is judged that the total active power of the system generator is too large, and the active power of the generator needs to be reduced to meet the power flow convergence requirements.

(3)除(1)和(2)时,at=i,若Pi≥0.5Pimax,则投运Pi与Pimax的中值向上取整,直到投运到Pimax为止;否则投运Pi与停机出力的中值向下取整,直到投运到0%,即停机为止;(3) Except for (1) and (2), a t = i, if Pi ≥ 0.5P imax , the median of the operating Pi and Pimax is rounded up until the operation reaches Pimax ; otherwise, the median of the operating Pi and the shutdown output is rounded down until the operation reaches 0%, i.e., the shutdown is completed;

将发电机节点替换成其余可调动作对象,类比得到其余动作对象的映射策略,可根据节点系统实际情况,进行映射策略的调整。Replace the generator node with other adjustable action objects, and obtain the mapping strategies of other action objects by analogy. The mapping strategies can be adjusted according to the actual situation of the node system.

步骤3:构建运行方式智能生成DQN网络;Step 3: Build and run the DQN network intelligently;

运行方式智能生成DQN网络的算法流程图如图3所示。所述DQN网络为,在Q-learning网络的基础上改进而来,将Q-learning网络与神经网络结合,使用神经网络对Q值函数进行估算,神经网络计算出每个潮流调整动作的值函数后,采用ε-greedy贪心搜索进行动作选取,选择Q值最大的动作输出。The algorithm flow chart of the intelligent generation of the DQN network in the operation mode is shown in Figure 3. The DQN network is improved on the basis of the Q-learning network, which combines the Q-learning network with the neural network, and uses the neural network to estimate the Q value function. After the neural network calculates the value function of each flow adjustment action, the ε-greedy search is used to select the action, and the action with the largest Q value is selected as the output.

其中,DQN网络引入了估计Q网络和目标Q网络,其训练过程为:Among them, the DQN network introduces the estimated Q network and the target Q network, and its training process is:

步骤A1:训练开始时,将估计Q网络和目标Q网络节点、发电机、线路、负荷参数设置相同,参数矩阵分别为θ和θ′;Step A1: At the beginning of training, the estimated Q network and the target Q network nodes, generators, lines, and load parameters are set to be the same, and the parameter matrices are θ and θ′ respectively;

步骤A2:训练过程中,估计Q网络每个时间步按照如公式(13)的损失函数梯度下降方向更新一次,DQN网络根据估计Q网络和当前状态计算Q值输出潮流调整动作;Step A2: During the training process, the estimated Q network is updated once at each time step according to the gradient descent direction of the loss function as shown in formula (13). The DQN network calculates the Q value output flow adjustment action based on the estimated Q network and the current state;

Figure BDA0003941469780000101
Figure BDA0003941469780000101

步骤A3:依据步骤A2调整可调动作对象的运行状态;Step A3: adjusting the running state of the adjustable action object according to step A2;

步骤A4:每隔C步将估计Q网络参数θ传给目标Q网络θ′;Step A4: every C steps, the estimated Q network parameters θ are passed to the target Q network θ′;

步骤A5:目标Q网络每C个时间步按照公式(13)的梯度下降方向更新一次。Step A5: The target Q network is updated once every C time steps according to the gradient descent direction of formula (13).

估计Q网络计算的潮流调整动作值被称为预测值,当前状态下的即时奖励与目标Q网络计算出的状态潮流调整动作值之和被称为真实值,通过反向传播的方式对估计Q网络的参数进行更新。训练时重复上述更新过程,直至潮流收敛且平衡机输出不越限或是达到迭代轮数;The power flow adjustment action value calculated by the estimated Q network is called the predicted value, and the sum of the instant reward in the current state and the state power flow adjustment action value calculated by the target Q network is called the true value. The parameters of the estimated Q network are updated by back propagation. During training, the above update process is repeated until the power flow converges and the output of the balancing machine does not exceed the limit or the number of iterations is reached;

步骤4:;运行方式智能删减过程建模,构建病态潮流诊断模型;Step 4: Model the operation mode intelligent deletion process and build a sick power flow diagnosis model;

若潮流计算不能收敛,分为以下两种情况:潮流计算无可行解,即潮流无解;或潮流计算有可行解,但无法搜索到,即病态潮流问题。病态潮流的特征有:潮流收敛解严重偏离启动初值,迭代次数增加而收敛速度慢;或者雅可比矩阵趋于奇异,使潮流无法收敛到该可行解。病态潮流的产生原因有两点:①断面潮流过重,即有功功率过大;②局部无功支撑不足。If the power flow calculation cannot converge, it can be divided into the following two situations: the power flow calculation has no feasible solution, that is, the power flow has no solution; or the power flow calculation has a feasible solution, but it cannot be searched, that is, the ill-conditioned power flow problem. The characteristics of ill-conditioned power flow are: the converged solution of the power flow seriously deviates from the initial value, the number of iterations increases and the convergence speed is slow; or the Jacobian matrix tends to be singular, so that the power flow cannot converge to the feasible solution. There are two reasons for the occurrence of ill-conditioned power flow: ① The cross-section power flow is too heavy, that is, the active power is too large; ② The local reactive power support is insufficient.

所述病态潮流问题包括以下两种情况;The pathological power flow problem includes the following two situations:

(1)由断面潮流过重引起的病态潮流,通过调整可调动作对象出力来分摊掉有功不平衡功率,从而解决病态潮流问题;(1) The pathological power flow caused by excessive cross-section power flow can be solved by adjusting the output of the adjustable action object to distribute the unbalanced active power;

(2)由局部无功支撑不足引起的病态潮流,定义如下潮流迭代指标来进行判断:(2) For pathological power flow caused by insufficient local reactive power support, the following power flow iteration index is defined to judge:

当采用PQ分解法进行潮流计算不收敛时,根据指标λ作为判据,如公式(14)所示:When the PQ decomposition method does not converge in power flow calculation, the index λ is used as the criterion, as shown in formula (14):

λ=max{|[ΔU](3)/[ΔU](2)|} (14)λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)

式中,[ΔU](3)为第三次迭代电压值增量,[ΔU](2)为第二次迭代值电压值增量;Where, [ΔU] (3) is the voltage increment of the third iteration, and [ΔU] (2) is the voltage increment of the second iteration;

当潮流正常收敛时λ<1,当PQ节点负荷的无功需求增加时λ随之增加,当潮流呈现病态时λ>1。以IEEE118节点系统为例,29号PQ节点的有功需求是24MW,无功需求为4MVar,逐渐增加该负荷节点的无功需求,其与λ的关系如表1所示。When the power flow converges normally, λ<1, when the reactive power demand of the PQ node load increases, λ increases accordingly, and when the power flow is pathological, λ>1. Taking the IEEE118 node system as an example, the active power demand of the 29th PQ node is 24MW, and the reactive power demand is 4MVar. The reactive power demand of this load node is gradually increased, and its relationship with λ is shown in Table 1.

表1负荷节点无功需求增加量与λ的关系Table 1 Relationship between the increase in reactive power demand at load nodes and λ

Figure BDA0003941469780000111
Figure BDA0003941469780000111

由表1可知,随着节点无功需求增加,当系统呈现病态时λ>1,因此采用λ作为衡量系统病态程度的指标是合理的。It can be seen from Table 1 that as the node reactive power demand increases, when the system becomes ill-conditioned, λ>1, so it is reasonable to use λ as an indicator to measure the degree of system ill-conditioning.

在步骤3生成运行方式后,对运行方式合理性进行判断,采用PQ分解法来进行潮流计算,当λ>1或潮流计算迭代在10次以内不收敛,即认为病态潮流现象。删除该运行方式,否则认为该运行方式合理并保留,完成运行方式的智能删减;After the operation mode is generated in step 3, the rationality of the operation mode is judged, and the PQ decomposition method is used to perform power flow calculation. When λ>1 or the power flow calculation iteration does not converge within 10 times, it is considered to be a pathological power flow phenomenon. Delete the operation mode, otherwise it is considered to be reasonable and retained, completing the intelligent deletion of the operation mode;

步骤5:重复步骤3至步骤4,不断进行动作调整,直到满足目标运行方式负荷水平或达到最大动作调整次数;Step 5: Repeat steps 3 to 4 and continuously adjust the action until the target operation mode load level is met or the maximum number of action adjustments is reached;

步骤6:输出估计Q网络参数θ,完成电网运行方式智能生成与智能删减。Step 6: Output the estimated Q network parameter θ to complete the intelligent generation and intelligent deletion of power grid operation mode.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an explanation of the technical principles used. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solutions formed by a specific combination of the above-mentioned technical features, but should also cover other technical solutions formed by any combination of the above-mentioned technical features or their equivalent features without departing from the above-mentioned inventive concept. For example, the above-mentioned features are replaced with the technical features with similar functions disclosed in the embodiments of the present disclosure (but not limited to) to form a technical solution.

Claims (7)

1. An intelligent generation method of an electric power system operation mode based on deep reinforcement learning is characterized by comprising the following steps:
step 1: performing reinforcement learning modeling on the power grid by using a Markov Decision Process (MDP);
step 2: establishing an improved mapping strategy of the intelligent agent action and the adjustable action object;
and step 3: an operation mode is established to intelligently generate a DQN network;
and 4, step 4: modeling an intelligent deletion process of an operation mode, and constructing a pathological tide diagnosis model;
if the power flow calculation can not be converged, the following two cases are divided: the load flow calculation has no feasible solution, namely the load flow has no solution; or the power flow calculation has a feasible solution but cannot be searched, namely the problem of pathological power flow is solved;
and 5: repeating the step 3 to the step 4, and continuously adjusting the action until the load level of the target operation mode is met or the maximum action adjustment times are reached;
step 6: and outputting an estimated Q network parameter theta, and finishing intelligent generation and intelligent deletion of a power grid operation mode.
2. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: setting parameters in the power grid operation mode process by using a Markov decision process;
setting a power grid operation mode calculator as an intelligent agent, setting power grid operation data and a power flow calculation formula as an environment, wherein the result of interaction between the intelligent agent and the environment is power grid power flow calculation convergence, and the process of interaction between the intelligent agent and the environment is represented by a Markov decision process;
the Markov decision process MDP consists of 5-tuple (S, A, P) r R, gamma), S is the system environment state space, S t The system state at the moment t; a is an action space, a t The action of the agent at the moment t; p r To transition probabilities, P r (s t+1 |s t ,a t ) Is in a state s t Taking action a t Post transition to state s t+1 The probability of (d); r is a reward function, R t Is in a state s t Take action a t The reward value obtained later; gamma is a discount factor (gamma is more than or equal to 0 and less than or equal to 1) and is used for balancing the influence of the instant reward value and the future reward value on the decision process;
to quantitatively describe the action a at time t t Guiding the system State transition direction, introducing the function concept of State Action State-Action value, namely, the State s is guided at the moment t t Performing action a t The expected value of the accumulated reward obtained later is expressed by Q (s, a), and the specific calculation method is shown as formula (1);
Q(s,a)=E[r t +γr t+12 r t+2 +...|s t =s,a t =a,π],s∈S,a∈A (1)
where, larger γ means larger effect of the future bonus value on Q (s, a), γ =1 means that the future bonus value and the instant bonus value have the same effect on Q (s, a), and γ =1 means that only the instant bonus value affects Q (s, a); π represents the action execution policy of the agent, i.e. the system state s t And action a t The mapping relationship between the two;
calculating the optimal strategy mu * Make the action a at each time t The Q value of (2) is maximum, and the formula is shown as (2):
μ * =max Q μ (s,a) (2)
in the formula, Q μ (s, a) is the expected return of policy μ after taking action a from state s;
step 1.2: defining an expression of a system environment state space S, an action space A and a reward function R in the Markov decision model;
s in the system environment state space, defining the state space S at the moment t t Comprises the following steps:
s t =[p,q,s,v,L,D,l] (3)
p=[p 1 ,p 2 ,...,p m ] (4)
q=[q 1 ,q 2 ,...,q m ] (5)
s=[s 1 ,s 2 ,...,s n ] (5)
v=[v 1 ,v 2 ,...,v g ] (6)
L=[L 1 ,L 2 ,...,L h ] (7)
D=[D 1 ,D 2 ,...,D k ] (8)
l=[l 1 ,l 2 ,...,l N ] (9)
in the formula, p i Numbering the active power of the i-node generator; q. q.s i Numbering the reactive power of the i-node generator; s is i Numbering the line commissioning state of the node i; v. of i New energy contribution condition, L, for numbering i nodes i The controllable load output of the node i is numbered; d i The direct current output is numbered as an i node; m, n, g, h and k are respectively adjustable generator sections without balancing machinesPoint total number, line total number, new energy node total number, controllable load node total number and direct current node total number; l 1 ,l 2 ,...,l N The codes are combined to form a binary code which is used for representing the numbers of different operation modes;
in the action space A, the action space A is discrete, and the action space A is associated with discrete positive integers, wherein the formula is shown as (10);
A=[{1,2,...,m},{1,2,...,n},{1,2,...,g},{1,2,...,h},{1,2,...,k}] (10)
the number in the set A represents the number of the adjustable action object, and the adjusting action a at the moment t t To represent;
defining 4 indexes of the flow adjustment problem: (1) Convergence of load flow calculation by c 1 Represents; (2) The output power of the balancing machine is not out of limit by c 2 Represents; (3) Quantizing the network loss rate lower than a set value by calculating the network loss rate; (4) No pathological load flow is generated, and the lambda value is quantized through load flow calculation iteration; thus, the reward function R is as in equation (11):
Figure FDA0003941469770000031
execution of a t And then, the load flow calculation is converged, and the output power of the balancing machine is not out of limit, R is 0, and R is-1 in other cases.
3. The method as claimed in claim 1, wherein the improved mapping strategy in step 2 is set as P G The active power sum of the current power grid generator without the balancing machine is obtained; p L The total active power of all the current loads of the power grid is obtained;
Figure FDA0003941469770000032
maximum/minimum active power for the balancing machine; k is the set target network loss rate; p i Is the active power of the generator i; p imax For maximum active power of generator i, minimum adjustment thresholdIs 0.05P imax
4. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the DQN network in step 3 is obtained by combining a Q-learning network and a neural network, estimating a Q-value function by using the neural network, calculating a value function of each power flow adjustment action by the neural network, then selecting an action by using epsilon-greedy search, and selecting an action output with the largest Q value.
5. The intelligent generation method for the operation mode of the power system based on the deep reinforcement learning as claimed in claim 1, wherein the pathological power flow problem in the step 4 includes the following two situations;
(1) The active unbalanced power is distributed by adjusting the output of the adjustable action object according to the pathological trend caused by the overweight of the section trend, so that the problem of the pathological trend is solved;
(2) The pathological tide caused by insufficient local reactive power support is judged by defining the following tide iteration indexes:
when the flow calculation by adopting the PQ decomposition method is not converged, taking the index lambda as a criterion, as shown in a formula (14):
λ=max{|[ΔU] (3) /[ΔU] (2) |} (14)
in the formula, [ Delta U ]] (3) For the third iteration voltage value increment, [ Δ U [ ]] (2) The voltage value increment is the second iteration value;
lambda is less than 1 when the power flow is normally converged, lambda is increased when the reactive demand of PQ node load is increased, and lambda is greater than 1 when the power flow is ill-conditioned;
and 3, after the operation mode is generated in the step 3, judging the rationality of the operation mode, carrying out flow calculation by adopting a PQ decomposition method, and when the lambda is greater than 1 or the iteration of the flow calculation is not converged within 10 times, considering the pathological flow phenomenon and deleting the operation mode, otherwise, considering the operation mode to be reasonable and reserved, and finishing the intelligent deletion of the operation mode.
6. The intelligent deep reinforcement learning-based power system operation mode generation method according to claim 3, wherein the improved mapping strategy comprises the following three conditions:
(1) When the temperature is higher than the set temperature
Figure FDA0003941469770000041
When a is turned on t If i, let P i =0.5P imax If at this time P i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is delivered imax Until the end; judging that the total active power of a system generator is insufficient in the scene, and increasing the active power of the generator to meet the requirement of power flow convergence;
(2) When the temperature is higher than the set temperature
Figure FDA0003941469770000042
When a is turned on t If i, let P i =0.5P imax If at this time P i ≤0.5P imax Then P is i The median value of the output force of the machine halt is rounded downwards until the input is 0 percent, and the machine halt is carried out; in the scene, the situation that the total active power of a system generator is too large is judged, and the active power of the generator needs to be reduced to meet the requirement of power flow convergence;
(3) When except (1) and (2), a t If P is not less than = i i ≥0.5P imax Then put into operation P i And P imax Is rounded up until P is reached imax Until the end; otherwise, put into operation P i And rounding the median value of the output force of the shutdown downwards until the delivery reaches 0 percent, namely, the shutdown is carried out.
7. The intelligent generation method of the operation mode of the power system based on the deep reinforcement learning of claim 4, wherein the DQN network introduces an estimation Q network and a target Q network, and the training process comprises:
step A1: when training starts, setting the parameters of the estimated Q network and the target Q network node, the generator, the circuit and the load to be the same, wherein the parameter matrixes are theta and theta';
step A2: in the training process, updating each time step of the estimated Q network once according to the gradient descending direction of the loss function as the formula (13), and calculating a Q value by the DQN network according to the estimated Q network and the current state to output a load flow adjusting action;
Figure FDA0003941469770000043
step A3: adjusting the running state of the adjustable action object according to the step A2;
step A4: transmitting the estimated Q network parameter theta to a target Q network theta' every other step C;
step A5: updating the target Q network once every C time steps according to the gradient descending direction of the formula (13);
and updating parameters of the estimated Q network in a back propagation mode, and repeating the updating process during training until the power flow is converged and the output of the balancing machine is not out of limit or reaches the number of iteration rounds.
CN202211418090.7A 2022-11-14 2022-11-14 A method for intelligent generation of power system operation mode based on deep reinforcement learning Pending CN115912367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211418090.7A CN115912367A (en) 2022-11-14 2022-11-14 A method for intelligent generation of power system operation mode based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211418090.7A CN115912367A (en) 2022-11-14 2022-11-14 A method for intelligent generation of power system operation mode based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN115912367A true CN115912367A (en) 2023-04-04

Family

ID=86496603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211418090.7A Pending CN115912367A (en) 2022-11-14 2022-11-14 A method for intelligent generation of power system operation mode based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115912367A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041068A (en) * 2023-07-31 2023-11-10 广东工业大学 Deep reinforcement learning reliable sensing service assembly integration method and system
CN118232333A (en) * 2024-03-29 2024-06-21 中国南方电网有限责任公司 Power grid section regulation and control method and device based on deep reinforcement learning
CN118278495A (en) * 2024-05-27 2024-07-02 东北大学 A method for generating power grid operation mode based on reinforcement learning
CN118862991B (en) * 2024-09-27 2025-01-03 浙江伟臻成套柜体有限公司 A lightweight distribution cabinet service life loss prediction optimization method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117041068A (en) * 2023-07-31 2023-11-10 广东工业大学 Deep reinforcement learning reliable sensing service assembly integration method and system
CN118232333A (en) * 2024-03-29 2024-06-21 中国南方电网有限责任公司 Power grid section regulation and control method and device based on deep reinforcement learning
CN118278495A (en) * 2024-05-27 2024-07-02 东北大学 A method for generating power grid operation mode based on reinforcement learning
CN118862991B (en) * 2024-09-27 2025-01-03 浙江伟臻成套柜体有限公司 A lightweight distribution cabinet service life loss prediction optimization method

Similar Documents

Publication Publication Date Title
CN112615379B (en) Power grid multi-section power control method based on distributed multi-agent reinforcement learning
CN115912367A (en) A method for intelligent generation of power system operation mode based on deep reinforcement learning
CN115241885B (en) Power grid real-time scheduling optimization method and system, computer equipment and storage medium
CN110535146B (en) Electric power system reactive power optimization method based on depth determination strategy gradient reinforcement learning
CN114362187B (en) A method and system for cooperative voltage regulation of active distribution network based on multi-agent deep reinforcement learning
CN113489015A (en) Power distribution network multi-time scale reactive voltage control method based on reinforcement learning
CN112213945B (en) Improved robust prediction control method and system for electric vehicle participating in micro-grid group frequency modulation
CN114566971B (en) Real-time optimal power flow calculation method based on near-end strategy optimization algorithm
CN116523327A (en) Method and equipment for intelligently generating operation strategy of power distribution network based on reinforcement learning
CN115313403A (en) Real-time voltage regulation and control method based on deep reinforcement learning algorithm
CN112488442B (en) Power distribution network reconstruction method based on deep reinforcement learning algorithm and source load uncertainty
CN108306346A (en) A kind of distribution network var compensation power-economizing method
CN113110052B (en) A Hybrid Energy Management Approach Based on Neural Networks and Reinforcement Learning
CN109494766A (en) A kind of intelligent power generation control method of manual depth&#39;s emotion game intensified learning
CN117674114A (en) Dynamic economic scheduling method and system for power distribution network
CN117937599A (en) Multi-agent reinforcement learning distribution network optimization method for distributed photovoltaic consumption
CN112012875B (en) Optimization method of PID control parameters of water turbine regulating system
CN110163540A (en) Electric power system transient stability prevention and control method and system
CN115345380A (en) A new energy consumption power dispatching method based on artificial intelligence
CN106300417A (en) Wind farm group reactive voltage optimal control method based on Model Predictive Control
CN111749847A (en) On-line control method, system and device for wind turbine pitch
CN114330649B (en) Voltage regulation method and system based on evolutionary learning and deep reinforcement learning
CN118671629A (en) CSAPSO-improved DNN algorithm-based energy storage power station battery state of health evaluation method
CN117893043A (en) Hydropower station load distribution method based on DDPG algorithm and deep learning model
CN117543607A (en) Distribution network reactive power optimization method based on genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination