CN116260143A

CN116260143A - Distribution network switch automatic control method and system based on reinforcement learning theory

Info

Publication number: CN116260143A
Application number: CN202310099843.0A
Authority: CN
Inventors: 李晓旭; 田猛; 龚立; 郑涵; 朱紫阳; 王先培
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-06-13

Abstract

The invention discloses a distribution network switch automatic control method and system based on reinforcement learning theory. The method includes: establishing a Distflow power flow optimization constraint model, determining a reinforcement learning algorithm for approximate dynamic programming; obtaining historical output data of distributed power generation units , and establish the output power fluctuation and conversion model of the distributed generation unit; determine the topology structure of the distribution network, obtain the active output data of the distributed generation unit, the controllable section switch of the distribution network, the status information of the tie switch and the calculation of the Distflow power flow optimization constraint model The result information, and the MDP model is established according to the output power fluctuation and conversion model of the distributed power generation unit, the distribution network topology and the obtained information; the reinforcement learning algorithm is used to solve the MDP model, and the optimal strategy for the automatic control of the distribution network switch is output in real time . The invention can solve the problems of fluctuations, failures and failures of the distributed power generation units in the power distribution network, improve the power supply reliability of the power distribution system, and increase the investment benefit of the power distribution system.

Description

Automatic control method and system of distribution network switches based on reinforcement learning theory

技术领域Technical Field

本发明涉及电网调度技术领域，尤其涉及一种基于强化学习理论的配电网开关自动控制方法和系统。The present invention relates to the technical field of power grid dispatching, and in particular to a distribution network switch automatic control method and system based on reinforcement learning theory.

背景技术Background Art

配电网在电力系统中承担着接受电能、分配电能的艰巨任务，它直接面向用电终端，与人民大众的日常生产生活息息相通。传统的配电网络呈“树”形结构，这种“辐射”状的结构有诸多薄弱之处：故障损害大、互供能力差、自动化程度低等严峻问题。虽然充分降低了制造成本，但是其可靠性不高。近些年来，愈来愈多分布式发电系统(DistributedGeneration，DG)诸如风力发电、光伏发电的新能源发电系统接入到了配电网中，这样的优化策略可以在一定程度上满足清洁环保、低廉成本、高效可靠的电力产业发展要求。The distribution network undertakes the arduous task of receiving and distributing electric energy in the power system. It is directly facing the electricity terminals and is closely connected with the daily production and life of the general public. The traditional distribution network has a "tree" structure. This "radiating" structure has many weaknesses: serious problems such as large fault damage, poor mutual supply capacity, and low degree of automation. Although the manufacturing cost is greatly reduced, its reliability is not high. In recent years, more and more distributed generation systems (DG) such as wind power generation and photovoltaic power generation have been connected to the distribution network. Such optimization strategies can meet the development requirements of the clean, environmentally friendly, low-cost, efficient and reliable power industry to a certain extent.

然而，这些可再生新能源的接入也对配电网络产生了很多不利影响。其中最大的问题是，风力发电、太阳能发电等新能源发电模式受环境影响会出现诸多随机波动，存在不确定性和不稳定性；其次，受自然灾害和人为灾害影响，配电网络各节点支路容易发生故障。这些不利因素使得配电系统的工作状态变得复杂多变，直接影响配电系统的安全工作。However, the access to these renewable energy sources has also had many adverse effects on the distribution network. The biggest problem is that new energy generation modes such as wind power generation and solar power generation will experience many random fluctuations due to environmental influences, and there is uncertainty and instability; secondly, due to natural disasters and man-made disasters, the nodes and branches of the distribution network are prone to failure. These adverse factors make the working state of the distribution system complex and changeable, directly affecting the safe operation of the distribution system.

发明内容Summary of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题之一。为此，本发明的第一个目的在于提供一种基于强化学习理论的配电网开关自动控制方法，通过该方法能够解决配电网络中分布式发电单元工作波动、失效、故障问题，改善配电系统的供电可靠性，提高配电系统的投资效益。The present invention aims to solve one of the technical problems in the related art to at least a certain extent. To this end, the first object of the present invention is to provide a distribution network switch automatic control method based on reinforcement learning theory, which can solve the working fluctuation, failure and fault problems of distributed power generation units in the distribution network, improve the power supply reliability of the distribution system, and improve the investment efficiency of the distribution system.

本发明的第二个目的在于提供一种基于强化学习理论的配电网开关自动控制系统。The second object of the present invention is to provide a distribution network switch automatic control system based on reinforcement learning theory.

为达到上述目的，本发明通过以下技术方案实现：To achieve the above object, the present invention is implemented through the following technical solutions:

一种基于强化学习理论的配电网开关自动控制方法，包括：A distribution network switch automatic control method based on reinforcement learning theory, comprising:

步骤S1：建立Distflow潮流优化约束模型，并确定近似动态规划的强化学习算法；Step S1: Establish a Distflow power flow optimization constraint model and determine a reinforcement learning algorithm for approximate dynamic programming;

步骤S2：获取光伏、风电分布式发电单元的历史出力数据，并根据所述历史出力数据建立分布式发电单元输出功率波动及转化模型；Step S2: Obtain historical output data of photovoltaic and wind power distributed generation units, and establish an output power fluctuation and conversion model of the distributed generation units according to the historical output data;

步骤S3：确定配电网络拓扑结构，并获取分布式发电单元有功出力数据、配电网可控分段开关、联络开关状况信息和Distflow潮流优化约束模型计算结果信息，以及根据所述分布式发电单元输出功率波动及转化模型、所述配电网络拓扑结构和步骤S3中获取的信息建立配电网络马尔可夫决策过程MDP模型；Step S3: determine the topological structure of the distribution network, obtain the active output data of the distributed generation units, the status information of the controllable sectional switches and the tie switches of the distribution network, and the calculation result information of the Distflow power flow optimization constraint model, and establish a distribution network Markov decision process MDP model according to the output power fluctuation and conversion model of the distributed generation units, the topological structure of the distribution network and the information obtained in step S3;

步骤S4：采用近似动态规划的所述强化学习算法求解所述配电网络马尔可夫决策过程MDP模型，并实时输出配电网开关自动控制最优策略。Step S4: using the reinforcement learning algorithm of approximate dynamic programming to solve the Markov decision process MDP model of the distribution network, and outputting the optimal strategy for automatic control of distribution network switches in real time.

可选的，所述步骤S1包括：Optionally, the step S1 includes:

步骤S11：根据配电网拓扑结构约束和配电网潮流计算基本理论建立所述Distflow潮流优化约束模型；Step S11: establishing the Distflow power flow optimization constraint model according to the distribution network topology constraints and the basic theory of distribution network power flow calculation;

步骤S12：根据强化学习理论和贝尔曼最优方程确定出近似动态规划的所述强化学习算法。Step S12: Determine the reinforcement learning algorithm of approximate dynamic programming according to the reinforcement learning theory and the Bellman optimal equation.

可选的，所述步骤S2中，在建立所述分布式发电单元输出功率波动及转化模型之前，所述方法还包括：对各分布式发电单元的输出功率波动和不确定状况进行分析和量化，以便将量化值作为所述配电网络马尔可夫决策过程MDP模型的输入。Optionally, in step S2, before establishing the output power fluctuation and conversion model of the distributed power generation unit, the method also includes: analyzing and quantifying the output power fluctuation and uncertainty conditions of each distributed power generation unit, so as to use the quantified value as input of the Markov decision process MDP model of the distribution network.

可选的，通过所述分布式发电单元输出功率波动及转化模型能够模拟各分布式发电单元输出功率的波动性和不确定性。Optionally, the output power fluctuation and conversion model of the distributed power generation unit can simulate the volatility and uncertainty of the output power of each distributed power generation unit.

可选的，所述步骤S3包括：Optionally, step S3 includes:

步骤S31：将各分布式发电单元的输出功率波动状况量化值以及对应的配电网络拓扑结构建模为所述配电网络马尔可夫决策过程MDP模型的状态参量；Step S31: Modeling the output power fluctuation status quantization value of each distributed power generation unit and the corresponding distribution network topology structure as state parameters of the distribution network Markov decision process MDP model;

步骤S32：将连接配电网络各条支线的开关的动作状态建模为所述配电网络马尔可夫决策过程MDP模型的动作组合参量；Step S32: Modeling the action states of switches connected to each branch of the power distribution network as action combination parameters of the Markov decision process MDP model of the power distribution network;

步骤S33：选择考虑分布式发电故障的所述Distflow潮流优化约束模型中的切负荷以及线路运营成本建模为所述配电网络马尔可夫决策过程MDP模型的奖励函数参考指标；Step S33: Selecting the load shedding and line operation cost modeling in the Distflow power flow optimization constraint model considering distributed generation failure as the reward function reference index of the distribution network Markov decision process MDP model;

步骤S34：定义所述配电网络马尔可夫决策过程MDP模型的状态转移概率，以便考虑各分布式发电单元发电功率输出水平的变化概率带来的不确定性。Step S34: defining the state transition probability of the Markov decision process MDP model of the power distribution network, so as to take into account the uncertainty caused by the probability of change of the power output level of each distributed generation unit.

可选的，所述步骤S4中，在实时输出所述配电网开关自动控制最优策略之前，所述方法还包括：对所述配电网络马尔可夫决策过程MDP模型进行离线迭代训练。Optionally, in step S4, before outputting the optimal strategy for automatic control of the distribution network switches in real time, the method further includes: performing offline iterative training on the Markov decision process MDP model of the distribution network.

可选的，对所述配电网络马尔可夫决策过程MDP模型进行离线迭代训练的步骤包括：输入分布式发电单元有功出力数据，并输出配电网开关自动控制策略，以及反馈配电网络实时运行情况评价指标。Optionally, the step of performing offline iterative training on the Markov decision process MDP model of the distribution network includes: inputting active output data of distributed generation units, outputting automatic control strategies for distribution network switches, and feeding back evaluation indicators of real-time operation of the distribution network.

可选的，所述步骤S4中，在实时输出所述配电网开关自动控制最优策略之后，所述方法还包括：在人机交互界面进行决策结果文字显示和实时配电网网络拓扑结构显示。Optionally, in step S4, after outputting the optimal strategy for automatic control of the distribution network switches in real time, the method further comprises: displaying the decision result in text and the real-time distribution network topology structure on a human-computer interaction interface.

可选的，所述开关的动作状态包括：开关变换当前时刻的开关状态和保持当前时刻的开关状态。Optionally, the action state of the switch includes: the switch changes the switch state at the current moment and maintains the switch state at the current moment.

为达到上述目的，本发明第二方面提供了一种基于强化学习理论的配电网开关自动控制系统，包括：To achieve the above object, the second aspect of the present invention provides a distribution network switch automatic control system based on reinforcement learning theory, comprising:

建立模块，用于建立Distflow潮流优化约束模型；Establish a module for establishing the Distflow power flow optimization constraint model;

确定模块，用于确定近似动态规划的强化学习算法；A determination module, used to determine a reinforcement learning algorithm that approximates dynamic programming;

获取模块，用于获取光伏、风电分布式发电单元的历史出力数据，以便所述建立模块根据所述历史出力数据建立分布式发电单元输出功率波动及转化模型；An acquisition module is used to acquire historical output data of photovoltaic and wind power distributed generation units, so that the establishment module can establish an output power fluctuation and conversion model of the distributed generation units according to the historical output data;

所述建立模块还用于根据所述确定模块确定的配电网络拓扑结构、所述获取模块获取的分布式发电单元有功出力数据、配电网可控分段开关、联络开关状况信息和Distflow潮流优化约束模型计算结果信息，以及所述分布式发电单元输出功率波动及转化模型建立配电网络马尔可夫决策过程MDP模型；The establishment module is also used to establish a distribution network Markov decision process MDP model according to the distribution network topology determined by the determination module, the active output data of the distributed generation units obtained by the acquisition module, the controllable sectional switches of the distribution network, the status information of the tie switches and the calculation result information of the Distflow power flow optimization constraint model, and the output power fluctuation and conversion model of the distributed generation units;

计算模块，用于根据近似动态规划的所述强化学习算法求解所述配电网络马尔可夫决策过程MDP模型，并实时输出配电网开关自动控制最优策略。The calculation module is used to solve the Markov decision process MDP model of the distribution network according to the reinforcement learning algorithm of approximate dynamic programming, and output the optimal strategy for automatic control of distribution network switches in real time.

本发明至少具有以下技术效果：The present invention has at least the following technical effects:

1、本发明通过光伏、风电等分布式发电单元历史出力数据建立分布式发电单元输出功率波动及转化模型，并通过分布式发电单元输出功率波动及转化模型建立得到马尔可夫决策过程MDP模型，由于分布式发电单元输出功率波动及转化模型充分考虑了风力发电、太阳能发电等新能源发电模式受环境影响出现的工作效率波动性，所以由该模型建立得到的马尔可夫决策过程MDP模型可重点关注各分布式发电单元的发电功率输出波动，给出高效稳定的配电网开关自动控制优化策略。1. The present invention establishes a distributed power generation unit output power fluctuation and conversion model through historical output data of distributed power generation units such as photovoltaic and wind power, and obtains a Markov decision process MDP model through the distributed power generation unit output power fluctuation and conversion model. Since the distributed power generation unit output power fluctuation and conversion model fully considers the working efficiency volatility of new energy power generation modes such as wind power generation and solar power generation affected by the environment, the Markov decision process MDP model obtained by the model can focus on the power output fluctuation of each distributed power generation unit, and provide an efficient and stable distribution network switch automatic control optimization strategy.

2、本发明通过建立Distflow潮流优化约束模型，而Distflow潮流优化约束模型可解决配电网优化重构、故障重构等问题，由于配电网重构可实现通过选择用户的供电路径，达到优化潮流分布、提高供电可靠性和经济性的目的，所以通过该Distflow潮流优化约束模型建立得到的马尔可夫决策过程MDP模型，可解决配电系统故障问题，提高配电系统可靠性，并且在该MDP模型的基础上，可根据实时状态和状态转移概率动态制定每个时间间隔内的最优策略。2. The present invention establishes a Distflow power flow optimization constraint model, and the Distflow power flow optimization constraint model can solve the problems of distribution network optimization reconstruction, fault reconstruction, etc. Since the distribution network reconstruction can achieve the purpose of optimizing power flow distribution and improving power supply reliability and economy by selecting the power supply path of the user, the Markov decision process MDP model established by the Distflow power flow optimization constraint model can solve the distribution system failure problem and improve the reliability of the distribution system. On the basis of the MDP model, the optimal strategy for each time interval can be dynamically formulated according to the real-time state and state transition probability.

3、本发明采用近似动态规划(ADP)算法来求解上述的MDP模型，可解决“维数灾难”问题。3. The present invention adopts the approximate dynamic programming (ADP) algorithm to solve the above-mentioned MDP model, which can solve the "dimensionality curse" problem.

4、本发明旨在优化解决配电网络中分布式发电单元工作波动、失效、故障问题，改善配电系统的供电可靠性，提高配电系统的投资效益。4. The present invention aims to optimize and solve the problems of operating fluctuations, failures and malfunctions of distributed power generation units in the distribution network, improve the power supply reliability of the distribution system, and enhance the investment benefits of the distribution system.

本发明附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be given in part in the following description and in part will be obvious from the following description, or will be learned through practice of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的基于强化学习理论的配电网开关自动控制方法的流程图。FIG1 is a flow chart of a method for automatically controlling switches in a distribution network based on reinforcement learning theory according to an embodiment of the present invention.

图2为本发明实施例的基于近似动态规划的配电网开关自动控制方法对应系统的总体框架图。FIG2 is an overall framework diagram of a system corresponding to a distribution network switch automatic control method based on approximate dynamic programming according to an embodiment of the present invention.

图3为本发明实施例的近似动态规划算法流程图。FIG3 is a flow chart of an approximate dynamic programming algorithm according to an embodiment of the present invention.

图4为本发明实施例的离线计算流程框图。FIG4 is a flowchart of an offline calculation process according to an embodiment of the present invention.

图5为本发明实施例的在线计算流程框图。FIG5 is a flowchart of an online calculation process according to an embodiment of the present invention.

图6为本发明实施例的基于强化学习理论的配电网开关自动控制方法整体流程图。FIG6 is an overall flow chart of a distribution network switch automatic control method based on reinforcement learning theory according to an embodiment of the present invention.

图7为本发明实施例的基于强化学习理论的配电网开关自动控制系统的结构框图。FIG7 is a structural block diagram of a distribution network switch automatic control system based on reinforcement learning theory according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面详细描述本实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。The present embodiment is described in detail below, and examples of the embodiment are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are intended to be used to explain the present invention, and should not be construed as limiting the present invention.

下面参考附图描述本实施例的基于强化学习理论的配电网开关自动控制方法和系统。The following describes the distribution network switch automatic control method and system based on reinforcement learning theory of this embodiment with reference to the accompanying drawings.

图1为本发明一实施例提供的基于强化学习理论的配电网开关自动控制方法的流程图。如图1所示，该方法包括：FIG1 is a flow chart of a method for automatic control of distribution network switches based on reinforcement learning theory provided by an embodiment of the present invention. As shown in FIG1 , the method includes:

步骤S1：建立Distflow潮流优化约束模型，并确定近似动态规划的强化学习算法。Step S1: Establish a Distflow power flow optimization constraint model and determine a reinforcement learning algorithm for approximate dynamic programming.

所述步骤S1包括：The step S1 comprises:

步骤S11：根据配电网拓扑结构约束和配电网潮流计算基本理论建立Distflow潮流优化约束模型。Step S11: Establish a Distflow power flow optimization constraint model according to the distribution network topology constraints and the basic theory of distribution network power flow calculation.

具体的，配电网络的网络节点结构以“树”状、“辐射”状、径向结构为主，环状为辅。它广泛用于轻型和中等密度负载区域。考虑到经济和管理两方面的角度，电网中大多采用了“辐射”状的配电网结构。Specifically, the network node structure of the distribution network is mainly "tree", "radial", and radial structures, supplemented by ring structures. It is widely used in light and medium density load areas. Considering both economic and management aspects, the "radial" distribution network structure is mostly adopted in the power grid.

假设一个配电网络具有n个节点、m条连接线，配电网“树”状结构的基本判断方程式为：Assuming that a distribution network has n nodes and m connecting lines, the basic judgment equation of the "tree" structure of the distribution network is:

m＝n-1 (1)m＝n-1 (1)

上式描述了“树”状结构的基本要求，但生成树还需要满足连通性。具有连通性的“树”状结构被称为生成树。在生成树中，需要满足除了根节点(变电站节点)以外的所有节点均只有一个父节点这个要求。可以用下列方程式来实现这一要求：The above formula describes the basic requirements of the "tree" structure, but the spanning tree also needs to meet connectivity. A "tree" structure with connectivity is called a spanning tree. In the spanning tree, it is necessary to meet the requirement that all nodes except the root node (substation node) have only one parent node. This requirement can be achieved using the following equation:

β_ij+βj_i＝α_l，l＝1，2，...，m (2)β _ij + βj _i = α _l , l = 1, 2,..., m (2)

∑_j∈N(i)β_ij＝1，i＝1，2，...，n (3)∑ _j∈N(i) β _ij ＝1, i＝1, 2,..., n (3)

β_0j＝0，j∈N(0) (4)β _0j = 0, j∈N(0) (4)

β_ij∈{0，1} (5)β _ij ∈{0，1} (5)

0≤α_l≤1 (6)0≤α _l ≤1 (6)

通过引入两个二进制变量β_ij和β_ji来对应“树”状结构中每一条连接线路，且β_ij＝1表示节点j是节点i的父节点，否则为β_ij＝0。∑为求和函数。N(i)则表示与节点i相连的所有节点集合。另外，网络中任意两个节点的连接状态(连通与断开)由变量α_l或α_ij表示，可以确保配电网与主变电站相连的生成树相对应。上式分别表示线路l实际存在于生成树中；约束除根节点外的每个节点都有且只有一个父节点；变电站节点即根节点没有父节点。上述5个方程式即可以保证“树”状结构的连通性，使之成为生成树以模拟配电网络的结构。By introducing two binary variables β _ij and β _ji to correspond to each connection line in the "tree" structure, β _ij = 1 indicates that node j is the parent node of node i, otherwise β _ij = 0. ∑ is a summation function. N(i) represents the set of all nodes connected to node i. In addition, the connection state (connected or disconnected) of any two nodes in the network is represented by the variable α _l or α _ij , which can ensure that the distribution network corresponds to the spanning tree connected to the main substation. The above formulas respectively indicate that line l actually exists in the spanning tree; each node except the root node is constrained to have and only have one parent node; the substation node, i.e. the root node, has no parent node. The above five equations can ensure the connectivity of the "tree" structure, making it a spanning tree to simulate the structure of the distribution network.

配电网开关自动控制方法在每个决策时间t内，都应该满足配电网络的操作约束条件，包括网络拓扑结构约束、功率平衡约束、功率流量约束、电压限制约束和线路容量约束。而满足上述约束的理论基础是DistFlow配电网络潮流计算模型。电力系统的潮流计算可以利用网络系统中各节点物理结构、电压相量、有功、无功功率分布、线路损耗等参数作为运行条件来确定整个电力系统的运行的状态。The automatic control method of distribution network switches should meet the operating constraints of the distribution network within each decision time t, including network topology constraints, power balance constraints, power flow constraints, voltage limit constraints and line capacity constraints. The theoretical basis for meeting the above constraints is the DistFlow distribution network flow calculation model. The flow calculation of the power system can use the physical structure of each node in the network system, voltage phasor, active and reactive power distribution, line loss and other parameters as operating conditions to determine the operating status of the entire power system.

根据基尔霍夫电压定律、能量守恒定律和欧姆定律得到下式：According to Kirchhoff's voltage law, the law of conservation of energy and Ohm's law, we can get the following formula:

S₁＝S₀-S_loss1-S_L1 (7) _S1 ＝ _S0 - _Sloss1 - _SL1 (7)

V₁∠θ＝V₀-z₁I₀ (9)V ₁ ∠θ＝V ₀ −z ₁ I ₀ (9)

其中，S_i，i＝1，2，...，n表示节点i的注入功率，例如节点0的注入功率为S₀＝P₀+jQ₀即注入功率S₀等于注入有功功率P₀和无功功率Q₀的复数形式之和。S_loss1表示节点0到节点1线路上的能量损耗，S_L1表示节点1上的负荷。节点1上的注入功率等于节点0上的注入功率减去线路损耗功率和负荷需求功率。z₁＝r₁+jx₁表示连接节点0到节点1的线路阻抗值，r₁和x₁为线路阻抗变量，根据欧姆定律，阻抗与能量损耗功率的关系可以用

表示。V₁∠θ表示节点1上的电压相量，与节点0处电压相量的关系可用V₁∠θ＝V₀-z₁I₀表示，I₀为线路电流，θ为相位角。Wherein, _Si ,i=1,2,...,n represents the injected power of node i. For example, the injected power of node 0 is _S0 = _P0 + _jQ0, i.e., the injected power _S0 is equal to the complex sum of the injected active power _P0 and the injected reactive power _Q0 . _Sloss1 represents the energy loss on the line from node 0 to node 1, and S _L1 represents the load on node 1. The injected power on node 1 is equal to the injected power on node 0 minus the line loss power and the load demand power. _z1 = _r1 + _jx1 represents the line impedance value connecting node 0 to node 1, _r1 and _x1 are line impedance variables. According to Ohm's law, the relationship between impedance and energy loss power can be expressed as

_{V 1} ∠θ represents the voltage phasor at node 1, and its relationship with the voltage phasor at node 0 can be expressed as V ₁ ∠θ＝V ₀ -z ₁ I ₀ , where I ₀ is the line current and θ is the phase angle.

又根据功率计算公式得到下式：According to the power calculation formula, the following formula is obtained:

其中，

为S₀的共轭复数，接下来再将z₁＝r₁+jx₁、

和上式代入V₁∠θ＝V₀-z₁I₀可得到下式：in,

is the conjugate complex number of S ₀ , and then z ₁ = r ₁ + jx ₁ ,

Substituting the above formula into V ₁ ∠θ＝V ₀ -z ₁ I _0, we can get the following formula:

V₁∠θ＝V₀-(r₁+jx₁)(P₀-jQ₀)/V₀ (11)V ₁ ∠θ＝V ₀ -(r ₁ +jx ₁ )(P ₀ -jQ ₀ )/V ₀ (11)

将上式两边取相量的模，则得到下式：Taking the modulus of the phasor on both sides of the above equation, we get the following equation:

化简后可得到下式：After simplification, we can get the following formula:

依据上述分析并利用上式可以类推到i得到递推等式，于是得到下列DistFlow等式：Based on the above analysis and using the above formula, we can extrapolate to i to get the recursive equation, and then get the following DistFlow equation:

为了符合电力运营真实情况，在上式中，引入用小写字母表示的

为j节点的电功率负荷总消耗。其中，p_j和q_j分别为节点j的注入有功功率和注入无功功率，

和

分别表示j节点的电功率负荷有功功率和无功功率；

和

分别表示j节点接入发电机或者DG输出供电的有功功率和无功功率，v_j为节点j的电压，r_j和x_j为线路阻抗变量。需要特别注意的是，仅在j节点具有发电机或者DG发电单元的情况下才存在

和

这两个物理量。In order to meet the actual situation of power operation, the lowercase letters are introduced in the above formula.

is the total power load consumption of node j. Where p _j and q _j are the injected active power and injected reactive power of node j respectively.

and

They represent the active power and reactive power of the electrical power load of node j respectively;

and

They represent the active power and reactive power of the generator or DG output power supply connected to node j, _vj is the voltage of node j, _rj and _xj are line impedance variables. It should be noted that only exists when there is a generator or DG power generation unit at node j.

and

These two physical quantities.

现在得到了非线性的Distflow等式，为了更好地将上述公式运用到实际研究工作中，提出下列两个假设：Now we have obtained the nonlinear Distflow equation. In order to better apply the above formula to practical research work, the following two assumptions are proposed:

假设一：配电模型中的非线性项的值非常小，可以将其视为0。Assumption 1: The value of the nonlinear term in the power distribution model is very small and can be considered as zero.

假设二：认为V_j≈V₀，则可以得到下式：Assumption 2: Assume that V _j ≈V ₀ , then we can get the following formula:

基于上述两个假设，可以将前文得到的非线性的Distflow等式转化为线性方程组：Based on the above two assumptions, the nonlinear Distflow equation obtained above can be transformed into a linear system of equations:

至此，得到了线性化的Distflow潮流计算模型。它对配电网络中各支路开关的自动控制本质上是对配电网络的拓扑结构进行规划与设计，而上述分析和研究都是以线性化的Distflow潮流计算模型为基础的。So far, the linearized Distflow power flow calculation model has been obtained. Its automatic control of each branch switch in the distribution network is essentially the planning and design of the topological structure of the distribution network, and the above analysis and research are based on the linearized Distflow power flow calculation model.

基于Distflow潮流计算模型，考虑以下潮流优化约束模型：Based on the Distflow power flow calculation model, the following power flow optimization constraint model is considered:

min∑p_sd，j (21)min∑p _sd，j (21)

其中，引入新的物理量p_sd，i和q_sd，i分别代表i节点处的有功切负荷和无功切负荷。切负荷也称减载，即降低负载，通常情况下，遇到线路故障或自然灾害，为维持电力系统的功率平衡和稳定性，将部分负荷从电网上断开的行为被称为切负荷。另外，P_ik和Q_ik分别表示从节点i流动至任意节点k的有功功率和无功功率，

和

分别表示i节点接入发电机或者DG输出供电的有功功率和无功功率，P_d，i和q_d，i分别表示i节点处的有功、无功负荷需求，

表示任意节点，

和

分别表示P_ji和Q_ji的最大值，

和

分别表示V_i最小值和最大值；利用负荷的功率因数tanβ来表示p_sd，i和q_sd，i的关系。利用上述潮流优化约束模型的约束关系，对于n个节点的配电网络，在假设已经知道根节点即1号节点的电压V₁和各节点的负荷需求P_d，i、配电网络拓扑结构和各支路阻抗r_ji+x_ji的情况，对各节点电压V_i、V_j、注入负荷的有功功率P_ji和Q_ji以及有功切负荷和无功切负荷p_sd，i和q_sd，i等物理量进行优化求解。Among them, the new physical quantities p _sd,i and q _sd,i are introduced to represent the active load shedding and reactive load shedding at node i, respectively. Load shedding is also called load reduction, that is, reducing the load. Under normal circumstances, when encountering line failures or natural disasters, in order to maintain the power balance and stability of the power system, the behavior of disconnecting part of the load from the power grid is called load shedding. In addition, _Pik and _Qik represent the active power and reactive power flowing from node i to any node k, respectively.

and

They represent the active power and reactive power of the generator or DG output power supply connected to node i, respectively. P _d,i and q _d,i represent the active and reactive load demands at node i, respectively.

represents any node,

and

Respectively represent the maximum values of P _ji and Q _ji ,

and

Respectively represent the minimum and maximum values _{of Vi} ; use the load power factor tanβ to represent the relationship between _psd,i and qsd _,i . Using the constraint relationship of the above power flow optimization constraint model, for a distribution network with n nodes, assuming that the voltage _V1 of the root node, i.e., node 1, and the load demand _Pd,i of each node, the topological structure of the distribution network, and the impedance of each branch _rji + _xji are known, the physical quantities such as the voltage V _i , V _j of each node, the active power _Pji and _Qji of the injected load, and the active load shedding and reactive load shedding _psd,i and qsd _,i are optimized and solved.

配电网络中部分线路是可以控制其通断状态的，因此可以得到如下Distflow潮流优化约束模型：The on/off status of some lines in the distribution network can be controlled, so the following Distflow power flow optimization constraint model can be obtained:

min∑p_sd，j (30)min∑p _sd，j (30)

上式中引入了变量μ_ji表示线路通断状态，如果μ_ji＝1表示线路闭合，否则μ_ji＝0。The variable μ _ji is introduced in the above formula to represent the on/off state of the line. If μ _ji = 1, it means the line is closed, otherwise μ _ji = 0.

网络重构是开关自动控制策略的实质，它指的是在网络正常运行状态下，通过改变分段开关、联络开关的组合状态，即选择用户的供电路径，达到优化潮流分布、提高供电可靠性和经济性的目的。利用上述模型解决配电网优化重构、故障重构等问题。Network reconstruction is the essence of the switch automatic control strategy. It means that under the normal operation of the network, by changing the combination state of the section switch and the tie switch, that is, selecting the power supply path of the user, the purpose of optimizing the power flow distribution and improving the reliability and economy of power supply is achieved. The above model is used to solve the problems of distribution network optimization reconstruction and fault reconstruction.

步骤S12：根据强化学习理论和贝尔曼最优方程确定出近似动态规划的强化学习算法。Step S12: Determine a reinforcement learning algorithm for approximate dynamic programming based on reinforcement learning theory and Bellman's optimal equation.

强化学习是机器学习领域内的一类问题，它旨在让智能体采取最优的动作来最大化收益。它被用来寻找在某种环境下应该采取的最佳动作。强化学习与监督学习的不同之处在于：在监督学习中，训练数据(Train data)带有标签值(Label)，因此模型本身使用正确答案进行训练；在强化学习中，虽然没有Label，但智能体能够从一次次经验和错误中逐渐学习到最优动作或者路径。Reinforcement learning is a type of problem in the field of machine learning that aims to enable an agent to take the best action to maximize benefits. It is used to find the best action to take in a certain environment. The difference between reinforcement learning and supervised learning is that in supervised learning, the training data (Train data) has a label value (Label), so the model itself is trained with the correct answer; in reinforcement learning, although there is no label, the agent can gradually learn the optimal action or path from repeated experience and mistakes.

动态规划(Dynamic Programming，DP)起源于工程、金融领域内倾向于关注连续状态和决策控制的问题。相比之下，在人工智能领域DP主要涉及离散的状态和决策。它属于强化学习中一种具有基于模型特点的算法。在该强化学习算法下，环境和模型对智能体来讲是已知的，在给定的完备的马尔可夫决策过程模型下，可以学习到最优策略，因此被归为基于模型方法。Dynamic Programming (DP) originated from the fields of engineering and finance, which tend to focus on continuous states and decision control. In contrast, in the field of artificial intelligence, DP mainly involves discrete states and decisions. It is an algorithm with model-based characteristics in reinforcement learning. Under this reinforcement learning algorithm, the environment and model are known to the agent, and the optimal strategy can be learned under a given complete Markov decision process model, so it is classified as a model-based method.

DP中有许多高维的问题，例如本文中研究的问题，通常使用编程的工具来研究。这些工作大部分集中在使用线性、非线性或整数规划等工具的确定性问题上。There are many high-dimensional problems in DP, such as the one studied in this paper, that are often studied using tools from programming. Much of this work focuses on deterministic problems using tools such as linear, nonlinear, or integer programming.

与强化学习的其他算法相同，DP的核心思想仍然是依据状态-价值函数来找到最优的决策。但是DP算法下的状态-价值函数是建立在贝尔曼最优方程的基础上的，在此基础上可以得到：Like other reinforcement learning algorithms, the core idea of DP is still to find the optimal decision based on the state-value function. However, the state-value function under the DP algorithm is based on the Bellman optimal equation, on which we can get:

其中，加了星号下标的v_*(S_t)表示满足价值最大化的状态-价值函数，R_t为t时刻的即时奖励，γ为折扣因子，γ∈[0，1]，γ接近1的程度表示未来收益的重要程度；反之γ接近0的程度表示当前收益的重要程度，G_t+1为直至t+1时刻所有奖励之和，S_t为状态集合，s为t时刻状态，A_t为动作集合，a为t时刻动作，

为求期望值。利用概率论中期望的概念，可以继续进行改写：Among them, v _* ( _St ) with an asterisk subscript represents the state-value function that satisfies the value maximization, _Rt is the immediate reward at time t, γ is the discount factor, γ∈[0,1], the degree to which γ is close to 1 indicates the importance of future benefits; conversely, the degree to which γ is close to 0 indicates the importance of current benefits, _Gt+1 is the sum of all rewards until time t+1, _St is the state set, s is the state at time t, _At is the action set, a is the action at time t,

To find the expected value, we can continue to rewrite it using the concept of expectation in probability theory:

v_*(S_t)＝max_a∑_s′，rp(s′，r|s，a)[r+γG_t+1] (40)v _* (S _t )=max _a ∑ _{s′, r} p (s′, r|s, a) [r+γG _t+1 ] (40)

其中，max为取最大值函数，s′为下一时刻状态，r为环境反馈的奖励值，p(s′，r|s，a)为在状态s和动作a的前提条件下环境反馈奖励r和转移到下一时刻状态s′的概率。Among them, max is the maximum value function, s′ is the state at the next moment, r is the reward value of the environment feedback, and p(s′, r|s, a) is the probability of the environment feedback reward r and the transition to the next moment state s′ under the premise of state s and action a.

需要特别说明的是，G_t+1与v_*(S_t+1)具有相互替换性，得到：It should be noted that G _t+1 and v _* (S _t+1 ) are interchangeable, and we can obtain:

v_*(S_t)＝max_a∑_s′，rp(s′，r|s，a)[r+γv_*(S_t+1)] (41)v _* (S _t ) = max _a ∑ _{s′, r} p (s′, r|s, a) [r+γv _* (S _t+1 )] (41)

v_*(S_t)＝max_a{R_t+∑_s′，rp(s′，r|s，a)*[γv_*(S_t+1)]} (42)v _* (S _t )=max _a {R _t +∑ _{s′, r} p(s′, r|s, a)*[γv _* (S _t+1 )]} (42)

替换之后的方程则是将贝尔曼方程转化成了逼近理想价值函数的递归更新方程，即DP算法的雏形。基于上式，所有动态程序都可以用递归的方式编写，递归过程将某一时刻t处于特定状态价值v_t(S_t)与下一时刻t′进入的状态价值v_t+1(S_t+1)联系起来。The equation after substitution transforms the Bellman equation into a recursive update equation that approximates the ideal value function, which is the prototype of the DP algorithm. Based on the above equation, all dynamic programs can be written in a recursive way. The recursive process links the value of a specific state v _t (S _t ) at a certain time t with the value of the state v _t+1 (S _t+1 ) at the next time t′.

但是，DP算法在维度上存在三个“诅咒”，使得求解过程变得复杂，主要表现在以下三个方面：状态空间S太大，无法在可接受的时间内计算值函数v_*(S_t)；决策空间A太大，动作的排列组合太多，往往呈指数级上升，无法快速找到最优动作；结果空间太大，无法计算对未来奖励的预期期望值。However, the DP algorithm has three "curses" in terms of dimension, which makes the solution process complicated, mainly manifested in the following three aspects: the state space S is too large to calculate the value function v _* ( _St ) within an acceptable time; the decision space A is too large, there are too many permutations and combinations of actions, which often increase exponentially, and the optimal action cannot be found quickly; the result space is too large to calculate the expected value of future rewards.

近似动态规划的是基于一种随着时间的推移而逐步前进的算法策略，因此也叫向前动态规划(forward dynamic)。DP算法需要求解

转化成概率分布的形式后得到∑_s′，rp(s′，r|s，a)[r+γv_*(S_t+1)]。观察该式发现MDP(MarkovDecision Process，马尔可夫决策过程)的状态过多会导致求解缓慢或无法求解的问题，提出近似动态规划算法来解决，将状态-价值函数v_*(S_t)作近似处理并引入了一个新的概念：Approximate dynamic programming is based on an algorithmic strategy that progresses step by step over time, so it is also called forward dynamic programming. The DP algorithm needs to solve

After being converted into a probability distribution, we get ∑ _{s′, r} p(s′, r|s, a)[r+γv _* (S _t+1 )]. By observing this formula, we find that too many states in the MDP (Markov Decision Process) will lead to slow or unsolvable problems. We propose an approximate dynamic programming algorithm to solve this problem, approximate the state-value function v _* (S _t ) and introduce a new concept:

决策后状态(The post-decision state)，指在做出决策后，在任何新信息到达之前的系统状态。这是一种作出决策a_t后介于S_t和S_t+1之间的状态，记为

The post-decision state refers to the state of the system after a decision is made and before any new information arrives. This is a state between S _t and S _t+1 after a decision a _t is made, denoted by

提出这个概念之后，可以将上式改写为：After proposing this concept, the above formula can be rewritten as:

替代的是求解未来奖励的期望值。

Instead, we solve for the expected value of future rewards.

步骤S2：获取光伏、风电分布式发电单元的历史出力数据，并根据历史出力数据建立分布式发电单元输出功率波动及转化模型。Step S2: Obtain historical output data of photovoltaic and wind power distributed generation units, and establish a distributed generation unit output power fluctuation and conversion model based on the historical output data.

所述步骤S2中，在建立分布式发电单元输出功率波动及转化模型之前，该方法还包括：对各分布式发电单元的输出功率波动和不确定状况进行分析和量化，以便将量化值作为配电网络马尔可夫决策过程MDP模型的状态输入之一。In step S2, before establishing the output power fluctuation and conversion model of the distributed generation units, the method also includes: analyzing and quantifying the output power fluctuation and uncertainty of each distributed generation unit, so as to use the quantified value as one of the state inputs of the Markov decision process MDP model of the distribution network.

根据相关历史数据，风电场、光伏电场在长时间尺度下的有功功率出力情况随机性很强，满足探究风力发电和光伏发电等新能源发电单元随自然环境的波动性的要求。According to relevant historical data, the active power output of wind farms and photovoltaic farms on a long-term scale is highly random, which meets the requirements of exploring the volatility of new energy power generation units such as wind power generation and photovoltaic power generation with the natural environment.

风电场有功出力的波动性和随机性受到多种客观因素影响，例如风力发电厂所在地区的地域、气候情况；风电机组的空间分布、排列方式情况。利用采样周期为15min的风力发电有功出力数据，将风力发电日均出力按下式计算：The volatility and randomness of wind farm active output are affected by many objective factors, such as the geographical and climatic conditions of the area where the wind farm is located; the spatial distribution and arrangement of wind turbines. Using the wind power active output data with a sampling period of 15 minutes, the average daily output of wind power is calculated as follows:

其中，P_日均出力和W_日分别表示风力发电日均出力和日发电量，P(t)为t时刻有功出力，依据上式对风电场日均出力进行数据分析。风力电场有功出力受到不同季节的自然天气情况影响，在一年时间范围内的波动很剧烈。因此，为了定性地选取风电DG有功功率输出波动数据，选择风电场一般出力日(日发电量＝年发电量/365day)的出力数据。一般出力日出力高峰为夜间3点至白天10点、夜间22点至24点，其余时段则约等于停发。Among them, P _{daily average output} and W _day represent the daily average output and daily power generation of wind power generation respectively, P(t) is the active output at time t, and the daily average output of the wind farm is analyzed according to the above formula. The active output of the wind farm is affected by the natural weather conditions in different seasons, and the fluctuations within a year are very violent. Therefore, in order to qualitatively select the active power output fluctuation data of wind power DG, the output data of the general output day of the wind farm (daily power generation = annual power generation/365day) is selected. The peak output of the general output day is from 3 pm to 10 am and from 10 pm to 10 am, and the rest of the time is approximately equal to no output.

光伏电场有功出力的波动性和随机性则主要由光伏发电厂日照时数、海拔高度、自然灾害(干旱、暴雨、霜冻)情况造成。同样地，利用来自一般出力日采样周期为T＝15min的光伏发电实际有功出力数据。天气状况直接导致光伏发电站的有功出力剧烈波动；晴朗天气下的出力情况会大幅优于阴雨天气下；光伏DG具有输出间歇性，夜间时段则会出现较长时间的持续停发。The volatility and randomness of the active output of the photovoltaic power field are mainly caused by the sunshine hours, altitude, and natural disasters (drought, rainstorm, frost) of the photovoltaic power plant. Similarly, the actual active output data of photovoltaic power generation with a sampling period of T=15min from the general output day is used. Weather conditions directly lead to drastic fluctuations in the active output of photovoltaic power stations; the output in clear weather is much better than that in rainy weather; photovoltaic DG has intermittent output, and there will be a long period of continuous suspension during the night.

本实施例中，在建立分布式发电单元输出功率波动及转化模型后，可通过分布式发电单元输出功率波动及转化模型模拟各分布式发电单元输出功率的波动性和不确定性。In this embodiment, after the distributed power generation unit output power fluctuation and conversion model is established, the distributed power generation unit output power fluctuation and conversion model can be used to simulate the volatility and uncertainty of the output power of each distributed power generation unit.

具体的，将各分布式发电单元(DG)的输出功率波动情况量化值以及对应的配电网络拓扑结构视为MDP的状态，而DG的发电输出功率具有很高的不确定性。Specifically, the quantified value of the output power fluctuation of each distributed generation unit (DG) and the corresponding distribution network topology are regarded as the state of MDP, while the power generation output power of DG has high uncertainty.

DG在每个时间段内都有k个输出水平，k值越大，对DG输出功率模拟量的离散量化就越精细，误差越小。t时刻的一个输出级k到t+1时刻的另一个输出级k′的概率用∏_kk′表示，这个转换概率可用风力发电DG历史数据的蒙特卡洛模拟来呈现。例如，t时刻k输出水平的历史发生次数是m次，t+1时刻从k输出水平转换到k′的输出水平的次数是n次，因此∏_kk′等于n/m。DG has k output levels in each time period. The larger the k value, the more precise the discrete quantization of the DG output power simulation and the smaller the error. The probability of an output level k at time t to another output level k′ at time t+1 is represented by ∏ _kk′. This conversion probability can be presented by Monte Carlo simulation of historical data of wind power DG. For example, the number of historical occurrences of output level k at time t is m times, and the number of conversions from output level k to output level k′ at time t+1 is n times, so ∏ _kk′ is equal to n/m.

步骤S3：确定配电网络拓扑结构，并获取分布式发电单元有功出力数据、配电网可控分段开关、联络开关状况信息和Distflow潮流优化约束模型计算结果信息，以及根据分布式发电单元输出功率波动及转化模型、配电网络拓扑结构和步骤S3中获取的信息建立配电网络马尔可夫决策过程MDP模型。Step S3: Determine the topological structure of the distribution network, obtain the active output data of the distributed generation units, the status information of the controllable section switches and the interconnection switches of the distribution network, and the calculation result information of the Distflow power flow optimization constraint model, and establish the distribution network Markov decision process MDP model according to the output power fluctuation and conversion model of the distributed generation units, the topological structure of the distribution network and the information obtained in step S3.

其中，基于近似动态规划的配电网开关自动控制方法对应的系统总体框架图如图2所示。图2由强化学习算法和分布式配电网络两部分组成。在分布式配电网络运行的每个时刻，天气、自然灾害等外部情况和网络拓扑结构作为状态参数S_t即外部环境状态输入给MDP过程，其具体表现形式为天气、自然灾害等外部情况影响光伏发电和风力发电的出力情况，然后输出开关策略A_t。根据预先设定的相关参数，如网络线路运行成本、切负荷等，反馈当前时刻的奖励值即反馈即时奖励R_t。强化学习ADP算法利用决策后状态和前向动态算法，并根据以上数据以及配电网络的实时运行状态和累积的奖励值进行迭代计算，算法的计算结果会反馈给智能体agent。每个完整的时间周期为一次训练，经过上百次的训练后每个决策均会得到一个收敛的计算值，将这些值存入一个表格中，最后比较这些值的大小来找到每个时刻的最优策略。Among them, the overall framework diagram of the system corresponding to the automatic control method of the distribution network switch based on approximate dynamic programming is shown in Figure 2. Figure 2 consists of two parts: reinforcement learning algorithm and distributed distribution network. At each moment of the operation of the distributed distribution network, external conditions such as weather and natural disasters and network topology are input into the MDP process as state parameters _St, that is, external environmental states. The specific manifestation is that external conditions such as weather and natural disasters affect the output of photovoltaic power generation and wind power generation, and then the switch strategy _At is output. According to the pre-set relevant parameters, such as network line operation cost, load shedding, etc., the reward value at the current moment is fed back, that is, the immediate reward _Rt is fed back. The reinforcement learning ADP algorithm uses the post-decision state and forward dynamic algorithm, and iterates the calculation based on the above data, the real-time operation status of the distribution network and the accumulated reward value. The calculation results of the algorithm will be fed back to the intelligent agent. Each complete time cycle is a training. After hundreds of trainings, each decision will get a converged calculation value, which will be stored in a table. Finally, the size of these values is compared to find the optimal strategy at each moment.

所述步骤S3建立配电网络马尔可夫决策过程MDP模型的步骤包括：The step S3 of establishing the Markov decision process MDP model of the power distribution network includes:

步骤S31：将各分布式发电单元的输出功率波动状况量化值以及对应的配电网络拓扑结构建模为配电网络马尔可夫决策过程MDP模型的状态参量。Step S31: Modeling the output power fluctuation status quantization value of each distributed power generation unit and the corresponding distribution network topology structure as state parameters of the distribution network Markov decision process MDP model.

状态集合S_i，t：配电网络拓扑内每一条线路上的开关被依次定义为网路拓扑结构：State set S _i,t : The switches on each line in the distribution network topology are defined in turn as the network topology:

Ξ_t＝[swt₁，swt₂，swt₃，...，swt_n] (45)Ξ _t = [swt ₁ , swt ₂ , swt ₃ ,..., swt _n ] (45)

其中，swt_n表示开关n当前的状态，每个开关具有断开和闭合两种状态，分别用二进制数0和1来表示。因此可以用n位二进制数来表示此时的配电网络拓扑结构Ξ_t。该配电网络的马尔可夫决策模型中将t时刻上述的网络拓扑结构和分布式发电单元的发电功率输出水平定义为状态集合：Wherein, swt _n represents the current state of switch n, and each switch has two states, open and closed, represented by binary numbers 0 and 1 respectively. Therefore, the current distribution network topology Ξ _t can be represented by an n-bit binary number. In the Markov decision model of the distribution network, the network topology and the power output level of the distributed generation unit at time t are defined as a state set:

S_i，t＝[Ξ_t|k_1，t，k_2，t，k_3，t，...，k_dg，t，...，k_DG，t] (46)S _i,t =[Ξ _t |k _1,t ,k _2,t ,k _3,t ,...,k _dg,t ,...,k _DG,t ] (46)

其中，Ξ_t表示在t时刻的配电网网络拓扑结构，k_dg，t表示在t时刻DG发电功率输出水平。Wherein, Ξ _t represents the distribution network topology at time t, k _dg,t represents the DG power output level at time t.

步骤S32：将连接配电网络各条支线的开关的动作状态建模为配电网络马尔可夫决策过程MDP模型的动作组合参量。Step S32: Modeling the action states of switches connecting the branches of the power distribution network as action combination parameters of the Markov decision process MDP model of the power distribution network.

其中，开关的动作状态包括开关变换当前时刻的开关状态和保持当前时刻的开关状态。The action state of the switch includes the switch changing the switch state at the current moment and maintaining the switch state at the current moment.

状态中的动作集合A_t(S_i，t)：连接各节点的开关[swt₁，swt₂，swt₃，...，swt_n]在每个t时刻均有且只能有两种动作，即从变换t-1时刻的开关状态(闭合状态转换到断开状态、从断开状态转换到闭合状态)或者保持t-1时刻的开关状态，用单个开关状态中的动作a_swt1＝a₁或者a₂表示。树形网络线路中的每一个开关在任意时刻均具有上述两种动作，A_t(S_i，t)可表示为：Action set A _t (S _{i, t} ): The switches [swt ₁ , swt ₂ , swt ₃ , ..., swt _n ] connecting each node have and can only have two actions at each time t, that is, changing the switch state at time t-1 (converting from closed state to open state, from open state to closed state) or maintaining the switch state at time t-1, which is represented by the action a _swt1 = a ₁ or a ₂ in a single switch state. Each switch in the tree network line has the above two actions at any time, and A _t (S _{i, t} ) can be expressed as:

A_t(S_i，t)＝[a_swt1，a_swt2，...，a_swti] (47)A _t (S _{i, t} )=[a _swt1 , a _swt2 ,..., a _swti ] (47)

其中，a_swti表示第i个开关状态中的动作，S_i，t表示状态集合。Among them, a _swti represents the action in the i-th switch state, and S _i,t represents the state set.

步骤S33：选择考虑分布式发电故障的Distflow潮流优化约束模型中的切负荷(load shedding)以及线路运营成本建模为配电网络马尔可夫决策过程MDP模型的奖励函数参考指标。Step S33: Select load shedding and line operation cost modeling in the Distflow power flow optimization constraint model considering distributed generation failure as reward function reference indicators of the distribution network Markov decision process MDP model.

即时收益函数R(S_i，t，a_t)：当在t时刻处应用一个特定动作a_t∈A_t(S_i，t)时，网络的状态将从s_t∈S_i，t更改为s_t+1∈S_i，t，s_t和s_t+1分别表示t和t+1时刻的状态，系统观察到一个即时奖励值R(S_i，t，a_t)，该奖励值根据整个配电系统网络潮流计算来衡量。Instantaneous reward function R(S _{i, t} , a _t ): When a specific action a _t ∈ _{A t} (S _{i, t} ) is applied at time t, the state of the network will change from s _t ∈ S _{i, t} to s _t+1 ∈ _{S i, t} , where s _t and s _t+1 represent the states at time t and t+1, respectively, and the system observes an instantaneous reward value R(S _{i, t} , a _t ), which is measured according to the network flow calculation of the entire distribution system.

具体的，可选择考虑分布式发电故障的配电网络潮流计算模型Distflow中的甩负荷(load shedding)作为奖励函数R(S_i，t，a_t)的参考指标。R(S_i，t，a_t)表示该配电系统中甩负荷的成本和线路运营成本，其值为恒负，欲寻求R(S_i，t，a_t)的值最大则要求甩负荷的成本越小。甩负荷的大小反映了配电系统的输入输出功率平衡和系统稳定性，其值越小越能够满足配电网络的工作要求和经济预算。Specifically, the load shedding in the distribution network flow calculation model Distflow considering distributed generation failures can be selected as a reference indicator of the reward function R(S _{i, t} , a _t ). R(S _{i, t} , a _t ) represents the cost of load shedding and the line operation cost in the distribution system. Its value is always negative. To maximize the value of R(S _{i, t} , a _t ), the cost of load shedding must be smaller. The size of load shedding reflects the input and output power balance and system stability of the distribution system. The smaller its value, the more it can meet the working requirements and economic budget of the distribution network.

步骤S34：定义配电网络马尔可夫决策过程MDP模型的状态转移概率，以便考虑各分布式发电单元发电功率输出水平的变化概率带来的不确定性。Step S34: define the state transition probability of the Markov decision process MDP model of the distribution network so as to take into account the uncertainty caused by the probability of change of the power output level of each distributed generation unit.

本实施例中，可考虑DG发电功率输出水平的变化概率带来的不确定性，定义状态在动作的作用下转换到状态的马尔可夫决策过程MDP模型的状态转移概率。In this embodiment, the uncertainty caused by the probability of change of the DG power output level can be considered to define the state transition probability of the Markov decision process MDP model in which the state is converted to the state under the action of the action.

状态转移概率P(S_j，t+1|S_i，t，a_t)：考虑到t到t+1时刻DG发电功率输出水平的变化概率∏_kk′带来的不确定性，状态S_i，t在动作a_t的作用下转换到状态S_j，t+1的状态转移概率P(S_j，t+1|S_i，t，a_t)可表示为下式：State transition probability P( _{Sj, t+1} |Si _{, t} , _at ): Considering the uncertainty brought by the probability _∏kk′ of the change of DG power output level from time t to time t+1, the state transition probability P(Sj _{, t+1} |Si, _t , _at ) of state _{Si, t} transforming to state Sj _{, t+1} under the action _at can be expressed as follows:

P(S_j，t+1|S_i，t，a_t)＝∏_kk′×P(Ξ_t+1|Ξ_t，a_t) (48)P(S _{j, t+1} |S _{i, t} , a _t )=∏ _kk′ ×P(Ξ _t+1 |Ξ _t , a _t ) (48)

其中P(Ξ_t+1|Ξ_t，a_t)表示t时刻，配电网络拓扑结构Ξ_t在动作a_t作用下转换为Ξ_t+1的概率。.由于Ξ_t是由于网络重构而改变为Ξ_t+1的，并且在特定的重构操作下配电网络拓扑转换是确定的，因此这个概率为“100％”或“0％”。∏_kk′为仅有一个DG时，t时刻的一个输出级k到t+1时刻的另一个输出级k′的概率；如果有n个DG，各个DG发电功率输出水平转换概率应该相乘，即上式应写为n个∏_kk′的连乘形式。Where P(Ξ _t+1 |Ξ _t , a _t ) represents the probability that the distribution network topology Ξ _t is converted to Ξ _t+1 under the action a _t at time t. Since Ξ _t is changed to Ξ _t+1 due to network reconstruction, and the distribution network topology conversion is certain under a specific reconstruction operation, this probability is "100%" or "0%". ∏ _kk′ is the probability of an output level k at time t to another output level k′ at time t+1 when there is only one DG; if there are n DGs, the conversion probabilities of the power output levels of each DG should be multiplied, that is, the above formula should be written as a continuous multiplication of n ∏ _kk′ .

步骤S4：采用近似动态规划的强化学习算法求解配电网络马尔可夫决策过程MDP模型，并实时输出配电网开关自动控制最优策略。Step S4: Use the reinforcement learning algorithm of approximate dynamic programming to solve the Markov decision process MDP model of the distribution network, and output the optimal strategy for automatic control of distribution network switches in real time.

所述步骤S4中，在实时输出配电网开关自动控制最优策略之前，该方法还包括：对配电网络马尔可夫决策过程MDP模型进行离线迭代训练。In the step S4, before outputting the optimal strategy for automatic control of the distribution network switches in real time, the method further comprises: performing offline iterative training on the Markov decision process MDP model of the distribution network.

对配电网络马尔可夫决策过程MDP模型进行离线迭代训练的步骤包括：输入分布式发电单元有功出力数据，并输出配电网开关自动控制策略，以及反馈配电网络实时运行情况评价指标。The steps of offline iterative training of the Markov decision process MDP model of the distribution network include: inputting the active output data of the distributed generation units, outputting the automatic control strategy of the distribution network switches, and feeding back the evaluation indicators of the real-time operation of the distribution network.

根据步骤S3建立的配电网络马尔可夫模型和贝尔曼最优方程，再考虑到DG输出的不确定性和波动性，整个系统的动态特性由四参数概率分布P(S_j，t+1|S_i，t，a_t)给出。递归优化的状态-价值函数可以表示为下式：According to the distribution network Markov model and Bellman optimal equation established in step S3, and taking into account the uncertainty and volatility of DG output, the dynamic characteristics of the entire system are given by the four-parameter probability distribution P(S _{j, t+1} |S _{i, t} , a _t ). The recursive optimized state-value function can be expressed as follows:

其中，γ为折扣因子，γ∈[0，1]，γ接近1的程度表示未来收益的重要程度；反之γ接近O的程度表示当前收益的重要程度，r(S_i，t，a_t)为在当前时刻状态S_i，t和动作a_t下环境反馈的即时奖励值，v_t+1(S_j，t+1)为下一时刻状态S_j，t+1的状态-价值函数值。令γ＝1且将r(S_i，t，a_t)提出括号，则得到下式：Among them, γ is the discount factor, γ∈[0,1], the degree to which γ is close to 1 indicates the importance of future benefits; conversely, the degree to which γ is close to 0 indicates the importance of current benefits, r(S _{i, t} , a _t ) is the immediate reward value of environmental feedback under the current state S _{i, t} and action a _t , and v _t+1 (S _{j, t+1} ) is the state-value function value of the next state S _{j, t+1} . Let γ＝1 and put r(S _{i, t} , a _t ) out of brackets, and we get the following formula:

其中，

表示在当前时刻状态S_i，t和动作a_t前提条件下，下一时刻状态S_j，t+1的状态-价值函数的期望值，即时收益函数R(S_i，t，a_t)由两部分组成：配电系统中切负荷的成本和线路运营成本，其值为恒负。故R(S_i，t，a_t)可以表示为：in,

It represents the expected value of the state-value function of the next state _Sj,t+1 under the current state S _i,t and action a _t . The instant benefit function R(S _i,t , a _t ) consists of two parts: the cost of shedding load in the distribution system and the line operation cost, and its value is always negative. Therefore, R(S _i,t , a _t ) can be expressed as:

R(S_i，t，a_t)＝∑_b∈B(c_b*p_sd，b)+∑_l∈L(c₁*μ_b，b′) (52)R(S _{i, t} , a _t )=∑ _b∈B (c _b *p _{sd, b} )+∑ _l∈L (c ₁ *μ _{b, b′} ) (52)

其中，B为节点集合，c_b为切负荷成本系数，p_sd，b为节点b处的有功切负荷，c₁为线路运营成本系数，μ_b，b′为节点b到节点b′的线路运营成本，引入前述的决策后状态(The post-decision state)的概念后，将S_i，t，a_t改写成决策后状态变量的形式

将

定义为

从而实现了对数学期望求解过程的形式转换，得到下列式：Where B is the node set, c _b is the load shedding cost coefficient, p _sd,b is the active load shedding at node b, c ₁ is the line operation cost coefficient, μ _b,b′ is the line operation cost from node b to node b′. After introducing the concept of the post-decision state, S _i,t , _at is rewritten as the post-decision state variable

Will

Defined as

This achieves the formal transformation of the mathematical expectation solution process and obtains the following formula:

其中，

表示决策后状态

的状态-价值函数值，

表示在上一时刻状态S_j，t-1和动作a_t-1前提条件下，当前时刻状态S_i，t的状态-价值函数的期望值，R_t+1(S_j，t+1，a_t+1)表示下一时刻状态S_j，t+1和动作a_t+1下环境反馈的即时奖励值，

表示决策后状态

的状态-价值函数值，上式将多周期和大规模的基于MDP的随机模型转换为每个决策周期内每个状态的单周期确定性模型并可以通过迭代来求解

in,

Indicates the state after decision

The state-value function value of

represents the expected value of the state-value function of the current state S _{i, t} under the premise of the previous state S _{j, t-1} and action a _t-1 , R _t+1 (S _{j, t+1} , a _t+1 ) represents the immediate reward value of the environmental feedback under the next state S _{j, t+1} and action a _t+1 ,

Indicates the state after decision

The state-value function value of the above formula converts the multi-period and large-scale MDP-based stochastic model into a single-period deterministic model for each state in each decision cycle and can be solved by iteration

引入变量n来表述第n次迭代，得到下式：Introducing the variable n to describe the nth iteration, we get the following formula:

其中，

表示第n次迭代中状态S_i，t的状态-价值函数值，

表示第n次迭代中决策后状态

的状态-价值函数值，结合决策后状态变量的状态-价值函数递推形式，再利用前向动态算法来逐轮次更新

in,

represents the state-value function value of state _Si,t in the nth iteration,

Represents the state after decision in the nth iteration

The state-value function value of the decision-making state variable is combined with the state-value function recursive form, and then the forward dynamic algorithm is used to update the state-value function round by round.

其中，

表示第n次迭代中决策后状态

的状态-价值函数值，

表示第n-1次迭代中决策后状态

的状态-价值函数值，

表示第n次迭代中状态S_j，t的状态-价值函数值，系数α是一个小于1的平滑参数，依托这个迭代更新公式以及足够多的迭代次数，则每一个决策后状态的状态-价值

都可以得到一个对应的收敛值。并可以得到每一个马尔可夫状态下不同开关动作的价值，最后通过比较这些价值的大小来找到最佳动作和决策。in,

Represents the state after decision in the nth iteration

The state-value function value of

Represents the state after decision in the n-1th iteration

The state-value function value of

represents the state-value function value of state S _j,t in the nth iteration. The coefficient α is a smoothing parameter less than 1. Based on this iterative update formula and a sufficient number of iterations, the state-value of each state after decision making is

A corresponding convergence value can be obtained. And the value of different switch actions in each Markov state can be obtained. Finally, the best action and decision can be found by comparing the sizes of these values.

ADP算法具体伪代码如下：The specific pseudo code of the ADP algorithm is as follows:

步骤1初始化。Step 1 Initialization.

步骤1a设置网络拓扑状态和DG波动输出状态。Step 1a sets the network topology state and DG fluctuation output state.

步骤1b初始化

和

状态-价值函数表。Step 1b Initialization

and

State-value function table.

步骤1c设置步长α∈(0，1]，迭代轮数N。Step 1c sets the step size α∈(0, 1] and the number of iterations N.

步骤2 Do当t＝1，2，...，T：Step 2 Do when t = 1, 2, ..., T:

步骤2a利用Gurobi数学规划优化器混合整数线性规划求解R(S_i，t，a_t)，并求解得到

其中，要让代入a_t来解决最大值问题。Step 2a: Use Gurobi mathematical programming optimizer to solve R(S _{i, t} , a _t ) using mixed integer linear programming and obtain

Among them, we need to substitute a _t to solve the maximum value problem.

步骤2b利用前向动态算法更新

Step 2b uses the forward dynamic algorithm to update

步骤2c从状态S_i，t和动作a_t来得到决策后状态

判断t是否小于T，并在不小于T时，执行下述步骤；Step 2c: Get the post-decision state from the state Si _,t and action _at

Determine whether t is less than T, and if it is not less than T, perform the following steps;

步骤2d根据DG波动和网络拓扑结构，从决策后状态

进入下一时刻马尔可夫状态S_j，t+1。Step 2d: According to the DG fluctuation and network topology, the decision-making state

Enter the next Markov state S _j,t+1 .

步骤3使得n＝n+1。判断n是否小于等于迭代轮数N，如果是则重复步骤2，否则结束循环进入步骤4。Step 3 makes n=n+1. Determine whether n is less than or equal to the number of iterations N. If so, repeat step 2. Otherwise, end the loop and go to step 4.

步骤4返回收敛值

以及其对应的马尔可夫状态S_i，t。Step 4 returns the convergence value

And its corresponding Markov state Si _,t .

近似动态规划算法流程图如图3所示。The flow chart of the approximate dynamic programming algorithm is shown in Figure 3.

DG输出功率波动情况下的配电网开关自动控制MDP模型学习和训练的过程是一个离线计算的迭代过程，图4展示了离线计算流程框图。离线运算的目的是获取决策后状态价值的收敛值

近似动态规划算法即可实现这一目标。离线计算部分的输入信息包括配电系统网络拓扑情况和DG输出功率的不确定性和波动性。将这些信息送入ADP算法中，返回得到

它是一个使得

反复迭代并最终收敛的多周期过程。The learning and training process of the MDP model for automatic control of distribution network switches under DG output power fluctuation is an iterative process of offline calculation. Figure 4 shows the flow chart of offline calculation. The purpose of offline calculation is to obtain the convergence value of the state value after decision making.

Approximate dynamic programming algorithm can achieve this goal. The input information of the offline calculation part includes the distribution system network topology and the uncertainty and volatility of DG output power. This information is fed into the ADP algorithm and returned

It is a

A multi-cycle process that iterates repeatedly and eventually converges.

步骤S4中，在实时输出配电网开关自动控制最优策略之后，该方法还包括：在人机交互界面进行决策结果文字显示和实时配电网网络拓扑结构显示。In step S4, after outputting the optimal strategy for automatic control of the distribution network switches in real time, the method further comprises: displaying the decision result textually and the real-time distribution network topology structure on the human-computer interaction interface.

DG输出功率波动情况下的配电网开关自动控制MDP模型训练和学习结束后，就需要自动预测出最佳开关策略。预测阶与决策阶段是一个在线计算的过程，在线计算流程框图如图5所示。After the training and learning of the MDP model for automatic control of distribution network switches under DG output power fluctuations, it is necessary to automatically predict the optimal switching strategy. The prediction stage and decision stage are an online calculation process, and the online calculation flow chart is shown in Figure 5.

在线计算的作用是实现根据每个时刻智能体观察到的实时马尔可夫状态S_j，t得到配电网络最优开关策略。如图5所示，离线计算中输出的决策后状态价值

和每个时间段观察到的马尔可夫状态S_i，t作为单周期确定性模型的输入，目标函数仍是R(S_i，t，a_t)，其约束条件是步骤S1提出的配电网络结构与潮流约束条件，然后返回

和最佳动作A_t(S_i，t)。在这种情况下，在线计算过程可以在每个决策时刻获得基于实时马尔可夫状态S_i，t的最大价值策略。The role of online calculation is to obtain the optimal switching strategy of the distribution network based on the real-time Markov state S _j,t observed by the agent at each moment. As shown in Figure 5, the post-decision state value output in offline calculation is

The Markov state Si _,t observed in each time period is used as the input of the single-period deterministic model. The objective function is still R(Si _,t ,a _t ), and its constraints are the distribution network structure and power flow constraints proposed in step S1, and then return

and the best action A _t (S _{i, t} ). In this case, the online calculation process can obtain the maximum value strategy based on the real-time Markov state S _{i, t} at each decision moment.

图6为本发明的基于强化学习理论的配电网开关自动控制方法整体流程图。如图6所示，主要由四部分组成：分别是相关理论知识推导、建立DG波动下的配电网开关自动控制MDP模型、求解MDP模型并整理分析结果、完成人机交互APP。理论推导部分根据强化学习、马尔可夫决策模型的思想推导了贝尔曼方程，并以此为基础介绍了动态规划和近似动态规划的基本原理。建立MDP模型考虑风力发电和光伏发电等新能源发电单元随自然环境的波动性，建立面向配电网开关自动控制的马尔科夫决策模型。首先是建立分布式发电单元不同输出水平之间的转换波动模型，然后再根据强化学习理论中的MDP模型思想建立配电网络MDP模型，最后再根据分布式配电网络的结构约束条件、Distflow潮流计算约束条件编写相应程序。求解MDP模型针对了传统的强化学习算法求解MDP模型会出现的问题和困难，并提出了一种配电网开关自动控制ADP算法来解决这个问题。其中核心方法是引入决策后状态和前向动态算法。力求实现将多周期和大规模的基于MDP的随机模型转换为每个决策周期内每个状态的单周期确定性模型。最后编写人机交互APP(Application，应用程序)，实现用户与配电网络开关自动控制软件系统的互动，直接查看最优决策的文字结果和图片结果。FIG6 is an overall flow chart of the automatic control method of the distribution network switch based on the reinforcement learning theory of the present invention. As shown in FIG6, it mainly consists of four parts: derivation of relevant theoretical knowledge, establishment of the automatic control MDP model of the distribution network switch under DG fluctuation, solution of the MDP model and collation of analysis results, and completion of the human-computer interaction APP. The theoretical derivation part derives the Bellman equation based on the ideas of reinforcement learning and Markov decision model, and introduces the basic principles of dynamic programming and approximate dynamic programming on this basis. The establishment of the MDP model considers the volatility of new energy power generation units such as wind power generation and photovoltaic power generation with the natural environment, and establishes a Markov decision model for automatic control of distribution network switches. First, a conversion fluctuation model between different output levels of distributed power generation units is established, and then a distribution network MDP model is established according to the MDP model idea in the reinforcement learning theory, and finally the corresponding program is written according to the structural constraints of the distributed distribution network and the Distflow flow calculation constraints. Solving the MDP model targets the problems and difficulties that will occur when the traditional reinforcement learning algorithm solves the MDP model, and proposes an automatic control ADP algorithm for distribution network switches to solve this problem. The core method is to introduce the post-decision state and forward dynamic algorithm. We strive to achieve the transformation of multi-cycle and large-scale MDP-based random models into single-cycle deterministic models for each state in each decision cycle. Finally, we write a human-computer interaction APP (Application) to enable users to interact with the distribution network switch automatic control software system and directly view the text and picture results of the optimal decision.

图7为本发明实施例的基于强化学习理论的配电网开关自动控制系统的结构框图。如图7所示，基于强化学习理论的配电网开关自动控制系统100包括建立模块10、确定模块20、获取模块30和计算模块40，所述建立模块10分别与确定模块20、获取模块30和计算模块40连接，所述计算模块40还与确定模块20连接。Fig. 7 is a block diagram of a distribution network switch automatic control system based on reinforcement learning theory according to an embodiment of the present invention. As shown in Fig. 7, the distribution network switch automatic control system 100 based on reinforcement learning theory includes an establishment module 10, a determination module 20, an acquisition module 30 and a calculation module 40, wherein the establishment module 10 is connected to the determination module 20, the acquisition module 30 and the calculation module 40 respectively, and the calculation module 40 is also connected to the determination module 20.

其中，建立模块10用于建立Distflow潮流优化约束模型。确定模块20用于确定近似动态规划的强化学习算法。获取模块30用于获取光伏、风电分布式发电单元的历史出力数据，以便所述建立模块10根据历史出力数据建立分布式发电单元输出功率波动及转化模型。建立模块10还用于根据确定模块20确定的配电网络拓扑结构、获取模块30获取的分布式发电单元有功出力数据、配电网可控分段开关、联络开关状况信息和Distflow潮流优化约束模型计算结果信息，以及所述分布式发电单元输出功率波动及转化模型建立配电网络马尔可夫决策过程MDP模型。计算模块40用于根据近似动态规划的强化学习算法求解所述配电网络马尔可夫决策过程MDP模型，并实时输出配电网开关自动控制最优策略。Among them, the establishment module 10 is used to establish the Distflow flow optimization constraint model. The determination module 20 is used to determine the reinforcement learning algorithm of approximate dynamic programming. The acquisition module 30 is used to obtain the historical output data of photovoltaic and wind power distributed generation units, so that the establishment module 10 can establish the output power fluctuation and conversion model of the distributed generation unit according to the historical output data. The establishment module 10 is also used to establish the distribution network Markov decision process MDP model according to the distribution network topology determined by the determination module 20, the active output data of the distributed generation unit acquired by the acquisition module 30, the controllable segmented switches of the distribution network, the interconnection switch status information and the calculation result information of the Distflow flow optimization constraint model, and the output power fluctuation and conversion model of the distributed generation unit. The calculation module 40 is used to solve the distribution network Markov decision process MDP model according to the reinforcement learning algorithm of approximate dynamic programming, and output the optimal strategy for automatic control of the distribution network switches in real time.

需要说明的是，本发明实施例的基于强化学习理论的配电网开关自动控制系统的具体实施方式可参见上述基于强化学习理论的配电网开关自动控制方法的具体实施方式，为避免冗余，此处不再赘述。It should be noted that the specific implementation of the distribution network switch automatic control system based on reinforcement learning theory in the embodiment of the present invention can refer to the specific implementation of the distribution network switch automatic control method based on reinforcement learning theory mentioned above. To avoid redundancy, it will not be repeated here.

综上所述，本发明提出了一种基于强化学习理论的配电网开关自动控制方法和系统，本发明能够充分考虑风力发电、太阳能发电等可再生新能源的工作效率波动性，建立面向配电网开关自动控制的马尔可夫决策过程MDP模型，该模型结合了配电网络配置一般模型与供电可靠率，重点关注各分布式发电单元的发电功率输出波动，能够给出高效稳定的配电网开关自动控制优化策略；本发明利用MDP过程，以提高经济性、可靠性、降低损耗、平衡负载的主要目的，完成了对含分布式发电单元的配电网络的网络重构，并且在该模型的基础上，根据实时状态和状态转移概率动态制定了每个时间间隔内的最优策略；本发明采用近似动态规划ADP算法来求解上述的MDP模型，可解决“维数灾难”问题；本发明旨在优化解决配电网络中分布式发电单元工作波动、失效、故障问题，改善配电系统的供电可靠性，提高配电系统的投资效益。In summary, the present invention proposes a distribution network switch automatic control method and system based on reinforcement learning theory. The present invention can fully consider the working efficiency volatility of renewable energy such as wind power generation and solar power generation, and establish a Markov decision process MDP model for automatic control of distribution network switches. The model combines the general model of distribution network configuration with the power supply reliability rate, focuses on the power output fluctuation of each distributed power generation unit, and can provide an efficient and stable distribution network switch automatic control optimization strategy; the present invention uses the MDP process to improve economy, reliability, reduce losses, and balance loads. The main purpose completes the network reconstruction of the distribution network containing distributed power generation units, and on the basis of the model, the optimal strategy in each time interval is dynamically formulated according to the real-time state and state transition probability; the present invention adopts an approximate dynamic programming ADP algorithm to solve the above-mentioned MDP model, which can solve the "dimensionality curse" problem; the present invention aims to optimize and solve the working fluctuation, failure, and fault problems of distributed power generation units in the distribution network, improve the power supply reliability of the distribution system, and improve the investment benefit of the distribution system.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

尽管本发明的内容已经通过上述优选实施例作了详细介绍，但应当认识到上述的描述不应被认为是对本发明的限制。在本领域技术人员阅读了上述内容后，对于本发明的多种修改和替代都将是显而易见的。因此，本发明的保护范围应由所附的权利要求来限定。Although the content of the present invention has been described in detail through the above preferred embodiments, it should be appreciated that the above description should not be considered as a limitation of the present invention. After reading the above content, it will be apparent to those skilled in the art that various modifications and substitutions of the present invention will occur. Therefore, the protection scope of the present invention should be limited by the appended claims.

Claims

1. The utility model provides a distribution network switch automatic control method based on reinforcement learning theory which is characterized by comprising the following steps:

step S1: establishing a Distflow tide optimization constraint model, and determining a reinforcement learning algorithm approximating dynamic programming;

step S2: acquiring historical output data of a photovoltaic and wind power distributed generation unit, and establishing a distributed generation unit output power fluctuation and conversion model according to the historical output data;

step S3: determining a distribution network topology structure, acquiring active output data of a distributed power generation unit, controllable sectionalizing switches of a power distribution network, contact switch status information and Distflow power flow optimization constraint model calculation result information, and establishing a distribution network Markov decision process MDP model according to the output power fluctuation and conversion model of the distributed power generation unit, the distribution network topology structure and the information acquired in the step S3;

step S4: and solving the MDP model of the power distribution network Markov decision process by adopting the reinforcement learning algorithm of the approximate dynamic programming, and outputting the automatic control optimal strategy of the power distribution network switch in real time.

2. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 1, wherein the step S1 comprises:

Step S11: establishing the Distflow power flow optimization constraint model according to the power distribution network topological structure constraint and the power distribution network power flow calculation basic theory;

step S12: and determining the reinforcement learning algorithm which approximates to the dynamic programming according to reinforcement learning theory and a Belman optimal equation.

3. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 1, wherein in the step S2, before the step of establishing the output power fluctuation and conversion model of the distributed power generation unit, the method further comprises: the output power fluctuations and uncertainty conditions of each distributed generation unit are analyzed and quantified so as to take the quantified values as inputs to the power distribution network markov decision process MDP model.

4. The method for automatically controlling the switching of the power distribution network based on the reinforcement learning theory according to claim 3, wherein fluctuation and uncertainty of the output power of each distributed power generation unit can be simulated through the fluctuation and transformation model of the output power of the distributed power generation unit.

5. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 4, wherein the step S3 comprises:

Step S31: modeling output power fluctuation status quantized values of each distributed generation unit and corresponding power distribution network topological structures as state parameters of a power distribution network Markov decision process MDP model;

step S32: modeling the action state of a switch connected with each branch line of a power distribution network as an action combination parameter of an MDP model of a Markov decision process of the power distribution network;

step S33: selecting cut loads in the Distflow power flow optimization constraint model considering distributed power generation faults and modeling line operation cost as a rewarding function reference index of the distribution network Markov decision process MDP model;

step S34: the state transition probabilities of the distribution network markov decision process MDP model are defined so as to take into account the uncertainty caused by the probability of change in the generated power output level of each distributed generation unit.

6. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 1, wherein in step S4, before outputting the power distribution network switch automatic control optimal strategy in real time, the method further comprises: and performing offline iterative training on the MDP model of the power distribution network Markov decision process.

7. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 6, wherein the step of performing offline iterative training on the power distribution network markov decision process MDP model comprises: and inputting active output data of the distributed power generation unit, outputting an automatic control strategy of a power distribution network switch, and feeding back an evaluation index of the real-time running condition of the power distribution network.

8. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 1, wherein in the step S4, after outputting the power distribution network switch automatic control optimal strategy in real time, the method further comprises: and displaying decision result characters and displaying the network topology structure of the real-time power distribution network on the man-machine interaction interface.

9. The method for automatically controlling a power distribution network switch based on reinforcement learning theory according to claim 5, wherein the action state of the switch comprises: the switch changes the switch state at the current time and maintains the switch state at the current time.

10. An automatic control system for a power distribution network switch based on reinforcement learning theory is characterized by comprising:

the establishing module is used for establishing a Distflow tide optimization constraint model;

The determining module is used for determining a reinforcement learning algorithm approximate to dynamic programming;

the acquisition module is used for acquiring historical output data of the photovoltaic and wind power distributed generation units so that the establishment module establishes an output power fluctuation and conversion model of the distributed generation units according to the historical output data;

the establishing module is also used for establishing an MDP model of a power distribution network Markov decision process according to the power distribution network topological structure determined by the determining module, the active output data of the distributed power generation units, the controllable sectional switch of the power distribution network, the state information of the contact switch and the calculation result information of a Distflow flow optimization constraint model, and the fluctuation and conversion model of the output power of the distributed power generation units;

and the calculation module is used for solving the MDP model of the power distribution network Markov decision process according to the reinforcement learning algorithm of the approximate dynamic programming and outputting the automatic control optimal strategy of the power distribution network switch in real time.