CN117748515A

CN117748515A - Two-stage reinforcement learning method and system for urban power distribution network reconstruction operation

Info

Publication number: CN117748515A
Application number: CN202311811778.6A
Authority: CN
Inventors: 高红均; 雷成; 姜思远; 梁宇; 聂金峰; 潘旭东; 王仁俊; 刘友波; 刘俊勇
Original assignee: Sichuan University; Energy Development Research Institute of China Southern Power Grid Co Ltd
Current assignee: Sichuan University; Energy Development Research Institute of China Southern Power Grid Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-22
Anticipated expiration: 2043-12-27
Also published as: CN117748515B

Abstract

The invention relates to a two-stage reinforcement learning method and a two-stage reinforcement learning system for reconstruction operation of an urban power distribution network, comprising the following steps: constructing a dynamic reconstruction operation mathematical model of the urban power distribution network; providing a concept of contribution degree of a switch and a quantization method of the contribution degree of the switch, quantizing the contribution degree according to photovoltaic digestion and power supply capacity, and screening the switch by a contribution degree value, thereby realizing the large-scale dimension reduction of a switch action space; the Weighted QMIX multi-agent deep reinforcement learning algorithm based on action weights is provided, perception of contribution degree of an agent to a switch can be deepened, a two-stage reinforcement learning framework is provided on the basis, the switch obtained after screening is divided into two groups according to difference of the contribution degree, and then reinforcement learning training process is divided into two stages, so that the problem of difficulty in learning of the agent due to unbalance of weight distribution is avoided; the multi-agent deep reinforcement learning model provided by the invention is fully trained, and the optimal strategy for dynamic reconfiguration operation of the urban power distribution network is obtained.

Description

Two-stage reinforcement learning method and system for urban power distribution network reconstruction operation

Technical Field

The invention relates to the technical field of urban power distribution network dynamic reconstruction, in particular to a two-stage reinforcement learning method and system for urban power distribution network reconstruction operation.

Background

Along with the gradual perfection of the automatic construction of the urban power distribution network, the reconstruction technology becomes a reliable means for optimizing the operation of the power distribution network, and the reconstruction technology transfers the power flow by changing the topology structure of the power distribution network, so that the effects of reducing the network loss and improving the operation economy can be achieved, and meanwhile, the effects of balancing the loads of all areas and reducing the local load loss can be achieved. In addition, as the photovoltaic permeability in the urban power distribution network is gradually improved, the photovoltaic absorption level can be effectively improved by applying the reconstruction technology.

However, the number of controllable switches is increased along with the increasing of the urban power distribution network, and the traditional dynamic reconstruction method based on the model is difficult to process the action space with high dimension, so that the problems of low solving speed, easy sinking into local solution and even incapability of solving are brought; considering that the reinforcement learning method directly takes data as a drive, and obtains the optimal strategy through frequent interaction with the environment, the reinforcement learning method has strong strategy exploration capacity and extremely high solving speed, and along with the increase of the complexity of the environment, the reinforcement learning advantage is more obvious, so that a new thought can be provided for the efficient solving of the dynamic reconstruction strategy of the complex urban power distribution network with gradually huge scale; in addition, considering that the number of controllable switches in a large-scale power distribution network is numerous, a considerable number of controllable switches do not actively participate in reconstruction optimization, even the switches do not act in the whole dispatching period, namely, the contribution of the switches to reconstruction optimization is not high enough, but an intelligent body can perform mistakes and study on the actions of the switches in the process of interacting with the environment, which definitely causes huge calculation resource waste, and meanwhile, the learning difficulty of the intelligent body is increased, so that the switches with not high contribution to reconstruction optimization can be abandoned, the action space of the intelligent body can be greatly reduced, the training speed is improved, the perception of the intelligent body to important switch actions is enhanced, and the optimizing capability is improved; based on the analysis, the urban power distribution network reconstruction operation reinforcement learning method considering the switching contribution degree is worthy of intensive study.

Disclosure of Invention

The invention provides a two-stage reinforcement learning method for reconstruction operation of an urban power distribution network, which is used for solving the technical problem that the conventional reconstruction method is difficult to solve in the existing power distribution network switch action space.

The invention is realized by the following technical scheme: a two-stage reinforcement learning method for reconstruction operation of an urban power distribution network comprises the following steps:

step one, constructing a dynamic reconstruction operation mathematical model of the urban power distribution network;

quantifying the contribution degree of the switch according to the photovoltaic digestion and the power supply capacity, screening the switch by the contribution degree value, and omitting the screened switch with low contribution degree during reconstruction optimization, so as to realize the large-amplitude dimension reduction of a switch action space;

thirdly, constructing a Weighted QMIX multi-agent reinforcement learning model, deepening perception of an agent on contribution of a switch, providing a two-stage reinforcement learning framework, dividing the switch obtained after screening into two groups according to the difference of the contribution, and dividing a reinforcement learning training process into two stages;

and step four, training the Weighted QMIX multi-agent reinforcement learning model by adopting a method of centralized training and decentralized execution to obtain an optimal strategy for dynamic reconstruction operation of the urban power distribution network.

Optionally, in order to better implement the present invention, the objective function of the dynamic reconstruction operation mathematical model of the urban power distribution network is:

wherein:the number of hours in the cycle optimized for the reconstruction of the distribution network;is thatThe amount of light discarded at the moment;is thatTime of day load loss;is thatNetwork loss of the power distribution network at any moment;is thatThe number of switching operations at the moment;、、andthe cost factor of the light rejection, the cost factor of the load loss, the cost factor of the network loss and the cost factor of the switch operation are respectively.

The specific calculation method of the light rejection amount, the load loss amount, the network loss amount and the switch operation amount is as follows:

the calculation formula of the light rejection amount is as follows:

wherein:representing a set of photovoltaic access nodes;representation ofTime nodePhotovoltaic output power;is shown inThe photovoltaic is actually connected into the power of the power grid at any time;

the calculation formula of the load loss is as follows:

wherein:representing a set of load shedding nodes;representation ofTime nodeLoad prediction power;representation ofTime actual injection nodeIs a power of (2);

the calculation formula of the network loss is as follows:

wherein:the method comprises the steps of collecting all branches of a power distribution network;is thatTime branchAn effective value of the current; Is a branch circuitResistance value of (2);

the calculation formula of the switch operation times is as follows:

wherein:the method comprises the steps that the method is a collection of all switch branches of a power distribution network;、respectively representTime of dayReconstruction corresponding branch of time distribution networkThe state of the upper switch, with a value of 0, representing a branchThe switch being opened, a value of 1 representing a branchThe switch is closed.

Optionally, in order to better implement the present invention, constraint conditions of the dynamic reconfiguration operation mathematical model of the urban power distribution network include a power flow constraint, a safe operation constraint, a reconfiguration constraint, a photovoltaic output constraint and a load loss constraint, and a calculation formula of the power flow constraint is:

wherein:andrespectively representTime injection nodeActive power and reactive power of (a);representation ofTime nodeIs a voltage of (2); admittance between adjacent nodes isAnd；is the voltage phase angle difference.

The calculation formula of the safe operation constraint is as follows:

the calculation formula of the reconstruction constraint is as follows:

wherein:representing the total number of the branches which are always in a closed state and cannot be adjusted in the net rack;to be expressed byA branch terminal node set serving as an initial node;expressed in terms ofA branch initial node set which is a terminal node;representing the total number of system nodes;representing the number of substations of the optimizing subject; distribution networks containing distributed photovoltaics may have islanding operation under constraints, so that replenishment is required, injecting power at non-substation nodes And by simplified tideThe flow constraint keeps the nodes in a communicated state;representation ofTime branchAuxiliary tidal active power on the power grid.

The calculation formula of the photovoltaic output constraint is as follows:

wherein:is thatTime period nodeThe upper limit of the active output of the photovoltaic is set;is thatTime period nodeAt the lower limit of the active output of the photovoltaic.

The calculation formula of the load loss constraint is as follows:

wherein:is a nodeA loss of load scaling factor.

Optionally, in order to better implement the present invention, the method for quantifying a switching contribution comprises the steps of:

simulation generation using Monte Carlo methodThe source load samples are subjected to noise adding method, the phenomena of photovoltaic light rejection and load reduction in the source load samples are emphasized, and an operation mathematical model is dynamically reconstructed through the urban power distribution network, so that the method is characterized in thatIndividual source load sample generationAnd corresponding dynamic reconfiguration switch action samples.

Substation to which switch belongsAs an environment for acquiring the index.

Aiming at the dynamic reconfiguration switch action sample, the index quantification method for photovoltaic digestion and power supply capacity improvement is put into a calculation formula as follows:

wherein:representing a substationAll of the above are photovoltaicThe photovoltaic consumption at the moment is in proportion to the total consumption, Representing a substationAll of the above load shedding is performedThe proportion of the power compensation quantity at the moment to the total compensation quantity;representing all belonging to substationsIs a light Fu Jiedian of (2);representation ofTime light Fu JiedianIs a photovoltaic digestion amount of (2);representing all belonging to substationsIs not loaded on the load losing node;representation ofMoment load losing nodeIs a power compensation amount of (a);representing a dynamically reconstructed switch action sample.

To be obtainedAndand then, quantifying the 24-hour photovoltaic digestion capacity and the power supply capacity of the switch, wherein the calculation formula is as follows:

wherein:indicating switchAt the position ofQuantification of photovoltaic digestion capability at the moment,indicating switchAt the position ofAnd quantifying the power supply capacity at the moment. If it isTime switchIf no action takes place, it is considered that at this point the switch does not contribute to photovoltaic digestion and improved power supply, ifTime switchAction is taken and considered to be taken at this timeThe switch contributes to photovoltaic digestion and improved power supply.

To obtain the switch respectivelyAt the position ofObtained from individual samplesAndthen, the obtained products are accumulated and averaged to obtain the productThe photovoltaic capacity and power supply lifting capacity quantized values of the switch under the condition of a plurality of samples are calculated according to the following formula:

wherein: Andthe representation is based onContact switch for individual samplesPhotovoltaic digestion capability and power supply boost capability quantification values;indicating switchQuantized value of the final contribution of (c).

After the contribution degree quantized values of all the switches are obtained, the sectional switches and the interconnection switches are respectively evaluated and screened, and the evaluation rule is as follows: the contribution quantized values of the switches are ordered, the boundary is 50%, when the contribution quantized value rank of the switch is higher than 50%, the switch is judged to be an important switch, when the contribution quantized value rank of the switch is lower than 50%, the switch is judged to be a non-important switch, in reconstruction optimization, the non-important switch is ignored, namely the non-important sectionalized switch is constantly closed, the non-important interconnection switch is constantly opened, and the action combination of the important switches is regarded as an optimization object.

Optionally, in order to better implement the present invention, the Weighted QMIX multi-agent deep reinforcement learning model is calculated by the following method:

the weight factors are introduced to adjust the contributions of different agents to the aggregate Q-value function.

For the followingOperators for which weighting functions are addedThe method comprises the following steps:

wherein the parameters are，Andthe update method of (2) is as follows:

（1）the loss function of (2) is:

（2）the loss function of (2) is:

（3）the expression of (2) is:

The actions of each switch in the joint action space are weighted by adding a weighting function to highlight the characteristic that the contribution degree of different switches is different.

Optionally, to better implement the present invention, a multi-agent interaction model is also included, the multi-agent interaction model including an observation space, a state space, an action space, a reward function, and a state transition probability, the observation space representing a state value that each agent is capable of observing from the environment,time intelligent bodyIs defined as:

wherein:，，，andrespectively represent intelligent agentsSubstation to which control switch belongsUpper partTime nodePower and voltage of (a) branchCurrent and all switches of (a)On-off condition and time of day of (a)。

The state space represents the union of all agent observation spaces, expressed as follows:

wherein the dynamic reconfiguration problem of the distribution network is set to be a fully observable problem, namely。

The action space represents actions that each agent makes after acquiring observations at a certain time,time intelligent bodyIs defined as:

wherein:representation ofTime intelligent bodyControlled sectional switchState values of (2).

The rewarding function represents a group of rewarding values obtained after the agent interacts with the environment, for Time intelligent bodyThe reward function is defined as:

wherein:、andrespectively representing two indexes, a network loss index and a switch action cost index which are mentioned in the second step,representing an out-of-limit penalty term.

The state transition probabilityThe environmental impact of multi-agent actions is described,representing an agentIn stateTake action downwardsTransition to StateProbability of (2) at the current policyThe following state transition probabilities are:

。

optionally, to better implement the present invention, the learning system includes:

the state detection module is used for detecting photovoltaic real-time output data and load real-time demand data of the urban power distribution network;

the information storage module is used for storing the data detected by the state detection module;

the switch contribution degree quantization module is used for carrying out contribution degree quantization on each switch according to the photovoltaic absorption rate and the load loss event so as to provide a data basis for screening the switches by distribution network scheduling personnel;

the switch contribution degree evaluation module is used for providing a method basis for screening the switches for distribution network scheduling personnel;

the switch grouping module is used for grouping the screened switches and providing basis for the subsequent reinforcement learning optimization in two stages;

The multi-agent centralized training module is used for respectively and intensively training each agent in two stages in a mode of sharing observation values;

and the single-agent scattered execution module independently operates each trained agent so that each agent can independently make a reconstruction operation optimization strategy.

Compared with the prior art, the invention has the following beneficial effects:

in the problem of power distribution network reconstruction operation optimization, the huge switching action space brought by the urban power distribution network with gradually huge scale makes the traditional reconstruction method difficult to solve, and based on the method, the invention provides a two-stage reinforcement learning method for the urban power distribution network reconstruction operation, which can realize the large-scale dimension reduction of the switching action space and is beneficial to the multi-agent system to search the optimal nonlinear mapping between the source load data and the reconstruction optimization scheme. The decision result of the reconstruction operation scheme is obtained according to the deployed reinforcement learning model, so that power distribution network operators can be helped to conduct quick decision, the waste of calculation resources is avoided, and the method has practical value.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow diagram of a two-stage reinforcement learning method for power distribution network reconstruction operation according to the present invention;

FIG. 2 is a schematic diagram of the two-stage reinforcement learning system for power distribution network reconstruction operation of the present invention;

FIG. 3 is a schematic view of the subject selection of the present invention;

FIG. 4 is a schematic diagram of the distribution of multiple agents and switches according to the present invention;

FIG. 5 is a block diagram of a 297 node system used in the present invention;

FIG. 6 is a diagram showing the effect of the method of the invention for verifying the contribution of the switch;

FIG. 7 is a diagram of the effect of the two-stage reinforcement learning framework of the present invention;

FIG. 8 is a diagram showing the effect of the Weighted QMIX algorithm of the present invention;

FIG. 9 is a graph comparing photovoltaic digestion effects under various methods of the present invention;

FIG. 10 is a graph showing the comparison of the power supply capacity improvement effect according to the different methods of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. Based on the examples of the invention, all other embodiments that a person of ordinary skill in the art could achieve without inventive effort are related to the dynamic reconfiguration of the urban distribution network within the scope of the invention.

Example 1:

the embodiment provides a two-stage reinforcement learning method for reconstruction operation of an urban power distribution network, wherein:

the two-stage reinforcement learning framework is as follows: all switches participating in optimization are divided into two groups, the switches are ranked according to the contribution degree, the switches are also classified into 1 group with 50% as a dividing line, and the switches with higher contribution degree are classified into 2 groups with lower contribution degree. In the first stage, an intelligent agent is distributed for each switch in the 1 group, all switches in the 2 groups are ignored, the state of the switch keeps the action result of the last moment, then the 1 group of switches are reconstructed and optimized, the topological constraint of the power distribution network is not needed to be considered in the stage, and out-of-limit penalty in the rewarding function is taken from the out-of-limit penalty in the second stage; in the second stage, an agent is allocated to each switch in the 2 groups, all switches in the 1 groups are ignored, the state of the agent keeps the action result of the first stage, then the 2 groups of switches are subjected to reconstruction optimization, and the topological constraint of the power distribution network needs to be considered in the stage.

As shown in fig. 1, a dynamic reconstruction operation optimization mathematical model in the context of a large-scale urban distribution network is proposed; the method comprises the steps of providing a concept of contribution degree of a switch on the basis of a reconstruction mathematical model of a power distribution network, quantifying the contribution degree of the switch from two angles of photovoltaic digestion capability and power supply capability, evaluating the switch according to the contribution degree, and neglecting the switch with low contribution degree in reconstruction optimization, so that the dimension of an intelligent body action space is greatly reduced; the Weighted QMIX algorithm based on QMIX can give different weights to different switch actions so as to strengthen the perception of the intelligent body on the importance of the switch, and according to the difference of the contribution degree of the switch obtained after screening, a two-stage reinforcement learning framework is provided, wherein the intelligent body in the first stage controls the action of the switch with higher contribution degree, and the intelligent body in the second stage controls the action of the switch with lower contribution degree, so that the negative influence of the unbalanced distribution of the weight of the switch action is avoided; the reinforcement learning model is trained by adopting a training method of centralized training and decentralized execution, namely all the agents in the training stage share observed quantity, all the agents in the execution stage independently execute, and the optimal strategy for the reconstruction operation of the urban power distribution network can be obtained rapidly after the training is finished.

As shown in fig. 2, the two-stage reinforcement learning system for power distribution network reconstruction operation includes a state detection module, an information storage module, a switch contribution quantization module, a switch contribution evaluation module, a switch grouping module, a multi-agent centralized training module, and a single-agent decentralized execution module. The state detection module is used for detecting photovoltaic real-time output data and load real-time demand data of the urban power distribution network and storing the photovoltaic real-time output data and the load real-time demand data in the information storage module; the information storage module is used for storing historical data information of the photovoltaic and the load; the switch contribution degree quantization module is used for carrying out contribution degree quantization on each switch according to two types of indexes, and the result of the switch contribution degree quantization module provides a data basis for screening the switches by distribution network scheduling personnel; the switch contribution degree evaluation module provides a method basis for screening the switch by a distribution network dispatcher; the switch grouping module is used for grouping the screened switches and providing basis for the subsequent reinforcement learning optimization in two stages; the multi-agent centralized training module is used for respectively carrying out centralized training on each agent in two stages in a mode of sharing observation values; and the single-agent scattered execution module independently operates each trained agent, and each agent independently makes a respective reconstruction operation optimization strategy without interaction.

As shown in fig. 3, since different switches have different contribution degrees, calculation of the contribution degrees of the switches should be performed one by one. The invention considers that the contribution degree of all switches in the power distribution network is calculated one by one, namely, in one contribution degree calculation, a research object is concentrated on one switch. In terms of the selection of contribution index, in view of the problems of low photovoltaic absorption rate and frequent load loss event commonly existing in a large-scale power distribution network, the invention focuses on the improvement effect of a switch on the photovoltaic absorption capacity and the load power supply capacity of the power distribution network, so the quantification method is described as follows:

(1) Simulation generation using Monte Carlo methodThe phenomena of photovoltaic light rejection and load reduction in the sample are emphasized by the source load sample through a noise adding method; by a step-one model, byIndividual source load sample generationDynamic reconfiguration switch action samples corresponding to the samples;

(2) Considering that the contribution degree of a single switch to the optimization of the whole urban distribution network is limited, and more local optimization of the urban distribution network is realized, the invention ensures that the switch belongs to the transformer substationRather than the entire distribution network as an environment for obtaining metrics. For samplesThe index quantification method for improving the photovoltaic digestion and the power supply capacity is as follows:

Wherein:representing a substationAll of the above are photovoltaicThe photovoltaic consumption at the moment is in proportion to the total consumption,representing a substationAll of the above load shedding is performedThe proportion of the power compensation quantity at the moment to the total compensation quantity;representing all belonging to substationsIs a light Fu Jiedian of (2);representation ofTime light Fu JiedianIs a photovoltaic digestion amount of (2);representing all belonging to substationsIs not loaded on the load losing node;representation ofMoment load losing nodeIs used for compensating the power of the power supply.

To be obtainedAndafter that, the 24-hour photovoltaic digestion capability and the power supply capability of the switch are quantized, and the quantization processing is as follows:

wherein:indicating switchAt the position ofQuantification of photovoltaic digestion capability at the moment,indicating switchAt the position ofAnd quantifying the power supply capacity at the moment. If it isTime switchIf no action takes place, it is considered that at this point the switch does not contribute to photovoltaic digestion and improved power supply, ifTime switchAction takes place, it is considered that at this point the switch contributes to photovoltaic digestion and to an improved supply of electricity.

To obtain the switch respectivelyAt the position ofObtained from individual samplesAndthen, the obtained products are accumulated and averaged to obtain the productThe photovoltaic capacity and power supply lifting capacity quantization values of the switch under the condition of a plurality of samples are shown as the following formula:

Wherein:andthe representation is based onContact switch for individual samplesPhotovoltaic digestion capability and power supply boost capability quantification values;indicating switchQuantized value of the final contribution of (c).

(3) And for the interconnection switch between the two substations, taking an average value, and calculating the same method.

(4) After the contribution quantized values of all the switches are obtained, the switches need to be evaluated and screened. Considering that the sectionalizer is essentially different from the tie switch, the two types of switches are separated at the time of evaluation.

The evaluation rule is: the contribution quantized values of the switches are ranked, the demarcation is 50%, and if the contribution quantized values of the switches are ranked higher than 50%, the switches are considered important, and if the contribution quantized values of the switches are ranked lower than 50%, the switches are considered non-important. In the reconstruction optimization, non-important switches are ignored, namely, the non-important sectionalizing switches are constantly closed, the non-important interconnecting switches are constantly opened, and the action combination of the important switches is taken as an optimization object.

As shown in fig. 4, the dynamic reconstruction of the present invention aims at obtaining an optimal switching state combination scheme within 24 hours, which belongs to a sequential decision problem, and the decision scheme of a future period is not affected by the past power distribution network environment but is only related to the current period power distribution network environment, so that the problem of the present invention satisfies the markov property, and furthermore, an optimization object can be regarded as a set of results composed of a plurality of independent individuals participating in joint optimization. According to the analysis, the dynamic reconfiguration optimization problem of the power distribution network based on multi-agent deep reinforcement learning can be converted into a Markov model.

The Markov model is as follows:

from state set，Action set of individual agentsBonus function for individual agentsObserved quantity of each agentAnd state transfer functionComposition is prepared. The goal of each agent is to learn strategySo that the accumulated discount rewardsMaximum, where discount rateTo balance short-term and long-term return,to optimize the total number of time periods of the problem.

Because the optimization object can be regarded as a set of results formed by the joint optimization of a plurality of independent individuals, and the important property that different switches have different contribution degrees is considered, the invention distributes an agent for each switch participating in the reconstruction optimization, that is to say, the action space of each agent at each moment is only 2, and the agent focuses on the action condition of the switch without judging the action of other switches. By means of Weighted QMIX algorithm, this approach can improve the optimization efficiency of the model, which is more advantageous than the single agent approach.

And (3) performing example verification analysis:

as shown in fig. 5, the proposed method was verified using an actual 297 node system consisting of 2 substations, 4 transformers and 8 feeders. The transformer substation 1 is provided with 2 transformers T1 and T2; the low-voltage side of the T1 is connected with 2 feeder lines S1T11 and S1T12; the low-voltage side of T2 is connected with two feeder lines S1T21 and S1T22. The transformer station 2 is provided with two transformers T3 and T4; the low-voltage side of the T3 is connected with 2 feeder lines S2T11 and S2T12; the low-voltage side of T4 is connected with 2 feeder lines S2T21 and S2T22.

As shown in fig. 6, the validity of the method of taking the switching contribution into account and performing switching screening according thereto was verified:

in order to verify the effectiveness of the method for considering the switch contribution degree, the method is compared and analyzed with a method for directly optimizing without considering the switch contribution degree, and the grouping method is the same as the above. Aiming at the problem of dynamic reconfiguration of an urban power distribution network, the performance pairs of the 2 methods are shown in fig. 6, and the graphs (a) and (b) are training curves of the two methods respectively. It should be noted that, since the reward function of the second stage shows the final optimizing effect, the reward curves shown in the present invention are all the reward curves of the second stage.

Analysis of fig. 6 can be concluded as follows:

in the problem of the invention, the return curve of the method (a) after 60000 rounds of training can be converged, and the return curve of the method (b) after the same rounds of training still can not be converged; in contrast, graph (a) has a significant advantage in convergence speed, and the return curve has already approached plateau at about 42000 rounds, and the return value has finally stabilized around-1800; in the graph (b), the jitter amplitude of the return curve is large, the curve cannot be reliably converged in a limited round, the curve does not show a trend to be converged, and the return value of the curve finally stays at about-2400, which indicates that the multi-agent reinforcement learning algorithm cannot efficiently optimize the dynamic reconfiguration problem of the power distribution network in the limited round. In addition, the report curve in early training stage is analyzed, the graph (a) shows that the agent can learn useful strategies about 2000 rounds, and then the report value starts to be obviously improved, while the agent in the graph (b) can not explore useful strategies in initial learning and frequently makes out-of-limit actions, so that the jitter amplitude of the report curve is extremely large, and the report value does not start to be obviously improved until about 17000 rounds.

The root cause of the great difference of the multi-agent reinforcement learning results under the two methods is the difference of the agent joint action spaces. The combined action space of the method (a) is a screening result according to the switching contribution degree under the premise of considering the dual lifting of the light Fu Xiaona capability and the power supply capability, and all the intelligent agents control the switches which have large optimization contribution degree and relatively frequent actions on the power distribution network, so that the scale of the combined action space is greatly reduced, and the calculation speed and the model convergence speed of the intelligent agents are obviously improved; in addition, as the switch screening operation is subjected to a strict quantitative evaluation process, the connection among the switches is tighter, the coordination and optimization capacity of the corresponding multi-agent is improved, and each agent can learn a better strategy more easily.

As can be obtained by the analysis of fig. 6, the urban power distribution network reconstruction operation model taking the switching contribution degree into consideration has better performance both in convergence speed and optimization effect, so the method provided by the invention is effective.

As shown in fig. 7, the validity of the two-stage reinforcement learning framework was verified:

in order to verify the effectiveness of the two-stage dynamic reconstruction method, the method is compared and analyzed with a method for directly performing reconstruction optimization without considering grouping after switch screening. For the problem of dynamic reconfiguration of urban distribution network, the performance pairs of 2 methods are shown in fig. 7, wherein the graph (a) and the graph (b) are training curves of the two methods respectively,

Analysis of fig. 7 can be concluded as follows:

in the problem of the present invention, both methods in fig. 7 can converge in 60000 rounds, and the final return values are all stable around-1800, which indicates that the method in fig. (b) can learn the optimal strategy in a limited round, but in contrast, the method in fig. (a) not only has smaller fluctuation range of the return values obtained in learning, but also has much faster convergence speed than the method in fig. (b), especially in the early stage of training.

The reason why the training curves corresponding to the two methods are greatly different is that the complexity of the optimization problem is changed. The method provided by the invention adopts a two-stage dynamic reconstruction strategy, is essentially a hierarchical reinforcement learning, and can well solve the marginal effects of too sparse return, too large action space and multiple agents, and simplify the complex problem. Compared with a comparison method, the method provided by the invention distinguishes two types of switches with larger contribution degree difference, performs two-stage reinforcement learning, reduces the negative influence of the Weighted QMIX algorithm on the unbalanced switch weight distribution, and simultaneously greatly reduces the quantity of the agents required by reconstruction optimization in each stage, thereby reducing the possibility that excellent actions cannot be learned due to too sparse excellent return, avoiding marginal effects caused by too much quantity of the agents, and reducing the learning difficulty of the agents on a better strategy.

As can be seen from the analysis of fig. 7, the method provided by the present invention is effective in considering that the two-stage reinforcement learning architecture method achieves better performance both in terms of convergence speed and optimization effect.

As shown in fig. 8, the validity of the Weighted QMIX algorithm was verified:

to verify the effectiveness of the Weighted QMIX algorithm proposed by the present invention, the algorithm is now compared to QMIX algorithms, QTRAN algorithms, which are both methods based on value function decomposition, which are not involved in value function decomposition, which are also applicable to multi-agent reinforcement learning, as well as DQN algorithms, which are applicable to single-agent reinforcement learning. For the problem of dynamic reconfiguration of the urban distribution network, the performance pairs of the 4 methods are 4 algorithms, such as the algorithm shown in fig. 8, and the algorithm shown in the diagrams (a) to (d) are Weighted QMIX and QMIX, QTRAN, DQN in sequence.

Analysis of fig. 8 can be concluded as follows:

in the problem of the invention, the Weighted QMIX, QMIX and QTRAN 3 algorithms can all converge within 60000 rounds, and the convergence of DQN is poor; the Weighted QMIX has the fastest convergence speed and the highest return value, the QMIX approaches convergence approximately at 50000 rounds, the final return value stabilizes around-2000, the QTRAN approaches convergence approximately at 55000 rounds, and the final return value stabilizes around-2400, lower than Weighted QMIX and QMIX; DQN exhibits quite unstable and difficult to converge on the return curve throughout the training period.

Comparing the results of (a) (b) (c), it can be known that the Weighted QMIX algorithm has the characteristics of further strengthening that different switches in the distribution network have different importance due to the different weights given to different actions of each agent, and the Weighted operation brings great benefit to learning of the agents, so that the agents can learn better strategies more easily, and therefore, compared with algorithms without action weight concepts such as QMIX and QTRAN, the Weighted QMIX algorithm has faster convergence speed and better optimization result; in addition, it can be seen that, because the constraint condition of the QTRAN algorithm is more loose, the QTRAN algorithm converges faster than the Weighted QMIX algorithm in the initial training period, but falls into a locally optimal strategy around 30000 rounds, and the final return value is the lowest of the three, although the agent interacts with a better strategy around 43000 rounds to get rid of the local solution gradually, which means that the QTRAN algorithm has poor performance in the problem of the present invention.

Comparing the results of (a) and (d), the single agent reinforcement learning method is weak when the problem of the invention is processed, and can not converge in a limited round, because the dimension of the action space is too high and the single agent is difficult to deal with, the learning fluctuation of the agent is large, the convergence speed is low and the optimization effect is poor as can be seen from the return curve; in contrast, the method for distributing each switch with one intelligent agent has obvious effect, greatly reduces the action space of each intelligent agent, eliminates the interference of other switches by means of the cooperative reinforcement learning method, and strengthens the perception of the intelligent agent on the action of the controlled switch, so that the Weighted QMIX algorithm not only has a convergence rate far faster than the DQN algorithm, but also obtains a return value far higher than the DQN algorithm.

From the analysis of FIG. 8, the Weighted QMIX algorithm has the best optimizing effect and the best convergence speed, so the Weighted QMIX algorithm of the invention has the best performance.

As shown in fig. 9 and 10, table 1, the effectiveness of the proposed method compared with the conventional method is verified: a step of

In order to verify the performance advantages of the data driving method compared with the traditional model driving method, the method provided by the invention is compared with a global reconstruction method based on a three-level dynamic reconstruction method, a global reconstruction method based on a binary particle swarm and a global reconstruction method based on a mathematical programming algorithm. The 4 types of methods, including the method provided by the invention, are numbered from No. 1 to No. 4 in sequence, and in order to embody the optimization effect, the operation result of the power distribution network which is not reconstructed is added. In order to fully mine the performance of the class 4 method, the invention sets the following operation scene: the photovoltaic output is larger, the photovoltaic output cannot be completely absorbed, namely, the light rejection phenomenon occurs, meanwhile, the load demand is higher, the load exceeds the transmissibility of the feeder line, and namely, the load loss phenomenon occurs. In this scenario, the amount of light rejection and the amount of load loss for the class 4 method are shown in fig. 9 and 10, respectively, and the optimization results are shown in table 1.

Table 1 comparison of optimized results

As can be seen by combining fig. 9 and table 1, compared with the operation without reconstruction, the class 4 methods all significantly improve the photovoltaic digestion level, which indicates that although the global optimality of the switching action cannot be ensured due to the screening of the switch in the method 1, the photovoltaic digestion effect of the method 1 already exceeds the optimization precision of global reconstruction by virtue of the quantification of the photovoltaic digestion capability index of the switch and the full perception of the multi-agent system, and compared with the methods 2, 3 and 4, the light rejection cost is respectively reduced by 9.35%, 5.77% and 2.38%; as can be seen by combining fig. 10 and table 1, compared with the operation without reconstruction, the 4 types of methods all improve the load power supply capacity, wherein the method 2 can reduce the load coordination optimization, so the power supply capacity is improved the highest, the method 1 is improved by 11.24% compared with the method 2, but is lower than the methods 3 and 4, which shows that the switch obtained by screening after quantifying the power supply capacity index of the switch can obviously enhance the perception capacity of the intelligent body on the switch action strategy for improving the power supply capacity, and compared with the methods 3 and 4, the load loss cost is respectively reduced by 56.01% and 51.46%, and the improvement is obvious.

As can be seen from table 1, the network loss cost of method 1 is reduced by 2.03%, 4.57% and 1.09% compared with methods 2, 3 and 4, respectively, which indicates that for large-scale urban distribution network, method 1 can not guarantee the global optimal solution under any condition but also avoid a large-scale switching action, thereby reducing the network loss method; the method 2 carries out flexible switching classification action aiming at the structure of the power distribution network, thereby reducing the network loss cost; the conventional global reconstruction method such as the method 3 and the method 4 can cause a plurality of switching actions because of no local flexibility of the switch, which can bring about a large-scale power flow transfer, so that the operation cost of the power distribution network can be increased, and the operation risk can be brought about. In addition, method 2 reduces the switching action dimension by determining the reconstruction level, so the solution time is much shorter than methods 3, 4, but the solution time of method 2 is more than hundred times slower than method 1, and as the power distribution network scale expands, the solution speed advantage of method 1 compared to method 2 expands further.

The method according to the invention performs best by combining the results of fig. 9, 10 and table 1.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein and is not to be considered as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either by the foregoing teachings or by the teaching of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Example 2:

the learning system corresponding to the two-stage reinforcement learning method for urban power distribution network reconstruction operation provided by the embodiment 1 comprises a state detection module, an information storage module, a switch contribution quantization module, a switch contribution evaluation module, a switch grouping module, a multi-agent centralized training module and a single-agent decentralized execution module, wherein the state detection module is used for detecting photovoltaic real-time output data and load real-time demand data of the urban power distribution network, the information storage module is used for storing the data detected by the state detection module, and carrying out contribution quantization on each switch according to the photovoltaic absorption rate and the load loss event so as to provide a data basis for screening the switches by a distribution network dispatcher, and the switch contribution evaluation module is used for providing a method basis for screening the switches by the distribution network dispatcher; the switch grouping module is used for grouping the screened switches, providing basis for the subsequent reinforcement learning optimization in two stages, the multi-agent centralized training module is used for respectively and intensively training each agent in the two stages in a mode of sharing the observed value, and the single-agent decentralized execution module independently operates each trained agent so that each agent can independently make respective reconstruction operation optimization strategies.

The above description is merely an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present invention, and it is intended to cover the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The two-stage reinforcement learning method for the reconstruction operation of the urban power distribution network is characterized by comprising the following steps of:

2. The two-stage reinforcement learning method for urban power distribution network reconstruction operation according to claim 1, wherein the objective function of the mathematical model for urban power distribution network dynamic reconstruction operation is:

；

wherein:the number of hours in the cycle optimized for the reconstruction of the distribution network; />Is->The amount of light discarded at the moment; />Is->Time of day load loss; />Is->Network loss of the power distribution network at any moment; />Is->The number of switching operations at the moment; />、/>、/>And->The cost factor of the light rejection, the cost factor of the load loss and the cost factor of the network loss and the cost factor of the switch operation are respectively;

the calculation formula of the light rejection amount is as follows:

；

wherein:representing a set of photovoltaic access nodes; />Representation->Time node->Photovoltaic output power; />Is indicated at->The photovoltaic is actually connected into the power of the power grid at any time;

the calculation formula of the load loss is as follows:

；

wherein:representing a set of load shedding nodes; />Representation->Time node->Load prediction power; />Representation->Time actual injection node +.>Is a power of (2);

the calculation formula of the network loss is as follows:

；

Wherein:the method comprises the steps of collecting all branches of a power distribution network; />Is->Time branch->Of electric currentA valid value; />For branch->Resistance value of (2);

the calculation formula of the switch operation times is as follows:

；

wherein:the method comprises the steps that the method is a collection of all switch branches of a power distribution network; />、/>Respectively indicate->Time and->Reconstruction of the corresponding branch of the time distribution network>The state of the upper switch, a value of 0 indicates branch +.>The switch is opened and a value of 1 indicates the branch +.>The switch is closed.

3. The two-stage reinforcement learning method for reconstructing and running of an urban power distribution network according to claim 2, wherein constraint conditions of a dynamic reconstruction and running mathematical model of the urban power distribution network comprise a power flow constraint, a safe running constraint, a reconstruction constraint, a photovoltaic output constraint and a load loss constraint, and the calculation formula of the power flow constraint is as follows:

；

wherein:and->Respectively indicate->Time injection node->Active power and reactive power of (a); />Representation->Time node->Is a voltage of (2); admittance between adjacent nodes is +.>And->；/>Is the voltage phase angle difference;

the calculation formula of the safe operation constraint is as follows:

；

the calculation formula of the reconstruction constraint is as follows:

；

wherein:representing the total number of the branches which are always in a closed state and cannot be adjusted in the net rack; / >To express by +.>A branch terminal node set serving as an initial node; />Expressed as +.>A branch initial node set which is a terminal node; />Representing the total number of system nodes; />Representing the number of substations of the optimizing subject; the power distribution network containing distributed photovoltaic may have island operation under constraint, so that the power needs to be supplemented, and power is injected into non-substation nodes>The nodes are kept in a communication state through simplified tide constraint; />Representation->Time branch->Auxiliary tidal current active power;

the calculation formula of the photovoltaic output constraint is as follows:

；

wherein:is->Period node->The upper limit of the active output of the photovoltaic is set; />Is->Period node->The lower limit of the active output of the photovoltaic device is set;

the calculation formula of the load loss constraint is as follows:

；

wherein:for node->A loss of load scaling factor.

4. The two-stage reinforcement learning method for reconstructing and running of an urban power distribution network according to claim 1, wherein the method for quantifying the contribution of the switch comprises the following steps:

simulation generation using Monte Carlo methodThe method comprises the steps of (1) a source load sample, emphasizing the phenomena of photovoltaic light rejection and load reduction in the source load sample by a noise adding method, and dynamically reconstructing an operation mathematical model by the urban power distribution network, wherein the noise adding method comprises the steps of (1) >Personal source payload sample generation->Corresponding dynamic reconfiguration switch action samples; transformer station to which the switch belongs>An environment as an acquisition index;

；

wherein:representing transformer substation->All photovoltaics are->The proportion of the photovoltaic consumption in the time of day to the total consumption, < >>Representing transformer substation->All load shedding are in->The proportion of the power compensation quantity at the moment to the total compensation quantity; />Indicating all belonging to the substation->Is a light Fu Jiedian of (2); />Representation->Time photovoltaic node->Is a photovoltaic digestion amount of (2); />Indicating all belonging to the substation->Is not loaded on the load losing node; />Representation->Moment no-load node->Is a power compensation amount of (a); />Representing a dynamic reconfiguration switch action sample;

to be obtainedAnd->And then, quantifying the 24-hour photovoltaic digestion capacity and the power supply capacity of the switch, wherein the calculation formula is as follows:

；

wherein:indicating switch->At->Quantification of the photovoltaic digestion capacity at the moment, < >>Indicating switch->At->Quantifying the power supply capacity at the moment; if->Time switch->If no action takes place, it is considered that at this point the switch does not contribute to photovoltaic digestion and to an improved supply, if +. >Time switch->Action takes place, then it is considered that at this moment the switch contributes to photovoltaic digestion and to an improved power supply;

to obtain the switch respectivelyAt->The>And->Then, the obtained products are accumulated and averaged to obtain the product>The photovoltaic capacity and power supply lifting capacity quantized values of the switch under the condition of a plurality of samples are calculated according to the following formula:

；

wherein:and->The representation is based on->Contact switch of individual samples->Is of (3)A quantified value of the volt-digestion capability and the power supply boost capability; />Indicating switch->A final contribution degree quantization value of (2);

5. The two-stage reinforcement learning method for reconstructing and running an urban power distribution network according to claim 1, wherein the Weighted QMIX multi-agent deep reinforcement learning model is calculated by the following method:

introducing weight factors to adjust contributions of different agents to the set Q value function;

for the followingOperator, add weight function for it>The method comprises the following steps:

；

wherein the parameters are，/>And->The update method of (2) is as follows:

（1）the loss function of (2) is:

；

（2）the loss function of (2) is:

；

（3）the expression of (2) is:

；

6. According to the weightsThe method for two-stage reinforcement learning of urban distribution network reconstruction operation according to claim 5, further comprising a multi-agent interaction model comprising an observation space, a state space, an action space, a reward function, and a state transition probability, the observation space representing a state value that each agent can observe from the environment,time intelligent agent->Is defined as:

；

wherein:，/>，/>，/>and->Respectively represent intelligent agent->Substation to which the control switch belongs >Go up->Time node->Power and voltage of (a) branch>Is>On-off condition and time->；

；

wherein the dynamic reconfiguration problem of the distribution network is set to be a fully observable problem, namely；

The action space represents actions that each agent makes after acquiring observations at a certain time,time intelligent agent->Is defined as:

；

wherein:representation->Time intelligent agent->Controlled sectionalizer->State values of (2);

the rewarding function represents a group of rewarding values obtained after the agent interacts with the environment, forTime intelligent agent->The reward function is defined as:

；

wherein:、/>and->Respectively representing two indexes, a network loss index and a switch action cost index, which are mentioned in the second step->Representing an out-of-limit penalty term;

；

the state transition probabilityThe environmental impact of multi-agent actions is described,representing intelligent agent->In state->Take action down->Transition to State->Is in the current policy->The following state transition probabilities are:

。

7. a learning system based on the two-stage reinforcement learning method for urban distribution network reconstruction operation according to any one of claims 1 to 6, characterized by comprising: