CN113517684B

CN113517684B - Method and system for establishing parallel deep reinforcement learning model for tide state adjustment

Info

Publication number: CN113517684B
Application number: CN202110286364.0A
Authority: CN
Inventors: 王甜婧; 汤涌; 郭强; 黄彦浩; 陈兴雷; 文晶; 李文臣; 宋新立; 李芳�
Original assignee: China Electric Power Research Institute Co Ltd CEPRI
Current assignee: China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2023-08-25
Anticipated expiration: 2041-03-17
Also published as: CN113517684A

Abstract

The application discloses a method and a system for establishing a parallel deep reinforcement learning model for tide state adjustment. Wherein the method comprises the following steps: establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process; positioning an adjustment target, screening actionable equipment and calculating the action quantity of a generator according to the tide state, the action, the strategy, the rewards and the rewards, wherein the adjustment target comprises different section sets; and according to the adjustment target, the actionable equipment and the action quantity of the generator, establishing a parallel deep reinforcement learning model for adjusting the tide state by considering the N-1 static stability constraint.

Description

Method and system for establishing parallel deep reinforcement learning model for tide state adjustment

Technical Field

The application relates to the technical field of power systems, in particular to a method and a system for establishing a parallel deep reinforcement learning model for tide state adjustment.

Background

The analysis and adjustment of the tide mode are the most basic work of the simulation analysis of the power grid, and provide a basis for quantitative analysis for judging the rationality, the safety reliability and the economical efficiency of the operation and planning design scheme of the power grid. According to the result of the power flow analysis or the requirement of subsequent simulation calculation, the parameters and the structure of the power grid power flow equation are modified, the demodulation of the power flow equation is completed to a value which meets the actual condition or the requirement of subsequent simulation, and an operation mode is generated. Wherein, adjusting the power flow state to meet the N-1 static stability constraint is one of the important steps. In actual work, the task mainly depends on manpower to adjust the tide state, and the automation degree is low. The general steps are that firstly, N-1 calculation is carried out on the initial power flow, the out-of-limit condition of the power flow is observed, and then the output of the generator is regulated according to the out-of-limit condition, however, in the process, the regulation directivity is not strong, the regulation size is not clear, the situation of eliminating the power flows easily occurs, and the regulation efficiency is low.

Currently, the adjustment of the tide running state is studied. On the one hand, an optimal power flow problem is posed by an optimal power flow, wherein the area under consideration is defined by sensitivity analysis. N-1 safety was targeted by current injection. The optimal power flow control problem of the power system is studied by utilizing the unified power flow controller. A mixed integer linear programming model is used to optimally determine a subset of phase shifting transformers for angle adjustment, which may be adjusted to minimize the total power generation cost in an optimal power flow problem. The optimal power flow under the safety constraint is accelerated through a new accident screening model, so that the online application is possible. On the other hand, through a practical active power flow adjustment method based on direct current power flow model sensitivity analysis, a concept of relieving flow sensitivity is provided, and the method is used for installing a series capacitor on a non-heavy-load line so as to relieve load difference between power transmission lines. However, no matter the optimal power flow or the construction index, when the method is applied to an actual large power grid, the current method for adjusting the power flow running state has poor effect and convergence in the application of the large power grid due to complex constraint conditions and large power grid scale.

Aiming at the technical problems that whether the optimal power flow exists or indexes are built in the prior art, when the method is applied to an actual large power grid, the current method for adjusting the power flow running state has poor effect and poor convergence in the application of the large power grid due to complex constraint conditions and large power grid scale, and no effective solution is proposed at present.

Disclosure of Invention

The embodiment of the disclosure provides a method for establishing a parallel deep reinforcement learning model for adjusting a tide state, which at least solves the technical problems of poor effect and poor convergence of the current method for adjusting the tide running state in large power grid application due to complex constraint conditions and large power grid scale when the method is applied to an actual large power grid no matter whether optimal tide is or indexes are established in the prior art.

According to one aspect of the disclosed embodiments, there is provided a method of establishing a parallel deep reinforcement learning model of tidal current state adjustment, comprising: establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process; positioning an adjustment target, screening actionable equipment and calculating the action quantity of a generator according to the tide state, the action, the strategy, the rewards and the rewards, wherein the adjustment target comprises different section sets; and according to the adjustment target, the actionable equipment and the action quantity of the generator, establishing a parallel deep reinforcement learning model for adjusting the tide state by considering the N-1 static stability constraint.

According to another aspect of the embodiments of the present disclosure, there is also provided a system for establishing a parallel deep reinforcement learning model of a tide state adjustment, including: forming a Markov module for establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process; the action quantity calculating module is used for positioning an adjustment target, screening actionable equipment and calculating the action quantity of the generator according to the tide state, the action, the strategy, the rewards and the returns; and the model building module is used for building a parallel deep reinforcement learning model which considers the adjustment of the tide state of the N-1 static stability constraint according to the adjustment target, the actionable equipment and the action quantity of the generator.

In the invention, a Markov decision process of power flow adjustment is constructed by satisfying a static stable power flow adjustment process. And then, based on the positioning of the adjustment target, the screening of the actionable equipment and the calculation of the action quantity, formulating a tide state adjustment strategy, and accelerating the adjustment process through sensitivity, transfer ratio and load margin. And then, a parallel deep reinforcement learning model is established, mapping actions are carried out to tide adjustment, a generator pair is formed, and multi-section target parallel adjustment is realized. And the action strategy of reinforcement learning and the network of deep learning are improved, so that the learning efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the present disclosure, and together with the description serve to explain the present disclosure. In the drawings:

FIG. 1 is a flow diagram of a method of establishing a parallel deep reinforcement learning model of tidal current state adjustment according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an automatic adjustment method for a tide meeting a static stability constraint according to an embodiment of the disclosure.

FIG. 3 is a flowchart and implementation method of an adjustment strategy according to an embodiment of the present disclosure;

FIG. 4 is a graph of average number of steps adjusted and ratio of meeting constraints in a 36-node system iterative process according to an embodiment of the present disclosure;

FIG. 5 is a graph showing the cumulative total change over limits for various load levels for a 36-node system according to an embodiment of the present disclosure;

FIG. 6 is a graph showing the cumulative total number of out-of-limit changes for different loading levels of the northeast grid according to an embodiment of the present disclosure;

FIG. 7 is a graph showing experimental results under different hyper-parameters according to an embodiment of the disclosure;

fig. 8 is a schematic diagram of a method of establishing a parallel deep reinforcement learning model of tidal current state adjustment according to an embodiment of the present disclosure.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.

Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

According to a first aspect of the present embodiment, a method of establishing a parallel deep reinforcement learning model of tidal current state adjustment is provided. Referring to fig. 1, the method includes:

s102, establishing a tide state, actions, strategies, rewards and rewards to form a Markov decision process;

S104, positioning an adjustment target, screening actionable equipment and calculating the action quantity of the generator according to the tide state, the action, the strategy, the rewards and the rewards, wherein the adjustment target comprises different section collection sets;

and S106, establishing a parallel deep reinforcement learning model for regulating the tide state by considering the N-1 static stability constraint according to the regulating target, the actionable equipment and the action quantity of the generator.

Specifically, referring to fig. 2, the present embodiment provides a method for establishing a parallel deep reinforcement learning model for adjusting a tide state, which includes forming a markov decision process, making a tide state adjustment strategy, and establishing a parallel deep reinforcement learning model, where the forming the markov decision process is a basis for establishing reinforcement learning, the making a tide state adjustment strategy is a key for accelerating tide adjustment, and the establishing of the parallel deep reinforcement learning model is a core method for tide adjustment.

The N-1 is static and stable, namely, after any independent element (such as a generator, a power transmission line, a transformer and the like) in N elements of the power system is cut off due to faults, the accidents such as power failure of a user, voltage breakdown and the like caused by overload tripping of other lines are avoided.

The forming a markov decision process includes the establishment of states, actions, policies, rewards, and rewards.

The state, i.e. the state of the tide, so that the variables to be observed mainly comprise the active power of the current lines and the active output of the generator, the state space can be expressed as shown in the following formula.

wherein ,P_Li and P_Gi Active power of the ith line and the ith generator, n _L and n_G The number of lines and generators, respectively.

In this operation, the power flow is mainly satisfied by operating the generator, and thus the operation space can be expressed as follows.

in the formula,G_i Refers to the zone bit of the ith generator.

The strategy is a conditional probability distribution p of an action, and can be expressed as the following formula. The strategy can be used as an optimization index of actions and plays an important role in reinforcement learning which is described later.

π(a|s)＝p(a|s)

And during the adjustment process, the current power flow and the N-1 power flow are likely to have out-of-limit conditions, the power flow is enabled to meet the static stability constraint for guiding actions, a certain prize is given to the out-of-limit conditions, and the specific prize is set as follows.

a) Current power flow out-of-limit condition

If the current power flow is out of limit in the adjustment process, a certain negative prize is given, and the specific setting is shown in the following formula.

wherein ,λ_R Is a reward coefficient; andThe current power and the upper power limit of the ith line respectively；Andthe current voltage and the lower voltage limit of the ith node, respectively.

b) N-1 tidal current out-of-limit condition

N-1 calculation is carried out to obtain N power flows missing one element, and the accumulated out-of-limit number N of each line is counted according to out-of-limit conditions of the N power flows _L Can be expressed as follows.

wherein ,the accumulated out-of-limit number of the ith line; n is n _L Is the total number of lines. Then the total number of out-of-limit N is accumulated _ZL The method comprises the following steps:

to make the power flow satisfy the N-1 constraint, the total number of accumulated out-of-limit after each adjustment can be counted into a reward function, so as to embody the current line-crossing condition, which can be expressed as the following formula:

wherein ,N_i and N_init The j-th adjusted threshold number and the initial threshold number are respectively.

The out-of-limit condition of the lines may be represented by the following expression, in addition to the cumulative out-of-limit total number, and the out-of-limit transition condition of each line after each adjustment may also be represented by the out-of-limit condition of the current state.

wherein ,for the out-of-limit number of the jth line, < >>An initial out-of-limit number for the jth line, n _init and n_add The initial threshold and the current increment threshold are respectively.

In addition to considering the above-described out-of-limit number, it is also necessary to consider how much out-of-limit power is, and the present out-of-limit degree is represented by the following equation.

wherein , andThe out-of-limit power and the out-of-limit line number of the i out-of-limit line after the j-th adjustment are respectively; andThe power and the number of the out-of-limit lines are respectively the out-of-limit power and the out-of-limit line number of the i-th out-of-limit line in the initial state.

The return, which is the accumulation of rewards over time, can be expressed as:

where γ is the attenuation coefficient.

Referring to fig. 3, the setting of the trend status adjustment strategy includes positioning of adjustment targets, screening of actionable equipment, and calculation of action amounts.

And positioning the adjustment target, and obtaining the result of N times of power flow calculation after one time of N-1 calculation. And then counting the out-of-limit number of each line according to the out-of-limit condition of each line in N power flows, and sequencing each line from large to small based on the out-of-limit number. Then the line ordering based on the threshold crossing number is:

wherein ,numbering for the ith line, +.>Is the number of the i-th node.

In order to reduce the mutual interference degree between different adjustment objects in the adjustment process, out-of-limit lines with similar power supply areas and similar load areas are classified into the same power transmission section. Generally, after a branch is opened, a power flow often flows to a power transmission section with the same power supply area and the same load area. Therefore, it is necessary to identify transmission profiles in out-of-limit lines and nodes.

In order to identify the power supply area and the load area of the out-of-limit line, the adjustment sensitivity is introduced, and the following formula is defined:

s _i,j ＝ΔP _L,j /ΔP _G,i

wherein ,ΔP_L,j For the active power variation deltap of generator i _G,i The amount of active power change of line j.

If s _i,j If yes, the generator i belongs to the power supply area of the line j; if s _i,j If negative, then generator i belongs to the load zone of line j. Power supply region k for forming each line _PS And load region kappa _LD ：

wherein ,G_i Representing an i-th generator; n is n _PS The number of generators in the power supply area; n is n _LD The number of generators in the load zone.

Judging whether each out-of-limit line and node have similar power supply area and load area, if so, the lines and nodes form a weak power transmission section, and the power transmission section mu _C Can be represented by the following formula.

μ _C ＝{χ _L (1),…,χ _L (n _CL ),χ _N (1),…,χ _N (n _CN )}

wherein ,n_CL The number of lines included in the weak section, n _CN Is the number of nodes contained in the weak section.

In the adjustment process, the out-of-limit condition of each line and node can be continuously changed to form different section sets, specifically as shown in the following formula.

wherein ,n_C The number of the power transmission sections in one adjustment process is the number of the power transmission sections.

When the screening of the movable equipment is carried out, the certain principle should be followed when the tide of a certain section is regulated, so that excessive other out-of-limit is avoided. The specific principle is as follows:

(1) Selecting equipment with larger influence on the current object;

(2) The tide changes of other lines are reduced as much as possible, otherwise, the limit of other lines is easily exceeded;

(3) And the tide part with large action load degree is avoided.

Based on the above principle, in order to minimize the influence of the power flow of other lines after adjustment when adjusting a certain line, the power adjustment variation condition of each generator on each line needs to be considered when screening the adjustable generators, namely, the adjustment sensitivity is selected. Meanwhile, when the power of the generator is regulated, the load flow affects the surrounding lines greatly, and the load condition of the surrounding lines may affect the out-of-limit condition after regulation, so that the load margin of the connecting lines of the operated generator needs to be considered. The load margin of a connection line refers to the minimum value of the power margin of a line connected by a certain node, and can be specifically expressed as:

wherein ,p_Li And is the power angle difference of line i, Ω _L A set of lines connected for a node.

According to the adjustment sensitivity and the load margin of the connecting line, primarily screening the generators, sequencing the influence degree of the generators relative to a certain line, and according to the positive and negative of the adjustment sensitivity, obtaining the following forward adjustment generator sequence And negative regulation of the generator sequence->

Then, form action equipment pair A _match As shown in the following formula.

The calculation of the operation amount first requires determining an adjustment amount target of the out-of-limit line for the calculation of the operation amount of the generator. The out-of-limit line adjustment target is related to the power deficit of the disconnected line, and a transfer factor is often used to describe the relationship between the link transmission power fluctuation peak and the power deficit. Is provided withThe power shortage of a certain line is delta P _L0 The power fluctuation of the out-of-limit line is delta P _Lf Transfer factorThe method comprises the following steps:

then according toThe adjustment amount target delta P of the out-of-limit line can be obtained _LT ：

wherein ,P_Lop Is the active power when the line is disconnected.

Then, the motion amount of the device is calculated. When the generator is operated, not only the adjustment sensitivity is needed to be considered, but also the power balance is needed to be satisfied, namely, the starting-up quantity and the shutdown quantity are equal. The relation between the generator operation amount and the line power adjustment amount is shown in the following expression, and the generator operation amount can be calculated from the expression.

wherein , andFor the sensitivity matching pair of the target line, ΔP _G and ΔP_L The generator action amount and the line power adjustment amount are respectively.

Considering the power limit of the generator, the action quantity of the generator is as follows:

wherein ,P_Gi and P_maxi Active power and upper power limit of the ith generator respectively; p (P) _Gj and P_maxj The active power and the upper power limit of the jth generator respectively.

The parallel deep reinforcement learning model is built, and an Actor and a Critic are built first. The Actor uses policy functions, which are responsible for generating actions (actions) and interacting with the environment. Critic uses a cost function, is responsible for evaluating the performance of the Actor, and directs the action of the Actor in the next stage. The parameter gradient of the Actor is shown as follows.

wherein ,as action value function, N is sampling frequency, T _n P being the time period contained by an epoode _θ Is a conditional probability of the policy parameter θ.

According to action equipment pair A _match Giving the prior probability of the action:

critic is updated based on the square error of the estimated Q value and the actual Q value, and for Critic, its loss is:

to increase the guiding force of the feedback, a reference is added to the Q value to make the feedback positive or negative, which is usually a cost function V (s _t ) The gradient becomes:

however, the above equation means that two networks are required to calculate Q and V, respectively, and Q is estimated as follows:

the loss of the Critic network becomes the square loss of the actual state value and the estimated state value as shown in the following equation.

The above Actor-Critic network is put into a plurality of threads for synchronous training, so that a global net and a worker net in each thread are formed, the structures of the global net and the worker net are the same, and the only difference is that the main network does not need to train and is only used for storing parameters of the AC structure. Before each operation of the workbench net, acquiring operation parameters (w, b) from the global net through a pull function, returning the parameters to the global net through the push function after the operation is finished, and updating the parameters of the global net. On one hand, the multithreading operation improves the operation efficiency, and on the other hand, the correlation is cut off, thereby being beneficial to the convergence of the program. In each iteration, each worker net trains one section, acts on the generator with higher section sensitivity, and multi-section parallel adjustment is realized.

In this embodiment, verification is performed through a small example, and the method of the present invention is applied to the small example, and power adjustment is performed by selecting an action device through sensitivity and using deep reinforcement learning, so that sample adjustment is adjusted to satisfy constraints. The test results verify the effectiveness of the present invention.

Further, for the small example, for New England 39-bus standard system, based on the initial convergence trend of the system, the generator and the load are randomly changed between 0-2 times, and meanwhile, the switching condition of the capacitive reactor is changed, so that 5000 groups of data are generated. 4000 sets of data were used as training sets and 1000 sets of data were used as test sets. For practical large grid systems, northeast grid is employed herein, and the methods of generating and distributing data are consistent with the foregoing.

Further, as shown in fig. 4, in the iteration process of the 36-node system, the average adjustment step number and the ratio satisfying the constraint can be seen, in the iteration process, the average step number of the power flow adjusted to satisfy the N-1 constraint is gradually reduced, and the power flow is stable after 60 iterations and is stable to between 5 and 6 steps. Meanwhile, the proportion of the samples meeting the constraint is continuously increased, the samples gradually converge after 60 iterations, and finally the samples meeting the constraint are close to 98%.

Further, referring to fig. 5, it can be seen that when the load is 100%, 80%, 140%, the load can be gradually reduced to 0, that is, all the loads can be adjusted to meet the constraint. However, when the load increases by 180%, the total cumulative overrun is stabilized to about 55, and cannot be reduced to 0, i.e., cannot be adjusted to meet the N-1 constraint. Thus, for a 39-node system, the power flow can be adjusted to meet the N-1 constraint in 10 steps without the load level being too high.

In the embodiment of the invention, verification is performed on the northeast power grid, and by using the method, the action equipment is selected through sensitivity, and the power adjustment is performed by using deep reinforcement learning, so that the sample adjustment is adjusted to meet the constraint.

Further, the total cumulative out-of-limit total number of the northeast power grid can be gradually reduced to 0 in 50 steps when the load is 100%, 80%, 120%, 140% and 160% as shown in fig. 6, and the total cumulative out-of-limit total number can be adjusted to meet the N-1 constraint. However, when the load is 200% of the original load flow, the total cumulative limit crossing amount is stabilized to about 200, and cannot be adjusted to satisfy the constraint. Thus, for this actual grid, the power flow can be adjusted to meet the N-1 constraint in 50 steps without the load level being too high.

Further, referring to fig. 7, the influence of the prize coefficient, the Batch size, the update interval d, and the number of layers of the fully connected network on the adjustment result is shown in fig. 7. For the reward factor, it directly acts on the reward size of reinforcement learning. It can be observed that the CSR is continuously rising at 10-50 and is somewhat lower at 60, whether it is a 39-node system or an actual grid system. Therefore, the prize coefficient should be taken 50. For the Batch size, when the Batch size is smaller, the Batch size is equivalent to online learning, and the missing of a long-term sample leads to inaccurate value function fitting, so that the effect is reduced; when it is too large, the weight update rate decreases, resulting in the optimization process becoming very lengthy. It can be seen that for a 39-node system, the fetch size fetch 30 works best; for an actual grid system, the batch size fetch 40 works best. For the update interval d, which relates to the update frequency of the target neural network, the Actor network and the critic are not updated synchronously, so that the accumulated error can be reduced, and the variance is reduced. However, too high a d will result in too slow an update to learn the latest action. As can be seen from the figure, the training effect is best when d of the two systems is taken to be 2 and 3 respectively. For the number of layers of the fully connected network, when the number of layers is very small, the model is easy to be under fitted, so that the training effect is reduced; when the model is large, the model is easy to be fitted excessively, and the training effect is also reduced. As can be seen from the figure, the training effect is best when the number of layers of the two systems is respectively 3 and 4. In summary, when the artificial intelligence algorithm is applied, the influence of parameters on the result is large, so that it is critical to reasonably configure the parameters.

Therefore, the sensitivity, the transfer factor, the load margin and other indexes are utilized to evaluate the adjustment target, the optimal action object is learned through deep reinforcement learning, the quick positioning of the power flow out-of-limit related equipment is completed, a plurality of sections are adjusted in parallel, and the quick adjustment of the power flow state is realized.

In the process of adjusting the tide, the weak part of the system can be positioned, and the matched pair of the action equipment can be learned through deep reinforcement learning, so that the simultaneous adjustment of multiple sections is realized, and the effect of high-efficiency adjustment of the tide is achieved.

In summary, a markov decision process for power flow adjustment is constructed by satisfying a static stable power flow adjustment process. And then, based on the positioning of the adjustment target, the screening of the actionable equipment and the calculation of the action quantity, formulating a tide state adjustment strategy, and accelerating the adjustment process through sensitivity, transfer ratio and load margin. And then, a parallel deep reinforcement learning model is established, mapping actions are carried out to tide adjustment, a generator pair is formed, and multi-section target parallel adjustment is realized. And the action strategy of reinforcement learning and the network of deep learning are improved, so that the learning efficiency is improved.

Optionally, establishing the power flow state, action, strategy, rewards and rewards to form a markov decision process including: according to the active power of each current line and the active power of the generator, determining a tide state space as follows:

Wherein s is a tide state space, P _Li and P_Gi Active power of the ith line and the ith generator, n _L N is the number of lines _G The number of lines and generators; the power flow is enabled to meet constraint through the action generator, and the action space is determined as follows:

wherein A is an action space, G _i The marker bit of the ith generator;

a strategy is determined, which is a conditional probability distribution p of an action, according to the following formula:

π(a|s)＝p(a|s)

where pi is policy and a is action.

Optionally, establishing a power flow state, an action, a policy, a reward, and a return, forming a markov decision process, further comprising: under the condition that the current power flow is out of limit, determining that the current power flow is out of limit is:

wherein r is the current flow out-of-limit rewards, lambda _R In order for the reward factor to be a factor, andThe current power and the upper power limit of the ith line, respectively,/-> andThe current voltage and the lower voltage limit of the ith node, respectively.

Optionally, establishing a power flow state, an action, a policy, a reward, and a return, forming a markov decision process, further comprising: under the condition that N-1 power flow is out of limit, N power flows missing one element are obtained after N-1 calculation is determined, and the accumulated out-of-limit number of each line is counted according to the out-of-limit condition of the N power flows:

wherein ,N_L For the cumulative out-of-limit number for each line,the accumulated out-of-limit number of the ith line;

determining the total number of the accumulated out-of-limit according to the accumulated out-of-limit number of each line and the total number of the lines:

wherein ,N_ZL To accumulate the total number of out-of-limit, n _L Is the total number of lines;

to enable the power flow to meet the N-1 constraint, the total accumulated out-of-limit sum after each adjustment is counted into rewards, the current line-crossing condition is reflected, and the total accumulated out-of-limit sum rewards are determined as follows:

wherein ,r₁ To accumulate out-of-limit total rewards, N _i and N_init The out-of-limit number and the initial out-of-limit number after the jth adjustment are respectively;

the out-of-limit transfer condition of each line after each adjustment is counted as rewards, the out-of-limit condition of the current state is reflected, and the out-of-limit transfer condition rewards are determined as follows:

wherein ,r₂ To reward for out-of-limit transfer situations,for the out-of-limit number of the jth line, < >>An initial out-of-limit number for the jth line, n _init and n_add The initial threshold number and the current increase threshold number are respectively;

the out-of-limit power is counted into rewards, the current out-of-limit degree is reflected, and the out-of-limit power rewards are determined as follows:

wherein ,r₃ For out-of-limit power rewards, andThe out-of-limit power and the out-of-limit line number of the i out-of-limit line after the j-th adjustment are respectively; andThe out-of-limit power and the out-of-limit line number of the i-th out-of-limit line in the initial state are respectively;

And determining N-1 power flow out-of-limit rewards according to the accumulated out-of-limit total rewards, out-of-limit transfer situation rewards and out-of-limit power rewards.

Optionally, establishing a power flow state, an action, a policy, a reward, and a return, forming a markov decision process, further comprising: determining a reward based on rewards accumulated over time:

where G is the return, t is the time, gamma is the decay factor, and τ is a time period.

Optionally, locating the adjustment target according to the markov decision process includes: after one time of N-1 calculation, counting the out-of-limit number of each out-of-limit line according to the out-of-limit condition of each line in N power flows, and sequencing each out-of-limit line from large to small based on the out-of-limit number, wherein the line sequencing based on the out-of-limit number is as follows:

wherein ,x_L For line ordering based on out-of-limit numbers,numbering the ith line;

according to the active power variation of the generator and the active power variation of the circuit, determining the adjustment sensitivity:

s _i,j ＝ΔP _L,j /ΔP _G,i

If s _i,j If yes, the generator i belongs to the power supply area of the line j; if s _i,j If the load area of the line j is negative, the generator i belongs to the load area of the line j, and the power area and the load area of each line are determined:

wherein ,κ_PS Is the power supply region, κ _LD G is the load area _i Representing an i-th generator; n is n _PS The number of generators in the power supply area; n is n _LD The number of generators in the load area;

judging whether each out-of-limit line and each node have similar power supply areas and load areas, if so, determining that each out-of-limit line and each node form a weak power transmission section, and determining that the weak power transmission section is

μ _C ＝{χ _L (1),…,χ _L (n _CL ),χ _N (1),…,χ _N (n _CN )}

wherein ,μ_C Is a weak power transmission section, n _CL The number of lines included in the weak section, n _CN The number of the nodes contained in the weak section;

in the adjustment process, the out-of-limit condition of each out-of-limit line and node is continuously changed, and different section sets are formed according to each out-of-limit line and node:

wherein ,Ω_C For different cross-section sets, n _C The number of the power transmission sections in one adjustment process is the number of the power transmission sections.

Optionally, filtering the actionable device according to the markov decision process includes: when the power of the generator is regulated, determining the load margin of a connecting line of the generator, wherein the load margin of the connecting line refers to the minimum value of the power margin of a line connected with a node, and determining the load margin of the connecting line according to the following formula:

wherein ,d_L For the connection line load margin of the generator, p _Li And is the power angle difference of line i, Ω _L A line set connected with the nodes;

according to the adjustment sensitivity and the connection line load margin, sequencing the generators, and determining a positive adjustment generator sequence and a negative adjustment generator sequence:

wherein ,for forward regulation of the generator sequence,/->For negative regulation of the generator sequence,/->For the first forward generator,/a>For the second forward generator,/->Is the nth _G+ A positive generator (S)>For the first negative-going generator,for the second negative generator, +.>Is the nth _G- A negative going generator;

forming an action equipment pair according to the positive adjustment generator sequence and the negative adjustment generator sequence:

wherein ,A_match Is an action equipment pair.

Optionally, calculating the action amount of the generator according to the markov decision process includes: determining a transfer factor according to the power shortage of the line and the power fluctuation of the out-of-limit line:

wherein ,as a transfer factor, ΔP _L0 ΔP for line power shortage _Lf Power fluctuation for out-of-limit lines; according to the transfer factor, determining the adjustment quantity of the out-of-limit line:

wherein ,ΔP_LT For the adjustment amount target of the out-of-limit line, P _Lop Active power when the line is disconnected; determining the line power adjustment amount according to the adjustment amount target of the out-of-limit line; determining according to the relation between the line power adjustment quantity and the action quantity of the generator Initial amount of action of the generator:

wherein , andFor the sensitivity matching pair of the target line, ΔP _G and ΔP_L The initial action quantity and the line power adjustment quantity of the generator are respectively; considering the power limit of the generator, and determining the generator action amount according to the initial action amount of the generator as follows:

Optionally, establishing a parallel deep reinforcement learning model according to the adjustment target, the actionable device and the action amount of the generator, including: constructing an Actor model, and generating actions and environment interactions through a strategy function by using the Actor model; constructing a Critic model, and utilizing the Critic model to evaluate the Actor model by using a cost function to guide the action of the next stage of the Actor model; the parameter gradient of the Actor model is determined as follows:

wherein ,for parameter gradient +.>As action value function, N is sampling frequency, T _n P being the time period contained by an epoode _θ Conditional probability for policy parameter θ>For the current state, ++>Is action;

the loss function of the Critic model is determined as follows:

Wherein loss is loss function, N is sampling frequency, V ^π As a function of value, T _n For a period of time encompassed by an epoode,r is the state of the next moment _t ⁿ Is rewarded; and determining an Actor-Critic network according to the Actor model and the Actor model, and synchronously training the Actor-Critic network in a plurality of threads.

In accordance with another aspect of the present embodiment, a system 800 for establishing a parallel deep reinforcement learning model of tidal current state adjustment is provided. Referring to fig. 8, the system 800 includes: a form markov module 810 for establishing a power flow state, action, strategy, rewards, and rewards, forming a markov decision process; a calculation action amount module 820 for positioning adjustment targets, screening actionable equipment, and calculating action amounts of the generators according to the tide state, actions, policies, rewards, and rewards; and the model building module 830 is configured to build a parallel deep reinforcement learning model for adjusting the tide state in consideration of the N-1 static stability constraint according to the adjustment target, the actionable device, and the action amount of the generator.

Optionally, forming the markov module 810 includes: the power flow state determining sub-module is used for determining a power flow state space as follows according to the active power of each current line and the active power of the generator:

Wherein s is a tide state space, P _Li and P_Gi Active power of the ith line and the ith generator, n _L N is the number of lines _G The number of lines and generators;

the determining action sub-module is used for enabling the tide to meet the constraint through the action generator, and determining the action space as follows:

wherein A is an action space, G _i The marker bit of the ith generator;

a determining policy sub-module for determining a policy, which is a conditional probability distribution p of an action, according to the following formula:

π(a|s)＝p(a|s)

where pi is policy and a is action.

Optionally, forming the markov module 810 further includes: the determining and rewarding sub-module is used for determining that the current power flow out-of-limit rewarding is as follows:

Optionally, forming the markov module 810 further includes: the statistics accumulation out-of-limit number submodule is used for determining N out-of-limit states of N-1 power flows to obtain N power flows missing one element after N-1 calculation, and counting accumulation out-of-limit numbers of each line according to the out-of-limit states of the N power flows:

the accumulated threshold total number determining sub-module is used for determining the accumulated threshold total number according to the accumulated threshold total number of each line and the total number of the lines:

the sub-module for determining the total accumulated out-of-limit rewards is used for enabling the power flow to meet the N-1 constraint, counting the total accumulated out-of-limit rewards after each adjustment, reflecting the current line-crossing condition, and determining the total accumulated out-of-limit rewards as follows:

the sub-module for determining out-of-limit transfer condition rewards is used for counting out-of-limit transfer conditions of each line after each adjustment into rewards, reflecting out-of-limit conditions of the current state, and determining out-of-limit transfer condition rewards is as follows:

wherein ,r₂ To reward for out-of-limit transfer situations,for the out-of-limit number of the jth line, < >>An initial out-of-limit number for the jth line, n _init and n_add The initial threshold number and the current increase threshold number are respectively; />

The sub-module for determining out-of-limit power rewards is used for counting out-of-limit power into rewards, reflecting the current out-of-limit degree and determining out-of-limit power rewards as follows:

and the N-1 power flow out-of-limit rewarding sub-module is used for determining N-1 power flow out-of-limit rewards according to the accumulated out-of-limit total rewards, out-of-limit transfer situation rewards and out-of-limit power rewards.

Optionally, forming the markov module 810 further includes: a determining payback sub-module for determining payback according to rewards accumulated over time:

Optionally, the calculate action amount module 820 includes: the sequencing out-of-limit line sub-module is used for counting out-of-limit numbers of out-of-limit lines according to out-of-limit conditions of the lines in N power flows after one time of N-1 calculation, sequencing the out-of-limit lines from large to small based on the out-of-limit numbers, and sequencing the lines based on the out-of-limit numbers as follows:

the adjustment sensitivity determining sub-module is used for determining adjustment sensitivity according to the active power variation of the generator and the active power variation of the circuit:

s _i,j ＝ΔP _L,j /ΔP _G,i

wherein ,ΔP_L,j To generate electricityActive power change Δp of machine i _G,i The amount of active power change of line j.

Determining a power load sub-module for if s _i,j If yes, the generator i belongs to the power supply area of the line j; if s _i,j If the load area of the line j is negative, the generator i belongs to the load area of the line j, and the power area and the load area of each line are determined:

the sub-module for determining the weak power transmission section is used for judging whether each out-of-limit line and each node have similar power supply areas and load areas, if so, determining that each out-of-limit line and each node form a weak power transmission section, and determining that the weak power transmission section is

μ _C ＝{χ _L (1),…,χ _L (n _CL ),χ _N (1),…,χ _N (n _CN )}

and forming a section set sub-module, wherein the section set sub-module is used for continuously changing the out-of-limit condition of each out-of-limit line and node in the adjustment process, and forming different section sets according to each out-of-limit line and node:

Optionally, the calculate action amount module 820 includes: the connection line load margin determining sub-module is used for determining the connection line load margin of the generator when the power of the generator is adjusted, wherein the connection line load margin refers to the minimum value of the power margin of a line connected with the node, and the connection line load margin is determined according to the following formula:

the sequencing generator sub-module is used for sequencing the generators according to the adjustment sensitivity and the connection line load margin, and determining a positive adjustment generator sequence and a negative adjustment generator sequence:

wherein ,for forward regulation of the generator sequence,/->For negative regulation of the generator sequence,/->For the first forward generator,/a>For the second forward generator,/->Is the nth _G+ A positive generator (S)>Generating power for the first negative directionThe machine is provided with a machine body,for the second negative generator, +.>Is the nth _G- A negative going generator;

forming an action equipment pair, wherein the action equipment pair is formed according to the positive adjustment generator sequence and the negative adjustment generator sequence:

wherein ,A_match Is an action equipment pair.

Optionally, the calculate action amount module 820 includes: the transfer factor determining sub-module is used for determining a transfer factor according to the power shortage of the line and the power fluctuation of the out-of-limit line:

wherein ,as a transfer factor, ΔP _L0 ΔP for line power shortage _Lf Power fluctuation for out-of-limit lines;

the out-of-limit line adjustment amount determining module is used for determining the adjustment amount of the out-of-limit line according to the transfer factor:

wherein ,ΔP_LT For the adjustment amount target of the out-of-limit line, P _Lop Active power when the line is disconnected;

the line power adjusting quantum module is used for determining the line power adjusting quantity according to the adjusting quantity target of the out-of-limit line;

the initial action quantum module is used for determining the initial action quantity of the generator according to the relation between the line power adjustment quantity and the action quantity of the generator:

wherein , andFor the sensitivity matching pair of the target line, ΔP _G and ΔP_L The initial action quantity and the line power adjustment quantity of the generator are respectively;

the generator action quantum determining module is used for considering the power limit of the generator, and determining the generator action amount as follows according to the initial action amount of the generator:

Optionally, the modeling module 830 includes: an Actor model sub-module is constructed and used for constructing an Actor model, and action and environment interaction are generated through a strategy function by utilizing the Actor model; constructing a Critic model submodule, wherein the Critic model submodule is used for constructing a Critic model, evaluating the Actor model by using a cost function by utilizing the Critic model and guiding the action of the next stage of the Actor model; the parameter gradient determining sub-module is used for determining the parameter gradient of the Actor model as follows:

wherein ,for parameter gradient +.>As action value function, N is sampling frequency, T _n P being the time period contained by an epoode _θ Conditional probability for policy parameter θ, +.>For the current state, ++>Is action;

the loss function determining submodule is used for determining the loss function of the Critic model as follows:

wherein loss is loss function, N is sampling frequency, V ^π As a function of value, T _n For a period of time encompassed by an epoode,r is the state of the next moment _t ⁿ Is rewarded;

and the synchronous training sub-module is used for determining an Actor-Critic network according to the Actor model and synchronously training the Actor-Critic network in a plurality of threads.

The system 800 for establishing a parallel deep reinforcement learning model for adjusting a tide state according to an embodiment of the present application corresponds to a system method for establishing a parallel deep reinforcement learning model for adjusting a tide state according to another embodiment of the present application, and is not described herein.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of establishing a parallel deep reinforcement learning model for tidal current state adjustment, comprising:

establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process;

positioning an adjustment target, screening actionable equipment and calculating the action quantity of a generator according to the tide state, the action, the strategy, the rewards and the rewards, wherein the adjustment target comprises different section sets;

according to the adjustment target, the actionable equipment and the action quantity of the generator, a parallel deep reinforcement learning model which considers the adjustment of the tide state of the N-1 static stability constraint is established;

establishing tide states, actions, policies, rewards and rewards to form a Markov decision process comprising:

according to the active power of each current line and the active power of the generator, determining a tide state space as follows:

Wherein s isTidal current State space, P _Li and P_Gi Active power of the ith line and the ith generator, n _L N is the number of lines _G The number of lines and generators;

the power flow is enabled to meet constraint through the action generator, and the action space is determined as follows:

wherein A is an action space, G _i The marker bit of the ith generator;

π(a|s)＝p(a|s)

wherein pi is policy and a is action;

establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process, and further comprising:

under the condition that the current power flow is out of limit, determining that the current power flow is out of limit is:

wherein r is the current flow out-of-limit rewards, lambda _R In order for the reward factor to be a factor, andThe current power and the upper power limit of the ith line, respectively,/-> andThe current voltage and the voltage lower limit of the ith node are respectively;

determining a reward based on rewards accumulated over time:

2. The method of claim 1, wherein establishing the power flow state, action, policy, rewards, and rewards forms a markov decision process, further comprising:

Under the condition that N-1 power flow is out of limit, N power flows missing one element are obtained after N-1 calculation is determined, and the accumulated out-of-limit number of each line is counted according to the out-of-limit condition of the N power flows:

where NL is the cumulative out-of-limit number for each line,the accumulated out-of-limit number of the ith line;

wherein ,r₁ To accumulate the total out-of-limit rewards, and N_init The out-of-limit number and the initial out-of-limit number after the jth adjustment are respectively;

wherein ,r₂ To reward for out-of-limit transfer situations,for the out-of-limit number of the jth line, < >>For the initial out-of-limit number of the jth line, for example> andThe initial threshold number and the current increase threshold number are respectively;

3. The method of claim 1, wherein locating an adjustment target according to the markov decision process comprises:

after one time of N-1 calculation, counting the out-of-limit number of each out-of-limit line according to the out-of-limit condition of each line in N power flows, and sequencing each out-of-limit line from large to small based on the out-of-limit number, wherein the line sequencing based on the out-of-limit number is as follows:

s _i,j ＝ΔP _L,j /ΔP _G,i

wherein ,ΔP_L,j For the active power variation deltap of generator i _G,i The active power variation of the line j;

μ _C ＝{χ _L (1),…,χ _L (n _CL ),χ _N (1),…,χ _N (n _CN )}

wherein ,Ω_C For different sets of sections，n _C The number of the power transmission sections in one adjustment process is the number of the power transmission sections.

4. The method of claim 1, wherein screening actionable devices according to the markov decision process comprises:

when the power of the generator is regulated, determining the load margin of a connecting line of the generator, wherein the load margin of the connecting line refers to the minimum value of the power margin of a line connected with a node, and determining the load margin of the connecting line according to the following formula:

wherein ,for forward regulation of the generator sequence,/->For negative regulation of the generator sequence,/->For the first forward generator,/a>Generating power for the second forward directionMachine(s)>Is the nth _G+ A positive generator (S)>For the first negative generator,For the second negative generator, +.>Is the nth _G- A negative going generator;

wherein ,A_match Is an action equipment pair.

5. The method of claim 1, wherein calculating the amount of action of the generator according to the markov decision process comprises:

determining a transfer factor according to the power shortage of the line and the power fluctuation of the out-of-limit line:

according to the transfer factor, determining the adjustment quantity of the out-of-limit line:

determining the line power adjustment amount according to the adjustment amount target of the out-of-limit line;

determining the initial action quantity of the generator according to the relation between the line power adjustment quantity and the action quantity of the generator:

considering the power limit of the generator, and determining the generator action amount according to the initial action amount of the generator as follows:

6. The method of claim 1, wherein building a parallel deep reinforcement learning model based on the adjustment targets, the actionable devices, and the amount of action of the generator comprises:

constructing an Actor model, and generating actions and environment interactions through a strategy function by using the Actor model;

constructing a Critic model, and utilizing the Critic model to evaluate the Actor model by using a cost function to guide the action of the next stage of the Actor model;

The parameter gradient of the Actor model is determined as follows:

the loss function of the Critic model is determined as follows:

and determining an Actor-Critic network according to the Actor model and the Actor model, and synchronously training the Actor-Critic network in a plurality of threads.

7. A system for building a parallel deep reinforcement learning model of tidal current state adjustment, comprising:

forming a Markov module for establishing a tide state, an action, a strategy, rewards and rewards to form a Markov decision process;

the action quantity calculating module is used for positioning an adjustment target, screening actionable equipment and calculating the action quantity of the generator according to the tide state, the action, the strategy, the rewards and the returns;

the model building module is used for building a parallel deep reinforcement learning model which considers the adjustment of the tide state of the N-1 static stability constraint according to the adjustment target, the actionable equipment and the action quantity of the generator;

Forming a markov module comprising:

the power flow state determining sub-module is used for determining a power flow state space as follows according to the active power of each current line and the active power of the generator:

wherein A is an action space, G _i The marker bit of the ith generator;

π(a|s)＝p(a|s)

wherein pi is policy and a is action;

forming a markov module, further comprising:

the determining and rewarding sub-module is used for determining that the current power flow out-of-limit rewarding is as follows:

forming a markov module, further comprising:

a determining payback sub-module for determining payback according to rewards accumulated over time:

8. The system of claim 7, wherein forming a markov module further comprises:

the statistics accumulation out-of-limit number submodule is used for determining N out-of-limit states of N-1 power flows to obtain N power flows missing one element after N-1 calculation, and counting accumulation out-of-limit numbers of each line according to the out-of-limit states of the N power flows:

wherein ,r₁ To accumulate the total out-of-limit rewards,N _i and N_init The out-of-limit number and the initial out-of-limit number after the jth adjustment are respectively;

9. The system of claim 7, wherein the calculate motion module comprises:

the sequencing out-of-limit line sub-module is used for counting out-of-limit numbers of out-of-limit lines according to out-of-limit conditions of the lines in N power flows after one time of N-1 calculation, sequencing the out-of-limit lines from large to small based on the out-of-limit numbers, and sequencing the lines based on the out-of-limit numbers as follows:

s _i,j ＝ΔP _L,j /ΔP _G,i

μ _C ＝{χ _L (1),…,χ _L (n _CL ),χ _N (1),…,χ _N (n _CN )}

10. The system of claim 7, wherein the calculate motion module comprises:

the connection line load margin determining sub-module is used for determining the connection line load margin of the generator when the power of the generator is adjusted, wherein the connection line load margin refers to the minimum value of the power margin of a line connected with the node, and the connection line load margin is determined according to the following formula:

wherein ,A_match Is an action equipment pair.

11. The system of claim 7, wherein the calculate motion module comprises:

the transfer factor determining sub-module is used for determining a transfer factor according to the power shortage of the line and the power fluctuation of the out-of-limit line:

12. The system of claim 7, wherein the modeling module comprises:

an Actor model sub-module is constructed and used for constructing an Actor model, and action and environment interaction are generated through a strategy function by utilizing the Actor model;

constructing a Critic model submodule, wherein the Critic model submodule is used for constructing a Critic model, evaluating the Actor model by using a cost function by utilizing the Critic model and guiding the action of the next stage of the Actor model;

the parameter gradient determining sub-module is used for determining the parameter gradient of the Actor model as follows: