CN114004282A - Method for extracting deep reinforcement learning emergency control strategy of power system - Google Patents

Method for extracting deep reinforcement learning emergency control strategy of power system Download PDF

Info

Publication number
CN114004282A
CN114004282A CN202111188349.9A CN202111188349A CN114004282A CN 114004282 A CN114004282 A CN 114004282A CN 202111188349 A CN202111188349 A CN 202111188349A CN 114004282 A CN114004282 A CN 114004282A
Authority
CN
China
Prior art keywords
model
sample
node
power system
theta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111188349.9A
Other languages
Chinese (zh)
Inventor
张俊
高天露
戴宇欣
张科
许沛东
陈思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Wuhan University WHU
State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, State Grid Zhejiang Electric Power Co Ltd filed Critical Wuhan University WHU
Priority to CN202111188349.9A priority Critical patent/CN114004282A/en
Publication of CN114004282A publication Critical patent/CN114004282A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Abstract

The invention provides a method for extracting an emergency control strategy of deep reinforcement learning of an electric power system. Establishing observation data by introducing characteristic data of a plurality of historical moments of a power system node model; further constructing a deep Q learning network model, and performing optimization training by adopting a random gradient descent optimization algorithm to obtain an emergency control deep reinforcement learning model of the power system; generating a data set under a specific fault scene based on the trained deep Q learning network model; training a weighted tilt decision tree model based on the information gain ratio on the data to complete strategy extraction; and setting a strategy fidelity index, a strategy actual control performance index and a model complexity index to evaluate the model performance under different super parameters, so as to select an optimal model according to actual requirements for the field of emergency control of the power system.

Description

Method for extracting deep reinforcement learning emergency control strategy of power system
Technical Field
The invention belongs to the field of crossing of artificial intelligence and an electric power system, and particularly relates to an extraction method of an emergency control strategy for deep reinforcement learning of the electric power system.
Background
The great social and economic losses are caused by major power failure events occurring around the world, such as the power failure increase in the U.S. 2003, and the urgent need for constructing a safer and more reliable power system is warned. However, the current protection and control mechanisms of the power system are designed based on some typical scenarios in an offline situation and cannot adapt to unknown changes of the power system. Meanwhile, with the development of Artificial Intelligence (AI) technology in the fields of natural language processing technology, computer vision, automatic driving, etc., these technologies have also been successfully applied in power systems, such as load prediction and new energy prediction, power transmission line icing thickness identification, rapid charging guidance for electric vehicles, etc. Artificial intelligence algorithms, represented by Deep Learning (DL), can more easily cope with unknown changes in the power system because of their strong feature extraction and nonlinear mapping capabilities.
In recent years, the application of Deep Learning (DRL) to autonomous driving, games, and the like has verified its advantages in solving the problem of sequence decision, and certainly also the problem of power system control. Many scholars try to solve the problems of prevention control, emergency control, recovery control and the like of the power system based on the DRL and obtain good effects.
However, the black box property and the difficult interactivity of the artificial intelligence algorithms limit the application of the artificial intelligence algorithms in practical scenes, especially in the occasions involving some critical decisions. Therefore, scholars at home and abroad try to build a lighter and interpretable decision model by utilizing the existing AI model based on the concepts of imitation learning, knowledge distillation and the like. In particular, some scholars propose reinforcement learning strategy extraction methods based on decision trees and their variants. However, these efforts have only verified their feasibility in some simple game scenarios, and the related efforts have not been carried out on the research and success in the power system control problem.
Disclosure of Invention
In order to overcome the defects of the background art, the invention provides an emergency control strategy extraction method for deep reinforcement learning of an electric power system.
The specific technical scheme of the invention is a method for extracting an emergency control strategy for deep reinforcement learning of an electric power system, which specifically comprises the following steps.
Step 1: introducing characteristic data of a plurality of historical moments of a power system node model to construct observation data;
step 2: introducing a deep Q learning network model, further sequentially inputting a plurality of groups of observation data into the deep Q learning network model, predicting and obtaining load reduction actions, and further performing optimization training by adopting a random gradient descent optimization algorithm to obtain an emergency control deep reinforcement learning model of the power system;
and step 3: generating a data set under a specific fault scene based on the trained deep Q learning network model;
and 4, step 4: under each non-leaf node in a weighted inclined decision tree model of an information gain ratio, inputting state-action pair data in a lower data set under each non-leaf node into the weighted inclined decision tree model of the information gain ratio, solving the minimum value of a model objective function through a quasi-Newton algorithm, obtaining the optimal parameter of the model under the node, simultaneously dividing the data set under the node into a left subset and a right subset, constructing a left sub node and a right sub node, and circulating the steps until the termination condition of the algorithm is met;
and 5: setting a strategy fidelity index, a strategy actual control performance index and a model complexity index to evaluate the model performance under different super parameters, so as to select an optimal model according to task requirements for emergency control of the power system;
preferably, the observation data in step 1 is specifically defined as:
Xt=[ut,ut+1,...,ut+L-1]T
ut+l={datat+l,p,j|1≤p≤P,1≤j≤J},l∈[0,L-1]
wherein, XtThe method is characterized by representing the t-th group of observation data, t represents the starting time of the t-th group of observation data, and L is a positive integer and is the length of an observation data window. u. oft+kThe method comprises the steps of representing the ith group of observation data in the tth group of observation data, namely the observation data at the t + l moment in a multi-node model of the power system; datat+l,p,jThe J-th type characteristic data of the P-th bus node representing the l-th step of observation data in the t-th group of observation data, namely the J-th type characteristic data of the P-th power system node at the t + l moment in the power system multi-node model, P represents the number of the power system nodes, and J is the number of the node characteristics.
Preferably, the predicted load reduction action in step 2 is formed by a combination of percentage load shedding of bus nodes of the power system, each bus node has two load reduction ways, which respectively define no action and load reduction of 20% of the total load on the bus node;
the number of load reduction actions predicted by the deep Q learning network model includes 2 in totalHH is the number of controllable nodes; further, the actions included in the actions are sorted and numbered, that is, the action set is defined as:
Y=[0,1,...,y,...,2H-1],y∈Ν
preferably, the specific generation step of the data set in step 3 is: after the deep Q learning network model training is completed, aiming at a set fault scene, the characteristic quantity x from t moment to t + L-1 moment of the power system is measuredtIn the DQN decision model of the rolling input, the optimal action Y is selected in the action set Y by the decision modeltAnd recording the input and output data of the model at each step to construct the formA state-action pair, (x)t,yt) To complete the generation of the tagged data set. Step 3 the state-action pair data set can be expressed as:
S={(x1,y1),(x2,y2),(xi,yi),...,(xN,yN)}
wherein (x)i,yi) Representing the ith state-action pair, x, in the state-action pair datasetiRepresents the power system state quantity, y, of the i-th state-action pair in the state-action pair data setiRepresenting the control action of the ith state-action pair in the state-action pair data set, and N representing the number of the state-action pairs in the state-action pair data set;
preferably, the step 4 is specifically as follows:
step 4.1: the input condition of each non-leaf node in the weighted tilt decision tree model with the set information gain ratio is a training data set S, (x)i,yi) E, determining that the number of samples of the data set under the current node is equal to or larger than N, wherein M is the number of samples of the data set under the current node, and N is the total number of samples; setting the maximum depth of the model as D and the current node depth as D;
wherein the training data set under the root node is the data set S generated in the step 3, and the training data sets under other non-leaf nodes are left subsets S 'obtained by dividing the training set of the parent node'LS 'of right subset'R
Step 4.2: creating a model root node G based on the data set S, and enabling the current node depth d to be 0;
step 4.3: if the current node depth D is larger than the maximum model depth D, the node G is set as a leaf node, and the label of the node G is the corresponding label k with the maximum number of samples in the data set S; otherwise, turning to step 4.4;
step 4.4: if all samples in dataset S belong to the same class k, node G is set as a leaf node, labeled k. Otherwise, turning to step 4.5;
step 4.5: initializing a parameter theta under a current node of the model in a univariate decision tree mode to obtain an initial value theta0
Step 4.6: based on quasi-Newton algorithm and initial value theta0Solving the minimum value of the model objective function and obtaining the optimal parameter theta of the modelbest
Figure BDA0003300208670000031
Wherein L (theta) is a model target function, lambda is a regularization term coefficient of L2, theta is a parameter to be trained under each node of the model, | | theta | | magnetism2Is a two-norm of theta, H (S) is the empirical entropy of the sample set S, and H (S | theta) is the conditional empirical entropy of the sample set S under theta;
Figure BDA0003300208670000041
wherein K is the total number of classes of the sample; k represents a kth class sample label in the sample; i SkI is the number of kth samples in the sample set S, and I S is the total number of samples in the sample set S;
Figure BDA0003300208670000042
wherein, WLIs the sum of the weights of all samples belonging to the left subset, WRIs the sum of the weights of all samples belonging to the right subset, HLWeighting the entropy of the information for the left child node, HRWeighting information entropy for the right child node, wherein M is the total sample number of a sample set S under the node, and theta is a parameter to be trained under each node of the model;
Figure BDA0003300208670000043
wherein the content of the first and second substances,
Figure BDA0003300208670000044
is a sample (x)i,yi) Weights belonging to the left child node, SLEach sample association belongs to a set of left child node weight information.
Figure BDA0003300208670000045
Wherein the content of the first and second substances,
Figure BDA0003300208670000046
is a sample (x)i,yi) Weights belonging to the right child node, SREach sample association belongs to a set of right child node weight information.
Figure BDA0003300208670000047
Wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure BDA0003300208670000048
is the sum of the weights of the samples belonging to the left subset under k classes in the sample set, WLIs the sum of the weights of all samples in the set S belonging to the left subset.
Figure BDA0003300208670000049
Wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure BDA00033002086700000410
the sum of the weights of the samples belonging to the right subset under the k classes in the sample set is obtained; wRThe sum of the weights of all samples in the sample set S belonging to the right subset;
Figure BDA0003300208670000051
Figure BDA0003300208670000052
wherein (x)i,yi) Represents the ith sample in the sample set S, SLAssociating sets of left subset weight information to which samples belong, SRAssociating and assigning a set of right subset weight information for each sample;
Figure BDA0003300208670000053
Figure BDA0003300208670000054
wherein the content of the first and second substances,
Figure BDA0003300208670000055
the ith sample is assigned the weight of the left child node,
Figure BDA0003300208670000056
the weight of the ith sample belonging to the right child node is represented, and sigma (·) is a sigmoid function;
Figure BDA0003300208670000057
wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure BDA0003300208670000058
is the sum of the weights of the left subset to which the samples belong under k classes in the sample set.
Figure BDA0003300208670000059
Wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure BDA00033002086700000510
is the sum of the weights of the right subset to which the samples belong under k classes in the sample set.
Step 4.7: based on the current model parameter θbestThe objective function value L at this time is calculated in conjunction with L (theta)0
Step 4.8: randomly initializing a parameter theta, and repeating the initialization for C times, wherein C is a model hyper-parameter to obtain a parameter initial value theta'0
Step 4.9: based on quasi-Newton algorithm and initial value theta'0Solving the formula L (theta) to obtain a model optimal parameter theta'best
Step 4.10: model parameter theta 'obtained based on current solution'bestCalculating the objective function value L 'at this time'0
Step 4.11: if objective function value L'0<L0Then the model optimum parameter thetabest=θ’bestOtherwise, turning to step 4.12;
step 4.12: based on the optimum parameter thetabestObtaining a left subset, a right subset, S'L,S’R(ii) a Wherein the content of the first and second substances,
Figure BDA00033002086700000511
Figure BDA00033002086700000512
wherein (x)i,yi) Representing the ith sample in the sample set S,
Figure BDA0003300208670000061
assigning the ith sample a weight, S, of the left child nodeLA set of left subset weight information is assigned to each sample association,
Figure BDA0003300208670000062
assigning the ith sample a weight of the right child node, SRAssociating and assigning a set of right subset weight information for each sample;
step 4.13: construction of left child node GLAnd let the node lower training set be S'LIf d is d +1, turning to step 4.3;
step 4.14: construction of the Right child node GRAnd let the node lower training set be S'RIf d is d +1, turning to step 4.3;
preferably, the step 5 specifically comprises:
step 5.1, the strategy fidelity index in step 5 is strategy fidelity, the meaning is the decision matching degree of the weighted inclined decision tree model based on the information gain ratio and the depth reinforcement learning strategy, and the calculation formula is as follows:
Figure BDA0003300208670000063
wherein y is
Figure BDA0003300208670000064
The output of the deep reinforcement learning and the weighted tilt decision tree model based on the information gain ratio is given the same input x, N is the total amount of samples, and I (-) is an illustrative function.
Step 5.2: and 5, the strategy actual control performance index is the strategy actual control performance and represents the average return obtained in each round when the weighted inclined decision tree strategy based on the information gain ratio is applied to an actual control scene. When the deep reinforcement learning model is applied to the actual control scenario, the average reward it gets at each round is: r iseCorrespondingly, the average return of the weighted tilt decision tree based on the information gain ratio in the corresponding scene is recorded as: r'e. Thus, Re=r’e-reA value > 0 represents that the weighted inclined decision tree model based on the information gain ratio has better control performance than the deep reinforcement learning model and vice versa.
Step 5.3: and 5, the model complexity index is the model complexity, and is measured by the model parameter number or the model depth. From the viewpoint of model interpretability and interactivity, a weighted tilt decision tree strategy based on an information gain ratio is sought, wherein the complexity of the model is as small as possible.
Step 5.4: after comprehensively considering the index results of the step 5.1 to the step 5.3, selecting a weighted tilt decision tree model with the optimal information gain ratio according to actual requirements;
the invention has the advantages that the invention can well extract the complex deep reinforcement learning model into a light-weight control strategy with certain interpretability, provides good control performance and solves the problem that the artificial intelligence technology is difficult to be practically applied due to the black box property.
Drawings
FIG. 1: the method is a work flow chart for extracting the deep reinforcement learning emergency control strategy of the power system;
FIG. 2: is an IEEE39 node system topology;
FIG. 3: is an algorithm pseudo code graph of a weighted tilt decision tree model based on information gain ratio;
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The method is based on introduction of a low-voltage load shedding problem on an IEEE39 node system, strategy extraction is carried out on a deep reinforcement learning intelligent body for low-voltage load shedding, a complex deep reinforcement learning strategy is converted into a lighter strategy in a weighted inclined decision tree model form with a certain interpretable information gain ratio, and effectiveness and advancement of the method are evaluated through three indexes of strategy fidelity, strategy actual control performance and model complexity.
Referring to fig. 1 to 3, an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of the present invention is a method for extracting an emergency control strategy for deep reinforcement learning of an electric power system, and specifically includes the following steps:
step 1: as shown in step 1 of fig. 1, the observation data is constructed by introducing feature data of a plurality of historical moments of a multi-node model of the power system.
The observation data in the step 1 is specifically defined as:
Xt=[ut,ut+1,...,ut+L-1]T
ut+l={datat+l,p,j|1≤p≤P,1≤j≤J},l∈[0,L-1]
wherein, XtThe method is characterized by representing the t-th group of observation data, t represents the starting time of the t-th group of observation data, and L is a positive integer and is the length of an observation data window. u. oft+kThe method comprises the steps of representing the ith group of observation data in the tth group of observation data, namely the observation data at the t + l moment in a multi-node model of the power system; datat+l,p,jThe J-th type characteristic data of the P-th bus node representing the l-th step of observation data in the t-th group of observation data, namely the J-th type characteristic data of the P-th power system node at the t + l moment in the power system multi-node model, P represents the number of the power system nodes, and J is the number of the node characteristics.
As shown in fig. 2, the observed quantities of the agent are set as: 4. the voltage per unit value of the high and low voltage sides of the No. 7, No. 8 and No. 18 buses and the load allowance of the No. 4 and No. 7 buses, wherein P is 4, JBus bar 4=JBus 7=3,JBus bar 8J Bus bar 182. In order to capture the characteristic variation trend, the characteristic quantity of the latest 5 simulation steps is superposed to be used as the input quantity, x, of the final intelligent agent. Namely: p is 4, JBus bar 4=JBus 7=3,JBus bar 8=JBus bar 18When L is 2, then xt=[ut-4,ut-3,...ut]。
Step 2: as shown in step 2 of the attached figure 1, a deep Q learning network model is introduced, a plurality of groups of observation data are further sequentially input into the deep Q learning network model, load reduction actions are predicted and obtained, and a random gradient descent optimization algorithm is further adopted for optimization training to obtain an emergency control deep reinforcement learning model of the power system;
and 2, the predicted load reduction action is formed by a load shedding percentage combination mode of bus nodes of the power system, each bus node has two load reduction modes, and the modes of no action and load reduction are respectively defined as 20% of the total load on the bus nodes.
The number of load reduction actions predicted by the deep Q learning network model includes 2 in totalHH is the number of controllable nodes; further, the actions included in the actions are sorted and numbered, that is, the action set is defined as:
Y=[0,1,...,y,...,2H-1],y∈Ν
here, the operable bus is set to be bus 4 or 7, that is, H is 2, and therefore: y ═ 0,1,2, 3.
And step 3: and as shown in step 3 of fig. 1, generating a data set under a specific fault scene based on the trained deep Q learning network model.
The specific generation steps of the data set related to the step 3 are as follows: after the deep Q learning network model training is completed, aiming at the set fault scene, the characteristic quantity x from t moment to t +4 moment of the power system is calculatedtScrolling input DQN decision model from which to set actions [0,1,2,3]Selecting an optimal action ytRecording the model input and output data of each step to construct a state-action pair (x)t,yt) To complete the generation of the tagged data set. Step 3 the state-action pair data set can be expressed as:
S={(x1,y1),(x2,y2),(xi,yi),...,(xN,yN)}
wherein (x)i,yi) Representing the ith state-action pair, x, in the state-action pair datasetiRepresents the power system state quantity, y, of the i-th state-action pair in the state-action pair data setiThe control action of the ith state-action pair in the state-action pair data set is shown, and N is the number of the state-action pairs in the state-action pair data set, wherein N is 4836.
And 4, step 4: as shown in step 4 of fig. 1, under each non-leaf node in the weighted tilt decision tree model of the information gain ratio, inputting the state-action pair data in the data set under each non-leaf node into the weighted tilt decision tree model of the information gain ratio, solving the minimum value of the model objective function through a quasi-newton algorithm, obtaining the optimal parameter of the model under the node, dividing the data set under the node into a left subset and a right subset, constructing a left child node and a right child node, and repeating the above steps until the algorithm termination condition is met;
as shown in the pseudo code of fig. 3, the step 4 is as follows:
step 4.1: the input condition of each non-leaf node in the weighted tilt decision tree model with the set information gain ratio is a training data set S, (x)i,yi) E, S, i is equal to 1,2,3, and M is equal to or less than N, where M is the number of data set samples under the current node, N is the total number of samples, and N is equal to 4836; setting the maximum depth of the model as D, wherein D belongs to {3,4,5,6,7,8}, and the depth of the current node is D;
wherein the training data set under the root node is the data set S generated in the step 3, and the training data sets under other non-leaf nodes are left subsets S 'obtained by dividing the training set of the parent node'LS 'of right subset'R
Step 4.2: creating a model root node G based on the data set S, and enabling the current node depth d to be 0;
step 4.3: if the current node depth D is larger than the maximum model depth D, the node G is set as a leaf node, and the label of the node G is the corresponding label k with the maximum number of samples in the data set S; otherwise, turning to step 4.4;
step 4.4: if all samples in dataset S belong to the same class k, node G is set as a leaf node, labeled k. Otherwise, turning to step 4.5;
step 4.5: initializing a parameter theta under a current node of the model in a univariate decision tree mode to obtain an initial value theta0
Step 4.6: based on quasi-Newton algorithm and initial value theta0Solving the minimum value of the model objective function and obtaining the optimal parameter theta of the modelbest
Figure BDA0003300208670000091
Wherein L (theta) is a model objective function, and lambda is L2 regularization term coefficients, set here as: λ is 0.0001; theta is a parameter to be trained under each node of the model, | | theta | | ventilation2Is a two-norm of theta, H (S) is the empirical entropy of the sample set S, and H (S | theta) is the conditional empirical entropy of the sample set S under theta;
Figure BDA0003300208670000092
where K is the total number of classes of samples, where K is 4; k represents a kth class sample label in the sample; i SkI is the number of kth samples in the sample set S, and I S is the total number of samples in the sample set S;
Figure BDA0003300208670000093
wherein, WLIs the sum of the weights of all samples belonging to the left subset, WRIs the sum of the weights of all samples belonging to the right subset, HLWeighting the entropy of the information for the left child node, HRWeighting information entropy for the right child node, wherein M is the total sample number of a sample set S under the node, and theta is a parameter to be trained under each node of the model;
Figure BDA0003300208670000101
wherein the content of the first and second substances,
Figure BDA0003300208670000102
is a sample (x)i,yi) Weights belonging to the left child node, SLEach sample association belongs to a set of left child node weight information.
Figure BDA0003300208670000103
Wherein the content of the first and second substances,
Figure BDA0003300208670000104
is a sample(xi,yi) Weights belonging to the right child node, SREach sample association belongs to a set of right child node weight information.
Figure BDA0003300208670000105
Where K is the total number of classes of samples, where K is 4; k represents the kth class sample label in the sample,
Figure BDA0003300208670000106
is the sum of the weights of the samples belonging to the left subset under k classes in the sample set, WLIs the sum of the weights of all samples in the set S belonging to the left subset.
Figure BDA0003300208670000107
Where K is the total number of classes of samples, where K is 4; k represents the kth class sample label in the sample,
Figure BDA0003300208670000108
the sum of the weights of the samples belonging to the right subset under the k classes in the sample set is obtained; wRThe sum of the weights of all samples in the sample set S belonging to the right subset;
Figure BDA0003300208670000109
Figure BDA00033002086700001010
wherein (x)i,yi) Represents the ith sample in the sample set S, SLAssociating sets of left subset weight information to which samples belong, SRAssociating and assigning a set of right subset weight information for each sample;
Figure BDA00033002086700001011
Figure BDA00033002086700001012
wherein the content of the first and second substances,
Figure BDA00033002086700001013
the ith sample is assigned the weight of the left child node,
Figure BDA00033002086700001014
the weight of the ith sample belonging to the right child node is represented, and sigma (·) is a sigmoid function;
Figure BDA0003300208670000111
where K is the total number of classes of samples, where K is 4; k represents the kth class sample label in the sample,
Figure BDA0003300208670000112
is the sum of the weights of the left subset to which the samples belong under k classes in the sample set.
Figure BDA0003300208670000113
Where K is the total number of classes of samples, where K is 4; k represents the kth class sample label in the sample,
Figure BDA0003300208670000114
is the sum of the weights of the right subset to which the samples belong under k classes in the sample set.
Step 4.7: based on the current model parameter θbestThe objective function value L at this time is calculated in conjunction with L (theta)0
Step 4.8: randomly initializing a parameter theta, and repeating the initialization C times, wherein C is a model hyper-parameter and is set as follows: c-3, obtaining an initial parameter value theta'0
Step 4.9: based on quasi-Newton algorithm and initial value theta'0Solving the formula L (theta) to obtain a model optimal parameter theta'best
Step 4.10: model parameter theta 'obtained based on current solution'bestCalculating the objective function value L 'at this time'0
Step 4.11: if objective function value L'0<L0Then the model optimum parameter thetabest=θ’bestOtherwise, turning to step 4.12;
step 4.12: based on the optimum parameter thetabestObtaining a left subset, a right subset, S'L,S’R(ii) a Wherein the content of the first and second substances,
Figure BDA0003300208670000115
Figure BDA0003300208670000116
wherein (x)i,yi) Representing the ith sample in the sample set S,
Figure BDA0003300208670000117
assigning the ith sample a weight, S, of the left child nodeLA set of left subset weight information is assigned to each sample association,
Figure BDA0003300208670000118
assigning the ith sample a weight of the right child node, SRAssociating and assigning a set of right subset weight information for each sample;
step 4.13: construction of left child node GLAnd let the node lower training set be S'LIf d is d +1, turning to step 4.3;
step 4.14: construction of the Right child node GRAnd let the node lower training set be S'RIf d is d +1, turning to step 4.3;
and 5: as shown in the step 5 of fig. 1, a strategy fidelity index, a strategy actual control performance index and a model complexity index are set to evaluate the model performance under different hyper-parameters, so that an optimal model is selected according to actual requirements;
the step 5 specifically comprises the following steps:
step 5.1, the strategy fidelity index in step 5 is strategy fidelity, the meaning is the decision matching degree of the weighted inclined decision tree model based on the information gain ratio and the depth reinforcement learning strategy, and the calculation formula is as follows:
Figure BDA0003300208670000121
wherein y is
Figure BDA0003300208670000122
The output of the deep reinforcement learning and the weighted tilt decision tree model based on the information gain ratio is given the same input x, N is the total amount of samples, and I (-) is an illustrative function.
Step 5.2: and 5, the strategy actual control performance index is the strategy actual control performance and represents the average return obtained in each round when the weighted inclined decision tree strategy based on the information gain ratio is applied to an actual control scene. When the deep reinforcement learning model is applied to the actual control scenario, the average reward it gets at each round is: r iseCorrespondingly, the average return of the weighted tilt decision tree based on the information gain ratio in the corresponding scene is recorded as: r'e. Thus, Re=r’e-reA value > 0 represents that the weighted inclined decision tree model based on the information gain ratio has better control performance than the deep reinforcement learning model and vice versa.
Step 5.3: and 5, the model complexity index is the model complexity, and is measured by the model parameter number or the model depth. From the viewpoint of model interpretability and interactivity, a weighted tilt decision tree strategy based on an information gain ratio is sought, wherein the complexity of the model is as small as possible.
Step 5.4: after comprehensively considering the index results of the step 5.1 to the step 5.3, selecting a weighted tilt decision tree model with the optimal information gain ratio according to actual requirements;
here, the index results are:
Figure BDA0003300208670000123
therefore, in the case that the policy fidelity is not similar to the actual control performance of the policy, we select a model with lower model complexity, namely: model depth is 3 weighted tilt decision tree model based on information gain ratio.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
Although specific embodiments of the present invention have been described above with reference to the accompanying drawings, it will be appreciated by those skilled in the art that these are merely illustrative and that various changes or modifications may be made to these embodiments without departing from the principles and spirit of the invention. The scope of the invention is only limited by the appended claims.

Claims (6)

1. A method for extracting an emergency control strategy for deep reinforcement learning of an electric power system is characterized by comprising the following steps;
step 1: introducing characteristic data of a plurality of historical moments of a power system node model to construct observation data;
step 2: introducing a deep Q learning network model, further sequentially inputting a plurality of groups of observation data into the deep Q learning network model, predicting and obtaining load reduction actions, and further performing optimization training by adopting a random gradient descent optimization algorithm to obtain an emergency control deep reinforcement learning model of the power system;
and step 3: generating a data set under a specific fault scene based on the trained deep Q learning network model;
and 4, step 4: under each non-leaf node in a weighted inclined decision tree model of an information gain ratio, inputting state-action pair data in a lower data set under each non-leaf node into the weighted inclined decision tree model of the information gain ratio, solving the minimum value of a model objective function through a quasi-Newton algorithm, obtaining the optimal parameter of the model under the node, simultaneously dividing the data set under the node into a left subset and a right subset, constructing a left sub node and a right sub node, and circulating the steps until the termination condition of the algorithm is met;
and 5: and setting a strategy fidelity index, a strategy actual control performance index and a model complexity index to evaluate the model performance under different super parameters, so as to select an optimal model according to task requirements for emergency control of the power system.
2. The method for extracting the emergency control strategy for deep reinforcement learning of the power system according to claim 1, wherein the observation data in step 1 is specifically defined as:
Xt=[ut,ut+1,...,ut+L-1]T
ut+l={datat+l,p,j|1≤p≤P,1≤j≤J},l∈[0,L-1]
wherein, XtThe method comprises the steps of representing a t-th group of observation data, wherein t represents the starting moment of the t-th group of observation data, L is a positive integer and is the length of an observation data window; u. oft+kThe method comprises the steps of representing the ith group of observation data in the tth group of observation data, namely the observation data at the t + l moment in a multi-node model of the power system; datat+l,p,jThe J-th type characteristic data of the P-th bus node representing the l-th step of observation data in the t-th group of observation data, namely the J-th type characteristic data of the P-th power system node at the t + l moment in the power system multi-node model, P represents the number of the power system nodes, and J is the number of the node characteristics.
3. The method for extracting the emergency control strategy through deep reinforcement learning of the power system according to claim 1, wherein the predicted load reduction action in step 2 is formed by a combination of load shedding percentages of bus nodes of the power system, each bus node has two load reduction ways, which respectively define the total load on the bus nodes of no action and load shedding of 20%;
the number of load reduction actions predicted by the deep Q learning network model includes 2 in totalHH is the number of controllable nodes; further, the actions included in the actions are sorted and numbered, that is, the action set is defined as:
Y=[0,1,...,y,...,2H-1],y∈Ν。
4. the method for extracting the emergency control strategy for deep reinforcement learning of the power system according to claim 1, wherein the step 3 of generating the data set specifically comprises the following steps: after the deep Q learning network model training is completed, aiming at a set fault scene, the characteristic quantity x from t moment to t + L-1 moment of the power system is measuredtIn the DQN decision model of the rolling input, the optimal action Y is selected in the action set Y by the decision modeltRecording the model input and output data of each step to construct a state-action pair (x)t,yt) To complete the generation of tagged data sets; step 3 the state-action pair data set can be expressed as:
S={(x1,y1),(x2,y2),(xi,yi),...,(xN,yN)}
wherein (x)i,yi) Representing the ith state-action pair, x, in the state-action pair datasetiRepresents the power system state quantity, y, of the i-th state-action pair in the state-action pair data setiRepresenting the control action of the i-th state-action pair in the state-action pair data set, and N representing the number of state-action pairs in the state-action pair data set.
5. The method for extracting the emergency control strategy for deep reinforcement learning of the power system according to claim 1, wherein the step 4 is specifically as follows:
step 4.1: the input condition of each non-leaf node in the weighted tilt decision tree model with the set information gain ratio is a training data set S, (x)i,yi) E.g. S, i ═ 1,2,3, and M ≦ N, where M is under the current nodeThe number of data set samples, N is the total number of samples; setting the maximum depth of the model as D and the current node depth as D;
wherein the training data set under the root node is the data set S generated in the step 3, and the training data sets under other non-leaf nodes are left subsets S 'obtained by dividing the training set of the parent node'LS 'of right subset'R
Step 4.2: creating a model root node G based on the data set S, and enabling the current node depth d to be 0;
step 4.3: if the current node depth D is larger than the maximum model depth D, the node G is set as a leaf node, and the label of the node G is the corresponding label k with the maximum number of samples in the data set S; otherwise, turning to step 4.4;
step 4.4: if all samples in the data set S belong to the same class k, the node G is set as a leaf node, and the label of the leaf node is k; otherwise, turning to step 4.5;
step 4.5: initializing a parameter theta under a current node of the model in a univariate decision tree mode to obtain an initial value theta0
Step 4.6: based on quasi-Newton algorithm and initial value theta0Solving the minimum value of the model objective function and obtaining the optimal parameter theta of the modelbest
Figure FDA0003300208660000031
Wherein L (theta) is a model target function, lambda is a regularization term coefficient of L2, theta is a parameter to be trained under each node of the model, | | theta | | magnetism2Is a two-norm of theta, H (S) is the empirical entropy of the sample set S, and H (S | theta) is the conditional empirical entropy of the sample set S under theta;
Figure FDA0003300208660000032
wherein K is the total number of classes of the sample; k represents a kth class sample label in the sample; i SkI is the number of kth samples in the sample set S, and I S is the total number of samples in the sample set S;
Figure FDA0003300208660000033
wherein, WLIs the sum of the weights of all samples belonging to the left subset, WRIs the sum of the weights of all samples belonging to the right subset, HLWeighting the entropy of the information for the left child node, HRWeighting information entropy for the right child node, wherein M is the total sample number of a sample set S under the node, and theta is a parameter to be trained under each node of the model;
Figure FDA0003300208660000034
wherein the content of the first and second substances,
Figure FDA0003300208660000035
is a sample (x)i,yi) Weights belonging to the left child node, SLEach sample association belongs to a set of left child node weight information;
Figure FDA0003300208660000036
wherein the content of the first and second substances,
Figure FDA0003300208660000037
is a sample (x)i,yi) Weights belonging to the right child node, SREach sample is associated with a set of right child node weight information;
Figure FDA0003300208660000038
wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure FDA0003300208660000039
is the sum of the weights of the samples belonging to the left subset under k classes in the sample set, WLThe sum of the weights of all samples in the sample set S belonging to the left subset;
Figure FDA0003300208660000041
wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure FDA0003300208660000042
the sum of the weights of the samples belonging to the right subset under the k classes in the sample set is obtained; wRThe sum of the weights of all samples in the sample set S belonging to the right subset;
Figure FDA0003300208660000043
Figure FDA0003300208660000044
wherein (x)i,yi) Represents the ith sample in the sample set S, SLAssociating sets of left subset weight information to which samples belong, SRAssociating and assigning a set of right subset weight information for each sample;
Figure FDA0003300208660000045
Figure FDA0003300208660000046
wherein the content of the first and second substances,
Figure FDA0003300208660000047
the ith sample is assigned the weight of the left child node,
Figure FDA0003300208660000048
the weight of the ith sample belonging to the right child node is represented, and sigma (·) is a sigmoid function;
Figure FDA0003300208660000049
wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure FDA00033002086600000410
the sum of the weights of the samples belonging to the left subset under the k classes in the sample set is used as the weight of the sample;
Figure FDA00033002086600000411
wherein K is the total number of classes of the sample; k represents the kth class sample label in the sample,
Figure FDA00033002086600000412
the sum of the weights of the samples belonging to the right subset under the k classes in the sample set is obtained;
step 4.7: based on the current model parameter θbestThe objective function value L at this time is calculated in conjunction with L (theta)0
Step 4.8: randomly initializing a parameter theta, and repeating the initialization for C times, wherein C is a model hyper-parameter to obtain a parameter initial value theta'0
Step 4.9: based on quasi-Newton algorithm and initial value theta'0Solving the formula L (theta) to obtain a model optimal parameter theta'best
Step 4.10: model parameter theta 'obtained based on current solution'bestCalculating the objective function value L 'at this time'0
Step 4.11: if objective function value L'0<L0Then the model optimum parameter thetabest=θ'bestOtherwise, turning to step 4.12;
step 4.12: based on the optimum parameter thetabestObtaining a left subset, a right subset, S'L,S'R(ii) a Wherein the content of the first and second substances,
Figure FDA0003300208660000051
Figure FDA0003300208660000052
wherein (x)i,yi) Representing the ith sample in the sample set S,
Figure FDA0003300208660000053
assigning the ith sample a weight, S, of the left child nodeLA set of left subset weight information is assigned to each sample association,
Figure FDA0003300208660000054
assigning the ith sample a weight of the right child node, SRAssociating and assigning a set of right subset weight information for each sample;
step 4.13: construction of left child node GLAnd let the node lower training set be S'LIf d is d +1, turning to step 4.3;
step 4.14: construction of the Right child node GRAnd let the node lower training set be S'RAnd d +1, turning to step 4.3.
6. The method according to claim 1,
the method is characterized in that the step 5 specifically comprises the following steps:
step 5.1, the strategy fidelity index in step 5 is strategy fidelity, the meaning is the decision matching degree of the weighted inclined decision tree model based on the information gain ratio and the depth reinforcement learning strategy, and the calculation formula is as follows:
Figure FDA0003300208660000055
wherein y is
Figure FDA0003300208660000056
Respectively outputting a deep reinforcement learning and weighted tilt decision tree model based on an information gain ratio under the condition of giving the same input x, wherein N is the total amount of samples, and I (·) is an indicative function;
step 5.2: step 5, the strategy actual control performance index is a strategy actual control performance and represents an average return obtained in each round when the information gain ratio-based weighted tilt decision tree strategy is applied to an actual control scene; when the deep reinforcement learning model is applied to the actual control scenario, the average reward it gets at each round is: r iseCorrespondingly, the average return of the weighted tilt decision tree based on the information gain ratio in the corresponding scene is recorded as: r'e(ii) a Thus, Re=r'e-reThe condition that the control performance of the weighted inclined decision tree model based on the information gain ratio is superior to that of the deep reinforcement learning model when the control performance is more than 0 is represented, and vice versa;
step 5.3: step 5, the model complexity index is the model complexity, and is measured by the model parameter number or the model depth; from the viewpoint of model interpretability and interactivity, a weighted tilt decision tree strategy based on an information gain ratio is required to be found, wherein the complexity of the model is as small as possible;
step 5.4: and after comprehensively considering the index results of the step 5.1 to the step 5.3, selecting a weighted tilt decision tree model with the optimal information gain ratio according to actual requirements.
CN202111188349.9A 2021-10-12 2021-10-12 Method for extracting deep reinforcement learning emergency control strategy of power system Pending CN114004282A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111188349.9A CN114004282A (en) 2021-10-12 2021-10-12 Method for extracting deep reinforcement learning emergency control strategy of power system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111188349.9A CN114004282A (en) 2021-10-12 2021-10-12 Method for extracting deep reinforcement learning emergency control strategy of power system

Publications (1)

Publication Number Publication Date
CN114004282A true CN114004282A (en) 2022-02-01

Family

ID=79922683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111188349.9A Pending CN114004282A (en) 2021-10-12 2021-10-12 Method for extracting deep reinforcement learning emergency control strategy of power system

Country Status (1)

Country Link
CN (1) CN114004282A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151047A (en) * 2023-04-21 2023-05-23 嘉豪伟业科技有限公司 Power dispatching data network fault simulation method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116151047A (en) * 2023-04-21 2023-05-23 嘉豪伟业科技有限公司 Power dispatching data network fault simulation method and system
CN116151047B (en) * 2023-04-21 2023-06-27 嘉豪伟业科技有限公司 Power dispatching data network fault simulation method and system

Similar Documents

Publication Publication Date Title
CN111310915B (en) Data anomaly detection defense method oriented to reinforcement learning
CN110968866B (en) Defense method for resisting attack for deep reinforcement learning model
CN110852448A (en) Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN113255936B (en) Deep reinforcement learning strategy protection defense method and device based on imitation learning and attention mechanism
CN110991027A (en) Robot simulation learning method based on virtual scene training
CN111104522A (en) Regional industry association effect trend prediction method based on knowledge graph
CN112884131A (en) Deep reinforcement learning strategy optimization defense method and device based on simulation learning
CN110109358B (en) Feedback-based hybrid multi-agent cooperative control method
CN112491818B (en) Power grid transmission line defense method based on multi-agent deep reinforcement learning
CN111461325B (en) Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem
CN113627596A (en) Multi-agent confrontation method and system based on dynamic graph neural network
CN114757351A (en) Defense method for resisting attack by deep reinforcement learning model
CN112069504A (en) Model enhanced defense method for resisting attack by deep reinforcement learning
CN114757362A (en) Multi-agent system communication method based on edge enhancement and related device
CN114004282A (en) Method for extracting deep reinforcement learning emergency control strategy of power system
CN115345222A (en) Fault classification method based on TimeGAN model
CN113276852B (en) Unmanned lane keeping method based on maximum entropy reinforcement learning framework
CN113936140A (en) Evaluation method of sample attack resisting model based on incremental learning
CN111348034B (en) Automatic parking method and system based on generation countermeasure simulation learning
CN117406100A (en) Lithium ion battery remaining life prediction method and system
CN115761654B (en) Vehicle re-identification method
Wu et al. Fault diagnosis of TE process based on incremental learning
Bar et al. Deep Reinforcement Learning Approach with adaptive reward system for robot navigation in Dynamic Environments
CN115562835A (en) Agile satellite imaging task scheduling method and system based on data driving
CN113298255B (en) Deep reinforcement learning robust training method and device based on neuron coverage rate

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination