Power distribution network fault outage rate prediction method and system based on improved random forest
Technical Field
The invention belongs to the technical field of power distribution network fault and power failure reliability assessment, and particularly relates to a power distribution network fault power failure rate prediction method and system based on an improved random forest.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The distribution network in the power system plays a role in transmitting and distributing electric energy, has important influence on the stability of the power system, and 80% of power failure events are caused by power failure of the distribution network according to statistics.
The power failure caused by the power distribution network fault is related to a plurality of factors, so that the influence of various aspects such as line architecture, weather conditions, human factors and the like needs to be considered for predicting the power failure rate of the power distribution network fault.
At present, the power distribution network fault outage rate prediction is mainly researched by an analytic method and a simulation method, but a mathematical model is built by depending on a grid structure, and the limitation is large; meanwhile, from the data driving perspective, the method for predicting the outage rate by using the machine learning algorithm is gradually and widely applied, however, the artificial neural network has the defects of difficult parameter optimization, too low convergence rate and the like, and the support vector machine overcomes the problems of low convergence rate and overfitting of the artificial neural network, but has the difficulties of insufficient capability when processing large sample data, low accuracy in solving the multi-classification problem and the like.
Therefore, the method for efficiently improving the power failure rate level prediction of the power distribution network is needed for the reliability of the power failure of the power distribution network at present.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the power distribution network fault outage rate prediction method based on the improved random forest, which is not limited to the grid structure of the power distribution network any more, and realizes effective prediction of the power distribution network fault outage rate level from fault record data.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the method for predicting the power failure rate of the power distribution network based on the improved random forest comprises the following steps:
acquiring power distribution network fault and power failure related data, extracting characteristic quantities, and calculating fault and power failure reliability parameters based on the extracted characteristic quantities;
according to the fault power failure reliability parameters and interval distribution, classifying and numbering the fault rate levels of the fault power failure reliability parameters as labels of a sample set;
carrying out weight analysis on different characteristic quantities by using a principal component analysis method to obtain weight coefficients;
according to the feature weight coefficient obtained by principal component analysis, removing the feature quantity with smaller coefficient to reduce the complexity of the model (generally selecting by a lithograph and an accumulated contribution rate).
And obtaining a prediction model by utilizing a random forest algorithm optimized by a gray wolf optimization algorithm according to the characteristic quantity data set and the sample set label data obtained by the principal component analysis method.
According to a further technical scheme, the fault power failure reliability parameters comprise: the power failure rate of the faults of the overhead lines, the power failure rate of the faults of the cable lines, the power failure rate of the faults of the distribution transformers, the power failure rate of the faults of the load switches and the power failure rate of the faults of the circuit breakers.
According to a further technical scheme, the extracting the characteristic quantity comprises the following steps: the failure times of each facility within a certain time can cause the failure of weather factors, equipment factors, human factors, environmental factors and the self condition of the facility.
According to a further technical scheme, a principal component analysis method is utilized to carry out weight analysis on different data types of data to obtain weight coefficients, and characteristic quantities (generally, the accumulated contribution rate reaches more than 85% or is selected by a lithograph) with large weight coefficients are taken to form a model input vector.
According to the further technical scheme, the feature vector is used as input based on the prediction model, and a prediction result is obtained.
The above one or more technical solutions have the following beneficial effects:
the invention optimizes the parameter selection of the random forest algorithm and improves the accuracy of the model. Meanwhile, the method has high training speed and can effectively process high-dimensional data. Compared with the prior art, the method is not limited to the grid structure of the power distribution network any more, and the effective prediction of the power failure rate level of the power distribution network is realized from the fault record data.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a principal component analysis flow chart according to an embodiment of the present invention;
FIG. 3 is a flow chart of an embodiment of the invention for improving a random forest algorithm;
FIG. 4 is a diagram of the training results of the random forest algorithm improved by the training of the embodiment of the invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
Referring to the attached figure 1, the embodiment discloses a power distribution network fault outage rate level prediction method based on an improved random forest, which comprises the following steps:
step one, processing a power distribution network fault history record into an available data type, wherein the data type is used as a characteristic quantity and comprises values of the characteristic quantities in a table 1;
step two, classifying and numbering the fault rate levels of various facilities in the power distribution network according to the fault rates of the facilities and interval distribution, wherein the fault rate levels are used as labels of a sample set, and a random forest algorithm is an algorithm with supervised learning and needs to give labels of characteristic quantities;
thirdly, performing weight analysis on the input data by using a principal component analysis method, and obtaining the processed input data according to a weight coefficient;
and step four, training input data by using an improved random forest algorithm, and obtaining a characteristic quantity data set and sample set label data obtained by a principal component analysis method to obtain a prediction model.
In step one, the extracted available data information includes: the frequency of faults in each facility (including overhead lines, cable lines, transformers and the like) within a certain time, weather factors, equipment factors, human factors and environmental factors which cause the faults, the self condition (operation years, line length and insulation rate) of the facility are used as original characteristic quantities, and the final input characteristic quantity is selected through principal component analysis in the follow-up process.
The related data in the fault records of the electric energy system, the PMS system and the power distribution network of the power system are arranged into a usable data format, and the following contents are included as input characteristic quantities.
TABLE 1 Power distribution network Fault Power failure related data
The line fault outage rate is equal to the line fault outage frequency counted in the electric energy system divided by the line length (times/100 km.year) in the PMS; the failure outage rate of the switch equipment is equal to the failure outage frequency of the switch equipment counted in the electric energy system divided by the number of the switch equipment in the PMS (times/100 machines/year); the failure outage rate of the distribution transformer is equal to the number of times of failure outage of the public transformation distribution transformer counted in the electric energy system divided by the number of public transformation distribution transformers counted in the electric energy system (times/100 pieces/year).
In the fourth step, the parameters of the random forest algorithm are optimized by using the gray wolf optimization algorithm.
Specifically, a principal component analysis method is used for carrying out weight analysis on initial characteristic quantity data to obtain a weight coefficient, a flow chart of a method for obtaining the weight through the principal component analysis is shown in fig. 2, a characteristic root M and a characteristic vector T are obtained according to a covariance matrix, and a value obtained after normalization of the characteristic root is the weight omega of different characteristic quantity datajiAnd selecting main characteristic quantity according to the weight value of the characteristic quantity, thereby reducing the complexity of the model.
And fourthly, optimizing a random forest algorithm after initial parameters (the number of decision trees and the number of split attributes) by using a gray wolf optimization algorithm according to the feature quantity data set and the sample set label data after principal component analysis processing, and performing supervised learning by taking the feature vector as input. The principle is shown in fig. 3.
The grey wolf optimization algorithm is a process for simulating the hunting of grey wolf population, performs mathematical modeling, and gradually draws close to an optimal solution under the guidance of α, β and delta grey wolf points (three candidate solutions with the highest fitness value). The algorithm has less parameters to be adjusted and strong global search capability, can be used for optimizing and searching the optimal parameters of random forests and improving the accuracy of the model, and in the iterative process, the positions of α, β and delta are used for replacing the positions of the optimal solution, and the specific process is shown as a formula:
wherein X is the value of the solution corresponding to the current gray wolf point, Xα,Xβ,XδIs the value of the solution candidate α, β, δ, Dα,Dβ,DδIs the distance of the current gray wolf point distance α, β, delta, K1,K2,K3And A1,A2,A3Are coefficient vectors, K ═ 2r1,r1Is [0, 1]]K provides random weights to prey, either enhanced or diminished by random (depending on r)1Magnitude of absolute value) distance detection between the prey and the wolf and mining the search space; a is 2ar2A, a decreases gradually from 2 to 0, r2Is [0, 1]]With a decreasing of A, half of the iterations are used to detect | A | N>1 case, half of the iterations are used to detect | A ∞<1.
According to the formulas 1,2 and 3, other wolf individuals can determine the position of the next iteration, after all the wolfs are updated, three obtained optimal solutions are determined to be new α, β and delta, the wolf optimization algorithm is a parameter optimization algorithm, and initial parameters of the random forest algorithm, namely the number of decision trees and the number of split attributes, are optimized by the aid of the algorithm, so that the accuracy of the algorithm model is improved.
The random forest algorithm is one of machine learning algorithms, a certain number of samples are extracted from original samples by using a bootstrap resampling method to perform modeling of decision trees, and then a final prediction result is obtained by voting according to the result of each decision tree. A large number of theoretical and empirical researches show that the random forest algorithm has high prediction accuracy, has good tolerance on abnormal values and noise, and is not easy to over-fit.
The random forest is composed of a plurality of decision trees h (x, theta)k) K-1, 2, …, n, where { θ }kAnd k represents the number of decision trees in the random forest, and finally, under a given independent variable x, all decision trees are used for comprehensively voting to determine an output result. The specific implementation steps are as follows:
(1) and (3) extracting a certain number of samples from the original training sample set by the random forest by using a bootstrap resampling method each time to generate a sub-sample set, wherein each sub-sample corresponds to a classification tree.
(2) At each node of the tree, M feature vectors are randomly selected from the M feature vectors, and a feature α is selected from the M feature vectors as the classification attribute of the node according to the principle that the impurity degree of the node is minimum.
(3) The nodes are divided into 2 branches according to the characteristics α, and then the characteristics with the best classification effect are searched from the rest characteristics, so that the branches of the classification tree are recursively constructed, the classification tree is fully grown, the impurity degree of each node reaches the minimum, and pruning is not carried out until the tree can accurately classify the training set, or all the attributes are used up.
(4) In the classification stage, the classification labels are synthesized from the results of all the classification trees. Random forests use the voting principle. Namely, it is
Where N is the number of decision trees in the forest, I (#) is an indicative function, Nhi,cIs a tree hiClassification result for class C, nhiIs the number of leaf nodes of the tree.
And forming a random forest by the generated classification trees, distinguishing and classifying new data by using a random forest classifier, wherein a classification result is determined according to the voting amount of the tree classifier.
The random forest is subjected to bootstrap sampling, and each node variable is only generated in a few randomly selected variables when each tree is generated. Thus, not only are the samples random, but the generation of each node variable is random.
During the random forest generation process, two parameters can be found to need to be set: the number of decision trees to be generated and the number of feature vectors selected at each node are determined, and the selection of the two parameters determines whether the generated random forest fully considers the information in the feature vectors, namely determines the accuracy of the obtained model. Therefore, the optimal decision tree number and feature quantity number are obtained by using the gray wolf optimization algorithm to obtain the best prediction effect, and the specific flow is shown in fig. 3.
The method for predicting the power failure and outage rate level of the power distribution network, provided by the invention, is started from the data driving angle, overcomes the limitations that an analytic method and a simulation method are limited to a grid structure and the like, has the advantages of strong data processing capacity, high convergence speed, high prediction precision and the like for a large sample, and can effectively predict the power failure and outage rate level of the power distribution network.
Meanwhile, the invention can also be combined with other functions of the user, such as a power distribution network fault data acquisition and storage function, a fault data processing function and the like, so that the power distribution network fault outage rate level can be known and mastered more timely and accurately.
Example two
The present embodiment aims to provide a computing device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the following steps, including:
step one, processing a power distribution network fault historical record into an available data type as a characteristic quantity;
step two, classifying and numbering the fault rate levels of various facilities in the power distribution network according to the fault rates of the facilities and interval distribution, and using the fault rate levels as labels of a sample set;
thirdly, performing weight analysis on the input data by using a principal component analysis method, and obtaining the processed input data according to a weight coefficient;
and step four, training data by using an improved random forest algorithm to obtain a prediction model.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, performs the steps of:
step one, processing a power distribution network fault historical record into an available data type as a characteristic quantity;
step two, classifying and numbering the fault rate levels of various facilities in the power distribution network according to the fault rates of the facilities and interval distribution, and using the fault rate levels as labels of a sample set;
thirdly, performing weight analysis on the input data by using a principal component analysis method, and obtaining the processed input data according to a weight coefficient;
and step four, training data by using an improved random forest algorithm to obtain a prediction model.
Example four
The present embodiment aims to provide a power distribution network fault outage rate prediction system based on an improved random forest, which includes a server configured to:
step one, processing a power distribution network fault historical record into an available data type as a characteristic quantity;
step two, classifying and numbering the fault rate levels of various facilities in the power distribution network according to the fault rates of the facilities and interval distribution, and using the fault rate levels as labels of a sample set;
thirdly, performing weight analysis on the input data by using a principal component analysis method, and obtaining the processed input data according to a weight coefficient;
and step four, training data by using an improved random forest algorithm to obtain a prediction model.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
Results of the experiment
Taking prediction of the level of the overhead line fault outage rate as an example, the original data is obtained from the historical records of the local overhead line fault outage and the original data such as equipment information, and part of the data is as shown in the following table 2:
TABLE 2
According to the size of the failure rate, the failure rate is divided into four types according to the interval: 0, (0,0.1], (0.1, 0.4), 0.4, respectively labeled as 0,1, 2, 3 classes, as class labels of the input feature quantities.
The feature quantity with larger weight is screened out by a principal component analysis method and is used as an input vector, an improved random forest algorithm is trained by combining an input feature quantity sample set and a label sample set, and the training result is shown in figure 4.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.