CN111401783A - Power system operation data integration feature selection method - Google Patents
Power system operation data integration feature selection method Download PDFInfo
- Publication number
- CN111401783A CN111401783A CN202010265810.5A CN202010265810A CN111401783A CN 111401783 A CN111401783 A CN 111401783A CN 202010265810 A CN202010265810 A CN 202010265810A CN 111401783 A CN111401783 A CN 111401783A
- Authority
- CN
- China
- Prior art keywords
- feature
- feature selection
- algorithm
- features
- power system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010354 integration Effects 0.000 title claims abstract description 9
- 238000010187 selection method Methods 0.000 title claims description 11
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000010219 correlation analysis Methods 0.000 claims abstract description 11
- 238000005070 sampling Methods 0.000 claims abstract description 7
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 description 5
- 238000011524 similarity measure Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 238000000513 principal component analysis Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 125000004432 carbon atom Chemical group C* 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- Theoretical Computer Science (AREA)
- Strategic Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Educational Administration (AREA)
- Quality & Reliability (AREA)
- Game Theory and Decision Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Water Supply & Treatment (AREA)
- Public Health (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for selecting running data integration characteristics of a power system, which comprises the following steps: s1, extracting K training subsets; s2, correlation analysis; s3, sorting the features on each training subset according to the weight to obtain different results; and S4, summarizing the results to obtain the optimal feature subset. The invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.
Description
Technical Field
The invention relates to a feature selection method, in particular to a power system operation data integration feature selection method.
Background
At present, due to the continuous expansion of the smart power grid, the historical data volume is huge, and the dimensionality is large. When historical data is analyzed and modeled, if all attributes are used as the input of a model, not only is the calculation difficulty increased, but also dimensionality disasters are caused, model overfitting is caused, and the generalization capability is reduced. Therefore, before modeling, dimension reduction processing is required.
The common methods of the dimensionality reduction technology are feature extraction and feature selection. Feature extraction generally maps data from a high-dimensional space to a low-dimensional feature space through data methods (e.g., projection), typically Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Canonical Correlation Analysis (CCA), and the like. However, the physical meaning of the new feature extracted from the feature is far from the original feature, and even the extracted feature is very different, and the interpretability of the extracted feature is weak, which is unacceptable in many problems. And feature selection is the selection of the smallest subset of features from the original feature space that maximizes the evaluation criterion. The physical significance of the selected features in feature selection is the same as that of the prior art, the interpretability is strong, the advantages are obvious, and the features are widely applied to aspects of bioinformatics, computer vision, target identification and the like in recent years. In practical application, the feature selection algorithm is required to have good capability of selecting features, and requirements are also provided for stability of feature selection. The feature selection stability means that the feature selection method has certain robustness to the tiny disturbance of the training sample, and a stable feature selection method should generate the same or similar feature subsets under the condition that the training sample has the tiny disturbance. The stability of feature selection is improved, so that related features can be found, the reliability of field experts on results is enhanced, and the complexity and time consumption for acquiring data are further reduced.
The source of feature selection instability is the following three: (1) instability of the algorithm itself: most existing feature selection methods aim at selecting only a minimal subset of features and neglect the stability of the selection. (2) The features are highly redundant: if a feature selection algorithm achieves the same learning accuracy on different feature subsets of the dataset, the selected attribute is not stable. (3) High dimensional small sample problem: in some practical problems such as gene detection, there are usually only several hundred samples, but thousands of features. In a high-dimensional small sample space, a small change in the sample data set affects the distribution of data, and a change in the distribution of data affects the selection result, thereby causing a difference in the feature selection result.
The RReliefF algorithm can process nonlinear data, has high algorithm efficiency and is a well-known feature selection method. However, the RReliefF algorithm has certain defects: firstly, the algorithm itself is not stable considering the stability of feature selection, and multiple results may generate different optimal subsets due to the characteristics of the data itself and the algorithm. And secondly, redundant features cannot be removed, the feature weight playing a positive role in the predicted value is large, the feature weight can be reserved, the algorithm does not consider the correlation among the features, and the power plant operation data has the characteristics of a large number of redundant features, high coupling among the features and strong correlation.
Disclosure of Invention
The invention mainly aims to provide a method for selecting the running data integration characteristics of a power system.
The technical scheme adopted by the invention is as follows: a method for selecting an integrated feature of operating data of a power system comprises the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
Further, the step S1 is specifically:
Further, the step S2 specifically includes:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsetsBy this step, redundant data is removed.
Further, the step S3 specifically includes:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
The invention has the advantages that:
the invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method for selecting an integrated feature of operating data of an electrical power system according to an embodiment of the present invention;
FIG. 2 is an integrated feature selection framework diagram of an embodiment of the present invention;
FIG. 3 is a graph comparing the stability of the results of experiments according to examples of the present invention;
FIG. 4 is a graph comparing the time efficiency of embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, as shown in fig. 1, a method for selecting an integrated feature of operating data of a power system includes the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
The invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.
The step S1 specifically includes:
Bootstrap is a replaced random sampling method, and the specific steps of the method are that n samples are extracted from original data each time, m dimensions are selected to form a subdata set, and the subdata set is repeated for K times, wherein n, m and K are set by the user.
The step S2 specifically includes:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsetsBy this step, redundant data is removed.
The step S3 specifically includes:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
Introduction of the rreleiff algorithm:
the Relief algorithm was first proposed by Kira et al in 1992 as an efficient feature weight algorithm for reducing data, which is applicable to the classification problem. Scholars have then proposed the ReliefF algorithm to solve the multi-classification problem. The main idea of the algorithm is as follows: features are weighted according to their ability to distinguish between different classes of samples within a neighborhood, and good features can bring samples of the same class close and samples of different classes far away from each other. According to the set weight threshold, the features with the weight less than the threshold are removed, and finally the optimal feature subset is obtained.
The ReliefF algorithm provides an initial value for each feature in the data set. And (5) iteratively updating the weight W [ A ] through the following formula, and iterating for k times to finally obtain a result.
In the above formula, the first and second carbon atoms are,representing the weight of feature a.Is a sample randomly selected from the training samples, and the nearest instance is found by the algorithm by the ReliefF.Representing distanceSamples of the most recent and same category.Representing distanceRecent and different classes of samples. If the sampleAndhaving different feature a, meaning that feature a would separate two samples of the same class, the weight of feature a is reduced. If the sampleAndhaving different feature a, indicating that feature a would classify two different classes of samples, the weight of feature a is increased. Function(s)The calculation formula of (2) is as follows:
for numerical attributes:
for the discrete attribute:
on the basis of ReliefF, Kononenko et al propose the RReliefF algorithm for solving the regression problem. Since the predictive value of the regression problem is continuous, there is no classification. To solve this problem, the probability of introducing 2 distinct instances, which can model the relative distance from the predicted instance, determines whether the 2 instances belong to the same class.
The integrated feature selection framework BPRR (Bootstrap-Pearson-RReliefF) of the present invention is an improved algorithm.
Stability measurement index:
the stability of the algorithm is evaluated by selecting and expanding an extension of Kunzewav similarity measure (extension of Kuncheva similarity measure) index. The kunzhewa similarity measure is one of the subset stability measure indexes, is extended from kunzhewa similarity measure (kunchheva similarity measure), and can be used for measuring the feature subset similarity of different feature numbers. The calculation formula is as follows:
wherein,andtwo subsets of the features are represented, and,a base representing a subset of the features,is the total number of features of both subsets. The value of the expanded Quinchwatt similarity measurement index is [ -1, 1 [)]In the meantime, the greater the value of the expanded Quinchentwar similarity measurement index is, the higher the similarity of the two feature subsets is.
And (3) experimental verification:
in the experiment of the invention, subsets are divided into 10, 20, … and 90 respectively, then a BPRR framework and a RReliefF algorithm are used for carrying out feature selection on the subsets, the feature selection result is calculated by using an expanded Kunzhen Watt similarity measurement index, and then the result is averaged to obtain the stability of each algorithm on different subsets.
RReliefF and the frame stability measurement of the integrated feature selection are carried out through the algorithm, and the results are shown in fig. 3 and fig. 4;
and (4) conclusion:
the following two main aspects can be seen from the figure: first, the stability of the two algorithms is compared on a stability comparison graph. RReliefF has poor stability, does not perform well in the subset number of 10 to 90, and the integrated feature selection framework exhibits better stability. The results indicate that the integrated feature selection framework is effective in improving the stability of RReliefF.
Second, the time efficiencies of the two algorithms are compared on a time efficiency map. It can be seen that the runtime of rreleieff is much slower than the integrated feature selection framework, e.g. up to 90 subsets, RReliefF is 3 times slower than the integrated feature selection framework.
The experimental result shows that compared with the RReliefF algorithm, the stability of feature selection is improved, redundant features are eliminated, and the efficiency is improved. The method can be applied to the characteristic selection step in data preprocessing before the modeling and prediction of the big power data, the attributes related to the target parameters are screened out from a plurality of power parameters, the same result can be obtained under the condition that the data have small-amplitude disturbance, and the reliability of characteristic selection is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (4)
1. The method for selecting the running data integration characteristics of the power system is characterized by comprising the following steps
The method comprises the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
3. The power system operational data integration feature selection method of claim 1, characterized in that
Characterized in that, the step S2 specifically comprises:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsetsBy this step, redundant data is removed.
4. The power system operational data integration feature selection method of claim 1, characterized in that
Characterized in that, the step S3 specifically comprises:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010265810.5A CN111401783A (en) | 2020-04-07 | 2020-04-07 | Power system operation data integration feature selection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010265810.5A CN111401783A (en) | 2020-04-07 | 2020-04-07 | Power system operation data integration feature selection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111401783A true CN111401783A (en) | 2020-07-10 |
Family
ID=71431498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010265810.5A Pending CN111401783A (en) | 2020-04-07 | 2020-04-07 | Power system operation data integration feature selection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111401783A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114464266A (en) * | 2022-01-27 | 2022-05-10 | 东北电力大学 | Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156484A (en) * | 2016-06-08 | 2016-11-23 | 中国科学院自动化研究所 | Disease of brain individuation Forecasting Methodology based on nuclear magnetic resonance image and system |
CN106250442A (en) * | 2016-07-26 | 2016-12-21 | 新疆大学 | The feature selection approach of a kind of network security data and system |
CN107169514A (en) * | 2017-05-05 | 2017-09-15 | 清华大学 | The method for building up of diagnosing fault of power transformer model |
CN109636248A (en) * | 2019-01-15 | 2019-04-16 | 清华大学 | Feature selection approach and device suitable for transient stability evaluation in power system |
-
2020
- 2020-04-07 CN CN202010265810.5A patent/CN111401783A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156484A (en) * | 2016-06-08 | 2016-11-23 | 中国科学院自动化研究所 | Disease of brain individuation Forecasting Methodology based on nuclear magnetic resonance image and system |
CN106250442A (en) * | 2016-07-26 | 2016-12-21 | 新疆大学 | The feature selection approach of a kind of network security data and system |
CN107169514A (en) * | 2017-05-05 | 2017-09-15 | 清华大学 | The method for building up of diagnosing fault of power transformer model |
CN109636248A (en) * | 2019-01-15 | 2019-04-16 | 清华大学 | Feature selection approach and device suitable for transient stability evaluation in power system |
Non-Patent Citations (4)
Title |
---|
伍杰华: "基于RReliefF特征选择算法的复杂网络链接分类", 《计算机工程》 * |
刘艺 等: "特征选择稳定性研究综述", 《软件学报》 * |
张磊 等: "基于多目标蚁群算法的稳定参考点选择", 《计算机技术与发展》 * |
柴明锐 等: "《数据挖掘技术及在石油地质中的应用》", 30 September 2017, 天津科学技术出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114464266A (en) * | 2022-01-27 | 2022-05-10 | 东北电力大学 | Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR |
CN114464266B (en) * | 2022-01-27 | 2022-08-02 | 东北电力大学 | Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110070141B (en) | Network intrusion detection method | |
US7542953B1 (en) | Data classification by kernel density shape interpolation of clusters | |
Ibrahim et al. | Cluster representation of the structural description of images for effective classification | |
Saha et al. | A new multiobjective clustering technique based on the concepts of stability and symmetry | |
Saha et al. | Simultaneous feature selection and symmetry based clustering using multiobjective framework | |
Belhaouari et al. | Optimized K‐Means Algorithm | |
CN111275127B (en) | Dynamic feature selection method based on condition mutual information | |
CN111460161A (en) | Unsupervised text theme related gene extraction method for unbalanced big data set | |
Min et al. | Automatic determination of clustering centers for “clustering by fast search and find of density peaks” | |
CN115344693A (en) | Clustering method based on fusion of traditional algorithm and neural network algorithm | |
Maddumala | A Weight Based Feature Extraction Model on Multifaceted Multimedia Bigdata Using Convolutional Neural Network. | |
Mandal et al. | Unsupervised non-redundant feature selection: a graph-theoretic approach | |
Zhang et al. | A new outlier detection algorithm based on fast density peak clustering outlier factor. | |
CN111401783A (en) | Power system operation data integration feature selection method | |
Wang et al. | Mining high-dimensional data | |
Rahman et al. | An efficient approach for selecting initial centroid and outlier detection of data clustering | |
Balaganesh et al. | Movie success rate prediction using robust classifier | |
CN111444989A (en) | Network intrusion detection method | |
CN110837853A (en) | Rapid classification model construction method | |
Li | Logistic and SVM credit score models based on lasso variable selection | |
Yang et al. | Adaptive density peak clustering for determinging cluster center | |
Schneider et al. | Expected similarity estimation for large scale anomaly detection | |
CN111382273A (en) | Text classification method based on feature selection of attraction factors | |
Hochma et al. | Efficient Feature Ranking and Selection using Statistical Moments | |
Claypo et al. | A new feature selection based on class dependency and feature dissimilarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200710 |