CN111401783A - Power system operation data integration feature selection method - Google Patents

Power system operation data integration feature selection method Download PDF

Info

Publication number
CN111401783A
CN111401783A CN202010265810.5A CN202010265810A CN111401783A CN 111401783 A CN111401783 A CN 111401783A CN 202010265810 A CN202010265810 A CN 202010265810A CN 111401783 A CN111401783 A CN 111401783A
Authority
CN
China
Prior art keywords
feature
feature selection
algorithm
features
power system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010265810.5A
Other languages
Chinese (zh)
Inventor
王勇
李磊
马强
管荑
李慧聪
田大伟
耿玉杰
刘勇
林琳
娄建楼
孙博
李建坡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Shandong Electric Power Co Ltd
Northeast Electric Power University
Original Assignee
State Grid Shandong Electric Power Co Ltd
Northeast Dianli University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Shandong Electric Power Co Ltd, Northeast Dianli University filed Critical State Grid Shandong Electric Power Co Ltd
Priority to CN202010265810.5A priority Critical patent/CN111401783A/en
Publication of CN111401783A publication Critical patent/CN111401783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for selecting running data integration characteristics of a power system, which comprises the following steps: s1, extracting K training subsets; s2, correlation analysis; s3, sorting the features on each training subset according to the weight to obtain different results; and S4, summarizing the results to obtain the optimal feature subset. The invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.

Description

Power system operation data integration feature selection method
Technical Field
The invention relates to a feature selection method, in particular to a power system operation data integration feature selection method.
Background
At present, due to the continuous expansion of the smart power grid, the historical data volume is huge, and the dimensionality is large. When historical data is analyzed and modeled, if all attributes are used as the input of a model, not only is the calculation difficulty increased, but also dimensionality disasters are caused, model overfitting is caused, and the generalization capability is reduced. Therefore, before modeling, dimension reduction processing is required.
The common methods of the dimensionality reduction technology are feature extraction and feature selection. Feature extraction generally maps data from a high-dimensional space to a low-dimensional feature space through data methods (e.g., projection), typically Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Canonical Correlation Analysis (CCA), and the like. However, the physical meaning of the new feature extracted from the feature is far from the original feature, and even the extracted feature is very different, and the interpretability of the extracted feature is weak, which is unacceptable in many problems. And feature selection is the selection of the smallest subset of features from the original feature space that maximizes the evaluation criterion. The physical significance of the selected features in feature selection is the same as that of the prior art, the interpretability is strong, the advantages are obvious, and the features are widely applied to aspects of bioinformatics, computer vision, target identification and the like in recent years. In practical application, the feature selection algorithm is required to have good capability of selecting features, and requirements are also provided for stability of feature selection. The feature selection stability means that the feature selection method has certain robustness to the tiny disturbance of the training sample, and a stable feature selection method should generate the same or similar feature subsets under the condition that the training sample has the tiny disturbance. The stability of feature selection is improved, so that related features can be found, the reliability of field experts on results is enhanced, and the complexity and time consumption for acquiring data are further reduced.
The source of feature selection instability is the following three: (1) instability of the algorithm itself: most existing feature selection methods aim at selecting only a minimal subset of features and neglect the stability of the selection. (2) The features are highly redundant: if a feature selection algorithm achieves the same learning accuracy on different feature subsets of the dataset, the selected attribute is not stable. (3) High dimensional small sample problem: in some practical problems such as gene detection, there are usually only several hundred samples, but thousands of features. In a high-dimensional small sample space, a small change in the sample data set affects the distribution of data, and a change in the distribution of data affects the selection result, thereby causing a difference in the feature selection result.
The RReliefF algorithm can process nonlinear data, has high algorithm efficiency and is a well-known feature selection method. However, the RReliefF algorithm has certain defects: firstly, the algorithm itself is not stable considering the stability of feature selection, and multiple results may generate different optimal subsets due to the characteristics of the data itself and the algorithm. And secondly, redundant features cannot be removed, the feature weight playing a positive role in the predicted value is large, the feature weight can be reserved, the algorithm does not consider the correlation among the features, and the power plant operation data has the characteristics of a large number of redundant features, high coupling among the features and strong correlation.
Disclosure of Invention
The invention mainly aims to provide a method for selecting the running data integration characteristics of a power system.
The technical scheme adopted by the invention is as follows: a method for selecting an integrated feature of operating data of a power system comprises the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
Further, the step S1 is specifically:
extracting K training subsets from an original data set D by a Bootstrap random sampling method
Figure DEST_PATH_IMAGE001
Further, the step S2 specifically includes:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsets
Figure 111257DEST_PATH_IMAGE002
By this step, redundant data is removed.
Further, the step S3 specifically includes:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
The invention has the advantages that:
the invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method for selecting an integrated feature of operating data of an electrical power system according to an embodiment of the present invention;
FIG. 2 is an integrated feature selection framework diagram of an embodiment of the present invention;
FIG. 3 is a graph comparing the stability of the results of experiments according to examples of the present invention;
FIG. 4 is a graph comparing the time efficiency of embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, as shown in fig. 1, a method for selecting an integrated feature of operating data of a power system includes the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
The invention provides an integrated feature selection framework aiming at the operation data of the power plant on the basis of random sampling and RReliefF feature selection algorithm, thereby improving the stability of the algorithm, removing redundant data and improving the time efficiency.
The step S1 specifically includes:
extracting K training subsets from an original data set D by a Bootstrap random sampling method
Figure 61459DEST_PATH_IMAGE001
Bootstrap is a replaced random sampling method, and the specific steps of the method are that n samples are extracted from original data each time, m dimensions are selected to form a subdata set, and the subdata set is repeated for K times, wherein n, m and K are set by the user.
The step S2 specifically includes:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsets
Figure 610252DEST_PATH_IMAGE002
By this step, redundant data is removed.
The step S3 specifically includes:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
Introduction of the rreleiff algorithm:
the Relief algorithm was first proposed by Kira et al in 1992 as an efficient feature weight algorithm for reducing data, which is applicable to the classification problem. Scholars have then proposed the ReliefF algorithm to solve the multi-classification problem. The main idea of the algorithm is as follows: features are weighted according to their ability to distinguish between different classes of samples within a neighborhood, and good features can bring samples of the same class close and samples of different classes far away from each other. According to the set weight threshold, the features with the weight less than the threshold are removed, and finally the optimal feature subset is obtained.
The ReliefF algorithm provides an initial value for each feature in the data set. And (5) iteratively updating the weight W [ A ] through the following formula, and iterating for k times to finally obtain a result.
Figure 229452DEST_PATH_IMAGE004
(3)
In the above formula, the first and second carbon atoms are,
Figure DEST_PATH_IMAGE006
representing the weight of feature a.
Figure 152278DEST_PATH_IMAGE008
Is a sample randomly selected from the training samples, and the nearest instance is found by the algorithm by the ReliefF.
Figure 820019DEST_PATH_IMAGE010
Representing distance
Figure 583838DEST_PATH_IMAGE008
Samples of the most recent and same category.
Figure 46044DEST_PATH_IMAGE012
Representing distance
Figure 393848DEST_PATH_IMAGE008
Recent and different classes of samples. If the sample
Figure DEST_PATH_IMAGE013
And
Figure 458756DEST_PATH_IMAGE010
having different feature a, meaning that feature a would separate two samples of the same class, the weight of feature a is reduced. If the sample
Figure 450983DEST_PATH_IMAGE013
And
Figure 474303DEST_PATH_IMAGE014
having different feature a, indicating that feature a would classify two different classes of samples, the weight of feature a is increased. Function(s)
Figure 184770DEST_PATH_IMAGE016
The calculation formula of (2) is as follows:
for numerical attributes:
Figure 755166DEST_PATH_IMAGE018
(4)
for the discrete attribute:
Figure 992112DEST_PATH_IMAGE020
(5)
on the basis of ReliefF, Kononenko et al propose the RReliefF algorithm for solving the regression problem. Since the predictive value of the regression problem is continuous, there is no classification. To solve this problem, the probability of introducing 2 distinct instances, which can model the relative distance from the predicted instance, determines whether the 2 instances belong to the same class.
Figure 796120DEST_PATH_IMAGE022
(6)
Figure 118517DEST_PATH_IMAGE024
(7)
Figure 666173DEST_PATH_IMAGE026
(8)
The integrated feature selection framework BPRR (Bootstrap-Pearson-RReliefF) of the present invention is an improved algorithm.
Stability measurement index:
the stability of the algorithm is evaluated by selecting and expanding an extension of Kunzewav similarity measure (extension of Kuncheva similarity measure) index. The kunzhewa similarity measure is one of the subset stability measure indexes, is extended from kunzhewa similarity measure (kunchheva similarity measure), and can be used for measuring the feature subset similarity of different feature numbers. The calculation formula is as follows:
Figure 23205DEST_PATH_IMAGE028
(9)
wherein,
Figure 263694DEST_PATH_IMAGE030
and
Figure 309272DEST_PATH_IMAGE032
two subsets of the features are represented, and,
Figure 988515DEST_PATH_IMAGE034
a base representing a subset of the features,
Figure 403316DEST_PATH_IMAGE036
is the total number of features of both subsets. The value of the expanded Quinchwatt similarity measurement index is [ -1, 1 [)]In the meantime, the greater the value of the expanded Quinchentwar similarity measurement index is, the higher the similarity of the two feature subsets is.
And (3) experimental verification:
in the experiment of the invention, subsets are divided into 10, 20, … and 90 respectively, then a BPRR framework and a RReliefF algorithm are used for carrying out feature selection on the subsets, the feature selection result is calculated by using an expanded Kunzhen Watt similarity measurement index, and then the result is averaged to obtain the stability of each algorithm on different subsets.
RReliefF and the frame stability measurement of the integrated feature selection are carried out through the algorithm, and the results are shown in fig. 3 and fig. 4;
and (4) conclusion:
the following two main aspects can be seen from the figure: first, the stability of the two algorithms is compared on a stability comparison graph. RReliefF has poor stability, does not perform well in the subset number of 10 to 90, and the integrated feature selection framework exhibits better stability. The results indicate that the integrated feature selection framework is effective in improving the stability of RReliefF.
Second, the time efficiencies of the two algorithms are compared on a time efficiency map. It can be seen that the runtime of rreleieff is much slower than the integrated feature selection framework, e.g. up to 90 subsets, RReliefF is 3 times slower than the integrated feature selection framework.
The experimental result shows that compared with the RReliefF algorithm, the stability of feature selection is improved, redundant features are eliminated, and the efficiency is improved. The method can be applied to the characteristic selection step in data preprocessing before the modeling and prediction of the big power data, the attributes related to the target parameters are screened out from a plurality of power parameters, the same result can be obtained under the condition that the data have small-amplitude disturbance, and the reliability of characteristic selection is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. The method for selecting the running data integration characteristics of the power system is characterized by comprising the following steps
The method comprises the following steps:
s1, extracting K training subsets;
s2, correlation analysis;
s3, sorting the features on each training subset according to the weight to obtain different results;
and S4, summarizing the results to obtain the optimal feature subset.
2. The power system operational data integration feature selection method of claim 1, characterized in that
Characterized in that, the step S1 specifically comprises:
extracting K training subsets from an original data set D by a Bootstrap random sampling method
Figure 903652DEST_PATH_IMAGE002
3. The power system operational data integration feature selection method of claim 1, characterized in that
Characterized in that, the step S2 specifically comprises:
performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsets
Figure 351951DEST_PATH_IMAGE004
By this step, redundant data is removed.
4. The power system operational data integration feature selection method of claim 1, characterized in that
Characterized in that, the step S3 specifically comprises:
and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.
CN202010265810.5A 2020-04-07 2020-04-07 Power system operation data integration feature selection method Pending CN111401783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010265810.5A CN111401783A (en) 2020-04-07 2020-04-07 Power system operation data integration feature selection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010265810.5A CN111401783A (en) 2020-04-07 2020-04-07 Power system operation data integration feature selection method

Publications (1)

Publication Number Publication Date
CN111401783A true CN111401783A (en) 2020-07-10

Family

ID=71431498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010265810.5A Pending CN111401783A (en) 2020-04-07 2020-04-07 Power system operation data integration feature selection method

Country Status (1)

Country Link
CN (1) CN111401783A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464266A (en) * 2022-01-27 2022-05-10 东北电力大学 Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156484A (en) * 2016-06-08 2016-11-23 中国科学院自动化研究所 Disease of brain individuation Forecasting Methodology based on nuclear magnetic resonance image and system
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system
CN107169514A (en) * 2017-05-05 2017-09-15 清华大学 The method for building up of diagnosing fault of power transformer model
CN109636248A (en) * 2019-01-15 2019-04-16 清华大学 Feature selection approach and device suitable for transient stability evaluation in power system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156484A (en) * 2016-06-08 2016-11-23 中国科学院自动化研究所 Disease of brain individuation Forecasting Methodology based on nuclear magnetic resonance image and system
CN106250442A (en) * 2016-07-26 2016-12-21 新疆大学 The feature selection approach of a kind of network security data and system
CN107169514A (en) * 2017-05-05 2017-09-15 清华大学 The method for building up of diagnosing fault of power transformer model
CN109636248A (en) * 2019-01-15 2019-04-16 清华大学 Feature selection approach and device suitable for transient stability evaluation in power system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
伍杰华: "基于RReliefF特征选择算法的复杂网络链接分类", 《计算机工程》 *
刘艺 等: "特征选择稳定性研究综述", 《软件学报》 *
张磊 等: "基于多目标蚁群算法的稳定参考点选择", 《计算机技术与发展》 *
柴明锐 等: "《数据挖掘技术及在石油地质中的应用》", 30 September 2017, 天津科学技术出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464266A (en) * 2022-01-27 2022-05-10 东北电力大学 Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR
CN114464266B (en) * 2022-01-27 2022-08-02 东北电力大学 Pulverized coal boiler NOx emission prediction method and device based on improved SSA-GPR

Similar Documents

Publication Publication Date Title
CN110070141B (en) Network intrusion detection method
US7542953B1 (en) Data classification by kernel density shape interpolation of clusters
Ibrahim et al. Cluster representation of the structural description of images for effective classification
Saha et al. A new multiobjective clustering technique based on the concepts of stability and symmetry
Saha et al. Simultaneous feature selection and symmetry based clustering using multiobjective framework
Belhaouari et al. Optimized K‐Means Algorithm
CN111275127B (en) Dynamic feature selection method based on condition mutual information
CN111460161A (en) Unsupervised text theme related gene extraction method for unbalanced big data set
Min et al. Automatic determination of clustering centers for “clustering by fast search and find of density peaks”
CN115344693A (en) Clustering method based on fusion of traditional algorithm and neural network algorithm
Maddumala A Weight Based Feature Extraction Model on Multifaceted Multimedia Bigdata Using Convolutional Neural Network.
Mandal et al. Unsupervised non-redundant feature selection: a graph-theoretic approach
Zhang et al. A new outlier detection algorithm based on fast density peak clustering outlier factor.
CN111401783A (en) Power system operation data integration feature selection method
Wang et al. Mining high-dimensional data
Rahman et al. An efficient approach for selecting initial centroid and outlier detection of data clustering
Balaganesh et al. Movie success rate prediction using robust classifier
CN111444989A (en) Network intrusion detection method
CN110837853A (en) Rapid classification model construction method
Li Logistic and SVM credit score models based on lasso variable selection
Yang et al. Adaptive density peak clustering for determinging cluster center
Schneider et al. Expected similarity estimation for large scale anomaly detection
CN111382273A (en) Text classification method based on feature selection of attraction factors
Hochma et al. Efficient Feature Ranking and Selection using Statistical Moments
Claypo et al. A new feature selection based on class dependency and feature dissimilarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710