CN111401783A

CN111401783A - A feature selection method for power system operation data integration

Info

Publication number: CN111401783A
Application number: CN202010265810.5A
Authority: CN
Inventors: 王勇; 李磊; 马强; 管荑; 李慧聪; 田大伟; 耿玉杰; 刘勇; 林琳; 娄建楼; 孙博; 李建坡
Original assignee: State Grid Shandong Electric Power Co Ltd; Northeast Dianli University
Current assignee: State Grid Shandong Electric Power Co Ltd; Northeast Electric Power University
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2020-07-10

Abstract

The invention discloses a feature selection method for power system operation data integration, comprising the following steps: S1, extracting K training subsets; S2, correlation analysis; S3, sorting the features on each training subset according to weights, Different results are obtained; S4, the results are aggregated to obtain the optimal feature subset. The invention proposes an integrated feature selection framework for power plant operation data on the basis of random sampling and RReliefF feature selection algorithm, improves the stability of the algorithm, removes redundant data, and improves time efficiency.

Description

A feature selection method for power system operation data integration

技术领域technical field

本发明涉及特征选择方法，具体涉及一种电力系统运行数据集成特征选择方法。The invention relates to a feature selection method, in particular to a feature selection method for power system operation data integration.

背景技术Background technique

目前，由于智能电网的不断扩展，历史数据量庞大，维度众多。在针对历史数据进行分析建模时，若将全部属性作为模型的输入，不但增加计算难度，而且将引发维度灾难，造成模型过拟合，降低泛化能力。因此，在建模前，需要进行降维处理。At present, due to the continuous expansion of smart grid, the amount of historical data is huge and has many dimensions. When analyzing and modeling historical data, if all attributes are used as the input of the model, it will not only increase the computational difficulty, but also lead to dimensional disaster, resulting in overfitting of the model and reducing the generalization ability. Therefore, before modeling, dimensionality reduction processing is required.

降维技术常用的方法有特征提取和特征选择。特征提取一般通过数据方法（如投影）将数据从高维空间映射到低维特征空间，典型的方法有主成分分析（PCA）、核主成分分析（KPCA）、典型相关分析（CCA）等。不过经特征提取的新特征物理意义与原始特征相差甚远，甚至截然不同，提取到的特征可解释性弱，这在很多问题中难以接受。而特征选择是从原始特征空间中选择使得评价准则最大化的最小特征子集。特征选择挑选的特征，物理意义一如既往，可解释性强，优势明显，近年来在生物信息学、计算机视觉、目标识别等方面有广泛应用。在实际应用中，不仅要求特征选择算法具有良好的选择特征的能力，也对特征选择的稳定性提出了要求。特征选择稳定性是指特征选择方法对训练样本的微小扰动具有一定的鲁棒性，一个稳定的特征选择方法应当在训练样本具有微小扰动的情况下生成相同或相似的特征子集。提高特征选择的稳定性可以发现相关特征，增强领域专家对结果的可信度，进一步降低获取数据的复杂性和时间消耗。The commonly used methods of dimensionality reduction technology include feature extraction and feature selection. Feature extraction generally maps data from high-dimensional space to low-dimensional feature space through data methods (such as projection). Typical methods include principal component analysis (PCA), kernel principal component analysis (KPCA), canonical correlation analysis (CCA) and so on. However, the physical meaning of the new features extracted by features is far from the original features, or even completely different, and the interpretability of the extracted features is weak, which is unacceptable in many problems. Feature selection is to select the smallest subset of features that maximizes the evaluation criteria from the original feature space. The features selected by feature selection have the same physical meaning, strong interpretability and obvious advantages. In recent years, they have been widely used in bioinformatics, computer vision, target recognition, etc. In practical applications, not only the feature selection algorithm is required to have a good ability to select features, but also the stability of feature selection is required. Feature selection stability means that the feature selection method has a certain robustness to the small perturbations of the training samples. A stable feature selection method should generate the same or similar feature subsets when the training samples have small perturbations. Improving the stability of feature selection can discover relevant features, enhance the confidence of domain experts in the results, and further reduce the complexity and time consumption of data acquisition.

特征选择不稳定性的来源有以下三个方面：（1）算法本身的不稳定性：现有大部分特征选择方法目标只是选择出特征的最小子集而忽略选择的稳定性。（2）特征是高度冗余的：如果一个特征选择算法在数据集的不同特征子集上得到了相同的学习精度，说明所选择的属性是不稳定的。（3）高维度小样本问题：在基因检测等某些实际问题中，通常只有几百条样本，但有数千个特征。在高维度小样本空间中，样本数据集的微小变化会影响数据的分布，而数据分布的变化会影响选择结果，从而导致特征选择结果的差异。The sources of the instability of feature selection are as follows: (1) The instability of the algorithm itself: Most of the existing feature selection methods only aim to select the smallest subset of features and ignore the stability of the selection. (2) Features are highly redundant: If a feature selection algorithm obtains the same learning accuracy on different feature subsets of the dataset, the selected attributes are unstable. (3) High-dimensional small sample problem: In some practical problems such as genetic testing, there are usually only a few hundred samples, but thousands of features. In a high-dimensional small sample space, small changes in the sample data set will affect the distribution of data, and changes in data distribution will affect the selection results, resulting in differences in feature selection results.

RReliefF算法可以处理非线性数据，并且算法效率较高，已经成为著名的特征选择方法。但RReliefF算法也存在一定缺陷：一是不稳定，算法本身并没有考虑特征选择的稳定性，多次结果可能会产生不同的最优子集，是由于数据本身的特点和算法导致的。二是无法去除冗余特征，对预测值起到积极作用的特征权重较大，会保留下来，算法本身未考虑到特征之间的相关关系，而电厂运行数据不仅存在大量冗余特征，而且特征之间存在高耦合，关联性强的特点。The RReliefF algorithm can deal with nonlinear data, and the algorithm is more efficient, and has become a well-known feature selection method. However, the RReliefF algorithm also has certain defects: First, it is unstable. The algorithm itself does not consider the stability of feature selection. Multiple results may produce different optimal subsets, which is caused by the characteristics of the data itself and the algorithm. Second, the redundant features cannot be removed, and the features that play a positive role in the predicted value have a large weight and will be retained. The algorithm itself does not take into account the correlation between the features, and the power plant operation data not only has a large number of redundant features, but also features There is high coupling and strong correlation between them.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种电力系统运行数据集成特征选择方法。The main purpose of the present invention is to provide a method for selecting integrated features of power system operation data.

本发明采用的技术方案是：一种电力系统运行数据集成特征选择方法，包括以下步骤：The technical scheme adopted in the present invention is: a method for selecting integrated features of power system operation data, comprising the following steps:

S1，提取K个训练子集；S1, extract K training subsets;

S2，相关性分析；S2, correlation analysis;

S3，在每个训练子集上对特征按照权重进行排序，得到不同的结果；S3, sort the features by weight on each training subset to get different results;

S4，汇总结果，得到最优特征子集。S4, summarize the results to obtain the optimal feature subset.

进一步地，所述步骤S1具体为：Further, the step S1 is specifically:

通过Bootstrap随机采样方法从原始数据集D中提取K个训练子集

。Extract K training subsets from the original dataset D via Bootstrap random sampling method

.

更进一步地，所述步骤S2具体为：Further, the step S2 is specifically:

在每个训练子集上使用相关性分析算法Pearson进行相关性分析，如果某两个特征之间的相关系数的绝对值大于某一个阈值，则随机的删除一个特征，得到K个训练子集

，通过这步来去除掉冗余数据。The correlation analysis algorithm Pearson is used for correlation analysis on each training subset. If the absolute value of the correlation coefficient between two features is greater than a certain threshold, a feature is randomly deleted to obtain K training subsets.

, through this step to remove redundant data.

更进一步地，所述步骤S3具体为：Further, the step S3 is specifically:

在每个训练子集上通过RReliefF算法对特征按照权重进行排序，得到K个不同的结果。On each training subset, the features are sorted by weight by the RReliefF algorithm, and K different results are obtained.

本发明的优点：Advantages of the present invention:

本发明在随机采样和RReliefF特征选择算法的基础上提出一种针对电厂运行数据的集成特征选择框架，提高了算法的稳定性并去除冗余数据，提高了时间效率。The invention proposes an integrated feature selection framework for power plant operation data on the basis of random sampling and RReliefF feature selection algorithm, improves the stability of the algorithm, removes redundant data, and improves time efficiency.

除了上面所描述的目的、特征和优点之外，本发明还有其它的目的、特征和优点。下面将参照图，对本发明作进一步详细的说明。In addition to the objects, features and advantages described above, the present invention has other objects, features and advantages. The present invention will be described in further detail below with reference to the drawings.

附图说明Description of drawings

构成本申请的一部分的附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings constituting a part of the present application are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention.

图1是本发明实施例的电力系统运行数据集成特征选择方法流程图；FIG. 1 is a flowchart of a method for selecting integrated features of power system operation data according to an embodiment of the present invention;

图2是本发明实施例的集成特征选择框架图；2 is an integrated feature selection framework diagram of an embodiment of the present invention;

图3是本发明实施例的实验结果稳定性对比图；Fig. 3 is the experimental result stability comparison diagram of the embodiment of the present invention;

图4是本发明实施例的时间效率对比图。FIG. 4 is a time efficiency comparison diagram of an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

参考图1，如图1所示，一种电力系统运行数据集成特征选择方法，包括以下步骤：Referring to FIG. 1, as shown in FIG. 1, a method for selecting integrated features of power system operation data includes the following steps:

S1，提取K个训练子集；S1, extract K training subsets;

S2，相关性分析；S2, correlation analysis;

所述步骤S1具体为：The step S1 is specifically:

通过Bootstrap随机采样方法从原始数据集D中提取K个训练子集

.

Bootstrap是一种有放回的随机抽样方法，它的具体步骤为每次从原始数据中抽取n条样本，选择m个维度构成子数据集，重复K次，n、m、k都是自己设定的。Bootstrap is a random sampling method with replacement. Its specific steps are to extract n samples from the original data each time, select m dimensions to form a sub-data set, repeat K times, and set n, m, and k by yourself. determined.

所述步骤S2具体为：The step S2 is specifically:

, through this step to remove redundant data.

所述步骤S3具体为：The step S3 is specifically:

RReliefF算法介绍：RReliefF algorithm introduction:

Relief算法最早由Kira等人在1992年提出，是用于约简数据的效率较高的特征权重算法，适用于二分类问题。之后有学者提出ReliefF算法来解决多分类问题。算法的主要思想是：根据特征对近邻内不同类别的样本的区分能力对特征加权，好的特征可以使相同类别的样本接近，并使不同类别的样本彼此远离。根据设置好的权重阈值，权重小于阈值的特征将会被移除，最终得到最优特征子集。The Relief algorithm was first proposed by Kira et al. in 1992. It is an efficient feature weight algorithm for data reduction and is suitable for binary classification problems. Later, some scholars proposed the ReliefF algorithm to solve the multi-classification problem. The main idea of the algorithm is to weight the features according to their ability to distinguish samples of different categories in the neighborhood. Good features can make the samples of the same category close, and the samples of different categories are far away from each other. According to the set weight threshold, the features with weights less than the threshold will be removed, and finally the optimal feature subset will be obtained.

ReliefF算法为数据集中的每个特征提供初始值。通过以下公式迭代更新权重W[A]，迭代k次最终获取结果。The ReliefF algorithm provides initial values for each feature in the dataset. Iteratively update the weight W[A] through the following formula, and iterate k times to finally obtain the result.

（3）

(3)

上式中，

表示特征A的权重。

是从训练样本中随机选择的样本，ReliefF通过算法找到其最近的实例。

代表距离

最近的并且相同类别的样本。

代表距离

最近的并且不同类别的样本。如果样本

和

具有不同的特征A，说明A特征会分离两个类别相同的样本，则降低特征A的权重。如果样本

和

具有不同的特征A，说明A特征会分类两个不同类别的样本，则增加特征A的权重。函数

的计算公式为：In the above formula,

represents the weight of feature A.

is a randomly selected sample from the training samples, and ReliefF algorithmically finds its nearest instance.

representative distance

recent samples of the same class.

representative distance

recent samples of different classes. If the sample

and

If there are different features A, it means that the A feature will separate two samples of the same category, and the weight of feature A will be reduced. If the sample

and

With different feature A, it means that feature A will classify samples of two different categories, then increase the weight of feature A. function

The calculation formula is:

对于数值型属性：For numeric properties:

（4）

(4)

对于离散型属性：For discrete attributes:

（5）

(5)

在ReliefF的基础上，Kononenko等人提出了RReliefF算法，用于解决回归问题。由于回归问题的预测值是连续型的，并没有类别的区分。为了解决这个问题，引入2个相异实例的概率判断这2个实例是否属于同一类，该概率可以模拟与预测实例间的相对距离。On the basis of ReliefF, Kononenko et al. proposed the RReliefF algorithm for solving regression problems. Since the predicted value of the regression problem is continuous, there is no category distinction. In order to solve this problem, the probability of two different instances is introduced to judge whether the two instances belong to the same class. This probability can simulate the relative distance between the predicted instances.

（6）

(6)

（7）

(7)

（8）

(8)

本发明的集成特征选择框架BPRR（Bootstrap-Pearson-RReliefF）为改进的算法。The integrated feature selection framework BPRR (Bootstrap-Pearson-RReliefF) of the present invention is an improved algorithm.

稳定性度量指标：Stability metrics:

本发明选择扩展昆彻瓦相似度度量（extensions of Kuncheva similarity measure）指标来评估算法的稳定性。昆彻瓦相似度度量是子集法稳定性度量指标中的一种，它由昆彻瓦相似度度量（Kuncheva similarity measure）扩展而来，可以用来度量不同特征个数的特征子集相似性。计算公式如下：The present invention selects an index of extensions of Kuncheva similarity measure to evaluate the stability of the algorithm. The Kuncheva similarity measure is one of the stability metrics of the subset method. It is extended from the Kuncheva similarity measure and can be used to measure the similarity of feature subsets with different numbers of features. . Calculated as follows:

（9）

(9)

其中，

和

表示两个特征子集，

表示特征子集的基，

为两个子集的特征总数。扩展昆彻瓦相似度度量指标的取值在[-1, 1]之间，扩展昆彻瓦相似度度量指标取值越大，两个特征子集相似度越高。in,

and

represents two feature subsets,

represents a basis for a subset of features,

is the total number of features for the two subsets. The value of the extended Kunchewa similarity measure is between [-1, 1], and the larger the value of the extended Kunchewa similarity measure, the higher the similarity of the two feature subsets.

实验验证：Experimental verification:

本发明的实验分别将子集划分为10个、20个、…，90个，然后使用BPRR框架和RReliefF算法对子集进行特征选择，将特征选择结果使用扩展昆彻瓦相似度度量指标进行计算，再将结果取平均值，得到每个算法在不同子集个数上的稳定度。In the experiment of the present invention, the subsets are respectively divided into 10, 20, . , and then average the results to obtain the stability of each algorithm on different numbers of subsets.

通过上述算法进行RReliefF和集成特征选择框架稳定性度量，结果如图3，图4所示；The above algorithm is used to measure the stability of RReliefF and the integrated feature selection framework, and the results are shown in Figure 3 and Figure 4;

结论：in conclusion:

从图中可以看出以下两个主要方面：第一，在稳定性对比图上比较了两种算法的稳定性。RReliefF的稳定性较差，在子集个数在10到90中表现都不是很好，而集成特征选择框架表现出了更好的稳定性。结果表明了集成特征选择框架在改善RReliefF稳定性方面是有效的。The following two main aspects can be seen from the figure: First, the stability of the two algorithms is compared on the stability comparison chart. The stability of RReliefF is poor, and the performance is not very good in the number of subsets from 10 to 90, while the integrated feature selection framework shows better stability. The results show that the ensemble feature selection framework is effective in improving the stability of RReliefF.

第二，在时间效率图上对比了两种算法的时间效率。可以看出，RReliefF的运行时间要比集成特征选择框架慢的多，例如在子集个数达到90个，RreliefF比集成特征选择框架慢了3倍。Second, the time efficiency of the two algorithms is compared on the time efficiency graph. It can be seen that the running time of RReliefF is much slower than that of the integrated feature selection framework. For example, when the number of subsets reaches 90, RreliefF is 3 times slower than the integrated feature selection framework.

实验结果表明，与RReliefF算法相比，提高了特征选择的稳定性，剔除了冗余特征并提高了效率。证明了本发明的方法可以应用于电力大数据建模预测之前的数据预处理中的特征选择步骤，在繁多的电力参数中筛选出与目标参数相关的属性，且能在数据有小幅度扰动的情况下得出相同的结果，提高特征选择的可信度。The experimental results show that compared with the RReliefF algorithm, the stability of feature selection is improved, redundant features are eliminated and the efficiency is improved. It is proved that the method of the present invention can be applied to the feature selection step in the data preprocessing before the power big data modeling and prediction, screen out the attributes related to the target parameters from a large number of power parameters, and can be used when the data has a small disturbance. The same results are obtained in the same situation, and the confidence of feature selection is improved.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. The method for selecting the running data integration characteristics of the power system is characterized by comprising the following steps

The method comprises the following steps:

s1, extracting K training subsets;

s2, correlation analysis;

s3, sorting the features on each training subset according to the weight to obtain different results;

and S4, summarizing the results to obtain the optimal feature subset.

2. The power system operational data integration feature selection method of claim 1, characterized in that

Characterized in that, the step S1 specifically comprises:

extracting K training subsets from an original data set D by a Bootstrap random sampling method

。

3. The power system operational data integration feature selection method of claim 1, characterized in that

Characterized in that, the step S2 specifically comprises:

performing correlation analysis on each training subset by using a correlation analysis algorithm Pearson, and if the absolute value of a correlation coefficient between certain two features is larger than a certain threshold, deleting one feature randomly to obtain K training subsets

By this step, redundant data is removed.

4. The power system operational data integration feature selection method of claim 1, characterized in that

Characterized in that, the step S3 specifically comprises:

and sequencing the features according to the weights on each training subset through an RReliefF algorithm to obtain K different results.