CN109034270A

CN109034270A - A kind of visualization feature selection method based on the classification Non-negative Matrix Factorization of failure two

Info

Publication number: CN109034270A
Application number: CN201810968454.6A
Authority: CN
Inventors: 梁霖; 牛奔; 刘飞; 山磊; 何康康; 徐光华
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2018-12-18

Abstract

A visual feature selection method based on non-negative matrix decomposition of fault binary classification. First, the multi-classification problem is divided into multiple binary classification problems according to the arrangement and combination. Graph expression, and finally use the salience expression principle of the heat map to select effective classification features, and extract the sensitive features of the entire data set through the classification features; While performing dimensionality reduction, good classification performance of low-dimensional feature subsets is guaranteed.

Description

A visual feature selection method based on fault binary classification non-negative matrix factorization

技术领域technical field

本发明属于机械设备状态检测与故障诊断技术领域，具体涉及一种基于故障二分类非负矩阵分解的可视化特征选择方法。The invention belongs to the technical field of mechanical equipment state detection and fault diagnosis, and in particular relates to a visual feature selection method based on fault binary classification non-negative matrix decomposition.

背景技术Background technique

随着机电系统的复杂程度和集成化水平不断提高，设备在运行过程中发生故障的风险也在不断增加。为了准确识别机电系统在运行过程中萌生和演变的故障，对出现异常的部件进行及时的诊断和处理，状态监测与故障诊断就变得非常必要。而随着信息获取技术的不断进步，能够获得的有关系统状态和运行参数的特征量越来越多，包括了冗余和无关特征信息，这为后续的诊断识别带来了巨大挑战，这就需要对高维数据进行有效的特征选择和提取工作。除了传统的维数约简方法外，非负矩阵分解(Non-negative MatrixFactorization，NMF)方法可以得到原特征数据矩阵的低秩逼近，分解结果具有较好的可解释性和物理意义，通过与特征相关的分解矩阵能够实现数据特征的维数约简，在监测诊断领域得到了推广应用。As the complexity and level of integration of electromechanical systems continue to increase, so does the risk of equipment failure during operation. In order to accurately identify the faults initiated and evolved during the operation of the electromechanical system, and to diagnose and deal with abnormal components in a timely manner, condition monitoring and fault diagnosis become very necessary. With the continuous advancement of information acquisition technology, more and more feature quantities about system status and operating parameters can be obtained, including redundant and irrelevant feature information, which brings great challenges to subsequent diagnosis and identification. Effective feature selection and extraction work on high-dimensional data is required. In addition to the traditional dimension reduction method, the Non-negative Matrix Factorization (NMF) method can obtain the low-rank approximation of the original feature data matrix, and the decomposition results have better interpretability and physical meaning. The relevant decomposition matrix can realize the dimensionality reduction of data features, and has been popularized and applied in the field of monitoring and diagnosis.

但是，目前基于非负矩阵分解的特征分析方法中，采用的是原始多类故障样本矩阵分解的基矩阵或系数矩阵直接分析，以算法为中心，选择过程中通常缺乏人的参与，致使选择的过程不够透明直观，选择的结果可解释性不强，限制了特征分析和选择的效果。However, in the current eigenanalysis method based on non-negative matrix decomposition, the direct analysis of the basis matrix or coefficient matrix of the original multi-type fault sample matrix decomposition is used, and the algorithm is the center, and the selection process usually lacks human participation, resulting in the selection of The process is not transparent and intuitive enough, and the interpretability of the selection results is not strong, which limits the effect of feature analysis and selection.

发明内容Contents of the invention

为了克服上述现有技术的缺点，本发明的目的在于提供了一种基于故障二分类非负矩阵分解的可视化特征选择方法，结合了非负矩阵分解结果的物理意义及热图的显著表达优点，使得特征选择变得直观形象，保证了所选特征子集的分类精度。In order to overcome the shortcomings of the above-mentioned prior art, the object of the present invention is to provide a visual feature selection method based on non-negative matrix decomposition of fault binary classification, which combines the physical meaning of non-negative matrix decomposition results and the significant expression advantages of heat maps, It makes feature selection intuitive and visual, and ensures the classification accuracy of the selected feature subset.

为了达到上述目的，本发明所采用的技术方案是：In order to achieve the above object, the technical scheme adopted in the present invention is:

一种基于故障二分类非负矩阵分解的可视化特征选择方法，包括以下步骤：A visual feature selection method based on fault binary classification non-negative matrix decomposition, comprising the following steps:

1)提取待处理的数据集V_m×n，数据集的行m代表着样本，列n代表着特征；1) Extract the data set V _m×n to be processed, the row m of the data set represents the sample, and the column n represents the feature;

2)将数据集V_m×n进行非负化、归一化处理，2) Perform non-negativeization and normalization processing on the data set V _m×n ,

式中：i＝1，2，...，m；j＝1，2，...，n，maxV_ij为列向量V_j的最大值；minV_kj为列向量V_j的最小值；In the formula: i=1,2,...,m; j=1,2,...,n, maxV _ij is the maximum value of column vector V _j ; minV _kj is the minimum value of column vector V _j ;

3)将原始的多故障分类的问题按照排列组合方式划分成多个二分类问题，假设V_m×n含有N类样本，则划分的每一个二分类问题对应的特征集合表示为P_i，其中 3) The original multi-fault classification problem is divided into multiple binary classification problems according to the arrangement and combination. Assuming that V _m×n contains N types of samples, the feature set corresponding to each divided binary classification problem is expressed as P _i , where

4)对每个非负的特征集合采用最小二乘迭代算法分解，即P_i＝W_iH_i；4) For each non-negative feature set Use the least squares iterative algorithm to decompose, that is, P _i =W _i H _i ;

随机初始化W_i和H_i，低维嵌入维数r_i优先选择与样本类别数相同，非负矩阵分解得到基矩阵W_i和系数矩阵H_i，迭代规则如下：Randomly initialize W _i and H _i , the low-dimensional embedding dimension r _i is preferably selected to be the same as the number of sample categories, and the base matrix W _i and coefficient matrix H _i are obtained by non-negative matrix decomposition. The iteration rules are as follows:

式中：Wi为特征集合P_i非负矩阵分解得到的基矩阵，表示基矩阵W_i的转置，H_i为特征集合P_i非负矩阵分解得到的系数矩阵，表示系数矩阵H_i的转置；In the formula: Wi is the base matrix obtained by non-negative matrix decomposition of feature set P _i , Represents the transpose of the base matrix W _i , and H _i is the coefficient matrix obtained by the non-negative matrix decomposition of the feature set P _i , Represents the transpose of the coefficient matrix H _i ;

5)对基矩阵W_i和系数矩阵H_i进行热图可视化表达，基矩阵W_i的行对应着样本，系数矩阵H_i的列对应着原始特征；5) Visualize the base matrix W _i and the coefficient matrix H _i in a heat map, the rows of the base matrix W _i correspond to the samples, and the columns of the coefficient matrix H _i correspond to the original features;

6)观察基矩阵W_i的特征聚类情况，若在基矩阵W_i其中一列能观察到两类样本明显分开，则进行步骤7)，否则重新选择低维嵌入维数r_i，返回步骤4)；6) Observe the feature clustering of the base matrix W _i , if the two types of samples can be observed to be clearly separated in one column of the base matrix W _i , then proceed to step 7), otherwise re-select the low-dimensional embedding dimension r _i , and return to step 4 );

7)在热图中采用显著表达原理选择出分类特征F_i，通过调节热图阈值控制分类特征个数为1；7) Use the principle of salience expression to select the classification feature F _i in the heat map, and control the number of classification features to 1 by adjusting the threshold of the heat map;

8)对所有二分类问题求得的分类特征F_i做并集运算，得到最终的分类特征集合F，8) Perform a union operation on the classification features F _i obtained for all binary classification problems to obtain the final classification feature set F,

本发明的有益效果为：本发明方法能够实现计算机和人的相互协作与优势互补，选择的结果可解释性强，在对原始高维原始特征进行降维的同时保证了低维特征子集的良好的分类性能。The beneficial effects of the present invention are: the method of the present invention can realize the mutual cooperation and complementary advantages of the computer and the human, the selected result can be interpreted strongly, and the original high-dimensional original features are reduced in dimension while ensuring the low-dimensional feature subset. Good classification performance.

附图说明Description of drawings

图1是本发明方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2是实施例中数据集P₁(第一二类样本集合)的非负矩阵分解结果热图可视化效果图。Fig. 2 is a visualization effect diagram of the heat map of the non-negative matrix decomposition result of the data set P ₁ (the first and second types of sample sets) in the embodiment.

图3是实施例中数据集P₂(第一三类样本集合)的非负矩阵分解结果热图可视化效果图。Fig. 3 is a visualization effect diagram of the heat map of the non-negative matrix decomposition result of the data set P ₂ (the first and third types of sample sets) in the embodiment.

图4是实施例中数据集P₃(第二三类样本集合)的非负矩阵分解结果热图可视化效果图。Fig. 4 is a visualization effect diagram of the heat map of the non-negative matrix decomposition result of the data set P ₃ (the second and third types of sample sets) in the embodiment.

图5是实施例中wine数据集特征1、7和12三维可视化效果图。Fig. 5 is a three-dimensional visualization effect diagram of features 1, 7 and 12 of the wine dataset in the embodiment.

具体实施方式Detailed ways

下面结合附图和实施例对本发明做详细说明，本实施案例针对UCI数据集中Wine数据集展开，Wine数据集来源于三种不同品种葡萄酒的化学分析结果，该分析确定了三种葡萄酒中13种成分的具体含量，Wine数据集含有三类数据共计178个样本，13种特征属性(成分)，每一类的样本数分别为：59(类一)、71(类二)、48(类三)，本实施例对这13种特征进行特征选择，选择出分类性能良好的特征。The present invention will be described in detail below in conjunction with the accompanying drawings and examples. This implementation case is developed for the Wine data set in the UCI data set. The Wine data set comes from the chemical analysis results of three different varieties of wine. The analysis has determined 13 of the three wines. The specific content of ingredients, the Wine dataset contains three types of data, a total of 178 samples, 13 kinds of characteristic attributes (components), the number of samples for each category is: 59 (category 1), 71 (category 2), 48 (category 3 ), this embodiment performs feature selection on these 13 features, and selects features with good classification performance.

参照图1，一种基于故障二分类非负矩阵分解的可视化特征选择方法，包括以下步骤：Referring to Figure 1, a visual feature selection method based on fault binary classification non-negative matrix decomposition, including the following steps:

1)提取待处理的数据集V_m×n，数据集的行m代表着样本，列n代表着特征，本实施例采用Wine数据集；1) extract the data set V _{m * n} to be processed, the row m of the data set represents the sample, and the column n represents the feature, and the present embodiment adopts the Wine data set;

3)将原始的多故障分类的问题按照排列组合方式划分成多个二分类问题，假设V_m×n含有N类样本，则划分的每一个二分类问题对应的特征集合表示为P_i，其中本实施例将Wine数据集按照排列组合原理划分为3个二分类问题，分别为P₁(第一二类样本集合)、P₂(第一三类样本集合)、P₃(第二三类样本集合)；3) The original multi-fault classification problem is divided into multiple binary classification problems according to the arrangement and combination. Assuming that V _m×n contains N types of samples, the feature set corresponding to each divided binary classification problem is expressed as P _i , where In this embodiment, the Wine data set is divided into three binary classification problems according to the principle of permutation and combination, which are P ₁ (the first and second types of sample sets), P ₂ (the first and third types of sample sets), P ₃ (the second and third types of sample sets), and P 3 (the second and third types of sample sets). sample collection);

式中：W_i为特征集合P_i非负矩阵分解得到的基矩阵，表示基矩阵W_i的转置，H_i为特征集合P_i非负矩阵分解得到的系数矩阵，表示系数矩阵H_i的转置；In the formula: W _i is the basis matrix obtained by non-negative matrix decomposition of feature set P _i , Represents the transpose of the base matrix W _i , and H _i is the coefficient matrix obtained by the non-negative matrix decomposition of the feature set P _i , Represents the transpose of the coefficient matrix H _i ;

6)观察基矩阵W_i的特征聚类情况，若在基矩阵W_i其中一列能观察到两类样本明显分开，如图2所示，则进行步骤7)，否则重新选择低维嵌入维数r_i，返回步骤4)；6) Observe the feature clustering of the base matrix W _i , if the two types of samples can be observed to be clearly separated in one column of the base matrix W _i , as shown in Figure 2, then proceed to step 7), otherwise re-select the low-dimensional embedding dimension r _i , return to step 4);

从聚类角度看，二分类问题的特征集合中本质上应该只含有两类特征，即能区分和不能区分二分类问题的特征，其他特征均可由这两类本质特征组合得到；因而，本实施例对P₁、P₂、P₃进行非负矩阵分解时，低维嵌入维数r均优先选择2，最终分解结果的可视化表达如图2、3、4所示；From the perspective of clustering, the feature set of the binary classification problem should essentially contain only two types of features, that is, the features that can distinguish and cannot distinguish the binary classification problem, and other features can be obtained by combining these two types of essential features; therefore, this implementation For example, when performing non-negative matrix decomposition on P ₁ , P ₂ , and P ₃ , the low-dimensional embedding dimension r is preferred to 2, and the visual representation of the final decomposition results is shown in Figures 2, 3, and 4;

7)在热图中采用显著表达原理，即根据矩阵相乘规则，大数×大数得到的依然会是大数，与热图中小数值区分开来；通过系数矩阵H_i大数区域所对应的原始特征选择出分类特征F_i，通过调节热图阈值实现对分类特征个数的控制，通常控制特征个数为1；7) The principle of significant expression is adopted in the heat map, that is, according to the matrix multiplication rule, the large number × large number will still be a large number, which is distinguished from the small value in the heat map; through the coefficient matrix H _i corresponding to the large number area The classification feature F _i is selected from the original features, and the number of classification features is controlled by adjusting the threshold of the heat map. Usually, the number of control features is 1;

本实施例利用热图的显著表达原理依次对P₁、P₂、P₃的分解结果进行分类特征的选择；通过调节热图的显示阈值，控制每一个二分类问题被选择出的分类特征个数为1，则最终选择出的分类特征由系数矩阵H可知，依次为特征1、特征7和特征12，对其取并集后得到分类特征集合为{1，7，12}。In this embodiment, the principle of salience expression of the heat map is used to sequentially select the classification features of the decomposition results of P ₁ , P ₂ , and P ₃ ; by adjusting the display threshold of the heat map, the number of classification features selected for each binary classification problem is controlled. If the number is 1, the final selected classification features can be known from the coefficient matrix H, which are feature 1, feature 7 and feature 12 in turn, and the classification feature set is obtained after union of them as {1, 7, 12}.

本实施例提取wine数据集中的第1维、第7维和第12维特征绘制该数据集的三维可视化效果图，如图5所示，由该图可见，采用本发明方法三种不同品种的葡萄酒能被很好的区分开在，KNN分类器下，子集F的分类率达到了94.94％。This embodiment extracts the 1st dimension, the 7th dimension and the 12th dimension feature in the wine data set to draw a three-dimensional visualization effect diagram of the data set, as shown in Figure 5, as can be seen from this figure, the wines of three different varieties using the method of the present invention It can be well distinguished. Under the KNN classifier, the classification rate of subset F reaches 94.94%.

通过以上应用说明，本发明方法可以使得特征选择变得直观形象，选择的结果可解释性强，保证了所选特征子集的分类精度，可以有助于解决复杂机电系统监测与故障诊断的高维数据分析，提高故障诊断的效率与准确性。Through the above application description, the method of the present invention can make the feature selection become intuitive and vivid, the result of the selection can be interpreted strongly, and the classification accuracy of the selected feature subset can be guaranteed, which can help solve the high-level problem of complex electromechanical system monitoring and fault diagnosis. dimensional data analysis to improve the efficiency and accuracy of fault diagnosis.

Claims

1. A visual feature selection method based on fault binary classification non-negative matrix decomposition, is characterized in that, comprises the following steps:

1) Extract the data set V _m×n to be processed, the row m of the data set represents the sample, and the column n represents the feature;

2) Perform non-negativeization and normalization processing on the data set V _m×n ,

In the formula: i=1,2,...,m; j=1,2,...,n, maxV _ij is the maximum value of column vector V _j ; minV _kj is the minimum value of column vector V _j ;

3) The original multi-fault classification problem is divided into multiple binary classification problems according to the arrangement and combination. Assuming that V _m×n contains N types of samples, the feature set corresponding to each divided binary classification problem is expressed as P _i , where

4) For each non-negative feature set Use the least squares iterative algorithm to decompose, that is, P _i =W _i H _i ;

Randomly initialize W _i and H _i , the low-dimensional embedding dimension r _i is preferably selected to be the same as the number of sample categories, and the base matrix W _i and coefficient matrix H _i are obtained by non-negative matrix decomposition. The iteration rules are as follows:

In the formula: W _i is the basis matrix obtained by non-negative matrix decomposition of feature set P _i , Represents the transpose of the base matrix W _i , and H _i is the coefficient matrix obtained by the non-negative matrix decomposition of the feature set P _i , Represents the transpose of the coefficient matrix H _i ;

5) Visualize the base matrix W _i and the coefficient matrix H _i in a heat map, the rows of the base matrix W _i correspond to the samples, and the columns of the coefficient matrix H _i correspond to the original features;

6) Observe the feature clustering of the base matrix W _i , if the two types of samples can be observed to be clearly separated in one column of the base matrix W _i , then proceed to step 7), otherwise re-select the low-dimensional embedding dimension r _i , and return to step 4 );

7) Use the principle of salience expression to select the classification feature F _i in the heat map, and control the number of classification features to 1 by adjusting the threshold of the heat map;

8) Perform a union operation on the classification features F _i obtained for all binary classification problems to obtain the final classification feature set F,