CN106156789A

CN106156789A - Towards the validity feature sample identification techniques strengthening grader popularization performance

Info

Publication number: CN106156789A
Application number: CN201610303447.5A
Authority: CN
Inventors: 焦卫东; 杨志强
Original assignee: Zhejiang Normal University CJNU
Current assignee: Zhejiang Normal University CJNU
Priority date: 2016-05-09
Filing date: 2016-05-09
Publication date: 2016-11-23

Abstract

The invention discloses an effective feature sample identification technology for enhancing classifier generalization performance, which is characterized in that the method comprises the following steps: 1) establishment of classifier generalization performance evaluation index; 2) construction of fuzzy clustering criterion; 3 ) primary clustering division of feature sample set; 4) definition of intra-class average distance and inter-class average distance; 5) establishment of initial cluster optimization criteria; 6) secondary cluster identification of feature sample set. The invention has the beneficial effects that the method is reasonable in design, simple in use, effectively removes noise points or wild points, and has a high recognition rate of feature samples.

Description

Efficient Feature Sample Recognition Techniques for Enhancing Classifier Generalization Performance

技术领域technical field

本发明基于信号处理理论，在数据聚类分析的基础上提出一种有效特征样本识别方法，利用聚类分析的模式自动划分特性，剔除特征数据中的野点或噪点，达到特征数据净化的目的，在此基础上提高支持向量机分类器的推广性能。该计算方法为解决机械故障诊断领域中涉及的精确模式识别与分类问题奠定了基础。Based on the signal processing theory, the present invention proposes an effective feature sample identification method on the basis of data cluster analysis, uses the cluster analysis mode to automatically divide characteristics, eliminates wild points or noise points in feature data, and achieves the purpose of feature data purification. On this basis, the generalization performance of the support vector machine classifier is improved. This calculation method lays the foundation for solving the precise pattern recognition and classification problems involved in the field of mechanical fault diagnosis.

背景技术Background technique

基于统计学习理论的支持向量机(SVM)方法在模式分类方面极具优势，已成功应用于故障诊断。理论上，SVM的最优分类面由位于类边缘的支持向量决定，而位于类边缘附近的野(值)点或噪(声)点往往与有效样本混杂在一起，导致所求出的分类面不是最优的，从而影响了分类器的推广性能^[1,2]。The support vector machine (SVM) method based on statistical learning theory has great advantages in pattern classification and has been successfully applied to fault diagnosis. Theoretically, the optimal classification surface of SVM is determined by the support vectors located at the edge of the class, and the wild (value) points or noise (sound) points near the edge of the class are often mixed with valid samples, resulting in the obtained classification surface It is not optimal, which affects the generalization performance of the classifier ^[1,2] .

在实际的故障诊断应用场合，信号检测中待诊断对象周边的外来干扰以及采集系统的内噪声等均可能在原始观测数据中引入噪声干扰；传感器件异常或故障、系统中力或运动的异常波动或仅仅是运行工况的改变，也可能产生异常的观测野值。这些存在于原始数据中的噪声或野值如果不适当处理，将随同特征提取进入特征空间，形成明显偏离整体类特征的噪点或野点。此外，还有许多影响故障诊断的负面因素，例如振动在机械结构中传递的散射与混响效应所导致的传感观测信息冗余，特征提取环节所选择的过高特征维数等。信息冗余将造成后续特征提取的困难，并进一步放大噪声或野值的负面作用；特征维数选择过高，则会使样本统计特性的估计更加困难，从而降低分类器的推广能力或泛化能力^[3]。因此，必须首先对特征数据进行必要的净化处理，才能达到有效诊断的目的。In actual fault diagnosis applications, external interference around the object to be diagnosed in signal detection and internal noise of the acquisition system may introduce noise interference into the original observation data; abnormal or faulty sensor devices, abnormal fluctuations in force or motion in the system Or just changes in operating conditions may also produce abnormal observation outliers. If these noises or outliers in the original data are not properly processed, they will enter the feature space along with the feature extraction, forming noise or outliers that obviously deviate from the overall class features. In addition, there are many negative factors that affect fault diagnosis, such as the redundancy of sensing observation information caused by the scattering and reverberation effects of vibration transmitted in the mechanical structure, and the excessively high feature dimension selected in the feature extraction process. Information redundancy will cause difficulties in subsequent feature extraction and further amplify the negative effects of noise or outliers; if the feature dimension is too high, it will make it more difficult to estimate the statistical characteristics of the sample, thereby reducing the generalization ability or generalization of the classifier Capability ^[3] . Therefore, the characteristic data must first be purified to achieve the purpose of effective diagnosis.

发明内容Contents of the invention

本发明的目的是为了解决上述问题，开发了一种面向增强分类器推广性能的有效特征样本识别技术。The purpose of the present invention is to solve the above problems and develop an effective feature sample identification technology for enhancing the generalization performance of classifiers.

实现上述目的本发明的技术方案为，一种面向增强分类器推广性能的有效特征样本识别技术，其特征在于，该方法包括如下步骤：Achieving the above object The technical solution of the present invention is an effective feature sample identification technology for enhancing the generalization performance of classifiers, characterized in that the method includes the following steps:

1)分类器推广性能评价指标的建立；1) Establishment of classifier generalization performance evaluation index;

2)模糊聚类准则的构建；2) Construction of fuzzy clustering criteria;

3)特征样本集一次聚类划分；3) One-time clustering and division of feature sample set;

4)类内平均距离与类间平均距离的定义；4) The definition of the average distance within a class and the average distance between classes;

5)初始聚类优选准则的建立；5) Establishment of initial clustering optimization criteria;

6)特征样本集二次聚类识别。6) Secondary clustering identification of feature sample set.

所述分类器推广性能评价指标的建立计算式为：The establishment calculation formula of described classifier generalization performance evaluation index is:

R(w)＝R_emp(w)+Φ(h/l),R(w)＝ _Remp (w)+Φ(h/l),

h≤min([r²a²],n)+1.h≤min([r ² a ² ],n)+1.

式中Φ(·)为置信风险函数，h为分类函数的VC维，l为训练样本数。可以看到，真实风险R(w)由经验风险R_emp(w)与置信风险Φ(·)两部分构成。[·]表示取整数部分。r为包含所有高维空间映射点的最小超球体半径。where Φ(·) is the confidence risk function, h is the VC dimension of the classification function, and l is the number of training samples. It can be seen that the real risk R(w) is composed of two parts: the empirical risk _Remp (w) and the confidence risk Φ(·). [·] indicates the integer part. r contains all high-dimensional space mapping points The minimum hypersphere radius of .

所述模糊聚类准则的构建计算式为：The formula for constructing the fuzzy clustering criterion is:

式中d_ik＝||x_k-v_i||为样本x_k与聚类中心v_i之间的距离，一般采用欧式距离度量。m为模糊加权指数，通常取m＝2。J_FCM(U,V)表示各类样本到聚类中心加权距离的平方和，权重是样本x_k对第i类隶属度的m次方。In the formula, d _ik ＝||x _k -v _i| | is the distance between the sample x _k and the cluster center v _i , which is generally measured by Euclidean distance. m is the fuzzy weighting index, usually m=2. J _FCM (U, V) represents the sum of squares of weighted distances from various samples to the cluster center, and the weight is the mth power of the membership degree of sample x _k to the i-th class.

所述特征样本集一次聚类划分计算式为：The calculation formula of the primary clustering division of the feature sample set is:

式中设定聚类数目c、模糊加权指数m和初始隶属度矩阵U⁰，迭代步数l＝0。对于给定的停止值ε>0，迭代计算直至max{|u_ik ^l-u_ik ^l-1|}<ε，算法终止；否则l＝l+1，算法继续执行。In the formula, the number of clusters c, the fuzzy weight index m and the initial membership degree matrix U ⁰ are set, and the number of iteration steps l=0. For a given stop value ε>0, iterative calculation until max{|u _ik ^l -u _ik ^l-1 |}<ε, the algorithm terminates; otherwise l=l+1, the algorithm continues to execute.

所述类内平均距离与类间平均距离的定义计算式为：The definition calculation formula of the average distance within the class and the average distance between classes is:

式中为聚类{X_O}中数据样本的两两组合数目。v_i与v_j分别为第i个聚类{X_i}与第j个聚类{X_j}的中心。为c个聚类编号两两组合的组合数目。In the formula is the number of pairwise combinations of data samples in the cluster {X _O }. v _i and v _j are the centers of the i-th cluster {X _i } and the j-th cluster {X _j } respectively. The number of combinations that number pairwise combinations for c clusters.

所述初始聚类优选准则的建立计算式为：The formula for establishing the initial clustering optimization criterion is:

式中{X_f}为c个聚类(通常c≥3)中所包含的由有效特征样本构成的、容量为n_f且中心为v_f的初始有效聚类，{X_n}为主要由噪点或野点构成的、容量为n_n且中心为v_n的初始无效聚类{X_n}。In the formula, {X _f } is an initial effective cluster with a capacity of n _f and a center of v _f which is composed of effective feature samples contained in c clusters (usually c≥3), and {X _n } is mainly composed of The initial invalid cluster {X _n } with capacity n _n and center v _n is composed of noise or wild points.

所述特征样本集二次聚类识别计算式为：The formula for calculating the secondary clustering identification of the feature sample set is:

式中{X_s}是从{X_d}中抽取的一个容量为n_s、中心为v_s的组合子集，并满足最小化准则条件。经过有效样本抽取后{X_d}中剩余的数据样本构成子集{X_t}，并入由噪(野)点构成的无效聚类{X_n}中。式(18)执行无效聚类的二次划分过程，其中x_i为经过主划分后形成的无效聚类{X_n}中的数据样本。X_near为有效聚类{X_f}中距离无效聚类{X_n}的中心v_n最近的数据样本。In the formula, {X _s } is a combined subset with capacity n _s and center v _s extracted from {X _d }, which satisfies the minimization criteria. After effective sample extraction, the remaining data samples in {X _d } form a subset {X _t }, which is merged into the invalid cluster {X _n } composed of noise (wild) points. Equation (18) implements the secondary division process of invalid clusters, where _xi is the data sample in the invalid cluster {X _n } formed after the main division. X _near is the data sample closest to the center v _n of the invalid cluster {X _n } in the valid cluster {X _f }.

附图说明Description of drawings

图1是本发明所述面向增强分类器推广性能的有效特征样本识别技术的流程示意图Fig. 1 is a schematic flow chart of the effective feature sample identification technology for enhancing classifier generalization performance according to the present invention

图2是SVM分类器推广性能评价原理图Figure 2 is a schematic diagram of SVM classifier generalization performance evaluation

图3特征空间中的超球域描述Figure 3 Hypersphere description in feature space

图4正常状态有效特征样本的识别结果Fig.4 Recognition results of effective feature samples in normal state

图5轮齿破坏有效特征样本的识别结果Figure 5. Identification results of effective feature samples of tooth damage

图6机座松动有效特征样本的识别结果Figure 6 The recognition results of the effective feature samples of machine base looseness

具体实施方式detailed description

下面结合附图对本发明进行具体描述，如图1是本发明所述面向增强分类器推广性能的有效特征样本识别技术的流程示意图，采用基于噪点(野点)去除的有效特征样本识别方法，对特征空间中的特征样本进行净化预处理。The present invention is described in detail below in conjunction with the accompanying drawings. Fig. 1 is a schematic flow chart of the effective feature sample recognition technology for enhancing the generalization performance of the classifier according to the present invention. The feature samples in the space are pre-cleaned.

本技术方案以齿轮箱正常状态、轮齿破坏和机座松动三类模式的有效特征样本识别为例子阐述特征样本净化预处理过程，其基本原理为：基于统计学习理论的结构风险最小化(SRM)原理，最大化SVM分类器的推广性能，对多个故障模式类特征样本按照层次化处理原则进行两次净化处理，获得用于分类器设计的有效特征样本，SVM分类器推广性能评价原理如图2。即This technical solution uses the effective feature sample identification of the three types of gearbox normal state, gear tooth damage and machine base looseness as an example to illustrate the feature sample purification and pretreatment process. The basic principle is: Structural Risk Minimization (SRM) based on Statistical Learning Theory ) principle to maximize the generalization performance of the SVM classifier, perform two purification processes on multiple failure mode feature samples according to the principle of hierarchical processing, and obtain effective feature samples for classifier design. The generalization performance evaluation principle of the SVM classifier is as follows: figure 2. which is

R(w)＝R_emp(w)+Φ(h/l),R(w)＝ _Remp (w)+Φ(h/l),

h≤min([r²a²],n)+1.h≤min([r ² a ² ],n)+1.

式中Φ(·)为置信风险函数，h为分类函数的VC维，l为训练样本数。可以看到，真实风险R(w)由经验风险R_emp(w)与置信风险Φ(·)两部分构成。[·]表示取整数部分。r为包含所有高维空间映射点的最小超球体半径。特征空间中的超球域描述如图3。where Φ(·) is the confidence risk function, h is the VC dimension of the classification function, and l is the number of training samples. It can be seen that the real risk R(w) is composed of two parts: the empirical risk _Remp (w) and the confidence risk Φ(·). [·] indicates the integer part. r contains all high-dimensional space mapping points The minimum hypersphere radius of . The description of the hypersphere in the feature space is shown in Figure 3.

实施例1Example 1

正常状态有效特征样本识别Recognition of Valid Feature Samples in Normal State

对正常状态特征样本集依次建立聚类准则，并连续进行两次聚类划分，有效特征样本识别结果如图4所示。The clustering criteria are established sequentially for the feature sample set in normal state, and two consecutive clustering divisions are carried out. The identification results of effective feature samples are shown in Figure 4.

聚类准则建立式为：The establishment formula of the clustering criterion is:

正常状态特征样本集一次聚类划分式为：The primary clustering formula of the normal state feature sample set is:

正常状态类内平均距离与类间平均距离的定义式为：The definition of the average distance within a class and the average distance between classes in a normal state is:

正常状态特征样本初始聚类优选准则的建立式为：The establishment formula of the initial clustering optimization criterion of normal state feature samples is:

正常状态特征样本集二次聚类识别计算式为：The calculation formula for the secondary clustering identification of the normal state feature sample set is:

实施例2Example 2

轮齿破坏有效特征样本识别Identification of Effective Feature Samples for Gear Tooth Destruction

对轮齿破坏特征样本集依次建立聚类准则，并连续进行两次聚类划分，有效特征样本识别结果如图5所示。The clustering criterion is established sequentially for the tooth damage feature sample set, and two consecutive clustering divisions are carried out. The effective feature sample identification results are shown in Fig. 5.

轮齿破坏特征样本集一次聚类划分式为：The primary clustering formula of the gear tooth damage feature sample set is:

轮齿破坏类内平均距离与类间平均距离的定义式为：The definition formulas of the average distance within a class and the average distance between classes of tooth damage are:

轮齿破坏特征样本初始聚类优选准则的建立式为：The establishment formula of the initial clustering optimization criterion for gear tooth damage feature samples is:

轮齿破坏特征样本集二次聚类识别计算式为：The formula for secondary clustering identification of tooth damage feature sample set is:

实施例3Example 3

机座松动有效特征样本识别Recognition of Effective Feature Samples for Machine Base Looseness

对机座松动特征样本集依次建立聚类准则，并连续进行两次聚类划分，有效特征样本识别结果如图6所示。The clustering criterion is established sequentially for the loose feature sample set of the machine base, and two consecutive clustering divisions are carried out. The effective feature sample identification results are shown in Figure 6.

机座松动特征样本集一次聚类划分式为：The primary clustering division formula of the machine base loose feature sample set is:

机座松动类内平均距离与类间平均距离的定义式为：The definition formula of the average distance within the category and the average distance between categories is as follows:

机座松动特征样本初始聚类优选准则的建立式为：The establishment formula of the initial clustering optimization criterion for the loose feature samples of the machine base is:

机座松动特征样本集二次聚类识别计算式为：The calculation formula of the secondary clustering identification of the loose feature sample set of the machine base is:

参考文献references

[1]杜喆，刘三阳,齐小刚.一种新隶属度函数的模糊支持向量机.系统仿真学报,2009,21(7):1901-1903.[1] Du Zhe, Liu Sanyang, Qi Xiaogang. A Fuzzy Support Vector Machine with a New Membership Function. Journal of System Simulation, 2009,21(7):1901-1903.

[2]丁世飞，齐丙娟，谭红艳.支持向量机理论与算法研究综述.电子科技大学学报,2011,40(1):2-10.[2] Ding Shifei, Qi Bingjuan, Tan Hongyan. A Review of Support Vector Machine Theory and Algorithms. Journal of University of Electronic Science and Technology of China, 2011,40(1):2-10.

[3]张彪.文本分类中特征选择算法的分析与研究.合肥:中国科学技术大学硕士学位论文,2010.[3] Zhang Biao. Analysis and Research on Feature Selection Algorithms in Text Classification. Hefei: Master's Degree Thesis of University of Science and Technology of China, 2010.

上述技术方案仅体现了本发明技术方案的优选技术方案，本技术领域的技术人员对其中某些部分所可能做出的一些变动均体现了本发明的原理，属于本发明的保护范围之内。The above-mentioned technical solutions only reflect the preferred technical solutions of the technical solutions of the present invention, and some changes that those skilled in the art may make to certain parts reflect the principles of the present invention and fall within the protection scope of the present invention.

Claims

1. an effective feature sample identification technology for enhancing classifier generalization performance, it is characterized in that, the method comprises the steps:

1) Establishment of classifier generalization performance evaluation index;

2) Construction of fuzzy clustering criteria;

3) One-time clustering and division of feature sample set;

4) The definition of the average distance within a class and the average distance between classes;

5) Establishment of initial clustering optimization criteria;

6) Secondary clustering identification of feature sample set.

2. according to claim 1, face the effective feature sample identification technology of enhanced classifier generalization performance, it is characterized in that, the establishment calculation formula of described classifier generalization performance evaluation index is:

R(w)＝ _Remp (w)+Φ(h/l),

h≤min([r ² a ² ],n)+1.

where Φ(·) is the confidence risk function, h is the VC dimension of the classification function, and l is the number of training samples. It can be seen that the real risk R(w) is composed of two parts: the empirical risk _Remp (w) and the confidence risk Φ(·). [·] indicates the integer part. r contains all high-dimensional space mapping points The minimum hypersphere radius of .

3. according to claim 1, face the effective feature sample identification technology of enhanced classifier generalization performance, it is characterized in that, the construction computing formula of described fuzzy clustering criterion is:

min min {J J}_{F f C C M m} ((U u,, V V)) = = {Σ Σ}_{k k = = 11}^{n no} {Σ Σ}_{i i = = 11}^{c c} {(({u u}_{i i k k}))}^{m m} {(({d d}_{i i k k}))}^{22} . .

In the formula, d _ik ＝||x _k -v _i| | is the distance between the sample x _k and the cluster center v _i , which is generally measured by Euclidean distance. m is the fuzzy weighting index, usually m=2. J _FCM (U, V) represents the sum of squares of weighted distances from various samples to the cluster center, and the weight is the mth power of the membership degree of sample x _k to the i-th class.

4. according to claim 1, the effective feature sample identification technology facing enhanced classifier generalization performance, it is characterized in that, described feature sample set primary clustering division formula is:

{v v}_{i i}^{l l} = = {Σ Σ}_{k k = = 11}^{n no} {(({u u}_{i i k k}^{l l}))}^{m m} {x x}_{k k} / / {Σ Σ}_{k k = = 11}^{n no} {(({u u}_{i i k k}^{l l}))}^{m m},, i i = = 11,, K K,, c c,,

{u u}_{i i k k}^{l l + + 11} = = 11 / / {Σ Σ}_{j j = = 11}^{c c} {((\frac{{d d}_{i i k k}}{{d d}_{j j k k}}))}^{\frac{22}{m m - - 11}},, &ForAll; &ForAll; i i,, &ForAll; &ForAll; k k . .

In the formula, the number of clusters c, the fuzzy weight index m and the initial membership degree matrix U ⁰ are set, and the number of iteration steps l=0. For a given stop value ε>0, iterative calculation until max{|u _ik ^l -u _ik ^l-1 |}<ε, the algorithm terminates; otherwise l=l+1, the algorithm continues to execute.

5. according to claim 1, face the effective feature sample identification technology of enhanced classifier generalization performance, it is characterized in that, the definition calculation formula of average distance in described class and average distance between classes is:

\{\begin{matrix} {δ δ}_{i i n no n no e e r r} = = {Σ Σ}_{i i = = 11}^{{n no}_{O o} - - 11} {Σ Σ}_{j j = = i i + + 11}^{{n no}_{O o}} | | | | {x x}_{i i} - - {x x}_{j j} | | | | / / {C C}_{{n no}_{O o}}^{22},, \\ {δ δ}_{i i n no t t e e r r} = = {Σ Σ}_{i i = = 11}^{c c - - 11} {Σ Σ}_{j j = = i i + + 11}^{c c} | | | | {v v}_{i i} - - {v v}_{j j} | | | | / / {C C}_{c c}^{22},, . . \end{matrix}

In the formula is the number of pairwise combinations of data samples in the cluster {X _O }. v _i and v _j are the centers of the i-th cluster {X _i } and the j-th cluster {X _j } respectively. C _c ² is the combination number of pairwise combinations of c cluster numbers.

6. according to claim 1, face the effective feature sample identification technology of enhanced classifier generalization performance, it is characterized in that, the establishment computing formula of described initial clustering optimization criterion is:

{{{X x}_{f f}}} &DoubleLeftArrow; &DoubleLeftArrow; {{{X x}_{O o}}},,

\begin{matrix} s the s . . t t . . & \underset{O o}{m m a a x x} [[{n no}_{O o} {C C}_{{n no}_{O o}}^{22} / / {Σ Σ}_{i i = = 11}^{{n no}_{O o} - - 11} {Σ Σ}_{j j = = i i + + 11}^{{n no}_{O o}} | | | | {x x}_{i i} - - {x x}_{j j} | | | |]] \end{matrix} . .

{{{X x}_{n no}}} &DoubleLeftArrow; &DoubleLeftArrow; {{{X x}_{O o}}},,

\begin{matrix} s the s . . t t . . & \underset{O o}{min min} [[{n no}_{O o} {C C}_{{n no}_{O o}}^{22} / / {Σ Σ}_{i i = = 11}^{{n no}_{O o} - - 11} {Σ Σ}_{j j = = i i + + 11}^{{n no}_{O o}} | | | | {x x}_{i i} - - {x x}_{j j} | | | |]] \end{matrix} . .

In the formula, {X _f } is an initial effective cluster with a capacity of n _f and a center of v _f which is composed of effective feature samples contained in c clusters (usually c≥3), and {X _n } is mainly composed of The initial invalid cluster {X _n } with capacity n _n and center v _n is composed of noise or wild points.

7. according to claim 1, the effective feature sample identification technology facing enhanced classifier generalization performance, it is characterized in that, described feature sample set secondary clustering recognition calculation formula is:

\begin{matrix} i i f f & | | | | {v v}_{d d} - - {v v}_{f f} | | | | < < {δ δ}_{int int e e r r},, & t t h h e e n no & {{{X x}_{d d}}} &DoubleRightArrow; &DoubleRightArrow; {{{X x}_{f f}}} \end{matrix} . .

\begin{matrix} e e l l s the s e e & {{{X x}_{s the s}}} &DoubleRightArrow; &DoubleRightArrow; {{{X x}_{f f}}},, & a a n no d d & {{{X x}_{t t}}} &DoubleRightArrow; &DoubleRightArrow; {{{X x}_{n no}}} \end{matrix},,

\begin{matrix} s the s . . t t . . & \underset{s the s}{min min} [[{Σ Σ}_{i i = = 11}^{{n no}_{s the s} - - 11} {Σ Σ}_{j j = = i i + + 11}^{{n no}_{s the s}} | | | | {x x}_{i i} - - {x x}_{j j} | | | | | | | | {v v}_{s the s} - - {v v}_{f f} | | | | / / {C C}_{{n no}_{s the s}}^{22}]] \end{matrix} . .

In the formula, {X _s } is a combined subset with capacity n _s and center v _s extracted from {X _d }, which satisfies the minimization criteria. After effective sample extraction, the remaining data samples in {X _d } form a subset {X _t }, which is merged into the invalid cluster {X _n } composed of noise (wild) points. Equation (18) implements the secondary division process of invalid clusters, where _xi is the data sample in the invalid cluster {X _n } formed after the main division. X _near is the data sample closest to the center v _n of the invalid cluster {X _n } in the valid cluster {X _f }.