WO2019041629A1 - 基于svm的高维不平衡数据分类方法 - Google Patents

基于svm的高维不平衡数据分类方法 Download PDF

Info

Publication number
WO2019041629A1
WO2019041629A1 PCT/CN2017/115847 CN2017115847W WO2019041629A1 WO 2019041629 A1 WO2019041629 A1 WO 2019041629A1 CN 2017115847 W CN2017115847 W CN 2017115847W WO 2019041629 A1 WO2019041629 A1 WO 2019041629A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
svm
algorithm
space
boundary
Prior art date
Application number
PCT/CN2017/115847
Other languages
English (en)
French (fr)
Inventor
张春慨
Original Assignee
哈尔滨工业大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学深圳研究生院 filed Critical 哈尔滨工业大学深圳研究生院
Publication of WO2019041629A1 publication Critical patent/WO2019041629A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods

Definitions

  • the invention belongs to the technical field of data classification, and particularly relates to an unbalanced sample classification method.
  • the current classification methods for high-dimensional unbalanced data are to solve high-dimensional problems or imbalance problems first, and then solve another problem, and do not consider the high-dimensional characteristics brought about by unbalanced data classification.
  • the classification task of unbalanced data is mainly carried out from two levels: sampling at the data level and classification at the algorithm level.
  • the sampling method at the data level is one of the important means to solve the imbalance of data distribution from the sample space.
  • Undersampling, resampling and mixed sampling the sample space with unbalanced number of categories is reconstructed, so that the original distribution is not Balanced data tends to balance in quantity, reducing data imbalance has an impact on post-data classification, preventing the classifier from paying too much attention to the classification accuracy of most categories in pursuit of global accuracy and ignoring the minority categories that people pay more attention to.
  • Classification accuracy [23] .
  • a large number of experimental studies have shown that the sampling method can significantly improve the classification effect of unbalanced data.
  • the sampling method has been developed to date and has been widely used in the field of unbalanced sample classification.
  • the undersampling method refers to deleting certain samples according to certain rules, so as to improve the classification effect.
  • Kubat et al. proposed a method for sampling sample points into different types based on the Euclidean distance between sample points: one-side selection. The main idea is to observe the categories of K sample points closest to a sample point. According to the difference between the category of the K samples and the category of the sample, the sample is divided into safety samples, redundant samples, boundary samples and noise. There are four types of samples. The safety samples and redundant samples are spatially distributed in the clusters in which they are located.
  • the traditional classifier can achieve a higher level of recognition; and the boundary samples and noise Samples are referred to as “unsafe samples” because of their spatial heterogeneity, and they often require more attention from the classifier.
  • the unilateral selection algorithm rejects the "unsafe samples” in most classes according to the spatial distribution characteristics of the samples, and retains the boundary samples, redundant samples, and security samples of a few classes, so as to make the sample space obtain better separability.
  • the SMOTE (synthetic minority over-sampling technique) algorithm proposed by Chawla et al. has been widely used in the processing of unbalanced data as a classical oversampling method, and many oversampling based on SMOTE method has been derived. method.
  • SMOTE calculation The main idea of the law is to randomly select one of the k minority classes closest to a certain minority class, and then interpolate between the wires of the two minority classes to generate a counterfeit minority class whose formula is as follows:
  • the SMOTE algorithm changes the imbalance ratio between the majority class and the minority class, it will change the variance, covariance, class density and other information of the original sample space because it generates a counterfeit minority class between the two real minority classes.
  • Due to the randomness of the sample generated by SMOTE it can avoid the problem of over-fitting the training data, and also better expand the decision space of a few classes.
  • Many oversampling methods are improved based on SMOTE, such as Han.
  • the Borderline-SMOTE method proposed for interpolation of boundary samples.
  • SBC sampling magnification
  • the processing of high-dimensional data mainly includes dimensionality reduction and feature selection.
  • a classical supervised dimension reduction classification method Linear Discriminant Analysis has been widely used in the processing of high-dimensional data.
  • the sample interval between different categories is as far as possible, and the sample interval of the same category is as close as possible.
  • the original sample space is projected and mapped according to the direction in which the ratio between the distance between different categories and the distance between the same categories is the largest.
  • the LDA method is one of the most used methods in pattern recognition and image processing. This method can achieve very good when there is less discrimination between different types of data, data fragmentation and boundary blurring.
  • the classification effect is one of the most used methods in pattern recognition and image processing. This method can achieve very good when there is less discrimination between different types of data, data fragmentation and boundary blurring.
  • the sample space after dimensionality reduction is at most C-1 dimension
  • the feature space of the data since the feature space of the data is extremely compressed, it may appear.
  • a small number of classes are covered by most classes, and different classes of samples have the same attribute after dimension reduction.
  • the unsupervised dimensionality reduction method does not consider category information. It seeks to restore some features in the original sample space during the dimension reduction process.
  • the classic PCA (Principal Component Analysis) dimension reduction is a method of considering the projection direction according to the variance distribution of different directions in the original feature space, so that the variance distribution can be preserved as much as possible after dimension reduction.
  • the current feature selection method can be divided into three categories: filter feature selection, package feature selection and embedded feature selection method according to the relationship between the feature selection process and the classifier training process.
  • Support vector machine iterative feature elimination method SVM-RFE finds the weight of each attribute through each iteration. The size of the weight represents the degree of attention of the SVM to the feature, and the selection is made by continuously eliminating the feature with relatively low feature weight. The purpose of the optimal feature combination.
  • Support vector machine inverse feature elimination method SVM-BFE eliminates one feature each time, saves the feature combination that will eliminate the best effect after a certain feature, and continues to substitute for the next round of training.
  • SVM-based feature selection method because it aims to classify, eliminate some feature combinations that have a negative impact on the classification effect and some features with high redundancy and high correlation, so as to find the feature combination that makes the classification effect the best.
  • a good set of results has been achieved in processing high dimensional data.
  • an algorithm for feature selection at one time (such as LASSO algorithm). It is possible to directly eliminate some combinations of features that have an important effect on the identification of a few classes; the iterative elimination feature is an improvement of the inverse feature elimination method, which selects the feature by considering the "feel" of the classifier itself, each round Selecting a classifier to determine that the contribution to the final result is low and the final result is maximized is eliminated, but it is also impossible to prevent the feature selection process from proceeding toward increasing the recognition rate of the majority class.
  • the SMOTE oversampling algorithm is the mainstream method for dealing with imbalance problems. It has been widely used in the processing of unbalanced data and has achieved good results. However, in high-dimensional unbalanced data, due to the existence of high-dimensional problems, the traditional sampling method can not change the classifier's weighting on most classes, thus making the traditional sampling method meaningless.
  • the experimental research in [21] shows that although the SMOTE method can increase the classifier's attention to a few classes in low-dimensional data, the effect is not obvious in high-dimensional data. The reason is mainly that the few classes generated by the SMOTE method will introduce the correlation between the samples in the new sample space, rather than the correlation between the features, so the generated minority classes can not restore the minority in the original sample space very well. The distribution of the class.
  • the present invention designs a high-dimensional unbalanced data classification method based on SVM to solve the problem of high-dimensional unbalanced data set classification, and achieves good results.
  • a SVM-based high-dimensional unbalanced data classification method includes two parts, a first part is a feature selection part, a second part is a data sampling part, and the feature selection part adopts an SVM-BRFE algorithm, and the SVM-BRFE algorithm includes the following step:
  • the SVM is trained to obtain the initial feature weight vector w, Lagrangian parameter ⁇ , and F1 value;
  • iterative feature elimination is performed according to the importance degree of the feature from small to large, and one feature is eliminated per round to increase the F1 value most. Since each feature eliminates one feature, the SVM's separated hyperplane also changes, and the boundary samples also follow. It changes, so it is also necessary to re-score the remaining features to generate new feature weights w to evaluate how important each feature is in the new feature space.
  • the data sampling part adopts an improved SMOTE algorithm, that is, a PBKS algorithm, which is used to solve the problem of space conversion caused by different input space and training space when processing unbalanced data classification by using SVM, and it utilizes SVM automatically.
  • SMOTE SMOTE
  • the problem of dividing the sample boundaries and the imbalance in the SVM is mainly reflected in the characteristics of the boundary sample imbalance problem.
  • the PBKS algorithm uses two different classes to synthesize a new minority class under the Hilbert space and looks for oversampling. The approximate original image of the sample point in the Euclidean space, and the PSO algorithm adaptively optimizes the sampling magnification of a few class boundary sample points and newly generated sample points to improve the classification effect of the SVM.
  • the invention combines two parts to form an algorithm specifically for solving the problem of high-dimensional unbalanced data classification.
  • the latter part needs to solve the new problem that arises after solving the imbalance problem in the high-dimensional unbalanced data classification task based on SVM.
  • 1 is a flowchart of a solution to an imbalance problem; a histogram of AUC values of each algorithm;
  • Figure 2 is a histogram of the AUC values in each algorithm
  • Figure 3 is a graph of ROC obtained by each algorithm on data set 1;
  • Figure 5 is a graph of ROC obtained by each algorithm on data set 3;
  • Figure 6 is a graph of ROC obtained by each algorithm on data set 4.
  • Figure 7 is a graph of ROC obtained by each algorithm on data set 5;
  • Figure 8 is a graph of ROC curves obtained for each algorithm on data set 6.
  • the present invention finds that the feature evaluation system of the wrapped feature selection process can be used to balance the imbalance problem in the process of feature iterative selection, so the SVM automatically divides the boundary characteristics to the Hill.
  • the sample points under the Bert space are resampled to improve the F1 value of the support vector machine model, and the feature weight vector w of the SVM at this time is used as the evaluation criterion of the feature.
  • the following is a combination of the two, in the case of considering the imbalance problem, feature selection of high-dimensional unbalanced data to solve high-dimensional problems.
  • the time complexity of the algorithm is O(d 2 ), and d is the total number of features.
  • the main process is as follows.
  • the SVM is trained to obtain the initial feature weight vector w, Lagrangian parameter ⁇ , and F1 value, and these three values are recorded for subsequent comparison use.
  • iterative feature elimination is performed according to the importance degree of the feature from small to large, and one feature is eliminated per round to increase the F1 value most. Since each feature eliminates one feature, the SVM's separated hyperplane also changes, and the boundary samples also follow. It changes, so it is also necessary to re-score the remaining features to generate new feature weights w to evaluate how important each feature is in the new feature space.
  • the resampling process of the feature selection part does not participate in the update of the training set: resampling a few class boundary samples is only to obtain a feature weight w that is fairer than the majority class and the minority class, Better measure the importance of each feature in high-dimensional unbalanced data, rather than directly changing the SVM's focus on a few classes to improve direct classification and F1 values, that is, before each round of feature selection.
  • the resampling process is just for Solve the high-dimensional problem of receiving the imbalance problem, not to solve the imbalance problem.
  • the current round of resampling process ends, and the weight vector w of the SVM at the maximum F1 value is saved, used to measure the importance of the feature and sort the features, and then the resampling is removed.
  • the above process is repeated until an optimal subset of features is selected.
  • the resampling process does not change the train_set, and the train_set is updated after each feature selection only during the feature selection process.
  • the PSO-Border-Kernel-SMOTE (PBKS) oversampling algorithm is mainly used to solve the problem of spatial transformation caused by the difference of input space and training space when dealing with unbalanced data classification by SVM. It uses SVM to automatically divide the sample boundary and The imbalance problem in SVM is mainly concentrated on the characteristics of boundary sample imbalance.
  • the PBKS algorithm uses two different classes to synthesize a new minority class in Hilbert space, and looks for the sample points generated by oversampling in Europe. Approximate original image in the space, and the PSO algorithm adaptively optimizes the sampling magnification of a few class boundary sample points and newly generated sample points to improve the classification effect of SVM.
  • Equation (2) Let the implicit mapping of Euclidean space to Hilbert space be as shown in equation (2), and assume that the defined kernel function is a Gaussian kernel function. In later writing, K ij is used instead of K(x i , x j ), which represents the inner product of two points x i and x j in Euclidean space after being mapped to Hilbert space. . Then the square of the distance under the Hilbert space is as shown in equation (3).
  • the SMOTE algorithm finds the first k samples that are closest to the sample point x i , and then randomly selects one sample point x j among the k samples, and linearly interpolates between the sample point x i and the sample point x j . Since the present invention mainly considers oversampling of a few class boundary samples, another minority class of samples in the boundary is randomly selected as a SMOTE algorithm for each of a few class sample points in the boundary under Hilbert space. Input, then the SMOTE oversampling formula under Hilbert space is shown in equation (6), where ⁇ ij is a random number between the open intervals (0, 1).
  • the present invention considers the automatic division using SVM. a few of the boundaries in the out
  • the distance constraint in the middle replaces the original constraint and uses the grid method to find the approximate original image.
  • the label of the minority class boundary samples divided in Hilbert space is 1, 2, ..., k, and the d features are found in the k minority class boundary samples.
  • the boundary and the lower boundary are as shown in equations (10) and (11), where (10) is the lower boundary of all minority class boundary samples, and (11) is the upper boundary of all minority class boundary samples.
  • each mesh divides the granularity of each mesh according to formula (12), divide the boundary minority space into k ⁇ d meshes, each mesh represents a position in a Euclidean space, and find a mesh It maps to the Hilbert space and is closest to the point produced by oversampling.
  • the size of each mesh is the maximum value in the feature dimension minus the minimum value divided by the total number k of the original boundary samples, and in the subsequent process of searching for the original image, the search is performed in units of each grid. The entire grid space.
  • Equation (8) represents the mesh granularity of the i-th feature.
  • the number of mesh sizes optimized by PSO is added to each dimension to obtain x ij , and the sample of the search is obtained.
  • the point is used as an iteration of the solution variable x ij .
  • the square of the cosine distance values of equations (7) and (8) is then obtained, as in equation (13), until the end of the iteration.
  • the target solution x ij is replaced by the point where the square of the cosine value is the largest as the approximate original image of z ij .
  • the traditional classifier directly classifies the minority samples into the majority samples in order to pursue the global classification accuracy rate. A higher global accuracy rate will be obtained, but the correct classification rate for a small number of samples is zero. In this case, the traditional single evaluation system will no longer be applicable to the evaluation system of unbalanced sample classification. So we need some special Complex considerations of multiple indicators to accommodate the special case of unbalanced sample classification.
  • atomic standard the one is called "atomic standard” and the other is called “composite standard”. It is a complex and capable of combining atomic standards and mathematical theories proposed after extensive research. It is well adapted to the evaluation system for unbalanced sample classification problems.
  • the subject curve (ROC) is also widely used in the evaluation of unbalanced sample classification.
  • Equations (14) through (17) list some of the atomic evaluation criteria that are often used in unbalanced sample classification based on confusion matrices.
  • F-Measure is most often applied to the evaluation of unbalanced sample classifications, as shown in equation (17). F-Measure is compounded by recall, precision, and balance factor. When both Recall and Precision achieve a higher value, F-Measure will achieve better results.
  • is a balance factor that adjusts the recall and precision (usually ⁇ is set to 1).
  • ROC Receiver Operating Characteristics Curve was proposed by Swets in 1988 and has been widely used in many fields.
  • ROC uses FPRate as the X-axis and TPRate as the Y-axis.
  • TPRate By setting the threshold, the pseudo-positive rate and true positive rate values are obtained, and these scattered points are connected to form an ROC curve.
  • the ROC curve is not able to directly evaluate the unbalanced sample classification problem quantitatively, so in order to obtain a quantitative evaluation index, the area under the ROC (AC) is proposed.
  • the classification effect of the classifier algorithm can be evaluated by the area under the ROC (ie, AUC). The larger the AUC, the better the classification effect.
  • UCI is a well-known and open machine learning database.
  • the data sets of all experiments of the present invention are all derived from UCI.
  • the experimental data is shown in Table 2.
  • Table 2 describes the specific properties of the dataset used in all experiments, where No. is listed as the dataset number, Data-Set is the dataset name, #Attr. is the number of attributes contained in the dataset, and %Min. proportion.
  • the BRFE-PBKS-SVM algorithm is divided into two parts.
  • the first part is the feature selection part and the second part is the data sampling part.
  • an algorithm for solving the problem of high-dimensional unbalanced data classification is formed.
  • the latter part needs to solve the new problem that arises after solving the imbalance problem in the high-dimensional unbalanced data classification task based on SVM.
  • the BRFE-PBKS-SVM algorithm has achieved the highest recall rate for a few classes among the four algorithms.
  • the PBKS oversampling algorithm has a minority recall rate. The degree of improvement is significant, and as the recall rate of a few categories increases, the accuracy rate decreases.
  • the BRFE-PBKS-SVM algorithm is optimal in all algorithm combinations in the second to fifth data sets; and in the case of the same oversampling algorithm, improved
  • the BRFE feature selection algorithm combines the best results, because the BRFE feature selection algorithm considers the imbalance problem in the feature elimination process; in the case of the same feature selection algorithm, the improved PBKS oversampling algorithm combination
  • the best results are obtained because they are all trained in the Hilbert space corresponding to the polynomial kernel function or the Gaussian kernel function.
  • the sample points generated by the oversampling of the PBKS algorithm can better fill the Hilbert.
  • the boundary under the space is more reasonable in spatial distribution, so it can improve the classification effect.
  • Figure 2 is a comparison of the AUC values of the ROC curves of the four algorithms on the six data sets. From Figure 2, it can be found that in the six sets of data, in addition to the second and fourth data, the BRFE-PBKS-SVM algorithm The maximum AUC value can be obtained. In the fourth data set, even if the improved algorithm fails to obtain the optimal AUC value, the difference is only 0.006. Overall, the algorithm BRFE-PBKS-SVM has good stability. Sex. Figure 3-8 shows that the AVM values of the four SVM-based algorithm combinations are not much different in each data set. This proves that SVM has better stability and superiority for the classification task of high-dimensional unbalanced data. Sex.
  • the area enclosed by the line is the AUC value in Figure 2.
  • the diagonal line represents the worst level of classification effect, and its corresponding AUC value is 0.5.
  • AUC value When a classifier's ROC curve on a data set is below this diagonal line, its AUC value will be Less than 0.5, this would mean that the classifier's classification efficiency on the data set is not as good as a randomly guessed classifier.
  • the curve corresponds to an AUC value of 0.993.
  • the six ROC curves obtained in the experiment show that, except for the second and fourth data sets, in the remaining data sets, the areas enclosed by the four algorithms are not much different, and they can all be better.
  • the effect, and the finally improved algorithm can achieve the largest AUC value in the four data sets; in the second and fourth data sets, the four algorithms have different effects, and the ROC curve is extremely uneven, BRFE
  • the -PBKS-SVM algorithm also failed to achieve the best classification effect, but the AUC value of the algorithm with the best classification effect is not much different, and the ROC area of the random classifier can be obtained. This shows that the SVM-based BRFE-PBKS-SVM algorithm for high-dimensional unbalanced data classification tasks can stably and efficiently complete the classification task of high-dimensional unbalanced data, and can achieve considerable results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于SVM的高维不平衡数据分类方法,包括两部分,第一部分是特征选择,采用SVM-BRFE算法对边界进行重采样以寻找最优特征权重以衡量特征重要程度、特征选择、更新训练集并重复以上过程,最终保留最有利于提升F1值的特征,其他特征将被剔除,使得后续的训练过程在一个特征冗余、无关特征组合尽量少和维数尽量低的情况下进行,减少了高维问题对不平衡问题的影响和对SMOTE过采样算法的束缚;第二部分是数据采样,采用改进的SMOTE算法,即PBKS算法,考虑利用SVM自动划分出的边界中的少数类作为希尔伯特空间下DHxij中的距离约束,以此来取代原始约束,并采用网格法来寻找该近似原像。该方法能稳定有效的完成高维不平衡数据的分类任务,并能取得可观的效果。

Description

基于SVM的高维不平衡数据分类方法 技术领域
本发明属于数据分类技术领域,具体涉及一种不平衡样本分类方法。
背景技术
在数据挖掘的分类任务中,目前针对高维不平衡数据的分类方法都是先解决高维问题或者不平衡问题,再解决另外一个问题,并没有考虑高维特性对不平衡数据分类带来的新问题和不平衡特性对高维数据分类造成的影响。不平衡数据的分类任务主要从两个层面进行:数据层面的采样和算法层面的分类。
数据层面的采样方法是从样本空间中解决数据分布不平衡的重要手段之一,通过欠采样、重采样和混合采样等方法,对类别数目分布不平衡的样本空间进行重构,使原本分布不平衡的数据在数量上趋于平衡,减少数据不平衡对后期数据分类带来影响,防止分类器过多的关注多数类的分类准确率以追求全局准确率而忽略了人们更加关注的少数类的分类准确率[23]。大量实验研究表明,通过采样的方法,能显著提高不平衡数据的分类效果。采样方法发展至今,已经在不平衡样本分类领域中被广泛运用。
欠采样方法是指按照一定的规律删除某些样本,以使分类效果有所提升。1997年Kubat等人提出了一种基于样本点之间的欧氏距离将样本点划分为不同的类型从而进行采样的方法:单边选择算法(one-side selection)。其主要思想是观察与某样本点最近的K个样本点的类别,根据这K个样本的类别与该样本的类别的差异性,将该样本划分为安全样本、冗余样本、边界样本和噪声样本四种类型。其中安全样本和冗余样本在空间分布上是在它所在的簇较靠内的样本,即使它们是少数类样本,传统分类器对它们的识别程度也能达到较高水平;而边界样本和噪声样本由于其所处位置在空间上多种类别混杂,被称为“不安全样本”,它们往往需要分类器投入更多的关注。单边选择算法根据样本的空间分布特点,将多数类中的“不安全样本”剔除,保留少数类的边界样本、冗余样本、安全样本,尽量使样本空间获得较好的可分性。
Chawla等人提出的SMOTE(synthetic minority over-sampling technique)算法作为一种经典的过采样方法,已经被广泛的运用在不平衡数据的处理中,并且衍生出了不少基于SMOTE方法改进的过采样方法。SMOTE算 法的主要思想是在与某个少数类最邻近的k个少数类中随机选择一个,然后在这两个少数类的连线之间插值,生成一个仿造的少数类,其公式如下:
xnew=xi+rand(0,1)×(xj-xi)            (1)
SMOTE算法虽然改变了多数类与少数类之间的不平衡比例,但由于其在两个真实少数类之间生成仿造的少数类,所以会改变原始样本空间的方差、协方差、类别密度等信息,对一些追求保留样本空间方差信息的降维方法有所限制,同时也会让KNN等基于原始样本空间数据分布特点来进行分类的方法效果大打折扣。但由于SMOTE生成的样本具有随机性,使得它能够避免对训练数据过拟合的问题,同时也更好地扩展了少数类的决策空间,不少过采样方法都基于SMOTE进行改进,比如Han等人提出的针对边界样本进行插值的Borderline-SMOTE方法。
还有一类采样方法关注采样倍率的设置,SBC是其中的典型算法。该算法认为样本空间的不同类簇,由于其空间分布不同,重要程度也有所差别,因此不能对同一类样本都设置相同的采样率,应该考虑他们所处的类簇在样本空间中的分布。基于该思想,SBC算法将不平衡数据中的多数类聚成多个簇,然后按一定的规则设置每个多数类簇的欠采样比例,不同程度的减少每个多数类簇中的样本数目。
高维数据的处理主要有降维和和特征选择。线性判别分析(Linear Discriminant Analysis)作为一种经典的有监督降维分类方法,早已在高维数据的处理中被广泛运用。LDA追求降维之后不同类别之间的样本间隔尽量远、同一类别样本间隔尽量近,按照不同类别间的距离与相同类别间的距离之比最大的方向将原始样本空间进行投影映射。LDA方法在模式识别、图像处理中是一种被运用的较多的方法,当不同类别的数据之间可区分度较高、数据碎片、边界模糊的问题较少时,该方法能取得十分好的分类效果。但在类别总数是C种的情况下,由于其降维后的样本空间最多是C-1维,所以当高维数据中存在不平衡特性时,由于数据的特征空间被极度压缩,所以会出现少数类被多数类覆盖、不同类别的样本在降维之后有相同属性的问题。无监督的降维方法不考虑类别信息,它追求在降维过程中,尽量还原原始样本空间中的某些特性。比如经典的PCA(Principal Component Analysis)降维,就是一种按照原始特征空间中不同方向的方差分布大小来考虑投影方向的方法,使得降维后能尽量保留方差的分布。不少数据实验表明,即便样本空间中有成千上万的特征数,但是真正的方差能量,只用相对于原始特征数不到百分之十的投影方向就能保留大部分的 方差能量。PCA在处理类别信息基本遵循方差分布的数据时能有十分好的效果,比如图像分类等领域。但由于不考虑类别标签,在处理一些方差信息不能反映类别分布情况的数据时,往往会取得极坏的效果。流形学习方法(Manifold Learning)自2000年被首次提出以来,已成为信息科学领域的研究重点。其主要思想是:假设高维空间中的数据具有某种特殊的结构,在将高维数据映射到低维后,低维空间中的数据仍能尽量还原原始数据在高维空间中的本质结构特征。
目前的特征选择方法按照特征选择过程与分类器训练过程的关系可以分为过滤式特征选择、包裹式特征选择和嵌入式特征选择方法三大类。支持向量机迭代特征消除法SVM-RFE通过每一轮迭代寻找每种属性的权值,权值的大小代表着SVM对该特征的关注程度,通过不断消除特征权重相对较低的特征来达到选取最优特征组合的目的。支持向量机反向特征消除法SVM-BFE每次训练消除一个特征,保存将消除某个特征后效果最好的特征组合,继续代入下一轮训练。基于SVM的特征选择方法,由于它以分类为目的,消除一些对分类效果有负面影响的特征组合和一些冗余度、相关度较高的特征,从而寻找使分类效果最好的特征组合,在处理高维数据中取得了一系列不错的效果。
由于没有考虑不平衡问题给特征选择带来的影响,在特征选择的过程中,极容易使得特征选择朝着不利于少数类识别的方向进行:一次性完成特征选择的算法(如LASSO算法等)则可能直接剔除掉一些对少数类的识别有重要效果的特征组合;迭代消除特征的做法是反向特征消除法的改进,它通过考虑分类器自身的“感受”来进行特征选择,每一轮选择一个分类器判定为对最终结果贡献较低且能使最终结果提升最大的特征进行消除,但同样无法阻止特征选择过程朝着增加多数类识别率的方向进行。
此外,SMOTE过采样算法是用于处理不平衡问题的主流方法,已被广泛运用在不平衡数据的处理中,并取得了良好的效果。但是在高维不平衡数据中,由于高维问题的存在,使得传统采样方法无法改变分类器对多数类的倾重,从而使传统采样方法失去意义。文献[21]中的实验研究表明,SMOTE方法虽然能在低维数据中让分类器增加对少数类的关注程度,但在高维数据中,效果却不明显。其原因主要是SMOTE方法生成的少数类,会使新样本空间中引入样本之间的相关性,而不是特征之间的相关性,因此生成的少数类不能很好的还原原本样本空间中的少数类的分布。
发明内容
为解决现有技术中存在的问题,本发明设计了一种基于SVM的高维不平衡数据分类方法来解决高维不平衡数据集分类问题,并取得了不错的效果。
本发明具体通过如下技术方案实现:
一种基于SVM的高维不平衡数据分类方法包括两部分,第一部分是特征选择部分,第二部分是数据采样部分;所述特征选择部分采用SVM-BRFE算法,所述SVM-BRFE算法包括以下步骤:
首先,训练SVM,得到最初的特征权重向量w、拉格朗日参数ɑ和F1值;
然后,对ɑ=C的少数类进行单倍率重采样,并用重采样后的数据训练SVM,使SVM的分离超平面朝着F1值增大的方向移动;由于分离超平面的每一次变化都会伴随着分隔超平面的同时变化,边界样本也会有所改变,因此需要不断重复该过程,每一次都对新的少数类样本边界进行单倍率的重采样,直到找到使F1值最大的分离超平面为止,用这个w值作为一轮特征选择的特征评分;
最后,按照特征的重要程度从小到大排列进行迭代特征消除,每轮消除一个特征使得F1值提高最多;由于每一轮消除了一个特征之后SVM的分离超平面同样也会改变,边界样本也随之发生改变,因此也同样需要对剩下的特征重新评分以产生新的特征权重w来评价每一个特征在新的特征空间下的重要程度。
所述数据采样部分采用改进的SMOTE算法,即PBKS算法,所述PBKS算法用于解决利用SVM处理不平衡数据分类时,由于输入空间与训练空间不同而产生的空间转化的问题,它利用SVM自动划分样本边界和在SVM中不平衡问题主要集中体现为边界样本不平衡问题的特点,PBKS算法在希尔伯特空间下利用不同的两个少数类合成新的少数类,并寻找过采样产生的样本点在欧几里得空间中的近似原像,同时利用PSO算法自适应的对少数类边界样本点以及新产生的样本点的采样倍率进行优化,提升SVM的分类效果。
本发明通过将两部分结合,形成了一种专门针对解决高维不平衡数据分类问题的算法。在该算法中,后半部分所需要解决的,是运用基于SVM来解决高维不平衡数据分类任务中的不平衡问题之后,所产生的新问题。
附图说明
图1是不平衡问题的解决流程图;各算法AUC值的直方图;
图2是在各算法AUC值的直方图;
图3是在数据集1上各算法得到的ROC曲线图;
图4是在数据集2上各算法得到的ROC曲线图;
图5是在数据集3上各算法得到的ROC曲线图;
图6是在数据集4上各算法得到的ROC曲线图;
图7是在数据集5上各算法得到的ROC曲线图;
图8是在数据集6上各算法得到的ROC曲线图。
具体实施方式
下面结合附图说明及具体实施方式对本发明进一步说明。
本发明通过分析SVM-RFE特征选择过程,发现可以在特征迭代选择的过程中,通过改进包裹式特征选择过程的特征评价体系来兼顾不平衡问题,于是利用SVM自动划分边界的特点来对希尔伯特空间下的样本点进行重采样来使支持向量机模型的F1值有所提高,并用此时SVM的特征权向量w作为特征的评价标准。下面便是将这两者结合起来,在考虑不平衡问题的情况下对高维不平衡数据进行特征选择,解决高维问题。该算法的时间复杂度为O(d2),d为特征的总数,主要过程如下所示。
算法1 SVM-BRFE算法伪代码
Figure PCTCN2017115847-appb-000001
Figure PCTCN2017115847-appb-000002
首先,训练SVM,得到最初的特征权重向量w、拉格朗日参数ɑ和F1值,记录下这3个值以便后续对比使用。
然后,对ɑ=C的少数类进行单倍率重采样,并用重采样后的数据训练SVM,使SVM的分离超平面朝着F1值增大的方向移动;由于分离超平面的每一次变化都会伴随着分隔超平面的同时变化,边界样本也会有所改变,因此需要不断重复该过程,每一次都对新的少数类样本边界进行单倍率的重采样,直到找到使F1值最大的分离超平面为止,用这个w值作为一轮特征选择的特征评分。
最后,按照特征的重要程度从小到大排列进行迭代特征消除,每轮消除一个特征使得F1值提高最多;由于每一轮消除了一个特征之后SVM的分离超平面同样也会改变,边界样本也随之发生改变,因此也同样需要对剩下的特征重新评分以产生新的特征权重w来评价每一个特征在新的特征空间下的重要程度。
在此,值得注意的是,特征选择部分的重采样过程并不参与训练集的更新:对少数类边界样本进行重采样只是为了得到一个相对于多数类和少数类比较公平的特征权重w,以更好的衡量在高维不平衡数据中,每一个特征的重要程度,而不是为了直接改变SVM对少数类的关注程度以提高直接分类效果和F1值,也就是说每一轮特征选择前的重采样过程只是为 了解决收到不平衡问题影响的高维问题,而不是为了解决不平衡问题。因此,当得到最大的F1值时,当前一轮的重采样过程结束,保存SVM在取得最大F1值时的权重向量w,用它来衡量特征的重要程度并对特征排序,接着去除掉重采样复制的少数类样本点,只保留原始的少数类样本点,然后进入特征选择过程。每当选择出一个特征之后,又重复上述过程,直到选择出最优的特征子集为止。从算法1的伪代码中可以看到,重采样过程并不更改train_set,只有在特征选择的过程中才在每选择一个特征之后更新train_set。
通过以上的几个步骤:对边界进行重采样以寻找最优特征权重以衡量特征重要程度、特征选择、更新训练集并重复以上过程,最终保留最有利于提升F1值的特征,其他特征将被剔除,使得后续的训练过程在一个特征冗余、无关特征组合尽量少和维数尽量低的情况下进行,减少了高维问题对不平衡问题的影响和对SMOTE过采样算法的束缚,有利于在后续过程中改进传统过采样算法来解决不平衡问题,提升分类效果。
PSO-Border-Kernel-SMOTE(PBKS)过采样算法主要用于解决利用SVM处理不平衡数据分类时,由于输入空间与训练空间不同而产生的空间转化的问题,它利用SVM自动划分样本边界和在SVM中不平衡问题主要集中体现为边界样本不平衡问题的特点,PBKS算法在希尔伯特空间下利用不同的两个少数类合成新的少数类,并寻找过采样产生的样本点在欧几里得空间中的近似原像,同时利用PSO算法自适应的对少数类边界样本点以及新产生的样本点的采样倍率进行优化,提升SVM的分类效果。从图1中可以看到,左侧部分的流程在希尔伯特空间下完成,右侧部分的流程主要欧几里得空间下完成,中间的部分是欧几里得空间下的操作和希尔伯特空间下的操作进行对接的关键。
在解决该问题之前,首先提出希尔伯特空间下的距离度量方式:
Figure PCTCN2017115847-appb-000003
Figure PCTCN2017115847-appb-000004
设欧几里得空间到希尔伯特空间的隐式映射如式(2)所示,并假设显示定义的核函数为高斯核函数。在以后的书写中,都用Kij代替K(xi,xj),它表示欧几里得空间中的两个点xi和xj在被映射到希尔伯特空间后的内积。则希尔伯特空间下的距离的平方如式(3)所示。
当核函数是高斯核时,欧几里得空间下的距离平方与希尔伯特空间下的距离平方的关系如式(4)和式(5)所示,D2表示欧几里得空间下的距离的平 方,d2表示希尔伯特空间下的距离的平方。
Figure PCTCN2017115847-appb-000005
Figure PCTCN2017115847-appb-000006
SMOTE算法寻找与样本点xi最邻近的前k个样本,然后在这个k个样本中随机选择一个样本点xj,在样本点xi与样本点xj之间进行线性插值。由于本发明主要考虑少数类边界样本的过采样,因此将在希尔伯特空间下,对于每个处于边界中的少数类样本点,随机选择边界中的另一个少数类样本点作为SMOTE算法的输入,则希尔伯特空间下的SMOTE过采样公式如式(6)所示,其中λij是一个在开区间(0,1)之间的随机数。
Figure PCTCN2017115847-appb-000007
要寻找zij在希尔伯特空间下的近似原像,样本点之间的距离约束对确定原像的近似位置十分重要:
假设希尔伯特空间下用SMOTE,过采样生成的样本点zij与SVM中每个少数类边界样本之间的距离平方向量
Figure PCTCN2017115847-appb-000008
如式(7)所示,假设边界中少数类样本的总数是k个:
Figure PCTCN2017115847-appb-000009
又假设在训练集原来的欧几里得空间中有一个未知样本点为xij,则xij与式(7)中这k个样本点的距离平方向量
Figure PCTCN2017115847-appb-000010
如式(8)所示。在式(7)和式(8)中,下标1,2,…,k所对应的样本点必须一致。
Dxij=[D2(xij,x1),D2(xij,x2),…,D2(xij,xk)]      (8)
当核函数为高斯核函数时,结合式(4)和式(8),将欧几里得空间下的向量
Figure PCTCN2017115847-appb-000011
映射到对应的希尔伯特下,如式(9)所示。
Figure PCTCN2017115847-appb-000012
式(8)的值与式(9)的值越接近,说明xij经过空间变换后,在高斯核函数对应的希尔伯特空间中的位置
Figure PCTCN2017115847-appb-000013
越接近SMOTE合成的样本点zij
利用前k个与SMOTE产生的样本点距离最近原始少数类样本点作为约束来确定希尔伯特空间样本的原像的思路,为了能够很好的填充边界少数类,本发明考虑利用SVM自动划分出的边界中的少数类作为
Figure PCTCN2017115847-appb-000014
中的距离约束,以此来取代原始约束,并采用网格法来寻找该近似原像。具体地: 假设SVM训练后,在希尔伯特空间中划分出来的少数类边界样本的标号为1,2,…,k,求出这d个特征在这k个少数类边界样本中的上边界和下边界,如式(10)和式(11)所示,其中(10)是所有少数类边界样本的下边界,(11)是所有少数类边界样本的上边界。
Figure PCTCN2017115847-appb-000015
Figure PCTCN2017115847-appb-000016
然后按式(12)划分每一个网格的粒度,将边界少数类空间划分成k×d个网格,每个网格代表一个欧几里得空间中的位置,要寻找到一个网格使得它映射到希尔伯特空间后与过采样产生的点最相近。具体地,每一个网格的大小为该特征维度上的最大值减去最小值再除以原始边界样本的总数k,在后续搜索原像的过程中,将以每一个网格为单位,搜索整个网格空间。
Figure PCTCN2017115847-appb-000017
式(7)中的zij是在希尔伯特空间中进行SMOTE过采样所生成的少数类样本点,是已知的;式(8)中的xij是要求的zij的原像,是未知的。式(8)表示第i个特征的网格粒度,在每一次PSO随机网格搜索中,每一维都加上PSO所优化的网格粒度的数目得到xij,并将该次搜索的样本点作为求解变量xij的一次迭代。代入式(7)中,然后求得式(7)与式(8)的余弦距值的平方,如式(13),直到迭代结束为止。最后,用余弦值的平方最大的点代替目标解xij作为zij的近似原像。
Figure PCTCN2017115847-appb-000018
考虑到不平衡样本分类问题的特殊性,当用传统评价标准进行评价的时候就会造成下面的问题:传统分类器为追求全局分类准确率,直接将少数类样本全部分类为多数类样本,就会得到一个较高的全局准确率,但对于少数类样本的正确分类率却为0,在这种情况下,传统的单一的评价体系将不再适用于不平衡样本分类的评价体系中。因此,我们需要一些特殊的 复杂的考虑多方面的指标,来适应不平衡样本分类的特殊情况。这些标准主要有两类,一类称为“原子标准”,另外一类则称为“复合标准”,它是一种经大量研究之后所提出的原子标准和数学理论复合而成的复杂并且能够很好适应不平衡样本分类问题评价体系。此外,受试者曲线(ROC)也被广泛的应用于不平衡样本分类的评价工作中。
如表1所示,为针对不平衡样本分类问题中所涉及的二分类问题的混淆矩阵。通过统计混淆矩阵的各个指标以及这些指标的复合指标,我们可以更好的分别统计各自类别的准确率,分别考虑不同类别的分类情况,从而在评价不平衡样本分类算法的准则中,不是一味的追求全局最高准确率,而是同时考虑少数类和多数类的分类准确率。
表1 混淆矩阵
Figure PCTCN2017115847-appb-000019
式(14)至式(17)列出了一些基于混淆矩阵的不平衡样本分类中被经常使用的原子评价标准。
Figure PCTCN2017115847-appb-000020
Figure PCTCN2017115847-appb-000021
Figure PCTCN2017115847-appb-000022
Figure PCTCN2017115847-appb-000023
F-Measure最经常被应用到不平衡样本分类的评价工作中,如式(17)所示。F-Measure由查全率、查准率以及平衡因子复合得到,当Recall和Precision都取得一个较高的数值时,F-Measure便会取得较为理想的结果。式(17)中β是调节查全率和查准率的平衡因子(通常β设为1)。
ROC曲线(Receiver Operating Characteristics Curve)是Swets于1988年提出了的,一经提出便在诸多领域中得到了广泛地应用。ROC以FPRate为X轴、TPRate为Y轴来搭建的空间。通过设定阀值,得到伪阳率和真阳率值,将这些分散的点连接起来就形成了ROC曲线。
ROC曲线是不能够直接对不平衡样本分类问题进行量化地评价,所以为了得到一个量化的评价指标,覆盖面积AUC(Area under the ROC curve)被提出。分类器算法的分类效果可以用ROC右下方的面积(也就是AUC)来评价,AUC越大,则分类效果越好。
UCI是一个著名的、公开的机器学习数据库,为使实验结果更具说服力,本发明所有实验的数据集,均来源于UCI。实验数据如表2所示。表2描述了所有实验所用数据集的具体属性,其中No.列为数据集编号,Data-Set为数据集名称,#Attr.为数据集包含的属性数量,%Min.表示少数类样本所占比例。
表2 实验数据
Figure PCTCN2017115847-appb-000024
BRFE-PBKS-SVM算法分成两部分,第一部分是特征选择部分,第二部分是数据采样部分,通过将两部分结合,形成了一种专门针对解决高维不平衡数据分类问题的算法。在该算法中,后半部分所需要解决的,是运用基于SVM来解决高维不平衡数据分类任务中的不平衡问题之后,所产生的新问题。接下来将利用前面所提到的评价标准,分别从以下3个方面比较BRFE-PBKS-SVM算法的效率:对少数类识别率的提高、总体效率的提高以及算法稳定性的对比:
a)少数类召回率的的变化
b)全局准确率及F1值的变化
c)ROC曲线所围成的面积值
表3 少数类召回率和精确率对比
Figure PCTCN2017115847-appb-000025
从表中3中可以看到,BRFE-PBKS-SVM算法在4个算法中,对少数类都取得了最高的召回率,相比于未改进的SMOTE算法,PBKS过采样算法对少数类召回率的提升程度显著,并且随着少数类召回率的提升,其精确率有所下降。
表4 各算法F1值与ACC值对比
Figure PCTCN2017115847-appb-000026
表4中,通过第二列和第四列的对比、第六列和第八列的对比,可以看出普通的SMOTE过采样方法与PBKS过采样方法在SVM中的ACC值效果对比;通过第二列和第六列的比较、第四列和第八列的比较,可以看到SVM-RFE特征选择算法与SVM-BRFE特征选择算法的效果对比。就全局准确率ACC来说,在第2到第5个数据集中,BRFE-PBKS-SVM算法在所有算法组合里,是最优的;而在采用同样的过采样算法的情况下,经过改进的BRFE特征选择算法组合所取得的效果最好,这是因为BRFE特征选择算法在特征消除的过程中考虑了不平衡问题;在采用同样的特征选择算法的情况下,改进的PBKS过采样算法组合所取得的效果最好,这是因为它们都是在多项式核函数或者高斯核函数对应的希尔伯特空间下训练的数据,由于PBKS算法过采样产生的样本点能更好的填充希尔伯特空间下的边界,空间上分布更合理,因此能使得分类效果提升较多。
图2是4种算法在6个数据集上的ROC曲线的AUC值对比图,从图2中可以发现在六组数据中,除了第二个和第四个数据外,BRFE-PBKS-SVM算法都能取得最大的AUC值,而在第四个数据集中,即使改进后的算法没能取得最优的AUC值,其差值也只有0.006,总体上说明算法BRFE-PBKS-SVM有着良好的稳定性。图3-8显示了4种基于SVM的算法组合在各个数据集中的AUC值均相差不大,这也从侧面证明了SVM对完成高维不平衡数据的分类任务有着较好的稳定性以及优越性。
图3-8中,线条围起的面积即图2中的AUC值。对角线表示的是一个最差的分类效果水平,它对应的AUC值是0.5,当一个分类器在某个数据集上的ROC曲线位于这条对角线之下时,它的AUC值将小于0.5,这将意味着分类器在该数据集上的分类效率不如一个随机猜测的分类器效果好。ROC曲线越趋向于左上方,代表相应的算法的效果越显著,AUC值越接近于1;例如图7中,算法BRFE-PBKS-SVM在第五个数据集上的ROC曲线,从图2可知,该曲线对应的AUC值为0.993。
实验得到的六个ROC曲线图中发现,除了第二个和第四个数据集之外,在剩下的数据集里,四种算法所围成的面积相差均不大,都能取得较好的效果,并且最终改进的算法都能在这四个数据集中取得最大的AUC值;而在第二和第四个数据集中,四种算法效果差异性较大,并且ROC曲线极度不平滑,BRFE-PBKS-SVM算法也没能取得最佳的分类效果,但与分类效果最好的算法的AUC值相差并不大,且都能取得较随机分类器好的ROC面积。这说明,基于SVM的针对高维不平衡数据分类任务的BRFE-PBKS-SVM算法,能稳定有效的完成高维不平衡数据的分类任务,并能取得可观的效果。
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。

Claims (3)

  1. 一种基于SVM的高维不平衡数据分类方法,其特征在于:所述方法包括两部分,第一部分是特征选择部分,第二部分是数据采样部分;
    所述特征选择部分采用SVM-BRFE算法,所述SVM-BRFE算法包括以下步骤:首先,训练SVM,得到最初的特征权重向量w、拉格朗日参数ɑ和F1值;然后,对ɑ=C的少数类进行单倍率重采样,并用重采样后的数据训练SVM,使SVM的分离超平面朝着F1值增大的方向移动;由于分离超平面的每一次变化都会伴随着分隔超平面的同时变化,边界样本也会有所改变,因此需要不断重复该过程,每一次都对新的少数类样本边界进行单倍率的重采样,直到找到使F1值最大的分离超平面为止,用这个w值作为一轮特征选择的特征评分;最后,按照特征的重要程度从小到大排列进行迭代特征消除,每轮消除一个特征使得F1值提高最多;由于每一轮消除了一个特征之后SVM的分离超平面同样也会改变,边界样本也随之发生改变,因此也同样需要对剩下的特征重新评分以产生新的特征权重w来评价每一个特征在新的特征空间下的重要程度;
    所述数据采样部分采用改进的SMOTE算法,即PBKS算法,所述PBKS算法用于解决利用SVM处理不平衡数据分类时,由于输入空间与训练空间不同而产生的空间转化的问题,它利用SVM自动划分样本边界和在SVM中不平衡问题主要集中体现为边界样本不平衡问题的特点,PBKS算法在希尔伯特空间下利用不同的两个少数类合成新的少数类,并寻找过采样产生的样本点在欧几里得空间中的近似原像,同时利用PSO算法自适应的对少数类边界样本点以及新产生的样本点的采样倍率进行优化,提升SVM的分类效果。
  2. 根据权利要求1所述的方法,其特征在于:所述PBKS算法利用SVM自动划分出的边界中的少数类作为
    Figure PCTCN2017115847-appb-100001
    中的距离约束,以此来取代原始约束,并采用网格法来寻找该近似原像,其中,
    Figure PCTCN2017115847-appb-100002
    为样本点xi与xj在欧几里得空间下的距离
    Figure PCTCN2017115847-appb-100003
    映射到对应的希尔伯特空间下的向量。
  3. 根据权利要求1所述的方法,其特征在于:假设SVM训练后,在希尔伯特空间中划分出来的少数类边界样本的标号为1,2,…,k,求出这d个特征在这k个少数类边界样本中的上边界xhigh和下边界xlow
    Figure PCTCN2017115847-appb-100004
    然后划分每一个网格的粒度,将边界少数类空间划分成k×d个网格,每个网格代表一个欧几里得空间中的位置,要寻找到一个网格使得它映射到希尔伯特空间后与过采样产生的点最相近;具体地,每一个网格的大小为该特征维度上的最大值减去最小值再除以原始边界样本的总数k,在后续搜索原像的过程中,将以每一个网格为单位,搜索整个网格空间;在每一次PSO随机网格搜索中,每一维都加上PSO所优化的网格粒度的数目得到xij,并将该次搜索的样本点作为求解变量xij的一次迭代;然后求得
    Figure PCTCN2017115847-appb-100005
    Figure PCTCN2017115847-appb-100006
    的余弦距值的平方,直到迭代结束为止;最后,用余弦值的平方最大的点代替目标解xij作为zij的近似原像,其中,zij是在希尔伯特空间中进行SMOTE过采样所生成的少数类样本点,xij是要求的zij的原像。
PCT/CN2017/115847 2017-08-30 2017-12-13 基于svm的高维不平衡数据分类方法 WO2019041629A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710763329.7 2017-08-30
CN201710763329.7A CN107563435A (zh) 2017-08-30 2017-08-30 基于svm的高维不平衡数据分类方法

Publications (1)

Publication Number Publication Date
WO2019041629A1 true WO2019041629A1 (zh) 2019-03-07

Family

ID=60978124

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/115847 WO2019041629A1 (zh) 2017-08-30 2017-12-13 基于svm的高维不平衡数据分类方法

Country Status (2)

Country Link
CN (1) CN107563435A (zh)
WO (1) WO2019041629A1 (zh)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110806A (zh) * 2019-05-15 2019-08-09 济南浪潮高新科技投资发展有限公司 基于机器学习技术的对中标与非中标数据的平衡方法
CN111125359A (zh) * 2019-12-17 2020-05-08 东软集团股份有限公司 一种文本信息分类的方法、装置及设备
CN111275003A (zh) * 2020-02-19 2020-06-12 煤炭科学研究总院 基于类最优高斯核多分类支持向量机的微震信号识别方法
CN111695626A (zh) * 2020-06-10 2020-09-22 湖南湖大金科科技发展有限公司 基于混合采样与特征选择的高维度不平衡数据分类方法
CN111782904A (zh) * 2019-12-10 2020-10-16 国网天津市电力公司电力科学研究院 一种基于改进smote算法的非平衡数据集处理方法及系统
CN112000705A (zh) * 2020-03-30 2020-11-27 华南理工大学 一种基于主动漂移检测的非平衡数据流挖掘方法
CN112257767A (zh) * 2020-10-16 2021-01-22 浙江大学 针对类不均衡数据的产品关键零部件状态分类方法
CN112633227A (zh) * 2020-12-30 2021-04-09 应急管理部国家自然灾害防治研究院 张衡一号感应磁力仪数据闪电哨声波自动识别方法及系统
CN112733960A (zh) * 2021-01-25 2021-04-30 大连交通大学 一种基于人工合成数据过采样技术的不平衡物体识别方法
CN112819806A (zh) * 2021-02-23 2021-05-18 江苏科技大学 一种基于深度卷积神经网络模型的船舶焊缝缺陷检测方法
CN113032726A (zh) * 2021-02-25 2021-06-25 北京化工大学 基于核概率密度估计的加权上采样方法用于流化床结块故障监测方法
CN113723514A (zh) * 2021-08-31 2021-11-30 重庆邮电大学 一种基于混合采样的安全接入日志数据平衡处理方法
CN113792765A (zh) * 2021-08-24 2021-12-14 西安理工大学 一种基于三角质心权重的过采样方法
US20220120727A1 (en) * 2020-10-16 2022-04-21 Saudi Arabian Oil Company Detecting equipment defects using lubricant analysis
CN115455177A (zh) * 2022-08-02 2022-12-09 淮阴工学院 基于混合样本空间的不平衡化工文本数据增强方法及装置
CN116051288A (zh) * 2023-03-30 2023-05-02 华南理工大学 一种基于重采样的金融信用评分数据增强方法
CN116628443A (zh) * 2023-05-16 2023-08-22 西安工程大学 一种poa-svm变压器故障诊断方法及电子设备
CN116721354A (zh) * 2023-08-08 2023-09-08 中铁七局集团电务工程有限公司武汉分公司 一种建筑物裂纹缺陷识别方法、系统及可读存储介质
US11836219B2 (en) 2021-11-03 2023-12-05 International Business Machines Corporation Training sample set generation from imbalanced data in view of user goals
CN117272116A (zh) * 2023-10-13 2023-12-22 西安工程大学 一种基于loras平衡数据集的变压器故障诊断方法

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108494845B (zh) * 2018-03-14 2020-12-22 曙光信息产业(北京)有限公司 一种基于6D-Torus网络的作业调度方法和装置
CN108563119B (zh) * 2018-03-26 2021-06-15 哈尔滨工程大学 一种基于模糊支持向量机算法的无人艇运动控制方法
CN108763873A (zh) * 2018-05-28 2018-11-06 苏州大学 一种基因分类方法及相关设备
CN109635034B (zh) * 2018-11-08 2020-03-03 北京字节跳动网络技术有限公司 训练数据重采样方法、装置、存储介质及电子设备
CN109376944A (zh) * 2018-11-13 2019-02-22 国网宁夏电力有限公司电力科学研究院 智能电表预测模型的构建方法及装置
CN109540562A (zh) * 2018-12-12 2019-03-29 上海理工大学 一种冷水机组故障诊断方法
CN109886462B (zh) * 2019-01-18 2021-10-08 杭州电子科技大学 一种改进粒子群优化支持向量机的精馏塔故障诊断方法
CN111693939A (zh) * 2019-03-15 2020-09-22 中国科学院上海高等研究院 提高室内相邻网格定位准确率的方法、装置、设备和介质
CN111210075B (zh) * 2020-01-07 2023-05-12 国网辽宁省电力有限公司朝阳供电公司 一种基于组合分类器的雷击输电线路故障概率分析方法
CN111652193B (zh) * 2020-07-08 2024-03-19 中南林业科技大学 基于多源影像的湿地分类方法
CN112396124B (zh) * 2020-12-01 2023-01-24 北京理工大学 一种面向不平衡数据的小样本数据扩充方法及系统
US11797516B2 (en) * 2021-05-12 2023-10-24 International Business Machines Corporation Dataset balancing via quality-controlled sample generation
CN113408707A (zh) * 2021-07-05 2021-09-17 哈尔滨理工大学 一种基于深度学习的网络加密流量识别方法
CN113657499B (zh) * 2021-08-17 2023-08-11 中国平安财产保险股份有限公司 基于特征选择的权益分配方法、装置、电子设备及介质
CN114612255B (zh) * 2022-04-08 2023-11-07 湖南提奥医疗科技有限公司 一种基于电子病历数据特征选择的保险定价方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868775A (zh) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 基于pso算法的不平衡样本分类方法
CN105930856A (zh) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 基于改进dbscan-smote算法的分类方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868775A (zh) * 2016-03-23 2016-08-17 深圳市颐通科技有限公司 基于pso算法的不平衡样本分类方法
CN105930856A (zh) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 基于改进dbscan-smote算法的分类方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHUNKAI ZHANG: "Research on Classification Method of High-Dimensional Class-Imbalanced Data Sets Based on SVM", 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE, 29 June 2017 (2017-06-29), XP033139592 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110806A (zh) * 2019-05-15 2019-08-09 济南浪潮高新科技投资发展有限公司 基于机器学习技术的对中标与非中标数据的平衡方法
CN111782904A (zh) * 2019-12-10 2020-10-16 国网天津市电力公司电力科学研究院 一种基于改进smote算法的非平衡数据集处理方法及系统
CN111782904B (zh) * 2019-12-10 2023-10-27 国网天津市电力公司电力科学研究院 一种基于改进smote算法的非平衡数据集处理方法及系统
CN111125359A (zh) * 2019-12-17 2020-05-08 东软集团股份有限公司 一种文本信息分类的方法、装置及设备
CN111125359B (zh) * 2019-12-17 2023-12-15 东软集团股份有限公司 一种文本信息分类的方法、装置及设备
CN111275003A (zh) * 2020-02-19 2020-06-12 煤炭科学研究总院 基于类最优高斯核多分类支持向量机的微震信号识别方法
CN111275003B (zh) * 2020-02-19 2023-08-01 煤炭科学研究总院 基于类最优高斯核多分类支持向量机的微震信号识别方法
CN112000705A (zh) * 2020-03-30 2020-11-27 华南理工大学 一种基于主动漂移检测的非平衡数据流挖掘方法
CN112000705B (zh) * 2020-03-30 2024-04-02 华南理工大学 一种基于主动漂移检测的非平衡数据流挖掘方法
CN111695626B (zh) * 2020-06-10 2023-10-31 湖南湖大金科科技发展有限公司 基于混合采样与特征选择的高维度不平衡数据分类方法
CN111695626A (zh) * 2020-06-10 2020-09-22 湖南湖大金科科技发展有限公司 基于混合采样与特征选择的高维度不平衡数据分类方法
CN112257767A (zh) * 2020-10-16 2021-01-22 浙江大学 针对类不均衡数据的产品关键零部件状态分类方法
US20220120727A1 (en) * 2020-10-16 2022-04-21 Saudi Arabian Oil Company Detecting equipment defects using lubricant analysis
CN112633227B (zh) * 2020-12-30 2024-02-23 应急管理部国家自然灾害防治研究院 张衡一号感应磁力仪数据闪电哨声波自动识别方法及系统
CN112633227A (zh) * 2020-12-30 2021-04-09 应急管理部国家自然灾害防治研究院 张衡一号感应磁力仪数据闪电哨声波自动识别方法及系统
CN112733960B (zh) * 2021-01-25 2023-06-20 大连交通大学 一种基于人工合成数据过采样技术的不平衡物体识别方法
CN112733960A (zh) * 2021-01-25 2021-04-30 大连交通大学 一种基于人工合成数据过采样技术的不平衡物体识别方法
CN112819806A (zh) * 2021-02-23 2021-05-18 江苏科技大学 一种基于深度卷积神经网络模型的船舶焊缝缺陷检测方法
CN112819806B (zh) * 2021-02-23 2024-05-28 江苏科技大学 一种基于深度卷积神经网络模型的船舶焊缝缺陷检测方法
CN113032726A (zh) * 2021-02-25 2021-06-25 北京化工大学 基于核概率密度估计的加权上采样方法用于流化床结块故障监测方法
CN113032726B (zh) * 2021-02-25 2023-11-24 北京化工大学 基于核概率密度估计的加权上采样方法用于流化床结块故障监测方法
CN113792765A (zh) * 2021-08-24 2021-12-14 西安理工大学 一种基于三角质心权重的过采样方法
CN113723514B (zh) * 2021-08-31 2023-10-20 重庆邮电大学 一种基于混合采样的安全接入日志数据平衡处理方法
CN113723514A (zh) * 2021-08-31 2021-11-30 重庆邮电大学 一种基于混合采样的安全接入日志数据平衡处理方法
US11836219B2 (en) 2021-11-03 2023-12-05 International Business Machines Corporation Training sample set generation from imbalanced data in view of user goals
CN115455177A (zh) * 2022-08-02 2022-12-09 淮阴工学院 基于混合样本空间的不平衡化工文本数据增强方法及装置
CN115455177B (zh) * 2022-08-02 2023-07-21 淮阴工学院 基于混合样本空间的不平衡化工文本数据增强方法及装置
CN116051288A (zh) * 2023-03-30 2023-05-02 华南理工大学 一种基于重采样的金融信用评分数据增强方法
CN116628443A (zh) * 2023-05-16 2023-08-22 西安工程大学 一种poa-svm变压器故障诊断方法及电子设备
CN116628443B (zh) * 2023-05-16 2024-01-23 西安工程大学 一种poa-svm变压器故障诊断方法及电子设备
CN116721354B (zh) * 2023-08-08 2023-11-21 中铁七局集团电务工程有限公司武汉分公司 一种建筑物裂纹缺陷识别方法、系统及可读存储介质
CN116721354A (zh) * 2023-08-08 2023-09-08 中铁七局集团电务工程有限公司武汉分公司 一种建筑物裂纹缺陷识别方法、系统及可读存储介质
CN117272116A (zh) * 2023-10-13 2023-12-22 西安工程大学 一种基于loras平衡数据集的变压器故障诊断方法
CN117272116B (zh) * 2023-10-13 2024-05-17 西安工程大学 一种基于loras平衡数据集的变压器故障诊断方法

Also Published As

Publication number Publication date
CN107563435A (zh) 2018-01-09

Similar Documents

Publication Publication Date Title
WO2019041629A1 (zh) 基于svm的高维不平衡数据分类方法
CN111695626B (zh) 基于混合采样与特征选择的高维度不平衡数据分类方法
Wang et al. A perception-driven approach to supervised dimensionality reduction for visualization
Xie et al. An improved oversampling algorithm based on the samples’ selection strategy for classifying imbalanced data
Sun et al. An adaptive density peaks clustering method with Fisher linear discriminant
CN105930856A (zh) 基于改进dbscan-smote算法的分类方法
Zhang et al. A cost-sensitive ensemble method for class-imbalanced datasets
CN101853389A (zh) 多类目标的检测装置及检测方法
Dai et al. Multi-granularity relabeled under-sampling algorithm for imbalanced data
CN109800790B (zh) 一种面向高维数据的特征选择方法
CN109150830B (zh) 一种基于支持向量机和概率神经网络的层次入侵检测方法
Wang et al. Nearest Neighbor with Double Neighborhoods Algorithm for Imbalanced Classification.
Wang et al. AGNES-SMOTE: An oversampling algorithm based on hierarchical clustering and improved SMOTE
Longadge et al. Multi-cluster based approach for skewed data in data mining
Cao et al. An over-sampling method based on probability density estimation for imbalanced datasets classification
Miranda et al. Active testing for SVM parameter selection
CN115859115A (zh) 一种基于高斯分布的智能重采样技术
CN116204647A (zh) 一种目标比对学习模型的建立、文本聚类方法及装置
Ma et al. A membership-based resampling and cleaning algorithm for multi-class imbalanced overlapping data
Paithankar et al. A HK clustering algorithm for high dimensional data using ensemble learning
Che et al. Boosting few-shot open-set recognition with multi-relation margin loss
Hossen et al. A comparison of some soft computing methods on Imbalanced data
Jiang et al. Undersampling of approaching the classification boundary for imbalance problem
Li et al. Overlapping oriented imbalanced ensemble learning method based on projective clustering and stagewise hybrid sampling
CN113361563B (zh) 一种基于样本和特征双变换的帕金森病语音数据分类系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17923570

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17923570

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17923570

Country of ref document: EP

Kind code of ref document: A1