CN109165694B

CN109165694B - Method and system for classifying unbalanced data sets

Info

Publication number: CN109165694B
Application number: CN201811061152.7A
Authority: CN
Inventors: 张雪英; 李凤莲; 陈桂军; 张波; 魏鑫; 焦江丽
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2022-07-08
Anticipated expiration: 2038-09-12
Also published as: CN109165694A

Abstract

The invention discloses a classification method and a classification system for unbalanced data sets, and a class center c for calculating and obtaining a positive class training set and a negative class training set₁And c₂Determining the distance T between the centers of the two classes, a positive class hyperplane, a negative class hyperplane, a first distance, a second distance, a third distance and a fourth distance, further determining a fuzzy membership function, and determining a classification model according to the fuzzy membership function and a fuzzy dual-support vector machine. And obtaining the optimized classification model by adopting a grid search algorithm and a cross verification method. And inputting the unbalanced data to be classified into the optimized classification model to obtain a classification result of the unbalanced data to be classified. According to the method or the system, the determined classification model based on the fuzzy membership function is used, different membership values are given to the sample points according to different contributions of the sample points to the classification hyperplane and different unbalanced rates of the two types of samples, the unbalance among the samples is reduced, and therefore the accuracy of the classification result when the method or the system is used is improved.

Description

Method and system for classifying unbalanced data sets

Technical Field

The invention relates to the technical field of unbalanced data processing, in particular to a method and a system for classifying unbalanced data sets.

Background

Many industry data tend to have data distribution imbalance phenomena. Taking the binary problem as an example, if the proportion of one sample is much larger than that of the other sample, the data set is an unbalanced data set. The majority of samples are also called negative samples, the minority of samples are called positive samples, and the ratio of the number of negative samples to the number of positive samples is called Imbalance Rate (IR). Typical examples include: fault diagnosis data, credit fraud data, medical diagnosis data, and the like. When the unbalanced data set is classified and predicted, the reference value of the classification prediction accuracy of the minority class in practice is more important, but the prediction accuracy of the majority class is generally higher and the prediction accuracy of the minority class is lower by a common classification prediction model, and the prediction error of the minority class generally brings greater economic loss and even life cost, such as credit card embezzlement accidents, coal mine water inrush and gas outburst accidents and the like. Therefore, how to improve the accuracy of the classification prediction of a small number of classes of unbalanced data sets is a research hotspot at home and abroad in recent years.

The Batuwita et al provides a Fuzzy Support Vector Machine (FSVM) for processing unbalanced data sets, sets different penalty factors for positive and negative samples, designs a fuzzy membership function to endow the training samples with different membership, but the method only considers the distance between the samples and class centers and the unbalanced condition of the samples, does not consider the distribution characteristics of the samples, and has poor classification accuracy. A novel double-membership fuzzy support vector machine is proposed by Chua Yan et al, so that the classification accuracy is effectively improved, the complexity is increased, and the classification efficiency is low.

Disclosure of Invention

The invention aims to provide a method and a system for classifying an unbalanced data set, which aim to solve the problems of low efficiency and poor accuracy when the unbalanced data set is classified in the prior art.

In order to achieve the purpose, the invention provides the following scheme:

a method of classifying an unbalanced data set, comprising:

acquiring sample unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;

randomly dividing sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set;

obtaining class center c of the positive class training set₁And class center c of the negative class training set₂And a center c of the training set;

centering said class c₁The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And the class center c₂The modulus of the difference is determined as the distance T between the two types of centers;

according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passage of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane-like surface of (1);

according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-(ii) a The first distance di⁺Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di^-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli⁺Representing the positive class data in the positive class training set passing through the class center c₂A distance to the negative hyperplane; the fourth distance dli^-Representing the negative class data in the negative class training set to pass through a class center c₁A distance to the positive hyperplane;

determining from a neighbor algorithmCompactness C of the positive data in the positive training set_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-；

According to the first distance di⁺The second distance di^-Said third distance dl⁺Said fourth distance dl^-The compactness C_i ⁺The tightness C_i ^-And the distance T between the two classes of centers determines a fuzzy membership function (1),

wherein S is_i+Representing the fuzzy membership of the positive data, S_i-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;

determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine,

wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w₁Normal vector representing the hyperplane of the positive class classification, e₁Column vectors of positive type with representing elements all equal to 1, b₁Denotes a first constant, d₁Denotes a first penalty parameter, S_ARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e₂Presentation elementAll negative class column vectors equal to 1, FTWSVM2 denotes the negative class classification hyperplane, w₂Normal vector representing the hyperplane of the negative class classification, b₂Denotes a second constant, d₂Denotes a second penalty parameter, S_BRepresenting the fuzzy membership of the second data to be classified;

taking a training set and a test set of sample unbalanced data as input of a classification model (2), taking recall ratio, precision ratio, g-mean and F value of the test set as output of the classification model (2), and determining an optimized first penalty parameter d by adopting a grid search algorithm and a cross-validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model;

acquiring unbalanced data to be detected;

and taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.

Optionally, said according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane specifically comprises:

according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂(xvi) ═ 1 (4); wherein x is⁺Representing positive class data, x, in a positive class training set^-Representing negative class data in the negative class training set.

Optionally, said according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-The method specifically comprises the following steps:

according to the class centerc₁The class center c₂The normal vector w₁And the normal vector w₂Determining a first distance

Second distance

Third distance

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

represents a normal vector w₂Is greater than the torque, | w₁I represents the normal vector w₁Modulo, | w₂I represents the normal vector w₂The die of (1).

Optionally, the closeness C of the positive data in the positive training set is determined according to a neighbor algorithm_i ⁺Determining the compactness C of the negative class data in the negative class training set according to a neighbor algorithm_i ^-The method specifically comprises the following steps:

determining the closeness of the positive data in the positive training set according to the neighbor algorithm

Determining the closeness of the negative class data in the negative class training set according to the nearest neighbor algorithm

Wherein, X_i ⁺Representing the ith positive class data in the positive class training set,

set of K neighbor samples, x, representing the ith positive class data in the positive class training set_j ⁺To represent

Of (2) is the j-th neighboring sample, X_i ^-Representing the ith negative class data in the negative class training set,

set of K neighbor samples, X, representing the ith negative class data in the negative class training set_j ^-To represent

K is

And

the number of neighboring samples in (a).

Optionally, the training set and the test set of the sample unbalanced data are used as input of the classification model (2), the recall ratio, the precision ratio, the g-mean and the F value of the test set are used as output of the classification model (2), and the optimized first penalty parameter d is determined by adopting a grid search algorithm and a cross-validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model, specifically including:

taking a training set and a test set of sample unbalanced data as the input of a classification model (2) and taking the recall ratio of a positive class test set

Precision ratio

And

as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, and FP represents the number of incorrectly classified positive data in the positive test set.

A classification system for unbalanced data sets, comprising:

the first acquisition module is used for acquiring unbalanced data; the unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;

the training set and test set generating module is used for randomly dividing the sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set

A second obtaining module for obtaining the class center c of the positive class training set₁And class center c of the negative class training set₂And a center c of the training set;

a distance T determining module of normal vector and two class centers for determining the class center c₁The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And said class center c₂The modulus of the difference is determined as the distance T between the two types of centers;

a hyperplane determination module for determining the class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane-like surface of (1);

a distance determination module for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-(ii) a The first distance di⁺Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di^-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli⁺Representing the positive class data in the positive class training set passing through the class center c₂A distance to the negative hyperplane; the fourth distance dli^-Representing the negative class data in the negative class training set to pass through a class center c₁A distance to the positive hyperplane;

the compactness determining module is used for determining the compactness C of the positive data in the positive training set according to the neighbor algorithm_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-；

A fuzzy membership function determination module for determining the first distance di⁺The second distance di^-The third distance dli⁺The fourth distance dli^-The tightness C_i ⁺The tightness C_i ^-And the distance T between the two classes of centers determines a fuzzy membership function (1),

a classification model determining module for determining a classification model (2) according to the fuzzy membership function (1) and the fuzzy dual-support vector machine,

wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w₁Normal vector representing the hyperplane of the positive class classification, e₁Column vectors of positive type with representing elements all equal to 1, b₁Denotes a first constant, d₁Denotes a first penalty parameter, S_ARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e₂A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w₂Normal vector representing the hyperplane of the negative class classification, b₂Denotes a second constant, d₂Denotes a second penalty parameter, S_BRepresenting the fuzzy membership of the second data to be classified;

the optimized classification model generation module is used for taking a training set and a test set of sample unbalanced data as the input of the classification model (2), taking the recall ratio, precision ratio, g-mean and F value of the test set as the output of the classification model (2), and determining an optimized first punishment parameter d by adopting a grid search algorithm and a cross-validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model;

the third acquisition module is used for acquiring unbalanced data to be detected;

and the classification result generation module is used for taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.

Optionally, the hyperplane determining module specifically includes:

a positive hyperplane and negative hyperplane determining unit for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂1 (4); wherein x is⁺Representing positive class data in a positive class training set, x^-Representing negative class data in the negative class training set.

Optionally, the distance determining module specifically includes:

a distance determination unit for determining the distance between the class center and the object center₁The class center c₂The normal vector w₁And the normal vector w₂Determining a first distance

Second distance

Third distance

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

Optionally, the tightness determining module specifically includes:

the positive data compactness determining unit is used for determining the compactness of the positive data in the positive training set according to a neighbor algorithm

The closeness determining unit of the negative class data is used for determining the closeness of the negative class data in the negative class training set according to the neighbor algorithm

set of K neighbor samples, X, representing the ith negative class data in the negative class training set_i ^-To represent

K is

And

the number of neighboring samples in (a).

Optionally, the optimized classification model generating module specifically includes:

an optimized classification model generation unit used for taking the training set and the test set of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio of the positive class test set

Precision ratio

And

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a classification method and a classification system for unbalanced data sets, and a class center c for calculating and obtaining a positive class training set and a negative class training set₁And c₂And a training set center C, further determining the distance T, the positive hyperplane, the negative hyperplane, the first distance, the second distance, the third distance and the fourth distance of the two kinds of centers, and determining the compactness C of the positive data and the negative data according to a neighbor algorithm_i ⁺And C_i ^-. According to the first distance, the second distance and the tightness C_i ⁺、C_i ^-And determining a fuzzy membership function (1) according to the distance T between the two classes of centers, and determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine. Determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross verification method₁And an optimized second penalty parameter d₂And obtaining the optimized classification model. Will be provided withAnd inputting the unbalanced data to be classified into the optimized classification model to obtain a classification result of the unbalanced data to be classified. The method or the system of the invention endows different membership values to the sample points according to the difference of contribution of the sample points to the classification hyperplane and the difference of the unbalanced rate of the two types of samples by using the determined classification model based on the fuzzy membership function, reduces the unbalance among the samples, reduces the influence of noise points contained in the samples on the classification hyperplane, and improves the accuracy of the classification result when the method or the system of the invention is used.

The method or the system of the invention also processes two quadratic programming problems by using the determined classification model based on the fuzzy membership function and the fuzzy double-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for classifying unbalanced data sets according to the present invention;

FIG. 2 is a diagram of a classification system for unbalanced data sets according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

FIG. 1 is a flow chart of a method for classifying unbalanced data sets according to the present invention. As shown in fig. 1, a method for classifying unbalanced data sets includes:

s101, acquiring sample unbalanced data; in the embodiment, the angry emotion samples in the CASIA Chinese emotion corpus are selected as positive samples, and the remaining emotion samples in the CASIA Chinese emotion corpus are selected as negative samples. And selecting the MFCC characteristics, the acoustic characteristics and the prosodic characteristics of the sample voice, and correspondingly obtaining three characteristic values of the mean value, the variance and the standard deviation of the voice characteristics respectively to obtain unbalanced data. The unbalanced data includes positive class data and negative class data. More negative class data than positive class data.

Step S102, carrying out random division on sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set;

s103, acquiring class center c of the positive class training set₁And class center c of the negative class training set₂And the center c of the training set.

Step S104, centering the class c₁The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And said class center c₂The modulus of the difference is determined as the distance T between the centers of the two classes.

Step S105, according to the class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane-like.

Step S106, according to the class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-(ii) a The first distance di⁺Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di^-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli⁺Representing the positive class data in the positive class training set passing through the class center c₂A distance to the negative hyperplane; the fourth distance dli^-Representing the negative class data in the negative class training set to pass through a class center c₁A distance to the positive hyperplane;

step S107, determining the closeness C of the positive data in the positive training set according to the neighbor algorithm_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-。

Step S108, according to the first distance di⁺The second distance di^-The third distance dli⁺The fourth distance dli^-The tightness C_i ⁺The tightness C_i ^-And the distance T between the two classes of centers determines a fuzzy membership function (1),

wherein S is_i+Representing the fuzzy membership of the positive data, S_i-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter; epsilon may serve to preprocess the data setThe function of (1) controlling most effective samples in the hypersphere; and sigma is a very small number, and the weight is given to the sample point by combining the K nearest neighbor criterion.

Step S109, determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine,

wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w₁Normal vector representing the hyperplane of the positive class classification, e₁Column vectors of positive type with representing elements all equal to 1, b₁Denotes a first constant, d₁Denotes a first penalty parameter, S_ARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e₂A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w₂Normal vector representing the hyperplane of the negative class classification, b₂Denotes a second constant, d₂Denotes a second penalty parameter, S_BAnd expressing the fuzzy membership of the second data to be classified.

Step S110, a training set and a test set of sample unbalanced data are used as input of a classification model (2), recall ratio, precision ratio, g-mean and F value of the test set are used as output of the classification model (2), and a grid search algorithm and a cross-validation method are adopted to determine an optimized first punishment parameter d₁And an optimized second penalty parameter d₂And obtaining the optimized classification model.

S111, acquiring unbalanced data to be detected; in this embodiment, happy emotion samples in the TYUT2.0 emotion voice database of the university of tai yuan are selected as positive samples, and remaining emotion samples in the TYUT2.0 emotion voice database of the university of tai yuan are selected as negative samples. And selecting the MFCC characteristics, the acoustic characteristics and the prosodic characteristics of the sample voice, and correspondingly obtaining three characteristic values of the mean value, the variance and the standard deviation of the voice characteristics respectively to obtain unbalanced data. The unbalanced data includes positive class data and negative class data. More negative class data than positive class data.

And S112, taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain a classification result of the to-be-detected unbalanced data.

According to the method, the classification model is determined based on the fuzzy membership function, different membership values are given to the sample points according to different contributions of the sample points to the classification hyperplane and different unbalanced rates of the two types of samples, the imbalance among the samples is reduced, the influence of noise points contained in the samples on the classification hyperplane is reduced, and therefore the accuracy of the classification result when the method is used is improved. The method of the embodiment also processes two quadratic programming problems by using a determination classification model based on a fuzzy membership function and a fuzzy dual-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.

In practical application, according to the class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂1 (4); wherein x is⁺Representing positive class data, x, in a positive class training set^-Representing negative class data in the negative class training set.

According to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining a first distance

Second distance

Third stepDistance between two adjacent plates

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

Determining the closeness of the positive data in the positive training set according to the nearest neighbor algorithm

The j-th neighbor sample in (1), K is

And

the number of neighboring samples in (a).

Taking a training set and a test set of sample unbalanced data as input of a classification model (2), and taking the recall ratio of a positive class test set

Precision ratio

And

The recall and the precision respectively represent the ratio of positive and negative samples of a classifier for correct prediction, but the classifier with high recall does not necessarily have high precision in many times, so the geometric mean value g-mean is introduced to evaluate the performance of the classifier, and the larger the g-mean is, the better the classification effect is. F-value considers a combination of recall and precision of a small number of classes.

In the present embodiment, the first distance di is provided⁺The second distance di^-And the thirdDistance dli⁺The fourth distance dli^-The tightness C_i ⁺The tightness C_i ^-The specific calculation formula also provides a specific calculation method of the recall ratio, precision ratio, g-mean and F value contained in the classification result.

In practical application, a fuzzy membership function (1) is introduced based on a fuzzy dual-support vector machine (FTWSVM) model, and a classification model is reconstructed, wherein the method comprises the following steps:

the original TWSVM model abandons the parallel constraint condition, and for the two-classification problem, two non-parallel hyperplanes are constructed, the construction principle is that the samples are as close as possible to the sample point of the class and as far as possible from the other class, samples belonging to the class 1 and the class-1 are respectively represented by A, B matrixes, and the optimization problem is constructed as formula (3):

wherein d is₁，d₂As a penalty parameter, e₁,e₂Is a column vector of all 1's. The classification can be obtained by optimizing the above formula:

w₁x⁺+b₁＝1；w₂x^-+b₂＝-1。

w₁x⁺+b₁1 and w₂x^-+b₂And-1 is the obtained classification hyperplane, and the data is divided into two types by obtaining the classification hyperplane.

On the basis, a fuzzy membership function S is introduced_A、S_BThen the classification hyperplane optimization problem of the classification model can be expressed as equation (2).

Wherein S_A、S_BFor the fuzzy membership of each sample of A, B samples, the product of the sample error and the membership represents the amount of contribution of the sample point to the classifier. The Lagrange transform is paired and the problem is expressed as

x^Tw_r+b_r＝min|x^Tw_l+b_l|l＝1，2 (5)；

Where | is x to plane x^Tw_l+b_lVertical distance of 0, (l 1, 2).

The embodiment provides a specific derivation process of a classification model, and the method of the embodiment also processes two quadratic programming problems by using a determined classification model based on a fuzzy membership function and a fuzzy dual-support vector machine through the method or the system of the invention, and if the number of the two types of samples is the same, the time can be 4 times faster than that of an SVM, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.

FIG. 2 is a diagram of a classification system for unbalanced data sets according to the present invention. As shown in fig. 2, a classification system for unbalanced data sets includes:

the first acquisition module 1 is used for acquiring unbalanced data; the unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;

the training set and test set generating module 2 is used for randomly dividing sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set

A second obtaining module 3, configured to obtain a class center c of the positive class training set₁And class center c of the negative class training set₂And a center c of the training set;

a normal vector and a distance T between two class centers is determined by a module 4 for determining the class center c₁The difference from the center c of the training set is determined to be positiveHyperplane-like normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And said class center c₂The modulus of the difference is determined as the distance T between the two types of centers;

a hyperplane determining module 5 for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane-like surface of (1);

a distance determination module 6 for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-(ii) a The first distance di⁺Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di^-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli⁺Representing the positive class data in the positive class training set passing through the class center c₂A distance to the negative hyperplane; the fourth distance dli^-Representing the negative class data in the negative class training set to pass through a class center c₁A distance to the positive hyperplane;

a closeness determining module 7 for determining the closeness C of the positive data in the positive training set according to the neighbor algorithm_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-；

A fuzzy membership function determination module 8 for determining a fuzzy membership function according to said first distance di⁺The second distance di^-The third distance dli⁺The fourth distance dli^-The tightness C_i ⁺The tightness C_i ^-And the distance T between the two classes of centers determines a fuzzy membership function (1),

a classification model determining module 9 for determining a classification model (2) according to the fuzzy membership function (1) and the fuzzy dual-support vector machine,

an optimized classification model generation module 10, which is used for taking the training set and the test set of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio, precision ratio, g-mean sum and sum of the test setThe F value is used as the output of the classification model (2), and the optimized first punishment parameter d is determined by adopting a grid search algorithm and a cross verification method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model;

a third obtaining module 11, configured to obtain unbalanced data to be detected;

and the classification result generation module 12 is configured to use the to-be-detected unbalanced data as an input of the optimized classification model to obtain a classification result of the to-be-detected unbalanced data.

The system of the embodiment gives different membership values to the sample points according to the difference of contribution of the sample points to the classification hyperplane and the difference of non-equilibrium rates of the two types of samples by using the determined classification model based on the fuzzy membership function, so that the imbalance among the samples is reduced, the influence of noise points contained in the samples on the classification hyperplane is reduced, and the accuracy of the classification result when the system is used is improved. The system in the embodiment processes two quadratic programming problems by using the determination classification model based on the fuzzy membership function and the fuzzy dual-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.

In practical application, the hyperplane determining module specifically includes: the positive hyperplane and negative hyperplane determining unit is used for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂1 (4); wherein x is⁺Representing positive class data, x, in a positive class training set^-Representing negative class data in the negative class training set.

The distance determining module specifically comprises: a distance determination unit for determining the distance between the class center and the object center₁The class center c₂The normal vector w₁And the normal vector w₂Determining a first distance

Second distance

Third distance

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

The compactness determining module specifically comprises: the positive data compactness determining unit is used for determining the compactness of the positive data in the positive training set according to a neighbor algorithm

The j-th neighbor of (1)Sample, X_i ^-Representing the ith negative class data in the negative class training set,

K is

And

the number of neighboring samples in (a).

The optimized classification model generation module specifically includes: an optimized classification model generation unit used for taking the training set and the test of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio of the positive class test set

Precision ratio

And

as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, FP represents incorrectly classified positive data in the positive test setThe number of (2).

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; for those skilled in the art, variations can be made in the specific embodiments and applications without departing from the spirit of the invention. In view of the above, this description should not be taken as limiting the invention.

Claims

1. A method of classifying an unbalanced data set, comprising:

acquiring sample unbalanced data, specifically comprising: selecting an angry emotion sample in a CASIA Chinese emotion corpus as a positive sample, selecting remaining emotion samples in the CASIA Chinese emotion corpus as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features, tone features and rhythm features of sample voice, and correspondingly obtaining three feature values of a mean value, a variance and a standard deviation of voice features respectively to obtain unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;

will be described inClass center c₁The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And said class center c₂The modulus of the difference is determined as the distance T between the two types of centers;

according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane-like surface of (1);

determining the closeness C of the positive data in the positive training set according to the neighbor algorithm_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-；

According to the first distance di⁺The second distance di^-The third distance dli⁺The fourth distance dli^-The tightness C_i ⁺The tightness C_i ^-And the distance T between the two classes of centers determines a fuzzy membership function (1),

acquiring unbalanced data to be detected, specifically comprising: selecting happy emotion samples in a TUT 2.0 emotion voice database of the Taiyuan university as positive samples, selecting remaining emotion samples in the TUT 2.0 emotion voice database of the Taiyuan university as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features of sample voice, and selecting tone features and rhythm features, and respectively and correspondingly solving three feature values of mean value, variance and standard deviation of the voice features to obtain unbalanced data to be detected;

2. The method of classification according to claim 1, characterised in that said class centre c is a function of said class centre₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passage of said class center c₁And a positive class hyperplane and a line passing through said class center c₂The negative hyperplane specifically comprises:

according to said class center c₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂1 (4); wherein x is⁺Representing positive class data, x, in a positive class training set^-Representing negative class data in the negative class training set.

3. The method of classification according to claim 2, characterised in that said class centre c is a function of said class centre₁The class center c₂The normal vector w₁And the normal vector w₂Determining the first distance di⁺Second distance di^-A third distance dli⁺And a fourth distance dli^-The method specifically comprises the following steps:

Second distance

Third distance

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

4. The classification method according to claim 2, wherein the closeness C of the positive class data in the positive class training set is determined according to a neighbor algorithm_i ⁺Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithm_i ^-The method specifically comprises the following steps:

Wherein x is_i ⁺Representing the ith positive class data in the positive class training set,

K is

And

the number of neighboring samples in (a).

5. The classification method according to claim 1, wherein the optimized first penalty parameter d is determined by a grid search algorithm and a cross-validation method by taking a training set and a test set of sample unbalanced data as input of the classification model (2), taking the recall ratio, precision ratio, g-mean and F value of the test set as output of the classification model (2)₁And an optimized second penalty parameter d₂Obtaining an optimized classification model, specifically including:

Precision ratio

And

6. A classification system for unbalanced data sets, comprising:

the first obtaining module is configured to obtain unbalanced data, and specifically includes: selecting an angry emotion sample in a CASIA Chinese emotion corpus as a positive sample, selecting remaining emotion samples in the CASIA Chinese emotion corpus as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features, tone features and rhythm features of sample voice, and correspondingly obtaining three feature values of a mean value, a variance and a standard deviation of voice features respectively to obtain unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;

module for determining distance T between normal vector and two kinds of centersAt the center c of the class₁The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w₁Centering said class c₂The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w₂Centering said class c₁And said class center c₂The modulus of the difference is determined as the distance T between the two types of centers;

A fuzzy membership function determination module for determining the first distance di⁺The second distance di^-Said third distance dl⁺Said fourth distance dl^-The compactness C_i ⁺The tightness C_i ^-And the two classesThe distance T of the centers determines the fuzzy membership function (1),

the third obtaining module is configured to obtain unbalanced data to be measured, and specifically includes: selecting happy emotion samples in a TUT 2.0 emotion voice database of the Taiyuan university as positive samples, selecting remaining emotion samples in the TUT 2.0 emotion voice database of the Taiyuan university as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features of sample voice, and selecting tone features and rhythm features, and respectively and correspondingly solving three feature values of mean value, variance and standard deviation of the voice features to obtain unbalanced data to be detected;

7. The classification system according to claim 6, wherein the hyperplane determination module specifically comprises:

a positive hyperplane and negative hyperplane determining unit for determining the center c according to the class₁The class center c₂The normal vector w₁And the normal vector w₂Determining the passing of said class center c₁Is a positive hyperplane w₁x⁺+b₁1(3) and passing through said class center c₂Is a negative hyperplane w₂x^-+b₂1 (4); wherein x is⁺Representing positive class data, x, in a positive class training set^-Representing negative class data in the negative class training set.

8. The classification system according to claim 7, wherein the distance determination module specifically includes:

a distance determination unit for determining the distance between the class center and the object center₁The class center c₂The normal vector w1 and the normal vector w2 determine a first distance

Second distance

Third distance

And a fourth distance

Wherein the content of the first and second substances,

represents a normal vector w₁The torque of (a) is set to be,

9. The classification system according to claim 7, wherein the closeness determination module specifically comprises:

The j-th neighbor of (1)This, x_i ^-Representing the ith negative class data in the negative class training set,

K is

And

the number of neighboring samples in (a).

10. The classification system according to claim 6, wherein the optimized classification model generation module specifically includes:

Precision ratio

And

as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method₁And an optimized second penalty parameter d₂Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, and TN represents the negativeThe number of correctly classified negative class data in the test set, and the FP indicates the number of incorrectly classified positive class data in the positive class test set.