CN109165694B - Method and system for classifying unbalanced data sets - Google Patents

Method and system for classifying unbalanced data sets Download PDF

Info

Publication number
CN109165694B
CN109165694B CN201811061152.7A CN201811061152A CN109165694B CN 109165694 B CN109165694 B CN 109165694B CN 201811061152 A CN201811061152 A CN 201811061152A CN 109165694 B CN109165694 B CN 109165694B
Authority
CN
China
Prior art keywords
class
data
positive
negative
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811061152.7A
Other languages
Chinese (zh)
Other versions
CN109165694A (en
Inventor
张雪英
李凤莲
陈桂军
张波
魏鑫
焦江丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201811061152.7A priority Critical patent/CN109165694B/en
Publication of CN109165694A publication Critical patent/CN109165694A/en
Application granted granted Critical
Publication of CN109165694B publication Critical patent/CN109165694B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses a classification method and a classification system for unbalanced data sets, and a class center c for calculating and obtaining a positive class training set and a negative class training set1And c2Determining the distance T between the centers of the two classes, a positive class hyperplane, a negative class hyperplane, a first distance, a second distance, a third distance and a fourth distance, further determining a fuzzy membership function, and determining a classification model according to the fuzzy membership function and a fuzzy dual-support vector machine. And obtaining the optimized classification model by adopting a grid search algorithm and a cross verification method. And inputting the unbalanced data to be classified into the optimized classification model to obtain a classification result of the unbalanced data to be classified. According to the method or the system, the determined classification model based on the fuzzy membership function is used, different membership values are given to the sample points according to different contributions of the sample points to the classification hyperplane and different unbalanced rates of the two types of samples, the unbalance among the samples is reduced, and therefore the accuracy of the classification result when the method or the system is used is improved.

Description

Method and system for classifying unbalanced data sets
Technical Field
The invention relates to the technical field of unbalanced data processing, in particular to a method and a system for classifying unbalanced data sets.
Background
Many industry data tend to have data distribution imbalance phenomena. Taking the binary problem as an example, if the proportion of one sample is much larger than that of the other sample, the data set is an unbalanced data set. The majority of samples are also called negative samples, the minority of samples are called positive samples, and the ratio of the number of negative samples to the number of positive samples is called Imbalance Rate (IR). Typical examples include: fault diagnosis data, credit fraud data, medical diagnosis data, and the like. When the unbalanced data set is classified and predicted, the reference value of the classification prediction accuracy of the minority class in practice is more important, but the prediction accuracy of the majority class is generally higher and the prediction accuracy of the minority class is lower by a common classification prediction model, and the prediction error of the minority class generally brings greater economic loss and even life cost, such as credit card embezzlement accidents, coal mine water inrush and gas outburst accidents and the like. Therefore, how to improve the accuracy of the classification prediction of a small number of classes of unbalanced data sets is a research hotspot at home and abroad in recent years.
The Batuwita et al provides a Fuzzy Support Vector Machine (FSVM) for processing unbalanced data sets, sets different penalty factors for positive and negative samples, designs a fuzzy membership function to endow the training samples with different membership, but the method only considers the distance between the samples and class centers and the unbalanced condition of the samples, does not consider the distribution characteristics of the samples, and has poor classification accuracy. A novel double-membership fuzzy support vector machine is proposed by Chua Yan et al, so that the classification accuracy is effectively improved, the complexity is increased, and the classification efficiency is low.
Disclosure of Invention
The invention aims to provide a method and a system for classifying an unbalanced data set, which aim to solve the problems of low efficiency and poor accuracy when the unbalanced data set is classified in the prior art.
In order to achieve the purpose, the invention provides the following scheme:
a method of classifying an unbalanced data set, comprising:
acquiring sample unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;
randomly dividing sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set;
obtaining class center c of the positive class training set1And class center c of the negative class training set2And a center c of the training set;
centering said class c1The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And the class center c2The modulus of the difference is determined as the distance T between the two types of centers;
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the passage of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like surface of (1);
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
determining from a neighbor algorithmCompactness C of the positive data in the positive training seti +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
According to the first distance di+The second distance di-Said third distance dl+Said fourth distance dl-The compactness Ci +The tightness Ci -And the distance T between the two classes of centers determines a fuzzy membership function (1),
Figure BDA0001797118480000031
Figure BDA0001797118480000032
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;
determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine,
Figure BDA0001797118480000033
Figure BDA0001797118480000034
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2Presentation elementAll negative class column vectors equal to 1, FTWSVM2 denotes the negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBRepresenting the fuzzy membership of the second data to be classified;
taking a training set and a test set of sample unbalanced data as input of a classification model (2), taking recall ratio, precision ratio, g-mean and F value of the test set as output of the classification model (2), and determining an optimized first penalty parameter d by adopting a grid search algorithm and a cross-validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model;
acquiring unbalanced data to be detected;
and taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.
Optionally, said according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane specifically comprises:
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b2(xvi) ═ 1 (4); wherein x is+Representing positive class data, x, in a positive class training set-Representing negative class data in the negative class training set.
Optionally, said according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-The method specifically comprises the following steps:
according to the class centerc1The class center c2The normal vector w1And the normal vector w2Determining a first distance
Figure BDA0001797118480000041
Second distance
Figure BDA0001797118480000042
Third distance
Figure BDA0001797118480000043
And a fourth distance
Figure BDA0001797118480000044
Wherein the content of the first and second substances,
Figure BDA0001797118480000045
represents a normal vector w1The torque of (a) is set to be,
Figure BDA0001797118480000046
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
Optionally, the closeness C of the positive data in the positive training set is determined according to a neighbor algorithmi +Determining the compactness C of the negative class data in the negative class training set according to a neighbor algorithmi -The method specifically comprises the following steps:
determining the closeness of the positive data in the positive training set according to the neighbor algorithm
Figure BDA0001797118480000047
Determining the closeness of the negative class data in the negative class training set according to the nearest neighbor algorithm
Figure BDA0001797118480000051
Wherein, Xi +Representing the ith positive class data in the positive class training set,
Figure BDA0001797118480000052
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure BDA0001797118480000053
Of (2) is the j-th neighboring sample, Xi -Representing the ith negative class data in the negative class training set,
Figure BDA0001797118480000054
set of K neighbor samples, X, representing the ith negative class data in the negative class training setj -To represent
Figure BDA0001797118480000055
K is
Figure BDA0001797118480000056
And
Figure BDA0001797118480000057
the number of neighboring samples in (a).
Optionally, the training set and the test set of the sample unbalanced data are used as input of the classification model (2), the recall ratio, the precision ratio, the g-mean and the F value of the test set are used as output of the classification model (2), and the optimized first penalty parameter d is determined by adopting a grid search algorithm and a cross-validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model, specifically including:
taking a training set and a test set of sample unbalanced data as the input of a classification model (2) and taking the recall ratio of a positive class test set
Figure BDA0001797118480000058
Precision ratio
Figure BDA0001797118480000059
Figure BDA00017971184800000510
And
Figure BDA00017971184800000511
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, and FP represents the number of incorrectly classified positive data in the positive test set.
A classification system for unbalanced data sets, comprising:
the first acquisition module is used for acquiring unbalanced data; the unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;
the training set and test set generating module is used for randomly dividing the sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set
A second obtaining module for obtaining the class center c of the positive class training set1And class center c of the negative class training set2And a center c of the training set;
a distance T determining module of normal vector and two class centers for determining the class center c1The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And said class center c2The modulus of the difference is determined as the distance T between the two types of centers;
a hyperplane determination module for determining the class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like surface of (1);
a distance determination module for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
the compactness determining module is used for determining the compactness C of the positive data in the positive training set according to the neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
A fuzzy membership function determination module for determining the first distance di+The second distance di-The third distance dli+The fourth distance dli-The tightness Ci +The tightness Ci -And the distance T between the two classes of centers determines a fuzzy membership function (1),
Figure BDA0001797118480000061
Figure BDA0001797118480000062
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;
a classification model determining module for determining a classification model (2) according to the fuzzy membership function (1) and the fuzzy dual-support vector machine,
Figure BDA0001797118480000071
Figure BDA0001797118480000072
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBRepresenting the fuzzy membership of the second data to be classified;
the optimized classification model generation module is used for taking a training set and a test set of sample unbalanced data as the input of the classification model (2), taking the recall ratio, precision ratio, g-mean and F value of the test set as the output of the classification model (2), and determining an optimized first punishment parameter d by adopting a grid search algorithm and a cross-validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model;
the third acquisition module is used for acquiring unbalanced data to be detected;
and the classification result generation module is used for taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.
Optionally, the hyperplane determining module specifically includes:
a positive hyperplane and negative hyperplane determining unit for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b21 (4); wherein x is+Representing positive class data in a positive class training set, x-Representing negative class data in the negative class training set.
Optionally, the distance determining module specifically includes:
a distance determination unit for determining the distance between the class center and the object center1The class center c2The normal vector w1And the normal vector w2Determining a first distance
Figure BDA0001797118480000081
Second distance
Figure BDA0001797118480000082
Third distance
Figure BDA0001797118480000083
And a fourth distance
Figure BDA0001797118480000084
Wherein the content of the first and second substances,
Figure BDA0001797118480000085
represents a normal vector w1The torque of (a) is set to be,
Figure BDA0001797118480000086
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
Optionally, the tightness determining module specifically includes:
the positive data compactness determining unit is used for determining the compactness of the positive data in the positive training set according to a neighbor algorithm
Figure BDA0001797118480000087
The closeness determining unit of the negative class data is used for determining the closeness of the negative class data in the negative class training set according to the neighbor algorithm
Figure BDA0001797118480000088
Wherein, Xi +Representing the ith positive class data in the positive class training set,
Figure BDA0001797118480000089
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure BDA00017971184800000810
Of (2) is the j-th neighboring sample, Xi -Representing the ith negative class data in the negative class training set,
Figure BDA00017971184800000811
set of K neighbor samples, X, representing the ith negative class data in the negative class training seti -To represent
Figure BDA00017971184800000812
K is
Figure BDA00017971184800000813
And
Figure BDA00017971184800000814
the number of neighboring samples in (a).
Optionally, the optimized classification model generating module specifically includes:
an optimized classification model generation unit used for taking the training set and the test set of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio of the positive class test set
Figure BDA00017971184800000815
Precision ratio
Figure BDA00017971184800000816
And
Figure BDA00017971184800000817
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, and FP represents the number of incorrectly classified positive data in the positive test set.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a classification method and a classification system for unbalanced data sets, and a class center c for calculating and obtaining a positive class training set and a negative class training set1And c2And a training set center C, further determining the distance T, the positive hyperplane, the negative hyperplane, the first distance, the second distance, the third distance and the fourth distance of the two kinds of centers, and determining the compactness C of the positive data and the negative data according to a neighbor algorithmi +And Ci -. According to the first distance, the second distance and the tightness Ci +、Ci -And determining a fuzzy membership function (1) according to the distance T between the two classes of centers, and determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine. Determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross verification method1And an optimized second penalty parameter d2And obtaining the optimized classification model. Will be provided withAnd inputting the unbalanced data to be classified into the optimized classification model to obtain a classification result of the unbalanced data to be classified. The method or the system of the invention endows different membership values to the sample points according to the difference of contribution of the sample points to the classification hyperplane and the difference of the unbalanced rate of the two types of samples by using the determined classification model based on the fuzzy membership function, reduces the unbalance among the samples, reduces the influence of noise points contained in the samples on the classification hyperplane, and improves the accuracy of the classification result when the method or the system of the invention is used.
The method or the system of the invention also processes two quadratic programming problems by using the determined classification model based on the fuzzy membership function and the fuzzy double-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a method for classifying unbalanced data sets according to the present invention;
FIG. 2 is a diagram of a classification system for unbalanced data sets according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a method and a system for classifying an unbalanced data set, which aim to solve the problems of low efficiency and poor accuracy when the unbalanced data set is classified in the prior art.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
FIG. 1 is a flow chart of a method for classifying unbalanced data sets according to the present invention. As shown in fig. 1, a method for classifying unbalanced data sets includes:
s101, acquiring sample unbalanced data; in the embodiment, the angry emotion samples in the CASIA Chinese emotion corpus are selected as positive samples, and the remaining emotion samples in the CASIA Chinese emotion corpus are selected as negative samples. And selecting the MFCC characteristics, the acoustic characteristics and the prosodic characteristics of the sample voice, and correspondingly obtaining three characteristic values of the mean value, the variance and the standard deviation of the voice characteristics respectively to obtain unbalanced data. The unbalanced data includes positive class data and negative class data. More negative class data than positive class data.
Step S102, carrying out random division on sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set;
s103, acquiring class center c of the positive class training set1And class center c of the negative class training set2And the center c of the training set.
Step S104, centering the class c1The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And said class center c2The modulus of the difference is determined as the distance T between the centers of the two classes.
Step S105, according to the class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like.
Step S106, according to the class center c1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
step S107, determining the closeness C of the positive data in the positive training set according to the neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
Step S108, according to the first distance di+The second distance di-The third distance dli+The fourth distance dli-The tightness Ci +The tightness Ci -And the distance T between the two classes of centers determines a fuzzy membership function (1),
Figure BDA0001797118480000111
Figure BDA0001797118480000112
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter; epsilon may serve to preprocess the data setThe function of (1) controlling most effective samples in the hypersphere; and sigma is a very small number, and the weight is given to the sample point by combining the K nearest neighbor criterion.
Step S109, determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine,
Figure BDA0001797118480000113
Figure BDA0001797118480000114
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBAnd expressing the fuzzy membership of the second data to be classified.
Step S110, a training set and a test set of sample unbalanced data are used as input of a classification model (2), recall ratio, precision ratio, g-mean and F value of the test set are used as output of the classification model (2), and a grid search algorithm and a cross-validation method are adopted to determine an optimized first punishment parameter d1And an optimized second penalty parameter d2And obtaining the optimized classification model.
S111, acquiring unbalanced data to be detected; in this embodiment, happy emotion samples in the TYUT2.0 emotion voice database of the university of tai yuan are selected as positive samples, and remaining emotion samples in the TYUT2.0 emotion voice database of the university of tai yuan are selected as negative samples. And selecting the MFCC characteristics, the acoustic characteristics and the prosodic characteristics of the sample voice, and correspondingly obtaining three characteristic values of the mean value, the variance and the standard deviation of the voice characteristics respectively to obtain unbalanced data. The unbalanced data includes positive class data and negative class data. More negative class data than positive class data.
And S112, taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain a classification result of the to-be-detected unbalanced data.
According to the method, the classification model is determined based on the fuzzy membership function, different membership values are given to the sample points according to different contributions of the sample points to the classification hyperplane and different unbalanced rates of the two types of samples, the imbalance among the samples is reduced, the influence of noise points contained in the samples on the classification hyperplane is reduced, and therefore the accuracy of the classification result when the method is used is improved. The method of the embodiment also processes two quadratic programming problems by using a determination classification model based on a fuzzy membership function and a fuzzy dual-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.
In practical application, according to the class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b21 (4); wherein x is+Representing positive class data, x, in a positive class training set-Representing negative class data in the negative class training set.
According to said class center c1The class center c2The normal vector w1And the normal vector w2Determining a first distance
Figure BDA0001797118480000131
Second distance
Figure BDA0001797118480000132
Third stepDistance between two adjacent plates
Figure BDA0001797118480000133
And a fourth distance
Figure BDA0001797118480000134
Wherein the content of the first and second substances,
Figure BDA0001797118480000135
represents a normal vector w1The torque of (a) is set to be,
Figure BDA0001797118480000136
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
Determining the closeness of the positive data in the positive training set according to the nearest neighbor algorithm
Figure BDA0001797118480000137
Determining the closeness of the negative class data in the negative class training set according to the nearest neighbor algorithm
Figure BDA0001797118480000138
Wherein, Xi +Representing the ith positive class data in the positive class training set,
Figure BDA0001797118480000139
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure BDA00017971184800001310
Of (2) is the j-th neighboring sample, Xi -Representing the ith negative class data in the negative class training set,
Figure BDA00017971184800001311
set of K neighbor samples, X, representing the ith negative class data in the negative class training setj -To represent
Figure BDA00017971184800001312
The j-th neighbor sample in (1), K is
Figure BDA00017971184800001313
And
Figure BDA00017971184800001314
the number of neighboring samples in (a).
Taking a training set and a test set of sample unbalanced data as input of a classification model (2), and taking the recall ratio of a positive class test set
Figure BDA00017971184800001315
Precision ratio
Figure BDA00017971184800001316
Figure BDA00017971184800001317
And
Figure BDA00017971184800001318
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, and FP represents the number of incorrectly classified positive data in the positive test set.
The recall and the precision respectively represent the ratio of positive and negative samples of a classifier for correct prediction, but the classifier with high recall does not necessarily have high precision in many times, so the geometric mean value g-mean is introduced to evaluate the performance of the classifier, and the larger the g-mean is, the better the classification effect is. F-value considers a combination of recall and precision of a small number of classes.
In the present embodiment, the first distance di is provided+The second distance di-And the thirdDistance dli+The fourth distance dli-The tightness Ci +The tightness Ci -The specific calculation formula also provides a specific calculation method of the recall ratio, precision ratio, g-mean and F value contained in the classification result.
In practical application, a fuzzy membership function (1) is introduced based on a fuzzy dual-support vector machine (FTWSVM) model, and a classification model is reconstructed, wherein the method comprises the following steps:
the original TWSVM model abandons the parallel constraint condition, and for the two-classification problem, two non-parallel hyperplanes are constructed, the construction principle is that the samples are as close as possible to the sample point of the class and as far as possible from the other class, samples belonging to the class 1 and the class-1 are respectively represented by A, B matrixes, and the optimization problem is constructed as formula (3):
Figure BDA0001797118480000141
Figure BDA0001797118480000142
wherein d is1,d2As a penalty parameter, e1,e2Is a column vector of all 1's. The classification can be obtained by optimizing the above formula:
w1x++b1=1;w2x-+b2=-1。
w1x++b11 and w2x-+b2And-1 is the obtained classification hyperplane, and the data is divided into two types by obtaining the classification hyperplane.
On the basis, a fuzzy membership function S is introducedA、SBThen the classification hyperplane optimization problem of the classification model can be expressed as equation (2).
Figure BDA0001797118480000143
Wherein SA、SBFor the fuzzy membership of each sample of A, B samples, the product of the sample error and the membership represents the amount of contribution of the sample point to the classifier. The Lagrange transform is paired and the problem is expressed as
xTwr+br=min|xTwl+bl|l=1,2 (5);
Where | is x to plane xTwl+blVertical distance of 0, (l 1, 2).
The embodiment provides a specific derivation process of a classification model, and the method of the embodiment also processes two quadratic programming problems by using a determined classification model based on a fuzzy membership function and a fuzzy dual-support vector machine through the method or the system of the invention, and if the number of the two types of samples is the same, the time can be 4 times faster than that of an SVM, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.
FIG. 2 is a diagram of a classification system for unbalanced data sets according to the present invention. As shown in fig. 2, a classification system for unbalanced data sets includes:
the first acquisition module 1 is used for acquiring unbalanced data; the unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;
the training set and test set generating module 2 is used for randomly dividing sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set
A second obtaining module 3, configured to obtain a class center c of the positive class training set1And class center c of the negative class training set2And a center c of the training set;
a normal vector and a distance T between two class centers is determined by a module 4 for determining the class center c1The difference from the center c of the training set is determined to be positiveHyperplane-like normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And said class center c2The modulus of the difference is determined as the distance T between the two types of centers;
a hyperplane determining module 5 for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like surface of (1);
a distance determination module 6 for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
a closeness determining module 7 for determining the closeness C of the positive data in the positive training set according to the neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
A fuzzy membership function determination module 8 for determining a fuzzy membership function according to said first distance di+The second distance di-The third distance dli+The fourth distance dli-The tightness Ci +The tightness Ci -And the distance T between the two classes of centers determines a fuzzy membership function (1),
Figure BDA0001797118480000161
Figure BDA0001797118480000162
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;
a classification model determining module 9 for determining a classification model (2) according to the fuzzy membership function (1) and the fuzzy dual-support vector machine,
Figure BDA0001797118480000163
Figure BDA0001797118480000164
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBRepresenting the fuzzy membership of the second data to be classified;
an optimized classification model generation module 10, which is used for taking the training set and the test set of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio, precision ratio, g-mean sum and sum of the test setThe F value is used as the output of the classification model (2), and the optimized first punishment parameter d is determined by adopting a grid search algorithm and a cross verification method1And an optimized second penalty parameter d2Obtaining an optimized classification model;
a third obtaining module 11, configured to obtain unbalanced data to be detected;
and the classification result generation module 12 is configured to use the to-be-detected unbalanced data as an input of the optimized classification model to obtain a classification result of the to-be-detected unbalanced data.
The system of the embodiment gives different membership values to the sample points according to the difference of contribution of the sample points to the classification hyperplane and the difference of non-equilibrium rates of the two types of samples by using the determined classification model based on the fuzzy membership function, so that the imbalance among the samples is reduced, the influence of noise points contained in the samples on the classification hyperplane is reduced, and the accuracy of the classification result when the system is used is improved. The system in the embodiment processes two quadratic programming problems by using the determination classification model based on the fuzzy membership function and the fuzzy dual-support vector machine, thereby greatly reducing the complexity of the algorithm and improving the operation efficiency.
In practical application, the hyperplane determining module specifically includes: the positive hyperplane and negative hyperplane determining unit is used for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b21 (4); wherein x is+Representing positive class data, x, in a positive class training set-Representing negative class data in the negative class training set.
The distance determining module specifically comprises: a distance determination unit for determining the distance between the class center and the object center1The class center c2The normal vector w1And the normal vector w2Determining a first distance
Figure BDA0001797118480000171
Second distance
Figure BDA0001797118480000172
Third distance
Figure BDA0001797118480000173
And a fourth distance
Figure BDA0001797118480000174
Wherein the content of the first and second substances,
Figure BDA0001797118480000175
represents a normal vector w1The torque of (a) is set to be,
Figure BDA0001797118480000176
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
The compactness determining module specifically comprises: the positive data compactness determining unit is used for determining the compactness of the positive data in the positive training set according to a neighbor algorithm
Figure BDA0001797118480000181
The closeness determining unit of the negative class data is used for determining the closeness of the negative class data in the negative class training set according to the neighbor algorithm
Figure BDA0001797118480000182
Wherein, Xi +Representing the ith positive class data in the positive class training set,
Figure BDA0001797118480000183
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure BDA0001797118480000184
The j-th neighbor of (1)Sample, Xi -Representing the ith negative class data in the negative class training set,
Figure BDA0001797118480000185
set of K neighbor samples, X, representing the ith negative class data in the negative class training setj -To represent
Figure BDA0001797118480000186
K is
Figure BDA0001797118480000187
And
Figure BDA0001797118480000188
the number of neighboring samples in (a).
The optimized classification model generation module specifically includes: an optimized classification model generation unit used for taking the training set and the test of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio of the positive class test set
Figure BDA0001797118480000189
Precision ratio
Figure BDA00017971184800001810
Figure BDA00017971184800001811
And
Figure BDA00017971184800001812
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, FP represents incorrectly classified positive data in the positive test setThe number of (2).
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; for those skilled in the art, variations can be made in the specific embodiments and applications without departing from the spirit of the invention. In view of the above, this description should not be taken as limiting the invention.

Claims (10)

1. A method of classifying an unbalanced data set, comprising:
acquiring sample unbalanced data, specifically comprising: selecting an angry emotion sample in a CASIA Chinese emotion corpus as a positive sample, selecting remaining emotion samples in the CASIA Chinese emotion corpus as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features, tone features and rhythm features of sample voice, and correspondingly obtaining three feature values of a mean value, a variance and a standard deviation of voice features respectively to obtain unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;
randomly dividing sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set;
obtaining class center c of the positive class training set1And class center c of the negative class training set2And a center c of the training set;
will be described inClass center c1The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And said class center c2The modulus of the difference is determined as the distance T between the two types of centers;
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like surface of (1);
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
determining the closeness C of the positive data in the positive training set according to the neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
According to the first distance di+The second distance di-The third distance dli+The fourth distance dli-The tightness Ci +The tightness Ci -And the distance T between the two classes of centers determines a fuzzy membership function (1),
Figure FDA0003551614440000021
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;
determining a classification model (2) according to the fuzzy membership function (1) and a fuzzy dual-support vector machine,
Figure FDA0003551614440000022
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBRepresenting the fuzzy membership of the second data to be classified;
taking a training set and a test set of sample unbalanced data as input of a classification model (2), taking recall ratio, precision ratio, g-mean and F value of the test set as output of the classification model (2), and determining an optimized first penalty parameter d by adopting a grid search algorithm and a cross-validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model;
acquiring unbalanced data to be detected, specifically comprising: selecting happy emotion samples in a TUT 2.0 emotion voice database of the Taiyuan university as positive samples, selecting remaining emotion samples in the TUT 2.0 emotion voice database of the Taiyuan university as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features of sample voice, and selecting tone features and rhythm features, and respectively and correspondingly solving three feature values of mean value, variance and standard deviation of the voice features to obtain unbalanced data to be detected;
and taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.
2. The method of classification according to claim 1, characterised in that said class centre c is a function of said class centre1The class center c2The normal vector w1And the normal vector w2Determining the passage of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane specifically comprises:
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b21 (4); wherein x is+Representing positive class data, x, in a positive class training set-Representing negative class data in the negative class training set.
3. The method of classification according to claim 2, characterised in that said class centre c is a function of said class centre1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-The method specifically comprises the following steps:
according to said class center c1The class center c2The normal vector w1And the normal vector w2Determining a first distance
Figure FDA0003551614440000031
Second distance
Figure FDA0003551614440000032
Third distance
Figure FDA0003551614440000033
And a fourth distance
Figure FDA0003551614440000034
Wherein the content of the first and second substances,
Figure FDA0003551614440000035
represents a normal vector w1The torque of (a) is set to be,
Figure FDA0003551614440000036
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
4. The classification method according to claim 2, wherein the closeness C of the positive class data in the positive class training set is determined according to a neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -The method specifically comprises the following steps:
determining the closeness of the positive data in the positive training set according to the neighbor algorithm
Figure FDA0003551614440000041
Determining the closeness of the negative class data in the negative class training set according to the nearest neighbor algorithm
Figure FDA0003551614440000042
Wherein x isi +Representing the ith positive class data in the positive class training set,
Figure FDA0003551614440000043
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure FDA0003551614440000044
Of (2) is the j-th neighboring sample, Xi -Representing the ith negative class data in the negative class training set,
Figure FDA0003551614440000045
set of K neighbor samples, X, representing the ith negative class data in the negative class training setj -To represent
Figure FDA0003551614440000046
K is
Figure FDA0003551614440000047
And
Figure FDA0003551614440000048
the number of neighboring samples in (a).
5. The classification method according to claim 1, wherein the optimized first penalty parameter d is determined by a grid search algorithm and a cross-validation method by taking a training set and a test set of sample unbalanced data as input of the classification model (2), taking the recall ratio, precision ratio, g-mean and F value of the test set as output of the classification model (2)1And an optimized second penalty parameter d2Obtaining an optimized classification model, specifically including:
taking a training set and a test set of sample unbalanced data as input of a classification model (2), and taking the recall ratio of a positive class test set
Figure FDA0003551614440000049
Precision ratio
Figure FDA00035516144400000410
Figure FDA00035516144400000411
And
Figure FDA00035516144400000412
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, TN represents the number of correctly classified negative data in the negative test set, and FP represents the number of incorrectly classified positive data in the positive test set.
6. A classification system for unbalanced data sets, comprising:
the first obtaining module is configured to obtain unbalanced data, and specifically includes: selecting an angry emotion sample in a CASIA Chinese emotion corpus as a positive sample, selecting remaining emotion samples in the CASIA Chinese emotion corpus as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features, tone features and rhythm features of sample voice, and correspondingly obtaining three feature values of a mean value, a variance and a standard deviation of voice features respectively to obtain unbalanced data; the sample unbalanced data comprises positive class data and negative class data; the positive data represent a type of data with a small quantity in the sample unbalanced data, and the negative data represent a type of data with a large quantity in the sample unbalanced data;
the training set and test set generating module is used for randomly dividing the sample unbalanced data to obtain a training set and a test set; the training set comprises a positive training set and a negative training set; the test set comprises a positive type test set and a negative type test set
A second obtaining module for obtaining the class center c of the positive class training set1And class center c of the negative class training set2And a center c of the training set;
module for determining distance T between normal vector and two kinds of centersAt the center c of the class1The difference between the vector c and the center c of the training set is determined as a normal hyperplane normal vector w1Centering said class c2The difference from the center c of the training set is determined as a negative hyperplane-like normal vector w2Centering said class c1And said class center c2The modulus of the difference is determined as the distance T between the two types of centers;
a hyperplane determination module for determining the class center c1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1And a positive class hyperplane and a line passing through said class center c2The negative hyperplane-like surface of (1);
a distance determination module for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the first distance di+Second distance di-A third distance dli+And a fourth distance dli-(ii) a The first distance di+Representing a distance from the positive class data in the positive class training set to the positive class hyperplane; the second distance di-Representing a distance from the negative class data in the negative class training set to the negative class hyperplane; the third distance dli+Representing the positive class data in the positive class training set passing through the class center c2A distance to the negative hyperplane; the fourth distance dli-Representing the negative class data in the negative class training set to pass through a class center c1A distance to the positive hyperplane;
the compactness determining module is used for determining the compactness C of the positive data in the positive training set according to the neighbor algorithmi +Determining the closeness C of the negative class data in the negative class training set according to the neighbor algorithmi -
A fuzzy membership function determination module for determining the first distance di+The second distance di-Said third distance dl+Said fourth distance dl-The compactness Ci +The tightness Ci -And the two classesThe distance T of the centers determines the fuzzy membership function (1),
Figure FDA0003551614440000061
wherein S isi+Representing the fuzzy membership of the positive data, Si-Representing the fuzzy membership degree of the negative class data, wherein epsilon represents a radius control factor, and sigma represents a sample weight value endowing parameter;
a classification model determining module for determining a classification model (2) according to the fuzzy membership function (1) and the fuzzy dual-support vector machine,
Figure FDA0003551614440000062
wherein, FTWSVM1 represents the positive classification hyperplane, A represents the first data to be classified, w1Normal vector representing the hyperplane of the positive class classification, e1Column vectors of positive type with representing elements all equal to 1, b1Denotes a first constant, d1Denotes a first penalty parameter, SARepresenting fuzzy membership of first data to be classified, ξ representing relaxation factor, s.t. representing constraint condition, B representing second data to be classified, e2A negative class column vector representing elements all equal to 1, FTWSVM2 represents a negative class classification hyperplane, w2Normal vector representing the hyperplane of the negative class classification, b2Denotes a second constant, d2Denotes a second penalty parameter, SBRepresenting the fuzzy membership of the second data to be classified;
the optimized classification model generation module is used for taking a training set and a test set of sample unbalanced data as the input of the classification model (2), taking the recall ratio, precision ratio, g-mean and F value of the test set as the output of the classification model (2), and determining an optimized first punishment parameter d by adopting a grid search algorithm and a cross-validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model;
the third obtaining module is configured to obtain unbalanced data to be measured, and specifically includes: selecting happy emotion samples in a TUT 2.0 emotion voice database of the Taiyuan university as positive samples, selecting remaining emotion samples in the TUT 2.0 emotion voice database of the Taiyuan university as negative samples, selecting MFCC (Mel frequency cepstrum coefficient) features of sample voice, and selecting tone features and rhythm features, and respectively and correspondingly solving three feature values of mean value, variance and standard deviation of the voice features to obtain unbalanced data to be detected;
and the classification result generation module is used for taking the to-be-detected unbalanced data as the input of the optimized classification model to obtain the classification result of the to-be-detected unbalanced data.
7. The classification system according to claim 6, wherein the hyperplane determination module specifically comprises:
a positive hyperplane and negative hyperplane determining unit for determining the center c according to the class1The class center c2The normal vector w1And the normal vector w2Determining the passing of said class center c1Is a positive hyperplane w1x++b11(3) and passing through said class center c2Is a negative hyperplane w2x-+b21 (4); wherein x is+Representing positive class data, x, in a positive class training set-Representing negative class data in the negative class training set.
8. The classification system according to claim 7, wherein the distance determination module specifically includes:
a distance determination unit for determining the distance between the class center and the object center1The class center c2The normal vector w1 and the normal vector w2 determine a first distance
Figure FDA0003551614440000071
Second distance
Figure FDA0003551614440000072
Third distance
Figure FDA0003551614440000073
And a fourth distance
Figure FDA0003551614440000074
Wherein the content of the first and second substances,
Figure FDA0003551614440000075
represents a normal vector w1The torque of (a) is set to be,
Figure FDA0003551614440000076
represents a normal vector w2Is greater than the torque, | w1I represents the normal vector w1Modulo, | w2I represents the normal vector w2The die of (1).
9. The classification system according to claim 7, wherein the closeness determination module specifically comprises:
the positive data compactness determining unit is used for determining the compactness of the positive data in the positive training set according to a neighbor algorithm
Figure FDA0003551614440000081
The closeness determining unit of the negative class data is used for determining the closeness of the negative class data in the negative class training set according to the neighbor algorithm
Figure FDA0003551614440000082
Wherein x isi +Representing the ith positive class data in the positive class training set,
Figure FDA0003551614440000083
set of K neighbor samples, x, representing the ith positive class data in the positive class training setj +To represent
Figure FDA0003551614440000084
The j-th neighbor of (1)This, xi -Representing the ith negative class data in the negative class training set,
Figure FDA0003551614440000085
set of K neighbor samples, x, representing the ith negative class data in the negative class training setj -To represent
Figure FDA0003551614440000086
K is
Figure FDA0003551614440000087
And
Figure FDA0003551614440000088
the number of neighboring samples in (a).
10. The classification system according to claim 6, wherein the optimized classification model generation module specifically includes:
an optimized classification model generation unit used for taking the training set and the test set of the sample unbalanced data as the input of the classification model (2) and taking the recall ratio of the positive class test set
Figure FDA0003551614440000089
Precision ratio
Figure FDA00035516144400000810
And
Figure FDA00035516144400000811
as the output of the classification model (2), determining the optimized first punishment parameter d by adopting a grid search algorithm and a cross validation method1And an optimized second penalty parameter d2Obtaining an optimized classification model; wherein TP represents the number of correctly classified positive data in the positive test set, FN represents the number of incorrectly classified negative data in the negative test set, and TN represents the negativeThe number of correctly classified negative class data in the test set, and the FP indicates the number of incorrectly classified positive class data in the positive class test set.
CN201811061152.7A 2018-09-12 2018-09-12 Method and system for classifying unbalanced data sets Active CN109165694B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811061152.7A CN109165694B (en) 2018-09-12 2018-09-12 Method and system for classifying unbalanced data sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811061152.7A CN109165694B (en) 2018-09-12 2018-09-12 Method and system for classifying unbalanced data sets

Publications (2)

Publication Number Publication Date
CN109165694A CN109165694A (en) 2019-01-08
CN109165694B true CN109165694B (en) 2022-07-08

Family

ID=64894748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811061152.7A Active CN109165694B (en) 2018-09-12 2018-09-12 Method and system for classifying unbalanced data sets

Country Status (1)

Country Link
CN (1) CN109165694B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516733A (en) * 2019-08-23 2019-11-29 西南石油大学 A kind of Recognition of Weil Logging Lithology method based on the more twin support vector machines of classification of improvement
CN110751190A (en) * 2019-09-27 2020-02-04 北京淇瑀信息科技有限公司 Financial risk model generation method and device and electronic equipment
CN110781922A (en) * 2019-09-27 2020-02-11 北京淇瑀信息科技有限公司 Sample data generation method and device for machine learning model and electronic equipment
CN115008882A (en) * 2022-08-09 2022-09-06 南通海恒纺织设备有限公司 Circular screen printer pressure compensation optimizing system based on industry thing networking
CN116108349B (en) * 2022-12-19 2023-12-15 广州爱浦路网络技术有限公司 Algorithm model training optimization method, device, data classification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN104463221A (en) * 2014-12-22 2015-03-25 江苏科海智能系统有限公司 Imbalance sample weighting method suitable for training of support vector machine
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data
CN105913091A (en) * 2016-04-19 2016-08-31 华东理工大学 Support vector data description method for fuzzy zone negative class samples based on class center distance
CN107871141A (en) * 2017-11-07 2018-04-03 太原理工大学 A kind of classification Forecasting Methodology and classification fallout predictor for non-equilibrium data collection

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7707129B2 (en) * 2006-03-20 2010-04-27 Microsoft Corporation Text classification by weighted proximal support vector machine based on positive and negative sample sizes and weights
US20170270429A1 (en) * 2016-03-21 2017-09-21 Xerox Corporation Methods and systems for improved machine learning using supervised classification of imbalanced datasets with overlap
US10084822B2 (en) * 2016-05-19 2018-09-25 Nec Corporation Intrusion detection and prevention system and method for generating detection rules and taking countermeasures
US11049011B2 (en) * 2016-11-16 2021-06-29 Indian Institute Of Technology Delhi Neural network classifier

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN104463221A (en) * 2014-12-22 2015-03-25 江苏科海智能系统有限公司 Imbalance sample weighting method suitable for training of support vector machine
CN104679860A (en) * 2015-02-27 2015-06-03 北京航空航天大学 Classifying method for unbalanced data
CN105913091A (en) * 2016-04-19 2016-08-31 华东理工大学 Support vector data description method for fuzzy zone negative class samples based on class center distance
CN107871141A (en) * 2017-11-07 2018-04-03 太原理工大学 A kind of classification Forecasting Methodology and classification fallout predictor for non-equilibrium data collection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Fuzzy multiclass support vector machines for unbalanced data;Yuanyuan Wu 等;《2017 29th Chinese Control And Decision Conference (CCDC)》;20170717;第1-5页 *
Fuzzy support vector machines for class imbalance learning;Batuwita R 等;《IEEE Transactions on Fuzzy Systems》;20101231;第18卷(第3期);第558-571页 *
基于混合模糊隶属度的模糊双支持向量机研究;丁胜峰 等;《计算机应用研究》;20130228;第30卷(第2期);第432-435页 *
针对非平衡数据分类的新型模糊SVM模型;蔡艳艳 等;《西安电子科技大学学报(自然科学版)》;20151031;第42卷(第5期);第120-124页 *

Also Published As

Publication number Publication date
CN109165694A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN109165694B (en) Method and system for classifying unbalanced data sets
US11586875B2 (en) Systems and methods for optimization of a data model network architecture for target deployment
CN106202952A (en) A kind of Parkinson disease diagnostic method based on machine learning
CN111242302A (en) XGboost prediction method of intelligent parameter optimization module
CN103941131A (en) Transformer fault detecting method based on simplified set unbalanced SVM (support vector machine)
CN110826618A (en) Personal credit risk assessment method based on random forest
CN113094988A (en) Data-driven slurry circulating pump operation optimization method and system
CN109344907A (en) Based on the method for discrimination for improving judgment criteria sorting algorithm
Zhou et al. Personal credit default prediction model based on convolution neural network
May et al. Topic identification and discovery on text and speech
CN109460872B (en) Mobile communication user loss imbalance data prediction method
Fan et al. Modeling voice pathology detection using imbalanced learning
CN113420508A (en) Unit combination calculation method based on LSTM
CN105608460A (en) Method and system for fusing multiple classifiers
CN117077819A (en) Water quality prediction method
CN113821975B (en) Method and system for predicting performance decay of fuel cell
Castillo et al. Optimization of the fuzzy C-means algorithm using evolutionary methods
Benjumea et al. Genetic clustering algorithm for extractive text summarization
CN115437960A (en) Regression test case sequencing method, device, equipment and storage medium
CN111522743B (en) Software defect prediction method based on gradient lifting tree support vector machine
CN113656707A (en) Financing product recommendation method, system, storage medium and equipment
Deng et al. A negative selection algorithm based on adaptive immunoregulation
Kimiaei An improved randomized algorithm with noise level tuning for large-scale noisy unconstrained DFO problems
Abootalebi et al. Multiple-attribute group decision making using a modified TOPSIS method in the presence of interval data
CN115829036B (en) Sample selection method and device for text knowledge reasoning model continuous learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant