CN104679860B - A kind of sorting technique of unbalanced data - Google Patents

A kind of sorting technique of unbalanced data Download PDF

Info

Publication number
CN104679860B
CN104679860B CN201510089729.5A CN201510089729A CN104679860B CN 104679860 B CN104679860 B CN 104679860B CN 201510089729 A CN201510089729 A CN 201510089729A CN 104679860 B CN104679860 B CN 104679860B
Authority
CN
China
Prior art keywords
sample
sample set
training sample
decision function
membership
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510089729.5A
Other languages
Chinese (zh)
Other versions
CN104679860A (en
Inventor
王理
邓卫国
钱中
王祎旸
许波
雷超
游越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201510089729.5A priority Critical patent/CN104679860B/en
Publication of CN104679860A publication Critical patent/CN104679860A/en
Application granted granted Critical
Publication of CN104679860B publication Critical patent/CN104679860B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of sorting technique of unbalanced data, including:The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision function is obtained;First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function;Categorised decision function is obtained according to first degree of membership and the second degree of membership;Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;The sample of the second overlay region sample set is classified according to the categorised decision function pair.

Description

A kind of sorting technique of unbalanced data
Technical field
The invention belongs to data classification technology field, more particularly to a kind of sorting technique of unbalanced data.
Background technology
The society of today is in the epoch of information explosion, in face of vast as the open sea data, how to be carried from the data of magnanimity Useful information and knowledge is taken to turn into huge challenge.Just because of this, the statistical machine learning technology based on data is occurred in that, into For the topmost method of knowledge acquisition, it designs a kind of appropriate learning algorithm mainly according to specific historical data, and then Acquisition can reflect the mathematics or statistical model of data rule itself, for the top survey to Future Data.Just because of based on system The importance in terms of knowledge acquisition of the machine learning method of meter, has become intellectual analysis and intelligent decision research field Key problem, and be widely used in industry and business.
Wherein, most common Machine Learning Problems are the classification learnings of supervised, such as, living things feature recognition, text point Class, web mining, speech recognition, network invasion monitoring etc..In in the past few decades, the research in machine learning field Persons have made sufficient research to classification learning method, and many highly effective algorithms, so far, still extensive use are proposed in succession In various occasions, including K- neighbours, decision tree, neutral net, integrated study and support vector machine method (Support Vector Machine,SVM).Wherein, it is support vector machine method to attract attention most, and the method is a kind of The Learning machine in Statistical Learning Theory and structuring principle of minimization risk is set up, with traditional learning algorithm such as neutral net Compare, SVM has solid theoretical foundation, and last realization can be attributed to a secondary convex optimization problem, thus can obtain To globally optimal solution, it is to avoid neutral net is easily trapped into the shortcoming of local optimum, and in the case where sample size is less, according to So result in good generalization ability.Just due to these advantages, currently in theoretical circles and industrial quarters, SVM is research and utilization Obtain one of most commonly used learning algorithm.
However, with the continuous expansion of application and deepening continuously for practice, also layer goes out not for new challenge and problem Thoroughly, the classification learning problem of unbalanced data be exactly current machine learning field urgent need to resolve obstacle it.Specifically, it is uneven Weighing apparatus data classification problem just refers to that certain class sample size is considerably less than the situation of other class samples, such as:Abnormal data analysis, invasion Detection, fraud detection, video monitoring, fault diagnosis, medical diagnosis etc..However, traditional machine learning classification method is at place When managing unbalanced data classification problem, the differentiation result of grader always tends to more several classes of samples, causes grader to few class The recognition effect of sample is seriously degenerated, and in extensive application, generally we more pay close attention to the classification accuracy rate of few class sample, because How this, avoided grader from staying bigger decision space to more several classes of samples and led into the research of unbalanced data sorting algorithm One of the key problem in domain.
The researchers in machine learning field have done substantial amounts of research work to unbalanced data classification problem, have carried so far Many different solutions are gone out, generally these methods may be summarized to be two types class:One class is started with from data Layer, is led to The sample distribution for changing training set is crossed, weakens the degree of data nonbalance;Another kind of is the improvement by algorithm layer, for algorithm The limitation when solving the problems, such as unbalanced data, suitably makes to algorithm and is correspondingly improved to be allowed to adapt to uneven number in itself According to classification problem.
Equally, even for the such learning abilities of SVM very strong grader, unbalanced data problem is also resulted in Results of learning sharp decline, in view of the validity of SVM methods and the popularity used, many researchers are specifically designed for injustice Weigh data problem concerning study, corresponding research has been done to SVM methods, and propose some modified hydrothermal process, achieve it is certain into Really, but generally speaking, existing method is not high to the nicety of grading of unbalanced data.
The content of the invention
To overcome existing defect, the invention provides a kind of sorting technique of unbalanced data.
According to an aspect of the present invention, it is proposed that a kind of sorting technique of unbalanced data, methods described includes following Step:
The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision letter is obtained Number;
The first degree of membership and second is respectively obtained by the first categorised decision function and the second categorised decision function to be subordinate to Category degree;
Categorised decision function is obtained according to first degree of membership and the second degree of membership;
Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;
The sample of the second overlay region sample set is classified according to the categorised decision function pair.
It is described to respectively obtain first by the first categorised decision function and the second categorised decision function in such scheme Degree of membership and the second degree of membership include:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair respectively The sample concentrated with Equations of The Second Kind training sample is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample The sample of this collection constitutes the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described First degree of membership of first kind training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set.
It is described to pass through the first kind described in the first categorised decision function and the second categorised decision function pair in such scheme The sample that training sample set and Equations of The Second Kind training sample are concentrated judged, will belong to the first kind training sample set and the The sample of two class training sample sets, which constitutes the first overlay region sample set, to be included:
By the logical relation between the first categorised decision function and the second categorised decision function by the first kind The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set Sample, will belong to the first kind training sample set and Equations of The Second Kind training sample set sample constitute the first overlay region sample Collection.
In such scheme, the calculating process of first degree of membership is:
Wherein:
For the first degree of membership, the sample x in the first overlay region sample set is representediBelong to the first kind training sample The probability of collection;A represents the first kind training sample set;For the sample x in the first overlay region sample setiTo first kind instruction Practice the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of sample set;For the sample in the first overlay region sample set xiTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.
In such scheme, the calculating process of second degree of membership is:
Wherein:
For the second degree of membership, the first overlay region sample x is representediBelong to the probability of the Equations of The Second Kind training sample set;B Represent the Equations of The Second Kind training sample set.
It is described categorised decision function is obtained according to first degree of membership and the second degree of membership to include in such scheme:
Build double sample sets for being subordinate to SVMs;
It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;
It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.
In such scheme, the described pair of calculating process for being subordinate to fuzzy support vector machine is:
Wherein:
W is the weight vector of Optimal Separating Hyperplane;C is noise punishment parameter;For the first degree of membership;
ξiFor the slack variable of the first non-negative;For the second degree of membership;ηiFor the slack variable of the second non-negative;
B is the threshold value of Optimal Separating Hyperplane;For nonlinear mapping function.
In such scheme, the calculating process of the categorised decision function is:
Wherein:
F (x) is categorised decision function;Sign () is sign function;αiFor the first Lagrange multiplier of sample;βiFor sample This second Lagrange multiplier;K(x,xi) it is the kernel function for meeting Mercer conditions.
The present invention by the training sample set of unbalanced data obtains that the classification of unbalanced data characteristic of division can be characterized Decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself Feature carries out precise classification to unbalanced data.
Brief description of the drawings
Fig. 1 is the flow chart of the sorting technique of the unbalanced data of embodiment 1;
Fig. 2 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Pima-indians data sets;
Fig. 3 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Breast-w data sets;
Fig. 4 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Inosphere data sets.
In order to be able to clearly realize the structure of embodiments of the invention, certain size, structure and device are labelled with figure, But it is only for illustrating needs, is not intended to limit the invention in the specific dimensions, structure, device and environment, according to specific Need, these devices and environment can be adjusted or changed by one of ordinary skill in the art, the adjustment that is carried out or Person's modification is still included in the scope of appended claims.
Embodiment
A kind of sorting technique of the unbalanced data provided below in conjunction with the accompanying drawings with specific embodiment the present invention carries out detailed Thin description.
In the following description, multiple different aspects of the present invention will be described, however, for common skill in the art For art personnel, the present invention can be implemented just with some or all structures or flow of the present invention.In order to explain Definition for, elaborate specific number, configuration and order, however, it will be apparent that in the situation without these specific details Under can also implement the present invention.In other cases, will no longer for some well-known features in order to not obscure the present invention It is described in detail.
Embodiment 1
In order to solve the existing low deficiency of the nicety of grading to unbalanced data, a kind of uneven number is present embodiments provided According to sorting technique, as shown in figure 1, the present embodiment method comprises the following steps:
Step S101:The training sample set of unbalanced data is learnt, the first categorised decision function and second is obtained Categorised decision function;
In order to accurately classify to unbalanced data, first have to extract a part of data composition from unbalanced data Training sample set, training sample set should characterize the ratio data in unbalanced data on the whole.Training sample is concentrated Sample is divided into first kind training sample set and Equations of The Second Kind training sample set in the ratio of shared training sample set.Wherein, the first kind Training sample set is the set for the sample for accounting for vast scale that training sample is concentrated, and Equations of The Second Kind training sample set is that training sample is concentrated Remaining proportion sample set.Due to having been obtained for first kind training sample set and Equations of The Second Kind training sample set, so, First categorised decision function and the second categorised decision function can characterize first kind training sample set and Equations of The Second Kind training well The feature of sample set, is that the follow-up classification to unbalanced data is laid a good foundation.
Step S102:First degree of membership is respectively obtained by the first categorised decision function and the second categorised decision function With the second degree of membership;
Sample in first kind training sample set is divided into three classes by the first categorised decision function, i.e. the first kind, belongs to first Sample point inside the corresponding minimal hyper-sphere of class training sample set;Equations of The Second Kind, belongs to first kind training sample set corresponding most The sample point on small suprasphere border;3rd class, the sample point belonged to outside the corresponding minimal hyper-sphere of first kind training sample set. Similarly, the sample that the second categorised decision function also concentrates Equations of The Second Kind training sample is divided into above-mentioned three class.Due to first kind instruction Practice sample set and Equations of The Second Kind training sample set and constitute whole training set, so, by the first categorised decision function and Second categorised decision function just can determine that the sample for belonging to first kind training sample set and Equations of The Second Kind training sample set, by these The set of the composition of sample is used as the first overlay region sample set.Then the sample calculated in the first overlay region sample set is belonging respectively to The probability of first kind training sample set and Equations of The Second Kind training sample set, obtains the first degree of membership and the second degree of membership.I.e., now The sample of first overlay region sample set has the attribute of first kind training sample set and Equations of The Second Kind training sample set, the first weight simultaneously Namely easily there is the sample of mistake in classification in the sample of folded area's sample set.
Step S103:Categorised decision function is obtained according to first degree of membership and the second degree of membership;
, can be according to the first degree of membership after the first degree of membership and the second degree of membership that obtain the sample of the first overlay region sample set With the second degree of membership build double sample sets for being subordinate to SVMs and it is double be subordinate to fuzzy support vector machine, be then subordinate to mould to double Paste SVMs is handled with regard to that can obtain the categorised decision function for being classified to unbalanced data, categorised decision letter Number can concentrate the degree of membership of sample to classify sample according to first kind training sample set and Equations of The Second Kind training sample.
Step S104:Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;
According to the method for the sample that the first overlay region sample set is obtained to training sample concentration sample process, to uneven number According to test sample collection progress handle and obtain the sample of the second overlay region sample set.
Step S105:The sample of the second overlay region sample set is classified according to the categorised decision function pair.
Categorised decision function is the function for having been able to carry out unbalanced data precise classification, and categorised decision function is straight Scoop out the sample for using the second overlay region sample set, it becomes possible to which precise classification is carried out to unbalanced data.
The present embodiment obtains characterizing point of unbalanced data characteristic of division by the training sample set of unbalanced data Class decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself Feature to unbalanced data carry out precise classification.
Specifically, step S102 includes:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair respectively The sample concentrated with Equations of The Second Kind training sample is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample The sample of this collection constitutes the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described First degree of membership of first kind training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set.
Wherein, it is described that sample is trained by the first kind described in the first categorised decision function and the second categorised decision function pair The sample that this collection and Equations of The Second Kind training sample are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind instruction Practicing sample the first overlay region sample set of composition of sample set includes:
By the logical relation between the first categorised decision function and the second categorised decision function by the first kind The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set Sample totally four type, the sample composition first of the first kind training sample set and Equations of The Second Kind training sample set will be belonged to Overlay region sample set, be specially:
If f+(xi)<0 and f-(xi)<0, then sample xiFor noise point;Wherein, f+(xi) it is the first categorised decision function;f- (xi) it is the second categorised decision function, xiThe sample concentrated for the first kind training sample set or Equations of The Second Kind training sample, i= 0,1,…;
If f+(xi) >=0 and f-(xi)<0, then sample xiThe sample concentrated for the first kind training sample;
If f+(xi)<0 and f-(xi) >=0, then sample xiThe sample concentrated for the Equations of The Second Kind training sample;
If f+(xi)>0 and f-(xi)>0, then sample xiIt is also that the present invention is wanted for the sample in the first overlay region sample set The sample set specifically classified.
Obtain after the first overlay region sample set, it is necessary to seek degree of membership to the sample in the first overlay region sample set, asking and being subordinate to The method of degree has a variety of, herein using the dual membership based on distance, specifically, the calculating process of first degree of membership is:
Wherein:
For the first degree of membership, the sample x in the first overlay region sample set is representediBelong to the first kind training sample The probability of collection;A represents the first kind training sample set;
For the sample x in the first overlay region sample setiTo the ball of the corresponding minimal hyper-sphere of first kind training sample set Heart distance and the ratio of radius;Wherein, Φ+(xi) be the first overlay region sample set in sample xi Value in the corresponding nonlinear mapping function of first kind training sample set;a+It is corresponding minimum super for first kind training sample set The sphere centre coordinate of spheroid;R+For the radius of the corresponding minimal hyper-sphere of first kind training sample set;
For the sample x in the first overlay region sample setiTo the ball of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set Heart distance and the ratio of radius;Wherein, Φ-(xi) be the first overlay region sample set in sample xi Value in the corresponding nonlinear mapping function of Equations of The Second Kind training sample set;a-It is corresponding minimum super for Equations of The Second Kind training sample set The sphere centre coordinate of spheroid;R-For the radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.
The calculating process of second degree of membership is:
Wherein:
For the second degree of membership, the first overlay region sample x is representediBelong to the probability of the Equations of The Second Kind training sample set;B Represent the Equations of The Second Kind training sample set.
Obtaining categorised decision function according to first degree of membership and the second degree of membership described in step S103 includes:
S1031:Build double sample sets for being subordinate to SVMs;
Double sample set needs for being subordinate to SVMs consider that belonging to the first of the first kind training sample set is subordinate to simultaneously Category degree and the second degree of membership for belonging to the Equations of The Second Kind training sample set, and the first degree of membership and the second degree of membership and be 1.
S1032:It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;It is described double The calculating process for being subordinate to fuzzy support vector machine is:
Wherein:
W is the weight vector of Optimal Separating Hyperplane;
C is noise punishment parameter;
For the first degree of membership;
ξiFor the slack variable of the first non-negative;
For the second degree of membership;
ηiFor the slack variable of the second non-negative;ξiAnd ηiError bandwidth for reflecting each sample point;
B is the threshold value (the vertical intercept of hyperplane) of Optimal Separating Hyperplane;
For nonlinear mapping function.
S1033:It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.The categorised decision function Calculating process is:
Wherein:
F (x) is categorised decision function;
Sign () is sign function;
αiFor the first Lagrange multiplier of sample;
βiFor the second Lagrange multiplier of sample;
K(x,xi) it is the kernel function for meeting Mercer conditions.
Obtain after categorised decision function, concentrate sample process to obtain the first overlay region sample set according still further to training sample The method of sample obtains the sample of the second overlay region sample set of test sample collection, and it is overlapping that categorised decision function is applied into second The sample of area's sample set, realizes and the data of the test sample collection of unbalanced data is classified.
Embodiment 2
The present invention is described in detail by the scene of a reality for the present embodiment.
The basic step of the present embodiment includes:
(1) using supporting vector test in data domain (Support Vector Data Domain Description, SVDD) two class training sets (are accounted for by the first kind training sample set of vast scale and the Equations of The Second Kind training sample set of remaining proportion is accounted for) Sample carries out single class study respectively, obtains the first categorised decision function f+(x) with the second categorised decision function f-(x), so as to recognize Go out noise point, positive class sample (sample that first kind training sample is concentrated), the negative class sample (sample that Equations of The Second Kind training sample is concentrated This) and the first overlay region sample set in sample;
(2) it is based on f+And f (x)-(x) and two class samples minimal hyper-sphere, calculate the first overlay region sample set in sample This dual membership;
(3) it is subordinate to fuzzy support vector machine model to the sample use pair in the first overlay region sample set to be trained, obtains To the categorised decision function f (x) of overlapping region sample;
(4) for test set sample, first using f+And f (x)-(x) noise point, positive class sample, negative class sample are identified as Sheet or overlapping region sample;
(5) for the overlapping region sample of test set, its dual membership is calculated, is then subordinate to fuzzy support vector using double The decision function f (x) of machine model is differentiated.
Wherein, the decision function building process in step (1) is as follows:
SVDD is directed to single class and learnt, and finds the suprasphere of a higher dimensional space to cover data as much as possible at this The map of attribute space, so as to obtain data boundary feature.Give a set X={ x for including n data objecti| i=1, 2 ..., n }, the input space is mapped to high latitude space by SVDD by nonlinear mapping function Φ (), find a radius be R, The centre of sphere covers x as much as possible for a supraspherei.SVDD sets up following optimization problem:
minR2
s.t.||Φ(xi)-a||2≤R2
I=1,2 ..., n
Slack variable vector ξ=(ξ is introduced in above formula12,...,ξn) so that suprasphere can make a part of sample Foreclosed portion for noise, optimization problem is transformed to:
s.t.||Φ(xi)-a||2≤R2
ξi≥0;I=1,2 ..., n
Wherein, q (R, ξ) is optimization problem object function;C is noise punishment parameter.Introducing Lagrangian can obtain:
OrderAbove formula can transform to:
Wherein, v is to target class very this refusal degree, 0≤v≤1.As v=0, nv is the lower limit of supporting vector;When During v=1, nv is the upper limit of exterior point quantity (i.e. data amount check).Make L seek R, a and ξ local derviation respectively, and make it be 0, can obtain:
By inner product Φ (xi)Φ(xj) use Mercer function K (xi,xj) replace, the Wolfe antithesis that can obtain former optimal problem is asked It is entitled:
According to optimal condition (Karush-Kuhn-Tucker, KKT) condition, therefore sample data can be divided into three classes:
The first kind is interior point, is to be located at the sample point inside suprasphere, its | | Φ (xi)-a||2<R2, i.e. αi=0,
Equations of The Second Kind is supporting vector, the sample point positioned at suprasphere border, its | | Φ (xi)-a||2=R2, i.e.,βi>0;
3rd class is exterior point, is to be located at the sample point outside suprasphere, its | | Φ (xi)-a||2>R2, i.e.,βi =0
In order to verify the type of sample data, decision function is as follows:
F (x)=sgn (R2-||Φ(xi)-a||2)
It can thus be concluded that the decision function value of supporting vector is 0, the decision function value of interior point is more than 0, the decision function of exterior point Value is less than 0.
Dual membership in step (3) obscures SVM algorithm (Double-Fuzzy support vector machine, D- FSVM) process is as follows:
It is subordinate to sample set form in SVMs double and is:
Each sample is under the jurisdiction of two classes, i.e. sample x according to probability respectivelyiBelong to A classes (yi=probability 1) isBelong to B classes (yi=-1) probability isWherein, yiFor the i-th class sample, in two category support vector machines models, sample is divided into A classes With B classes, then yi∈ { -1 ,+1 }, i=1 ..., l.That is sample xiOnly correspond to " label " yi, yi=+1 explanation sample xiCategory In A classes;yi=-1 explanation sample xiBelong to B classes.
It is double be subordinate to fuzzy support vector machine basic model be:
The Lagrangian of the problem is:
Wherein, αkk,vkkThe respectively first, second, third and fourth Lagrange multiplier of non-negative.
The optimal solution for solving former problem is equivalent to solve the optimal solution of its dual problem, and primal-dual optimization problem is:
I=1,2 ..., l
The higher dimensional space that the object function of above-mentioned primal-dual optimization problem is related to after the conversion does inner product operationIf the dimension in space is very high after nonlinear transformation, it can produce " dimension disaster ".To solve this problem, According to Functional Theory, the kernel function K (x for meeting Mercer conditions can be usedi,xj) replace the inner product operation of high-dimensional feature space:
The classification operator finally given is:
From model above it can be seen that double be subordinate to the essential step that fuzzy support vector machine is different from traditional support vector machine Just it is to determine that each sample point is subordinate to probability relative to A classes and B classes, therefore a very crucial step is how to set up degree of membership Model portrays subjection degree of the training sample o'clock relative to two class samples.
Using the dual membership computational methods based on distance:
Wherein,Respectively equal to it is located at the sample of overlapping region To the centre of sphere distance and the ratio of radius of two class minimal hyper-spheres.For the sample x in the first overlay region sample setiTo first The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set;Φ+(xi) in the first overlay region sample set Sample xiValue in the corresponding nonlinear mapping function of first kind training sample set;a+For first kind training sample set correspondence Minimal hyper-sphere sphere centre coordinate;R+For the radius of the corresponding minimal hyper-sphere of first kind training sample set;For the first weight Sample x in folded area's sample setiTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set; Φ-(xi) be the first overlay region sample set in sample xiIn the corresponding nonlinear mapping function of Equations of The Second Kind training sample set Value;a-For the sphere centre coordinate of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set;R-It is corresponding most for Equations of The Second Kind training sample set The radius of small suprasphere.
Scene below by way of a reality is illustrated to the present embodiment.
The present invention have chosen University of California, Irvine (University of California, Irvine, UCI) Pi Ma American Indians diabetes data collection (Pima-indians), the University of Wisconsin's mammary gland in machine learning databases The database such as cancer data set (Breast-w) and Johns Hopkins University's ionospheric data collection (Inosphere), each number It is shown in Table according to the details in storehouse.
The essential information of table 1UCI data sets
Data set Dimension Positive class sample number Negative class sample number Total number of samples Non-equilibrium ratio
Pima-indians 8 268 500 768 1:2
Breast-w 9 241 458 699 1:2
Inosphere 34 126 225 351 1:2
UCI data sets are carried out random division by the present invention, wherein 70% as training set, are left 30% as test set, And ensure the constant of non-equilibrium ratio during division.
In order to analyze double performances for being subordinate to fuzzy support vector machine algorithm proposed by the present invention based on SVDD, the present invention is right Include SVM, the SVM algorithm based on SVDD than Data set reconstruction model.SVM algorithm wherein based on SVDD and pair based on SVDD It is subordinate to fuzzy support vector machine algorithm (D-FSVM) similar, simply in the 2nd step and the 4th step, overlapping area sample is sentenced Common SVM models are used when other, during also without assign overlapping region sample dual membership.
The sorting algorithm evaluation index that the present invention is used is sensitivity (Sensitivity is abbreviated as SE), specificity (Specificity is abbreviated as SP) and overall average nicety of grading (General Accuracy, be abbreviated as GA).Experimental result is such as Shown in lower:
The experimental result of table 2
It can be seen from the results that seeing accompanying drawing 2, Fig. 3, Fig. 4, concentrated in three data, SVDD+SVM algorithms and SVDD+ (D- FSVM) effect of algorithm is substantially better than common SVM models.Therefore, first using SVDD algorithms identify noise point, positive class sample, Negative class sample and overlapping region sample, then again using SVM models or double fuzzy support vector machine models that are subordinate to overlay region Domain sample is learnt, and can obtain preferable classifying quality.
Meanwhile, concentrated in three data, SE, SP and GA index of SVDD+ (D-FSVM) algorithm proposed by the present invention are all Highest.Therefore, for overlapping region sample, dual membership can preferably portray sample point and belong to the relative of positive class and negative class Degree, double fuzzy support vector machine models that are subordinate to more preferably can classify to overlapping area sample.
Finally it should be noted that above example is only to describe technical scheme rather than to this technology method Limited, the present invention application can above extend to other modifications, change, using and embodiment, and it is taken as that institute Have such modification, change, using, embodiment all in the range of the spirit or teaching of the present invention.

Claims (5)

1. a kind of sorting technique of unbalanced data, it is characterised in that the described method comprises the following steps:
The training sample of unbalanced data is divided into the ratio of shared training sample set:First kind training sample set and Equations of The Second Kind Training sample set, the collection of these samples composition is combined into the first overlay region sample set, by entering to first overlay region sample set Row study, obtains the first categorised decision function and the second categorised decision function;
First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function;
Categorised decision function is obtained according to first degree of membership and the second degree of membership;
Determine the sample for the first overlay region sample set that the test sample of the unbalanced data is concentrated;
The sample of the first overlay region sample set is classified according to the categorised decision function pair;
It is described that first degree of membership and the second person in servitude are respectively obtained by the first categorised decision function and the second categorised decision function Category degree includes:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair and respectively The sample that two class training samples are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample set Sample constitute the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described first First degree of membership of class training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set;
The calculating process of first degree of membership is:
Wherein:
For the first degree of membership, represent that the sample xi in the first overlay region sample set belongs to the general of the first kind training sample set Rate;A represents the first kind training sample set;For the sample xi in the first overlay region sample set to first kind training sample Collect the centre of sphere distance and the ratio of radius of corresponding minimal hyper-sphere;For the sample xi to second in the first overlay region sample set The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set;
The calculating process of second degree of membership is:
Wherein:
For the second degree of membership, represent that the first overlay region sample xi belongs to the probability of the Equations of The Second Kind training sample set;B represents institute State Equations of The Second Kind training sample set.
2. according to the method described in claim 1, it is characterised in that described to pass through the first categorised decision function and second point The sample that class decision function is concentrated to the first kind training sample set and Equations of The Second Kind training sample judges, by belonging to Stating sample the first overlay region sample set of composition of first kind training sample set and Equations of The Second Kind training sample set includes:
The first kind is trained by the logical relation between the first categorised decision function and the second categorised decision function The sample that sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, the sample for belonging to first kind training sample concentration, category The sample concentrated in Equations of The Second Kind training sample, belong to the sample of the first kind training sample set and Equations of The Second Kind training sample set This, the first overlay region sample set is constituted by the sample for belonging to the first kind training sample set and Equations of The Second Kind training sample set.
3. according to the method described in claim 1, it is characterised in that obtained according to first degree of membership and second degree of membership Include to categorised decision function:
Build double sample sets for being subordinate to SVMs;
It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;
It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.
4. method according to claim 3, it is characterised in that double calculating process for being subordinate to fuzzy support vector machine For:
<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mrow> <mi>w</mi> <mo>,</mo> <mi>b</mi> </mrow> </munder> </mtd> <mtd> <mrow> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <mo>|</mo> <mo>|</mo> <mi>w</mi> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>+</mo> <mi>C</mi> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <mrow> <mo>(</mo> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>A</mi> </msubsup> <msub> <mi>&amp;xi;</mi> <mi>i</mi> </msub> <mo>+</mo> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>B</mi> </msubsup> <msub> <mi>&amp;eta;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced>
<mrow> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>A</mi> </msubsup> <mo>+</mo> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>B</mi> </msubsup> <mo>=</mo> <mn>1</mn> </mrow>
<mrow> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>A</mi> </msubsup> <mo>&amp;GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msubsup> <mi>&amp;mu;</mi> <mi>i</mi> <mi>B</mi> </msubsup> <mo>&amp;GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msub> <mi>&amp;xi;</mi> <mi>i</mi> </msub> <mo>&amp;GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msub> <mi>&amp;eta;</mi> <mi>i</mi> </msub> <mo>&amp;GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>l</mi> </mrow>
Wherein:
W is the weight vector of Optimal Separating Hyperplane;C is noise punishment parameter;For the first degree of membership;ξ i are the relaxation of the first non-negative Variable;For the second degree of membership;η i are the slack variable of the second non-negative;B is the threshold value of Optimal Separating Hyperplane;Reflected to be non-linear Penetrate function.
5. method according to claim 3, it is characterised in that the calculating process of the categorised decision function is:
Wherein:
F (x) is categorised decision function;S ign () are sign function;α i are the first Lagrange multiplier of sample;β i are sample The second Lagrange multiplier;K (x, xi) is the kernel function for meeting Mercer conditions.
CN201510089729.5A 2015-02-27 2015-02-27 A kind of sorting technique of unbalanced data Expired - Fee Related CN104679860B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510089729.5A CN104679860B (en) 2015-02-27 2015-02-27 A kind of sorting technique of unbalanced data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510089729.5A CN104679860B (en) 2015-02-27 2015-02-27 A kind of sorting technique of unbalanced data

Publications (2)

Publication Number Publication Date
CN104679860A CN104679860A (en) 2015-06-03
CN104679860B true CN104679860B (en) 2017-11-07

Family

ID=53314902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510089729.5A Expired - Fee Related CN104679860B (en) 2015-02-27 2015-02-27 A kind of sorting technique of unbalanced data

Country Status (1)

Country Link
CN (1) CN104679860B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105005589B (en) * 2015-06-26 2017-12-29 腾讯科技(深圳)有限公司 A kind of method and apparatus of text classification
CN105447520A (en) * 2015-11-23 2016-03-30 盐城工学院 Sample classification method based on weighted PTSVM (projection twin support vector machine)
CN107463938B (en) * 2017-06-26 2021-02-26 南京航空航天大学 Aero-engine gas circuit component fault detection method based on interval correction support vector machine
CN108960056B (en) * 2018-05-30 2022-06-03 西南交通大学 Fall detection method based on attitude analysis and support vector data description
CN110555054B (en) * 2018-06-15 2023-06-09 泉州信息工程学院 Data classification method and system based on fuzzy double-supersphere classification model
CN109165694B (en) * 2018-09-12 2022-07-08 太原理工大学 Method and system for classifying unbalanced data sets
CN109919931B (en) * 2019-03-08 2020-12-25 数坤(北京)网络科技有限公司 Coronary stenosis degree evaluation model training method and evaluation system
CN111126577A (en) * 2020-03-30 2020-05-08 北京精诊医疗科技有限公司 Loss function design method for unbalanced samples

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402690A (en) * 2011-09-28 2012-04-04 南京师范大学 Data classification method based on intuitive fuzzy integration and system
CN102945280A (en) * 2012-11-15 2013-02-27 翟云 Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method
CN104268577A (en) * 2014-06-27 2015-01-07 大连理工大学 Human body behavior identification method based on inertial sensor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402690A (en) * 2011-09-28 2012-04-04 南京师范大学 Data classification method based on intuitive fuzzy integration and system
CN102945280A (en) * 2012-11-15 2013-02-27 翟云 Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method
CN104268577A (en) * 2014-06-27 2015-01-07 大连理工大学 Human body behavior identification method based on inertial sensor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于双隶属度模糊支持向量机的邮件过滤;孙名松等;《计算机工程与应用》;20100120;第46卷(第2期);第94页第2节、第3.1节,第95页第4节、第5.2节 *
基于类权重的模糊不平衡数据分类方法;薛贞霞等;《计算机科学》;20081130;第35卷(第11期);第171页第3节 *

Also Published As

Publication number Publication date
CN104679860A (en) 2015-06-03

Similar Documents

Publication Publication Date Title
CN104679860B (en) A kind of sorting technique of unbalanced data
US20230141886A1 (en) Method for assessing hazard on flood sensitivity based on ensemble learning
Chen et al. Regional disaster risk assessment of China based on self-organizing map: clustering, visualization and ranking
Farsadnia et al. Identification of homogeneous regions for regionalization of watersheds by two-level self-organizing feature maps
CN106897738B (en) A kind of pedestrian detection method based on semi-supervised learning
Wang et al. Assessment of river water quality based on theory of variable fuzzy sets and fuzzy binary comparison method
Zhu et al. Flood disaster risk assessment based on random forest algorithm
CN107123123A (en) Image segmentation quality evaluating method based on convolutional neural networks
Wu et al. A hybrid support vector regression approach for rainfall forecasting using particle swarm optimization and projection pursuit technology
CN105487526A (en) FastRVM (fast relevance vector machine) wastewater treatment fault diagnosis method
CN108764621A (en) A kind of family endowment collaboration nurse dispatching method of data-driven
CN112785450A (en) Soil environment quality partitioning method and system
CN107729922A (en) Remote sensing images method for extracting roads based on deep learning super-resolution technique
Zhang et al. Surface and high-altitude combined rainfall forecasting using convolutional neural network
CN112418571A (en) Method and device for enterprise environmental protection comprehensive evaluation
CN107247954A (en) A kind of image outlier detection method based on deep neural network
Zhang et al. Information fusion for automated post-disaster building damage evaluation using deep neural network
CN114399212A (en) Ecological environment quality evaluation method and device, electronic equipment and storage medium
CN107909278A (en) A kind of method and system of program capability comprehensive assessment
CN106446965A (en) Spacecraft visible light image classification method
Li et al. Evaluation of livable city based on GIS and PSO-SVM: A case study of hunan province
Inyang et al. Visual association analytics approach to predictive modelling of students’ academic performance
CN109214598A (en) Batch ranking method based on K-MEANS and ARIMA model prediction residential quarters collateral risk
CN111401683B (en) Method and device for measuring tradition of ancient villages
CN104778479B (en) A kind of image classification method and system based on sparse coding extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171107

CF01 Termination of patent right due to non-payment of annual fee