CN112035719B - Category imbalance data classification method and system based on convex polyhedron classifier - Google Patents
Category imbalance data classification method and system based on convex polyhedron classifier Download PDFInfo
- Publication number
- CN112035719B CN112035719B CN202010904076.2A CN202010904076A CN112035719B CN 112035719 B CN112035719 B CN 112035719B CN 202010904076 A CN202010904076 A CN 202010904076A CN 112035719 B CN112035719 B CN 112035719B
- Authority
- CN
- China
- Prior art keywords
- samples
- convex
- sample
- classification
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 83
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000013145 classification model Methods 0.000 claims abstract description 21
- 238000012360 testing method Methods 0.000 claims abstract description 19
- 238000010276 construction Methods 0.000 claims abstract description 12
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 26
- 230000004069 differentiation Effects 0.000 claims description 8
- 230000006399 behavior Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000003745 diagnosis Methods 0.000 claims description 6
- 201000010099 disease Diseases 0.000 claims description 6
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 241000287196 Asthenes Species 0.000 claims description 3
- 230000010365 information processing Effects 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 241000364051 Pima Species 0.000 description 5
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 239000011521 glass Substances 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Abstract
The invention belongs to the technical field of artificial intelligence/information science, and discloses a classification method and a classification system for class imbalance data based on a convex polyhedron classifier, wherein the method comprises the following steps: dividing the class unbalanced data set S into a training set T and a test set P, marking few class samples X in the training set T and marking most samples Y in the training set T; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model between X and Y' by utilizing a convex polyhedron construction algorithm; and judging the category of each sample in the test set P by using the obtained classification model. When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence/information science, and particularly relates to a classification unbalanced data classification method and system based on a convex polyhedron classifier.
Background
The existing method for solving the unbalanced classification problem mainly changes the distribution of training samples through data resampling technologies such as over-sampling, under-sampling and the like, so that the unbalanced degree of data is reduced. And then, the balanced data is fed to a specific classifier to make classification decisions.
Oversampling is to increase the samples in the minority class, bringing the minority class and the majority class into quantitative balance. It typically synthesizes a new sample that is not repeated between two minority classes of samples that are closely spaced using K-nearest neighbor and linear interpolation algorithms. But this approach tends to cause data overlap at the classification boundary, making it difficult for the classifier to distinguish sample class attributes at the boundary. Even if the two types of samples are forced to separate, such classification faces tend to be very complex, resulting in over-learning.
Undersampling is to reduce the number of samples of the majority class to a level of the minority class to maintain balance. It typically deletes or uses clusters to reduce the majority class of samples according to certain cleaning rules. But for cleaning, it may delete important sample data by mistake, resulting in the loss of classification information; for clustering, it uses a cluster center to decide on sample retention, so important boundary point information may be lost.
Recent studies have also shown that a 50% to 50% equilibrium data formed by resampling techniques is not more discriminative than the original data. This also shows that solving the imbalance classification problem from the data balancing point of view is an artificial empirical behavior, and there is no evidence in practical use to prove the effectiveness of this behavior.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) The existing method needs to perform balanced preprocessing on data, so that data distribution is changed, and the problem of global information loss occurs.
(2) The existing method takes data rebalancing as a premise, and can not fully consider the natural distribution characteristic of the data.
(3) The prior art has limitations in solving the practical unbalanced application problem of medical auxiliary diagnosis. In this problem, the population suffering from severe disease is a minority and healthy population is a majority, if a balancing technique is used at this time, it will occur that: [a] the first over-sampling would synthesize some new patient data that would necessarily tend to be most sample-like in spatial distribution, i.e., would cause the synthesized sample (diseased) and most sample-like (healthy) at the boundary to overlap, which would make it difficult to distinguish. [b] The second undersampling would be to delete a large portion of the samples in the majority class, which would cause the classification boundary to move toward the majority class, i.e., in later decisions, a significant portion of healthy people would be misclassified as ill, thus increasing the burden of subsequent examinations.
The difficulty of solving the problems and the defects is as follows:
the existing unbalanced classifier depends on the rebalancing treatment of data when generating a decision surface, namely, the data needs to be balanced first and then can work. The process of rebalancing the data destroys the natural distribution of the data so that valid information is overwritten or deleted. Therefore, it becomes particularly difficult to effectively utilize natural distribution information of the original data. In addition, protecting few class samples with high misclassification cost from being destroyed is a difficult problem that the existing method cannot effectively solve.
The meaning of solving the problems and the defects is as follows:
(1) The preprocessing process of data rebalancing is not used, so that the response time of classification decision is shortened;
(2) Effective information can be learned from the natural distribution form of the original data;
(3) The information of the minority sample is protected from being destroyed, and the importance of the minority sample with high misclassification cost is reflected;
(4) The unbalanced classification system module is simplified, and the system structure is lightweight.
The invention provides a lightweight unbalanced problem solution which is simple in method and easy to realize and can accurately learn aiming at few types of samples. The invention fully exploits the potential of few class samples with high misclassification cost in improving the performance of the classification model.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a classification unbalanced data classification method and system based on a convex polyhedron classifier.
The invention is realized in such a way that a classification unbalanced data classification system based on a convex polyhedron classifier comprises:
the unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;
a convex polyhedron differentiation module of the sample space, which is used for representing the convex hull of the X by using the convex combination form of the samples in the minority class set X and providing the separable judgment of the convex polyhedrons of the two sample sets X and Y'; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;
the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;
and the classification decision module is used for carrying out classification decision on the samples in the test set P according to the model CPC (x) and outputting classification results.
Another object of the present invention is to provide a classification method of class imbalance data based on a convex polyhedron classifier, comprising:
step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, a few class samples in the training set T are marked as x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: first, the convex hull of X is represented using a convex combination of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is referred to as X being convex polyhedron separable relative to Y'. Then, samples in the convex hull of X in the Y are detected, and the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated.
The method comprises the following specific steps:
step 2.1: setting an initial set of clean samplesSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eThe convex polyhedron in the sample space can be differentiated.
Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsA classification model CPC (x) is constructed from the set LDFs and expressed as: CPC (X) = +1, X e X; CPC (x) = -1, x e Y. The method comprises the following specific steps:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is to connect twoThe nearest point y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method.
It is a further object of the invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method.
Another object of the present invention is to provide an information processing terminal for solving bank fraud detection, disease diagnosis, risk behavior assessment, which is equipped with the class imbalance data classification system based on a convex polyhedron classifier, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to execute the method.
By combining all the technical schemes, the invention has the advantages and positive effects that:
the method provided by the invention divides the class unbalanced data set into a training set T and a test set P, and marks a few class samples in the training set T as X= { X i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) obtained is used for judging the classification of each sample in the test set P. The method can be used for effectively solving the problems of bank fraud detection, disease diagnosis, risk behavior assessment and the like.
When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.
Compared with the prior art, the method fully utilizes the advantages of simple realization, strong approximation capability, good interpretation and the like of the local linear function, and establishes a convex polyhedron classification model which is accurately surrounded by a few types of samples; fully considering the nature distribution characteristic of the class unbalanced data, discarding the data rebalancing technology, and fully excavating the potential of few class samples with high misclassification cost in terms of improving the performance of the model; the characteristics of few parameters, no distribution assumption and the like of the convex polyhedron classifier are fully exerted, and the problems that the traditional method depends on a complex processing mechanism, excessive parameter adjustment and the like are solved from the perspective of instantaneity.
The class unbalanced data classification system has simple modules, short decision response time and easy realization and expansion.
Compared with the existing oversampling and undersampling methods, the classification method for the class-unbalanced data provided by the invention has the advantages of accuracy Precision, recall and specificity Specificity, F 1 Metric F 1 The evaluation indexes such as Score and G measurement G-Mean are improved;
the techniques and methods involved in the present invention can be very easily implemented on a computer system;
the terminal provided with the class unbalanced data classification system based on the convex polyhedron classifier can realize early warning of abnormal events such as bank fraud detection, disease diagnosis, risk behavior assessment and the like.
On the international benchmark evaluation data set, the method of the invention has the advantages of accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 The method has obvious advantages over the oversampling method and the undersampling method in evaluation indexes such as Score, G measurement G-Mean and the like. The specific data are as follows: on accuracy Precision, the method is higher than an oversampling (SMOTE) method by 5.79 percentage points on average and higher than a random undersampling method by 11.62 percentage points on average; on the Recall rate Recall, the method is higher than the oversampling (SMOTE) method by 39.78 percentage points on average and higher than the random undersampling method by 10.28 percentage points on average; in the Specificity, the method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average; at F 1 Metric F 1 On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method on the Score; on the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings needed in the embodiments of the present application, and it is obvious that the drawings described below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a classification method of class imbalance data based on a convex polyhedron classifier according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of sample space convex polyhedron differentiation provided by an embodiment of the present invention.
FIG. 3 is a schematic diagram of a convex polyhedron construction algorithm provided by an embodiment of the present invention.
FIG. 4 is a graph of the effect of an unbalanced data set on a randomly generated Gaussian distribution provided by an embodiment of the invention.
Fig. 5 is a graph of the effect of a few class samples and a majority class samples in the labeled training set T according to an embodiment of the present invention.
Fig. 6 is a sample effect diagram of labels falling into a minority class convex hull provided by an embodiment of the present invention.
Fig. 7 is a diagram showing the effect of realizing the sample space convex polyhedron differentiation according to the embodiment of the present invention.
FIG. 8 is a graph showing the effect of computing a set of linear discriminant functions provided by an embodiment of the present invention.
Fig. 9 is a diagram of classification decision according to the model CPC (x) for samples in the test set P according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Aiming at the problems existing in the prior art, the invention provides a classification method and a classification system for class unbalanced data based on a convex polyhedron classifier, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a classification method of class imbalance data based on a convex polyhedron classifier, comprising:
step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, marking trainingThe minority class samples in the training set T are x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: first, the convex hull of X is represented using a convex combination of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is referred to as X being convex polyhedron separable relative to Y'. Then, samples in the convex hull of X in the Y are detected, and the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated.
Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsConstructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y.
Step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean.
As shown in fig. 2, the specific steps of step 2 include:
step 2.1: setting an initial set of clean samplesSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eThe convex polyhedron in the sample space can be differentiated.
As shown in fig. 3, the specific steps of step 3 include:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is the connection of two closest points y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
the invention will be further described with reference to specific examples
Examples
The classification method of the class unbalanced data based on the convex polyhedron classifier provided by the invention comprises the following steps:
step 1: an unbalanced data set S of gaussian distribution is randomly generated [ as in fig. 4 ]. 200 minority class samples and 800 majority class samples. The training set T and the test set P are divided into 50 percent to 50 percent ratio. Then, a few class samples in the training set T are marked as x= { X i I is more than or equal to 1 and less than or equal to 100, and most types of samples in the training set T are marked as Y= { Y j J is more than or equal to 1 and less than or equal to 400; [ as in FIG. 5 ]
Step 2: setting an initial set of clean samples
Step 3: sequentially selecting each sample Y from Y j E, Y, j is more than or equal to 1 and less than or equal to 400, and is placed in a region to be detected;
step 4: calculating convex hull and y of X j Distance d (y) j CH (X)), if d (y) j CH (X)) > 0, y will be j Put into set Y'. Otherwise, will y j Marked as[ as in FIG. 6 ]
Step 5: delete 9 markers in Y asAnd confirm that all samples in Y have been detected, resulting in a clean sample set Y'. Realizing the differentiation of a convex polyhedron in a sample space; [ as in FIG. 7 ]
Step 6: initializing a set of linear discriminant functions
Step 7: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 8: by the nearest point pair (y p ,x * ) Calculate a linear discriminant function f 1 (x)=w 1 ·x+b 1 Wherein w is 1 =x * -y p ,b 1 =(||y p || 2 -||x * || 2 ) 2, i.e. f 1 (x)=w 1 ·x+b 1 =0 is the connection of 2 closest points y p And x * A perpendicular bisector of the connection line;
step 9: f in the label Y 1 (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 10: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 11: will f 1 (x) Putting the linear discriminant function set LDFs;
step 12: repeating steps 7-11 until
Step 13: the temporary data space IDS is emptied, and the linear discriminant function set LDFs= { f is arranged l (x) 1.ltoreq.l.ltoreq.5, wherein,
f 1 (x)=-6x 1 -6x 2 +456
f 2 (x)=-16x 1 -5x 2 +809
f 3 (x)=-5x 1 -12x 2 +681
f 4 (x)=-21x 1 -4x 2 +1008
f 5 (x)=-12x 1 -21x 2 +756; [ as in FIG. 8 ].
Step 14: construction of classification models using linear discriminant functions
Step 15: the classification decisions are made on the samples in the test set P according to the model CPC (x) [ as in fig. 9 ]. Statistics form a confusion matrix [ as in table 1 ]. Calculating and outputting classification results, wherein the indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean [ as table 2 ].
TABLE 1
The index calculation process comprises the following steps:
TABLE 2
Precision(%) | Recall(%) | Specificity(%) | F 1 -Score(%) | G-Mean(%) |
95.15 | 98.00 | 98.75 | 96.55 | 98.37 |
The invention is applicable to datasets in high-dimensional space.
The invention is further described below in connection with specific experimental data.
1) Data for experiments
Data set | Number of samples | Minority class sample number | Unbalance rate | Feature number |
Wisconsin | 683 | 239 | 0.54 | 9 |
Pima | 768 | 268 | 0.54 | 8 |
Glass | 214 | 70 | 0.49 | 9 |
Vehicle | 846 | 217 | 0.34 | 18 |
Ecoli | 336 | 77 | 0.30 | 7 |
Yeast | 1484 | 163 | 0.12 | 8 |
Vowel | 988 | 90 | 0.10 | 13 |
2) Experimental results 1-Precision (%)
Data set | The method of the invention | Oversampling (SMOTE) | The method-oversampling | Random undersampling | The method-undersampling |
Wisconsin | 92.15 | 92.06 | 0.09 | 92.06 | 0.09 |
Pima | 98.17 | 57.77 | 40.40 | 50.28 | 47.89 |
Glass | 66.70 | 77.29 | -10.59 | 68.14 | -1.44 |
Vehicle | 87.47 | 57.96 | 29.51 | 41.45 | 46.02 |
Ecoli | 74.12 | 66.91 | 7.21 | 70.81 | 3.31 |
Yeast | 56.05 | 71.79 | -15.74 | 60.00 | -3.95 |
Vowel | 89.65 | 100.00 | -10.35 | 100.00 | -10.35 |
Average value of | 5.79 | 11.62 |
On accuracy Precision, the method is 5.79 percent higher than the average of the over-Sampling (SMOTE) method and 11.62 percent higher than the average of the random under-sampling method.
3) Experimental results 2-Recall (%)
Data set | The method of the invention | Oversampling (SMOTE) | The method-oversampling | Random undersampling | The method-undersampling |
Wisconsin | 100.00 | 97.87 | 2.13 | 97.87 | 2.13 |
Pima | 100.00 | 43.40 | 56.60 | 79.25 | 20.75 |
Glass | 85.71 | 50.00 | 35.71 | 78.57 | 7.14 |
Vehicle | 95.35 | 25.58 | 69.77 | 83.72 | 11.63 |
Ecoli | 86.67 | 53.33 | 33.34 | 80.00 | 6.67 |
Yeast | 100.00 | 46.88 | 53.12 | 87.50 | 12.50 |
Vowel | 100.00 | 72.22 | 2.13 | 88.89 | 11.11 |
Average value of | 39.78 | 10.28 |
On average, the method was 39.78 percentage points higher than the oversampling (SMOTE) method and 10.28 percentage points higher than the random undersampling method on average, on Recall.
4) Experimental results 3-Specificity (%)
Data set | The inventionMethod for improving eyesight | Oversampling (SMOTE) | The method-oversampling | Random undersampling | The method-undersampling |
Wisconsin | 97.12 | 96.66 | 0.46 | 96.66 | 0.46 |
Pima | 64.60 | 60.02 | 4.58 | 61.43 | 3.17 |
Glass | 72.14 | 68.14 | 4.00 | 64.29 | 7.85 |
Vehicle | 51.67 | 48.93 | 2.74 | 51.54 | 0.13 |
Ecoli | 82.45 | 70.11 | 12.34 | 80.85 | 1.60 |
Yeast | 73.85 | 67.68 | 6.17 | 70.04 | 3.81 |
Vowel | 98.31 | 84.98 | 13.33 | 94.28 | 4.03 |
Average value of | 6.23 | 3.01 |
The method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average in the Specificity of the Specificity.
5) Experimental results 4-F 1 -Score(%)
At F 1 Metric F 1 On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method.
6) Experimental results 5-G-Mean (%)
Data set | The method of the invention | Oversampling (SMOTE) | The method-oversampling | Random undersampling | The method-undersampling |
Wisconsin | 98.55 | 97.26 | 1.29 | 97.26 | 1.29 |
Pima | 80.37 | 51.04 | 29.33 | 69.77 | 10.60 |
Glass | 78.63 | 58.37 | 20.26 | 71.07 | 7.56 |
Vehicle | 70.19 | 35.38 | 34.81 | 65.69 | 4.50 |
Ecoli | 84.53 | 61.15 | 23.38 | 80.42 | 4.11 |
Yeast | 85.94 | 56.33 | 29.61 | 78.28 | 7.66 |
Vowel | 99.15 | 78.34 | 20.81 | 91.55 | 7.60 |
Average value of | 22.78 | 6.19 |
On the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.
In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more; the terms "upper," "lower," "left," "right," "inner," "outer," "front," "rear," "head," "tail," and the like are used as an orientation or positional relationship based on that shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.
Claims (4)
1. A convex polyhedron classifier-based class imbalance data classification system, comprising:
the finite unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;
the convex polyhedron differentiation module of the sample space is used for representing the convex hull of X by using the convex combination form of the samples in the minority class set X and providing the judgment that the convex polyhedrons of the two sample sets X and Y' are separable; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;
the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of a group of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;
the classification decision module is used for carrying out classification decision on samples in the test set P according to the model CPC (x) and outputting classification results;
the classification method of the class unbalanced data based on the convex polyhedron classifier comprises the following steps:
step 1: for a given finite imbalance data set S, dividing into a training set T and a test set P in a proportion of 50% to 50%; then, a few class samples in the training set T are marked as x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples; the majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: convex hulls representing X using convex combinations of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is indicated that X is convex polyhedral relative to Y'; then, detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated;
step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsf l (x i )>0;/>f l (y j ) < 0; constructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y;
step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean;
the step 2 comprises the following steps:
step 2.1: setting an initial valuePure sample setSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eRealizing the differentiation of a convex polyhedron in a sample space;
the step 3 specifically comprises the following steps:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is the connection of two closest points y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
the method divides a class unbalanced data set into a training set T and a test set P, and marks a minority class sample in the training set T as X= { X i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) is used for judging the classification of each sample in the test set P, and can be used for effectively solving the problems of bank fraud detection, disease diagnosis and risk behavior assessment.
2. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1.
3. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1.
4. An information processing terminal for solving bank fraud detection, disease diagnosis, risk behaviour assessment, carrying a class imbalance data classification system based on a convex polyhedron classifier according to claim 1, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the method of any of claims 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010904076.2A CN112035719B (en) | 2020-09-01 | 2020-09-01 | Category imbalance data classification method and system based on convex polyhedron classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010904076.2A CN112035719B (en) | 2020-09-01 | 2020-09-01 | Category imbalance data classification method and system based on convex polyhedron classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112035719A CN112035719A (en) | 2020-12-04 |
CN112035719B true CN112035719B (en) | 2024-02-20 |
Family
ID=73590773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010904076.2A Active CN112035719B (en) | 2020-09-01 | 2020-09-01 | Category imbalance data classification method and system based on convex polyhedron classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112035719B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492096A (en) * | 2018-10-23 | 2019-03-19 | 华东理工大学 | A kind of unbalanced data categorizing system integrated based on geometry |
CN110533116A (en) * | 2019-09-04 | 2019-12-03 | 大连大学 | Based on the adaptive set of Euclidean distance at unbalanced data classification method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180210944A1 (en) * | 2017-01-26 | 2018-07-26 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
-
2020
- 2020-09-01 CN CN202010904076.2A patent/CN112035719B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492096A (en) * | 2018-10-23 | 2019-03-19 | 华东理工大学 | A kind of unbalanced data categorizing system integrated based on geometry |
CN110533116A (en) * | 2019-09-04 | 2019-12-03 | 大连大学 | Based on the adaptive set of Euclidean distance at unbalanced data classification method |
Non-Patent Citations (1)
Title |
---|
An effective method to determine whether a point is within a convex hull and its generalized convex polyhedron classifier;Qiangkui Leng,et al;Information Sciences;第504卷;435-448 * |
Also Published As
Publication number | Publication date |
---|---|
CN112035719A (en) | 2020-12-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103136504B (en) | Face identification method and device | |
CN112446891B (en) | Medical image segmentation method based on U-Net network brain glioma | |
CN111723786A (en) | Method and device for detecting wearing of safety helmet based on single model prediction | |
Das et al. | Diabetic retinopathy detection and classification using CNN tuned by genetic algorithm | |
Yang et al. | Cell image segmentation with kernel-based dynamic clustering and an ellipsoidal cell shape model | |
EP1495444A2 (en) | Methods and devices relating to estimating classifier performance | |
CN104392253B (en) | Interactive classification labeling method for sketch data set | |
CN106203377A (en) | A kind of coal dust image-recognizing method | |
WO2020250730A1 (en) | Fraud detection device, fraud detection method, and fraud detection program | |
Wang et al. | Human peripheral blood leukocyte classification method based on convolutional neural network and data augmentation | |
CN112991363A (en) | Brain tumor image segmentation method and device, electronic equipment and storage medium | |
Tao et al. | SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data | |
CN109214444B (en) | Game anti-addiction determination system and method based on twin neural network and GMM | |
Zhang et al. | Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification | |
CN115345246A (en) | Intelligent liver cancer staging method and system based on T-S fuzzy semantics | |
CN112035719B (en) | Category imbalance data classification method and system based on convex polyhedron classifier | |
CN116245139B (en) | Training method and device for graph neural network model, event detection method and device | |
CN109191452B (en) | Peritoneal transfer automatic marking method for abdominal cavity CT image based on active learning | |
CN115017988A (en) | Competitive clustering method for state anomaly diagnosis | |
Kraiem et al. | Effectiveness of basic and advanced sampling strategies on the classification of imbalanced data. A comparative study using classical and novel metrics | |
CN114093003A (en) | Human face living body detection method with fraud discrimination and network model thereof | |
Cui et al. | esearch on Credit Card Fraud Classification Based on GA-SVM | |
Chachuła et al. | Combating noisy labels in object detection datasets | |
CN112906779B (en) | Data classification method based on sample boundary value and integrated diversity | |
CN113705705A (en) | Hierarchical weighted classification method and system for unbalanced data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |