CN112035719B - Category imbalance data classification method and system based on convex polyhedron classifier - Google Patents

Category imbalance data classification method and system based on convex polyhedron classifier Download PDF

Info

Publication number
CN112035719B
CN112035719B CN202010904076.2A CN202010904076A CN112035719B CN 112035719 B CN112035719 B CN 112035719B CN 202010904076 A CN202010904076 A CN 202010904076A CN 112035719 B CN112035719 B CN 112035719B
Authority
CN
China
Prior art keywords
samples
convex
sample
classification
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010904076.2A
Other languages
Chinese (zh)
Other versions
CN112035719A (en
Inventor
冷强奎
赵留洋
李松宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bohai University
Original Assignee
Bohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bohai University filed Critical Bohai University
Priority to CN202010904076.2A priority Critical patent/CN112035719B/en
Publication of CN112035719A publication Critical patent/CN112035719A/en
Application granted granted Critical
Publication of CN112035719B publication Critical patent/CN112035719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Abstract

The invention belongs to the technical field of artificial intelligence/information science, and discloses a classification method and a classification system for class imbalance data based on a convex polyhedron classifier, wherein the method comprises the following steps: dividing the class unbalanced data set S into a training set T and a test set P, marking few class samples X in the training set T and marking most samples Y in the training set T; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model between X and Y' by utilizing a convex polyhedron construction algorithm; and judging the category of each sample in the test set P by using the obtained classification model. When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.

Description

Category imbalance data classification method and system based on convex polyhedron classifier
Technical Field
The invention belongs to the technical field of artificial intelligence/information science, and particularly relates to a classification unbalanced data classification method and system based on a convex polyhedron classifier.
Background
The existing method for solving the unbalanced classification problem mainly changes the distribution of training samples through data resampling technologies such as over-sampling, under-sampling and the like, so that the unbalanced degree of data is reduced. And then, the balanced data is fed to a specific classifier to make classification decisions.
Oversampling is to increase the samples in the minority class, bringing the minority class and the majority class into quantitative balance. It typically synthesizes a new sample that is not repeated between two minority classes of samples that are closely spaced using K-nearest neighbor and linear interpolation algorithms. But this approach tends to cause data overlap at the classification boundary, making it difficult for the classifier to distinguish sample class attributes at the boundary. Even if the two types of samples are forced to separate, such classification faces tend to be very complex, resulting in over-learning.
Undersampling is to reduce the number of samples of the majority class to a level of the minority class to maintain balance. It typically deletes or uses clusters to reduce the majority class of samples according to certain cleaning rules. But for cleaning, it may delete important sample data by mistake, resulting in the loss of classification information; for clustering, it uses a cluster center to decide on sample retention, so important boundary point information may be lost.
Recent studies have also shown that a 50% to 50% equilibrium data formed by resampling techniques is not more discriminative than the original data. This also shows that solving the imbalance classification problem from the data balancing point of view is an artificial empirical behavior, and there is no evidence in practical use to prove the effectiveness of this behavior.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) The existing method needs to perform balanced preprocessing on data, so that data distribution is changed, and the problem of global information loss occurs.
(2) The existing method takes data rebalancing as a premise, and can not fully consider the natural distribution characteristic of the data.
(3) The prior art has limitations in solving the practical unbalanced application problem of medical auxiliary diagnosis. In this problem, the population suffering from severe disease is a minority and healthy population is a majority, if a balancing technique is used at this time, it will occur that: [a] the first over-sampling would synthesize some new patient data that would necessarily tend to be most sample-like in spatial distribution, i.e., would cause the synthesized sample (diseased) and most sample-like (healthy) at the boundary to overlap, which would make it difficult to distinguish. [b] The second undersampling would be to delete a large portion of the samples in the majority class, which would cause the classification boundary to move toward the majority class, i.e., in later decisions, a significant portion of healthy people would be misclassified as ill, thus increasing the burden of subsequent examinations.
The difficulty of solving the problems and the defects is as follows:
the existing unbalanced classifier depends on the rebalancing treatment of data when generating a decision surface, namely, the data needs to be balanced first and then can work. The process of rebalancing the data destroys the natural distribution of the data so that valid information is overwritten or deleted. Therefore, it becomes particularly difficult to effectively utilize natural distribution information of the original data. In addition, protecting few class samples with high misclassification cost from being destroyed is a difficult problem that the existing method cannot effectively solve.
The meaning of solving the problems and the defects is as follows:
(1) The preprocessing process of data rebalancing is not used, so that the response time of classification decision is shortened;
(2) Effective information can be learned from the natural distribution form of the original data;
(3) The information of the minority sample is protected from being destroyed, and the importance of the minority sample with high misclassification cost is reflected;
(4) The unbalanced classification system module is simplified, and the system structure is lightweight.
The invention provides a lightweight unbalanced problem solution which is simple in method and easy to realize and can accurately learn aiming at few types of samples. The invention fully exploits the potential of few class samples with high misclassification cost in improving the performance of the classification model.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a classification unbalanced data classification method and system based on a convex polyhedron classifier.
The invention is realized in such a way that a classification unbalanced data classification system based on a convex polyhedron classifier comprises:
the unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;
a convex polyhedron differentiation module of the sample space, which is used for representing the convex hull of the X by using the convex combination form of the samples in the minority class set X and providing the separable judgment of the convex polyhedrons of the two sample sets X and Y'; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;
the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;
and the classification decision module is used for carrying out classification decision on the samples in the test set P according to the model CPC (x) and outputting classification results.
Another object of the present invention is to provide a classification method of class imbalance data based on a convex polyhedron classifier, comprising:
step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, a few class samples in the training set T are marked as x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: first, the convex hull of X is represented using a convex combination of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is referred to as X being convex polyhedron separable relative to Y'. Then, samples in the convex hull of X in the Y are detected, and the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated.
The method comprises the following specific steps:
step 2.1: setting an initial set of clean samplesSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eThe convex polyhedron in the sample space can be differentiated.
Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsA classification model CPC (x) is constructed from the set LDFs and expressed as: CPC (X) = +1, X e X; CPC (x) = -1, x e Y. The method comprises the following specific steps:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is to connect twoThe nearest point y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method.
It is a further object of the invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method.
Another object of the present invention is to provide an information processing terminal for solving bank fraud detection, disease diagnosis, risk behavior assessment, which is equipped with the class imbalance data classification system based on a convex polyhedron classifier, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to execute the method.
By combining all the technical schemes, the invention has the advantages and positive effects that:
the method provided by the invention divides the class unbalanced data set into a training set T and a test set P, and marks a few class samples in the training set T as X= { X i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) obtained is used for judging the classification of each sample in the test set P. The method can be used for effectively solving the problems of bank fraud detection, disease diagnosis, risk behavior assessment and the like.
When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.
Compared with the prior art, the method fully utilizes the advantages of simple realization, strong approximation capability, good interpretation and the like of the local linear function, and establishes a convex polyhedron classification model which is accurately surrounded by a few types of samples; fully considering the nature distribution characteristic of the class unbalanced data, discarding the data rebalancing technology, and fully excavating the potential of few class samples with high misclassification cost in terms of improving the performance of the model; the characteristics of few parameters, no distribution assumption and the like of the convex polyhedron classifier are fully exerted, and the problems that the traditional method depends on a complex processing mechanism, excessive parameter adjustment and the like are solved from the perspective of instantaneity.
The class unbalanced data classification system has simple modules, short decision response time and easy realization and expansion.
Compared with the existing oversampling and undersampling methods, the classification method for the class-unbalanced data provided by the invention has the advantages of accuracy Precision, recall and specificity Specificity, F 1 Metric F 1 The evaluation indexes such as Score and G measurement G-Mean are improved;
the techniques and methods involved in the present invention can be very easily implemented on a computer system;
the terminal provided with the class unbalanced data classification system based on the convex polyhedron classifier can realize early warning of abnormal events such as bank fraud detection, disease diagnosis, risk behavior assessment and the like.
On the international benchmark evaluation data set, the method of the invention has the advantages of accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 The method has obvious advantages over the oversampling method and the undersampling method in evaluation indexes such as Score, G measurement G-Mean and the like. The specific data are as follows: on accuracy Precision, the method is higher than an oversampling (SMOTE) method by 5.79 percentage points on average and higher than a random undersampling method by 11.62 percentage points on average; on the Recall rate Recall, the method is higher than the oversampling (SMOTE) method by 39.78 percentage points on average and higher than the random undersampling method by 10.28 percentage points on average; in the Specificity, the method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average; at F 1 Metric F 1 On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method on the Score; on the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings needed in the embodiments of the present application, and it is obvious that the drawings described below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a classification method of class imbalance data based on a convex polyhedron classifier according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of sample space convex polyhedron differentiation provided by an embodiment of the present invention.
FIG. 3 is a schematic diagram of a convex polyhedron construction algorithm provided by an embodiment of the present invention.
FIG. 4 is a graph of the effect of an unbalanced data set on a randomly generated Gaussian distribution provided by an embodiment of the invention.
Fig. 5 is a graph of the effect of a few class samples and a majority class samples in the labeled training set T according to an embodiment of the present invention.
Fig. 6 is a sample effect diagram of labels falling into a minority class convex hull provided by an embodiment of the present invention.
Fig. 7 is a diagram showing the effect of realizing the sample space convex polyhedron differentiation according to the embodiment of the present invention.
FIG. 8 is a graph showing the effect of computing a set of linear discriminant functions provided by an embodiment of the present invention.
Fig. 9 is a diagram of classification decision according to the model CPC (x) for samples in the test set P according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Aiming at the problems existing in the prior art, the invention provides a classification method and a classification system for class unbalanced data based on a convex polyhedron classifier, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a classification method of class imbalance data based on a convex polyhedron classifier, comprising:
step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, marking trainingThe minority class samples in the training set T are x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: first, the convex hull of X is represented using a convex combination of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is referred to as X being convex polyhedron separable relative to Y'. Then, samples in the convex hull of X in the Y are detected, and the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated.
Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsConstructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y.
Step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean.
As shown in fig. 2, the specific steps of step 2 include:
step 2.1: setting an initial set of clean samplesSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eThe convex polyhedron in the sample space can be differentiated.
As shown in fig. 3, the specific steps of step 3 include:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is the connection of two closest points y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
the invention will be further described with reference to specific examples
Examples
The classification method of the class unbalanced data based on the convex polyhedron classifier provided by the invention comprises the following steps:
step 1: an unbalanced data set S of gaussian distribution is randomly generated [ as in fig. 4 ]. 200 minority class samples and 800 majority class samples. The training set T and the test set P are divided into 50 percent to 50 percent ratio. Then, a few class samples in the training set T are marked as x= { X i I is more than or equal to 1 and less than or equal to 100, and most types of samples in the training set T are marked as Y= { Y j J is more than or equal to 1 and less than or equal to 400; [ as in FIG. 5 ]
Step 2: setting an initial set of clean samples
Step 3: sequentially selecting each sample Y from Y j E, Y, j is more than or equal to 1 and less than or equal to 400, and is placed in a region to be detected;
step 4: calculating convex hull and y of X j Distance d (y) j CH (X)), if d (y) j CH (X)) > 0, y will be j Put into set Y'. Otherwise, will y j Marked as[ as in FIG. 6 ]
Step 5: delete 9 markers in Y asAnd confirm that all samples in Y have been detected, resulting in a clean sample set Y'. Realizing the differentiation of a convex polyhedron in a sample space; [ as in FIG. 7 ]
Step 6: initializing a set of linear discriminant functions
Step 7: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 8: by the nearest point pair (y p ,x * ) Calculate a linear discriminant function f 1 (x)=w 1 ·x+b 1 Wherein w is 1 =x * -y p ,b 1 =(||y p || 2 -||x * || 2 ) 2, i.e. f 1 (x)=w 1 ·x+b 1 =0 is the connection of 2 closest points y p And x * A perpendicular bisector of the connection line;
step 9: f in the label Y 1 (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 10: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 11: will f 1 (x) Putting the linear discriminant function set LDFs;
step 12: repeating steps 7-11 until
Step 13: the temporary data space IDS is emptied, and the linear discriminant function set LDFs= { f is arranged l (x) 1.ltoreq.l.ltoreq.5, wherein,
f 1 (x)=-6x 1 -6x 2 +456
f 2 (x)=-16x 1 -5x 2 +809
f 3 (x)=-5x 1 -12x 2 +681
f 4 (x)=-21x 1 -4x 2 +1008
f 5 (x)=-12x 1 -21x 2 +756; [ as in FIG. 8 ].
Step 14: construction of classification models using linear discriminant functions
Step 15: the classification decisions are made on the samples in the test set P according to the model CPC (x) [ as in fig. 9 ]. Statistics form a confusion matrix [ as in table 1 ]. Calculating and outputting classification results, wherein the indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean [ as table 2 ].
TABLE 1
The index calculation process comprises the following steps:
TABLE 2
Precision(%) Recall(%) Specificity(%) F 1 -Score(%) G-Mean(%)
95.15 98.00 98.75 96.55 98.37
The invention is applicable to datasets in high-dimensional space.
The invention is further described below in connection with specific experimental data.
1) Data for experiments
Data set Number of samples Minority class sample number Unbalance rate Feature number
Wisconsin 683 239 0.54 9
Pima 768 268 0.54 8
Glass 214 70 0.49 9
Vehicle 846 217 0.34 18
Ecoli 336 77 0.30 7
Yeast 1484 163 0.12 8
Vowel 988 90 0.10 13
2) Experimental results 1-Precision (%)
Data set The method of the invention Oversampling (SMOTE) The method-oversampling Random undersampling The method-undersampling
Wisconsin 92.15 92.06 0.09 92.06 0.09
Pima 98.17 57.77 40.40 50.28 47.89
Glass 66.70 77.29 -10.59 68.14 -1.44
Vehicle 87.47 57.96 29.51 41.45 46.02
Ecoli 74.12 66.91 7.21 70.81 3.31
Yeast 56.05 71.79 -15.74 60.00 -3.95
Vowel 89.65 100.00 -10.35 100.00 -10.35
Average value of 5.79 11.62
On accuracy Precision, the method is 5.79 percent higher than the average of the over-Sampling (SMOTE) method and 11.62 percent higher than the average of the random under-sampling method.
3) Experimental results 2-Recall (%)
Data set The method of the invention Oversampling (SMOTE) The method-oversampling Random undersampling The method-undersampling
Wisconsin 100.00 97.87 2.13 97.87 2.13
Pima 100.00 43.40 56.60 79.25 20.75
Glass 85.71 50.00 35.71 78.57 7.14
Vehicle 95.35 25.58 69.77 83.72 11.63
Ecoli 86.67 53.33 33.34 80.00 6.67
Yeast 100.00 46.88 53.12 87.50 12.50
Vowel 100.00 72.22 2.13 88.89 11.11
Average value of 39.78 10.28
On average, the method was 39.78 percentage points higher than the oversampling (SMOTE) method and 10.28 percentage points higher than the random undersampling method on average, on Recall.
4) Experimental results 3-Specificity (%)
Data set The inventionMethod for improving eyesight Oversampling (SMOTE) The method-oversampling Random undersampling The method-undersampling
Wisconsin 97.12 96.66 0.46 96.66 0.46
Pima 64.60 60.02 4.58 61.43 3.17
Glass 72.14 68.14 4.00 64.29 7.85
Vehicle 51.67 48.93 2.74 51.54 0.13
Ecoli 82.45 70.11 12.34 80.85 1.60
Yeast 73.85 67.68 6.17 70.04 3.81
Vowel 98.31 84.98 13.33 94.28 4.03
Average value of 6.23 3.01
The method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average in the Specificity of the Specificity.
5) Experimental results 4-F 1 -Score(%)
At F 1 Metric F 1 On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method.
6) Experimental results 5-G-Mean (%)
Data set The method of the invention Oversampling (SMOTE) The method-oversampling Random undersampling The method-undersampling
Wisconsin 98.55 97.26 1.29 97.26 1.29
Pima 80.37 51.04 29.33 69.77 10.60
Glass 78.63 58.37 20.26 71.07 7.56
Vehicle 70.19 35.38 34.81 65.69 4.50
Ecoli 84.53 61.15 23.38 80.42 4.11
Yeast 85.94 56.33 29.61 78.28 7.66
Vowel 99.15 78.34 20.81 91.55 7.60
Average value of 22.78 6.19
On the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.
In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more; the terms "upper," "lower," "left," "right," "inner," "outer," "front," "rear," "head," "tail," and the like are used as an orientation or positional relationship based on that shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (4)

1. A convex polyhedron classifier-based class imbalance data classification system, comprising:
the finite unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;
the convex polyhedron differentiation module of the sample space is used for representing the convex hull of X by using the convex combination form of the samples in the minority class set X and providing the judgment that the convex polyhedrons of the two sample sets X and Y' are separable; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;
the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of a group of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;
the classification decision module is used for carrying out classification decision on samples in the test set P according to the model CPC (x) and outputting classification results;
the classification method of the class unbalanced data based on the convex polyhedron classifier comprises the following steps:
step 1: for a given finite imbalance data set S, dividing into a training set T and a test set P in a proportion of 50% to 50%; then, a few class samples in the training set T are marked as x= { X i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples; the majority sample in the marked training set T is y= { Y j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;
step 2: convex hulls representing X using convex combinations of samples in a minority class set X, i.e., CH (X) = { x|x= Σ 1≤i≤m α i x i ,∑ 1≤i≤m α i =1,x i ∈X,α i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is indicated that X is convex polyhedral relative to Y'; then, detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated;
step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of linear discriminant functions l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsf l (x i )>0;/>f l (y j ) < 0; constructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y;
step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F 1 Metric F 1 Score, G metric G-Mean;
the step 2 comprises the following steps:
step 2.1: setting an initial valuePure sample setSetting an initial sample indication variable k=1;
step 2.2: selection of a single sample Y from a plurality of classes of samples Y k Placing the sample in a region to be detected;
step 2.3: calculating convex hull of X to y k Is as follows
d(y k ,CH(X))=min{d(y k ,x),x∈CH(X)};
Step 2.4: if d (y) k CH (X)) > 0, y will be k Put into set Y';
step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;
step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;
step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eRealizing the differentiation of a convex polyhedron in a sample space;
the step 3 specifically comprises the following steps:
step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;
step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y p ∈Y',x * ∈CH(X));
Step 3.3: using the closest point pair (y p ,x * ) Calculate a linear discriminant function f L (x)=w L ·x+b L Wherein w is L =x * -y p ,b L =(||y p || 2 -||x * || 2 ) 2, i.e. f L (x)=w L ·x+b L =0 is the connection of two closest points y p And x * A perpendicular bisector of the connection line;
step 3.4: f in the label Y L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;
step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';
step 3.6: will f L (x) Putting the linear discriminant function set LDFs;
step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;
step 3.8: finishing the linear discriminant function set ldfs= { f l (x),1≤l≤L};
Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:
the method divides a class unbalanced data set into a training set T and a test set P, and marks a minority class sample in the training set T as X= { X i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) is used for judging the classification of each sample in the test set P, and can be used for effectively solving the problems of bank fraud detection, disease diagnosis and risk behavior assessment.
2. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1.
3. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1.
4. An information processing terminal for solving bank fraud detection, disease diagnosis, risk behaviour assessment, carrying a class imbalance data classification system based on a convex polyhedron classifier according to claim 1, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the method of any of claims 1.
CN202010904076.2A 2020-09-01 2020-09-01 Category imbalance data classification method and system based on convex polyhedron classifier Active CN112035719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010904076.2A CN112035719B (en) 2020-09-01 2020-09-01 Category imbalance data classification method and system based on convex polyhedron classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010904076.2A CN112035719B (en) 2020-09-01 2020-09-01 Category imbalance data classification method and system based on convex polyhedron classifier

Publications (2)

Publication Number Publication Date
CN112035719A CN112035719A (en) 2020-12-04
CN112035719B true CN112035719B (en) 2024-02-20

Family

ID=73590773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010904076.2A Active CN112035719B (en) 2020-09-01 2020-09-01 Category imbalance data classification method and system based on convex polyhedron classifier

Country Status (1)

Country Link
CN (1) CN112035719B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492096A (en) * 2018-10-23 2019-03-19 华东理工大学 A kind of unbalanced data categorizing system integrated based on geometry
CN110533116A (en) * 2019-09-04 2019-12-03 大连大学 Based on the adaptive set of Euclidean distance at unbalanced data classification method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492096A (en) * 2018-10-23 2019-03-19 华东理工大学 A kind of unbalanced data categorizing system integrated based on geometry
CN110533116A (en) * 2019-09-04 2019-12-03 大连大学 Based on the adaptive set of Euclidean distance at unbalanced data classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
An effective method to determine whether a point is within a convex hull and its generalized convex polyhedron classifier;Qiangkui Leng,et al;Information Sciences;第504卷;435-448 *

Also Published As

Publication number Publication date
CN112035719A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN103136504B (en) Face identification method and device
CN112446891B (en) Medical image segmentation method based on U-Net network brain glioma
CN111723786A (en) Method and device for detecting wearing of safety helmet based on single model prediction
Das et al. Diabetic retinopathy detection and classification using CNN tuned by genetic algorithm
Yang et al. Cell image segmentation with kernel-based dynamic clustering and an ellipsoidal cell shape model
EP1495444A2 (en) Methods and devices relating to estimating classifier performance
CN104392253B (en) Interactive classification labeling method for sketch data set
CN106203377A (en) A kind of coal dust image-recognizing method
WO2020250730A1 (en) Fraud detection device, fraud detection method, and fraud detection program
Wang et al. Human peripheral blood leukocyte classification method based on convolutional neural network and data augmentation
CN112991363A (en) Brain tumor image segmentation method and device, electronic equipment and storage medium
Tao et al. SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data
CN109214444B (en) Game anti-addiction determination system and method based on twin neural network and GMM
Zhang et al. Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification
CN115345246A (en) Intelligent liver cancer staging method and system based on T-S fuzzy semantics
CN112035719B (en) Category imbalance data classification method and system based on convex polyhedron classifier
CN116245139B (en) Training method and device for graph neural network model, event detection method and device
CN109191452B (en) Peritoneal transfer automatic marking method for abdominal cavity CT image based on active learning
CN115017988A (en) Competitive clustering method for state anomaly diagnosis
Kraiem et al. Effectiveness of basic and advanced sampling strategies on the classification of imbalanced data. A comparative study using classical and novel metrics
CN114093003A (en) Human face living body detection method with fraud discrimination and network model thereof
Cui et al. esearch on Credit Card Fraud Classification Based on GA-SVM
Chachuła et al. Combating noisy labels in object detection datasets
CN112906779B (en) Data classification method based on sample boundary value and integrated diversity
CN113705705A (en) Hierarchical weighted classification method and system for unbalanced data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant