CN112035719B

CN112035719B - Category imbalance data classification method and system based on convex polyhedron classifier

Info

Publication number: CN112035719B
Application number: CN202010904076.2A
Authority: CN
Inventors: 冷强奎; 赵留洋; 李松宇
Original assignee: Bohai University
Current assignee: Bohai University
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2024-02-20
Anticipated expiration: 2040-09-01
Also published as: CN112035719A

Abstract

The invention belongs to the technical field of artificial intelligence/information science, and discloses a classification method and a classification system for class imbalance data based on a convex polyhedron classifier, wherein the method comprises the following steps: dividing the class unbalanced data set S into a training set T and a test set P, marking few class samples X in the training set T and marking most samples Y in the training set T; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model between X and Y' by utilizing a convex polyhedron construction algorithm; and judging the category of each sample in the test set P by using the obtained classification model. When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.

Description

Category imbalance data classification method and system based on convex polyhedron classifier

Technical Field

The invention belongs to the technical field of artificial intelligence/information science, and particularly relates to a classification unbalanced data classification method and system based on a convex polyhedron classifier.

Background

The existing method for solving the unbalanced classification problem mainly changes the distribution of training samples through data resampling technologies such as over-sampling, under-sampling and the like, so that the unbalanced degree of data is reduced. And then, the balanced data is fed to a specific classifier to make classification decisions.

Oversampling is to increase the samples in the minority class, bringing the minority class and the majority class into quantitative balance. It typically synthesizes a new sample that is not repeated between two minority classes of samples that are closely spaced using K-nearest neighbor and linear interpolation algorithms. But this approach tends to cause data overlap at the classification boundary, making it difficult for the classifier to distinguish sample class attributes at the boundary. Even if the two types of samples are forced to separate, such classification faces tend to be very complex, resulting in over-learning.

Undersampling is to reduce the number of samples of the majority class to a level of the minority class to maintain balance. It typically deletes or uses clusters to reduce the majority class of samples according to certain cleaning rules. But for cleaning, it may delete important sample data by mistake, resulting in the loss of classification information; for clustering, it uses a cluster center to decide on sample retention, so important boundary point information may be lost.

Recent studies have also shown that a 50% to 50% equilibrium data formed by resampling techniques is not more discriminative than the original data. This also shows that solving the imbalance classification problem from the data balancing point of view is an artificial empirical behavior, and there is no evidence in practical use to prove the effectiveness of this behavior.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) The existing method needs to perform balanced preprocessing on data, so that data distribution is changed, and the problem of global information loss occurs.

(2) The existing method takes data rebalancing as a premise, and can not fully consider the natural distribution characteristic of the data.

(3) The prior art has limitations in solving the practical unbalanced application problem of medical auxiliary diagnosis. In this problem, the population suffering from severe disease is a minority and healthy population is a majority, if a balancing technique is used at this time, it will occur that: [a] the first over-sampling would synthesize some new patient data that would necessarily tend to be most sample-like in spatial distribution, i.e., would cause the synthesized sample (diseased) and most sample-like (healthy) at the boundary to overlap, which would make it difficult to distinguish. [b] The second undersampling would be to delete a large portion of the samples in the majority class, which would cause the classification boundary to move toward the majority class, i.e., in later decisions, a significant portion of healthy people would be misclassified as ill, thus increasing the burden of subsequent examinations.

The difficulty of solving the problems and the defects is as follows:

the existing unbalanced classifier depends on the rebalancing treatment of data when generating a decision surface, namely, the data needs to be balanced first and then can work. The process of rebalancing the data destroys the natural distribution of the data so that valid information is overwritten or deleted. Therefore, it becomes particularly difficult to effectively utilize natural distribution information of the original data. In addition, protecting few class samples with high misclassification cost from being destroyed is a difficult problem that the existing method cannot effectively solve.

The meaning of solving the problems and the defects is as follows:

(1) The preprocessing process of data rebalancing is not used, so that the response time of classification decision is shortened;

(2) Effective information can be learned from the natural distribution form of the original data;

(3) The information of the minority sample is protected from being destroyed, and the importance of the minority sample with high misclassification cost is reflected;

(4) The unbalanced classification system module is simplified, and the system structure is lightweight.

The invention provides a lightweight unbalanced problem solution which is simple in method and easy to realize and can accurately learn aiming at few types of samples. The invention fully exploits the potential of few class samples with high misclassification cost in improving the performance of the classification model.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a classification unbalanced data classification method and system based on a convex polyhedron classifier.

The invention is realized in such a way that a classification unbalanced data classification system based on a convex polyhedron classifier comprises:

the unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;

a convex polyhedron differentiation module of the sample space, which is used for representing the convex hull of the X by using the convex combination form of the samples in the minority class set X and providing the separable judgment of the convex polyhedrons of the two sample sets X and Y'; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;

the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;

and the classification decision module is used for carrying out classification decision on the samples in the test set P according to the model CPC (x) and outputting classification results.

Another object of the present invention is to provide a classification method of class imbalance data based on a convex polyhedron classifier, comprising:

step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, a few class samples in the training set T are marked as x= { X _i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y _j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;

step 2: first, the convex hull of X is represented using a convex combination of samples in a minority class set X, i.e., CH (X) = { x|x= Σ _1≤i≤m α _i x _i ,∑ _1≤i≤m α _i ＝1,x _i ∈X,α _i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is referred to as X being convex polyhedron separable relative to Y'. Then, samples in the convex hull of X in the Y are detected, and the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated.

The method comprises the following specific steps:

step 2.1: setting an initial set of clean samplesSetting an initial sample indication variable k=1;

step 2.2: selection of a single sample Y from a plurality of classes of samples Y _k Placing the sample in a region to be detected;

step 2.3: calculating convex hull of X to y _k Is as follows

d(y _k ,CH(X))＝min{d(y _k ,x),x∈CH(X)}；

Step 2.4: if d (y) _k CH (X)) > 0, y will be _k Put into set Y';

step 2.5: the sample indicates that the value of variable k has increased by 1, namely k+ k+1;

step 2.6: if the sample which is not detected exists in Y, namely k is less than n, turning to step 2.2; otherwise, turning to step 2.7;

step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eThe convex polyhedron in the sample space can be differentiated.

Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions _l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsA classification model CPC (x) is constructed from the set LDFs and expressed as: CPC (X) = +1, X e X; CPC (x) = -1, x e Y. The method comprises the following specific steps:

step 3.1: initializing a set of linear discriminant functionsInitializing a linear discriminant function indicating variable l=1;

step 3.2: by calculating the distance from the point to the convex hull, the nearest point pair (Y _p ∈Y',x ^* ∈CH(X))；

Step 3.3: using the closest point pair (y _p ,x ^* ) Calculate a linear discriminant function f _L (x)＝w _L ·x+b _L Wherein w is _L ＝x ^* -y _p ，b _L ＝(||y _p || ² -||x ^* || ² ) 2, i.e. f _L (x)＝w _L ·x+b _L =0 is to connect twoThe nearest point y _p And x ^* A perpendicular bisector of the connection line;

step 3.4: f in the label Y _L (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;

step 3.5: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';

step 3.6: will f _L (x) Putting the linear discriminant function set LDFs;

step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;

step 3.8: finishing the linear discriminant function set ldfs= { f _l (x),1≤l≤L}；

Step 3.9: constructing a classification model by using the linear discriminant function set LDFs:

step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F ₁ Metric F ₁ Score, G metric G-Mean.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method.

It is a further object of the invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method.

Another object of the present invention is to provide an information processing terminal for solving bank fraud detection, disease diagnosis, risk behavior assessment, which is equipped with the class imbalance data classification system based on a convex polyhedron classifier, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to execute the method.

By combining all the technical schemes, the invention has the advantages and positive effects that:

the method provided by the invention divides the class unbalanced data set into a training set T and a test set P, and marks a few class samples in the training set T as X= { X _i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y _j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) obtained is used for judging the classification of each sample in the test set P. The method can be used for effectively solving the problems of bank fraud detection, disease diagnosis, risk behavior assessment and the like.

When the method solves the problem of classifying the data with unbalanced categories, the natural distribution characteristic of the data is fully considered, balance pretreatment is not needed, excessive parameters are not needed to be adjusted, the realization is simple, the method is suitable for high-dimensional data, and the generalization capability is high. The method is also applied to the field of unbalanced data classification for the first time, and has very original significance.

Compared with the prior art, the method fully utilizes the advantages of simple realization, strong approximation capability, good interpretation and the like of the local linear function, and establishes a convex polyhedron classification model which is accurately surrounded by a few types of samples; fully considering the nature distribution characteristic of the class unbalanced data, discarding the data rebalancing technology, and fully excavating the potential of few class samples with high misclassification cost in terms of improving the performance of the model; the characteristics of few parameters, no distribution assumption and the like of the convex polyhedron classifier are fully exerted, and the problems that the traditional method depends on a complex processing mechanism, excessive parameter adjustment and the like are solved from the perspective of instantaneity.

The class unbalanced data classification system has simple modules, short decision response time and easy realization and expansion.

Compared with the existing oversampling and undersampling methods, the classification method for the class-unbalanced data provided by the invention has the advantages of accuracy Precision, recall and specificity Specificity, F ₁ Metric F ₁ The evaluation indexes such as Score and G measurement G-Mean are improved;

the techniques and methods involved in the present invention can be very easily implemented on a computer system;

the terminal provided with the class unbalanced data classification system based on the convex polyhedron classifier can realize early warning of abnormal events such as bank fraud detection, disease diagnosis, risk behavior assessment and the like.

On the international benchmark evaluation data set, the method of the invention has the advantages of accuracy Precision, recall rate Recall and specificity rate Specificity, F ₁ Metric F ₁ The method has obvious advantages over the oversampling method and the undersampling method in evaluation indexes such as Score, G measurement G-Mean and the like. The specific data are as follows: on accuracy Precision, the method is higher than an oversampling (SMOTE) method by 5.79 percentage points on average and higher than a random undersampling method by 11.62 percentage points on average; on the Recall rate Recall, the method is higher than the oversampling (SMOTE) method by 39.78 percentage points on average and higher than the random undersampling method by 10.28 percentage points on average; in the Specificity, the method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average; at F ₁ Metric F ₁ On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method on the Score; on the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings needed in the embodiments of the present application, and it is obvious that the drawings described below are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a classification method of class imbalance data based on a convex polyhedron classifier according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of sample space convex polyhedron differentiation provided by an embodiment of the present invention.

FIG. 3 is a schematic diagram of a convex polyhedron construction algorithm provided by an embodiment of the present invention.

FIG. 4 is a graph of the effect of an unbalanced data set on a randomly generated Gaussian distribution provided by an embodiment of the invention.

Fig. 5 is a graph of the effect of a few class samples and a majority class samples in the labeled training set T according to an embodiment of the present invention.

Fig. 6 is a sample effect diagram of labels falling into a minority class convex hull provided by an embodiment of the present invention.

Fig. 7 is a diagram showing the effect of realizing the sample space convex polyhedron differentiation according to the embodiment of the present invention.

FIG. 8 is a graph showing the effect of computing a set of linear discriminant functions provided by an embodiment of the present invention.

Fig. 9 is a diagram of classification decision according to the model CPC (x) for samples in the test set P according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Aiming at the problems existing in the prior art, the invention provides a classification method and a classification system for class unbalanced data based on a convex polyhedron classifier, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a classification method of class imbalance data based on a convex polyhedron classifier, comprising:

step 1: for a given finite imbalance data set S, it is divided into training set T and test set P in a 50% to 50% ratio. Then, marking trainingThe minority class samples in the training set T are x= { X _i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples. The majority sample in the marked training set T is y= { Y _j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;

Step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of a group of linear discriminant functions _l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsConstructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y.

As shown in fig. 2, the specific steps of step 2 include:

step 2.3: calculating convex hull of X to y _k Is as follows

d(y _k ,CH(X))＝min{d(y _k ,x),x∈CH(X)}；

Step 2.4: if d (y) _k CH (X)) > 0, y will be _k Put into set Y';

As shown in fig. 3, the specific steps of step 3 include:

Step 3.3: using the closest point pair (y _p ,x ^* ) Calculate a linear discriminant function f _L (x)＝w _L ·x+b _L Wherein w is _L ＝x ^* -y _p ，b _L ＝(||y _p || ² -||x ^* || ² ) 2, i.e. f _L (x)＝w _L ·x+b _L =0 is the connection of two closest points y _p And x ^* A perpendicular bisector of the connection line;

step 3.6: will f _L (x) Putting the linear discriminant function set LDFs;

step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;

the invention will be further described with reference to specific examples

Examples

The classification method of the class unbalanced data based on the convex polyhedron classifier provided by the invention comprises the following steps:

step 1: an unbalanced data set S of gaussian distribution is randomly generated [ as in fig. 4 ]. 200 minority class samples and 800 majority class samples. The training set T and the test set P are divided into 50 percent to 50 percent ratio. Then, a few class samples in the training set T are marked as x= { X _i I is more than or equal to 1 and less than or equal to 100, and most types of samples in the training set T are marked as Y= { Y _j J is more than or equal to 1 and less than or equal to 400; [ as in FIG. 5 ]

Step 2: setting an initial set of clean samples

Step 3: sequentially selecting each sample Y from Y _j E, Y, j is more than or equal to 1 and less than or equal to 400, and is placed in a region to be detected;

step 4: calculating convex hull and y of X _j Distance d (y) _j CH (X)), if d (y) _j CH (X)) > 0, y will be _j Put into set Y'. Otherwise, will y _j Marked as[ as in FIG. 6 ]

Step 5: delete 9 markers in Y asAnd confirm that all samples in Y have been detected, resulting in a clean sample set Y'. Realizing the differentiation of a convex polyhedron in a sample space; [ as in FIG. 7 ]

Step 6: initializing a set of linear discriminant functions

Step 7: by calculating the distance from the point to the convex hull, the nearest point pair (Y _p ∈Y',x ^* ∈CH(X))；

Step 8: by the nearest point pair (y _p ,x ^* ) Calculate a linear discriminant function f ₁ (x)＝w ₁ ·x+b ₁ Wherein w is ₁ ＝x ^* -y _p ，b ₁ ＝(||y _p || ² -||x ^* || ² ) 2, i.e. f ₁ (x)＝w ₁ ·x+b ₁ =0 is the connection of 2 closest points y _p And x ^* A perpendicular bisector of the connection line;

step 9: f in the label Y ₁ (x) Samples < 0 and storing the numbers of these samples in a temporary data space IDS;

step 10: deleting the marked samples in the IDS from Y ', the remaining sample set still marked as Y';

step 11: will f ₁ (x) Putting the linear discriminant function set LDFs;

step 12: repeating steps 7-11 until

Step 13: the temporary data space IDS is emptied, and the linear discriminant function set LDFs= { f is arranged _l (x) 1.ltoreq.l.ltoreq.5, wherein,

f ₁ (x)＝-6x ₁ -6x ₂ +456

f ₂ (x)＝-16x ₁ -5x ₂ +809

f ₃ (x)＝-5x ₁ -12x ₂ +681

f ₄ (x)＝-21x ₁ -4x ₂ +1008

f ₅ (x)＝-12x ₁ -21x ₂ +756; [ as in FIG. 8 ].

Step 14: construction of classification models using linear discriminant functions

Step 15: the classification decisions are made on the samples in the test set P according to the model CPC (x) [ as in fig. 9 ]. Statistics form a confusion matrix [ as in table 1 ]. Calculating and outputting classification results, wherein the indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F ₁ Metric F ₁ Score, G metric G-Mean [ as table 2 ].

TABLE 1

The index calculation process comprises the following steps:

TABLE 2

Precision(％)	Recall(％)	Specificity(％)	F ₁ -Score(％)	G-Mean(％)
					95.15	98.00	98.75	96.55	98.37

The invention is applicable to datasets in high-dimensional space.

The invention is further described below in connection with specific experimental data.

1) Data for experiments

Data set	Number of samples	Minority class sample number	Unbalance rate	Feature number
					Wisconsin	683	239	0.54	9
Pima	768	268	0.54	8
					Glass	214	70	0.49	9
Vehicle	846	217	0.34	18
					Ecoli	336	77	0.30	7
Yeast	1484	163	0.12	8
					Vowel	988	90	0.10	13

2) Experimental results 1-Precision (%)

Data set	The method of the invention	Oversampling (SMOTE)	The method-oversampling	Random undersampling	The method-undersampling
						Wisconsin	92.15	92.06	0.09	92.06	0.09
Pima	98.17	57.77	40.40	50.28	47.89
						Glass	66.70	77.29	-10.59	68.14	-1.44
Vehicle	87.47	57.96	29.51	41.45	46.02
						Ecoli	74.12	66.91	7.21	70.81	3.31
Yeast	56.05	71.79	-15.74	60.00	-3.95
						Vowel	89.65	100.00	-10.35	100.00	-10.35
Average value of			5.79		11.62

On accuracy Precision, the method is 5.79 percent higher than the average of the over-Sampling (SMOTE) method and 11.62 percent higher than the average of the random under-sampling method.

3) Experimental results 2-Recall (%)

Data set	The method of the invention	Oversampling (SMOTE)	The method-oversampling	Random undersampling	The method-undersampling
						Wisconsin	100.00	97.87	2.13	97.87	2.13
Pima	100.00	43.40	56.60	79.25	20.75
						Glass	85.71	50.00	35.71	78.57	7.14
Vehicle	95.35	25.58	69.77	83.72	11.63
						Ecoli	86.67	53.33	33.34	80.00	6.67
Yeast	100.00	46.88	53.12	87.50	12.50
						Vowel	100.00	72.22	2.13	88.89	11.11
Average value of			39.78		10.28

On average, the method was 39.78 percentage points higher than the oversampling (SMOTE) method and 10.28 percentage points higher than the random undersampling method on average, on Recall.

4) Experimental results 3-Specificity (%)

Data set	The inventionMethod for improving eyesight	Oversampling (SMOTE)	The method-oversampling	Random undersampling	The method-undersampling
						Wisconsin	97.12	96.66	0.46	96.66	0.46
Pima	64.60	60.02	4.58	61.43	3.17
						Glass	72.14	68.14	4.00	64.29	7.85
Vehicle	51.67	48.93	2.74	51.54	0.13
						Ecoli	82.45	70.11	12.34	80.85	1.60
Yeast	73.85	67.68	6.17	70.04	3.81
						Vowel	98.31	84.98	13.33	94.28	4.03
Average value of			6.23		3.01

The method is higher than the oversampling (SMOTE) method by 6.23 percentage points on average and higher than the random undersampling method by 3.01 percentage points on average in the Specificity of the Specificity.

5) Experimental results 4-F ₁ -Score(％)

At F ₁ Metric F ₁ On average, the method is 23.85 percent higher than the oversampling (SMOTE) method and 11.75 percent higher than the random undersampling method.

6) Experimental results 5-G-Mean (%)

Data set	The method of the invention	Oversampling (SMOTE)	The method-oversampling	Random undersampling	The method-undersampling
						Wisconsin	98.55	97.26	1.29	97.26	1.29
Pima	80.37	51.04	29.33	69.77	10.60
						Glass	78.63	58.37	20.26	71.07	7.56
Vehicle	70.19	35.38	34.81	65.69	4.50
						Ecoli	84.53	61.15	23.38	80.42	4.11
Yeast	85.94	56.33	29.61	78.28	7.66
						Vowel	99.15	78.34	20.81	91.55	7.60
Average value of			22.78		6.19

On the G metric G-Mean, the method averages 22.78 percent higher than the oversampling (SMOTE) method and 6.19 percent higher than the random undersampling method.

In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more; the terms "upper," "lower," "left," "right," "inner," "outer," "front," "rear," "head," "tail," and the like are used as an orientation or positional relationship based on that shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. A convex polyhedron classifier-based class imbalance data classification system, comprising:

the finite unbalanced data set preprocessing module is used for dividing a given finite unbalanced data set S into a training set T and a testing set P; then marking a few class samples X in the training set T and marking a majority class samples Y in the training set T;

the convex polyhedron differentiation module of the sample space is used for representing the convex hull of X by using the convex combination form of the samples in the minority class set X and providing the judgment that the convex polyhedrons of the two sample sets X and Y' are separable; detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so as to realize convex polyhedron differentiation in a sample space;

the classification model construction module runs a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs of a group of linear discriminant functions; constructing a classification model CPC (x) according to the set LDFs;

the classification decision module is used for carrying out classification decision on samples in the test set P according to the model CPC (x) and outputting classification results;

the classification method of the class unbalanced data based on the convex polyhedron classifier comprises the following steps:

step 1: for a given finite imbalance data set S, dividing into a training set T and a test set P in a proportion of 50% to 50%; then, a few class samples in the training set T are marked as x= { X _i 1.ltoreq.i.ltoreq.m, where m is the number of minority class samples; the majority sample in the marked training set T is y= { Y _j 1.ltoreq.j.ltoreq.n, where n is the number of most classes of samples;

step 2: convex hulls representing X using convex combinations of samples in a minority class set X, i.e., CH (X) = { x|x= Σ _1≤i≤m α _i x _i ,∑ _1≤i≤m α _i ＝1,x _i ∈X,α _i 0, and provides two sample sets X and Y' convex polyhedron separable decision criteria: if the intersection of the convex hull of X and Y' is null, it is expressed asThen it is indicated that X is convex polyhedral relative to Y'; then, detecting samples which are not in the X convex hulls in the Y, wherein the samples form a pure sample set Y', so that the convex polyhedron in the sample space can be differentiated;

step 3: running a convex polyhedron construction algorithm on X and Y' to obtain a set LDFs= { f of linear discriminant functions _l (x) L is more than or equal to 1 and less than or equal to L, meets the following requirementsf _l (x _i )＞0；/>f _l (y _j ) < 0; constructing a classification model CPC (X) according to the set LDFs, and expressing the classification model CPC (X) = +1, X epsilon X; CPC (x) = -1, x e Y;

step 4: classifying and deciding the samples in the test set P according to the model CPC (x), and outputting classification results, wherein the evaluation indexes comprise accuracy Precision, recall rate Recall and specificity rate Specificity, F ₁ Metric F ₁ Score, G metric G-Mean;

the step 2 comprises the following steps:

step 2.1: setting an initial valuePure sample setSetting an initial sample indication variable k=1;

step 2.3: calculating convex hull of X to y _k Is as follows

d(y _k ,CH(X))＝min{d(y _k ,x),x∈CH(X)}；

Step 2.4: if d (y) _k CH (X)) > 0, y will be _k Put into set Y';

step 2.7: obtaining a sample set Y' which intersects the convex hull of X as empty, i.eRealizing the differentiation of a convex polyhedron in a sample space;

the step 3 specifically comprises the following steps:

step 3.6: will f _L (x) Putting the linear discriminant function set LDFs;

step 3.7: if it isL=l+1, ids is emptied, returning to step 3.2;

the method divides a class unbalanced data set into a training set T and a test set P, and marks a minority class sample in the training set T as X= { X _i I is more than or equal to 1 and less than or equal to m, and most samples in the training set are marked as Y= { Y _j J is more than or equal to 1 and less than or equal to n; detecting samples falling into the X convex hulls in the Y, removing the samples, and marking the set of the residual samples in the Y as Y'; training a convex polyhedron classification model CPC (X) between X and Y' using a convex polyhedron construction algorithm; the classification model CPC (x) is used for judging the classification of each sample in the test set P, and can be used for effectively solving the problems of bank fraud detection, disease diagnosis and risk behavior assessment.

2. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1.

3. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1.

4. An information processing terminal for solving bank fraud detection, disease diagnosis, risk behaviour assessment, carrying a class imbalance data classification system based on a convex polyhedron classifier according to claim 1, the terminal comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the method of any of claims 1.