CN104063520A

CN104063520A - Unbalance data classifying method based on cluster sampling kernel transformation

Info

Publication number: CN104063520A
Application number: CN201410342031.5A
Authority: CN
Inventors: 李鹏; 张楷卉
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2014-07-17
Filing date: 2014-07-17
Publication date: 2014-09-24

Abstract

The invention relates to an unbalance data classifying method based on cluster sampling kernel transformation and belongs to the field of unbalance data classification. The unbalance data classifying method based on cluster sampling kernel transformation aims to solve the problem that a traditional unbalance data classifying method is poor in classifying effect. The unbalance data classifying method based on cluster sampling kernel transformation comprises the steps that (1) unbalance data to be classified are vectorized, and an unbalance data set is obtained; (2) resample is conducted on vectors in the unbalance data set based on a dynamic self-organizing map cluster sampling method, and an unbalance data set is obtained after resample is conducted; (3) a kernel function of a classifier SVM is transformed, the unbalance data set obtained in the step (2) after resample is conducted is classified by using the classifier SVM obtained after kernel transformation is conducted, and a classified unbalance data set is obtained. The unbalance data classifying method based on cluster sampling kernel transformation is applied to medical diagnoses, insurance and other fraud detection, protein detection, fault detection, client loss prediction and other fields.

Description

Unbalanced data classification method based on clustering sampling kernel transformation

Technical Field

The present invention is in the field of unbalanced data classification.

Background

The classification problem oriented to the unbalanced data set is a difficult problem in the field of natural science, and has important practical application value in various fields such as biology, medicine, engineering, calculation and the like. It has been proved that the traditional classification method can not achieve satisfactory recognition effect under the condition of unbalanced data category. Therefore, how to find a classification method adaptive to the characteristics of the unbalanced data set is a direction worthy of further exploration.

The classification problem is a very important one in data mining tasks, and the goal is to generalize general descriptions of each class according to the existing classes of data. The classification technology based on machine learning, especially the learning method based on samples, has been the most effective approach for pattern recognition and classifier design through more than 20 years of continuous development. The existing classification technology can better solve the problems and applications that most of the existing classification technology has the characteristics of relatively small data volume, relatively complete labeling, relatively uniform data distribution and the like. However, the classification problem facing unbalanced data remains one of the most challenging difficulties in classification technology research. In the case of data imbalance, the samples cannot accurately reflect the data distribution of the entire space. For example, when a binary classification strategy is adopted, the number of samples of a positive case may only account for a small proportion of all samples, and the classifier is easily submerged by large classes and ignores small classes, thereby forming a 'data flooding' phenomenon, so that the classification performance is greatly reduced, and even fails.

Disclosure of Invention

The invention aims to solve the problem that the traditional unbalanced data classification method is poor in classification effect, and provides an unbalanced data classification method based on clustering sampling kernel transform.

The invention relates to an unbalanced data classification method based on clustering sampling kernel transformation,

it comprises the following steps:

the method comprises the following steps: vectorizing unbalance data to be classified to obtain an unbalance data set;

step two: resampling the vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping to obtain a resampled unbalanced data set;

step three: and D, transforming a kernel function of the classifier SVM, and classifying the resampled unbalance data set obtained in the step two by using the classifier SVM of kernel transformation to obtain a classified unbalance data set.

Resampling vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping, and obtaining the resampled unbalanced data set by using the method comprises the following steps:

step two, firstly: initializing the self-organizing mapping network, and setting a training time variable cycle as zero;

step two: initializing the weight of the neuron nodes of the output layer of the self-organizing mapping network, and weighting the weight w of all the neuron nodes of the output layer_ijAll give random decimal fraction, i.e. t is 0:0<w_ij<1, i 1,2,., L denotes the vector dimension of each output layer neuron, and j 1,2,., c, c are the number of output layer neurons; step two and step three: randomly selecting a sample vector X ═ X (X) from within an imbalanced dataset₁,x₂,...,x_L) Inputting the training frequency variable cycle into a self-organizing mapping network, adding 1 to the training frequency variable cycle when inputting one sample, wherein the total number of the input samples is | D |;

step two, four: calculating a sample vector X and a weight vector w of each output layer neuron node_jDistance dis (X, w)_j)；

Step two and step five: selecting dis (X, w)_j) The node with the minimum distance is the competition winning node and is set asThen

Step two, step six: according to the formulaAdjusting competing winning nodesAnd the weights of output layer neuron nodes within its neighborhood, whereRepresenting a learning rate value, r_jA neighborhood value representing a jth output layer neuron;

step two, seven: if cycles% (| D |) ═ 0, then R for the current output layer neurons is calculated²Coefficient of clustering criterion when R²If the value is greater than the threshold value mu, ending, otherwise, turning to the step two eight;

step two eight: and searching the output layer neuron with the maximum in-class dispersion square sum in the output layer neuron after the weight is adjusted, inserting a new output layer neuron nearby the output layer neuron, initializing the weight of the new output layer neuron node as the mean value of the adjacent output layer neuron vectors, and turning to the third step.

In step three, the method for transforming the kernel function of the classifier SVM comprises:

the transformation of the kernel function of the classifier SVM is:

wherein

The sigma is a nuclear parameter which is the parameter,

SV is the number of support vectors,is the imbalance factor for the kth input sample, k 1, 2.

In the third step, the classifier SVM of the kernel transformation is:

wherein, y_kIs the class of the kth input sample, α_kThe parameter that determines the optimal classification hyperplane for the kth input sample, b is the offset.

The method has the advantages that firstly, the unbalanced data are resampled, so that a large amount of noise data influencing classification are reduced, the unbalanced ratio is reduced, and the occurrence of data inundation is reduced; secondly, aiming at the characteristics of data deflection, high noise, serious information loss, data inundation and the like of the unbalanced data set, a special kernel function suitable for unbalanced data is constructed, and the deviation between the optimal classification surface and the ideal classification surface is automatically corrected by introducing an unbalanced factor, so that the classification effect of the unbalanced data is effectively improved.

Drawings

Fig. 1 is a schematic diagram illustrating the principle of the unbalanced data classification method based on cluster sampling kernel transform according to the first embodiment.

Fig. 2 is a schematic diagram of a clustering smoking principle in the second embodiment.

Detailed Description

The first embodiment is as follows: the embodiment is described with reference to fig. 1, and the unbalanced data classification method based on cluster sampling kernel transform according to the embodiment includes the following steps:

The embodiment mainly solves the classification problem facing the unbalanced data set, and the method realizes the organic combination of two strategies of sample resampling and classifier improvement. The method creatively adopts a resampling method combining unsupervised clustering and K-nearest neighbor rules to select and prune the samples of the unbalanced data set, thereby not only effectively balancing the skew state of the samples, but also greatly reducing the number of support vectors, and remarkably improving the classification speed while improving the classification effect. The sampling method overcomes the defects of lack of theoretical basis, strong randomness, man-made subjective interference, information loss and the like of the traditional sampling method, simultaneously well solves the aliasing phenomenon in the data, and obviously improves the generalization performance of the subsequent SVM classifier. In order to adapt to the state of sample unbalance, the SVM classification model is also improved. Through the transformation of the kernel function, an unbalance factor is introduced to automatically adjust the classification surface according to the unbalance ratio so as to achieve the purpose of class boundary calibration.

Aiming at the classification problem of unbalanced data sets, the resampling method is an effective way for solving data imbalance, and the key point is how to eliminate a large amount of noise information, obviously reduce the data skewness degree, and ensure the minimum information loss to reserve most of sample points which are useful for classification learning. Self-Organizing mapping networks (SOM) are a method for simulating the Self-Organizing characteristics of human brains, and the method can well realize order-preserving mapping from high-dimensional data to two-dimensional plane space, so that the SOM clustering method has obvious advantages in processing the high-attribute dimensional data compared with other methods, has strong anti-noise interference capability, and can improve the processing speed by adopting parallel processing. The clustering method using the dynamic self-organizing map (V-SOM) is to avoid the phenomenon of under-utilization of neurons due to neuron expansion, and also to overcome the problem of boundary effect easily caused by rectangular structures and other structures.

The clustering sampling method based on the dynamic self-organizing mapping is adopted to effectively balance the sample deflection state, the number of support vectors is greatly reduced, and the classification speed is obviously improved while the classification effect is improved. The sampling method overcomes the defects of lack of theoretical basis, strong randomness, man-made subjective interference, information loss and the like of the traditional sampling method, simultaneously well solves the aliasing phenomenon in the data, and obviously improves the generalization performance of the subsequent SVM classifier.

The imbalance of the data categories can cause the classified optimal hyperplane obtained by actual learning to be basically consistent with the ideal hyperplane in direction, but far away from the opposite example and close to the positive example, which is the result caused by data inundation, and the classified hyperplane has stronger tendency to the opposite example when in test. The principle of the method is that equiangular transformation is carried out on a kernel function of the SVM, the unbalanced data set is adaptive to the characteristic of unbalanced data by introducing an unbalanced factor, a classification boundary is automatically adjusted to improve the phenomenon that a classification plane is close to a prime example, and finally the classification hyperplane is closer to an ideal hyperplane. In principle, the adjusting method is different from the adjustment of penalty factors and is different from boundary translation, the magnitude of the imbalance factors is calculated according to the data deflection degree of an input imbalance data set and the self condition of data, the movement change of a classification surface is automatically realized, no human intervention exists, and the condition that the parameter is manually set and the theoretical basis is lacked is avoided.

The second embodiment is as follows: the embodiment is a further limitation on the unbalanced data classification method based on the cluster sampling kernel transform, and the method for resampling the vectors in the unbalanced data set obtained in the first step by using the cluster sampling method based on the dynamic self-organizing map, and obtaining the resampled unbalanced data set includes:

step two: initializing the weight of the neuron nodes of the output layer of the self-organizing mapping network, and weighting the weight w of all the neuron nodes of the output layer_ijAll give random decimal fraction, i.e. t is 0:0<w_ij<1, i 1,2, L denotes per output layer neuronVector dimension, j 1,2, and c, c is the number of neurons in the output layer; step two and step three: randomly selecting a sample vector X ═ X (X) from within an imbalanced dataset₁,x₂,...,x_L) Inputting the training frequency variable cycle into a self-organizing mapping network, adding 1 to the training frequency variable cycle when inputting one sample, wherein the total number of the input samples is | D |;

Step two, step six: according to the formulaAdjusting competing winning nodesAnd the weights of output layer neuron nodes within its neighborhood, whereRepresenting a learning rate value, r_jDenotes the jthOutputting neighborhood values for the layer neurons;

Clustered samples, also known as whole cluster samples. It is a sample that extracts some small groups from the population and then constructs a survey from all elements within the extracted small groups. The unit of sampling is not a single individual but a group of individuals. The small population can be extracted by adopting simple random sampling, systematic sampling and clustering methods. The advantages are that: the method is simple and convenient, saves cost, and is particularly suitable for the situation that the overall sampling frame is difficult to determine.

The SVM classification principle indicates that the determination of the optimal classification hyperplane is only related to sample points near the classification hyperplane and not to sample points far from the classification hyperplane. Colloquially, if a classification algorithm can correctly separate a positive example from a negative example that interferes most (i.e., is the closest) to the positive example, then those negative examples that interfere less (i.e., are not the closest) to the positive example can naturally be correctly separated. Therefore, in principle, the SVM classification model requires that the collected samples have some similarity of attribute features although the types are different, and the concept of cluster sampling just conforms to this principle, and the process of cluster sampling is shown in fig. 2.

The key to the method of resampling is an effective approach to solve the data imbalance, and is how to not only eliminate a large amount of noise information and significantly reduce the degree of data skew, but also ensure a minimum loss of information to retain most of the sample points useful for classification learning. The patent technology proposes a new clustering method, dynamic self-organizing map clustering, and adopts the clustering method to solve the difficult problem with certain paradox. Meanwhile, the method has the advantage of clustering high-attribute dimensional data. The basic idea is to divide the original large-scale unbalanced data into N clusters by a clustering method, wherein the clusters with sample points being negative examples are deleted and added into a selected sample set.

The weight value adjustment of the neuron of the output layer adopts the following formula:

dis(x_i,w_j(t))＝1-sim(x_i,w_j(t)) (2)

wherein w_j(t +1) and w_j(t) represents the neuron w_jThe weight vectors after adjustment and before adjustment.To learn the rate function, r_j(t) is a neighborhood function, which is gradually decreased as training progresses. dis (x)_i,w_j(t)) represents a sample vector x_iAnd neuron vector w_j(t), the magnitude of which can be translated into a calculation of similarity. The greater the similarity between vectors, the smaller its distance. In general, the similarity can be calculated by using the cosine formula, i.e.

L in equation (3) represents the dimension of the vector,representing the weight of the sample vector x in the ith dimension,and representing the weight of the neuron vector w on the ith dimension, wherein all related vectors are subjected to normalization processing.

The algorithm adopts R²The clustering criterion coefficient is used as a judgment basis to seek balance between the over-utilization and under-utilization of the neurons. Let m_iIs neuron N_iThe corresponding vector is then N_iThe sum of the squares of the intra-class dispersion of the mapped samples is

<math> <mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>&RightArrow;</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> </mrow> </munder> <mi>dis</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>

At time t, assuming the output layer has a total of c neurons, then definition is madeAssuming T is the sum of the squares of the total deviations of all samples, then

<math> <mrow> <mi>T</mi> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>D</mi> <mo>|</mo> </mrow> </munderover> <mi>dis</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mover> <mi>x</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> </mrow> </math>

Wherein,mean vector representing all training samples. | D | represents the total number of input samples, then

R^{2} = 1 - \frac{P_{c}}{T} - - - (5)

Coefficient of clustering criterion R²Has a value range of [0,1 ]]And the specific value thereof generally has a monotonous increasing trend along with the increase of the network scale. It is therefore necessary to set the threshold μ to terminate the growth of the network at the appropriate time to prevent under-utilization of the neurons.

The third concrete implementation mode: in the third step, the method for transforming the kernel function of the classifier SVM includes:

the transformation of the kernel function of the classifier SVM is:

wherein

The sigma is a nuclear parameter which is the parameter,

As is well known, the basic idea of the kernel method is to map linearly indivisible samples in an input space I into a high-dimensional feature space F through phi (x) to obtain a linear classification surface in the high-dimensional space. Now, assuming that Z is phi (X), a neighborhood of point X becomes mapped to space F

<math> <mrow> <mi>dZ</mi> <mo>=</mo> <mo>&dtri;</mo> <mi>φ</mi> <mo>·</mo> <mi>dx</mi> <mo>=</mo> <munder> <mi>Σ</mi> <mi>i</mi> </munder> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mi>φ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <msub> <mi>dx</mi> <mi>i</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>

Here, the quadratic form of dZ is given by

Here, the

<math> <mrow> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mi>φ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>·</mo> <mrow> <mo>(</mo> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> </msub> </mfrac> <mi>φ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>

Wherein the factor g_ijReferred to as the riemann metric, which is related to the mapping phi (x). Although phi (x) is not explicitly given in the kernel method, g can always be obtained by the computational skills associated with kernel functions_ijExpression relating to kernel function K:

<math> <mrow> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mrow> <mo>(</mo> <mfrac> <mrow> <msup> <mo>&PartialD;</mo> <mn>2</mn> </msup> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>′</mo> </msup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> <msubsup> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> <mo>′</mo> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <msup> <mi>x</mi> <mo>′</mo> </msup> <mo>=</mo> <mi>x</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>

riemann metric g_ij(x) It characterizes the extent to which a small region in the input space I is enlarged after being mapped to the feature space F by the mapping phi (x), and this is the basis of the kernel function transformation. In order to overcome the interference of an unbalanced data set to model training, the Riemann metric g near a classification boundary is expanded by transforming the kernel function K_ij(x) While it is reduced at the remaining sample points. Thus, the transformation of the kernel function is introduced here:

using the RBF kernel as the raw kernel function of the classifier, which is considered for the good performance of the RBF kernel in most classes, substituting equation (10) into equation (9) can result in the following result when using such kernel function:

<math> <mrow> <mover> <mi>g</mi> <mo>~</mo> </mover> <mo>=</mo> <mfrac> <mrow> <mo>&PartialD;</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mo>·</mo> <mfrac> <mrow> <mo>&PartialD;</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> </msub> </mfrac> <mo>+</mo> <mi>C</mi> <msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mtext>11</mtext> <mo>)</mo> </mrow> </mrow> </math>

as can be seen from equation (11), the function c (x) must satisfy that its first derivative is 0 at the maximum to ensure that the transformation does not cause angular changes between feature points and to ensure that larger values are taken at the support vectors. The selected kernel function is an RBF kernel function, and the function form is as follows:

c (x) is a projection function which must be appropriately selected to ensure that the resulting image is obtainedWith a larger value at the classification boundary, the form of the RBF projection function is chosen:

<math> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>&Element;</mo> <mi>SV</mi> </mrow> </munder> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msubsup> <mrow> <mn>2</mn> <mi>τ</mi> </mrow> <mi>k</mi> <mn>2</mn> </msubsup> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>

whereinIs an important parameter, called imbalance factor, which must make c (x) have a larger value closer to the classification interface and a smaller value farther away from the classification interface, and therefore can be set according to the following formula:

<math> <mrow> <msubsup> <mi>τ</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>=</mo> <msub> <mi>AVG</mi> <mrow> <mi>i</mi> <mo>&Element;</mo> <mo>{</mo> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>Φ</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>Φ</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mo><</mo> <mi>M</mi> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>&NotEqual;</mo> <msub> <mi>y</mi> <mi>k</mi> </msub> <mo>}</mo> </mrow> </msub> <mrow> <mo>(</mo> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>Φ</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>Φ</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>14</mn> <mo>)</mo> </mrow> </mrow> </math>

where M is a given distance constant. And | | | Φ (X)_i)-Φ(X_k)||²The following can be obtained through the operation of the kernel function:

||Φ(x_i)-Φ(x_k)||²＝K(x_i,x_i)+K(x_k,x_k)-2K(x_i,x_k) (15)

it is necessary to discuss how the imbalance ratio plays a role in kernel function adjustment, that is, while considering the Riemann metric corresponding to the change support vector, the imbalance ratio must be introduced to ensure that the positive and negative examples correspond toIn a small sample classOf the larger and large classIs smaller. Here, letWhileThe two factors are respectively used as positive example samples and negative example samplesThe desired effect can be achieved.

By introducing an imbalance factorDue to the deviation of positive and negative examples of the unbalanced data set and the difference of the number of the positive and negative support vectors near the classification surface, the sizes of positive and negative example unbalance factors are different, through the transformation of the kernel function, the kernel function internally contains unbalance factor parameters, and the unbalance factors can automatically adjust the classification boundary according to the deviation condition of input data. In principle, the adjusting method is different from the adjustment of penalty factors and is different from boundary translation, the magnitude of the imbalance factors is calculated according to the data deflection degree of an input imbalance data set and the self condition of data, the movement change of a classification surface is automatically realized, no human intervention exists, and the condition that the parameter is manually set and the theoretical basis is lacked is avoided.

The fourth concrete implementation mode: this embodiment is a further limitation of the unbalanced data classification method based on cluster sampling kernel transform described in the third embodiment,

in the third step, the classifier SVM of the kernel transformation is:

wherein, y_kIs the class of the kth input sample, α_kIs the kth input sampleThe parameters of the optimal classification hyperplane are determined, and b is the offset. In order to verify the effect and influence condition of the invention on the classification effect of the unbalanced data set, four data sets are selected on the standard data set MUC-6 and UCI of the public data platform for verification, and the data composition is shown in Table 1. It is worth noting that the imbalance ratio of the two data sets UCI-Seg1 and UCI-Glass7 is substantially the same, but the positive case information deficiency is very different, and they are chosen to verify the impact of the positive case information deficiency of the unbalanced data on the classification method and the result. In practical application of unbalanced data set classification, people tend to be more concerned about the classification effect of small classes. Therefore, the accuracy (Precision), Recall (Recall) and F-measure of the regular example classification are selected as technical indexes for measuring the classification effect in the experiment. The calculation formula is as follows:

TABLE 1 list of unbalanced datasets

Data set	Number of samples in counter example	Number of samples in positive case	Imbalance ratio
				MUC-6	18815	1266	15:1
UCI-Seg1	1980	330	6:1
				UCI-Glass7	185	29	6.4:1
UCI-Abalone	4145	32	130:1

Four method strategies were used to perform experiments on the above four different data sets to verify the role of the three algorithms in classification. In the experiment, 50% of each data set was used for training and the remaining 50% was used for testing, and the training and testing data were guaranteed to have the same imbalance ratio. The detailed strategies for the four methods are as follows:

the method comprises the following steps: classifying by using an SVM model of a common RBF kernel function;

the second method comprises the following steps: classifying by using an SVM model based on kernel transformation;

the third method comprises the following steps: introducing a KNN pruning algorithm on the basis of the second method;

the method four comprises the following steps: introducing clustering sampling algorithm on the basis of the third method

The influence of the kernel transformation on the classification effect can be verified by comparing the experimental results of the first method and the second method; the influence of the KNN pruning algorithm on the classification effect can be verified by comparing the experimental results of the method II and the method III; the influence of clustering sampling on the classification effect can be verified by comparing the experimental results of the method three and the method four. Utensil for cleaning buttock

Through comparison and analysis of experimental results, some meaningful conclusions can be obtained. (1) Aiming at unbalanced data classification, the method of kernel transformation does not help to improve the classification accuracy of the good case, but can obviously improve the recall rate, and finally improves the overall performance of system classification. That is, more positive examples can be found by adopting the method of kernel transformation. (2) Aiming at unbalanced data classification, the classification accuracy can be improved by adopting a KNN pruning algorithm, but the improvement of the recall rate is not contributed, and finally the overall performance of system classification is improved. That is, the misclassification of counter examples can be reduced by adopting the KNN pruning algorithm. However, the algorithm causes serious degradation and even failure of the system performance in classification experiments on the UCI-Glass7 and UCI-Abalon data sets. The analysis considers that the result is caused by serious shortage of the positive example information, and the magnitude relation with the imbalance ratio of the data is not large. Because the unbalance ratio of the UCI-Glass7 data set and the UCI-Seg1 data set is basically the same, but the UCI-Glass7 data set only has 29 data sets, the number of the positive examples in the training data is only 14, and the residual positive example information after pruning is less, so that the classification effect is obviously reduced; the method fails in the UCI-abatone dataset because few of the original cases are almost completely deleted. Therefore, it can be seen that not only the imbalance ratio should be considered in the classification of imbalance data, but also the information shortage degree is an important factor affecting the classification method and the classification effect. (3) Aiming at the unbalanced data classification, the accuracy and the recall rate of the classification are improved by adopting a clustering sampling method, and the resampling method is further proved to be an effective way for solving the unbalanced data classification.

The field of application of unbalanced data set classification is many. For example, medical diagnosis, cancer detection, credit card, insurance fraud detection, bioinformatics such as protein detection, enterprise bankruptcy, failure detection, customer churn prediction, and the like.

Claims

1. The unbalanced data classification method based on clustering sampling kernel transformation is characterized by comprising the following steps of:

2. The method for classifying imbalance data based on cluster sampling kernel transform according to claim 1, wherein the vector in the imbalance data set obtained in the first step is resampled by using a cluster sampling method based on dynamic self-organizing mapping, and the method for obtaining the resampled imbalance data set comprises the following steps:

3. The method for classifying imbalance data based on clustered sampling kernel transform as claimed in claim 2, wherein in step three, the method for transforming kernel function of classifier SVM comprises:

the transformation of the kernel function of the classifier SVM is:

wherein

The sigma is a nuclear parameter which is the parameter,

4. The method for classifying imbalance data based on clustered sampling kernel transform according to claim 3, wherein in step three, the classifier SVM of kernel transform is: