CN104063520A - Unbalance data classifying method based on cluster sampling kernel transformation - Google Patents

Unbalance data classifying method based on cluster sampling kernel transformation Download PDF

Info

Publication number
CN104063520A
CN104063520A CN201410342031.5A CN201410342031A CN104063520A CN 104063520 A CN104063520 A CN 104063520A CN 201410342031 A CN201410342031 A CN 201410342031A CN 104063520 A CN104063520 A CN 104063520A
Authority
CN
China
Prior art keywords
mrow
msup
output layer
data set
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410342031.5A
Other languages
Chinese (zh)
Inventor
李鹏
张楷卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201410342031.5A priority Critical patent/CN104063520A/en
Publication of CN104063520A publication Critical patent/CN104063520A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an unbalance data classifying method based on cluster sampling kernel transformation and belongs to the field of unbalance data classification. The unbalance data classifying method based on cluster sampling kernel transformation aims to solve the problem that a traditional unbalance data classifying method is poor in classifying effect. The unbalance data classifying method based on cluster sampling kernel transformation comprises the steps that (1) unbalance data to be classified are vectorized, and an unbalance data set is obtained; (2) resample is conducted on vectors in the unbalance data set based on a dynamic self-organizing map cluster sampling method, and an unbalance data set is obtained after resample is conducted; (3) a kernel function of a classifier SVM is transformed, the unbalance data set obtained in the step (2) after resample is conducted is classified by using the classifier SVM obtained after kernel transformation is conducted, and a classified unbalance data set is obtained. The unbalance data classifying method based on cluster sampling kernel transformation is applied to medical diagnoses, insurance and other fraud detection, protein detection, fault detection, client loss prediction and other fields.

Description

Unbalanced data classification method based on clustering sampling kernel transformation
Technical Field
The present invention is in the field of unbalanced data classification.
Background
The classification problem oriented to the unbalanced data set is a difficult problem in the field of natural science, and has important practical application value in various fields such as biology, medicine, engineering, calculation and the like. It has been proved that the traditional classification method can not achieve satisfactory recognition effect under the condition of unbalanced data category. Therefore, how to find a classification method adaptive to the characteristics of the unbalanced data set is a direction worthy of further exploration.
The classification problem is a very important one in data mining tasks, and the goal is to generalize general descriptions of each class according to the existing classes of data. The classification technology based on machine learning, especially the learning method based on samples, has been the most effective approach for pattern recognition and classifier design through more than 20 years of continuous development. The existing classification technology can better solve the problems and applications that most of the existing classification technology has the characteristics of relatively small data volume, relatively complete labeling, relatively uniform data distribution and the like. However, the classification problem facing unbalanced data remains one of the most challenging difficulties in classification technology research. In the case of data imbalance, the samples cannot accurately reflect the data distribution of the entire space. For example, when a binary classification strategy is adopted, the number of samples of a positive case may only account for a small proportion of all samples, and the classifier is easily submerged by large classes and ignores small classes, thereby forming a 'data flooding' phenomenon, so that the classification performance is greatly reduced, and even fails.
Disclosure of Invention
The invention aims to solve the problem that the traditional unbalanced data classification method is poor in classification effect, and provides an unbalanced data classification method based on clustering sampling kernel transform.
The invention relates to an unbalanced data classification method based on clustering sampling kernel transformation,
it comprises the following steps:
the method comprises the following steps: vectorizing unbalance data to be classified to obtain an unbalance data set;
step two: resampling the vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping to obtain a resampled unbalanced data set;
step three: and D, transforming a kernel function of the classifier SVM, and classifying the resampled unbalance data set obtained in the step two by using the classifier SVM of kernel transformation to obtain a classified unbalance data set.
Resampling vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping, and obtaining the resampled unbalanced data set by using the method comprises the following steps:
step two, firstly: initializing the self-organizing mapping network, and setting a training time variable cycle as zero;
step two: initializing the weight of the neuron nodes of the output layer of the self-organizing mapping network, and weighting the weight w of all the neuron nodes of the output layerijAll give random decimal fraction, i.e. t is 0:0<wij<1, i 1,2,., L denotes the vector dimension of each output layer neuron, and j 1,2,., c, c are the number of output layer neurons; step two and step three: randomly selecting a sample vector X ═ X (X) from within an imbalanced dataset1,x2,...,xL) Inputting the training frequency variable cycle into a self-organizing mapping network, adding 1 to the training frequency variable cycle when inputting one sample, wherein the total number of the input samples is | D |;
step two, four: calculating a sample vector X and a weight vector w of each output layer neuron nodejDistance dis (X, w)j);
Step two and step five: selecting dis (X, w)j) The node with the minimum distance is the competition winning node and is set asThen <math> <mrow> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msubsup> <mi>w</mi> <mi>j</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>min</mi> <mrow> <mo>(</mo> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>c</mi> <mo>;</mo> </mrow> </math>
Step two, step six: according to the formulaAdjusting competing winning nodesAnd the weights of output layer neuron nodes within its neighborhood, whereRepresenting a learning rate value, rjA neighborhood value representing a jth output layer neuron;
step two, seven: if cycles% (| D |) ═ 0, then R for the current output layer neurons is calculated2Coefficient of clustering criterion when R2If the value is greater than the threshold value mu, ending, otherwise, turning to the step two eight;
step two eight: and searching the output layer neuron with the maximum in-class dispersion square sum in the output layer neuron after the weight is adjusted, inserting a new output layer neuron nearby the output layer neuron, initializing the weight of the new output layer neuron node as the mean value of the adjacent output layer neuron vectors, and turning to the third step.
In step three, the method for transforming the kernel function of the classifier SVM comprises:
the transformation of the kernel function of the classifier SVM is:
<math> <mrow> <mover> <mi>K</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>C</mi> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </math>
wherein <math> <mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msup> <mrow> <mn>2</mn> <mi>&sigma;</mi> </mrow> <mn>2</mn> </msup> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>&sigma;</mi> <mo>></mo> <mn>0</mn> <mo>,</mo> </mrow> </math> The sigma is a nuclear parameter which is the parameter,
SV is the number of support vectors,is the imbalance factor for the kth input sample, k 1, 2.
In the third step, the classifier SVM of the kernel transformation is:
wherein, ykIs the class of the kth input sample, αkThe parameter that determines the optimal classification hyperplane for the kth input sample, b is the offset.
The method has the advantages that firstly, the unbalanced data are resampled, so that a large amount of noise data influencing classification are reduced, the unbalanced ratio is reduced, and the occurrence of data inundation is reduced; secondly, aiming at the characteristics of data deflection, high noise, serious information loss, data inundation and the like of the unbalanced data set, a special kernel function suitable for unbalanced data is constructed, and the deviation between the optimal classification surface and the ideal classification surface is automatically corrected by introducing an unbalanced factor, so that the classification effect of the unbalanced data is effectively improved.
Drawings
Fig. 1 is a schematic diagram illustrating the principle of the unbalanced data classification method based on cluster sampling kernel transform according to the first embodiment.
Fig. 2 is a schematic diagram of a clustering smoking principle in the second embodiment.
Detailed Description
The first embodiment is as follows: the embodiment is described with reference to fig. 1, and the unbalanced data classification method based on cluster sampling kernel transform according to the embodiment includes the following steps:
the method comprises the following steps: vectorizing unbalance data to be classified to obtain an unbalance data set;
step two: resampling the vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping to obtain a resampled unbalanced data set;
step three: and D, transforming a kernel function of the classifier SVM, and classifying the resampled unbalance data set obtained in the step two by using the classifier SVM of kernel transformation to obtain a classified unbalance data set.
The embodiment mainly solves the classification problem facing the unbalanced data set, and the method realizes the organic combination of two strategies of sample resampling and classifier improvement. The method creatively adopts a resampling method combining unsupervised clustering and K-nearest neighbor rules to select and prune the samples of the unbalanced data set, thereby not only effectively balancing the skew state of the samples, but also greatly reducing the number of support vectors, and remarkably improving the classification speed while improving the classification effect. The sampling method overcomes the defects of lack of theoretical basis, strong randomness, man-made subjective interference, information loss and the like of the traditional sampling method, simultaneously well solves the aliasing phenomenon in the data, and obviously improves the generalization performance of the subsequent SVM classifier. In order to adapt to the state of sample unbalance, the SVM classification model is also improved. Through the transformation of the kernel function, an unbalance factor is introduced to automatically adjust the classification surface according to the unbalance ratio so as to achieve the purpose of class boundary calibration.
Aiming at the classification problem of unbalanced data sets, the resampling method is an effective way for solving data imbalance, and the key point is how to eliminate a large amount of noise information, obviously reduce the data skewness degree, and ensure the minimum information loss to reserve most of sample points which are useful for classification learning. Self-Organizing mapping networks (SOM) are a method for simulating the Self-Organizing characteristics of human brains, and the method can well realize order-preserving mapping from high-dimensional data to two-dimensional plane space, so that the SOM clustering method has obvious advantages in processing the high-attribute dimensional data compared with other methods, has strong anti-noise interference capability, and can improve the processing speed by adopting parallel processing. The clustering method using the dynamic self-organizing map (V-SOM) is to avoid the phenomenon of under-utilization of neurons due to neuron expansion, and also to overcome the problem of boundary effect easily caused by rectangular structures and other structures.
The clustering sampling method based on the dynamic self-organizing mapping is adopted to effectively balance the sample deflection state, the number of support vectors is greatly reduced, and the classification speed is obviously improved while the classification effect is improved. The sampling method overcomes the defects of lack of theoretical basis, strong randomness, man-made subjective interference, information loss and the like of the traditional sampling method, simultaneously well solves the aliasing phenomenon in the data, and obviously improves the generalization performance of the subsequent SVM classifier.
The imbalance of the data categories can cause the classified optimal hyperplane obtained by actual learning to be basically consistent with the ideal hyperplane in direction, but far away from the opposite example and close to the positive example, which is the result caused by data inundation, and the classified hyperplane has stronger tendency to the opposite example when in test. The principle of the method is that equiangular transformation is carried out on a kernel function of the SVM, the unbalanced data set is adaptive to the characteristic of unbalanced data by introducing an unbalanced factor, a classification boundary is automatically adjusted to improve the phenomenon that a classification plane is close to a prime example, and finally the classification hyperplane is closer to an ideal hyperplane. In principle, the adjusting method is different from the adjustment of penalty factors and is different from boundary translation, the magnitude of the imbalance factors is calculated according to the data deflection degree of an input imbalance data set and the self condition of data, the movement change of a classification surface is automatically realized, no human intervention exists, and the condition that the parameter is manually set and the theoretical basis is lacked is avoided.
The second embodiment is as follows: the embodiment is a further limitation on the unbalanced data classification method based on the cluster sampling kernel transform, and the method for resampling the vectors in the unbalanced data set obtained in the first step by using the cluster sampling method based on the dynamic self-organizing map, and obtaining the resampled unbalanced data set includes:
step two, firstly: initializing the self-organizing mapping network, and setting a training time variable cycle as zero;
step two: initializing the weight of the neuron nodes of the output layer of the self-organizing mapping network, and weighting the weight w of all the neuron nodes of the output layerijAll give random decimal fraction, i.e. t is 0:0<wij<1, i 1,2, L denotes per output layer neuronVector dimension, j 1,2, and c, c is the number of neurons in the output layer; step two and step three: randomly selecting a sample vector X ═ X (X) from within an imbalanced dataset1,x2,...,xL) Inputting the training frequency variable cycle into a self-organizing mapping network, adding 1 to the training frequency variable cycle when inputting one sample, wherein the total number of the input samples is | D |;
step two, four: calculating a sample vector X and a weight vector w of each output layer neuron nodejDistance dis (X, w)j);
Step two and step five: selecting dis (X, w)j) The node with the minimum distance is the competition winning node and is set asThen <math> <mrow> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msubsup> <mi>w</mi> <mi>j</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>min</mi> <mrow> <mo>(</mo> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>c</mi> <mo>;</mo> </mrow> </math>
Step two, step six: according to the formulaAdjusting competing winning nodesAnd the weights of output layer neuron nodes within its neighborhood, whereRepresenting a learning rate value, rjDenotes the jthOutputting neighborhood values for the layer neurons;
step two, seven: if cycles% (| D |) ═ 0, then R for the current output layer neurons is calculated2Coefficient of clustering criterion when R2If the value is greater than the threshold value mu, ending, otherwise, turning to the step two eight;
step two eight: and searching the output layer neuron with the maximum in-class dispersion square sum in the output layer neuron after the weight is adjusted, inserting a new output layer neuron nearby the output layer neuron, initializing the weight of the new output layer neuron node as the mean value of the adjacent output layer neuron vectors, and turning to the third step.
Clustered samples, also known as whole cluster samples. It is a sample that extracts some small groups from the population and then constructs a survey from all elements within the extracted small groups. The unit of sampling is not a single individual but a group of individuals. The small population can be extracted by adopting simple random sampling, systematic sampling and clustering methods. The advantages are that: the method is simple and convenient, saves cost, and is particularly suitable for the situation that the overall sampling frame is difficult to determine.
The SVM classification principle indicates that the determination of the optimal classification hyperplane is only related to sample points near the classification hyperplane and not to sample points far from the classification hyperplane. Colloquially, if a classification algorithm can correctly separate a positive example from a negative example that interferes most (i.e., is the closest) to the positive example, then those negative examples that interfere less (i.e., are not the closest) to the positive example can naturally be correctly separated. Therefore, in principle, the SVM classification model requires that the collected samples have some similarity of attribute features although the types are different, and the concept of cluster sampling just conforms to this principle, and the process of cluster sampling is shown in fig. 2.
The key to the method of resampling is an effective approach to solve the data imbalance, and is how to not only eliminate a large amount of noise information and significantly reduce the degree of data skew, but also ensure a minimum loss of information to retain most of the sample points useful for classification learning. The patent technology proposes a new clustering method, dynamic self-organizing map clustering, and adopts the clustering method to solve the difficult problem with certain paradox. Meanwhile, the method has the advantage of clustering high-attribute dimensional data. The basic idea is to divide the original large-scale unbalanced data into N clusters by a clustering method, wherein the clusters with sample points being negative examples are deleted and added into a selected sample set.
The weight value adjustment of the neuron of the output layer adopts the following formula:
dis(xi,wj(t))=1-sim(xi,wj(t)) (2)
wherein wj(t +1) and wj(t) represents the neuron wjThe weight vectors after adjustment and before adjustment.To learn the rate function, rj(t) is a neighborhood function, which is gradually decreased as training progresses. dis (x)i,wj(t)) represents a sample vector xiAnd neuron vector wj(t), the magnitude of which can be translated into a calculation of similarity. The greater the similarity between vectors, the smaller its distance. In general, the similarity can be calculated by using the cosine formula, i.e.
<math> <mrow> <mi>sim</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msub> <mi>W</mi> <msub> <mi>x</mi> <mi>i</mi> </msub> </msub> <msub> <mi>W</mi> <msub> <mi>w</mi> <mi>i</mi> </msub> </msub> </mrow> <mrow> <msqrt> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msubsup> <mi>W</mi> <msub> <mi>x</mi> <mi>i</mi> </msub> <mn>2</mn> </msubsup> </msqrt> <msqrt> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>L</mi> </munderover> <msubsup> <mi>W</mi> <msub> <mi>w</mi> <mi>i</mi> </msub> <mn>2</mn> </msubsup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
L in equation (3) represents the dimension of the vector,representing the weight of the sample vector x in the ith dimension,and representing the weight of the neuron vector w on the ith dimension, wherein all related vectors are subjected to normalization processing.
The algorithm adopts R2The clustering criterion coefficient is used as a judgment basis to seek balance between the over-utilization and under-utilization of the neurons. Let miIs neuron NiThe corresponding vector is then NiThe sum of the squares of the intra-class dispersion of the mapped samples is
<math> <mrow> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>&RightArrow;</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> </mrow> </munder> <mi>dis</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>m</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
At time t, assuming the output layer has a total of c neurons, then definition is madeAssuming T is the sum of the squares of the total deviations of all samples, then <math> <mrow> <mi>T</mi> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>D</mi> <mo>|</mo> </mrow> </munderover> <mi>dis</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>,</mo> <mover> <mi>x</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> </mrow> </math>
Wherein,mean vector representing all training samples. | D | represents the total number of input samples, then
R 2 = 1 - P c T - - - ( 5 )
Coefficient of clustering criterion R2Has a value range of [0,1 ]]And the specific value thereof generally has a monotonous increasing trend along with the increase of the network scale. It is therefore necessary to set the threshold μ to terminate the growth of the network at the appropriate time to prevent under-utilization of the neurons.
The third concrete implementation mode: in the third step, the method for transforming the kernel function of the classifier SVM includes:
the transformation of the kernel function of the classifier SVM is:
<math> <mrow> <mover> <mi>K</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>C</mi> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </math>
wherein <math> <mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msup> <mrow> <mn>2</mn> <mi>&sigma;</mi> </mrow> <mn>2</mn> </msup> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>&sigma;</mi> <mo>></mo> <mn>0</mn> <mo>,</mo> </mrow> </math> The sigma is a nuclear parameter which is the parameter,
SV is the number of support vectors,is the imbalance factor for the kth input sample, k 1, 2.
As is well known, the basic idea of the kernel method is to map linearly indivisible samples in an input space I into a high-dimensional feature space F through phi (x) to obtain a linear classification surface in the high-dimensional space. Now, assuming that Z is phi (X), a neighborhood of point X becomes mapped to space F
<math> <mrow> <mi>dZ</mi> <mo>=</mo> <mo>&dtri;</mo> <mi>&phi;</mi> <mo>&CenterDot;</mo> <mi>dx</mi> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mi>i</mi> </munder> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <msub> <mi>dx</mi> <mi>i</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
Here, the quadratic form of dZ is given by
<math> <mrow> <msup> <mrow> <mo>|</mo> <mi>dZ</mi> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </munder> <msub> <mi>g</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <msub> <mi>dx</mi> <mi>i</mi> </msub> <msub> <mi>dx</mi> <mi>j</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
Here, the
<math> <mrow> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mrow> <mo>(</mo> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mrow> <mo>(</mo> <mfrac> <mo>&PartialD;</mo> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> </msub> </mfrac> <mi>&phi;</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
Wherein the factor gijReferred to as the riemann metric, which is related to the mapping phi (x). Although phi (x) is not explicitly given in the kernel method, g can always be obtained by the computational skills associated with kernel functionsijExpression relating to kernel function K:
<math> <mrow> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mrow> <mo>(</mo> <mfrac> <mrow> <msup> <mo>&PartialD;</mo> <mn>2</mn> </msup> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> <msubsup> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> <mo>&prime;</mo> </msubsup> </mrow> </mfrac> <mo>)</mo> </mrow> <mrow> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mi>x</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
riemann metric gij(x) It characterizes the extent to which a small region in the input space I is enlarged after being mapped to the feature space F by the mapping phi (x), and this is the basis of the kernel function transformation. In order to overcome the interference of an unbalanced data set to model training, the Riemann metric g near a classification boundary is expanded by transforming the kernel function Kij(x) While it is reduced at the remaining sample points. Thus, the transformation of the kernel function is introduced here:
<math> <mrow> <mover> <mi>K</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>C</mi> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>
using the RBF kernel as the raw kernel function of the classifier, which is considered for the good performance of the RBF kernel in most classes, substituting equation (10) into equation (9) can result in the following result when using such kernel function:
<math> <mrow> <mover> <mi>g</mi> <mo>~</mo> </mover> <mo>=</mo> <mfrac> <mrow> <mo>&PartialD;</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>i</mi> </msub> </mfrac> <mo>&CenterDot;</mo> <mfrac> <mrow> <mo>&PartialD;</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <msub> <mrow> <mo>&PartialD;</mo> <mi>x</mi> </mrow> <mi>j</mi> </msub> </mfrac> <mo>+</mo> <mi>C</mi> <msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mn>2</mn> </msup> <msub> <mi>g</mi> <mi>ij</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mtext>11</mtext> <mo>)</mo> </mrow> </mrow> </math>
as can be seen from equation (11), the function c (x) must satisfy that its first derivative is 0 at the maximum to ensure that the transformation does not cause angular changes between feature points and to ensure that larger values are taken at the support vectors. The selected kernel function is an RBF kernel function, and the function form is as follows:
<math> <mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msup> <mrow> <mn>2</mn> <mi>&sigma;</mi> </mrow> <mn>2</mn> </msup> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>&sigma;</mi> <mo>></mo> <mn>0</mn> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> </mrow> </math>
c (x) is a projection function which must be appropriately selected to ensure that the resulting image is obtainedWith a larger value at the classification boundary, the form of the RBF projection function is chosen:
<math> <mrow> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>&Element;</mo> <mi>SV</mi> </mrow> </munder> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msubsup> <mrow> <mn>2</mn> <mi>&tau;</mi> </mrow> <mi>k</mi> <mn>2</mn> </msubsup> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>
whereinIs an important parameter, called imbalance factor, which must make c (x) have a larger value closer to the classification interface and a smaller value farther away from the classification interface, and therefore can be set according to the following formula:
<math> <mrow> <msubsup> <mi>&tau;</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>=</mo> <msub> <mi>AVG</mi> <mrow> <mi>i</mi> <mo>&Element;</mo> <mo>{</mo> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>&Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>&Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mo>&lt;</mo> <mi>M</mi> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>&NotEqual;</mo> <msub> <mi>y</mi> <mi>k</mi> </msub> <mo>}</mo> </mrow> </msub> <mrow> <mo>(</mo> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>&Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mi>&Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>14</mn> <mo>)</mo> </mrow> </mrow> </math>
where M is a given distance constant. And | | | Φ (X)i)-Φ(Xk)||2The following can be obtained through the operation of the kernel function:
||Φ(xi)-Φ(xk)||2=K(xi,xi)+K(xk,xk)-2K(xi,xk) (15)
it is necessary to discuss how the imbalance ratio plays a role in kernel function adjustment, that is, while considering the Riemann metric corresponding to the change support vector, the imbalance ratio must be introduced to ensure that the positive and negative examples correspond toIn a small sample classOf the larger and large classIs smaller. Here, letWhileThe two factors are respectively used as positive example samples and negative example samplesThe desired effect can be achieved.
By introducing an imbalance factorDue to the deviation of positive and negative examples of the unbalanced data set and the difference of the number of the positive and negative support vectors near the classification surface, the sizes of positive and negative example unbalance factors are different, through the transformation of the kernel function, the kernel function internally contains unbalance factor parameters, and the unbalance factors can automatically adjust the classification boundary according to the deviation condition of input data. In principle, the adjusting method is different from the adjustment of penalty factors and is different from boundary translation, the magnitude of the imbalance factors is calculated according to the data deflection degree of an input imbalance data set and the self condition of data, the movement change of a classification surface is automatically realized, no human intervention exists, and the condition that the parameter is manually set and the theoretical basis is lacked is avoided.
The fourth concrete implementation mode: this embodiment is a further limitation of the unbalanced data classification method based on cluster sampling kernel transform described in the third embodiment,
in the third step, the classifier SVM of the kernel transformation is:
wherein, ykIs the class of the kth input sample, αkIs the kth input sampleThe parameters of the optimal classification hyperplane are determined, and b is the offset. In order to verify the effect and influence condition of the invention on the classification effect of the unbalanced data set, four data sets are selected on the standard data set MUC-6 and UCI of the public data platform for verification, and the data composition is shown in Table 1. It is worth noting that the imbalance ratio of the two data sets UCI-Seg1 and UCI-Glass7 is substantially the same, but the positive case information deficiency is very different, and they are chosen to verify the impact of the positive case information deficiency of the unbalanced data on the classification method and the result. In practical application of unbalanced data set classification, people tend to be more concerned about the classification effect of small classes. Therefore, the accuracy (Precision), Recall (Recall) and F-measure of the regular example classification are selected as technical indexes for measuring the classification effect in the experiment. The calculation formula is as follows:
TABLE 1 list of unbalanced datasets
Data set Number of samples in counter example Number of samples in positive case Imbalance ratio
MUC-6 18815 1266 15:1
UCI-Seg1 1980 330 6:1
UCI-Glass7 185 29 6.4:1
UCI-Abalone 4145 32 130:1
Four method strategies were used to perform experiments on the above four different data sets to verify the role of the three algorithms in classification. In the experiment, 50% of each data set was used for training and the remaining 50% was used for testing, and the training and testing data were guaranteed to have the same imbalance ratio. The detailed strategies for the four methods are as follows:
the method comprises the following steps: classifying by using an SVM model of a common RBF kernel function;
the second method comprises the following steps: classifying by using an SVM model based on kernel transformation;
the third method comprises the following steps: introducing a KNN pruning algorithm on the basis of the second method;
the method four comprises the following steps: introducing clustering sampling algorithm on the basis of the third method
The influence of the kernel transformation on the classification effect can be verified by comparing the experimental results of the first method and the second method; the influence of the KNN pruning algorithm on the classification effect can be verified by comparing the experimental results of the method II and the method III; the influence of clustering sampling on the classification effect can be verified by comparing the experimental results of the method three and the method four. Utensil for cleaning buttock
Through comparison and analysis of experimental results, some meaningful conclusions can be obtained. (1) Aiming at unbalanced data classification, the method of kernel transformation does not help to improve the classification accuracy of the good case, but can obviously improve the recall rate, and finally improves the overall performance of system classification. That is, more positive examples can be found by adopting the method of kernel transformation. (2) Aiming at unbalanced data classification, the classification accuracy can be improved by adopting a KNN pruning algorithm, but the improvement of the recall rate is not contributed, and finally the overall performance of system classification is improved. That is, the misclassification of counter examples can be reduced by adopting the KNN pruning algorithm. However, the algorithm causes serious degradation and even failure of the system performance in classification experiments on the UCI-Glass7 and UCI-Abalon data sets. The analysis considers that the result is caused by serious shortage of the positive example information, and the magnitude relation with the imbalance ratio of the data is not large. Because the unbalance ratio of the UCI-Glass7 data set and the UCI-Seg1 data set is basically the same, but the UCI-Glass7 data set only has 29 data sets, the number of the positive examples in the training data is only 14, and the residual positive example information after pruning is less, so that the classification effect is obviously reduced; the method fails in the UCI-abatone dataset because few of the original cases are almost completely deleted. Therefore, it can be seen that not only the imbalance ratio should be considered in the classification of imbalance data, but also the information shortage degree is an important factor affecting the classification method and the classification effect. (3) Aiming at the unbalanced data classification, the accuracy and the recall rate of the classification are improved by adopting a clustering sampling method, and the resampling method is further proved to be an effective way for solving the unbalanced data classification.
The field of application of unbalanced data set classification is many. For example, medical diagnosis, cancer detection, credit card, insurance fraud detection, bioinformatics such as protein detection, enterprise bankruptcy, failure detection, customer churn prediction, and the like.

Claims (4)

1. The unbalanced data classification method based on clustering sampling kernel transformation is characterized by comprising the following steps of:
the method comprises the following steps: vectorizing unbalance data to be classified to obtain an unbalance data set;
step two: resampling the vectors in the unbalanced data set obtained in the first step by using a clustering sampling method based on dynamic self-organizing mapping to obtain a resampled unbalanced data set;
step three: and D, transforming a kernel function of the classifier SVM, and classifying the resampled unbalance data set obtained in the step two by using the classifier SVM of kernel transformation to obtain a classified unbalance data set.
2. The method for classifying imbalance data based on cluster sampling kernel transform according to claim 1, wherein the vector in the imbalance data set obtained in the first step is resampled by using a cluster sampling method based on dynamic self-organizing mapping, and the method for obtaining the resampled imbalance data set comprises the following steps:
step two, firstly: initializing the self-organizing mapping network, and setting a training time variable cycle as zero;
step two: initializing the weight of the neuron nodes of the output layer of the self-organizing mapping network, and weighting the weight w of all the neuron nodes of the output layerijAll give random decimal fraction, i.e. t is 0:0<wij<1, i 1,2,., L denotes the vector dimension of each output layer neuron, and j 1,2,., c, c are the number of output layer neurons; step two and step three: randomly selecting a sample vector X ═ X (X) from within an imbalanced dataset1,x2,...,xL) Inputting the training frequency variable cycle into a self-organizing mapping network, adding 1 to the training frequency variable cycle when inputting one sample, wherein the total number of the input samples is | D |;
step two, four: calculating a sample vector X and a weight vector w of each output layer neuron nodejDistance dis (X, w)j);
Step two and step five: selecting dis (X, w)j) The node with the minimum distance is the competition winning node and is set asThen <math> <mrow> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msubsup> <mi>w</mi> <mi>j</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mo>=</mo> <mi>min</mi> <mrow> <mo>(</mo> <mi>dis</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>,</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>c</mi> <mo>;</mo> </mrow> </math>
Step two, step six: according to the formulaAdjusting competing winning nodesAnd the weights of output layer neuron nodes within its neighborhood, whereRepresenting a learning rate value, rjA neighborhood value representing a jth output layer neuron;
step two, seven: if cycles% (| D |) ═ 0, then R for the current output layer neurons is calculated2Coefficient of clustering criterion when R2If the value is greater than the threshold value mu, ending, otherwise, turning to the step two eight;
step two eight: and searching the output layer neuron with the maximum in-class dispersion square sum in the output layer neuron after the weight is adjusted, inserting a new output layer neuron nearby the output layer neuron, initializing the weight of the new output layer neuron node as the mean value of the adjacent output layer neuron vectors, and turning to the third step.
3. The method for classifying imbalance data based on clustered sampling kernel transform as claimed in claim 2, wherein in step three, the method for transforming kernel function of classifier SVM comprises:
the transformation of the kernel function of the classifier SVM is:
<math> <mrow> <mover> <mi>K</mi> <mo>~</mo> </mover> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>C</mi> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </math>
wherein <math> <mrow> <mi>K</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mrow> <mo>|</mo> <mo>|</mo> <mi>x</mi> <mo>-</mo> <msup> <mi>x</mi> <mo>&prime;</mo> </msup> <mo>|</mo> <mo>|</mo> </mrow> <mn>2</mn> </msup> <msup> <mrow> <mn>2</mn> <mi>&sigma;</mi> </mrow> <mn>2</mn> </msup> </mfrac> <mo>)</mo> </mrow> <mo>,</mo> <mi>&sigma;</mi> <mo>></mo> <mn>0</mn> <mo>,</mo> </mrow> </math> The sigma is a nuclear parameter which is the parameter,
SV is the number of support vectors,is the imbalance factor for the kth input sample, k 1, 2.
4. The method for classifying imbalance data based on clustered sampling kernel transform according to claim 3, wherein in step three, the classifier SVM of kernel transform is:
wherein, ykIs the class of the kth input sample, αkThe parameter that determines the optimal classification hyperplane for the kth input sample, b is the offset.
CN201410342031.5A 2014-07-17 2014-07-17 Unbalance data classifying method based on cluster sampling kernel transformation Pending CN104063520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410342031.5A CN104063520A (en) 2014-07-17 2014-07-17 Unbalance data classifying method based on cluster sampling kernel transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410342031.5A CN104063520A (en) 2014-07-17 2014-07-17 Unbalance data classifying method based on cluster sampling kernel transformation

Publications (1)

Publication Number Publication Date
CN104063520A true CN104063520A (en) 2014-09-24

Family

ID=51551234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410342031.5A Pending CN104063520A (en) 2014-07-17 2014-07-17 Unbalance data classifying method based on cluster sampling kernel transformation

Country Status (1)

Country Link
CN (1) CN104063520A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104373338A (en) * 2014-11-19 2015-02-25 北京航空航天大学 Hydraulic pump fault diagnosing method based on LMD-SVD and IG-SVM
CN106022511A (en) * 2016-05-11 2016-10-12 北京京东尚科信息技术有限公司 Information predicting method and device
CN110706749A (en) * 2019-09-10 2020-01-17 至本医疗科技(上海)有限公司 Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050154701A1 (en) * 2003-12-01 2005-07-14 Parunak H. Van D. Dynamic information extraction with self-organizing evidence construction
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN103605711A (en) * 2013-11-12 2014-02-26 中国石油大学(北京) Construction method and device, classification method and device of support vector machine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050154701A1 (en) * 2003-12-01 2005-07-14 Parunak H. Van D. Dynamic information extraction with self-organizing evidence construction
CN101980202A (en) * 2010-11-04 2011-02-23 西安电子科技大学 Semi-supervised classification method of unbalance data
CN103605711A (en) * 2013-11-12 2014-02-26 中国石油大学(北京) Construction method and device, classification method and device of support vector machine

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LI PENG 等: "Imbalanced Data SVM Classification Method Based on Cluster Boundary Sampling and DT-KNN Pruning", 《INTERNATIONAL JOURNAL OF SIGNAL PROCESSING,IMAGE PROCESSING AND PATTERN RECOGNITION》 *
李鹏 等: "一种基于混合策略的失衡数据集分类方法", 《电子学报》 *
李鹏 等: "失衡数据集分类技术研究进展", 《中国科技论文在线》 *
陶新民 等: "核聚类集成失衡数据SVM算法", 《哈尔滨工程大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104373338A (en) * 2014-11-19 2015-02-25 北京航空航天大学 Hydraulic pump fault diagnosing method based on LMD-SVD and IG-SVM
CN106022511A (en) * 2016-05-11 2016-10-12 北京京东尚科信息技术有限公司 Information predicting method and device
CN110706749A (en) * 2019-09-10 2020-01-17 至本医疗科技(上海)有限公司 Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation
CN110706749B (en) * 2019-09-10 2022-06-10 至本医疗科技(上海)有限公司 Cancer type prediction system and method based on tissue and organ differentiation hierarchical relation

Similar Documents

Publication Publication Date Title
Pomerat et al. On neural network activation functions and optimizers in relation to polynomial regression
CN110348399B (en) Hyperspectral intelligent classification method based on prototype learning mechanism and multidimensional residual error network
CN104392231B (en) Fast synergistic conspicuousness detection method based on piecemeal Yu sparse main eigen
CN111415379B (en) Three-dimensional point cloud data registration method based on cuckoo optimization
CN101853509A (en) SAR (Synthetic Aperture Radar) image segmentation method based on Treelets and fuzzy C-means clustering
CN103886335B (en) Classification of Polarimetric SAR Image method based on Fuzzy particle swarm artificial and scattering entropy
CN106886793B (en) Hyperspectral image waveband selection method based on discrimination information and manifold information
CN106650766A (en) Inherent feature analysis based three-dimensional body waveform classification method
Lu et al. Clustering by Sorting Potential Values (CSPV): A novel potential-based clustering method
CN114913379A (en) Remote sensing image small sample scene classification method based on multi-task dynamic contrast learning
CN106250918A (en) A kind of mixed Gauss model matching process based on the soil-shifting distance improved
CN104063520A (en) Unbalance data classifying method based on cluster sampling kernel transformation
CN103971362B (en) SAR image change-detection based on rectangular histogram and elite genetic algorithm for clustering
CN105894493A (en) FMRI data feature selection method based on stability selection
CN105160666B (en) SAR image change detection based on Non-Stationary Analysis and condition random field
CN104573726B (en) Facial image recognition method based on the quartering and each ingredient reconstructed error optimum combination
CN107085705A (en) A kind of forest parameters remote sensing estimation method of efficient feature selection
Gholamian et al. A new method for clustering in credit scoring problems
CN105550677B (en) A kind of 3D palmprint authentications method
CN104050486A (en) Polarimetric SAR image classification method based on maps and Wishart distance
Tushir et al. Exploring different kernel functions for kernel-based clustering
Morris et al. Efficient identification of promising regions in high-dimensional design spaces with multilevel materials design applications
CN104063715B (en) A kind of face classification method based on the nearest feature line
CN110084303B (en) CNN and RF based balance ability feature selection method for old people
Xu Non-Euclidean dissimilarity data in pattern recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140924