CN110309302A

CN110309302A - A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Info

Publication number: CN110309302A
Application number: CN201910414208.0A
Authority: CN
Inventors: 姜震; 熊相真; 杜阳; 冯路捷; 孙祥瑜
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2019-10-08
Anticipated expiration: 2039-05-17
Also published as: CN110309302B

Abstract

The invention discloses the uneven file classification methods and system of a kind of combination SVM and semi-supervised clustering, pre-process to text to be processed, obtain the text data of vector format as data set；SVM classifier is trained using training set to obtain disaggregated model, test set is predicted using disaggregated model, obtains the generic and confidence level of test set；It is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set；The obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, final output is obtained.Present invention incorporates different types of methods in the technical field of uneven text classification, realize the mutual supplement with each other's advantages of distinct methods, use vectorization and normalized method, it compensates for when handling the sparse text data of higher-dimension, because having the shortcomings that text classification result is inaccurate caused by label text is very few.Efficiently solve the problems, such as that text categories are unbalanced.

Description

A kind of uneven file classification method and system of combination SVM and semi-supervised clustering

Technical field

The invention belongs to natural language processing field, especially uneven text classification field more particularly to a kind of combination The uneven file classification method and system of SVM and semi-supervised clustering.

Background technique

Text classification problem is a classical problem in natural language processing field, information filtering, mail classification, The fields such as query intention prediction, text subject tracking, which suffer from, widely applies.Traditional file classification method primarily directed to The design of text classification problem is balanced, when treatment scale is smaller, data distribution is uniform and intensive balance text classification problem It works well.But still have more limitation.Especially in practical application, due to class imbalance, there is label text Very few and sample has the characteristics that higher-dimension is sparse, increases the complexity of text classification, classification accuracy is caused to decline, limit The application of file classification method in practice.

Currently, solving these problems mainly has following a few class methods and thinking:

1) it aiming at the problem that class imbalance in text classification, proposes and changes metric form, resampling, cost correlation The solutions such as habit.Generally use ROC curve, F- measurement isometry mode；Upper sampling, lower sampling, the resampling of mixed sampling Method；The mistake for increasing small class text is divided into this cost relational learning method.These methods can preferably solve lower dimensional space In class imbalance problem, but be directed to higher dimensional space specific to text classification problem, the cost of study it is very high and It as a result is not very accurate.

2) for the problem for having label text very few in text classification, the semi-supervised algorithm of two classes is proposed.One kind, original Disaggregated model in increase a part depend on the item without label text so that final text classification result is by there is label text Codetermined with no label text, solved the problems, such as that label text is very few, if but during realization disaggregated model It is mismatched with text, algorithm performance can be reduced with training.It is another kind of, using there is label text one classifier of training, so The classification for demarcating unmarked text afterwards obtains pseudo label text, finally obtains a new classification using all text training Device is repeated until convergence.This method has also solved the problems, such as that label text is very few, but due to existing in pseudo label text Noise, repetition training will lead to noise accumulation, reduce the accuracy of text classification.

3) aiming at the problem that having text to have the characteristics that higher-dimension is sparse in text classification, the method for Feature Compression is proposed, Two classes: feature selecting and feature extraction can be divided into.Wherein, feature extraction be according to certain criterion from text extraction feature； Feature selecting is that selected section most has the feature of class discrimination ability from primitive character according to certain criterion.Both methods subtracts Lack text data in training and bring expense on the time of classifying, while also reducing a possibility that dimension disaster occurs.But It is that can inevitably cast out effective text information in compression, cause the problem that text classification is not accurate enough.

Based on certain deficiency present in the above-mentioned method for solving the problems, such as imbalance text classification, preferably divide to reach Class effect, it is necessary to provide the more efficient algorithm of one kind to solve the above problems.

Summary of the invention

The present invention proposes the imbalance of a kind of combination SVM and semi-supervised clustering according to problems of the prior art File classification method and system can improve and existing imitate to classifier single in uneven text classification problem or algorithm classification The bad situation of fruit is finally reached the purpose of uneven text Accurate classification.

The technical solution adopted in the present invention is as follows:

A kind of uneven file classification method of combination SVM and semi-supervised clustering, process are as follows:

S1, text to be processed is pre-processed, obtains the text data of vector format as data set；The data set It is divided into training set and test set；

S2, SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model Prediction, obtains the generic and confidence level of test set；

S3, it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set；

S4, the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are melted It closes, obtains final output, realization makes final classification to uneven text.

Further, S2 process are as follows:

S2.1. in training set, using one-to-one (one-versus-one) method, one is found between any two classes sample A hyperplane separates different classes of text, obtains being divided into more classification problems multiple based on the disaggregated model of SVM training Two classification problems；

S2.2. weight is arranged to the distance of the sample in training set to hyperplane, acquires a new decision function；

S2.3. sample generic and its probability are calculated according to new decision function；For multicategory classification, using one It votes (one-versus-one) method, obtains the generic of final test text；

S2.4. confidence level is gone out by probability calculation.

Further, the new decision function indicates are as follows:

When positive and negative samples are classified, it is contemplated that text has class imbalance feature in classification problem.It is asked for this Topic, we to it plus a weight, make after it calculates decision function value (decision function value and sample arrive distance and be positively correlated) Threshold value is obtained to move.Wherein,Respectively indicate the weight added when label is positive and negative samples；N₊ Represent the number of samples that label is positive, N_-The number of samples that label is negative is represented, f (x) is the decision function of SVM.

Further, the S3 process are as follows:

S3.1. the number of clusters amount (K value) and generic determined by training set, according to the label of sample each in training set according to Cluster that is secondary to be divided to corresponding cluster, being initialized；

S3.2. to each cluster, mass center is updated, and sample is divided into again according to the distance that sample reaches mass center by each cluster In；

S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, update mass center and K value again；

S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and is counted Calculate its confidence level；

S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration；

S3.6. according to cluster generic, the generic and confidence level of test text are obtained.

Further, the process of mass center is updated in the S3.2 are as follows:

S3.2.1. mass center is calculated:

S3.2.2. sample reaches the distance of mass center are as follows:

S3.2.3. the distance weighted processing that mass center is reached to sample, obtains:

Wherein, | C_m| indicate cluster C_mSample number, μ_mIndicate cluster C_mCorresponding mass center.x_iFor some sample, V in cluster_mIt indicates Mass center μ_mThe number of sample in generic, V represent the number of all samples, and K is the dimension of mass center.μ_m[i] and x [i] difference table Show mass center μ_mWith the ith feature value of sample x.

Further, the splitting condition are as follows: whether training of judgement collection includes noise, and (mistake divides sample if there is noise in current cluster This), then current cluster is divided, conversely, not needing then to be divided.

Further, the process of the S4 are as follows: calculate separately sensitivity S E under SVM classifier and semi-supervised clustering and special Property SP；The Gmean value for calculating separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, according to being obtained Gmean value obtain a weight mu, using the weight mu to the confidence level C of classification results_SKAS(x_i) normalized is done, according to The result of normalized is determined using the classification results in SVM classifier or semi-supervised clustering.To identified classification results It is merged, exports the final prediction result of test text.

Further, the calculation method of the confidence level is that maximum probability subtracts its second maximum probability in class categories.

Further, the pretreated process are as follows: select the keyword in text to be measured, remove stop words；According to crucial Word frequency rate calculates weight, finally makes text vector to be measured, then vector is normalized using deviation standardized mode Processing.

The present invention also designs the uneven Text Classification System of a kind of combination SVM and semi-supervised clustering, including pretreatment list Member, training unit and predicting unit；

The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, final to obtain The data set of the format of vector, and data set is input to training unit；

The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering list Member classifies to test set, obtains test set generic and its confidence level；

The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported pass through Fusion treatment obtains final result.

Beneficial effects of the present invention:

Classification method designed by the present invention is built-up by SVM classifier and semi-supervised Kmeans algorithm combination, realizes The mutual supplement with each other's advantages of two methods.Using vectorization and normalized method, the text data sparse in processing higher-dimension is compensated for When, because having the shortcomings that text classification result is inaccurate caused by label text is very few.By improving semi-supervised clustering algorithm, Solve the problems, such as that text categories are unbalanced.Solves the initialization K value in semi-supervised clustering using the text classification result of SVM With the doubt problem of mass center.Meanwhile the present invention devises a kind of splitting algorithm, it is accurate when can effectively promote text classification Degree.The present invention is significantly improved with existing to classifier or algorithm classification effect single in uneven text classification problem not Good situation is finally reached the purpose of uneven text Accurate classification.

Detailed description of the invention

Fig. 1 is that the present invention is directed to class imbalance file classification method flow chart；

Fig. 2 is SVM classifier process flow diagram；

Fig. 3 is semi-supervised clustering algorithm process flow chart；

Fig. 4 is using Gmean value to SVM and semi-supervised clustering processing result treatment process schematic diagram

Fig. 5 is the schematic diagram of uneven file classification method；

Fig. 6 is a kind of frame diagram of the uneven Text Classification System of combination SVM and semi-supervised clustering.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, the specific embodiments are only for explaining the present invention, not For limiting the present invention.

As shown in Figure 1, 2, the uneven text classification side of a kind of the combination SVM and semi-supervised clustering designed by the present invention Method, detailed process is as follows:

S1, the keyword being first sorted out in text to be measured, rejecting act on text classification little stop words；According to pass Keyword frequency calculates weight, to make text vector to be measured；Secondly, carrying out normalizing to vector in the standardized method of deviation Change processing；The finally format output data set using vector format used in libsvm as standard, so as to it is subsequent processing and It uses.

Obtained document format data is as follows:

[label][index1]:[value1][index2]:[value2]…

Wherein, Label is label value, and index is sequential index, is usually continuous integer, i.e. feature number, It must be arranged according to ascending order, value is characteristic value, and usually a pile real number forms.

S2, svm classifier algorithm, detailed process is as follows as shown in Figure 3:

S2.1. it trains: in training set, using one-to-one (one-versus-one) method, between any two classes sample Find a hyperplane, different classes of text separated, thus the sample of k classification just need to design k (k-1)/2 it is super flat Face obtains that more classification problems are divided into multiple two classification problems based on the disaggregated model of SVM training；

Seek the hyperplane indicated by f (x)=wx+b, so that objective function meets following formula:

Wherein, ξ_iFor the slack variable on normal data, c is given penalty factor, and w and b are indicate hyperplane two A parameter

S2.2. in traditional SVM classifier, former decision function obtains text by the distance between text and hyperplane Generic, and former decision function is as follows:

Wherein, y_iFor sample label value, α_iFor Lagrange multiplier, K (x_i, x) and it is kernel function, b is hyperplane displacement, and l is Vector dimension；

Since sample number of all categories in data set is seriously uneven, training process can be tilted to multiclass, to sample to super The distance of plane adds a weight, the number of samples N that the present invention is positive according to label₊The number of samples N being negative with label_-It calculates New decision function is obtained, expression formula is as follows by the weight in conjunction with former decision function for the weight of each classification out:

Wherein, f (x) is former decision function, as can be seen that the lesser classification of amount of text will have from expression formula (3) Biggish weight, the biggish classification of amount of text will have lesser weight, solve SVM classifier and handle uneven text When the problem ineffective for minority class text prediction,

S2.3. the numerical value on real axis can be projected on [0,1] using sigmoid function, i.e., by an output real value It is converted into a probability value.Sigmoid function representation are as follows:

The present invention further improves probability output function, the formula of the probability value of output are as follows:

Wherein, p_iFor the probability value f of i-th of sample output_iFor new decision function, A and B are for adjusting mapping value The determination method of the parameter of size, A and B are as follows:

Wherein, by parameter t_iBe divided into sample be positive class when corresponding value t₊With sample be negative class when corresponding value t_-,

S2.4. after calculating probability, then confidence level obtained by probability, existing method is the maximum class prediction probability of selection As confidence level；Calculating employed in the present invention using confidence level is that probability maximum in sample classification classification is subtracted it Second maximum probability, expression formula are as follows:

C_svm(x_i)=P_svm(y=c_{max_j}|x_j)-P_svm(y=c_{sub_max_j}|x_j) (7)

Calculating confidence level using the method can be pseudo- reference numerals of the data exclusion in SVM for being likely located at class overlapping region Except collection, solve the problems, such as that SVM performance under class overlapping cases declines.

S3. semi-supervised clustering algorithm, detailed process is as follows as shown in Figure 4:

S3.1. according to the determining number of clusters amount (K value) of training sample and generic, by the sample in training set according to its class Distinguishing label is successively divided to corresponding cluster, the cluster initialized.

S3.2. for each cluster, its mass center is updated, and sample is divided into respectively again according to the distance that sample reaches mass center In a cluster；Wherein, the method for mass center is updated are as follows:

S3.2.1. mass center is the average value of all samples (vector) in cluster, and expression formula is as follows:

Wherein μ_mIndicate cluster C_mMass center, | C_m| indicate cluster C_mSample number (wherein K=| C_m|)

S3.2.2. sample is calculated at a distance from each mass center using Euclidean distance, expression formula is as follows:

μ_mIndicate mass center.K is the dimension of mass center.μ_m[i] and x [i] respectively indicate mass center μ_mWith the ith feature of sample x Value.

S3.2.3. since text data concentrates sample number of all categories seriously uneven, training process can be tilted to multiclass, Therefore, the distance to text to mass center adds a weight, threshold value can be made to move, and new range formula is as follows:

S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, and update mass center and K value.

The condition of division are as follows: if there is noise (error sample) in current cluster, current cluster is divided, conversely, then It does not need to be divided.

The process of division are as follows: divisional mode: current cluster C is found_mThe interior sample point x farthest apart from mass center, if x and mass center Distance is r.Current cluster is then divided into A and B two parts:

A={ d (x, μ_m)≤r|x∈C_m} (11)

B={ d (x, μ_m) > r | x ∈ C_m} (12)

Wherein, d (x, μ_m) it is sample point x and mass center μ_mDistance.

S3.4.1. the calculating of distance:

S3.4.2. need first to calculate the probability that text belongs to each classification, the calculating side of probability before calculating confidence level Formula is as follows:

P [i]=(1/d [cluster [i]])/sum (14)

Wherein, P [i] representative sample i belongs to the probability of current cluster, the cluster label that cluster [i] representative sample i belongs to, d [j] is represented in j-th of cluster, and current text i reaches the distance of mass center, and sum represents the distance that current sample i reaches each mass center It is reciprocal and.Sum calculation formula is as follows:

S3.4.3. confidence level, confidence level are measured by the difference of the maximum probability of classification and it and the second maximum probability Calculation formula it is as follows:

C_SKAS(x_i)=P_SKAS(y=C_{max_j}|x_i)-P_SKAS(y=C_{sub_max_j}|x_i) (16)

S4, blending algorithm, as shown in Figure 5:

S4.1. the sensitivity S E (sensitivity, SE) and spy under SVM classifier and semi-supervised Kmeans are calculated separately Anisotropic SP (specificity, SP)；Calculation formula is as follows:

Wherein, TP is real example (true positive): being correctly divided into the sample number of positive class, FN is false positive example (false positive): being mistakenly divided into the sample number of positive class, and TN is true counter-example (true negative): being correctly divided into The sample number of anti-class, FP are false counter-example (false negative): being mistakenly divided into the sample number of anti-class.

S4.2. the Gmean value of the sensitivity S E and specificity SP under SVM classifier and semi-supervised Kmeans are calculated separately, Gmean value calculation formula is as follows:

S4.3. by the Gmean value W of SVM classifier₁With the Gmean value W of semi-supervised Kmeans₂A weight mu is calculated, is counted It is as follows to calculate formula:

S4.4. using weight mu to the confidence level C of classification results_SKAS(x_i) normalized, the process of normalized is such as Under:

It is determined according to the result of normalized using the classification results in SVM classifier or semi-supervised Kmeans.

S4.5. classification results used by will be final, are merged, export the final prediction result of test text.

Uneven file classification method based on a kind of combination SVM designed by the present invention and semi-supervised clustering, the present invention The uneven Text Classification System that also proposed a kind of combination SVM and semi-supervised clustering, as shown in fig. 6, the system includes Pretreatment unit, training unit and predicting unit；

Pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector Format data set, and data set is input to training unit；

Training unit includes SVM cell and semi-supervised Kmeans unit, is utilized respectively SVM cell and semi-supervised Kmeans is mono- Member classifies to test set, obtains test set generic and its confidence level；

Predicting unit: the generic and its confidence level that SVM classifier and semi-supervised Kmeans unit are exported, through normalizing After change processing, the two result is merged, i.e., obtains the final generic of test text according to decision function.

Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.

Claims

1. the uneven file classification method of a kind of combination SVM and semi-supervised clustering, which comprises the following steps:

S1. text to be processed is pre-processed, obtains the text data of vector format as data set；The data set is divided into Training set and test set；

S2. SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model pre- It surveys, obtains the generic and confidence level of test set；

S3. it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set；

S4. the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, is obtained To final output, realization makes final classification to uneven text.

2. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In S2 process are as follows:

S2.1. in training set, using one-to-one method, a hyperplane is found between any two classes sample, it will be different classes of Text separate；

S2.3. sample generic and its probability are calculated according to new decision function；Wherein, it for multicategory classification, uses One-to-one method ballot, obtains the generic of final test text；

S2.4. confidence level is gone out by probability calculation.

3. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2, feature exist In the new decision function indicates are as follows:

Wherein,Respectively indicate the weight added when label is positive and negative samples；N₊Label is represented to be positive Number of samples, N_-The number of samples that label is negative is represented, f (x) is the decision function of SVM.

4. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the S3 process are as follows:

S3.1. the number of clusters amount and generic determined by training set, is successively divided to according to the label of sample each in training set Corresponding cluster, the cluster initialized；

S3.2. to each cluster, mass center is updated, and sample is divided into each cluster again according to the distance that sample reaches mass center；

S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and calculate it Confidence level；

5. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist In the process of the update mass center in the S3.2 are as follows:

S3.2.1. mass center is calculated:

S3.2.2. sample reaches the distance of mass center are as follows:

Wherein, | C_m| indicate cluster C_mSample number, μ_mIndicate cluster C_mCorresponding mass center, x_iFor some sample, V in cluster_mIndicate mass center μ_m The number of sample in generic, V represent the number of all samples, and K is the dimension of mass center, μ_m[i] and x [i] respectively indicate matter Heart μ_mWith the ith feature value of sample x.

6. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist In the splitting condition are as follows: if there is noise in current cluster, current cluster is divided, conversely, not needing then point It splits.

7. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the process of the S4 are as follows: calculate separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering；It counts respectively The Gmean value for calculating the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, is worth according to Gmean obtained To a weight mu, using the weight mu to the confidence level C of classification results_SKAS(x_i) normalized is done, according to normalized As a result it determines using the classification results in SVM classifier or semi-supervised clustering；Identified classification results are merged, are exported The final prediction result of test text.

8. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2 or 4, special Sign is that the calculation method of the confidence level is that maximum probability subtracts the second maximum probability in class categories.

9. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the pretreated process are as follows: select the keyword in text to be measured, remove stop words；With being calculated according to keyword frequency Weight is finally made text vector to be measured, then vector is normalized using deviation standardized mode.

10. a kind of classification system based on the uneven file classification method for combining SVM and semi-supervised clustering described in claim 1 System, which is characterized in that including pretreatment unit, training unit and predicting unit；

The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector Format data set, and data set is input to training unit；

The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering unit pair Test set classification, obtains test set generic and its confidence level；

The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported, by fusion Processing obtains final result.