CN110309302A - A kind of uneven file classification method and system of combination SVM and semi-supervised clustering - Google Patents

A kind of uneven file classification method and system of combination SVM and semi-supervised clustering Download PDF

Info

Publication number
CN110309302A
CN110309302A CN201910414208.0A CN201910414208A CN110309302A CN 110309302 A CN110309302 A CN 110309302A CN 201910414208 A CN201910414208 A CN 201910414208A CN 110309302 A CN110309302 A CN 110309302A
Authority
CN
China
Prior art keywords
semi
text
svm
sample
supervised clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910414208.0A
Other languages
Chinese (zh)
Other versions
CN110309302B (en
Inventor
姜震
熊相真
杜阳
冯路捷
孙祥瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201910414208.0A priority Critical patent/CN110309302B/en
Publication of CN110309302A publication Critical patent/CN110309302A/en
Application granted granted Critical
Publication of CN110309302B publication Critical patent/CN110309302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the uneven file classification methods and system of a kind of combination SVM and semi-supervised clustering, pre-process to text to be processed, obtain the text data of vector format as data set;SVM classifier is trained using training set to obtain disaggregated model, test set is predicted using disaggregated model, obtains the generic and confidence level of test set;It is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;The obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, final output is obtained.Present invention incorporates different types of methods in the technical field of uneven text classification, realize the mutual supplement with each other's advantages of distinct methods, use vectorization and normalized method, it compensates for when handling the sparse text data of higher-dimension, because having the shortcomings that text classification result is inaccurate caused by label text is very few.Efficiently solve the problems, such as that text categories are unbalanced.

Description

A kind of uneven file classification method and system of combination SVM and semi-supervised clustering
Technical field
The invention belongs to natural language processing field, especially uneven text classification field more particularly to a kind of combination The uneven file classification method and system of SVM and semi-supervised clustering.
Background technique
Text classification problem is a classical problem in natural language processing field, information filtering, mail classification, The fields such as query intention prediction, text subject tracking, which suffer from, widely applies.Traditional file classification method primarily directed to The design of text classification problem is balanced, when treatment scale is smaller, data distribution is uniform and intensive balance text classification problem It works well.But still have more limitation.Especially in practical application, due to class imbalance, there is label text Very few and sample has the characteristics that higher-dimension is sparse, increases the complexity of text classification, classification accuracy is caused to decline, limit The application of file classification method in practice.
Currently, solving these problems mainly has following a few class methods and thinking:
1) it aiming at the problem that class imbalance in text classification, proposes and changes metric form, resampling, cost correlation The solutions such as habit.Generally use ROC curve, F- measurement isometry mode;Upper sampling, lower sampling, the resampling of mixed sampling Method;The mistake for increasing small class text is divided into this cost relational learning method.These methods can preferably solve lower dimensional space In class imbalance problem, but be directed to higher dimensional space specific to text classification problem, the cost of study it is very high and It as a result is not very accurate.
2) for the problem for having label text very few in text classification, the semi-supervised algorithm of two classes is proposed.One kind, original Disaggregated model in increase a part depend on the item without label text so that final text classification result is by there is label text Codetermined with no label text, solved the problems, such as that label text is very few, if but during realization disaggregated model It is mismatched with text, algorithm performance can be reduced with training.It is another kind of, using there is label text one classifier of training, so The classification for demarcating unmarked text afterwards obtains pseudo label text, finally obtains a new classification using all text training Device is repeated until convergence.This method has also solved the problems, such as that label text is very few, but due to existing in pseudo label text Noise, repetition training will lead to noise accumulation, reduce the accuracy of text classification.
3) aiming at the problem that having text to have the characteristics that higher-dimension is sparse in text classification, the method for Feature Compression is proposed, Two classes: feature selecting and feature extraction can be divided into.Wherein, feature extraction be according to certain criterion from text extraction feature; Feature selecting is that selected section most has the feature of class discrimination ability from primitive character according to certain criterion.Both methods subtracts Lack text data in training and bring expense on the time of classifying, while also reducing a possibility that dimension disaster occurs.But It is that can inevitably cast out effective text information in compression, cause the problem that text classification is not accurate enough.
Based on certain deficiency present in the above-mentioned method for solving the problems, such as imbalance text classification, preferably divide to reach Class effect, it is necessary to provide the more efficient algorithm of one kind to solve the above problems.
Summary of the invention
The present invention proposes the imbalance of a kind of combination SVM and semi-supervised clustering according to problems of the prior art File classification method and system can improve and existing imitate to classifier single in uneven text classification problem or algorithm classification The bad situation of fruit is finally reached the purpose of uneven text Accurate classification.
The technical solution adopted in the present invention is as follows:
A kind of uneven file classification method of combination SVM and semi-supervised clustering, process are as follows:
S1, text to be processed is pre-processed, obtains the text data of vector format as data set;The data set It is divided into training set and test set;
S2, SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model Prediction, obtains the generic and confidence level of test set;
S3, it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;
S4, the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are melted It closes, obtains final output, realization makes final classification to uneven text.
Further, S2 process are as follows:
S2.1. in training set, using one-to-one (one-versus-one) method, one is found between any two classes sample A hyperplane separates different classes of text, obtains being divided into more classification problems multiple based on the disaggregated model of SVM training Two classification problems;
S2.2. weight is arranged to the distance of the sample in training set to hyperplane, acquires a new decision function;
S2.3. sample generic and its probability are calculated according to new decision function;For multicategory classification, using one It votes (one-versus-one) method, obtains the generic of final test text;
S2.4. confidence level is gone out by probability calculation.
Further, the new decision function indicates are as follows:
When positive and negative samples are classified, it is contemplated that text has class imbalance feature in classification problem.It is asked for this Topic, we to it plus a weight, make after it calculates decision function value (decision function value and sample arrive distance and be positively correlated) Threshold value is obtained to move.Wherein,Respectively indicate the weight added when label is positive and negative samples;N+ Represent the number of samples that label is positive, N-The number of samples that label is negative is represented, f (x) is the decision function of SVM.
Further, the S3 process are as follows:
S3.1. the number of clusters amount (K value) and generic determined by training set, according to the label of sample each in training set according to Cluster that is secondary to be divided to corresponding cluster, being initialized;
S3.2. to each cluster, mass center is updated, and sample is divided into again according to the distance that sample reaches mass center by each cluster In;
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, update mass center and K value again;
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and is counted Calculate its confidence level;
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
Further, the process of mass center is updated in the S3.2 are as follows:
S3.2.1. mass center is calculated:
S3.2.2. sample reaches the distance of mass center are as follows:
S3.2.3. the distance weighted processing that mass center is reached to sample, obtains:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center.xiFor some sample, V in clustermIt indicates Mass center μmThe number of sample in generic, V represent the number of all samples, and K is the dimension of mass center.μm[i] and x [i] difference table Show mass center μmWith the ith feature value of sample x.
Further, the splitting condition are as follows: whether training of judgement collection includes noise, and (mistake divides sample if there is noise in current cluster This), then current cluster is divided, conversely, not needing then to be divided.
Further, the process of the S4 are as follows: calculate separately sensitivity S E under SVM classifier and semi-supervised clustering and special Property SP;The Gmean value for calculating separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, according to being obtained Gmean value obtain a weight mu, using the weight mu to the confidence level C of classification resultsSKAS(xi) normalized is done, according to The result of normalized is determined using the classification results in SVM classifier or semi-supervised clustering.To identified classification results It is merged, exports the final prediction result of test text.
Further, the calculation method of the confidence level is that maximum probability subtracts its second maximum probability in class categories.
Further, the pretreated process are as follows: select the keyword in text to be measured, remove stop words;According to crucial Word frequency rate calculates weight, finally makes text vector to be measured, then vector is normalized using deviation standardized mode Processing.
The present invention also designs the uneven Text Classification System of a kind of combination SVM and semi-supervised clustering, including pretreatment list Member, training unit and predicting unit;
The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, final to obtain The data set of the format of vector, and data set is input to training unit;
The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering list Member classifies to test set, obtains test set generic and its confidence level;
The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported pass through Fusion treatment obtains final result.
Beneficial effects of the present invention:
Classification method designed by the present invention is built-up by SVM classifier and semi-supervised Kmeans algorithm combination, realizes The mutual supplement with each other's advantages of two methods.Using vectorization and normalized method, the text data sparse in processing higher-dimension is compensated for When, because having the shortcomings that text classification result is inaccurate caused by label text is very few.By improving semi-supervised clustering algorithm, Solve the problems, such as that text categories are unbalanced.Solves the initialization K value in semi-supervised clustering using the text classification result of SVM With the doubt problem of mass center.Meanwhile the present invention devises a kind of splitting algorithm, it is accurate when can effectively promote text classification Degree.The present invention is significantly improved with existing to classifier or algorithm classification effect single in uneven text classification problem not Good situation is finally reached the purpose of uneven text Accurate classification.
Detailed description of the invention
Fig. 1 is that the present invention is directed to class imbalance file classification method flow chart;
Fig. 2 is SVM classifier process flow diagram;
Fig. 3 is semi-supervised clustering algorithm process flow chart;
Fig. 4 is using Gmean value to SVM and semi-supervised clustering processing result treatment process schematic diagram
Fig. 5 is the schematic diagram of uneven file classification method;
Fig. 6 is a kind of frame diagram of the uneven Text Classification System of combination SVM and semi-supervised clustering.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, the specific embodiments are only for explaining the present invention, not For limiting the present invention.
As shown in Figure 1, 2, the uneven text classification side of a kind of the combination SVM and semi-supervised clustering designed by the present invention Method, detailed process is as follows:
S1, the keyword being first sorted out in text to be measured, rejecting act on text classification little stop words;According to pass Keyword frequency calculates weight, to make text vector to be measured;Secondly, carrying out normalizing to vector in the standardized method of deviation Change processing;The finally format output data set using vector format used in libsvm as standard, so as to it is subsequent processing and It uses.
Obtained document format data is as follows:
[label][index1]:[value1][index2]:[value2]…
Wherein, Label is label value, and index is sequential index, is usually continuous integer, i.e. feature number, It must be arranged according to ascending order, value is characteristic value, and usually a pile real number forms.
S2, svm classifier algorithm, detailed process is as follows as shown in Figure 3:
S2.1. it trains: in training set, using one-to-one (one-versus-one) method, between any two classes sample Find a hyperplane, different classes of text separated, thus the sample of k classification just need to design k (k-1)/2 it is super flat Face obtains that more classification problems are divided into multiple two classification problems based on the disaggregated model of SVM training;
Seek the hyperplane indicated by f (x)=wx+b, so that objective function meets following formula:
Wherein, ξiFor the slack variable on normal data, c is given penalty factor, and w and b are indicate hyperplane two A parameter
S2.2. in traditional SVM classifier, former decision function obtains text by the distance between text and hyperplane Generic, and former decision function is as follows:
Wherein, yiFor sample label value, αiFor Lagrange multiplier, K (xi, x) and it is kernel function, b is hyperplane displacement, and l is Vector dimension;
Since sample number of all categories in data set is seriously uneven, training process can be tilted to multiclass, to sample to super The distance of plane adds a weight, the number of samples N that the present invention is positive according to label+The number of samples N being negative with label-It calculates New decision function is obtained, expression formula is as follows by the weight in conjunction with former decision function for the weight of each classification out:
Wherein, f (x) is former decision function, as can be seen that the lesser classification of amount of text will have from expression formula (3) Biggish weight, the biggish classification of amount of text will have lesser weight, solve SVM classifier and handle uneven text When the problem ineffective for minority class text prediction,
S2.3. the numerical value on real axis can be projected on [0,1] using sigmoid function, i.e., by an output real value It is converted into a probability value.Sigmoid function representation are as follows:
The present invention further improves probability output function, the formula of the probability value of output are as follows:
Wherein, piFor the probability value f of i-th of sample outputiFor new decision function, A and B are for adjusting mapping value The determination method of the parameter of size, A and B are as follows:
Wherein, by parameter tiBe divided into sample be positive class when corresponding value t+With sample be negative class when corresponding value t-,
S2.4. after calculating probability, then confidence level obtained by probability, existing method is the maximum class prediction probability of selection As confidence level;Calculating employed in the present invention using confidence level is that probability maximum in sample classification classification is subtracted it Second maximum probability, expression formula are as follows:
Csvm(xi)=Psvm(y=cmax_j|xj)-Psvm(y=csub_max_j|xj) (7)
Calculating confidence level using the method can be pseudo- reference numerals of the data exclusion in SVM for being likely located at class overlapping region Except collection, solve the problems, such as that SVM performance under class overlapping cases declines.
S3. semi-supervised clustering algorithm, detailed process is as follows as shown in Figure 4:
S3.1. according to the determining number of clusters amount (K value) of training sample and generic, by the sample in training set according to its class Distinguishing label is successively divided to corresponding cluster, the cluster initialized.
S3.2. for each cluster, its mass center is updated, and sample is divided into respectively again according to the distance that sample reaches mass center In a cluster;Wherein, the method for mass center is updated are as follows:
S3.2.1. mass center is the average value of all samples (vector) in cluster, and expression formula is as follows:
Wherein μmIndicate cluster CmMass center, | Cm| indicate cluster CmSample number (wherein K=| Cm|)
S3.2.2. sample is calculated at a distance from each mass center using Euclidean distance, expression formula is as follows:
μmIndicate mass center.K is the dimension of mass center.μm[i] and x [i] respectively indicate mass center μmWith the ith feature of sample x Value.
S3.2.3. since text data concentrates sample number of all categories seriously uneven, training process can be tilted to multiclass, Therefore, the distance to text to mass center adds a weight, threshold value can be made to move, and new range formula is as follows:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center.xiFor some sample, V in clustermIt indicates Mass center μmThe number of sample in generic, V represent the number of all samples, and K is the dimension of mass center.μm[i] and x [i] difference table Show mass center μmWith the ith feature value of sample x.
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, and update mass center and K value.
The condition of division are as follows: if there is noise (error sample) in current cluster, current cluster is divided, conversely, then It does not need to be divided.
The process of division are as follows: divisional mode: current cluster C is foundmThe interior sample point x farthest apart from mass center, if x and mass center Distance is r.Current cluster is then divided into A and B two parts:
A={ d (x, μm)≤r|x∈Cm} (11)
B={ d (x, μm) > r | x ∈ Cm} (12)
Wherein, d (x, μm) it is sample point x and mass center μmDistance.
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and is counted Calculate its confidence level;
S3.4.1. the calculating of distance:
S3.4.2. need first to calculate the probability that text belongs to each classification, the calculating side of probability before calculating confidence level Formula is as follows:
P [i]=(1/d [cluster [i]])/sum (14)
Wherein, P [i] representative sample i belongs to the probability of current cluster, the cluster label that cluster [i] representative sample i belongs to, d [j] is represented in j-th of cluster, and current text i reaches the distance of mass center, and sum represents the distance that current sample i reaches each mass center It is reciprocal and.Sum calculation formula is as follows:
S3.4.3. confidence level, confidence level are measured by the difference of the maximum probability of classification and it and the second maximum probability Calculation formula it is as follows:
CSKAS(xi)=PSKAS(y=Cmax_j|xi)-PSKAS(y=Csub_max_j|xi) (16)
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
S4, blending algorithm, as shown in Figure 5:
S4.1. the sensitivity S E (sensitivity, SE) and spy under SVM classifier and semi-supervised Kmeans are calculated separately Anisotropic SP (specificity, SP);Calculation formula is as follows:
Wherein, TP is real example (true positive): being correctly divided into the sample number of positive class, FN is false positive example (false positive): being mistakenly divided into the sample number of positive class, and TN is true counter-example (true negative): being correctly divided into The sample number of anti-class, FP are false counter-example (false negative): being mistakenly divided into the sample number of anti-class.
S4.2. the Gmean value of the sensitivity S E and specificity SP under SVM classifier and semi-supervised Kmeans are calculated separately, Gmean value calculation formula is as follows:
S4.3. by the Gmean value W of SVM classifier1With the Gmean value W of semi-supervised Kmeans2A weight mu is calculated, is counted It is as follows to calculate formula:
S4.4. using weight mu to the confidence level C of classification resultsSKAS(xi) normalized, the process of normalized is such as Under:
It is determined according to the result of normalized using the classification results in SVM classifier or semi-supervised Kmeans.
S4.5. classification results used by will be final, are merged, export the final prediction result of test text.
Uneven file classification method based on a kind of combination SVM designed by the present invention and semi-supervised clustering, the present invention The uneven Text Classification System that also proposed a kind of combination SVM and semi-supervised clustering, as shown in fig. 6, the system includes Pretreatment unit, training unit and predicting unit;
Pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector Format data set, and data set is input to training unit;
Training unit includes SVM cell and semi-supervised Kmeans unit, is utilized respectively SVM cell and semi-supervised Kmeans is mono- Member classifies to test set, obtains test set generic and its confidence level;
Predicting unit: the generic and its confidence level that SVM classifier and semi-supervised Kmeans unit are exported, through normalizing After change processing, the two result is merged, i.e., obtains the final generic of test text according to decision function.
Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.

Claims (10)

1. the uneven file classification method of a kind of combination SVM and semi-supervised clustering, which comprises the following steps:
S1. text to be processed is pre-processed, obtains the text data of vector format as data set;The data set is divided into Training set and test set;
S2. SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model pre- It surveys, obtains the generic and confidence level of test set;
S3. it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;
S4. the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, is obtained To final output, realization makes final classification to uneven text.
2. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In S2 process are as follows:
S2.1. in training set, using one-to-one method, a hyperplane is found between any two classes sample, it will be different classes of Text separate;
S2.2. weight is arranged to the distance of the sample in training set to hyperplane, acquires a new decision function;
S2.3. sample generic and its probability are calculated according to new decision function;Wherein, it for multicategory classification, uses One-to-one method ballot, obtains the generic of final test text;
S2.4. confidence level is gone out by probability calculation.
3. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2, feature exist In the new decision function indicates are as follows:
Wherein,Respectively indicate the weight added when label is positive and negative samples;N+Label is represented to be positive Number of samples, N-The number of samples that label is negative is represented, f (x) is the decision function of SVM.
4. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the S3 process are as follows:
S3.1. the number of clusters amount and generic determined by training set, is successively divided to according to the label of sample each in training set Corresponding cluster, the cluster initialized;
S3.2. to each cluster, mass center is updated, and sample is divided into each cluster again according to the distance that sample reaches mass center;
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, update mass center and K value again;
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and calculate it Confidence level;
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
5. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist In the process of the update mass center in the S3.2 are as follows:
S3.2.1. mass center is calculated:
S3.2.2. sample reaches the distance of mass center are as follows:
S3.2.3. the distance weighted processing that mass center is reached to sample, obtains:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center, xiFor some sample, V in clustermIndicate mass center μm The number of sample in generic, V represent the number of all samples, and K is the dimension of mass center, μm[i] and x [i] respectively indicate matter Heart μmWith the ith feature value of sample x.
6. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist In the splitting condition are as follows: if there is noise in current cluster, current cluster is divided, conversely, not needing then point It splits.
7. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the process of the S4 are as follows: calculate separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering;It counts respectively The Gmean value for calculating the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, is worth according to Gmean obtained To a weight mu, using the weight mu to the confidence level C of classification resultsSKAS(xi) normalized is done, according to normalized As a result it determines using the classification results in SVM classifier or semi-supervised clustering;Identified classification results are merged, are exported The final prediction result of test text.
8. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2 or 4, special Sign is that the calculation method of the confidence level is that maximum probability subtracts the second maximum probability in class categories.
9. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist In the pretreated process are as follows: select the keyword in text to be measured, remove stop words;With being calculated according to keyword frequency Weight is finally made text vector to be measured, then vector is normalized using deviation standardized mode.
10. a kind of classification system based on the uneven file classification method for combining SVM and semi-supervised clustering described in claim 1 System, which is characterized in that including pretreatment unit, training unit and predicting unit;
The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector Format data set, and data set is input to training unit;
The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering unit pair Test set classification, obtains test set generic and its confidence level;
The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported, by fusion Processing obtains final result.
CN201910414208.0A 2019-05-17 2019-05-17 Unbalanced text classification method and system combining SVM and semi-supervised clustering Active CN110309302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910414208.0A CN110309302B (en) 2019-05-17 2019-05-17 Unbalanced text classification method and system combining SVM and semi-supervised clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910414208.0A CN110309302B (en) 2019-05-17 2019-05-17 Unbalanced text classification method and system combining SVM and semi-supervised clustering

Publications (2)

Publication Number Publication Date
CN110309302A true CN110309302A (en) 2019-10-08
CN110309302B CN110309302B (en) 2023-03-24

Family

ID=68075442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910414208.0A Active CN110309302B (en) 2019-05-17 2019-05-17 Unbalanced text classification method and system combining SVM and semi-supervised clustering

Country Status (1)

Country Link
CN (1) CN110309302B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851596A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Text classification method and device and computer readable storage medium
CN110930399A (en) * 2019-12-10 2020-03-27 南京医科大学 TKA preoperative clinical staging intelligent evaluation method based on support vector machine
CN110955773A (en) * 2019-11-06 2020-04-03 中国科学技术大学 Discriminant text clustering method and system based on minimum normalized information distance
CN111241286A (en) * 2020-01-16 2020-06-05 东方红卫星移动通信有限公司 Short text emotion fine classification method based on mixed classifier
CN111738308A (en) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN111753874A (en) * 2020-05-15 2020-10-09 江苏大学 Image scene classification method and system combined with semi-supervised clustering
CN112232416A (en) * 2020-10-16 2021-01-15 浙江大学 Semi-supervised learning method based on pseudo label weighting
CN112241454A (en) * 2020-12-14 2021-01-19 成都数联铭品科技有限公司 Text classification method for processing sample inclination
CN112418289A (en) * 2020-11-17 2021-02-26 北京京航计算通讯研究所 Multi-label classification processing method and device for incomplete labeling data
CN112463964A (en) * 2020-12-01 2021-03-09 科大讯飞股份有限公司 Text classification and model training method, device, equipment and storage medium
CN112860895A (en) * 2021-02-23 2021-05-28 西安交通大学 Tax payer industry classification method based on multistage generation model
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN113051462A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Multi-classification model training method, system and device
CN114077860A (en) * 2020-08-18 2022-02-22 鸿富锦精密电子(天津)有限公司 Method and system for sorting parts before assembly, electronic device and storage medium
CN114281994A (en) * 2021-12-27 2022-04-05 盐城工学院 Text clustering integration method and system based on three-layer weighting model
CN114661903A (en) * 2022-03-03 2022-06-24 贵州大学 Deep semi-supervised text clustering method, device and medium combining user intention
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116540316A (en) * 2023-07-06 2023-08-04 华设检测科技有限公司 Geological soil layer testing method based on SVM classification algorithm and clustering algorithm
CN117253095A (en) * 2023-11-16 2023-12-19 吉林大学 Image classification system and method based on biased shortest distance criterion
CN114661903B (en) * 2022-03-03 2024-07-09 贵州大学 Deep semi-supervised text clustering method, device and medium combining user intention

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886330A (en) * 2014-03-27 2014-06-25 西安电子科技大学 Classification method based on semi-supervised SVM ensemble learning
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886330A (en) * 2014-03-27 2014-06-25 西安电子科技大学 Classification method based on semi-supervised SVM ensemble learning
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SUGATO BASU等: "Semi-supervised Clustering by Seeding", 《PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 *
戴林等: "基于半监督学习的入侵检测系统", 《计算机技术与发展》 *
曹雅茜等: "基于代价敏感大间隔分布机的不平衡数据分类算法", 《华东理工大学学报(自然科学版)》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851596A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Text classification method and device and computer readable storage medium
CN110851596B (en) * 2019-10-11 2023-06-27 平安科技(深圳)有限公司 Text classification method, apparatus and computer readable storage medium
CN110955773A (en) * 2019-11-06 2020-04-03 中国科学技术大学 Discriminant text clustering method and system based on minimum normalized information distance
CN110955773B (en) * 2019-11-06 2023-03-31 中国科学技术大学 Discriminant text clustering method and system based on minimum normalized information distance
CN110930399A (en) * 2019-12-10 2020-03-27 南京医科大学 TKA preoperative clinical staging intelligent evaluation method based on support vector machine
CN113051462A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Multi-classification model training method, system and device
CN111241286A (en) * 2020-01-16 2020-06-05 东方红卫星移动通信有限公司 Short text emotion fine classification method based on mixed classifier
CN111753874A (en) * 2020-05-15 2020-10-09 江苏大学 Image scene classification method and system combined with semi-supervised clustering
CN111738308A (en) * 2020-06-03 2020-10-02 浙江中烟工业有限责任公司 Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning
CN114077860A (en) * 2020-08-18 2022-02-22 鸿富锦精密电子(天津)有限公司 Method and system for sorting parts before assembly, electronic device and storage medium
CN112232416A (en) * 2020-10-16 2021-01-15 浙江大学 Semi-supervised learning method based on pseudo label weighting
CN112418289A (en) * 2020-11-17 2021-02-26 北京京航计算通讯研究所 Multi-label classification processing method and device for incomplete labeling data
CN112463964A (en) * 2020-12-01 2021-03-09 科大讯飞股份有限公司 Text classification and model training method, device, equipment and storage medium
CN112463964B (en) * 2020-12-01 2023-01-17 科大讯飞股份有限公司 Text classification and model training method, device, equipment and storage medium
CN112241454A (en) * 2020-12-14 2021-01-19 成都数联铭品科技有限公司 Text classification method for processing sample inclination
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
WO2022160449A1 (en) * 2021-01-28 2022-08-04 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device, and storage medium
CN112860895A (en) * 2021-02-23 2021-05-28 西安交通大学 Tax payer industry classification method based on multistage generation model
CN112860895B (en) * 2021-02-23 2023-03-28 西安交通大学 Tax payer industry classification method based on multistage generation model
CN114281994A (en) * 2021-12-27 2022-04-05 盐城工学院 Text clustering integration method and system based on three-layer weighting model
CN114661903A (en) * 2022-03-03 2022-06-24 贵州大学 Deep semi-supervised text clustering method, device and medium combining user intention
CN114661903B (en) * 2022-03-03 2024-07-09 贵州大学 Deep semi-supervised text clustering method, device and medium combining user intention
CN115994527A (en) * 2023-03-23 2023-04-21 广东聚智诚科技有限公司 Machine learning-based PPT automatic generation system
CN116540316A (en) * 2023-07-06 2023-08-04 华设检测科技有限公司 Geological soil layer testing method based on SVM classification algorithm and clustering algorithm
CN116540316B (en) * 2023-07-06 2023-09-01 华设检测科技有限公司 Geological Soil Layer Testing Method Based on SVM Classification Algorithm and Clustering Algorithm
CN117253095A (en) * 2023-11-16 2023-12-19 吉林大学 Image classification system and method based on biased shortest distance criterion
CN117253095B (en) * 2023-11-16 2024-01-30 吉林大学 Image classification system and method based on biased shortest distance criterion

Also Published As

Publication number Publication date
CN110309302B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110309302A (en) A kind of uneven file classification method and system of combination SVM and semi-supervised clustering
CN103632168B (en) Classifier integration method for machine learning
CN109815492A (en) A kind of intension recognizing method based on identification model, identification equipment and medium
CN111126482B (en) Remote sensing image automatic classification method based on multi-classifier cascade model
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN110969191B (en) Glaucoma prevalence probability prediction method based on similarity maintenance metric learning method
CN107798033B (en) Case text classification method in public security field
CN105261367B (en) A kind of method for distinguishing speek person
CN110717554B (en) Image recognition method, electronic device, and storage medium
CN106611052A (en) Text label determination method and device
CN108647736A (en) A kind of image classification method based on perception loss and matching attention mechanism
CN105046195A (en) Human behavior identification method based on asymmetric generalized Gaussian distribution model (AGGD)
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN108932318A (en) A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN104616319A (en) Multi-feature selection target tracking method based on support vector machine
CN113553906A (en) Method for discriminating unsupervised cross-domain pedestrian re-identification based on class center domain alignment
CN105975611A (en) Self-adaptive combined downsampling reinforcing learning machine
WO2021189830A1 (en) Sample data optimization method, apparatus and device, and storage medium
CN116226785A (en) Target object recognition method, multi-mode recognition model training method and device
CN103235954A (en) Improved AdaBoost algorithm-based foundation cloud picture identification method
CN101216886B (en) A shot clustering method based on spectral segmentation theory
Fu et al. Speaker independent emotion recognition based on SVM/HMMs fusion system
Zhang et al. Learn to adapt for generalized zero-shot text classification
CN103744958A (en) Webpage classification algorithm based on distributed computation
CN112200260B (en) Figure attribute identification method based on discarding loss function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant