CN110309302A - A kind of uneven file classification method and system of combination SVM and semi-supervised clustering - Google Patents
A kind of uneven file classification method and system of combination SVM and semi-supervised clustering Download PDFInfo
- Publication number
- CN110309302A CN110309302A CN201910414208.0A CN201910414208A CN110309302A CN 110309302 A CN110309302 A CN 110309302A CN 201910414208 A CN201910414208 A CN 201910414208A CN 110309302 A CN110309302 A CN 110309302A
- Authority
- CN
- China
- Prior art keywords
- semi
- text
- svm
- sample
- supervised clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the uneven file classification methods and system of a kind of combination SVM and semi-supervised clustering, pre-process to text to be processed, obtain the text data of vector format as data set;SVM classifier is trained using training set to obtain disaggregated model, test set is predicted using disaggregated model, obtains the generic and confidence level of test set;It is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;The obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, final output is obtained.Present invention incorporates different types of methods in the technical field of uneven text classification, realize the mutual supplement with each other's advantages of distinct methods, use vectorization and normalized method, it compensates for when handling the sparse text data of higher-dimension, because having the shortcomings that text classification result is inaccurate caused by label text is very few.Efficiently solve the problems, such as that text categories are unbalanced.
Description
Technical field
The invention belongs to natural language processing field, especially uneven text classification field more particularly to a kind of combination
The uneven file classification method and system of SVM and semi-supervised clustering.
Background technique
Text classification problem is a classical problem in natural language processing field, information filtering, mail classification,
The fields such as query intention prediction, text subject tracking, which suffer from, widely applies.Traditional file classification method primarily directed to
The design of text classification problem is balanced, when treatment scale is smaller, data distribution is uniform and intensive balance text classification problem
It works well.But still have more limitation.Especially in practical application, due to class imbalance, there is label text
Very few and sample has the characteristics that higher-dimension is sparse, increases the complexity of text classification, classification accuracy is caused to decline, limit
The application of file classification method in practice.
Currently, solving these problems mainly has following a few class methods and thinking:
1) it aiming at the problem that class imbalance in text classification, proposes and changes metric form, resampling, cost correlation
The solutions such as habit.Generally use ROC curve, F- measurement isometry mode;Upper sampling, lower sampling, the resampling of mixed sampling
Method;The mistake for increasing small class text is divided into this cost relational learning method.These methods can preferably solve lower dimensional space
In class imbalance problem, but be directed to higher dimensional space specific to text classification problem, the cost of study it is very high and
It as a result is not very accurate.
2) for the problem for having label text very few in text classification, the semi-supervised algorithm of two classes is proposed.One kind, original
Disaggregated model in increase a part depend on the item without label text so that final text classification result is by there is label text
Codetermined with no label text, solved the problems, such as that label text is very few, if but during realization disaggregated model
It is mismatched with text, algorithm performance can be reduced with training.It is another kind of, using there is label text one classifier of training, so
The classification for demarcating unmarked text afterwards obtains pseudo label text, finally obtains a new classification using all text training
Device is repeated until convergence.This method has also solved the problems, such as that label text is very few, but due to existing in pseudo label text
Noise, repetition training will lead to noise accumulation, reduce the accuracy of text classification.
3) aiming at the problem that having text to have the characteristics that higher-dimension is sparse in text classification, the method for Feature Compression is proposed,
Two classes: feature selecting and feature extraction can be divided into.Wherein, feature extraction be according to certain criterion from text extraction feature;
Feature selecting is that selected section most has the feature of class discrimination ability from primitive character according to certain criterion.Both methods subtracts
Lack text data in training and bring expense on the time of classifying, while also reducing a possibility that dimension disaster occurs.But
It is that can inevitably cast out effective text information in compression, cause the problem that text classification is not accurate enough.
Based on certain deficiency present in the above-mentioned method for solving the problems, such as imbalance text classification, preferably divide to reach
Class effect, it is necessary to provide the more efficient algorithm of one kind to solve the above problems.
Summary of the invention
The present invention proposes the imbalance of a kind of combination SVM and semi-supervised clustering according to problems of the prior art
File classification method and system can improve and existing imitate to classifier single in uneven text classification problem or algorithm classification
The bad situation of fruit is finally reached the purpose of uneven text Accurate classification.
The technical solution adopted in the present invention is as follows:
A kind of uneven file classification method of combination SVM and semi-supervised clustering, process are as follows:
S1, text to be processed is pre-processed, obtains the text data of vector format as data set;The data set
It is divided into training set and test set;
S2, SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model
Prediction, obtains the generic and confidence level of test set;
S3, it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;
S4, the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are melted
It closes, obtains final output, realization makes final classification to uneven text.
Further, S2 process are as follows:
S2.1. in training set, using one-to-one (one-versus-one) method, one is found between any two classes sample
A hyperplane separates different classes of text, obtains being divided into more classification problems multiple based on the disaggregated model of SVM training
Two classification problems;
S2.2. weight is arranged to the distance of the sample in training set to hyperplane, acquires a new decision function;
S2.3. sample generic and its probability are calculated according to new decision function;For multicategory classification, using one
It votes (one-versus-one) method, obtains the generic of final test text;
S2.4. confidence level is gone out by probability calculation.
Further, the new decision function indicates are as follows:
When positive and negative samples are classified, it is contemplated that text has class imbalance feature in classification problem.It is asked for this
Topic, we to it plus a weight, make after it calculates decision function value (decision function value and sample arrive distance and be positively correlated)
Threshold value is obtained to move.Wherein,Respectively indicate the weight added when label is positive and negative samples;N+
Represent the number of samples that label is positive, N-The number of samples that label is negative is represented, f (x) is the decision function of SVM.
Further, the S3 process are as follows:
S3.1. the number of clusters amount (K value) and generic determined by training set, according to the label of sample each in training set according to
Cluster that is secondary to be divided to corresponding cluster, being initialized;
S3.2. to each cluster, mass center is updated, and sample is divided into again according to the distance that sample reaches mass center by each cluster
In;
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, update mass center and K value again;
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and is counted
Calculate its confidence level;
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
Further, the process of mass center is updated in the S3.2 are as follows:
S3.2.1. mass center is calculated:
S3.2.2. sample reaches the distance of mass center are as follows:
S3.2.3. the distance weighted processing that mass center is reached to sample, obtains:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center.xiFor some sample, V in clustermIt indicates
Mass center μmThe number of sample in generic, V represent the number of all samples, and K is the dimension of mass center.μm[i] and x [i] difference table
Show mass center μmWith the ith feature value of sample x.
Further, the splitting condition are as follows: whether training of judgement collection includes noise, and (mistake divides sample if there is noise in current cluster
This), then current cluster is divided, conversely, not needing then to be divided.
Further, the process of the S4 are as follows: calculate separately sensitivity S E under SVM classifier and semi-supervised clustering and special
Property SP;The Gmean value for calculating separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, according to being obtained
Gmean value obtain a weight mu, using the weight mu to the confidence level C of classification resultsSKAS(xi) normalized is done, according to
The result of normalized is determined using the classification results in SVM classifier or semi-supervised clustering.To identified classification results
It is merged, exports the final prediction result of test text.
Further, the calculation method of the confidence level is that maximum probability subtracts its second maximum probability in class categories.
Further, the pretreated process are as follows: select the keyword in text to be measured, remove stop words;According to crucial
Word frequency rate calculates weight, finally makes text vector to be measured, then vector is normalized using deviation standardized mode
Processing.
The present invention also designs the uneven Text Classification System of a kind of combination SVM and semi-supervised clustering, including pretreatment list
Member, training unit and predicting unit;
The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, final to obtain
The data set of the format of vector, and data set is input to training unit;
The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering list
Member classifies to test set, obtains test set generic and its confidence level;
The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported pass through
Fusion treatment obtains final result.
Beneficial effects of the present invention:
Classification method designed by the present invention is built-up by SVM classifier and semi-supervised Kmeans algorithm combination, realizes
The mutual supplement with each other's advantages of two methods.Using vectorization and normalized method, the text data sparse in processing higher-dimension is compensated for
When, because having the shortcomings that text classification result is inaccurate caused by label text is very few.By improving semi-supervised clustering algorithm,
Solve the problems, such as that text categories are unbalanced.Solves the initialization K value in semi-supervised clustering using the text classification result of SVM
With the doubt problem of mass center.Meanwhile the present invention devises a kind of splitting algorithm, it is accurate when can effectively promote text classification
Degree.The present invention is significantly improved with existing to classifier or algorithm classification effect single in uneven text classification problem not
Good situation is finally reached the purpose of uneven text Accurate classification.
Detailed description of the invention
Fig. 1 is that the present invention is directed to class imbalance file classification method flow chart;
Fig. 2 is SVM classifier process flow diagram;
Fig. 3 is semi-supervised clustering algorithm process flow chart;
Fig. 4 is using Gmean value to SVM and semi-supervised clustering processing result treatment process schematic diagram
Fig. 5 is the schematic diagram of uneven file classification method;
Fig. 6 is a kind of frame diagram of the uneven Text Classification System of combination SVM and semi-supervised clustering.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, the specific embodiments are only for explaining the present invention, not
For limiting the present invention.
As shown in Figure 1, 2, the uneven text classification side of a kind of the combination SVM and semi-supervised clustering designed by the present invention
Method, detailed process is as follows:
S1, the keyword being first sorted out in text to be measured, rejecting act on text classification little stop words;According to pass
Keyword frequency calculates weight, to make text vector to be measured;Secondly, carrying out normalizing to vector in the standardized method of deviation
Change processing;The finally format output data set using vector format used in libsvm as standard, so as to it is subsequent processing and
It uses.
Obtained document format data is as follows:
[label][index1]:[value1][index2]:[value2]…
Wherein, Label is label value, and index is sequential index, is usually continuous integer, i.e. feature number,
It must be arranged according to ascending order, value is characteristic value, and usually a pile real number forms.
S2, svm classifier algorithm, detailed process is as follows as shown in Figure 3:
S2.1. it trains: in training set, using one-to-one (one-versus-one) method, between any two classes sample
Find a hyperplane, different classes of text separated, thus the sample of k classification just need to design k (k-1)/2 it is super flat
Face obtains that more classification problems are divided into multiple two classification problems based on the disaggregated model of SVM training;
Seek the hyperplane indicated by f (x)=wx+b, so that objective function meets following formula:
Wherein, ξiFor the slack variable on normal data, c is given penalty factor, and w and b are indicate hyperplane two
A parameter
S2.2. in traditional SVM classifier, former decision function obtains text by the distance between text and hyperplane
Generic, and former decision function is as follows:
Wherein, yiFor sample label value, αiFor Lagrange multiplier, K (xi, x) and it is kernel function, b is hyperplane displacement, and l is
Vector dimension;
Since sample number of all categories in data set is seriously uneven, training process can be tilted to multiclass, to sample to super
The distance of plane adds a weight, the number of samples N that the present invention is positive according to label+The number of samples N being negative with label-It calculates
New decision function is obtained, expression formula is as follows by the weight in conjunction with former decision function for the weight of each classification out:
Wherein, f (x) is former decision function, as can be seen that the lesser classification of amount of text will have from expression formula (3)
Biggish weight, the biggish classification of amount of text will have lesser weight, solve SVM classifier and handle uneven text
When the problem ineffective for minority class text prediction,
S2.3. the numerical value on real axis can be projected on [0,1] using sigmoid function, i.e., by an output real value
It is converted into a probability value.Sigmoid function representation are as follows:
The present invention further improves probability output function, the formula of the probability value of output are as follows:
Wherein, piFor the probability value f of i-th of sample outputiFor new decision function, A and B are for adjusting mapping value
The determination method of the parameter of size, A and B are as follows:
Wherein, by parameter tiBe divided into sample be positive class when corresponding value t+With sample be negative class when corresponding value t-,
S2.4. after calculating probability, then confidence level obtained by probability, existing method is the maximum class prediction probability of selection
As confidence level;Calculating employed in the present invention using confidence level is that probability maximum in sample classification classification is subtracted it
Second maximum probability, expression formula are as follows:
Csvm(xi)=Psvm(y=cmax_j|xj)-Psvm(y=csub_max_j|xj) (7)
Calculating confidence level using the method can be pseudo- reference numerals of the data exclusion in SVM for being likely located at class overlapping region
Except collection, solve the problems, such as that SVM performance under class overlapping cases declines.
S3. semi-supervised clustering algorithm, detailed process is as follows as shown in Figure 4:
S3.1. according to the determining number of clusters amount (K value) of training sample and generic, by the sample in training set according to its class
Distinguishing label is successively divided to corresponding cluster, the cluster initialized.
S3.2. for each cluster, its mass center is updated, and sample is divided into respectively again according to the distance that sample reaches mass center
In a cluster;Wherein, the method for mass center is updated are as follows:
S3.2.1. mass center is the average value of all samples (vector) in cluster, and expression formula is as follows:
Wherein μmIndicate cluster CmMass center, | Cm| indicate cluster CmSample number (wherein K=| Cm|)
S3.2.2. sample is calculated at a distance from each mass center using Euclidean distance, expression formula is as follows:
μmIndicate mass center.K is the dimension of mass center.μm[i] and x [i] respectively indicate mass center μmWith the ith feature of sample x
Value.
S3.2.3. since text data concentrates sample number of all categories seriously uneven, training process can be tilted to multiclass,
Therefore, the distance to text to mass center adds a weight, threshold value can be made to move, and new range formula is as follows:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center.xiFor some sample, V in clustermIt indicates
Mass center μmThe number of sample in generic, V represent the number of all samples, and K is the dimension of mass center.μm[i] and x [i] difference table
Show mass center μmWith the ith feature value of sample x.
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, and update mass center and K value.
The condition of division are as follows: if there is noise (error sample) in current cluster, current cluster is divided, conversely, then
It does not need to be divided.
The process of division are as follows: divisional mode: current cluster C is foundmThe interior sample point x farthest apart from mass center, if x and mass center
Distance is r.Current cluster is then divided into A and B two parts:
A={ d (x, μm)≤r|x∈Cm} (11)
B={ d (x, μm) > r | x ∈ Cm} (12)
Wherein, d (x, μm) it is sample point x and mass center μmDistance.
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and is counted
Calculate its confidence level;
S3.4.1. the calculating of distance:
S3.4.2. need first to calculate the probability that text belongs to each classification, the calculating side of probability before calculating confidence level
Formula is as follows:
P [i]=(1/d [cluster [i]])/sum (14)
Wherein, P [i] representative sample i belongs to the probability of current cluster, the cluster label that cluster [i] representative sample i belongs to, d
[j] is represented in j-th of cluster, and current text i reaches the distance of mass center, and sum represents the distance that current sample i reaches each mass center
It is reciprocal and.Sum calculation formula is as follows:
S3.4.3. confidence level, confidence level are measured by the difference of the maximum probability of classification and it and the second maximum probability
Calculation formula it is as follows:
CSKAS(xi)=PSKAS(y=Cmax_j|xi)-PSKAS(y=Csub_max_j|xi) (16)
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
S4, blending algorithm, as shown in Figure 5:
S4.1. the sensitivity S E (sensitivity, SE) and spy under SVM classifier and semi-supervised Kmeans are calculated separately
Anisotropic SP (specificity, SP);Calculation formula is as follows:
Wherein, TP is real example (true positive): being correctly divided into the sample number of positive class, FN is false positive example
(false positive): being mistakenly divided into the sample number of positive class, and TN is true counter-example (true negative): being correctly divided into
The sample number of anti-class, FP are false counter-example (false negative): being mistakenly divided into the sample number of anti-class.
S4.2. the Gmean value of the sensitivity S E and specificity SP under SVM classifier and semi-supervised Kmeans are calculated separately,
Gmean value calculation formula is as follows:
S4.3. by the Gmean value W of SVM classifier1With the Gmean value W of semi-supervised Kmeans2A weight mu is calculated, is counted
It is as follows to calculate formula:
S4.4. using weight mu to the confidence level C of classification resultsSKAS(xi) normalized, the process of normalized is such as
Under:
It is determined according to the result of normalized using the classification results in SVM classifier or semi-supervised Kmeans.
S4.5. classification results used by will be final, are merged, export the final prediction result of test text.
Uneven file classification method based on a kind of combination SVM designed by the present invention and semi-supervised clustering, the present invention
The uneven Text Classification System that also proposed a kind of combination SVM and semi-supervised clustering, as shown in fig. 6, the system includes
Pretreatment unit, training unit and predicting unit;
Pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector
Format data set, and data set is input to training unit;
Training unit includes SVM cell and semi-supervised Kmeans unit, is utilized respectively SVM cell and semi-supervised Kmeans is mono-
Member classifies to test set, obtains test set generic and its confidence level;
Predicting unit: the generic and its confidence level that SVM classifier and semi-supervised Kmeans unit are exported, through normalizing
After change processing, the two result is merged, i.e., obtains the final generic of test text according to decision function.
Above embodiments are merely to illustrate design philosophy and feature of the invention, and its object is to make technology in the art
Personnel can understand the content of the present invention and implement it accordingly, and protection scope of the present invention is not limited to the above embodiments.So it is all according to
It is within the scope of the present invention according to equivalent variations made by disclosed principle, mentality of designing or modification.
Claims (10)
1. the uneven file classification method of a kind of combination SVM and semi-supervised clustering, which comprises the following steps:
S1. text to be processed is pre-processed, obtains the text data of vector format as data set;The data set is divided into
Training set and test set;
S2. SVM classifier is trained using training set to obtain disaggregated model, test set is carried out using disaggregated model pre-
It surveys, obtains the generic and confidence level of test set;
S3. it is clustered using semi-supervised clustering algorithm to data set, obtains the generic and its confidence level of test set;
S4. the obtained test set generic and its confidence level of SVM classifier and semi-supervised clustering algorithm are merged, is obtained
To final output, realization makes final classification to uneven text.
2. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist
In S2 process are as follows:
S2.1. in training set, using one-to-one method, a hyperplane is found between any two classes sample, it will be different classes of
Text separate;
S2.2. weight is arranged to the distance of the sample in training set to hyperplane, acquires a new decision function;
S2.3. sample generic and its probability are calculated according to new decision function;Wherein, it for multicategory classification, uses
One-to-one method ballot, obtains the generic of final test text;
S2.4. confidence level is gone out by probability calculation.
3. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2, feature exist
In the new decision function indicates are as follows:
Wherein,Respectively indicate the weight added when label is positive and negative samples;N+Label is represented to be positive
Number of samples, N-The number of samples that label is negative is represented, f (x) is the decision function of SVM.
4. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist
In the S3 process are as follows:
S3.1. the number of clusters amount and generic determined by training set, is successively divided to according to the label of sample each in training set
Corresponding cluster, the cluster initialized;
S3.2. to each cluster, mass center is updated, and sample is divided into each cluster again according to the distance that sample reaches mass center;
S3.3. judge whether each cluster meets splitting condition, the cluster for the condition that meets is divided, update mass center and K value again;
S3.4. according to the distance between each test set sample and mass center, sample is repartitioned into corresponding cluster, and calculate it
Confidence level;
S3.5. the above S3.2-S3.4 step is repeated, until meeting stopping criterion for iteration;
S3.6. according to cluster generic, the generic and confidence level of test text are obtained.
5. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist
In the process of the update mass center in the S3.2 are as follows:
S3.2.1. mass center is calculated:
S3.2.2. sample reaches the distance of mass center are as follows:
S3.2.3. the distance weighted processing that mass center is reached to sample, obtains:
Wherein, | Cm| indicate cluster CmSample number, μmIndicate cluster CmCorresponding mass center, xiFor some sample, V in clustermIndicate mass center μm
The number of sample in generic, V represent the number of all samples, and K is the dimension of mass center, μm[i] and x [i] respectively indicate matter
Heart μmWith the ith feature value of sample x.
6. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 4, feature exist
In the splitting condition are as follows: if there is noise in current cluster, current cluster is divided, conversely, not needing then point
It splits.
7. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist
In the process of the S4 are as follows: calculate separately the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering;It counts respectively
The Gmean value for calculating the sensitivity S E and specificity SP under SVM classifier and semi-supervised clustering, is worth according to Gmean obtained
To a weight mu, using the weight mu to the confidence level C of classification resultsSKAS(xi) normalized is done, according to normalized
As a result it determines using the classification results in SVM classifier or semi-supervised clustering;Identified classification results are merged, are exported
The final prediction result of test text.
8. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 2 or 4, special
Sign is that the calculation method of the confidence level is that maximum probability subtracts the second maximum probability in class categories.
9. the uneven file classification method of a kind of combination SVM and semi-supervised clustering according to claim 1, feature exist
In the pretreated process are as follows: select the keyword in text to be measured, remove stop words;With being calculated according to keyword frequency
Weight is finally made text vector to be measured, then vector is normalized using deviation standardized mode.
10. a kind of classification system based on the uneven file classification method for combining SVM and semi-supervised clustering described in claim 1
System, which is characterized in that including pretreatment unit, training unit and predicting unit;
The pretreatment unit carries out vectorization processing to text to be measured, and makees normalized to vector, finally obtains vector
Format data set, and data set is input to training unit;
The training unit includes SVM cell and semi-supervised clustering unit, is utilized respectively SVM cell and semi-supervised clustering unit pair
Test set classification, obtains test set generic and its confidence level;
The predicting unit: the generic and its confidence level that SVM classifier and semi-supervised clustering unit are exported, by fusion
Processing obtains final result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910414208.0A CN110309302B (en) | 2019-05-17 | 2019-05-17 | Unbalanced text classification method and system combining SVM and semi-supervised clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910414208.0A CN110309302B (en) | 2019-05-17 | 2019-05-17 | Unbalanced text classification method and system combining SVM and semi-supervised clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110309302A true CN110309302A (en) | 2019-10-08 |
CN110309302B CN110309302B (en) | 2023-03-24 |
Family
ID=68075442
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910414208.0A Active CN110309302B (en) | 2019-05-17 | 2019-05-17 | Unbalanced text classification method and system combining SVM and semi-supervised clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110309302B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851596A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Text classification method and device and computer readable storage medium |
CN110930399A (en) * | 2019-12-10 | 2020-03-27 | 南京医科大学 | TKA preoperative clinical staging intelligent evaluation method based on support vector machine |
CN110955773A (en) * | 2019-11-06 | 2020-04-03 | 中国科学技术大学 | Discriminant text clustering method and system based on minimum normalized information distance |
CN111241286A (en) * | 2020-01-16 | 2020-06-05 | 东方红卫星移动通信有限公司 | Short text emotion fine classification method based on mixed classifier |
CN111738308A (en) * | 2020-06-03 | 2020-10-02 | 浙江中烟工业有限责任公司 | Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning |
CN111753874A (en) * | 2020-05-15 | 2020-10-09 | 江苏大学 | Image scene classification method and system combined with semi-supervised clustering |
CN112232416A (en) * | 2020-10-16 | 2021-01-15 | 浙江大学 | Semi-supervised learning method based on pseudo label weighting |
CN112241454A (en) * | 2020-12-14 | 2021-01-19 | 成都数联铭品科技有限公司 | Text classification method for processing sample inclination |
CN112418289A (en) * | 2020-11-17 | 2021-02-26 | 北京京航计算通讯研究所 | Multi-label classification processing method and device for incomplete labeling data |
CN112463964A (en) * | 2020-12-01 | 2021-03-09 | 科大讯飞股份有限公司 | Text classification and model training method, device, equipment and storage medium |
CN112860895A (en) * | 2021-02-23 | 2021-05-28 | 西安交通大学 | Tax payer industry classification method based on multistage generation model |
CN112883190A (en) * | 2021-01-28 | 2021-06-01 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and storage medium |
CN113051462A (en) * | 2019-12-26 | 2021-06-29 | 深圳市北科瑞声科技股份有限公司 | Multi-classification model training method, system and device |
CN114077860A (en) * | 2020-08-18 | 2022-02-22 | 鸿富锦精密电子(天津)有限公司 | Method and system for sorting parts before assembly, electronic device and storage medium |
CN114281994A (en) * | 2021-12-27 | 2022-04-05 | 盐城工学院 | Text clustering integration method and system based on three-layer weighting model |
CN114661903A (en) * | 2022-03-03 | 2022-06-24 | 贵州大学 | Deep semi-supervised text clustering method, device and medium combining user intention |
CN115994527A (en) * | 2023-03-23 | 2023-04-21 | 广东聚智诚科技有限公司 | Machine learning-based PPT automatic generation system |
CN116540316A (en) * | 2023-07-06 | 2023-08-04 | 华设检测科技有限公司 | Geological soil layer testing method based on SVM classification algorithm and clustering algorithm |
CN117253095A (en) * | 2023-11-16 | 2023-12-19 | 吉林大学 | Image classification system and method based on biased shortest distance criterion |
CN114661903B (en) * | 2022-03-03 | 2024-07-09 | 贵州大学 | Deep semi-supervised text clustering method, device and medium combining user intention |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886330A (en) * | 2014-03-27 | 2014-06-25 | 西安电子科技大学 | Classification method based on semi-supervised SVM ensemble learning |
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
-
2019
- 2019-05-17 CN CN201910414208.0A patent/CN110309302B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886330A (en) * | 2014-03-27 | 2014-06-25 | 西安电子科技大学 | Classification method based on semi-supervised SVM ensemble learning |
CN104318242A (en) * | 2014-10-08 | 2015-01-28 | 中国人民解放军空军工程大学 | High-efficiency SVM active half-supervision learning algorithm |
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
Non-Patent Citations (3)
Title |
---|
SUGATO BASU等: "Semi-supervised Clustering by Seeding", 《PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING》 * |
戴林等: "基于半监督学习的入侵检测系统", 《计算机技术与发展》 * |
曹雅茜等: "基于代价敏感大间隔分布机的不平衡数据分类算法", 《华东理工大学学报(自然科学版)》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851596A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Text classification method and device and computer readable storage medium |
CN110851596B (en) * | 2019-10-11 | 2023-06-27 | 平安科技(深圳)有限公司 | Text classification method, apparatus and computer readable storage medium |
CN110955773A (en) * | 2019-11-06 | 2020-04-03 | 中国科学技术大学 | Discriminant text clustering method and system based on minimum normalized information distance |
CN110955773B (en) * | 2019-11-06 | 2023-03-31 | 中国科学技术大学 | Discriminant text clustering method and system based on minimum normalized information distance |
CN110930399A (en) * | 2019-12-10 | 2020-03-27 | 南京医科大学 | TKA preoperative clinical staging intelligent evaluation method based on support vector machine |
CN113051462A (en) * | 2019-12-26 | 2021-06-29 | 深圳市北科瑞声科技股份有限公司 | Multi-classification model training method, system and device |
CN111241286A (en) * | 2020-01-16 | 2020-06-05 | 东方红卫星移动通信有限公司 | Short text emotion fine classification method based on mixed classifier |
CN111753874A (en) * | 2020-05-15 | 2020-10-09 | 江苏大学 | Image scene classification method and system combined with semi-supervised clustering |
CN111738308A (en) * | 2020-06-03 | 2020-10-02 | 浙江中烟工业有限责任公司 | Dynamic threshold detection method for monitoring index based on clustering and semi-supervised learning |
CN114077860A (en) * | 2020-08-18 | 2022-02-22 | 鸿富锦精密电子(天津)有限公司 | Method and system for sorting parts before assembly, electronic device and storage medium |
CN112232416A (en) * | 2020-10-16 | 2021-01-15 | 浙江大学 | Semi-supervised learning method based on pseudo label weighting |
CN112418289A (en) * | 2020-11-17 | 2021-02-26 | 北京京航计算通讯研究所 | Multi-label classification processing method and device for incomplete labeling data |
CN112463964A (en) * | 2020-12-01 | 2021-03-09 | 科大讯飞股份有限公司 | Text classification and model training method, device, equipment and storage medium |
CN112463964B (en) * | 2020-12-01 | 2023-01-17 | 科大讯飞股份有限公司 | Text classification and model training method, device, equipment and storage medium |
CN112241454A (en) * | 2020-12-14 | 2021-01-19 | 成都数联铭品科技有限公司 | Text classification method for processing sample inclination |
CN112883190A (en) * | 2021-01-28 | 2021-06-01 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and storage medium |
WO2022160449A1 (en) * | 2021-01-28 | 2022-08-04 | 平安科技(深圳)有限公司 | Text classification method and apparatus, electronic device, and storage medium |
CN112860895A (en) * | 2021-02-23 | 2021-05-28 | 西安交通大学 | Tax payer industry classification method based on multistage generation model |
CN112860895B (en) * | 2021-02-23 | 2023-03-28 | 西安交通大学 | Tax payer industry classification method based on multistage generation model |
CN114281994A (en) * | 2021-12-27 | 2022-04-05 | 盐城工学院 | Text clustering integration method and system based on three-layer weighting model |
CN114661903A (en) * | 2022-03-03 | 2022-06-24 | 贵州大学 | Deep semi-supervised text clustering method, device and medium combining user intention |
CN114661903B (en) * | 2022-03-03 | 2024-07-09 | 贵州大学 | Deep semi-supervised text clustering method, device and medium combining user intention |
CN115994527A (en) * | 2023-03-23 | 2023-04-21 | 广东聚智诚科技有限公司 | Machine learning-based PPT automatic generation system |
CN116540316A (en) * | 2023-07-06 | 2023-08-04 | 华设检测科技有限公司 | Geological soil layer testing method based on SVM classification algorithm and clustering algorithm |
CN116540316B (en) * | 2023-07-06 | 2023-09-01 | 华设检测科技有限公司 | Geological Soil Layer Testing Method Based on SVM Classification Algorithm and Clustering Algorithm |
CN117253095A (en) * | 2023-11-16 | 2023-12-19 | 吉林大学 | Image classification system and method based on biased shortest distance criterion |
CN117253095B (en) * | 2023-11-16 | 2024-01-30 | 吉林大学 | Image classification system and method based on biased shortest distance criterion |
Also Published As
Publication number | Publication date |
---|---|
CN110309302B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110309302A (en) | A kind of uneven file classification method and system of combination SVM and semi-supervised clustering | |
CN103632168B (en) | Classifier integration method for machine learning | |
CN109815492A (en) | A kind of intension recognizing method based on identification model, identification equipment and medium | |
CN111126482B (en) | Remote sensing image automatic classification method based on multi-classifier cascade model | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN110969191B (en) | Glaucoma prevalence probability prediction method based on similarity maintenance metric learning method | |
CN107798033B (en) | Case text classification method in public security field | |
CN105261367B (en) | A kind of method for distinguishing speek person | |
CN110717554B (en) | Image recognition method, electronic device, and storage medium | |
CN106611052A (en) | Text label determination method and device | |
CN108647736A (en) | A kind of image classification method based on perception loss and matching attention mechanism | |
CN105046195A (en) | Human behavior identification method based on asymmetric generalized Gaussian distribution model (AGGD) | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN108932318A (en) | A kind of intellectual analysis and accurate method for pushing based on Policy resources big data | |
CN104616319A (en) | Multi-feature selection target tracking method based on support vector machine | |
CN113553906A (en) | Method for discriminating unsupervised cross-domain pedestrian re-identification based on class center domain alignment | |
CN105975611A (en) | Self-adaptive combined downsampling reinforcing learning machine | |
WO2021189830A1 (en) | Sample data optimization method, apparatus and device, and storage medium | |
CN116226785A (en) | Target object recognition method, multi-mode recognition model training method and device | |
CN103235954A (en) | Improved AdaBoost algorithm-based foundation cloud picture identification method | |
CN101216886B (en) | A shot clustering method based on spectral segmentation theory | |
Fu et al. | Speaker independent emotion recognition based on SVM/HMMs fusion system | |
Zhang et al. | Learn to adapt for generalized zero-shot text classification | |
CN103744958A (en) | Webpage classification algorithm based on distributed computation | |
CN112200260B (en) | Figure attribute identification method based on discarding loss function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |