Disclosure of Invention
The invention aims to overcome the defects of the technology and provide an optimized classification method for short texts. On one hand, the method expands and uses more semantic features to classify the short text so as to improve the accuracy of short text classification; on the other hand, similar feature clusters obtained by clustering the two-stage features are used for replacing the original features for classification, so that the dimensionality of the features is reduced, and the efficiency of short text classification is improved. The invention can better support related application research, such as spam classification, mail classification, microblog topic classification and the like.
The specific technical scheme is as follows:
acquiring an original data set and preprocessing the original data set;
A. the original data set is from open source news corpora published by networks (e.g., the Compound Dane and dog laboratories);
B. adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;
C. and performing word segmentation on the original data set, removing stop words and finishing preprocessing. Wherein the segmentation uses Chinese academy segmentation tool ICTCCLA 2018.
Selecting a characteristic item set from the preprocessed data set obtained in the step one;
specifically, each word in the preprocessed original data set is traversed, words with the word frequency larger than a set threshold and without repetition are selected as feature words, and then a feature item set is constructed.
Training the collected large-scale corpus by using a word vector tool to obtain a word vector model;
A. searching open source news corpora from a network path (a laboratory of Redandan and dog searching), searching Chinese corpora from Wikipedia, and preprocessing the data, wherein the preprocessing step is the same as the preprocessing step in the first step;
B. and performing word vector training on the preprocessed corpus by using a word2vec word vector tool. Word2Vec is a tool for Word vector computation from Google open source. Word2Vec can be trained on a million-order dictionary and a billion data set with high efficiency, and the training result obtained by the tool, namely word vector (word embedding), can well measure the similarity between words;
C. and D, storing the word vector model obtained in the step B.
Performing word vector representation on each feature item in the feature item set by using a word vector model, and performing primary clustering on the word vectors of the feature items at one stage to obtain a plurality of primary feature clusters;
the first-stage preliminary clustering is specifically carried out by adopting a spectral clustering algorithm according to the following steps:
A. and determining the number K of the clustering centers by using an elbow method. With the increase of K, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when K is smaller than the real clustering number, the aggregation degree of each cluster is greatly increased due to the increase of K, so that the descending amplitude of the error square sum is large, when K reaches the real clustering number, the return of the aggregation degree obtained by increasing K is rapidly reduced, so that the descending amplitude of the error square sum is reduced suddenly and then tends to be flat along with the continuous increase of the K value, the relation graph of the error square sum and K is in the shape of an elbow, and the K value corresponding to the elbow is the real clustering number of the data.
B. Calling a natural language processing toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;
C. k preliminary feature clusters are obtained.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four, thereby obtaining a plurality of similar feature clusters;
the two-stage loose clustering specifically comprises the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster is calculated using the following formulai,Vj):
In the formula V
iIs a vector of a feature word i, V
jIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,
is the k-th dimension vector of the feature word i,
a k-dimension vector of the feature word j;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. and D, in each preliminary feature cluster obtained in the step four, constructing a feature similarity adjacency list by using the matched feature words, performing depth-first traversal on the adjacency list, and forming a similar feature cluster through each iteration traversal to finally obtain a plurality of similar feature clusters. This way of clustering, in which similar features are directly connected, we call loose clustering. Because the method is not based on any model of clustering algorithm, the characteristic words with similar semantemes are selected for matching, and the characteristic word pairs are combined to form similar characteristic clusters.
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the method specifically comprises the following steps:
A. traversing the feature words obtained in the fourth step, and if the feature words belong to a certain similar feature cluster, replacing the feature words with the similar feature cluster; if the feature word does not belong to any similar feature cluster, the feature word is retained. Finally, a replacement data set T of the original data set is obtained;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a mean vector of word vectors of all feature items in each similar feature cluster;
C. and calculating a space vector of the replacement data set T, wherein the space vector is obtained by performing vector space calculation on T by using a TF-IDF algorithm.
D. The final vector representation of the replacement data set T is spliced from the cluster vector in B and the space vector in C.
E. And carrying out text classification on the final vector of the replacement data set T by using a text classifier SVM to obtain a classification result.
The invention discloses a short text-oriented optimization classification method, which replaces the characteristics of an original data set with the characteristics after two-stage clustering and then carries out short text classification. The method comprises the following steps of obtaining a plurality of large preliminary feature clusters by preliminarily clustering in one stage by adopting spectral clustering, and obtaining a plurality of small similar feature clusters in each preliminary feature cluster in the second stage by adopting loose clustering.
Compared with the traditional short text classification mode, the invention has the following advantages that:
firstly, the loose clustering mode used by the invention can control the similarity level of the feature clusters and cover more short text features, thereby enhancing the semantic representation capability of the short text and improving the precision of subsequent classification. The method comprises the following specific steps: the method controls the similarity level of the clusters by varying the similarity pairing threshold used in the pairing as well as the size of the largest cluster. When the similarity matching threshold is low, similar features are gathered in a feature sub-cluster as much as possible, and when the similarity matching threshold is high, the method can reduce the probability of appearance of outliers in each cluster, so that the quality of each similar feature cluster is improved, and the precision of subsequent classification is improved. For example, in our word vector model, when the similarity pairing threshold is 0.35, the constructed feature cluster can cover 66% of the whole feature item set. When the similarity pairing threshold is 0.6, this ratio drops to 25%.
Secondly, the two-stage clustering method is used, so that the calculation amount of the similar feature cluster construction and the dimensionality of feature space calculation can be reduced, and the efficiency of subsequent classification can be improved. The reason for this is that: if the method for constructing the preliminary feature clusters and then constructing the similar feature sub-clusters in each preliminary feature cluster is adopted, the calculation amount for constructing the similar feature sub-clusters can be reduced. This is because constructing similar feature sub-clusters requires similarity calculation for any two words in the feature set, if the feature set is too large, pairwise comparison operations will increase if the feature set is directly constructed, and the overall calculation overhead will also increase dramatically. Therefore, the original feature item set is divided into a plurality of preliminary feature clusters through one-stage clustering, and sub-clusters are constructed in the preliminary feature clusters, so that the effect of reducing the calculated amount can be achieved.
In addition, the vector space dimension constructed by directly splicing the TF-IDF vector and the word vector is much smaller than the vector space dimension directly constructed by using the TF-IDF. For example, after running the clustering process, "apple-banana," "apple-pear" and "banana-pear" are clustered into "apple, banana, pear" to form a cluster of similar features. Its characteristic dimension decreases from 3 to 1. The short text classification precision can be improved by using a text representation mode of replacing the feature words contained in the short text with the feature clusters, and the reason is that the short text contains fewer feature words, and the semantic information contained in a small number of feature words is limited, so that a proper classifier model is difficult to train. The method for expressing the text by using the feature cluster can improve the semantic expression capability of a single feature word, so that the semantic expression capability of the whole short text is enhanced.
Thirdly, the splicing mode of the feature vectors is optimized. Since the TF-IDF vector is very high in dimension, usually thousands or even tens of thousands of dimensions. And the corresponding word vector dimension is relatively low, typically 200 to 400 dimensions. If the TF-IDF vector and the word vector are directly spliced, the effect is difficult to be fully reflected if the occupation ratio of the word vector is small, so that a good classification result is difficult to obtain by the vector splicing mode. And the TF-IDF vector after feature cluster replacement is used, the feature vector dimension can be reduced by 20% -66%, and the effect of the word vector is amplified, so that the vector splicing method can achieve a better effect.
Detailed Description
Comparative example 1:
chinese patent CN107368611A, "a short text classification method", adopts a short text classification method, which first divides two types of samples on a hyperplane, then calculates the geometric distance between each type of sample and the hyperplane, divides a plurality of sub-domains according to the geometric distance, each sub-domain interval is given different weights, the sub-domain farther away from the hyperplane has smaller weight, under-samples the data according to the weights in the under-sampling stage, and finally, introduces the samples obtained after sampling into an SVM classifier for classification. In practical applications, the effect of using only the SVM classifier is limited, because the SVM classifier is slow and inefficient in classifying large-scale data, and thus the time efficiency of the invention needs to be improved.
Comparative example 2:
chinese patent CN108080206A, "a short text classification method based on semantic enhancement," adopts a way of external corpus resource expansion to perform semantic expansion of short text. Aiming at the characteristics of small short text information quantity and sparse semantics, the method utilizes a method of expanding linguistic data with high quality and high-precision word vectors to carry out semantic enhancement expression on the short text, and simultaneously uses an efficient text classification algorithm to capture limited text features to the maximum extent and effectively shorten the training time of a classifier.
The short text semantic expansion method based on external corpus expansion needs to use a large amount of external data, and the resource and time overhead for constructing an external corpus is huge. In addition, the method does not design the corresponding relation between specific characteristic words and external linguistic data, so that the problem of low classification precision possibly exists.
Fig. 1 is a flow chart of the method of the present invention, fig. 2 is a diagram of an embodiment of the method of the present invention, and fig. 3 is a diagram of a similar feature cluster visualization of the method of the present invention.
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
Example 1:
the embodiment is a specific embodiment of a short text optimization classification method based on feature clustering. The invention mainly comprises six steps:
acquiring training data, and preprocessing the training data by adopting the following steps:
A. the training data come from the open source news corpora released by the double-denier and dog-searching laboratories, have more than 20 ten thousand pieces of data, and totally comprise 6 categories of sports, Internet, economy, politics, art and military;
B. adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;
C. removing stop words;
D. and performing word segmentation on the training data to finish preprocessing.
Traversing each feature word in the data set after word segmentation for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold value and without repetition to construct a feature item set. The threshold is set here to 3;
step three, training the collected large-scale corpus, and specifically comprising the following steps:
A. searching open source news corpora from a double-denier and dog-searching laboratory, and carrying out data preprocessing on the open source news corpora, wherein the preprocessing step is the same as the step I;
B. performing word vector training on the preprocessed corpus by using a word2vec tool, wherein the dimension of the trained word vector is 300 dimensions;
C. and D, storing the word vector model obtained in the step B, and performing word vector representation on the feature item set obtained in the step two by using the model.
Step four, performing primary clustering on the word vectors of the feature items obtained in the step three by adopting a spectral clustering algorithm according to the following steps to obtain a plurality of primary feature clusters:
A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 5, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 11 according to the elbow method steps and conclusions.
B. Calling a toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;
C. k-11 preliminary feature clusters were obtained.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four to obtain a plurality of similar feature clusters, and specifically performing the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster is calculated using the following formulai,Vj):
In the formula V
iIs a vector of a feature word i, V
jIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,
is the k-th dimension vector of the feature word i,
a k-dimension vector of the feature word j;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. firstly, in each preliminary feature cluster obtained in the fourth step, a feature similarity adjacency list is constructed by using matched feature words. And then performing depth-first traversal on the adjacency list by using a loose clustering strategy, wherein each iteration of traversal forms a similar feature cluster. And finally obtaining a plurality of similar feature clusters.
The cluster sizes of similar features of the present invention are shown in table 1. When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. When the threshold is too high, each similar feature cluster has the strongest semantic relation, but the size of the whole similar feature cluster is sharply reduced. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 50%.
Table 1 cluster of similar features results
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, splicing and representing the features in the fifth step by using TF-IDF vectors of similar feature cluster linguistic data and similar feature cluster word vectors, and then classifying the short texts by using a classifier, wherein the method specifically comprises the following steps:
A. traversing the original data set, and replacing all feature words contained in the similar feature clusters with the similar feature clusters to obtain a replacement data set T of the original corpus;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a vector mean value of all feature items in each corresponding similar feature cluster;
C. and performing vector calculation in the T by using a TF-IDF algorithm to obtain a corresponding vector space.
D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C. Since the similarity pairing threshold is too high, only a few features are affected, so its impact on class recall and feature reduction is also very limited. Along with the continuous reduction of the similarity matching threshold, the similar feature cluster scale is continuously enlarged, the feature reduction effect is better and better, and the highest feature reduction effect can reach 36%.
E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including support vector machine, naive Bayes and logistic regression. The accuracy of the classification by using the TF-IDF is 78.12%, and the classification accuracy of the invention is respectively improved by 2.5% -5%, 1.6% -4.1% and 4.1% -5.3% on the basis of the baseline.
Example 2:
the embodiment is a specific embodiment of a short text optimization classification method based on feature clustering. The invention mainly comprises six steps:
acquiring training data, and preprocessing the training data by adopting the following steps:
A. the training data is from a China Mobile short message data set, and comprises 5 categories of normal, marketing, advertisement, credit card and others, wherein the total data is about 10 ten thousand;
B. removing stop words;
C. and performing word segmentation on the training data to finish preprocessing.
Traversing each feature word in the segmented data set for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold (set as 2) and without repetition to construct a feature item set;
step three, training the collected large-scale corpus, and specifically comprising the following steps:
A. the method comprises the following steps of (1) collecting source Chinese corpora from Wikipedia, and carrying out data preprocessing on the source Chinese corpora, wherein the preprocessing step is the same as the first step;
B. performing word vector training on the preprocessed corpus by using a word2vec tool, wherein the dimension of the trained word vector is 300 dimensions;
C. and D, storing the word vector model obtained in the step B, and performing word vector representation on the feature item set obtained in the step two by using the model.
Step four, performing primary clustering on the word vectors of the feature items obtained in the step three by adopting a spectral clustering algorithm according to the following steps to obtain a plurality of primary feature clusters:
A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 3, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 9 according to the elbow method steps and conclusions.
B. Calling a tool kit (using the scimit tool kit herein) to perform spectral clustering on the feature items in the feature item set in the step four;
C. and obtaining K-9 preliminary feature clusters.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four to obtain a plurality of similar feature clusters, and specifically performing the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. calculating cosine similarity between the feature vectors in each preliminary feature cluster;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. firstly, in each preliminary feature cluster obtained in the fourth step, a feature similarity adjacency list is constructed by using matched feature words. And then performing depth-first traversal on the adjacency list by using a loose clustering strategy, wherein each iteration of traversal forms a similar feature cluster. And finally obtaining a plurality of similar feature clusters.
When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. Because the data set is small, the appropriate similarity pairing threshold may fluctuate greatly, and multiple sets of experiments need to be set for verification. When the threshold is 65% or higher, there is the strongest semantic relation within each similar feature cluster, but the size of the entire similar feature cluster drops sharply. When the similarity threshold is 50% or less, the effect of improving the classification accuracy is not significant. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 55%.
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, splicing and representing the features in the fifth step by using TF-IDF vectors of similar feature cluster linguistic data and similar feature cluster word vectors, and then classifying the short texts by using a classifier, wherein the method specifically comprises the following steps:
A. traversing the original data set, and replacing all feature words contained in the similar feature clusters with the similar feature clusters to obtain a replacement data set T of the original corpus;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a vector mean value of all feature items in each corresponding similar feature cluster;
C. and performing vector calculation in the T by using a TF-IDF algorithm to obtain a corresponding vector space.
D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C.
E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including a support vector machine SV M, a naive Bayes NB and a logistic regression LR. The 6 classification accuracy base line using TF-IDF is 85.12%, and under the current data set, the classification accuracy of the invention is respectively improved by 2.9% -4%, 2.6% -3.1% and 1.9% -3.3% to the maximum. Table 2 shows specific similar feature clustering results.
Table 2 cluster of similar features results